From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from gate.crashing.org (gate.crashing.org [63.228.1.57]) by sourceware.org (Postfix) with ESMTP id DA33C385841B for ; Thu, 13 Jan 2022 18:03:27 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DA33C385841B Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=kernel.crashing.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=kernel.crashing.org Received: from [192.168.2.107] ([70.99.78.137]) by gate.crashing.org (8.14.1/8.14.1) with ESMTP id 20DI0Lr4012070; Thu, 13 Jan 2022 12:00:21 -0600 Message-ID: <9f77a6e3-fa70-d445-0e70-45a176cb0a7f@kernel.crashing.org> Date: Thu, 13 Jan 2022 12:00:20 -0600 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.3.2 Subject: Re: [PATCH] elf/dl-deps.c: Make _dl_build_local_scope breadth first Content-Language: en-US To: Adhemerval Zanella , Khem Raj , Libc-alpha , "Carlos O'Donell" References: <20211209235354.1558088-1-raj.khem@gmail.com> <018ad7e3-c020-3507-94be-ccb21c90899f@linaro.org> <62d5866a-ea76-a56a-7063-dada34b3fe66@kernel.crashing.org> <3d6799fe-2b9d-4780-254d-dbd0799483ae@linaro.org> <9384a4f0-095a-a818-e48e-026dfdfc8efd@linaro.org> <631ee5ae-c334-39bb-91b0-8e8531967567@kernel.crashing.org> From: Mark Hatle In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.0 required=5.0 tests=BAYES_00, JMQ_SPF_NEUTRAL, KAM_DMARC_STATUS, NICE_REPLY_A, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Jan 2022 18:03:31 -0000 On 1/13/22 11:20 AM, Adhemerval Zanella wrote: >> When I last profiled this, roughly 3 1/2 years ago, the run-time linking speedup was huge.  There were two main advantages to this: >> >> * Run Time linking speedup - primarily helped initial application loads.  System boot times went from 10-15 seconds down to 4-5 seconds.  For embedded systems this was massive. > > Right, this is interesting. Any profile data where exactly the speed is coming > from? I wonder if we could get any gain by optimizing the normal patch without > the need to resort to prelink. glibc's runtime linker is very efficient, I don't honestly expect many speedups at this point. This is partially from memory, so I may have a few details wrong... but LD_DEBUG=statistics On my ubuntu machine, just setting then and running /bin/bash results in: 334067: 334067: runtime linker statistics: 334067: total startup time in dynamic loader: 252415 cycles 334067: time needed for relocation: 119006 cycles (47.1%) 334067: number of relocations: 412 334067: number of relocations from cache: 3 334067: number of relative relocations: 5100 334067: time needed to load objects: 92655 cycles (36.7%) 334068: 334068: runtime linker statistics: 334068: total startup time in dynamic loader: 125018 cycles 334068: time needed for relocation: 40554 cycles (32.4%) 334068: number of relocations: 176 334068: number of relocations from cache: 3 334068: number of relative relocations: 1534 334068: time needed to load objects: 45882 cycles (36.7%) 334069: 334069: runtime linker statistics: 334069: total startup time in dynamic loader: 121500 cycles 334069: time needed for relocation: 39067 cycles (32.1%) 334069: number of relocations: 136 334069: number of relocations from cache: 3 334069: number of relative relocations: 1274 334069: time needed to load objects: 47505 cycles (39.0%) 334071: 334071: runtime linker statistics: 334071: total startup time in dynamic loader: 111850 cycles 334071: time needed for relocation: 35089 cycles (31.3%) 334071: number of relocations: 135 334071: number of relocations from cache: 3 334071: number of relative relocations: 1272 334071: time needed to load objects: 45746 cycles (40.8%) 334072: 334072: runtime linker statistics: 334072: total startup time in dynamic loader: 109827 cycles 334072: time needed for relocation: 34863 cycles (31.7%) 334072: number of relocations: 145 334072: number of relocations from cache: 3 334072: number of relative relocations: 1351 334072: time needed to load objects: 45565 cycles (41.4%) (why so many, because it's running through the profile and other bash startups which end up running additional items.) When prelinker worked... the number of relocations (and especially cycles) required dropped to about 1-10% of the original application. This compounded by the large number of executables loaded at boot (think of sysvinit with all of the shells started and destroyed) turned into a massive speedup during early boot process. As a normal "user" behavior, the speedup is negligible, because the amount of time spent loading vs running is nothing.... but in automated processing where something, like bash, is started runs for a fraction of a second, exits.. "repeat" 1000s of times.. it really becomes a massive part of the time scale. So back to the above, I know that in one instance that bash would end up with about 4 relocations, with 400+ from cache with the prelinker. Resulting in the cycles required for relocations to be in the 10% of overall load time, with time needed to load objects being roughly 90%. >> >> * Memory usage.  The COW page usage for runtime linking can be significant on memory constrained systems.  Prelinking dropped the COW page usage in the systems I was looking at to about 10% of what it was prior.  This is believed to have further contributed to the boot time optimizations. > > Interesting, why exactly is prelinking help in COW usage? I would expect memory > utilization to be rough the same, is prelinking helping in aligning the segment > in a better way? Each time a relocation occurs, the runtime linker needs to write into a page with that address. No relocation, no runtime write, no COW page created. Add to this mmap usage between applications, and you can run say 100 bash sessions and each session would use a fraction of the COW pages that it would without prelinking. At one point I had statistics on this, but I don't even remember how this was calculated or done anymore. (I had help from some kernel people to show me kernel memory use, contiguous pages, etc..) >> >> Last I looked at this, only about 20-30% of the system is capable of prelinking anymore due to the evolutionary changes in the various toolchain elements, introducing new relocations and related components.  Even things like re-ordering sections (done a couple years ago in binutils) has broken the prelinker in mysterious ways. > > Yes and it is even harder to have a project that is dependent of both > static in dynamic linker to have out-of-tree developement without a > proper ABI definition. That's why I think currently prelink is hackish > solution with a niche usage that adds a lot complexity to the code base. > > For instance, we are aiming to support DT_RELR which would help to > decrease the relocation segment size for PIE binaries. It would be > probably another feature that prelink will lack support. > > In fact this information you provided that only 20-30% of all binaries > are supported makes even more willing to really deprecate prelink. prelink has a huge advantage on embedded systems -- but it hasn't worked well for about 3 years now... I was hoping other then life support someone would step up and contribute, and it never really happened. There were a few bugs/fixes sent by Mentor that kept things going on a few platforms -- but even that eventually dried up. (This is meant to thank them for the code and contributions they did!) >> >> Add to this the IMHO mistaken belief that ASLR is some magic security device.  I see it more as a component of security that needs to be part of a broader system that you can decide to trade off against load performance (including memory). But the broad "if you don't use ASLR your device if vulnerable" mentality has seriously restricted the desire for people to use, improve and contribute back to the prelinker. > > ASLR/PIE is not a silver bullet, specially with limited mmap entropy on > 32 bit systems. But it is a gradual improvement over the multiple security > features we support (like the generic ones as relro, malloc safelink, etc. > to arch-specific one such as x86_64 CET or aarch64 BTI or PAC/RET). Exactly it's multiple security features work together for a purpose. But everyone got convinced ASLR was a silver bullet and that is what started the final death spiral of the prelinker (as it is today). > My point is more that usually what we see is generic distribution is to > use more broader security features. I am not sure about embedded though. Embedded needs security, no doubt.. but with the limited entropy (even on 64-bit, the entropy is truly limited.. great I now have to run my attack 15 times instead of 5.. that really isn't much of an improvement!) ASLR has become a check list item for some security consultant to approve a product release. Things like the CET, BTI / PAC/RET have a much larger re-world security impact, IMHO. So in the end the embedded development that I've been involved with has always had a series of "these are our options, in a perfect world we'd use them all -- but we don't have the memory (prelink helped), we've got disk space limits (can't use PAC/RET, binaries get bigger), we need to be able to upgrade the software (prelink on the device? send pre-prelinked to all devices), we've got industry requirements (not all devices should have the same memory map, prelink ranomize addresses?), we've got maximum boot time requirements, etc. It's not cut and dried what combination of those requirements, and which technologies (such as the prelinker) should be used to meet them. As we have less operating system engineers, the preference is going away from using tools like prelink and lots of simple utilities into alternatives like "jumbo do it all" binaries that only get loaded once. Avoiding initscript systems and packing system initialization into those binaries, or even moving to other libc's that have less relocation pressure (due to smaller libraires, feature sets, etc.) If you declare prelink dead, then it's dead.. nobody will be bringing it back. But I do still believe technology wise it's a good technology for the embedded systems (remember embedded doesn't mean "small") to help the meet specific integration needs. But without help from people with the appropriate knowledge to implement new features, like DT_RELR, in the prelinker -- there is little chance that it is anything but on life support. --Mark >> >>> >>> [1] https://sourceware.org/bugzilla/show_bug.cgi?id=19861 >>> [2] https://sourceware.org/pipermail/libc-alpha/2021-August/130404.html >>> [3] https://git.yoctoproject.org/prelink-cross/ >>>