From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <mark.hatle@kernel.crashing.org>
Received: from gate.crashing.org (gate.crashing.org [63.228.1.57])
 by sourceware.org (Postfix) with ESMTP id DA33C385841B
 for <libc-alpha@sourceware.org>; Thu, 13 Jan 2022 18:03:27 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DA33C385841B
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=kernel.crashing.org
Authentication-Results: sourceware.org;
 spf=pass smtp.mailfrom=kernel.crashing.org
Received: from [192.168.2.107] ([70.99.78.137])
 by gate.crashing.org (8.14.1/8.14.1) with ESMTP id 20DI0Lr4012070;
 Thu, 13 Jan 2022 12:00:21 -0600
Message-ID: <9f77a6e3-fa70-d445-0e70-45a176cb0a7f@kernel.crashing.org>
Date: Thu, 13 Jan 2022 12:00:20 -0600
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
 Gecko/20100101 Thunderbird/91.3.2
Subject: Re: [PATCH] elf/dl-deps.c: Make _dl_build_local_scope breadth first
Content-Language: en-US
To: Adhemerval Zanella <adhemerval.zanella@linaro.org>,
 Khem Raj <raj.khem@gmail.com>, Libc-alpha <libc-alpha@sourceware.org>,
 "Carlos O'Donell" <carlos@redhat.com>
References: <20211209235354.1558088-1-raj.khem@gmail.com>
 <018ad7e3-c020-3507-94be-ccb21c90899f@linaro.org>
 <62d5866a-ea76-a56a-7063-dada34b3fe66@kernel.crashing.org>
 <3d6799fe-2b9d-4780-254d-dbd0799483ae@linaro.org>
 <f70b0757-23a7-a2e6-8298-c14b77b798a9@kernel.crashing.org>
 <9384a4f0-095a-a818-e48e-026dfdfc8efd@linaro.org>
 <631ee5ae-c334-39bb-91b0-8e8531967567@kernel.crashing.org>
 <c842a083-66b0-068e-21c9-924de4d2cfc2@linaro.org>
From: Mark Hatle <mark.hatle@kernel.crashing.org>
In-Reply-To: <c842a083-66b0-068e-21c9-924de4d2cfc2@linaro.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-4.0 required=5.0 tests=BAYES_00, JMQ_SPF_NEUTRAL,
 KAM_DMARC_STATUS, NICE_REPLY_A, SPF_HELO_PASS, SPF_PASS,
 TXREP autolearn=no autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Thu, 13 Jan 2022 18:03:31 -0000


On 1/13/22 11:20 AM, Adhemerval Zanella wrote:
>> When I last profiled this, roughly 3 1/2 years ago, the run-time linking speedup was huge.  There were two main advantages to this:
>>
>> * Run Time linking speedup - primarily helped initial application loads.  System boot times went from 10-15 seconds down to 4-5 seconds.  For embedded systems this was massive.
> 
> Right, this is interesting.  Any profile data where exactly the speed is coming
> from? I wonder if we could get any gain by optimizing the normal patch without
> the need to resort to prelink.

glibc's runtime linker is very efficient, I don't honestly expect many speedups 
at this point.

This is partially from memory, so I may have a few details wrong... but

LD_DEBUG=statistics

On my ubuntu machine, just setting then and running /bin/bash results in:

     334067:	
     334067:	runtime linker statistics:
     334067:	  total startup time in dynamic loader: 252415 cycles
     334067:	            time needed for relocation: 119006 cycles (47.1%)
     334067:	                 number of relocations: 412
     334067:	      number of relocations from cache: 3
     334067:	        number of relative relocations: 5100
     334067:	           time needed to load objects: 92655 cycles (36.7%)
     334068:	
     334068:	runtime linker statistics:
     334068:	  total startup time in dynamic loader: 125018 cycles
     334068:	            time needed for relocation: 40554 cycles (32.4%)
     334068:	                 number of relocations: 176
     334068:	      number of relocations from cache: 3
     334068:	        number of relative relocations: 1534
     334068:	           time needed to load objects: 45882 cycles (36.7%)
     334069:	
     334069:	runtime linker statistics:
     334069:	  total startup time in dynamic loader: 121500 cycles
     334069:	            time needed for relocation: 39067 cycles (32.1%)
     334069:	                 number of relocations: 136
     334069:	      number of relocations from cache: 3
     334069:	        number of relative relocations: 1274
     334069:	           time needed to load objects: 47505 cycles (39.0%)
     334071:	
     334071:	runtime linker statistics:
     334071:	  total startup time in dynamic loader: 111850 cycles
     334071:	            time needed for relocation: 35089 cycles (31.3%)
     334071:	                 number of relocations: 135
     334071:	      number of relocations from cache: 3
     334071:	        number of relative relocations: 1272
     334071:	           time needed to load objects: 45746 cycles (40.8%)
     334072:	
     334072:	runtime linker statistics:
     334072:	  total startup time in dynamic loader: 109827 cycles
     334072:	            time needed for relocation: 34863 cycles (31.7%)
     334072:	                 number of relocations: 145
     334072:	      number of relocations from cache: 3
     334072:	        number of relative relocations: 1351
     334072:	           time needed to load objects: 45565 cycles (41.4%)

(why so many, because it's running through the profile and other bash startups 
which end up running additional items.)

When prelinker worked...  the number of relocations (and especially cycles) 
required dropped to about 1-10% of the original application.  This compounded by 
the large number of executables loaded at boot (think of sysvinit with all of 
the shells started and destroyed) turned into a massive speedup during early 
boot process.

As a normal "user" behavior, the speedup is negligible, because the amount of 
time spent loading vs running is nothing.... but in automated processing where 
something, like bash, is started runs for a fraction of a second, exits.. 
"repeat" 1000s of times.. it really becomes a massive part of the time scale.

So back to the above, I know that in one instance that bash would end up with 
about 4 relocations, with 400+ from cache with the prelinker.  Resulting in the 
cycles required for relocations to be in the 10% of overall load time, with time 
needed to load objects being roughly 90%.

>>
>> * Memory usage.  The COW page usage for runtime linking can be significant on memory constrained systems.  Prelinking dropped the COW page usage in the systems I was looking at to about 10% of what it was prior.  This is believed to have further contributed to the boot time optimizations.
> 
> Interesting, why exactly is prelinking help in COW usage? I would expect memory
> utilization to be rough the same, is prelinking helping in aligning the segment
> in a better way?

Each time a relocation occurs, the runtime linker needs to write into a page 
with that address.  No relocation, no runtime write, no COW page created.

Add to this mmap usage between applications, and you can run say 100 bash 
sessions and each session would use a fraction of the COW pages that it would 
without prelinking.

At one point I had statistics on this, but I don't even remember how this was 
calculated or done anymore.  (I had help from some kernel people to show me 
kernel memory use, contiguous pages, etc..)

>>
>> Last I looked at this, only about 20-30% of the system is capable of prelinking anymore due to the evolutionary changes in the various toolchain elements, introducing new relocations and related components.  Even things like re-ordering sections (done a couple years ago in binutils) has broken the prelinker in mysterious ways.
> 
> Yes and it is even harder to have a project that is dependent of both
> static in dynamic linker to have out-of-tree developement without a
> proper ABI definition.  That's why I think currently prelink is hackish
> solution with a niche usage that adds a lot complexity to the code base.
> 
> For instance, we are aiming to support DT_RELR which would help to
> decrease the relocation segment size for PIE binaries.  It would be
> probably another feature that prelink will lack support.
> 
> In fact this information you provided that only 20-30% of all binaries
> are supported makes even more willing to really deprecate prelink.

prelink has a huge advantage on embedded systems -- but it hasn't worked well 
for about 3 years now...  I was hoping other then life support someone would 
step up and contribute, and it never really happened.  There were a few 
bugs/fixes sent by Mentor that kept things going on a few platforms -- but even 
that eventually dried up.  (This is meant to thank them for the code and 
contributions they did!)

>>
>> Add to this the IMHO mistaken belief that ASLR is some magic security device.  I see it more as a component of security that needs to be part of a broader system that you can decide to trade off against load performance (including memory). But the broad "if you don't use ASLR your device if vulnerable" mentality has seriously restricted the desire for people to use, improve and contribute back to the prelinker.
> 
> ASLR/PIE is not a silver bullet, specially with limited mmap entropy on
> 32 bit systems. But it is a gradual improvement over the multiple security
> features we support (like the generic ones as relro, malloc safelink, etc.
> to arch-specific one such as x86_64 CET or aarch64 BTI or PAC/RET).

Exactly it's multiple security features work together for a purpose.  But 
everyone got convinced ASLR was a silver bullet and that is what started the 
final death spiral of the prelinker (as it is today).

> My point is more that usually what we see is generic distribution is to
> use more broader security features. I am not sure about embedded though.

Embedded needs security, no doubt.. but with the limited entropy (even on 
64-bit, the entropy is truly limited.. great I now have to run my attack 15 
times instead of 5..  that really isn't much of an improvement!) ASLR has become 
a check list item for some security consultant to approve a product release.

Things like the CET, BTI / PAC/RET have a much larger re-world security impact, 
IMHO.

So in the end the embedded development that I've been involved with has always 
had a series of "these are our options, in a perfect world we'd use them all -- 
but we don't have the memory (prelink helped), we've got disk space limits 
(can't use PAC/RET, binaries get bigger), we need to be able to upgrade the 
software (prelink on the device?  send pre-prelinked to all devices), we've got 
industry requirements (not all devices should have the same memory map, prelink 
ranomize addresses?), we've got maximum boot time requirements, etc.  It's not 
cut and dried what combination of those requirements, and which technologies 
(such as the prelinker) should be used to meet them.   As we have less operating 
system engineers, the preference is going away from using tools like prelink and 
lots of simple utilities into alternatives like "jumbo do it all" binaries that 
only get loaded once.  Avoiding initscript systems and packing system 
initialization into those binaries, or even moving to other libc's that have 
less relocation pressure (due to smaller libraires, feature sets, etc.)

If you declare prelink dead, then it's dead.. nobody will be bringing it back. 
But I do still believe technology wise it's a good technology for the embedded 
systems (remember embedded doesn't mean "small") to help the meet specific 
integration needs.  But without help from people with the appropriate knowledge 
to implement new features, like DT_RELR, in the prelinker -- there is little 
chance that it is anything but on life support.

--Mark

>>
>>>
>>> [1] https://sourceware.org/bugzilla/show_bug.cgi?id=19861
>>> [2] https://sourceware.org/pipermail/libc-alpha/2021-August/130404.html
>>> [3] https://git.yoctoproject.org/prelink-cross/
>>>