From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot1-x335.google.com (mail-ot1-x335.google.com [IPv6:2607:f8b0:4864:20::335]) by sourceware.org (Postfix) with ESMTPS id D9E81385840D for ; Thu, 13 Jan 2022 18:37:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D9E81385840D Received: by mail-ot1-x335.google.com with SMTP id t6-20020a9d7746000000b005917e6b96ffso7297609otl.7 for ; Thu, 13 Jan 2022 10:37:51 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:references:from:in-reply-to :content-transfer-encoding; bh=OKZQ9wbVl2m8xOxnw5j9I7yce1mSYhtQJ66edpsMuzw=; b=Q/QDu+J2OSA047efor4rbGBCKdZgVpcw/zR6EsS85h6zSq4JlAKSNFKhv2ewbS3p0b rlonPQ0FhJG1XJh0DFwHpWicOvc1lKVnJMWOi/SyW60nDww8MsUJgDOj3SLLAjIcx0pR +Y/geMaUynf2oCfoZzaCVwgvJdrbGuIkwHj2a09+tfL5JlBu3rnh0Z9nLuzw50jArhnf AtU41Va89rwDntflGJxWSC94auTzAzJgMJCiWm/exPpghVjEnPSDbHzIh4Urg2RM8WGo AqjMLqWfNJxBfpAKLJuyKnYqy4BY0KRzMnLNeqJ9PkKYYZ5Xwet+XKFC8DYQLwXQShRY WEew== X-Gm-Message-State: AOAM533ioL+EM6NucRECB/qhlpmauqtIwS5cx2hAyS8FFFcgN5MVKy5Z gJzkYHb4YlqTLjZdSmZOmARKTw== X-Google-Smtp-Source: ABdhPJx6S+oYAhIjmPEmVLso3w+xhkrPkgnRtvKqSdg+F8yLcQSjVKd2ENTGT172WZE8ffPQizHLJQ== X-Received: by 2002:a05:6830:310c:: with SMTP id b12mr4082409ots.16.1642099070997; Thu, 13 Jan 2022 10:37:50 -0800 (PST) Received: from ?IPV6:2804:431:c7cb:989a:1584:bd83:a167:36ab? ([2804:431:c7cb:989a:1584:bd83:a167:36ab]) by smtp.gmail.com with ESMTPSA id s10sm674499otg.23.2022.01.13.10.37.49 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 13 Jan 2022 10:37:50 -0800 (PST) Message-ID: Date: Thu, 13 Jan 2022 15:37:48 -0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0 Subject: Re: [PATCH] elf/dl-deps.c: Make _dl_build_local_scope breadth first Content-Language: en-US To: Mark Hatle , Khem Raj , Libc-alpha , Carlos O'Donell References: <20211209235354.1558088-1-raj.khem@gmail.com> <018ad7e3-c020-3507-94be-ccb21c90899f@linaro.org> <62d5866a-ea76-a56a-7063-dada34b3fe66@kernel.crashing.org> <3d6799fe-2b9d-4780-254d-dbd0799483ae@linaro.org> <9384a4f0-095a-a818-e48e-026dfdfc8efd@linaro.org> <631ee5ae-c334-39bb-91b0-8e8531967567@kernel.crashing.org> <9f77a6e3-fa70-d445-0e70-45a176cb0a7f@kernel.crashing.org> From: Adhemerval Zanella In-Reply-To: <9f77a6e3-fa70-d445-0e70-45a176cb0a7f@kernel.crashing.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-4.8 required=5.0 tests=BAYES_00, BODY_8BITS, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Jan 2022 18:37:54 -0000 On 13/01/2022 15:00, Mark Hatle wrote: >=20 >=20 > On 1/13/22 11:20 AM, Adhemerval Zanella wrote: >>> When I last profiled this, roughly 3 1/2 years ago, the run-time link= ing speedup was huge.=C2=A0 There were two main advantages to this: >>> >>> * Run Time linking speedup - primarily helped initial application loa= ds.=C2=A0 System boot times went from 10-15 seconds down to 4-5 seconds.=C2= =A0 For embedded systems this was massive. >> >> Right, this is interesting.=C2=A0 Any profile data where exactly the s= peed is coming >> from? I wonder if we could get any gain by optimizing the normal patch= without >> the need to resort to prelink. >=20 > glibc's runtime linker is very efficient, I don't honestly expect many = speedups at this point. >=20 > This is partially from memory, so I may have a few details wrong... but= >=20 > LD_DEBUG=3Dstatistics >=20 > On my ubuntu machine, just setting then and running /bin/bash results i= n: >=20 > =C2=A0=C2=A0=C2=A0 334067:=C2=A0=C2=A0=C2=A0 > =C2=A0=C2=A0=C2=A0 334067:=C2=A0=C2=A0=C2=A0 runtime linker statistics:= > =C2=A0=C2=A0=C2=A0 334067:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 total startup = time in dynamic loader: 252415 cycles > =C2=A0=C2=A0=C2=A0 334067:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 time needed for relocation:= 119006 cycles (47.1%) > =C2=A0=C2=A0=C2=A0 334067:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= number of relocations: 412 > =C2=A0=C2=A0=C2=A0 334067:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 number of relocations from cache: 3 > =C2=A0=C2=A0=C2=A0 334067:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 number of relative relocations: 5100 > =C2=A0=C2=A0=C2=A0 334067:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 time needed to load objects: 9265= 5 cycles (36.7%) > =C2=A0=C2=A0=C2=A0 334068:=C2=A0=C2=A0=C2=A0 > =C2=A0=C2=A0=C2=A0 334068:=C2=A0=C2=A0=C2=A0 runtime linker statistics:= > =C2=A0=C2=A0=C2=A0 334068:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 total startup = time in dynamic loader: 125018 cycles > =C2=A0=C2=A0=C2=A0 334068:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 time needed for relocation:= 40554 cycles (32.4%) > =C2=A0=C2=A0=C2=A0 334068:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= number of relocations: 176 > =C2=A0=C2=A0=C2=A0 334068:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 number of relocations from cache: 3 > =C2=A0=C2=A0=C2=A0 334068:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 number of relative relocations: 1534 > =C2=A0=C2=A0=C2=A0 334068:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 time needed to load objects: 4588= 2 cycles (36.7%) > =C2=A0=C2=A0=C2=A0 334069:=C2=A0=C2=A0=C2=A0 > =C2=A0=C2=A0=C2=A0 334069:=C2=A0=C2=A0=C2=A0 runtime linker statistics:= > =C2=A0=C2=A0=C2=A0 334069:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 total startup = time in dynamic loader: 121500 cycles > =C2=A0=C2=A0=C2=A0 334069:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 time needed for relocation:= 39067 cycles (32.1%) > =C2=A0=C2=A0=C2=A0 334069:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= number of relocations: 136 > =C2=A0=C2=A0=C2=A0 334069:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 number of relocations from cache: 3 > =C2=A0=C2=A0=C2=A0 334069:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 number of relative relocations: 1274 > =C2=A0=C2=A0=C2=A0 334069:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 time needed to load objects: 4750= 5 cycles (39.0%) > =C2=A0=C2=A0=C2=A0 334071:=C2=A0=C2=A0=C2=A0 > =C2=A0=C2=A0=C2=A0 334071:=C2=A0=C2=A0=C2=A0 runtime linker statistics:= > =C2=A0=C2=A0=C2=A0 334071:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 total startup = time in dynamic loader: 111850 cycles > =C2=A0=C2=A0=C2=A0 334071:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 time needed for relocation:= 35089 cycles (31.3%) > =C2=A0=C2=A0=C2=A0 334071:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= number of relocations: 135 > =C2=A0=C2=A0=C2=A0 334071:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 number of relocations from cache: 3 > =C2=A0=C2=A0=C2=A0 334071:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 number of relative relocations: 1272 > =C2=A0=C2=A0=C2=A0 334071:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 time needed to load objects: 4574= 6 cycles (40.8%) > =C2=A0=C2=A0=C2=A0 334072:=C2=A0=C2=A0=C2=A0 > =C2=A0=C2=A0=C2=A0 334072:=C2=A0=C2=A0=C2=A0 runtime linker statistics:= > =C2=A0=C2=A0=C2=A0 334072:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 total startup = time in dynamic loader: 109827 cycles > =C2=A0=C2=A0=C2=A0 334072:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 time needed for relocation:= 34863 cycles (31.7%) > =C2=A0=C2=A0=C2=A0 334072:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= number of relocations: 145 > =C2=A0=C2=A0=C2=A0 334072:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 number of relocations from cache: 3 > =C2=A0=C2=A0=C2=A0 334072:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 number of relative relocations: 1351 > =C2=A0=C2=A0=C2=A0 334072:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 time needed to load objects: 4556= 5 cycles (41.4%) >=20 > (why so many, because it's running through the profile and other bash s= tartups which end up running additional items.) >=20 > When prelinker worked...=C2=A0 the number of relocations (and especiall= y cycles) required dropped to about 1-10% of the original application.=C2= =A0 This compounded by the large number of executables loaded at boot (th= ink of sysvinit with all of the shells started and destroyed) turned into= a massive speedup during early boot process. >=20 > As a normal "user" behavior, the speedup is negligible, because the amo= unt of time spent loading vs running is nothing.... but in automated proc= essing where something, like bash, is started runs for a fraction of a se= cond, exits.. "repeat" 1000s of times.. it really becomes a massive part = of the time scale. >=20 > So back to the above, I know that in one instance that bash would end u= p with about 4 relocations, with 400+ from cache with the prelinker.=C2=A0= Resulting in the cycles required for relocations to be in the 10% of ove= rall load time, with time needed to load objects being roughly 90%. >=20 Right, the compound improvements over all binaries make sense. >>> >>> * Memory usage.=C2=A0 The COW page usage for runtime linking can be s= ignificant on memory constrained systems.=C2=A0 Prelinking dropped the CO= W page usage in the systems I was looking at to about 10% of what it was = prior.=C2=A0 This is believed to have further contributed to the boot tim= e optimizations. >> >> Interesting, why exactly is prelinking help in COW usage? I would expe= ct memory >> utilization to be rough the same, is prelinking helping in aligning th= e segment >> in a better way? >=20 > Each time a relocation occurs, the runtime linker needs to write into a= page with that address.=C2=A0 No relocation, no runtime write, no COW pa= ge created. >=20 > Add to this mmap usage between applications, and you can run say 100 ba= sh sessions and each session would use a fraction of the COW pages that i= t would without prelinking. Yes, but not taking in consideration TEXTREL or writable PLT I am not=20 seeing on how COW would help since the GOT (where mostly if not all=20 relocation would happen) is anonymous mappings. >=20 > At one point I had statistics on this, but I don't even remember how th= is was calculated or done anymore.=C2=A0 (I had help from some kernel peo= ple to show me kernel memory use, contiguous pages, etc..) >=20 >>> >>> Last I looked at this, only about 20-30% of the system is capable of = prelinking anymore due to the evolutionary changes in the various toolcha= in elements, introducing new relocations and related components.=C2=A0 Ev= en things like re-ordering sections (done a couple years ago in binutils)= has broken the prelinker in mysterious ways. >> >> Yes and it is even harder to have a project that is dependent of both >> static in dynamic linker to have out-of-tree developement without a >> proper ABI definition.=C2=A0 That's why I think currently prelink is h= ackish >> solution with a niche usage that adds a lot complexity to the code bas= e. >> >> For instance, we are aiming to support DT_RELR which would help to >> decrease the relocation segment size for PIE binaries.=C2=A0 It would = be >> probably another feature that prelink will lack support. >> >> In fact this information you provided that only 20-30% of all binaries= >> are supported makes even more willing to really deprecate prelink. >=20 > prelink has a huge advantage on embedded systems -- but it hasn't worke= d well for about 3 years now...=C2=A0 I was hoping other then life suppor= t someone would step up and contribute, and it never really happened.=C2=A0= There were a few bugs/fixes sent by Mentor that kept things going on a f= ew platforms -- but even that eventually dried up.=C2=A0 (This is meant t= o thank them for the code and contributions they did!) >=20 >>> >>> Add to this the IMHO mistaken belief that ASLR is some magic security= device.=C2=A0 I see it more as a component of security that needs to be = part of a broader system that you can decide to trade off against load pe= rformance (including memory). But the broad "if you don't use ASLR your d= evice if vulnerable" mentality has seriously restricted the desire for pe= ople to use, improve and contribute back to the prelinker. >> >> ASLR/PIE is not a silver bullet, specially with limited mmap entropy o= n >> 32 bit systems. But it is a gradual improvement over the multiple secu= rity >> features we support (like the generic ones as relro, malloc safelink, = etc. >> to arch-specific one such as x86_64 CET or aarch64 BTI or PAC/RET). >=20 > Exactly it's multiple security features work together for a purpose.=C2= =A0 But everyone got convinced ASLR was a silver bullet and that is what = started the final death spiral of the prelinker (as it is today). >=20 >> My point is more that usually what we see is generic distribution is t= o >> use more broader security features. I am not sure about embedded thoug= h. >=20 > Embedded needs security, no doubt.. but with the limited entropy (even = on 64-bit, the entropy is truly limited.. great I now have to run my atta= ck 15 times instead of 5..=C2=A0 that really isn't much of an improvement= !) ASLR has become a check list item for some security consultant to appr= ove a product release. >=20 > Things like the CET, BTI / PAC/RET have a much larger re-world security= impact, IMHO. >=20 > So in the end the embedded development that I've been involved with has= always had a series of "these are our options, in a perfect world we'd u= se them all -- but we don't have the memory (prelink helped), we've got d= isk space limits (can't use PAC/RET, binaries get bigger), we need to be = able to upgrade the software (prelink on the device?=C2=A0 send pre-preli= nked to all devices), we've got industry requirements (not all devices sh= ould have the same memory map, prelink ranomize addresses?), we've got ma= ximum boot time requirements, etc.=C2=A0 It's not cut and dried what comb= ination of those requirements, and which technologies (such as the prelin= ker) should be used to meet them.=C2=A0=C2=A0 As we have less operating s= ystem engineers, the preference is going away from using tools like preli= nk and lots of simple utilities into alternatives like "jumbo do it all" = binaries that only get loaded once.=C2=A0 Avoiding initscript systems and= packing system initialization into those binaries, or even moving > to other libc's that have less relocation pressure (due to smaller libr= aires, feature sets, etc.) >=20 > If you declare prelink dead, then it's dead.. nobody will be bringing i= t back. But I do still believe technology wise it's a good technology for= the embedded systems (remember embedded doesn't mean "small") to help th= e meet specific integration needs.=C2=A0 But without help from people wit= h the appropriate knowledge to implement new features, like DT_RELR, in t= he prelinker -- there is little chance that it is anything but on life su= pport. >=20 It is more and more I see that proper static linking is a *much* more simple solution than prelink, with the advantage it also decrease code complexity and attack surface and can keep up with ABI extension way more easily. For instance, static pie is now support on both glibc and musl. And it is not that I declare dead, but it will become a dead weight support that we will need to provide for the sake of handful specific usage that due lack of maintenance will have subtle issues and missing support with the new ABI extensions.=20