* Re: hwcaps subdirectory selection in the dynamic loader
2020-05-08 18:26 hwcaps subdirectory selection in the dynamic loader Florian Weimer
@ 2020-05-12 15:23 ` Stefan Liebler
2020-05-13 16:47 ` Florian Weimer
2020-05-20 12:51 ` Adhemerval Zanella
1 sibling, 1 reply; 5+ messages in thread
From: Stefan Liebler @ 2020-05-12 15:23 UTC (permalink / raw)
To: libc-alpha
On 5/8/20 8:26 PM, Florian Weimer via Libc-alpha wrote:
> As part of my work on bug 23249, I looked at how the dynamic loader
> finds and selects alternative implementations of shared objects based on
> hardware capabilities (hwcaps). This message intends to capture my
> understanding of this feature.
>
> The implementation largely happens via elf/dl-hwcaps.c, dl-procinfo.h,
> elf/dl-load.c, and elf/ldconfig.c, elf/dl-cache.c for ld.so.cache. On
> typical targets, the kernel provides hardware capability bits via
> AT_HWCAP auxiliary vector entry, and a platform string AT_PLATFORM.
>
> # Non-cache lookups
>
> For non-cache (LD_LIBRARY_PATH) lookups, the dynamic loader needs to
> guess pathnames. It does not use readdir. The supported hwcap bits
> (usually supplied by the kernel via AT_HWCAP) are filtered with the
> compile-time mask HWCAP_IMPORTANT. Each bit corresponds to a
> subdirectory name, as returned by _dl_hwcap_string. Two fake hwcap bits
> and corresponding subdirectory are added by the loader: the TLS bit with
> the directory name "tls", and the platform bit, with the AT_PLATFROM
> string provided by the kernel as the directory name. The dynamic loader
> then computes the power set of those directory names. The full paths
> are constructed by concatenating the subdirectory names of the set bits,
> starting with "tls", the AT_PLATFORM directory, and then the active real
> hwcap bits, going from more significant to less significant bits. The
> power set is enumerated starting with all bits set, and then proceeds to
> remove bits according to an integer decrementing pattern.
>
> (Please ignore the NEED_DL_SYSINFO_DSO part in elf/dl-hwcaps.c because
> it is no longer used in practice since the nosegneg removal on i686.)
>
> This is no sysdeps override for this search path construction. An
> architecture can only affect how the hwcap bits are computed, to which
> strings individal bits correspond, and what the platform subdirectory is
> called. The fake two bits (TLS and platform) and the power-set
> construction always apply.
>
> I'm using s390x as an example now because the situation is fairly simple
> compared to other architectures and I have it around for testing. I
> think it's broadly representable of what other architectures do.
>
> On a zEC12 machine with the zarch, ldisp, eimm, dfp bits (but non of the
> vx and later bits), the search paths looks like this:
>
> tls/zEC12/dfp/eimm/ldisp/zarch
> tls/zEC12/dfp/eimm/ldisp
> tls/zEC12/dfp/eimm/zarch
> tls/zEC12/dfp/eimm
> tls/zEC12/dfp/ldisp/zarch
> tls/zEC12/dfp/ldisp
> tls/zEC12/dfp/zarch
> tls/zEC12/dfp
> tls/zEC12/eimm/ldisp/zarch
> tls/zEC12/eimm/ldisp
> tls/zEC12/eimm/zarch
> tls/zEC12/eimm
> tls/zEC12/ldisp/zarch
> tls/zEC12/ldisp
> tls/zEC12/zarch
> tls/zEC12
> tls/dfp/eimm/ldisp/zarch
> tls/dfp/eimm/ldisp
> tls/dfp/eimm/zarch
> tls/dfp/eimm
> tls/dfp/ldisp/zarch
> tls/dfp/ldisp
> tls/dfp/zarch
> tls/dfp
> tls/eimm/ldisp/zarch
> tls/eimm/ldisp
> tls/eimm/zarch
> tls/eimm
> tls/ldisp/zarch
> tls/ldisp
> tls/zarch
> tls
> zEC12/dfp/eimm/ldisp/zarch
> zEC12/dfp/eimm/ldisp
> zEC12/dfp/eimm/zarch
> zEC12/dfp/eimm
> zEC12/dfp/ldisp/zarch
> zEC12/dfp/ldisp
> zEC12/dfp/zarch
> zEC12/dfp
> zEC12/eimm/ldisp/zarch
> zEC12/eimm/ldisp
> zEC12/eimm/zarch
> zEC12/eimm
> zEC12/ldisp/zarch
> zEC12/ldisp
> zEC12/zarch
> zEC12
> dfp/eimm/ldisp/zarch
> dfp/eimm/ldisp
> dfp/eimm/zarch
> dfp/eimm
> dfp/ldisp/zarch
> dfp/ldisp
> dfp/zarch
> dfp
> eimm/ldisp/zarch
> eimm/ldisp
> eimm/zarch
> eimm
> ldisp/zarch
> ldisp
> zarch
>
> And finally the actual search path entry is searched. On a z13 machine,
> there would one more bit (vx), and the platform directory has a
> different name, "z13". So the first path is
> tls/z13/vx/dfp/eimm/ldisp/zarch, and there are twice as many lookups.
>
> This scheme allows a library developer to require any combination of the
> HWCAP_IMPORTANT bits for an optimized object, by placing it in the
> appropriate subdirectory. But it does not scale well as more bits are
> added. There is some path backlisting in elf/dl-load.c, so this is not
> as bad as it looks here, but the first lookup in a library search path
> entry will consult all the directories (i.e., there is no blacklisting
> of say the tls/ subtree if the tls subdirectory does not exist).
Would it be possible to blacklist the remaining tls/... paths in this case?
>
> # Cache lookups
>
> ldconfig uses a completely different way to locate objects in hwcaps
> subdirectories. To build the cache, it lists directories, and if in
> those directories, it encounters a name that corresponds to a hwcap
> directory name or a (hard-coded) platform name, it queues this
> subdirectory for later listing, descending further in the tree along
> these paths. This means that paths like those quoted above are also
> supported by ldconfig, except that it is more lenient and does not
> enforce any particular order on hwcap names.
>
> Only the second cache format (involving struct file_entry_new) can
> represent libraries in hwcaps subdirectories. There is a single
> uint64_t field which identifies the implied hardware capabilities.
> Regular hwcap bits are represented as themselves (after converting from
> the subdirectory name to the bit value), and all the bits are OR-ed
> together. If a platform directory is encountered in the path, a number
> is computed using _dl_string_platform from its name, and this number is
> then used as a fake bit index (outside of the supported real hwcap bits,
> see _DL_FIRST_PLATFORM) to compute another bitmask that is OR-ed into
> the hwcap field in the cache.
>
> ldconfig tries to sort entries for the same soname according to some
> heuristic (see the compare function in elf/cache.c): hwcap entries with
> more bits generally come first.
>
> At run time, the dynamic loader finds all matching path entries for a
> soname in the cache, and then picks the first entry that matches the
> hwcap and platform requirements (see HWCAP_CHECK in elf/dl-cache.c).
>
> # Discussion
>
> I think there a couple of problems with this approach. One subtle
> problem involves the AT_PLATFORM encoding in the cache file (bug 25938).
> But I think there are other issues.
>
> The LD_LIBRARY_PATH/non-cache case is rather wasteful in terms of system
> calls, even with the blacklisting in place.
Yes, that's true. And I assume the most paths are never used.
>
> The heuristics for choosing the implementation is not very obvious. Of
> course, with bitmasks of opaque CPU features, there is no generic
> winner. For example, on s390x z13, a library in a subdirectory
> ldisp/zarch would be preferred over one in vx because the former has
> more matching hwcap bits and comes earlier in the ld.so.cache sort order
> (but not the LD_LIBRARY_PATH order). This is counter-intuitive because
> vx (the z13 vector capability) should imply the other capabilities—the
> library was just placed into the wrong directory.
>
> The most tempting choice for such optimizations is the platform
> directory for architectures that have it ("zEC12" in the example above).
> But the problem is that if the system administrator upgrades the machine
> to z13, the directory name would change to "z13", and the optimized code
> would no longer be loaded! (Presumably, the zEC12-optimized code is
> still better than the generic code on z13. The same issue would apply
> to z13-optimized code vs z14-optimized code.) This would be a reason
> not use AT_PLATFORM from the kernel even on s390x.
>
> There is another reason to distrust AT_PLATFORM: virtualization. If
> AT_PLATFORM is set by some sort of machine ID (as on s390x), then it
> might not match the actual hwcap bits available to the guest because
> they are subject to separate knobs.
>
> The complexity of the trade-offs here suggests to me that we (the GNU
> toolchain as a whole) should try to pre-define names for collections of
> hwcap flags, so that we can get a monotonic progression of features
> under a clearly defined name. This will allow programmers to optimize
> for subsequent microarchitecture revisions. So instead of "x86_64" we
> would have pseudo-capabilities like "x86-200", "x86-201", "x86-202" and
> so on, more or less mirroring the "zEC12", "z13" &c platform directories
> on s390x, even though the kernel does not provide such platform names on
> x86-64. Even on platforms that provide an AT_PLATFORM name, in most
> cases, it would make sense to use *earlier* platform names as a fallback
> (so that z15 system would also use z14- and z13-optimized libraries if
> available). This would mean that the dynamic loader would need to know
> more about these relationships.
Even on s390x, it's not really simple. E.g. platform "z13" does not
automatically mean that hwcap "vx" is available (e.g. if you are running
as zVM guest where an old zVM version does not support vx). But if hwcap
"vx" is available, it is at least platform "z13".
As far as I know, the kernel currently provides "z900" as AT_PLATFORM
for new unknown machines instead of the latest known platform string,
e.g. "z15". But there could be hwcap flags for newer machines than
"z900" (e.g. hwcap "vx"). Would the loader also recognize this and test
z13 and all the former platforms?
>
> The current hwcap construction is not really suited to that.
> ld.so.cache is better matched than the LD_LIBRARY_PATH search with its
> mandatory power set construction. Even agressive tree pruning will
> still see it make at least one system call per search path entry and
> hwcap. So I don't think we can use this mechanism for future changes.
>
> The way we store hwcap bits in ld.so.cache is also not ideal. It would
> be nice if ldconfig could be hwcap-agnostic, not having to care at all
> of the correspondence between subdirectory name and hwcap bit (or
> AT_PLATFORM pseudo-hwcap bit). I think I have a way to encode that
> while still maintaining ld.so.cache backwards compatibility (basically,
> set the currently unused bit 62 on those new hwcap entries, so that
> older loaders ignore them because of a missed hwcap requirement).
>
> If we put new hwcap subdirectories under a *single* subdirectory (say
> "glibc-hwcaps"), then we could prune paths more aggressively, and use
> the new scheme in parallel to the old without much impact on performance
> until these subdirectories are actually used. ldconfig could also treat
> the presence of a glibc-hwcaps subdirectory has an instruction to
> descend into each subdirectory of the glibc-hwcaps directory, but not
> further, and store the names of those subdirectories in ld.so.cache, so
> that the loader can match them at run time.
This means, the LD_LIBRARY_PATH/non-cache case would first try all
directories inside glibc-hwcaps and if no suitable library was found,
the current approach is used?
Are "new hwcaps" also allowed in the current approach or are those only
allowed in the "glibc-hwcaps" directory?
Is nesting the "new hwcaps" allowed in "glibc-hwcaps" directory and if
yes, which heuristics for choosing the library is used? Compare to youre
example above: "ldisp/zarch" vs "vx".
"store the names of those subdirectories in ld.so.cache, so
that the loader can match them at run time.": This means if a library is
placed in a new subdirectory without calling ldconfig again, this
library is not found?
>
> In any case, I do not see a way to make good progress on bug 23249 (the
> "haswell" platform subdirectory issue on various x86-64 variants)
> without tackling some of these isssues.
>
> Thoughts?
>
> Thanks,
> Florian
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: hwcaps subdirectory selection in the dynamic loader
2020-05-08 18:26 hwcaps subdirectory selection in the dynamic loader Florian Weimer
2020-05-12 15:23 ` Stefan Liebler
@ 2020-05-20 12:51 ` Adhemerval Zanella
2020-05-28 10:54 ` Florian Weimer
1 sibling, 1 reply; 5+ messages in thread
From: Adhemerval Zanella @ 2020-05-20 12:51 UTC (permalink / raw)
To: libc-alpha
On 08/05/2020 15:26, Florian Weimer via Libc-alpha wrote:
> As part of my work on bug 23249, I looked at how the dynamic loader
> finds and selects alternative implementations of shared objects based on
> hardware capabilities (hwcaps). This message intends to capture my
> understanding of this feature.
This is a very nice code digging, should we add it on a wiki entry?
>
> The implementation largely happens via elf/dl-hwcaps.c, dl-procinfo.h,
> elf/dl-load.c, and elf/ldconfig.c, elf/dl-cache.c for ld.so.cache. On
> typical targets, the kernel provides hardware capability bits via
> AT_HWCAP auxiliary vector entry, and a platform string AT_PLATFORM.
By typical do you mean all but x86 that uses a different mechanism
(cpuid through sysdeps/x86/cpu-features.c)?
So it seems that currently we have x86, i686, powerpc32, powerpc64,
powerpc64le, sparc32, sparc64, s390, s390x, and aarch64 that use the
hwcap subdirectories, right?
>
> # Non-cache lookups
>
> For non-cache (LD_LIBRARY_PATH) lookups, the dynamic loader needs to
> guess pathnames. It does not use readdir. The supported hwcap bits
> (usually supplied by the kernel via AT_HWCAP) are filtered with the
> compile-time mask HWCAP_IMPORTANT. Each bit corresponds to a
> subdirectory name, as returned by _dl_hwcap_string. Two fake hwcap bits
> and corresponding subdirectory are added by the loader: the TLS bit with
> the directory name "tls", and the platform bit, with the AT_PLATFROM
> string provided by the kernel as the directory name. The dynamic loader
> then computes the power set of those directory names. The full paths
> are constructed by concatenating the subdirectory names of the set bits,
> starting with "tls", the AT_PLATFORM directory, and then the active real
> hwcap bits, going from more significant to less significant bits. The
> power set is enumerated starting with all bits set, and then proceeds to
> remove bits according to an integer decrementing pattern.
>
> (Please ignore the NEED_DL_SYSINFO_DSO part in elf/dl-hwcaps.c because
> it is no longer used in practice since the nosegneg removal on i686.)
And I noted you sent a patch to remove it as well.
>
> This is no sysdeps override for this search path construction. An
> architecture can only affect how the hwcap bits are computed, to which
> strings individal bits correspond, and what the platform subdirectory is
> called. The fake two bits (TLS and platform) and the power-set
> construction always apply.
First question is whether still does make sense to provide the fake TLS bit
to add "tls" in patch construction.
>
> I'm using s390x as an example now because the situation is fairly simple
> compared to other architectures and I have it around for testing. I
> think it's broadly representable of what other architectures do.
>
> On a zEC12 machine with the zarch, ldisp, eimm, dfp bits (but non of the
> vx and later bits), the search paths looks like this:
>
> tls/zEC12/dfp/eimm/ldisp/zarch
> tls/zEC12/dfp/eimm/ldisp
> tls/zEC12/dfp/eimm/zarch
> tls/zEC12/dfp/eimm
> tls/zEC12/dfp/ldisp/zarch
> tls/zEC12/dfp/ldisp
> tls/zEC12/dfp/zarch
> tls/zEC12/dfp
> tls/zEC12/eimm/ldisp/zarch
> tls/zEC12/eimm/ldisp
> tls/zEC12/eimm/zarch
> tls/zEC12/eimm
> tls/zEC12/ldisp/zarch
> tls/zEC12/ldisp
> tls/zEC12/zarch
> tls/zEC12
> tls/dfp/eimm/ldisp/zarch
> tls/dfp/eimm/ldisp
> tls/dfp/eimm/zarch
> tls/dfp/eimm
> tls/dfp/ldisp/zarch
> tls/dfp/ldisp
> tls/dfp/zarch
> tls/dfp
> tls/eimm/ldisp/zarch
> tls/eimm/ldisp
> tls/eimm/zarch
> tls/eimm
> tls/ldisp/zarch
> tls/ldisp
> tls/zarch
> tls
> zEC12/dfp/eimm/ldisp/zarch
> zEC12/dfp/eimm/ldisp
> zEC12/dfp/eimm/zarch
> zEC12/dfp/eimm
> zEC12/dfp/ldisp/zarch
> zEC12/dfp/ldisp
> zEC12/dfp/zarch
> zEC12/dfp
> zEC12/eimm/ldisp/zarch
> zEC12/eimm/ldisp
> zEC12/eimm/zarch
> zEC12/eimm
> zEC12/ldisp/zarch
> zEC12/ldisp
> zEC12/zarch
> zEC12
> dfp/eimm/ldisp/zarch
> dfp/eimm/ldisp
> dfp/eimm/zarch
> dfp/eimm
> dfp/ldisp/zarch
> dfp/ldisp
> dfp/zarch
> dfp
> eimm/ldisp/zarch
> eimm/ldisp
> eimm/zarch
> eimm
> ldisp/zarch
> ldisp
> zarch
>
> And finally the actual search path entry is searched. On a z13 machine,
> there would one more bit (vx), and the platform directory has a
> different name, "z13". So the first path is
> tls/z13/vx/dfp/eimm/ldisp/zarch, and there are twice as many lookups.
>
> This scheme allows a library developer to require any combination of the
> HWCAP_IMPORTANT bits for an optimized object, by placing it in the
> appropriate subdirectory. But it does not scale well as more bits are
> added. There is some path backlisting in elf/dl-load.c, so this is not
> as bad as it looks here, but the first lookup in a library search path
> entry will consult all the directories (i.e., there is no blacklisting
> of say the tls/ subtree if the tls subdirectory does not exist).
>
> # Cache lookups
>
> ldconfig uses a completely different way to locate objects in hwcaps
> subdirectories. To build the cache, it lists directories, and if in
> those directories, it encounters a name that corresponds to a hwcap
> directory name or a (hard-coded) platform name, it queues this
> subdirectory for later listing, descending further in the tree along
> these paths. This means that paths like those quoted above are also
> supported by ldconfig, except that it is more lenient and does not
> enforce any particular order on hwcap names.
>
> Only the second cache format (involving struct file_entry_new) can
> represent libraries in hwcaps subdirectories. There is a single
> uint64_t field which identifies the implied hardware capabilities.
> Regular hwcap bits are represented as themselves (after converting from
> the subdirectory name to the bit value), and all the bits are OR-ed
> together. If a platform directory is encountered in the path, a number
> is computed using _dl_string_platform from its name, and this number is
> then used as a fake bit index (outside of the supported real hwcap bits,
> see _DL_FIRST_PLATFORM) to compute another bitmask that is OR-ed into
> the hwcap field in the cache.
>
> ldconfig tries to sort entries for the same soname according to some
> heuristic (see the compare function in elf/cache.c): hwcap entries with
> more bits generally come first.
>
> At run time, the dynamic loader finds all matching path entries for a
> soname in the cache, and then picks the first entry that matches the
> hwcap and platform requirements (see HWCAP_CHECK in elf/dl-cache.c).
>
> # Discussion
>
> I think there a couple of problems with this approach. One subtle
> problem involves the AT_PLATFORM encoding in the cache file (bug 25938).
> But I think there are other issues.
>
> The LD_LIBRARY_PATH/non-cache case is rather wasteful in terms of system
> calls, even with the blacklisting in place.
Indeed, one option might be to use a different scheme than nesting string
capabilities by appending them. So instead of:
zEC12/dfp/eimm/ldisp/zarch
zEC12/dfp/eimm/ldisp
zEC12/dfp/eimm/zarch
zEC12/dfp/eimm
zEC12/dfp/ldisp/zarch
we might have
zEC12-dfp-eimm-ldisp-zarch
zEC12-dfp-eimm-ldisp
zEC12-dfp-eimm-zarch
zEC12-dfp-eimm
zEC12-dfp-ldisp-zarch
And a thus a openat plus a getdents might be less costly than the multiple
opens loaders issues today. It should scale better also with new inclusions
or permutations.
>
> The heuristics for choosing the implementation is not very obvious. Of
> course, with bitmasks of opaque CPU features, there is no generic
> winner. For example, on s390x z13, a library in a subdirectory
> ldisp/zarch would be preferred over one in vx because the former has
> more matching hwcap bits and comes earlier in the ld.so.cache sort order
> (but not the LD_LIBRARY_PATH order). This is counter-intuitive because
> vx (the z13 vector capability) should imply the other capabilities—the
> library was just placed into the wrong directory.
It was not clear no me why exactly the heuristics used in cached lookups
differs from non-cached way.
>
> The most tempting choice for such optimizations is the platform
> directory for architectures that have it ("zEC12" in the example above).
> But the problem is that if the system administrator upgrades the machine
> to z13, the directory name would change to "z13", and the optimized code
> would no longer be loaded! (Presumably, the zEC12-optimized code is
> still better than the generic code on z13. The same issue would apply
> to z13-optimized code vs z14-optimized code.) This would be a reason
> not use AT_PLATFORM from the kernel even on s390x.
>
> There is another reason to distrust AT_PLATFORM: virtualization. If
> AT_PLATFORM is set by some sort of machine ID (as on s390x), then it
> might not match the actual hwcap bits available to the guest because
> they are subject to separate knobs.
>
> The complexity of the trade-offs here suggests to me that we (the GNU
> toolchain as a whole) should try to pre-define names for collections of
> hwcap flags, so that we can get a monotonic progression of features
> under a clearly defined name. This will allow programmers to optimize
> for subsequent microarchitecture revisions. So instead of "x86_64" we
> would have pseudo-capabilities like "x86-200", "x86-201", "x86-202" and
> so on, more or less mirroring the "zEC12", "z13" &c platform directories
> on s390x, even though the kernel does not provide such platform names on
> x86-64. Even on platforms that provide an AT_PLATFORM name, in most
> cases, it would make sense to use *earlier* platform names as a fallback
> (so that z15 system would also use z14- and z13-optimized libraries if
> available). This would mean that the dynamic loader would need to know
> more about these relationships.
I agree with you, my understanding is the current scheme try to free the
platform maintainer to pre-define such list (where itself has its own
drawbacks as the AMD/Intel selection one is showing us).
>
> The current hwcap construction is not really suited to that.
> ld.so.cache is better matched than the LD_LIBRARY_PATH search with its
> mandatory power set construction. Even agressive tree pruning will
> still see it make at least one system call per search path entry and
> hwcap. So I don't think we can use this mechanism for future changes.
Agreed.
>
> The way we store hwcap bits in ld.so.cache is also not ideal. It would
> be nice if ldconfig could be hwcap-agnostic, not having to care at all
> of the correspondence between subdirectory name and hwcap bit (or
> AT_PLATFORM pseudo-hwcap bit). I think I have a way to encode that
> while still maintaining ld.so.cache backwards compatibility (basically,
> set the currently unused bit 62 on those new hwcap entries, so that
> older loaders ignore them because of a missed hwcap requirement).
From previous discussion I take the current searched glibc defined path are
not part of the ABI so we are free to tune in future releases.
>
> If we put new hwcap subdirectories under a *single* subdirectory (say
> "glibc-hwcaps"), then we could prune paths more aggressively, and use
> the new scheme in parallel to the old without much impact on performance
> until these subdirectories are actually used. ldconfig could also treat
> the presence of a glibc-hwcaps subdirectory has an instruction to
> descend into each subdirectory of the glibc-hwcaps directory, but not
> further, and store the names of those subdirectories in ld.so.cache, so
> that the loader can match them at run time.
>
> In any case, I do not see a way to make good progress on bug 23249 (the
> "haswell" platform subdirectory issue on various x86-64 variants)
> without tackling some of these isssues.
>
> Thoughts?
Agreed, what I think what we need to do is move the logic of providing
the paths from generic to arch-specific bits. One thing that was clear
is not every hwcap bits represents a meaningful ABI variant where developers
will actively provided optimized builds.
So one option I see is to create a architecture hook similar to
_dl_hwcap_string where loader provides the AT_PLATFORM, the hwcaps,
and any other meaningful information and architecture specific code
creates the desirable search list (which might either use a pre-defined
database or compute based on hwcap bits).
As you put in another reply I think folder nesting only adds complexity
without much gains and the preference order should be define in the
arch-specific code instead of generic heuristics.
And this will require to setup a proper documentation on how the hwcap
path list is selected instead of the implicit system we have so far (which
the most straightforward way is to use LD_DEBUG to obtain it).
Final question is whether will use still continue to use current scheme
in parallel or if the idea it to eventually phase this out.
^ permalink raw reply [flat|nested] 5+ messages in thread