public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* hwcaps subdirectory selection in the dynamic loader
@ 2020-05-08 18:26 Florian Weimer
  2020-05-12 15:23 ` Stefan Liebler
  2020-05-20 12:51 ` Adhemerval Zanella
  0 siblings, 2 replies; 5+ messages in thread
From: Florian Weimer @ 2020-05-08 18:26 UTC (permalink / raw)
  To: libc-alpha

As part of my work on bug 23249, I looked at how the dynamic loader
finds and selects alternative implementations of shared objects based on
hardware capabilities (hwcaps).  This message intends to capture my
understanding of this feature.

The implementation largely happens via elf/dl-hwcaps.c, dl-procinfo.h,
elf/dl-load.c, and elf/ldconfig.c, elf/dl-cache.c for ld.so.cache.  On
typical targets, the kernel provides hardware capability bits via
AT_HWCAP auxiliary vector entry, and a platform string AT_PLATFORM.

# Non-cache lookups

For non-cache (LD_LIBRARY_PATH) lookups, the dynamic loader needs to
guess pathnames.  It does not use readdir.  The supported hwcap bits
(usually supplied by the kernel via AT_HWCAP) are filtered with the
compile-time mask HWCAP_IMPORTANT.  Each bit corresponds to a
subdirectory name, as returned by _dl_hwcap_string.  Two fake hwcap bits
and corresponding subdirectory are added by the loader: the TLS bit with
the directory name "tls", and the platform bit, with the AT_PLATFROM
string provided by the kernel as the directory name.  The dynamic loader
then computes the power set of those directory names.  The full paths
are constructed by concatenating the subdirectory names of the set bits,
starting with "tls", the AT_PLATFORM directory, and then the active real
hwcap bits, going from more significant to less significant bits.  The
power set is enumerated starting with all bits set, and then proceeds to
remove bits according to an integer decrementing pattern.

(Please ignore the NEED_DL_SYSINFO_DSO part in elf/dl-hwcaps.c because
it is no longer used in practice since the nosegneg removal on i686.)

This is no sysdeps override for this search path construction.  An
architecture can only affect how the hwcap bits are computed, to which
strings individal bits correspond, and what the platform subdirectory is
called.  The fake two bits (TLS and platform) and the power-set
construction always apply.

I'm using s390x as an example now because the situation is fairly simple
compared to other architectures and I have it around for testing.  I
think it's broadly representable of what other architectures do.

On a zEC12 machine with the zarch, ldisp, eimm, dfp bits (but non of the
vx and later bits), the search paths looks like this:

  tls/zEC12/dfp/eimm/ldisp/zarch
  tls/zEC12/dfp/eimm/ldisp
  tls/zEC12/dfp/eimm/zarch
  tls/zEC12/dfp/eimm
  tls/zEC12/dfp/ldisp/zarch
  tls/zEC12/dfp/ldisp
  tls/zEC12/dfp/zarch
  tls/zEC12/dfp
  tls/zEC12/eimm/ldisp/zarch
  tls/zEC12/eimm/ldisp
  tls/zEC12/eimm/zarch
  tls/zEC12/eimm
  tls/zEC12/ldisp/zarch
  tls/zEC12/ldisp
  tls/zEC12/zarch
  tls/zEC12
  tls/dfp/eimm/ldisp/zarch
  tls/dfp/eimm/ldisp
  tls/dfp/eimm/zarch
  tls/dfp/eimm
  tls/dfp/ldisp/zarch
  tls/dfp/ldisp
  tls/dfp/zarch
  tls/dfp
  tls/eimm/ldisp/zarch
  tls/eimm/ldisp
  tls/eimm/zarch
  tls/eimm
  tls/ldisp/zarch
  tls/ldisp
  tls/zarch
  tls
  zEC12/dfp/eimm/ldisp/zarch
  zEC12/dfp/eimm/ldisp
  zEC12/dfp/eimm/zarch
  zEC12/dfp/eimm
  zEC12/dfp/ldisp/zarch
  zEC12/dfp/ldisp
  zEC12/dfp/zarch
  zEC12/dfp
  zEC12/eimm/ldisp/zarch
  zEC12/eimm/ldisp
  zEC12/eimm/zarch
  zEC12/eimm
  zEC12/ldisp/zarch
  zEC12/ldisp
  zEC12/zarch
  zEC12
  dfp/eimm/ldisp/zarch
  dfp/eimm/ldisp
  dfp/eimm/zarch
  dfp/eimm
  dfp/ldisp/zarch
  dfp/ldisp
  dfp/zarch
  dfp
  eimm/ldisp/zarch
  eimm/ldisp
  eimm/zarch
  eimm
  ldisp/zarch
  ldisp
  zarch

And finally the actual search path entry is searched.  On a z13 machine,
there would one more bit (vx), and the platform directory has a
different name, "z13".  So the first path is
tls/z13/vx/dfp/eimm/ldisp/zarch, and there are twice as many lookups.

This scheme allows a library developer to require any combination of the
HWCAP_IMPORTANT bits for an optimized object, by placing it in the
appropriate subdirectory.  But it does not scale well as more bits are
added.  There is some path backlisting in elf/dl-load.c, so this is not
as bad as it looks here, but the first lookup in a library search path
entry will consult all the directories (i.e., there is no blacklisting
of say the tls/ subtree if the tls subdirectory does not exist).

# Cache lookups

ldconfig uses a completely different way to locate objects in hwcaps
subdirectories.  To build the cache, it lists directories, and if in
those directories, it encounters a name that corresponds to a hwcap
directory name or a (hard-coded) platform name, it queues this
subdirectory for later listing, descending further in the tree along
these paths.  This means that paths like those quoted above are also
supported by ldconfig, except that it is more lenient and does not
enforce any particular order on hwcap names.

Only the second cache format (involving struct file_entry_new) can
represent libraries in hwcaps subdirectories.  There is a single
uint64_t field which identifies the implied hardware capabilities.
Regular hwcap bits are represented as themselves (after converting from
the subdirectory name to the bit value), and all the bits are OR-ed
together.  If a platform directory is encountered in the path, a number
is computed using _dl_string_platform from its name, and this number is
then used as a fake bit index (outside of the supported real hwcap bits,
see _DL_FIRST_PLATFORM) to compute another bitmask that is OR-ed into
the hwcap field in the cache.

ldconfig tries to sort entries for the same soname according to some
heuristic (see the compare function in elf/cache.c): hwcap entries with
more bits generally come first.

At run time, the dynamic loader finds all matching path entries for a
soname in the cache, and then picks the first entry that matches the
hwcap and platform requirements (see HWCAP_CHECK in elf/dl-cache.c).

# Discussion

I think there a couple of problems with this approach.  One subtle
problem involves the AT_PLATFORM encoding in the cache file (bug 25938).
But I think there are other issues.

The LD_LIBRARY_PATH/non-cache case is rather wasteful in terms of system
calls, even with the blacklisting in place.

The heuristics for choosing the implementation is not very obvious.  Of
course, with bitmasks of opaque CPU features, there is no generic
winner.  For example, on s390x z13, a library in a subdirectory
ldisp/zarch would be preferred over one in vx because the former has
more matching hwcap bits and comes earlier in the ld.so.cache sort order
(but not the LD_LIBRARY_PATH order).  This is counter-intuitive because
vx (the z13 vector capability) should imply the other capabilities—the
library was just placed into the wrong directory.

The most tempting choice for such optimizations is the platform
directory for architectures that have it ("zEC12" in the example above).
But the problem is that if the system administrator upgrades the machine
to z13, the directory name would change to "z13", and the optimized code
would no longer be loaded!  (Presumably, the zEC12-optimized code is
still better than the generic code on z13.  The same issue would apply
to z13-optimized code vs z14-optimized code.)  This would be a reason
not use AT_PLATFORM from the kernel even on s390x.

There is another reason to distrust AT_PLATFORM: virtualization.  If
AT_PLATFORM is set by some sort of machine ID (as on s390x), then it
might not match the actual hwcap bits available to the guest because
they are subject to separate knobs.

The complexity of the trade-offs here suggests to me that we (the GNU
toolchain as a whole) should try to pre-define names for collections of
hwcap flags, so that we can get a monotonic progression of features
under a clearly defined name.  This will allow programmers to optimize
for subsequent microarchitecture revisions.  So instead of "x86_64" we
would have pseudo-capabilities like "x86-200", "x86-201", "x86-202" and
so on, more or less mirroring the "zEC12", "z13" &c platform directories
on s390x, even though the kernel does not provide such platform names on
x86-64.  Even on platforms that provide an AT_PLATFORM name, in most
cases, it would make sense to use *earlier* platform names as a fallback
(so that z15 system would also use z14- and z13-optimized libraries if
available).  This would mean that the dynamic loader would need to know
more about these relationships.

The current hwcap construction is not really suited to that.
ld.so.cache is better matched than the LD_LIBRARY_PATH search with its
mandatory power set construction.  Even agressive tree pruning will
still see it make at least one system call per search path entry and
hwcap.  So I don't think we can use this mechanism for future changes.

The way we store hwcap bits in ld.so.cache is also not ideal.  It would
be nice if ldconfig could be hwcap-agnostic, not having to care at all
of the correspondence between subdirectory name and hwcap bit (or
AT_PLATFORM pseudo-hwcap bit).  I think I have a way to encode that
while still maintaining ld.so.cache backwards compatibility (basically,
set the currently unused bit 62 on those new hwcap entries, so that
older loaders ignore them because of a missed hwcap requirement).

If we put new hwcap subdirectories under a *single* subdirectory (say
"glibc-hwcaps"), then we could prune paths more aggressively, and use
the new scheme in parallel to the old without much impact on performance
until these subdirectories are actually used.  ldconfig could also treat
the presence of a glibc-hwcaps subdirectory has an instruction to
descend into each subdirectory of the glibc-hwcaps directory, but not
further, and store the names of those subdirectories in ld.so.cache, so
that the loader can match them at run time.

In any case, I do not see a way to make good progress on bug 23249 (the
"haswell" platform subdirectory issue on various x86-64 variants)
without tackling some of these isssues.

Thoughts?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: hwcaps subdirectory selection in the dynamic loader
  2020-05-08 18:26 hwcaps subdirectory selection in the dynamic loader Florian Weimer
@ 2020-05-12 15:23 ` Stefan Liebler
  2020-05-13 16:47   ` Florian Weimer
  2020-05-20 12:51 ` Adhemerval Zanella
  1 sibling, 1 reply; 5+ messages in thread
From: Stefan Liebler @ 2020-05-12 15:23 UTC (permalink / raw)
  To: libc-alpha

On 5/8/20 8:26 PM, Florian Weimer via Libc-alpha wrote:
> As part of my work on bug 23249, I looked at how the dynamic loader
> finds and selects alternative implementations of shared objects based on
> hardware capabilities (hwcaps).  This message intends to capture my
> understanding of this feature.
> 
> The implementation largely happens via elf/dl-hwcaps.c, dl-procinfo.h,
> elf/dl-load.c, and elf/ldconfig.c, elf/dl-cache.c for ld.so.cache.  On
> typical targets, the kernel provides hardware capability bits via
> AT_HWCAP auxiliary vector entry, and a platform string AT_PLATFORM.
> 
> # Non-cache lookups
> 
> For non-cache (LD_LIBRARY_PATH) lookups, the dynamic loader needs to
> guess pathnames.  It does not use readdir.  The supported hwcap bits
> (usually supplied by the kernel via AT_HWCAP) are filtered with the
> compile-time mask HWCAP_IMPORTANT.  Each bit corresponds to a
> subdirectory name, as returned by _dl_hwcap_string.  Two fake hwcap bits
> and corresponding subdirectory are added by the loader: the TLS bit with
> the directory name "tls", and the platform bit, with the AT_PLATFROM
> string provided by the kernel as the directory name.  The dynamic loader
> then computes the power set of those directory names.  The full paths
> are constructed by concatenating the subdirectory names of the set bits,
> starting with "tls", the AT_PLATFORM directory, and then the active real
> hwcap bits, going from more significant to less significant bits.  The
> power set is enumerated starting with all bits set, and then proceeds to
> remove bits according to an integer decrementing pattern.
> 
> (Please ignore the NEED_DL_SYSINFO_DSO part in elf/dl-hwcaps.c because
> it is no longer used in practice since the nosegneg removal on i686.)
> 
> This is no sysdeps override for this search path construction.  An
> architecture can only affect how the hwcap bits are computed, to which
> strings individal bits correspond, and what the platform subdirectory is
> called.  The fake two bits (TLS and platform) and the power-set
> construction always apply.
> 
> I'm using s390x as an example now because the situation is fairly simple
> compared to other architectures and I have it around for testing.  I
> think it's broadly representable of what other architectures do.
> 
> On a zEC12 machine with the zarch, ldisp, eimm, dfp bits (but non of the
> vx and later bits), the search paths looks like this:
> 
>    tls/zEC12/dfp/eimm/ldisp/zarch
>    tls/zEC12/dfp/eimm/ldisp
>    tls/zEC12/dfp/eimm/zarch
>    tls/zEC12/dfp/eimm
>    tls/zEC12/dfp/ldisp/zarch
>    tls/zEC12/dfp/ldisp
>    tls/zEC12/dfp/zarch
>    tls/zEC12/dfp
>    tls/zEC12/eimm/ldisp/zarch
>    tls/zEC12/eimm/ldisp
>    tls/zEC12/eimm/zarch
>    tls/zEC12/eimm
>    tls/zEC12/ldisp/zarch
>    tls/zEC12/ldisp
>    tls/zEC12/zarch
>    tls/zEC12
>    tls/dfp/eimm/ldisp/zarch
>    tls/dfp/eimm/ldisp
>    tls/dfp/eimm/zarch
>    tls/dfp/eimm
>    tls/dfp/ldisp/zarch
>    tls/dfp/ldisp
>    tls/dfp/zarch
>    tls/dfp
>    tls/eimm/ldisp/zarch
>    tls/eimm/ldisp
>    tls/eimm/zarch
>    tls/eimm
>    tls/ldisp/zarch
>    tls/ldisp
>    tls/zarch
>    tls
>    zEC12/dfp/eimm/ldisp/zarch
>    zEC12/dfp/eimm/ldisp
>    zEC12/dfp/eimm/zarch
>    zEC12/dfp/eimm
>    zEC12/dfp/ldisp/zarch
>    zEC12/dfp/ldisp
>    zEC12/dfp/zarch
>    zEC12/dfp
>    zEC12/eimm/ldisp/zarch
>    zEC12/eimm/ldisp
>    zEC12/eimm/zarch
>    zEC12/eimm
>    zEC12/ldisp/zarch
>    zEC12/ldisp
>    zEC12/zarch
>    zEC12
>    dfp/eimm/ldisp/zarch
>    dfp/eimm/ldisp
>    dfp/eimm/zarch
>    dfp/eimm
>    dfp/ldisp/zarch
>    dfp/ldisp
>    dfp/zarch
>    dfp
>    eimm/ldisp/zarch
>    eimm/ldisp
>    eimm/zarch
>    eimm
>    ldisp/zarch
>    ldisp
>    zarch
> 
> And finally the actual search path entry is searched.  On a z13 machine,
> there would one more bit (vx), and the platform directory has a
> different name, "z13".  So the first path is
> tls/z13/vx/dfp/eimm/ldisp/zarch, and there are twice as many lookups.
> 
> This scheme allows a library developer to require any combination of the
> HWCAP_IMPORTANT bits for an optimized object, by placing it in the
> appropriate subdirectory.  But it does not scale well as more bits are
> added.  There is some path backlisting in elf/dl-load.c, so this is not
> as bad as it looks here, but the first lookup in a library search path
> entry will consult all the directories (i.e., there is no blacklisting
> of say the tls/ subtree if the tls subdirectory does not exist).
Would it be possible to blacklist the remaining tls/... paths in this case?
> 
> # Cache lookups
> 
> ldconfig uses a completely different way to locate objects in hwcaps
> subdirectories.  To build the cache, it lists directories, and if in
> those directories, it encounters a name that corresponds to a hwcap
> directory name or a (hard-coded) platform name, it queues this
> subdirectory for later listing, descending further in the tree along
> these paths.  This means that paths like those quoted above are also
> supported by ldconfig, except that it is more lenient and does not
> enforce any particular order on hwcap names.
> 
> Only the second cache format (involving struct file_entry_new) can
> represent libraries in hwcaps subdirectories.  There is a single
> uint64_t field which identifies the implied hardware capabilities.
> Regular hwcap bits are represented as themselves (after converting from
> the subdirectory name to the bit value), and all the bits are OR-ed
> together.  If a platform directory is encountered in the path, a number
> is computed using _dl_string_platform from its name, and this number is
> then used as a fake bit index (outside of the supported real hwcap bits,
> see _DL_FIRST_PLATFORM) to compute another bitmask that is OR-ed into
> the hwcap field in the cache.
> 
> ldconfig tries to sort entries for the same soname according to some
> heuristic (see the compare function in elf/cache.c): hwcap entries with
> more bits generally come first.
> 
> At run time, the dynamic loader finds all matching path entries for a
> soname in the cache, and then picks the first entry that matches the
> hwcap and platform requirements (see HWCAP_CHECK in elf/dl-cache.c).
> 
> # Discussion
> 
> I think there a couple of problems with this approach.  One subtle
> problem involves the AT_PLATFORM encoding in the cache file (bug 25938).
> But I think there are other issues.
> 
> The LD_LIBRARY_PATH/non-cache case is rather wasteful in terms of system
> calls, even with the blacklisting in place.
Yes, that's true. And I assume the most paths are never used.

> 
> The heuristics for choosing the implementation is not very obvious.  Of
> course, with bitmasks of opaque CPU features, there is no generic
> winner.  For example, on s390x z13, a library in a subdirectory
> ldisp/zarch would be preferred over one in vx because the former has
> more matching hwcap bits and comes earlier in the ld.so.cache sort order
> (but not the LD_LIBRARY_PATH order).  This is counter-intuitive because
> vx (the z13 vector capability) should imply the other capabilities—the
> library was just placed into the wrong directory.
> 
> The most tempting choice for such optimizations is the platform
> directory for architectures that have it ("zEC12" in the example above).
> But the problem is that if the system administrator upgrades the machine
> to z13, the directory name would change to "z13", and the optimized code
> would no longer be loaded!  (Presumably, the zEC12-optimized code is
> still better than the generic code on z13.  The same issue would apply
> to z13-optimized code vs z14-optimized code.)  This would be a reason
> not use AT_PLATFORM from the kernel even on s390x.
> 
> There is another reason to distrust AT_PLATFORM: virtualization.  If
> AT_PLATFORM is set by some sort of machine ID (as on s390x), then it
> might not match the actual hwcap bits available to the guest because
> they are subject to separate knobs.
> 
> The complexity of the trade-offs here suggests to me that we (the GNU
> toolchain as a whole) should try to pre-define names for collections of
> hwcap flags, so that we can get a monotonic progression of features
> under a clearly defined name.  This will allow programmers to optimize
> for subsequent microarchitecture revisions.  So instead of "x86_64" we
> would have pseudo-capabilities like "x86-200", "x86-201", "x86-202" and
> so on, more or less mirroring the "zEC12", "z13" &c platform directories
> on s390x, even though the kernel does not provide such platform names on
> x86-64.  Even on platforms that provide an AT_PLATFORM name, in most
> cases, it would make sense to use *earlier* platform names as a fallback
> (so that z15 system would also use z14- and z13-optimized libraries if
> available).  This would mean that the dynamic loader would need to know
> more about these relationships.
Even on s390x, it's not really simple. E.g. platform "z13" does not 
automatically mean that hwcap "vx" is available (e.g. if you are running 
as zVM guest where an old zVM version does not support vx). But if hwcap 
"vx" is available, it is at least platform "z13".

As far as I know, the kernel currently provides "z900" as AT_PLATFORM 
for new unknown machines instead of the latest known platform string, 
e.g. "z15". But there could be hwcap flags for newer machines than 
"z900" (e.g. hwcap "vx"). Would the loader also recognize this and test 
z13 and all the former platforms?
> 
> The current hwcap construction is not really suited to that.
> ld.so.cache is better matched than the LD_LIBRARY_PATH search with its
> mandatory power set construction.  Even agressive tree pruning will
> still see it make at least one system call per search path entry and
> hwcap.  So I don't think we can use this mechanism for future changes.
> 
> The way we store hwcap bits in ld.so.cache is also not ideal.  It would
> be nice if ldconfig could be hwcap-agnostic, not having to care at all
> of the correspondence between subdirectory name and hwcap bit (or
> AT_PLATFORM pseudo-hwcap bit).  I think I have a way to encode that
> while still maintaining ld.so.cache backwards compatibility (basically,
> set the currently unused bit 62 on those new hwcap entries, so that
> older loaders ignore them because of a missed hwcap requirement).
> 
> If we put new hwcap subdirectories under a *single* subdirectory (say
> "glibc-hwcaps"), then we could prune paths more aggressively, and use
> the new scheme in parallel to the old without much impact on performance
> until these subdirectories are actually used.  ldconfig could also treat
> the presence of a glibc-hwcaps subdirectory has an instruction to
> descend into each subdirectory of the glibc-hwcaps directory, but not
> further, and store the names of those subdirectories in ld.so.cache, so
> that the loader can match them at run time.
This means, the LD_LIBRARY_PATH/non-cache case would first try all 
directories inside glibc-hwcaps and if no suitable library was found, 
the current approach is used?

Are "new hwcaps" also allowed in the current approach or are those only 
allowed in the "glibc-hwcaps" directory?

Is nesting the "new hwcaps" allowed in "glibc-hwcaps" directory and if 
yes, which heuristics for choosing the library is used? Compare to youre 
example above: "ldisp/zarch" vs "vx".

"store the names of those subdirectories in ld.so.cache, so
that the loader can match them at run time.": This means if a library is 
placed in a new subdirectory without calling ldconfig again, this 
library is not found?
> 
> In any case, I do not see a way to make good progress on bug 23249 (the
> "haswell" platform subdirectory issue on various x86-64 variants)
> without tackling some of these isssues.
> 
> Thoughts?
> 
> Thanks,
> Florian
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: hwcaps subdirectory selection in the dynamic loader
  2020-05-12 15:23 ` Stefan Liebler
@ 2020-05-13 16:47   ` Florian Weimer
  0 siblings, 0 replies; 5+ messages in thread
From: Florian Weimer @ 2020-05-13 16:47 UTC (permalink / raw)
  To: Stefan Liebler via Libc-alpha; +Cc: Stefan Liebler

* Stefan Liebler via Libc-alpha:

>> This scheme allows a library developer to require any combination of the
>> HWCAP_IMPORTANT bits for an optimized object, by placing it in the
>> appropriate subdirectory.  But it does not scale well as more bits are
>> added.  There is some path backlisting in elf/dl-load.c, so this is not
>> as bad as it looks here, but the first lookup in a library search path
>> entry will consult all the directories (i.e., there is no blacklisting
>> of say the tls/ subtree if the tls subdirectory does not exist).

> Would it be possible to blacklist the remaining tls/... paths in this
> case?

By changing how we do the probing, yes.  It's not too hard to implement
this, but reviewing the patch will be difficult.

>> The complexity of the trade-offs here suggests to me that we (the GNU
>> toolchain as a whole) should try to pre-define names for collections of
>> hwcap flags, so that we can get a monotonic progression of features
>> under a clearly defined name.  This will allow programmers to optimize
>> for subsequent microarchitecture revisions.  So instead of "x86_64" we
>> would have pseudo-capabilities like "x86-200", "x86-201", "x86-202" and
>> so on, more or less mirroring the "zEC12", "z13" &c platform directories
>> on s390x, even though the kernel does not provide such platform names on
>> x86-64.  Even on platforms that provide an AT_PLATFORM name, in most
>> cases, it would make sense to use *earlier* platform names as a fallback
>> (so that z15 system would also use z14- and z13-optimized libraries if
>> available).  This would mean that the dynamic loader would need to know
>> more about these relationships.

> Even on s390x, it's not really simple. E.g. platform "z13" does not
> automatically mean that hwcap "vx" is available (e.g. if you are
> running as zVM guest where an old zVM version does not support
> vx). But if hwcap "vx" is available, it is at least platform "z13".

In that case, it would probably make sense to select the "z13" directory
once the "vx" HWCAP is present, for the new scheme at least.

(But if the hwcaps can be masked individually by virtualization, we
should probably probe them all, to cope with weird machine models, and
not present something as z13 if it would crash on valid userspace z13
code.)

> As far as I know, the kernel currently provides "z900" as AT_PLATFORM
> for new unknown machines instead of the latest known platform string,
> e.g. "z15". But there could be hwcap flags for newer machines than
> "z900" (e.g. hwcap "vx"). Would the loader also recognize this and
> test z13 and all the former platforms?

The loader needs to know the string for the hwcap flag.  These real
hwcap flags are not covered by the NEED_DL_SYSINFO_DSO mechanism which
allows in theory to receive named flags from the kernel.  (The
_DL_FIRST_EXTRA shift keeps those flags completely separate from the
hwcap flags that s390x uses.)

With the current scheme, based on your vx vs z13 comment, it would be
possible to get the expected behavior (z13 code selected for z13 and
later) if the z13 code is installed in the "vx" subdirectory.  A z14
would get "z14" AT_PLATFORM directory, but still have the vx hwcap, so
the dynamic loader probes those paths.

>> If we put new hwcap subdirectories under a *single* subdirectory (say
>> "glibc-hwcaps"), then we could prune paths more aggressively, and use
>> the new scheme in parallel to the old without much impact on performance
>> until these subdirectories are actually used.  ldconfig could also treat
>> the presence of a glibc-hwcaps subdirectory has an instruction to
>> descend into each subdirectory of the glibc-hwcaps directory, but not
>> further, and store the names of those subdirectories in ld.so.cache, so
>> that the loader can match them at run time.

> This means, the LD_LIBRARY_PATH/non-cache case would first try all
> directories inside glibc-hwcaps and if no suitable library was found,
> the current approach is used?

Yes, that's the plan.  We would check for the existence of glibc-hwcaps
first, and if that's not there, we wouldn't probe further.  (Similar to
what you suggested about "tls/" earlier.)  So the incremental cost is
manageable, I think.

> Are "new hwcaps" also allowed in the current approach or are those
> only allowed in the "glibc-hwcaps" directory?

Sorry, I don't understand this question.

We can add more hwcaps under the current approach, but without the more
agressive shortcuts (which would have to be implemented separately),
each new hwcap flag doubles the number of directories that are probed.

> Is nesting the "new hwcaps" allowed in "glibc-hwcaps" directory and if
> yes, which heuristics for choosing the library is used? Compare to
> youre example above: "ldisp/zarch" vs "vx".

No nesting would be allowed.  (Or at least libraries would not be found
using the hwcaps selection mechanism.)  For the new mechanism, we would
have to come up with some total preference ordering, with sensible
directory names and selection criteria (and matching -mcpu=/-march=
flags for GCC, for good developer experience).  We have that today as
well, of course, but the directory probe order is non-obvious.

> "store the names of those subdirectories in ld.so.cache, so
> that the loader can match them at run time.": This means if a library
> is placed in a new subdirectory without calling ldconfig again, this
> library is not found?

Yes, if the soname is present in the cache (and can be used, subject to
hwcaps conditions etc.), ld.so will not chase for it along the search
path.  If the soname is completely missing, ld.so will do path probing.

This is how things are handled today: ld.so does not check if the cache
is actually up-to-date.  I do not plan to change this.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: hwcaps subdirectory selection in the dynamic loader
  2020-05-08 18:26 hwcaps subdirectory selection in the dynamic loader Florian Weimer
  2020-05-12 15:23 ` Stefan Liebler
@ 2020-05-20 12:51 ` Adhemerval Zanella
  2020-05-28 10:54   ` Florian Weimer
  1 sibling, 1 reply; 5+ messages in thread
From: Adhemerval Zanella @ 2020-05-20 12:51 UTC (permalink / raw)
  To: libc-alpha



On 08/05/2020 15:26, Florian Weimer via Libc-alpha wrote:
> As part of my work on bug 23249, I looked at how the dynamic loader
> finds and selects alternative implementations of shared objects based on
> hardware capabilities (hwcaps).  This message intends to capture my
> understanding of this feature.

This is a very nice code digging, should we add it on a wiki entry?

> 
> The implementation largely happens via elf/dl-hwcaps.c, dl-procinfo.h,
> elf/dl-load.c, and elf/ldconfig.c, elf/dl-cache.c for ld.so.cache.  On
> typical targets, the kernel provides hardware capability bits via
> AT_HWCAP auxiliary vector entry, and a platform string AT_PLATFORM.

By typical do you mean all but x86 that uses a different mechanism
(cpuid through sysdeps/x86/cpu-features.c)?

So it seems that currently we have x86, i686, powerpc32, powerpc64,
powerpc64le, sparc32, sparc64, s390, s390x, and aarch64 that use the
hwcap subdirectories, right?

> 
> # Non-cache lookups
> 
> For non-cache (LD_LIBRARY_PATH) lookups, the dynamic loader needs to
> guess pathnames.  It does not use readdir.  The supported hwcap bits
> (usually supplied by the kernel via AT_HWCAP) are filtered with the
> compile-time mask HWCAP_IMPORTANT.  Each bit corresponds to a
> subdirectory name, as returned by _dl_hwcap_string.  Two fake hwcap bits
> and corresponding subdirectory are added by the loader: the TLS bit with
> the directory name "tls", and the platform bit, with the AT_PLATFROM
> string provided by the kernel as the directory name.  The dynamic loader
> then computes the power set of those directory names.  The full paths
> are constructed by concatenating the subdirectory names of the set bits,
> starting with "tls", the AT_PLATFORM directory, and then the active real
> hwcap bits, going from more significant to less significant bits.  The
> power set is enumerated starting with all bits set, and then proceeds to
> remove bits according to an integer decrementing pattern.
> 
> (Please ignore the NEED_DL_SYSINFO_DSO part in elf/dl-hwcaps.c because
> it is no longer used in practice since the nosegneg removal on i686.)

And I noted you sent a patch to remove it as well.

> 
> This is no sysdeps override for this search path construction.  An
> architecture can only affect how the hwcap bits are computed, to which
> strings individal bits correspond, and what the platform subdirectory is
> called.  The fake two bits (TLS and platform) and the power-set
> construction always apply.

First question is whether still does make sense to provide the fake TLS bit
to add "tls" in patch construction.

> 
> I'm using s390x as an example now because the situation is fairly simple
> compared to other architectures and I have it around for testing.  I
> think it's broadly representable of what other architectures do.
> 
> On a zEC12 machine with the zarch, ldisp, eimm, dfp bits (but non of the
> vx and later bits), the search paths looks like this:
> 
>   tls/zEC12/dfp/eimm/ldisp/zarch
>   tls/zEC12/dfp/eimm/ldisp
>   tls/zEC12/dfp/eimm/zarch
>   tls/zEC12/dfp/eimm
>   tls/zEC12/dfp/ldisp/zarch
>   tls/zEC12/dfp/ldisp
>   tls/zEC12/dfp/zarch
>   tls/zEC12/dfp
>   tls/zEC12/eimm/ldisp/zarch
>   tls/zEC12/eimm/ldisp
>   tls/zEC12/eimm/zarch
>   tls/zEC12/eimm
>   tls/zEC12/ldisp/zarch
>   tls/zEC12/ldisp
>   tls/zEC12/zarch
>   tls/zEC12
>   tls/dfp/eimm/ldisp/zarch
>   tls/dfp/eimm/ldisp
>   tls/dfp/eimm/zarch
>   tls/dfp/eimm
>   tls/dfp/ldisp/zarch
>   tls/dfp/ldisp
>   tls/dfp/zarch
>   tls/dfp
>   tls/eimm/ldisp/zarch
>   tls/eimm/ldisp
>   tls/eimm/zarch
>   tls/eimm
>   tls/ldisp/zarch
>   tls/ldisp
>   tls/zarch
>   tls
>   zEC12/dfp/eimm/ldisp/zarch
>   zEC12/dfp/eimm/ldisp
>   zEC12/dfp/eimm/zarch
>   zEC12/dfp/eimm
>   zEC12/dfp/ldisp/zarch
>   zEC12/dfp/ldisp
>   zEC12/dfp/zarch
>   zEC12/dfp
>   zEC12/eimm/ldisp/zarch
>   zEC12/eimm/ldisp
>   zEC12/eimm/zarch
>   zEC12/eimm
>   zEC12/ldisp/zarch
>   zEC12/ldisp
>   zEC12/zarch
>   zEC12
>   dfp/eimm/ldisp/zarch
>   dfp/eimm/ldisp
>   dfp/eimm/zarch
>   dfp/eimm
>   dfp/ldisp/zarch
>   dfp/ldisp
>   dfp/zarch
>   dfp
>   eimm/ldisp/zarch
>   eimm/ldisp
>   eimm/zarch
>   eimm
>   ldisp/zarch
>   ldisp
>   zarch
> 
> And finally the actual search path entry is searched.  On a z13 machine,
> there would one more bit (vx), and the platform directory has a
> different name, "z13".  So the first path is
> tls/z13/vx/dfp/eimm/ldisp/zarch, and there are twice as many lookups.
> 
> This scheme allows a library developer to require any combination of the
> HWCAP_IMPORTANT bits for an optimized object, by placing it in the
> appropriate subdirectory.  But it does not scale well as more bits are
> added.  There is some path backlisting in elf/dl-load.c, so this is not
> as bad as it looks here, but the first lookup in a library search path
> entry will consult all the directories (i.e., there is no blacklisting
> of say the tls/ subtree if the tls subdirectory does not exist).
> 
> # Cache lookups
> 
> ldconfig uses a completely different way to locate objects in hwcaps
> subdirectories.  To build the cache, it lists directories, and if in
> those directories, it encounters a name that corresponds to a hwcap
> directory name or a (hard-coded) platform name, it queues this
> subdirectory for later listing, descending further in the tree along
> these paths.  This means that paths like those quoted above are also
> supported by ldconfig, except that it is more lenient and does not
> enforce any particular order on hwcap names.
> 
> Only the second cache format (involving struct file_entry_new) can
> represent libraries in hwcaps subdirectories.  There is a single
> uint64_t field which identifies the implied hardware capabilities.
> Regular hwcap bits are represented as themselves (after converting from
> the subdirectory name to the bit value), and all the bits are OR-ed
> together.  If a platform directory is encountered in the path, a number
> is computed using _dl_string_platform from its name, and this number is
> then used as a fake bit index (outside of the supported real hwcap bits,
> see _DL_FIRST_PLATFORM) to compute another bitmask that is OR-ed into
> the hwcap field in the cache.
> 
> ldconfig tries to sort entries for the same soname according to some
> heuristic (see the compare function in elf/cache.c): hwcap entries with
> more bits generally come first.
> 
> At run time, the dynamic loader finds all matching path entries for a
> soname in the cache, and then picks the first entry that matches the
> hwcap and platform requirements (see HWCAP_CHECK in elf/dl-cache.c).
> 
> # Discussion
> 
> I think there a couple of problems with this approach.  One subtle
> problem involves the AT_PLATFORM encoding in the cache file (bug 25938).
> But I think there are other issues.
> 
> The LD_LIBRARY_PATH/non-cache case is rather wasteful in terms of system
> calls, even with the blacklisting in place.

Indeed, one option might be to use a different scheme than nesting string
capabilities by appending them.  So instead of:

   zEC12/dfp/eimm/ldisp/zarch
   zEC12/dfp/eimm/ldisp
   zEC12/dfp/eimm/zarch
   zEC12/dfp/eimm
   zEC12/dfp/ldisp/zarch

we might have

   zEC12-dfp-eimm-ldisp-zarch
   zEC12-dfp-eimm-ldisp
   zEC12-dfp-eimm-zarch
   zEC12-dfp-eimm
   zEC12-dfp-ldisp-zarch

And a thus a openat plus a getdents might be less costly than the multiple
opens loaders issues today. It should scale better also with new inclusions
or permutations.

> 
> The heuristics for choosing the implementation is not very obvious.  Of
> course, with bitmasks of opaque CPU features, there is no generic
> winner.  For example, on s390x z13, a library in a subdirectory
> ldisp/zarch would be preferred over one in vx because the former has
> more matching hwcap bits and comes earlier in the ld.so.cache sort order
> (but not the LD_LIBRARY_PATH order).  This is counter-intuitive because
> vx (the z13 vector capability) should imply the other capabilities—the
> library was just placed into the wrong directory.

It was not clear no me why exactly the heuristics used in cached lookups
differs from non-cached way.

> 
> The most tempting choice for such optimizations is the platform
> directory for architectures that have it ("zEC12" in the example above).
> But the problem is that if the system administrator upgrades the machine
> to z13, the directory name would change to "z13", and the optimized code
> would no longer be loaded!  (Presumably, the zEC12-optimized code is
> still better than the generic code on z13.  The same issue would apply
> to z13-optimized code vs z14-optimized code.)  This would be a reason
> not use AT_PLATFORM from the kernel even on s390x.
> 
> There is another reason to distrust AT_PLATFORM: virtualization.  If
> AT_PLATFORM is set by some sort of machine ID (as on s390x), then it
> might not match the actual hwcap bits available to the guest because
> they are subject to separate knobs.
> 
> The complexity of the trade-offs here suggests to me that we (the GNU
> toolchain as a whole) should try to pre-define names for collections of
> hwcap flags, so that we can get a monotonic progression of features
> under a clearly defined name.  This will allow programmers to optimize
> for subsequent microarchitecture revisions.  So instead of "x86_64" we
> would have pseudo-capabilities like "x86-200", "x86-201", "x86-202" and
> so on, more or less mirroring the "zEC12", "z13" &c platform directories
> on s390x, even though the kernel does not provide such platform names on
> x86-64.  Even on platforms that provide an AT_PLATFORM name, in most
> cases, it would make sense to use *earlier* platform names as a fallback
> (so that z15 system would also use z14- and z13-optimized libraries if
> available).  This would mean that the dynamic loader would need to know
> more about these relationships.

I agree with you, my understanding is the current scheme try to free the
platform maintainer to pre-define such list (where itself has its own
drawbacks as the AMD/Intel selection one is showing us). 

> 
> The current hwcap construction is not really suited to that.
> ld.so.cache is better matched than the LD_LIBRARY_PATH search with its
> mandatory power set construction.  Even agressive tree pruning will
> still see it make at least one system call per search path entry and
> hwcap.  So I don't think we can use this mechanism for future changes.

Agreed.

> 
> The way we store hwcap bits in ld.so.cache is also not ideal.  It would
> be nice if ldconfig could be hwcap-agnostic, not having to care at all
> of the correspondence between subdirectory name and hwcap bit (or
> AT_PLATFORM pseudo-hwcap bit).  I think I have a way to encode that
> while still maintaining ld.so.cache backwards compatibility (basically,
> set the currently unused bit 62 on those new hwcap entries, so that
> older loaders ignore them because of a missed hwcap requirement).

From previous discussion I take the current searched glibc defined path are
not part of the ABI so we are free to tune in future releases.

> 
> If we put new hwcap subdirectories under a *single* subdirectory (say
> "glibc-hwcaps"), then we could prune paths more aggressively, and use
> the new scheme in parallel to the old without much impact on performance
> until these subdirectories are actually used.  ldconfig could also treat
> the presence of a glibc-hwcaps subdirectory has an instruction to
> descend into each subdirectory of the glibc-hwcaps directory, but not
> further, and store the names of those subdirectories in ld.so.cache, so
> that the loader can match them at run time.
> 
> In any case, I do not see a way to make good progress on bug 23249 (the
> "haswell" platform subdirectory issue on various x86-64 variants)
> without tackling some of these isssues.
> 
> Thoughts?

Agreed, what I think what we need to do is move the logic of providing
the paths from generic to arch-specific bits.  One thing that was clear
is not every hwcap bits represents a meaningful ABI variant where developers
will actively provided optimized builds. 

So one option I see is to create a architecture hook similar to 
_dl_hwcap_string where loader provides the AT_PLATFORM, the hwcaps,
and any other meaningful information and architecture specific code
creates the desirable search list (which might either use a pre-defined
database or compute based on hwcap bits).

As you put in another reply I think folder nesting only adds complexity
without much gains and the preference order should be define in the
arch-specific code instead of generic heuristics.

And this will require to setup a proper documentation on how the hwcap
path list is selected instead of the implicit system we have so far (which
the most straightforward way is to use LD_DEBUG to obtain it).

Final question is whether will use still continue to use current scheme
in parallel or if the idea it to eventually phase this out.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: hwcaps subdirectory selection in the dynamic loader
  2020-05-20 12:51 ` Adhemerval Zanella
@ 2020-05-28 10:54   ` Florian Weimer
  0 siblings, 0 replies; 5+ messages in thread
From: Florian Weimer @ 2020-05-28 10:54 UTC (permalink / raw)
  To: Adhemerval Zanella via Libc-alpha

* Adhemerval Zanella via Libc-alpha:

>> The implementation largely happens via elf/dl-hwcaps.c, dl-procinfo.h,
>> elf/dl-load.c, and elf/ldconfig.c, elf/dl-cache.c for ld.so.cache.  On
>> typical targets, the kernel provides hardware capability bits via
>> AT_HWCAP auxiliary vector entry, and a platform string AT_PLATFORM.
>
> By typical do you mean all but x86 that uses a different mechanism
> (cpuid through sysdeps/x86/cpu-features.c)?

Yes, indeed.

> So it seems that currently we have x86, i686, powerpc32, powerpc64,
> powerpc64le, sparc32, sparc64, s390, s390x, and aarch64 that use the
> hwcap subdirectories, right?

I think 32-bit Arm uses it too:

sysdeps/unix/sysv/linux/arm/dl-procinfo.h:#define HWCAP_IMPORTANT		(HWCAP_ARM_VFP | HWCAP_ARM_NEON)

>> This is no sysdeps override for this search path construction.  An
>> architecture can only affect how the hwcap bits are computed, to which
>> strings individal bits correspond, and what the platform subdirectory is
>> called.  The fake two bits (TLS and platform) and the power-set
>> construction always apply.
>
> First question is whether still does make sense to provide the fake TLS bit
> to add "tls" in patch construction.

I don't think it does.  We really should deprecate it.

>> # Discussion
>> 
>> I think there a couple of problems with this approach.  One subtle
>> problem involves the AT_PLATFORM encoding in the cache file (bug 25938).
>> But I think there are other issues.
>> 
>> The LD_LIBRARY_PATH/non-cache case is rather wasteful in terms of system
>> calls, even with the blacklisting in place.
>
> Indeed, one option might be to use a different scheme than nesting string
> capabilities by appending them.  So instead of:
>
>    zEC12/dfp/eimm/ldisp/zarch
>    zEC12/dfp/eimm/ldisp
>    zEC12/dfp/eimm/zarch
>    zEC12/dfp/eimm
>    zEC12/dfp/ldisp/zarch
>
> we might have
>
>    zEC12-dfp-eimm-ldisp-zarch
>    zEC12-dfp-eimm-ldisp
>    zEC12-dfp-eimm-zarch
>    zEC12-dfp-eimm
>    zEC12-dfp-ldisp-zarch
>
> And a thus a openat plus a getdents might be less costly than the multiple
> opens loaders issues today. It should scale better also with new inclusions
> or permutations.

It's hard to tell what's better for the file system.  The intermediate
subdirectory could swing things in favor of getdents64.  But listing
everything that is on LD_LIBRARY_PATH strikes me as problematic because
we don't know how large those directories are.  Performance
characteristics will change again if the kernel ever implements negative
dentries.

However, I do not think these bitmask-based hwcaps are actually useful
to programmers.  It's simply too much choice.

>> The heuristics for choosing the implementation is not very obvious.  Of
>> course, with bitmasks of opaque CPU features, there is no generic
>> winner.  For example, on s390x z13, a library in a subdirectory
>> ldisp/zarch would be preferred over one in vx because the former has
>> more matching hwcap bits and comes earlier in the ld.so.cache sort order
>> (but not the LD_LIBRARY_PATH order).  This is counter-intuitive because
>> vx (the z13 vector capability) should imply the other capabilities—the
>> library was just placed into the wrong directory.
>
> It was not clear no me why exactly the heuristics used in cached lookups
> differs from non-cached way.

It's readdir vs path probing.  There's considerable more flexibility
with readdir in terms of directory layout.

>> The complexity of the trade-offs here suggests to me that we (the GNU
>> toolchain as a whole) should try to pre-define names for collections of
>> hwcap flags, so that we can get a monotonic progression of features
>> under a clearly defined name.  This will allow programmers to optimize
>> for subsequent microarchitecture revisions.  So instead of "x86_64" we
>> would have pseudo-capabilities like "x86-200", "x86-201", "x86-202" and
>> so on, more or less mirroring the "zEC12", "z13" &c platform directories
>> on s390x, even though the kernel does not provide such platform names on
>> x86-64.  Even on platforms that provide an AT_PLATFORM name, in most
>> cases, it would make sense to use *earlier* platform names as a fallback
>> (so that z15 system would also use z14- and z13-optimized libraries if
>> available).  This would mean that the dynamic loader would need to know
>> more about these relationships.
>
> I agree with you, my understanding is the current scheme try to free the
> platform maintainer to pre-define such list (where itself has its own
> drawbacks as the AMD/Intel selection one is showing us).

Yes, and the selection approach we have on x86 uses hwcap in this way
not because hwcap is a good fit to the architecture requirements, but
because it's the only mechanism we've got.

I suspect s390x is similar in that regard.

>> The way we store hwcap bits in ld.so.cache is also not ideal.  It would
>> be nice if ldconfig could be hwcap-agnostic, not having to care at all
>> of the correspondence between subdirectory name and hwcap bit (or
>> AT_PLATFORM pseudo-hwcap bit).  I think I have a way to encode that
>> while still maintaining ld.so.cache backwards compatibility (basically,
>> set the currently unused bit 62 on those new hwcap entries, so that
>> older loaders ignore them because of a missed hwcap requirement).
>
> From previous discussion I take the current searched glibc defined
> path are not part of the ABI so we are free to tune in future
> releases.

I'm not so sure.  We should probably make changes incrementally.  Add a
new mechanism, preferred over the existing one, wait a bit, and then
remove the old mechanism.

>> If we put new hwcap subdirectories under a *single* subdirectory (say
>> "glibc-hwcaps"), then we could prune paths more aggressively, and use
>> the new scheme in parallel to the old without much impact on performance
>> until these subdirectories are actually used.  ldconfig could also treat
>> the presence of a glibc-hwcaps subdirectory has an instruction to
>> descend into each subdirectory of the glibc-hwcaps directory, but not
>> further, and store the names of those subdirectories in ld.so.cache, so
>> that the loader can match them at run time.
>> 
>> In any case, I do not see a way to make good progress on bug 23249 (the
>> "haswell" platform subdirectory issue on various x86-64 variants)
>> without tackling some of these isssues.
>> 
>> Thoughts?
>
> Agreed, what I think what we need to do is move the logic of providing
> the paths from generic to arch-specific bits.  One thing that was clear
> is not every hwcap bits represents a meaningful ABI variant where developers
> will actively provided optimized builds.

Right, that's currently expressed as HWCAP_IMPORTANT.

> So one option I see is to create a architecture hook similar to 
> _dl_hwcap_string where loader provides the AT_PLATFORM, the hwcaps,
> and any other meaningful information and architecture specific code
> creates the desirable search list (which might either use a pre-defined
> database or compute based on hwcap bits).

Yes, and ldconfig puts the strings into the cache (not bits or indices).

> Final question is whether will use still continue to use current scheme
> in parallel or if the idea it to eventually phase this out.

I think we'll have both in parallel for a while.  We can use an unused
hwcap bit (like bit 62) in the cache file entries to make old glibc
ignore the new entries, and add some secondary data later in the file
which is ignored by current ld.so.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-05-28 10:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-08 18:26 hwcaps subdirectory selection in the dynamic loader Florian Weimer
2020-05-12 15:23 ` Stefan Liebler
2020-05-13 16:47   ` Florian Weimer
2020-05-20 12:51 ` Adhemerval Zanella
2020-05-28 10:54   ` Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).