LD_AUDIT: Not enough space in static TLS block

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* LD_AUDIT: Not enough space in static TLS block
@ 2022-04-11 20:24 Jonathon Anderson
  2022-04-12  7:44 ` Florian Weimer
  0 siblings, 1 reply; 7+ messages in thread
From: Jonathon Anderson @ 2022-04-11 20:24 UTC (permalink / raw)
  To: Carlos O'Donell, Florian Weimer, Ben Woodard,
	Adhemerval Zanella, Legendre, Matthew P.
  Cc: libc-alpha, John Mellor-Crummey

Hello all,

We (the HPCToolkit team) have encountered another critical LD_AUDIT bug. 
When LD_AUDIT is specified, the allocation of the static TLS block does 
not account for the TLS requirements of executable dependencies or of 
the auditors themselves. If:
  - an executable accesses a thread-local variable in a linked library 
with sufficiently large TLS requirements, or
  - an auditor itself uses sufficiently large TLS and optimizes access 
with `-ftls-model=initial-exec`,

then the process or auditor will fail with the error "cannot allocate 
memory in static TLS block."

This is a critical issue for us. We have observed this issue affecting 
RAJA, a template-based library for efficient parallel computation and 
widely-used among HPC applications. It would help us greatly if this 
issue was fixed for 2.36 and backported along with the other 
LD_AUDIT-related patches.

We have added this issue and a minimal reproducer to our document of 
auditor bugs: 
https://docs.google.com/document/d/1dVaDBdzySecxQqD6hLLzDrEF18M1UtjDna9gL5BWWI0

-Jonathon

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: LD_AUDIT: Not enough space in static TLS block
  2022-04-11 20:24 LD_AUDIT: Not enough space in static TLS block Jonathon Anderson
@ 2022-04-12  7:44 ` Florian Weimer
  2022-05-03  7:22   ` Florian Weimer
  0 siblings, 1 reply; 7+ messages in thread
From: Florian Weimer @ 2022-04-12  7:44 UTC (permalink / raw)
  To: Jonathon Anderson
  Cc: Carlos O'Donell, Ben Woodard, Adhemerval Zanella, Legendre,
	Matthew P.,
	libc-alpha, John Mellor-Crummey

* Jonathon Anderson:

> Hello all,
>
> We (the HPCToolkit team) have encountered another critical LD_AUDIT
> bug. When LD_AUDIT is specified, the allocation of the static TLS
> block does not account for the TLS requirements of executable
> dependencies or of the auditors themselves. If:
>  - an executable accesses a thread-local variable in a linked library
> with sufficiently large TLS requirements, or
>  - an auditor itself uses sufficiently large TLS and optimizes access
> with `-ftls-model=initial-exec`,
>
> then the process or auditor will fail with the error "cannot allocate
> memory in static TLS block."

We have a tunable that can be used as a workaround.  Your reproducer
passes for me with our 2.28 backport (glibc-2.28-164.el8) if I run it
like this:

  GLIBC_TUNABLES=glibc.rtld.optional_static_tls=4000 make

The best we can do in the short term would be an increase of the default
limit.  On 64-bit platforms, defaulting to a dozen or so kilobytes per
thread should not be a problem as far as virtual address space
consumption is concerned.  We can also add an additional reservation of
similar size for every auditor that is loaded, to compensate for the
lack of auto-tuning of the TLS allocation size in auditing mode.

The fundamental issue is that there is always going to be a hard limit
for initial-exec TLS.  Initial-exec TLS requires a fixed offset from the
thread pointer, and we cannot relocate TLS variables because they are
ordinary C objects with an observable address.  There are some other
things we can try to improve auto-tuning, but in the end, there is
always going to be a fixed-size reserved area dedicated to initial-exec
TLS set up at process startup, and with dlopen, that might not be enough
even without any auditor use.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: LD_AUDIT: Not enough space in static TLS block
  2022-04-12  7:44 ` Florian Weimer
@ 2022-05-03  7:22   ` Florian Weimer
  2022-05-05 17:30     ` Florian Weimer
  0 siblings, 1 reply; 7+ messages in thread
From: Florian Weimer @ 2022-05-03  7:22 UTC (permalink / raw)
  To: Jonathon Anderson
  Cc: Carlos O'Donell, Ben Woodard, Adhemerval Zanella, Legendre,
	Matthew P.,
	libc-alpha, John Mellor-Crummey

* Florian Weimer:

> * Jonathon Anderson:
>
>> Hello all,
>>
>> We (the HPCToolkit team) have encountered another critical LD_AUDIT
>> bug. When LD_AUDIT is specified, the allocation of the static TLS
>> block does not account for the TLS requirements of executable
>> dependencies or of the auditors themselves. If:
>>  - an executable accesses a thread-local variable in a linked library
>> with sufficiently large TLS requirements, or
>>  - an auditor itself uses sufficiently large TLS and optimizes access
>> with `-ftls-model=initial-exec`,
>>
>> then the process or auditor will fail with the error "cannot allocate
>> memory in static TLS block."
>
> We have a tunable that can be used as a workaround.  Your reproducer
> passes for me with our 2.28 backport (glibc-2.28-164.el8) if I run it
> like this:
>
>   GLIBC_TUNABLES=glibc.rtld.optional_static_tls=4000 make
>
> The best we can do in the short term would be an increase of the default
> limit.  On 64-bit platforms, defaulting to a dozen or so kilobytes per
> thread should not be a problem as far as virtual address space
> consumption is concerned.  We can also add an additional reservation of
> similar size for every auditor that is loaded, to compensate for the
> lack of auto-tuning of the TLS allocation size in auditing mode.
>
> The fundamental issue is that there is always going to be a hard limit
> for initial-exec TLS.  Initial-exec TLS requires a fixed offset from the
> thread pointer, and we cannot relocate TLS variables because they are
> ordinary C objects with an observable address.  There are some other
> things we can try to improve auto-tuning, but in the end, there is
> always going to be a fixed-size reserved area dedicated to initial-exec
> TLS set up at process startup, and with dlopen, that might not be enough
> even without any auditor use.

Jonathon,

does setting the environment variable work for you?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: LD_AUDIT: Not enough space in static TLS block
  2022-05-03  7:22   ` Florian Weimer
@ 2022-05-05 17:30     ` Florian Weimer
  2022-05-05 19:56       ` Jonathon Anderson
  0 siblings, 1 reply; 7+ messages in thread
From: Florian Weimer @ 2022-05-05 17:30 UTC (permalink / raw)
  To: Jonathon Anderson
  Cc: Carlos O'Donell, Ben Woodard, Adhemerval Zanella, Legendre,
	Matthew P.,
	libc-alpha, John Mellor-Crummey

* Florian Weimer:

> * Florian Weimer:
>
>> * Jonathon Anderson:
>>
>>> Hello all,
>>>
>>> We (the HPCToolkit team) have encountered another critical LD_AUDIT
>>> bug. When LD_AUDIT is specified, the allocation of the static TLS
>>> block does not account for the TLS requirements of executable
>>> dependencies or of the auditors themselves. If:
>>>  - an executable accesses a thread-local variable in a linked library
>>> with sufficiently large TLS requirements, or
>>>  - an auditor itself uses sufficiently large TLS and optimizes access
>>> with `-ftls-model=initial-exec`,
>>>
>>> then the process or auditor will fail with the error "cannot allocate
>>> memory in static TLS block."
>>
>> We have a tunable that can be used as a workaround.  Your reproducer
>> passes for me with our 2.28 backport (glibc-2.28-164.el8) if I run it
>> like this:
>>
>>   GLIBC_TUNABLES=glibc.rtld.optional_static_tls=4000 make
>>
>> The best we can do in the short term would be an increase of the default
>> limit.  On 64-bit platforms, defaulting to a dozen or so kilobytes per
>> thread should not be a problem as far as virtual address space
>> consumption is concerned.  We can also add an additional reservation of
>> similar size for every auditor that is loaded, to compensate for the
>> lack of auto-tuning of the TLS allocation size in auditing mode.
>>
>> The fundamental issue is that there is always going to be a hard limit
>> for initial-exec TLS.  Initial-exec TLS requires a fixed offset from the
>> thread pointer, and we cannot relocate TLS variables because they are
>> ordinary C objects with an observable address.  There are some other
>> things we can try to improve auto-tuning, but in the end, there is
>> always going to be a fixed-size reserved area dedicated to initial-exec
>> TLS set up at process startup, and with dlopen, that might not be enough
>> even without any auditor use.
>
> Jonathon,
>
> does setting the environment variable work for you?

Do you have any additional feedback here?

In the meantime, we have updated Fedora rawhide with the bug fix to
enable early <dlfcn.h> usage from auditors, and the new RTLD_DI_PHDR
dlinfo is included as well.  If you could test glibc-2.35.9000-16 or
later, that would be great.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: LD_AUDIT: Not enough space in static TLS block
  2022-05-05 17:30     ` Florian Weimer
@ 2022-05-05 19:56       ` Jonathon Anderson
  2022-05-11 13:59         ` Florian Weimer
  0 siblings, 1 reply; 7+ messages in thread
From: Jonathon Anderson @ 2022-05-05 19:56 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Carlos O'Donell, Ben Woodard, Adhemerval Zanella, Legendre,
	Matthew P.,
	libc-alpha, John Mellor-Crummey



On 5/5/22 12:30, Florian Weimer wrote:
> * Florian Weimer:
>
>> * Florian Weimer:
>>
>>> * Jonathon Anderson:
>>>
>>>> Hello all,
>>>>
>>>> We (the HPCToolkit team) have encountered another critical LD_AUDIT
>>>> bug. When LD_AUDIT is specified, the allocation of the static TLS
>>>> block does not account for the TLS requirements of executable
>>>> dependencies or of the auditors themselves. If:
>>>>   - an executable accesses a thread-local variable in a linked library
>>>> with sufficiently large TLS requirements, or
>>>>   - an auditor itself uses sufficiently large TLS and optimizes access
>>>> with `-ftls-model=initial-exec`,
>>>>
>>>> then the process or auditor will fail with the error "cannot allocate
>>>> memory in static TLS block."
>>> We have a tunable that can be used as a workaround.  Your reproducer
>>> passes for me with our 2.28 backport (glibc-2.28-164.el8) if I run it
>>> like this:
>>>
>>>    GLIBC_TUNABLES=glibc.rtld.optional_static_tls=4000 make
>>>
>>> The best we can do in the short term would be an increase of the default
>>> limit.  On 64-bit platforms, defaulting to a dozen or so kilobytes per
>>> thread should not be a problem as far as virtual address space
>>> consumption is concerned.  We can also add an additional reservation of
>>> similar size for every auditor that is loaded, to compensate for the
>>> lack of auto-tuning of the TLS allocation size in auditing mode.
>>>
>>> The fundamental issue is that there is always going to be a hard limit
>>> for initial-exec TLS.  Initial-exec TLS requires a fixed offset from the
>>> thread pointer, and we cannot relocate TLS variables because they are
>>> ordinary C objects with an observable address.  There are some other
>>> things we can try to improve auto-tuning, but in the end, there is
>>> always going to be a fixed-size reserved area dedicated to initial-exec
>>> TLS set up at process startup, and with dlopen, that might not be enough
>>> even without any auditor use.
>> Jonathon,
>>
>> does setting the environment variable work for you?
> Do you have any additional feedback here?
Sorry for the delayed response (it's ECP conference week).

*This tunable works for us as a stopgap until a long-term solution can 
be implemented.*

I had a separate (email) chat with Ben Woodard bouncing ideas for a 
long-term solution. A major difficulty is that LD_AUDIT currently 
introduces a cyclic dependency:
  - auditors must be loaded before searching for the application's 
dependencies (since la_objsearch may modify the results), and
  - dependency searches must complete before the static TLS auto-tuning 
(since the TLS sizes of the initial link-map must be known), but
  - the static TLS block must be allocated before auditors are loaded 
(since auditors may also use initial-exec TLS).

So, I'm not hopeful for a long-term solution that does not involve 
another LAV_CURRENT bump. We (me and Ben) came up with a couple of 
initial solutions: disallowing initial-exec TLS in auditors, or 
per-auditor static TLS blocks (ie. TLS namespaces). Comments and ideas 
are welcome. (I would love to have a detailed LD_AUDIT discussion at STW 
in June.)

> In the meantime, we have updated Fedora rawhide with the bug fix to
> enable early <dlfcn.h> usage from auditors, and the new RTLD_DI_PHDR
> dlinfo is included as well.  If you could test glibc-2.35.9000-16 or
> later, that would be great.
Thanks! Our reproducer for the early dl* bug passes with the latest 
Fedora Rawhide, I'll look into using RTLD_DI_PHDR in HPCToolkit in the 
coming weeks.

-Jonathon

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: LD_AUDIT: Not enough space in static TLS block
  2022-05-05 19:56       ` Jonathon Anderson
@ 2022-05-11 13:59         ` Florian Weimer
  2022-05-11 17:31           ` Jonathon Anderson
  0 siblings, 1 reply; 7+ messages in thread
From: Florian Weimer @ 2022-05-11 13:59 UTC (permalink / raw)
  To: Jonathon Anderson
  Cc: Carlos O'Donell, Ben Woodard, Adhemerval Zanella, Legendre,
	Matthew P.,
	libc-alpha, John Mellor-Crummey

* Jonathon Anderson:

> This tunable works for us as a stopgap until a long-term solution can be
> implemented.

Good to know, thanks.

> I had a separate (email) chat with Ben Woodard bouncing ideas for a long-term solution. A
> major difficulty is that LD_AUDIT currently introduces a cyclic dependency:
>  - auditors must be loaded before searching for the application's dependencies (since
> la_objsearch may modify the results), and
>  - dependency searches must complete before the static TLS auto-tuning (since the TLS
> sizes of the initial link-map must be known), but
>  - the static TLS block must be allocated before auditors are loaded (since auditors may also
> use initial-exec TLS).
>
> So, I'm not hopeful for a long-term solution that does not involve
> another LAV_CURRENT bump. We (me and Ben) came up with a couple of
> initial solutions: disallowing initial-exec TLS in auditors,

I'm not sure if this feasible.  It would mean we cannot use initial-exec
TLS in glibc at all, or in libstdc++ (in case auditors are written in
C++).

And we don't want to build libraries twice (for auditor usage).

> or per-auditor static TLS blocks (ie. TLS namespaces).

We already have that, but there is just one thread pointer, so that does
not solve the problem.  Using a secondary thread pointer has the second
build problem, too.

Auditor TLS usage has a conceptually simple fix, though.  (It's simple
in concept, but implementation requires some refactoring.)  Recall that
for regular process startup (without auditing), we do this:

  (1) map the main executable (the kernel may do this for us)
  (2) recursively map all the dependencies
  (3) calculate static TLS usage
  (4) allocate static TLS space
  (5) perform relocation
  (6) assign TLS variables their initial values
  (7) start running user code (initializers, main)

Once auditors are in the mix, we do this instead:

  (1) guess static TLS usage
  (2) allocate static TLS space
  (3) load each audit module individual, in sequence, as if per dlmopen:
    (3.1) map the auditor
    (3.2) recursively map all its dependencies
    (3.3) perform relocation
    (3.4) calculate and allocate static TLS space (from the global area)
    (3.5) assign TLS variables their initial values
    (3.6) start running auditor code (ELF constructors, la_version)
  (4) map the main executable (the kernel may do this for us)
  (5) recursively map all the dependencies (may involve la_objsearch)
  (6) calculate and allocate static TLS space (from the global area)
  (7) perform relocation
  (8) assign TLS variables their initial values
  (9) start running user code (initializers, main)

Step (1) is the big problem here, it's just a quick hack to get things
going with TLS, but it has been around for a long time.  What we should
be doing instead is this:

  (1) load each audit module individual, in sequence (no relocation here):
    (1.1) map the auditor
    (1.2) recursively map all its dependencies
    (1.3) calculate static TLS usage for this auditor namespace
  (2) map the main executable (the kernel may do this for us)
  (3) calculate static TLS usage using all TLS size information seen so far
  (4) allocate static TLS space
  (5) complete loading the auditors (relocation and startup):
    (5.1) perform relocation
    (5.2) calculate and allocate static TLS space (from the global area)
    (5.3) assign TLS variables their initial values
    (5.4) start running auditor code (ELF constructors, la_version)
  (6) recursively map all the dependencies (may involve la_objsearch)
  (7) calculate and allocate static TLS space (from the global area)
  (8) perform relocation
  (9) assign TLS variables their initial values
  (10) start running user code (initializers, main)

With this sequence, direct static TLS usage from auditors is taken into
account for the fixed-size TLS allocation at (4), eliminating the
guesswork.  (Step (2) could actually come right before step (6), it
would not alter the picture.)

When no auditor defines la_objsearch, we can do even better and map the
executable and its dependencies before computing the static TLS size,
and only run step (5), complete loading the auditors, after mapping
everything (but before relocation, which needs working auditors for
la_symbind).  In this case, we'd have the same level of information
regarding TLS usage as in the non-auditor case (which is still not
enough in all cases, but another incremental improvement).

With la_objsearch, we could pull a few more tricks.  Auditors could
advertise that the address of their TLS variables do not matter, which
would enable us to relocate the TLS space as we discover more objects
that need more static TLS.  Or we could unload the auditors on TLS
exhaustion and start again with a larger space esimate.  Auditors could
provide their own guesses for static TLS usage that we query upfront and
take into account for the size calculation.

None of this solves the general dlopen case, though.  I have some ideas
for that, which boils down to “just provide enough address space during
early startup, so that you never exceed it until the initialization
phase with dlopen is complete”.  This needs a new TCB allocator, though,
so it's also quite involved to implement.  It does not solve the problem
completely, but I expect that it will eliminate pretty much all
shortcomings of initial-exec TLS we have seen in practice.

With that change, we might not even need the two-phased auditor loading.

> Comments and ideas are welcome. (I would love to have a detailed
> LD_AUDIT discussion at STW in June.)

Uhm, what's STW?

> Our reproducer for the early dl* bug passes with the latest Fedora
> Rawhide, I'll look into using RTLD_DI_PHDR in HPCToolkit in the coming
> weeks.

Thanks!

Florian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: LD_AUDIT: Not enough space in static TLS block
  2022-05-11 13:59         ` Florian Weimer
@ 2022-05-11 17:31           ` Jonathon Anderson
  0 siblings, 0 replies; 7+ messages in thread
From: Jonathon Anderson @ 2022-05-11 17:31 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Carlos O'Donell, Ben Woodard, Adhemerval Zanella, Legendre,
	Matthew P.,
	libc-alpha, John Mellor-Crummey



On 5/11/22 08:59, Florian Weimer wrote:
>> I had a separate (email) chat with Ben Woodard bouncing ideas for a long-term solution. A
>> major difficulty is that LD_AUDIT currently introduces a cyclic dependency:
>>   - auditors must be loaded before searching for the application's dependencies (since
>> la_objsearch may modify the results), and
>>   - dependency searches must complete before the static TLS auto-tuning (since the TLS
>> sizes of the initial link-map must be known), but
>>   - the static TLS block must be allocated before auditors are loaded (since auditors may also
>> use initial-exec TLS).
>>
>> So, I'm not hopeful for a long-term solution that does not involve
>> another LAV_CURRENT bump. We (me and Ben) came up with a couple of
>> initial solutions: disallowing initial-exec TLS in auditors,
I wasn't very clear with my half-sentence descriptions, let me add some 
more detail to these ideas.
> I'm not sure if this feasible.  It would mean we cannot use initial-exec
> TLS in glibc at all, or in libstdc++ (in case auditors are written in
> C++).
>
> And we don't want to build libraries twice (for auditor usage).
I believe this can be avoided if they're compiled with the gnu2 TLS 
variant, that would allow ld.so to relocate so that:
  - the Glibc in the auditor namespace(s) uses a dynamic TLS segment 
(with a performance hit, of course), but
  - the Glibc in the main namespace(s) uses the static TLS segment (with 
no measurable performance degradation).

This extends to other libraries as well, libstdc++ included. On the 
ld.so side, the startup sequence would then become something like this:

   (1) allocate TCB and DTV (but not static TLS space)
   (2) load each audit module
     (2.1) map the auditor
     (2.2) recursively map all its dependencies
     (2.3) relocate, where:
       (2.3.1) TLSDESC is always relocated in dynamic form
       (2.3.2) any initial-exec relocation (eg. R_X86_64_TPOFF64) throws 
an error
     (2.4) assign TLS variables from their initial values
     (2.5) start running auditor code (ELF constructors, la_version)
   (3) map the main executable
   (4) recursively map all dependencies
   (5) calculate static TLS usage
   (6) allocate static TLS + new TCB
   (7) move old TCB to new location
   (8) relocate
   (9) assign TLS variables their initial values
   (10) start running user code (initializers, main)

Where steps (1-2) and (7) are skipped in the non-auditor case.

(Of course, this idea leaves us in the unenviable position of 
negotiating with our dependencies to compile everything with gnu2 TLS. 
Caveat emptor.)

>> or per-auditor static TLS blocks (ie. TLS namespaces).
> We already have that, but there is just one thread pointer, so that does
> not solve the problem.  Using a secondary thread pointer has the second
> build problem, too.
In this idea the auditors would be responsible for "switching" between 
the multiple thread pointers, where all non-auditor namespaces share one 
"main" thread pointer. The audit API would need to be expanded with a 
few function(-like macros) to manipulate the thread pointer, such as:
  - struct tls_restore tls_switch_to_main_tp();
  - struct tls_restore tls_switch_to_caller_tp();
  - void tls_restore_tp(const struct tls_storage*);

Then auditors (and ld.so) would need to implement a series of 
restrictions/conventions to properly maintain TP, one set of rules could be:
  - All la_* notification calls occur with the auditor's thread pointer set.
  - All calls to code outside the auditor namespace (eg. user code) must 
occur with the main thread pointer set (ie. wrapped in a 
tls_switch_to_main_tp()/tls_restore_tp() pair).
  - All calls from outside the auditor namespace (eg. wrapped symbols) 
occur with the main thread pointer set (ie. the auditor should wrap the 
contents in a tls_switch_to_caller_tp()/tls_restore_tp() pair).

(Of course, any bugs in any auditor's TP management will cause very 
subtle errors in the application, and auditors will need to be careful 
that their dependencies don't naively call code loaded via 
dlmopen(LM_ID_BASE). Caveat emptor.)

> Auditor TLS usage has a conceptually simple fix, though.  (It's simple
> in concept, but implementation requires some refactoring.)  Recall that
> for regular process startup (without auditing), we do this:
>
>    (1) map the main executable (the kernel may do this for us)
>    (2) recursively map all the dependencies
>    (3) calculate static TLS usage
>    (4) allocate static TLS space
>    (5) perform relocation
>    (6) assign TLS variables their initial values
>    (7) start running user code (initializers, main)
>
> Once auditors are in the mix, we do this instead:
>
>    (1) guess static TLS usage
>    (2) allocate static TLS space
>    (3) load each audit module individual, in sequence, as if per dlmopen:
>      (3.1) map the auditor
>      (3.2) recursively map all its dependencies
>      (3.3) perform relocation
>      (3.4) calculate and allocate static TLS space (from the global area)
>      (3.5) assign TLS variables their initial values
>      (3.6) start running auditor code (ELF constructors, la_version)
>    (4) map the main executable (the kernel may do this for us)
>    (5) recursively map all the dependencies (may involve la_objsearch)
>    (6) calculate and allocate static TLS space (from the global area)
>    (7) perform relocation
>    (8) assign TLS variables their initial values
>    (9) start running user code (initializers, main)
>
> Step (1) is the big problem here, it's just a quick hack to get things
> going with TLS, but it has been around for a long time.  What we should
> be doing instead is this:
>
>    (1) load each audit module individual, in sequence (no relocation here):
>      (1.1) map the auditor
>      (1.2) recursively map all its dependencies
>      (1.3) calculate static TLS usage for this auditor namespace
>    (2) map the main executable (the kernel may do this for us)
>    (3) calculate static TLS usage using all TLS size information seen so far
>    (4) allocate static TLS space
>    (5) complete loading the auditors (relocation and startup):
>      (5.1) perform relocation
>      (5.2) calculate and allocate static TLS space (from the global area)
>      (5.3) assign TLS variables their initial values
>      (5.4) start running auditor code (ELF constructors, la_version)
>    (6) recursively map all the dependencies (may involve la_objsearch)
>    (7) calculate and allocate static TLS space (from the global area)
>    (8) perform relocation
>    (9) assign TLS variables their initial values
>    (10) start running user code (initializers, main)
>
> With this sequence, direct static TLS usage from auditors is taken into
> account for the fixed-size TLS allocation at (4), eliminating the
> guesswork.  (Step (2) could actually come right before step (6), it
> would not alter the picture.)
Step (3) doesn't account for the static TLS usage of the executable's 
dependencies, so this doesn't completely solve the issue. (The cyclic 
dependency I mentioned before is roughly (3) -> (4) -> (5.4) -> (6) -> (3).)

> When no auditor defines la_objsearch, we can do even better and map the
> executable and its dependencies before computing the static TLS size,
> and only run step (5), complete loading the auditors, after mapping
> everything (but before relocation, which needs working auditors for
> la_symbind).  In this case, we'd have the same level of information
> regarding TLS usage as in the non-auditor case (which is still not
> enough in all cases, but another incremental improvement).
> With la_objsearch, we could pull a few more tricks.  Auditors could
> advertise that the address of their TLS variables do not matter, which
> would enable us to relocate the TLS space as we discover more objects
> that need more static TLS.
If I understand correctly, this optimization would require that the 
auditor and all its dependencies don't take the address of TLS 
variables. This won't be feasible for us (we have complex dependencies), 
to be used by any reasonable auditor this restriction would have to be 
satisfied by (at least) Glibc and libstdc++.

Compiler assistance would help here, although I highly doubt this 
restriction will be often achieved with C++ code (eg. returning a const 
reference to a TLS variable is enough to break this restriction).

>    Or we could unload the auditors on TLS
> exhaustion and start again with a larger space esimate.
We would need some additions to the auditor API to indicate when this is 
happening (vs. normal program termination) and to move data to the 
reloaded instance. This would also require significant violence to our 
measurement infrastructure to support save/reload like this, so this is 
not really a preferable solution.

>    Auditors could
> provide their own guesses for static TLS usage that we query upfront and
> take into account for the size calculation.
Is this different from setting the tunable (except with a fancier 
interface)?

> None of this solves the general dlopen case, though.  I have some ideas
> for that, which boils down to “just provide enough address space during
> early startup, so that you never exceed it until the initialization
> phase with dlopen is complete”.  This needs a new TCB allocator, though,
> so it's also quite involved to implement.  It does not solve the problem
> completely, but I expect that it will eliminate pretty much all
> shortcomings of initial-exec TLS we have seen in practice.
>
> With that change, we might not even need the two-phased auditor loading.
I'm not sure I understand this solution, would this require the static 
TLS block to grow as new libraries are loaded during ELF constructors? 
Or for the entire execution? Would this then mean that any library can 
use initial-exec TLS, regardless of whether it's part of the initial 
link map?

It sounds magical, but if it's possible it would definitely solve the issue.

>> Comments and ideas are welcome. (I would love to have a detailed
>> LD_AUDIT discussion at STW in June.)
> Uhm, what's STW?
As Dr. Mellor-Crummey corrected me, it's the Scalable Tools Workshop 
(https://dyninst.github.io/scalable_tools_workshop/petascale2022/). 
Force of habit using the acronym, sorry for the confusion.

-Jonathon

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-05-11 17:31 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-11 20:24 LD_AUDIT: Not enough space in static TLS block Jonathon Anderson
2022-04-12  7:44 ` Florian Weimer
2022-05-03  7:22   ` Florian Weimer
2022-05-05 17:30     ` Florian Weimer
2022-05-05 19:56       ` Jonathon Anderson
2022-05-11 13:59         ` Florian Weimer
2022-05-11 17:31           ` Jonathon Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).