Counting static __cxa

public inbox for binutils@sourceware.org
 help / color / mirror / Atom feed

* Counting static __cxa_atexit calls
@ 2022-08-23 11:58 Florian Weimer
  2022-08-23 12:28 ` Nick Clifton
  2022-08-23 13:40 ` Michael Matz
  0 siblings, 2 replies; 7+ messages in thread
From: Florian Weimer @ 2022-08-23 11:58 UTC (permalink / raw)
  To: binutils; +Cc: gcc, libc-alpha

We currently have a latent bug in glibc where C++ constructor calls can
fail if they have static or thread storage duration and a non-trivial
destructor.  The reason is that __cxa_atexit (and
__cxa_thread_atexit_impl) may have to allocate memory.  We can avoid
that if we know how many such static calls exist in an object (for C++,
the compiler will never emit these calls repeatedly in a loop).  Then we
can allocate the resources beforehand, either during process and thread
start, or when dlopen is called and new objects are loaded.

What would be the most ELF-flavored way to implement this?  After the
final link, I expect that the count (or counts, we need a separate
counter for thread-local storage) would show up under a new dynamic tag
in the dynamic segment.  This is actually a very good fit because older
loaders will just ignore it.  But the question remains what GCC should
emit into assembler & object files, so that the link editor can compute
the total count from that.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Counting static __cxa_atexit calls
  2022-08-23 11:58 Counting static __cxa_atexit calls Florian Weimer
@ 2022-08-23 12:28 ` Nick Clifton
  2022-08-23 13:40 ` Michael Matz
  1 sibling, 0 replies; 7+ messages in thread
From: Nick Clifton @ 2022-08-23 12:28 UTC (permalink / raw)
  To: Florian Weimer, binutils; +Cc: gcc, libc-alpha

Hi Florian,

> What would be the most ELF-flavored way to implement this?  After the
> final link, I expect that the count (or counts, we need a separate
> counter for thread-local storage) would show up under a new dynamic tag
> in the dynamic segment.  This is actually a very good fit because older
> loaders will just ignore it.  But the question remains what GCC should
> emit into assembler & object files, so that the link editor can compute
> the total count from that.

(It would worthwhile asking this question of the LLVM community too,
since ideally we would like to use the same method in both compilers).


This sounds like an opportunity to add a couple of new GNU object
attributes:

   .gnu_attribute Tag_gnu_destructor_count, <number>
   .gnu_attribute Tag_gnu_tld_count, <count>

Which would then translate into a GNU object attribute notes in the
object file.

Cheers
   Nick


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Counting static __cxa_atexit calls
  2022-08-23 11:58 Counting static __cxa_atexit calls Florian Weimer
  2022-08-23 12:28 ` Nick Clifton
@ 2022-08-23 13:40 ` Michael Matz
  2022-08-24 12:06   ` Florian Weimer
  1 sibling, 1 reply; 7+ messages in thread
From: Michael Matz @ 2022-08-23 13:40 UTC (permalink / raw)
  To: Florian Weimer; +Cc: binutils, gcc, libc-alpha

Hello,

On Tue, 23 Aug 2022, Florian Weimer via Gcc wrote:

> We currently have a latent bug in glibc where C++ constructor calls can
> fail if they have static or thread storage duration and a non-trivial
> destructor.  The reason is that __cxa_atexit (and
> __cxa_thread_atexit_impl) may have to allocate memory.  We can avoid
> that if we know how many such static calls exist in an object (for C++,
> the compiler will never emit these calls repeatedly in a loop).  Then we
> can allocate the resources beforehand, either during process and thread
> start, or when dlopen is called and new objects are loaded.

Isn't this merely moving the failure point from exception-at-ctor to 
dlopen-fails?  If an individual __cxa_atexit can't allocate memory anymore 
for its list structure, why should pre-allocation (which is still dynamic, 
based on the number of actual atexit calls) have any more luck?

> What would be the most ELF-flavored way to implement this?  After the
> final link, I expect that the count (or counts, we need a separate
> counter for thread-local storage) would show up under a new dynamic tag
> in the dynamic segment.  This is actually a very good fit because older
> loaders will just ignore it.  But the question remains what GCC should
> emit into assembler & object files, so that the link editor can compute
> the total count from that.

Probably a note section, which the link editor could either transform into 
a dynamic tag or leave as note(s) in the PT_NOTE segment.  The latter 
wouldn't require any specific tooling support in the link editor.  But the 
consumer would have to iterate through all the notes to add the 
individual counts together.  Might be acceptable, though.

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Counting static __cxa_atexit calls
  2022-08-23 13:40 ` Michael Matz
@ 2022-08-24 12:06   ` Florian Weimer
  2022-08-24 12:53     ` Michael Matz
  0 siblings, 1 reply; 7+ messages in thread
From: Florian Weimer @ 2022-08-24 12:06 UTC (permalink / raw)
  To: Michael Matz; +Cc: binutils, gcc, libc-alpha

* Michael Matz:

> Hello,
>
> On Tue, 23 Aug 2022, Florian Weimer via Gcc wrote:
>
>> We currently have a latent bug in glibc where C++ constructor calls can
>> fail if they have static or thread storage duration and a non-trivial
>> destructor.  The reason is that __cxa_atexit (and
>> __cxa_thread_atexit_impl) may have to allocate memory.  We can avoid
>> that if we know how many such static calls exist in an object (for C++,
>> the compiler will never emit these calls repeatedly in a loop).  Then we
>> can allocate the resources beforehand, either during process and thread
>> start, or when dlopen is called and new objects are loaded.
>
> Isn't this merely moving the failure point from exception-at-ctor to 
> dlopen-fails?

Yes, and that is a soft error that can be handled (likewise for
pthread_create).

> If an individual __cxa_atexit can't allocate memory anymore for its
> list structure, why should pre-allocation (which is still dynamic,
> based on the number of actual atexit calls) have any more luck?

We can report the error properly, and not just terminate the process.

The existing ABI functions are mostly noexcept.  For C++ constructors of
global objects, there cannot even be a handler because they are invoked
by an ELF constructor, and throwing through an ELF constructor is
undefined.

>> What would be the most ELF-flavored way to implement this?  After the
>> final link, I expect that the count (or counts, we need a separate
>> counter for thread-local storage) would show up under a new dynamic tag
>> in the dynamic segment.  This is actually a very good fit because older
>> loaders will just ignore it.  But the question remains what GCC should
>> emit into assembler & object files, so that the link editor can compute
>> the total count from that.
>
> Probably a note section, which the link editor could either transform into 
> a dynamic tag or leave as note(s) in the PT_NOTE segment.  The latter 
> wouldn't require any specific tooling support in the link editor.  But the 
> consumer would have to iterate through all the notes to add the 
> individual counts together.  Might be acceptable, though.

I think we need some level of link editor support to avoid drastically
over-counting multiple static calls that get merged into one
implementation as the result of vague linkage.  Not sure how to express
that at the ELF level?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Counting static __cxa_atexit calls
  2022-08-24 12:06   ` Florian Weimer
@ 2022-08-24 12:53     ` Michael Matz
  2022-08-24 14:31       ` Florian Weimer
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Matz @ 2022-08-24 12:53 UTC (permalink / raw)
  To: Florian Weimer; +Cc: binutils, gcc, libc-alpha

Hello,

On Wed, 24 Aug 2022, Florian Weimer wrote:

> > Isn't this merely moving the failure point from exception-at-ctor to 
> > dlopen-fails?
> 
> Yes, and that is a soft error that can be handled (likewise for
> pthread_create).

Makes sense.  Though that actually hints at a design problem with ELF 
static ctors/dtors: they should be able to soft-fail (leading to dlopen or 
pthread_create error returns).  So, maybe the _best_ way to deal with this 
is to extend the definition of the various object-initionalization means 
in ELF to allow propagating failure.

> > Probably a note section, which the link editor could either transform into 
> > a dynamic tag or leave as note(s) in the PT_NOTE segment.  The latter 
> > wouldn't require any specific tooling support in the link editor.  But the 
> > consumer would have to iterate through all the notes to add the 
> > individual counts together.  Might be acceptable, though.
> 
> I think we need some level of link editor support to avoid drastically
> over-counting multiple static calls that get merged into one
> implementation as the result of vague linkage.  Not sure how to express
> that at the ELF level?

Hmm.  The __cxa_atexit calls are coming from the per-file local static 
initialization_and_destruction routine which doesn't have vague linkage, 
so its contribution to the overall number of cxa_atexit calls doesn't 
change from .o to final-exe.  Can you show an example of what you're 
worried about?

A completely different way would be to not use cxa_atexit at all: allocate 
memory statically for the object and dtor addresses in .rodata (instead of 
in .text right now), and iterate over those at static_destruction time.  
(For the thread-local ones it would need to store arguments to 
__tls_get_addr).

Doing that or defining failure modes for ELF init/fini seems a better 
design than hacking around the current limitation via counting static 
cxa_atexit calls.

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Counting static __cxa_atexit calls
  2022-08-24 12:53     ` Michael Matz
@ 2022-08-24 14:31       ` Florian Weimer
  2022-08-24 15:25         ` Michael Matz
  0 siblings, 1 reply; 7+ messages in thread
From: Florian Weimer @ 2022-08-24 14:31 UTC (permalink / raw)
  To: Michael Matz; +Cc: binutils, gcc, libc-alpha

* Michael Matz:

> Hello,
>
> On Wed, 24 Aug 2022, Florian Weimer wrote:
>
>> > Isn't this merely moving the failure point from exception-at-ctor to 
>> > dlopen-fails?
>> 
>> Yes, and that is a soft error that can be handled (likewise for
>> pthread_create).
>
> Makes sense.  Though that actually hints at a design problem with ELF 
> static ctors/dtors: they should be able to soft-fail (leading to dlopen or 
> pthread_create error returns).  So, maybe the _best_ way to deal with this 
> is to extend the definition of the various object-initionalization means 
> in ELF to allow propagating failure.

We could enable unwinding through the dynamic linker perhaps.  But as I
said, those Itanium ABI functions tend to be noexcept, so there's work
on that front as well.

For thread-local storage, it's even more difficult because any first
access can throw even if the constructor is noexcept.

>> > Probably a note section, which the link editor could either transform into 
>> > a dynamic tag or leave as note(s) in the PT_NOTE segment.  The latter 
>> > wouldn't require any specific tooling support in the link editor.  But the 
>> > consumer would have to iterate through all the notes to add the 
>> > individual counts together.  Might be acceptable, though.
>> 
>> I think we need some level of link editor support to avoid drastically
>> over-counting multiple static calls that get merged into one
>> implementation as the result of vague linkage.  Not sure how to express
>> that at the ELF level?
>
> Hmm.  The __cxa_atexit calls are coming from the per-file local static 
> initialization_and_destruction routine which doesn't have vague linkage, 
> so its contribution to the overall number of cxa_atexit calls doesn't 
> change from .o to final-exe.  Can you show an example of what you're 
> worried about?

Sorry if I didn't use the correct terminology.

I was thinking about this:

#include <vector>

template <int i>
struct S {
  static std::vector<int *> vec;
};

template <int i> std::vector<int *> S<i>::vec(i);

std::vector<int *> &
f()
{
  return S<1009>::vec;
}

The initialization is deduplicated with the help of a guard variable,
and that also bounds to number of __cxa_atexit invocations to at most
one per type.

> A completely different way would be to not use cxa_atexit at all: allocate 
> memory statically for the object and dtor addresses in .rodata (instead of 
> in .text right now), and iterate over those at static_destruction time.  
> (For the thread-local ones it would need to store arguments to 
> __tls_get_addr).

That only works if the compiler and linker can figure out the
construction order.  In general, that is not possible, and that case
seems even quite common with C++.  If the construction order is not
known ahead of time, it is necessary to record it somewhere, so that
destruction can happen in reverse.  So I think storing things in .rodata
is out.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Counting static __cxa_atexit calls
  2022-08-24 14:31       ` Florian Weimer
@ 2022-08-24 15:25         ` Michael Matz
  0 siblings, 0 replies; 7+ messages in thread
From: Michael Matz @ 2022-08-24 15:25 UTC (permalink / raw)
  To: Florian Weimer; +Cc: binutils, gcc, libc-alpha

Hello,

On Wed, 24 Aug 2022, Florian Weimer wrote:

> > On Wed, 24 Aug 2022, Florian Weimer wrote:
> >
> >> > Isn't this merely moving the failure point from exception-at-ctor to 
> >> > dlopen-fails?
> >> 
> >> Yes, and that is a soft error that can be handled (likewise for
> >> pthread_create).
> >
> > Makes sense.  Though that actually hints at a design problem with ELF 
> > static ctors/dtors: they should be able to soft-fail (leading to dlopen or 
> > pthread_create error returns).  So, maybe the _best_ way to deal with this 
> > is to extend the definition of the various object-initionalization means 
> > in ELF to allow propagating failure.
> 
> We could enable unwinding through the dynamic linker perhaps.  But as I
> said, those Itanium ABI functions tend to be noexcept, so there's work
> on that front as well.

Yeah, my idea would have been slightly less ambitious: redefine the ABI of 
.init_array functions to be able to return an int.  The loader would abort 
loading if any of them return non-zero.  Now change GCC code emission of 
those helper functions placed in .init_array to catch all exceptions and 
(in case an exception happened) return non-zero.  Or, even easier, don't 
deal with exceptions, but rather just check if __cxa_atexit worked, and if 
not return non-zero right away.  That way all the exception propagation 
(or cxa_atexit error handling) stays purely within the GCC generated code 
and the dynamic loader only needs to deal with return values, not 
exceptions and unwinding.

For backward compat we can't just change the ABI of .init_array, but we 
can devise an alternative: .init_array_mayfail and the associated DT tags.

> For thread-local storage, it's even more difficult because any first
> access can throw even if the constructor is noexcept.

That's extending the scope somewhat, pre-counting cxa_atexit wouldn't 
solve this problem either, right?

> >> I think we need some level of link editor support to avoid drastically
> >> over-counting multiple static calls that get merged into one
> >> implementation as the result of vague linkage.  Not sure how to express
> >> that at the ELF level?
> >
> > Hmm.  The __cxa_atexit calls are coming from the per-file local static 
> > initialization_and_destruction routine which doesn't have vague linkage, 
> > so its contribution to the overall number of cxa_atexit calls doesn't 
> > change from .o to final-exe.  Can you show an example of what you're 
> > worried about?
> 
> Sorry if I didn't use the correct terminology.
> 
> I was thinking about this:
> 
> #include <vector>
> 
> template <int i>
> struct S {
>   static std::vector<int *> vec;
> };
> 
> template <int i> std::vector<int *> S<i>::vec(i);
> 
> std::vector<int *> &
> f()
> {
>   return S<1009>::vec;
> }
> 
> The initialization is deduplicated with the help of a guard variable,
> and that also bounds to number of __cxa_atexit invocations to at most
> one per type.

Ah, right, thanks.  The guard variable for class-local statics, I was 
thinking file-scope globals.  Double-hmm.  I don't readily see a nice way 
to correctly precalculate the number of cxa_atexit calls here.  A simple 
problem is the following: assume a couple files each defining such class 
templates, that ultimately define and initialize static members A<1>::a 
and B<1>::b (assume vague linkage).  Assume we have four files:

a:  defines A::a
b:  defines B::b
ab: defines A::a and B::b
ba: defines B::b and A::a

Now link order influences which file gets to actually initialize the 
members and which ones skip it due to guard variables.  But the object 
files themself don't know enough context of which will be which.  Not even 
the link editor know that because the non-taken cxa_atexit calls aren't in 
linkonce/group sections, there are all there in 
object.o:.text:_Z41__static_initialization_and_destruction_0ii .

So, what would need to be emitted is for instance a list of cxa_atexit 
calls plus guard variable; the link editor could then count all unguarded 
cxa_atexit calls plus all guarded ones, but the latter only once per 
guard.  The key would be the identity of the guard variable.

That seems like an awful lot of complexity at the wrong level for a very 
specific usecase when we could also make .init_array failable, which then 
even might have more usecases.

> > A completely different way would be to not use cxa_atexit at all: 
> > allocate memory statically for the object and dtor addresses in 
> > .rodata (instead of in .text right now), and iterate over those at 
> > static_destruction time.  (For the thread-local ones it would need to 
> > store arguments to __tls_get_addr).
> 
> That only works if the compiler and linker can figure out the
> construction order.  In general, that is not possible, and that case
> seems even quite common with C++.  If the construction order is not
> known ahead of time, it is necessary to record it somewhere, so that
> destruction can happen in reverse.  So I think storing things in .rodata
> is out.

Hmm, right.  The basic idea could be salvaged by also pre-allocating a 
linked list field in .data (or .tdata), and a per-object-file entry to 
such list.  But failable .init_array looks nicer to me right now.

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-08-24 15:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-23 11:58 Counting static __cxa_atexit calls Florian Weimer
2022-08-23 12:28 ` Nick Clifton
2022-08-23 13:40 ` Michael Matz
2022-08-24 12:06   ` Florian Weimer
2022-08-24 12:53     ` Michael Matz
2022-08-24 14:31       ` Florian Weimer
2022-08-24 15:25         ` Michael Matz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).