From: Adhemerval Zanella <adhemerval.zanella@linaro.org>
To: Florian Weimer <fweimer@redhat.com>,
Adhemerval Zanella via Libc-alpha <libc-alpha@sourceware.org>
Cc: Jakub Jelinek <jakub@redhat.com>, gcc-patches@gcc.gnu.org
Subject: Re: [PATCH 3/3] elf: Add _dl_find_eh_frame function
Date: Tue, 23 Nov 2021 13:52:17 -0300 [thread overview]
Message-ID: <9d7d6043-734a-48b9-578e-1c1924896a28@linaro.org> (raw)
In-Reply-To: <87czmy7vf0.fsf@oldenburg.str.redhat.com>
On 17/11/2021 10:40, Florian Weimer wrote:
> * Adhemerval Zanella via Libc-alpha:
>
>> However the code is somewhat complex and I would like to have some feedback
>> if gcc will be willing to accept this change (I assume it would require
>> this code merge on glibc beforehand).
>
> There's a long review queue on the GCC side due to the stage1 close.
> It may still be considered for GCC 12. Jakub has also requested that
> we hold off committing the glibc side until the GCC side is reviewed.
>
> I'll flesh out the commit message and NEWS entry once we have agreed
> upon the interface.
>
>>> new file mode 100644
>>> index 0000000000..c7313c122d
>>> --- /dev/null
>>> +++ b/elf/dl-find_eh_frame.c
>
>>> +/* Data for the main executable. There is usually a large gap between
>>> + the main executable and initially loaded shared objects. Record
>>> + the main executable separately, to increase the chance that the
>>> + range for the non-closeable mappings below covers only the shared
>>> + objects (and not also the gap between main executable and shared
>>> + objects). */
>>> +static uintptr_t _dl_eh_main_map_start attribute_relro;
>>> +static struct dl_eh_frame_info _dl_eh_main_info attribute_relro;
>>> +
>>> +/* Data for initally loaded shared objects that cannot be unlaoded.
>>
>> s/initally/initially and s/unlaoded/unloaded.
>
> Fixed.
>
>>
>>> + The mapping base addresses are stored in address order in the
>>> + _dl_eh_nodelete_mappings_bases array (containing
>>> + _dl_eh_nodelete_mappings_size elements). The EH data for a base
>>> + address is stored in the parallel _dl_eh_nodelete_mappings_infos.
>>> + These arrays are not modified after initialization. */
>>> +static uintptr_t _dl_eh_nodelete_mappings_end attribute_relro;
>>> +static size_t _dl_eh_nodelete_mappings_size attribute_relro;
>>> +static uintptr_t *_dl_eh_nodelete_mappings_bases attribute_relro;
>>> +static struct dl_eh_frame_info *_dl_eh_nodelete_mappings_infos
>>> + attribute_relro;
>>> +
>>> +/* Mappings created by dlopen can go away with dlclose, so a data
>>> + dynamic data structure with some synchronization is needed.
>>
>> This sounds strange ("a data dynamic data").
>
> I dropped the first data.
>
>>
>>> + Individual segments are similar to the _dl_eh_nodelete_mappings
>>
>> Maybe use _dl_eh_nodelete_mappings_*, because '_dl_eh_nodelete_mappings'
>> itself if not defined anywhere.
>
> Right.
>
>>> + Adding new elements to this data structure is another source of
>>> + quadratic behavior for dlopen. If the other causes of quadratic
>>> + behavior are eliminated, a more complicated data structure will be
>>> + needed. */
>>
>> This worries me, specially we have reports that python and other dynamic
>> environments do use a lot of plugin and generates a lot of dlopen() calls.
>> What kind of performance implication do you foresee here?
>
> The additional overhead is not disproportionate to the other sources of
> quadratic behavior. With 1,000 dlopen'ed objects, overall run-time
> seems to be comparable to the strcmp time required soname matching, for
> example, and is quite difficult to measure. So we could fix the
> performance regression if we used a hash table for that …
>
> It's just an undesirable complexity class. The implementation is not
> actually slow because it's a mostly-linear copy (although a backwards
> one). Other parts of dlopen involve pointer chasing and are much
> slower.
Right, I agree this should probably won't incur in performance issues,
I was curious if you have any numbers about it.
>
>>> +/* Allocate an empty segment that is at least SIZE large. PREVIOUS */
>>
>> What this PREVIOUS refer to?
>
> Oops, it's now:
>
> /* Allocate an empty segment that is at least SIZE large. PREVIOUS
> points to the chain of previously allocated segments and can be
> NULL. */
>
>>> +/* Update the version to reflect that an update is happening. This
>>> + does not change the bit that controls the active segment chain.
>>> + Returns the index of the currently active segment chain. */
>>> +static inline unsigned int
>>> +_dl_eh_mappings_begin_update (void)
>>> +{
>>> + unsigned int v
>>> + = __atomic_wide_counter_fetch_add_relaxed (&_dl_eh_loaded_mappings_version,
>>> + 2);
>>
>> Why use an 'unsigned int' for the wide counter here?
>
> Because …
>
>>> + /* Subsequent stores to the TM data must not be reordered before the
>>> + store above with the version update. */
>>> + atomic_thread_fence_release ();
>>> + return v & 1;
>>> +}
>
> … we only need the lower bit.
Ack, I guess it won't matter to compiler.
>
>>> + /* Other initially loaded objects. */
>>> + if (pc >= *_dl_eh_nodelete_mappings_bases
>>> + && pc < _dl_eh_nodelete_mappings_end)
>>> + {
>>> + size_t idx = _dl_eh_find_lower_bound (pc,
>>> + _dl_eh_nodelete_mappings_bases,
>>> + _dl_eh_nodelete_mappings_size);
>>> + const struct dl_eh_frame_info *info
>>> + = _dl_eh_nodelete_mappings_infos + idx;
>>
>> Ins't a UB if idx is not a valid one?
>
> idx is always valid here.
>
>>> + bool match;
>>> + if (idx < _dl_eh_nodelete_mappings_size
>>> + && pc == _dl_eh_nodelete_mappings_bases[idx])
>>> + match = true;
>>> + else
>>> + {
>>> + /* PC might be in the previous mapping. */
>>> + --idx;
>>> + --info;
>>> + match = pc - _dl_eh_nodelete_mappings_bases[idx] < info->size;
>>> + }
>>
>> It is not clear to me why _dl_eh_find_lower_bound() can not handle this
>> specific exception, since this is also used in the audit, dlopen case
>> below.
>
> The struct layouts are different. Maybe I should use a segment
> structure for _dl_eh_nodelete_mappings_bases as well?
I am not sure, it is just that check does seems similar and it is used on
a couple of places. Maybe you can consolidate it.
>
>>> + /* Handle audit modules, dlopen, dlopen objects. This uses software
>>> + transactional memory, with a retry loop in case the version
>>> + changes during execution. */
>>> + while (true)
>>> + {
>>> + retry:
>>> + ;
>>> + uint64_t start_version = _dl_eh_read_start_version ();
>
>>> + for (struct dl_eh_mappings_segment *seg
>>> + = _dl_eh_mappings_active_segment (start_version);
>>> + seg != NULL && seg->size > 0; seg = seg->previous)
>>> + if (pc >= seg->bases[0])
>>> + {
>>> + /* PC may lie within this segment. If it is less than the
>
>>> + if (match)
>>> + {
>>> + /* Found the right mapping. Copy out the data prior to
>>> + checking if the read transaction was successful. */
>
>>> + if (_dl_eh_read_success (start_version))
>
>>> + else
>>> + /* Read transaction failure. */
>>> + goto retry;
>>
>> Why not 'continue' here?
>
> We need to re-run the outer while loop, not the inner segment
> iteration.
Indeed.
>
>>> +/* _dl_eh_process_initial is called twice. First to compute the array
>>> + sizes from the initial loaded mappings. Second to fill in the
>>> + bases and infos arrays with the (still unsorted) data. Returns the
>>> + number of loaded (non-nodelete) mappings. */
>>
>> I usually see this called twice functionn confunsing because it not clear
>> which is the default patch used for each call (although the comments do
>> help it).
>>
>> Maybe you can add two function, one that calculates the initial size and
>> another that actually populates it. It allows, for instance, to call
>> the initial_map _dl_get_eh_frame() once (you will need to keep the
>> struct dl_eh_frame_info between calls).
>
> I started out with this and found it more confusing because of the
> complicated conditions (apart from the phase distinction).
Right.
>
>>> +static void
>>> +_dl_eh_frame_link_map_sort (struct link_map **loaded, size_t size)
>>> +{
>>> + /* Selection sort based on map_start. */
>>> + if (size < 2)
>>> + return;
>>> + for (size_t i = 0; i < size - 1; ++i)
>>> + {
>>> + /* Find minimum. */
>>> + size_t min_idx = i;
>>> + ElfW(Addr) min_val = loaded[i]->l_map_start;
>>> + for (size_t j = i + 1; j < size; ++j)
>>> + if (loaded[j]->l_map_start < min_val)
>>> + {
>>> + min_idx = j;
>>> + min_val = loaded[j]->l_map_start;
>>> + }
>>> +
>>> + /* Swap into place. */
>>> + struct link_map *tmp = loaded[min_idx];
>>> + loaded[min_idx] = loaded[i];
>>> + loaded[i] = tmp;
>>> + }
>>> +}
>>
>> Could this be a problem with programs with a lot of dependencies?
>
> I don't see it in profiles. It's quadratic for sure, but mostly linear
> accesses, and the array fits into the L1 cache. We could optimize it
> and check if the array is already sorted in reverse order, reflecting
> the typical kernel mapping behavior.
If this is not really a hotspot I would prefer the simplest code.
>
>>> +/* Invoked from _dl_eh_frame_frame after sorting. */
>>> +static bool
>>> +_dl_find_eh_frame_update_1 (struct link_map **loaded, size_t count)
>>> +{
>>> + int active_idx = _dl_eh_mappings_begin_update ();
>>> +
>>> + struct dl_eh_mappings_segment *current_seg
>>> + = _dl_eh_loaded_mappings[active_idx];
>>> + size_t current_used = _dl_eh_mappings_segment_count_used (current_seg);
>>> +
>>> + struct dl_eh_mappings_segment *target_seg
>>> + = _dl_eh_loaded_mappings[!active_idx];
>>> + size_t remaining_to_add = current_used + count;
>>> +
>>> + /* Ensure that the new segment chain has enough space. */
>>> + {
>>> + size_t new_allocated
>>> + = _dl_eh_mappings_segment_count_allocated (target_seg);
>>> + if (new_allocated < remaining_to_add)
>>> + {
>>> + size_t more = remaining_to_add - new_allocated;
>>> + target_seg = _dl_eh_mappings_segment_allocate (more, target_seg);
>>> + if (target_seg == NULL)
>>> + /* Out of memory. */
>>> + return false;
>>
>> Shouldn't it have a _dl_eh_mappings_end_update() here?
>
> No, we should keep the original version. I added:
>
> /* Out of memory. Do not end the update and keep the
> current version unchanged. */
>
> We could split the counter update and current version access, and start
> the update only after the malloc, I think. This would make it less
> likely that a concurrent thread sees multiple version updates, causing
> multiple retries, I think.
This sounds a nice improvement, although I think malloc failure would
be unlikely on most systems with overcommit.
>
>>> +bool
>>> +_dl_find_eh_frame_update (struct link_map *new_map)
>>> +{
>>> + /* Copy the newly-loaded link maps into an array for sorting. */
>>> + size_t count = 0;
>>> + for (struct link_map *l = new_map; l != NULL; l = l->l_next)
>>> + count += !l->l_eh_frame_processed;
>>> + struct link_map **map_array = malloc (count * sizeof (*map_array));
>>
>> Maybe use a scratch_buffer here? It should cover 128 entries for 64-bit,
>
> This is only used for dlopen, and dlopen already has many malloc calls.
> I don't think it matters.
Ok.
>>> diff --git a/elf/dl-find_eh_frame.h b/elf/dl-find_eh_frame.h
>>> new file mode 100644
>>> index 0000000000..4bde9b14db
>>> --- /dev/null
>>> +++ b/elf/dl-find_eh_frame.h
>
>>> +/* Extract the exception handling data from a link map and writes it
>>> + to *INFO. If no such data is available INFO->eh_frame will be
>>> + NULL. */
>>> +static void __attribute__ ((unused))
>>
>> Maybe use inline here and let compiler optimize?
>
> I don't think inlining is beneficial.
>
>> Or move to the be a proper function?
>
> It's also used from tests, so it's convenient to have it in a header.
I usually don't like to resort extra attributes to avoid compiler warnings
when we do have a standard solution. But I don't have a strong preference
here.
>
>>> +/* This function is similar to _dl_find_eh_frame, but travers the link
>>> + maps directly. It is used from audit modules before
>>> + _dl_find_eh_frame_init has been called, and for testing. */
>>> +static void *
>>> +_dl_find_eh_frame_slow (void *pc
>>> +#if DL_FIND_EH_FRAME_DBASE
>>> + , void **dbase
>>> +#endif
>>> + )
>>> +{
>>> + ElfW(Addr) addr = (ElfW(Addr)) pc;
>>> + for (Lmid_t ns = 0; ns < GL(dl_nns); ++ns)
>>> + for (struct link_map *l = GL(dl_ns)[ns]._ns_loaded; l != NULL;
>>> + l = l->l_next)
>>> + if (addr >= l->l_map_start && addr < l->l_map_end
>>> + && (l->l_contiguous || _dl_addr_inside_object (l, addr)))
>>> + {
>>> + assert (ns == l->l_ns);
>>> + struct dl_eh_frame_info info;
>>> + _dl_get_eh_frame (l, &info);
>>> +#if DL_FIND_EH_FRAME_DBASE
>>> + *dbase = info.dbase;
>>> +#endif
>>> + return info.eh_frame;
>>> + }
>>> +
>>> + /* Object not found. */
>>> + return NULL;
>>> +}
>>
>> This is only used on dl-find-eh_frame.c, so there is no much gain in
>> adding header to define it.
>
> I expected it to use it in tests.
But currently it is not used, right? The tests seems to use solely
_dl_find_eh_frame().
Do you plan to add other tests to stress it? Also, it always called
on initialization, so you can indirectly call it through
_dl_find_eh_frame().
>
>>> +@deftypefun {void *} _dl_find_eh_frame (void *@var{pc})
>>> +@standards{GNU, dlfcn.h}
>>> +@safety{@mtsafe{}@assafe{}@acsafe{}}
>>> +This function returns a pointer to the unwinding information for the
>>> +object that contains the program coder @var{pc}. If the platform uses
>>> +DWARF unwinding information, this is the in-memory address of the
>>> +@code{PT_GNU_EH_FRAME} segment.
>>> +
>>> +In case @var{pc} resides in an object that lacks unwinding information,
>>> +the function returns @code{NULL}. If no object matches @var{pc},
>>> +@code{NULL} is returned as well.
>>> +
>>> +@code{_dl_find_eh_frame} itself is thread-safe. However, if the
>>> +application invokes @code{dlclose} for the object that contains @var{pc}
>>> +concurrently with @code{_dl_find_eh_frame} or after the call returns,
>>> +accessing the unwinding data for that object is not safe. Therefore,
>>> +the application needs to ensure by other means (e.g., by convention)
>>> +that @var{pc} remains a valid code address while the unwinding
>>> +information is processed.
>>> +
>>> +This function is a GNU extension.
>>
>> In the code you state this interface should be async-signa-safe, but there it
>> no explicit documentation about it.
>
> It says so in the header:
>
> +@safety{@mtsafe{}@assafe{}@acsafe{}}
Oops, thanks.
>
>>> diff --git a/sysdeps/i386/bits/dlfcn_eh_frame.h b/sysdeps/i386/bits/dlfcn_eh_frame.h
>>> new file mode 100644
>>> index 0000000000..98f6b37029
>>> --- /dev/null
>>> +++ b/sysdeps/i386/bits/dlfcn_eh_frame.h
>
>>> +#ifndef _DLFCN_H
>>> +# error "Never use <bits/dlfcn_eh_frame.h> directly; include <dlfcn.h> instead."
>>> +#endif
>>> +
>>> +/* This implementation uses a DBASE pointer argument in
>>> + _dl_find_eh_frame. */
>>> +#define DL_FIND_EH_FRAME_DBASE 1
>>
>> Maybe move DL_FIND_EH_FRAME_DBASE to its own header and sets the expected
>> _dl_find_eh_frame() prototype based on it (so you don't need to replicate
>> it on i386 and nios2).
>
> I think I wrote elsewhere that I could glibc/GCC settle on a different
> interface eventually, e.g.
>
> struct dl_object_info
> {
> uint64_t flags; /* Would indicate NODELETE/otherwise persistent mappings. */
> struct link_map *map;
> void *begin;
> void *end;
> void *eh_frame;
> #if DL_OBJECT_INFO_DBASE
> void *dbase;
> #endif
> #if DL_OBJECT_INFO_PCOUNT
> unsigned long int pcount; /* Used on Arm, see __gnu_Unwind_Find_exidx. */
> #endif
> };
>
> int _dl_find_object (void *address, struct dl_object_info *result);
>
> This interface would allow some additional optimizations in libgcc_s
> (e.g., cache its own eh_frame value, and the libstdc++ eh_frame value if
> it's a persistent mapping). The object boundary information could be
> useful in other cases. And it's trivial to implement
> _dl_find_dso_for_object on top of it, eliminating a potential bottleneck
> from caller-sensitive functions.
>
> Thanks,
> Florian
>
next prev parent reply other threads:[~2021-11-23 16:52 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-03 16:27 [PATCH 0/3] Add _dl_find_eh_frame function for unwinder optimization Florian Weimer
2021-11-03 16:27 ` [PATCH 1/3] nptl: Extract <bits/atomic_wide_counter.h> from pthread_cond_common.c Florian Weimer
2021-11-15 19:24 ` Adhemerval Zanella
2021-11-18 13:19 ` Jakub Jelinek
2021-11-18 13:23 ` Florian Weimer
2021-11-03 16:27 ` [PATCH 2/3] elf: Introduce GLRO (dl_libc_freeres), called from __libc_freeres Florian Weimer
2021-11-15 19:28 ` Adhemerval Zanella
2021-11-03 16:28 ` [PATCH 3/3] elf: Add _dl_find_eh_frame function Florian Weimer
2021-11-16 12:42 ` Adhemerval Zanella
2021-11-17 13:40 ` Florian Weimer
2021-11-23 16:52 ` Adhemerval Zanella [this message]
2021-11-18 13:43 ` Jakub Jelinek
2021-11-18 14:09 ` Florian Weimer
2021-11-18 15:37 ` Jakub Jelinek
2021-11-18 16:28 ` Florian Weimer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9d7d6043-734a-48b9-578e-1c1924896a28@linaro.org \
--to=adhemerval.zanella@linaro.org \
--cc=fweimer@redhat.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=jakub@redhat.com \
--cc=libc-alpha@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).