* [RFC][BZ #19329] high level description of pthread_create vs dlopen races @ 2016-11-28 19:03 Szabolcs Nagy 2016-12-05 19:24 ` Torvald Riegel 0 siblings, 1 reply; 8+ messages in thread From: Szabolcs Nagy @ 2016-11-28 19:03 UTC (permalink / raw) To: GNU C Library; +Cc: nd, Torvald Riegel Before sending an updated version of https://sourceware.org/ml/libc-alpha/2016-01/msg00480.html i try to provide high level description of the issues (that can go into the concurrency notes of the patch) with the hope that agreeing on the design will allow less patch review iterations. Overview: Dynamic linker internal data structures for thread local storage may be accessed concurrently causing assertion failures in dl-tls.c when dlopen loads modules that have tls concurrently with pthread_create. Assertions that may trigger: Inconsistency detected by ld.so: dl-tls.c: 520: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed! Inconsistency detected by ld.so: dl-tls.c: 556: _dl_allocate_tls_init: Assertion `listp != NULL' failed! Related recent discussions: TLS performance degradation after dlopen: https://sourceware.org/ml/libc-alpha/2016-06/msg00203.html nptl/tst-stack4 failure: https://sourceware.org/ml/libc-alpha/2016-11/msg00574.html Test case patch: https://sourceware.org/ml/libc-alpha/2016-11/msg00917.html Current behaviour (minor details omitted): dlopen (operations in program order): Load a module and its dependencies, assign a modid to each while incrementing GL(dl_tls_max_dtv_idx) (this is the dtv and slotinfo index of the module). Add each module to GL(dl_tls_dtv_slotinfo_list) using _dl_add_to_slotinfo. Assign the next generation count to these slotinfo entries. Increment the GL(dl_tls_generation) counter. Update the dtv of the current thread following the slotinfo changes by calling _dl_update_slotinfo (only resizes dtv and fills it with unallocated slots, actual tls allocation and initialization is deferred). pthread_create: Call _dl_allocate_tls_init to allocate dtv and tls for the new thread to be created, covering all currently loaded modules. ("currently loaded" means up to a generation count, the dtv and tls for modules with larger generation count are allocated lazily on first access). Other apis: dlclose can write the relevant globals too, tls access and dl_iterate_phdr read them (i focus on dlopen and pthread_create first, similar concerns affect these apis). dlmain can write these globals too, but i think that it is single threaded (early startup, no pthreads yet, can audit modules change that?). In short: relevant globals (dl_tls_max_dtv_idx, dl_tls_generation, dl_tls_dtv_slotinfo_list entries) are written by dlopen and pthread_create reads them when initializing the dtv of a new thread. Current concurrency design: dlopen holds GL(dl_load_lock) while writing the relevant globals (using volatile loads and stores). The lock serializes dlopen calls, but there is no synchronization with pthread_create (other than mmap and mprotect calls which are likely synchronized but not part of the c11 memory model). pthread_create accesses the globals without locks and uses volatile loads: dtv size and valid entries in the slotinfo list are determined based on GL(dl_tls_max_dtv_idx) with the following asserted assumptions: (a) slotinfo list have at least GL(dl_tls_max_dtv_idx) entries set up, (b) slotinfo entries have generation count <= GL(dl_tls_generation). These would be wrong even if all memory accesses were mo_seq_cst: slotinfo entries are added after modid is assigned and the generation counter is incremented even later. Current design bugs in glibc: (0) There are data races on the relevant globals. And there are asserts that are not guaranteed (a,b above). This is what i was trying to fix. (1) Any access to link_map struct is invalid if neither GL(dl_load_lock) nor GL(dl_load_write_lock) are held, because a concurrent dlclose may free the map, this affects: pthread_create (_dl_allocate_tls_init reads map.l_tls_offset) __tls_get_addr (_dl_update_slotinfo reads map.l_tls_offset) (2) dlopen can reuse modids of dlclosed modules (and dlclose can decrement GL(dl_tls_max_dtv_idx)), so multiple reads of a slotinfo entry may be inconsistent if neither GL(dl_load_lock) nor GL(dl_load_write_lock) are held (e.g. concurrent dlclose and dlopen between reading the .gen and .map members of the slotinfo entry), this affects: pthread_create (_dl_allocate_tls_init, reads .gen and .map) __tls_get_addr (_dl_update_slotinfo, reads .gen and .map) (3) tls access must be as-safe, but currently tls is allocated at first access if the containing module was not loaded at the time the accessing thread was created. This means __tls_get_addr may malloc or lock GL(dl_load_lock) or modify the thread local dtv. This can cause deadlocks, oom crashes or memory corruption if it is in a signal handler (malloc in signal handler is also problematic for malloc interposition). (4) allocation failures can oom crash in pthread_create and dlopen (_dl_resize_dtv) and at tls access (various mallocs) or may leave inconsistent state behind in dlopen (_dl_add_to_slotinfo or any later call in dlopen that fails: tls changes are not rolled back). (5) GL(dl_tls_generation) is not checked for overflow consistently and slotinfo entries can get assigned an overflowed generation count, its type is size_t and both dlopen and dlclose increment it, so overflow may be a concern on 32bit targets. (6) dlclose holds GL(dl_load_lock) while running ctors so if those ctors wait for anything that may take the same lock there is a deadlock. Proposed fix(es): (0) can be fixed by adding atomics and removing invalid asserts: It can be guaranteed that whatever GL(dl_tls_generation) is read in pthread_create, all slotinfo entries with that generation will be read too and the read GL(dl_tls_max_dtv_idx) is no less than the modids of those entries. Then walking the slotinfo list until the end and only considering ones up to the observed generation count would set up the dtv consistently without races. Assuming there is no concurrent dlclose. (I think my previous patch did this.) OTOH (1) and (2) clearly point to the direction that pthread_create should take the lock, in which case thread creation performance may go down a bit and the influence of (6) increases (but that is broken anyway) and we didn't do anything about __tls_get_addr. (I don't see other reason to avoid taking the lock). Fixing (3) or (6) would need significant redesign. Fixing (4) and (5) may be possible independently. I think these are the immediately available options: - lock free pthread_create that is broken with concurrent dlclose, - pthread_create that locks and may deadlock when synchronized with a ctor, - or current brokenness (concurrent pthread_create and dlopen asserts). ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races 2016-11-28 19:03 [RFC][BZ #19329] high level description of pthread_create vs dlopen races Szabolcs Nagy @ 2016-12-05 19:24 ` Torvald Riegel 2016-12-06 17:23 ` Szabolcs Nagy 0 siblings, 1 reply; 8+ messages in thread From: Torvald Riegel @ 2016-12-05 19:24 UTC (permalink / raw) To: Szabolcs Nagy; +Cc: GNU C Library, nd On Mon, 2016-11-28 at 19:03 +0000, Szabolcs Nagy wrote: > Before sending an updated version of > https://sourceware.org/ml/libc-alpha/2016-01/msg00480.html > i try to provide high level description of the issues (that can go > into the concurrency notes of the patch) with the hope that agreeing > on the design will allow less patch review iterations. That's a good approach. Thanks. > dlopen (operations in program order): > > Load a module and its dependencies, assign a modid to each > while incrementing GL(dl_tls_max_dtv_idx) (this is the dtv > and slotinfo index of the module). > > Add each module to GL(dl_tls_dtv_slotinfo_list) using > _dl_add_to_slotinfo. Assign the next generation count to > these slotinfo entries. > > Increment the GL(dl_tls_generation) counter. > > Update the dtv of the current thread following the slotinfo > changes by calling _dl_update_slotinfo (only resizes dtv and > fills it with unallocated slots, actual tls allocation and > initialization is deferred). > > pthread_create: > > Call _dl_allocate_tls_init to allocate dtv and tls for the > new thread to be created, covering all currently loaded modules. > ("currently loaded" means up to a generation count, the dtv and > tls for modules with larger generation count are allocated lazily > on first access). > > Other apis: > > dlclose can write the relevant globals too, tls access and > dl_iterate_phdr read them (i focus on dlopen and pthread_create > first, similar concerns affect these apis). dlmain can write > these globals too, but i think that it is single threaded > (early startup, no pthreads yet, can audit modules change that?). > > In short: relevant globals (dl_tls_max_dtv_idx, dl_tls_generation, > dl_tls_dtv_slotinfo_list entries) are written by dlopen and > pthread_create reads them when initializing the dtv of a new thread. > > Current concurrency design: Generally, I'd suggest to start one level of abstraction higher: What are the abstract operations that we need to perform? Which of these need to appear to be atomic wrt. which other operations? Which ordering relations do need to hold in every execution? (In contrast, you explain things below closer to the implementation level (eg, this lock is acquired for these operations, etc.)) There are a couple of reasons why I'm suggesting that. First, it helps people that are not familiar in detail with all of the requirements (such as myself :). Second, and perhaps that's more immediately importantly from your perspective, it will likely guide us to what we need to do in the implementation. In my experience, the tighter you get in your understanding of a concurrency problem on the abstract level, the better you can understand what the options are and how the solution should look like. For example, assume some operation needs to be atomic. A natural reaction is to use a lock to guarantee mutual exclusion -- but do we really need a lock? One question to ask then is whether other operations really depend on all the details of the operation, or whether it can be reduced to something smaller and its effects "replayed" at other threads? If so, then one can often announce that the operation has happened in some other mechanism that summarizes the current state of the world. Otherwise, is it actually important which thread executes the operation? If it is, then we may actually need a lock because the abstract essence of the concurrency problem is that thread B has to wait for thread A to perform the operation. Third, this abstract description gives the code a vocabulary that can then be used to cleanly describe why the actual implementation is supposed to work. In your v3 patch from January, you had something in that direction: a list of readers and writers. I'd suggest to take this one step further. For example, there are certain operations like pthread_create and sub-operations these have to perform. Then there's code that uses TLS, dlopen calls, etc. The language specs and POSIX make requiremets on these, which already restricts the valid executions that we need to support. That's the start of the requirements. Then there's requirements or a wish-list of what we'd like to have, such as AS-Safe. Then we can break them down into the abstract steps we need to perform, such as initialize TLS, or memory reclamation. Combined with the constraints we have on these (eg, uses of a piece of memory need to happen before the memory is released), this gives us a high-level plan of the concurrency problem. We can then iterate and refine that until we have something that's implementable. Doing that iterating on an abstract level will allow us to see possibilities to apply patterns such as a need for mutual exclusion, a need for a simple ordering of events, a need to do atomic snapshots, etc. If you skip the abstract level, it becomes too complex quickly, and you run into the risk of inventing a fully custom concurrent algorithm, instead of being able to apply known techniques (eg, locking is one, but also only just one). See some of my questions below for examples. Note that by "abstract", I don't mean vague or so high-level that it lacks important detail -- but rather something that has been reduced to the core of what we need to reason about when treating this as a concurrency problem. This is why I've talked about atomicity and ordering requirements. > dlopen holds GL(dl_load_lock) while writing the relevant globals (using > volatile loads and stores). The lock serializes dlopen calls, but there > is no synchronization with pthread_create (other than mmap and mprotect > calls which are likely synchronized but not part of the c11 memory model). > > pthread_create accesses the globals without locks and uses volatile loads: > dtv size and valid entries in the slotinfo list are determined based on > GL(dl_tls_max_dtv_idx) with the following asserted assumptions: > (a) slotinfo list have at least GL(dl_tls_max_dtv_idx) entries set up, > (b) slotinfo entries have generation count <= GL(dl_tls_generation). > These would be wrong even if all memory accesses were mo_seq_cst: slotinfo > entries are added after modid is assigned and the generation counter is > incremented even later. > > Current design bugs in glibc: > > (0) There are data races on the relevant globals. And there are asserts > that are not guaranteed (a,b above). This is what i was trying to fix. > > (1) Any access to link_map struct is invalid if neither GL(dl_load_lock) > nor GL(dl_load_write_lock) are held, because a concurrent dlclose may > free the map, this affects: > pthread_create (_dl_allocate_tls_init reads map.l_tls_offset) > __tls_get_addr (_dl_update_slotinfo reads map.l_tls_offset) You are describing the current issue in the implementation, which is perfectly fine. What is the underlying synchronization problem we need to solve? Do we just need to reclaim memory after all of it's uses? Locking may get this done, but other techniques such as hazzard pointers might be better, in particular if we can reclaim memory lazily and don't want to use blocking synchronization. > (2) dlopen can reuse modids of dlclosed modules (and dlclose can decrement > GL(dl_tls_max_dtv_idx)), so multiple reads of a slotinfo entry may be > inconsistent if neither GL(dl_load_lock) nor GL(dl_load_write_lock) are > held (e.g. concurrent dlclose and dlopen between reading the .gen and > .map members of the slotinfo entry), this affects: > pthread_create (_dl_allocate_tls_init, reads .gen and .map) > __tls_get_addr (_dl_update_slotinfo, reads .gen and .map) Can we version the modids? Or do we generally need a mechanism for atomic snapshots? What are the abstract operations and the requirements for them? > (3) tls access must be as-safe, but currently tls is allocated at first > access if the containing module was not loaded at the time the accessing > thread was created. This means __tls_get_addr may malloc or lock > GL(dl_load_lock) or modify the thread local dtv. This can cause deadlocks, > oom crashes or memory corruption if it is in a signal handler (malloc in > signal handler is also problematic for malloc interposition). Are signal handlers allowed to access TLS? I don't know what POSIX requires (or doesn't disallow), but C++, for example, is moving to not allowing TLS (if I remember the draft wording correctly). > (4) allocation failures can oom crash in pthread_create and dlopen > (_dl_resize_dtv) and at tls access (various mallocs) or may leave > inconsistent state behind in dlopen (_dl_add_to_slotinfo or any > later call in dlopen that fails: tls changes are not rolled back). Do we need to be able to publish changes atomically? If so, there are known techniques for that. Do we perhaps can deal with most of all of this if we had simple transactions? If these aren't arbitrary operations but in some way limited, and can be used for lots of the operations we need to synchronize, this may perhaps be the easiest way to implement all of this. Once we know about the abstract operations, we can better assess whether that would be a helpful approach. > (5) GL(dl_tls_generation) is not checked for overflow consistently and > slotinfo entries can get assigned an overflowed generation count, its type > is size_t and both dlopen and dlclose increment it, so overflow may be a > concern on 32bit targets. Can the size be extended on 32b targets? If so, we can build a monotonically increasing, larger atomic counter on 32b targets too. > (6) dlclose holds GL(dl_load_lock) while running ctors so if those ctors > wait for anything that may take the same lock there is a deadlock. Do you mean dtors? > Proposed fix(es): > > (0) can be fixed by adding atomics and removing invalid asserts: > It can be guaranteed that whatever GL(dl_tls_generation) is read in > pthread_create, all slotinfo entries with that generation will be read > too and the read GL(dl_tls_max_dtv_idx) is no less than the modids of > those entries. Then walking the slotinfo list until the end and only > considering ones up to the observed generation count would set up the > dtv consistently without races. Assuming there is no concurrent dlclose. > (I think my previous patch did this.) > > OTOH (1) and (2) clearly point to the direction that pthread_create > should take the lock, in which case thread creation performance may go > down a bit and the influence of (6) increases (but that is broken anyway) > and we didn't do anything about __tls_get_addr. (I don't see other reason > to avoid taking the lock). Before focusing on taking the lock, I'd suggest to look at the bigger picture (ie, the abstract operations and atomicity and ordering requirements). It can be possible that a clean redesign is actually simpler than trying to fix a system that is already complex and which we don't understand 100%. > Fixing (3) or (6) would need significant redesign. > > Fixing (4) and (5) may be possible independently. > > I think these are the immediately available options: > > - lock free pthread_create that is broken with concurrent dlclose, > - pthread_create that locks and may deadlock when synchronized with a ctor, > - or current brokenness (concurrent pthread_create and dlopen asserts). Looking at the abstract operations will allow us to make such conclusions. Also, if we document these, it will be much clearer to future developers why we did what we did (ie, the reasoning behind options and decisions such as the ones you list above). I hope this helps. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races 2016-12-05 19:24 ` Torvald Riegel @ 2016-12-06 17:23 ` Szabolcs Nagy 2016-12-06 17:48 ` Joseph Myers 2016-12-07 14:08 ` Torvald Riegel 0 siblings, 2 replies; 8+ messages in thread From: Szabolcs Nagy @ 2016-12-06 17:23 UTC (permalink / raw) To: Torvald Riegel; +Cc: nd, GNU C Library On 05/12/16 19:24, Torvald Riegel wrote: > On Mon, 2016-11-28 at 19:03 +0000, Szabolcs Nagy wrote: >> In short: relevant globals (dl_tls_max_dtv_idx, dl_tls_generation, >> dl_tls_dtv_slotinfo_list entries) are written by dlopen and >> pthread_create reads them when initializing the dtv of a new thread. >> >> Current concurrency design: > > Generally, I'd suggest to start one level of abstraction higher: What > are the abstract operations that we need to perform? Which of these > need to appear to be atomic wrt. which other operations? Which ordering > relations do need to hold in every execution? > (In contrast, you explain things below closer to the implementation > level (eg, this lock is acquired for these operations, etc.)) i know you wanted this the last time too, but i don't plan to rewrite the dynamic linker from scratch, i'm trying to make a minimal change that is acceptable without years of work. the current design does not have a clean set of invariants and interface contracts, so it's hard to describe at a higher level abstraction: there is always yet another special case that does not fit into a simple model and nailing down all those special cases is a lot more effort than necessary. ignoring the implementation and only describing the user visible requirements is not enough: glibc has lot of extensions and non-standard behaviours users may depend on (so whatever the current implementation does is the spec), following the standard would not end up with something compatible with the current code and it would still take a lot of time. that's why my approach was to make a small change and trying to reason about the changed semantics compared to the original behaviour instead of figuring out all the atomicity and ordering requirements (which can be specified in many ways). >> (2) dlopen can reuse modids of dlclosed modules (and dlclose can decrement >> GL(dl_tls_max_dtv_idx)), so multiple reads of a slotinfo entry may be >> inconsistent if neither GL(dl_load_lock) nor GL(dl_load_write_lock) are >> held (e.g. concurrent dlclose and dlopen between reading the .gen and >> .map members of the slotinfo entry), this affects: >> pthread_create (_dl_allocate_tls_init, reads .gen and .map) >> __tls_get_addr (_dl_update_slotinfo, reads .gen and .map) > > Can we version the modids? Or do we generally need a mechanism for > atomic snapshots? What are the abstract operations and the requirements > for them? this is probably not such a big issue as i originally thought, reuse of .gen and .map is possible to detect, because gen can only increase (so it can be compared to a previous read of the global gen counter and ignored if bigger, if smaller then it can be guaranteed that .map and .gen are consistent). >> (3) tls access must be as-safe, but currently tls is allocated at first >> access if the containing module was not loaded at the time the accessing >> thread was created. This means __tls_get_addr may malloc or lock >> GL(dl_load_lock) or modify the thread local dtv. This can cause deadlocks, >> oom crashes or memory corruption if it is in a signal handler (malloc in >> signal handler is also problematic for malloc interposition). > > Are signal handlers allowed to access TLS? I don't know what POSIX > requires (or doesn't disallow), but C++, for example, is moving to not > allowing TLS (if I remember the draft wording correctly). if that's true it is just another brokenness in c++, i would not worry about it since dynamic loading of libraries and tls is unfixably broken in c++ anyway. meanwhile the rest of the world needs a way to detect reentry into an as-safe library call which currently requires as-safe tls (signal mask is not an option: syscalls are too slow) and a way to do soft float correctly (fenv is tls and arithmetics must be as-safe) or just access errno (which could be made as-safe even if tls access is not as-safe otherwise, but specifying that correctly in the standard is more hassle than requiring all tls access to be as-safe). >> (4) allocation failures can oom crash in pthread_create and dlopen >> (_dl_resize_dtv) and at tls access (various mallocs) or may leave >> inconsistent state behind in dlopen (_dl_add_to_slotinfo or any >> later call in dlopen that fails: tls changes are not rolled back). > > Do we need to be able to publish changes atomically? If so, there are > known techniques for that. > that may not be necessary.. i think this can be solved without being atomic if some resource leaks are allowed (an argument can be made that it's fine, it's one of those special case decisions). > Do we perhaps can deal with most of all of this if we had simple > transactions? If these aren't arbitrary operations but in some way > limited, and can be used for lots of the operations we need to > synchronize, this may perhaps be the easiest way to implement all of > this. Once we know about the abstract operations, we can better assess > whether that would be a helpful approach. you can think of dlopen (up to ctors) as a transaction.. but there are mallocs, mmaps and even debug messages that are not transactional operations and some state does not have to be rolled back (e.g. generation count increment) so i don't think this helps. >> (5) GL(dl_tls_generation) is not checked for overflow consistently and >> slotinfo entries can get assigned an overflowed generation count, its type >> is size_t and both dlopen and dlclose increment it, so overflow may be a >> concern on 32bit targets. > > Can the size be extended on 32b targets? If so, we can build a > monotonically increasing, larger atomic counter on 32b targets too. that would need some changes the generation count up to which a dtv is set up in a thread is stored in dtv[0] which is 32bit on a 32bit system. making the dtv 64bit will break anything that tries to look at the dtv (asan?), hacking it around by using dtv[-1] is possible but not elegant (dtv length is already stored there it just have to be moved to dtv[-2]). >> (6) dlclose holds GL(dl_load_lock) while running ctors so if those ctors >> wait for anything that may take the same lock there is a deadlock. > > Do you mean dtors? > sorry, correctly: dlopen holds the lock while running ctors and dlclose holds the lock while running dtors. >> Proposed fix(es): > > Before focusing on taking the lock, I'd suggest to look at the bigger > picture (ie, the abstract operations and atomicity and ordering > requirements). > It can be possible that a clean redesign is actually simpler than trying > to fix a system that is already complex and which we don't understand > 100%. i think that's optimistic but i can write up what i think the requirements are. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races 2016-12-06 17:23 ` Szabolcs Nagy @ 2016-12-06 17:48 ` Joseph Myers 2016-12-06 18:31 ` Szabolcs Nagy 2016-12-07 14:08 ` Torvald Riegel 1 sibling, 1 reply; 8+ messages in thread From: Joseph Myers @ 2016-12-06 17:48 UTC (permalink / raw) To: Szabolcs Nagy; +Cc: Torvald Riegel, nd, GNU C Library On Tue, 6 Dec 2016, Szabolcs Nagy wrote: > meanwhile the rest of the world needs a way to detect > reentry into an as-safe library call which currently > requires as-safe tls (signal mask is not an option: > syscalls are too slow) and a way to do soft float > correctly (fenv is tls and arithmetics must be as-safe) > or just access errno (which could be made as-safe even > if tls access is not as-safe otherwise, but specifying > that correctly in the standard is more hassle than > requiring all tls access to be as-safe). errno, the powerpc soft-float floating-point environment and the libdfp soft-dfp decimal rounding mode are all initial-exec TLS (and other cases of floating-point environment are hardware registers rather than TLS variables). Does being initial-exec help? -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races 2016-12-06 17:48 ` Joseph Myers @ 2016-12-06 18:31 ` Szabolcs Nagy 0 siblings, 0 replies; 8+ messages in thread From: Szabolcs Nagy @ 2016-12-06 18:31 UTC (permalink / raw) To: Joseph Myers; +Cc: Szabolcs Nagy, Torvald Riegel, nd, GNU C Library * Joseph Myers <joseph@codesourcery.com> [2016-12-06 17:47:45 +0000]: > On Tue, 6 Dec 2016, Szabolcs Nagy wrote: > > > meanwhile the rest of the world needs a way to detect > > reentry into an as-safe library call which currently > > requires as-safe tls (signal mask is not an option: > > syscalls are too slow) and a way to do soft float > > correctly (fenv is tls and arithmetics must be as-safe) > > or just access errno (which could be made as-safe even > > if tls access is not as-safe otherwise, but specifying > > that correctly in the standard is more hassle than > > requiring all tls access to be as-safe). > > errno, the powerpc soft-float floating-point environment and the libdfp > soft-dfp decimal rounding mode are all initial-exec TLS (and other cases > of floating-point environment are hardware registers rather than TLS > variables). Does being initial-exec help? > yes, that works. only tls access through __tls_get_addr may be non-as-safe now. (but if a language standard does not guarantee tls access to be as-safe then implementation behaviour does not help portable code.) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races 2016-12-06 17:23 ` Szabolcs Nagy 2016-12-06 17:48 ` Joseph Myers @ 2016-12-07 14:08 ` Torvald Riegel 2016-12-21 15:27 ` Szabolcs Nagy 1 sibling, 1 reply; 8+ messages in thread From: Torvald Riegel @ 2016-12-07 14:08 UTC (permalink / raw) To: Szabolcs Nagy; +Cc: nd, GNU C Library On Tue, 2016-12-06 at 17:22 +0000, Szabolcs Nagy wrote: > On 05/12/16 19:24, Torvald Riegel wrote: > > On Mon, 2016-11-28 at 19:03 +0000, Szabolcs Nagy wrote: > >> In short: relevant globals (dl_tls_max_dtv_idx, dl_tls_generation, > >> dl_tls_dtv_slotinfo_list entries) are written by dlopen and > >> pthread_create reads them when initializing the dtv of a new thread. > >> > >> Current concurrency design: > > > > Generally, I'd suggest to start one level of abstraction higher: What > > are the abstract operations that we need to perform? Which of these > > need to appear to be atomic wrt. which other operations? Which ordering > > relations do need to hold in every execution? > > (In contrast, you explain things below closer to the implementation > > level (eg, this lock is acquired for these operations, etc.)) > > i know you wanted this the last time too, but i don't plan to > rewrite the dynamic linker from scratch, That is not what I'm asking for, or suggesting. What I do want is that we deal with concurrency in a thorough manner. What I suggested is how this can be done most easily in my experience -- while still being thorough. > i'm trying to make a > minimal change that is acceptable without years of work. Even for a minimal change (it looked more like minimal changes) you need to be able to show that it is not breaking something else. Even if we're fine with accepting that we may not be able to prove that and still want to apply the patch, we need to be thorough about the assumptions we make about how the code we're changing works. > the current design does not have a clean set of invariants and > interface contracts, so it's hard to describe at a higher level > abstraction: there is always yet another special case that does > not fit into a simple model and nailing down all those special > cases is a lot more effort than necessary. Well, if you change something that is not a mechanical change (eg, make changes that allow for the same executions), you must have some understanding of the system, right? This understanding must be sufficient to explain why your change is correct. That's the core of what I'm asking for. I'm not asking you to describe every possible special case -- but instead that you thoroughly document how the system you try to change looks like. So, for example, if you assume that two operations don't run concurrently, then document that; if we find out later that that this is in fact not the case because of some special case, at least we have a proper way to cross-check this against the assumptions you made when proposing your change. This incremental approach is what you'd do if you'd try to describe the full system too, so I don't mind to have only model and describe parts of the system as long as the way we do that is thorough. The big problem with not being thorough and documenting the abstract-level core of your understanding of the system is that then all future changes have to cross-check against every tiny detail again. This does not scale, and it's error-prone. You have observed that problem already I guess, in that I suppose it would have been easier for you to reason about your fix if you could have cross-checked it against existing thorough documentation and rules. We have to be thorough to be able to deal with concurrency in way that's somewhat easier on our brains. Delaying that does not help glibc. I know it can be daunting to start it, because there are so many open questions (I've been working on fixing the nscd code, so trust me...), but it will make things easier for us in the long run. > ignoring the implementation and only describing the user visible > requirements is not enough: glibc has lot of extensions and > non-standard behaviours users may depend on (so whatever the > current implementation does is the spec), following the standard > would not end up with something compatible with the current code > and it would still take a lot of time. This is not about standards for me. They are just one source of constraints or requirements. What I want to see is that we make it clear which requirements we think there are, irrespective where they come from; likewise, what a correct program is required not to do. This is what we start with; once that's done, checking for compliance against some standard can still be done. > that's why my approach was to make a small change and trying to > reason about the changed semantics compared to the original > behaviour instead of figuring out all the atomicity and ordering > requirements (which can be specified in many ways). This is not a bad approach, so don't get me wrong. There can be changes that cannot break a program because they just constrain execution to a subset of executions that would have been possible before too; for example, changing a finite number of steps into one atomic operation (though this needs to be nonblocking or you can interfere with forward progress (think AS-Safety)). However, this is often not sufficient. First, arguing why the change is necessary is important because it adds to the what future changes have to pay attention to (unless the change is not affecting correctness). Second, look at the first of your goals, for example: Wanting to avoid a data race is good; but what memory orders do you need? Do you want to make them SC because you don't know what is required (and document the uncertainty!)? If not, you'll have to reason based on assumptions about the happens-before relations the code needs to enforce, and the kind of "incoming" executions it has to face. This then ultimately boils down to having some mental model of the abstract-level concurrency in the code, or you can't really reason about what is truly required. > >> (2) dlopen can reuse modids of dlclosed modules (and dlclose can decrement > >> GL(dl_tls_max_dtv_idx)), so multiple reads of a slotinfo entry may be > >> inconsistent if neither GL(dl_load_lock) nor GL(dl_load_write_lock) are > >> held (e.g. concurrent dlclose and dlopen between reading the .gen and > >> .map members of the slotinfo entry), this affects: > >> pthread_create (_dl_allocate_tls_init, reads .gen and .map) > >> __tls_get_addr (_dl_update_slotinfo, reads .gen and .map) > > > > Can we version the modids? Or do we generally need a mechanism for > > atomic snapshots? What are the abstract operations and the requirements > > for them? > > this is probably not such a big issue as i originally thought, > reuse of .gen and .map is possible to detect, because gen can > only increase (so it can be compared to a previous read of the > global gen counter and ignored if bigger, if smaller then it > can be guaranteed that .map and .gen are consistent). Maybe that's a good piece to start with. Let me try to give a brief outline how you could describe that more thoroughly. You seem to need to have an atomic snapshot for more than one read of a slotinfo entry (if that's not the case, just treat the following as an example for illustratitive purposes). This will be required for some higher-level operation, which you refer to. That's the highest level of abstraction in this example here. You know that modifications are ordered (through some mechanism in the program that you mention), and you know that this order is represented by the generation number (so, if modification A happens before modification B, then A's generation will be less than B's generation). You try to solve the atomic snapshot problem by ignoring everything with a generation less than a number chosen by you. That's kind of the middle level of abstraction in this example, so how you intend to solve the synchronization problem. Finally, you need to implement this with atomics. The middle level tells you which happens-before relationships you need, and thus what kind of memory orders you need. For example, assume a slotinfo entry is published by writing a generation number, then you write this number with release MO, and read it with acquire MO in the consumers; they will synchronize, and they will ensure that if you read a certain number you also see the matching content of the entry. Because you've described the middle abstraction layer, you can argue why you need those MOs by just going up one level higher, and referring to your code comments. This is easier than trying to argue directly why this is sufficient to solve the problem at the highest level of abstraction (ie, atomic snapshot). So you build up these different abstraction layers, which helps make the relationship between layer N and layer N-1 tractable for one's brain. You're thorough about statements at each layer, which ensures that you don't have to consider anything but the layers at end (eg, because there's no ambiguity). You document the models and reasoning you build at each layer so it's easier for yourself -- and all other and future developers -- to remember and refer to. Also, when combined with being thorough, this makes stuff easier to understand because what you documented for a particular layer is law; you don't have to second-guess it, which frees up resources in your brain. In my experience, thinking in terms of atomicity and ordering at these different layers is what works best, plus common synchronization patterns and techniques (eg, atomic snapshots, mutual exclusion, ...) and problems (eg, ABA). Atomicity is important because it states that there are less interleavings to think about; ordering is essential because orderings of operations is what defines how state evolves. This is what I do too when I build concurrent code like the new condvar or rwlock. And if you do all that thoroughly, you will see, for example, that my brief outline of a description of course has holes in it (eg, which generation number is the threshold? can slotinfo entries be reused so that you'd perhaps need to restart an atomic snapshot if the version number changes? ...). The thorough process I described is aimed at revealing such holes in the reasoning we do about concurrent code. I hope that this clarifies why I'm always pushing for being thorough in how we reason about concurrent code, and how we communicate such reasoning to our fellow developers. > >> (3) tls access must be as-safe, but currently tls is allocated at first > >> access if the containing module was not loaded at the time the accessing > >> thread was created. This means __tls_get_addr may malloc or lock > >> GL(dl_load_lock) or modify the thread local dtv. This can cause deadlocks, > >> oom crashes or memory corruption if it is in a signal handler (malloc in > >> signal handler is also problematic for malloc interposition). > > > > Are signal handlers allowed to access TLS? I don't know what POSIX > > requires (or doesn't disallow), but C++, for example, is moving to not > > allowing TLS (if I remember the draft wording correctly). > > if that's true it is just another brokenness in c++, No, it's not broken, it's simply taking a conservative approach: If signal handlers are not allowed to touch TLS (or there's UB), you don't require something special from implementations. Whether that means that you can't do all the funny things in a signal handler you might want to do is another question. > > Do we perhaps can deal with most of all of this if we had simple > > transactions? If these aren't arbitrary operations but in some way > > limited, and can be used for lots of the operations we need to > > synchronize, this may perhaps be the easiest way to implement all of > > this. Once we know about the abstract operations, we can better assess > > whether that would be a helpful approach. > > you can think of dlopen (up to ctors) as a transaction.. > but there are mallocs, mmaps and even debug messages that > are not transactional operations malloc can usually be rolled back fairly easily because doing an unncessary malloc is not a problem. The same might apply for mmap (but not for munmap). Debug messages can be buffered. > and some state does not > have to be rolled back (e.g. generation count increment) > so i don't think this helps. It's fine if we don't hvae to roll back something; that's the easy case :) (I'm assuming that needlessly incrementing the generation count is harmless.) What I was thinking about was a custom transaction mechanism based on some of the simpler STM algorithms. We can mark transactional operations as such, and the transaction mechanism becomes much easier if we, for example, know all the operations of a transaction at the start of the transaction. > >> (5) GL(dl_tls_generation) is not checked for overflow consistently and > >> slotinfo entries can get assigned an overflowed generation count, its type > >> is size_t and both dlopen and dlclose increment it, so overflow may be a > >> concern on 32bit targets. > > > > Can the size be extended on 32b targets? If so, we can build a > > monotonically increasing, larger atomic counter on 32b targets too. > > that would need some changes > > the generation count up to which a dtv is set up in a > thread is stored in dtv[0] which is 32bit on a 32bit > system. The other 32b half could be somewhere else. It's more efficient if it's in the same cacheline of course, but it's not required to be adjacent. > >> Proposed fix(es): > > > > Before focusing on taking the lock, I'd suggest to look at the bigger > > picture (ie, the abstract operations and atomicity and ordering > > requirements). > > It can be possible that a clean redesign is actually simpler than trying > > to fix a system that is already complex and which we don't understand > > 100%. > > i think that's optimistic but i can write up what i think > the requirements are. Thanks. Better documentation of what the requirements are would help even if we a rewrite is not what we decide to do. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races 2016-12-07 14:08 ` Torvald Riegel @ 2016-12-21 15:27 ` Szabolcs Nagy 2017-01-05 17:53 ` Szabolcs Nagy 0 siblings, 1 reply; 8+ messages in thread From: Szabolcs Nagy @ 2016-12-21 15:27 UTC (permalink / raw) To: Torvald Riegel; +Cc: nd, GNU C Library On 07/12/16 14:07, Torvald Riegel wrote: > On Tue, 2016-12-06 at 17:22 +0000, Szabolcs Nagy wrote: >> On 05/12/16 19:24, Torvald Riegel wrote: >>> On Mon, 2016-11-28 at 19:03 +0000, Szabolcs Nagy wrote: >>>> (5) GL(dl_tls_generation) is not checked for overflow consistently and >>>> slotinfo entries can get assigned an overflowed generation count, its type >>>> is size_t and both dlopen and dlclose increment it, so overflow may be a >>>> concern on 32bit targets. >>> >>> Can the size be extended on 32b targets? If so, we can build a >>> monotonically increasing, larger atomic counter on 32b targets too. >> >> the generation count up to which a dtv is set up in a >> thread is stored in dtv[0] which is 32bit on a 32bit >> system. > > The other 32b half could be somewhere else. It's more efficient if it's > in the same cacheline of course, but it's not required to be adjacent. i was wrong about this: a dtv entry has 2 pointers already so it is 64bit on a 32bit target so a 64bit counter fits. >>> Before focusing on taking the lock, I'd suggest to look at the bigger >>> picture (ie, the abstract operations and atomicity and ordering >>> requirements). >>> It can be possible that a clean redesign is actually simpler than trying >>> to fix a system that is already complex and which we don't understand >>> 100%. >> >> i think that's optimistic but i can write up what i think >> the requirements are. > > Thanks. Better documentation of what the requirements are would help > even if we a rewrite is not what we decide to do. i tried to collect the requirements, the current implementation is far from it: as-safety of tls access and ctor/dtor without locks are difficult to fix. if as-safe tls access is not required then the concurrency fixes are much simpler. simplified model: m: module. m.i: modid. t: thread. t.dtv: dtv pointer. t.dtv[i]: dtv entry i (tls for t,m where m.i==i). constraints: tls of a currently loaded module m in thread t is at t.dtv[m.i]. m.i is unique during the lifetime of m: it is assigned before ctors of m are entered and may be reused after dtors of m return. tls access to m in thread t is only valid during the lifetime of t and m (after ctors of m start and before dtors of m end). during the lifetime of a thread its dtv may be updated: t.dtv may need to be resized (if an m is loaded with larger m.i). t.dtv[i] may need to be freed (if an m is dlclosed). t.dtv[i] may need to be allocated (if an m is dlopened). if dtv updates are not immediate for all threads at dlopen and dlclose time, then the threads need a way to tell if t.dtv[i] is out-of-date in case modid i is reused by the time the dtv update happens. (this can be done by counting dlopen/dlclose calls and remembering the counter of the last dtv update and globally tracking the last counter for each modid i, if global_counter[i] > t.dtv_counter then t.dtv[i] is out-of-date. such counter should be 64bit). dtv update consists of three kind of operations: 1) allocate dtv and dtv entries (malloc) 2) unset out-of-date entries (free) 3) resize dtv, set dtv entries (memcpy and ptr update) 1) alloc: pthread_create and dlopen needs to be synchronized such that either sync in dlopen or sync in pthread_create happens before the other (for all dlopen/pthrea_create pairs), the one that happens later should do the allocation. (this is needed because right after dlopen and pthread_create tls access is valid, but must be as-safe so it cannot easily allocate). 2) free: t.dtv[m.i] should be freed eventually after dlclose of m or after t exits. this is difficult because t.dtv[m.i] need to be updated if m.i is reused and the tls of the new module is accessed, but tls access cannot do the free (not as-safe). so the options are - dlclose of m frees t.dtv[m.i] for all t (non-trivial). - allocated dtv entry pointers are remembered somewhere and garbage collected in some way (such that overwriting t.dtv[i] does not leak memory). 3) update dtv pointer and dtv entry: either dlopen stops all threads and takes care of dtv updates or it has to be done during tls access lazily in which case signals may need to be blocked. an m with static tls is a special case: 1) if m is already loaded when t is created, then pthread_create needs to do setups (copy tls init image) that requires accessing m, so either pthread_create needs to sync with dlclose or it is invalid to unload an m with static tls, in the later case pthread_create should be able to walk the currently loaded modules and tell if they have static tls without accessing the module structures that are freed during dlclose. 2) if m is loaded after t is created, then dlopen should do the setup for the current thread, but i think it has to do the setup for other threads as well (?). (in principle static tls cannot be dynamically loaded/unloaded but i'm not sure what are the requirements if glibc internal libraries are dlopened.) indirectly related to tls: ctor,dtor handling should be callback safe: dlopen and dlclose must not hold internal locks otherwise correct ctor/dtor code may deadlock. dlopen and pthread_create should roll back state changes on allocation failure and report the errors (some state changes may be harmless this needs further analysis). ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races 2016-12-21 15:27 ` Szabolcs Nagy @ 2017-01-05 17:53 ` Szabolcs Nagy 0 siblings, 0 replies; 8+ messages in thread From: Szabolcs Nagy @ 2017-01-05 17:53 UTC (permalink / raw) To: Torvald Riegel; +Cc: nd, GNU C Library On 21/12/16 15:27, Szabolcs Nagy wrote: > i tried to collect the requirements, the current implementation > is far from it: as-safety of tls access and ctor/dtor without > locks are difficult to fix. if as-safe tls access is not required > then the concurrency fixes are much simpler. following steps might work for incremental bug fixing: 1) fix the gen counter to be 64bit on 32bit targets and remove all overflow checks and related aborts. (requires 64bit atomic load and store.) 2) make sure slotinfo update does not access link_map of a module that may be dlclosed concurrently. (the tricky bit is to handle static tls initialization in pthread_create: it needs either sync with dlclose or remembering static tls info in the global slotinfo list, i'd try to do the later). 3) fix pthread_create so it initializes dtv with the right size for a particular generation and iterates over slotinfo entries of modules with static tls correctly to initialize tls and related dtv entries. this would fix BZ 19329 and the following issues would remain: - tls access allocates and frees dtv entries and may resize dtv and may lock the load_lock. - dlopen and pthread_create aborts if resize_dtv fails. and dlopen may leave inconsistent state behind if other allocation fails. - dtors and ctors are run while the load_lock is held. these seem hard to fix to me. > simplified model: > > m: module. > m.i: modid. > t: thread. > t.dtv: dtv pointer. > t.dtv[i]: dtv entry i (tls for t,m where m.i==i). > > constraints: > > tls of a currently loaded module m in thread t is at t.dtv[m.i]. > > m.i is unique during the lifetime of m: it is assigned before > ctors of m are entered and may be reused after dtors of m return. > > tls access to m in thread t is only valid during the lifetime > of t and m (after ctors of m start and before dtors of m end). > > during the lifetime of a thread its dtv may be updated: > t.dtv may need to be resized (if an m is loaded with larger m.i). > t.dtv[i] may need to be freed (if an m is dlclosed). > t.dtv[i] may need to be allocated (if an m is dlopened). > > if dtv updates are not immediate for all threads at dlopen and > dlclose time, then the threads need a way to tell if t.dtv[i] > is out-of-date in case modid i is reused by the time the dtv > update happens. (this can be done by counting dlopen/dlclose > calls and remembering the counter of the last dtv update and > globally tracking the last counter for each modid i, if > global_counter[i] > t.dtv_counter then t.dtv[i] is out-of-date. > such counter should be 64bit). > > dtv update consists of three kind of operations: > 1) allocate dtv and dtv entries (malloc) > 2) unset out-of-date entries (free) > 3) resize dtv, set dtv entries (memcpy and ptr update) > > 1) alloc: > pthread_create and dlopen needs to be synchronized such that > either sync in dlopen or sync in pthread_create happens before > the other (for all dlopen/pthrea_create pairs), the one that > happens later should do the allocation. > (this is needed because right after dlopen and pthread_create > tls access is valid, but must be as-safe so it cannot easily > allocate). > > 2) free: > t.dtv[m.i] should be freed eventually after dlclose of m or > after t exits. this is difficult because t.dtv[m.i] need to > be updated if m.i is reused and the tls of the new module is > accessed, but tls access cannot do the free (not as-safe). > so the options are > - dlclose of m frees t.dtv[m.i] for all t (non-trivial). > - allocated dtv entry pointers are remembered somewhere > and garbage collected in some way (such that overwriting > t.dtv[i] does not leak memory). > > 3) update dtv pointer and dtv entry: > either dlopen stops all threads and takes care of dtv updates > or it has to be done during tls access lazily in which case > signals may need to be blocked. > > an m with static tls is a special case: > 1) if m is already loaded when t is created, then pthread_create > needs to do setups (copy tls init image) that requires accessing > m, so either pthread_create needs to sync with dlclose or it > is invalid to unload an m with static tls, in the later case > pthread_create should be able to walk the currently loaded > modules and tell if they have static tls without accessing the > module structures that are freed during dlclose. > 2) if m is loaded after t is created, then dlopen should do the > setup for the current thread, but i think it has to do the setup > for other threads as well (?). (in principle static tls cannot > be dynamically loaded/unloaded but i'm not sure what are the > requirements if glibc internal libraries are dlopened.) > > indirectly related to tls: > > ctor,dtor handling should be callback safe: dlopen and dlclose > must not hold internal locks otherwise correct ctor/dtor code may > deadlock. > > dlopen and pthread_create should roll back state changes on allocation > failure and report the errors (some state changes may be harmless > this needs further analysis). > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2017-01-05 17:53 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-11-28 19:03 [RFC][BZ #19329] high level description of pthread_create vs dlopen races Szabolcs Nagy 2016-12-05 19:24 ` Torvald Riegel 2016-12-06 17:23 ` Szabolcs Nagy 2016-12-06 17:48 ` Joseph Myers 2016-12-06 18:31 ` Szabolcs Nagy 2016-12-07 14:08 ` Torvald Riegel 2016-12-21 15:27 ` Szabolcs Nagy 2017-01-05 17:53 ` Szabolcs Nagy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).