From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 121943 invoked by alias); 7 Dec 2016 14:08:03 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 121897 invoked by uid 89); 7 Dec 2016 14:08:02 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-3.5 required=5.0 tests=BAYES_05,RP_MATCHES_RCVD,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=compliance, brains, roll, mental X-HELO: mx1.redhat.com Message-ID: <1481119669.14990.511.camel@redhat.com> Subject: Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races From: Torvald Riegel To: Szabolcs Nagy Cc: nd@arm.com, GNU C Library Date: Wed, 07 Dec 2016 14:08:00 -0000 In-Reply-To: <5846F3EB.2060003@arm.com> References: <583C7F76.3000601@arm.com> <1480965875.14990.199.camel@redhat.com> <5846F3EB.2060003@arm.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-SW-Source: 2016-12/txt/msg00241.txt.bz2 On Tue, 2016-12-06 at 17:22 +0000, Szabolcs Nagy wrote: > On 05/12/16 19:24, Torvald Riegel wrote: > > On Mon, 2016-11-28 at 19:03 +0000, Szabolcs Nagy wrote: > >> In short: relevant globals (dl_tls_max_dtv_idx, dl_tls_generation, > >> dl_tls_dtv_slotinfo_list entries) are written by dlopen and > >> pthread_create reads them when initializing the dtv of a new thread. > >> > >> Current concurrency design: > > > > Generally, I'd suggest to start one level of abstraction higher: What > > are the abstract operations that we need to perform? Which of these > > need to appear to be atomic wrt. which other operations? Which ordering > > relations do need to hold in every execution? > > (In contrast, you explain things below closer to the implementation > > level (eg, this lock is acquired for these operations, etc.)) > > i know you wanted this the last time too, but i don't plan to > rewrite the dynamic linker from scratch, That is not what I'm asking for, or suggesting. What I do want is that we deal with concurrency in a thorough manner. What I suggested is how this can be done most easily in my experience -- while still being thorough. > i'm trying to make a > minimal change that is acceptable without years of work. Even for a minimal change (it looked more like minimal changes) you need to be able to show that it is not breaking something else. Even if we're fine with accepting that we may not be able to prove that and still want to apply the patch, we need to be thorough about the assumptions we make about how the code we're changing works. > the current design does not have a clean set of invariants and > interface contracts, so it's hard to describe at a higher level > abstraction: there is always yet another special case that does > not fit into a simple model and nailing down all those special > cases is a lot more effort than necessary. Well, if you change something that is not a mechanical change (eg, make changes that allow for the same executions), you must have some understanding of the system, right? This understanding must be sufficient to explain why your change is correct. That's the core of what I'm asking for. I'm not asking you to describe every possible special case -- but instead that you thoroughly document how the system you try to change looks like. So, for example, if you assume that two operations don't run concurrently, then document that; if we find out later that that this is in fact not the case because of some special case, at least we have a proper way to cross-check this against the assumptions you made when proposing your change. This incremental approach is what you'd do if you'd try to describe the full system too, so I don't mind to have only model and describe parts of the system as long as the way we do that is thorough. The big problem with not being thorough and documenting the abstract-level core of your understanding of the system is that then all future changes have to cross-check against every tiny detail again. This does not scale, and it's error-prone. You have observed that problem already I guess, in that I suppose it would have been easier for you to reason about your fix if you could have cross-checked it against existing thorough documentation and rules. We have to be thorough to be able to deal with concurrency in way that's somewhat easier on our brains. Delaying that does not help glibc. I know it can be daunting to start it, because there are so many open questions (I've been working on fixing the nscd code, so trust me...), but it will make things easier for us in the long run. > ignoring the implementation and only describing the user visible > requirements is not enough: glibc has lot of extensions and > non-standard behaviours users may depend on (so whatever the > current implementation does is the spec), following the standard > would not end up with something compatible with the current code > and it would still take a lot of time. This is not about standards for me. They are just one source of constraints or requirements. What I want to see is that we make it clear which requirements we think there are, irrespective where they come from; likewise, what a correct program is required not to do. This is what we start with; once that's done, checking for compliance against some standard can still be done. > that's why my approach was to make a small change and trying to > reason about the changed semantics compared to the original > behaviour instead of figuring out all the atomicity and ordering > requirements (which can be specified in many ways). This is not a bad approach, so don't get me wrong. There can be changes that cannot break a program because they just constrain execution to a subset of executions that would have been possible before too; for example, changing a finite number of steps into one atomic operation (though this needs to be nonblocking or you can interfere with forward progress (think AS-Safety)). However, this is often not sufficient. First, arguing why the change is necessary is important because it adds to the what future changes have to pay attention to (unless the change is not affecting correctness). Second, look at the first of your goals, for example: Wanting to avoid a data race is good; but what memory orders do you need? Do you want to make them SC because you don't know what is required (and document the uncertainty!)? If not, you'll have to reason based on assumptions about the happens-before relations the code needs to enforce, and the kind of "incoming" executions it has to face. This then ultimately boils down to having some mental model of the abstract-level concurrency in the code, or you can't really reason about what is truly required. > >> (2) dlopen can reuse modids of dlclosed modules (and dlclose can decrement > >> GL(dl_tls_max_dtv_idx)), so multiple reads of a slotinfo entry may be > >> inconsistent if neither GL(dl_load_lock) nor GL(dl_load_write_lock) are > >> held (e.g. concurrent dlclose and dlopen between reading the .gen and > >> .map members of the slotinfo entry), this affects: > >> pthread_create (_dl_allocate_tls_init, reads .gen and .map) > >> __tls_get_addr (_dl_update_slotinfo, reads .gen and .map) > > > > Can we version the modids? Or do we generally need a mechanism for > > atomic snapshots? What are the abstract operations and the requirements > > for them? > > this is probably not such a big issue as i originally thought, > reuse of .gen and .map is possible to detect, because gen can > only increase (so it can be compared to a previous read of the > global gen counter and ignored if bigger, if smaller then it > can be guaranteed that .map and .gen are consistent). Maybe that's a good piece to start with. Let me try to give a brief outline how you could describe that more thoroughly. You seem to need to have an atomic snapshot for more than one read of a slotinfo entry (if that's not the case, just treat the following as an example for illustratitive purposes). This will be required for some higher-level operation, which you refer to. That's the highest level of abstraction in this example here. You know that modifications are ordered (through some mechanism in the program that you mention), and you know that this order is represented by the generation number (so, if modification A happens before modification B, then A's generation will be less than B's generation). You try to solve the atomic snapshot problem by ignoring everything with a generation less than a number chosen by you. That's kind of the middle level of abstraction in this example, so how you intend to solve the synchronization problem. Finally, you need to implement this with atomics. The middle level tells you which happens-before relationships you need, and thus what kind of memory orders you need. For example, assume a slotinfo entry is published by writing a generation number, then you write this number with release MO, and read it with acquire MO in the consumers; they will synchronize, and they will ensure that if you read a certain number you also see the matching content of the entry. Because you've described the middle abstraction layer, you can argue why you need those MOs by just going up one level higher, and referring to your code comments. This is easier than trying to argue directly why this is sufficient to solve the problem at the highest level of abstraction (ie, atomic snapshot). So you build up these different abstraction layers, which helps make the relationship between layer N and layer N-1 tractable for one's brain. You're thorough about statements at each layer, which ensures that you don't have to consider anything but the layers at end (eg, because there's no ambiguity). You document the models and reasoning you build at each layer so it's easier for yourself -- and all other and future developers -- to remember and refer to. Also, when combined with being thorough, this makes stuff easier to understand because what you documented for a particular layer is law; you don't have to second-guess it, which frees up resources in your brain. In my experience, thinking in terms of atomicity and ordering at these different layers is what works best, plus common synchronization patterns and techniques (eg, atomic snapshots, mutual exclusion, ...) and problems (eg, ABA). Atomicity is important because it states that there are less interleavings to think about; ordering is essential because orderings of operations is what defines how state evolves. This is what I do too when I build concurrent code like the new condvar or rwlock. And if you do all that thoroughly, you will see, for example, that my brief outline of a description of course has holes in it (eg, which generation number is the threshold? can slotinfo entries be reused so that you'd perhaps need to restart an atomic snapshot if the version number changes? ...). The thorough process I described is aimed at revealing such holes in the reasoning we do about concurrent code. I hope that this clarifies why I'm always pushing for being thorough in how we reason about concurrent code, and how we communicate such reasoning to our fellow developers. > >> (3) tls access must be as-safe, but currently tls is allocated at first > >> access if the containing module was not loaded at the time the accessing > >> thread was created. This means __tls_get_addr may malloc or lock > >> GL(dl_load_lock) or modify the thread local dtv. This can cause deadlocks, > >> oom crashes or memory corruption if it is in a signal handler (malloc in > >> signal handler is also problematic for malloc interposition). > > > > Are signal handlers allowed to access TLS? I don't know what POSIX > > requires (or doesn't disallow), but C++, for example, is moving to not > > allowing TLS (if I remember the draft wording correctly). > > if that's true it is just another brokenness in c++, No, it's not broken, it's simply taking a conservative approach: If signal handlers are not allowed to touch TLS (or there's UB), you don't require something special from implementations. Whether that means that you can't do all the funny things in a signal handler you might want to do is another question. > > Do we perhaps can deal with most of all of this if we had simple > > transactions? If these aren't arbitrary operations but in some way > > limited, and can be used for lots of the operations we need to > > synchronize, this may perhaps be the easiest way to implement all of > > this. Once we know about the abstract operations, we can better assess > > whether that would be a helpful approach. > > you can think of dlopen (up to ctors) as a transaction.. > but there are mallocs, mmaps and even debug messages that > are not transactional operations malloc can usually be rolled back fairly easily because doing an unncessary malloc is not a problem. The same might apply for mmap (but not for munmap). Debug messages can be buffered. > and some state does not > have to be rolled back (e.g. generation count increment) > so i don't think this helps. It's fine if we don't hvae to roll back something; that's the easy case :) (I'm assuming that needlessly incrementing the generation count is harmless.) What I was thinking about was a custom transaction mechanism based on some of the simpler STM algorithms. We can mark transactional operations as such, and the transaction mechanism becomes much easier if we, for example, know all the operations of a transaction at the start of the transaction. > >> (5) GL(dl_tls_generation) is not checked for overflow consistently and > >> slotinfo entries can get assigned an overflowed generation count, its type > >> is size_t and both dlopen and dlclose increment it, so overflow may be a > >> concern on 32bit targets. > > > > Can the size be extended on 32b targets? If so, we can build a > > monotonically increasing, larger atomic counter on 32b targets too. > > that would need some changes > > the generation count up to which a dtv is set up in a > thread is stored in dtv[0] which is 32bit on a 32bit > system. The other 32b half could be somewhere else. It's more efficient if it's in the same cacheline of course, but it's not required to be adjacent. > >> Proposed fix(es): > > > > Before focusing on taking the lock, I'd suggest to look at the bigger > > picture (ie, the abstract operations and atomicity and ordering > > requirements). > > It can be possible that a clean redesign is actually simpler than trying > > to fix a system that is already complex and which we don't understand > > 100%. > > i think that's optimistic but i can write up what i think > the requirements are. Thanks. Better documentation of what the requirements are would help even if we a rewrite is not what we decide to do.