From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-75613-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 121943 invoked by alias); 7 Dec 2016 14:08:03 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 121897 invoked by uid 89); 7 Dec 2016 14:08:02 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-3.5 required=5.0 tests=BAYES_05,RP_MATCHES_RCVD,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=compliance, brains, roll, mental
X-HELO: mx1.redhat.com
Message-ID: <1481119669.14990.511.camel@redhat.com>
Subject: Re: [RFC][BZ #19329] high level description of pthread_create vs
 dlopen races
From: Torvald Riegel <triegel@redhat.com>
To: Szabolcs Nagy <szabolcs.nagy@arm.com>
Cc: nd@arm.com, GNU C Library <libc-alpha@sourceware.org>
Date: Wed, 07 Dec 2016 14:08:00 -0000
In-Reply-To: <5846F3EB.2060003@arm.com>
References: <583C7F76.3000601@arm.com>
	 <1480965875.14990.199.camel@redhat.com> <5846F3EB.2060003@arm.com>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
X-SW-Source: 2016-12/txt/msg00241.txt.bz2

On Tue, 2016-12-06 at 17:22 +0000, Szabolcs Nagy wrote:
> On 05/12/16 19:24, Torvald Riegel wrote:
> > On Mon, 2016-11-28 at 19:03 +0000, Szabolcs Nagy wrote:
> >> In short: relevant globals (dl_tls_max_dtv_idx, dl_tls_generation,
> >> dl_tls_dtv_slotinfo_list entries) are written by dlopen and
> >> pthread_create reads them when initializing the dtv of a new thread.
> >>
> >> Current concurrency design:
> > 
> > Generally, I'd suggest to start one level of abstraction higher: What
> > are the abstract operations that we need to perform?  Which of these
> > need to appear to be atomic wrt. which other operations?  Which ordering
> > relations do need to hold in every execution?
> > (In contrast, you explain things below closer to the implementation
> > level (eg, this lock is acquired for these operations, etc.))
> 
> i know you wanted this the last time too, but i don't plan to
> rewrite the dynamic linker from scratch,

That is not what I'm asking for, or suggesting.  What I do want is that
we deal with concurrency in a thorough manner.  What I suggested is how
this can be done most easily in my experience -- while still being
thorough.

> i'm trying to make a
> minimal change that is acceptable without years of work.

Even for a minimal change (it looked more like minimal changes) you need
to be able to show that it is not breaking something else.  Even if
we're fine with accepting that we may not be able to prove that and
still want to apply the patch, we need to be thorough about the
assumptions we make about how the code we're changing works.

> the current design does not have a clean set of invariants and
> interface contracts, so it's hard to describe at a higher level
> abstraction: there is always yet another special case that does
> not fit into a simple model and nailing down all those special
> cases is a lot more effort than necessary.

Well, if you change something that is not a mechanical change (eg, make
changes that allow for the same executions), you must have some
understanding of the system, right?  This understanding must be
sufficient to explain why your change is correct.  That's the core of
what I'm asking for.

I'm not asking you to describe every possible special case -- but
instead that you thoroughly document how the system you try to change
looks like.  So, for example, if you assume that two operations don't
run concurrently, then document that; if we find out later that that
this is in fact not the case because of some special case, at least we
have a proper way to cross-check this against the assumptions you made
when proposing your change.

This incremental approach is what you'd do if you'd try to describe the
full system too, so I don't mind to have only model and describe parts
of the system as long as the way we do that is thorough.

The big problem with not being thorough and documenting the
abstract-level core of your understanding of the system is that then all
future changes have to cross-check against every tiny detail again.
This does not scale, and it's error-prone.  You have observed that
problem already I guess, in that I suppose it would have been easier for
you to reason about your fix if you could have cross-checked it against
existing thorough documentation and rules.

We have to be thorough to be able to deal with concurrency in way that's
somewhat easier on our brains.  Delaying that does not help glibc.  I
know it can be daunting to start it, because there are so many open
questions (I've been working on fixing the nscd code, so trust me...),
but it will make things easier for us in the long run.

> ignoring the implementation and only describing the user visible
> requirements is not enough: glibc has lot of extensions and
> non-standard behaviours users may depend on (so whatever the
> current implementation does is the spec), following the standard
> would not end up with something compatible with the current code
> and it would still take a lot of time.

This is not about standards for me.  They are just one source of
constraints or requirements.  What I want to see is that we make it
clear which requirements we think there are, irrespective where they
come from; likewise, what a correct program is required not to do.  This
is what we start with; once that's done, checking for compliance against
some standard can still be done.

> that's why my approach was to make a small change and trying to
> reason about the changed semantics compared to the original
> behaviour instead of figuring out all the atomicity and ordering
> requirements (which can be specified in many ways).

This is not a bad approach, so don't get me wrong.  There can be changes
that cannot break a program because they just constrain execution to a
subset of executions that would have been possible before too; for
example, changing a finite number of steps into one atomic operation
(though this needs to be nonblocking or you can interfere with forward
progress (think AS-Safety)).

However, this is often not sufficient.  First, arguing why the change is
necessary is important because it adds to the what future changes have
to pay attention to (unless the change is not affecting correctness).
Second, look at the first of your goals, for example: Wanting to avoid a
data race is good; but what memory orders do you need?  Do you want to
make them SC because you don't know what is required (and document the
uncertainty!)?  If not, you'll have to reason based on assumptions about
the happens-before relations the code needs to enforce, and the kind of
"incoming" executions it has to face.  This then ultimately boils down
to having some mental model of the abstract-level concurrency in the
code, or you can't really reason about what is truly required.

> >> (2) dlopen can reuse modids of dlclosed modules (and dlclose can decrement
> >> GL(dl_tls_max_dtv_idx)), so multiple reads of a slotinfo entry may be
> >> inconsistent if neither GL(dl_load_lock) nor GL(dl_load_write_lock) are
> >> held (e.g. concurrent dlclose and dlopen between reading the .gen and
> >> .map members of the slotinfo entry), this affects:
> >> 	pthread_create (_dl_allocate_tls_init, reads .gen and .map)
> >> 	__tls_get_addr (_dl_update_slotinfo, reads .gen and .map)
> > 
> > Can we version the modids?  Or do we generally need a mechanism for
> > atomic snapshots?  What are the abstract operations and the requirements
> > for them?
> 
> this is probably not such a big issue as i originally thought,
> reuse of .gen and .map is possible to detect, because gen can
> only increase (so it can be compared to a previous read of the
> global gen counter and ignored if bigger, if smaller then it
> can be guaranteed that .map and .gen are consistent).

Maybe that's a good piece to start with.  Let me try to give a brief
outline how you could describe that more thoroughly.

You seem to need to have an atomic snapshot for more than one read of a
slotinfo entry (if that's not the case, just treat the following as an
example for illustratitive purposes).  This will be required for some
higher-level operation, which you refer to.  That's the highest level of
abstraction in this example here.

You know that modifications are ordered (through some mechanism in the
program that you mention), and you know that this order is represented
by the generation number (so, if modification A happens before
modification B, then A's generation will be less than B's generation).
You try to solve the atomic snapshot problem by ignoring everything with
a generation less than a number chosen by you.  That's kind of the
middle level of abstraction in this example, so how you intend to solve
the synchronization problem.

Finally, you need to implement this with atomics.  The middle level
tells you which happens-before relationships you need, and thus what
kind of memory orders you need.  For example, assume a slotinfo entry is
published by writing a generation number, then you write this number
with release MO, and read it with acquire MO in the consumers; they will
synchronize, and they will ensure that if you read a certain number you
also see the matching content of the entry.  Because you've described
the middle abstraction layer, you can argue why you need those MOs by
just going up one level higher, and referring to your code comments.
This is easier than trying to argue directly why this is sufficient to
solve the problem at the highest level of abstraction (ie, atomic
snapshot).

So you build up these different abstraction layers, which helps make the
relationship between layer N and layer N-1 tractable for one's brain.
You're thorough about statements at each layer, which ensures that you
don't have to consider anything but the layers at end (eg, because
there's no ambiguity).  You document the models and reasoning you build
at each layer so it's easier for yourself -- and all other and future
developers -- to remember and refer to.  Also, when combined with being
thorough, this makes stuff easier to understand because what you
documented for a particular layer is law; you don't have to second-guess
it, which frees up resources in your brain.
In my experience, thinking in terms of atomicity and ordering at these
different layers is what works best, plus common synchronization
patterns and techniques (eg, atomic snapshots, mutual exclusion, ...)
and problems (eg, ABA).  Atomicity is important because it states that
there are less interleavings to think about; ordering is essential
because orderings of operations is what defines how state evolves.
This is what I do too when I build concurrent code like the new condvar
or rwlock.

And if you do all that thoroughly, you will see, for example, that my
brief outline of a description of course has holes in it (eg, which
generation number is the threshold? can slotinfo entries be reused so
that you'd perhaps need to restart an atomic snapshot if the version
number changes? ...).  The thorough process I described is aimed at
revealing such holes in the reasoning we do about concurrent code.

I hope that this clarifies why I'm always pushing for being thorough in
how we reason about concurrent code, and how we communicate such
reasoning to our fellow developers.

> >> (3) tls access must be as-safe, but currently tls is allocated at first
> >> access if the containing module was not loaded at the time the accessing
> >> thread was created.  This means __tls_get_addr may malloc or lock
> >> GL(dl_load_lock) or modify the thread local dtv. This can cause deadlocks,
> >> oom crashes or memory corruption if it is in a signal handler (malloc in
> >> signal handler is also problematic for malloc interposition).
> > 
> > Are signal handlers allowed to access TLS?  I don't know what POSIX
> > requires (or doesn't disallow), but C++, for example, is moving to not
> > allowing TLS (if I remember the draft wording correctly).
> 
> if that's true it is just another brokenness in c++,

No, it's not broken, it's simply taking a conservative approach: If
signal handlers are not allowed to touch TLS (or there's UB), you don't
require something special from implementations.  Whether that means that
you can't do all the funny things in a signal handler you might want to
do is another question.

> > Do we perhaps can deal with most of all of this if we had simple
> > transactions?  If these aren't arbitrary operations but in some way
> > limited, and can be used for lots of the operations we need to
> > synchronize, this may perhaps be the easiest way to implement all of
> > this.  Once we know about the abstract operations, we can better assess
> > whether that would be a helpful approach.
> 
> you can think of dlopen (up to ctors) as a transaction..
> but there are mallocs, mmaps and even debug messages that
> are not transactional operations

malloc can usually be rolled back fairly easily because doing an
unncessary malloc is not a problem.  The same might apply for mmap (but
not for munmap).  Debug messages can be buffered.

> and some state does not
> have to be rolled back (e.g. generation count increment)
> so i don't think this helps.

It's fine if we don't hvae to roll back something; that's the easy
case :)  (I'm assuming that needlessly incrementing the generation count
is harmless.)

What I was thinking about was a custom transaction mechanism based on
some of the simpler STM algorithms.  We can mark transactional
operations as such, and the transaction mechanism becomes much easier if
we, for example, know all the operations of a transaction at the start
of the transaction.

> >> (5) GL(dl_tls_generation) is not checked for overflow consistently and
> >> slotinfo entries can get assigned an overflowed generation count, its type
> >> is size_t and both dlopen and dlclose increment it, so overflow may be a
> >> concern on 32bit targets.
> > 
> > Can the size be extended on 32b targets?  If so, we can build a
> > monotonically increasing, larger atomic counter on 32b targets too.
> 
> that would need some changes
> 
> the generation count up to which a dtv is set up in a
> thread is stored in dtv[0] which is 32bit on a 32bit
> system.

The other 32b half could be somewhere else.  It's more efficient if it's
in the same cacheline of course, but it's not required to be adjacent.

> >> Proposed fix(es):
> > 
> > Before focusing on taking the lock, I'd suggest to look at the bigger
> > picture (ie, the abstract operations and atomicity and ordering
> > requirements).
> > It can be possible that a clean redesign is actually simpler than trying
> > to fix a system that is already complex and which we don't understand
> > 100%.
> 
> i think that's optimistic but i can write up what i think
> the requirements are.

Thanks.  Better documentation of what the requirements are would help
even if we a rewrite is not what we decide to do.