[RFC][BZ #19329] high level description of pthread

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [RFC][BZ #19329] high level description of pthread_create vs dlopen races
@ 2016-11-28 19:03 Szabolcs Nagy
  2016-12-05 19:24 ` Torvald Riegel
  0 siblings, 1 reply; 8+ messages in thread
From: Szabolcs Nagy @ 2016-11-28 19:03 UTC (permalink / raw)
  To: GNU C Library; +Cc: nd, Torvald Riegel

Before sending an updated version of
https://sourceware.org/ml/libc-alpha/2016-01/msg00480.html
i try to provide high level description of the issues (that can go
into the concurrency notes of the patch) with the hope that agreeing
on the design will allow less patch review iterations.

Overview:

Dynamic linker internal data structures for thread local storage may
be accessed concurrently causing assertion failures in dl-tls.c when
dlopen loads modules that have tls concurrently with pthread_create.

Assertions that may trigger:

Inconsistency detected by ld.so: dl-tls.c: 520: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <=
GL(dl_tls_generation)' failed!
Inconsistency detected by ld.so: dl-tls.c: 556: _dl_allocate_tls_init: Assertion `listp != NULL' failed!

Related recent discussions:

TLS performance degradation after dlopen:
https://sourceware.org/ml/libc-alpha/2016-06/msg00203.html
nptl/tst-stack4 failure:
https://sourceware.org/ml/libc-alpha/2016-11/msg00574.html
Test case patch:
https://sourceware.org/ml/libc-alpha/2016-11/msg00917.html

Current behaviour (minor details omitted):

dlopen (operations in program order):

	Load a module and its dependencies, assign a modid to each
	while incrementing GL(dl_tls_max_dtv_idx) (this is the dtv
	and slotinfo index of the module).

	Add each module to GL(dl_tls_dtv_slotinfo_list) using
	_dl_add_to_slotinfo.  Assign the next generation count to
	these slotinfo entries.

	Increment the GL(dl_tls_generation) counter.

	Update the dtv of the current thread following the slotinfo
	changes by calling _dl_update_slotinfo (only resizes dtv and
	fills it with unallocated slots, actual tls allocation and
	initialization is deferred).

pthread_create:

	Call _dl_allocate_tls_init to allocate dtv and tls for the
	new thread to be created, covering all currently loaded modules.
	("currently loaded" means up to a generation count, the dtv and
	tls for modules with larger generation count are allocated lazily
	on first access).

Other apis:

	dlclose can write the relevant globals too, tls access and
	dl_iterate_phdr read them (i focus on dlopen and pthread_create
	first, similar concerns affect these apis).  dlmain can write
	these globals too, but i think that it is single threaded
	(early startup, no pthreads yet, can audit modules change that?).

In short: relevant globals (dl_tls_max_dtv_idx, dl_tls_generation,
dl_tls_dtv_slotinfo_list entries) are written by dlopen and
pthread_create reads them when initializing the dtv of a new thread.

Current concurrency design:

dlopen holds GL(dl_load_lock) while writing the relevant globals (using
volatile loads and stores).  The lock serializes dlopen calls, but there
is no synchronization with pthread_create (other than mmap and mprotect
calls which are likely synchronized but not part of the c11 memory model).

pthread_create accesses the globals without locks and uses volatile loads:
dtv size and valid entries in the slotinfo list are determined based on
GL(dl_tls_max_dtv_idx) with the following asserted assumptions:
(a) slotinfo list have at least GL(dl_tls_max_dtv_idx) entries set up,
(b) slotinfo entries have generation count <= GL(dl_tls_generation).
These would be wrong even if all memory accesses were mo_seq_cst: slotinfo
entries are added after modid is assigned and the generation counter is
incremented even later.

Current design bugs in glibc:

(0) There are data races on the relevant globals. And there are asserts
that are not guaranteed (a,b above). This is what i was trying to fix.

(1) Any access to link_map struct is invalid if neither GL(dl_load_lock)
nor GL(dl_load_write_lock) are held, because a concurrent dlclose may
free the map, this affects:
	pthread_create (_dl_allocate_tls_init reads map.l_tls_offset)
	__tls_get_addr (_dl_update_slotinfo reads map.l_tls_offset)

(2) dlopen can reuse modids of dlclosed modules (and dlclose can decrement
GL(dl_tls_max_dtv_idx)), so multiple reads of a slotinfo entry may be
inconsistent if neither GL(dl_load_lock) nor GL(dl_load_write_lock) are
held (e.g. concurrent dlclose and dlopen between reading the .gen and
.map members of the slotinfo entry), this affects:
	pthread_create (_dl_allocate_tls_init, reads .gen and .map)
	__tls_get_addr (_dl_update_slotinfo, reads .gen and .map)

(3) tls access must be as-safe, but currently tls is allocated at first
access if the containing module was not loaded at the time the accessing
thread was created.  This means __tls_get_addr may malloc or lock
GL(dl_load_lock) or modify the thread local dtv. This can cause deadlocks,
oom crashes or memory corruption if it is in a signal handler (malloc in
signal handler is also problematic for malloc interposition).

(4) allocation failures can oom crash in pthread_create and dlopen
(_dl_resize_dtv) and at tls access (various mallocs) or may leave
inconsistent state behind in dlopen (_dl_add_to_slotinfo or any
later call in dlopen that fails: tls changes are not rolled back).

(5) GL(dl_tls_generation) is not checked for overflow consistently and
slotinfo entries can get assigned an overflowed generation count, its type
is size_t and both dlopen and dlclose increment it, so overflow may be a
concern on 32bit targets.

(6) dlclose holds GL(dl_load_lock) while running ctors so if those ctors
wait for anything that may take the same lock there is a deadlock.

Proposed fix(es):

(0) can be fixed by adding atomics and removing invalid asserts:
It can be guaranteed that whatever GL(dl_tls_generation) is read in
pthread_create, all slotinfo entries with that generation will be read
too and the read GL(dl_tls_max_dtv_idx) is no less than the modids of
those entries.  Then walking the slotinfo list until the end and only
considering ones up to the observed generation count would set up the
dtv consistently without races. Assuming there is no concurrent dlclose.
(I think my previous patch did this.)

OTOH (1) and (2) clearly point to the direction that pthread_create
should take the lock, in which case thread creation performance may go
down a bit and the influence of (6) increases (but that is broken anyway)
and we didn't do anything about __tls_get_addr. (I don't see other reason
to avoid taking the lock).

Fixing (3) or (6) would need significant redesign.

Fixing (4) and (5) may be possible independently.

I think these are the immediately available options:

- lock free pthread_create that is broken with concurrent dlclose,
- pthread_create that locks and may deadlock when synchronized with a ctor,
- or current brokenness (concurrent pthread_create and dlopen asserts).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races
  2016-11-28 19:03 [RFC][BZ #19329] high level description of pthread_create vs dlopen races Szabolcs Nagy
@ 2016-12-05 19:24 ` Torvald Riegel
  2016-12-06 17:23   ` Szabolcs Nagy
  0 siblings, 1 reply; 8+ messages in thread
From: Torvald Riegel @ 2016-12-05 19:24 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: GNU C Library, nd

On Mon, 2016-11-28 at 19:03 +0000, Szabolcs Nagy wrote:
> Before sending an updated version of
> https://sourceware.org/ml/libc-alpha/2016-01/msg00480.html
> i try to provide high level description of the issues (that can go
> into the concurrency notes of the patch) with the hope that agreeing
> on the design will allow less patch review iterations.

That's a good approach.  Thanks.

> dlopen (operations in program order):
> 
> 	Load a module and its dependencies, assign a modid to each
> 	while incrementing GL(dl_tls_max_dtv_idx) (this is the dtv
> 	and slotinfo index of the module).
> 
> 	Add each module to GL(dl_tls_dtv_slotinfo_list) using
> 	_dl_add_to_slotinfo.  Assign the next generation count to
> 	these slotinfo entries.
> 
> 	Increment the GL(dl_tls_generation) counter.
> 
> 	Update the dtv of the current thread following the slotinfo
> 	changes by calling _dl_update_slotinfo (only resizes dtv and
> 	fills it with unallocated slots, actual tls allocation and
> 	initialization is deferred).
> 
> pthread_create:
> 
> 	Call _dl_allocate_tls_init to allocate dtv and tls for the
> 	new thread to be created, covering all currently loaded modules.
> 	("currently loaded" means up to a generation count, the dtv and
> 	tls for modules with larger generation count are allocated lazily
> 	on first access).
> 
> Other apis:
> 
> 	dlclose can write the relevant globals too, tls access and
> 	dl_iterate_phdr read them (i focus on dlopen and pthread_create
> 	first, similar concerns affect these apis).  dlmain can write
> 	these globals too, but i think that it is single threaded
> 	(early startup, no pthreads yet, can audit modules change that?).
> 
> In short: relevant globals (dl_tls_max_dtv_idx, dl_tls_generation,
> dl_tls_dtv_slotinfo_list entries) are written by dlopen and
> pthread_create reads them when initializing the dtv of a new thread.
> 
> Current concurrency design:

Generally, I'd suggest to start one level of abstraction higher: What
are the abstract operations that we need to perform?  Which of these
need to appear to be atomic wrt. which other operations?  Which ordering
relations do need to hold in every execution?
(In contrast, you explain things below closer to the implementation
level (eg, this lock is acquired for these operations, etc.))

There are a couple of reasons why I'm suggesting that.  First, it helps
people that are not familiar in detail with all of the requirements
(such as myself :).

Second, and perhaps that's more immediately importantly from your
perspective, it will likely guide us to what we need to do in the
implementation.  In my experience, the tighter you get in your
understanding of a concurrency problem on the abstract level, the better
you can understand what the options are and how the solution should look
like.
For example, assume some operation needs to be atomic.  A natural
reaction is to use a lock to guarantee mutual exclusion -- but do we
really need a lock?  One question to ask then is whether other
operations really depend on all the details of the operation, or whether
it can be reduced to something smaller and its effects "replayed" at
other threads?  If so, then one can often announce that the operation
has happened in some other mechanism that summarizes the current state
of the world.  Otherwise, is it actually important which thread executes
the operation?  If it is, then we may actually need a lock because the
abstract essence of the concurrency problem is that thread B has to wait
for thread A to perform the operation.

Third, this abstract description gives the code a vocabulary that can
then be used to cleanly describe why the actual implementation is
supposed to work.

In your v3 patch from January, you had something in that direction: a
list of readers and writers.  I'd suggest to take this one step further.
For example, there are certain operations like pthread_create and
sub-operations these have to perform.  Then there's code that uses TLS,
dlopen calls, etc.  The language specs and POSIX make requiremets on
these, which already restricts the valid executions that we need to
support.  That's the start of the requirements.
Then there's requirements or a wish-list of what we'd like to have, such
as AS-Safe.
Then we can break them down into the abstract steps we need to perform,
such as initialize TLS, or memory reclamation.  Combined with the
constraints we have on these (eg, uses of a piece of memory need to
happen before the memory is released), this gives us a high-level plan
of the concurrency problem.
We can then iterate and refine that until we have something that's
implementable.  Doing that iterating on an abstract level will allow us
to see possibilities to apply patterns such as a need for mutual
exclusion, a need for a simple ordering of events, a need to do atomic
snapshots, etc.

If you skip the abstract level, it becomes too complex quickly, and you
run into the risk of inventing a fully custom concurrent algorithm,
instead of being able to apply known techniques (eg, locking is one, but
also only just one).  See some of my questions below for examples.

Note that by "abstract", I don't mean vague or so high-level that it
lacks important detail -- but rather something that has been reduced to
the core of what we need to reason about when treating this as a
concurrency problem.  This is why I've talked about atomicity and
ordering requirements.

> dlopen holds GL(dl_load_lock) while writing the relevant globals (using
> volatile loads and stores).  The lock serializes dlopen calls, but there
> is no synchronization with pthread_create (other than mmap and mprotect
> calls which are likely synchronized but not part of the c11 memory model).
> 
> pthread_create accesses the globals without locks and uses volatile loads:
> dtv size and valid entries in the slotinfo list are determined based on
> GL(dl_tls_max_dtv_idx) with the following asserted assumptions:
> (a) slotinfo list have at least GL(dl_tls_max_dtv_idx) entries set up,
> (b) slotinfo entries have generation count <= GL(dl_tls_generation).
> These would be wrong even if all memory accesses were mo_seq_cst: slotinfo
> entries are added after modid is assigned and the generation counter is
> incremented even later.
> 
> Current design bugs in glibc:
> 
> (0) There are data races on the relevant globals. And there are asserts
> that are not guaranteed (a,b above). This is what i was trying to fix.
> 
> (1) Any access to link_map struct is invalid if neither GL(dl_load_lock)
> nor GL(dl_load_write_lock) are held, because a concurrent dlclose may
> free the map, this affects:
> 	pthread_create (_dl_allocate_tls_init reads map.l_tls_offset)
> 	__tls_get_addr (_dl_update_slotinfo reads map.l_tls_offset)

You are describing the current issue in the implementation, which is
perfectly fine.  What is the underlying synchronization problem we need
to solve?  Do we just need to reclaim memory after all of it's uses?
Locking may get this done, but other techniques such as hazzard pointers
might be better, in particular if we can reclaim memory lazily and don't
want to use blocking synchronization.

> (2) dlopen can reuse modids of dlclosed modules (and dlclose can decrement
> GL(dl_tls_max_dtv_idx)), so multiple reads of a slotinfo entry may be
> inconsistent if neither GL(dl_load_lock) nor GL(dl_load_write_lock) are
> held (e.g. concurrent dlclose and dlopen between reading the .gen and
> .map members of the slotinfo entry), this affects:
> 	pthread_create (_dl_allocate_tls_init, reads .gen and .map)
> 	__tls_get_addr (_dl_update_slotinfo, reads .gen and .map)

Can we version the modids?  Or do we generally need a mechanism for
atomic snapshots?  What are the abstract operations and the requirements
for them?

> (3) tls access must be as-safe, but currently tls is allocated at first
> access if the containing module was not loaded at the time the accessing
> thread was created.  This means __tls_get_addr may malloc or lock
> GL(dl_load_lock) or modify the thread local dtv. This can cause deadlocks,
> oom crashes or memory corruption if it is in a signal handler (malloc in
> signal handler is also problematic for malloc interposition).

Are signal handlers allowed to access TLS?  I don't know what POSIX
requires (or doesn't disallow), but C++, for example, is moving to not
allowing TLS (if I remember the draft wording correctly).

> (4) allocation failures can oom crash in pthread_create and dlopen
> (_dl_resize_dtv) and at tls access (various mallocs) or may leave
> inconsistent state behind in dlopen (_dl_add_to_slotinfo or any
> later call in dlopen that fails: tls changes are not rolled back).

Do we need to be able to publish changes atomically?  If so, there are
known techniques for that.

Do we perhaps can deal with most of all of this if we had simple
transactions?  If these aren't arbitrary operations but in some way
limited, and can be used for lots of the operations we need to
synchronize, this may perhaps be the easiest way to implement all of
this.  Once we know about the abstract operations, we can better assess
whether that would be a helpful approach.

> (5) GL(dl_tls_generation) is not checked for overflow consistently and
> slotinfo entries can get assigned an overflowed generation count, its type
> is size_t and both dlopen and dlclose increment it, so overflow may be a
> concern on 32bit targets.

Can the size be extended on 32b targets?  If so, we can build a
monotonically increasing, larger atomic counter on 32b targets too.

> (6) dlclose holds GL(dl_load_lock) while running ctors so if those ctors
> wait for anything that may take the same lock there is a deadlock.

Do you mean dtors?

> Proposed fix(es):
> 
> (0) can be fixed by adding atomics and removing invalid asserts:
> It can be guaranteed that whatever GL(dl_tls_generation) is read in
> pthread_create, all slotinfo entries with that generation will be read
> too and the read GL(dl_tls_max_dtv_idx) is no less than the modids of
> those entries.  Then walking the slotinfo list until the end and only
> considering ones up to the observed generation count would set up the
> dtv consistently without races. Assuming there is no concurrent dlclose.
> (I think my previous patch did this.)
> 
> OTOH (1) and (2) clearly point to the direction that pthread_create
> should take the lock, in which case thread creation performance may go
> down a bit and the influence of (6) increases (but that is broken anyway)
> and we didn't do anything about __tls_get_addr. (I don't see other reason
> to avoid taking the lock).

Before focusing on taking the lock, I'd suggest to look at the bigger
picture (ie, the abstract operations and atomicity and ordering
requirements).
It can be possible that a clean redesign is actually simpler than trying
to fix a system that is already complex and which we don't understand
100%.

> Fixing (3) or (6) would need significant redesign.
> 
> Fixing (4) and (5) may be possible independently.
> 
> I think these are the immediately available options:
> 
> - lock free pthread_create that is broken with concurrent dlclose,
> - pthread_create that locks and may deadlock when synchronized with a ctor,
> - or current brokenness (concurrent pthread_create and dlopen asserts).

Looking at the abstract operations will allow us to make such
conclusions.  Also, if we document these, it will be much clearer to
future developers why we did what we did (ie, the reasoning behind
options and decisions such as the ones you list above).

I hope this helps.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races
  2016-12-05 19:24 ` Torvald Riegel
@ 2016-12-06 17:23   ` Szabolcs Nagy
  2016-12-06 17:48     ` Joseph Myers
  2016-12-07 14:08     ` Torvald Riegel
  0 siblings, 2 replies; 8+ messages in thread
From: Szabolcs Nagy @ 2016-12-06 17:23 UTC (permalink / raw)
  To: Torvald Riegel; +Cc: nd, GNU C Library

On 05/12/16 19:24, Torvald Riegel wrote:
> On Mon, 2016-11-28 at 19:03 +0000, Szabolcs Nagy wrote:
>> In short: relevant globals (dl_tls_max_dtv_idx, dl_tls_generation,
>> dl_tls_dtv_slotinfo_list entries) are written by dlopen and
>> pthread_create reads them when initializing the dtv of a new thread.
>>
>> Current concurrency design:
> 
> Generally, I'd suggest to start one level of abstraction higher: What
> are the abstract operations that we need to perform?  Which of these
> need to appear to be atomic wrt. which other operations?  Which ordering
> relations do need to hold in every execution?
> (In contrast, you explain things below closer to the implementation
> level (eg, this lock is acquired for these operations, etc.))

i know you wanted this the last time too, but i don't plan to
rewrite the dynamic linker from scratch, i'm trying to make a
minimal change that is acceptable without years of work.

the current design does not have a clean set of invariants and
interface contracts, so it's hard to describe at a higher level
abstraction: there is always yet another special case that does
not fit into a simple model and nailing down all those special
cases is a lot more effort than necessary.

ignoring the implementation and only describing the user visible
requirements is not enough: glibc has lot of extensions and
non-standard behaviours users may depend on (so whatever the
current implementation does is the spec), following the standard
would not end up with something compatible with the current code
and it would still take a lot of time.

that's why my approach was to make a small change and trying to
reason about the changed semantics compared to the original
behaviour instead of figuring out all the atomicity and ordering
requirements (which can be specified in many ways).

>> (2) dlopen can reuse modids of dlclosed modules (and dlclose can decrement
>> GL(dl_tls_max_dtv_idx)), so multiple reads of a slotinfo entry may be
>> inconsistent if neither GL(dl_load_lock) nor GL(dl_load_write_lock) are
>> held (e.g. concurrent dlclose and dlopen between reading the .gen and
>> .map members of the slotinfo entry), this affects:
>> 	pthread_create (_dl_allocate_tls_init, reads .gen and .map)
>> 	__tls_get_addr (_dl_update_slotinfo, reads .gen and .map)
> 
> Can we version the modids?  Or do we generally need a mechanism for
> atomic snapshots?  What are the abstract operations and the requirements
> for them?

this is probably not such a big issue as i originally thought,
reuse of .gen and .map is possible to detect, because gen can
only increase (so it can be compared to a previous read of the
global gen counter and ignored if bigger, if smaller then it
can be guaranteed that .map and .gen are consistent).

>> (3) tls access must be as-safe, but currently tls is allocated at first
>> access if the containing module was not loaded at the time the accessing
>> thread was created.  This means __tls_get_addr may malloc or lock
>> GL(dl_load_lock) or modify the thread local dtv. This can cause deadlocks,
>> oom crashes or memory corruption if it is in a signal handler (malloc in
>> signal handler is also problematic for malloc interposition).
> 
> Are signal handlers allowed to access TLS?  I don't know what POSIX
> requires (or doesn't disallow), but C++, for example, is moving to not
> allowing TLS (if I remember the draft wording correctly).

if that's true it is just another brokenness in c++,
i would not worry about it since dynamic loading of
libraries and tls is unfixably broken in c++ anyway.

meanwhile the rest of the world needs a way to detect
reentry into an as-safe library call which currently
requires as-safe tls (signal mask is not an option:
syscalls are too slow) and a way to do soft float
correctly (fenv is tls and arithmetics must be as-safe)
or just access errno (which could be made as-safe even
if tls access is not as-safe otherwise, but specifying
that correctly in the standard is more hassle than
requiring all tls access to be as-safe).

>> (4) allocation failures can oom crash in pthread_create and dlopen
>> (_dl_resize_dtv) and at tls access (various mallocs) or may leave
>> inconsistent state behind in dlopen (_dl_add_to_slotinfo or any
>> later call in dlopen that fails: tls changes are not rolled back).
> 
> Do we need to be able to publish changes atomically?  If so, there are
> known techniques for that.
> 

that may not be necessary.. i think this can be solved
without being atomic if some resource leaks are allowed
(an argument can be made that it's fine, it's one of
those special case decisions).

> Do we perhaps can deal with most of all of this if we had simple
> transactions?  If these aren't arbitrary operations but in some way
> limited, and can be used for lots of the operations we need to
> synchronize, this may perhaps be the easiest way to implement all of
> this.  Once we know about the abstract operations, we can better assess
> whether that would be a helpful approach.

you can think of dlopen (up to ctors) as a transaction..
but there are mallocs, mmaps and even debug messages that
are not transactional operations and some state does not
have to be rolled back (e.g. generation count increment)
so i don't think this helps.

>> (5) GL(dl_tls_generation) is not checked for overflow consistently and
>> slotinfo entries can get assigned an overflowed generation count, its type
>> is size_t and both dlopen and dlclose increment it, so overflow may be a
>> concern on 32bit targets.
> 
> Can the size be extended on 32b targets?  If so, we can build a
> monotonically increasing, larger atomic counter on 32b targets too.

that would need some changes

the generation count up to which a dtv is set up in a
thread is stored in dtv[0] which is 32bit on a 32bit
system.

making the dtv 64bit will break anything that tries to
look at the dtv (asan?), hacking it around by using dtv[-1]
is possible but not elegant (dtv length is already stored
there it just have to be moved to dtv[-2]).

>> (6) dlclose holds GL(dl_load_lock) while running ctors so if those ctors
>> wait for anything that may take the same lock there is a deadlock.
> 
> Do you mean dtors?
> 

sorry, correctly:
dlopen holds the lock while running ctors and
dlclose holds the lock while running dtors.

>> Proposed fix(es):
> 
> Before focusing on taking the lock, I'd suggest to look at the bigger
> picture (ie, the abstract operations and atomicity and ordering
> requirements).
> It can be possible that a clean redesign is actually simpler than trying
> to fix a system that is already complex and which we don't understand
> 100%.

i think that's optimistic but i can write up what i think
the requirements are.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races
  2016-12-06 17:23   ` Szabolcs Nagy
@ 2016-12-06 17:48     ` Joseph Myers
  2016-12-06 18:31       ` Szabolcs Nagy
  2016-12-07 14:08     ` Torvald Riegel
  1 sibling, 1 reply; 8+ messages in thread
From: Joseph Myers @ 2016-12-06 17:48 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: Torvald Riegel, nd, GNU C Library

On Tue, 6 Dec 2016, Szabolcs Nagy wrote:

> meanwhile the rest of the world needs a way to detect
> reentry into an as-safe library call which currently
> requires as-safe tls (signal mask is not an option:
> syscalls are too slow) and a way to do soft float
> correctly (fenv is tls and arithmetics must be as-safe)
> or just access errno (which could be made as-safe even
> if tls access is not as-safe otherwise, but specifying
> that correctly in the standard is more hassle than
> requiring all tls access to be as-safe).

errno, the powerpc soft-float floating-point environment and the libdfp 
soft-dfp decimal rounding mode are all initial-exec TLS (and other cases 
of floating-point environment are hardware registers rather than TLS 
variables).  Does being initial-exec help?

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races
  2016-12-06 17:48     ` Joseph Myers
@ 2016-12-06 18:31       ` Szabolcs Nagy
  0 siblings, 0 replies; 8+ messages in thread
From: Szabolcs Nagy @ 2016-12-06 18:31 UTC (permalink / raw)
  To: Joseph Myers; +Cc: Szabolcs Nagy, Torvald Riegel, nd, GNU C Library

* Joseph Myers <joseph@codesourcery.com> [2016-12-06 17:47:45 +0000]:
> On Tue, 6 Dec 2016, Szabolcs Nagy wrote:
> 
> > meanwhile the rest of the world needs a way to detect
> > reentry into an as-safe library call which currently
> > requires as-safe tls (signal mask is not an option:
> > syscalls are too slow) and a way to do soft float
> > correctly (fenv is tls and arithmetics must be as-safe)
> > or just access errno (which could be made as-safe even
> > if tls access is not as-safe otherwise, but specifying
> > that correctly in the standard is more hassle than
> > requiring all tls access to be as-safe).
> 
> errno, the powerpc soft-float floating-point environment and the libdfp 
> soft-dfp decimal rounding mode are all initial-exec TLS (and other cases 
> of floating-point environment are hardware registers rather than TLS 
> variables).  Does being initial-exec help?
> 

yes, that works.

only tls access through __tls_get_addr may be
non-as-safe now.

(but if a language standard does not guarantee tls
access to be as-safe then implementation behaviour
does not help portable code.)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races
  2016-12-06 17:23   ` Szabolcs Nagy
  2016-12-06 17:48     ` Joseph Myers
@ 2016-12-07 14:08     ` Torvald Riegel
  2016-12-21 15:27       ` Szabolcs Nagy
  1 sibling, 1 reply; 8+ messages in thread
From: Torvald Riegel @ 2016-12-07 14:08 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: nd, GNU C Library

On Tue, 2016-12-06 at 17:22 +0000, Szabolcs Nagy wrote:
> On 05/12/16 19:24, Torvald Riegel wrote:
> > On Mon, 2016-11-28 at 19:03 +0000, Szabolcs Nagy wrote:
> >> In short: relevant globals (dl_tls_max_dtv_idx, dl_tls_generation,
> >> dl_tls_dtv_slotinfo_list entries) are written by dlopen and
> >> pthread_create reads them when initializing the dtv of a new thread.
> >>
> >> Current concurrency design:
> > 
> > Generally, I'd suggest to start one level of abstraction higher: What
> > are the abstract operations that we need to perform?  Which of these
> > need to appear to be atomic wrt. which other operations?  Which ordering
> > relations do need to hold in every execution?
> > (In contrast, you explain things below closer to the implementation
> > level (eg, this lock is acquired for these operations, etc.))
> 
> i know you wanted this the last time too, but i don't plan to
> rewrite the dynamic linker from scratch,

That is not what I'm asking for, or suggesting.  What I do want is that
we deal with concurrency in a thorough manner.  What I suggested is how
this can be done most easily in my experience -- while still being
thorough.

> i'm trying to make a
> minimal change that is acceptable without years of work.

Even for a minimal change (it looked more like minimal changes) you need
to be able to show that it is not breaking something else.  Even if
we're fine with accepting that we may not be able to prove that and
still want to apply the patch, we need to be thorough about the
assumptions we make about how the code we're changing works.

> the current design does not have a clean set of invariants and
> interface contracts, so it's hard to describe at a higher level
> abstraction: there is always yet another special case that does
> not fit into a simple model and nailing down all those special
> cases is a lot more effort than necessary.

Well, if you change something that is not a mechanical change (eg, make
changes that allow for the same executions), you must have some
understanding of the system, right?  This understanding must be
sufficient to explain why your change is correct.  That's the core of
what I'm asking for.

I'm not asking you to describe every possible special case -- but
instead that you thoroughly document how the system you try to change
looks like.  So, for example, if you assume that two operations don't
run concurrently, then document that; if we find out later that that
this is in fact not the case because of some special case, at least we
have a proper way to cross-check this against the assumptions you made
when proposing your change.

This incremental approach is what you'd do if you'd try to describe the
full system too, so I don't mind to have only model and describe parts
of the system as long as the way we do that is thorough.

The big problem with not being thorough and documenting the
abstract-level core of your understanding of the system is that then all
future changes have to cross-check against every tiny detail again.
This does not scale, and it's error-prone.  You have observed that
problem already I guess, in that I suppose it would have been easier for
you to reason about your fix if you could have cross-checked it against
existing thorough documentation and rules.

We have to be thorough to be able to deal with concurrency in way that's
somewhat easier on our brains.  Delaying that does not help glibc.  I
know it can be daunting to start it, because there are so many open
questions (I've been working on fixing the nscd code, so trust me...),
but it will make things easier for us in the long run.

> ignoring the implementation and only describing the user visible
> requirements is not enough: glibc has lot of extensions and
> non-standard behaviours users may depend on (so whatever the
> current implementation does is the spec), following the standard
> would not end up with something compatible with the current code
> and it would still take a lot of time.

This is not about standards for me.  They are just one source of
constraints or requirements.  What I want to see is that we make it
clear which requirements we think there are, irrespective where they
come from; likewise, what a correct program is required not to do.  This
is what we start with; once that's done, checking for compliance against
some standard can still be done.

> that's why my approach was to make a small change and trying to
> reason about the changed semantics compared to the original
> behaviour instead of figuring out all the atomicity and ordering
> requirements (which can be specified in many ways).

This is not a bad approach, so don't get me wrong.  There can be changes
that cannot break a program because they just constrain execution to a
subset of executions that would have been possible before too; for
example, changing a finite number of steps into one atomic operation
(though this needs to be nonblocking or you can interfere with forward
progress (think AS-Safety)).

However, this is often not sufficient.  First, arguing why the change is
necessary is important because it adds to the what future changes have
to pay attention to (unless the change is not affecting correctness).
Second, look at the first of your goals, for example: Wanting to avoid a
data race is good; but what memory orders do you need?  Do you want to
make them SC because you don't know what is required (and document the
uncertainty!)?  If not, you'll have to reason based on assumptions about
the happens-before relations the code needs to enforce, and the kind of
"incoming" executions it has to face.  This then ultimately boils down
to having some mental model of the abstract-level concurrency in the
code, or you can't really reason about what is truly required.

> >> (2) dlopen can reuse modids of dlclosed modules (and dlclose can decrement
> >> GL(dl_tls_max_dtv_idx)), so multiple reads of a slotinfo entry may be
> >> inconsistent if neither GL(dl_load_lock) nor GL(dl_load_write_lock) are
> >> held (e.g. concurrent dlclose and dlopen between reading the .gen and
> >> .map members of the slotinfo entry), this affects:
> >> 	pthread_create (_dl_allocate_tls_init, reads .gen and .map)
> >> 	__tls_get_addr (_dl_update_slotinfo, reads .gen and .map)
> > 
> > Can we version the modids?  Or do we generally need a mechanism for
> > atomic snapshots?  What are the abstract operations and the requirements
> > for them?
> 
> this is probably not such a big issue as i originally thought,
> reuse of .gen and .map is possible to detect, because gen can
> only increase (so it can be compared to a previous read of the
> global gen counter and ignored if bigger, if smaller then it
> can be guaranteed that .map and .gen are consistent).

Maybe that's a good piece to start with.  Let me try to give a brief
outline how you could describe that more thoroughly.

You seem to need to have an atomic snapshot for more than one read of a
slotinfo entry (if that's not the case, just treat the following as an
example for illustratitive purposes).  This will be required for some
higher-level operation, which you refer to.  That's the highest level of
abstraction in this example here.

You know that modifications are ordered (through some mechanism in the
program that you mention), and you know that this order is represented
by the generation number (so, if modification A happens before
modification B, then A's generation will be less than B's generation).
You try to solve the atomic snapshot problem by ignoring everything with
a generation less than a number chosen by you.  That's kind of the
middle level of abstraction in this example, so how you intend to solve
the synchronization problem.

Finally, you need to implement this with atomics.  The middle level
tells you which happens-before relationships you need, and thus what
kind of memory orders you need.  For example, assume a slotinfo entry is
published by writing a generation number, then you write this number
with release MO, and read it with acquire MO in the consumers; they will
synchronize, and they will ensure that if you read a certain number you
also see the matching content of the entry.  Because you've described
the middle abstraction layer, you can argue why you need those MOs by
just going up one level higher, and referring to your code comments.
This is easier than trying to argue directly why this is sufficient to
solve the problem at the highest level of abstraction (ie, atomic
snapshot).

So you build up these different abstraction layers, which helps make the
relationship between layer N and layer N-1 tractable for one's brain.
You're thorough about statements at each layer, which ensures that you
don't have to consider anything but the layers at end (eg, because
there's no ambiguity).  You document the models and reasoning you build
at each layer so it's easier for yourself -- and all other and future
developers -- to remember and refer to.  Also, when combined with being
thorough, this makes stuff easier to understand because what you
documented for a particular layer is law; you don't have to second-guess
it, which frees up resources in your brain.
In my experience, thinking in terms of atomicity and ordering at these
different layers is what works best, plus common synchronization
patterns and techniques (eg, atomic snapshots, mutual exclusion, ...)
and problems (eg, ABA).  Atomicity is important because it states that
there are less interleavings to think about; ordering is essential
because orderings of operations is what defines how state evolves.
This is what I do too when I build concurrent code like the new condvar
or rwlock.

And if you do all that thoroughly, you will see, for example, that my
brief outline of a description of course has holes in it (eg, which
generation number is the threshold? can slotinfo entries be reused so
that you'd perhaps need to restart an atomic snapshot if the version
number changes? ...).  The thorough process I described is aimed at
revealing such holes in the reasoning we do about concurrent code.

I hope that this clarifies why I'm always pushing for being thorough in
how we reason about concurrent code, and how we communicate such
reasoning to our fellow developers.

> >> (3) tls access must be as-safe, but currently tls is allocated at first
> >> access if the containing module was not loaded at the time the accessing
> >> thread was created.  This means __tls_get_addr may malloc or lock
> >> GL(dl_load_lock) or modify the thread local dtv. This can cause deadlocks,
> >> oom crashes or memory corruption if it is in a signal handler (malloc in
> >> signal handler is also problematic for malloc interposition).
> > 
> > Are signal handlers allowed to access TLS?  I don't know what POSIX
> > requires (or doesn't disallow), but C++, for example, is moving to not
> > allowing TLS (if I remember the draft wording correctly).
> 
> if that's true it is just another brokenness in c++,

No, it's not broken, it's simply taking a conservative approach: If
signal handlers are not allowed to touch TLS (or there's UB), you don't
require something special from implementations.  Whether that means that
you can't do all the funny things in a signal handler you might want to
do is another question.

> > Do we perhaps can deal with most of all of this if we had simple
> > transactions?  If these aren't arbitrary operations but in some way
> > limited, and can be used for lots of the operations we need to
> > synchronize, this may perhaps be the easiest way to implement all of
> > this.  Once we know about the abstract operations, we can better assess
> > whether that would be a helpful approach.
> 
> you can think of dlopen (up to ctors) as a transaction..
> but there are mallocs, mmaps and even debug messages that
> are not transactional operations

malloc can usually be rolled back fairly easily because doing an
unncessary malloc is not a problem.  The same might apply for mmap (but
not for munmap).  Debug messages can be buffered.

> and some state does not
> have to be rolled back (e.g. generation count increment)
> so i don't think this helps.

It's fine if we don't hvae to roll back something; that's the easy
case :)  (I'm assuming that needlessly incrementing the generation count
is harmless.)

What I was thinking about was a custom transaction mechanism based on
some of the simpler STM algorithms.  We can mark transactional
operations as such, and the transaction mechanism becomes much easier if
we, for example, know all the operations of a transaction at the start
of the transaction.

> >> (5) GL(dl_tls_generation) is not checked for overflow consistently and
> >> slotinfo entries can get assigned an overflowed generation count, its type
> >> is size_t and both dlopen and dlclose increment it, so overflow may be a
> >> concern on 32bit targets.
> > 
> > Can the size be extended on 32b targets?  If so, we can build a
> > monotonically increasing, larger atomic counter on 32b targets too.
> 
> that would need some changes
> 
> the generation count up to which a dtv is set up in a
> thread is stored in dtv[0] which is 32bit on a 32bit
> system.

The other 32b half could be somewhere else.  It's more efficient if it's
in the same cacheline of course, but it's not required to be adjacent.

> >> Proposed fix(es):
> > 
> > Before focusing on taking the lock, I'd suggest to look at the bigger
> > picture (ie, the abstract operations and atomicity and ordering
> > requirements).
> > It can be possible that a clean redesign is actually simpler than trying
> > to fix a system that is already complex and which we don't understand
> > 100%.
> 
> i think that's optimistic but i can write up what i think
> the requirements are.

Thanks.  Better documentation of what the requirements are would help
even if we a rewrite is not what we decide to do.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races
  2016-12-07 14:08     ` Torvald Riegel
@ 2016-12-21 15:27       ` Szabolcs Nagy
  2017-01-05 17:53         ` Szabolcs Nagy
  0 siblings, 1 reply; 8+ messages in thread
From: Szabolcs Nagy @ 2016-12-21 15:27 UTC (permalink / raw)
  To: Torvald Riegel; +Cc: nd, GNU C Library

On 07/12/16 14:07, Torvald Riegel wrote:
> On Tue, 2016-12-06 at 17:22 +0000, Szabolcs Nagy wrote:
>> On 05/12/16 19:24, Torvald Riegel wrote:
>>> On Mon, 2016-11-28 at 19:03 +0000, Szabolcs Nagy wrote:
>>>> (5) GL(dl_tls_generation) is not checked for overflow consistently and
>>>> slotinfo entries can get assigned an overflowed generation count, its type
>>>> is size_t and both dlopen and dlclose increment it, so overflow may be a
>>>> concern on 32bit targets.
>>>
>>> Can the size be extended on 32b targets?  If so, we can build a
>>> monotonically increasing, larger atomic counter on 32b targets too.
>>
>> the generation count up to which a dtv is set up in a
>> thread is stored in dtv[0] which is 32bit on a 32bit
>> system.
> 
> The other 32b half could be somewhere else.  It's more efficient if it's
> in the same cacheline of course, but it's not required to be adjacent.

i was wrong about this: a dtv entry has 2 pointers already
so it is 64bit on a 32bit target so a 64bit counter fits.

>>> Before focusing on taking the lock, I'd suggest to look at the bigger
>>> picture (ie, the abstract operations and atomicity and ordering
>>> requirements).
>>> It can be possible that a clean redesign is actually simpler than trying
>>> to fix a system that is already complex and which we don't understand
>>> 100%.
>>
>> i think that's optimistic but i can write up what i think
>> the requirements are.
> 
> Thanks.  Better documentation of what the requirements are would help
> even if we a rewrite is not what we decide to do.

i tried to collect the requirements, the current implementation
is far from it: as-safety of tls access and ctor/dtor without
locks are difficult to fix. if as-safe tls access is not required
then the concurrency fixes are much simpler.

simplified model:

m: module.
m.i: modid.
t: thread.
t.dtv: dtv pointer.
t.dtv[i]: dtv entry i (tls for t,m where m.i==i).

constraints:

tls of a currently loaded module m in thread t is at t.dtv[m.i].

m.i is unique during the lifetime of m: it is assigned before
ctors of m are entered and may be reused after dtors of m return.

tls access to m in thread t is only valid during the lifetime
of t and m (after ctors of m start and before dtors of m end).

during the lifetime of a thread its dtv may be updated:
t.dtv may need to be resized (if an m is loaded with larger m.i).
t.dtv[i] may need to be freed (if an m is dlclosed).
t.dtv[i] may need to be allocated (if an m is dlopened).

if dtv updates are not immediate for all threads at dlopen and
dlclose time, then the threads need a way to tell if t.dtv[i]
is out-of-date in case modid i is reused by the time the dtv
update happens. (this can be done by counting dlopen/dlclose
calls and remembering the counter of the last dtv update and
globally tracking the last counter for each modid i, if
global_counter[i] > t.dtv_counter then t.dtv[i] is out-of-date.
such counter should be 64bit).

dtv update consists of three kind of operations:
1) allocate dtv and dtv entries (malloc)
2) unset out-of-date entries (free)
3) resize dtv, set dtv entries (memcpy and ptr update)

1) alloc:
pthread_create and dlopen needs to be synchronized such that
either sync in dlopen or sync in pthread_create happens before
the other (for all dlopen/pthrea_create pairs), the one that
happens later should do the allocation.
(this is needed because right after dlopen and pthread_create
tls access is valid, but must be as-safe so it cannot easily
allocate).

2) free:
t.dtv[m.i] should be freed eventually after dlclose of m or
after t exits. this is difficult because t.dtv[m.i] need to
be updated if m.i is reused and the tls of the new module is
accessed, but tls access cannot do the free (not as-safe).
so the options are
- dlclose of m frees t.dtv[m.i] for all t (non-trivial).
- allocated dtv entry pointers are remembered somewhere
  and garbage collected in some way (such that overwriting
  t.dtv[i] does not leak memory).

3) update dtv pointer and dtv entry:
either dlopen stops all threads and takes care of dtv updates
or it has to be done during tls access lazily in which case
signals may need to be blocked.

an m with static tls is a special case:
1) if m is already loaded when t is created, then pthread_create
needs to do setups (copy tls init image) that requires accessing
m, so either pthread_create needs to sync with dlclose or it
is invalid to unload an m with static tls, in the later case
pthread_create should be able to walk the currently loaded
modules and tell if they have static tls without accessing the
module structures that are freed during dlclose.
2) if m is loaded after t is created, then dlopen should do the
setup for the current thread, but i think it has to do the setup
for other threads as well (?). (in principle static tls cannot
be dynamically loaded/unloaded but i'm not sure what are the
requirements if glibc internal libraries are dlopened.)

indirectly related to tls:

ctor,dtor handling should be callback safe: dlopen and dlclose
must not hold internal locks otherwise correct ctor/dtor code may
deadlock.

dlopen and pthread_create should roll back state changes on allocation
failure and report the errors (some state changes may be harmless
this needs further analysis).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][BZ #19329] high level description of pthread_create vs dlopen races
  2016-12-21 15:27       ` Szabolcs Nagy
@ 2017-01-05 17:53         ` Szabolcs Nagy
  0 siblings, 0 replies; 8+ messages in thread
From: Szabolcs Nagy @ 2017-01-05 17:53 UTC (permalink / raw)
  To: Torvald Riegel; +Cc: nd, GNU C Library

On 21/12/16 15:27, Szabolcs Nagy wrote:
> i tried to collect the requirements, the current implementation
> is far from it: as-safety of tls access and ctor/dtor without
> locks are difficult to fix. if as-safe tls access is not required
> then the concurrency fixes are much simpler.

following steps might work for incremental bug fixing:

1) fix the gen counter to be 64bit on 32bit targets
  and remove all overflow checks and related aborts.
  (requires 64bit atomic load and store.)

2) make sure slotinfo update does not access link_map
  of a module that may be dlclosed concurrently.
  (the tricky bit is to handle static tls initialization
  in pthread_create: it needs either sync with dlclose
  or remembering static tls info in the global slotinfo
  list, i'd try to do the later).

3) fix pthread_create so it initializes dtv with the
  right size for a particular generation and iterates
  over slotinfo entries of modules with static tls
  correctly to initialize tls and related dtv entries.

this would fix BZ 19329 and the following issues would remain:

- tls access allocates and frees dtv entries and may
  resize dtv and may lock the load_lock.

- dlopen and pthread_create aborts if resize_dtv fails.
  and dlopen may leave inconsistent state behind if
  other allocation fails.

- dtors and ctors are run while the load_lock is held.

these seem hard to fix to me.

> simplified model:
> 
> m: module.
> m.i: modid.
> t: thread.
> t.dtv: dtv pointer.
> t.dtv[i]: dtv entry i (tls for t,m where m.i==i).
> 
> constraints:
> 
> tls of a currently loaded module m in thread t is at t.dtv[m.i].
> 
> m.i is unique during the lifetime of m: it is assigned before
> ctors of m are entered and may be reused after dtors of m return.
> 
> tls access to m in thread t is only valid during the lifetime
> of t and m (after ctors of m start and before dtors of m end).
> 
> during the lifetime of a thread its dtv may be updated:
> t.dtv may need to be resized (if an m is loaded with larger m.i).
> t.dtv[i] may need to be freed (if an m is dlclosed).
> t.dtv[i] may need to be allocated (if an m is dlopened).
> 
> if dtv updates are not immediate for all threads at dlopen and
> dlclose time, then the threads need a way to tell if t.dtv[i]
> is out-of-date in case modid i is reused by the time the dtv
> update happens. (this can be done by counting dlopen/dlclose
> calls and remembering the counter of the last dtv update and
> globally tracking the last counter for each modid i, if
> global_counter[i] > t.dtv_counter then t.dtv[i] is out-of-date.
> such counter should be 64bit).
> 
> dtv update consists of three kind of operations:
> 1) allocate dtv and dtv entries (malloc)
> 2) unset out-of-date entries (free)
> 3) resize dtv, set dtv entries (memcpy and ptr update)
> 
> 1) alloc:
> pthread_create and dlopen needs to be synchronized such that
> either sync in dlopen or sync in pthread_create happens before
> the other (for all dlopen/pthrea_create pairs), the one that
> happens later should do the allocation.
> (this is needed because right after dlopen and pthread_create
> tls access is valid, but must be as-safe so it cannot easily
> allocate).
> 
> 2) free:
> t.dtv[m.i] should be freed eventually after dlclose of m or
> after t exits. this is difficult because t.dtv[m.i] need to
> be updated if m.i is reused and the tls of the new module is
> accessed, but tls access cannot do the free (not as-safe).
> so the options are
> - dlclose of m frees t.dtv[m.i] for all t (non-trivial).
> - allocated dtv entry pointers are remembered somewhere
>   and garbage collected in some way (such that overwriting
>   t.dtv[i] does not leak memory).
> 
> 3) update dtv pointer and dtv entry:
> either dlopen stops all threads and takes care of dtv updates
> or it has to be done during tls access lazily in which case
> signals may need to be blocked.
> 
> an m with static tls is a special case:
> 1) if m is already loaded when t is created, then pthread_create
> needs to do setups (copy tls init image) that requires accessing
> m, so either pthread_create needs to sync with dlclose or it
> is invalid to unload an m with static tls, in the later case
> pthread_create should be able to walk the currently loaded
> modules and tell if they have static tls without accessing the
> module structures that are freed during dlclose.
> 2) if m is loaded after t is created, then dlopen should do the
> setup for the current thread, but i think it has to do the setup
> for other threads as well (?). (in principle static tls cannot
> be dynamically loaded/unloaded but i'm not sure what are the
> requirements if glibc internal libraries are dlopened.)
> 
> indirectly related to tls:
> 
> ctor,dtor handling should be callback safe: dlopen and dlclose
> must not hold internal locks otherwise correct ctor/dtor code may
> deadlock.
> 
> dlopen and pthread_create should roll back state changes on allocation
> failure and report the errors (some state changes may be harmless
> this needs further analysis).
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-01-05 17:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-28 19:03 [RFC][BZ #19329] high level description of pthread_create vs dlopen races Szabolcs Nagy
2016-12-05 19:24 ` Torvald Riegel
2016-12-06 17:23   ` Szabolcs Nagy
2016-12-06 17:48     ` Joseph Myers
2016-12-06 18:31       ` Szabolcs Nagy
2016-12-07 14:08     ` Torvald Riegel
2016-12-21 15:27       ` Szabolcs Nagy
2017-01-05 17:53         ` Szabolcs Nagy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).