Bug: Incorrect signal behavior in multi-threaded processes

public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed

* Bug: Incorrect signal behavior in multi-threaded processes
@ 2019-01-20 20:33 Dan Bonachea
  2019-01-22  9:13 ` Corinna Vinschen
  2019-01-22 11:16 ` E. Madison Bray
  0 siblings, 2 replies; 11+ messages in thread
From: Dan Bonachea @ 2019-01-20 20:33 UTC (permalink / raw)
  To: cygwin; +Cc: gasnet-devel, Dan Bonachea

I'm writing to report some POSIX compliance problems with Cygwin
signal handling in the presence of multiple pthreads that our group
has encountered in our parallel scientific computing codes.

A minimal test program is copied below and also available here:
https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=589

I believe the test program is fully compliant with ISO C 99 and POSIX
1003.1-2016. In a nutshell, it registers one signal handler, spawns a
number of pthreads, and then synchronously generates a signal from
exactly one thread while others sit in a pthread_barrier_wait. The
"throwing" thread and signal number can be varied from the command
line, and diagnostic output indicates what happened.

As a basis for comparison, here are a few examples of the test program
running on x86_64/Linux-3.10.0(Scientific Linux 7.4)/gcc-4.8.5
demonstrating what I believe to be the *correct*/POSIX-required
behavior:

$ ./thread-signal 1 11    # "th#1 sends sig 11 (SIGSEGV) via null deref"
Running test with 5 threads and thread 1 sending signal=11
Spawning pthreads..
thread 1 (0x7f8dd0b13700): Hello
thread 4 (0x7f8dcf310700): Hello
thread 2 (0x7f8dd0312700): Hello
thread 3 (0x7f8dcfb11700): Hello
thread 0 (0x7f8dd131a740): Hello
thread 1 (0x7f8dd0b13700): sending signal 11..
sig_handler: ENTERING
sig_handler: running on thread 0x7f8dd0b13700
sig_handler: calling _exit()

$ ./thread-signal 1 6    # "th#1 sends sig 6 (SIGABRT) via abort()"
Running test with 5 threads and thread 1 sending signal=6
Spawning pthreads..
thread 1 (0x7f1a2451d700): Hello
thread 2 (0x7f1a23d1c700): Hello
thread 0 (0x7f1a24d24740): Hello
thread 3 (0x7f1a2351b700): Hello
thread 4 (0x7f1a22d1a700): Hello
thread 1 (0x7f1a2451d700): sending signal 6..
sig_handler: ENTERING
sig_handler: running on thread 0x7f1a2451d700
sig_handler: calling _exit()

$ ./thread-signal 1 2        # "th#1 sends sig 2 via raise(SIGINT)"
Running test with 5 threads and thread 1 sending signal=2
Spawning pthreads..
thread 1 (0x7f2a29a3f700): Hello
thread 2 (0x7f2a2923e700): Hello
thread 0 (0x7f2a2a246740): Hello
thread 3 (0x7f2a28a3d700): Hello
thread 4 (0x7f2a2823c700): Hello
thread 1 (0x7f2a29a3f700): sending signal 2..
sig_handler: ENTERING
sig_handler: running on thread 0x7f2a29a3f700
sig_handler: calling _exit()

This output indicates that in all cases on Linux, the unique thread
generating the signal jumps to the pre-registered signal handler while
other threads remain stalled at the barrier, as required by POSIX
signalling semantics (e.g. see raise() on p.1765 of POSIX
1003.1-2016). The test program and commands above demonstrate the
substantially same, correct behavior on ALL of the following platform
combinations:

* Linux-3.10/{i686,x86_64}/{gcc-4.8.5,gcc-8.2.0,clang-7.0.0}
* Solaris-11.3/x86_64/{gcc-7.2.0,SunStudio-12.5}
* FreeBSD-12.0/x86_64/clang-6.0.1
* MicrosoftWSL-Ubuntu18.04/x86_64/{gcc-7.3.0,clang-6.0.0)
    - This notably runs on Microsoft Windows! (10.0.17763.288)

Unfortunately the observed behavior on Cygwin (various versions)
deviates far from our expectations and (based on my understanding)
from the behavior required by current POSIX specs. Here is example
output from Cygwin 2.11.1(0.329/5/3) 2018-09-05 on Windows 10, build
17763.288 with gcc 7.3.0:

$ ./thread-signal 1 11    # "th#1 sends sig 11 (SIGSEGV) via null deref"
Running test with 5 threads and thread 1 sending signal=11
Spawning pthreads..
thread 1 (0x600048770): Hello
thread 2 (0x600048870): Hello
thread 3 (0x600048970): Hello
thread 0 (0x600000010): Hello
thread 4 (0x600048a70): Hello
thread 1 (0x600048770): sending signal 11..
<process terminated, without calling handler from ANY thread>

$ ./thread-signal 1 6    # "th#1 sends sig 6 (SIGABRT) via abort()"
Running test with 5 threads and thread 1 sending signal=6
Spawning pthreads..
thread 1 (0x600048770): Hello
thread 2 (0x600048870): Hello
thread 3 (0x600048970): Hello
thread 4 (0x600048a70): Hello
thread 0 (0x600000010): Hello
thread 1 (0x600048770): sending signal 6..
sig_handler: ENTERING
Abort
<non-deterministic race leads to varying behavior here>

$ ./thread-signal 1 2    # "th#1 sends sig 2 via raise(SIGINT)"
Running test with 5 threads and thread 1 sending signal=2
Spawning pthreads..
thread 1 (0x600048770): Hello
thread 2 (0x600048870): Hello
thread 3 (0x600048970): Hello
thread 0 (0x600000010): Hello
thread 4 (0x600048a70): Hello
thread 1 (0x600048770): sending signal 2..
sig_handler: ENTERING
sig_handler: ERROR - signal delivered to wrong thread!
thread 1 (0x600048770): ERROR: STILL ALIVE!
sig_handler: running on thread 0x600000010
sig_handler: calling _exit()

The second case in particular (abort() called by one non-primordial
thread) appears to have non-deterministic/racing behavior. The
evidence seems to indicate the SIGABRT is delivered to the primordial
thread (the wrong thread) via the signal handler and concurrently also
delivered to the SIG_DFL handler of other threads who then race to
invoke abortive process termination (which should not be reachable in
any correct execution of the program). It's worth noting POSIX
1003.1-2016 sec XRAT.B.2.4.1 (p.3577) specifically requires that any
given signal should be delivered to exactly one thread. Also the spec
for abort (p.565) requires the signal to be delivered as if by
`raise(SIGABRT)` (p.1765) aka. `pthread_kill(pthread_self(),SIGABRT)`
(p.1657), which implies any registered SIGABRT handler should run only
on the thread which called abort().

The choice of SIGINT in the third example is arbitrary, and
representative of similar deliver-to-wrong-thread behavior also
observed on Cygwin for all of the following signals:
  HUP, INT, QUIT, ILL, EMT, TRAP, FPE, BUS, SYS, PIPE, ALRM, TERM, URG,
  TSTP, CONT, CHLD, TTIN, TTOU, IO, USR1, USR2, and RTMIN..RTMAX
All of which consequently appear to be unreliable for thread-specific
signalling in Cygwin programs.

Note that in all cases examined, generating the signal from the
"primordial" thread 0 (by changing the 1 to a 0 in the commands above)
yields nominally correct behavior; in that case, the signal handler is
correctly invoked by the primordial thread and the others remain
undisturbed. However it appears the primordial thread is the ONLY
thread that enjoys the special status of POSIX-compliant signal
behavior on Cygwin. Substantially similar broken behavior has been
observed for NON-primordial threads on ALL of the following Cygwin
version combinations (spread across three different workstations):

* Cygwin64-2.11.1(0.329/5/3)-{win7,win10}-{gcc-7.3.0,clang-5.0.1}
* Cygwin64-2.10.0(0.325/5/3)-{win7,win10}-{gcc-6.4.0,clang-5.0.1}
* Cygwin64-2.6.0(0.304/5/3)-win7-{gcc-5.4.0,clang-3.8.1}
* Cygwin64-2.6.0(0.304/5/3)-win7-{gcc-5.4.0,clang-3.8.1}

Possibly of note, a 32-bit version of Cygwin (i686 2.11.1(0.329/5/3))
correctly handles SIGSEGV, but fails all the other cases in
substantially the same manner as Cygwin64.

In case you're wondering why we care: The SIGABRT and SIGSEGV
misbehaviors are particularly problematic for our distributed-memory
codes that register fatal signal handlers to ensure correct tear-down
of a multi-process job if/when any process crashes or aborts (e.g. due
to an assertion failure). Cygwin unfortunately makes it effectively
impossible to reliably handle abort()'s or SIGSEGV's generated by
programming errors in a multi-threaded program, unless one can arrange
to only generate the signal from the primordial thread (impractical
for our applications).

Searching around the Cygwin lists I find some evidence that
tangentially similar problems with signals and multithreading have
been discussed before, but perhaps not adequately
isolated/demonstrated.

Is there any hope of this situation ever improving?

Thanks for your consideration.
-Dan Bonachea

Test program code below, also available for download at:
https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=589

=====================================================================
// Thread/signal tester by Dan Bonachea
// compile with a command like:
//   gcc -D_GNU_SOURCE -std=c99 -pedantic -pthread thread-signal.c -o
thread-signal
// usage:
//   thread-signal <thread_idx[0..4]> <signal_number>
//
// page numbers in comments below refer to POSIX IEEE Std 1003.1-2016
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <string.h>
#include <assert.h>
#include <signal.h>

// Utilities
typedef void (*sig_handler_t)(int); // signal handler function pointer

unsigned long long thidtollu(pthread_t thid) { // map pthread_t to a
unique value
  // non-portable but sufficient on all systems of interest
  return (unsigned long long)(uintptr_t)thid;
}

pthread_barrier_t barrier_object;
void barrier(void) {
  int res = pthread_barrier_wait(&barrier_object); // p.1595
  assert(res == 0 || res == PTHREAD_BARRIER_SERIAL_THREAD);
}

#define FD_STDOUT 1
#define FD_STDERR 2
void writeout(const char *msg) { // signal-safe string output and flush
  int sz = strlen(msg)+1;
  int res = write(FD_STDOUT, msg, sz);
  if (res != sz) {
    const char err[] = "write failed!\n";
    write(FD_STDERR, err, sizeof(err));
    _exit(-1);
  }
  (void)fsync(FD_STDOUT);
}

#ifndef NUMTHREAD
#define NUMTHREAD 5
#endif

// state variables
int sigid = SIGSEGV;
int sender = 1;
volatile sig_atomic_t sender_aid = 0;
volatile sig_atomic_t errs = 0;

// registered signal handler function
void sig_handler(int signum) { // p.494 defines permitted calls
  pthread_t thid = pthread_self();
  writeout("sig_handler: ENTERING\n");
  sig_atomic_t my_aid = (sig_atomic_t)thidtollu(thid);
  if (my_aid != sender_aid) {
    errs++;
    writeout("sig_handler: ERROR - signal delivered to wrong thread!\n");
  }
  #if !STRICT // sprintf technically forbidden, but doesn't affect
behavior in practice
  { char tmp[200];
    sprintf(tmp,"sig_handler: running on thread 0x%llx\n",thidtollu(thid));
    writeout(tmp);
  }
  #endif
  writeout("sig_handler: calling _exit()\n");
  _exit(errs);
}
struct thinfo {
  pthread_t thid;
  int idx;
} thread_info[NUMTHREAD];

// thread entry point
void * thread_main(void *arg) {
  struct thinfo *myinfo = arg;
  pthread_t thid = pthread_self();
  assert(pthread_equal(thid, myinfo->thid));

  printf("thread %i (0x%llx): Hello\n",myinfo->idx, thidtollu(thid));
fflush(NULL);

  if (myinfo->idx == sender) { // this thread will send the signal
    sender_aid = (sig_atomic_t)thidtollu(thid); // record for signal handler
  }

  barrier(); // wait for all threads

  if (myinfo->idx == sender) { // this thread sends the signal
    printf("thread %i (0x%llx): sending signal %i..\n",
           myinfo->idx, thidtollu(thid), sigid);
    fflush(NULL);

    switch (sigid) {
      case SIGABRT:
        abort(); // p.565
      break;
      case SIGSEGV: {
        int *nullpt = NULL;
        *nullpt = 0; // SEGV
      }
      break;
      default: {
        int res = raise(sigid); // p.1765
        if (res) {
          errs++;
          printf("thread %i (0x%llx): ERROR: raise failed: %i %s\n",
                  myinfo->idx, thidtollu(thid), res, strerror(res));
fflush(NULL);
        }
      }
    }
    errs++;
    printf("thread %i (0x%llx): ERROR: STILL ALIVE!\n",myinfo->idx,
thidtollu(thid));
    fflush(NULL);
  }

  barrier(); // wait for all threads
  return NULL;
}

// process entry point
int main(int argc, char **argv) {

  if (argc > 1) sender = atoi(argv[1]);
  if (argc > 2) sigid = atoi(argv[2]);

  printf("Running test with %i threads and thread %i sending signal=%i\n",
         NUMTHREAD,sender,sigid); fflush(NULL);

  int ret = pthread_barrier_init(&barrier_object, NULL, NUMTHREAD); // p.1593
  assert(!ret);

  // establish a signal handler
  sig_handler_t init = signal(sigid, sig_handler); // p.1971
  assert(init == SIG_DFL || init == SIG_IGN);

  // ensure it is registered
  sig_handler_t res = signal(sigid, sig_handler);
  assert(res == sig_handler);

  printf("Spawning pthreads..\n"); fflush(NULL);

  for (int i=1; i < NUMTHREAD; i++) { // create threads
    thread_info[i].idx = i;
    int res = pthread_create(&(thread_info[i].thid), NULL,
                             thread_main, &(thread_info[i])); // p.1633
    assert(!res);
  }

  // primordial thread is "thread 0"
  thread_info[0].idx = 0;
  thread_info[0].thid = pthread_self();
  thread_main(&(thread_info[0]));

  // should never reach this point for a catchable signal
  for (int i=1; i < NUMTHREAD; i++) { // join threads
    int res = pthread_join(thread_info[i].thid, NULL); // p.1649
    assert(!res);
  }

  printf("all threads exited!\n");
  errs++;

  return errs;
}

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bug: Incorrect signal behavior in multi-threaded processes
  2019-01-20 20:33 Bug: Incorrect signal behavior in multi-threaded processes Dan Bonachea
@ 2019-01-22  9:13 ` Corinna Vinschen
  2019-01-22 11:16 ` E. Madison Bray
  1 sibling, 0 replies; 11+ messages in thread
From: Corinna Vinschen @ 2019-01-22  9:13 UTC (permalink / raw)
  To: cygwin; +Cc: Dan Bonachea, gasnet-devel

[-- Attachment #1: Type: text/plain, Size: 1173 bytes --]

On Jan 20 15:33, Dan Bonachea wrote:
> I'm writing to report some POSIX compliance problems with Cygwin
> signal handling in the presence of multiple pthreads that our group
> has encountered in our parallel scientific computing codes.
> 
> A minimal test program is copied below and also available here:
> https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=589
> 
> I believe the test program is fully compliant with ISO C 99 and POSIX
> 1003.1-2016. In a nutshell, it registers one signal handler, spawns a
> number of pthreads, and then synchronously generates a signal from
> exactly one thread while others sit in a pthread_barrier_wait. The
> "throwing" thread and signal number can be varied from the command
> line, and diagnostic output indicates what happened.
> [...]
> Is there any hope of this situation ever improving?

Signal handling is a real beast and phread_barriers probably haven't
been tested as extensively yet as your testcase.

No guarantees for a patch, but I'll look into that as time permits.

Patches to fix bugs are very welcome: https://cygwin.com/contrib.html


Corinna

-- 
Corinna Vinschen
Cygwin Maintainer

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bug: Incorrect signal behavior in multi-threaded processes
  2019-01-20 20:33 Bug: Incorrect signal behavior in multi-threaded processes Dan Bonachea
  2019-01-22  9:13 ` Corinna Vinschen
@ 2019-01-22 11:16 ` E. Madison Bray
  2019-01-22 20:43   ` Dan Bonachea
  1 sibling, 1 reply; 11+ messages in thread
From: E. Madison Bray @ 2019-01-22 11:16 UTC (permalink / raw)
  To: cygwin; +Cc: gasnet-devel, Dan Bonachea

On Sun, Jan 20, 2019 at 9:34 PM Dan Bonachea wrote:
>
> I'm writing to report some POSIX compliance problems with Cygwin
> signal handling in the presence of multiple pthreads that our group
> has encountered in our parallel scientific computing codes.
>
> A minimal test program is copied below and also available here:
> https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=589
>
> I believe the test program is fully compliant with ISO C 99 and POSIX
> 1003.1-2016. In a nutshell, it registers one signal handler, spawns a
> number of pthreads, and then synchronously generates a signal from
> exactly one thread while others sit in a pthread_barrier_wait. The
> "throwing" thread and signal number can be varied from the command
> line, and diagnostic output indicates what happened.
>
> As a basis for comparison, here are a few examples of the test program
> running on x86_64/Linux-3.10.0(Scientific Linux 7.4)/gcc-4.8.5
> demonstrating what I believe to be the *correct*/POSIX-required
> behavior:

Thank you for the detailed analysis of this problem.  I haven't
personally encountered a problem like this in any of my own code,
though I'm not relying on pthread_barrier, or signal handlers being
run from specific threads.  This is relevant to my interests though,
so time permitting I might look into it just out of curiosity of
nothing else.  The behavior with SIGSEGV in particular is very
reminiscent (possibly same as) a problem I reported last year, but
never got around to fixing (except for the local workaround I used):
https://cygwin.com/ml/cygwin/2018-05/msg00333.html

I wonder if the same problem applies to thread-local stacks.  Indeed,
I ran your test program in gdb with arguments (1, 11) with a
breakpoint in myfault_altstack_handler [1] and wound up there.  But
since the segfault did not come from Cygwin itself (me.andreas is a
"san" fault handler for the current exception being handled by Cygwin,
but this is only set for exceptions generated by Cygwin itself (with
its __try/__except blocks).  In this case it's 0x0 so the exception is
not handled and the process just runs off into the weeds until it
(quickly) runs out of "vectored continue handlers" and so the process
exits (at the Windows level, without Cygwin controlling its shutdown).

For the other cases I'm not as sure what's going on, but possibly
related problems.

[1] https://cygwin.com/git/gitweb.cgi?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/exceptions.cc;h=205ad850e4c7b69954fadd1efe3ae9ff65e5f806;hb=HEAD#l594

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bug: Incorrect signal behavior in multi-threaded processes
  2019-01-22 11:16 ` E. Madison Bray
@ 2019-01-22 20:43   ` Dan Bonachea
  2019-01-23 12:44     ` E. Madison Bray
  0 siblings, 1 reply; 11+ messages in thread
From: Dan Bonachea @ 2019-01-22 20:43 UTC (permalink / raw)
  To: E. Madison Bray; +Cc: cygwin, gasnet-devel

Hi Corinna and Madison, thanks for your responses.

To clarify, I'm reasonably confident the problem I'm reporting has
NOTHING to do with pthread_barrier. Our real application which
exhibits very similar symptoms does not use pthread_barrier *at all*;
pthread_barrier was merely the most convenient/concise synchronization
mechanism to produce deterministic output behavior in the minimal
example.

Indeed, when I comment out the pthread_barrier code entirely from the
example program (causing unselected non-primordial threads to exit and
the primordial thread to stall in pthread_join), I see substantially
the same misbehaviors.

Thanks,
-Dan Bonachea

On Tue, Jan 22, 2019 at 6:16 AM E. Madison Bray <erik.m.bray@gmail.com> wrote:
>
> On Sun, Jan 20, 2019 at 9:34 PM Dan Bonachea wrote:
> >
> > I'm writing to report some POSIX compliance problems with Cygwin
> > signal handling in the presence of multiple pthreads that our group
> > has encountered in our parallel scientific computing codes.
> >
> > A minimal test program is copied below and also available here:
> > https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=589
> >
> > I believe the test program is fully compliant with ISO C 99 and POSIX
> > 1003.1-2016. In a nutshell, it registers one signal handler, spawns a
> > number of pthreads, and then synchronously generates a signal from
> > exactly one thread while others sit in a pthread_barrier_wait. The
> > "throwing" thread and signal number can be varied from the command
> > line, and diagnostic output indicates what happened.
> >
> > As a basis for comparison, here are a few examples of the test program
> > running on x86_64/Linux-3.10.0(Scientific Linux 7.4)/gcc-4.8.5
> > demonstrating what I believe to be the *correct*/POSIX-required
> > behavior:
>
> Thank you for the detailed analysis of this problem.  I haven't
> personally encountered a problem like this in any of my own code,
> though I'm not relying on pthread_barrier, or signal handlers being
> run from specific threads.  This is relevant to my interests though,
> so time permitting I might look into it just out of curiosity of
> nothing else.  The behavior with SIGSEGV in particular is very
> reminiscent (possibly same as) a problem I reported last year, but
> never got around to fixing (except for the local workaround I used):
> https://cygwin.com/ml/cygwin/2018-05/msg00333.html
>
> I wonder if the same problem applies to thread-local stacks.  Indeed,
> I ran your test program in gdb with arguments (1, 11) with a
> breakpoint in myfault_altstack_handler [1] and wound up there.  But
> since the segfault did not come from Cygwin itself (me.andreas is a
> "san" fault handler for the current exception being handled by Cygwin,
> but this is only set for exceptions generated by Cygwin itself (with
> its __try/__except blocks).  In this case it's 0x0 so the exception is
> not handled and the process just runs off into the weeds until it
> (quickly) runs out of "vectored continue handlers" and so the process
> exits (at the Windows level, without Cygwin controlling its shutdown).
>
> For the other cases I'm not as sure what's going on, but possibly
> related problems.
>
> [1] https://cygwin.com/git/gitweb.cgi?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/exceptions.cc;h=205ad850e4c7b69954fadd1efe3ae9ff65e5f806;hb=HEAD#l594

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bug: Incorrect signal behavior in multi-threaded processes
  2019-01-22 20:43   ` Dan Bonachea
@ 2019-01-23 12:44     ` E. Madison Bray
  2019-01-29 23:22       ` Dan Bonachea
  0 siblings, 1 reply; 11+ messages in thread
From: E. Madison Bray @ 2019-01-23 12:44 UTC (permalink / raw)
  To: Dan Bonachea; +Cc: cygwin, gasnet-devel

On Tue, Jan 22, 2019 at 9:43 PM Dan Bonachea <dobonachea@lbl.gov> wrote:
>
> Hi Corinna and Madison, thanks for your responses.
>
> To clarify, I'm reasonably confident the problem I'm reporting has
> NOTHING to do with pthread_barrier. Our real application which
> exhibits very similar symptoms does not use pthread_barrier *at all*;
> pthread_barrier was merely the most convenient/concise synchronization
> mechanism to produce deterministic output behavior in the minimal
> example.
>
> Indeed, when I comment out the pthread_barrier code entirely from the
> example program (causing unselected non-primordial threads to exit and
> the primordial thread to stall in pthread_join), I see substantially
> the same misbehaviors.

Thank you for the clarification, that does make sense.

Per my previous message, the problem with SIGSEGV in particular is
(IMO a somewhat serious problem) caused by the fact that Windows
apparently does not recognize the stacks allocated for pthreads as
valid stacks, so it bypasses SEH entirely, and exceptions like
segfaults in threads do not get handled well by Cygwin.

I'm less clear on what the problem is with other signals, but it might
also be related somehow (though the SIGSEGV problem is obviously more
severe since it just results in the process immediately dying (with
exit code 0 no less :(


> On Tue, Jan 22, 2019 at 6:16 AM E. Madison Bray <erik.m.bray@gmail.com> wrote:
> >
> > On Sun, Jan 20, 2019 at 9:34 PM Dan Bonachea wrote:
> > >
> > > I'm writing to report some POSIX compliance problems with Cygwin
> > > signal handling in the presence of multiple pthreads that our group
> > > has encountered in our parallel scientific computing codes.
> > >
> > > A minimal test program is copied below and also available here:
> > > https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=589
> > >
> > > I believe the test program is fully compliant with ISO C 99 and POSIX
> > > 1003.1-2016. In a nutshell, it registers one signal handler, spawns a
> > > number of pthreads, and then synchronously generates a signal from
> > > exactly one thread while others sit in a pthread_barrier_wait. The
> > > "throwing" thread and signal number can be varied from the command
> > > line, and diagnostic output indicates what happened.
> > >
> > > As a basis for comparison, here are a few examples of the test program
> > > running on x86_64/Linux-3.10.0(Scientific Linux 7.4)/gcc-4.8.5
> > > demonstrating what I believe to be the *correct*/POSIX-required
> > > behavior:
> >
> > Thank you for the detailed analysis of this problem.  I haven't
> > personally encountered a problem like this in any of my own code,
> > though I'm not relying on pthread_barrier, or signal handlers being
> > run from specific threads.  This is relevant to my interests though,
> > so time permitting I might look into it just out of curiosity of
> > nothing else.  The behavior with SIGSEGV in particular is very
> > reminiscent (possibly same as) a problem I reported last year, but
> > never got around to fixing (except for the local workaround I used):
> > https://cygwin.com/ml/cygwin/2018-05/msg00333.html
> >
> > I wonder if the same problem applies to thread-local stacks.  Indeed,
> > I ran your test program in gdb with arguments (1, 11) with a
> > breakpoint in myfault_altstack_handler [1] and wound up there.  But
> > since the segfault did not come from Cygwin itself (me.andreas is a
> > "san" fault handler for the current exception being handled by Cygwin,
> > but this is only set for exceptions generated by Cygwin itself (with
> > its __try/__except blocks).  In this case it's 0x0 so the exception is
> > not handled and the process just runs off into the weeds until it
> > (quickly) runs out of "vectored continue handlers" and so the process
> > exits (at the Windows level, without Cygwin controlling its shutdown).
> >
> > For the other cases I'm not as sure what's going on, but possibly
> > related problems.
> >
> > [1] https://cygwin.com/git/gitweb.cgi?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/exceptions.cc;h=205ad850e4c7b69954fadd1efe3ae9ff65e5f806;hb=HEAD#l594

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bug: Incorrect signal behavior in multi-threaded processes
  2019-01-23 12:44     ` E. Madison Bray
@ 2019-01-29 23:22       ` Dan Bonachea
  2019-01-30 10:44         ` Corinna Vinschen
  0 siblings, 1 reply; 11+ messages in thread
From: Dan Bonachea @ 2019-01-29 23:22 UTC (permalink / raw)
  To: E. Madison Bray; +Cc: cygwin, gasnet-devel

> A minimal test program is copied below and also available here:
> https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=589

> It's worth noting POSIX 1003.1-2016 sec XRAT.B.2.4.1 (p.3577)
> specifically requires that any given signal should be delivered to
> exactly one thread. Also the spec for abort (p.565) requires the
> signal to be delivered as if by `raise(SIGABRT)` (p.1765) aka.
> `pthread_kill(pthread_self(),SIGABRT)` (p.1657), which implies
> any registered SIGABRT handler should run only on the thread
> which called abort().

Poking around further, I find that replacing the signal generation
code in the test program for all cases with :

  pthread_kill(pthread_self(),sigid)

generates compliant signal delivery behavior!

This reveals that Cygwin is theoretically capable of correctly
delivering signals to a selected "non-primordial" thread; but the
various forms of signal generation exercised in the original test are
apparently not leading to correct use of that internal mechanism.

To review, the POSIX 1003.1-2017 specification for abort() says:

   The SIGABRT signal shall be sent to the calling process as if by means
   of raise() with the argument SIGABRT.

and the specification for raise() says:

    The effect of the raise() function shall be equivalent to calling:
    pthread_kill(pthread_self(), sig);

but this appears to NOT currently be the case in Cygwin.
The current implementation of raise() in winsup/cygwin/signal.cc:

 300 extern "C" int
 301 raise (int sig)
 302 {
 303   return kill (myself->pid, sig);
 304 }

I believe this is the root cause of the observed misbehaviors with
both raise() and abort(). The Cygwin implementation of raise(sig) is
incorrectly generating a process-scope signal (discarding thread
information) rather than sending the signal to the *calling* thread,
as required by POSIX, via the same mechanism as
pthread_kill(pthread_self(),sig).

If the implementation of raise() in libc was internally replaced with
pthread_kill(pthread_self(), sig), I believe that should resolve two
of the three failure modes we've seen. I have no idea what negative
consequences (if any) there may be to that proposed change.

It's worth noting that an end user could potentially deploy a
(fragile) partial workaround by macro-defining abort and raise to
pthread_kill; but that notably would fail to capture calls made from
within libc (such as the abort() call made from
cygwin/assert.cc:__assert_func() when an invocation of assert() from
<assert.h> fails).

The remaining failure mode is a SIGSEGV generated from a programming
error (e.g. null pointer dereference) on a non-primordial thread. This
should ideally be fixed to deliver a pthread_kill() to the offending
thread, instead of the current process-wide abnormal termination that
ignores signal handlers. I agree with Madison that there is probably
no user-level workaround to cover this case at all, and I don't know
what may be required in the Win API to make this happen correctly.

Thoughts?
-D

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bug: Incorrect signal behavior in multi-threaded processes
  2019-01-29 23:22       ` Dan Bonachea
@ 2019-01-30 10:44         ` Corinna Vinschen
  2019-01-30 15:48           ` Corinna Vinschen
  0 siblings, 1 reply; 11+ messages in thread
From: Corinna Vinschen @ 2019-01-30 10:44 UTC (permalink / raw)
  To: Dan Bonachea; +Cc: E. Madison Bray, cygwin, gasnet-devel

[-- Attachment #1: Type: text/plain, Size: 959 bytes --]

On Jan 29 18:21, Dan Bonachea wrote:
> > A minimal test program is copied below and also available here:
> > https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=589
> 
> > It's worth noting POSIX 1003.1-2016 sec XRAT.B.2.4.1 (p.3577)
> > specifically requires that any given signal should be delivered to
> > exactly one thread. Also the spec for abort (p.565) requires the
> > signal to be delivered as if by `raise(SIGABRT)` (p.1765) aka.
> > `pthread_kill(pthread_self(),SIGABRT)` (p.1657), which implies
> > any registered SIGABRT handler should run only on the thread
> > which called abort().
> 
> Poking around further, I find that replacing the signal generation
> code in the test program for all cases with :
> 
>   pthread_kill(pthread_self(),sigid)
> 
> generates compliant signal delivery behavior!

Thanks, I fixed that in Cygwin.  It will be part of the 3.0 release.


Corinna

-- 
Corinna Vinschen
Cygwin Maintainer

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bug: Incorrect signal behavior in multi-threaded processes
  2019-01-30 10:44         ` Corinna Vinschen
@ 2019-01-30 15:48           ` Corinna Vinschen
  2019-01-30 21:23             ` Corinna Vinschen
  0 siblings, 1 reply; 11+ messages in thread
From: Corinna Vinschen @ 2019-01-30 15:48 UTC (permalink / raw)
  To: Dan Bonachea; +Cc: E. Madison Bray, cygwin, gasnet-devel

[-- Attachment #1: Type: text/plain, Size: 1178 bytes --]

On Jan 30 11:44, Corinna Vinschen wrote:
> On Jan 29 18:21, Dan Bonachea wrote:
> > > A minimal test program is copied below and also available here:
> > > https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=589
> > 
> > > It's worth noting POSIX 1003.1-2016 sec XRAT.B.2.4.1 (p.3577)
> > > specifically requires that any given signal should be delivered to
> > > exactly one thread. Also the spec for abort (p.565) requires the
> > > signal to be delivered as if by `raise(SIGABRT)` (p.1765) aka.
> > > `pthread_kill(pthread_self(),SIGABRT)` (p.1657), which implies
> > > any registered SIGABRT handler should run only on the thread
> > > which called abort().
> > 
> > Poking around further, I find that replacing the signal generation
> > code in the test program for all cases with :
> > 
> >   pthread_kill(pthread_self(),sigid)
> > 
> > generates compliant signal delivery behavior!
> 
> Thanks, I fixed that in Cygwin.  It will be part of the 3.0 release.

I think I have a solution for the SISEGV misbehaviour.  I have to
test it a bit but this may make it into Cygwin 3.0 if all is well.


Corinna

-- 
Corinna Vinschen
Cygwin Maintainer

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bug: Incorrect signal behavior in multi-threaded processes
  2019-01-30 15:48           ` Corinna Vinschen
@ 2019-01-30 21:23             ` Corinna Vinschen
  2019-01-31  0:12               ` Dan Bonachea
  0 siblings, 1 reply; 11+ messages in thread
From: Corinna Vinschen @ 2019-01-30 21:23 UTC (permalink / raw)
  To: Dan Bonachea; +Cc: E. Madison Bray, cygwin, gasnet-devel

[-- Attachment #1: Type: text/plain, Size: 1438 bytes --]

On Jan 30 16:48, Corinna Vinschen wrote:
> On Jan 30 11:44, Corinna Vinschen wrote:
> > On Jan 29 18:21, Dan Bonachea wrote:
> > > > A minimal test program is copied below and also available here:
> > > > https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=589
> > > 
> > > > It's worth noting POSIX 1003.1-2016 sec XRAT.B.2.4.1 (p.3577)
> > > > specifically requires that any given signal should be delivered to
> > > > exactly one thread. Also the spec for abort (p.565) requires the
> > > > signal to be delivered as if by `raise(SIGABRT)` (p.1765) aka.
> > > > `pthread_kill(pthread_self(),SIGABRT)` (p.1657), which implies
> > > > any registered SIGABRT handler should run only on the thread
> > > > which called abort().
> > > 
> > > Poking around further, I find that replacing the signal generation
> > > code in the test program for all cases with :
> > > 
> > >   pthread_kill(pthread_self(),sigid)
> > > 
> > > generates compliant signal delivery behavior!
> > 
> > Thanks, I fixed that in Cygwin.  It will be part of the 3.0 release.
> 
> I think I have a solution for the SISEGV misbehaviour.  I have to
> test it a bit but this may make it into Cygwin 3.0 if all is well.

Please try the latest developer snapshot from https://cygwin.com/snapshots/
or the new 3.0.0-0.3 test release.  I hope this fixes the problem
sufficiently.


Thanks,
Corinna

-- 
Corinna Vinschen
Cygwin Maintainer

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bug: Incorrect signal behavior in multi-threaded processes
  2019-01-30 21:23             ` Corinna Vinschen
@ 2019-01-31  0:12               ` Dan Bonachea
  2019-01-31 19:48                 ` Corinna Vinschen
  0 siblings, 1 reply; 11+ messages in thread
From: Dan Bonachea @ 2019-01-31  0:12 UTC (permalink / raw)
  To: cygwin, Dan Bonachea; +Cc: E. Madison Bray, gasnet-devel

On Wed, Jan 30, 2019 at 4:23 PM Corinna Vinschen
<corinna-cygwin@cygwin.com> wrote:
> > > > Poking around further, I find that replacing the signal generation
> > > > code in the test program for all cases with :
> > > >
> > > >   pthread_kill(pthread_self(),sigid)
> > > >
> > > > generates compliant signal delivery behavior!
> > >
> > > Thanks, I fixed that in Cygwin.  It will be part of the 3.0 release.
> >
> > I think I have a solution for the SISEGV misbehaviour.  I have to
> > test it a bit but this may make it into Cygwin 3.0 if all is well.
>
> Please try the latest developer snapshot from https://cygwin.com/snapshots/
> or the new 3.0.0-0.3 test release.  I hope this fixes the problem
> sufficiently.

Hi Corinna -

I've installed the 3.0.0-0.3 test release on my workstation and
confirmed that it fixes all the reported misbehaviors in the test
program.
It also appears to fix the signal-related problems encountered in our
multi-threaded production code.

Thanks so much for the prompt fixes!
-D

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Bug: Incorrect signal behavior in multi-threaded processes
  2019-01-31  0:12               ` Dan Bonachea
@ 2019-01-31 19:48                 ` Corinna Vinschen
  0 siblings, 0 replies; 11+ messages in thread
From: Corinna Vinschen @ 2019-01-31 19:48 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 1211 bytes --]

On Jan 30 19:11, Dan Bonachea wrote:
> On Wed, Jan 30, 2019 at 4:23 PM Corinna Vinschen
> <corinna-cygwin@cygwin.com> wrote:
> > > > > Poking around further, I find that replacing the signal generation
> > > > > code in the test program for all cases with :
> > > > >
> > > > >   pthread_kill(pthread_self(),sigid)
> > > > >
> > > > > generates compliant signal delivery behavior!
> > > >
> > > > Thanks, I fixed that in Cygwin.  It will be part of the 3.0 release.
> > >
> > > I think I have a solution for the SISEGV misbehaviour.  I have to
> > > test it a bit but this may make it into Cygwin 3.0 if all is well.
> >
> > Please try the latest developer snapshot from https://cygwin.com/snapshots/
> > or the new 3.0.0-0.3 test release.  I hope this fixes the problem
> > sufficiently.
> 
> Hi Corinna -
> 
> I've installed the 3.0.0-0.3 test release on my workstation and
> confirmed that it fixes all the reported misbehaviors in the test
> program.
> It also appears to fix the signal-related problems encountered in our
> multi-threaded production code.
> 
> Thanks so much for the prompt fixes!

Thanks for testing!


Corinna

-- 
Corinna Vinschen
Cygwin Maintainer

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-01-31 19:48 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-20 20:33 Bug: Incorrect signal behavior in multi-threaded processes Dan Bonachea
2019-01-22  9:13 ` Corinna Vinschen
2019-01-22 11:16 ` E. Madison Bray
2019-01-22 20:43   ` Dan Bonachea
2019-01-23 12:44     ` E. Madison Bray
2019-01-29 23:22       ` Dan Bonachea
2019-01-30 10:44         ` Corinna Vinschen
2019-01-30 15:48           ` Corinna Vinschen
2019-01-30 21:23             ` Corinna Vinschen
2019-01-31  0:12               ` Dan Bonachea
2019-01-31 19:48                 ` Corinna Vinschen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).