From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 10479 invoked by alias); 20 Jan 2019 20:33:57 -0000 Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com Received: (qmail 10462 invoked by uid 89); 20 Jan 2019 20:33:56 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_PASS,SPF_PASS,TIME_LIMIT_EXCEEDED autolearn=unavailable version=3.3.2 spammy=dan, forbidden, processes, evidence X-HELO: fe3.lbl.gov Received: from fe3.lbl.gov (HELO fe3.lbl.gov) (131.243.228.52) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Sun, 20 Jan 2019 20:33:46 +0000 X-Ironport-SBRS: 2.7 Received: from mail-ot1-f70.google.com ([209.85.210.70]) by fe3.lbl.gov with ESMTP; 20 Jan 2019 12:33:44 -0800 Received: by mail-ot1-f70.google.com with SMTP id d5so7635960otl.21 for ; Sun, 20 Jan 2019 12:33:44 -0800 (PST) Return-Path: Received: from mail-oi1-f169.google.com (mail-oi1-f169.google.com. [209.85.167.169]) by smtp.gmail.com with ESMTPSA id x4sm4734451otk.37.2019.01.20.12.33.41 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 20 Jan 2019 12:33:42 -0800 (PST) Received: by mail-oi1-f169.google.com with SMTP id a77so13000617oii.5 for ; Sun, 20 Jan 2019 12:33:41 -0800 (PST) MIME-Version: 1.0 From: Dan Bonachea Date: Sun, 20 Jan 2019 20:33:00 -0000 Message-ID: Subject: Bug: Incorrect signal behavior in multi-threaded processes To: cygwin@cygwin.com Cc: gasnet-devel@lbl.gov, Dan Bonachea Content-Type: text/plain; charset="UTF-8" X-SW-Source: 2019-01/txt/msg00149.txt.bz2 I'm writing to report some POSIX compliance problems with Cygwin signal handling in the presence of multiple pthreads that our group has encountered in our parallel scientific computing codes. A minimal test program is copied below and also available here: https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=589 I believe the test program is fully compliant with ISO C 99 and POSIX 1003.1-2016. In a nutshell, it registers one signal handler, spawns a number of pthreads, and then synchronously generates a signal from exactly one thread while others sit in a pthread_barrier_wait. The "throwing" thread and signal number can be varied from the command line, and diagnostic output indicates what happened. As a basis for comparison, here are a few examples of the test program running on x86_64/Linux-3.10.0(Scientific Linux 7.4)/gcc-4.8.5 demonstrating what I believe to be the *correct*/POSIX-required behavior: $ ./thread-signal 1 11 # "th#1 sends sig 11 (SIGSEGV) via null deref" Running test with 5 threads and thread 1 sending signal=11 Spawning pthreads.. thread 1 (0x7f8dd0b13700): Hello thread 4 (0x7f8dcf310700): Hello thread 2 (0x7f8dd0312700): Hello thread 3 (0x7f8dcfb11700): Hello thread 0 (0x7f8dd131a740): Hello thread 1 (0x7f8dd0b13700): sending signal 11.. sig_handler: ENTERING sig_handler: running on thread 0x7f8dd0b13700 sig_handler: calling _exit() $ ./thread-signal 1 6 # "th#1 sends sig 6 (SIGABRT) via abort()" Running test with 5 threads and thread 1 sending signal=6 Spawning pthreads.. thread 1 (0x7f1a2451d700): Hello thread 2 (0x7f1a23d1c700): Hello thread 0 (0x7f1a24d24740): Hello thread 3 (0x7f1a2351b700): Hello thread 4 (0x7f1a22d1a700): Hello thread 1 (0x7f1a2451d700): sending signal 6.. sig_handler: ENTERING sig_handler: running on thread 0x7f1a2451d700 sig_handler: calling _exit() $ ./thread-signal 1 2 # "th#1 sends sig 2 via raise(SIGINT)" Running test with 5 threads and thread 1 sending signal=2 Spawning pthreads.. thread 1 (0x7f2a29a3f700): Hello thread 2 (0x7f2a2923e700): Hello thread 0 (0x7f2a2a246740): Hello thread 3 (0x7f2a28a3d700): Hello thread 4 (0x7f2a2823c700): Hello thread 1 (0x7f2a29a3f700): sending signal 2.. sig_handler: ENTERING sig_handler: running on thread 0x7f2a29a3f700 sig_handler: calling _exit() This output indicates that in all cases on Linux, the unique thread generating the signal jumps to the pre-registered signal handler while other threads remain stalled at the barrier, as required by POSIX signalling semantics (e.g. see raise() on p.1765 of POSIX 1003.1-2016). The test program and commands above demonstrate the substantially same, correct behavior on ALL of the following platform combinations: * Linux-3.10/{i686,x86_64}/{gcc-4.8.5,gcc-8.2.0,clang-7.0.0} * Solaris-11.3/x86_64/{gcc-7.2.0,SunStudio-12.5} * FreeBSD-12.0/x86_64/clang-6.0.1 * MicrosoftWSL-Ubuntu18.04/x86_64/{gcc-7.3.0,clang-6.0.0) - This notably runs on Microsoft Windows! (10.0.17763.288) Unfortunately the observed behavior on Cygwin (various versions) deviates far from our expectations and (based on my understanding) from the behavior required by current POSIX specs. Here is example output from Cygwin 2.11.1(0.329/5/3) 2018-09-05 on Windows 10, build 17763.288 with gcc 7.3.0: $ ./thread-signal 1 11 # "th#1 sends sig 11 (SIGSEGV) via null deref" Running test with 5 threads and thread 1 sending signal=11 Spawning pthreads.. thread 1 (0x600048770): Hello thread 2 (0x600048870): Hello thread 3 (0x600048970): Hello thread 0 (0x600000010): Hello thread 4 (0x600048a70): Hello thread 1 (0x600048770): sending signal 11.. $ ./thread-signal 1 6 # "th#1 sends sig 6 (SIGABRT) via abort()" Running test with 5 threads and thread 1 sending signal=6 Spawning pthreads.. thread 1 (0x600048770): Hello thread 2 (0x600048870): Hello thread 3 (0x600048970): Hello thread 4 (0x600048a70): Hello thread 0 (0x600000010): Hello thread 1 (0x600048770): sending signal 6.. sig_handler: ENTERING Abort $ ./thread-signal 1 2 # "th#1 sends sig 2 via raise(SIGINT)" Running test with 5 threads and thread 1 sending signal=2 Spawning pthreads.. thread 1 (0x600048770): Hello thread 2 (0x600048870): Hello thread 3 (0x600048970): Hello thread 0 (0x600000010): Hello thread 4 (0x600048a70): Hello thread 1 (0x600048770): sending signal 2.. sig_handler: ENTERING sig_handler: ERROR - signal delivered to wrong thread! thread 1 (0x600048770): ERROR: STILL ALIVE! sig_handler: running on thread 0x600000010 sig_handler: calling _exit() The second case in particular (abort() called by one non-primordial thread) appears to have non-deterministic/racing behavior. The evidence seems to indicate the SIGABRT is delivered to the primordial thread (the wrong thread) via the signal handler and concurrently also delivered to the SIG_DFL handler of other threads who then race to invoke abortive process termination (which should not be reachable in any correct execution of the program). It's worth noting POSIX 1003.1-2016 sec XRAT.B.2.4.1 (p.3577) specifically requires that any given signal should be delivered to exactly one thread. Also the spec for abort (p.565) requires the signal to be delivered as if by `raise(SIGABRT)` (p.1765) aka. `pthread_kill(pthread_self(),SIGABRT)` (p.1657), which implies any registered SIGABRT handler should run only on the thread which called abort(). The choice of SIGINT in the third example is arbitrary, and representative of similar deliver-to-wrong-thread behavior also observed on Cygwin for all of the following signals: HUP, INT, QUIT, ILL, EMT, TRAP, FPE, BUS, SYS, PIPE, ALRM, TERM, URG, TSTP, CONT, CHLD, TTIN, TTOU, IO, USR1, USR2, and RTMIN..RTMAX All of which consequently appear to be unreliable for thread-specific signalling in Cygwin programs. Note that in all cases examined, generating the signal from the "primordial" thread 0 (by changing the 1 to a 0 in the commands above) yields nominally correct behavior; in that case, the signal handler is correctly invoked by the primordial thread and the others remain undisturbed. However it appears the primordial thread is the ONLY thread that enjoys the special status of POSIX-compliant signal behavior on Cygwin. Substantially similar broken behavior has been observed for NON-primordial threads on ALL of the following Cygwin version combinations (spread across three different workstations): * Cygwin64-2.11.1(0.329/5/3)-{win7,win10}-{gcc-7.3.0,clang-5.0.1} * Cygwin64-2.10.0(0.325/5/3)-{win7,win10}-{gcc-6.4.0,clang-5.0.1} * Cygwin64-2.6.0(0.304/5/3)-win7-{gcc-5.4.0,clang-3.8.1} * Cygwin64-2.6.0(0.304/5/3)-win7-{gcc-5.4.0,clang-3.8.1} Possibly of note, a 32-bit version of Cygwin (i686 2.11.1(0.329/5/3)) correctly handles SIGSEGV, but fails all the other cases in substantially the same manner as Cygwin64. In case you're wondering why we care: The SIGABRT and SIGSEGV misbehaviors are particularly problematic for our distributed-memory codes that register fatal signal handlers to ensure correct tear-down of a multi-process job if/when any process crashes or aborts (e.g. due to an assertion failure). Cygwin unfortunately makes it effectively impossible to reliably handle abort()'s or SIGSEGV's generated by programming errors in a multi-threaded program, unless one can arrange to only generate the signal from the primordial thread (impractical for our applications). Searching around the Cygwin lists I find some evidence that tangentially similar problems with signals and multithreading have been discussed before, but perhaps not adequately isolated/demonstrated. Is there any hope of this situation ever improving? Thanks for your consideration. -Dan Bonachea Test program code below, also available for download at: https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=589 ===================================================================== // Thread/signal tester by Dan Bonachea // compile with a command like: // gcc -D_GNU_SOURCE -std=c99 -pedantic -pthread thread-signal.c -o thread-signal // usage: // thread-signal // // page numbers in comments below refer to POSIX IEEE Std 1003.1-2016 #include #include #include #include #include #include #include #include // Utilities typedef void (*sig_handler_t)(int); // signal handler function pointer unsigned long long thidtollu(pthread_t thid) { // map pthread_t to a unique value // non-portable but sufficient on all systems of interest return (unsigned long long)(uintptr_t)thid; } pthread_barrier_t barrier_object; void barrier(void) { int res = pthread_barrier_wait(&barrier_object); // p.1595 assert(res == 0 || res == PTHREAD_BARRIER_SERIAL_THREAD); } #define FD_STDOUT 1 #define FD_STDERR 2 void writeout(const char *msg) { // signal-safe string output and flush int sz = strlen(msg)+1; int res = write(FD_STDOUT, msg, sz); if (res != sz) { const char err[] = "write failed!\n"; write(FD_STDERR, err, sizeof(err)); _exit(-1); } (void)fsync(FD_STDOUT); } #ifndef NUMTHREAD #define NUMTHREAD 5 #endif // state variables int sigid = SIGSEGV; int sender = 1; volatile sig_atomic_t sender_aid = 0; volatile sig_atomic_t errs = 0; // registered signal handler function void sig_handler(int signum) { // p.494 defines permitted calls pthread_t thid = pthread_self(); writeout("sig_handler: ENTERING\n"); sig_atomic_t my_aid = (sig_atomic_t)thidtollu(thid); if (my_aid != sender_aid) { errs++; writeout("sig_handler: ERROR - signal delivered to wrong thread!\n"); } #if !STRICT // sprintf technically forbidden, but doesn't affect behavior in practice { char tmp[200]; sprintf(tmp,"sig_handler: running on thread 0x%llx\n",thidtollu(thid)); writeout(tmp); } #endif writeout("sig_handler: calling _exit()\n"); _exit(errs); } struct thinfo { pthread_t thid; int idx; } thread_info[NUMTHREAD]; // thread entry point void * thread_main(void *arg) { struct thinfo *myinfo = arg; pthread_t thid = pthread_self(); assert(pthread_equal(thid, myinfo->thid)); printf("thread %i (0x%llx): Hello\n",myinfo->idx, thidtollu(thid)); fflush(NULL); if (myinfo->idx == sender) { // this thread will send the signal sender_aid = (sig_atomic_t)thidtollu(thid); // record for signal handler } barrier(); // wait for all threads if (myinfo->idx == sender) { // this thread sends the signal printf("thread %i (0x%llx): sending signal %i..\n", myinfo->idx, thidtollu(thid), sigid); fflush(NULL); switch (sigid) { case SIGABRT: abort(); // p.565 break; case SIGSEGV: { int *nullpt = NULL; *nullpt = 0; // SEGV } break; default: { int res = raise(sigid); // p.1765 if (res) { errs++; printf("thread %i (0x%llx): ERROR: raise failed: %i %s\n", myinfo->idx, thidtollu(thid), res, strerror(res)); fflush(NULL); } } } errs++; printf("thread %i (0x%llx): ERROR: STILL ALIVE!\n",myinfo->idx, thidtollu(thid)); fflush(NULL); } barrier(); // wait for all threads return NULL; } // process entry point int main(int argc, char **argv) { if (argc > 1) sender = atoi(argv[1]); if (argc > 2) sigid = atoi(argv[2]); printf("Running test with %i threads and thread %i sending signal=%i\n", NUMTHREAD,sender,sigid); fflush(NULL); int ret = pthread_barrier_init(&barrier_object, NULL, NUMTHREAD); // p.1593 assert(!ret); // establish a signal handler sig_handler_t init = signal(sigid, sig_handler); // p.1971 assert(init == SIG_DFL || init == SIG_IGN); // ensure it is registered sig_handler_t res = signal(sigid, sig_handler); assert(res == sig_handler); printf("Spawning pthreads..\n"); fflush(NULL); for (int i=1; i < NUMTHREAD; i++) { // create threads thread_info[i].idx = i; int res = pthread_create(&(thread_info[i].thid), NULL, thread_main, &(thread_info[i])); // p.1633 assert(!res); } // primordial thread is "thread 0" thread_info[0].idx = 0; thread_info[0].thid = pthread_self(); thread_main(&(thread_info[0])); // should never reach this point for a catchable signal for (int i=1; i < NUMTHREAD; i++) { // join threads int res = pthread_join(thread_info[i].thid, NULL); // p.1649 assert(!res); } printf("all threads exited!\n"); errs++; return errs; } -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple