From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 87270 invoked by alias); 11 Jun 2018 12:55:44 -0000 Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-patches-owner@sourceware.org Received: (qmail 87256 invoked by uid 89); 11 Jun 2018 12:55:43 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-26.4 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=arrived, recover, stashed, Kill X-HELO: mail-wm0-f53.google.com Received: from mail-wm0-f53.google.com (HELO mail-wm0-f53.google.com) (74.125.82.53) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 11 Jun 2018 12:55:40 +0000 Received: by mail-wm0-f53.google.com with SMTP id 69-v6so16303143wmf.3 for ; Mon, 11 Jun 2018 05:55:39 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=cQO5nxaMogauMEj/eQLuzz/m8DEDBhuKXEkzT6eEzJE=; b=uk5vP9HUBY4GONl2OrqxmhOI3SfPhEW0Tx8Xpz0nIe5mKR1Uac+LwIFsmSi9/SDC7C 9g7iXneeXaIYyGz9GFayVLMJESc/l/TDi7AOOcqyLX7jiVrZSGxaid5bFXC3hMZpgFX/ SMtAjzJ6gJvpddOaYVTUkuj85hHuOHzXtDBMEW7vm7eCfRzbBQuq6BCb8I/7tkJQEIfS Ud3i/5oKxiDKgFKa5UWOGJP5CmyyFUEGortjQwPRQGvwi66FPYiYA9XxOcDlWFxW1lN5 mk3VELIfPrhR2iIfgtRYdNrF5/QY+D5vL3RZCVqcPnjWgzWY24/q2ApFxq3h+QiFV8bO a/GA== X-Gm-Message-State: APt69E2H8bAIB60B0GnQbw1dDODj20JzmadtNH+WzaBDiSyMPmtunhjN PzXc+X1zsvJoiFDWOix90oqILs1G X-Google-Smtp-Source: ADUXVKIuIjFfhn+cQTl/HTZjmlncoZhz8voAkY02HfQr924XKzd8vLA7t0JaqR0TQ4+0OGEB8b6Dug== X-Received: by 2002:aa7:c756:: with SMTP id c22-v6mr18743398eds.144.1528721737485; Mon, 11 Jun 2018 05:55:37 -0700 (PDT) Received: from localhost ([81.141.199.74]) by smtp.gmail.com with ESMTPSA id b9-v6sm8467542edr.29.2018.06.11.05.55.36 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Mon, 11 Jun 2018 05:55:36 -0700 (PDT) Date: Mon, 11 Jun 2018 12:55:00 -0000 From: Andrew Burgess To: gdb-patches@sourceware.org Cc: Richard Bunt Subject: Re: [PATCH] gdb: Don't drop SIGSTOP during stop_all_threads Message-ID: <20180611125535.GK15881@embecosm.com> References: <20180518105654.6932-1-andrew.burgess@embecosm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180518105654.6932-1-andrew.burgess@embecosm.com> X-Editor: GNU Emacs [ http://www.gnu.org/software/emacs ] User-Agent: Mutt/1.9.2 (2017-12-15) X-IsSubscribed: yes X-SW-Source: 2018-06/txt/msg00280.txt.bz2 Ping. Thanks, Andrewx * Andrew Burgess [2018-05-18 11:56:54 +0100]: > This patch fixes an issues where GDB would sometimes hang when attaching > to a multi-threaded process. This issues was especially likely to > trigger if the machine (running the inferior) was under load. > > In summary, the problem is an imbalance between two functions in > linux-nat.c, stop_callback and stop_wait_callback. In stop_callback we > send SIGSTOP to a thread, but _only_ if the thread is not already > stopped, and if it is not signalled, which means it should stop soon. > In stop_wait_callback we wait for the SIGSTOP to arrive, however, we are > aware that the thread might have been signalled for some other reason, > and so if a signal other than SIGSTOP causes the thread to stop then we > stash that signal away so it can be reported back later. If we get a > SIGSTOP then this is discarded, after all, this signal was sent from > stop_callback. Except that this might not be the case, it could be that > SIGSTOP was sent to a thread from elsewhere in GDB, in which case we > would not have sent another SIGSTOP from stop_callback and the SIGSTOP > received in stop_wait_callback should not be ignored. > > Below I've laid out the exact sequence of events that I saw that lead me > to track down the above diagnosis. > > After attaching to the inferior GDB sends a SIGSTOP to all of the > threads and then returns to the event loop waiting for interesting > things to happen. > > Eventually the first target event is detected (this will be the first > SIGSTOP arriving) and GDB calls inferior_event_handler which calls > fetch_inferior_event. Inside fetch_inferior_event GDB calls > do_target_wait which calls target_wait to find a thread with an event. > > The target_wait call ends up in linux_nat_wait_1, which first checks to > see if any threads already have stashed stop events to report, and if > there are non then we enter a loop fetching as many events as possible > out of the kernel. This event fetching is non-blocking, and we give up > once the kernel has no more events ready to give us. > > All of the events from the kernel are passed through > linux_nat_filter_event which stashes the wait status for all of the > threads that reported a SIGSTOP, these will be returned by future calls > to linux_nat_wait_1. > > Lets assume for a moment that we've attached to a multi-threaded > inferior, and that all but one thread has reported its stop during the > initial wait call in linux_nat_wait_1. The other thread will be > reporting a SIGSTOP, but the kernel has not yet managed to deliver that > signal to GDB before GDB gave up waiting and continued handling the > events it already had. GDB selects one of the threads that has reported > a SIGSTOP and passes this thread ID back to fetch_inferior_event. > > To handle the thread's SIGSTOP, GDB calls handle_signal_stop, which > calls stop_all_threads, this calls wait_one, which in turn calls > target_wait. > > The first call to target_wait at this point will result in a stashed > wait status being returned, at which point we call setup_inferior. The > call to setup_inferior leads to a call into try_thread_db_load_1 which > results in a call to linux_stop_and_wait_all_lwps. This in turn calls > stop_callback on each thread followed by stop_wait_callback on each > thread. > > We're now ready to make the mistake. In stop_callback we see that our > problem thread is not stopped, but is signalled, so it should stop soon. > As a result we don't send another SIGSTOP. > > We then enter stop_wait_callback, eventually the problem thread stops > with SIGSTOP which we _incorrectly_ assume came from stop_callback, and > we discard. > > Once stop_wait_callback has done its damage we return from > linux_stop_and_wait_all_lwps, finish in try_thread_db_load_1, and > eventually unwind back to the call to setup_inferior in > stop_all_threads. GDB now loops around, and performs another > target_wait to get the next event from the inferior. > > The target_wait calls causes us to once again reach linux_nat_wait_1, > and we pass through some code that calls resume_stopped_resumed_lwps. > This allows GDB to resume threads that are physically stopped, but which > GDB doesn't see any good reason for the thread to remain stopped. In > our case, the problem thread which had its SIGSTOP discarded is stopped, > but doesn't have a stashed wait status to report, and so GDB sets the > thread going again. > > We are now stuck waiting for an event on the problem thread that might > never arrive. > > When considering how to write a test for this bug I struggled. The > issue was only spotted _randomly_ when a machine was heavily loaded with > many multi-threaded applications, and GDB was being attached (by script) > to all of these applications in parallel. In one reproducer I required > around 5 applications each of 5 threads per machine core in order to > reproduce the bug 2 out of 3 times. > > What we really want to do though is simulate the kernel being slow to > report events through waitpid during the initial attach. The solution I > came up with was to write an LD_PRELOAD library that intercepts (some) > waitpid calls and rate limits them to one per-second. Any more than > that simply return 0 indicating there's no event available. Obviously > this can only be applied to waitpid calls that have the WNOHANG flag > set. > > Unfortunately, this library does break the "rest" of GDB, I suspect that > the issue is that in some places GDB knows that there's an event > pending, calls non-blocking waitpid and when the event fails to arrive > GDB moves into some unexpected/broken state from which it can't recover. > Still, the preload library does, at the moment, trigger the bug during > attach. > > gdb/ChangeLog: > > * linux-nat.c (stop_wait_callback): Don't discard SIGSTOP if it > was requested by GDB. > > gdb/testsuite/ChangeLog: > > * gdb.threads/attach-slow-waitpid.c: New file. > * gdb.threads/attach-slow-waitpid.exp: New file. > * gdb.threads/slow-waitpid.c: New file. > --- > gdb/ChangeLog | 6 ++ > gdb/linux-nat.c | 14 ++- > gdb/testsuite/ChangeLog | 7 ++ > gdb/testsuite/gdb.threads/attach-slow-waitpid.c | 77 ++++++++++++++ > gdb/testsuite/gdb.threads/attach-slow-waitpid.exp | 95 ++++++++++++++++++ > gdb/testsuite/gdb.threads/slow-waitpid.c | 117 ++++++++++++++++++++++ > 6 files changed, 312 insertions(+), 4 deletions(-) > create mode 100644 gdb/testsuite/gdb.threads/attach-slow-waitpid.c > create mode 100644 gdb/testsuite/gdb.threads/attach-slow-waitpid.exp > create mode 100644 gdb/testsuite/gdb.threads/slow-waitpid.c > > diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c > index 445b59fa4ad..d1d0ba66d2e 100644 > --- a/gdb/linux-nat.c > +++ b/gdb/linux-nat.c > @@ -2527,17 +2527,23 @@ stop_wait_callback (struct lwp_info *lp, void *data) > } > else > { > - /* We caught the SIGSTOP that we intended to catch, so > - there's no SIGSTOP pending. */ > + /* We caught the SIGSTOP that we intended to catch. */ > > if (debug_linux_nat) > fprintf_unfiltered (gdb_stdlog, > "SWC: Expected SIGSTOP caught for %s.\n", > target_pid_to_str (lp->ptid)); > > - /* Reset SIGNALLED only after the stop_wait_callback call > - above as it does gdb_assert on SIGNALLED. */ > lp->signalled = 0; > + > + /* If we are waiting for this stop so we can report the thread > + stopped then we need to record that this status. Otherwise, > + we can now discard this stop event. */ > + if (lp->last_resume_kind == resume_stop) > + { > + lp->status = status; > + save_stop_reason (lp); > + } > } > } > > diff --git a/gdb/testsuite/gdb.threads/attach-slow-waitpid.c b/gdb/testsuite/gdb.threads/attach-slow-waitpid.c > new file mode 100644 > index 00000000000..06e99ab22d1 > --- /dev/null > +++ b/gdb/testsuite/gdb.threads/attach-slow-waitpid.c > @@ -0,0 +1,77 @@ > +/* This testcase is part of GDB, the GNU debugger. > + > + Copyright 2018 Free Software Foundation, Inc. > + > + This program is free software; you can redistribute it and/or modify > + it under the terms of the GNU General Public License as published by > + the Free Software Foundation; either version 3 of the License, or > + (at your option) any later version. > + > + This program is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + GNU General Public License for more details. > + > + You should have received a copy of the GNU General Public License > + along with this program. If not, see . */ > + > +#include > +#include > +#include > +#include > +#define NUM_THREADS 4 > + > +/* Crude spin lock. Threads all spin until this is set to 0. */ > +int go = 1; > + > +/* Thread function, just spin until GO is set to 0. */ > +void * > +perform_work (void *argument) > +{ > + /* Cast to volatile to ensure that ARGUMENT is loaded each time around > + the loop. */ > + while (*((volatile int*) argument)) > + { > + /* Nothing. */ > + } > + return NULL; > +} > + > +/* The spin loop for the main thread. */ > +void > +function (void) > +{ > + (void) perform_work (&go); > + printf ("Finished from function\n"); > +} > + > +/* Main program, create some threads which all spin waiting for GO to be > + set to 0. */ > +int > +main (void) > +{ > + pthread_t threads[NUM_THREADS]; > + int result_code; > + unsigned index; > + > + /* Create some threads. */ > + for (index = 0; index < NUM_THREADS; ++index) > + { > + printf ("In main: creating thread %d\n", index); > + result_code = pthread_create (&threads[index], NULL, perform_work, &go); > + assert (!result_code); > + } > + > + function (); > + > + /* Wait for each thread to complete. */ > + for (index = 0; index < NUM_THREADS; ++index) > + { > + /* Block until thread INDEX completes. */ > + result_code = pthread_join (threads[index], NULL); > + assert (!result_code); > + printf ("In main: thread %d has completed\n", index); > + } > + printf ("In main: All threads completed successfully\n"); > + return 0; > +} > diff --git a/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp b/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp > new file mode 100644 > index 00000000000..d4b12727842 > --- /dev/null > +++ b/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp > @@ -0,0 +1,95 @@ > +# Copyright 2018 Free Software Foundation, Inc. > +# > +# This program is free software; you can redistribute it and/or modify > +# it under the terms of the GNU General Public License as published by > +# the Free Software Foundation; either version 3 of the License, or > +# (at your option) any later version. > +# > +# This program is distributed in the hope that it will be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program. If not, see . > + > +# This test script tries to expose a bug in some of the use of waitpid > +# in the linux native support within GDB. The problem was spotted on > +# systems which were heavily loaded when attaching to threaded test > +# programs. What happened was that during the initial attach, the > +# loop of waitpid calls that normally received the stop events from > +# each of the threads in the inferior was not receiving a stop event > +# for some threads (the kernel just hadn't sent the stop event yet). > +# > +# GDB would then trigger a call to stop_all_threads which would > +# continue to wait for all of the outstanding threads to stop, when > +# the outstanding stop events finally arrived GDB would then > +# (incorrectly) discard the stop event, resume the thread, and > +# continue to wait for the thread to stop.... which it now never > +# would. > +# > +# In order to try and expose this issue reliably, this test preloads a > +# library that intercepts waitpid calls. And waitpid calls targeting > +# pid -1, and with the WNOHANG flag are rate limited so that only 1 > +# per second can complete, any additional calls are forced to return 0 > +# indicating no event waiting. This is enough to trigger the bug > +# during the attach phase, however, it breaks the rest of GDB's > +# behaviour, so we can't do more than just attach with this library in > +# place. > + > +# This test only works on Linux > +if { ![isnative] || [is_remote host] || [use_gdb_stub] > + || ![istarget *-linux*] } { > + continue > +} > + > +standard_testfile > + > +set libfile slow-waitpid > +set libsrc "${srcdir}/${subdir}/${libfile}.c" > +set libobj [standard_output_file ${libfile}.so] > + > +# Compile the preload library. We only get away with this as we limit > +# this test to running when ISNATIVE is true. > +if { [gdb_compile_shlib $libsrc $libobj {debug}] != "" } then { > + return -1 > +} > + > +# Compile the test program > +if { [gdb_compile_pthreads "${srcdir}/${subdir}/${srcfile}" "${binfile}" executable {debug}] != "" } { > + return -1 > +} > + > +# Spawn GDB with LIB preloaded with LD_PRELOAD. CMDLINE_OPTS are > +# command line options passed to GDB. > + > +proc gdb_spawn_with_ld_preload {lib cmdline_opts} { > + global env > + > + save_vars { env(LD_PRELOAD) } { > + if { ![info exists env(LD_PRELOAD) ] > + || $env(LD_PRELOAD) == "" } { > + set env(LD_PRELOAD) "$lib" > + } else { > + append env(LD_PRELOAD) ":$lib" > + } > + > + gdb_spawn_with_cmdline_opts $cmdline_opts > + } > +} > + > +# Run test program in the background. > +set test_spawn_id [spawn_wait_for_attach $binfile] > +set testpid [spawn_id_get_pid $test_spawn_id] > + > +# Start GDB with preload library in place. > +gdb_spawn_with_ld_preload $libobj "" > + > +# Load binary, and attach to running program. > +gdb_load ${binfile} > +gdb_test "attach $testpid" "Attaching to program.*" "attach to target" > + > +gdb_exit > + > +# Kill of test program. > +kill_wait_spawned_process $test_spawn_id > diff --git a/gdb/testsuite/gdb.threads/slow-waitpid.c b/gdb/testsuite/gdb.threads/slow-waitpid.c > new file mode 100644 > index 00000000000..40b5c958899 > --- /dev/null > +++ b/gdb/testsuite/gdb.threads/slow-waitpid.c > @@ -0,0 +1,117 @@ > +/* This testcase is part of GDB, the GNU debugger. > + > + Copyright 2018 Free Software Foundation, Inc. > + > + This program is free software; you can redistribute it and/or modify > + it under the terms of the GNU General Public License as published by > + the Free Software Foundation; either version 3 of the License, or > + (at your option) any later version. > + > + This program is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + GNU General Public License for more details. > + > + You should have received a copy of the GNU General Public License > + along with this program. If not, see . */ > + > +#define _GNU_SOURCE > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +/* Logging. */ > + > +static void > +log_msg (const char *fmt, ...) > +{ > +#ifdef LOGGING > + va_list ap; > + > + va_start (ap, fmt); > + vfprintf (stderr, fmt, ap); > + va_end (ap); > +#endif /* LOGGING */ > +} > + > +/* Error handling, message and exit. */ > + > +static void > +error (const char *fmt, ...) > +{ > + va_list ap; > + > + va_start (ap, fmt); > + vfprintf (stderr, fmt, ap); > + va_end (ap); > + > + exit (EXIT_FAILURE); > +} > + > +/* Return true (non-zero) if we should skip this call to waitpid, or false > + (zero) if this waitpid call should be handled with a call to the "real" > + waitpid library. Allows 1 waitpid call per second. */ > + > +static int > +should_skip_waitpid (void) > +{ > + /* When we last allowed a waitpid to complete. */ > + static struct timeval last_waitpid_time = { 0, 0 }; > + > + struct timeval *tv = &last_waitpid_time; > + if (tv->tv_sec == 0) > + { > + if (gettimeofday (tv, NULL) < 0) > + error ("error: gettimeofday failed\n"); > + return 0; /* Don't skip. */ > + } > + else > + { > + struct timeval new_tv; > + > + if (gettimeofday (&new_tv, NULL) < 0) > + error ("error: gettimeofday failed\n"); > + > + if ((new_tv.tv_sec - tv->tv_sec) < 1) > + return 1; /* Skip. */ > + > + memcpy (tv, &new_tv, sizeof (new_tv)); > + } > + > + /* Don't skip. */ > + return 0; > +} > + > +/* The waitpid entry point function. */ > + > +pid_t > +waitpid (pid_t pid, int *wstatus, int options) > +{ > + typedef pid_t (*fptr_t) (pid_t, int *, int); > + static fptr_t real_func = NULL; > + > + if (real_func == NULL) > + { > + real_func = dlsym (RTLD_NEXT, "waitpid"); > + if (real_func == NULL) > + error ("error: failed to find real waitpid\n"); > + } > + > + log_msg ("waitpid (%d, %p, 0x%x)\n", pid, wstatus, options); > + > + if ((options & WNOHANG) != 0 > + && pid == -1 > + && should_skip_waitpid ()) > + { > + log_msg ("waitpid: skipping\n"); > + return 0; > + } > + > + return (*real_func) (pid, wstatus, options); > +} > -- > 2.14.3 >