From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-patches-return-148137-listarch-gdb-patches=sources.redhat.com@sourceware.org>
Received: (qmail 87270 invoked by alias); 11 Jun 2018 12:55:44 -0000
Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gdb-patches.sourceware.org>
List-Subscribe: <mailto:gdb-patches-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-patches-owner@sourceware.org
Received: (qmail 87256 invoked by uid 89); 11 Jun 2018 12:55:43 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-26.4 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=arrived, recover, stashed, Kill
X-HELO: mail-wm0-f53.google.com
Received: from mail-wm0-f53.google.com (HELO mail-wm0-f53.google.com) (74.125.82.53) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 11 Jun 2018 12:55:40 +0000
Received: by mail-wm0-f53.google.com with SMTP id 69-v6so16303143wmf.3        for <gdb-patches@sourceware.org>; Mon, 11 Jun 2018 05:55:39 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=1e100.net; s=20161025;        h=x-gm-message-state:date:from:to:cc:subject:message-id:references         :mime-version:content-disposition:in-reply-to:user-agent;        bh=cQO5nxaMogauMEj/eQLuzz/m8DEDBhuKXEkzT6eEzJE=;        b=uk5vP9HUBY4GONl2OrqxmhOI3SfPhEW0Tx8Xpz0nIe5mKR1Uac+LwIFsmSi9/SDC7C         9g7iXneeXaIYyGz9GFayVLMJESc/l/TDi7AOOcqyLX7jiVrZSGxaid5bFXC3hMZpgFX/         SMtAjzJ6gJvpddOaYVTUkuj85hHuOHzXtDBMEW7vm7eCfRzbBQuq6BCb8I/7tkJQEIfS         Ud3i/5oKxiDKgFKa5UWOGJP5CmyyFUEGortjQwPRQGvwi66FPYiYA9XxOcDlWFxW1lN5         mk3VELIfPrhR2iIfgtRYdNrF5/QY+D5vL3RZCVqcPnjWgzWY24/q2ApFxq3h+QiFV8bO         a/GA==
X-Gm-Message-State: APt69E2H8bAIB60B0GnQbw1dDODj20JzmadtNH+WzaBDiSyMPmtunhjN	PzXc+X1zsvJoiFDWOix90oqILs1G
X-Google-Smtp-Source: ADUXVKIuIjFfhn+cQTl/HTZjmlncoZhz8voAkY02HfQr924XKzd8vLA7t0JaqR0TQ4+0OGEB8b6Dug==
X-Received: by 2002:aa7:c756:: with SMTP id c22-v6mr18743398eds.144.1528721737485;        Mon, 11 Jun 2018 05:55:37 -0700 (PDT)
Received: from localhost ([81.141.199.74])        by smtp.gmail.com with ESMTPSA id b9-v6sm8467542edr.29.2018.06.11.05.55.36        (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);        Mon, 11 Jun 2018 05:55:36 -0700 (PDT)
Date: Mon, 11 Jun 2018 12:55:00 -0000
From: Andrew Burgess <andrew.burgess@embecosm.com>
To: gdb-patches@sourceware.org
Cc: Richard Bunt <Richard.Bunt@arm.com>
Subject: Re: [PATCH] gdb: Don't drop SIGSTOP during stop_all_threads
Message-ID: <20180611125535.GK15881@embecosm.com>
References: <20180518105654.6932-1-andrew.burgess@embecosm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20180518105654.6932-1-andrew.burgess@embecosm.com>
X-Editor: GNU Emacs [ http://www.gnu.org/software/emacs ]
User-Agent: Mutt/1.9.2 (2017-12-15)
X-IsSubscribed: yes
X-SW-Source: 2018-06/txt/msg00280.txt.bz2

Ping.

Thanks,
Andrewx

* Andrew Burgess <andrew.burgess@embecosm.com> [2018-05-18 11:56:54 +0100]:

> This patch fixes an issues where GDB would sometimes hang when attaching
> to a multi-threaded process.  This issues was especially likely to
> trigger if the machine (running the inferior) was under load.
> 
> In summary, the problem is an imbalance between two functions in
> linux-nat.c, stop_callback and stop_wait_callback.  In stop_callback we
> send SIGSTOP to a thread, but _only_ if the thread is not already
> stopped, and if it is not signalled, which means it should stop soon.
> In stop_wait_callback we wait for the SIGSTOP to arrive, however, we are
> aware that the thread might have been signalled for some other reason,
> and so if a signal other than SIGSTOP causes the thread to stop then we
> stash that signal away so it can be reported back later.  If we get a
> SIGSTOP then this is discarded, after all, this signal was sent from
> stop_callback.  Except that this might not be the case, it could be that
> SIGSTOP was sent to a thread from elsewhere in GDB, in which case we
> would not have sent another SIGSTOP from stop_callback and the SIGSTOP
> received in stop_wait_callback should not be ignored.
> 
> Below I've laid out the exact sequence of events that I saw that lead me
> to track down the above diagnosis.
> 
> After attaching to the inferior GDB sends a SIGSTOP to all of the
> threads and then returns to the event loop waiting for interesting
> things to happen.
> 
> Eventually the first target event is detected (this will be the first
> SIGSTOP arriving) and GDB calls inferior_event_handler which calls
> fetch_inferior_event.  Inside fetch_inferior_event GDB calls
> do_target_wait which calls target_wait to find a thread with an event.
> 
> The target_wait call ends up in linux_nat_wait_1, which first checks to
> see if any threads already have stashed stop events to report, and if
> there are non then we enter a loop fetching as many events as possible
> out of the kernel.  This event fetching is non-blocking, and we give up
> once the kernel has no more events ready to give us.
> 
> All of the events from the kernel are passed through
> linux_nat_filter_event which stashes the wait status for all of the
> threads that reported a SIGSTOP, these will be returned by future calls
> to linux_nat_wait_1.
> 
> Lets assume for a moment that we've attached to a multi-threaded
> inferior, and that all but one thread has reported its stop during the
> initial wait call in linux_nat_wait_1.  The other thread will be
> reporting a SIGSTOP, but the kernel has not yet managed to deliver that
> signal to GDB before GDB gave up waiting and continued handling the
> events it already had.  GDB selects one of the threads that has reported
> a SIGSTOP and passes this thread ID back to fetch_inferior_event.
> 
> To handle the thread's SIGSTOP, GDB calls handle_signal_stop, which
> calls stop_all_threads, this calls wait_one, which in turn calls
> target_wait.
> 
> The first call to target_wait at this point will result in a stashed
> wait status being returned, at which point we call setup_inferior.  The
> call to setup_inferior leads to a call into try_thread_db_load_1 which
> results in a call to linux_stop_and_wait_all_lwps.  This in turn calls
> stop_callback on each thread followed by stop_wait_callback on each
> thread.
> 
> We're now ready to make the mistake.  In stop_callback we see that our
> problem thread is not stopped, but is signalled, so it should stop soon.
> As a result we don't send another SIGSTOP.
> 
> We then enter stop_wait_callback, eventually the problem thread stops
> with SIGSTOP which we _incorrectly_ assume came from stop_callback, and
> we discard.
> 
> Once stop_wait_callback has done its damage we return from
> linux_stop_and_wait_all_lwps, finish in try_thread_db_load_1, and
> eventually unwind back to the call to setup_inferior in
> stop_all_threads.  GDB now loops around, and performs another
> target_wait to get the next event from the inferior.
> 
> The target_wait calls causes us to once again reach linux_nat_wait_1,
> and we pass through some code that calls resume_stopped_resumed_lwps.
> This allows GDB to resume threads that are physically stopped, but which
> GDB doesn't see any good reason for the thread to remain stopped.  In
> our case, the problem thread which had its SIGSTOP discarded is stopped,
> but doesn't have a stashed wait status to report, and so GDB sets the
> thread going again.
> 
> We are now stuck waiting for an event on the problem thread that might
> never arrive.
> 
> When considering how to write a test for this bug I struggled.  The
> issue was only spotted _randomly_ when a machine was heavily loaded with
> many multi-threaded applications, and GDB was being attached (by script)
> to all of these applications in parallel.  In one reproducer I required
> around 5 applications each of 5 threads per machine core in order to
> reproduce the bug 2 out of 3 times.
> 
> What we really want to do though is simulate the kernel being slow to
> report events through waitpid during the initial attach.  The solution I
> came up with was to write an LD_PRELOAD library that intercepts (some)
> waitpid calls and rate limits them to one per-second.  Any more than
> that simply return 0 indicating there's no event available.  Obviously
> this can only be applied to waitpid calls that have the WNOHANG flag
> set.
> 
> Unfortunately, this library does break the "rest" of GDB, I suspect that
> the issue is that in some places GDB knows that there's an event
> pending, calls non-blocking waitpid and when the event fails to arrive
> GDB moves into some unexpected/broken state from which it can't recover.
> Still, the preload library does, at the moment, trigger the bug during
> attach.
> 
> gdb/ChangeLog:
> 
> 	* linux-nat.c (stop_wait_callback): Don't discard SIGSTOP if it
> 	was requested by GDB.
> 
> gdb/testsuite/ChangeLog:
> 
> 	* gdb.threads/attach-slow-waitpid.c: New file.
> 	* gdb.threads/attach-slow-waitpid.exp: New file.
> 	* gdb.threads/slow-waitpid.c: New file.
> ---
>  gdb/ChangeLog                                     |   6 ++
>  gdb/linux-nat.c                                   |  14 ++-
>  gdb/testsuite/ChangeLog                           |   7 ++
>  gdb/testsuite/gdb.threads/attach-slow-waitpid.c   |  77 ++++++++++++++
>  gdb/testsuite/gdb.threads/attach-slow-waitpid.exp |  95 ++++++++++++++++++
>  gdb/testsuite/gdb.threads/slow-waitpid.c          | 117 ++++++++++++++++++++++
>  6 files changed, 312 insertions(+), 4 deletions(-)
>  create mode 100644 gdb/testsuite/gdb.threads/attach-slow-waitpid.c
>  create mode 100644 gdb/testsuite/gdb.threads/attach-slow-waitpid.exp
>  create mode 100644 gdb/testsuite/gdb.threads/slow-waitpid.c
> 
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index 445b59fa4ad..d1d0ba66d2e 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -2527,17 +2527,23 @@ stop_wait_callback (struct lwp_info *lp, void *data)
>  	}
>        else
>  	{
> -	  /* We caught the SIGSTOP that we intended to catch, so
> -	     there's no SIGSTOP pending.  */
> +	  /* We caught the SIGSTOP that we intended to catch.  */
>  
>  	  if (debug_linux_nat)
>  	    fprintf_unfiltered (gdb_stdlog,
>  				"SWC: Expected SIGSTOP caught for %s.\n",
>  				target_pid_to_str (lp->ptid));
>  
> -	  /* Reset SIGNALLED only after the stop_wait_callback call
> -	     above as it does gdb_assert on SIGNALLED.  */
>  	  lp->signalled = 0;
> +
> +	  /* If we are waiting for this stop so we can report the thread
> +	     stopped then we need to record that this status.  Otherwise,
> +	     we can now discard this stop event.  */
> +	  if (lp->last_resume_kind == resume_stop)
> +	    {
> +	      lp->status = status;
> +	      save_stop_reason (lp);
> +	    }
>  	}
>      }
>  
> diff --git a/gdb/testsuite/gdb.threads/attach-slow-waitpid.c b/gdb/testsuite/gdb.threads/attach-slow-waitpid.c
> new file mode 100644
> index 00000000000..06e99ab22d1
> --- /dev/null
> +++ b/gdb/testsuite/gdb.threads/attach-slow-waitpid.c
> @@ -0,0 +1,77 @@
> +/* This testcase is part of GDB, the GNU debugger.
> +
> +   Copyright 2018 Free Software Foundation, Inc.
> +
> +   This program is free software; you can redistribute it and/or modify
> +   it under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3 of the License, or
> +   (at your option) any later version.
> +
> +   This program is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +   GNU General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
> +
> +#include <pthread.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <assert.h>
> +#define NUM_THREADS 4
> +
> +/* Crude spin lock.  Threads all spin until this is set to 0.  */
> +int go = 1;
> +
> +/* Thread function, just spin until GO is set to 0.  */
> +void *
> +perform_work (void *argument)
> +{
> +  /* Cast to volatile to ensure that ARGUMENT is loaded each time around
> +     the loop.  */
> +  while (*((volatile int*) argument))
> +    {
> +      /* Nothing.  */
> +    }
> +  return NULL;
> +}
> +
> +/* The spin loop for the main thread.  */
> +void
> +function (void)
> +{
> +  (void) perform_work (&go);
> +  printf ("Finished from function\n");
> +}
> +
> +/* Main program, create some threads which all spin waiting for GO to be
> +   set to 0.  */
> +int
> +main (void)
> +{
> +  pthread_t threads[NUM_THREADS];
> +  int result_code;
> +  unsigned index;
> +
> +  /* Create some threads.  */
> +  for (index = 0; index < NUM_THREADS; ++index)
> +    {
> +      printf ("In main: creating thread %d\n", index);
> +      result_code = pthread_create (&threads[index], NULL, perform_work, &go);
> +      assert (!result_code);
> +    }
> +
> +  function ();
> +
> +  /* Wait for each thread to complete.  */
> +  for (index = 0; index < NUM_THREADS; ++index)
> +    {
> +      /* Block until thread INDEX completes.  */
> +      result_code = pthread_join (threads[index], NULL);
> +      assert (!result_code);
> +      printf ("In main: thread %d has completed\n", index);
> +    }
> +  printf ("In main: All threads completed successfully\n");
> +  return 0;
> +}
> diff --git a/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp b/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp
> new file mode 100644
> index 00000000000..d4b12727842
> --- /dev/null
> +++ b/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp
> @@ -0,0 +1,95 @@
> +# Copyright 2018 Free Software Foundation, Inc.
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
> +
> +# This test script tries to expose a bug in some of the use of waitpid
> +# in the linux native support within GDB.  The problem was spotted on
> +# systems which were heavily loaded when attaching to threaded test
> +# programs.  What happened was that during the initial attach, the
> +# loop of waitpid calls that normally received the stop events from
> +# each of the threads in the inferior was not receiving a stop event
> +# for some threads (the kernel just hadn't sent the stop event yet).
> +#
> +# GDB would then trigger a call to stop_all_threads which would
> +# continue to wait for all of the outstanding threads to stop, when
> +# the outstanding stop events finally arrived GDB would then
> +# (incorrectly) discard the stop event, resume the thread, and
> +# continue to wait for the thread to stop.... which it now never
> +# would.
> +#
> +# In order to try and expose this issue reliably, this test preloads a
> +# library that intercepts waitpid calls.  And waitpid calls targeting
> +# pid -1, and with the WNOHANG flag are rate limited so that only 1
> +# per second can complete, any additional calls are forced to return 0
> +# indicating no event waiting.  This is enough to trigger the bug
> +# during the attach phase, however, it breaks the rest of GDB's
> +# behaviour, so we can't do more than just attach with this library in
> +# place.
> +
> +# This test only works on Linux
> +if { ![isnative] || [is_remote host] || [use_gdb_stub]
> +     || ![istarget *-linux*] } {
> +    continue
> +}
> +
> +standard_testfile
> +
> +set libfile slow-waitpid
> +set libsrc "${srcdir}/${subdir}/${libfile}.c"
> +set libobj [standard_output_file ${libfile}.so]
> +
> +# Compile the preload library.  We only get away with this as we limit
> +# this test to running when ISNATIVE is true.
> +if { [gdb_compile_shlib $libsrc $libobj {debug}] != "" } then {
> +    return -1
> +}
> +
> +# Compile the test program
> +if  { [gdb_compile_pthreads "${srcdir}/${subdir}/${srcfile}" "${binfile}" executable {debug}] != "" } {
> +    return -1
> +}
> +
> +# Spawn GDB with LIB preloaded with LD_PRELOAD.  CMDLINE_OPTS are
> +# command line options passed to GDB.
> +
> +proc gdb_spawn_with_ld_preload {lib cmdline_opts} {
> +    global env
> +
> +    save_vars { env(LD_PRELOAD) } {
> +	if { ![info exists env(LD_PRELOAD) ]
> +	     || $env(LD_PRELOAD) == "" } {
> +	    set env(LD_PRELOAD) "$lib"
> +	} else {
> +	    append env(LD_PRELOAD) ":$lib"
> +	}
> +
> +	gdb_spawn_with_cmdline_opts $cmdline_opts
> +    }
> +}
> +
> +# Run test program in the background.
> +set test_spawn_id [spawn_wait_for_attach $binfile]
> +set testpid [spawn_id_get_pid $test_spawn_id]
> +
> +# Start GDB with preload library in place.
> +gdb_spawn_with_ld_preload $libobj ""
> +
> +# Load binary, and attach to running program.
> +gdb_load ${binfile}
> +gdb_test "attach $testpid" "Attaching to program.*" "attach to target"
> +
> +gdb_exit
> +
> +# Kill of test program.
> +kill_wait_spawned_process $test_spawn_id
> diff --git a/gdb/testsuite/gdb.threads/slow-waitpid.c b/gdb/testsuite/gdb.threads/slow-waitpid.c
> new file mode 100644
> index 00000000000..40b5c958899
> --- /dev/null
> +++ b/gdb/testsuite/gdb.threads/slow-waitpid.c
> @@ -0,0 +1,117 @@
> +/* This testcase is part of GDB, the GNU debugger.
> +
> +   Copyright 2018 Free Software Foundation, Inc.
> +
> +   This program is free software; you can redistribute it and/or modify
> +   it under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3 of the License, or
> +   (at your option) any later version.
> +
> +   This program is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +   GNU General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
> +
> +#define _GNU_SOURCE
> +
> +#include <sys/types.h>
> +#include <sys/wait.h>
> +#include <sys/time.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <dlfcn.h>
> +#include <string.h>
> +#include <stdarg.h>
> +
> +/* Logging.  */
> +
> +static void
> +log_msg (const char *fmt, ...)
> +{
> +#ifdef LOGGING
> +  va_list ap;
> +
> +  va_start (ap, fmt);
> +  vfprintf (stderr, fmt, ap);
> +  va_end (ap);
> +#endif /* LOGGING */
> +}
> +
> +/* Error handling, message and exit.  */
> +
> +static void
> +error (const char *fmt, ...)
> +{
> +  va_list ap;
> +
> +  va_start (ap, fmt);
> +  vfprintf (stderr, fmt, ap);
> +  va_end (ap);
> +
> +  exit (EXIT_FAILURE);
> +}
> +
> +/* Return true (non-zero) if we should skip this call to waitpid, or false
> +   (zero) if this waitpid call should be handled with a call to the "real"
> +   waitpid library.  Allows 1 waitpid call per second.  */
> +
> +static int
> +should_skip_waitpid (void)
> +{
> +  /* When we last allowed a waitpid to complete.  */
> +  static struct timeval last_waitpid_time = { 0, 0 };
> +
> +  struct timeval *tv = &last_waitpid_time;
> +  if (tv->tv_sec == 0)
> +    {
> +      if (gettimeofday (tv, NULL) < 0)
> +	error ("error: gettimeofday failed\n");
> +      return 0; /* Don't skip.  */
> +    }
> +  else
> +    {
> +      struct timeval new_tv;
> +
> +      if (gettimeofday (&new_tv, NULL) < 0)
> +	error ("error: gettimeofday failed\n");
> +
> +      if ((new_tv.tv_sec - tv->tv_sec) < 1)
> +	return 1; /* Skip.  */
> +
> +      memcpy (tv, &new_tv, sizeof (new_tv));
> +    }
> +
> +  /* Don't skip.  */
> +  return 0;
> +}
> +
> +/* The waitpid entry point function.  */
> +
> +pid_t
> +waitpid (pid_t pid, int *wstatus, int options)
> +{
> +  typedef pid_t (*fptr_t) (pid_t, int *, int);
> +  static fptr_t real_func = NULL;
> +
> +  if (real_func == NULL)
> +    {
> +      real_func = dlsym (RTLD_NEXT, "waitpid");
> +      if (real_func == NULL)
> +	error ("error: failed to find real waitpid\n");
> +    }
> +
> +  log_msg ("waitpid (%d, %p, 0x%x)\n", pid, wstatus, options);
> +
> +  if ((options & WNOHANG) != 0
> +      && pid == -1
> +      && should_skip_waitpid ())
> +    {
> +      log_msg ("waitpid: skipping\n");
> +      return 0;
> +    }
> +
> +  return (*real_func) (pid, wstatus, options);
> +}
> -- 
> 2.14.3
>