From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 53292 invoked by alias); 6 Mar 2015 19:58:16 -0000 Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-patches-owner@sourceware.org Received: (qmail 53217 invoked by uid 89); 6 Mar 2015 19:58:16 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00,SPF_HELO_PASS,SPF_PASS,T_RP_MATCHES_RCVD autolearn=ham version=3.3.2 X-HELO: mx1.redhat.com Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-GCM-SHA384 encrypted) ESMTPS; Fri, 06 Mar 2015 19:58:13 +0000 Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id t26JwBr7018383 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL) for ; Fri, 6 Mar 2015 14:58:11 -0500 Received: from brno.lan (ovpn01.gateway.prod.ext.ams2.redhat.com [10.39.146.11]) by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id t26Jw60n024787 for ; Fri, 6 Mar 2015 14:58:11 -0500 From: Pedro Alves To: gdb-patches@sourceware.org Subject: [PATCH 3/6] Fix race exposed by gdb.threads/killed.exp Date: Fri, 06 Mar 2015 19:58:00 -0000 Message-Id: <1425671886-7798-4-git-send-email-palves@redhat.com> In-Reply-To: <1425671886-7798-1-git-send-email-palves@redhat.com> References: <1425671886-7798-1-git-send-email-palves@redhat.com> X-SW-Source: 2015-03/txt/msg00178.txt.bz2 On GNU/Linux, this test sometimes FAILs like this: (gdb) run Starting program: /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/killed [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". ptrace: No such process. (gdb) Program terminated with signal SIGKILL, Killed. The program no longer exists. FAIL: gdb.threads/killed.exp: run program to completion (timeout) Note the suspicious "No such process" line (that's errno==ESRCH). Adding debug output we see: linux_nat_wait: [process -1], [TARGET_WNOHANG] LLW: enter LNW: waitpid(-1, ...) returned 18465, ERRNO-OK LLW: waitpid 18465 received Stopped (signal) (stopped) LNW: waitpid(-1, ...) returned 18461, ERRNO-OK LLW: waitpid 18461 received Trace/breakpoint trap (stopped) LLW: Handling extended status 0x03057f LHEW: Got clone event from LWP 18461, new child is LWP 18465 LNW: waitpid(-1, ...) returned 0, ERRNO-OK RSRL: resuming stopped-resumed LWP LWP 18465 at 0x3b36af4b51: step=0 RSRL: resuming stopped-resumed LWP LWP 18461 at 0x3b36af4b51: step=0 sigchld ptrace: No such process. (gdb) linux_nat_wait: [process -1], [TARGET_WNOHANG] LLW: enter LNW: waitpid(-1, ...) returned 18465, ERRNO-OK LLW: waitpid 18465 received Killed (terminated) LLW: LWP 18465 exited. LNW: waitpid(-1, ...) returned 18461, No child processes LLW: waitpid 18461 received Killed (terminated) Process 18461 exited LNW: waitpid(-1, ...) returned -1, No child processes LLW: exit sigchld infrun: target_wait (-1, status) = infrun: 18461 [process 18461], infrun: status->kind = signalled, signal = GDB_SIGNAL_KILL infrun: TARGET_WAITKIND_SIGNALLED Program terminated with signal SIGKILL, Killed. The program no longer exists. infrun: stop_waiting FAIL: gdb.threads/killed.exp: run program to completion (timeout) The ptrace call that fails was a PTRACE_CONTINUE. The issue is that here: RSRL: resuming stopped-resumed LWP LWP 18465 at 0x3b36af4b51: step=0 RSRL: resuming stopped-resumed LWP LWP 18461 at 0x3b36af4b51: step=0 The first line shows we had just resumed LWP 18465, which does: void * child_func (void *dummy) { kill (pid, SIGKILL); exit (1); } So if the kernel manages to schedule that thread fast enough, the process may be killed before GDB has a chance to resume LWP 18461. GDBserver has code at the tail end of linux_resume_one_lwp to cope with this: ~~~ ptrace (step ? PTRACE_SINGLESTEP : PTRACE_CONT, lwpid_of (thread), (PTRACE_TYPE_ARG3) 0, /* Coerce to a uintptr_t first to avoid potential gcc warning of coercing an 8 byte integer to a 4 byte pointer. */ (PTRACE_TYPE_ARG4) (uintptr_t) signal); current_thread = saved_thread; if (errno) { /* ESRCH from ptrace either means that the thread was already running (an error) or that it is gone (a race condition). If it's gone, we will get a notification the next time we wait, so we can ignore the error. We could differentiate these two, but it's tricky without waiting; the thread still exists as a zombie, so sending it signal 0 would succeed. So just ignore ESRCH. */ if (errno == ESRCH) return; perror_with_name ("ptrace"); } ~~~ However, that's not a complete fix, because between starting to handle the resume request and getting that PTRACE_CONTINUE, we run other ptrace calls that can also fail with ESRCH, and that end up throwing an error. So instead of ignoring ESRCH locally at each ptrace call, let ptrace errors throw a new THREAD_NOT_FOUND_ERROR, and wrap the resume request in a TRY_CATCH that swallows it. For now, nothing in the core ever sees or handles this error, but I envision that we'll find uses for it there too. gdb/gdbserver/ChangeLog: 2015-03-06 Pedro Alves * linux-low.c (linux_resume_one_lwp): Rename to ... (linux_resume_one_lwp_throw): ... this. Don't handle ESRCH here, instead call throw_ptrace_error. (linux_resume_one_lwp): Reimplement as wrapper around linux_resume_one_lwp_throw that swallows THREAD_NOT_FOUND_ERROR. gdb/ChangeLog: 2015-03-06 Pedro Alves * linux-nat.c (linux_resume_one_lwp): Rename to ... (linux_resume_one_lwp_throw): ... this. Don't handle ESRCH here, instead call throw_ptrace_error. (linux_resume_one_lwp): Reimplement as wrapper around linux_resume_one_lwp_throw that swallows THREAD_NOT_FOUND_ERROR. --- gdb/common/common-exceptions.h | 3 +++ gdb/gdbserver/linux-low.c | 54 +++++++++++++++++++++++++++++------------- gdb/linux-nat.c | 32 ++++++++++++++++++++++++- gdb/nat/ptrace-utils.c | 22 ++++++++++++++++- 4 files changed, 93 insertions(+), 18 deletions(-) diff --git a/gdb/common/common-exceptions.h b/gdb/common/common-exceptions.h index a32e6f9..1a13f18 100644 --- a/gdb/common/common-exceptions.h +++ b/gdb/common/common-exceptions.h @@ -93,6 +93,9 @@ enum errors { aborted as the inferior state is no longer valid. */ TARGET_CLOSE_ERROR, + /* The thread is gone. */ + THREAD_NOT_FOUND_ERROR, + /* An undefined command was executed. */ UNDEFINED_COMMAND_ERROR, diff --git a/gdb/gdbserver/linux-low.c b/gdb/gdbserver/linux-low.c index 48d905b..f9c4079 100644 --- a/gdb/gdbserver/linux-low.c +++ b/gdb/gdbserver/linux-low.c @@ -28,6 +28,7 @@ #include "nat/linux-ptrace.h" #include "nat/linux-procfs.h" #include "nat/linux-personality.h" +#include "nat/ptrace-utils.h" #include #include #include @@ -3378,13 +3379,12 @@ stop_all_lwps (int suspend, struct lwp_info *except) } } -/* Resume execution of the inferior process. - If STEP is nonzero, single-step it. - If SIGNAL is nonzero, give it that signal. */ +/* Like linux_resume_one_lwp but throws THREAD_NOT_FOUND_ERROR if the + LWP has disappeared (ptrace returns ESRCH). */ static void -linux_resume_one_lwp (struct lwp_info *lwp, - int step, int signal, siginfo_t *info) +linux_resume_one_lwp_throw (struct lwp_info *lwp, + int step, int signal, siginfo_t *info) { struct thread_info *thread = get_lwp_thread (lwp); struct thread_info *saved_thread; @@ -3576,18 +3576,40 @@ linux_resume_one_lwp (struct lwp_info *lwp, current_thread = saved_thread; if (errno) - { - /* ESRCH from ptrace either means that the thread was already - running (an error) or that it is gone (a race condition). If - it's gone, we will get a notification the next time we wait, - so we can ignore the error. We could differentiate these - two, but it's tricky without waiting; the thread still exists - as a zombie, so sending it signal 0 would succeed. So just - ignore ESRCH. */ - if (errno == ESRCH) - return; + throw_ptrace_error ("resuming thread"); +} - perror_with_name ("ptrace"); +/* Resume execution of LWP. If STEP is nonzero, single-step it. If + SIGNAL is nonzero, give it that signal. No error is throw if the + LWP disappears while we try to resume it, no error is thrown. */ + +static void +linux_resume_one_lwp (struct lwp_info *lwp, + int step, int signal, siginfo_t *info) +{ + struct gdb_exception ex; + + TRY_CATCH (ex, RETURN_MASK_ERROR) + { + linux_resume_one_lwp_throw (lwp, step, signal, info); + } + if (ex.reason < 0) + { + if (ex.error == THREAD_NOT_FOUND_ERROR) + { + /* We got an ESRCH while resuming the LWP. ESRCH from + ptrace either means that the thread was already running + (an error/bug) or that it is gone (a race condition, e.g, + another thread exited the whole process.), or it was + killed by SIGKILL. If it's gone, we will get a + notification the next time we wait, so we can ignore the + error. We could differentiate these, but it's tricky + without waiting; the thread still exists as a zombie, so + sending it signal 0 would succeed. So just ignore + ESRCH. */ + } + else + throw_exception (ex); } } diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c index 20fe533..6bb62fd 100644 --- a/gdb/linux-nat.c +++ b/gdb/linux-nat.c @@ -1504,7 +1504,7 @@ linux_nat_detach (struct target_ops *ops, const char *args, int from_tty) single-step it. If SIGNAL is nonzero, give it that signal. */ static void -linux_resume_one_lwp (struct lwp_info *lp, int step, enum gdb_signal signo) +linux_resume_one_lwp_throw (struct lwp_info *lp, int step, enum gdb_signal signo) { lp->step = step; @@ -1528,6 +1528,36 @@ linux_resume_one_lwp (struct lwp_info *lp, int step, enum gdb_signal signo) registers_changed_ptid (lp->ptid); } +static void +linux_resume_one_lwp (struct lwp_info *lp, int step, enum gdb_signal signo) +{ + struct gdb_exception ex; + + TRY_CATCH (ex, RETURN_MASK_ERROR) + { + linux_resume_one_lwp_throw (lp, step, signo); + } + if (ex.reason < 0) + { + if (ex.error == THREAD_NOT_FOUND_ERROR) + { + /* We got an ESRCH while resuming the LWP. ESRCH from + ptrace either means that the thread was already running + (an error/bug) or that it is gone (a race condition, e.g, + another thread exited the whole process.), or it was + killed by SIGKILL. If it's gone, we will get a + notification the next time we wait, so we can ignore the + error. We could differentiate these, but it's tricky + without waiting; the thread still exists as a zombie, so + sending it signal 0 would succeed. So just ignore + ESRCH. */ + } + else + throw_exception (ex); + } +} + + /* Resume LP. */ static void diff --git a/gdb/nat/ptrace-utils.c b/gdb/nat/ptrace-utils.c index a3dc9b5..ca59454 100644 --- a/gdb/nat/ptrace-utils.c +++ b/gdb/nat/ptrace-utils.c @@ -25,5 +25,25 @@ void throw_ptrace_error (const char *string) { - perror_with_name (string); + enum errors errcode; + + switch (errno) + { + case ESRCH: + /* This either means that: + + - the thread was already running (a GDB error/bug). + + - the thread is gone, a race condition. E.g, another thread + was set running and it exited the whole process, or the + process was killed by a SIGKILL. + + Assume the latter. */ + errcode = THREAD_NOT_FOUND_ERROR; + break; + default: + errcode = GENERIC_ERROR; + break; + } + throw_perror_with_name (errcode, string); } -- 1.9.3