From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-patches-return-112787-listarch-gdb-patches=sources.redhat.com@sourceware.org>
Received: (qmail 756 invoked by alias); 26 May 2014 04:55:40 -0000
Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gdb-patches.sourceware.org>
List-Subscribe: <mailto:gdb-patches-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-patches-owner@sourceware.org
Received: (qmail 742 invoked by uid 89); 26 May 2014 04:55:38 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2
X-HELO: mail-yk0-f170.google.com
Received: from mail-yk0-f170.google.com (HELO mail-yk0-f170.google.com) (209.85.160.170) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Mon, 26 May 2014 04:55:37 +0000
Received: by mail-yk0-f170.google.com with SMTP id 10so5810919ykt.1        for <gdb-patches@sourceware.org>; Sun, 25 May 2014 21:55:35 -0700 (PDT)
MIME-Version: 1.0
X-Received: by 10.236.88.193 with SMTP id a41mr32672383yhf.22.1401080135157; Sun, 25 May 2014 21:55:35 -0700 (PDT)
Received: by 10.170.150.70 with HTTP; Sun, 25 May 2014 21:55:35 -0700 (PDT)
In-Reply-To: <1400885374-18915-1-git-send-email-donb@codesourcery.com>
References: <1398885482-8449-1-git-send-email-donb@codesourcery.com>	<1400885374-18915-1-git-send-email-donb@codesourcery.com>
Date: Mon, 26 May 2014 04:55:00 -0000
Message-ID: <CAP9bCMSLJwYn3RHzAwvfE6T05SdsEKXY1bivXwjOjrdZsvjwoQ@mail.gmail.com>
Subject: Re: [PATCH 0/4 v3] Exec events in gdbserver on Linux
From: Doug Evans <xdje42@gmail.com>
To: Don Breazeal <donb@codesourcery.com>
Cc: "gdb-patches@sourceware.org" <gdb-patches@sourceware.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-IsSubscribed: yes
X-SW-Source: 2014-05/txt/msg00625.txt.bz2

On Fri, May 23, 2014 at 3:49 PM, Don Breazeal <donb@codesourcery.com> wrote:
> This patch series is an update to the gdbserver Linux exec event patches =
based on review comments for the previous version.  The changes from the pr=
evious version are summarized below.
>
> [...]
>
> RACE CONDITION
>
> This section explains why the existing techniques for detecting thread ex=
it weren't sufficient for gdbserver exec events, necessitating the use of P=
TRACE_EVENT_EXIT.  The short answer is that there is a race condition in th=
e current implementation that can leave a dangling entry in the lwp list (a=
n entry that doesn't have a corresponding actual lwp).  In this case gdbser=
ver will hang waiting for the non-existent lwp to stop.  Using the exit eve=
nts eliminates this race condition.
>
> The same race may exist in the native implementation, since the two imple=
mentations are similar, but I haven't verified that.  It may be difficult t=
o concoct a test case that demonstrates the race since the window is so sma=
ll.
>
> Now for the long answer: in my testing I ran into a race condition in che=
ck_zombie_leaders, which detects when a thread group leader has exited and =
other threads still exist.  On the Linux kernel, ptrace/waitpid don't allow=
 reaping the leader thread until all other threads in the group are reaped.=
  When the leader exits, it goes zombie, but waitpid will not return an exi=
t status until the other threads are gone.  When a non-leader thread calls =
exec, all other non-leader threads are destroyed, the leader becomes a zomb=
ie, and once the "other" threads have been reaped, the execing thread takes=
 over the leader's pid (tgid) and appears to vanish.  In order to handle th=
is situation, check_zombie_leaders polls the process state in /proc and del=
etes thread group leaders that are in a zombie state.  The replacement is a=
dded to the lwp list when the exec event is reported.
>
> See https://sourceware.org/ml/gdb-patches/2011-10/msg00704.html for a mor=
e detailed explanation of how this works.
>
> Here is the relevant part of check_zombie_leaders:
>
> if (leader_lp !=3D NULL
>           /* Check if there are other threads in the group, as we may
>              have raced with the inferior simply exiting.  */
>           && !last_thread_of_process_p (leader_pid)
>           && linux_proc_pid_is_zombie (leader_pid))
>         {
>           /* ...large informative comment block... */
>           delete_lwp (leader_lp);
>
> The race occurred when there were two threads in the program, and the non=
-leader thread called exec.  In this case the leader thread passed through =
a very brief zombie state before being replaced by the exec'ing thread as t=
he thread group leader.  This state transition was asynchronous, with no de=
pendency on anything gdbserver did.  Because there were no other threads, t=
here were no thread exit events, and thus there was no synchronization with=
 the leader passing through the zombie state and the exec completing.  If t=
here had been more threads, the leader would remain in the zombie state unt=
il they were waited for.  In the two-thread case, sometimes the leader exit=
 was detected and sometimes it wasn't. (Recall that check_zombie_leaders is=
 polling the state, via linux_proc_pid_is_zombie.  The race is between the =
leader thread passing through the zombie state and check_zombie_leaders tes=
ting for zombie state.)  If leader exit wasn't detected, gdbserver would en=
d up with a dangl
>  ing lwp entry that didn't correspond to any real lwp, and would hang wai=
ting for that lwp to stop.  Using PTRACE_EVENT_EXIT guarantees that the lea=
der exit will be detected.
>
> Note that check_zombie_leaders works just fine for the scenarios where th=
e leader thread exits and the other threads continue to run, with no exec c=
alls.  It is required for systems that don't support the extended ptrace ev=
ents.
>
> The sequence of events resulting in the race condition was this:
>
>     1) In the program, a CLONE event for a new thread occurs.
>
>     2) In the program, both threads are resumed once gdbserver has
>        completed the new thread processing.
>
>     3) In gdbserver, the function linux_wait_for_event_filtered loops unt=
il
>        waitpid returns "no more events" for the SIGCHLD generated by the
>        CLONE event.  Then linux_wait_for_event_filtered calls
>        check_zombie_leaders.
>
>     4) In the program, the new thread is doing the exec.  During the exec
>        the leader thread will pass through a transitory zombie state.  If
>        there were more than two threads, the leader thread would remain a
>        zombie until all the non-leader, non-exec'ing threads were reaped =
by
>        gdbserver.  Since there are no such threads to reap, the leader ju=
st
>        becomes a zombie and is replaced by the exec'ing thread on-the-fly.
>        (Note that it appears that the leader thread is a zombie just for a
>        very brief instant.)
>
>     5) In gdbserver, check_zombie_leaders checks whether an lwp entry
>        corresponds to a zombie leader thread, and if so, deletes it.  Here
>        is the race: in (4) above, the leader may or may not be in the
>        transitory zombie state.  In the case where a zombie isn't detecte=
d,
>        delete_lwp is not called.
>
>     6) In gdbserver, an EXEC event is detected and processed.  When it ge=
ts
>        ready to report the event to GDB, it calls stop_all_lwps, which se=
nds
>        a SIGSTOP to each lwp in the list and the waits until all the lwps=
 in
>        the list have reported a stop event.  If the zombie leader wasn't
>        detected and processed in step (5), gdbserver blocks forever in
>        linux_wait_for_event_filtered, waiting for the undeleted lwp to be
>        stopped, which will never happen.

Hi.

How do I tweak your patch so that I can see the race condition for myself?
[I realize the window is small ... I'd just like to play with it.]

I tried just disabling PTRACE_O_TRACEEXIT but that causes segvs in the
inferior (using non-ldr-exc-1 testcase).
Still digging into that.