From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sourceware-bugzilla@sourceware.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 80C98385802E; Mon, 19 Oct 2020 01:19:24 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 80C98385802E
From: "simark at simark dot ca" <sourceware-bugzilla@sourceware.org>
To: gdb-prs@sourceware.org
Subject: [Bug gdb/26754] New: Race condition when resuming threads and one
 does an exec
Date: Mon, 19 Oct 2020 01:19:24 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gdb
X-Bugzilla-Component: gdb
X-Bugzilla-Version: HEAD
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: simark at simark dot ca
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter target_milestone
Message-ID: <bug-26754-4717@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gdb-prs@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gdb-prs mailing list <gdb-prs.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-prs>,
 <mailto:gdb-prs-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/gdb-prs/>
List-Help: <mailto:gdb-prs-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-prs>,
 <mailto:gdb-prs-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Mon, 19 Oct 2020 01:19:24 -0000

https://sourceware.org/bugzilla/show_bug.cgi?id=3D26754

            Bug ID: 26754
           Summary: Race condition when resuming threads and one does an
                    exec
           Product: gdb
           Version: HEAD
            Status: NEW
          Severity: normal
          Priority: P2
         Component: gdb
          Assignee: unassigned at sourceware dot org
          Reporter: simark at simark dot ca
  Target Milestone: ---

I stumbled on this while trying to write a test for when a non-leader thread
displace-steps an exec syscall instruction.  The thing to remember here is =
that
on Linux, when a non-leader thread does an exec syscall, all non-main threa=
ds
are  removed and only the main thread starts executing the new executable  =
(I
don't know what happens in the kernel's data structure exactly, but at least
that's how it looks from userspace, so that's the important part).

Things go wrong when GDB tries to resume multiple threads, necessarily one
after the others, and one of these threads (a non-leader one) executes an e=
xec
syscall before the main thread is resumed.

I'll attach the source for a reproducer.  It can be compiled with:

  $ gcc test.c -g3 -O0 -o test -pthread test_asm.S -fPIE

I run it with

  $ ../gdb -q -nx --data-directory=3Ddata-directory ./test -ex "b the_sysca=
ll"
-ex "b main" -ex r -ex c

and then just "continue" to trigger the problem.  Normally,=20

Since it involves a race, different things can happen if you execute the
reproducer multiple times.  I'll describe one possible outcome.  By applying
this small patch to GDB, you can pretty much guarantee to have this outcome:

diff --git a/gdb/infrun.c b/gdb/infrun.c
index 8ae39a2877b3..450a7a37bc5b 100644
--- a/gdb/infrun.c
+++ b/gdb/infrun.c
@@ -2246,6 +2246,9 @@ do_target_resume (ptid_t resume_ptid, int step, enum
gdb_signal sig)

   target_commit_resume ();

+  for (int i =3D 0; i < 500; i++)
+    usleep (1000);
+
   if (target_can_async_p ())
     target_async (1);
 }

---

Note that the use of many usleep vs one sleep is because sleep otherwise ge=
ts
interrupted by incoming SIGCHILD signals.  I'm not sure it's needed but I j=
ust
wanted to play it safe and really make GDB sleep for a while.

So this is the sequence of events.  Let's assume we have a process with pid
1000 and two threads with tid 1000 (the main one) and 1001 (a user-created =
one,
which will execute the exec).  Both threads are stopped.  Thread 1001 is
stopped just before an exec syscall.

1. User does "continue"
2. Since thread 1001 needs to initiate a displaced-step, it is resumed first
(the displaced-step is not really at fault here, but having it makes it tha=
t we
resume this particular thread first, so it helps trigger the issue).
3. The now-resumed thread 1001 does an exec syscall
4. The kernel deletes all non-main threads of process 1000 (so, deletes thr=
ead
1001).  It sets up thread 1000 with the new executable.  It sends GDB (the
ptracer) a PTRACE_EVENT_EXEC.  The thread is stopped as it will need GDB to
continue it before it starts executing the code of the new executable.
5. GDB, still processing the "continue" command, now resumes thread 1000 -
which succeeds.

The thing is that the thread 1000 that GDB resumes now isn't the thread it
thinks it is resuming.  The thread that GDB thinks it is resuming is the one
stopped somewhere in the original executable.  In reality, it resumes the
post-exec thread 1000, stopped on the PTRACE_EVENT_EXEC event, about to sta=
rt
execution of the new executable.

So GDB ends up resuming this thread 1000, and all kinds of funny things can
happen after that. Normally, on exec, the linux-nat target should report the
exec event to the core, which would call follow_exec, which would install
breakpoints in the fresh program space, among other things.  None of that is
done and the program is resumed, so one visible consequence is that any
breakpoint set are not effective.

This is what it looks like when running the reproducer:

---8<---
$ ./gdb -q -nx --data-directory=3Ddata-directory ./test -ex "b the_syscall"=
 -ex
"b main" -ex r -ex c
Reading symbols from ./test...
Breakpoint 1 at 0x128a: file test_asm.S, line 11.
Breakpoint 2 at 0x11fb: file test.c, line 27.
Starting program: /home/simark/build/binutils-gdb/gdb/test=20
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/../lib/libthread_db.so.1".

Breakpoint 2, main (argc=3D1, argv=3D0x7fffffffe018) at test.c:27
27      {
Continuing.
[New Thread 0x7ffff7da6640 (LWP 2813391)]
[Switching to Thread 0x7ffff7da6640 (LWP 2813391)]

Thread 2 "test" hit Breakpoint 1, the_syscall () at test_asm.S:11
11              syscall
(gdb) c   <--- this is where things start to go wrong
Continuing.
Welcome
[Thread 0x7ffff7da7740 (LWP 2813387) exited]
...hangs...
--->8---

Normally, we should break on "main", but we missed it and the program ended=
.  I
 think that at this point, GDB still believes there's a thread 2813391
executing that hasn't stopped.

--=20
You are receiving this mail because:
You are on the CC list for the bug.=