From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 80C98385802E; Mon, 19 Oct 2020 01:19:24 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 80C98385802E From: "simark at simark dot ca" To: gdb-prs@sourceware.org Subject: [Bug gdb/26754] New: Race condition when resuming threads and one does an exec Date: Mon, 19 Oct 2020 01:19:24 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gdb X-Bugzilla-Component: gdb X-Bugzilla-Version: HEAD X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: simark at simark dot ca X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gdb-prs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gdb-prs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 19 Oct 2020 01:19:24 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D26754 Bug ID: 26754 Summary: Race condition when resuming threads and one does an exec Product: gdb Version: HEAD Status: NEW Severity: normal Priority: P2 Component: gdb Assignee: unassigned at sourceware dot org Reporter: simark at simark dot ca Target Milestone: --- I stumbled on this while trying to write a test for when a non-leader thread displace-steps an exec syscall instruction. The thing to remember here is = that on Linux, when a non-leader thread does an exec syscall, all non-main threa= ds are removed and only the main thread starts executing the new executable = (I don't know what happens in the kernel's data structure exactly, but at least that's how it looks from userspace, so that's the important part). Things go wrong when GDB tries to resume multiple threads, necessarily one after the others, and one of these threads (a non-leader one) executes an e= xec syscall before the main thread is resumed. I'll attach the source for a reproducer. It can be compiled with: $ gcc test.c -g3 -O0 -o test -pthread test_asm.S -fPIE I run it with $ ../gdb -q -nx --data-directory=3Ddata-directory ./test -ex "b the_sysca= ll" -ex "b main" -ex r -ex c and then just "continue" to trigger the problem. Normally,=20 Since it involves a race, different things can happen if you execute the reproducer multiple times. I'll describe one possible outcome. By applying this small patch to GDB, you can pretty much guarantee to have this outcome: diff --git a/gdb/infrun.c b/gdb/infrun.c index 8ae39a2877b3..450a7a37bc5b 100644 --- a/gdb/infrun.c +++ b/gdb/infrun.c @@ -2246,6 +2246,9 @@ do_target_resume (ptid_t resume_ptid, int step, enum gdb_signal sig) target_commit_resume (); + for (int i =3D 0; i < 500; i++) + usleep (1000); + if (target_can_async_p ()) target_async (1); } --- Note that the use of many usleep vs one sleep is because sleep otherwise ge= ts interrupted by incoming SIGCHILD signals. I'm not sure it's needed but I j= ust wanted to play it safe and really make GDB sleep for a while. So this is the sequence of events. Let's assume we have a process with pid 1000 and two threads with tid 1000 (the main one) and 1001 (a user-created = one, which will execute the exec). Both threads are stopped. Thread 1001 is stopped just before an exec syscall. 1. User does "continue" 2. Since thread 1001 needs to initiate a displaced-step, it is resumed first (the displaced-step is not really at fault here, but having it makes it tha= t we resume this particular thread first, so it helps trigger the issue). 3. The now-resumed thread 1001 does an exec syscall 4. The kernel deletes all non-main threads of process 1000 (so, deletes thr= ead 1001). It sets up thread 1000 with the new executable. It sends GDB (the ptracer) a PTRACE_EVENT_EXEC. The thread is stopped as it will need GDB to continue it before it starts executing the code of the new executable. 5. GDB, still processing the "continue" command, now resumes thread 1000 - which succeeds. The thing is that the thread 1000 that GDB resumes now isn't the thread it thinks it is resuming. The thread that GDB thinks it is resuming is the one stopped somewhere in the original executable. In reality, it resumes the post-exec thread 1000, stopped on the PTRACE_EVENT_EXEC event, about to sta= rt execution of the new executable. So GDB ends up resuming this thread 1000, and all kinds of funny things can happen after that. Normally, on exec, the linux-nat target should report the exec event to the core, which would call follow_exec, which would install breakpoints in the fresh program space, among other things. None of that is done and the program is resumed, so one visible consequence is that any breakpoint set are not effective. This is what it looks like when running the reproducer: ---8<--- $ ./gdb -q -nx --data-directory=3Ddata-directory ./test -ex "b the_syscall"= -ex "b main" -ex r -ex c Reading symbols from ./test... Breakpoint 1 at 0x128a: file test_asm.S, line 11. Breakpoint 2 at 0x11fb: file test.c, line 27. Starting program: /home/simark/build/binutils-gdb/gdb/test=20 [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/../lib/libthread_db.so.1". Breakpoint 2, main (argc=3D1, argv=3D0x7fffffffe018) at test.c:27 27 { Continuing. [New Thread 0x7ffff7da6640 (LWP 2813391)] [Switching to Thread 0x7ffff7da6640 (LWP 2813391)] Thread 2 "test" hit Breakpoint 1, the_syscall () at test_asm.S:11 11 syscall (gdb) c <--- this is where things start to go wrong Continuing. Welcome [Thread 0x7ffff7da7740 (LWP 2813387) exited] ...hangs... --->8--- Normally, we should break on "main", but we missed it and the program ended= . I think that at this point, GDB still believes there's a thread 2813391 executing that hasn't stopped. --=20 You are receiving this mail because: You are on the CC list for the bug.=