Should this be on the blocker list for the 7.10 release?

public inbox for gdb-patches@sourceware.org
 help / color / mirror / Atom feed

* Should this be on the blocker list for the 7.10 release?
@ 2015-07-06 20:26 Simon Marchi
  2015-07-07 13:25 ` Joel Brobecker
  0 siblings, 1 reply; 11+ messages in thread
From: Simon Marchi @ 2015-07-06 20:26 UTC (permalink / raw)
  To: gdb-patches

Hi all,

I filed this bug recently:

https://sourceware.org/bugzilla/show_bug.cgi?id=18600

In short, gdb sometimes leaves stopped newly spawned threads created by a
fork child.  The "sometimes" happens quite often, so it's very easy to
reproduce.  This has the effect of leaving (part) of your program stopped
while you expect it to be running.  "info threads" shows the threads as
running while in reality they aren't.

Is this bug serious enough to be on the blocking list for 7.10?  I would
think so, as it's a regression compared to 7.9.  However, I don't want
to impose work on anybody (e.g. Pedro, or anybody else familiar with the
linux-nat subsystem), which is why I am asking.  I am willing to put time
on this, but the chances of me successfully fixing the bug are quite slim.

Thanks!

Simon

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should this be on the blocker list for the 7.10 release?
  2015-07-06 20:26 Should this be on the blocker list for the 7.10 release? Simon Marchi
@ 2015-07-07 13:25 ` Joel Brobecker
  2015-07-07 16:18   ` Pedro Alves
  0 siblings, 1 reply; 11+ messages in thread
From: Joel Brobecker @ 2015-07-07 13:25 UTC (permalink / raw)
  To: Simon Marchi, Pedro Alves; +Cc: gdb-patches

> I filed this bug recently:
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=18600
> 
> In short, gdb sometimes leaves stopped newly spawned threads created by a
> fork child.  The "sometimes" happens quite often, so it's very easy to
> reproduce.  This has the effect of leaving (part) of your program stopped
> while you expect it to be running.  "info threads" shows the threads as
> running while in reality they aren't.
> 
> Is this bug serious enough to be on the blocking list for 7.10?  I would
> think so, as it's a regression compared to 7.9.  However, I don't want
> to impose work on anybody (e.g. Pedro, or anybody else familiar with the
> linux-nat subsystem), which is why I am asking.  I am willing to put time
> on this, but the chances of me successfully fixing the bug are quite slim.

Not sure. I think Pedro would be in a better position to answer.
For now, I've put this issue as a "maybe" for 7.10; so we will not
release until this is fixed, or we explicitly decide it's OK for 7.10.

Pedro?

-- 
Joel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should this be on the blocker list for the 7.10 release?
  2015-07-07 13:25 ` Joel Brobecker
@ 2015-07-07 16:18   ` Pedro Alves
  2015-07-07 18:04     ` Pedro Alves
  0 siblings, 1 reply; 11+ messages in thread
From: Pedro Alves @ 2015-07-07 16:18 UTC (permalink / raw)
  To: Joel Brobecker, Simon Marchi; +Cc: gdb-patches

On 07/07/2015 02:24 PM, Joel Brobecker wrote:

> Not sure. I think Pedro would be in a better position to answer.
> For now, I've put this issue as a "maybe" for 7.10; so we will not
> release until this is fixed, or we explicitly decide it's OK for 7.10.
> 
> Pedro?

Let me take a look and understand this better.

Thanks,
Pedro Alves

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should this be on the blocker list for the 7.10 release?
  2015-07-07 16:18   ` Pedro Alves
@ 2015-07-07 18:04     ` Pedro Alves
  2015-07-07 18:35       ` Simon Marchi
  0 siblings, 1 reply; 11+ messages in thread
From: Pedro Alves @ 2015-07-07 18:04 UTC (permalink / raw)
  To: Joel Brobecker, Simon Marchi; +Cc: gdb-patches

On 07/07/2015 05:18 PM, Pedro Alves wrote:
> On 07/07/2015 02:24 PM, Joel Brobecker wrote:
> 
>> Not sure. I think Pedro would be in a better position to answer.
>> For now, I've put this issue as a "maybe" for 7.10; so we will not
>> release until this is fixed, or we explicitly decide it's OK for 7.10.
>>
>> Pedro?
> 
> Let me take a look and understand this better.

OK, the issue is that the new clone thread is found while inside
the linux_stop_and_wait_all_lwps call in this new bit of
code in linux-thread-db.c:

      linux_stop_and_wait_all_lwps ();

      ALL_LWPS (lp)
	if (ptid_get_pid (lp->ptid) == pid)
	  thread_from_lwp (lp->ptid);

      linux_unstop_all_lwps ();

We reach linux_handle_extended_wait with the "stopping"
parameter set to 1, and because of that we don't mark the
new lwp as resumed.  As consequence, the subsequent
resume_stopped_resumed_lwps (called first from that
linux_unstop_all_lwps) never resumes the new LWP...

There's lots of cruft in linux_handle_extended_wait that no
longer makes sense.  This seems to fix your github test
for me, and causes no testsuite regressions.

Did you try converting your test case to a proper
GDB test?  That'd be much appreciated.

---
From a4f205a18dffaff3344b31e9b8009b1c0de8ba80 Mon Sep 17 00:00:00 2001
From: Pedro Alves <palves@redhat.com>
Date: Tue, 7 Jul 2015 17:42:52 +0100
Subject: [PATCH] fix

---
 gdb/linux-nat.c | 91 +++++++++++++++++++++++++--------------------------------
 1 file changed, 40 insertions(+), 51 deletions(-)

diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index be429f8..ea38ebb 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -2086,43 +2086,7 @@ linux_handle_extended_wait (struct lwp_info *lp, int status,
 	  new_lp = add_lwp (ptid_build (ptid_get_pid (lp->ptid), new_pid, 0));
 	  new_lp->cloned = 1;
 	  new_lp->stopped = 1;
-
-	  if (WSTOPSIG (status) != SIGSTOP)
-	    {
-	      /* This can happen if someone starts sending signals to
-		 the new thread before it gets a chance to run, which
-		 have a lower number than SIGSTOP (e.g. SIGUSR1).
-		 This is an unlikely case, and harder to handle for
-		 fork / vfork than for clone, so we do not try - but
-		 we handle it for clone events here.  We'll send
-		 the other signal on to the thread below.  */
-
-	      new_lp->signalled = 1;
-	    }
-	  else
-	    {
-	      struct thread_info *tp;
-
-	      /* When we stop for an event in some other thread, and
-		 pull the thread list just as this thread has cloned,
-		 we'll have seen the new thread in the thread_db list
-		 before handling the CLONE event (glibc's
-		 pthread_create adds the new thread to the thread list
-		 before clone'ing, and has the kernel fill in the
-		 thread's tid on the clone call with
-		 CLONE_PARENT_SETTID).  If that happened, and the core
-		 had requested the new thread to stop, we'll have
-		 killed it with SIGSTOP.  But since SIGSTOP is not an
-		 RT signal, it can only be queued once.  We need to be
-		 careful to not resume the LWP if we wanted it to
-		 stop.  In that case, we'll leave the SIGSTOP pending.
-		 It will later be reported as GDB_SIGNAL_0.  */
-	      tp = find_thread_ptid (new_lp->ptid);
-	      if (tp != NULL && tp->stop_requested)
-		new_lp->last_resume_kind = resume_stop;
-	      else
-		status = 0;
-	    }
+	  new_lp->resumed = 1;
 
 	  /* If the thread_db layer is active, let it record the user
 	     level thread id and status, and add the thread to GDB's
@@ -2136,19 +2100,23 @@ linux_handle_extended_wait (struct lwp_info *lp, int status,
 	    }
 
 	  /* Even if we're stopping the thread for some reason
-	     internal to this module, from the user/frontend's
-	     perspective, this new thread is running.  */
+	     internal to this module, from the perspective of infrun
+	     and the user/frontend, this new thread is running until
+	     it next reports a stop.  */
 	  set_running (new_lp->ptid, 1);
-	  if (!stopping)
-	    {
-	      set_executing (new_lp->ptid, 1);
-	      /* thread_db_attach_lwp -> lin_lwp_attach_lwp forced
-		 resume_stop.  */
-	      new_lp->last_resume_kind = resume_continue;
-	    }
+	  set_executing (new_lp->ptid, 1);
 
-	  if (status != 0)
+	  if (WSTOPSIG (status) != SIGSTOP)
 	    {
+	      /* This can happen if someone starts sending signals to
+		 the new thread before it gets a chance to run, which
+		 have a lower number than SIGSTOP (e.g. SIGUSR1).
+		 This is an unlikely case, and harder to handle for
+		 fork / vfork than for clone, so we do not try - but
+		 we handle it for clone events here.  */
+
+	      new_lp->signalled = 1;
+
 	      /* We created NEW_LP so it cannot yet contain STATUS.  */
 	      gdb_assert (new_lp->status == 0);
 
@@ -2162,7 +2130,6 @@ linux_handle_extended_wait (struct lwp_info *lp, int status,
 	      new_lp->status = status;
 	    }
 
-	  new_lp->resumed = !stopping;
 	  return 1;
 	}
 
@@ -3673,9 +3640,31 @@ resume_stopped_resumed_lwps (struct lwp_info *lp, void *data)
 {
   ptid_t *wait_ptid_p = data;
 
-  if (lp->stopped
-      && lp->resumed
-      && !lwp_status_pending_p (lp))
+  if (!lp->stopped)
+    {
+      if (debug_linux_nat)
+	fprintf_unfiltered (gdb_stdlog,
+			    "RSRL: NOT resuming stopped-resumed LWP %s, "
+			    "not stopped\n",
+			    target_pid_to_str (lp->ptid));
+    }
+  else if (!lp->resumed)
+    {
+      if (debug_linux_nat)
+	fprintf_unfiltered (gdb_stdlog,
+			    "RSRL: NOT resuming stopped-resumed LWP %s, "
+			    "not resumed\n",
+			    target_pid_to_str (lp->ptid));
+    }
+  else if (lwp_status_pending_p (lp))
+    {
+      if (debug_linux_nat)
+	fprintf_unfiltered (gdb_stdlog,
+			    "RSRL: NOT resuming stopped-resumed LWP %s, "
+			    "has pending status\n",
+			    target_pid_to_str (lp->ptid));
+    }
+  else
     {
       struct regcache *regcache = get_thread_regcache (lp->ptid);
       struct gdbarch *gdbarch = get_regcache_arch (regcache);
-- 
1.9.3

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should this be on the blocker list for the 7.10 release?
  2015-07-07 18:04     ` Pedro Alves
@ 2015-07-07 18:35       ` Simon Marchi
  2015-07-07 18:47         ` Pedro Alves
  0 siblings, 1 reply; 11+ messages in thread
From: Simon Marchi @ 2015-07-07 18:35 UTC (permalink / raw)
  To: Pedro Alves, Joel Brobecker; +Cc: gdb-patches

> OK, the issue is that the new clone thread is found while inside
> the linux_stop_and_wait_all_lwps call in this new bit of
> code in linux-thread-db.c:
> 
>       linux_stop_and_wait_all_lwps ();
> 
>       ALL_LWPS (lp)
> 	if (ptid_get_pid (lp->ptid) == pid)
> 	  thread_from_lwp (lp->ptid);
> 
>       linux_unstop_all_lwps ();
> 
> We reach linux_handle_extended_wait with the "stopping"
> parameter set to 1, and because of that we don't mark the
> new lwp as resumed.  As consequence, the subsequent
> resume_stopped_resumed_lwps (called first from that
> linux_unstop_all_lwps) never resumes the new LWP...
> 
> There's lots of cruft in linux_handle_extended_wait that no
> longer makes sense.  This seems to fix your github test
> for me, and causes no testsuite regressions.

It seems to fix most of it.  The only odd thing left that I
noticed is that it leaves some of the inferiors there.  When I
type "info inferiors" after running the program, I see one or
two of them left.  I believe there should only be inferior #1
left.

> Did you try converting your test case to a proper
> GDB test?  That'd be much appreciated.

I haven't, but I will.

Thanks!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should this be on the blocker list for the 7.10 release?
  2015-07-07 18:35       ` Simon Marchi
@ 2015-07-07 18:47         ` Pedro Alves
  2015-07-07 19:04           ` Pedro Alves
  0 siblings, 1 reply; 11+ messages in thread
From: Pedro Alves @ 2015-07-07 18:47 UTC (permalink / raw)
  To: Simon Marchi, Joel Brobecker; +Cc: gdb-patches

On 07/07/2015 07:35 PM, Simon Marchi wrote:
>> OK, the issue is that the new clone thread is found while inside
>> the linux_stop_and_wait_all_lwps call in this new bit of
>> code in linux-thread-db.c:
>>
>>       linux_stop_and_wait_all_lwps ();
>>
>>       ALL_LWPS (lp)
>> 	if (ptid_get_pid (lp->ptid) == pid)
>> 	  thread_from_lwp (lp->ptid);
>>
>>       linux_unstop_all_lwps ();
>>
>> We reach linux_handle_extended_wait with the "stopping"
>> parameter set to 1, and because of that we don't mark the
>> new lwp as resumed.  As consequence, the subsequent
>> resume_stopped_resumed_lwps (called first from that
>> linux_unstop_all_lwps) never resumes the new LWP...
>>
>> There's lots of cruft in linux_handle_extended_wait that no
>> longer makes sense.  This seems to fix your github test
>> for me, and causes no testsuite regressions.
> 
> It seems to fix most of it.  The only odd thing left that I
> noticed is that it leaves some of the inferiors there.  When I
> type "info inferiors" after running the program, I see one or
> two of them left.  I believe there should only be inferior #1
> left.

Hmm, indeed:

(gdb) info inferiors
  Num  Description       Executable
  4    process 8393      /home/pedro/bugs/src/test
  2    process 8388      /home/pedro/bugs/src/test
* 1    <null>            /home/pedro/bugs/src/test
(gdb) info threads

Calling prune_inferiors() at this point (from a top gdb)
does not remove them, because they still have inf->pid != 0.
Sounds like we miss mourning inferiors 2 and 4 somehow?

Thanks,
Pedro Alves

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should this be on the blocker list for the 7.10 release?
  2015-07-07 18:47         ` Pedro Alves
@ 2015-07-07 19:04           ` Pedro Alves
  2015-07-07 19:20             ` Pedro Alves
  0 siblings, 1 reply; 11+ messages in thread
From: Pedro Alves @ 2015-07-07 19:04 UTC (permalink / raw)
  To: Simon Marchi, Joel Brobecker; +Cc: gdb-patches

On 07/07/2015 07:47 PM, Pedro Alves wrote:

> Hmm, indeed:
> 
> (gdb) info inferiors
>   Num  Description       Executable
>   4    process 8393      /home/pedro/bugs/src/test
>   2    process 8388      /home/pedro/bugs/src/test
> * 1    <null>            /home/pedro/bugs/src/test
> (gdb) info threads
> 
> Calling prune_inferiors() at this point (from a top gdb)
> does not remove them, because they still have inf->pid != 0.
> Sounds like we miss mourning inferiors 2 and 4 somehow?

I think I got it.

Enabling logs (master + previous patch) I see:

...
WL: waitpid Thread 0x7ffff7fc2740 (LWP 9513) received Trace/breakpoint trap (stopped)
WL: Handling extended status 0x03057f
LHEW: Got clone event from LWP 9513, new child is LWP 9579
[New Thread 0x7ffff37b8700 (LWP 9579)]
WL: waitpid Thread 0x7ffff7fc2740 (LWP 9508) received 0 (exited)
WL: Thread 0x7ffff7fc2740 (LWP 9508) exited.
                           ^^^^^^^^
[Thread 0x7ffff7fc2740 (LWP 9508) exited]
WL: waitpid Thread 0x7ffff7fc2740 (LWP 9499) received 0 (exited)
WL: Thread 0x7ffff7fc2740 (LWP 9499) exited.
[Thread 0x7ffff7fc2740 (LWP 9499) exited]
RSRL: resuming stopped-resumed LWP Thread 0x7ffff37b8700 (LWP 9579) at 0x3615ef4ce1: step=0
...
(gdb) info inferiors
  Num  Description       Executable
  5    process 9508      /home/pedro/bugs/src/test
               ^^^^
  4    process 9503      /home/pedro/bugs/src/test
  3    process 9500      /home/pedro/bugs/src/test
  2    process 9499      /home/pedro/bugs/src/test
* 1    <null>            /home/pedro/bugs/src/test
(gdb)
...

Note the "Thread 0x7ffff7fc2740 (LWP 9508) exited." line.
That's this in wait_lwp:

      /* Check if the thread has exited.  */
      if (WIFEXITED (status) || WIFSIGNALED (status))
	{
	  thread_dead = 1;
	  if (debug_linux_nat)
	    fprintf_unfiltered (gdb_stdlog, "WL: %s exited.\n",
				target_pid_to_str (lp->ptid));
	}
    }

This code doesn't understand that an WIFEXITED status
of the last lwp of the process should be reported as
process exit.  Haven't tried to fix it yet.

I haven't tried 7.9, but I'd think it possible to trigger
this issue in all-stop there, because all it takes
is a whole process exiting while gdb is iterating
over lwps, stopping them.

Thanks,
Pedro Alves

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should this be on the blocker list for the 7.10 release?
  2015-07-07 19:04           ` Pedro Alves
@ 2015-07-07 19:20             ` Pedro Alves
  2015-07-07 20:38               ` Simon Marchi
  2015-07-07 22:04               ` Simon Marchi
  0 siblings, 2 replies; 11+ messages in thread
From: Pedro Alves @ 2015-07-07 19:20 UTC (permalink / raw)
  To: Simon Marchi, Joel Brobecker; +Cc: gdb-patches

On 07/07/2015 08:04 PM, Pedro Alves wrote:

> ...
> WL: waitpid Thread 0x7ffff7fc2740 (LWP 9513) received Trace/breakpoint trap (stopped)
> WL: Handling extended status 0x03057f
> LHEW: Got clone event from LWP 9513, new child is LWP 9579
> [New Thread 0x7ffff37b8700 (LWP 9579)]
> WL: waitpid Thread 0x7ffff7fc2740 (LWP 9508) received 0 (exited)
> WL: Thread 0x7ffff7fc2740 (LWP 9508) exited.
>                            ^^^^^^^^
> [Thread 0x7ffff7fc2740 (LWP 9508) exited]
> WL: waitpid Thread 0x7ffff7fc2740 (LWP 9499) received 0 (exited)
> WL: Thread 0x7ffff7fc2740 (LWP 9499) exited.
> [Thread 0x7ffff7fc2740 (LWP 9499) exited]
> RSRL: resuming stopped-resumed LWP Thread 0x7ffff37b8700 (LWP 9579) at 0x3615ef4ce1: step=0
> ...
> (gdb) info inferiors
>   Num  Description       Executable
>   5    process 9508      /home/pedro/bugs/src/test
>                ^^^^
>   4    process 9503      /home/pedro/bugs/src/test
>   3    process 9500      /home/pedro/bugs/src/test
>   2    process 9499      /home/pedro/bugs/src/test
> * 1    <null>            /home/pedro/bugs/src/test
> (gdb)
> ...
> 
> Note the "Thread 0x7ffff7fc2740 (LWP 9508) exited." line.
> That's this in wait_lwp:
> 
>       /* Check if the thread has exited.  */
>       if (WIFEXITED (status) || WIFSIGNALED (status))
> 	{
> 	  thread_dead = 1;
> 	  if (debug_linux_nat)
> 	    fprintf_unfiltered (gdb_stdlog, "WL: %s exited.\n",
> 				target_pid_to_str (lp->ptid));
> 	}
>     }
> 
> This code doesn't understand that an WIFEXITED status
> of the last lwp of the process should be reported as
> process exit.  Haven't tried to fix it yet.

This one (on top of the other) fixes this for me.  No
testsuite regressions on x86_64 F20.

-----
From 1290101d792c0e1d8c4e202cd7d900837db0ee84 Mon Sep 17 00:00:00 2001
From: Pedro Alves <palves@redhat.com>
Date: Tue, 7 Jul 2015 19:50:38 +0100
Subject: [PATCH] missing exit event

---
 gdb/linux-nat.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index ea38ebb..281a270 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -2275,6 +2275,20 @@ wait_lwp (struct lwp_info *lp)
       /* Check if the thread has exited.  */
       if (WIFEXITED (status) || WIFSIGNALED (status))
 	{
+	  if (ptid_get_pid (lp->ptid) == ptid_get_lwp (lp->ptid))
+	    {
+	      if (debug_linux_nat)
+		fprintf_unfiltered (gdb_stdlog, "WL: Process %d exited.\n",
+				    ptid_get_pid (lp->ptid));
+
+	      /* This is the leader exiting, it means the whole
+		 process is gone.  Store the status to report to the
+		 core.  Store it in the lp->waitstatus, because
+		 W_EXITCODE(0,0) == 0.  */
+	      store_waitstatus (&lp->waitstatus, status);
+	      return 0;
+	    }
+
 	  thread_dead = 1;
 	  if (debug_linux_nat)
 	    fprintf_unfiltered (gdb_stdlog, "WL: %s exited.\n",
-- 
1.9.3


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should this be on the blocker list for the 7.10 release?
  2015-07-07 19:20             ` Pedro Alves
@ 2015-07-07 20:38               ` Simon Marchi
  2015-07-07 22:04               ` Simon Marchi
  1 sibling, 0 replies; 11+ messages in thread
From: Simon Marchi @ 2015-07-07 20:38 UTC (permalink / raw)
  To: Pedro Alves, Joel Brobecker; +Cc: gdb-patches

On 15-07-07 03:20 PM, Pedro Alves wrote:
> This one (on top of the other) fixes this for me.  No
> testsuite regressions on x86_64 F20.
> 
> -----
> From 1290101d792c0e1d8c4e202cd7d900837db0ee84 Mon Sep 17 00:00:00 2001
> From: Pedro Alves <palves@redhat.com>
> Date: Tue, 7 Jul 2015 19:50:38 +0100
> Subject: [PATCH] missing exit event
> 
> ---
>  gdb/linux-nat.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index ea38ebb..281a270 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -2275,6 +2275,20 @@ wait_lwp (struct lwp_info *lp)
>        /* Check if the thread has exited.  */
>        if (WIFEXITED (status) || WIFSIGNALED (status))
>  	{
> +	  if (ptid_get_pid (lp->ptid) == ptid_get_lwp (lp->ptid))
> +	    {
> +	      if (debug_linux_nat)
> +		fprintf_unfiltered (gdb_stdlog, "WL: Process %d exited.\n",
> +				    ptid_get_pid (lp->ptid));
> +
> +	      /* This is the leader exiting, it means the whole
> +		 process is gone.  Store the status to report to the
> +		 core.  Store it in the lp->waitstatus, because
> +		 W_EXITCODE(0,0) == 0.  */
> +	      store_waitstatus (&lp->waitstatus, status);
> +	      return 0;
> +	    }
> +
>  	  thread_dead = 1;
>  	  if (debug_linux_nat)
>  	    fprintf_unfiltered (gdb_stdlog, "WL: %s exited.\n",
> 

Indeed it looks good.  I'll work on the test.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should this be on the blocker list for the 7.10 release?
  2015-07-07 19:20             ` Pedro Alves
  2015-07-07 20:38               ` Simon Marchi
@ 2015-07-07 22:04               ` Simon Marchi
  2015-07-13 19:20                 ` Pedro Alves
  1 sibling, 1 reply; 11+ messages in thread
From: Simon Marchi @ 2015-07-07 22:04 UTC (permalink / raw)
  To: Pedro Alves, Joel Brobecker; +Cc: gdb-patches

On 15-07-07 03:20 PM, Pedro Alves wrote:
> This one (on top of the other) fixes this for me.  No
> testsuite regressions on x86_64 F20.
> 
> -----
> From 1290101d792c0e1d8c4e202cd7d900837db0ee84 Mon Sep 17 00:00:00 2001
> From: Pedro Alves <palves@redhat.com>
> Date: Tue, 7 Jul 2015 19:50:38 +0100
> Subject: [PATCH] missing exit event
> 
> ---
>  gdb/linux-nat.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index ea38ebb..281a270 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -2275,6 +2275,20 @@ wait_lwp (struct lwp_info *lp)
>        /* Check if the thread has exited.  */
>        if (WIFEXITED (status) || WIFSIGNALED (status))
>  	{
> +	  if (ptid_get_pid (lp->ptid) == ptid_get_lwp (lp->ptid))
> +	    {
> +	      if (debug_linux_nat)
> +		fprintf_unfiltered (gdb_stdlog, "WL: Process %d exited.\n",
> +				    ptid_get_pid (lp->ptid));
> +
> +	      /* This is the leader exiting, it means the whole
> +		 process is gone.  Store the status to report to the
> +		 core.  Store it in the lp->waitstatus, because
> +		 W_EXITCODE(0,0) == 0.  */
> +	      store_waitstatus (&lp->waitstatus, status);
> +	      return 0;
> +	    }
> +
>  	  thread_dead = 1;
>  	  if (debug_linux_nat)
>  	    fprintf_unfiltered (gdb_stdlog, "WL: %s exited.\n",
> 

Another question.  This is probably unrelated to the current issue, since
the behavior was the same in 7.9, but I am encountering it while writing
the test.  When running the program without the &, when should the prompt
come back?

See this transcript: http://pastebin.com/qyba8Ucd

We see that the prompt comes back the first time an inferior exits, well
before the end of the execution of inferior #1.  I would think that this
is not what we want.  If I run my program synchronously, then the inferior
forks and then the fork child exits, do we really want to bring back the
prompt at this point?  I would think that we should only show it when
the initial inferior that was ran exits (or if some other significant
event, such as a breakpoint hit, of course).

Would you agree?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should this be on the blocker list for the 7.10 release?
  2015-07-07 22:04               ` Simon Marchi
@ 2015-07-13 19:20                 ` Pedro Alves
  0 siblings, 0 replies; 11+ messages in thread
From: Pedro Alves @ 2015-07-13 19:20 UTC (permalink / raw)
  To: Simon Marchi, Joel Brobecker; +Cc: gdb-patches

On 07/07/2015 11:04 PM, Simon Marchi wrote:

> Another question.  This is probably unrelated to the current issue, since
> the behavior was the same in 7.9, but I am encountering it while writing
> the test.  When running the program without the &, when should the prompt
> come back?

Yes, it's an issue currently that the prompt returns too eagerly.
It's "always" been that way though, not just in 7.9.

Your example was non-stop mode, but I'd say all-stop mode is even worse.
Consider the scenario of debugging the canonical multi-process
example (debug "make check" until some child process crashes).
Currently everything stops as soon as any child process exits...

> 
> See this transcript: http://pastebin.com/qyba8Ucd
> 
> We see that the prompt comes back the first time an inferior exits, well
> before the end of the execution of inferior #1.  I would think that this
> is not what we want.  If I run my program synchronously, then the inferior
> forks and then the fork child exits, do we really want to bring back the
> prompt at this point?  I would think that we should only show it when
> the initial inferior that was ran exits (or if some other significant
> event, such as a breakpoint hit, of course).

Yeah, I think we need something like this.  But should the prompt
be shown if the parent exits but the child carries on?  Why?
How do you define which process is special?

I think we should think of "run" as "create process" + "continue".
Because, say, you do "start", followed by "continue", and
then the inferior forks, and then child exits.  It's the same
situation.

Or any execution command really.  E.g., what if you already have two
processes under control of the debugger, and step a thread of inferior 2,
while running all threads of all processes
free (all-stop + schedule-multi on), and inferior 1 exits while the
step is still in progress?

It's not even clear if the prompt should be returned when a breakpoint
hits.  Since I know you've played with my itsets branch, let me put
this in terms of that branch:

Say you do "itfocus it1.1 next".  While "next" is in progress, a thread
from some other inferior (that was already running) hits a breakpoint, and
that breakpoint's suspend set does _not_ stop thread 1.1.  It would seem
that it would be reasonable to say that the "next" really hasn't finished yet,
and thread 1.1 should continue stepping?  It would follow then that
the prompt shouldn't be returned to the user yet.

Maybe what we need is for synchronous commands to define
the set of threads they're waiting for.  In "itfocus it1.1 next",
it would wait until any thread in inferior 1 reports a stop.
In "itfocus tt1.1 next", it would wait until thread 1.1 reports
a stop.  In "itfocus at1.1 next", it would wait until any thread
reports a stop.  Etc.
Since you can only run one synchronous command at a time,
the set of what to wait before the prompt is given back to the
user would be a property of the interpreter, not the thread.

> 
> Would you agree?
> 

I agree, though the devil is in the details.  I'd rather not
try to fix this for 7.10.

Thanks,
Pedro Alves

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-07-13 19:20 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-06 20:26 Should this be on the blocker list for the 7.10 release? Simon Marchi
2015-07-07 13:25 ` Joel Brobecker
2015-07-07 16:18   ` Pedro Alves
2015-07-07 18:04     ` Pedro Alves
2015-07-07 18:35       ` Simon Marchi
2015-07-07 18:47         ` Pedro Alves
2015-07-07 19:04           ` Pedro Alves
2015-07-07 19:20             ` Pedro Alves
2015-07-07 20:38               ` Simon Marchi
2015-07-07 22:04               ` Simon Marchi
2015-07-13 19:20                 ` Pedro Alves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).