Strange behavior of sigstep-threads.exp?

public inbox for gdb@sourceware.org
 help / color / mirror / Atom feed

* Strange behavior of sigstep-threads.exp?
@ 2012-11-08 15:51 Andreas Arnez
  2012-11-12 16:38 ` Pedro Alves
  0 siblings, 1 reply; 6+ messages in thread
From: Andreas Arnez @ 2012-11-08 15:51 UTC (permalink / raw)
  To: jan.kratochvil, Pedro Alves; +Cc: gdb

On x86-64 with upstream GDB, I'm observing what I consider strange
behavior of the sigstep-threads.exp test case.

Here's an excerpt from the log file:

  32        var++;                /* step-1 */
  (gdb) disable $step1
  (gdb) step

  Program received signal SIGUSR1, User defined signal 1.

  Program received signal SIGUSR1, User defined signal 1.

  Program received signal SIGUSR1, User defined signal 1.

  Program received signal SIGUSR1, User defined signal 1.
  33        tgkill (getpid (), gettid (), SIGUSR1);       /* step-2 */

It seems that running from step-1 to step-2 produced four SIGUSR1.  And
in total ca. 3 times as many signals are encountered than we pass
tgkill()'s:

  $ grep tgkill gdb.log | wc
      100    1000    5900
  $ grep 'Program received signal SIGUSR1' gdb.log | wc
      310    2480   17670

From the code I wouldn't expect that.  Is this a known bug or expected
behavior?  Note that the test case doesn't fail, since it doesn't care
about the number of incoming signals.

On s390x the test case actually fails sometimes.  In those cases, when
stepping from step-1 to step-2, a ton of SIGUSR1 are indicated, and then
the inferior seems to stop at the closing brace of the handler()
function instead of the tgkill().

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Strange behavior of sigstep-threads.exp?
  2012-11-08 15:51 Strange behavior of sigstep-threads.exp? Andreas Arnez
@ 2012-11-12 16:38 ` Pedro Alves
  2012-11-13 15:21   ` Andreas Arnez
  0 siblings, 1 reply; 6+ messages in thread
From: Pedro Alves @ 2012-11-12 16:38 UTC (permalink / raw)
  To: Andreas Arnez; +Cc: jan.kratochvil, gdb

On 11/08/2012 03:50 PM, Andreas Arnez wrote:
> On x86-64 with upstream GDB, I'm observing what I consider strange
> behavior of the sigstep-threads.exp test case.
> 
> Here's an excerpt from the log file:
> 
>   32        var++;                /* step-1 */
>   (gdb) disable $step1
>   (gdb) step
> 
>   Program received signal SIGUSR1, User defined signal 1.
> 
>   Program received signal SIGUSR1, User defined signal 1.
> 
>   Program received signal SIGUSR1, User defined signal 1.
> 
>   Program received signal SIGUSR1, User defined signal 1.
>   33        tgkill (getpid (), gettid (), SIGUSR1);       /* step-2 */
> 
> It seems that running from step-1 to step-2 produced four SIGUSR1.  And
> in total ca. 3 times as many signals are encountered than we pass
> tgkill()'s:
> 
>   $ grep tgkill gdb.log | wc
>       100    1000    5900
>   $ grep 'Program received signal SIGUSR1' gdb.log | wc
>       310    2480   17670
> 
> From the code I wouldn't expect that.  Is this a known bug or expected
> behavior?  Note that the test case doesn't fail, since it doesn't care
> about the number of incoming signals.

static void
handler (int signo)	/* step-0 */
{			/* step-0 */
  var++;		/* step-1 */
  tgkill (getpid (), gettid (), SIGUSR1);	/* step-2 */
}

static void *
start (void *arg)
{
  tgkill (getpid (), gettid (), SIGUSR1);
  assert (0);
  return NULL;
}

int
main (void)
{
  pthread_t thread;

  signal (SIGUSR1, handler);

  pthread_create (&thread, NULL, start, NULL);
  start (NULL);	/* main-start */
  return 0;
}

There are two threads in the test.  The main thread, and another
spawned by pthread_create.  They both start at "start", and
send SIGUSR1 to themselves.  The SIGUSR1 handler sends a
SIGUSR1 to the current thread, so each thread is constantly re-entering
"handler" ad infinitum.

The test does "handle SIGUSR1 nostop print pass", which means that
control is never passed to the user whenever the program gets a
SIGUSR1, but GDB still prints the signal (the "Program received ..." lines).

The test then steps _one_ of the threads (the one that happens to reach
the "step-1" line first), let's call it thread #1.  This "step" involves
multiple single-steps behind the scenes.  Whenever each of those single-steps
is done, the other thread (let's call it thread #2) is allowed to run free (the
"set scheduler-locking off" setting).  Occasionally, that other thread
will run enough to send itself another SIGUSR1.  And then, occasionally
GDB will see that SIGUSR1 in thread #2, print "Program received", and
go back to single-stepping thread #1, until it reaches a different line.

So "grep tgkill gdb.log" will only count the "step"s that have passed
through the "tgkill (getpid (), ..." line, while "grep 'Program received
signal SIGUSR1'" counts both threads sending themselves SIGUSR1.

Moreover, since thread #2 is "continued", while thread #1 is
single-stepped, it is expected that thread #2 makes progress faster, hence
the 3 to 1 ratio you see.  I see 4 to 1 on my machine even.

> On s390x the test case actually fails sometimes.  In those cases, when
> stepping from step-1 to step-2, a ton of SIGUSR1 are indicated, and then
> the inferior seems to stop at the closing brace of the handler()
> function instead of the tgkill().

That does sound like something's wrong.  Hacking the the test to force
"set debug infrun 1" and "set debug lin-lwp 1" would be my first move.

I wonder if this makes a difference?

 gdb/infrun.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/gdb/infrun.c b/gdb/infrun.c
index de2cf19..9621b84 100644
--- a/gdb/infrun.c
+++ b/gdb/infrun.c
@@ -4698,8 +4698,10 @@ process_event_stop_test:
 	  ecs->event_thread = tp;
 	  ecs->ptid = tp->ptid;
 	  context_switch (ecs->ptid);
-	  keep_going (ecs);
-	  return;
+
+	  /* Keep checking.  The stepped thread might have already
+	     reached its destination, but not have reported it yet.
+	     If we just kept going, we could end up overstepping.  */
 	}
     }

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Strange behavior of sigstep-threads.exp?
  2012-11-12 16:38 ` Pedro Alves
@ 2012-11-13 15:21   ` Andreas Arnez
  2012-11-14 16:15     ` Pedro Alves
  0 siblings, 1 reply; 6+ messages in thread
From: Andreas Arnez @ 2012-11-13 15:21 UTC (permalink / raw)
  To: Pedro Alves; +Cc: jan.kratochvil, gdb

Pedro Alves <palves@redhat.com> writes:

> Whenever each of those single-steps is done, the other thread (let's
> call it thread #2) is allowed to run free (the "set scheduler-locking
> off" setting).

Got it.  Seems I misunderstood the "step" behavior in this case.  Thanks
for the explanation.

>> On s390x the test case actually fails sometimes.  In those cases,
>> when stepping from step-1 to step-2, a ton of SIGUSR1 are indicated,
>> and then the inferior seems to stop at the closing brace of the
>> handler() function instead of the tgkill().
>
> That does sound like something's wrong.  Hacking the the test to force
> "set debug infrun 1" and "set debug lin-lwp 1" would be my first move.

As soon as "lin-lwp" debugging is turned on the test always seems to
succeed.  But with "debug infrun" alone the failure still occurs, and I
observe the following:

1. The stepped thread reaches the last instruction inside the stepping
range.

2. After resuming the stepped thread again, it traps at getpid@plt.
Which is curious, because getpid() shouldn't be called until the
instruction _after_ the stepping range.  It seems like the trap for that
instruction was missed somehow.  (In the good case the thread always
traps at the subroutine call, before having carried out the call.)

3. The thread is single-stepped until the jump to getpid().  The
getpid() invocation itself is skipped with a step-resume breakpoint on
the instruction after the original subroutine call.

4. The step-resume breakpoint is reached.  Despite now being well
outside the original stepping range, the thread is resumed.  Upon the
next trap, an updated stepping range is shown, adjusted to fit the line
of the tgkill().  Then stepping continues until the next line, which is
the closing brace.

> I wonder if this makes a difference?
>
>  gdb/infrun.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/gdb/infrun.c b/gdb/infrun.c
> index de2cf19..9621b84 100644
> --- a/gdb/infrun.c
> +++ b/gdb/infrun.c
> @@ -4698,8 +4698,10 @@ process_event_stop_test:
>  	  ecs->event_thread = tp;
>  	  ecs->ptid = tp->ptid;
>  	  context_switch (ecs->ptid);
> -	  keep_going (ecs);
> -	  return;
> +
> +	  /* Keep checking.  The stepped thread might have already
> +	     reached its destination, but not have reported it yet.
> +	     If we just kept going, we could end up overstepping.  */
>  	}
>      }

Yes, it does make a difference.  The test case still fails at a similar
rate as before, but this time after "continue", because the inferior
reaches "assert (0)".  Again, I can not reproduce this failure with "set
debug infrun 1" and "set debug lin-lwp 1".

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Strange behavior of sigstep-threads.exp?
  2012-11-13 15:21   ` Andreas Arnez
@ 2012-11-14 16:15     ` Pedro Alves
  2012-11-21 18:47       ` Andreas Arnez
  0 siblings, 1 reply; 6+ messages in thread
From: Pedro Alves @ 2012-11-14 16:15 UTC (permalink / raw)
  To: Andreas Arnez; +Cc: jan.kratochvil, gdb

On 11/13/2012 03:21 PM, Andreas Arnez wrote:
> Pedro Alves <palves@redhat.com> writes:
> 
>> > Whenever each of those single-steps is done, the other thread (let's
>> > call it thread #2) is allowed to run free (the "set scheduler-locking
>> > off" setting).
> Got it.  Seems I misunderstood the "step" behavior in this case.  Thanks
> for the explanation.
> 
>>> >> On s390x the test case actually fails sometimes.  In those cases,
>>> >> when stepping from step-1 to step-2, a ton of SIGUSR1 are indicated,
>>> >> and then the inferior seems to stop at the closing brace of the
>>> >> handler() function instead of the tgkill().
>> >
>> > That does sound like something's wrong.  Hacking the the test to force
>> > "set debug infrun 1" and "set debug lin-lwp 1" would be my first move.
> As soon as "lin-lwp" debugging is turned on the test always seems to
> succeed.  But with "debug infrun" alone the failure still occurs, and I
> observe the following:
> 
> 1. The stepped thread reaches the last instruction inside the stepping
> range.
> 
> 2. After resuming the stepped thread again, it traps at getpid@plt.
> Which is curious, because getpid() shouldn't be called until the
> instruction _after_ the stepping range.  It seems like the trap for that
> instruction was missed somehow.  (In the good case the thread always
> traps at the subroutine call, before having carried out the call.)

Indeed that's odd.  Is the event that is reported to GDB _before_ this
trap at getpid@plt a SIGUSR1 for the other thread?
linux-nat.c always gives preference to a SIGTRAP over other
signals, so it's unexpected that a trap could be lost.  Maybe while GDB goes
about stopping all threads with SIGSTOP (in effect, only the single-stepped
thread), the single step has actually completed, but the kernel manages
to report the SIGSTOP first, for some bizarre reason?  IOW, the kernel loses
the trap.  You could try to add some prints of the current PC of all lwps
at each event, so see if the stepped lwp makes the expected progress, but
we miss the trap, or even perhaps see if the lwp jumps between the last instruction
in the step range and getpid@plt, without the kernel ever reporting any
kind of stop at the branch.  Just adding a few prints instead of full
lin-lwp debug might be non disturbing enough.

> 
> 3. The thread is single-stepped until the jump to getpid().  The
> getpid() invocation itself is skipped with a step-resume breakpoint on
> the instruction after the original subroutine call.
> 
> 4. The step-resume breakpoint is reached.  Despite now being well
> outside the original stepping range, the thread is resumed.  Upon the
> next trap, an updated stepping range is shown, adjusted to fit the line
> of the tgkill().  Then stepping continues until the next line, which is
> the closing brace.
> 
>> > I wonder if this makes a difference?
>> >
>> >  gdb/infrun.c |    6 ++++--
>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/gdb/infrun.c b/gdb/infrun.c
>> > index de2cf19..9621b84 100644
>> > --- a/gdb/infrun.c
>> > +++ b/gdb/infrun.c
>> > @@ -4698,8 +4698,10 @@ process_event_stop_test:
>> >  	  ecs->event_thread = tp;
>> >  	  ecs->ptid = tp->ptid;
>> >  	  context_switch (ecs->ptid);
>> > -	  keep_going (ecs);
>> > -	  return;
>> > +
>> > +	  /* Keep checking.  The stepped thread might have already
>> > +	     reached its destination, but not have reported it yet.
>> > +	     If we just kept going, we could end up overstepping.  */
>> >  	}
>> >      }
> Yes, it does make a difference.  The test case still fails at a similar
> rate as before, but this time after "continue", because the inferior
> reaches "assert (0)".  Again, I can not reproduce this failure with "set
> debug infrun 1" and "set debug lin-lwp 1".

Hmm.  This confirms there's something not expected with single-step traps, like
we discuss above.  I shouldn't really make a difference for hardware step
Linux targets, as the backend always gives preference to single-step traps.  The
patch was originally for fixing something for software single-step targets, only.
If the assert is reached, a SIGUSR1 is mistakenly swallowed, which was the
whole point of the test.

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Strange behavior of sigstep-threads.exp?
  2012-11-14 16:15     ` Pedro Alves
@ 2012-11-21 18:47       ` Andreas Arnez
  2012-11-21 19:19         ` Pedro Alves
  0 siblings, 1 reply; 6+ messages in thread
From: Andreas Arnez @ 2012-11-21 18:47 UTC (permalink / raw)
  To: Pedro Alves; +Cc: jan.kratochvil, gdb

Pedro Alves <palves@redhat.com> writes:

> linux-nat.c always gives preference to a SIGTRAP over other signals,
> so it's unexpected that a trap could be lost.  Maybe while GDB goes
> about stopping all threads with SIGSTOP (in effect, only the
> single-stepped thread), the single step has actually completed, but
> the kernel manages to report the SIGSTOP first, for some bizarre
> reason?  IOW, the kernel loses the trap.

Right, the kernel loses the trap.

While processing the SIGUSR1 from thread #2, GDB stops all other threads
with SIGSTOP -- in this case the single-stepped thread #1.  Now if the
SIGSTOP is sent after thread #1 has received the single-step exception,
but before the kernel has decided what signals to present, the kernel
suppresses the SIGTRAP and just delivers the SIGSTOP instead.

Discussed this with the s390 kernel maintainer.  A kernel patch is on
the way.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Strange behavior of sigstep-threads.exp?
  2012-11-21 18:47       ` Andreas Arnez
@ 2012-11-21 19:19         ` Pedro Alves
  0 siblings, 0 replies; 6+ messages in thread
From: Pedro Alves @ 2012-11-21 19:19 UTC (permalink / raw)
  To: Andreas Arnez; +Cc: jan.kratochvil, gdb

On 11/21/2012 06:46 PM, Andreas Arnez wrote:
> Pedro Alves <palves@redhat.com> writes:
> 
>> linux-nat.c always gives preference to a SIGTRAP over other signals,
>> so it's unexpected that a trap could be lost.  Maybe while GDB goes
>> about stopping all threads with SIGSTOP (in effect, only the
>> single-stepped thread), the single step has actually completed, but
>> the kernel manages to report the SIGSTOP first, for some bizarre
>> reason?  IOW, the kernel loses the trap.
> 
> Right, the kernel loses the trap.

Great (that GDB isn't at fault).  :-)

> While processing the SIGUSR1 from thread #2, GDB stops all other threads
> with SIGSTOP -- in this case the single-stepped thread #1.  Now if the
> SIGSTOP is sent after thread #1 has received the single-step exception,
> but before the kernel has decided what signals to present, the kernel
> suppresses the SIGTRAP and just delivers the SIGSTOP instead.
> 
> Discussed this with the s390 kernel maintainer.  A kernel patch is on
> the way.

Thanks.

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-11-21 19:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-08 15:51 Strange behavior of sigstep-threads.exp? Andreas Arnez
2012-11-12 16:38 ` Pedro Alves
2012-11-13 15:21   ` Andreas Arnez
2012-11-14 16:15     ` Pedro Alves
2012-11-21 18:47       ` Andreas Arnez
2012-11-21 19:19         ` Pedro Alves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).