sig != GDB_SIGNAL_0 failed assertion stepping program on GNU/Linux

public inbox for gdb-patches@sourceware.org
 help / color / mirror / Atom feed

* sig != GDB_SIGNAL_0 failed assertion stepping program on GNU/Linux
@ 2015-08-04 18:07 Joel Brobecker
  2015-08-04 18:54 ` Pedro Alves
  0 siblings, 1 reply; 6+ messages in thread
From: Joel Brobecker @ 2015-08-04 18:07 UTC (permalink / raw)
  To: gdb-patches

Hello,

I have an issue debugging a program which uses threads, and which
only reproduces on SuSE 10; I'm not sure whether this is thanks
to pure luck, or whether something is specific to that particular
distro. The test is also sufficiently racy that it does not happen
all the time. Unfortunately, I can't share the code, and attempts at
creating another reproducer I can share have failed miserably.

I think I understand what's happening, but I am not sure what GDB ought
to be doing in that scenario.

In our scenario, we are trying to next/step our program after
having stopped at the start of its main subprogram:

    % gdb -q .obj/gprof/main
    (gdb) start
    (gdb) n
    (gdb) step
    [...]/infrun.c:2391: internal-error: resume: Assertion `sig != GDB_SIGNAL_0' failed.

The summary:

    The issue is related to the use of the _Unwind_DebugHook breakpoint.
    More precisely, the "next" happens to stop the inferior at an
    address which is the return address from the _Unwind_DebugHook, and
    then the "step" operation triggers the _Unwind_DebugHook breakpoint
    for another thread, before the initial thread had a chance to
    advance at all.  The handling of the _Unwind_DebugHook breakpoint
    causes GDB to insert a "bp_step_resume" breakpoint at the function's
    return address again, which happens to be the same as in the
    previous _Unwind_DebugHook call.  So, when GDB tries to step the
    initial thread again, it finds that it is stepping from a location
    that has a breakpoint with SIG being GDB_SIGNAL_0.

The detailed analysis so far:

At the end of the "start" command, we are stopped at the start
of function Main in main.adb.

    (gdb) start
    Temporary breakpoint 1 at 0x805451e: file /[...]/main.adb, line 57.
    Starting program: /[...]/main
    [New Thread 0xb7e5eba0 (LWP 28377)]
    [New Thread 0xb7c5aba0 (LWP 28378)]
    [New Thread 0xb7a56ba0 (LWP 28379)]

    Temporary breakpoint 1, main () at /[...]/main.adb:57
    [...]

There are 4 threads in total, and we are in the main thread (which is
thread 1):

    (gdb) info thread
      Id   Target Id         Frame
      4    Thread 0xb7a56ba0 (LWP 28379) 0xffffe410 in __kernel_vsyscall ()
      3    Thread 0xb7c5aba0 (LWP 28378) 0xffffe410 in __kernel_vsyscall ()
      2    Thread 0xb7e5eba0 (LWP 28377) 0xffffe410 in __kernel_vsyscall ()
    * 1    Thread 0xb7ea18c0 (LWP 28370) main ()
        at /glinos.a/users/brobecke/ex/O701-042/Pre_Post_Sub/main.adb:57

All the logs below reference Thread ID/LWP, but I think it'll be easier
to talk about the the thread by thread number. For instance, thread 1
is LWP 28370 while thread 3 is LWP 28378. So, I will translate in my
explanations the LWPs into thread numbers.

Back to our program. At this point, we attempt a "next" (from thread 1),
and here is what happens:

    (gdb) n
    infrun: clear_proceed_status_thread (Thread 0xb7a56ba0 (LWP 28379))
    infrun: clear_proceed_status_thread (Thread 0xb7c5aba0 (LWP 28378))
    infrun: clear_proceed_status_thread (Thread 0xb7e5eba0 (LWP 28377))
    infrun: clear_proceed_status_thread (Thread 0xb7ea18c0 (LWP 28370))
    infrun: proceed (addr=0xffffffff, signal=GDB_SIGNAL_DEFAULT)
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7ea18c0 (LWP 28370)] at 0x805451e
    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28370.0 [Thread 0xb7ea18c0 (LWP 28370)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x8054523

We've resumed thread 1 (LWP 28370), and received in return a signal
that the same thread stopped slightly further. It's still in the range
of instructions for the line of source we started the "next" from,
as evidenced by the following trace...

    infrun: stepping inside range [0x805451e-0x8054531]

... and thus, we decide to continue stepping the same thread:

    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7ea18c0 (LWP 28370)] at 0x8054523
    infrun: prepare_to_wait

That's when we get an event from a different thread (thread 3):

    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x80782d0
    infrun: context switch
    infrun: Switching context from Thread 0xb7ea18c0 (LWP 28370) to Thread 0xb7c5aba0 (LWP 28378)

... which we find to be at the address where we set a breakpoint
on "the unwinder debug hook" (namely "_Unwind_DebugHook"). That's
why GDB reports for this event that this is...

    infrun: BPSTAT_WHAT_SET_LONGJMP_RESUME

... and anlyzing the arguments passed to this function, it finds
that the return address is 0x80542a2. That's what the next trace
means, when it says:

    infrun: exception resume at 80542a2

The trace above also indicates that GDB created an internal
"bp_step_resume" breakpoint at that address. The address seems
innocent right now, but it'll become important later on, during
the step.

And now that the event has been analyzed and we've determmined
that we stopped at an exception breakpoint, GDB needs to step
thread 3 past the breakpoint it just hit. Thus, it temporarily
disables the exception breakpoint, and requests a step of that
thread:

    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=1, current thread [Thread 0xb7c5aba0 (LWP 28378)] at 0x80782d0
    infrun: prepare_to_wait

We then get a notification, still from thread 3, that it's now past
that breakpoint...

    infrun: prepare_to_wait
    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x8078424

... so we can resume what we were doing before, which is single-stepping
thread 1 until we get to a new line of code:

    infrun: switching back to stepped thread
    infrun: Switching context from Thread 0xb7c5aba0 (LWP 28378) to Thread 0xb7ea18c0 (LWP 28370)
    infrun: expected thread still hasn't advanced
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7ea18c0 (LWP 28370)] at 0x8054523

The "resume" log above shows that we're resuming thread 1 from
where we left off (0x8054523). We get one more stop at 0x8054529,
which is still inside our stepping range so we go again. That's
when we get the following event, from thread 3:

    infrun: prepare_to_wait
    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x80542a2

Now the stop_pc adddres is interesting, because it's the address
of "exception resume" breakpoint. When GDB sees this, it knows

    infrun: context switch
    infrun: Switching context from Thread 0xb7ea18c0 (LWP 28370) to Thread 0xb7c5aba0 (LWP 28378)
    infrun: BPSTAT_WHAT_CLEAR_LONGJMP_RESUME

... and since the location is at a different line of code,
this is where it decide the "next" operation should stop:

    infrun: stop_waiting
    [Switching to Thread 0xb7c5aba0 (LWP 28378)]
    0x080542a2 in inte_tache_rt.ttache_rt (
        <_task>=0x80968ec <inte_tache_rt_inst.tache2>)
        at /[...]/inte_tache_rt.adb:54
    54            end loop;

So far, things seem to be all normal. But the important element
to consider, here, is the fact that we stopped at an address that
also happens to be the address of the "exception resume" breakpoint.

Now that the plot has been set, we can try to "step", and see what
happens when things do not go well. First, GDB resumes the current
thread, which is now thread 3:

    (gdb) step
    infrun: clear_proceed_status_thread (Thread 0xb7a56ba0 (LWP 28379))
    infrun: clear_proceed_status_thread (Thread 0xb7c5aba0 (LWP 28378))
    infrun: clear_proceed_status_thread (Thread 0xb7e5eba0 (LWP 28377))
    infrun: clear_proceed_status_thread (Thread 0xb7ea18c0 (LWP 28370))
    infrun: proceed (addr=0xffffffff, signal=GDB_SIGNAL_DEFAULT)
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7c5aba0 (LWP 28378)] at 0x80542a2

Enter thread 4, whom we haven't heard of yet, who now triggers
our exception breakpoint at _Unwind_DebugHook...

    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28379.0 [Thread 0xb7a56ba0 (LWP 28379)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x80782d0
    infrun: context switch
    infrun: Switching context from Thread 0xb7c5aba0 (LWP 28378) to Thread 0xb7a56ba0 (LWP 28379)
    infrun: BPSTAT_WHAT_SET_LONGJMP_RESUME

... and once more, GDB finds that the "exception resume" breakpoint
should be set at 0x80542a2:

    infrun: exception resume at 80542a2

Same dance as before, we need to single-step that thread out of
the breakpoint, so:

    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=1, current thread [Thread 0xb7a56ba0 (LWP 28379)] at 0x80782d0
    infrun: prepare_to_wait
    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28379.0 [Thread 0xb7a56ba0 (LWP 28379)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x8078424

... at which point we get back to where we left off again, which is
stepping thread 3 because it hasn't "moved" yet:

    infrun: switching back to stepped thread
    infrun: Switching context from Thread 0xb7a56ba0 (LWP 28379) to Thread 0xb7c5aba0 (LWP 28378)
    infrun: expected thread still hasn't advanced
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7c5aba0 (LWP 28378)] at 0x80542a2

The error happens during the resume, because we're trying to step
from a location which has a breakpoint...

  if (execution_direction != EXEC_REVERSE
      && step && breakpoint_inserted_here_p (aspace, pc))

... and the code has a very long explanation of why this can only
happen if we have a signal to deliver (hence the assertion).
Note that the "exception resume" breakpoint has an additional
condition that limits its effectiveness to thread 4 only, which
is the thread that hit the _Unwind_DebugHook breakpoint. But,
at the target level, breakpoints apply to all thread, and thread-specific
handling is done by GDB, by automatically restarting a program
when a non-matching thread hits the thread-specific breakpoint.

I am not really sure what to think of that, whether the assertion
is just too strict, or whether we need special handling of that
situation. In particular, I'm wondering whether the code in the
"if" requires the assertion to be true in order to be correct.
It does:

      tp->stepped_breakpoint = 1;

      /* Most targets can step a breakpoint instruction, thus
         executing it normally.  But if this one cannot, just
         continue and we will hit it anyway.  */
      if (gdbarch_cannot_step_breakpoint (gdbarch))
        step = 0;

Just for kicks, I tried commenting out the assertion, and after
a couple of tries, I got at the end of the step...

    (gdb) step
    /[...]/infrun.c:4865: internal-error: process_event_stop_test: Assertion `ecs->event_thread->control.exception_resume_breakpoint != NULL' failed.

Any suggestion? Maybe catch this specific case where we're trying
to single-step a thread from the exception_resume_breakpoint by
adding some code that would single-stepp the one thread alone out
of that address, before resuming a more normal single-step operation?

Thanks!
-- 
Joel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sig != GDB_SIGNAL_0 failed assertion stepping program on GNU/Linux
  2015-08-04 18:07 sig != GDB_SIGNAL_0 failed assertion stepping program on GNU/Linux Joel Brobecker
@ 2015-08-04 18:54 ` Pedro Alves
  2015-08-05  1:04   ` Joel Brobecker
  0 siblings, 1 reply; 6+ messages in thread
From: Pedro Alves @ 2015-08-04 18:54 UTC (permalink / raw)
  To: Joel Brobecker, gdb-patches

On 08/04/2015 07:07 PM, Joel Brobecker wrote:
> 
> Back to our program. At this point, we attempt a "next" (from thread 1),
> and here is what happens:

If the "next" is for thread 1,

> That's when we get an event from a different thread (thread 3):
> 
>     infrun: target_wait (-1.0.0, status) =
>     infrun:   28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)],
>     infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
>     infrun: TARGET_WAITKIND_STOPPED
>     infrun: stop_pc = 0x80782d0
>     infrun: context switch
>     infrun: Switching context from Thread 0xb7ea18c0 (LWP 28370) to Thread 0xb7c5aba0 (LWP 28378)
> 
> ... which we find to be at the address where we set a breakpoint
> on "the unwinder debug hook" (namely "_Unwind_DebugHook"). That's
> why GDB reports for this event that this is...
> 
>     infrun: BPSTAT_WHAT_SET_LONGJMP_RESUME

Why are we getting this?  longjmp/exception/step-resume breakpoints
are thread-specific.

I'd guess that the bug is in bpstat_what:

struct bpstat_what
bpstat_what (bpstat bs_head)
{
...
	case bp_longjmp:
	case bp_longjmp_call_dummy:
	case bp_exception:
	  this_action = BPSTAT_WHAT_SET_LONGJMP_RESUME;
	  retval.is_longjmp = bptype != bp_exception;
	  break;
...

This bit is not considering "if (bs->stop)" like e.g.,
the bp_step_resume case.

I've seen something like this trigger before, and have a patch
somewhere to rewrite bpstat_what differently which fixes that.
I never managed to write a testcase for it so never submitted
it.  But, could you try the simpler approach?  Try making that:

	  if (bs->stop)
	    {
	       this_action = BPSTAT_WHAT_SET_LONGJMP_RESUME;
	       retval.is_longjmp = bptype != bp_exception;
	    }
	  else
	    this_action = BPSTAT_WHAT_SINGLE;
	  break;

Thanks,
Pedro Alves

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sig != GDB_SIGNAL_0 failed assertion stepping program on GNU/Linux
  2015-08-04 18:54 ` Pedro Alves
@ 2015-08-05  1:04   ` Joel Brobecker
  2015-08-05 10:16     ` Pedro Alves
  0 siblings, 1 reply; 6+ messages in thread
From: Joel Brobecker @ 2015-08-05  1:04 UTC (permalink / raw)
  To: Pedro Alves; +Cc: gdb-patches

[-- Attachment #1: Type: text/plain, Size: 2516 bytes --]

> If the "next" is for thread 1,
> 
> > That's when we get an event from a different thread (thread 3):
> > 
> >     infrun: target_wait (-1.0.0, status) =
> >     infrun:   28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)],
> >     infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
> >     infrun: TARGET_WAITKIND_STOPPED
> >     infrun: stop_pc = 0x80782d0
> >     infrun: context switch
> >     infrun: Switching context from Thread 0xb7ea18c0 (LWP 28370) to Thread 0xb7c5aba0 (LWP 28378)
> > 
> > ... which we find to be at the address where we set a breakpoint
> > on "the unwinder debug hook" (namely "_Unwind_DebugHook"). That's
> > why GDB reports for this event that this is...
> > 
> >     infrun: BPSTAT_WHAT_SET_LONGJMP_RESUME
> 
> Why are we getting this?  longjmp/exception/step-resume breakpoints
> are thread-specific.
> 
> I'd guess that the bug is in bpstat_what:
> 
> struct bpstat_what
> bpstat_what (bpstat bs_head)
> {
> ...
> 	case bp_longjmp:
> 	case bp_longjmp_call_dummy:
> 	case bp_exception:
> 	  this_action = BPSTAT_WHAT_SET_LONGJMP_RESUME;
> 	  retval.is_longjmp = bptype != bp_exception;
> 	  break;
> ...
> 
> This bit is not considering "if (bs->stop)" like e.g.,
> the bp_step_resume case.
> 
> I've seen something like this trigger before, and have a patch
> somewhere to rewrite bpstat_what differently which fixes that.
> I never managed to write a testcase for it so never submitted
> it.  But, could you try the simpler approach?  Try making that:
> 
> 	  if (bs->stop)
> 	    {
> 	       this_action = BPSTAT_WHAT_SET_LONGJMP_RESUME;
> 	       retval.is_longjmp = bptype != bp_exception;
> 	    }
> 	  else
> 	    this_action = BPSTAT_WHAT_SINGLE;
> 	  break;

Ah ha, I missed the fact that the exception breakpoint is thread-
specific. Your fix seems to be working very well; thanks for suggestion,
Pedro! Attached is a patch with a slightly altered analysis as the revision
log. Our SuSE 10 machine is very slow, so I tested it on a more modern
machine with a slightly different distro.

I'm wondering if we shouldn't be doing the same for:

        case bp_longjmp_resume:
        case bp_exception_resume:
          this_action = BPSTAT_WHAT_CLEAR_LONGJMP_RESUME;
          retval.is_longjmp = bptype == bp_longjmp_resume;
          break;

gdb/ChangeLog:

        Pedro Alves  <palves@redhat.com>
        * breakpoint.c (bpstat_what) <bp_longjmp, bp_longjmp_call_dummy>
        <bp_exception>: Correctly handle the case where BS->STOP is not set.

Thanks!
-- 
Joel

[-- Attachment #2: 0001-sig-GDB_SIGNAL_0-failed-assertion-stepping-program-o.patch --]
[-- Type: text/x-diff, Size: 7656 bytes --]

From 8ff769070f12eafd1b858a63a184a4be9f9a6500 Mon Sep 17 00:00:00 2001
From: Pedro Alves <palves@redhat.com>
Date: Tue, 4 Aug 2015 23:40:08 +0200
Subject: [PATCH] sig != GDB_SIGNAL_0 failed assertion stepping program on
 GNU/Linux

Trying to next/step a program on GNU/Linux sometimes results in
the following failed assertion:

    % gdb -q .obj/gprof/main
    (gdb) start
    (gdb) n
    (gdb) step
    [...]/infrun.c:2391: internal-error:
    resume: Assertion `sig != GDB_SIGNAL_0' failed.

What happens is that, durig the "next" operation, GDB hits
a longjmp/exception/step-resume breakpoint but fails to see that
this breakpoint was set for a different thread than the one being
stepped.

More precisely, at the end of the "start" command, we are stopped
at the start of function Main in main.adb; there are 4 threads in
total, and we are in the main thread (which is thread 1):

    (gdb) info thread
      Id   Target Id         Frame
      4    Thread 0xb7a56ba0 (LWP 28379) 0xffffe410 in __kernel_vsyscall ()
      3    Thread 0xb7c5aba0 (LWP 28378) 0xffffe410 in __kernel_vsyscall ()
      2    Thread 0xb7e5eba0 (LWP 28377) 0xffffe410 in __kernel_vsyscall ()
    * 1    Thread 0xb7ea18c0 (LWP 28370) main () at /[...]/main.adb:57

All the logs below reference Thread ID/LWP, but I think it'll be easier
to talk about the the thread by thread number. For instance, thread 1
is LWP 28370 while thread 3 is LWP 28378. So, I will translate in my
explanations the LWPs into thread numbers.

Back to what happens while we are trying to "next' our program:
    (gdb) n
    infrun: clear_proceed_status_thread (Thread 0xb7a56ba0 (LWP 28379))
    infrun: clear_proceed_status_thread (Thread 0xb7c5aba0 (LWP 28378))
    infrun: clear_proceed_status_thread (Thread 0xb7e5eba0 (LWP 28377))
    infrun: clear_proceed_status_thread (Thread 0xb7ea18c0 (LWP 28370))
    infrun: proceed (addr=0xffffffff, signal=GDB_SIGNAL_DEFAULT)
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7ea18c0 (LWP 28370)] at 0x805451e
    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28370.0 [Thread 0xb7ea18c0 (LWP 28370)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x8054523

We've resumed thread 1 (LWP 28370), and received in return a signal
that the same thread stopped slightly further. It's still in the range
of instructions for the line of source we started the "next" from,
as evidenced by the following trace...

    infrun: stepping inside range [0x805451e-0x8054531]

... and thus, we decide to continue stepping the same thread:

    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7ea18c0 (LWP 28370)] at 0x8054523
    infrun: prepare_to_wait

That's when we get an event from a different thread (thread 3)...

    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x80782d0
    infrun: context switch
    infrun: Switching context from Thread 0xb7ea18c0 (LWP 28370) to Thread 0xb7c5aba0 (LWP 28378)

... which we find to be at the address where we set a breakpoint
on "the unwinder debug hook" (namely "_Unwind_DebugHook"). But GDB
fails to notice that the breakpoint was inserted for thread 1 only,
and so decides to handle it as...

    infrun: BPSTAT_WHAT_SET_LONGJMP_RESUME

... and inserts a breakpoint at the corresponding resume address,
as evidenced by this the next log:

    infrun: exception resume at 80542a2

That breakpoint seems innocent right now, but will play a role fairly
quickly. But for now, GDB has inserted the exception-resume breakpoint,
and needs to single-step thread 3 past the breakpoint it just hit. Thus,
it temporarily disables the exception breakpoint, and requests a step of
that thread:

    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=1, current thread [Thread 0xb7c5aba0 (LWP 28378)] at 0x80782d0
    infrun: prepare_to_wait

We then get a notification, still from thread 3, that it's now past
that breakpoint...

    infrun: prepare_to_wait
    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x8078424

... so we can resume what we were doing before, which is single-stepping
thread 1 until we get to a new line of code:

    infrun: switching back to stepped thread
    infrun: Switching context from Thread 0xb7c5aba0 (LWP 28378) to Thread 0xb7ea18c0 (LWP 28370)
    infrun: expected thread still hasn't advanced
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7ea18c0 (LWP 28370)] at 0x8054523

The "resume" log above shows that we're resuming thread 1 from
where we left off (0x8054523). We get one more stop at 0x8054529,
which is still inside our stepping range so we go again. That's
when we get the following event, from thread 3:

    infrun: prepare_to_wait
    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x80542a2

Now the stop_pc adddres is interesting, because it's the address
of "exception resume" breakpoint. When GDB sees this, it knows

    infrun: context switch
    infrun: Switching context from Thread 0xb7ea18c0 (LWP 28370) to Thread 0xb7c5aba0 (LWP 28378)
    infrun: BPSTAT_WHAT_CLEAR_LONGJMP_RESUME

... and since the location is at a different line of code,
this is where it decide the "next" operation should stop:

    infrun: stop_waiting
    [Switching to Thread 0xb7c5aba0 (LWP 28378)]
    0x080542a2 in inte_tache_rt.ttache_rt (
        <_task>=0x80968ec <inte_tache_rt_inst.tache2>)
        at /[...]/inte_tache_rt.adb:54
    54            end loop;

Instead, what GDB should be doing is noticing that the exception
breakpoint we hit was for a different thread, thus single-step
that thread out of the breakpoint _without_ inserting the exception-return
breakpoint, and then resume the single-stepping of the initial thread
(thread 1) until stepping out of the stepping range.

This is what this patch does, and after applying it, GDB now correctly
stops on the next line of code.

gdb/ChangeLog:

        Pedro Alves  <palves@redhat.com>
        * breakpoint.c (bpstat_what) <bp_longjmp, bp_longjmp_call_dummy>
        <bp_exception>: Correctly handle the case where BS->STOP is not set.

Tested on x86_64-linux, no regressions.
---
 gdb/breakpoint.c |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/gdb/breakpoint.c b/gdb/breakpoint.c
index 74c7a7b..da4ee82 100644
--- a/gdb/breakpoint.c
+++ b/gdb/breakpoint.c
@@ -5778,8 +5778,13 @@ bpstat_what (bpstat bs_head)
 	case bp_longjmp:
 	case bp_longjmp_call_dummy:
 	case bp_exception:
-	  this_action = BPSTAT_WHAT_SET_LONGJMP_RESUME;
-	  retval.is_longjmp = bptype != bp_exception;
+	  if (bs->stop)
+	    {
+	      this_action = BPSTAT_WHAT_SET_LONGJMP_RESUME;
+	      retval.is_longjmp = bptype != bp_exception;
+	    }
+	  else
+	    this_action = BPSTAT_WHAT_SINGLE;
 	  break;
 	case bp_longjmp_resume:
 	case bp_exception_resume:
-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sig != GDB_SIGNAL_0 failed assertion stepping program on GNU/Linux
  2015-08-05  1:04   ` Joel Brobecker
@ 2015-08-05 10:16     ` Pedro Alves
  2015-08-05 17:17       ` Joel Brobecker
  0 siblings, 1 reply; 6+ messages in thread
From: Pedro Alves @ 2015-08-05 10:16 UTC (permalink / raw)
  To: Joel Brobecker; +Cc: gdb-patches

[-- Attachment #1: Type: text/plain, Size: 760 bytes --]

On 08/05/2015 02:04 AM, Joel Brobecker wrote:

> Ah ha, I missed the fact that the exception breakpoint is thread-
> specific. Your fix seems to be working very well; thanks for suggestion,
> Pedro! Attached is a patch with a slightly altered analysis as the revision
> log. Our SuSE 10 machine is very slow, so I tested it on a more modern
> machine with a slightly different distro.
> 
> I'm wondering if we shouldn't be doing the same for:
> 
>         case bp_longjmp_resume:
>         case bp_exception_resume:
>           this_action = BPSTAT_WHAT_CLEAR_LONGJMP_RESUME;
>           retval.is_longjmp = bptype == bp_longjmp_resume;
>           break;

Yep.

I've now written a testcase that triggers this, and tweaked the
commit log a bit further.

WDYT?

[-- Attachment #2: 0001-stepping-is-disturbed-by-setjmp-longjmp-try-catch-in.patch --]
[-- Type: text/x-patch, Size: 13978 bytes --]

From 2f1c5a8d90cfb71068d425ebdacb5b3108f13014 Mon Sep 17 00:00:00 2001
From: Pedro Alves <palves@redhat.com>
Date: Wed, 5 Aug 2015 10:32:49 +0100
Subject: [PATCH] stepping is disturbed by setjmp/longjmp | try/catch in other
 threads

At https://sourceware.org/ml/gdb-patches/2015-08/msg00097.html, Joel
observed that trying to next/step a program on GNU/Linux sometimes
results in the following failed assertion:

	% gdb -q .obj/gprof/main
    (gdb) start
    (gdb) n
    (gdb) step
    [...]/infrun.c:2391: internal-error:
    resume: Assertion `sig != GDB_SIGNAL_0' failed.

What happened is that, during the "next" operation, GDB hit a
longjmp/exception/step-resume breakpoint but failed to see that this
breakpoint was set for a different thread than the one being stepped.

Joel's detailed analysis follows:

More precisely, at the end of the "start" command, we are stopped at
the start of function Main in main.adb; there are 4 threads in total,
and we are in the main thread (which is thread 1):

    (gdb) info thread
      Id   Target Id         Frame
      4    Thread 0xb7a56ba0 (LWP 28379) 0xffffe410 in __kernel_vsyscall ()
      3    Thread 0xb7c5aba0 (LWP 28378) 0xffffe410 in __kernel_vsyscall ()
      2    Thread 0xb7e5eba0 (LWP 28377) 0xffffe410 in __kernel_vsyscall ()
    * 1    Thread 0xb7ea18c0 (LWP 28370) main () at /[...]/main.adb:57

All the logs below reference Thread ID/LWP, but it'll be easier to
talk about the threads by GDB thread number.  For instance, thread 1
is LWP 28370 while thread 3 is LWP 28378.  So, the explanations below
translate the LWPs into thread numbers.

Back to what happens while we are trying to "next' our program:
    (gdb) n
    infrun: clear_proceed_status_thread (Thread 0xb7a56ba0 (LWP 28379))
    infrun: clear_proceed_status_thread (Thread 0xb7c5aba0 (LWP 28378))
    infrun: clear_proceed_status_thread (Thread 0xb7e5eba0 (LWP 28377))
    infrun: clear_proceed_status_thread (Thread 0xb7ea18c0 (LWP 28370))
    infrun: proceed (addr=0xffffffff, signal=GDB_SIGNAL_DEFAULT)
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7ea18c0 (LWP 28370)] at 0x805451e
    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28370.0 [Thread 0xb7ea18c0 (LWP 28370)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x8054523

We've resumed thread 1 (LWP 28370), and received in return a signal
that the same thread stopped slightly further.  It's still in the
range of instructions for the line of source we started the "next"
from, as evidenced by the following trace...

    infrun: stepping inside range [0x805451e-0x8054531]

... and thus, we decide to continue stepping the same thread:

    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7ea18c0 (LWP 28370)] at 0x8054523
    infrun: prepare_to_wait

That's when we get an event from a different thread (thread 3)...

    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x80782d0
    infrun: context switch
    infrun: Switching context from Thread 0xb7ea18c0 (LWP 28370) to Thread 0xb7c5aba0 (LWP 28378)

... which we find to be at the address where we set a breakpoint on
"the unwinder debug hook" (namely "_Unwind_DebugHook").  But GDB fails
to notice that the breakpoint was inserted for thread 1 only, and so
decides to handle it as...

    infrun: BPSTAT_WHAT_SET_LONGJMP_RESUME

... and inserts a breakpoint at the corresponding resume address, as
evidenced by this the next log:

    infrun: exception resume at 80542a2

That breakpoint seems innocent right now, but will play a role fairly
quickly.  But for now, GDB has inserted the exception-resume
breakpoint, and needs to single-step thread 3 past the breakpoint it
just hit.  Thus, it temporarily disables the exception breakpoint, and
requests a step of that thread:

    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=1, current thread [Thread 0xb7c5aba0 (LWP 28378)] at 0x80782d0
    infrun: prepare_to_wait

We then get a notification, still from thread 3, that it's now past
that breakpoint...

    infrun: prepare_to_wait
    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x8078424

... so we can resume what we were doing before, which is single-stepping
thread 1 until we get to a new line of code:

    infrun: switching back to stepped thread
    infrun: Switching context from Thread 0xb7c5aba0 (LWP 28378) to Thread 0xb7ea18c0 (LWP 28370)
    infrun: expected thread still hasn't advanced
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7ea18c0 (LWP 28370)] at 0x8054523

The "resume" log above shows that we're resuming thread 1 from where
we left off (0x8054523).  We get one more stop at 0x8054529, which is
still inside our stepping range so we go again.  That's when we get
the following event, from thread 3:

    infrun: prepare_to_wait
    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x80542a2

Now the stop_pc address is interesting, because it's the address of
"exception resume" breakpoint...

    infrun: context switch
    infrun: Switching context from Thread 0xb7ea18c0 (LWP 28370) to Thread 0xb7c5aba0 (LWP 28378)
    infrun: BPSTAT_WHAT_CLEAR_LONGJMP_RESUME

... and since that location is at a different line of code, this is
where it decides the "next" operation should stop:

    infrun: stop_waiting
    [Switching to Thread 0xb7c5aba0 (LWP 28378)]
    0x080542a2 in inte_tache_rt.ttache_rt (
        <_task>=0x80968ec <inte_tache_rt_inst.tache2>)
        at /[...]/inte_tache_rt.adb:54
    54            end loop;

However, what GDB should have noticed earlier that the exception
breakpoint we hit was for a different thread, thus should have
single-stepped that thread out of the breakpoint _without_ inserting
the exception-return breakpoint, and then resumed the single-stepping
of the initial thread (thread 1) until that thread stepped out of its
stepping range.

This is what this patch does, and after applying it, GDB now correctly
stops on the next line of code.

The patch adds a C++ test that exercises this, both for setjmp/longjmp
and exception breakpoints.  With an unpatched GDB it shows:

 (gdb) next
 [Switching to Thread 22445.22455]
 thread_try_catch (arg=0x0) at /home/pedro/gdb/mygit/build/../src/gdb/testsuite/gdb.threads/next-other-thr-longjmp.c:59
 59            catch (...)
 (gdb) FAIL: gdb.threads/next-other-thr-longjmp.exp: next to line 1
 next
 /home/pedro/gdb/mygit/build/../src/gdb/infrun.c:4865: internal-error: process_event_stop_test: Assertion `ecs->event_thread->control.exception_resume_breakpoint != NULL' fa
 iled.
 A problem internal to GDB has been detected,
 further debugging may prove unreliable.
 Quit this debugging session? (y or n) FAIL: gdb.threads/next-other-thr-longjmp.exp: next to line 2 (GDB internal error)
 Resyncing due to internal error.
 n

Tested on x86_64-linux, no regressions.

gdb/ChangeLog:
2015-08-05  Pedro Alves  <palves@redhat.com>
	    Joel Brobecker  <brobecker@adacore.com>

        * breakpoint.c (bpstat_what) <bp_longjmp, bp_longjmp_call_dummy>
	<bp_exception, bp_longjmp_resume, bp_exception_resume>: Handle the
	case where BS->STOP is not set.

gdb/testsuite/ChangeLog:
2015-08-05  Pedro Alves  <palves@redhat.com>

	* gdb.threads/next-other-thr-longjmp.c: New file.
	* gdb.threads/next-other-thr-longjmp.exp: New file.
---
 gdb/breakpoint.c                                   |  18 +++-
 .../gdb.threads/next-while-other-thread-longjmps.c | 116 +++++++++++++++++++++
 .../next-while-other-thread-longjmps.exp           |  40 +++++++
 3 files changed, 170 insertions(+), 4 deletions(-)
 create mode 100644 gdb/testsuite/gdb.threads/next-while-other-thread-longjmps.c
 create mode 100644 gdb/testsuite/gdb.threads/next-while-other-thread-longjmps.exp

diff --git a/gdb/breakpoint.c b/gdb/breakpoint.c
index 78a694e..125b22f 100644
--- a/gdb/breakpoint.c
+++ b/gdb/breakpoint.c
@@ -5752,13 +5752,23 @@ bpstat_what (bpstat bs_head)
 	case bp_longjmp:
 	case bp_longjmp_call_dummy:
 	case bp_exception:
-	  this_action = BPSTAT_WHAT_SET_LONGJMP_RESUME;
-	  retval.is_longjmp = bptype != bp_exception;
+	  if (bs->stop)
+	    {
+	      this_action = BPSTAT_WHAT_SET_LONGJMP_RESUME;
+	      retval.is_longjmp = bptype != bp_exception;
+	    }
+	  else
+	    this_action = BPSTAT_WHAT_SINGLE;
 	  break;
 	case bp_longjmp_resume:
 	case bp_exception_resume:
-	  this_action = BPSTAT_WHAT_CLEAR_LONGJMP_RESUME;
-	  retval.is_longjmp = bptype == bp_longjmp_resume;
+	  if (bs->stop)
+	    {
+	      this_action = BPSTAT_WHAT_CLEAR_LONGJMP_RESUME;
+	      retval.is_longjmp = bptype == bp_longjmp_resume;
+	    }
+	  else
+	    this_action = BPSTAT_WHAT_SINGLE;
 	  break;
 	case bp_step_resume:
 	  if (bs->stop)
diff --git a/gdb/testsuite/gdb.threads/next-while-other-thread-longjmps.c b/gdb/testsuite/gdb.threads/next-while-other-thread-longjmps.c
new file mode 100644
index 0000000..b786430
--- /dev/null
+++ b/gdb/testsuite/gdb.threads/next-while-other-thread-longjmps.c
@@ -0,0 +1,116 @@
+/* This testcase is part of GDB, the GNU debugger.
+
+   Copyright 2015 Free Software Foundation, Inc.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
+
+#include <assert.h>
+#include <pthread.h>
+#include <unistd.h>
+#include <setjmp.h>
+
+/* Number of threads.  */
+#define NTHREADS 10
+
+/* When set, threads exit.  */
+volatile int break_out;
+
+/* Entry point for threads that setjmp/longjmp.  */
+
+static void *
+thread_longjmp (void *arg)
+{
+  jmp_buf env;
+
+  while (!break_out)
+    {
+      if (setjmp (env) == 0)
+	longjmp (env, 1);
+
+      usleep (1);
+    }
+  return NULL;
+}
+
+/* Entry point for threads that try/catch.  */
+
+static void *
+thread_try_catch (void *arg)
+{
+  volatile unsigned int counter = 0;
+
+  while (!break_out)
+    {
+      try
+	{
+	  throw 1;
+	}
+      catch (...)
+	{
+	  counter++;
+	}
+
+      usleep (1);
+    }
+  return NULL;
+}
+
+int
+main (void)
+{
+  pthread_t threads[NTHREADS];
+  int i;
+  int ret;
+
+  /* Don't run forever.  */
+  alarm (180);
+
+  for (i = 0; i < NTHREADS; i++)
+    {
+      /* Half the threads does setjmp/longjmp, the other half does
+	 try/catch.  */
+      if ((i % 2) == 0)
+	ret = pthread_create (&threads[i], NULL, thread_longjmp , NULL);
+      else
+	ret = pthread_create (&threads[i], NULL, thread_try_catch , NULL);
+      assert (ret == 0);
+    }
+
+#define LINE usleep (1)
+
+  /* The other thread's setjmp/longjmp/try/catch should not disturb
+     this thread's stepping over these lines.  */
+
+  LINE; /* set break here */
+  LINE; /* line 1 */
+  LINE; /* line 2 */
+  LINE; /* line 3 */
+  LINE; /* line 4 */
+  LINE; /* line 5 */
+  LINE; /* line 6 */
+  LINE; /* line 7 */
+  LINE; /* line 8 */
+  LINE; /* line 9 */
+  LINE; /* line 10 */
+
+  break_out = 1;
+
+  for (i = 0; i < NTHREADS; i++)
+    {
+      ret = pthread_join (threads[i], NULL);
+      assert (ret == 0);
+    }
+
+  return 0;
+}
diff --git a/gdb/testsuite/gdb.threads/next-while-other-thread-longjmps.exp b/gdb/testsuite/gdb.threads/next-while-other-thread-longjmps.exp
new file mode 100644
index 0000000..72a0617
--- /dev/null
+++ b/gdb/testsuite/gdb.threads/next-while-other-thread-longjmps.exp
@@ -0,0 +1,40 @@
+# Copyright (C) 2015 Free Software Foundation, Inc.
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# This test has the main thread step over a few lines, while a few
+# threads constantly do setjmp/long and others do try/catch.  The
+# "next" commands in the main thread should be able to complete
+# undisturbed.
+
+standard_testfile
+
+set linenum [gdb_get_line_number "set break here"]
+
+if {[prepare_for_testing "failed to prepare" \
+	 $testfile $srcfile {c++ debug pthreads}] == -1} {
+    return -1
+}
+
+if ![runto_main] then {
+    fail "Can't run to main"
+    return 0
+}
+
+gdb_breakpoint $linenum
+gdb_continue_to_breakpoint "start line"
+
+for {set i 1} {$i <= 10} {incr i} {
+    gdb_test "next" " line $i .*" "next to line $i"
+}
-- 
1.9.3


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sig != GDB_SIGNAL_0 failed assertion stepping program on GNU/Linux
  2015-08-05 10:16     ` Pedro Alves
@ 2015-08-05 17:17       ` Joel Brobecker
  2015-08-05 19:14         ` Pedro Alves
  0 siblings, 1 reply; 6+ messages in thread
From: Joel Brobecker @ 2015-08-05 17:17 UTC (permalink / raw)
  To: Pedro Alves; +Cc: gdb-patches

> I've now written a testcase that triggers this, and tweaked the
> commit log a bit further.
> 
> WDYT?

Awesome having a testcase! It looks very good to me - I quickly tested
it on our SuSE 10 machine with our example, just in case, and also
ran AdaCore's testsuite just for kicks. Still all good.

Thanks!

> gdb/ChangeLog:
> 2015-08-05  Pedro Alves  <palves@redhat.com>
> 	    Joel Brobecker  <brobecker@adacore.com>
> 
>         * breakpoint.c (bpstat_what) <bp_longjmp, bp_longjmp_call_dummy>
> 	<bp_exception, bp_longjmp_resume, bp_exception_resume>: Handle the
> 	case where BS->STOP is not set.
> 
> gdb/testsuite/ChangeLog:
> 2015-08-05  Pedro Alves  <palves@redhat.com>
> 
> 	* gdb.threads/next-other-thr-longjmp.c: New file.
> 	* gdb.threads/next-other-thr-longjmp.exp: New file.

-- 
Joel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sig != GDB_SIGNAL_0 failed assertion stepping program on GNU/Linux
  2015-08-05 17:17       ` Joel Brobecker
@ 2015-08-05 19:14         ` Pedro Alves
  0 siblings, 0 replies; 6+ messages in thread
From: Pedro Alves @ 2015-08-05 19:14 UTC (permalink / raw)
  To: Joel Brobecker; +Cc: gdb-patches

On 08/05/2015 06:17 PM, Joel Brobecker wrote:
>> I've now written a testcase that triggers this, and tweaked the
>> commit log a bit further.
>>
>> WDYT?
> 
> Awesome having a testcase! It looks very good to me - I quickly tested
> it on our SuSE 10 machine with our example, just in case, and also
> ran AdaCore's testsuite just for kicks. Still all good.

Alright, now pushed to master and 7.10.

Thanks,
Pedro Alves

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-08-05 19:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-04 18:07 sig != GDB_SIGNAL_0 failed assertion stepping program on GNU/Linux Joel Brobecker
2015-08-04 18:54 ` Pedro Alves
2015-08-05  1:04   ` Joel Brobecker
2015-08-05 10:16     ` Pedro Alves
2015-08-05 17:17       ` Joel Brobecker
2015-08-05 19:14         ` Pedro Alves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).