public inbox for gdb@sourceware.org
 help / color / mirror / Atom feed
* "finish" command leads to SIGTRAP
@ 2019-02-21 11:21 David Griffiths
  2019-02-21 11:24 ` Pedro Alves
  0 siblings, 1 reply; 15+ messages in thread
From: David Griffiths @ 2019-02-21 11:21 UTC (permalink / raw)
  To: gdb

I have a strange situation where issuing the "finish" command always leads
to a SIGTRAP (this is gdb 8.1 on Ubuntu 16.04). Once this SIGTRAP occurs
every continue also produces SIGTRAP. Completely reproducible. In the run
up to the finish I'm doing single steps from a previous breakpoint:

=====

(gdb) display/i $pc
1: x/i $pc
=> 0x7fffe1923b84:    movabs $0x7ffff6d33b00,%r10
(gdb) si
0x00007fffe1923b8e in ?? ()
1: x/i $pc
=> 0x7fffe1923b8e:    callq  *%r10
(gdb)
0x00007ffff6d33b00 in os::javaTimeMillis() () from
/mnt/hgfs/david/jdk8u/build/linux-x86_64-normal-server-release/jdk/lib/amd64/server/libjvm.so
1: x/i $pc
=> 0x7ffff6d33b00 <_ZN2os14javaTimeMillisEv>:    push   %rbp
(gdb) finish
Run till exit from #0  0x00007ffff6d33b00 in os::javaTimeMillis() () from
/mnt/hgfs/david/jdk8u/build/linux-x86_64-normal-server-release/jdk/lib/amd64/server/libjvm.so

Thread 2 "java" received signal SIGTRAP, Trace/breakpoint trap.
0x00007ffff6d33b01 in os::javaTimeMillis() () from
/mnt/hgfs/david/jdk8u/build/linux-x86_64-normal-server-release/jdk/lib/amd64/server/libjvm.so
1: x/i $pc
=> 0x7ffff6d33b01 <_ZN2os14javaTimeMillisEv+1>:    xor    %esi,%esi
(gdb) c
Continuing.

Thread 2 "java" received signal SIGTRAP, Trace/breakpoint trap.
0x00007ffff6d33b03 in os::javaTimeMillis() () from
/mnt/hgfs/david/jdk8u/build/linux-x86_64-normal-server-release/jdk/lib/amd64/server/libjvm.so
1: x/i $pc
=> 0x7ffff6d33b03 <_ZN2os14javaTimeMillisEv+3>:    mov    %rsp,%rbp

=====

Even more strangely I can execute finish on that function in general, e.g.
if I set a breakpoint on it:

=====

(gdb) br os::javaTimeMillis
Breakpoint 1 at 0x7ffff6d33b00
(gdb) c
Continuing.
[Switching to Thread 0x7ffff7fd8700 (LWP 12502)]

Thread 2 "java" hit Breakpoint 1, 0x00007ffff6d33b00 in
os::javaTimeMillis() () from
/mnt/hgfs/david/jdk8u/build/linux-x86_64-normal-server-release/jdk/lib/amd64/server/libjvm.so
(gdb) finish
Run till exit from #0  0x00007ffff6d33b00 in os::javaTimeMillis() () from
/mnt/hgfs/david/jdk8u/build/linux-x86_64-normal-server-release/jdk/lib/amd64/server/libjvm.so
0x00007fffe1b4f75c in ?? ()
(gdb)

=====

So there must be something about the environment when it occurs but I don't
know what. And by the way the code runs fine without the finish/single
steps/etc.

I need it to work because I'm trying to automate something via gdb/MI. Any
suggestions as to how to debug this would be very welcome.

Thanks,

David
-- 

David Griffiths, Senior Software Engineer

Undo <https://undo.io> | Resolve even the most challenging software defects
with software flight recorder technology

Software reliability report: optimizing the software supplier and customer
relationship
<https://info.undo.io/software-reliability-report-optimizing-supplier-and-customer-relationship>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "finish" command leads to SIGTRAP
  2019-02-21 11:21 "finish" command leads to SIGTRAP David Griffiths
@ 2019-02-21 11:24 ` Pedro Alves
  2019-02-21 12:13   ` David Griffiths
  0 siblings, 1 reply; 15+ messages in thread
From: Pedro Alves @ 2019-02-21 11:24 UTC (permalink / raw)
  To: David Griffiths, gdb

On 02/21/2019 11:21 AM, David Griffiths wrote:
> 
> I need it to work because I'm trying to automate something via gdb/MI. Any
> suggestions as to how to debug this would be very welcome.

Start with "set debug infrun 1".

And then "set debug lin-lwp 1" if debugging natively, or
"set debug remote 1" if using the remote serial protocol.

Thanks,
Pedro Alves

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "finish" command leads to SIGTRAP
  2019-02-21 11:24 ` Pedro Alves
@ 2019-02-21 12:13   ` David Griffiths
  2019-02-21 12:17     ` David Griffiths
  0 siblings, 1 reply; 15+ messages in thread
From: David Griffiths @ 2019-02-21 12:13 UTC (permalink / raw)
  To: Pedro Alves; +Cc: gdb

Ok thanks, did that. If I compare the output for the bad case with the good
case, this seems to be the main difference:

< infrun: proceed: resuming Thread 0x7ffff7fd8700 (LWP 12901)
< infrun: resume (step=0, signal=GDB_SIGNAL_0), trap_expected=0, current
thread [Thread 0x7ffff7fd8700 (LWP 12901)] at 0x7ffff6d33b00
< LLR: Preparing to resume Thread 0x7ffff7fd8700 (LWP 12901), 0,
inferior_ptid Thread 0x7ffff7fd8700 (LWP 12901)
< LLR: PTRACE_CONT Thread 0x7ffff7fd8700 (LWP 12901), 0 (resume event
thread)
---
> infrun: step-over queue now empty
> infrun: resuming [Thread 0x7ffff7fd8700 (LWP 12901)] for step-over
> infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=1, current
thread [Thread 0x7ffff7fd8700 (LWP 12901)] at 0x7ffff6d33b00
> LLR: Preparing to step Thread 0x7ffff7fd8700 (LWP 12901), 0,
inferior_ptid Thread 0x7ffff7fd8700 (LWP 12901)
> LLR: PTRACE_SINGLESTEP Thread 0x7ffff7fd8700 (LWP 12901), 0 (resume event
thread)
10a11
> infrun: proceed: [Thread 0x7ffff7fd8700 (LWP 12901)] resumed
27c28,60
< infrun: random signal (GDB_SIGNAL_TRAP)
---
> infrun: no stepping, continue
> infrun: resume (step=0, signal=GDB_SIGNAL_0), trap_expected=0, current
thread [Thread 0x7ffff7fd8700 (LWP 12901)] at 0x7ffff6d33b01

Cheers,

David

On Thu, 21 Feb 2019 at 11:24, Pedro Alves <palves@redhat.com> wrote:

> On 02/21/2019 11:21 AM, David Griffiths wrote:
> >
> > I need it to work because I'm trying to automate something via gdb/MI.
> Any
> > suggestions as to how to debug this would be very welcome.
>
> Start with "set debug infrun 1".
>
> And then "set debug lin-lwp 1" if debugging natively, or
> "set debug remote 1" if using the remote serial protocol.
>
> Thanks,
> Pedro Alves
>


-- 

David Griffiths, Senior Software Engineer

Undo <https://undo.io> | Resolve even the most challenging software defects
with software flight recorder technology

Software reliability report: optimizing the software supplier and customer
relationship
<https://info.undo.io/software-reliability-report-optimizing-supplier-and-customer-relationship>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "finish" command leads to SIGTRAP
  2019-02-21 12:13   ` David Griffiths
@ 2019-02-21 12:17     ` David Griffiths
  2019-02-21 13:12       ` Pedro Alves
  0 siblings, 1 reply; 15+ messages in thread
From: David Griffiths @ 2019-02-21 12:17 UTC (permalink / raw)
  To: Pedro Alves; +Cc: gdb

Oh, I should add a bit extra to the end because in the good case it is also
doing the PTRACE_CONT:

> LLR: Preparing to resume Thread 0x7ffff7fd8700 (LWP 12901), 0,
inferior_ptid Thread 0x7ffff7fd8700 (LWP 12901)
> LLR: PTRACE_CONT Thread 0x7ffff7fd8700 (LWP 12901), 0 (resume event
thread)
> infrun: prepare_to_wait
> linux_nat_wait: [process -1], [TARGET_WNOHANG]
> RSRL: NOT resuming LWP Thread 0x7ffff7fd8700 (LWP 12901), not stopped
> LLW: enter
> LNW: waitpid(-1, ...) returned 0, ERRNO-OK
> RSRL: NOT resuming LWP Thread 0x7ffff7fd8700 (LWP 12901), not stopped
> LLW: exit (ignore)

etc

On Thu, 21 Feb 2019 at 12:13, David Griffiths <dgriffiths@undo.io> wrote:

> Ok thanks, did that. If I compare the output for the bad case with the
> good case, this seems to be the main difference:
>
> < infrun: proceed: resuming Thread 0x7ffff7fd8700 (LWP 12901)
> < infrun: resume (step=0, signal=GDB_SIGNAL_0), trap_expected=0, current
> thread [Thread 0x7ffff7fd8700 (LWP 12901)] at 0x7ffff6d33b00
> < LLR: Preparing to resume Thread 0x7ffff7fd8700 (LWP 12901), 0,
> inferior_ptid Thread 0x7ffff7fd8700 (LWP 12901)
> < LLR: PTRACE_CONT Thread 0x7ffff7fd8700 (LWP 12901), 0 (resume event
> thread)
> ---
> > infrun: step-over queue now empty
> > infrun: resuming [Thread 0x7ffff7fd8700 (LWP 12901)] for step-over
> > infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=1, current
> thread [Thread 0x7ffff7fd8700 (LWP 12901)] at 0x7ffff6d33b00
> > LLR: Preparing to step Thread 0x7ffff7fd8700 (LWP 12901), 0,
> inferior_ptid Thread 0x7ffff7fd8700 (LWP 12901)
> > LLR: PTRACE_SINGLESTEP Thread 0x7ffff7fd8700 (LWP 12901), 0 (resume
> event thread)
> 10a11
> > infrun: proceed: [Thread 0x7ffff7fd8700 (LWP 12901)] resumed
> 27c28,60
> < infrun: random signal (GDB_SIGNAL_TRAP)
> ---
> > infrun: no stepping, continue
> > infrun: resume (step=0, signal=GDB_SIGNAL_0), trap_expected=0, current
> thread [Thread 0x7ffff7fd8700 (LWP 12901)] at 0x7ffff6d33b01
>
> Cheers,
>
> David
>
> On Thu, 21 Feb 2019 at 11:24, Pedro Alves <palves@redhat.com> wrote:
>
>> On 02/21/2019 11:21 AM, David Griffiths wrote:
>> >
>> > I need it to work because I'm trying to automate something via gdb/MI.
>> Any
>> > suggestions as to how to debug this would be very welcome.
>>
>> Start with "set debug infrun 1".
>>
>> And then "set debug lin-lwp 1" if debugging natively, or
>> "set debug remote 1" if using the remote serial protocol.
>>
>> Thanks,
>> Pedro Alves
>>
>
>
> --
>
> David Griffiths, Senior Software Engineer
>
> Undo <https://undo.io> | Resolve even the most challenging software
> defects with software flight recorder technology
>
> Software reliability report: optimizing the software supplier and customer
> relationship
> <https://info.undo.io/software-reliability-report-optimizing-supplier-and-customer-relationship>
>


-- 

David Griffiths, Senior Software Engineer

Undo <https://undo.io> | Resolve even the most challenging software defects
with software flight recorder technology

Software reliability report: optimizing the software supplier and customer
relationship
<https://info.undo.io/software-reliability-report-optimizing-supplier-and-customer-relationship>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "finish" command leads to SIGTRAP
  2019-02-21 12:17     ` David Griffiths
@ 2019-02-21 13:12       ` Pedro Alves
  2019-02-21 15:55         ` David Griffiths
  0 siblings, 1 reply; 15+ messages in thread
From: Pedro Alves @ 2019-02-21 13:12 UTC (permalink / raw)
  To: David Griffiths; +Cc: gdb

Might be unrelated, but ISTR that there used to be a kernel bug
that would lead to the cpu's trace flag getting stuck set
when you step in a signal handler.  That would result in
SIGTRAP happening at every step from that point on.  Could
that be the case here?

I'd look at "set debug displaced on" too.  Otherwise, it's a matter
at staring at the logs, and trying to understand what is happening.
Basically, "finish" sets a breakpoint at the caller and runs to it.
But all sorts of other things can happen behind the scenes.

Thanks,
Pedro Alves

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "finish" command leads to SIGTRAP
  2019-02-21 13:12       ` Pedro Alves
@ 2019-02-21 15:55         ` David Griffiths
  2019-02-21 17:50           ` Pedro Alves
  0 siblings, 1 reply; 15+ messages in thread
From: David Griffiths @ 2019-02-21 15:55 UTC (permalink / raw)
  To: Pedro Alves; +Cc: gdb

It's something to do with the nature of single stepping through a "popfq"
instruction. Given the following instructions:

   0x7fffe104638f:    add    $0x8,%rsp
   0x7fffe1046393:    popfq
   0x7fffe1046394:    pop    %rbp
   0x7fffe1046395:    jmpq   *%rax

If I set a breakpoint at the first of that set and single step through, I
end up with:

eflags         0x346    [ PF ZF TF IF ]

but if I set a breakpoint on the last instruction and avoid single stepping
I get:

eflags         0x246    [ PF ZF IF ]

and I think it's that TF that is causing the SIGTRAP?


On Thu, 21 Feb 2019 at 13:12, Pedro Alves <palves@redhat.com> wrote:

> Might be unrelated, but ISTR that there used to be a kernel bug
> that would lead to the cpu's trace flag getting stuck set
> when you step in a signal handler.  That would result in
> SIGTRAP happening at every step from that point on.  Could
> that be the case here?
>
> I'd look at "set debug displaced on" too.  Otherwise, it's a matter
> at staring at the logs, and trying to understand what is happening.
> Basically, "finish" sets a breakpoint at the caller and runs to it.
> But all sorts of other things can happen behind the scenes.
>
> Thanks,
> Pedro Alves
>


-- 

David Griffiths, Senior Software Engineer

Undo <https://undo.io> | Resolve even the most challenging software defects
with software flight recorder technology

Software reliability report: optimizing the software supplier and customer
relationship
<https://info.undo.io/software-reliability-report-optimizing-supplier-and-customer-relationship>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "finish" command leads to SIGTRAP
  2019-02-21 15:55         ` David Griffiths
@ 2019-02-21 17:50           ` Pedro Alves
  2019-02-21 18:03             ` Pedro Alves
                               ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Pedro Alves @ 2019-02-21 17:50 UTC (permalink / raw)
  To: David Griffiths; +Cc: gdb

On 02/21/2019 03:54 PM, David Griffiths wrote:
> It's something to do with the nature of single stepping through a "popfq"
> instruction. Given the following instructions:

I assume you have a pushf somewhere earlier?

> 
>    0x7fffe104638f:    add    $0x8,%rsp
>    0x7fffe1046393:    popfq
>    0x7fffe1046394:    pop    %rbp
>    0x7fffe1046395:    jmpq   *%rax
> 
> If I set a breakpoint at the first of that set and single step through, I
> end up with:
> 
> eflags         0x346    [ PF ZF TF IF ]
> 
> but if I set a breakpoint on the last instruction and avoid single stepping
> I get:
> 
> eflags         0x246    [ PF ZF IF ]
> 
> and I think it's that TF that is causing the SIGTRAP?

Same as <https://sourceware.org/bugzilla/show_bug.cgi?id=13508> ?

I can reproduce that here, on Fedora 27 / Linux 4.17.17-100.fc27.x86_64.

Sounds like PTRACE_SINGLESTEP enables TF, which then causes pushf to push
the state with TF set.  And then popf pops restores that TF-enabled state.

I'd think this is a kernel bug, in the same vein as the signal issue
I mentioned below (in which TF would get stuck when you stepped into
a signal handler, or something like that).  The kernel could have special
handling for pushf, emulating it instead of actually single-stepping it?

Maybe newer Linux kernels do something else.  Haven't tried.

I wonder what other kernels, like e.g., FreeBSD do here?

Guess if GDB is to workaround this, it'll have to either add
special treatment for this instruction (emulate, step over with a software
breakpoints, something like that), or clear TF manually after
single-stepping.  :-/

Thanks,
Pedro Alves

(please avoid top posting)

> 
> 
> On Thu, 21 Feb 2019 at 13:12, Pedro Alves <palves@redhat.com> wrote:
> 
>> Might be unrelated, but ISTR that there used to be a kernel bug
>> that would lead to the cpu's trace flag getting stuck set
>> when you step in a signal handler.  That would result in
>> SIGTRAP happening at every step from that point on.  Could
>> that be the case here?
>>
>> I'd look at "set debug displaced on" too.  Otherwise, it's a matter
>> at staring at the logs, and trying to understand what is happening.
>> Basically, "finish" sets a breakpoint at the caller and runs to it.
>> But all sorts of other things can happen behind the scenes.
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "finish" command leads to SIGTRAP
  2019-02-21 17:50           ` Pedro Alves
@ 2019-02-21 18:03             ` Pedro Alves
  2019-02-21 18:22             ` David Griffiths
  2019-02-21 18:50             ` John Baldwin
  2 siblings, 0 replies; 15+ messages in thread
From: Pedro Alves @ 2019-02-21 18:03 UTC (permalink / raw)
  To: David Griffiths; +Cc: gdb

On 02/21/2019 05:50 PM, Pedro Alves wrote:

> (...) the signal issue
> I mentioned below (in which TF would get stuck when you stepped into
> a signal handler, or something like that).  The kernel could have special
> handling for pushf, emulating it instead of actually single-stepping it?

FYI, the signals + TF kernel bug:

 https://bugzilla.kernel.org/show_bug.cgi?id=16061

Thanks,
Pedro Alves

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "finish" command leads to SIGTRAP
  2019-02-21 17:50           ` Pedro Alves
  2019-02-21 18:03             ` Pedro Alves
@ 2019-02-21 18:22             ` David Griffiths
  2019-02-21 18:50             ` John Baldwin
  2 siblings, 0 replies; 15+ messages in thread
From: David Griffiths @ 2019-02-21 18:22 UTC (permalink / raw)
  To: Pedro Alves; +Cc: gdb

On Thu, 21 Feb 2019 at 17:50, Pedro Alves <palves@redhat.com> wrote:

>
> Same as <https://sourceware.org/bugzilla/show_bug.cgi?id=13508> ?
>
>
Yes, that's exactly it! I'd just written a simple test also to reproduce:

#include <stdio.h>

int main()
{
    asm("pushf\n\t"
        "popf\n\t");
    printf("after popfq\n");
}

(please avoid top posting)
>
>

Sorry, gmail default!

Cheers,

David

-- 

David Griffiths, Senior Software Engineer

Undo <https://undo.io> | Resolve even the most challenging software defects
with software flight recorder technology

Software reliability report: optimizing the software supplier and customer
relationship
<https://info.undo.io/software-reliability-report-optimizing-supplier-and-customer-relationship>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "finish" command leads to SIGTRAP
  2019-02-21 17:50           ` Pedro Alves
  2019-02-21 18:03             ` Pedro Alves
  2019-02-21 18:22             ` David Griffiths
@ 2019-02-21 18:50             ` John Baldwin
  2019-02-21 19:34               ` Pedro Alves
  2 siblings, 1 reply; 15+ messages in thread
From: John Baldwin @ 2019-02-21 18:50 UTC (permalink / raw)
  To: Pedro Alves, David Griffiths; +Cc: gdb

On 2/21/19 9:50 AM, Pedro Alves wrote:
> On 02/21/2019 03:54 PM, David Griffiths wrote:
>> It's something to do with the nature of single stepping through a "popfq"
>> instruction. Given the following instructions:
> 
> I assume you have a pushf somewhere earlier?
> 
>>
>>    0x7fffe104638f:    add    $0x8,%rsp
>>    0x7fffe1046393:    popfq
>>    0x7fffe1046394:    pop    %rbp
>>    0x7fffe1046395:    jmpq   *%rax
>>
>> If I set a breakpoint at the first of that set and single step through, I
>> end up with:
>>
>> eflags         0x346    [ PF ZF TF IF ]
>>
>> but if I set a breakpoint on the last instruction and avoid single stepping
>> I get:
>>
>> eflags         0x246    [ PF ZF IF ]
>>
>> and I think it's that TF that is causing the SIGTRAP?
> 
> Same as <https://sourceware.org/bugzilla/show_bug.cgi?id=13508> ?
> 
> I can reproduce that here, on Fedora 27 / Linux 4.17.17-100.fc27.x86_64.
> 
> Sounds like PTRACE_SINGLESTEP enables TF, which then causes pushf to push
> the state with TF set.  And then popf pops restores that TF-enabled state.
> 
> I'd think this is a kernel bug, in the same vein as the signal issue
> I mentioned below (in which TF would get stuck when you stepped into
> a signal handler, or something like that).  The kernel could have special
> handling for pushf, emulating it instead of actually single-stepping it?
> 
> Maybe newer Linux kernels do something else.  Haven't tried.
> 
> I wonder what other kernels, like e.g., FreeBSD do here?

FreeBSD also fails (and in the last year we had a set of changes to rework
TF handling in the kernel to boot).  This doesn't look trivial to solve.
To get the exception you have to have TF set in %rflags/%eflags, but that
means it is set when the pushf writes to the stack.  I think what would
have to happen (ugh) is that the kernel needs to recognize that the DB#
fault is due to a pushf instruction and that if the TF was a "shadow" TF
due to ptrace it needs to clear TF from the value written on the stack as
part of the fault handler.

> Guess if GDB is to workaround this, it'll have to either add
> special treatment for this instruction (emulate, step over with a software
> breakpoints, something like that), or clear TF manually after
> single-stepping.  :-/

I suspect it will be common for kernels to have this bug because the CPU
will always write a value onto the stack with TF set as part of
executing the instruction.  A workaround in GDB would be much like what I
described above with the advantage that GDB actually knows it is stepping a
pushf before it steps it, so it can know to rewrite the value on the
stack after it gets the SIGTRAP for the single step over the pushf.

This may actually be hard for a kernel to get right as at the time of the
fault we don't get anything that says how long the faulting instruction was,
etc.  Thus, just looking at the byte before the current eip/rip in a DB#
fault handler for the pushf opcode (I believe it's a single byte) can get
false positives because you might have stepped over a mov instruction with
an immediate whose last byte happens to be the opcode, etc.

-- 
John Baldwin

                                                                            

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "finish" command leads to SIGTRAP
  2019-02-21 18:50             ` John Baldwin
@ 2019-02-21 19:34               ` Pedro Alves
  2019-02-21 20:50                 ` John Baldwin
  0 siblings, 1 reply; 15+ messages in thread
From: Pedro Alves @ 2019-02-21 19:34 UTC (permalink / raw)
  To: John Baldwin, David Griffiths; +Cc: gdb

Hi John,

Thanks for stepping in.

On 02/21/2019 06:49 PM, John Baldwin wrote:
> On 2/21/19 9:50 AM, Pedro Alves wrote:

>> I wonder what other kernels, like e.g., FreeBSD do here?
> 
> FreeBSD also fails (and in the last year we had a set of changes to rework
> TF handling in the kernel to boot).  This doesn't look trivial to solve.
> To get the exception you have to have TF set in %rflags/%eflags, but that
> means it is set when the pushf writes to the stack.  I think what would
> have to happen (ugh) is that the kernel needs to recognize that the DB#
> fault is due to a pushf instruction and that if the TF was a "shadow" TF
> due to ptrace it needs to clear TF from the value written on the stack as
> part of the fault handler.
> 
>> Guess if GDB is to workaround this, it'll have to either add
>> special treatment for this instruction (emulate, step over with a software
>> breakpoints, something like that), or clear TF manually after
>> single-stepping.  :-/
> 
> I suspect it will be common for kernels to have this bug because the CPU
> will always write a value onto the stack with TF set as part of
> executing the instruction.  A workaround in GDB would be much like what I
> described above with the advantage that GDB actually knows it is stepping a
> pushf before it steps it, so it can know to rewrite the value on the
> stack after it gets the SIGTRAP for the single step over the pushf.
> 
> This may actually be hard for a kernel to get right as at the time of the
> fault we don't get anything that says how long the faulting instruction was,
> etc.  Thus, just looking at the byte before the current eip/rip in a DB#
> fault handler for the pushf opcode (I believe it's a single byte) can get
> false positives because you might have stepped over a mov instruction with
> an immediate whose last byte happens to be the opcode, etc.
I can think of other workarounds potentially possible:

#1 - emulate the instruction: i.e., if you know you're stepping a
   pushf instruction, you could instead push the flags state on the
   stack yourself manually, advance the PC, and then raise a
   fake trap.  Could be done by the kernel, or gdb.  Fixing it on
   the kernel side should be more efficient, and fixes it for
   all debuggers.  While fixing it on the debugger side fixes
   it for all kernels...

#2 - if you know you're stepping a pushf instruction, set a breakpoint
   after it, and PTRACE_CONTINUE instead of stepping.  (that's the software
   single-step workaround mentioned earlier).

#3 - have gdb always clear TF after a single-step.  This is the
   easiest, even if the "less technically cool" solution.  This
   would mean that it'd be impossible to debug a program that
   sets the trace flag manually.  I've actually once co-wrote
   an in-process x86 debug stub, and in that use case
   preserving TF mattered, made it possible to debug that
   stub...  Quite a niche use case, though, and it'd have been
   trivial for me for hack gdb for that special use case, of course.

In order for GDB to know whether it is stepping a pushf instruction,
it needs to read the memory at PC, which has a cost, but maybe it's
negligible if we already end up reading memory anyway (because of the
code cache), but I'm not sure we already do.  This can have a more
noticeable effect with remote debugging (which should weigh on whether
to do the workaround at the infrun.c level, or in the target backend (thus
in gdbserver when remote).

Solution #3 would require extra ptrace commands anyway (read-modify-write
the flags), so it may end up being less performant, if #1 and #2 already
hit the code cache.

There are some extra complications around #1 and #2 for gdbserver,
because we need to consider the cases when gdbserver handles 
single-stepping without roundtripping to gdb:

  - range-stepping
  - stepping over breakpoints/tracepoints

Thanks,
Pedro Alves

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "finish" command leads to SIGTRAP
  2019-02-21 19:34               ` Pedro Alves
@ 2019-02-21 20:50                 ` John Baldwin
  2019-02-22 15:09                   ` Pedro Alves
  0 siblings, 1 reply; 15+ messages in thread
From: John Baldwin @ 2019-02-21 20:50 UTC (permalink / raw)
  To: Pedro Alves, David Griffiths; +Cc: gdb

On 2/21/19 11:34 AM, Pedro Alves wrote:
> Hi John,
> 
> Thanks for stepping in.
> 
> On 02/21/2019 06:49 PM, John Baldwin wrote:
>> On 2/21/19 9:50 AM, Pedro Alves wrote:
> 
>>> I wonder what other kernels, like e.g., FreeBSD do here?
>>
>> FreeBSD also fails (and in the last year we had a set of changes to rework
>> TF handling in the kernel to boot).  This doesn't look trivial to solve.
>> To get the exception you have to have TF set in %rflags/%eflags, but that
>> means it is set when the pushf writes to the stack.  I think what would
>> have to happen (ugh) is that the kernel needs to recognize that the DB#
>> fault is due to a pushf instruction and that if the TF was a "shadow" TF
>> due to ptrace it needs to clear TF from the value written on the stack as
>> part of the fault handler.
>>
>>> Guess if GDB is to workaround this, it'll have to either add
>>> special treatment for this instruction (emulate, step over with a software
>>> breakpoints, something like that), or clear TF manually after
>>> single-stepping.  :-/
>>
>> I suspect it will be common for kernels to have this bug because the CPU
>> will always write a value onto the stack with TF set as part of
>> executing the instruction.  A workaround in GDB would be much like what I
>> described above with the advantage that GDB actually knows it is stepping a
>> pushf before it steps it, so it can know to rewrite the value on the
>> stack after it gets the SIGTRAP for the single step over the pushf.
>>
>> This may actually be hard for a kernel to get right as at the time of the
>> fault we don't get anything that says how long the faulting instruction was,
>> etc.  Thus, just looking at the byte before the current eip/rip in a DB#
>> fault handler for the pushf opcode (I believe it's a single byte) can get
>> false positives because you might have stepped over a mov instruction with
>> an immediate whose last byte happens to be the opcode, etc.
> I can think of other workarounds potentially possible:
> 
> #1 - emulate the instruction: i.e., if you know you're stepping a
>    pushf instruction, you could instead push the flags state on the
>    stack yourself manually, advance the PC, and then raise a
>    fake trap.  Could be done by the kernel, or gdb.  Fixing it on
>    the kernel side should be more efficient, and fixes it for
>    all debuggers.  While fixing it on the debugger side fixes
>    it for all kernels...

Actually, yes, the PTRACE_STEP/PT_STEP can notice the pushf before it
executes it in the kernel.  That is not too bad then I guess.

> #2 - if you know you're stepping a pushf instruction, set a breakpoint
>    after it, and PTRACE_CONTINUE instead of stepping.  (that's the software
>    single-step workaround mentioned earlier).

I prefer that to my suggestion above, and if we chose to do it in GDB my
guess is that #2 is simpler / smaller patch to implement than #1?

> #3 - have gdb always clear TF after a single-step.  This is the
>    easiest, even if the "less technically cool" solution.  This
>    would mean that it'd be impossible to debug a program that
>    sets the trace flag manually.  I've actually once co-wrote
>    an in-process x86 debug stub, and in that use case
>    preserving TF mattered, made it possible to debug that
>    stub...  Quite a niche use case, though, and it'd have been
>    trivial for me for hack gdb for that special use case, of course.
> 
> In order for GDB to know whether it is stepping a pushf instruction,
> it needs to read the memory at PC, which has a cost, but maybe it's
> negligible if we already end up reading memory anyway (because of the
> code cache), but I'm not sure we already do.  This can have a more
> noticeable effect with remote debugging (which should weigh on whether
> to do the workaround at the infrun.c level, or in the target backend (thus
> in gdbserver when remote).
> 
> Solution #3 would require extra ptrace commands anyway (read-modify-write
> the flags), so it may end up being less performant, if #1 and #2 already
> hit the code cache.
> 
> There are some extra complications around #1 and #2 for gdbserver,
> because we need to consider the cases when gdbserver handles 
> single-stepping without roundtripping to gdb:
> 
>   - range-stepping
>   - stepping over breakpoints/tracepoints

Hmmm, I will probably try to fix (or get someone else to fix) FreeBSD's
kernel regardless probably using the approach in #1.  For GDB itself, I
probably have a slight preference for #2 over #1, but I haven't yet worked
with gdbserver, so I'd defer to you on if #3 is the best solution when
taking gdbserver into account.  If the edge case of #3 matters, (which might
matter for some other things like some language runtimes that set TF and use
SIGTRAP handlers that motivated FreeBSD's kernel changes last year), we
could perhaps provide a way for targets to override #3 if they know they
don't need it (e.g. a native target under a kernel known to work).  Not
sure how that would work over remote (e.g. if you would want gdbserver to
internalize this behavior so that only it deals with it and hides it from
the remote debugger).

-- 
John Baldwin

                                                                            

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "finish" command leads to SIGTRAP
  2019-02-21 20:50                 ` John Baldwin
@ 2019-02-22 15:09                   ` Pedro Alves
  2019-02-22 16:42                     ` John Baldwin
  0 siblings, 1 reply; 15+ messages in thread
From: Pedro Alves @ 2019-02-22 15:09 UTC (permalink / raw)
  To: John Baldwin, David Griffiths; +Cc: gdb

On 02/21/2019 08:49 PM, John Baldwin wrote:
> On 2/21/19 11:34 AM, Pedro Alves wrote:
>> Hi John,
>>
>> Thanks for stepping in.
>>
>> On 02/21/2019 06:49 PM, John Baldwin wrote:
>>> On 2/21/19 9:50 AM, Pedro Alves wrote:
>>
>>>> I wonder what other kernels, like e.g., FreeBSD do here?
>>>
>>> FreeBSD also fails (and in the last year we had a set of changes to rework
>>> TF handling in the kernel to boot).  This doesn't look trivial to solve.
>>> To get the exception you have to have TF set in %rflags/%eflags, but that
>>> means it is set when the pushf writes to the stack.  I think what would
>>> have to happen (ugh) is that the kernel needs to recognize that the DB#
>>> fault is due to a pushf instruction and that if the TF was a "shadow" TF
>>> due to ptrace it needs to clear TF from the value written on the stack as
>>> part of the fault handler.
>>>
>>>> Guess if GDB is to workaround this, it'll have to either add
>>>> special treatment for this instruction (emulate, step over with a software
>>>> breakpoints, something like that), or clear TF manually after
>>>> single-stepping.  :-/
>>>
>>> I suspect it will be common for kernels to have this bug because the CPU
>>> will always write a value onto the stack with TF set as part of
>>> executing the instruction.  A workaround in GDB would be much like what I
>>> described above with the advantage that GDB actually knows it is stepping a
>>> pushf before it steps it, so it can know to rewrite the value on the
>>> stack after it gets the SIGTRAP for the single step over the pushf.
>>>
>>> This may actually be hard for a kernel to get right as at the time of the
>>> fault we don't get anything that says how long the faulting instruction was,
>>> etc.  Thus, just looking at the byte before the current eip/rip in a DB#
>>> fault handler for the pushf opcode (I believe it's a single byte) can get
>>> false positives because you might have stepped over a mov instruction with
>>> an immediate whose last byte happens to be the opcode, etc.
>> I can think of other workarounds potentially possible:
>>
>> #1 - emulate the instruction: i.e., if you know you're stepping a
>>    pushf instruction, you could instead push the flags state on the
>>    stack yourself manually, advance the PC, and then raise a
>>    fake trap.  Could be done by the kernel, or gdb.  Fixing it on
>>    the kernel side should be more efficient, and fixes it for
>>    all debuggers.  While fixing it on the debugger side fixes
>>    it for all kernels...
> 
> Actually, yes, the PTRACE_STEP/PT_STEP can notice the pushf before it
> executes it in the kernel.  That is not too bad then I guess.
> 
>> #2 - if you know you're stepping a pushf instruction, set a breakpoint
>>    after it, and PTRACE_CONTINUE instead of stepping.  (that's the software
>>    single-step workaround mentioned earlier).
> 
> I prefer that to my suggestion above, and if we chose to do it in GDB my
> guess is that #2 is simpler / smaller patch to implement than #1?

Not 100% sure, #1 feels simpler in some aspects; #2 feels simpler in others.  

A detail that I'm thinking of right now, is that when we have a
signal to deliver, we better deliver the signal first before emulating
the instruction, because we don't know whether the signal will take up
to a signal handler (which may siglongjmp and thus skip the pushf).
IIRC, there's code in infrun.c to do something like that for other
cases, so it shouldn't be too hard.  #2 avoids this, because
PTRACE_CONTINUE would just take us to the signal handler as usual,
but, both in-line and out-of-line stepping must be considered.

To me it feels like the kind of thing that would require
experimentation / prototyping to get a better feel and notice
the corner cases as one digs through the state machine code
in infrun.c.

> 
>> #3 - have gdb always clear TF after a single-step.  This is the
>>    easiest, even if the "less technically cool" solution.  This
>>    would mean that it'd be impossible to debug a program that
>>    sets the trace flag manually.  I've actually once co-wrote
>>    an in-process x86 debug stub, and in that use case
>>    preserving TF mattered, made it possible to debug that
>>    stub...  Quite a niche use case, though, and it'd have been
>>    trivial for me for hack gdb for that special use case, of course.
>>
>> In order for GDB to know whether it is stepping a pushf instruction,
>> it needs to read the memory at PC, which has a cost, but maybe it's
>> negligible if we already end up reading memory anyway (because of the
>> code cache), but I'm not sure we already do.  This can have a more
>> noticeable effect with remote debugging (which should weigh on whether
>> to do the workaround at the infrun.c level, or in the target backend (thus
>> in gdbserver when remote).
>>
>> Solution #3 would require extra ptrace commands anyway (read-modify-write
>> the flags), so it may end up being less performant, if #1 and #2 already
>> hit the code cache.
>>
>> There are some extra complications around #1 and #2 for gdbserver,
>> because we need to consider the cases when gdbserver handles 
>> single-stepping without roundtripping to gdb:
>>
>>   - range-stepping
>>   - stepping over breakpoints/tracepoints
> 
> Hmmm, I will probably try to fix (or get someone else to fix) FreeBSD's
> kernel regardless probably using the approach in #1.  For GDB itself, I
> probably have a slight preference for #2 over #1, but I haven't yet worked
> with gdbserver, so I'd defer to you on if #3 is the best solution when
> taking gdbserver into account.  If the edge case of #3 matters, (which might
> matter for some other things like some language runtimes that set TF and use
> SIGTRAP handlers that motivated FreeBSD's kernel changes last year), we
> could perhaps provide a way for targets to override #3 if they know they
> don't need it (e.g. a native target under a kernel known to work).  Not
> sure how that would work over remote (e.g. if you would want gdbserver to
> internalize this behavior so that only it deals with it and hides it from
> the remote debugger).

I'd prefer #1 or #2 over #3.  As for gdbserver, the thing is that whatever
solution we implement in gdb isn't going to fix gdbserver, gdbserver
needs fixing as well.  gdbserver has its own run control loop that does
single-stepping behind gdb's back.  The most common case nowadays is
range-stepping.  When you do "next", or "step", as an optimization, gdb
tells gdbserver to single-step as long the PC is within an address range
(the continuous address range that corresponds to the current line
that includes PC).  gdbserver then continually single-steps, and only
reports back a stop to GDB once the PC leaves the range.  This avoids
many roundtrips between gdb and gdbserver.  This means that gdbserver
must have some workaround too.  For this case alone, we could just
make gdbserver punt and report a stop to gdb if the next instruction is
a pushf (gdb continues stepping itself, which would trigger the workaround).
BUT, that wouldn't address the less frequent case -- tracepoints:
gdbserver needs to step over them without gdb involvement, and needs to
implement while-stepping actions.  So here we can't punt to gdb, there
may not even be one connected!  So we need to a full workaround
in gdbserver.

Thanks,
Pedro Alves

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "finish" command leads to SIGTRAP
  2019-02-22 15:09                   ` Pedro Alves
@ 2019-02-22 16:42                     ` John Baldwin
  2019-02-22 17:38                       ` David Griffiths
  0 siblings, 1 reply; 15+ messages in thread
From: John Baldwin @ 2019-02-22 16:42 UTC (permalink / raw)
  To: Pedro Alves, David Griffiths; +Cc: gdb

On 2/22/19 7:09 AM, Pedro Alves wrote:
> On 02/21/2019 08:49 PM, John Baldwin wrote:
>> On 2/21/19 11:34 AM, Pedro Alves wrote:
>>> #3 - have gdb always clear TF after a single-step.  This is the
>>>    easiest, even if the "less technically cool" solution.  This
>>>    would mean that it'd be impossible to debug a program that
>>>    sets the trace flag manually.  I've actually once co-wrote
>>>    an in-process x86 debug stub, and in that use case
>>>    preserving TF mattered, made it possible to debug that
>>>    stub...  Quite a niche use case, though, and it'd have been
>>>    trivial for me for hack gdb for that special use case, of course.
>>>
>>> In order for GDB to know whether it is stepping a pushf instruction,
>>> it needs to read the memory at PC, which has a cost, but maybe it's
>>> negligible if we already end up reading memory anyway (because of the
>>> code cache), but I'm not sure we already do.  This can have a more
>>> noticeable effect with remote debugging (which should weigh on whether
>>> to do the workaround at the infrun.c level, or in the target backend (thus
>>> in gdbserver when remote).
>>>
>>> Solution #3 would require extra ptrace commands anyway (read-modify-write
>>> the flags), so it may end up being less performant, if #1 and #2 already
>>> hit the code cache.
>>>
>>> There are some extra complications around #1 and #2 for gdbserver,
>>> because we need to consider the cases when gdbserver handles 
>>> single-stepping without roundtripping to gdb:
>>>
>>>   - range-stepping
>>>   - stepping over breakpoints/tracepoints
>>
>> Hmmm, I will probably try to fix (or get someone else to fix) FreeBSD's
>> kernel regardless probably using the approach in #1.  For GDB itself, I
>> probably have a slight preference for #2 over #1, but I haven't yet worked
>> with gdbserver, so I'd defer to you on if #3 is the best solution when
>> taking gdbserver into account.  If the edge case of #3 matters, (which might
>> matter for some other things like some language runtimes that set TF and use
>> SIGTRAP handlers that motivated FreeBSD's kernel changes last year), we
>> could perhaps provide a way for targets to override #3 if they know they
>> don't need it (e.g. a native target under a kernel known to work).  Not
>> sure how that would work over remote (e.g. if you would want gdbserver to
>> internalize this behavior so that only it deals with it and hides it from
>> the remote debugger).
> 
> I'd prefer #1 or #2 over #3.  As for gdbserver, the thing is that whatever
> solution we implement in gdb isn't going to fix gdbserver, gdbserver
> needs fixing as well.  gdbserver has its own run control loop that does
> single-stepping behind gdb's back.  The most common case nowadays is
> range-stepping.  When you do "next", or "step", as an optimization, gdb
> tells gdbserver to single-step as long the PC is within an address range
> (the continuous address range that corresponds to the current line
> that includes PC).  gdbserver then continually single-steps, and only
> reports back a stop to GDB once the PC leaves the range.  This avoids
> many roundtrips between gdb and gdbserver.  This means that gdbserver
> must have some workaround too.  For this case alone, we could just
> make gdbserver punt and report a stop to gdb if the next instruction is
> a pushf (gdb continues stepping itself, which would trigger the workaround).
> BUT, that wouldn't address the less frequent case -- tracepoints:
> gdbserver needs to step over them without gdb involvement, and needs to
> implement while-stepping actions.  So here we can't punt to gdb, there
> may not even be one connected!  So we need to a full workaround
> in gdbserver.

I thought of one more issue with #3 which is that it's not necessarily that
you need to clear TF after each step.  The way I reproduced this when I ran
the test program was to si over the pushf, then do a continue.  This meant
that we weren't stepping when the popf was executed, and the instruction
after popf then raised a spurious SIGTRAP.  At that point, the thread's
current state isn't stepping.  One way perhaps to handle this was if you
could specifically determine that a SIGTRAP was a step and if the you get
an unexpected step trap, resume the thread anyway (possibly clearing TF as
part of the resume).  This wouldn't be hard to do in individual native
targets where you have the siginfo for the SIGTRAP.  It's harder to do at a
higher layer I think.  One thing I've wondered about when adding the siginfo
parsing for the FreeBSD native target is that it feels like it would be
nicer if a target could return more fine-grained waitkinds, something like
TARGET_WAITKIND_STEPPED, TARGET_WAITKIND_SW_BREAKPOINT, etc. instead of
requiring the various methods like 'supports_stopped_by_sw_breakpoint' and
'stopped_by_sw_breakpoint' and assuming that SIGTRAP is a step if the current
thread is stepping and none of the other 'stopped_by_foo' methods return
true.  You could maybe still have a fallback for TARGET_WAITKIND_STOPPED that
would use the same heuristics for targets that don't parse siginfo to infer
the more detailed stop type perhaps?  Having that detail at a higher level
would make it easier to recognize spurious step traps in the core I think.
That's probably too big a change just to workaround this issue, but still a
thought I've had for a while.

-- 
John Baldwin

                                                                            

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "finish" command leads to SIGTRAP
  2019-02-22 16:42                     ` John Baldwin
@ 2019-02-22 17:38                       ` David Griffiths
  0 siblings, 0 replies; 15+ messages in thread
From: David Griffiths @ 2019-02-22 17:38 UTC (permalink / raw)
  To: John Baldwin; +Cc: Pedro Alves, gdb

 By the way, just testing my workaround for this (setting a breakpoint and
continuing rather than single step) and it appears to effect both pushfq
and popfq. Even after I fixed the pushfq case the problem still occurred
because it set the TF on the popfq despite the fact the stack value didn't
contain TF.

Cheers,

David
-- 

David Griffiths, Senior Software Engineer

Undo <https://undo.io> | Resolve even the most challenging software defects
with software flight recorder technology

Software reliability report: optimizing the software supplier and customer
relationship
<https://info.undo.io/software-reliability-report-optimizing-supplier-and-customer-relationship>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-02-22 17:38 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-21 11:21 "finish" command leads to SIGTRAP David Griffiths
2019-02-21 11:24 ` Pedro Alves
2019-02-21 12:13   ` David Griffiths
2019-02-21 12:17     ` David Griffiths
2019-02-21 13:12       ` Pedro Alves
2019-02-21 15:55         ` David Griffiths
2019-02-21 17:50           ` Pedro Alves
2019-02-21 18:03             ` Pedro Alves
2019-02-21 18:22             ` David Griffiths
2019-02-21 18:50             ` John Baldwin
2019-02-21 19:34               ` Pedro Alves
2019-02-21 20:50                 ` John Baldwin
2019-02-22 15:09                   ` Pedro Alves
2019-02-22 16:42                     ` John Baldwin
2019-02-22 17:38                       ` David Griffiths

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).