GDB 13.2: breakpoint at wrong line after unrelated change

public inbox for gdb@sourceware.org
 help / color / mirror / Atom feed

* GDB 13.2: breakpoint at wrong line after unrelated change
@ 2024-03-11 18:28 Paul Smith
  2024-03-11 19:14 ` Simon Marchi
  0 siblings, 1 reply; 9+ messages in thread
From: Paul Smith @ 2024-03-11 18:28 UTC (permalink / raw)
  To: gdb

Hi all;

I have an extremely odd error and I'm wondering if it rings any bells
with anyone.  If not I'll embark on an effort of upgrading my tools to
see if it's fixed in newer versions and if not trying to file a bug.

I have a C++ unit test program.  This is GNU/Linux 64bit compiled with
GCC 12.3 and I'm using GDB 13.2 to debug it.  The error happens
regardless of whether I compile with "-ggdb3 -O0" or with "-ggdb3 -O2".
I haven't tried other optimization levels.

In the current behavior, I can set a breakpoint at a function and GDB
will stop at the first line of the function; for example:

  class TestClass : public ... {
    ...
      void breakpointTest(TestData* data)
      {
          printf("obj = %p\n", data);
      }
    ...

If I run:

  (gdb) br TestClass::breakpointTest

  (gdb) run

then GDB will stop at the printf line, and the "data" variable is set
properly:

  Thread 1 "TestClass" hit Breakpoint 1, TestClass::breakpointTest
    (this=0x7ffff2a09a00, data=0x7ffff2aa7000) at TestClass.cpp:100
  100             printf("obj = %p\n", data);

Now if I make a change in my program in a completely different,
irrelevant spot (this change creates a new templated function that uses
Args... and perfect forwarding etc.: it's complex and uses the fmt
library, but it is not being used at all in this function, or even in
this class although it's used in a superclass), then after I do exactly
the same thing above, GDB will stop at the wrong location.  Instead of
stopping at line 100 at the first line of the function it stops
"before" the function is entered and the function arguments are not set
yet (in the example below note the values of "this" and "data" are
wrong).

I have noticed that if I only include the templated function definition
but don't call it, then the problem doesn't happen.  I have to use the
templated function somewhere in the translation unit, but it doesn't
have to be anywhere near the function.

In the failure case if I use "n" to go to the next line, THEN I get to
the first line in the function and everything is set properly.

Example:

  Thread 1 "TestClass" hit Breakpoint 1, TestClass::breakpointTest
    (this=0x0, data=0x21) at TestClass.cpp:98
  98         void breakpointTest(TestData* data)

  (gdb) n
  100             printf("obj = %p\n", txn);

  (gdb) fr
  #0  TestClass::breakpointTest (this=0x7ffff2a09a00, data=0x7ffff2aa7000)
    at TestClass.cpp:100
  100             printf("obj = %p\n", data);

Is anyone aware of any issue in GDB, or GCC, where using templated
functions with perfect forwarding or other complex C++ template
features could cause GDB's understanding of the starting line number of
functions to be miscalculated like this?

Is there a way for me to investigate what information GDB is looking at
to determine where to set breakpoints when given a symbol name like
this?  Is this the same info available via addr2line etc.?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: GDB 13.2: breakpoint at wrong line after unrelated change
  2024-03-11 18:28 GDB 13.2: breakpoint at wrong line after unrelated change Paul Smith
@ 2024-03-11 19:14 ` Simon Marchi
  2024-03-11 19:38   ` Paul Smith
  0 siblings, 1 reply; 9+ messages in thread
From: Simon Marchi @ 2024-03-11 19:14 UTC (permalink / raw)
  To: psmith, gdb

On 3/11/24 14:28, Paul Smith via Gdb wrote:
> Hi all;
> 
> I have an extremely odd error and I'm wondering if it rings any bells
> with anyone.  If not I'll embark on an effort of upgrading my tools to
> see if it's fixed in newer versions and if not trying to file a bug.
> 
> I have a C++ unit test program.  This is GNU/Linux 64bit compiled with
> GCC 12.3 and I'm using GDB 13.2 to debug it.  The error happens
> regardless of whether I compile with "-ggdb3 -O0" or with "-ggdb3 -O2".
> I haven't tried other optimization levels.
> 
> In the current behavior, I can set a breakpoint at a function and GDB
> will stop at the first line of the function; for example:
> 
>   class TestClass : public ... {
>     ...
>       void breakpointTest(TestData* data)
>       {
>           printf("obj = %p\n", data);
>       }
>     ...
> 
> If I run:
> 
>   (gdb) br TestClass::breakpointTest
> 
>   (gdb) run
> 
> then GDB will stop at the printf line, and the "data" variable is set
> properly:
> 
>   Thread 1 "TestClass" hit Breakpoint 1, TestClass::breakpointTest
>     (this=0x7ffff2a09a00, data=0x7ffff2aa7000) at TestClass.cpp:100
>   100             printf("obj = %p\n", data);
> 
> 
> Now if I make a change in my program in a completely different,
> irrelevant spot (this change creates a new templated function that uses
> Args... and perfect forwarding etc.: it's complex and uses the fmt
> library, but it is not being used at all in this function, or even in
> this class although it's used in a superclass), then after I do exactly
> the same thing above, GDB will stop at the wrong location.  Instead of
> stopping at line 100 at the first line of the function it stops
> "before" the function is entered and the function arguments are not set
> yet (in the example below note the values of "this" and "data" are
> wrong).
> 
> I have noticed that if I only include the templated function definition
> but don't call it, then the problem doesn't happen.

When you say that, does it mean that you just define the templated
function, or do you manually instantiate it?  In other words, does it
cause any code to be generated?

> I have to use the
> templated function somewhere in the translation unit, but it doesn't
> have to be anywhere near the function.
> 
> In the failure case if I use "n" to go to the next line, THEN I get to
> the first line in the function and everything is set properly.
> 
> Example:
> 
>   Thread 1 "TestClass" hit Breakpoint 1, TestClass::breakpointTest
>     (this=0x0, data=0x21) at TestClass.cpp:98
>   98         void breakpointTest(TestData* data)
> 
>   (gdb) n
>   100             printf("obj = %p\n", txn);
> 
>   (gdb) fr
>   #0  TestClass::breakpointTest (this=0x7ffff2a09a00, data=0x7ffff2aa7000)
>     at TestClass.cpp:100
>   100             printf("obj = %p\n", data);
> 
> 
> Is anyone aware of any issue in GDB, or GCC, where using templated
> functions with perfect forwarding or other complex C++ template
> features could cause GDB's understanding of the starting line number of
> functions to be miscalculated like this?
> 
> Is there a way for me to investigate what information GDB is looking at
> to determine where to set breakpoints when given a symbol name like
> this?  Is this the same info available via addr2line etc.?

When placing a breakpoint on a function name like this, on code compiled
by gcc/g++, GDB analyzes the prologue and tries to guess at which point
the stack for the function is set up and where the location expressions
given by the DWARF debug info for the the local variables become
meaningful.  With optimizations, this can become tricky, but you said it
happens with -O0, so let's focus on that.

I don't really have an idea of what's happening, but you could try
showing what the "disas" command shows after hitting the breakpoint in
both cases (the `=>` should show where you are stopped, so where the
breakpoint was set).

Simon


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: GDB 13.2: breakpoint at wrong line after unrelated change
  2024-03-11 19:14 ` Simon Marchi
@ 2024-03-11 19:38   ` Paul Smith
  2024-03-11 19:50     ` Simon Marchi
  0 siblings, 1 reply; 9+ messages in thread
From: Paul Smith @ 2024-03-11 19:38 UTC (permalink / raw)
  To: gdb

On Mon, 2024-03-11 at 15:14 -0400, Simon Marchi wrote:
> > I have noticed that if I only include the templated function
> > definition but don't call it, then the problem doesn't happen.
> 
> When you say that, does it mean that you just define the templated
> function, or do you manually instantiate it?  In other words, does it
> cause any code to be generated?

If I just define the templated function I don't see the issue.  If I
invoke the templated function, I get the problem.

FYI I'm switching to the fmt library (if you're familiar with that) and
the templated function invokes it; it's something like this:

    void criticalErrorV(fmt::string_view fmt, const char *file, int line,
                        fmt::format_args args);

    template <typename... Args>
    void criticalError(fmt::format_string<Args...> fmt,
                       const char* file, int line, Args &&...args)
    {
        criticalErrorV(fmt, file, line, fmt::make_format_args(args...));
    }

If I never call criticalError() then it works fine (or in my previous
implementation, which used printf-style calls with stdarg, it worked
fine as well).

If I have some invocation of criticalError() somewhere in the
translation unit, I get this problem.  I haven't checked moving it
around to see if it needs to be invoked before/after the "problem"
method in the TU to get this behavior.

> I don't really have an idea of what's happening, but you could try
> showing what the "disas" command shows after hitting the breakpoint
> in both cases (the `=>` should show where you are stopped, so where
> the breakpoint was set).

Good idea; here's what I get for the correct behavior:

   0x000000000053209c <+0>:     push   %rbp
   0x000000000053209d <+1>:     mov    %rsp,%rbp
   0x00000000005320a0 <+4>:     lea    -0x10(%rsp),%rsp
   0x00000000005320a5 <+9>:     mov    %rdi,-0x8(%rbp)
   0x00000000005320a9 <+13>:    mov    %rsi,-0x10(%rbp)
=> 0x00000000005320ad <+17>:    mov    -0x10(%rbp),%rax
   0x00000000005320b1 <+21>:    mov    %rax,%rsi
   0x00000000005320b4 <+24>:    lea    0x17e9a9(%rip),%rax        # 0x6b0a64
   0x00000000005320bb <+31>:    mov    %rax,%rdi
   0x00000000005320be <+34>:    mov    $0x0,%eax
   0x00000000005320c3 <+39>:    call   0x52bc00 <printf@plt>
   0x00000000005320c8 <+44>:    nop
   0x00000000005320c9 <+45>:    mov    %rbp,%rsp
   0x00000000005320cc <+48>:    pop    %rbp
   0x00000000005320cd <+49>:    ret

Here's what I get for the incorrect behavior:

=> 0x00000000005320ee <+0>:     push   %rbp
   0x00000000005320ef <+1>:     mov    %rsp,%rbp
   0x00000000005320f2 <+4>:     lea    -0x10(%rsp),%rsp
   0x00000000005320f7 <+9>:     mov    %rdi,-0x8(%rbp)
   0x00000000005320fb <+13>:    mov    %rsi,-0x10(%rbp)
   0x00000000005320ff <+17>:    mov    -0x10(%rbp),%rax
   0x0000000000532103 <+21>:    mov    %rax,%rsi
   0x0000000000532106 <+24>:    lea    0x17e9d6(%rip),%rax        # 0x6b0ae3
   0x000000000053210d <+31>:    mov    %rax,%rdi
   0x0000000000532110 <+34>:    mov    $0x0,%eax
   0x0000000000532115 <+39>:    call   0x52bc00 <printf@plt>
   0x000000000053211a <+44>:    nop
   0x000000000053211b <+45>:    mov    %rbp,%rsp
   0x000000000053211e <+48>:    pop    %rbp
   0x000000000053211f <+49>:    ret

It seems to have given up and just picked the first instruction :)

Here's the compile line args (removed extraneous stuff like warnings
and preprocessor options):

  g++ -std=gnu++20 -ggdb3 -fPIC -march=haswell -mtune=intel \
    -fno-omit-frame-pointer -O0 -pthread \
    -o TestClass.o -c TestClass.cpp

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: GDB 13.2: breakpoint at wrong line after unrelated change
  2024-03-11 19:38   ` Paul Smith
@ 2024-03-11 19:50     ` Simon Marchi
  2024-03-11 20:17       ` Paul Smith
  2024-03-15 21:11       ` Paul Smith
  0 siblings, 2 replies; 9+ messages in thread
From: Simon Marchi @ 2024-03-11 19:50 UTC (permalink / raw)
  To: psmith, gdb

On 3/11/24 15:38, Paul Smith via Gdb wrote:
> On Mon, 2024-03-11 at 15:14 -0400, Simon Marchi wrote:
>>> I have noticed that if I only include the templated function
>>> definition but don't call it, then the problem doesn't happen.
>>
>> When you say that, does it mean that you just define the templated
>> function, or do you manually instantiate it?  In other words, does it
>> cause any code to be generated?
> 
> If I just define the templated function I don't see the issue.  If I
> invoke the templated function, I get the problem.
> 
> FYI I'm switching to the fmt library (if you're familiar with that)

Yes, I love it, we should use it in GDB :).

> and
> the templated function invokes it; it's something like this:
> 
>     void criticalErrorV(fmt::string_view fmt, const char *file, int line,
>                         fmt::format_args args);
> 
>     template <typename... Args>
>     void criticalError(fmt::format_string<Args...> fmt,
>                        const char* file, int line, Args &&...args)
>     {
>         criticalErrorV(fmt, file, line, fmt::make_format_args(args...));
>     }
> 
> If I never call criticalError() then it works fine (or in my previous
> implementation, which used printf-style calls with stdarg, it worked
> fine as well).

If you never call it, if never generates code, so it kinda make sense
that it doesn't change anything.

> If I have some invocation of criticalError() somewhere in the
> translation unit, I get this problem.  I haven't checked moving it
> around to see if it needs to be invoked before/after the "problem"
> method in the TU to get this behavior.
> 
>> I don't really have an idea of what's happening, but you could try
>> showing what the "disas" command shows after hitting the breakpoint
>> in both cases (the `=>` should show where you are stopped, so where
>> the breakpoint was set).
> 
> Good idea; here's what I get for the correct behavior:
> 
>    0x000000000053209c <+0>:     push   %rbp
>    0x000000000053209d <+1>:     mov    %rsp,%rbp
>    0x00000000005320a0 <+4>:     lea    -0x10(%rsp),%rsp
>    0x00000000005320a5 <+9>:     mov    %rdi,-0x8(%rbp)
>    0x00000000005320a9 <+13>:    mov    %rsi,-0x10(%rbp)
> => 0x00000000005320ad <+17>:    mov    -0x10(%rbp),%rax
>    0x00000000005320b1 <+21>:    mov    %rax,%rsi
>    0x00000000005320b4 <+24>:    lea    0x17e9a9(%rip),%rax        # 0x6b0a64
>    0x00000000005320bb <+31>:    mov    %rax,%rdi
>    0x00000000005320be <+34>:    mov    $0x0,%eax
>    0x00000000005320c3 <+39>:    call   0x52bc00 <printf@plt>
>    0x00000000005320c8 <+44>:    nop
>    0x00000000005320c9 <+45>:    mov    %rbp,%rsp
>    0x00000000005320cc <+48>:    pop    %rbp
>    0x00000000005320cd <+49>:    ret
> 
> Here's what I get for the incorrect behavior:
> 
> => 0x00000000005320ee <+0>:     push   %rbp
>    0x00000000005320ef <+1>:     mov    %rsp,%rbp
>    0x00000000005320f2 <+4>:     lea    -0x10(%rsp),%rsp
>    0x00000000005320f7 <+9>:     mov    %rdi,-0x8(%rbp)
>    0x00000000005320fb <+13>:    mov    %rsi,-0x10(%rbp)
>    0x00000000005320ff <+17>:    mov    -0x10(%rbp),%rax
>    0x0000000000532103 <+21>:    mov    %rax,%rsi
>    0x0000000000532106 <+24>:    lea    0x17e9d6(%rip),%rax        # 0x6b0ae3
>    0x000000000053210d <+31>:    mov    %rax,%rdi
>    0x0000000000532110 <+34>:    mov    $0x0,%eax
>    0x0000000000532115 <+39>:    call   0x52bc00 <printf@plt>
>    0x000000000053211a <+44>:    nop
>    0x000000000053211b <+45>:    mov    %rbp,%rsp
>    0x000000000053211e <+48>:    pop    %rbp
>    0x000000000053211f <+49>:    ret
> 
> It seems to have given up and just picked the first instruction :)
> 
> Here's the compile line args (removed extraneous stuff like warnings
> and preprocessor options):
> 
>   g++ -std=gnu++20 -ggdb3 -fPIC -march=haswell -mtune=intel \
>     -fno-omit-frame-pointer -O0 -pthread \
>     -o TestClass.o -c TestClass.cpp

Ok, so clearly GDB failed to analyze the prologue.  Which is weird
because the two functions are identical (modulo the addresses).  To get
to the bottom of this, you (or someone else) would need to debug GDB
itself.  If you want to do this, I would start at function
skip_prologue_using_sal, in symtab.c.  Off hand, I don't think we have a
debug switch to enable logging for prologue skipping.  It would be
useful to have some here, as we would be able to compare the logging
shown in both cases.

When you have DWARF debug info (which is your case), prologue skipping
is done using the DWARF line tables.  You could try to extract the line
tables for both versions of the function and see what's different.  But
that would probably only be useful if you're debugging GDB already.

Simon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: GDB 13.2: breakpoint at wrong line after unrelated change
  2024-03-11 19:50     ` Simon Marchi
@ 2024-03-11 20:17       ` Paul Smith
  2024-03-15 21:11       ` Paul Smith
  1 sibling, 0 replies; 9+ messages in thread
From: Paul Smith @ 2024-03-11 20:17 UTC (permalink / raw)
  To: gdb

On Mon, 2024-03-11 at 15:50 -0400, Simon Marchi wrote:
> > FYI I'm switching to the fmt library (if you're familiar with that)
> 
> Yes, I love it, we should use it in GDB :).

It's very nice for portable programming, but I really hope that GCC
will add __attribute__ support for it like they have for printf()
formatting.  The errors generated are incomprehensible (basically you
get pages of error messages and the only useful thing is you get a
filename and linenumber to look at) and it has one significant failing:
it will throw a compile error if you have _too few_ arguments for the
formatting string, or the argument can't be formatted, but it is
completely silent if you have _too many_ arguments for the formatting
string.  In the abstract this makes sense since unlike with varargs
it's quite possible to have extra arguments that are not formatted...
but in real life I expect that capability is virtually never used, and
the lack of this warning makes porting very tricky (if you forget to
switch a "%s" to a "{}", the compiler will not warn you).

> If you never call it, if never generates code, so it kinda make sense
> that it doesn't change anything.

Yes agreed.

> > If I have some invocation of criticalError() somewhere in the
> > translation unit, I get this problem.  I haven't checked moving it
> > around to see if it needs to be invoked before/after the "problem"
> > method in the TU to get this behavior.

Just to try it I put only one invocation near the start of the TU, then
only one invocation near the end of the TU.  I got the problem in both
cases, so that's kind of odd.  Maybe.

> Ok, so clearly GDB failed to analyze the prologue.  Which is weird
> because the two functions are identical (modulo the addresses).

Yes!  Very weird.

> To get to the bottom of this, you (or someone else) would need to
> debug GDB itself.

OK.  I will look into this but it may take a few days.  The first thing
I'll do is build the latest release of GDB and see if it makes a
difference.  I'm hopeful that the problem is in GDB not GCC or binutils
since those are harder to change, but we'll see.

Thanks for the conversation Simon I'll let you know where I get, with
the goal of filing a bug if I can repro with the latest code and can
get some idea of what's going on.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: GDB 13.2: breakpoint at wrong line after unrelated change
  2024-03-11 19:50     ` Simon Marchi
  2024-03-11 20:17       ` Paul Smith
@ 2024-03-15 21:11       ` Paul Smith
  2024-03-15 22:19         ` Paul Smith
  1 sibling, 1 reply; 9+ messages in thread
From: Paul Smith @ 2024-03-15 21:11 UTC (permalink / raw)
  To: Simon Marchi, gdb

On Mon, 2024-03-11 at 15:50 -0400, Simon Marchi wrote:
> Ok, so clearly GDB failed to analyze the prologue.  Which is weird
> because the two functions are identical (modulo the addresses).  To
> get to the bottom of this, you (or someone else) would need to debug
> GDB itself.  If you want to do this, I would start at function
> skip_prologue_using_sal, in symtab.c.  Off hand, I don't think we
> have a debug switch to enable logging for prologue skipping.  It
> would be useful to have some here, as we would be able to compare the
> logging shown in both cases.

FYI I have finally gotten back to looking at this.  I've only been at
it for a short time but just for information:

I was able to build GDB 14.2 (latest release) from source and I still
see the issue there.  So I started debugging.

I can tell you that in the "good" binary case I can see that
amd64_tdep.c:amd64_skip_prologue() is invoked which invokes
symtab.c:skip_prologue_using_sal() as you suggested.  In fact, these
methods are called numerous times.

In the "bad" binary case, neither of those methods is called, ever.  I
put a gdb_printf() in both functions and in the "good" binary I see
probably 20 invocations between starting, setting the breakpoint,
running, and exiting: in the "bad" binary zero invocations.  I do see
that we definitely invoke set_gdbarch_skip_prologue() with the amd64
function pointer in both cases, so it's not that.

I'm looking to see where *_skip_prologue() is called from to figure out
where the code paths diverge, just thought I'd send a note to let folks
know that I've not dropped this investigation.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: GDB 13.2: breakpoint at wrong line after unrelated change
  2024-03-15 21:11       ` Paul Smith
@ 2024-03-15 22:19         ` Paul Smith
  2024-03-16 16:33           ` Simon Marchi
  0 siblings, 1 reply; 9+ messages in thread
From: Paul Smith @ 2024-03-15 22:19 UTC (permalink / raw)
  To: Simon Marchi, gdb

On Fri, 2024-03-15 at 17:11 -0400, Paul Smith via Gdb wrote:
> I can tell you that in the "good" binary case I can see that
> amd64_tdep.c:amd64_skip_prologue() is invoked which invokes
> symtab.c:skip_prologue_using_sal() as you suggested.  In fact, these
> methods are called numerous times.
> 
> In the "bad" binary case, neither of those methods is called, ever. 
> I put a gdb_printf() in both functions and in the "good" binary I see
> probably 20 invocations between starting, setting the breakpoint,
> running, and exiting: in the "bad" binary zero invocations.  I do see
> that we definitely invoke set_gdbarch_skip_prologue() with the amd64
> function pointer in both cases, so it's not that.

More details, no answers.

However, the problem is much deeper than some kind of incorrect
computation of the prologue length.  It appears to be a major
difference in the structure of the binary itself, which is weird.

The difference happens in symtab.c:find_function_start_sal_1().  When
this is called on the "good" binary,
sal.symtab->compunit()->locations_valid() is 0 so we fall through to
calling skip_prologue_sal().

In the "bad" binary, locations_valid() returns 1 instead.  This sends
us through this code starting at symtab.c:3607:

  if (funfirstline && sal.symtab != NULL
      && (sal.symtab->compunit ()->locations_valid ()
          || sal.symtab->language () == language_asm))
    {
      struct gdbarch *gdbarch = sal.symtab->compunit ()->objfile ()-
>arch ();

      sal.pc = func_addr;
      if (gdbarch_skip_entrypoint_p (gdbarch))
        sal.pc = gdbarch_skip_entrypoint (gdbarch, sal.pc);
      return sal;
    }

thus returning early.  I've checked and gdb_arch_skip_entrypoint_p()
returns null so gdbarch_skip_entrypoint() is not called.

I've also verified that all other aspects of the above if-statement
(funfirstline and sal.symtab->language()) are the same (1 and 4)
between the good and bad calls.  The difference appears to be the
return code of locations_valid().

Looking into this it appears to be something set for the entire binary,
differently between the "good" and "bad" binary.

In the "good" binary we enter read.c:process_full_comp_unit() the
passed-in dwarf2_cu value of has_loclist is false.  Because of that,
this is not called:

      if (cu->has_loclist && gcc_4_minor >= 5)
        cust->set_locations_valid (true);

and because this is not called, the locations_valid() return above is
false.

In the "bad" binary when we enter process_full_comp_unit(), the value
of has_loclist is true.  Because of this we call cust-
>set_locations_valid(true) above, and this means locations_valid()
returns true and we follow the alternate path when skip_prologue_sal()
is called.

I have to stop here for today but maybe I'll have more time later this
weekend.  If anyone has hints on how to determine why the settings of
struct dwarf2_cu is different let me know.

Cheers!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: GDB 13.2: breakpoint at wrong line after unrelated change
  2024-03-15 22:19         ` Paul Smith
@ 2024-03-16 16:33           ` Simon Marchi
  2024-03-16 19:57             ` Paul Smith
  0 siblings, 1 reply; 9+ messages in thread
From: Simon Marchi @ 2024-03-16 16:33 UTC (permalink / raw)
  To: psmith, gdb



On 2024-03-15 18:19, Paul Smith wrote:
> On Fri, 2024-03-15 at 17:11 -0400, Paul Smith via Gdb wrote:
>> I can tell you that in the "good" binary case I can see that
>> amd64_tdep.c:amd64_skip_prologue() is invoked which invokes
>> symtab.c:skip_prologue_using_sal() as you suggested.  In fact, these
>> methods are called numerous times.
>>
>> In the "bad" binary case, neither of those methods is called, ever. 
>> I put a gdb_printf() in both functions and in the "good" binary I see
>> probably 20 invocations between starting, setting the breakpoint,
>> running, and exiting: in the "bad" binary zero invocations.  I do see
>> that we definitely invoke set_gdbarch_skip_prologue() with the amd64
>> function pointer in both cases, so it's not that.
> 
> More details, no answers.
> 
> However, the problem is much deeper than some kind of incorrect
> computation of the prologue length.  It appears to be a major
> difference in the structure of the binary itself, which is weird.
> 
> The difference happens in symtab.c:find_function_start_sal_1().  When
> this is called on the "good" binary,
> sal.symtab->compunit()->locations_valid() is 0 so we fall through to
> calling skip_prologue_sal().
> 
> In the "bad" binary, locations_valid() returns 1 instead.  This sends
> us through this code starting at symtab.c:3607:
> 
>   if (funfirstline && sal.symtab != NULL
>       && (sal.symtab->compunit ()->locations_valid ()
>           || sal.symtab->language () == language_asm))
>     {
>       struct gdbarch *gdbarch = sal.symtab->compunit ()->objfile ()-
>> arch ();
> 
>       sal.pc = func_addr;
>       if (gdbarch_skip_entrypoint_p (gdbarch))
>         sal.pc = gdbarch_skip_entrypoint (gdbarch, sal.pc);
>       return sal;
>     }
> 
> thus returning early.  I've checked and gdb_arch_skip_entrypoint_p()
> returns null so gdbarch_skip_entrypoint() is not called.
> 
> I've also verified that all other aspects of the above if-statement
> (funfirstline and sal.symtab->language()) are the same (1 and 4)
> between the good and bad calls.  The difference appears to be the
> return code of locations_valid().
> 
> 
> Looking into this it appears to be something set for the entire binary,
> differently between the "good" and "bad" binary.
> 
> In the "good" binary we enter read.c:process_full_comp_unit() the
> passed-in dwarf2_cu value of has_loclist is false.  Because of that,
> this is not called:
> 
>       if (cu->has_loclist && gcc_4_minor >= 5)
>         cust->set_locations_valid (true);
> 
> and because this is not called, the locations_valid() return above is
> false.
> 
> In the "bad" binary when we enter process_full_comp_unit(), the value
> of has_loclist is true.  Because of this we call cust-
>> set_locations_valid(true) above, and this means locations_valid()
> returns true and we follow the alternate path when skip_prologue_sal()
> is called.
> 
> I have to stop here for today but maybe I'll have more time later this
> weekend.  If anyone has hints on how to determine why the settings of
> struct dwarf2_cu is different let me know.

Hi Paul,

I started to look at this problem this week, because I hit a case in my
own C++ program very similar to yours.  I didn't have time to finish my
reply, but my findings were very similar to yours.  When compiled with
gcc 11, the prologue is skipped.  When compiled with gcc 12 and 13, the
prologue is not skipped.  All with -O0.

Here's my analysis (partly redundant with what you said):

First, what I see:

Here, GDB stopped at the very first instruction of the function.  The
arguments are wrong:

    (gdb) info args
    this = 0x3dd736
    msgType1 = ((anonymous namespace)::MsgType::MSG_ITER_INACTIVITY | unknown: 0x5554)
    msgType2 = (unknown: 0x555553f8)

If I step past the prologue, they become correct:

    (gdb) n
    183         const auto specTestName = makeSpecTestName(_mTestName, msgType1, msgType2);
    (gdb) info args
    this = 0x555555a65e40 <(anonymous namespace)::errorTestCases>
    msgType1 = (anonymous namespace)::MsgType::STREAM
    msgType2 = (anonymous namespace)::MsgType::STREAM

When the prologue is skipped in the gcc 11-compiled executable, we reach
the skip_prologue_sal function like this:

    #0  skip_prologue_sal (sal=0x7ffd145c64b0) at /home/smarchi/src/binutils-gdb/gdb/symtab.c:3852
    #1  0x0000561d4fa02155 in find_function_start_sal_1 (func_addr=1910486, section=0x561d5247d818, funfirstline=true)
        at /home/smarchi/src/binutils-gdb/gdb/symtab.c:3716
    #2  0x0000561d4fa0221e in find_function_start_sal (sym=0x561d52ed5900, funfirstline=true)
        at /home/smarchi/src/binutils-gdb/gdb/symtab.c:3744
    #3  0x0000561d4f7bc0e5 in symbol_to_sal (result=0x7ffd145c6570, funfirstline=1, sym=0x561d52ed5900)
        at /home/smarchi/src/binutils-gdb/gdb/linespec.c:4376
    #4  0x0000561d4f7b62f1 in convert_linespec_to_sals (state=0x7ffd145c69a0, ls=0x7ffd145c69f0)
        at /home/smarchi/src/binutils-gdb/gdb/linespec.c:2255
    #5  0x0000561d4f7b73a5 in parse_linespec (parser=0x7ffd145c6970, arg=0x7f002c018ce0 "_runOne",
        match_type=symbol_name_match_type::WILD) at /home/smarchi/src/binutils-gdb/gdb/linespec.c:2640
    #6  0x0000561d4f7b84e4 in location_spec_to_sals (parser=0x7ffd145c6970, locspec=0x561d52464650)
        at /home/smarchi/src/binutils-gdb/gdb/linespec.c:3080
    #7  0x0000561d4f7b890a in decode_line_full (locspec=0x561d52464650, flags=1, search_pspace=0x0, default_symtab=0x0, default_line=0,
        canonical=0x7ffd145c6e00, select_mode=0x0, filter=0x0) at /home/smarchi/src/binutils-gdb/gdb/linespec.c:3157
    #8  0x0000561d4f4989e9 in parse_breakpoint_sals (locspec=0x561d52464650, canonical=0x7ffd145c6e00)
        at /home/smarchi/src/binutils-gdb/gdb/breakpoint.c:8895
    #9  0x0000561d4f4a5077 in create_sals_from_location_spec_default (locspec=0x561d52464650, canonical=0x7ffd145c6e00)
        at /home/smarchi/src/binutils-gdb/gdb/breakpoint.c:13200
    #10 0x0000561d4f499a0f in create_breakpoint (gdbarch=0x561d5244cac0, locspec=0x561d52464650, cond_string=0x0, thread=-1, inferior=-1,
        extra_string=0x0, force_condition=false, parse_extra=1, tempflag=0, type_wanted=bp_breakpoint, ignore_count=0,
        pending_break_support=AUTO_BOOLEAN_AUTO, ops=0x561d501b5100 <code_breakpoint_ops>, from_tty=1, enabled=1, internal=0, flags=0)
        at /home/smarchi/src/binutils-gdb/gdb/breakpoint.c:9230
    #11 0x0000561d4f49a4da in break_command_1 (arg=0x561d521e1d49 "", flag=0, from_tty=1)
        at /home/smarchi/src/binutils-gdb/gdb/breakpoint.c:9415

When the prologue is not skipped, with gcc 12 and 13, skip_prologue_sal
is never called.  Backtracking a bit, I found that in that case
find_function_start_sal_1 returns early due to `sal.symtab->compunit
()->locations_valid ()` being true.  The locations_valid flag is set in
the DWARF reader (process_full_comp_unit function) whenever
dwarf2_cu::has_loclist is true.  That is set in var_decode_location when
processing a symbol whose location (DW_AT_location) is a loclist.

In my gcc 11-generated executable, I don't have a symbol whose location
is a loclist.  In my gcc 12 or 13-generated executable, I do:

    DW_AT_location [DW_FORM_sec_offset]   (0x0000000c:
       [0x000000000024ed64, 0x000000000024ed7b): DW_OP_reg5 RDI
       [0x000000000024ed7b, 0x000000000024ee14): DW_OP_reg3 RBX
       [0x000000000024ee14, 0x000000000024ee15): DW_OP_entry_value(DW_OP_reg5 RDI), DW_OP_stack_value
       [0x000000000024ee15, 0x000000000024ef0b): DW_OP_reg3 RBX)

The reasoning being this is explained here in process_full_comp_unit:

      /* GCC-4.0 has started to support -fvar-tracking.  GCC-3.x still can
	 produce DW_AT_location with location lists but it can be possibly
	 invalid without -fvar-tracking.  Still up to GCC-4.4.x incl. 4.4.0
	 there were bugs in prologue debug info, fixed later in GCC-4.5
	 by "unwind info for epilogues" patch (which is not directly related).

	 For -gdwarf-4 type units LOCATIONS_VALID indication is fortunately not
	 needed, it would be wrong due to missing DW_AT_producer there.

	 Still one can confuse GDB by using non-standard GCC compilation
	 options - this waits on GCC PR other/32998 (-frecord-gcc-switches).
	 */
      if (cu->has_loclist && gcc_4_minor >= 5)
	cust->set_locations_valid (true);

So, as soon as it sees one loclist in the compilation unit, GDB assumes
that GCC has produced loclists that describe accurately variable values
even in prologues everywhere.  This assumption is not true here.  The
locations for the two arguments I tried to print earlier are only valid
after the prologue, after the stack has been set up:


0x00055daa:     DW_TAG_formal_parameter
                  DW_AT_name [DW_FORM_strp]     ("msgType1")
                  DW_AT_location [DW_FORM_exprloc]      (DW_OP_fbreg -1660)

0x00055dba:     DW_TAG_formal_parameter
                  DW_AT_name [DW_FORM_strp]     ("msgType2")
                  DW_AT_location [DW_FORM_exprloc]      (DW_OP_fbreg -1664)

Simon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: GDB 13.2: breakpoint at wrong line after unrelated change
  2024-03-16 16:33           ` Simon Marchi
@ 2024-03-16 19:57             ` Paul Smith
  0 siblings, 0 replies; 9+ messages in thread
From: Paul Smith @ 2024-03-16 19:57 UTC (permalink / raw)
  To: Simon Marchi, gdb

On Sat, 2024-03-16 at 12:33 -0400, Simon Marchi wrote:
> Here, GDB stopped at the very first instruction of the function.  The
> arguments are wrong:
> If I step past the prologue, they become correct:

Everything you discovered is identical to my situation in every way,
and you got a little further than I did.  It's interesting that as long
as I don't invoke my perfect-forwarding fmt templated function it
works; as soon as I do GDB seems to misinterpret the handling of the
entire binary.

It seems this is a bug in GDB, after all.  I feel like you're in a
better position to file a bug (with a deeper understanding of the
problem) but if you would prefer me to do so I can.  I'm certainly
willing to test out any proposed fixes, if any were forthcoming.  I do
have the infrastructure here to build and test GDB.

In the meantime I will look into what I can do to work around this
issue as it's causing some of my tests to fail (we have a suite of GDB
Python macros we use and we wrote tests of these macros in our test
suite, which are failing due to this problem).  I don't really want to
delay deployment of my fmt changes until this GDB issue is fixed. 
Perhaps I can modify the tests to add a "step" call, if it detects this
incorrect prologue skip situation or something like that.

Thanks!

-- 
Paul D. Smith <psmith@gnu.org>            Find some GNU Make tips at:
https://www.gnu.org                       http://make.mad-scientist.net
"Please remain calm...I may be mad, but I am a professional." --Mad
Scientist

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-03-16 19:57 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-11 18:28 GDB 13.2: breakpoint at wrong line after unrelated change Paul Smith
2024-03-11 19:14 ` Simon Marchi
2024-03-11 19:38   ` Paul Smith
2024-03-11 19:50     ` Simon Marchi
2024-03-11 20:17       ` Paul Smith
2024-03-15 21:11       ` Paul Smith
2024-03-15 22:19         ` Paul Smith
2024-03-16 16:33           ` Simon Marchi
2024-03-16 19:57             ` Paul Smith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).