public inbox for elfutils@sourceware.org
 help / color / mirror / Atom feed
* How to debug broken unwinding?
@ 2017-06-02 11:53 Milian Wolff
  2017-06-02 15:03 ` Milian Wolff
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Milian Wolff @ 2017-06-02 11:53 UTC (permalink / raw)
  To: elfutils-devel

[-- Attachment #1: Type: text/plain, Size: 5766 bytes --]

Hey all,

on my ArchLinux box I regularly see cases where libdw fails to unwind 
properly. I can reproduce this both with upstream perf as well as with the 
perfparser utility.

How should I debug this, or how can I report a good bug report for this? I 
guess I could upload a perf archive and document the steps required to build 
perf with libdw as the unwinder, as it allows to easily compare libunwind and 
libdw for unwinding. When I then diff the output of `perf script` for the two 
unwinders for one and the same perf.data file, I see issues like this:

$ diff -u script.libunwind script.elfutils
--- script.libunwind    2017-06-01 22:30:52.418029474 +0200
+++ script.elfutils2    2017-06-01 22:35:10.987823055 +0200
@@ -510,10 +510,6 @@
                    e8ed _dl_fixup (/usr/lib/ld-2.25.so)
                   15f06 _dl_runtime_resolve_sse_vex (/usr/lib/ld-2.25.so)
                   ed94c KDynamicJobTracker::KDynamicJobTracker (/home/milian/
projects/compiled/kf5/lib64/libKF5KIOWidgets.so.5.35.0)
-                  608f3 _GLOBAL__sub_I_kdynamicjobtracker.cpp (/home/milian/
projects/compiled/kf5/lib64/libKF5KIOWidgets.so.5.35.0)
-                   f199 call_init.part.0 (/usr/lib/ld-2.25.so)
-                   f2a5 _dl_init (/usr/lib/ld-2.25.so)
-                    db9 _dl_start_user (/usr/lib/ld-2.25.so)

NOTE: it seems as if unwinding through _GLOBAL__sub* always fails?

 heaptrack_gui  2228 135073.400474:     613969 cycles: 
                  108c8e [unknown] (/usr/lib/libQt5Core.so.5.8.0)
@@ -533,8 +529,6 @@
                  2170af [unknown] (/usr/lib/libQt5Core.so.5.8.0)
                  297c53 QCoreApplicationPrivate::init (/usr/lib/
libQt5Core.so.5.8.0)
                   f7cde QGuiApplicationPrivate::init (/usr/lib/libQt5Gui.so.
5.8.0)
-                 1589e8 QApplicationPrivate::init (/usr/lib/libQt5Widgets.so.
5.8.0)
-                  78622 main (/home/milian/projects/compiled/other/bin/
heaptrack_gui)
                   20439 __libc_start_main (/usr/lib/libc-2.25.so)
                   78299 _start (/home/milian/projects/compiled/other/bin/
heaptrack_gui)
 
NOTE: this is super odd, it simply misses two frames in the middle?!

@@ -557,14 +551,7 @@
                  2780c7 QFactoryLoader::update (/usr/lib/libQt5Core.so.5.8.0)
                  279525 QFactoryLoader::QFactoryLoader (/usr/lib/
libQt5Core.so.5.8.0)
                   e5bd0 QPlatformIntegrationFactory::create (/usr/lib/
libQt5Gui.so.5.8.0)
-                  f5a1c QGuiApplicationPrivate::createPlatformIntegration (/
usr/lib/libQt5Gui.so.5.8.0)
-                  f650c QGuiApplicationPrivate::createEventDispatcher (/usr/
lib/libQt5Gui.so.5.8.0)
-                 298524 QCoreApplicationPrivate::init (/usr/lib/
libQt5Core.so.5.8.0)
-                  f7cde QGuiApplicationPrivate::init (/usr/lib/libQt5Gui.so.
5.8.0)
-                 1589e8 QApplicationPrivate::init (/usr/lib/libQt5Widgets.so.
5.8.0)
-                  78622 main (/home/milian/projects/compiled/other/bin/
heaptrack_gui)
-                  20439 __libc_start_main (/usr/lib/libc-2.25.so)
-                  78299 _start (/home/milian/projects/compiled/other/bin/
heaptrack_gui)
+                 723dbf [unknown] ([unknown])
 
NOTE: this is sadly a very common issue with libdw where it fails to unwind 
after some frames. The computed address for the next frame is totally off 
(723dbf vs f5a1c) and then everything stops...

 heaptrack_gui  2228 135073.402820:     569191 cycles: 
                    90a2 do_lookup_x (/usr/lib/ld-2.25.so)
@@ -732,23 +670,12 @@
                    b768 _dl_relocate_object (/usr/lib/ld-2.25.so)
                   135e2 dl_open_worker (/usr/lib/ld-2.25.so)
                  126c93 _dl_catch_error (/usr/lib/libc-2.25.so)
-                  12f28 _dl_open (/usr/lib/ld-2.25.so)
                     f25 [unknown] (/usr/lib/libdl-2.25.so)
                  126c93 _dl_catch_error (/usr/lib/libc-2.25.so)
                    15f4 [unknown] (/usr/lib/libdl-2.25.so)
                     fb0 dlopen (/usr/lib/libdl-2.25.so)
                  28dd00 [unknown] (/usr/lib/libQt5Core.so.5.8.0)
-                 287164 [unknown] (/usr/lib/libQt5Core.so.5.8.0)
-                 2773af QFactoryLoader::instance (/usr/lib/libQt5Core.so.
5.8.0)
-                  e5b3e QPlatformIntegrationFactory::create (/usr/lib/
libQt5Gui.so.5.8.0)
-                  f5a1c QGuiApplicationPrivate::createPlatformIntegration (/
usr/lib/libQt5Gui.so.5.8.0)
-                  f650c QGuiApplicationPrivate::createEventDispatcher (/usr/
lib/libQt5Gui.so.5.8.0)
-                 298524 QCoreApplicationPrivate::init (/usr/lib/
libQt5Core.so.5.8.0)
-                  f7cde QGuiApplicationPrivate::init (/usr/lib/libQt5Gui.so.
5.8.0)
-                 1589e8 QApplicationPrivate::init (/usr/lib/libQt5Widgets.so.
5.8.0)
-                  78622 main (/home/milian/projects/compiled/other/bin/
heaptrack_gui)
-                  20439 __libc_start_main (/usr/lib/libc-2.25.so)
-                  78299 _start (/home/milian/projects/compiled/other/bin/
heaptrack_gui)
+                 72499f [unknown] ([unknown])
 
NOTE: a mixture of missing frames in the middle as well as broken backtrace at 
the end

Since this is (sadly) very easy for me to reproduce, I'd be willing to invest 
some time to get this properly fixed. I fail to come up with a way to put this 
into a MWE that does not depend on tons of DSOs on my machine. I fear it's the 
only way to reproduce it though... But since I'm far from an expert in DWARF I 
have no clue on how to even begin to tackle this issue. Would someone more 
involved with this matter accept a `perf archive` to investigate what perf + 
libdw are doing with my DSOs?

Thanks
-- 
Milian Wolff
mail@milianw.de
http://milianw.de

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to debug broken unwinding?
  2017-06-02 11:53 How to debug broken unwinding? Milian Wolff
@ 2017-06-02 15:03 ` Milian Wolff
  2017-06-07 12:16 ` Milian Wolff
  2017-06-13 16:17 ` Mark Wielaard
  2 siblings, 0 replies; 8+ messages in thread
From: Milian Wolff @ 2017-06-02 15:03 UTC (permalink / raw)
  To: elfutils-devel

[-- Attachment #1: Type: text/plain, Size: 2840 bytes --]

On Donnerstag, 1. Juni 2017 22:57:12 CEST Milian Wolff wrote:
> Hey all,
> 
> on my ArchLinux box I regularly see cases where libdw fails to unwind
> properly. I can reproduce this both with upstream perf as well as with the
> perfparser utility.
> 
> How should I debug this, or how can I report a good bug report for this? I
> guess I could upload a perf archive and document the steps required to build
> perf with libdw as the unwinder, as it allows to easily compare libunwind
> and libdw for unwinding. When I then diff the output of `perf script` for
> the two unwinders for one and the same perf.data file, I see issues like
> this:
> 
> $ diff -u script.libunwind script.elfutils
> --- script.libunwind    2017-06-01 22:30:52.418029474 +0200
> +++ script.elfutils2    2017-06-01 22:35:10.987823055 +0200
> @@ -510,10 +510,6 @@
>                     e8ed _dl_fixup (/usr/lib/ld-2.25.so)
>                    15f06 _dl_runtime_resolve_sse_vex (/usr/lib/ld-2.25.so)
>                    ed94c KDynamicJobTracker::KDynamicJobTracker
> (/home/milian/ projects/compiled/kf5/lib64/libKF5KIOWidgets.so.5.35.0)
> -                  608f3 _GLOBAL__sub_I_kdynamicjobtracker.cpp
> (/home/milian/ projects/compiled/kf5/lib64/libKF5KIOWidgets.so.5.35.0)
> -                   f199 call_init.part.0 (/usr/lib/ld-2.25.so)
> -                   f2a5 _dl_init (/usr/lib/ld-2.25.so)
> -                    db9 _dl_start_user (/usr/lib/ld-2.25.so)
> 
> NOTE: it seems as if unwinding through _GLOBAL__sub* always fails?

This part I now investigated with extensive debug output and figured out the 
issue: For the last frame that works in both cases, i.e. ed94c, libdwfl says 
it knows that this address belongs to /usr/lib/ld-2.25.so. In reality, it 
belongs to libKF5KIOWidgets.so.5.35.0.

Previously, perf just checked whether any module is known to libdwfl for a 
given address and then trusted it to do the right thing. Now I created a patch 
that double-checks whether the mapping known to libdwfl matches what perf 
knows. If not, we report the correct map (as known to perf) and this fixes the 
issue.

In general, I believe that libdwfl's API is lacking here. Both perf and 
perfparser know the exact mapping of a file, i.e. the file, it's start and end  
address, as well as the pgoff. But the integration with dwfl simply calls 
dwfl_report_elf, which only takes a start address. For things like ld-2.25.so 
this is often not enough. Is there any chance to expand the API to let us set 
the explicit mapping addresses?

I see there's dwfl_report_module, which at least takes start and end address, 
but so far I always failed to use it for unwinding - it seems as if that 
function is not setting up the internal ELF file and thus all of the functions 
relying on that will break.

Thanks

-- 
Milian Wolff
mail@milianw.de
http://milianw.de

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to debug broken unwinding?
  2017-06-02 11:53 How to debug broken unwinding? Milian Wolff
  2017-06-02 15:03 ` Milian Wolff
@ 2017-06-07 12:16 ` Milian Wolff
  2017-06-07 18:41   ` Milian Wolff
  2017-06-13 16:17 ` Mark Wielaard
  2 siblings, 1 reply; 8+ messages in thread
From: Milian Wolff @ 2017-06-07 12:16 UTC (permalink / raw)
  To: elfutils-devel

[-- Attachment #1: Type: text/plain, Size: 2052 bytes --]

On Donnerstag, 1. Juni 2017 22:57:12 CEST Milian Wolff wrote:
> Hey all,

<snip>

>  heaptrack_gui  2228 135073.400474:     613969 cycles:
>                   108c8e [unknown] (/usr/lib/libQt5Core.so.5.8.0)
> @@ -533,8 +529,6 @@
>                   2170af [unknown] (/usr/lib/libQt5Core.so.5.8.0)
>                   297c53 QCoreApplicationPrivate::init (/usr/lib/
> libQt5Core.so.5.8.0)
>                    f7cde QGuiApplicationPrivate::init
> (/usr/lib/libQt5Gui.so. 5.8.0)
> -                 1589e8 QApplicationPrivate::init
> (/usr/lib/libQt5Widgets.so. 5.8.0)
> -                  78622 main (/home/milian/projects/compiled/other/bin/
> heaptrack_gui)
>                    20439 __libc_start_main (/usr/lib/libc-2.25.so)
>                    78299 _start (/home/milian/projects/compiled/other/bin/
> heaptrack_gui)
> 
> NOTE: this is super odd, it simply misses two frames in the middle?!

This is really quite odd - looking at the debug output, the frames in the 
middle are really just skipped for some reason:

unwind: access_mem addr 0x7ffca0a88330, val 4edc50, offset 2808
unwind: access_mem addr 0x7ffca0a88338, val 7f69bfce443a, offset 2816
unwind: pc: = 0x7f69c10fecde
found map: 7f69c1007000 7f69c1766000
dso found: libQt5Gui.so.5.8.0 /usr/lib/libQt5Gui.so.5.8.0
reported: libQt5Gui.so.5.8.0 /usr/lib/libQt5Gui.so.5.8.0, 1
unwind: QGuiApplicationPrivate::init():ip = 0x7f69c10fecde (0xf7cde)

-> so far so good, this frame is properly found inside libQt5Gui, but then:

unwind: pc: = 0x7f69bfce4439
found map: 7f69bfcc4000 7f69c0069000
dso found: libc-2.25.so /usr/lib/libc-2.25.so
reported: libc-2.25.so /usr/lib/libc-2.25.so, 1
unwind: __libc_start_main:ip = 0x7f69bfce4439 (0x20439)

-> the next frame is is supposedly the one in libc, but what happened to the 
two frames in QApplicationPrivate::init and main? I also note that no calls to 
access_mem are occuring - is this maybe some (wrong) caching in libdw or so 
that interfers here?

Any insight would be appreciated, thanks!
-- 
Milian Wolff
mail@milianw.de
http://milianw.de

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to debug broken unwinding?
  2017-06-07 12:16 ` Milian Wolff
@ 2017-06-07 18:41   ` Milian Wolff
  2017-06-14 18:27     ` Mark Wielaard
  0 siblings, 1 reply; 8+ messages in thread
From: Milian Wolff @ 2017-06-07 18:41 UTC (permalink / raw)
  To: elfutils-devel

[-- Attachment #1: Type: text/plain, Size: 4145 bytes --]

On Freitag, 2. Juni 2017 15:26:10 CEST Milian Wolff wrote:
> On Donnerstag, 1. Juni 2017 22:57:12 CEST Milian Wolff wrote:
> > Hey all,
> 
> <snip>
> 
> >  heaptrack_gui  2228 135073.400474:     613969 cycles:
> >                   108c8e [unknown] (/usr/lib/libQt5Core.so.5.8.0)
> > 
> > @@ -533,8 +529,6 @@
> > 
> >                   2170af [unknown] (/usr/lib/libQt5Core.so.5.8.0)
> >                   297c53 QCoreApplicationPrivate::init (/usr/lib/
> > 
> > libQt5Core.so.5.8.0)
> > 
> >                    f7cde QGuiApplicationPrivate::init
> > 
> > (/usr/lib/libQt5Gui.so. 5.8.0)
> > -                 1589e8 QApplicationPrivate::init
> > (/usr/lib/libQt5Widgets.so. 5.8.0)
> > -                  78622 main (/home/milian/projects/compiled/other/bin/
> > heaptrack_gui)
> > 
> >                    20439 __libc_start_main (/usr/lib/libc-2.25.so)
> >                    78299 _start (/home/milian/projects/compiled/other/bin/
> > 
> > heaptrack_gui)
> > 
> > NOTE: this is super odd, it simply misses two frames in the middle?!
> 
> This is really quite odd - looking at the debug output, the frames in the
> middle are really just skipped for some reason:
> 
> unwind: access_mem addr 0x7ffca0a88330, val 4edc50, offset 2808
> unwind: access_mem addr 0x7ffca0a88338, val 7f69bfce443a, offset 2816
> unwind: pc: = 0x7f69c10fecde
> found map: 7f69c1007000 7f69c1766000
> dso found: libQt5Gui.so.5.8.0 /usr/lib/libQt5Gui.so.5.8.0
> reported: libQt5Gui.so.5.8.0 /usr/lib/libQt5Gui.so.5.8.0, 1
> unwind: QGuiApplicationPrivate::init():ip = 0x7f69c10fecde (0xf7cde)
> 
> -> so far so good, this frame is properly found inside libQt5Gui, but then:
> 
> unwind: pc: = 0x7f69bfce4439
> found map: 7f69bfcc4000 7f69c0069000
> dso found: libc-2.25.so /usr/lib/libc-2.25.so
> reported: libc-2.25.so /usr/lib/libc-2.25.so, 1
> unwind: __libc_start_main:ip = 0x7f69bfce4439 (0x20439)
> 
> -> the next frame is is supposedly the one in libc, but what happened to the
> two frames in QApplicationPrivate::init and main? I also note that no calls
> to access_mem are occuring - is this maybe some (wrong) caching in libdw or
> so that interfers here?
> 
> Any insight would be appreciated, thanks!

Some more debugging and going after my gut feeling brings me to the following 
conclusion: The real issue seems to be the on-demand reporting of the elf 
file. We used to do:

       Dwarf_Addr pc;
       bool isactivation;
 
       if (!dwfl_frame_pc(state, &pc, &isactivation)) {
               pr_err("%s", dwfl_errmsg(-1));
               return DWARF_CB_ABORT;
       }

       // report the module before we query for isactivation
       report_module(pc, ui);

This looks safe and fine and actually works most of the time. But passing a 
non-null isactivation flag to dwfl_frame_pc potentially leads to a second 
unwind step, before we got the change to report the module! I can workaround 
this by instead doing

       Dwarf_Addr pc;
       bool isactivation;
 
       if (!dwfl_frame_pc(state, &pc, NULL)) {
               pr_err("%s", dwfl_errmsg(-1));
               return DWARF_CB_ABORT;
       }

       // report the module before we query for isactivation
       report_module(pc, ui);

       if (!dwfl_frame_pc(state, &pc, &isactivation)) {
               pr_err("%s", dwfl_errmsg(-1));
               return DWARF_CB_ABORT;
       }

This fixes all the issues in my original email. So sorry for the noise - it 
doesn't see as if the unwinding code in elfutils is broken - quite the 
contrary! It's just our misuse of the API that is to blame.

May I suggest though to move the isactivation code into a separate function to 
prevent this issue from arising in the future? I.e. it would be nice if the 
code above could read:


       Dwarf_Addr pc;
       bool isactivation;
 
       if (!dwfl_frame_pc(state, &pc)) {
               pr_err("%s", dwfl_errmsg(-1));
               return DWARF_CB_ABORT;
       }

       // report the module before we query for isactivation
       report_module(pc, ui);

       if (!dwfl_frame_is_activation(state)) {
           --pc; 
       }


Thanks
-- 
Milian Wolff
mail@milianw.de
http://milianw.de

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to debug broken unwinding?
  2017-06-02 11:53 How to debug broken unwinding? Milian Wolff
  2017-06-02 15:03 ` Milian Wolff
  2017-06-07 12:16 ` Milian Wolff
@ 2017-06-13 16:17 ` Mark Wielaard
  2017-06-14 13:12   ` Milian Wolff
  2 siblings, 1 reply; 8+ messages in thread
From: Mark Wielaard @ 2017-06-13 16:17 UTC (permalink / raw)
  To: Milian Wolff; +Cc: elfutils-devel

Hi Milian,

On Thu, 2017-06-01 at 22:57 +0200, Milian Wolff wrote:
> How should I debug this, or how can I report a good bug report for this? I 
> guess I could upload a perf archive and document the steps required to build 
> perf with libdw as the unwinder, as it allows to easily compare libunwind and 
> libdw for unwinding.

I am reading the rest of you messages in this thread now and it seems
you did find the issue already. But the above would still be useful if
you happen to have the instructions already written down. I was actually
looking at perf recently and couldn't immediately figure out how to
build it. It seems it depends on a whole kernel build. But if there is
an easy way to just build perf (with a variant that uses libunwind and
libdw) that would be useful in general I think.

More in general debugging the unwinder is indeed too hard. There are
some failures that are ignored because it isn't immediately clear
whether they are fatal. e.g. you might be unable to recover some
register value that then isn't ever used for a simple backtrace, or you
might not find the CFI for a pc address, but want to use the fallback
unwinder in that case. Maybe we should add a debug mode configuration
that compiles in tracing output of the unwinder operations?

Thanks,

Mark

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to debug broken unwinding?
  2017-06-13 16:17 ` Mark Wielaard
@ 2017-06-14 13:12   ` Milian Wolff
  0 siblings, 0 replies; 8+ messages in thread
From: Milian Wolff @ 2017-06-14 13:12 UTC (permalink / raw)
  To: elfutils-devel; +Cc: Mark Wielaard

On Dienstag, 13. Juni 2017 16:20:25 CEST Mark Wielaard wrote:
> Hi Milian,
> 
> On Thu, 2017-06-01 at 22:57 +0200, Milian Wolff wrote:
> > How should I debug this, or how can I report a good bug report for this? I
> > guess I could upload a perf archive and document the steps required to
> > build perf with libdw as the unwinder, as it allows to easily compare
> > libunwind and libdw for unwinding.
> 
> I am reading the rest of you messages in this thread now and it seems
> you did find the issue already. But the above would still be useful if
> you happen to have the instructions already written down. I was actually
> looking at perf recently and couldn't immediately figure out how to
> build it. It seems it depends on a whole kernel build. But if there is
> an easy way to just build perf (with a variant that uses libunwind and
> libdw) that would be useful in general I think.

Perf is, you could say, a two-fold software.

On one hand you have the kernel side of things, which does the PMU 
configuration and actual sampling and data recording.

One the other hand you have a bunch of userspace tooling that tells the kernel 
side to start recording, but - more importantly - to analyse the recorded 
data.

The first part in the kernel is pretty stable - no need to compile your own 
kernel. Esp. on a recent distro this is not required.

For the second side, compiling perf's userland is pretty easy to get the 
latest and greatest features. And it also allows us to compare libunwind vs. 
elfutils:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux \
    -b perf/core
cd linux/tools/perf
make # will output what dependencies where found
cp perf perf-libunwind
make NO_LIBUNWIND=1
cp perf perf-elfutils

# record data with either version
./perf record --call-graph dwarf <something>

Then you can use different "perf" versions to compare the unwinders. I usually 
diff the output of `perf script` or `perf report`, e.g.:

./perf-libunwind stat --inline > libunwind.txt
./perf-elfutils stat --inline > elfutils.txt
diff -u libunwind.txt elfutils.txt

Hope this helps!

> More in general debugging the unwinder is indeed too hard. There are
> some failures that are ignored because it isn't immediately clear
> whether they are fatal. e.g. you might be unable to recover some
> register value that then isn't ever used for a simple backtrace, or you
> might not find the CFI for a pc address, but want to use the fallback
> unwinder in that case. Maybe we should add a debug mode configuration
> that compiles in tracing output of the unwinder operations?

That would be helpful, I guess. Most notably for laymen like me who still have 
trouble understanding what exactly is going on here. If there would be some 
magic flag I could use (like libunwind's UNW_DEBUG_LEVEL=15) to generate a 
trace I could append to a bug report, that would be good to have!

Bye
-- 
Milian Wolff
mail@milianw.de
http://milianw.de

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to debug broken unwinding?
  2017-06-07 18:41   ` Milian Wolff
@ 2017-06-14 18:27     ` Mark Wielaard
  2017-06-15 17:45       ` Mark Wielaard
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Wielaard @ 2017-06-14 18:27 UTC (permalink / raw)
  To: Milian Wolff; +Cc: elfutils-devel

Hi Milian,

On Fri, 2017-06-02 at 17:03 +0200, Milian Wolff wrote:
> Some more debugging and going after my gut feeling brings me to the following 
> conclusion: The real issue seems to be the on-demand reporting of the elf 
> file. We used to do:
> 
>        Dwarf_Addr pc;
>        bool isactivation;
>  
>        if (!dwfl_frame_pc(state, &pc, &isactivation)) {
>                pr_err("%s", dwfl_errmsg(-1));
>                return DWARF_CB_ABORT;
>        }
> 
>        // report the module before we query for isactivation
>        report_module(pc, ui);
> 
> This looks safe and fine and actually works most of the time. But passing a 
> non-null isactivation flag to dwfl_frame_pc potentially leads to a second 
> unwind step, before we got the change to report the module! I can workaround 
> this by instead doing
> 
>        Dwarf_Addr pc;
>        bool isactivation;
>  
>        if (!dwfl_frame_pc(state, &pc, NULL)) {
>                pr_err("%s", dwfl_errmsg(-1));
>                return DWARF_CB_ABORT;
>        }
> 
>        // report the module before we query for isactivation
>        report_module(pc, ui);
> 
>        if (!dwfl_frame_pc(state, &pc, &isactivation)) {
>                pr_err("%s", dwfl_errmsg(-1));
>                return DWARF_CB_ABORT;
>        }
> 
> This fixes all the issues in my original email. So sorry for the noise - it 
> doesn't see as if the unwinding code in elfutils is broken - quite the 
> contrary! It's just our misuse of the API that is to blame.

That is some very nice detective work.
The isactivation code is tricky. The logic is:

isactivation = (frame is initial_frame
                || frame is signal_frame
		|| unwind_frame is signal_frame)

The idea being that normally a frame isn't an activation.
So the given pc is really the return address, not the call address.
But if it is the initial frame or a frame caused by a signal, then the
pc really is where the code was when this frame was active.

The tricky case is that last case. If the previous frame is a signal
frame then this frame is also an activation (so you don't need to
subtract 1 to get the actual address that was executing). I am not sure
I fully understand anymore why that is. This must be a frame that is
itself not an initial or signal frame, but that is called (activated)
from a signal frame.

But assuming that is the correct semantics then your original
observation seems to explain what is going on. The code doesn't account
for the fact that during an unwind other modules can be reported that
might make it able to unwind the current frame. And it caches the
failure state of the unwind of the current frame, so it doesn't retry
when asked.

I think we could reset the state of the current frame if any more
modules are reported. But is it possible for your parser or perf to
report modules as soon as it knows about them, not lazily during an
unwind? Is reporting lazily at the last possible moment an optimization
that really helps? I would assume that you end up reporting all modules
anyway. So it actually seems easier to report everything upfront.

> May I suggest though to move the isactivation code into a separate function to 
> prevent this issue from arising in the future? I.e. it would be nice if the 
> code above could read:
> 
> 
>        Dwarf_Addr pc;
>        bool isactivation;
>  
>        if (!dwfl_frame_pc(state, &pc)) {
>                pr_err("%s", dwfl_errmsg(-1));
>                return DWARF_CB_ABORT;
>        }
> 
>        // report the module before we query for isactivation
>        report_module(pc, ui);
> 
>        if (!dwfl_frame_is_activation(state)) {
>            --pc; 
>        }

That is a good suggestion in any case. Maybe also introduce
dwfl_frame_is_signal_frame () and dwfl_frame_is_initial_frame () in case
the user wants to know the exact details.

Thanks,

Mark

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to debug broken unwinding?
  2017-06-14 18:27     ` Mark Wielaard
@ 2017-06-15 17:45       ` Mark Wielaard
  0 siblings, 0 replies; 8+ messages in thread
From: Mark Wielaard @ 2017-06-15 17:45 UTC (permalink / raw)
  To: Milian Wolff; +Cc: elfutils-devel

On Tue, 2017-06-13 at 18:17 +0200, Mark Wielaard wrote:
> The tricky case is that last case. If the previous frame is a signal
> frame then this frame is also an activation (so you don't need to
> subtract 1 to get the actual address that was executing). I am not sure
> I fully understand anymore why that is. This must be a frame that is
> itself not an initial or signal frame, but that is called (activated)
> from a signal frame.
> 
> But assuming that is the correct semantics then your original
> observation seems to explain what is going on. The code doesn't account
> for the fact that during an unwind other modules can be reported that
> might make it able to unwind the current frame. And it caches the
> failure state of the unwind of the current frame, so it doesn't retry
> when asked.
> 
> I think we could reset the state of the current frame if any more
> modules are reported. But is it possible for your parser or perf to
> report modules as soon as it knows about them, not lazily during an
> unwind? Is reporting lazily at the last possible moment an optimization
> that really helps? I would assume that you end up reporting all modules
> anyway. So it actually seems easier to report everything upfront.

The following should reset the state of the current thread frame if a
module is added during unwinding. I haven't replicated your issue yet to
verify. And adding modules outside of dwfl_report_begin/end during
unwind wasn't really intended to work. But since there is code out there
doing it anyway maybe we should try to support it. Does the following
help your case?

diff --git a/libdwfl/dwfl_module.c b/libdwfl/dwfl_module.c
index 510bd69..dc81538 100644
--- a/libdwfl/dwfl_module.c
+++ b/libdwfl/dwfl_module.c
@@ -156,6 +156,9 @@ use (Dwfl_Module *mod, Dwfl_Module **tailp, Dwfl *dwfl)
       dwfl->lookup_module = NULL;
     }
 
+  /* Note that the module list changed.  */
+  dwfl->dwfl_module_state++;
+
   return mod;
 }
 
diff --git a/libdwfl/frame_unwind.c b/libdwfl/frame_unwind.c
index 4dc9c43..9b2c2d4 100644
--- a/libdwfl/frame_unwind.c
+++ b/libdwfl/frame_unwind.c
@@ -520,6 +520,7 @@ new_unwound (Dwfl_Frame *state)
   unwound = malloc (sizeof (*unwound) + sizeof (*unwound->regs) * nregs);
   if (unlikely (unwound == NULL))
     return NULL;
+  thread->dwfl_module_state = process->dwfl->dwfl_module_state;
   state->unwound = unwound;
   unwound->thread = thread;
   unwound->unwound = NULL;
@@ -700,7 +701,22 @@ internal_function
 __libdwfl_frame_unwind (Dwfl_Frame *state)
 {
   if (state->unwound)
-    return;
+    {
+      /* If it wasn't an error, then go for it.  */
+      if (state->unwound->pc_state != DWFL_FRAME_STATE_ERROR)
+	return;
+
+      /* If it was an error and the dwfl module list is the same, too bad.  */
+      uint64_t t_state, d_state;
+      t_state = state->thread->dwfl_module_state;
+      d_state = state->thread->process->dwfl->dwfl_module_state;
+      if (t_state == d_state)
+	return;
+
+      /* OK, lets try again. */
+      free (state->unwound);
+      state->unwound = NULL;
+    }
   /* Do not ask dwfl_frame_pc for ISACTIVATION, it would try to unwind STATE
      which would deadlock us.  */
   Dwarf_Addr pc;
diff --git a/libdwfl/libdwflP.h b/libdwfl/libdwflP.h
index 7d5f795..9d38373 100644
--- a/libdwfl/libdwflP.h
+++ b/libdwfl/libdwflP.h
@@ -131,6 +131,11 @@ struct Dwfl
   Dwfl_Module **lookup_module;	/* Module associated with segment, or null.  */
   int *lookup_segndx;		/* User segment index, or -1.  */
 
+  /* State (counter) of Dwfl module list.  Used to check by Dwfl_Thread
+     when updating Dwfl_Frame state if there was a previous error that
+     might be resolved when using an updated Dwfl module list.  */
+  uint64_t dwfl_module_state;
+
   /* Cache from last dwfl_report_segment call.  */
   const void *lookup_tail_ident;
   GElf_Off lookup_tail_vaddr;
@@ -244,6 +249,10 @@ struct Dwfl_Thread
      Later the processed frames get freed and this pointer is updated.  */
   Dwfl_Frame *unwound;
   void *callbacks_arg;
+  /* State of the Dwfl Modules last time an unwind wof the Dwfl_Frame
+     was attempted that resulted in an error.  Indicates it can be
+     retried if it is different from the state of the Dwfl.  */
+  uint64_t dwfl_module_state;
 };
 
 /* See its typedef in libdwfl.h.  */


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-06-14 18:27 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-02 11:53 How to debug broken unwinding? Milian Wolff
2017-06-02 15:03 ` Milian Wolff
2017-06-07 12:16 ` Milian Wolff
2017-06-07 18:41   ` Milian Wolff
2017-06-14 18:27     ` Mark Wielaard
2017-06-15 17:45       ` Mark Wielaard
2017-06-13 16:17 ` Mark Wielaard
2017-06-14 13:12   ` Milian Wolff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).