full dwarf backtracing kernel to user

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* full dwarf backtracing kernel to user
@ 2011-07-11 11:35 Mark Wielaard
  2011-07-22 15:52 ` Mark Wielaard
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Mark Wielaard @ 2011-07-11 11:35 UTC (permalink / raw)
  To: systemtap

Hi,

I have been cleaning up the dwarf unwinder a bit, and after some small
fixes (all in git trunk now) it is now finally possible to unwind fully
from kernel space right into user space. This provides better user
backtraces when a probe point triggered in kernel space. Up till now,
when we wanted a user backtrace when the probe point was in kernel space
we took the set of registers saved by the kernel and erased all others
because they might have been changed by the kernel (and to not leak
possible private data). This often worked, but as soon as the unwinder
needed one of the registers possibly "trashed" and not saved by the
kernel we would be stuck. With the new setup it is now possible to get a
full user space register set to start the user space dwarf unwinder and
always do a full unwind. One example is setting a probe point on
syscall.close and then doing a print_ubacktrace(). Previously (on
x86_64) at least early on in a glibc wrapper we would get stuck because
we needed access to an unsaved register, with "full" dwarf unwinding we
can just continue unwinding into the user space application.

We can go a couple of ways to take advantage of this (currently I just
have a hack in runtime/stack-x86_64.c that just checks that when a
kernel backtrace finishes UNW_PC(info) == task_pt_regs(current)->ip,
sets up info->regs.sp = task_pt_regs(current)->sp and then continues
unwinding the user backtrace).

My thinking is that the (kernel) backtrace related functions shouldn't
change at all (except to make sure that non-x86 arches also use the
dwarf unwinder, that is on my list next). All ubacktrace related
functions should check whether they are called from a user context, in
which case we already have a full register set, otherwise it should do a
kernel unwind, without setting/printing anything except recording the
final register set, and do the ubacktrace using that set. This is
roughly equal to the sanitize logic in arch_unw_init_frame_info. We
should also introduce a full_backtrace function that gets/prints a full
kernel&user backtrace in one go (which people should use instead of a
print_backtrace(); print_ubacktrace() to save some work in the probe).
That last one should NOT be marked unprivileged since we don't do that
for kernel backtraces either.

The "risks" of doing a kernel-to-user unwind for a ubacktrace are that
it is slightly more work. But backtraces are already a lot of work, the
kernel portion often is not deep, and the recent cleanups made things
slightly more faster. Detecting we need a certain register in a
ubacktrace, then backtrack, do a kernel unwind anyway, then redo the
user backtrace is an alternative that is a little too tricky in my mind,
and might actually lead to more work inside the probe. The other risk is
that somehow bad kernel unwind data gets used and through the backtrace
some private register value gets exposed to unprivileged users. I think
we should be able to trust the kernel unwind data. And that risk seems
small since the register value then also needs to be somehow expressed
in the final PC value that the user gets access through. Opinions?

All, except one, [u]backtrace related tapset functions have been marked
EXPERIMENTAL so IMHO we can change them to behave slightly differently
from what they do currently. There are two exceptions.

There is one function, task_backtrace() function, which given a pointer
(long) to a task struct produces a hex backtrace of an arbitrary task,
that I am unsure about. Because it doesn't really fit with the rest of
the backtrace functions, which all act on the current probe context.
Sadly this function is not marked EXPERIMENTAL and people might already
rely on it.

There is also print_stack(), also not marked EXPERIMENTAL, which really
does the same as sprint_stack(), turning a hex string list into a symbol
resolved kernel backtrace. They just differ slightly in how they print
out stuff.

There is also the print_ubacktrace_brief () function, which does print
things slightly differently from normal, and surprisingly doesn't have
any sprint counterpart, so cannot be used in tapsets splitting
ubacktrace collecting (space separated hex string list) and then
printing (using the sprint functions).

A lot of code actually deals with the various formats. I am unsure if we
really want to maintain that. Opinions?

While we are looking at all this, are there any opinions on the whole
split of collection versus printing. AKA using [u]backtrace() to get a
space separated list of addresses, and then at a later time use
sprint_[u]backtrace() to get a string representing the symbol values
associated with those addresses? The problem with that whole approach is
that there are no checks at all whether those address strings actually
correspond to the current task. I haven't come up with a better
representation. It would help if we had some kind of "stack type",
because using strings to pass these things around and then have to parse
them is somewhat awkward. Ideas?

BTW. While hacking on this I hit a kernel crash a couple of times. After
some investigation it became clear that it is a bug in the transport
layer that would react badly to the buffers being full. Apparently this
has always been there, but since the backtrace functions create more
output than usual it triggers more often. I am working on a fix:
http://sourceware.org/bugzilla/show_bug.cgi?id=12960
The bug report contains a simple workaround (increate the buffer size to
something huge) that makes the issue almost never trigger for me. But it
can (and has) still happened, so for a real fix I will remove the
usleep() and add separate buggers for control messages than must never
be lost.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: full dwarf backtracing kernel to user
  2011-07-11 11:35 full dwarf backtracing kernel to user Mark Wielaard
@ 2011-07-22 15:52 ` Mark Wielaard
  2011-08-01 22:16 ` amouehsan
  2011-09-26 11:10 ` Mark Wielaard
  2 siblings, 0 replies; 6+ messages in thread
From: Mark Wielaard @ 2011-07-22 15:52 UTC (permalink / raw)
  To: systemtap

On Mon, 2011-07-11 at 13:35 +0200, Mark Wielaard wrote:
> There is one function, task_backtrace() function, which given a pointer
> (long) to a task struct produces a hex backtrace of an arbitrary task,
> that I am unsure about. Because it doesn't really fit with the rest of
> the backtrace functions, which all act on the current probe context.
> Sadly this function is not marked EXPERIMENTAL and people might already
> rely on it.

We discussed it on irc and agreed it could be deprecated because nothing
really uses it (except one very small example dumpstack.stp that
basically does nothing else than just call this function).

I have to admit I didn't know we had a deprecation mechanism, so if you
also didn't know how to do this, then look at the attached patch (it is
really easy!). Now if you run stap with --check-version and you use a
deprecated function like this it will say:

WARNING: This function uses tapset constructs that are dependent on
systemtap version: identifier 'task_backtrace'
at /usr/local/install/systemtap/share/systemtap/tapset/context-unwind.stp:91:10

After 1.6 is released I'll remove this.

> There is also print_stack(), also not marked EXPERIMENTAL, which really
> does the same as sprint_stack(), turning a hex string list into a symbol
> resolved kernel backtrace. They just differ slightly in how they print
> out stuff.

I didn't deprecate this one because it isn't such a big deal to keep
supporting it. But we could if people think it would make the provided
tapsets more consistent.

> There is also the print_ubacktrace_brief () function, which does print
> things slightly differently from normal, and surprisingly doesn't have
> any sprint counterpart, so cannot be used in tapsets splitting
> ubacktrace collecting (space separated hex string list) and then
> printing (using the sprint functions).

Likewise kept for now.

> A lot of code actually deals with the various formats. I am unsure if we
> really want to maintain that. Opinions?

Cheers,

Mark

commit 20ab10df946293a9b9d403c6ac17f9b6a4351f0d
Author: Mark Wielaard <mjw@redhat.com>
Date:   Fri Jul 22 17:46:10 2011 +0200

    Deprecated task_backtrace:string (task:long).
    
    This function will go away after 1.6.

diff --git a/NEWS b/NEWS
index 4122692..35bc4c9 100644
--- a/NEWS
+++ b/NEWS
@@ -41,6 +41,9 @@
 
 - Depends on elfutils 0.142+.
 
+- Deprecated task_backtrace:string (task:long). This function will go
+  away after 1.6. Please run your scripts with stap --check-version.
+
 * What's new in version 1.5, 2011-05-23
 
 - The compile server and its related tools (stap-gen-ert, stap-authorize-cert,
diff --git a/tapset/context-unwind.stp b/tapset/context-unwind.stp
index 0b42201..656e01c 100644
--- a/tapset/context-unwind.stp
+++ b/tapset/context-unwind.stp
@@ -87,11 +87,13 @@ function backtrace:string () %{ /* pure */
  *  that are a backtrace of the stack of a particular task
  *  Output may be truncated as per maximum string length.
  */
+%( systemtap_v <= "1.6" %?
 function task_backtrace:string (task:long) %{ /* pure */
         _stp_stack_snprint_tsk(THIS->__retvalue, MAXSTRINGLEN,
                                (struct task_struct *)(unsigned long)THIS->task,
                                _STP_SYM_NONE, MAXTRACE);
 %}
+%: %)
 
 /**
  *  sfunction caller - Return name and address of calling function

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: full dwarf backtracing kernel to user
  2011-07-11 11:35 full dwarf backtracing kernel to user Mark Wielaard
  2011-07-22 15:52 ` Mark Wielaard
@ 2011-08-01 22:16 ` amouehsan
  2011-08-02  8:24   ` Mark Wielaard
  2011-09-26 11:10 ` Mark Wielaard
  2 siblings, 1 reply; 6+ messages in thread
From: amouehsan @ 2011-08-01 22:16 UTC (permalink / raw)
  To: systemtap


Hi, 
I am trying systemtap and when probing a kernel function ("bio_endio") I
need to get the user backtrace. When I use 
probe kernel.function("bio_endio@fs/bio.c:1443").call 
{
   print_ubacktrace();
}
I get the following message. 
<no ubacktrace: kernel.function("bio_endio@fs/bio.c:1443").call>

I tried using --ldd but it didn't help. By what you have written:

Mark Wielaard-4 wrote:
> 
> I have been cleaning up the dwarf unwinder a bit, and after some small
> fixes (all in git trunk now) it is now finally possible to unwind fully
> from kernel space right into user space. This provides better user
> backtraces when a probe point triggered in kernel space. 
> 
>  With the new setup it is now possible to get a 
> full user space register set to start the user space dwarf unwinder and
> always do a full unwind. One example is setting a probe point on
> syscall.close and then doing a print_ubacktrace().
> 
I understand the feature I wanted to use was not implemented before. Is my
understanding right? If no, how can I do that? If yes, how I can have this
new feature enabled? Should I get a new code and re-compile systemtap?

Thanks,

-- 
View this message in context: http://old.nabble.com/full-dwarf-backtracing-kernel-to-user-tp32036694p32173616.html
Sent from the Sourceware - systemtap mailing list archive at Nabble.com.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: full dwarf backtracing kernel to user
  2011-08-01 22:16 ` amouehsan
@ 2011-08-02  8:24   ` Mark Wielaard
  0 siblings, 0 replies; 6+ messages in thread
From: Mark Wielaard @ 2011-08-02  8:24 UTC (permalink / raw)
  To: amouehsan; +Cc: systemtap

On Mon, 2011-08-01 at 15:15 -0700, amouehsan wrote:
> I am trying systemtap and when probing a kernel function ("bio_endio") I
> need to get the user backtrace. When I use 
> probe kernel.function("bio_endio@fs/bio.c:1443").call 
> {
>    print_ubacktrace();
> }
> I get the following message. 
> <no ubacktrace: kernel.function("bio_endio@fs/bio.c:1443").call>
> 
> I tried using --ldd but it didn't help.

It might be that stap is already totally correct. It can be that your
probe point triggered totally in a kernel context, without any user
space trigger. You can check by doing:

stap -e 'probe kernel.function("bio_endio@fs/bio.c:1443").call
{ printf("%d %s\n", pid(), execname()); print_ubacktrace(); }'

For example for me this gives output like:

0 swapper
<no ubacktrace: kernel.function("bio_endio@fs/bio.c:1442").call>
3373 firefox-bin
 0x7faa1e35a670

So, the first does indeed not have any user backtrace associated with
it. The second does have one, but since I didn't provide -d firefox-bin
--ldd stap couldn't unwinding past the initial user space address.

>  By what you have written:
> 
> Mark Wielaard-4 wrote:
> > 
> > I have been cleaning up the dwarf unwinder a bit, and after some small
> > fixes (all in git trunk now) it is now finally possible to unwind fully
> > from kernel space right into user space. This provides better user
> > backtraces when a probe point triggered in kernel space. 
> > 
> >  With the new setup it is now possible to get a 
> > full user space register set to start the user space dwarf unwinder and
> > always do a full unwind. One example is setting a probe point on
> > syscall.close and then doing a print_ubacktrace().
> > 
> I understand the feature I wanted to use was not implemented before. Is my
> understanding right? If no, how can I do that? If yes, how I can have this
> new feature enabled? Should I get a new code and re-compile systemtap?

You are partly right. What I am working on will provide us with more
places where we will be able to grab a user space register context and
so a usable user backtrace. The work isn't finished yet though. I try
pushing early and often to systemtap git, so if you follow that and
build from source then you can see the progress.
http://sourceware.org/systemtap/getinvolved.html

Cheers,

Mark

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: full dwarf backtracing kernel to user
  2011-07-11 11:35 full dwarf backtracing kernel to user Mark Wielaard
  2011-07-22 15:52 ` Mark Wielaard
  2011-08-01 22:16 ` amouehsan
@ 2011-09-26 11:10 ` Mark Wielaard
  2011-09-27 15:15   ` Mark Wielaard
  2 siblings, 1 reply; 6+ messages in thread
From: Mark Wielaard @ 2011-09-26 11:10 UTC (permalink / raw)
  To: systemtap

On Mon, 2011-07-11 at 13:35 +0200, Mark Wielaard wrote:
> I have been cleaning up the dwarf unwinder a bit, and after some small
> fixes (all in git trunk now) it is now finally possible to unwind fully
> from kernel space right into user space. This provides better user
> backtraces when a probe point triggered in kernel space. Up till now,
> when we wanted a user backtrace when the probe point was in kernel space
> we took the set of registers saved by the kernel and erased all others
> because they might have been changed by the kernel (and to not leak
> possible private data). This often worked, but as soon as the unwinder
> needed one of the registers possibly "trashed" and not saved by the
> kernel we would be stuck. With the new setup it is now possible to get a
> full user space register set to start the user space dwarf unwinder and
> always do a full unwind. One example is setting a probe point on
> syscall.close and then doing a print_ubacktrace(). Previously (on
> x86_64) at least early on in a glibc wrapper we would get stuck because
> we needed access to an unsaved register, with "full" dwarf unwinding we
> can just continue unwinding into the user space application.
> [...]
> My thinking is that the (kernel) backtrace related functions shouldn't
> change at all (except to make sure that non-x86 arches also use the
> dwarf unwinder, that is on my list next). All ubacktrace related
> functions should check whether they are called from a user context, in
> which case we already have a full register set, otherwise it should do a
> kernel unwind, without setting/printing anything except recording the
> final register set, and do the ubacktrace using that set.

This has finally been implemented. A vacation came in between and there
was more runtime cleanup necessary than anticipated. Some lingering bugs
found and fixed. But now this final commit implements the above step:

commit eacd41d38bfec95e39f32d399ecc4e4f98eafe3d
Author: Mark Wielaard <mjw@redhat.com>
Date:   Fri Sep 23 13:34:37 2011 +0200

  stack.c (_stp_get_uregs): Recover user registers from kernel context.

  When possible recover full pt_regs user register set by unwinding
  using kernel context till we hit user space.

Please let me know if you have any questions about this commit or any of
the runtime changes done earlier to support this functionality.

There is one annoying bug/regression left. PR13210 "vma/vdso tracking is
broken". This makes things much less useful since now we cannot unwind
through the vdso, which is always used on i386 at least. I don't know
when this regression was introduced. But I did add a testcase for it
vma_vdso.exp, which checks a couple of ways of calling into the kernel
and making sure the we always knows the "originating" vma. Currently
this fails for the vdso vma. I am looking into this and it seems this is
only because our build-id checking is not working correctly, since the
vdso vma is detected, but then rejected.

> We should also introduce a full_backtrace function that gets/prints a full
> kernel&user backtrace in one go (which people should use instead of a
> print_backtrace(); print_ubacktrace() to save some work in the probe).
> That last one should NOT be marked unprivileged since we don't do that
> for kernel backtraces either.

I am not planning to add this anymore. Instead I'll introduce an
optimization for a [print_]backtrace() and [print_]ubacktrace() done in
the same probe.

Then I'll try and hook up some non-x86 platforms. Since all of the above
only works when the DWARF unwinder is used.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: full dwarf backtracing kernel to user
  2011-09-26 11:10 ` Mark Wielaard
@ 2011-09-27 15:15   ` Mark Wielaard
  0 siblings, 0 replies; 6+ messages in thread
From: Mark Wielaard @ 2011-09-27 15:15 UTC (permalink / raw)
  To: systemtap

On Mon, 2011-09-26 at 13:09 +0200, Mark Wielaard wrote:
> There is one annoying bug/regression left. PR13210 "vma/vdso tracking is
> broken". This makes things much less useful since now we cannot unwind
> through the vdso, which is always used on i386 at least. I don't know
> when this regression was introduced. But I did add a testcase for it
> vma_vdso.exp, which checks a couple of ways of calling into the kernel
> and making sure the we always knows the "originating" vma. Currently
> this fails for the vdso vma. I am looking into this and it seems this is
> only because our build-id checking is not working correctly, since the
> vdso vma is detected, but then rejected.
> 
> > We should also introduce a full_backtrace function that gets/prints a full
> > kernel&user backtrace in one go (which people should use instead of a
> > print_backtrace(); print_ubacktrace() to save some work in the probe).
> > That last one should NOT be marked unprivileged since we don't do that
> > for kernel backtraces either.
> 
> I am not planning to add this anymore. Instead I'll introduce an
> optimization for a [print_]backtrace() and [print_]ubacktrace() done in
> the same probe.

Both these issues have been addressed now in git trunk.
So on i686 and x86_64 you should now almost always be able to get a
kernel and/or user backtrace from basically any probe point.

In the kernel case it will ultimately fall back to using the kernel
dump_trace() function if available (currently only x86 seems to have
this), which is the case for tracepoint probes for example. In the user
case it will use the (sanitized) results of _stp_current_pt_regs() if
kernel-to-user dwarf unwinding fails (possibly having a somewhat reduced
backtrace because of missing registers).

Please let me know if you play with this on x86 and are not getting the
backtraces you expect. I'll now try to get the dwarf unwinder and some
of the fallback code work for non-x86 arches. Any hints and tips
appreciated of course.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-09-27 15:15 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-11 11:35 full dwarf backtracing kernel to user Mark Wielaard
2011-07-22 15:52 ` Mark Wielaard
2011-08-01 22:16 ` amouehsan
2011-08-02  8:24   ` Mark Wielaard
2011-09-26 11:10 ` Mark Wielaard
2011-09-27 15:15   ` Mark Wielaard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).