user backtrace from kernel context status

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* user backtrace from kernel context status
@ 2010-07-07 18:03 Mark Wielaard
  2010-07-09 18:12 ` Mark Wielaard
  0 siblings, 1 reply; 2+ messages in thread
From: Mark Wielaard @ 2010-07-07 18:03 UTC (permalink / raw)
  To: systemtap

Hi,

Some status update on our ability to produce user backtraces from kernel
space context. It is now sometimes possible to get (a partial) user
space backtrace. For those architectures (i686 and x86_64) that use the
dwarf unwinder. There are some limitations though.

Some examples:

Syscalls.

$ stap -d /bin/ls --ldd -e 'probe syscall.getdents
  { log(pn()); print_ubacktrace(); }' -c /bin/ls
syscall.getdents
 0x000000384f0a2f65 : __getdents+0x15/0x90 [libc-2.12.so]
 0x000000384f0a2962 : readdir64+0x82/0xdf [libc-2.12.so]
 0x0000000000407f1f : print_dir+0x1df/0x6f0 [ls]
 0x000000000040898d : main+0x55d/0x1900 [ls]
 0x000000384f01ec5d : __libc_start_main+0xfd/0x1d0 [libc-2.12.so]
 0x0000000000402799 : _start+0x29/0x2c [ls]

This example works for x86_64, but not for i686 because we don't track
the vdso yet (PR10080).

Timers.

$ stap -d /bin/sort --ldd -e 'probe timer.profile
  { if (execname() == "sort")
    { log(pn()); print_ubacktrace(); } }' \
  -c '/bin/sort /usr/share/dict/words > /dev/null'

timer.profile
 0x00913b18 : strcoll_l+0x158/0xeb0 [libc-2.12.so]
 0x0090f3f1 : strcoll+0x31/0x40 [libc-2.12.so]
 0x080568c3 : memcoll+0x73/0x150 [sort]
 0x08053d0b : xmemcoll+0x3b/0x150 [sort]
 0x0804a60b : compare+0xeb/0xf0 [sort]
 0x0804cb20 : sortlines+0xb0/0x1a0 [sort]
 0x0804caae : sortlines+0x3e/0x1a0 [sort]
 0x0804caae : sortlines+0x3e/0x1a0 [sort]
 0x0804caae : sortlines+0x3e/0x1a0 [sort]
 0x0804caae : sortlines+0x3e/0x1a0 [sort]
 0x0804c944 : sortlines_temp+0x54/0x180 [sort]
 0x0804c92e : sortlines_temp+0x3e/0x180 [sort]
 0x0804cac8 : sortlines+0x58/0x1a0 [sort]
 0x0804caae : sortlines+0x3e/0x1a0 [sort]
 0x0804caae : sortlines+0x3e/0x1a0 [sort]
 0x0804fcba : .L799+0xe9f/0x1a55 [sort]
[... lots more ...]

We do seem to lose track at the end of the trace, don't know why yet. On
x86_64 things look even nicer (all the way down to _start) but we seem
unable to unwind through some glibc functions like strcoll_l (I suspect
bad unwind data, but haven't inspected yet).

The above uses the fact that we now "know" when the full user register
set is available. The probe handlers set a new CONTEXT->regflags (except
the perf probes, I didn't know which events set which regs. If someone
more knowledgeable about the perf events might take a look.) If that
isn't set, we know to use task_pt_regs() and have a new "sanitizing"
mechanism in the dwarf unwinder to scrub any registers that aren't
reliable (this is for now just done by zeroing out a copy of the
pt_regs, it would be nicer to prime the unwinder state itself so it
marks those registers undefined). The heuristics are kind of crude:

/* Whether all user registers are valid. If not the pt_regs needs,
 * architecture specific, scrubbing before usage (in the unwinder).
 * XXX Currently very simple heuristics, just check arch. Should
 * user task and user pt_regs state.
 *
 * See arch specific "scrubbing" code in runtime/unwind/<arch>.h
 */
static inline int _stp_task_pt_regs_valid(struct task_struct *task,
                                          struct pt_regs *uregs)
{
/* It would be nice to just use syscall_get_nr(task, uregs) < 0
 * but that might trigger false negatives or false positives
 * (bad syscall numbers or syscall tracing being in effect).
 */
#if defined(__i386__)
  return 1; /* i386 has so little registers, all are saved. */
#elif defined(__x86_64__)
  return 0;
#endif
  return 0;
}

But it seems to work OK in the little tests I did.

In theory we should now also try to get unwinding to "the red
line" (kernel space till the border into user space) to work if the
above fails (there were several fixes to the dwarf unwinder and the
context.exp backtrace test now rejects any "inexact" frames). But I
haven't yet tested against a kernel that had all CFI build in the
debuginfo (fedora rawhide should have it though). And it will need even
more cleanup of the unwinder/symbol/stack-printing mechanism. Printing
the stack and unwinding are still somewhat intertwined, but a lot of
progress has been made to make them more separate. There are now also
tapset functions that return the backtrace as strings for even more
powerful scripts.

The biggest hurdle for users is making the "task finder" keep track of
the vmas of the relevant processes and making sure the unwind data is
available. In the above -d <mainprog> --ldd and -c <mainprog> does that
trick. It is slightly harder to get it all setup for "random" processes.
umodname(uaddr()) can sometimes help to see what stap would like. Also
the backtrace if available should end with the module/shared library
name (but that relies on the vma tracker to figure out that particular
process vma maps should be tracked.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: user backtrace from kernel context status
  2010-07-07 18:03 user backtrace from kernel context status Mark Wielaard
@ 2010-07-09 18:12 ` Mark Wielaard
  0 siblings, 0 replies; 2+ messages in thread
From: Mark Wielaard @ 2010-07-09 18:12 UTC (permalink / raw)
  To: systemtap

On Wed, 2010-07-07 at 20:02 +0200, Mark Wielaard wrote:
> Some status update on our ability to produce user backtraces from kernel
> space context. It is now sometimes possible to get (a partial) user
> space backtrace. For those architectures (i686 and x86_64) that use the
> dwarf unwinder. There are some limitations though.
> 
> Some examples:
> 
> Syscalls.
> 
> $ stap -d /bin/ls --ldd -e 'probe syscall.getdents
>   { log(pn()); print_ubacktrace(); }' -c /bin/ls
> syscall.getdents
>  0x000000384f0a2f65 : __getdents+0x15/0x90 [libc-2.12.so]
>  0x000000384f0a2962 : readdir64+0x82/0xdf [libc-2.12.so]
>  0x0000000000407f1f : print_dir+0x1df/0x6f0 [ls]
>  0x000000000040898d : main+0x55d/0x1900 [ls]
>  0x000000384f01ec5d : __libc_start_main+0xfd/0x1d0 [libc-2.12.so]
>  0x0000000000402799 : _start+0x29/0x2c [ls]
> 
> This example works for x86_64, but not for i686 because we don't track
> the vdso yet (PR10080).

This is now fixed and tada (on i686 f13):

$ stap -d /bin/ls --ldd -e 'probe syscall.ioctl
  { log(pn() . ": " . argstr); print_ubacktrace(); }' -c '/bin/ls -d /'
/
syscall.ioctl: 1, 21505, 0xbff13718
 0x00f80416 : __kernel_vsyscall+0x2/0x0
 0x0096c710 : tcgetattr+0x30/0xd0 [libc-2.12.so]
 0x00966de4 : isatty+0x24/0x40 [libc-2.12.so]
 0x08050ba4 : main+0xe4/0x373 [ls]
syscall.ioctl: 1, 21523, 0xbff13888
 0x00f80416 : __kernel_vsyscall+0x2/0x0
 0x0096d1f9 : ioctl+0x19/0x40 [libc-2.12.so]
 0x08050d50 : main+0x290/0x373 [ls]

Tracking the vdso is sadly also architecture specific. So if you like
hacking on non-x86 then take a look at the new vma.c
(_stp_vma_match_vdso) function.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2010-07-09 18:12 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-07 18:03 user backtrace from kernel context status Mark Wielaard
2010-07-09 18:12 ` Mark Wielaard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).