public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* Getting user-space stack backtraces in more probe contexts
@ 2010-05-14  7:20 William Cohen
  2010-05-15 14:27 ` Roland McGrath
  0 siblings, 1 reply; 6+ messages in thread
From: William Cohen @ 2010-05-14  7:20 UTC (permalink / raw)
  To: SystemTAP


One of the common tasks that people would like to use systemtap for is
to identify where lock contention is happening in user code.  The
futexes.stp example already in sytemtap will identify which processes
are waiting for locks, but it does not identify where in the code that
the waiting is occurring. What would like is to be able to get a
user-space stack backtrace to provide the developer some context where
the FUTEX_WAITs (and other syscalls) occur in the code.  Ideally,
would just like to be able to put a ubacktrace() in the syscall.futex
probe of futexes.stp. However, the ubacktrace() does not work in the
kernel contexts, e.g. kernel.function("*"); ubacktrace() only works in the
user-space process and profile probes where a pt_regs data struct
initialized to with userspace register values is available.

An alternative approach would be to have the kernel-space
syscall.futex probe record that a thread was waiting and then in
user-space record the information.  This is complicated by the
multiple sites in the glibc pthread library that make the futex system
call. It also means that futex use by other libraries might might be
missed unless they are instrumented in the same way as the glibc
library.

Previous work placed user-space probes in glibc pthread, but the
uprobes overhead was too high. An implementation with static markers
was generated to reduce that overhead:

http://sourceware.org/ml/systemtap/2009-q1/msg00502.html

However, these patches are not merged into the any of the packages or
the upstream glibc.

Took a closer look at how the systemtap runtime stack.c and uprobes
code operates to see whether there is some way to get the appropriate
location to start the userspace unwinding even when in a kprobe or
other kernel context.  Stack unwinding works for uprobes. The runtime
code uses the register_uprobes() and register_uretprobes() to insert
the breakpoint in the user-space code and adds a handler for that.

When the breakpoint from the registration is encounted in the
user-code a trap occurs. The CPU's user-mode registers are saved and a
SIGTRAP signal is generated. This signal is handled by a utrace engine
executing uprobe_report_signal() in uprobes.c.  The utrace engine
passes a pointer to the pt_regs into the uprobe_report_signal().

The main mechanism used by most user-space processes to get into
kernel code is syscalls using either an int 0x80, syscall, or
sysenter.  The mechanisms used for the syscalls do not store data in a
pt_regs structure.  the code that handles the syscall entry into the
kernel for x86_64 is located in linux/arch/x86/kernel/entry_64.S
ENTRY(system_call). Looking through the code for the syscall_call
entry see the old stackpointer getting saved and changed to the kernel
stack:

        movq    %rsp,PER_CPU_VAR(old_rsp)
        movq    PER_CPU_VAR(kernel_stack),%rsp

Also ran across the following comment and macro in entry_64.S creat
pt_regs for some functions:

 /*
  * Certain special system calls that need to save a complete full stack frame.
  */
         .macro PTREGSCALL label,func,arg

The syscalls using it are:

        PTREGSCALL stub_clone, sys_clone, %r8
        PTREGSCALL stub_fork, sys_fork, %rdi
        PTREGSCALL stub_vfork, sys_vfork, %rdi
        PTREGSCALL stub_sigaltstack, sys_sigaltstack, %rdx
        PTREGSCALL stub_iopl, sys_iopl, %rsi

Looked at how the backtrace mechanism worked in perf to see if
something from that mechanism can be borrowed. Has "perf record" has
"-g" option. This sets the PERF_SAMPLE_CALLCHAIN flag in the
attributes. The sampling mechanism in the kernel for perf includes a
pt_regs entry, but it looks like it only records backtrace for
hardware performance events, which cause interrupts.  When the perf
sample is recorded with with the "-g" option, the perf_callchain()
function is called. The perf_callchain_user() and
perf_callchain_user32() do the actual walk of the stack.  It looks to
be a simple minded frame-based mechanism.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Getting user-space stack backtraces in more probe contexts
  2010-05-14  7:20 Getting user-space stack backtraces in more probe contexts William Cohen
@ 2010-05-15 14:27 ` Roland McGrath
  2010-05-18  4:27   ` Frank Ch. Eigler
  0 siblings, 1 reply; 6+ messages in thread
From: Roland McGrath @ 2010-05-15 14:27 UTC (permalink / raw)
  To: William Cohen; +Cc: SystemTAP

Yes, perf is not even trying to address the same problem.  They've just
decided to punt to assuming -fno-omit-frame-pointers everywhere (like Sun).

For relying on CFI as we do, there are three basic approaches possible.
Before getting to those, I'll refine some details from your investigation.

> SIGTRAP signal is generated. This signal is handled by a utrace engine
> executing uprobe_report_signal() in uprobes.c.  The utrace engine
> passes a pointer to the pt_regs into the uprobe_report_signal().

Moreover, all utrace callbacks are (by rule of the API) at a place where
it is kosher to use the user_regset interfaces.  'struct pt_regs' is a
convenience if you happen to know what you want in it, but user_regset
is the only thing that gets you all the correct values of all the user
registers on all machines.

> The mechanisms used for the syscalls do not store data in a pt_regs
> structure.

That's not quite true.  There is always a 'struct pt_regs' at the
pointer returned by task_pt_regs(), the same one in all utrace callbacks
and such.  There is always some correct data there.  That's how
asm/syscall.h, instruction_pointer(), user_stack_pointer(), all work.

The "fast" paths store some subset of the 'struct pt_regs'.  The details
vary widely by machine.  On the x86, the tail end of the struct is
actually stored by the hardware for 'int $0x80' and all non-system-call
kernel entries (this has the PC and SP).  In the syscall/sysenter cases
(both 32 and 64, which are different for each flavor with the same
name), it's the kernel's code at those special entry-points that does
this; the x86_64 kernel uses a lot of hairy assembly macros that make
this stuff hard to notice.  (The 'struct pt_regs' sits at the base of
the kernel stack and so it's filled by pushing on the stack, later
pushes setting earlier fields, as we read the struct definition in the
source.)  Then all kinds of entries push the system call number and
argument registers too.  All the non-system-call kinds of entries (page
faults, interrupts, machine traps from user mode code) then push all the
rest of the register, so they are available for signal handling et al.
(It's consistent AFAIK on other machines that all the non-system-call
entries from user mode make all registers available.)

On the i386, all entries push all the registers in 'struct pt_regs',
because with 6 syscall arguments, the SP, and the syscall number
register, that's all the registers there are.  On most other machines
(all that I'm at all familiar with), only about as many as are used for
system call registers are saved in the fast entry paths.

Importantly, in cases like the x86_64 where the kernel stack fills the
'struct pt_regs', the "unavailable" portions of task_pt_regs() are not
merely unrelated garbage that doesn't tell you the right values of user
registers.  They are whatever the kernel code in the call chain from the
sys_* function has pushed on the stack.  So referring to that data at an
inappropriate time doesn't just give you useless results for userland,
but to consider it as "userland data" might be leaking kernel bits so as
to violate some information security intent.

As you noticed, the clone/fork/vfork syscalls (and also execve) all take
a special path that ensures all the user registers are saved.  That's so
of those calls on all machines, and it's no coincidence that these are
the places where there are ptrace/utrace/tracehook report points.  (It's
also true of a few other syscalls dealing with signals or arch-specific
weirdness.)

Note that when using syscall tracing (via utrace or the equivalent newer
"syscall tracepoint" path), things are similar but not quite the same.
In those special cases just mentioned, where there are utrace stopping
points inside the call, the full 'struct pt_regs' is on the kernel stack
below the sys_* frame, so you can just access it there (i.e. user_regset
calls will do that safely).  But in syscall tracing, the full registers
are only on the stack and visible to user_regset et al during the
tracehook_report_syscall_* call (where you get utrace or tracepoint
callbacks).  After the the entry tracing functions return, the extra
words are popped back off the stack into the registers, and then the
actual call to the sys_* function is clobbering that with private kernel
stack data.  (Then, after the call, all those registers are pushed back
on if there is syscall exit tracing to do.)  So while you have full
register access at the tracing callbacks, when actually inside the
particular system call code, you don't have any way to recover it short
of perfect kernel-side CFI unwinding back to the red line (more on that
later).


Now, to those three paths.

1. Work with what you got.

   This means, give the user unwinder some arch-specific code to prime
   its state from a known-to-be-partial struct pt_regs.  The various
   registers are marked as in "undefined" CFI state as opposed to the
   usual initial state of a known register value.  The ones that are
   available (roughly the syscall registers, PC and SP) are there.  Then
   process CFI as usual, and bail with a complaint when you find you
   need an undefined register value to figure out the next PC.

   This is one of those double-edged things that has the (dis)advantage
   that it works 100% perfectly on i386 (what you got is all there is to
   have).  It may well even work most of the time on other machines, I
   have barely a guess off hand about that.  (It's sometimes possible to
   get a certain someone to write a fancy script that could analyze a
   decent corpus of binaries and prestimadigitate about their FDEs'
   sensitivity to certain missing registers.)  But the bare guess is
   that it might well tend to cover just recovering the PC and CFA
   (enough to keep doing a basic backtrace) much more often than it
   covers all the registers (so that a full debugger or extracting
   application $variables from unwound frames, which we don't yet
   support anyway, would be happy).

   This has the feature that you can just try it in the unwinder code
   today without depending on any other moving parts, and see what you
   get.  It's known complete as is on i386, so you can just declare
   victory for that machine.  For others, you can see what happens in
   practice right away.  

   It's sure that full-frame unwinding (that is, calculating all the
   registers) will hit "undefined"s.  On-demand register finding was
   previously mentioned as an optimization for the basic backtrace,
   which rarely needs to figure many of the registers at each frame just
   to find the next PC.  Doing that would make the "bail on undefined"
   logic hit far less often, one presumes.  At that point, perhaps it
   becomes good enough, though never fully waterproof.

2. Turtles all the way down!

   (The turtles are made of CFI.)  That is, unwind in kernel space all
   the way back to the red line.  (The "red line" is what OS hackers of
   my vintage call the kernel-mode/user-mode boundary.)  If all the
   kernel CFI is correct, then you can unwind from anywhere all the way
   back to the frame that's the kernel entry point, and then unwind from
   there to full user registers.  It should be just like unwinding
   through an in-kernel interrupt or trap frame.  

   In 100% proper CFI these frames are marked as "signal frames" (it's
   part of the "augmentation string"), so you can see those and then
   check whether the "caller's PC" of that frame is < TASK_SIZE
   (i.e. outside kernel text) to tell whether it's the base kernel entry
   frame or is an in-kernel trap frame.  (The magic registers that
   user_mode() checks are not recorded in the CFI, though they could be.
   If they were, you could apply the exact check user_mode() does to the
   reconstructed registers to decide if you think they're from user
   mode.)  You can also do something much simpler like see if the
   unwound stack pointer matches user_stack_pointer(task_pt_regs(task)).
   
   All this requires is that all kernel code have CFI, that the CFI be
   correct, and that you have that CFI.  Three small matters.  I can
   only really speak to these for x86.

   For some time now, since sometime after the short-lived in-kernel CFI
   unwinder got removed, the linker script used to build vmlinux
   discards the .eh_frame section.  This is where all the hand-written
   CFI in assembly code has been going, since that's where the assembler
   puts it for .cfi_* directives.  So, in x86 kernels there is believed
   to be CFI for all the code and it is imagined to be correct, but it
   has not been in any binary you've seen in a long time.

   In the interim, the assembler has grown the .cfi_sections directive
   that lets us direct whether the .cfi_* directives in assembly code
   produce .eh_frame, .debug_frame, or both.  I have only just now sent
   a fix to the kernel x86 maintainers to use .cfi_section .debug_frame
   in the x86 assembly code, so that CFI is preserved for us to find.
   (I've put that patch into the rawhide kernel, so kernel-debuginfo
   from rawhide will have full CFI the next time the rest of the rawhide
   kernel's patches start building again.  We can probably get it into
   Fedora 13 and update kernels too.)

   So, near-future kernels will have CFI for the kernel entry points
   (and other assembly) so we can find out concretely whether there is
   any CFI that is missing or wrong.  It seems to be maintained fairly
   judiciously despite the upstream kernel build not having any way for
   anyone ever to see it.  In the past I have volunteered to fix it as
   needed.  Hence, with vigilance, relying on it for "current" kernels
   is plausible.

   For other machines, I don't really know the details.  Unless there is
   some magic I don't know about, the powerpc assembly in the kernel has
   no CFI, for example.  It might be that the base kernel stack layout
   is formulaic enough to handle the kernel entry frames generically
   with a hard-coded rule on that machine or something like that, but
   you'd need an arch expert to tell you for each arch.

   Incidentally, ia64 (and arm?) has its own non-DWARF flavor of unwind
   info that the assembler generates mostly automagically without the
   hand-written directives, and an in-kernel unwinder for it.  In fact,
   on ia64, that unwinder is the one and only way you ever get the full
   user registers.  Its unw_unwind_to_user() is used by the ptrace code.
   So, while I have no idea about the 'struct pt_regs' story on ia64, I
   believe there it's actually safe to use the user_regset calls more or
   less anywhere you don't hold spinlocks or whatnot.

   This solution can be "smooth round the bend", as they say.  There's
   no messy "phase change" at all, it's just unwinding all the way
   through.  There's the minor bump of noticing when you shift from
   consulting kernel CFI to user CFI, but perhaps you just think of that
   as PC ranges in different modules, as with a kernel module's CFI.

   But, my bet is it may prove to be not quite perfect (needing assembly
   fixes) on x86, and difficult to get anyone to add hand-assembly CFI
   in its entirety to powerpc and other machines where it's absent now.
   (The assembler supports the same .cfi_* stuff for powerpc and other
   machines just fine, if the kernel assembly code wanted to use it.)
   You certainly can't use it on existing kernels, and there is only any
   kind of ETA on that as yet for the x86.

3. Two phase with a safe point

   This is the notion that Will mentioned, but there is a general and
   optimal way to do it.  It's a classic "software interrupt" scheme:
   at an arbitrary point, put down a marker; when you reach a safe
   point (here, just before returning to user mode), pick up the
   marker and do the rest of the work.

   There are a variety of ways to do this, but there are now (kind of)
   some good ones.  In recent kernels, the TIF_NOTIFY_RESUME flag
   exists just for this sort of thing.  In all kernels, TIF_SIGPENDING
   does a related thing.

   You can safely do set_thread_flag(TIF_NOTIFY_RESUME) from anywhere.
   This means tracehook_notify_resume() gets called before returning
   to user mode.  tracehook_notify_resume() is an arch-indepedent
   inline with one call site on each machine (in arch-specific code).
   In any kernel that has it, you could at least use a kprobe on that
   inline to get a callback at the safe user-mode boundary (where
   user_regset is kosher, if you don't have interrupts disabled or
   other locks held or whatnot).

   In utrace, this is what passing UTRACE_REPORT to utrace_control()
   does.  But you can't call utrace_control() from interrupt level or
   with locks held or so forth, because of lockdep issues.  I'd always
   expected to have some manner of utrace call that can be made by the
   current task from interrupt level or anywhere, to demand a utrace
   report at the next safe point--i.e., what UTRACE_REPORT (or also a
   UTRACE_INTERRUPT option) would do if another thread were making a
   utrace_control() call.  I think we can add a simple thing like that
   to utrace easily.  But it's not there now.

   If you are considering something like a futex syscall, that might
   block (or already has), then TIF_NOTIFY_RESUME/UTRACE_REPORT
   doesn't do anything until the syscall finishes of its own accord.
   (For a problem futex wait you are investigating, that might be
   never.)  If you know that it's a syscall that restarts properly
   (futex does, correctly resuming a timeout if there is one), then
   you can use TIF_SIGPENDING (in utrace, UTRACE_INTERRUPT) instead.
   That will prevent it from blocking normally, instead going back to
   user mode to restart the syscall.  On its way back, you can see the
   full registers before or when it restarts.  Then of course you have
   to know not to loop when you hit your probe inside the futex code
   the second time.  (Note in the case of futex and some others, the
   second time around will be NR_restart_syscall rather than the
   original syscall, hence that path will be via futex_wait_restart()
   rather than sys_futex->do_futex as the first time's path was.)  In
   the case of a thread already blocked on a futex, you can already
   use utrace_control() on it to do UTRACE_INTERRUPT from another
   thread.  To do it from inside the futex call on the current thread,
   e.g. from a timer interrupt in the same thread context or something
   like that, you'd need the same new utrace interface as above.

   In the absence of utrace or that new feature, you can do a couple
   of things with TIF_SIGPENDING.  You can actually send a signal from
   anywhere (send_sig et al), and then catch that happening by normal
   means.  That is, you can use an ignored signal and then notice
   trace_signal_deliver(); or, with existing utrace you can use any
   signal and swallow it in the utrace report_signal handler.  Or, you
   can do plain set_thread_flag(TIF_SIGPENDING) from anywhere, and
   then use a kprobe on get_signal_to_deliver() to catch it before it
   checks and sees no signals and does nothing.

   Finally, you can just change the user PC to something that leads to
   an event you know how to trace.  (There's no point in just using an
   invalid PC, since that only leads to a signal you might as well
   just send.)  That could be the vDSO __kernel_rt_sigreturn if you
   have a probe in sys_rt_sigreturn or something like that, or could
   be some random PC where you previously installed a uprobe.  This is
   probably a poor choice, because of complications like restoring the
   real user PC if another signal comes along first.

   With any of those methods, what the low-level implemention provides
   is equivalent to two internal probes, hence "two phase".  There are
   lots of ways to deal with this in the script world.  e.g.

   a. For backtraces alone, you could have ubacktrace() store a magic
      object that is a placeholder for a backtrace to be done at the
      safe point.  For printing a backtrace from an inside-kernel probe,
      some magic nugget would be placed in the output buffer so that the
      stapio side would know not to actually deliver this buffer as
      printed text until the second phase probe comes along to fill in
      that portion of the output out of order.
   b. You could force the script/tapset to do it entirely in terms of
      two language-level probes:
	probe kernel... { notify_resume() /* could be embedded-c */ }
	probe user.resume { ... } // could be just = kernel.function("...")
   c. You could add two-phase probes as a first-class language feature:
	probe kernel... {
		print_firstpart();
		@resume {
			print_secondpart();
		}
	}
      Perhaps with some interrupting variant too, perhaps even one that
      rolls in the restart-once logic:
	probe kernel... {
		@restartsys { bt = ubacktrace() }
		printf("blah at %s\n", bt)
	}
      Inside the @resume et al clauses, you have no $ context (can use
      only kernel globals), but have full user registers.  Perhaps if it
      appears in a library... probe, then your globals-only $ context is
      for the user module named in the probe instead.

   This class of approach has the big advantage that it's entirely
   compatible with doing user unwinding from the .eh_frame CFI in the
   user text rather than relying on prepacking.  At these safe points,
   it's entirely fine to do full user memory access with uaccess.h or
   whatever, block in page faults to bring the necessary text in, etc.
   (Back at the dawn of time, I presumed this is how it would always be
   done, and hence thought it pretty nutty to be packing up userland CFI
   data into stap kernel modules.)  At worst, you risk nothing but
   wedging that one thread, and it can still be killed cleanly.

   Even with the prepacking, this can let the user unwinder code run
   preemptible (for voluntary preemption, you can sprinkle it with
   cond_resched, and for paranoia, fatal_signal_pending bail-out
   checks).  This removes the burden of dealing with userland CFI down
   in any sensitive places where delaying too long or getting led astray
   into an infinite loop is a big problem.  Then any unwinding work done
   in such places is only for the kernel, where the CFI to contend with
   is a finite known set we can have scoured thoroughly for bogons
   beforehand.  (Until someone compiles another kernel module, of
   course, but you get the idea.)

I said three paths, but the careful reader will have noticed when I was
talking about the nuances of entry paths earlier that there is also:

4. Pre-collect via syscall-entry tracing

   As I mentioned above, the syscall tracing callbacks have complete
   register access and can use user_regset.  So, you can enable syscall
   tracing via utrace or the "sys_enter" tracepoint.  In your callback,
   use user_regset (properly) or the 'struct pt_regs' (from argument or
   task_pt_regs(), improperly) to save off the complete user register
   data somewhere.  Then when in a later probe point before the syscall
   exit, you can use the saved regset block to prime the user unwinder.

   You might optimize out copying the registers based on looking for a
   syscall_get_nr() value.  Or you might make that syscall-entry probe
   the place where you check for and record a futex call, rather than a
   separate kernel probe inside the sys_futex call chain.

   Conversely, you could have a general mode (maybe even enabled just by
   using ubacktrace() in a script!) that just implicitly enables the
   syscall-entry tracing everywhere (or in targeted tasks, or whatever)
   with a canned probe in the runtime that stores the user registers.
   Both the tracing mode and the copying add overhead to every syscall,
   which can be measured to think about how desireable this is to do how
   automatically.  (I happen to know that on x86 there is some work in
   the low-level magic we can do to reduce the tracing mode part of the
   overhead on the syscall-exit side, which you incur by doing
   syscall-entry tracing.  So let me know if measurements suggest that
   optimizing that part of it could be the tipping point for a decision
   that's desireable in other regards such as punting on all the
   potential work involved in all the other avenues under discussion.)


Thanks,
Roland

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Getting user-space stack backtraces in more probe contexts
  2010-05-15 14:27 ` Roland McGrath
@ 2010-05-18  4:27   ` Frank Ch. Eigler
  2010-05-18 15:56     ` Roland McGrath
  0 siblings, 1 reply; 6+ messages in thread
From: Frank Ch. Eigler @ 2010-05-18  4:27 UTC (permalink / raw)
  To: Roland McGrath; +Cc: William Cohen, SystemTAP


roland wrote:

> [...]
> Now, to those three paths.
> 1. Work with what you got.
> [...]
>    This means, give the user unwinder some arch-specific code to
>    prime its state from a known-to-be-partial struct pt_regs.  [...]
>    [...]  But the bare guess is that it might well tend to cover
>    just recovering the PC and CFA (enough to keep doing a basic
>    backtrace) much more often than it covers all the registers [...]

I like it as a fallback / heuristic.  (Plus we should be able to fall
back to frame-pointer heuristics and/or the kernel's guesswork.)


> 2. Turtles all the way down!
> [...]
>    (The turtles are made of CFI.)  That is, unwind in kernel space all
>    the way back to the red line.  [...]
>    In 100% proper CFI these frames are marked as "signal frames" (it's
>    part of the "augmentation string"), so you can see those and then
>    check whether the "caller's PC" of that frame is < TASK_SIZE
>    [...]
>    All this requires is that all kernel code have CFI, that the CFI be
>    correct, and that you have that CFI.  Three small matters.  [...]

I like it.  This seems like the best first try.


> 3. Two phase with a safe point
>    This is the notion that Will mentioned, but there is a general and
>    optimal way to do it.  It's a classic "software interrupt" scheme:
>    at an arbitrary point, put down a marker [...]

I don't like it as much.  It's far more complex, plus I would like not
to sacrifice the ability to process backtraces as first class run-time
objects.


> 4. Pre-collect via syscall-entry tracing
> [...]

I like this, but doesn't appear to handle the interrupt / signal /
preemption type involuntary jumps into kernel space.


- FChE

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Getting user-space stack backtraces in more probe contexts
  2010-05-18  4:27   ` Frank Ch. Eigler
@ 2010-05-18 15:56     ` Roland McGrath
  2010-05-18 20:07       ` Frank Ch. Eigler
  0 siblings, 1 reply; 6+ messages in thread
From: Roland McGrath @ 2010-05-18 15:56 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: William Cohen, SystemTAP

> > 1. Work with what you got.
[...]
> I like it as a fallback / heuristic.  (Plus we should be able to fall
> back to frame-pointer heuristics and/or the kernel's guesswork.)

I wouldn't call this a "fallback".  Rather, it seems like the natural
thing to try first and then only fall back to other (or additional)
means when you run into a need for a trap-state register that you
don't have.

So, you'd proceed with user-only unwinding as I described.  When you
come across a need for a register value and that register's state is
"undefined", then you pause to go off and do kernel-side unwinding
from your base probe state back to a user-mode state.  If you succeed
in producing what you think are the user-mode registers at the
boundary crossing, then you can return to user user-mode unwinding
state and replace every user register there in "undefined" or
"same-value" state with the value just recovered via kernel unwinding.

Of course, if the same probe site also happens to use a kernel
backtrace, then you should just have it do the kernel backtrace
calculation beforehand (even if it's used later in the probe action
script code than the ubacktrace is) to prime the state for the user
backtrace.  (No need for laziness when you know you're going to do it
anyway.)

> > 2. Turtles all the way down!
[...]
> I like it.  This seems like the best first try.

Well, I'm not sure what you mean by "first".  This is little work on
the stap/runtime side and almost entirely just a big dependency on the
kernel compilation details being what you need.  So you can try it
just as soon as you are using a bleeding-edge kernel, as of this
writing not yet compiled anywhere by anyone except for my home build,
and only on x86.  (Or, go to town right away on ia64, where everything
is already peachy in its own way.)

The upstream x86 kernel, when built with a sufficiently recent
assembler, may well have the CFI for the important assembly layers by
2.6.35 or so.  Fedora x86 kernels will have it much sooner, but
probably only ever in updates for 12 and 13.

A further note about kernel CFI.  The current x86 kernels are built
with -fno-asynchronous-unwind-tables, so the compiler will sometimes
fudge the CFI state in between call sites.  This becomes relevant when
you want to unwind across an interrupt/trap frame where kernel-mode
code got interrupted.  The assembly CFI for the actual interrupt frame
will be correct, so you unwind through it to the exact state that got
interrupted.  But if what was interrupted was near code changing CFI
state between calls, there may be problems.  A likely example is if
the interrupted instruction was before pushing some arguments on the
stack for a call.  You may get CFI that thinks the SP is in the state
before any of the pushes, or after all of them, for a PC where that's
not right.  This could get you off by a word or more in judging the
CFA of that interrupted frame, which will lead you wrong values for
its caller's PC, CFA, or other registers.

> > 3. Two phase with a safe point
[...]
> I don't like it as much.  It's far more complex, 

Agreed on complexity.  OTOH, it is in a larger sense a complexity
reducer when one takes advantage of it to abandon prepacking CFI from
userland binaries and instead do dynamic user CFI unwinding (just as
userland does).

> plus I would like not
> to sacrifice the ability to process backtraces as first class run-time
> objects.

I don't quite understand this.  Making backtraces a special type is
what I'd call a "first-class" object, contrary to the status quo.  I
suspect you mean "... process backtraces as normal run-time strings".

> > 4. Pre-collect via syscall-entry tracing
> > [...]
> 
> I like this, but doesn't appear to handle the interrupt / signal /
> preemption type involuntary jumps into kernel space.

I'm sorry I was not clear.  I referred to "non-system-call entries" or
"other entries" when talking about different kinds of kernel entry,
and these include all those kinds you mention.  You get complete
information for these and can use 'struct pt_regs' or user_regset
freely.  It is only the system call paths (and not even those on i386)
that save partial register information and could require any of these
complex techniques.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Getting user-space stack backtraces in more probe contexts
  2010-05-18 15:56     ` Roland McGrath
@ 2010-05-18 20:07       ` Frank Ch. Eigler
  2010-05-18 20:07         ` Roland McGrath
  0 siblings, 1 reply; 6+ messages in thread
From: Frank Ch. Eigler @ 2010-05-18 20:07 UTC (permalink / raw)
  To: Roland McGrath; +Cc: systemtap

Hi -

> [...]
> So, you'd proceed with user-only unwinding as I described.  When you
> come across a need for a register value and that register's state is
> "undefined", then you pause to go off and do kernel-side unwinding
> from your base probe state back to a user-mode state.  [...]

Right; this would require representation of known/unknown state in the
unwinder code, which is a new smop.  That's what I meant as opposed to
"first" below.

> 
> > > 2. Turtles all the way down!
> [...]
> > I like it.  This seems like the best first try.
> 
> Well, I'm not sure what you mean by "first".  This is little work on
> the stap/runtime side and almost entirely just a big dependency on the
> kernel compilation details being what you need.  [...]

OK, I didn't realize that the status quo was that broken.


> [...]
> > I don't like it as much.  It's far more complex, 
> 
> Agreed on complexity.  OTOH, it is in a larger sense a complexity
> reducer when one takes advantage of it to abandon prepacking CFI from
> userland binaries and instead do dynamic user CFI unwinding (just as
> userland does).

I don't see a strong connection between the unwind-data-preload vs.
two-phase unwinding.  We could attempt to access user .eh_frame data
from our current context too.  We could also attempt to page that stuff
in ahead of time (with the proviso that the kernel could throw it back
out again).


> > plus I would like not
> > to sacrifice the ability to process backtraces as first class run-time
> > objects.
> 
> I don't quite understand this.  Making backtraces a special type is
> what I'd call a "first-class" object, contrary to the status quo.  I
> suspect you mean "... process backtraces as normal run-time strings".

Yes.


> > > 4. Pre-collect via syscall-entry tracing
> > > [...]
> > 
> > I like this, but doesn't appear to handle the interrupt / signal /
> > preemption type involuntary jumps into kernel space.
> 
> I'm sorry I was not clear.  I referred to "non-system-call entries" or
> "other entries" when talking about different kinds of kernel entry,
> and these include all those kinds you mention.  You get complete
> information for these and can use 'struct pt_regs' or user_regset
> freely.

OK, I guess we'd then have to be able to know when we're in one of
these 'complete information' probe contexts.

- FChE

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Getting user-space stack backtraces in more probe contexts
  2010-05-18 20:07       ` Frank Ch. Eigler
@ 2010-05-18 20:07         ` Roland McGrath
  0 siblings, 0 replies; 6+ messages in thread
From: Roland McGrath @ 2010-05-18 20:07 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap

> I don't see a strong connection between the unwind-data-preload vs.
> two-phase unwinding.  We could attempt to access user .eh_frame data
> from our current context too.  We could also attempt to page that stuff
> in ahead of time (with the proviso that the kernel could throw it back
> out again).

The point is that at the safe point you can rely entirely on normal paging
of the user text and so not become unreliable under memory pressure.  As
long as you're avoiding real paging, then anyone wanting sure reliability
will need the prepacking or else a pre-mlock'ing of all the user .eh_frame
pages, which has roughly equivalent precommitted-RAM constraints to
prepacking.  Normal user paging is the only thing that will ever be
entirely reasonable for scripts that could be entirely unprivileged and
still work as reliably as the user data makes possible.

> OK, I guess we'd then have to be able to know when we're in one of
> these 'complete information' probe contexts.

To a first approximation syscall_get_nr(task, task_pt_regs(task)) < 0 tells
you.  That can actually also be telling you that it's a syscall tracing
stop for a bogus syscall entry with too many bits set in the syscall number
register.  The only place you could be other than the syscall tracing stop
is an in-kernel interrupt somewhere on the entry path (irq or preemption).
In that case, the task_pt_regs(current) state can be even more
partial--only the PC and SP are stored before interrupts are enabled.  
(The kernel assembly CFI should walk you through all such states fine, but
that's not the point here.)


Thanks,
Roland

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-05-18  5:17 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-14  7:20 Getting user-space stack backtraces in more probe contexts William Cohen
2010-05-15 14:27 ` Roland McGrath
2010-05-18  4:27   ` Frank Ch. Eigler
2010-05-18 15:56     ` Roland McGrath
2010-05-18 20:07       ` Frank Ch. Eigler
2010-05-18 20:07         ` Roland McGrath

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).