How to get the detailed kernel stack trace if specified fuction takes too long to finish?

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* How to get the detailed kernel stack trace if specified fuction takes too long to finish?
@ 2014-03-28  4:22 liang xie
  2014-03-28 16:56 ` Frank Ch. Eigler
  2014-03-28 17:06 ` Josh Stone
  0 siblings, 2 replies; 8+ messages in thread
From: liang xie @ 2014-03-28  4:22 UTC (permalink / raw)
  To: systemtap

Hi all,

Probably it's a newbie question, to be honest, i am new to SystemTap,
currently i want to capture the kernel stack trace if "sys_write" runs
too long, if that i could know where it sucks.

My OS:2.6.32-220.el6.x86_64
# stap -V
Systemtap translator/driver (version 1.6/0.152 non-git sources)

My tried script:
global i=0
global tb=0
global te=0
probe kernel.function("sys_write") {
  tb=gettimeofday_us();
}
probe kernel.function("sys_write").return {
  te=gettimeofday_us();
  if(te-tb > 100000) {
   i++;
   printf("time_us: %d\n", te-tb);
   print_backtrace();
   if(i > 15) {
    exit();
   }
  }
}

# stap -x <pid> xxx.stp
time_us: 226795
Returning from:  0xffffffff81176f90 : sys_write+0x0/0x90 [kernel]
Returning to  :  0xffffffff8100b0f2 : system_call_fastpath+0x16/0x1b [kernel]
time_us: 101400
Returning from:  0xffffffff81176f90 : sys_write+0x0/0x90 [kernel]
Returning to  :  0xffffffff8100b0f2 : system_call_fastpath+0x16/0x1b [kernel]
time_us: 527853
Returning from:  0xffffffff81176f90 : sys_write+0x0/0x90 [kernel]
Returning to  :  0xffffffff8100b0f2 : system_call_fastpath+0x16/0x1b [kernel]
time_us: 133948
Returning from:  0xffffffff81176f90 : sys_write+0x0/0x90 [kernel]
Returning to  :  0xffffffff8100b0f2 : system_call_fastpath+0x16/0x1b [kernel]

It doesn't show my "expected" kernel stack trace,e.g. file system or
block layer related stack trace. How to archive that? Any pointers
will be highly appreciated!

Best,
Liang

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to get the detailed kernel stack trace if specified fuction takes too long to finish?
  2014-03-28  4:22 How to get the detailed kernel stack trace if specified fuction takes too long to finish? liang xie
@ 2014-03-28 16:56 ` Frank Ch. Eigler
  2014-03-28 17:06 ` Josh Stone
  1 sibling, 0 replies; 8+ messages in thread
From: Frank Ch. Eigler @ 2014-03-28 16:56 UTC (permalink / raw)
  To: liang xie; +Cc: systemtap

Hi -

xieliang007 wrote:

> Probably it's a newbie question, to be honest, i am new to SystemTap,
> currently i want to capture the kernel stack trace if "sys_write" runs
> too long, if that i could know where it sucks.

OK.

> [...] probe kernel.function("sys_write").return [...]
>    print_backtrace();

> Returning from:  0xffffffff81176f90 : sys_write+0x0/0x90 [kernel]
> Returning to  :  0xffffffff8100b0f2 : system_call_fastpath+0x16/0x1b [kernel]
>
> It doesn't show my "expected" kernel stack trace,e.g. file system or
> block layer related stack trace. 

The expectation is mistaken.  By the time that sys_write returns, all
that kernel activity is in the past, and will have left no trace on
the stack.

> How to archive that? Any pointers will be highly appreciated!

Not easy.  

One way could be to sample the kernel backtrace (from probe
timer.profile) while a sys_write is in effect.  Attributing the
backtraces to the particular thread suffering from a slow sys_write is
going to be tricky, but maybe this could work:

      global backtraces%
      global active
      probe syscall.write { 
            active[tid()]=1 
      }
      probe timer.profile {
            if (! (tid() in active)) next;
            backtraces[tid(), backtrace()] <<< 1
      }
      probe syscall.write.return { 
            delete active[tid()]
            elapsed = gettimeofday_ms() - @entry(gettimeofday_ms())
            if (elapsed <= 4)
                next; // oops, leaves behind backtraces[] records
            printf("%s[%d] sleepy for %d ms:\n", execname(), tid(), elapsed)
            foreach ([t,bt] in backtraces- limit /* top */ 10)
                    if (t == tid()) {
                       printf("\n%d hits:\n", @count(backtraces[t,bt]))
                       print_stack(bt) 
                       }
      }

      # stap --all-modules FOO.stp

- FChE

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to get the detailed kernel stack trace if specified fuction takes too long to finish?
  2014-03-28  4:22 How to get the detailed kernel stack trace if specified fuction takes too long to finish? liang xie
  2014-03-28 16:56 ` Frank Ch. Eigler
@ 2014-03-28 17:06 ` Josh Stone
  2014-04-02 10:25   ` liang xie
  1 sibling, 1 reply; 8+ messages in thread
From: Josh Stone @ 2014-03-28 17:06 UTC (permalink / raw)
  To: systemtap

On 03/27/2014 09:22 PM, liang xie wrote:
> Hi all,
> 
> Probably it's a newbie question, to be honest, i am new to SystemTap,

No problem, welcome!

> currently i want to capture the kernel stack trace if "sys_write" runs
> too long, if that i could know where it sucks.
> 
> My OS:2.6.32-220.el6.x86_64
> # stap -V
> Systemtap translator/driver (version 1.6/0.152 non-git sources)
> 
> My tried script:
> global i=0
> global tb=0
> global te=0
> probe kernel.function("sys_write") {
>   tb=gettimeofday_us();
> }
> probe kernel.function("sys_write").return {
>   te=gettimeofday_us();
>   if(te-tb > 100000) {
>    i++;
>    printf("time_us: %d\n", te-tb);
>    print_backtrace();
>    if(i > 15) {
>     exit();
>    }
>   }
> }
> 
> # stap -x <pid> xxx.stp
> time_us: 226795
> Returning from:  0xffffffff81176f90 : sys_write+0x0/0x90 [kernel]
> Returning to  :  0xffffffff8100b0f2 : system_call_fastpath+0x16/0x1b [kernel]
> time_us: 101400
> Returning from:  0xffffffff81176f90 : sys_write+0x0/0x90 [kernel]
> Returning to  :  0xffffffff8100b0f2 : system_call_fastpath+0x16/0x1b [kernel]
> time_us: 527853
> Returning from:  0xffffffff81176f90 : sys_write+0x0/0x90 [kernel]
> Returning to  :  0xffffffff8100b0f2 : system_call_fastpath+0x16/0x1b [kernel]
> time_us: 133948
> Returning from:  0xffffffff81176f90 : sys_write+0x0/0x90 [kernel]
> Returning to  :  0xffffffff8100b0f2 : system_call_fastpath+0x16/0x1b [kernel]
> 
> It doesn't show my "expected" kernel stack trace,e.g. file system or
> block layer related stack trace. How to archive that? Any pointers
> will be highly appreciated!

Right, since you waited until return to print_backtrace, it will only be
from the end of sys_write.  (Though it appears to be the beginning,
sys_write+0x0, because of the way a kretprobe captures returns.)

Since you want info from deeper kernel internals, you'll need additional
probes.  One option is to use the para-callgraph example script, where
sys_write is your first argument to start a trace, and the second
argument gives a broad subsystem to dig through.  Like:

$ stap para-callgraph.stp 'kernel.function("sys_write")' \
        'kernel.function("*@fs/*")'

This script will show you timestamps, but you'll get all output whether
it's slow or not.  Maybe you can modify it to filter some of the
information.

Another option is to use timer.profile, and then you can watch for your
long waits.  Something like:

global start_time
probe syscall.write { start_time[tid()] = gettimeofday_us() }
probe syscall.write.return { delete start_time[tid()] }
probe timer.profile {
  t = tid()
  if (t in start_time && gettimeofday_us() - start_time[t] > 100000)
  {
    printf("long sys_write in tid %d\n", t)
    print_backtrace()
    delete start_time[t]
  }
}

Even this assumes that the long wait is busy on the cpu, but I think
most likely it's an IO wait, which means you'll be scheduled off while
it completes.  So then instead of timer.profile, or in addition, you
could use a scheduler tracepoint.  Like:

global start_time
probe syscall.write { start_time[tid()] = gettimeofday_us() }
probe syscall.write.return { delete start_time[tid()] }
probe kernel.trace("sched_switch") {
  t = task_tid($next)
  if (t in start_time && gettimeofday_us() - start_time[t] > 100000)
  {
    printf("long sys_write in tid %d\n", t)
    print_backtrace()
    delete start_time[t]
  }
}

So as your thread is scheduled back onto the cpu, if it has been a long
time, the backtrace will tell you what path it took to get scheduled off
in the first place.

Hope that's enough to give you some direction...

Josh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to get the detailed kernel stack trace if specified fuction takes too long to finish?
  2014-03-28 17:06 ` Josh Stone
@ 2014-04-02 10:25   ` liang xie
  2014-04-02 16:32     ` Josh Stone
  0 siblings, 1 reply; 8+ messages in thread
From: liang xie @ 2014-04-02 10:25 UTC (permalink / raw)
  To: Josh Stone; +Cc: systemtap

I tried the last scheduler script, the output just like this:
# stap --all-modules y.stp
WARNING: missing unwind/symbol data for module 'uprobes'
long sys_write in tid 10162
Systemtap probe: kernel.trace("sched_switch")
long sys_write in tid 11510
Systemtap probe: kernel.trace("sched_switch")
long sys_write in tid 11288
Systemtap probe: kernel.trace("sched_switch")

I could not figure out why "print_backtrace()" didn't output any log:)

On Sat, Mar 29, 2014 at 12:30 AM, Josh Stone <jistone@redhat.com> wrote:
> On 03/27/2014 09:22 PM, liang xie wrote:
>> Hi all,
>>
>> Probably it's a newbie question, to be honest, i am new to SystemTap,
>
> No problem, welcome!
>
>> currently i want to capture the kernel stack trace if "sys_write" runs
>> too long, if that i could know where it sucks.
>>
>> My OS:2.6.32-220.el6.x86_64
>> # stap -V
>> Systemtap translator/driver (version 1.6/0.152 non-git sources)
>>
>> My tried script:
>> global i=0
>> global tb=0
>> global te=0
>> probe kernel.function("sys_write") {
>>   tb=gettimeofday_us();
>> }
>> probe kernel.function("sys_write").return {
>>   te=gettimeofday_us();
>>   if(te-tb > 100000) {
>>    i++;
>>    printf("time_us: %d\n", te-tb);
>>    print_backtrace();
>>    if(i > 15) {
>>     exit();
>>    }
>>   }
>> }
>>
>> # stap -x <pid> xxx.stp
>> time_us: 226795
>> Returning from:  0xffffffff81176f90 : sys_write+0x0/0x90 [kernel]
>> Returning to  :  0xffffffff8100b0f2 : system_call_fastpath+0x16/0x1b [kernel]
>> time_us: 101400
>> Returning from:  0xffffffff81176f90 : sys_write+0x0/0x90 [kernel]
>> Returning to  :  0xffffffff8100b0f2 : system_call_fastpath+0x16/0x1b [kernel]
>> time_us: 527853
>> Returning from:  0xffffffff81176f90 : sys_write+0x0/0x90 [kernel]
>> Returning to  :  0xffffffff8100b0f2 : system_call_fastpath+0x16/0x1b [kernel]
>> time_us: 133948
>> Returning from:  0xffffffff81176f90 : sys_write+0x0/0x90 [kernel]
>> Returning to  :  0xffffffff8100b0f2 : system_call_fastpath+0x16/0x1b [kernel]
>>
>> It doesn't show my "expected" kernel stack trace,e.g. file system or
>> block layer related stack trace. How to archive that? Any pointers
>> will be highly appreciated!
>
> Right, since you waited until return to print_backtrace, it will only be
> from the end of sys_write.  (Though it appears to be the beginning,
> sys_write+0x0, because of the way a kretprobe captures returns.)
>
> Since you want info from deeper kernel internals, you'll need additional
> probes.  One option is to use the para-callgraph example script, where
> sys_write is your first argument to start a trace, and the second
> argument gives a broad subsystem to dig through.  Like:
>
> $ stap para-callgraph.stp 'kernel.function("sys_write")' \
>         'kernel.function("*@fs/*")'
>
> This script will show you timestamps, but you'll get all output whether
> it's slow or not.  Maybe you can modify it to filter some of the
> information.
>
> Another option is to use timer.profile, and then you can watch for your
> long waits.  Something like:
>
> global start_time
> probe syscall.write { start_time[tid()] = gettimeofday_us() }
> probe syscall.write.return { delete start_time[tid()] }
> probe timer.profile {
>   t = tid()
>   if (t in start_time && gettimeofday_us() - start_time[t] > 100000)
>   {
>     printf("long sys_write in tid %d\n", t)
>     print_backtrace()
>     delete start_time[t]
>   }
> }
>
> Even this assumes that the long wait is busy on the cpu, but I think
> most likely it's an IO wait, which means you'll be scheduled off while
> it completes.  So then instead of timer.profile, or in addition, you
> could use a scheduler tracepoint.  Like:
>
> global start_time
> probe syscall.write { start_time[tid()] = gettimeofday_us() }
> probe syscall.write.return { delete start_time[tid()] }
> probe kernel.trace("sched_switch") {
>   t = task_tid($next)
>   if (t in start_time && gettimeofday_us() - start_time[t] > 100000)
>   {
>     printf("long sys_write in tid %d\n", t)
>     print_backtrace()
>     delete start_time[t]
>   }
> }
>
> So as your thread is scheduled back onto the cpu, if it has been a long
> time, the backtrace will tell you what path it took to get scheduled off
> in the first place.
>
>
> Hope that's enough to give you some direction...
>
> Josh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to get the detailed kernel stack trace if specified fuction takes too long to finish?
  2014-04-02 10:25   ` liang xie
@ 2014-04-02 16:32     ` Josh Stone
  2014-04-03  2:47       ` liang xie
  0 siblings, 1 reply; 8+ messages in thread
From: Josh Stone @ 2014-04-02 16:32 UTC (permalink / raw)
  To: liang xie; +Cc: systemtap

On 04/02/2014 03:25 AM, liang xie wrote:
> I tried the last scheduler script, the output just like this:
> # stap --all-modules y.stp
> WARNING: missing unwind/symbol data for module 'uprobes'
> long sys_write in tid 10162
> Systemtap probe: kernel.trace("sched_switch")
> long sys_write in tid 11510
> Systemtap probe: kernel.trace("sched_switch")
> long sys_write in tid 11288
> Systemtap probe: kernel.trace("sched_switch")
> 
> I could not figure out why "print_backtrace()" didn't output any log:)

Ah, this "Systemtap probe:" message is fairly old.

>>> My OS:2.6.32-220.el6.x86_64
>>> # stap -V
>>> Systemtap translator/driver (version 1.6/0.152 non-git sources)

Is it possible for you to update your systemtap rpm?  1.6 was released
with RHEL 6.2 (presumably CentOS 6.2 as well), and we've since released
systemtap-1.7 in 6.3, 1.8 in 6.4, and 2.3 in 6.5.  I believe at least
systemtap-1.8 will work better for you here.

Josh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to get the detailed kernel stack trace if specified fuction takes too long to finish?
  2014-04-02 16:32     ` Josh Stone
@ 2014-04-03  2:47       ` liang xie
  2014-04-03 16:24         ` Josh Stone
  0 siblings, 1 reply; 8+ messages in thread
From: liang xie @ 2014-04-03  2:47 UTC (permalink / raw)
  To: Josh Stone; +Cc: systemtap

Yeh, systemtap-1.8 works for me, thanks! now i can get strace trace
like this with the above scheduler script:
long sys_write in tid 11892
 0xffffffff814ed0a8 : thread_return+0x6d6/0x77e [kernel] (inexact)
 0xffffffff812c4a41 : intel_idle+0xc1/0x170 [kernel] (inexact)
 0xffffffff81097c6d : sched_clock_cpu+0xcd/0x110 [kernel] (inexact)
 0xffffffff81009e3e : cpu_idle+0xee/0x110 [kernel] (inexact)
 0xffffffff814e5f23 : start_secondary+0x202/0x245 [kernel] (inexact)
long sys_write in tid 16716
 0xffffffff814ed0a8 : thread_return+0x6d6/0x77e [kernel] (inexact)
 0xffffffff812c4a41 : intel_idle+0xc1/0x170 [kernel] (inexact)
 0xffffffff81097c6d : sched_clock_cpu+0xcd/0x110 [kernel] (inexact)
 0xffffffff81009e3e : cpu_idle+0xee/0x110 [kernel] (inexact)
 0xffffffff814e5f23 : start_secondary+0x202/0x245 [kernel] (inexact)
long sys_write in tid 16716
 0xffffffff814ed0a8 : thread_return+0x6d6/0x77e [kernel] (inexact)
 0xffffffff812c4a41 : intel_idle+0xc1/0x170 [kernel] (inexact)
 0xffffffff81097c6d : sched_clock_cpu+0xcd/0x110 [kernel] (inexact)
 0xffffffff81009e3e : cpu_idle+0xee/0x110 [kernel] (inexact)
 0xffffffff814e5f23 : start_secondary+0x202/0x245 [kernel] (inexact)
long sys_write in tid 17191
 0xffffffff814ed0a8 : thread_return+0x6d6/0x77e [kernel] (inexact)
 0xffffffff812c4a41 : intel_idle+0xc1/0x170 [kernel] (inexact)
 0xffffffff81097c6d : sched_clock_cpu+0xcd/0x110 [kernel] (inexact)
 0xffffffff81009e3e : cpu_idle+0xee/0x110 [kernel] (inexact)
 0xffffffff814e5f23 : start_secondary+0x202/0x245 [kernel] (inexact)

But still no interesting filesystem related stack trace be found

On Thu, Apr 3, 2014 at 12:32 AM, Josh Stone <jistone@redhat.com> wrote:
> On 04/02/2014 03:25 AM, liang xie wrote:
>> I tried the last scheduler script, the output just like this:
>> # stap --all-modules y.stp
>> WARNING: missing unwind/symbol data for module 'uprobes'
>> long sys_write in tid 10162
>> Systemtap probe: kernel.trace("sched_switch")
>> long sys_write in tid 11510
>> Systemtap probe: kernel.trace("sched_switch")
>> long sys_write in tid 11288
>> Systemtap probe: kernel.trace("sched_switch")
>>
>> I could not figure out why "print_backtrace()" didn't output any log:)
>
> Ah, this "Systemtap probe:" message is fairly old.
>
>>>> My OS:2.6.32-220.el6.x86_64
>>>> # stap -V
>>>> Systemtap translator/driver (version 1.6/0.152 non-git sources)
>
> Is it possible for you to update your systemtap rpm?  1.6 was released
> with RHEL 6.2 (presumably CentOS 6.2 as well), and we've since released
> systemtap-1.7 in 6.3, 1.8 in 6.4, and 2.3 in 6.5.  I believe at least
> systemtap-1.8 will work better for you here.
>
> Josh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to get the detailed kernel stack trace if specified fuction takes too long to finish?
  2014-04-03  2:47       ` liang xie
@ 2014-04-03 16:24         ` Josh Stone
  2014-04-08  2:15           ` liang xie
  0 siblings, 1 reply; 8+ messages in thread
From: Josh Stone @ 2014-04-03 16:24 UTC (permalink / raw)
  To: liang xie; +Cc: systemtap

On 04/02/2014 07:47 PM, liang xie wrote:
> Yeh, systemtap-1.8 works for me, thanks! now i can get strace trace
> like this with the above scheduler script:
> long sys_write in tid 11892
>  0xffffffff814ed0a8 : thread_return+0x6d6/0x77e [kernel] (inexact)
>  0xffffffff812c4a41 : intel_idle+0xc1/0x170 [kernel] (inexact)
>  0xffffffff81097c6d : sched_clock_cpu+0xcd/0x110 [kernel] (inexact)
>  0xffffffff81009e3e : cpu_idle+0xee/0x110 [kernel] (inexact)
>  0xffffffff814e5f23 : start_secondary+0x202/0x245 [kernel] (inexact)
[...]
> But still no interesting filesystem related stack trace be found

Oh, I see, I made a mistake here:

> probe kernel.trace("sched_switch") {
>   t = task_tid($next)

The tracepoint will run when we are *about* to switch to $next, but
haven't yet, so the backtrace is for the previous task.  That appears to
be the idle task, so at least it seems clear you're not cpu-bound.

We have a 'scheduler.cpu_on' tapset which may work better.  That puts a
kprobe on "finish_task_switch", so it should be the right context for a
backtrace.  Like:

global start_time
probe syscall.write { start_time[tid()] = gettimeofday_us() }
probe syscall.write.return { delete start_time[tid()] }
probe scheduler.cpu_on {
  t = tid()
  if (t in start_time && gettimeofday_us() - start_time[t] > 100000)
  {
    printf("long sys_write in tid %d\n", t)
    print_backtrace()
    delete start_time[t]
  }
}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to get the detailed kernel stack trace if specified fuction takes too long to finish?
  2014-04-03 16:24         ` Josh Stone
@ 2014-04-08  2:15           ` liang xie
  0 siblings, 0 replies; 8+ messages in thread
From: liang xie @ 2014-04-08  2:15 UTC (permalink / raw)
  To: Josh Stone; +Cc: systemtap

Hi Josh,

Just report back, the last one is pretty helpful and works for my
situation, coooool:)  Thanks a lot!

On Fri, Apr 4, 2014 at 12:24 AM, Josh Stone <jistone@redhat.com> wrote:
> On 04/02/2014 07:47 PM, liang xie wrote:
>> Yeh, systemtap-1.8 works for me, thanks! now i can get strace trace
>> like this with the above scheduler script:
>> long sys_write in tid 11892
>>  0xffffffff814ed0a8 : thread_return+0x6d6/0x77e [kernel] (inexact)
>>  0xffffffff812c4a41 : intel_idle+0xc1/0x170 [kernel] (inexact)
>>  0xffffffff81097c6d : sched_clock_cpu+0xcd/0x110 [kernel] (inexact)
>>  0xffffffff81009e3e : cpu_idle+0xee/0x110 [kernel] (inexact)
>>  0xffffffff814e5f23 : start_secondary+0x202/0x245 [kernel] (inexact)
> [...]
>> But still no interesting filesystem related stack trace be found
>
> Oh, I see, I made a mistake here:
>
>> probe kernel.trace("sched_switch") {
>>   t = task_tid($next)
>
> The tracepoint will run when we are *about* to switch to $next, but
> haven't yet, so the backtrace is for the previous task.  That appears to
> be the idle task, so at least it seems clear you're not cpu-bound.
>
> We have a 'scheduler.cpu_on' tapset which may work better.  That puts a
> kprobe on "finish_task_switch", so it should be the right context for a
> backtrace.  Like:
>
> global start_time
> probe syscall.write { start_time[tid()] = gettimeofday_us() }
> probe syscall.write.return { delete start_time[tid()] }
> probe scheduler.cpu_on {
>   t = tid()
>   if (t in start_time && gettimeofday_us() - start_time[t] > 100000)
>   {
>     printf("long sys_write in tid %d\n", t)
>     print_backtrace()
>     delete start_time[t]
>   }
> }

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-04-08  2:15 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-28  4:22 How to get the detailed kernel stack trace if specified fuction takes too long to finish? liang xie
2014-03-28 16:56 ` Frank Ch. Eigler
2014-03-28 17:06 ` Josh Stone
2014-04-02 10:25   ` liang xie
2014-04-02 16:32     ` Josh Stone
2014-04-03  2:47       ` liang xie
2014-04-03 16:24         ` Josh Stone
2014-04-08  2:15           ` liang xie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).