public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* Some newbie questions
@ 2016-08-07  8:08 Avi Kivity
  2016-08-07  9:16 ` Avi Kivity
  2016-08-08 14:58 ` David Smith
  0 siblings, 2 replies; 32+ messages in thread
From: Avi Kivity @ 2016-08-07  8:08 UTC (permalink / raw)
  To: systemtap

Hello,

A few newbie questions.

1.  Static library support

I am writing a static library, and would like to ship some scripts with 
the library for its consumers.  The library is named "seastar" and one 
of the consumers is named "scylla".

What should I write as the process name in the script?  As the writer of 
the library, I don't know the name of the process that will be consuming 
the library.  The consumer shouldn't have to write those scripts.  I 
could play sed games to adjust the scripts, but it's very sad to have to 
do this.

2. Inlined functions

 From the manual pages, it seems that inlined functions can be probed 
(minus the .return probe), but I just get an error:

semantic error: while resolving probe point: identifier 'process' at 
script/home/avi/seastar/debug/task-latency.stap:3:7
         source: probe process("scylla").function("reactor::run_tasks()") {
                       ^

semantic error: no match (similar functions: 
_ZN7reactor14run_exit_tasksEv, statfs, dup, mkdir, ntohs)

I will note that "mkdir" does not sound very similar to 
"reactor::run_tasks()" (but I am not a native speaker).

3. Process CPU timers

(more of a feature request)

I'm trying to find causes of latency in my program.  To do that, I'm 
running a periodic timer and checking whether a function takes more time 
than some threshold.

Ideally, I'd be able to arm the timer on the function entry point and 
disarm it on exit, rather than have it run continuously; this would need 
to be a per-thread cpu-time timer (e.g. CLOCK_THREAD_CPUTIME_ID)/

Here's my current script for reference ("running" and "start_time" need 
to become maps for it to be thread-safe):

#!/usr/bin/stap

global start_time
global running

probe begin {
     running = 0
}

probe 
process("/home/avi/urchin/build/release/scylla").mark("reactor_run_tasks_start") 
{
     start_time = gettimeofday_us()
     running = 1
}

probe 
process("/home/avi/urchin/build/release/scylla").mark("reactor_run_tasks_end") 
{
     running = 0
}

probe timer.ms(10) {
     now = gettimeofday_us()
     if (running && (now - start_time) > 30000) {
         printf("detected tasks running for >30ms\n")
         print_usyms(ubacktrace())
     }
}

I'd appreciate any tips as to whether there's a better way to do this.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-07  8:08 Some newbie questions Avi Kivity
@ 2016-08-07  9:16 ` Avi Kivity
  2016-08-08 14:58 ` David Smith
  1 sibling, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2016-08-07  9:16 UTC (permalink / raw)
  To: systemtap



On 08/07/2016 11:07 AM, Avi Kivity wrote:
> Hello,
>

Hi, n00b,

> A few newbie questions.
>
> 1.  Static library support
>
> I am writing a static library, and would like to ship some scripts 
> with the library for its consumers.  The library is named "seastar" 
> and one of the consumers is named "scylla".
>
> What should I write as the process name in the script?  As the writer 
> of the library, I don't know the name of the process that will be 
> consuming the library.  The consumer shouldn't have to write those 
> scripts.  I could play sed games to adjust the scripts, but it's very 
> sad to have to do this.
>

You can use "@1" to paste a script parameter into the script.

> 2. Inlined functions
>
> From the manual pages, it seems that inlined functions can be probed 
> (minus the .return probe), but I just get an error:
>
> semantic error: while resolving probe point: identifier 'process' at 
> script/home/avi/seastar/debug/task-latency.stap:3:7
>         source: probe 
> process("scylla").function("reactor::run_tasks()") {
>                       ^
>
> semantic error: no match (similar functions: 
> _ZN7reactor14run_exit_tasksEv, statfs, dup, mkdir, ntohs)
>
> I will note that "mkdir" does not sound very similar to 
> "reactor::run_tasks()" (but I am not a native speaker).
>
> 3. Process CPU timers
>
> (more of a feature request)
>
> I'm trying to find causes of latency in my program.  To do that, I'm 
> running a periodic timer and checking whether a function takes more 
> time than some threshold.
>
> Ideally, I'd be able to arm the timer on the function entry point and 
> disarm it on exit, rather than have it run continuously; this would 
> need to be a per-thread cpu-time timer (e.g. CLOCK_THREAD_CPUTIME_ID)/
>
> Here's my current script for reference ("running" and "start_time" 
> need to become maps for it to be thread-safe):
>
> #!/usr/bin/stap
>
> global start_time
> global running
>
> probe begin {
>     running = 0
> }
>
> probe 
> process("/home/avi/urchin/build/release/scylla").mark("reactor_run_tasks_start") 
> {
>     start_time = gettimeofday_us()
>     running = 1
> }
>
> probe 
> process("/home/avi/urchin/build/release/scylla").mark("reactor_run_tasks_end") 
> {
>     running = 0
> }
>
> probe timer.ms(10) {

You should use timer.profile instead

>     now = gettimeofday_us()
>     if (running && (now - start_time) > 30000) {

And a second parameter here.

>         printf("detected tasks running for >30ms\n")
>         print_usyms(ubacktrace())
>     }
> }
>
> I'd appreciate any tips as to whether there's a better way to do this.
>
>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-07  8:08 Some newbie questions Avi Kivity
  2016-08-07  9:16 ` Avi Kivity
@ 2016-08-08 14:58 ` David Smith
  2016-08-09  6:51   ` Avi Kivity
  1 sibling, 1 reply; 32+ messages in thread
From: David Smith @ 2016-08-08 14:58 UTC (permalink / raw)
  To: Avi Kivity, systemtap

On 08/07/2016 03:07 AM, Avi Kivity wrote:
> Hello,
> 
> A few newbie questions.
> 
> 1.  Static library support
> 
> I am writing a static library, and would like to ship some scripts with
> the library for its consumers.  The library is named "seastar" and one
> of the consumers is named "scylla".
> 
> What should I write as the process name in the script?  As the writer of
> the library, I don't know the name of the process that will be consuming
> the library.  The consumer shouldn't have to write those scripts.  I
> could play sed games to adjust the scripts, but it's very sad to have to
> do this.

The tricky part here is that your library is static. With a shared
library, you could do something like:

   probe process.library("libseastar.so").function("FOO") { }

which would get probe hits for any process that uses that shared library.

For a statically linked library, you can't do that.

Off the top of my head, I don't see any solution other than your '@1'
parameter idea.


> 2. Inlined functions
> 
> From the manual pages, it seems that inlined functions can be probed
> (minus the .return probe), but I just get an error:
> 
> semantic error: while resolving probe point: identifier 'process' at
> script/home/avi/seastar/debug/task-latency.stap:3:7
>         source: probe process("scylla").function("reactor::run_tasks()") {
>                       ^
> 
> semantic error: no match (similar functions:
> _ZN7reactor14run_exit_tasksEv, statfs, dup, mkdir, ntohs)

We're at the mercy of the compiler and the quality of the debuginfo it
produces here when looking for inlined functions.

If you want to investigate further here, you'll need to do the following:

# eu-readelf -N --debug-dump=info scylla > scylla.log

In scylla.log, look and see if you can find a subprogram with the name
of the function you are interested in.

> I will note that "mkdir" does not sound very similar to
> "reactor::run_tasks()" (but I am not a native speaker).

Actually, to my native speaker's eyes, none of the suggested functions
look that close. The way the levenshtein suggestive feature works is
that we report the top 5 scoring functions. I'd bet in this case the
scores were all bad, but these were the top 5 of the scores.

> 3. Process CPU timers
> 
> (more of a feature request)
> 
> I'm trying to find causes of latency in my program.  To do that, I'm
> running a periodic timer and checking whether a function takes more time
> than some threshold.
> 
> Ideally, I'd be able to arm the timer on the function entry point and
> disarm it on exit, rather than have it run continuously; this would need
> to be a per-thread cpu-time timer (e.g. CLOCK_THREAD_CPUTIME_ID)/
> 
> Here's my current script for reference ("running" and "start_time" need
> to become maps for it to be thread-safe):
> 
> #!/usr/bin/stap
> 
> global start_time
> global running
> 
> probe begin {
>     running = 0
> }
> 
> probe
> process("/home/avi/urchin/build/release/scylla").mark("reactor_run_tasks_start")
> {
>     start_time = gettimeofday_us()
>     running = 1
> }
> 
> probe
> process("/home/avi/urchin/build/release/scylla").mark("reactor_run_tasks_end")
> {
>     running = 0
> }
> 
> probe timer.ms(10) {
>     now = gettimeofday_us()
>     if (running && (now - start_time) > 30000) {
>         printf("detected tasks running for >30ms\n")
>         print_usyms(ubacktrace())
>     }
> }
> 
> I'd appreciate any tips as to whether there's a better way to do this.

The above isn't really going to work well, for several reasons:

1) You've only got one global 'start_time' and 'running' variables. If
scylla is multithreaded or more than one instance is running, that isn't
going to work.

To fix this, you'd want to do convert them to arrays and index them by
thread ids, like this:

    start_time[tid()] = gettimeofday_us()

2) Your ubacktrace() call in the timer function isn't going to report
what you want. At that point, you aren't running in the context of the
process, you are running in the context of whatever kernel thread is
running the timer. I'm really sure what a user backtrace will report at
that point, but it certainly won't be useful.

Here's an (untested) version of what would work a bit better:

====
global start_time

probe
process("/home/avi/urchin/build/release/scylla").mark("reactor_run_tasks_start")
{
    start_time[tid()] = gettimeofday_us()
}

probe
process("/home/avi/urchin/build/release/scylla").mark("reactor_run_tasks_end")
{
    now = gettimeofday_us()
    start = start_time[tid()]
    delete start_time[tid()]
    if ((now - start) > 30000) {
         printf("detected tasks running for >30ms (%d)\n",
                (now - start))
         print_usyms(ubacktrace())
    }
}
====

If you like, you could add a timer probe to the above to report an
ongoing problem in the current process (but without printing a backtrace).

-- 
David Smith
dsmith@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-08 14:58 ` David Smith
@ 2016-08-09  6:51   ` Avi Kivity
  2016-08-09 13:54     ` Mark Wielaard
  2016-08-09 15:17     ` David Smith
  0 siblings, 2 replies; 32+ messages in thread
From: Avi Kivity @ 2016-08-09  6:51 UTC (permalink / raw)
  To: David Smith, systemtap



On 08/08/2016 05:58 PM, David Smith wrote:
>
>> 2. Inlined functions
>>
>>  From the manual pages, it seems that inlined functions can be probed
>> (minus the .return probe), but I just get an error:
>>
>> semantic error: while resolving probe point: identifier 'process' at
>> script/home/avi/seastar/debug/task-latency.stap:3:7
>>          source: probe process("scylla").function("reactor::run_tasks()") {
>>                        ^
>>
>> semantic error: no match (similar functions:
>> _ZN7reactor14run_exit_tasksEv, statfs, dup, mkdir, ntohs)
> We're at the mercy of the compiler and the quality of the debuginfo it
> produces here when looking for inlined functions.
>
> If you want to investigate further here, you'll need to do the following:
>
> # eu-readelf -N --debug-dump=info scylla > scylla.log
>
> In scylla.log, look and see if you can find a subprogram with the name
> of the function you are interested in.

  [1417f4]      subprogram
                external             (flag_present) Yes
                name                 (strp) "run_tasks"
                decl_file            (data1) 5
                decl_line            (data2) 812
                linkage_name         (strp) 
"_ZN7reactor9run_tasksER15circular_bufferISt10unique_ptrI4taskSt14default_deleteIS2_EESaIS5_EE"
                declaration          (flag_present) Yes
                object_pointer       (ref4) [141808]
                sibling              (ref4) [141813]
  [141808]        formal_parameter
                  type                 (ref4) [1423bd]
                  artificial           (flag_present) Yes
  [14180d]        formal_parameter
                  type                 (ref4) [14a8a8]


The debugger has no trouble setting breakpoints on the function or 
showing it in backtraces.  In any case I sprinkled some static probes there.

>> 3. Process CPU timers
>>
>> (more of a feature request)
>>
>> I'm trying to find causes of latency in my program.  To do that, I'm
>> running a periodic timer and checking whether a function takes more time
>> than some threshold.
>>
>> Ideally, I'd be able to arm the timer on the function entry point and
>> disarm it on exit, rather than have it run continuously; this would need
>> to be a per-thread cpu-time timer (e.g. CLOCK_THREAD_CPUTIME_ID)/
>>
>> Here's my current script for reference ("running" and "start_time" need
>> to become maps for it to be thread-safe):
>>
>> #!/usr/bin/stap
>>
>> global start_time
>> global running
>>
>> probe begin {
>>      running = 0
>> }
>>
>> probe
>> process("/home/avi/urchin/build/release/scylla").mark("reactor_run_tasks_start")
>> {
>>      start_time = gettimeofday_us()
>>      running = 1
>> }
>>
>> probe
>> process("/home/avi/urchin/build/release/scylla").mark("reactor_run_tasks_end")
>> {
>>      running = 0
>> }
>>
>> probe timer.ms(10) {
>>      now = gettimeofday_us()
>>      if (running && (now - start_time) > 30000) {
>>          printf("detected tasks running for >30ms\n")
>>          print_usyms(ubacktrace())
>>      }
>> }
>>
>> I'd appreciate any tips as to whether there's a better way to do this.
> The above isn't really going to work well, for several reasons:
>
> 1) You've only got one global 'start_time' and 'running' variables. If
> scylla is multithreaded or more than one instance is running, that isn't
> going to work.
>
> To fix this, you'd want to do convert them to arrays and index them by
> thread ids, like this:
>
>      start_time[tid()] = gettimeofday_us()

Yes, I found that out and my latest version works like this, but I ran 
into severe scaling issues.

I also switched to timer.profile.  What's its resolution on normal machines?

> 2) Your ubacktrace() call in the timer function isn't going to report
> what you want. At that point, you aren't running in the context of the
> process, you are running in the context of whatever kernel thread is
> running the timer. I'm really sure what a user backtrace will report at
> that point, but it certainly won't be useful.

Actually I got good backtraces when I instrumented a task to spin for 
100ms.  I also got some bad traces, but did not investigate more.

> Here's an (untested) version of what would work a bit better:
>
> ====
> global start_time
>
> probe
> process("/home/avi/urchin/build/release/scylla").mark("reactor_run_tasks_start")
> {
>      start_time[tid()] = gettimeofday_us()
> }
>
> probe
> process("/home/avi/urchin/build/release/scylla").mark("reactor_run_tasks_end")
> {
>      now = gettimeofday_us()
>      start = start_time[tid()]
>      delete start_time[tid()]
>      if ((now - start) > 30000) {
>           printf("detected tasks running for >30ms (%d)\n",
>                  (now - start))
>           print_usyms(ubacktrace())

Here the backtrace is useless, I know exactly what it looks like at this 
point; it's two frames away from main().

To give some context, this is a run-to-completion scheduler. I have 
tasks pushed into a queue from various sources (including other tasks), 
and this bit is the part of the main loop that dequeues a task, runs it, 
and repeats.  Some tasks take too long, and I want to identify which 
ones.  I could print the last task's vtable, but that is much less 
useful than a full backtrace at the time the task is detected as taking 
too long.


>      }
> }
> ====
>
> If you like, you could add a timer probe to the above to report an
> ongoing problem in the current process (but without printing a backtrace).
>

Unfortunately I already know there is a problem and need to pinpoint 
it.  Which is where the backtrace is so useful.

I can fall back to instrumenting the code with a timer (from 
timer_create(3)), but I'd much rather have the flexibility of 
non-intrusive instrumentation.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-09  6:51   ` Avi Kivity
@ 2016-08-09 13:54     ` Mark Wielaard
  2016-08-09 14:11       ` Avi Kivity
  2016-08-09 15:17     ` David Smith
  1 sibling, 1 reply; 32+ messages in thread
From: Mark Wielaard @ 2016-08-09 13:54 UTC (permalink / raw)
  To: Avi Kivity; +Cc: David Smith, systemtap

On Tue, Aug 09, 2016 at 09:51:22AM +0300, Avi Kivity wrote:
> On 08/08/2016 05:58 PM, David Smith wrote:
> > 2) Your ubacktrace() call in the timer function isn't going to report
> > what you want. At that point, you aren't running in the context of the
> > process, you are running in the context of whatever kernel thread is
> > running the timer. I'm really sure what a user backtrace will report at
> > that point, but it certainly won't be useful.
> 
> Actually I got good backtraces when I instrumented a task to spin for 100ms.
> I also got some bad traces, but did not investigate more.

stap does indeed try to make this work even from a kernel context.
See runtime/stack.c:

/** Gets user space registers when available, also sets context
 * full_uregs_p if appropriate.  Should be used instead of accessing
 * context uregs field directly when (full) uregs are needed from
 * kernel context.
 */
static struct pt_regs *_stp_get_uregs(struct context *c)

But it might indeed depend on architecture and timer probe used.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-09 13:54     ` Mark Wielaard
@ 2016-08-09 14:11       ` Avi Kivity
  2016-08-10 15:33         ` Frank Ch. Eigler
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2016-08-09 14:11 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: David Smith, systemtap



On 08/09/2016 02:25 PM, Mark Wielaard wrote:
> On Tue, Aug 09, 2016 at 09:51:22AM +0300, Avi Kivity wrote:
>> On 08/08/2016 05:58 PM, David Smith wrote:
>>> 2) Your ubacktrace() call in the timer function isn't going to report
>>> what you want. At that point, you aren't running in the context of the
>>> process, you are running in the context of whatever kernel thread is
>>> running the timer. I'm really sure what a user backtrace will report at
>>> that point, but it certainly won't be useful.
>> Actually I got good backtraces when I instrumented a task to spin for 100ms.
>> I also got some bad traces, but did not investigate more.
> stap does indeed try to make this work even from a kernel context.
> See runtime/stack.c:
>
> /** Gets user space registers when available, also sets context
>   * full_uregs_p if appropriate.  Should be used instead of accessing
>   * context uregs field directly when (full) uregs are needed from
>   * kernel context.
>   */
> static struct pt_regs *_stp_get_uregs(struct context *c)
>
> But it might indeed depend on architecture and timer probe used.
>
>

I note that pid()/tid() also kept working.  A happy coincidence maybe 
(the thread was not switched out).  Maybe a per-thread timer can be added.

I'm surprised this didn't come up before.  Trying to find out why a 
function which is usually fast and very rarely slow seems to be a good 
fit for systemtap.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-09  6:51   ` Avi Kivity
  2016-08-09 13:54     ` Mark Wielaard
@ 2016-08-09 15:17     ` David Smith
  2016-08-09 15:28       ` Avi Kivity
  1 sibling, 1 reply; 32+ messages in thread
From: David Smith @ 2016-08-09 15:17 UTC (permalink / raw)
  To: Avi Kivity, systemtap

On 08/09/2016 01:51 AM, Avi Kivity wrote:
> 
> 
> On 08/08/2016 05:58 PM, David Smith wrote:
>>
>>> 2. Inlined functions
>>>
>>>  From the manual pages, it seems that inlined functions can be probed
>>> (minus the .return probe), but I just get an error:
>>>
>>> semantic error: while resolving probe point: identifier 'process' at
>>> script/home/avi/seastar/debug/task-latency.stap:3:7
>>>          source: probe
>>> process("scylla").function("reactor::run_tasks()") {
>>>                        ^
>>>
>>> semantic error: no match (similar functions:
>>> _ZN7reactor14run_exit_tasksEv, statfs, dup, mkdir, ntohs)
>> We're at the mercy of the compiler and the quality of the debuginfo it
>> produces here when looking for inlined functions.
>>
>> If you want to investigate further here, you'll need to do the following:
>>
>> # eu-readelf -N --debug-dump=info scylla > scylla.log
>>
>> In scylla.log, look and see if you can find a subprogram with the name
>> of the function you are interested in.
> 
>  [1417f4]      subprogram
>                external             (flag_present) Yes
>                name                 (strp) "run_tasks"
>                decl_file            (data1) 5
>                decl_line            (data2) 812
>                linkage_name         (strp)
> "_ZN7reactor9run_tasksER15circular_bufferISt10unique_ptrI4taskSt14default_deleteIS2_EESaIS5_EE"
> 
>                declaration          (flag_present) Yes
>                object_pointer       (ref4) [141808]
>                sibling              (ref4) [141813]
>  [141808]        formal_parameter
>                  type                 (ref4) [1423bd]
>                  artificial           (flag_present) Yes
>  [14180d]        formal_parameter
>                  type                 (ref4) [14a8a8]

According to the above, that function isn't inlined. You might just try
the following to see what functions stap can see:

# stap -l 'process("scylla").function("*")' | grep run_tasks


> The debugger has no trouble setting breakpoints on the function or
> showing it in backtraces.  In any case I sprinkled some static probes
> there.
> 
>>> 3. Process CPU timers
>>>
>>> (more of a feature request)
>>>
>>> I'm trying to find causes of latency in my program.  To do that, I'm
>>> running a periodic timer and checking whether a function takes more time
>>> than some threshold.
>>>
>>> Ideally, I'd be able to arm the timer on the function entry point and
>>> disarm it on exit, rather than have it run continuously; this would need
>>> to be a per-thread cpu-time timer (e.g. CLOCK_THREAD_CPUTIME_ID)/
>>>
>>> Here's my current script for reference ("running" and "start_time" need
>>> to become maps for it to be thread-safe):
>>>
>>> #!/usr/bin/stap
>>>
>>> global start_time
>>> global running
>>>
>>> probe begin {
>>>      running = 0
>>> }
>>>
>>> probe
>>> process("/home/avi/urchin/build/release/scylla").mark("reactor_run_tasks_start")
>>>
>>> {
>>>      start_time = gettimeofday_us()
>>>      running = 1
>>> }
>>>
>>> probe
>>> process("/home/avi/urchin/build/release/scylla").mark("reactor_run_tasks_end")
>>>
>>> {
>>>      running = 0
>>> }
>>>
>>> probe timer.ms(10) {
>>>      now = gettimeofday_us()
>>>      if (running && (now - start_time) > 30000) {
>>>          printf("detected tasks running for >30ms\n")
>>>          print_usyms(ubacktrace())
>>>      }
>>> }
>>>
>>> I'd appreciate any tips as to whether there's a better way to do this.
>> The above isn't really going to work well, for several reasons:
>>
>> 1) You've only got one global 'start_time' and 'running' variables. If
>> scylla is multithreaded or more than one instance is running, that isn't
>> going to work.
>>
>> To fix this, you'd want to do convert them to arrays and index them by
>> thread ids, like this:
>>
>>      start_time[tid()] = gettimeofday_us()
> 
> Yes, I found that out and my latest version works like this, but I ran
> into severe scaling issues.
> 
> I also switched to timer.profile.  What's its resolution on normal
> machines?

timer.profile probe resolution is CONFIG_HZ. Check your kernel config
and see "man stapprobes" for more details.

-- 
David Smith
dsmith@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-09 15:17     ` David Smith
@ 2016-08-09 15:28       ` Avi Kivity
  0 siblings, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2016-08-09 15:28 UTC (permalink / raw)
  To: David Smith, systemtap



On 08/09/2016 06:17 PM, David Smith wrote:
> On 08/09/2016 01:51 AM, Avi Kivity wrote:
>>
>> On 08/08/2016 05:58 PM, David Smith wrote:
>>>> 2. Inlined functions
>>>>
>>>>   From the manual pages, it seems that inlined functions can be probed
>>>> (minus the .return probe), but I just get an error:
>>>>
>>>> semantic error: while resolving probe point: identifier 'process' at
>>>> script/home/avi/seastar/debug/task-latency.stap:3:7
>>>>           source: probe
>>>> process("scylla").function("reactor::run_tasks()") {
>>>>                         ^
>>>>
>>>> semantic error: no match (similar functions:
>>>> _ZN7reactor14run_exit_tasksEv, statfs, dup, mkdir, ntohs)
>>> We're at the mercy of the compiler and the quality of the debuginfo it
>>> produces here when looking for inlined functions.
>>>
>>> If you want to investigate further here, you'll need to do the following:
>>>
>>> # eu-readelf -N --debug-dump=info scylla > scylla.log
>>>
>>> In scylla.log, look and see if you can find a subprogram with the name
>>> of the function you are interested in.
>>   [1417f4]      subprogram
>>                 external             (flag_present) Yes
>>                 name                 (strp) "run_tasks"
>>                 decl_file            (data1) 5
>>                 decl_line            (data2) 812
>>                 linkage_name         (strp)
>> "_ZN7reactor9run_tasksER15circular_bufferISt10unique_ptrI4taskSt14default_deleteIS2_EESaIS5_EE"
>>
>>                 declaration          (flag_present) Yes
>>                 object_pointer       (ref4) [141808]
>>                 sibling              (ref4) [141813]
>>   [141808]        formal_parameter
>>                   type                 (ref4) [1423bd]
>>                   artificial           (flag_present) Yes
>>   [14180d]        formal_parameter
>>                   type                 (ref4) [14a8a8]
> According to the above, that function isn't inlined. You might just try
> the following to see what functions stap can see:
>
> # stap -l 'process("scylla").function("*")' | grep run_tasks

process("/usr/bin/scylla").function("run_tasks@core/reactor.cc:1362")

Strangely not qualified by "reactor::".

>> I also switched to timer.profile.  What's its resolution on normal
>> machines?
> timer.profile probe resolution is CONFIG_HZ. Check your kernel config
> and see "man stapprobes" for more details.
>

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-09 14:11       ` Avi Kivity
@ 2016-08-10 15:33         ` Frank Ch. Eigler
  2016-08-10 15:39           ` Avi Kivity
  2016-08-10 15:40           ` Avi Kivity
  0 siblings, 2 replies; 32+ messages in thread
From: Frank Ch. Eigler @ 2016-08-10 15:33 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Mark Wielaard, David Smith, systemtap

Avi Kivity <avi@scylladb.com> writes:

> [...]
> I'm surprised this didn't come up before.  Trying to find out why a
> function which is usually fast and very rarely slow seems to be a good
> fit for systemtap.

Like this?
https://sourceware.org/systemtap/examples/#profiling/fntimes.stp

Note though that such analysis probably cannot be performed based only
upon PC samples - or even backtrace samples.  We seem to require
trapping individual function entry/exit events.

- FChE

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-10 15:33         ` Frank Ch. Eigler
@ 2016-08-10 15:39           ` Avi Kivity
  2016-08-10 15:40           ` Avi Kivity
  1 sibling, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2016-08-10 15:39 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Mark Wielaard, David Smith, systemtap



On 08/10/2016 06:32 PM, Frank Ch. Eigler wrote:
> Avi Kivity <avi@scylladb.com> writes:
>
>> [...]
>> I'm surprised this didn't come up before.  Trying to find out why a
>> function which is usually fast and very rarely slow seems to be a good
>> fit for systemtap.
> Like this?
> https://sourceware.org/systemtap/examples/#profiling/fntimes.stp

Yes.  The problem is that if the function is called often (with a usual 
short running time), then systemtap will eat all of the cpu time 
spinning on an internal lock.

> Note though that such analysis probably cannot be performed based only
> upon PC samples - or even backtrace samples.  We seem to require
> trapping individual function entry/exit events.

That's why I tried systemtap.  It worked will on my desktop, but very 
badly in production.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-10 15:33         ` Frank Ch. Eigler
  2016-08-10 15:39           ` Avi Kivity
@ 2016-08-10 15:40           ` Avi Kivity
  2016-08-10 16:47             ` Frank Ch. Eigler
  1 sibling, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2016-08-10 15:40 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Mark Wielaard, David Smith, systemtap



On 08/10/2016 06:32 PM, Frank Ch. Eigler wrote:
> Avi Kivity <avi@scylladb.com> writes:
>
>> [...]
>> I'm surprised this didn't come up before.  Trying to find out why a
>> function which is usually fast and very rarely slow seems to be a good
>> fit for systemtap.
> Like this?
> https://sourceware.org/systemtap/examples/#profiling/fntimes.stp

Yes.  The problem is that if the function is called often (with a usual 
short running time), then systemtap will eat all of the cpu time 
spinning on an internal lock.

> Note though that such analysis probably cannot be performed based only
> upon PC samples - or even backtrace samples.  We seem to require
> trapping individual function entry/exit events.

That's why I tried systemtap.  It worked well on my desktop, but very 
badly in production.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-10 15:40           ` Avi Kivity
@ 2016-08-10 16:47             ` Frank Ch. Eigler
  2016-08-11  7:51               ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Frank Ch. Eigler @ 2016-08-10 16:47 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Mark Wielaard, David Smith, systemtap

Hi -

On Wed, Aug 10, 2016 at 06:40:02PM +0300, Avi Kivity wrote:
> [...]
> Yes.  The problem is that if the function is called often (with a usual 
> short running time), then systemtap will eat all of the cpu time 
> spinning on an internal lock.

Well, not just that ...  trapping each function entry/exit has
unavoidable kernel uprobes context-switchy-type overheads.  Your perf
report may well be misattributing the cost.  (Have you tried a stap
script that merely traps all the same the function calls, and has
empty probe handlers?)


> >Note though that such analysis probably cannot be performed based only
> >upon PC samples - or even backtrace samples.  We seem to require
> >trapping individual function entry/exit events.
> 
> That's why I tried systemtap.  It worked well on my desktop, but very 
> badly in production.

It may be worth experimenting with "stap --runtime=dyninst" if your
function analysis were restricted to basic Cish userspace that dyninst
can handle.


- FChE

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-10 16:47             ` Frank Ch. Eigler
@ 2016-08-11  7:51               ` Avi Kivity
  2016-08-11  8:13                 ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2016-08-11  7:51 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Mark Wielaard, David Smith, systemtap



On 08/10/2016 07:47 PM, Frank Ch. Eigler wrote:
> Hi -
>
> On Wed, Aug 10, 2016 at 06:40:02PM +0300, Avi Kivity wrote:
>> [...]
>> Yes.  The problem is that if the function is called often (with a usual
>> short running time), then systemtap will eat all of the cpu time
>> spinning on an internal lock.
> Well, not just that ...  trapping each function entry/exit has
> unavoidable kernel uprobes context-switchy-type overheads.

Like you say, those are unavoidable.  But at least those costs are 
handled by scalable resources.


>    Your perf
> report may well be misattributing the cost.

I think it's unlikely.  When perf says __raw_spin_lock is guilty, it 
usually is.

>    (Have you tried a stap
> script that merely traps all the same the function calls, and has
> empty probe handlers?)

I can try it.

>
>
>>> Note though that such analysis probably cannot be performed based only
>>> upon PC samples - or even backtrace samples.  We seem to require
>>> trapping individual function entry/exit events.
>> That's why I tried systemtap.  It worked well on my desktop, but very
>> badly in production.
> It may be worth experimenting with "stap --runtime=dyninst" if your
> function analysis were restricted to basic Cish userspace that dyninst
> can handle.
>

Will timer.profile work with dyninst?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-11  7:51               ` Avi Kivity
@ 2016-08-11  8:13                 ` Avi Kivity
  2016-08-11 16:18                   ` Frank Ch. Eigler
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2016-08-11  8:13 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Mark Wielaard, David Smith, systemtap



On 08/11/2016 10:51 AM, Avi Kivity wrote:
>
>
> On 08/10/2016 07:47 PM, Frank Ch. Eigler wrote:
>> Hi -
>>
>> On Wed, Aug 10, 2016 at 06:40:02PM +0300, Avi Kivity wrote:
>>> [...]
>>> Yes.  The problem is that if the function is called often (with a usual
>>> short running time), then systemtap will eat all of the cpu time
>>> spinning on an internal lock.
>> Well, not just that ...  trapping each function entry/exit has
>> unavoidable kernel uprobes context-switchy-type overheads.
>
> Like you say, those are unavoidable.  But at least those costs are 
> handled by scalable resources.
>
>
>>    Your perf
>> report may well be misattributing the cost.
>
> I think it's unlikely.  When perf says __raw_spin_lock is guilty, it 
> usually is.
>
>>    (Have you tried a stap
>> script that merely traps all the same the function calls, and has
>> empty probe handlers?)
>
> I can try it.


It was actually pretty bad with empty handlers:

#
     66.25%  scylla [kernel.kallsyms]                             [k] 
_raw_spin_lock
                      |
                      ---_raw_spin_lock
                         |
                         |--49.95%-- 0x62ab
                         |          syscall_trace_leave
                         |          int_check_syscall_exit_work
                         |          |
                         |          |--99.08%-- epoll_pwait
                         |          | 
_ZN21reactor_backend_epoll16wait_and_processEiPK10__sigset_t
                         |          | 
_ZNSt17_Function_handlerIFbvEZN7reactor3runEvEUlvE5_E9_M_invokeERKSt9_Any_data
                         |          |          _ZN7reactor3runEv
                         |          |          |
                         |          |          |--91.80%-- 
_ZZN3smp9configureEN5boost15program_options13variables_mapEENKUlvE1_clEv.constprop.2783
                         |          |          | _Z19dpdk_thread_adaptorPv
                         |          |          | eal_thread_loop
                         |          |          | start_thread
                         |          |          |          __clone
                         |          |          |
                         |          |           --8.20%-- 
_ZN12app_template14run_deprecatedEiPPcOSt8functionIFvvEE
                         |          |                     |
                         |          | --100.00%-- main
                         |          | __libc_start_main
                         |          | _start
                         |           --0.92%-- [...]
                         |
                         |--49.46%-- 0x619b
                         |          syscall_trace_enter_phase2
                         |          tracesys_phase2
                         |          |
                         |          |--98.82%-- epoll_pwait
                         |          | 
_ZN21reactor_backend_epoll16wait_and_processEiPK10__sigset_t
                         |          | 
_ZNSt17_Function_handlerIFbvEZN7reactor3runEvEUlvE5_E9_M_invokeERKSt9_Any_data
                         |          |          _ZN7reactor3runEv
                         |          |          |
                         |          |          |--91.94%-- 
_ZZN3smp9configureEN5boost15program_options13variables_mapEENKUlvE1_clEv.constprop.2783
                         |          |          | _Z19dpdk_thread_adaptorPv
                         |          |          | eal_thread_loop
                         |          |          | start_thread
                         |          |          |          __clone
                         |          |          |
                         |          |           --8.06%-- 
_ZN12app_template14run_deprecatedEiPPcOSt8functionIFvvEE
                         |          |                     |
                         |          | --100.00%-- main
                         |          | __libc_start_main
                         |          | _start
                         |           --1.18%-- [...]
                          --0.58%-- [...]


I don't have any system call probes.  Just two empty static probes, and 
a timer.profile handler.

>
>>
>>
>>>> Note though that such analysis probably cannot be performed based only
>>>> upon PC samples - or even backtrace samples.  We seem to require
>>>> trapping individual function entry/exit events.
>>> That's why I tried systemtap.  It worked well on my desktop, but very
>>> badly in production.
>> It may be worth experimenting with "stap --runtime=dyninst" if your
>> function analysis were restricted to basic Cish userspace that dyninst
>> can handle.
>>
>
> Will timer.profile work with dyninst?
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-11  8:13                 ` Avi Kivity
@ 2016-08-11 16:18                   ` Frank Ch. Eigler
  2016-08-17 15:15                     ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Frank Ch. Eigler @ 2016-08-11 16:18 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Mark Wielaard, David Smith, systemtap

Hi -

avi wrote:

> [...]
> >>(Have you tried a stap script that merely traps all the same the
> >>function calls, and has empty probe handlers?)

> >I can try it.
> 
> It was actually pretty bad with empty handlers:
> 
> #
>     66.25%  scylla [kernel.kallsyms]   [k] > _raw_spin_lock
>                      |
>                      ---_raw_spin_lock
>                         |
>                         |--49.95%-- 0x62ab
>                         |          syscall_trace_leave
> [...]
>                         |--49.46%-- 0x619b
>                         |          syscall_trace_enter_phase2
> [...]
> 
> I don't have any system call probes.  Just two empty static probes, and 
> a timer.profile handler.

It would sure be nice if perf gave you an explanation of those 0x62ab
bits.  Maybe use dwarf unwind based callgraph collection?  The stap
runtime does sometimes create probes for internal purposes, such as
trapping exec/mmap operations (so probes on new programs can be
activated).  It is conceivable that this is the source of the spinlock
activity, but such a high rate doesn't make sense.


> Will timer.profile work with dyninst?

Profile timers, not yet, but normal periodic ones like timer.hz(997) yes.
Unfortunately backtracing is also not yet implemented there.


- FChE

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-11 16:18                   ` Frank Ch. Eigler
@ 2016-08-17 15:15                     ` Avi Kivity
  2016-08-17 18:06                       ` David Smith
  2016-08-17 18:10                       ` Frank Ch. Eigler
  0 siblings, 2 replies; 32+ messages in thread
From: Avi Kivity @ 2016-08-17 15:15 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Mark Wielaard, David Smith, systemtap



On 08/11/2016 07:18 PM, Frank Ch. Eigler wrote:
> Hi -
>
> avi wrote:
>
>> [...]
>>>> (Have you tried a stap script that merely traps all the same the
>>>> function calls, and has empty probe handlers?)
>>> I can try it.
>> It was actually pretty bad with empty handlers:
>>
>> #
>>      66.25%  scylla [kernel.kallsyms]   [k] > _raw_spin_lock
>>                       |
>>                       ---_raw_spin_lock
>>                          |
>>                          |--49.95%-- 0x62ab
>>                          |          syscall_trace_leave
>> [...]
>>                          |--49.46%-- 0x619b
>>                          |          syscall_trace_enter_phase2
>> [...]
>>
>> I don't have any system call probes.  Just two empty static probes, and
>> a timer.profile handler.
> It would sure be nice if perf gave you an explanation of those 0x62ab
> bits.  Maybe use dwarf unwind based callgraph collection?  The stap
> runtime does sometimes create probes for internal purposes, such as
> trapping exec/mmap operations (so probes on new programs can be
> activated).  It is conceivable that this is the source of the spinlock
> activity, but such a high rate doesn't make sense.

I replicated on a more modern machine (also wider: 2s24c48t).  I had to 
--suppress-time-limits or stap would exit almost immediately.

My load indeed calls epoll_wait() at a high rate, with zero timeout.  
But now it's completely dominated by tracing overhead.


#!/usr/bin/stap

# usage: task_latency.stap process_name latency_threshold_ms

global start_time

probe process(@1).mark("reactor_run_tasks_single_start") {
     start_time[pid(), tid()] = gettimeofday_us()
}

probe process(@1).mark("reactor_run_tasks_single_end") {
     delete start_time[pid(), tid()]
}

probe timer.profile {
     if ([pid(), tid()] in start_time) {
         now = gettimeofday_us()
         start = start_time[pid(), tid()]
         if ((now - start) > $2 * 1000) {
             printf("detected tasks running for >%sms\n", @2)
             print_usyms(ubacktrace())
         }
     }
}


#
# Total Lost Samples: 357
#
# Samples: 308K of event 'cycles:ppp'
# Event count (approx.): 222398334161
#
# Overhead  Command          Shared Object Symbol
# ........  ...............  ................... 
.....................................................................................................................................................................................................................................................
#
     50.66%  scylla           [kernel.kallsyms]    [k] 
queued_spin_lock_slowpath
             |
             ---queued_spin_lock_slowpath
                |
                |--49.99%--_raw_spin_lock
                |          |
                |          |--48.48%--task_utrace_struct
                |          |          |
                |          | |--25.21%--utrace_report_syscall_entry
                |          |          | syscall_trace_enter_phase2
                |          |          |          syscall_trace_enter
                |          |          |          do_syscall_64
                |          |          | return_from_SYSCALL_64
                |          |          |          |
                |          |          | --23.47%--epoll_pwait
                |          |          | 
reactor_backend_epoll::wait_and_process
                |          |          | |
                |          |          | 
|--15.91%--std::_Function_handler<bool (), 
reactor::run()::{lambda()#7}>::_M_invoke
                |          |          | |          reactor::run
                |          |          | | 
smp::configure(boost::program_options::variables_map)::{lambda()#3}::operator()
                |          |          | |          
posix_thread::start_routine
                |          |          | |          start_thread
                |          |          | |          __clone
                |          |          | |
                |          |          | 
--7.56%--std::_Function_handler<bool (), 
reactor::run()::{lambda()#8}>::_M_invoke
                |          | | logalloc::tracker::impl::compact_on_idle
                |          | |                                reactor::run
                |          | | 
smp::configure(boost::program_options::variables_map)::{lambda()#3}::operator()
                |          | |                                
posix_thread::start_routine
                |          | |                                start_thread
                |          | |                                __clone
                |          |          |
                |          | --23.27%--utrace_report_syscall_exit
                |          | syscall_slow_exit_work
                |          |                     do_syscall_64
                |          | return_from_SYSCALL_64
                |          |                     |
                |          | --21.92%--epoll_pwait
                |          | reactor_backend_epoll::wait_and_process
                |          | |
                |          | |--14.83%--std::_Function_handler<bool (), 
reactor::run()::{lambda()#7}>::_M_invoke
                |          | |          reactor::run
                |          | | 
smp::configure(boost::program_options::variables_map)::{lambda()#3}::operator()
                |          | |          posix_thread::start_routine
                |          | |          start_thread
                |          | |          __clone
                |          | |
                |          | --7.09%--std::_Function_handler<bool (), 
reactor::run()::{lambda()#8}>::_M_invoke
                | | logalloc::tracker::impl::compact_on_idle
                | |                                           reactor::run
                | | 
smp::configure(boost::program_options::variables_map)::{lambda()#3}::operator()
                | | posix_thread::start_routine
                | |                                           start_thread
                | |                                           __clone
                |          |
                |           --1.27%--__lock_task_sighand
                |                     |
                |                      --0.67%--send_sigqueue
                |                                posix_timer_event
                |                                cpu_timer_fire
                |                                run_posix_cpu_timers
                |                                update_process_times
                | tick_sched_handle.isra.16
                |                                tick_sched_timer
                |                                __hrtimer_run_queues
                |                                hrtimer_interrupt
:





>
>> Will timer.profile work with dyninst?
> Profile timers, not yet, but normal periodic ones like timer.hz(997) yes.
> Unfortunately backtracing is also not yet implemented there.
>
>

Well, it's  not very useful to me then.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-17 15:15                     ` Avi Kivity
@ 2016-08-17 18:06                       ` David Smith
  2016-08-17 18:22                         ` Avi Kivity
  2016-08-17 18:10                       ` Frank Ch. Eigler
  1 sibling, 1 reply; 32+ messages in thread
From: David Smith @ 2016-08-17 18:06 UTC (permalink / raw)
  To: Avi Kivity, Frank Ch. Eigler; +Cc: Mark Wielaard, systemtap

On 08/17/2016 10:15 AM, Avi Kivity wrote:

... stuff deleted ...

> My load indeed calls epoll_wait() at a high rate, with zero timeout. 
> But now it's completely dominated by tracing overhead.

I'm not sure how much it will help, but one easy fix would be to replace
all your '[pid(), tid()]' array indexing with just '[tid()]' array indexing.

Try changing that and see if the tracing overhead goes down a bit.

-- 
David Smith
dsmith@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-17 15:15                     ` Avi Kivity
  2016-08-17 18:06                       ` David Smith
@ 2016-08-17 18:10                       ` Frank Ch. Eigler
  2016-08-17 18:30                         ` Avi Kivity
  1 sibling, 1 reply; 32+ messages in thread
From: Frank Ch. Eigler @ 2016-08-17 18:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Mark Wielaard, David Smith, systemtap


avi wrote:

> [...]
> My load indeed calls epoll_wait() at a high rate, with zero timeout.

OK.  That's an unusual and cpu-saturating method.  Without systemtap
active, the code must be doing millions of syscalls per second.  Are
you sure there's no better way?


> But now it's completely dominated by tracing overhead.
> [...]
>                |          |--48.48%--task_utrace_struct
>                |          | |--25.21%--utrace_report_syscall_entry
>                |          |          | syscall_trace_enter_phase2
>                |          |          |          syscall_trace_enter
>                |          |          |          do_syscall_64
>                |          |          | return_from_SYSCALL_64
>                |          |          |          |
>                |          |          | --23.47%--epoll_pwait
> [...]

OK, this implicates the systemtap runtime utrace-emulation code that
looks for mmap/exec syscalls.  Perhaps there could be a mode (a -D
setting?)  for selecting kprobes rather than syscall-tracepoints to
catch those events.


- FChE

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-17 18:06                       ` David Smith
@ 2016-08-17 18:22                         ` Avi Kivity
  0 siblings, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2016-08-17 18:22 UTC (permalink / raw)
  To: David Smith, Frank Ch. Eigler; +Cc: Mark Wielaard, systemtap

On 08/17/2016 09:06 PM, David Smith wrote:
> On 08/17/2016 10:15 AM, Avi Kivity wrote:
>
> ... stuff deleted ...
>
>> My load indeed calls epoll_wait() at a high rate, with zero timeout.
>> But now it's completely dominated by tracing overhead.
> I'm not sure how much it will help, but one easy fix would be to replace
> all your '[pid(), tid()]' array indexing with just '[tid()]' array indexing.
>
> Try changing that and see if the tracing overhead goes down a bit.
>

Yeah, I forgot to do that.


But it's really unlikely to help, all the overhead is coming from 
syscall tracing, which I didn't ask for.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-17 18:10                       ` Frank Ch. Eigler
@ 2016-08-17 18:30                         ` Avi Kivity
  2016-08-17 18:35                           ` Frank Ch. Eigler
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2016-08-17 18:30 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Mark Wielaard, David Smith, systemtap

On 08/17/2016 09:10 PM, Frank Ch. Eigler wrote:
> avi wrote:
>
>> [...]
>> My load indeed calls epoll_wait() at a high rate, with zero timeout.
> OK.  That's an unusual and cpu-saturating method.  Without systemtap
> active, the code must be doing millions of syscalls per second.  Are
> you sure there's no better way?

Yes.  It's not very common, but some high throughput or latency 
sensitive applications will poll.  In my case it's both high throughput 
and latency sensitive.

In any case, asking users to do fewer syscalls because it hurts 
systemtap will not work.

>
>> But now it's completely dominated by tracing overhead.
>> [...]
>>                 |          |--48.48%--task_utrace_struct
>>                 |          | |--25.21%--utrace_report_syscall_entry
>>                 |          |          | syscall_trace_enter_phase2
>>                 |          |          |          syscall_trace_enter
>>                 |          |          |          do_syscall_64
>>                 |          |          | return_from_SYSCALL_64
>>                 |          |          |          |
>>                 |          |          | --23.47%--epoll_pwait
>> [...]
> OK, this implicates the systemtap runtime utrace-emulation code that

Why is it emulating?  Is there something missing in Kconfig?  This is a 
4.6 Fedora kernel, so I expect the kitchen sink has been compiled in.

> looks for mmap/exec syscalls.  Perhaps there could be a mode (a -D
> setting?)  for selecting kprobes rather than syscall-tracepoints to
> catch those events.
>

Why not trace just mmap and exec syscalls, rather than everything? 
Surely there's a static tracepoint for those.

I'd say defaulting to taking a spinlock on every syscall is going to 
cause anyone who's using more than a laptop a lot of pain.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-17 18:30                         ` Avi Kivity
@ 2016-08-17 18:35                           ` Frank Ch. Eigler
  2016-08-17 18:42                             ` Avi Kivity
  2016-08-25 16:21                             ` Avi Kivity
  0 siblings, 2 replies; 32+ messages in thread
From: Frank Ch. Eigler @ 2016-08-17 18:35 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Mark Wielaard, David Smith, systemtap

Hi -

On Wed, Aug 17, 2016 at 09:30:38PM +0300, Avi Kivity wrote:
> [...]
> >>My load indeed calls epoll_wait() at a high rate, with zero timeout.
> >[...]
> Yes.  It's not very common, but some high throughput or latency 
> sensitive applications will poll.  In my case it's both high throughput 
> and latency sensitive.

Interesting.  I assume you've measured that.


> [...]
> >looks for mmap/exec syscalls.  Perhaps there could be a mode (a -D
> >setting?)  for selecting kprobes rather than syscall-tracepoints to
> >catch those events.
> 
> Why not trace just mmap and exec syscalls, rather than everything? 
> Surely there's a static tracepoint for those.

Heh, in this case it's "surely not".  :-) There exists only a
multiplexed system-call-entry and a system-call-exit pair of
tracepoints.


> I'd say defaulting to taking a spinlock on every syscall is going to 
> cause anyone who's using more than a laptop a lot of pain.

There may well exist optimization opportunities that would benefit
syscall-saturated applications such as yours.


- FChE

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-17 18:35                           ` Frank Ch. Eigler
@ 2016-08-17 18:42                             ` Avi Kivity
  2016-08-25 16:21                             ` Avi Kivity
  1 sibling, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2016-08-17 18:42 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Mark Wielaard, David Smith, systemtap

On 08/17/2016 09:35 PM, Frank Ch. Eigler wrote:
> Hi -
>
> On Wed, Aug 17, 2016 at 09:30:38PM +0300, Avi Kivity wrote:
>> [...]
>>>> My load indeed calls epoll_wait() at a high rate, with zero timeout.
>>> [...]
>> Yes.  It's not very common, but some high throughput or latency
>> sensitive applications will poll.  In my case it's both high throughput
>> and latency sensitive.
> Interesting.  I assume you've measured that.

Yes.  It's also well known in some circles.

If you're interested, some links:

http://www.seastar-project.org/
http://www.scylladb.com/technology/architecture/
http://dpdk.org/


>
>> [...]
>>> looks for mmap/exec syscalls.  Perhaps there could be a mode (a -D
>>> setting?)  for selecting kprobes rather than syscall-tracepoints to
>>> catch those events.
>> Why not trace just mmap and exec syscalls, rather than everything?
>> Surely there's a static tracepoint for those.
> Heh, in this case it's "surely not".  :-) There exists only a
> multiplexed system-call-entry and a system-call-exit pair of
> tracepoints.

Well, static tracepoints are a patch away, no?

Another alternative is to eliminate that spinlock, perhaps until after 
you've verified it's mmap or exec.  Can you now attach an ebpf program 
to that tracepoint?  Maybe that's unneeded since you're loading native 
code into the kernel.

>
>
>> I'd say defaulting to taking a spinlock on every syscall is going to
>> cause anyone who's using more than a laptop a lot of pain.
> There may well exist optimization opportunities that would benefit
> syscall-saturated applications such as yours.

It's quite possible.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-17 18:35                           ` Frank Ch. Eigler
  2016-08-17 18:42                             ` Avi Kivity
@ 2016-08-25 16:21                             ` Avi Kivity
  2016-08-25 17:37                               ` David Smith
  1 sibling, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2016-08-25 16:21 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Mark Wielaard, David Smith, systemtap



On 08/17/2016 09:35 PM, Frank Ch. Eigler wrote:
> Hi -
>
> On Wed, Aug 17, 2016 at 09:30:38PM +0300, Avi Kivity wrote:
>> [...]
>>>> My load indeed calls epoll_wait() at a high rate, with zero timeout.
>>> [...]
>> Yes.  It's not very common, but some high throughput or latency
>> sensitive applications will poll.  In my case it's both high throughput
>> and latency sensitive.
> Interesting.  I assume you've measured that.
>
>
>> [...]
>>> looks for mmap/exec syscalls.  Perhaps there could be a mode (a -D
>>> setting?)  for selecting kprobes rather than syscall-tracepoints to
>>> catch those events.
>> Why not trace just mmap and exec syscalls, rather than everything?
>> Surely there's a static tracepoint for those.
> Heh, in this case it's "surely not".  :-) There exists only a
> multiplexed system-call-entry and a system-call-exit pair of
> tracepoints.
>

Hi,

Should I wait for ongoing improvement here, or shall I look elsewhere 
for my tracing needs?

It would be a pity (for me) if I have to find another solution, because 
systemtap has all the expressiveness and integration I need.  But it has 
a dramatic impact on my application runtime.

I was able to extract some useful data with perf probe/perf record, but 
as soon as I need to qualify a probe point with runtime information, 
perf falls short.


Avi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-25 16:21                             ` Avi Kivity
@ 2016-08-25 17:37                               ` David Smith
  2016-08-26 18:39                                 ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: David Smith @ 2016-08-25 17:37 UTC (permalink / raw)
  To: Avi Kivity, Frank Ch. Eigler; +Cc: Mark Wielaard, systemtap

On 08/25/2016 11:21 AM, Avi Kivity wrote:
> Hi,
> 
> Should I wait for ongoing improvement here, or shall I look elsewhere
> for my tracing needs?
> 
> It would be a pity (for me) if I have to find another solution, because
> systemtap has all the expressiveness and integration I need.  But it has
> a dramatic impact on my application runtime.
> 
> I was able to extract some useful data with perf probe/perf record, but
> as soon as I need to qualify a probe point with runtime information,
> perf falls short.

As Frank mentioned in a previous email, it might be possible for us to
switch to using straight kprobes instead of syscall tracing to handle
mmap tracing. In your use case of calling epoll_wait() lots of times per
second, that might be a *big* win.

I'll see what can be done to add that feature.

-- 
David Smith
dsmith@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-25 17:37                               ` David Smith
@ 2016-08-26 18:39                                 ` Avi Kivity
  2016-08-26 20:08                                   ` David Smith
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2016-08-26 18:39 UTC (permalink / raw)
  To: David Smith, Frank Ch. Eigler; +Cc: Mark Wielaard, systemtap

On 08/25/2016 08:37 PM, David Smith wrote:
> On 08/25/2016 11:21 AM, Avi Kivity wrote:
>> Hi,
>>
>> Should I wait for ongoing improvement here, or shall I look elsewhere
>> for my tracing needs?
>>
>> It would be a pity (for me) if I have to find another solution, because
>> systemtap has all the expressiveness and integration I need.  But it has
>> a dramatic impact on my application runtime.
>>
>> I was able to extract some useful data with perf probe/perf record, but
>> as soon as I need to qualify a probe point with runtime information,
>> perf falls short.
> As Frank mentioned in a previous email, it might be possible for us to
> switch to using straight kprobes instead of syscall tracing to handle
> mmap tracing. In your use case of calling epoll_wait() lots of times per
> second, that might be a *big* win.
>
> I'll see what can be done to add that feature.
>

Thanks a lot.  I'll be happy to test patches.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-26 18:39                                 ` Avi Kivity
@ 2016-08-26 20:08                                   ` David Smith
  2016-08-26 21:21                                     ` Josh Stone
  2016-08-28 13:30                                     ` Avi Kivity
  0 siblings, 2 replies; 32+ messages in thread
From: David Smith @ 2016-08-26 20:08 UTC (permalink / raw)
  To: Avi Kivity, Frank Ch. Eigler; +Cc: Mark Wielaard, systemtap

[-- Attachment #1: Type: text/plain, Size: 1344 bytes --]

On 08/26/2016 01:39 PM, Avi Kivity wrote:
> On 08/25/2016 08:37 PM, David Smith wrote:
>> On 08/25/2016 11:21 AM, Avi Kivity wrote:
>>> Hi,
>>>
>>> Should I wait for ongoing improvement here, or shall I look elsewhere
>>> for my tracing needs?
>>>
>>> It would be a pity (for me) if I have to find another solution, because
>>> systemtap has all the expressiveness and integration I need.  But it has
>>> a dramatic impact on my application runtime.
>>>
>>> I was able to extract some useful data with perf probe/perf record, but
>>> as soon as I need to qualify a probe point with runtime information,
>>> perf falls short.
>> As Frank mentioned in a previous email, it might be possible for us to
>> switch to using straight kprobes instead of syscall tracing to handle
>> mmap tracing. In your use case of calling epoll_wait() lots of times per
>> second, that might be a *big* win.
>>
>> I'll see what can be done to add that feature.
>>
> 
> Thanks a lot.  I'll be happy to test patches.

OK, since you asked...

Here's a patch I'm testing that tries to do prefiltering when a syscall
occurs, so we don't have to take that lock.

Please rebuild with it, and let me know if it (a) works, and (b) has
lower overhead in your situation.

-- 
David Smith
dsmith@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)

[-- Attachment #2: prefilter.patch --]
[-- Type: text/x-patch, Size: 4003 bytes --]

diff --git a/runtime/stp_utrace.c b/runtime/stp_utrace.c
index bb2d663..263b302 100644
--- a/runtime/stp_utrace.c
+++ b/runtime/stp_utrace.c
@@ -26,6 +26,7 @@
 #include <trace/events/syscalls.h>
 #include "stp_task_work.c"
 #include "linux/stp_tracepoint.h"
+#include "syscall.h"
 
 #include "stp_helper_lock.h"
 
@@ -116,6 +117,8 @@ static void utrace_report_exec(void *cb_data __attribute__ ((unused)),
 #define __UTRACE_REGISTERED	1
 static atomic_t utrace_state = ATOMIC_INIT(__UTRACE_UNREGISTERED);
 
+static int __stp_utrace_only_vma_syscall_tracing = 0;
+
 // If wake_up_state() is exported, use it.
 #if defined(STAPCONF_WAKE_UP_STATE_EXPORTED)
 #define stp_wake_up_state wake_up_state
@@ -2064,6 +2067,23 @@ static void utrace_report_syscall_entry(void *cb_data __attribute__ ((unused)),
 
 	if (atomic_read(&utrace_state) != __UTRACE_REGISTERED)
 		return;
+
+	/* If we're only doing syscall tracing for VMA tracking, then
+	 * let's prefilter the syscall numbers that we need before
+	 * calling the handlers.
+	 *
+	 * This allows us to avoid finding the utrace struct, which
+	 * avoids some locking. */
+	if (__stp_utrace_only_vma_syscall_tracing) {
+		long syscall_no = _stp_syscall_get_nr(task, regs);
+		if (syscall_no != MMAP_SYSCALL_NO(task)
+		    && syscall_no != MMAP2_SYSCALL_NO(task)
+		    && syscall_no != MPROTECT_SYSCALL_NO(task)
+		    && syscall_no != MUNMAP_SYSCALL_NO(task))
+			return;
+	}
+
+	/* Grab the utrace struct for this task. */
 	utrace = task_utrace_struct(task);
 
 	/* FIXME: Is this 100% correct? */
@@ -2103,6 +2123,23 @@ static void utrace_report_syscall_exit(void *cb_data __attribute__ ((unused)),
 
 	if (atomic_read(&utrace_state) != __UTRACE_REGISTERED)
 		return;
+
+	/* If we're only doing syscall tracing for VMA tracking, then
+	 * let's prefilter the syscall numbers that we need before
+	 * calling the handlers.
+	 *
+	 * This allows us to avoid finding the utrace struct, which
+	 * avoids some locking. */
+	if (__stp_utrace_only_vma_syscall_tracing) {
+		long syscall_no = _stp_syscall_get_nr(task, regs);
+		if (syscall_no != MMAP_SYSCALL_NO(task)
+		    && syscall_no != MMAP2_SYSCALL_NO(task)
+		    && syscall_no != MPROTECT_SYSCALL_NO(task)
+		    && syscall_no != MUNMAP_SYSCALL_NO(task))
+			return;
+	}
+
+	/* Grab the utrace struct for this task. */
 	utrace = task_utrace_struct(task);
 
 	/* FIXME: Is this 100% correct? */
@@ -2469,4 +2506,10 @@ static void utrace_report_work(struct task_work *work)
 	stp_task_work_func_done();
 }
 
+/* If this is called, we're *only* doing vma syscall tracing, no other
+ * syscall tracing is being done. */
+static void stap_utrace_only_vma_syscall_tracing(void)
+{
+	__stp_utrace_only_vma_syscall_tracing = 1;
+}
 #endif	/* _STP_UTRACE_C */
diff --git a/runtime/stp_utrace.h b/runtime/stp_utrace.h
index 9f162bb..367109c 100644
--- a/runtime/stp_utrace.h
+++ b/runtime/stp_utrace.h
@@ -329,4 +329,8 @@ static inline enum utrace_resume_action utrace_resume_action(u32 action)
 	return action & UTRACE_RESUME_MASK;
 }
 
+/* If this is called, we're *only* doing vma syscall tracing, no other
+ * syscall tracing is being done. */
+static void stap_utrace_only_vma_syscall_tracing(void);
+
 #endif	/* _STP_UTRACE_H */
diff --git a/tapset-utrace.cxx b/tapset-utrace.cxx
index 8580800..4321053 100644
--- a/tapset-utrace.cxx
+++ b/tapset-utrace.cxx
@@ -1208,6 +1208,12 @@ utrace_derived_probe_group::emit_module_linux_init (systemtap_session& s)
     return;
 
   s.op->newline() << "/* ---- utrace probes ---- */";
+  if (!flags_seen[UDPF_SYSCALL] && !flags_seen[UDPF_SYSCALL_RETURN])
+    {
+      s.op->newline() << "#if !defined(CONFIG_UTRACE)";
+      s.op->newline() << "stap_utrace_only_vma_syscall_tracing();";
+      s.op->newline() << "#endif";
+    }
   s.op->newline() << "for (i=0; i<ARRAY_SIZE(stap_utrace_probes); i++) {";
   s.op->newline(1) << "struct stap_utrace_probe *p = &stap_utrace_probes[i];";
   s.op->newline() << "probe_point = p->probe->pp;"; // for error messages

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-26 20:08                                   ` David Smith
@ 2016-08-26 21:21                                     ` Josh Stone
  2016-08-28 13:30                                     ` Avi Kivity
  1 sibling, 0 replies; 32+ messages in thread
From: Josh Stone @ 2016-08-26 21:21 UTC (permalink / raw)
  To: systemtap, David Smith

On 08/26/2016 01:07 PM, David Smith wrote:
> --- a/tapset-utrace.cxx
> +++ b/tapset-utrace.cxx
> @@ -1208,6 +1208,12 @@ utrace_derived_probe_group::emit_module_linux_init (systemtap_session& s)
>      return;
>  
>    s.op->newline() << "/* ---- utrace probes ---- */";
> +  if (!flags_seen[UDPF_SYSCALL] && !flags_seen[UDPF_SYSCALL_RETURN])
> +    {
> +      s.op->newline() << "#if !defined(CONFIG_UTRACE)";
> +      s.op->newline() << "stap_utrace_only_vma_syscall_tracing();";
> +      s.op->newline() << "#endif";
> +    }

Will this be emitted if there are no utrace probes at all?

I think not, so perhaps the default should be flipped.  Assume filtered
syscalls by default, and let this code turn on the full mode if
process.syscall is in use.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-26 20:08                                   ` David Smith
  2016-08-26 21:21                                     ` Josh Stone
@ 2016-08-28 13:30                                     ` Avi Kivity
  2016-08-28 13:35                                       ` Avi Kivity
  2016-08-29 18:26                                       ` David Smith
  1 sibling, 2 replies; 32+ messages in thread
From: Avi Kivity @ 2016-08-28 13:30 UTC (permalink / raw)
  To: David Smith, Frank Ch. Eigler; +Cc: Mark Wielaard, systemtap



On 08/26/2016 11:07 PM, David Smith wrote:
> On 08/26/2016 01:39 PM, Avi Kivity wrote:
>> On 08/25/2016 08:37 PM, David Smith wrote:
>>> On 08/25/2016 11:21 AM, Avi Kivity wrote:
>>>> Hi,
>>>>
>>>> Should I wait for ongoing improvement here, or shall I look elsewhere
>>>> for my tracing needs?
>>>>
>>>> It would be a pity (for me) if I have to find another solution, because
>>>> systemtap has all the expressiveness and integration I need.  But it has
>>>> a dramatic impact on my application runtime.
>>>>
>>>> I was able to extract some useful data with perf probe/perf record, but
>>>> as soon as I need to qualify a probe point with runtime information,
>>>> perf falls short.
>>> As Frank mentioned in a previous email, it might be possible for us to
>>> switch to using straight kprobes instead of syscall tracing to handle
>>> mmap tracing. In your use case of calling epoll_wait() lots of times per
>>> second, that might be a *big* win.
>>>
>>> I'll see what can be done to add that feature.
>>>
>> Thanks a lot.  I'll be happy to test patches.
> OK, since you asked...
>
> Here's a patch I'm testing that tries to do prefiltering when a syscall
> occurs, so we don't have to take that lock.
>
> Please rebuild with it, and let me know if it (a) works, and (b) has
> lower overhead in your situation.
>

With an unloaded system, systemtap almost vanishes from the profile.  
This is on a 2s24c48t system, running epoll_pwait() and polling on user 
memory locations in a tight loop.

When I load my system, I start to see contention on systemtap locks:

      9.26%  scylla         [kernel.kallsyms]    [k] delay_tsc
             |
             ---delay_tsc
                |
                 --9.25%--__const_udelay
                           stp_lock_probe.constprop.104
                           |
                           |--4.94%--probe_2819
                           |          stapiu_probe_prehandler
                           |          uprobe_notify_resume
                           |          exit_to_usermode_loop
                           |          prepare_exit_to_usermode
                           |          retint_user
                           |          reactor::run_tasks
                           |          reactor::run
                           |          |
                           | 
--4.94%--smp::configure(boost::program_options::variables_map)::{lambda()#3}::operator()
                           | posix_thread::start_routine
                           |                     start_thread
                           |                     __clone
                           |
                            --4.30%--probe_2822
                                      stapiu_probe_prehandler
                                      uprobe_notify_resume
                                      exit_to_usermode_loop
                                      prepare_exit_to_usermode
                                      retint_user
                                      reactor::run_tasks
                                      reactor::run
                                      |
--4.30%--smp::configure(boost::program_options::variables_map)::{lambda()#3}::operator()
posix_thread::start_routine
                                                 start_thread
                                                 __clone



      3.75%  scylla         [kernel.kallsyms]    [k] 
queued_spin_lock_slowpath
             |
             ---queued_spin_lock_slowpath
                |
                 --3.74%--_raw_spin_lock
                           |
                            --3.63%--task_utrace_struct
                                      |
|--2.49%--utrace_report_syscall_entry
                                      | syscall_trace_enter_phase2
                                      |          syscall_trace_enter
                                      |          do_syscall_64
                                      | return_from_SYSCALL_64
                                      |          |
                                      | --2.46%--epoll_pwait
                                      | 
reactor_backend_epoll::wait_and_process
                                      | |
                                      | 
|--1.65%--std::_Function_handler<bool (), 
reactor::run()::{lambda()#7}>::_M_invoke
                                      | |          reactor::run
                                      | |          |
                                      | | 
--1.55%--smp::configure(boost::program_options::variables_map)::{lambda()#3}::operator()
                                      | |                     
posix_thread::start_routine
                                      | |                     start_thread
                                      | |                     __clone
                                      | |
                                      | 
--0.81%--std::_Function_handler<bool (), 
reactor::run()::{lambda()#8}>::_M_invoke
| logalloc::tracker::impl::compact_on_idle
|                                reactor::run
|                                |
| 
--0.76%--smp::configure(boost::program_options::variables_map)::{lambda()#3}::operator()
| posix_thread::start_routine
|                                           start_thread
|                                           __clone
                                      |
--1.14%--utrace_report_syscall_exit
syscall_slow_exit_work
                                                 do_syscall_64
return_from_SYSCALL_64
                                                 |
--1.12%--epoll_pwait
reactor_backend_epoll::wait_and_process
|
--0.75%--std::_Function_handler<bool (), 
reactor::run()::{lambda()#7}>::_M_invoke
reactor::run
|
--0.69%--smp::configure(boost::program_options::variables_map)::{lambda()#3}::operator()
posix_thread::start_routine
start_thread
__clone


I may be able to move the probe points to a less frequently accessed 
point (at some loss in accuracy), so this patch ought to get me 
started.  Thanks a lot!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-28 13:30                                     ` Avi Kivity
@ 2016-08-28 13:35                                       ` Avi Kivity
  2016-08-28 13:45                                         ` Avi Kivity
  2016-08-29 18:26                                       ` David Smith
  1 sibling, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2016-08-28 13:35 UTC (permalink / raw)
  To: David Smith, Frank Ch. Eigler; +Cc: Mark Wielaard, systemtap



On 08/28/2016 04:29 PM, Avi Kivity wrote:
>
>
> I may be able to move the probe points to a less frequently accessed 
> point (at some loss in accuracy), so this patch ought to get me 
> started.  Thanks a lot!
>

Alas, even with a light load I see delay_tsc jump to the #1 spot in the 
profile, so without a thread local variable, I can't use systemtap.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-28 13:35                                       ` Avi Kivity
@ 2016-08-28 13:45                                         ` Avi Kivity
  0 siblings, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2016-08-28 13:45 UTC (permalink / raw)
  To: David Smith, Frank Ch. Eigler; +Cc: Mark Wielaard, systemtap

On 08/28/2016 04:34 PM, Avi Kivity wrote:
>
>
> On 08/28/2016 04:29 PM, Avi Kivity wrote:
>>
>>
>> I may be able to move the probe points to a less frequently accessed 
>> point (at some loss in accuracy), so this patch ought to get me 
>> started.  Thanks a lot!
>>
>
> Alas, even with a light load I see delay_tsc jump to the #1 spot in 
> the profile, so without a thread local variable, I can't use systemtap.

In addition, my server crashed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-28 13:30                                     ` Avi Kivity
  2016-08-28 13:35                                       ` Avi Kivity
@ 2016-08-29 18:26                                       ` David Smith
  2016-08-31 16:44                                         ` Avi Kivity
  1 sibling, 1 reply; 32+ messages in thread
From: David Smith @ 2016-08-29 18:26 UTC (permalink / raw)
  To: Avi Kivity, Frank Ch. Eigler; +Cc: Mark Wielaard, systemtap

[-- Attachment #1: Type: text/plain, Size: 1906 bytes --]

On 08/28/2016 08:29 AM, Avi Kivity wrote:
> 
> 
> On 08/26/2016 11:07 PM, David Smith wrote:
>> On 08/26/2016 01:39 PM, Avi Kivity wrote:
>>> On 08/25/2016 08:37 PM, David Smith wrote:
>>>> On 08/25/2016 11:21 AM, Avi Kivity wrote:
>>>>> Hi,
>>>>>
>>>>> Should I wait for ongoing improvement here, or shall I look elsewhere
>>>>> for my tracing needs?
>>>>>
>>>>> It would be a pity (for me) if I have to find another solution,
>>>>> because
>>>>> systemtap has all the expressiveness and integration I need.  But
>>>>> it has
>>>>> a dramatic impact on my application runtime.
>>>>>
>>>>> I was able to extract some useful data with perf probe/perf record,
>>>>> but
>>>>> as soon as I need to qualify a probe point with runtime information,
>>>>> perf falls short.
>>>> As Frank mentioned in a previous email, it might be possible for us to
>>>> switch to using straight kprobes instead of syscall tracing to handle
>>>> mmap tracing. In your use case of calling epoll_wait() lots of times
>>>> per
>>>> second, that might be a *big* win.
>>>>
>>>> I'll see what can be done to add that feature.
>>>>
>>> Thanks a lot.  I'll be happy to test patches.
>> OK, since you asked...
>>
>> Here's a patch I'm testing that tries to do prefiltering when a syscall
>> occurs, so we don't have to take that lock.
>>
>> Please rebuild with it, and let me know if it (a) works, and (b) has
>> lower overhead in your situation.
>>
> 
> With an unloaded system, systemtap almost vanishes from the profile. 
> This is on a 2s24c48t system, running epoll_pwait() and polling on user
> memory locations in a tight loop.

... stuff deleted ...

After looking into it, Josh was right in thinking I had the test
backwards. Can you try the following patch (after reverting the last
patch) and let us know what you find?

-- 
David Smith
dsmith@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)

[-- Attachment #2: prefilter2.patch --]
[-- Type: text/x-patch, Size: 3855 bytes --]

diff --git a/runtime/stp_utrace.c b/runtime/stp_utrace.c
index bb2d663..d2d5b70 100644
--- a/runtime/stp_utrace.c
+++ b/runtime/stp_utrace.c
@@ -26,6 +26,7 @@
 #include <trace/events/syscalls.h>
 #include "stp_task_work.c"
 #include "linux/stp_tracepoint.h"
+#include "syscall.h"
 
 #include "stp_helper_lock.h"
 
@@ -116,6 +117,8 @@ static void utrace_report_exec(void *cb_data __attribute__ ((unused)),
 #define __UTRACE_REGISTERED	1
 static atomic_t utrace_state = ATOMIC_INIT(__UTRACE_UNREGISTERED);
 
+static int __stp_utrace_syscall_tracing = 0;
+
 // If wake_up_state() is exported, use it.
 #if defined(STAPCONF_WAKE_UP_STATE_EXPORTED)
 #define stp_wake_up_state wake_up_state
@@ -2064,6 +2067,23 @@ static void utrace_report_syscall_entry(void *cb_data __attribute__ ((unused)),
 
 	if (atomic_read(&utrace_state) != __UTRACE_REGISTERED)
 		return;
+
+	/* If we're only doing syscall tracing for VMA tracking, then
+	 * let's prefilter the syscall numbers that we need before
+	 * calling the handlers.
+	 *
+	 * This allows us to avoid finding the utrace struct, which
+	 * avoids some locking. */
+	if (!__stp_utrace_syscall_tracing) {
+		long syscall_no = _stp_syscall_get_nr(task, regs);
+		if (syscall_no != MMAP_SYSCALL_NO(task)
+		    && syscall_no != MMAP2_SYSCALL_NO(task)
+		    && syscall_no != MPROTECT_SYSCALL_NO(task)
+		    && syscall_no != MUNMAP_SYSCALL_NO(task))
+			return;
+	}
+
+	/* Grab the utrace struct for this task. */
 	utrace = task_utrace_struct(task);
 
 	/* FIXME: Is this 100% correct? */
@@ -2103,6 +2123,23 @@ static void utrace_report_syscall_exit(void *cb_data __attribute__ ((unused)),
 
 	if (atomic_read(&utrace_state) != __UTRACE_REGISTERED)
 		return;
+
+	/* If we're only doing syscall tracing for VMA tracking, then
+	 * let's prefilter the syscall numbers that we need before
+	 * calling the handlers.
+	 *
+	 * This allows us to avoid finding the utrace struct, which
+	 * avoids some locking. */
+	if (!__stp_utrace_syscall_tracing) {
+		long syscall_no = _stp_syscall_get_nr(task, regs);
+		if (syscall_no != MMAP_SYSCALL_NO(task)
+		    && syscall_no != MMAP2_SYSCALL_NO(task)
+		    && syscall_no != MPROTECT_SYSCALL_NO(task)
+		    && syscall_no != MUNMAP_SYSCALL_NO(task))
+			return;
+	}
+
+	/* Grab the utrace struct for this task. */
 	utrace = task_utrace_struct(task);
 
 	/* FIXME: Is this 100% correct? */
@@ -2469,4 +2506,9 @@ static void utrace_report_work(struct task_work *work)
 	stp_task_work_func_done();
 }
 
+/* If this is called, we're doing utrace-based syscall tracing. */
+static void stap_utrace_syscall_tracing(void)
+{
+	__stp_utrace_syscall_tracing = 1;
+}
 #endif	/* _STP_UTRACE_C */
diff --git a/runtime/stp_utrace.h b/runtime/stp_utrace.h
index 9f162bb..34968b1 100644
--- a/runtime/stp_utrace.h
+++ b/runtime/stp_utrace.h
@@ -329,4 +329,7 @@ static inline enum utrace_resume_action utrace_resume_action(u32 action)
 	return action & UTRACE_RESUME_MASK;
 }
 
+/* If this is called, we're doing utrace-based syscall tracing. */
+static void stap_utrace_syscall_tracing(void);
+
 #endif	/* _STP_UTRACE_H */
diff --git a/tapset-utrace.cxx b/tapset-utrace.cxx
index 8580800..154c335 100644
--- a/tapset-utrace.cxx
+++ b/tapset-utrace.cxx
@@ -1208,6 +1208,12 @@ utrace_derived_probe_group::emit_module_linux_init (systemtap_session& s)
     return;
 
   s.op->newline() << "/* ---- utrace probes ---- */";
+  if (flags_seen[UDPF_SYSCALL] || flags_seen[UDPF_SYSCALL_RETURN])
+    {
+      s.op->newline() << "#if !defined(CONFIG_UTRACE)";
+      s.op->newline() << "stap_utrace_syscall_tracing();";
+      s.op->newline() << "#endif";
+    }
   s.op->newline() << "for (i=0; i<ARRAY_SIZE(stap_utrace_probes); i++) {";
   s.op->newline(1) << "struct stap_utrace_probe *p = &stap_utrace_probes[i];";
   s.op->newline() << "probe_point = p->probe->pp;"; // for error messages

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Some newbie questions
  2016-08-29 18:26                                       ` David Smith
@ 2016-08-31 16:44                                         ` Avi Kivity
  0 siblings, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2016-08-31 16:44 UTC (permalink / raw)
  To: David Smith, Frank Ch. Eigler; +Cc: Mark Wielaard, systemtap



On 08/29/2016 09:26 PM, David Smith wrote:
> On 08/28/2016 08:29 AM, Avi Kivity wrote:
>>
>> On 08/26/2016 11:07 PM, David Smith wrote:
>>> On 08/26/2016 01:39 PM, Avi Kivity wrote:
>>>> On 08/25/2016 08:37 PM, David Smith wrote:
>>>>> On 08/25/2016 11:21 AM, Avi Kivity wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Should I wait for ongoing improvement here, or shall I look elsewhere
>>>>>> for my tracing needs?
>>>>>>
>>>>>> It would be a pity (for me) if I have to find another solution,
>>>>>> because
>>>>>> systemtap has all the expressiveness and integration I need.  But
>>>>>> it has
>>>>>> a dramatic impact on my application runtime.
>>>>>>
>>>>>> I was able to extract some useful data with perf probe/perf record,
>>>>>> but
>>>>>> as soon as I need to qualify a probe point with runtime information,
>>>>>> perf falls short.
>>>>> As Frank mentioned in a previous email, it might be possible for us to
>>>>> switch to using straight kprobes instead of syscall tracing to handle
>>>>> mmap tracing. In your use case of calling epoll_wait() lots of times
>>>>> per
>>>>> second, that might be a *big* win.
>>>>>
>>>>> I'll see what can be done to add that feature.
>>>>>
>>>> Thanks a lot.  I'll be happy to test patches.
>>> OK, since you asked...
>>>
>>> Here's a patch I'm testing that tries to do prefiltering when a syscall
>>> occurs, so we don't have to take that lock.
>>>
>>> Please rebuild with it, and let me know if it (a) works, and (b) has
>>> lower overhead in your situation.
>>>
>> With an unloaded system, systemtap almost vanishes from the profile.
>> This is on a 2s24c48t system, running epoll_pwait() and polling on user
>> memory locations in a tight loop.
> ... stuff deleted ...
>
> After looking into it, Josh was right in thinking I had the test
> backwards. Can you try the following patch (after reverting the last
> patch) and let us know what you find?
>

With the new patch, I get

ERROR: probe overhead exceeded threshold
ERROR: probe overhead exceeded threshold
WARNING: Number of errors: 2, skipped probes: 0
WARNING: /home/avi/systemtap.f23/bin/staprun exited with status: 1
Pass 5: run failed.  [man error::pass5]

with -g --suppress-time-limits, I get the usual spinlock badness.


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2016-08-31 16:44 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-07  8:08 Some newbie questions Avi Kivity
2016-08-07  9:16 ` Avi Kivity
2016-08-08 14:58 ` David Smith
2016-08-09  6:51   ` Avi Kivity
2016-08-09 13:54     ` Mark Wielaard
2016-08-09 14:11       ` Avi Kivity
2016-08-10 15:33         ` Frank Ch. Eigler
2016-08-10 15:39           ` Avi Kivity
2016-08-10 15:40           ` Avi Kivity
2016-08-10 16:47             ` Frank Ch. Eigler
2016-08-11  7:51               ` Avi Kivity
2016-08-11  8:13                 ` Avi Kivity
2016-08-11 16:18                   ` Frank Ch. Eigler
2016-08-17 15:15                     ` Avi Kivity
2016-08-17 18:06                       ` David Smith
2016-08-17 18:22                         ` Avi Kivity
2016-08-17 18:10                       ` Frank Ch. Eigler
2016-08-17 18:30                         ` Avi Kivity
2016-08-17 18:35                           ` Frank Ch. Eigler
2016-08-17 18:42                             ` Avi Kivity
2016-08-25 16:21                             ` Avi Kivity
2016-08-25 17:37                               ` David Smith
2016-08-26 18:39                                 ` Avi Kivity
2016-08-26 20:08                                   ` David Smith
2016-08-26 21:21                                     ` Josh Stone
2016-08-28 13:30                                     ` Avi Kivity
2016-08-28 13:35                                       ` Avi Kivity
2016-08-28 13:45                                         ` Avi Kivity
2016-08-29 18:26                                       ` David Smith
2016-08-31 16:44                                         ` Avi Kivity
2016-08-09 15:17     ` David Smith
2016-08-09 15:28       ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).