public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* LTTng-UST vs SystemTap userspace tracing benchmarks
@ 2011-02-15 15:53 Julien Desfossez
  2011-02-15 16:25 ` [ltt-dev] " William Cohen
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Julien Desfossez @ 2011-02-15 15:53 UTC (permalink / raw)
  To: ltt-dev, systemtap; +Cc: Mathieu Desnoyers, dominique.toupin

LTTng-UST vs SystemTap userspace tracing benchmarks

February 15th, 2011

Authors: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
         Julien Desfossez <julien.desfossez@polymtl.ca>

-- Introduction

The purpose of this benchmark is to compare the performance for
userspace tracing of SystemTap and LTTng-UST. The goal is to show that
the two tools are complementary since SystemTap doesn't seem to be able
to handle tracing applications with a high throughput of trace data.

-- Benchmark
10 million events generated per thread, number of threads vary.  Each
event generates a time-stamp and contains a 4-byte integer value.
Synthetic workload: cache-hot test, function writing events called in
loop.  On a 8-core Intel Xeon, (2x 4-core), E5405 at 2.0GHz, 16GB ram
Running Linux 2.6.37 (custom build, with utrace patches, debuginfo
enabled and LTTng trace clock available)

UST 0.11, hooking on user-space Tracepoints
* UST tuning : Normal (blocking) mode, 16 buffers, 4k each
* We test UST with the LTTng Trace Clock (w/ TC) and with the standard
clock infrastructure (w/o TC)

SystemTap 1.2-5 (from Debian package), hooking on DTrace user-space
static markup.
* SystemTap probe (stap testutrace.stp -F) :
probe process("./.libs/tracepoint_benchmark").mark("single_trace") {
    printf("%d : %s\n", gettimeofday_ns(), $arg1);
}

-- Results
0) Baseline : running the program without any instrumentation

                            TOTAL CPU TIME
Number of threads           baseline
                1           0:0.33
                2           0:0.33
                4           0:0.33
                8           0:0.33


1) Flight recorder tracing comparison UST vs SystemTap
                            TOTAL CPU TIME
Number of threads       UST w/ TC       UST w/o TC      SystemTap
                1       0:01.81         0:02.25         0:58.36
                2       0:01.86         0:02.13         1:49.94
                4       0:01.86         0:02.22         2:38.49
                8       0:01.97         0:02.14         9:29.58

                            TOTAL CPU TIME (ns/event)
Number of threads       UST w/ TC       UST w/o TC      SystemTap
                1       181             225             5836
                2       186             213             10994
                4       186             222             15849
                8       197             204             56958

                            UST SPEEDUP
Number of threads       UST w/ TC       UST w/o TC
                1       32x             25x
                2       59x             51x
                4       85x             71x
                8       289x            279x


2) Tracing to disk comparison UST vs SystemTap (trace output fits in
page cache)
                            TOTAL CPU TIME
Number of threads       UST w/ TC    UST w/o TC   SystemTap
                1       0:01.82      0:02.11      1:01.12 (128622 lost)
                2       0:01.95      0:02.14      1:44.20 (397859 lost)
                4       0:01.97      0:02.31      2:38.13 (360549 lost)
                8       0:02.28      0:02.68      9:29.36 (158538 lost)

                            TOTAL CPU TIME (ns/event)
Number of threads       UST w/ TC       UST w/o TC      SystemTap
                1       182             211             6112
                2       195             214             10420
                4       197             231             15813
                8       228             268             56936

                            UST SPEEDUP
Number of threads       UST w/ TC       UST w/o TC
                1       33x             28x
                2       53x             48x
                4       80x             68x
                8       249x            212x

                            OUTPUT SIZE (MB)
Number of threads       UST         SystemTap     UST Output compression
                1       77          271           3.52
                2       153         554           3.62
                4       306         1097          3.58
                8       612         2214          3.61


--  Conclusions

For flight recorder tracing, UST is 289 times faster than SystemTap on
an 8-core system with a LTTng kernel and 279 times with a vanilla+utrace
kernel.

When recording traces to disk, UST is 249 times faster than SystemTap on
an 8-core system with a LTTng kernel and 212 times with a vanilla+utrace
kernel.
Only a small part of the UST speedup over SystemTap is due to the more
compressed size of its output (binary for UST vs text for SystemTap).

SystemTap does not scale for multithreaded applications running on
multi-core systems. UST scales linearly with the number of cores for
flight recorder tracing, and almost linearly when saving tracing output
to the page cache.

This study proves that LTTng-UST and SystemTap are two tools with a
complementary purpose. LTTng-UST is more efficient in extracting a high
volume of trace data which allows a developper or a system engineer to
diagnose an unknown problem, whereas SystemTap is more targetted to
provide a quick interface for instrumenting specific problems.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-15 15:53 LTTng-UST vs SystemTap userspace tracing benchmarks Julien Desfossez
@ 2011-02-15 16:25 ` William Cohen
  2011-02-17 16:21   ` Julien Desfossez
  2011-02-15 16:26 ` Frank Ch. Eigler
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: William Cohen @ 2011-02-15 16:25 UTC (permalink / raw)
  To: Julien Desfossez; +Cc: ltt-dev, systemtap, dominique.toupin, Mathieu Desnoyers

On 02/15/2011 10:53 AM, Julien Desfossez wrote:
> LTTng-UST vs SystemTap userspace tracing benchmarks
> 
> February 15th, 2011
> 
> Authors: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>          Julien Desfossez <julien.desfossez@polymtl.ca>

> * SystemTap probe (stap testutrace.stp -F) :
> probe process("./.libs/tracepoint_benchmark").mark("single_trace") {
>     printf("%d : %s\n", gettimeofday_ns(), $arg1);
> }



Hi Julien,

How much of the SystemTap overhead is due to the printf() statement in the probe? What is the run time for the following:

probe process("./.libs/tracepoint_benchmark").mark("single_trace") {}


Is the code for the benchmarks available, so we can take a look at reducing the overhead of SystemTap?

-Will

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-15 15:53 LTTng-UST vs SystemTap userspace tracing benchmarks Julien Desfossez
  2011-02-15 16:25 ` [ltt-dev] " William Cohen
@ 2011-02-15 16:26 ` Frank Ch. Eigler
  2011-02-15 16:54   ` Mathieu Desnoyers
  2011-02-15 17:00   ` [ltt-dev] " Stefan Hajnoczi
  2011-02-15 17:03 ` Mark Wielaard
  2011-02-16 15:30 ` Tom Tromey
  3 siblings, 2 replies; 23+ messages in thread
From: Frank Ch. Eigler @ 2011-02-15 16:26 UTC (permalink / raw)
  To: Julien Desfossez; +Cc: ltt-dev, systemtap, Mathieu Desnoyers, dominique.toupin


Julien Desfossez <julien.desfossez@polymtl.ca> writes:

> LTTng-UST vs SystemTap userspace tracing benchmarks

Thank you.

> [...]  For flight recorder tracing, UST is 289 times faster than
> SystemTap on an 8-core system with a LTTng kernel and 279 times with
> a vanilla+utrace kernel.

This is not that surprising, considering how the two tools work.  UST
does its work in userspace, and is therefore focused on an individual
process's activities.  Systemtap does its work in kernelspace, and can
therefore focus on many different processes and the kernel at the same
time.  This entails some ring transitions.

(One may imagine a future version of systemtap where scripts that
happen to independently probe single processes are executed with a
pure userspace backend, but this is not in our immediate roadmap.)

> SystemTap does not scale for multithreaded applications running on
> multi-core systems.  [...]

We know of at least one kernel problem in this area,
<http://sourceware.org/PR5660>, which may be fixable via core or
utrace or uprobes changes.


> This study proves that LTTng-UST and SystemTap are two tools with a
> complementary purpose.  [...]

Strictly speaking, it shows that their performance differs
dramatically in this sort of microbenchmark.  

Thank you for your data gathering.

- FChE

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-15 16:26 ` Frank Ch. Eigler
@ 2011-02-15 16:54   ` Mathieu Desnoyers
  2011-02-15 17:39     ` Frank Ch. Eigler
  2011-02-15 17:00   ` [ltt-dev] " Stefan Hajnoczi
  1 sibling, 1 reply; 23+ messages in thread
From: Mathieu Desnoyers @ 2011-02-15 16:54 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Julien Desfossez, ltt-dev, systemtap, dominique.toupin

Hi Frank,

* Frank Ch. Eigler (fche@redhat.com) wrote:
> 
> Julien Desfossez <julien.desfossez@polymtl.ca> writes:
> 
> > LTTng-UST vs SystemTap userspace tracing benchmarks
> 
> Thank you.
> 
> > [...]  For flight recorder tracing, UST is 289 times faster than
> > SystemTap on an 8-core system with a LTTng kernel and 279 times with
> > a vanilla+utrace kernel.
> 
> This is not that surprising, considering how the two tools work.  UST
> does its work in userspace,

This first part of the statement is true,

> and is therefore focused on an individual
> process's activities.

This is incorrect. LTTng and UST gather traces from multiple processes and from
the kernel, and merge them in post-processing. This toolset is therefore focused
on system-wide activity analysis.

> Systemtap does its work in kernelspace, and can
> therefore focus on many different processes and the kernel at the same
> time.  This entails some ring transitions.

The difference between UST and SystemTAP is not the target goal, but rather
where the computation is done: UST uses buffering to send its trace output,
conversely, SystemTAP performs the ring transition for each individual event.
This is a core design difference that partly explains the dramatically
performance results we see here.

> 
> (One may imagine a future version of systemtap where scripts that
> happen to independently probe single processes are executed with a
> pure userspace backend, but this is not in our immediate roadmap.)
> 
> > SystemTap does not scale for multithreaded applications running on
> > multi-core systems.  [...]
> 
> We know of at least one kernel problem in this area,
> <http://sourceware.org/PR5660>, which may be fixable via core or
> utrace or uprobes changes.
> 
> 
> > This study proves that LTTng-UST and SystemTap are two tools with a
> > complementary purpose.  [...]
> 
> Strictly speaking, it shows that their performance differs
> dramatically in this sort of microbenchmark.  

Strictly speaking, you are right. I've done performance testing on LTTng (the
kernel equivalent of UST, using very similar technology) on real workloads
traced at the kernel level, and this kind of microbenchmark actually shows a
lower-bound of the tracer performance impact per probe (the upper-bound being up
to a factor 3 higher due to cache misses in the trace buffers). All the details
are presented in http://lttng.org/pub/thesis/desnoyers-dissertation-2009-12.pdf,
Chapters 5.5, 8.4 and 8.5. Now the overall performance impact must indeed be
weighted by the number of times the tracer is called by the application. If, for
example, we trace standard tests like "dbench" at the kernel-level with LTTng,
we get a 3% performance hit. If we multiply this by 294, this gets in the area
of a 882% performance hit on the system, which is likely to have some noticeable
impact on the end user experience.

> 
> Thank you for your data gathering.

Thanks for your reply. We'll be glad to help out if we can.

Mathieu

> 
> - FChE

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-15 16:26 ` Frank Ch. Eigler
  2011-02-15 16:54   ` Mathieu Desnoyers
@ 2011-02-15 17:00   ` Stefan Hajnoczi
  2011-02-15 17:04     ` Mathieu Desnoyers
  2011-02-16 10:56     ` Mark Wielaard
  1 sibling, 2 replies; 23+ messages in thread
From: Stefan Hajnoczi @ 2011-02-15 17:00 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Julien Desfossez, dominique.toupin, ltt-dev, Mathieu Desnoyers,
	systemtap

On Tue, Feb 15, 2011 at 4:26 PM, Frank Ch. Eigler <fche@redhat.com> wrote:
>
> Julien Desfossez <julien.desfossez@polymtl.ca> writes:
>
>> LTTng-UST vs SystemTap userspace tracing benchmarks
>
> Thank you.
>
>> [...]  For flight recorder tracing, UST is 289 times faster than
>> SystemTap on an 8-core system with a LTTng kernel and 279 times with
>> a vanilla+utrace kernel.
>
> This is not that surprising, considering how the two tools work.  UST
> does its work in userspace, and is therefore focused on an individual
> process's activities.  Systemtap does its work in kernelspace, and can
> therefore focus on many different processes and the kernel at the same
> time.  This entails some ring transitions.
>
> (One may imagine a future version of systemtap where scripts that
> happen to independently probe single processes are executed with a
> pure userspace backend, but this is not in our immediate roadmap.)

What is the fundamental mechanism that UST and SystemTap use for tracing?

e.g. Here's a guess:
UST: a conditional function call within the same process
SystemTap: a software interrupt on x86

I don't know the implementations details but would be interested in
understanding this.

Stefan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-15 15:53 LTTng-UST vs SystemTap userspace tracing benchmarks Julien Desfossez
  2011-02-15 16:25 ` [ltt-dev] " William Cohen
  2011-02-15 16:26 ` Frank Ch. Eigler
@ 2011-02-15 17:03 ` Mark Wielaard
  2011-02-16 15:30 ` Tom Tromey
  3 siblings, 0 replies; 23+ messages in thread
From: Mark Wielaard @ 2011-02-15 17:03 UTC (permalink / raw)
  To: Julien Desfossez; +Cc: ltt-dev, systemtap, Mathieu Desnoyers, dominique.toupin

On Tue, 2011-02-15 at 10:53 -0500, Julien Desfossez wrote:
> The purpose of this benchmark is to compare the performance for
> userspace tracing of SystemTap and LTTng-UST. The goal is to show that
> the two tools are complementary since SystemTap doesn't seem to be able
> to handle tracing applications with a high throughput of trace data.

Yeah, the more "natural" way for someone using SystemTap would be to
write probe handlers that can check some logic conditions about the
probe environment and/or keep statistics instead of just dumping output
whenever a probe is being hit.

> UST 0.11, hooking on user-space Tracepoints

Can UST user-space tracepoints hook onto systemtap sdt user space
markers? That would be ideal since then programs need to be instrumented
only once. That is why the systemtap sdt markers are source compatible
with the dtrace macros, so you can just reuse any that are already
there. gdb is also adding support for using them now, so that you can
also just put a "normal" breakpoint on them and use them in the context
of a gdb debugging session.

> SystemTap 1.2-5 (from Debian package), hooking on DTrace user-space
> static markup.
> * SystemTap probe (stap testutrace.stp -F) :
> probe process("./.libs/tracepoint_benchmark").mark("single_trace") {
>     printf("%d : %s\n", gettimeofday_ns(), $arg1);
> }

Does it matter what you put in the printf statement? The formatting and
getting the date stamp might impact performance of course. If the goal
is just to see how many probes are being hit, then you could also use a
simple counter instead of a printf statement in the probe handler.

Thanks,

Mark

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-15 17:00   ` [ltt-dev] " Stefan Hajnoczi
@ 2011-02-15 17:04     ` Mathieu Desnoyers
  2011-02-16 10:56     ` Mark Wielaard
  1 sibling, 0 replies; 23+ messages in thread
From: Mathieu Desnoyers @ 2011-02-15 17:04 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Frank Ch. Eigler, Julien Desfossez, dominique.toupin, ltt-dev, systemtap

* Stefan Hajnoczi (stefanha@gmail.com) wrote:
> On Tue, Feb 15, 2011 at 4:26 PM, Frank Ch. Eigler <fche@redhat.com> wrote:
> >
> > Julien Desfossez <julien.desfossez@polymtl.ca> writes:
> >
> >> LTTng-UST vs SystemTap userspace tracing benchmarks
> >
> > Thank you.
> >
> >> [...]  For flight recorder tracing, UST is 289 times faster than
> >> SystemTap on an 8-core system with a LTTng kernel and 279 times with
> >> a vanilla+utrace kernel.
> >
> > This is not that surprising, considering how the two tools work.  UST
> > does its work in userspace, and is therefore focused on an individual
> > process's activities.  Systemtap does its work in kernelspace, and can
> > therefore focus on many different processes and the kernel at the same
> > time.  This entails some ring transitions.
> >
> > (One may imagine a future version of systemtap where scripts that
> > happen to independently probe single processes are executed with a
> > pure userspace backend, but this is not in our immediate roadmap.)
> 

Hi Stefan,

> What is the fundamental mechanism that UST and SystemTap use for tracing?
> 
> e.g. Here's a guess:
> UST: a conditional function call within the same process

Yes, UST can manage to stay within the same process because tracing is buffered:
it only has to write the trace data into shared-memory buffers. Therefore, a
simple function call is sufficient.

> SystemTap: a software interrupt on x86

Yep, AFAIK, SystemTap needs to receive everything at the kernel-level to perform
its system-wide data processing at kernel-level, without buffering between the
data extraction from the instrumented applications and the in-kernel execution
of SystemTap. This leads to a strong dependency on using a software interrupt
for every event.

Thanks,

Mathieu

> I don't know the implementations details but would be interested in
> understanding this.
> 
> Stefan

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-15 16:54   ` Mathieu Desnoyers
@ 2011-02-15 17:39     ` Frank Ch. Eigler
  2011-02-15 22:26       ` Mathieu Desnoyers
  0 siblings, 1 reply; 23+ messages in thread
From: Frank Ch. Eigler @ 2011-02-15 17:39 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: dominique.toupin, ltt-dev, systemtap


Hi -


mathieu.desnoyers wrote:

> [...]
>> This is not that surprising, considering how the two tools work.  UST
>> does its work in userspace,
> This first part of the statement is true,
>
>> and is therefore focused on an individual process's activities.
> This is incorrect. LTTng and UST gather traces from multiple processes and from
> the kernel, and merge them in post-processing. [...]

Isn't such after-the-fact merging out of scope of the present
microbenchmark, and in any case applicable to both tools?

- FChE

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-15 17:39     ` Frank Ch. Eigler
@ 2011-02-15 22:26       ` Mathieu Desnoyers
  0 siblings, 0 replies; 23+ messages in thread
From: Mathieu Desnoyers @ 2011-02-15 22:26 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: dominique.toupin, ltt-dev, systemtap

* Frank Ch. Eigler (fche@redhat.com) wrote:
> 
> Hi -
> 
> 
> mathieu.desnoyers wrote:
> 
> > [...]
> >> This is not that surprising, considering how the two tools work.  UST
> >> does its work in userspace,
> > This first part of the statement is true,
> >
> >> and is therefore focused on an individual process's activities.
> > This is incorrect. LTTng and UST gather traces from multiple processes and from
> > the kernel, and merge them in post-processing. [...]
> 
> Isn't such after-the-fact merging out of scope of the present
> microbenchmark, and in any case applicable to both tools?

This merging is indeed out of the scope of the microbenchmark, but your
statement that UST focuses on an "individual process's activities" is still
incorrect -- both SystemTAP and UST target system-wide analysis as one of their
main use-cases. We just use different techniques to get there (in-kernel probe
execution called from the instrumentation site vs streaming data through
buffers). SystemTAP allows flexibility with a powerful scripting language, which
complements the "data gathering" approach taken by UST nicely. We could even
think of combining the two approaches eventually: running scripts on information
extracted from UST buffers.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-15 17:00   ` [ltt-dev] " Stefan Hajnoczi
  2011-02-15 17:04     ` Mathieu Desnoyers
@ 2011-02-16 10:56     ` Mark Wielaard
  2011-02-16 18:51       ` Roland McGrath
  1 sibling, 1 reply; 23+ messages in thread
From: Mark Wielaard @ 2011-02-16 10:56 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Frank Ch. Eigler, Julien Desfossez, dominique.toupin, ltt-dev,
	Mathieu Desnoyers, systemtap

On Tue, 2011-02-15 at 17:00 +0000, Stefan Hajnoczi wrote:
> On Tue, Feb 15, 2011 at 4:26 PM, Frank Ch. Eigler <fche@redhat.com> wrote:
> > (One may imagine a future version of systemtap where scripts that
> > happen to independently probe single processes are executed with a
> > pure userspace backend, but this is not in our immediate roadmap.)
> 
> What is the fundamental mechanism that UST and SystemTap use for tracing?
> 
> e.g. Here's a guess:
> UST: a conditional function call within the same process
> SystemTap: a software interrupt on x86
> 
> I don't know the implementations details but would be interested in
> understanding this.

I don't know the precise implementation details for ltt. But for
SystemTap you could divide the "tracing" process into a couple of steps:

1) The probe marking. The way you embed where you can place probes and
   how to get at arguments/context of the probe. For userspace probes
   SystemTap mainly relies on two mechanisms:

 - dwarf debuginfo. This is the same mechanism debuggers use. It is a
   very low level description of how the source program maps to the
   binary. Through it you can determine locations for probes based on
   source lines, function names, etc. and get a description of how to
   get at local variables and arguments. Advantage is that it is already
   there (when compiled -g), so you don't need to do anything special.
   Downside is that it is pretty low level, so you do need to know a bit
   about the program structure before you can "trace" effectively.
   Recent advancements in gcc made the dwarf debuginfo pretty reliable.

 - sdt markers. This is a mechanism also employed by dtrace (although
   the way the markers and arguments are embedded is slightly different,
   this is an implementation detail though). A program #include
   <sys/sdt.h> and places PROBE markers in their source code to indicate
   "high-level events" and relevant arguments for that event.
   The macros get translated to special code that places the name,
   address and where to find the arguments into a special elf note.
   Advantage is that as a "trace user" you get an overview of high level
   events that might be interested to introspect. Disadvantage is that
   the programmer needs to explicitly embed them in their program (but
   since dtrace and now gdb can also hook onto them they are getting
   used more and more).

2) The probe and context selection. In a systemtap stap script you
   list all places/events you want to place a probe on. These can be
   low level kernel events (tracepoints, based on kernel debuginfo,
   timers, perf events, etc) or user level events (based on the dwarf
   debuginfo or sdt markers placed in the program). Then for each (group
   of) probe events, you write a handler listing the context you are
   interested in (variables, arguments, etc.). These can then be used to
   filter and/or log the event (see under 5. The actual "trace").

4) Hooking onto the probe. Based on the stap script you provide the
   systemtap runtime decides which addresses to place probes on (or hook
   into event notifiers). It also extracts the location of each context
   variable and/or parameter used in the probe handler for that
   location. Currently for each user space address derived (which could
   be multiple if the probe point is inlined in various places) it uses
   uprobes to place a breakpoint instruction at that location and
   inserts a callback handler to the handler responsible for that probe
   event. All the nitty-gritty of placing the probes and handling the
   software interrupt is delegated to uprobes (it saves a full roundtrip
   user/kernel/user necessary with for example ptrace), which is being
   pushed into the upstream kernel so it can be used by others like perf
   and gdb in the future. But you could imagine hooking being done
   through other mechanisms, like in-process functional calls in the
   user process. If the code injection techniques of ltt are reusable
   that would be a very cool idea.

5) The actual "trace"/data gathering step. Depending on the stap
   handler you wrote for the probe the SystemTap runtime (called
   through the probe hook) will extract the context variables
   and/or parameters you are interested in. They are then used for
   filtering (based on the conditionals used in your handler) and
   then lets you either assign derived values to global (script)
   variables or statistical containers, or make you log the event
   and/or some of the context. Basically you write a log or printf
   statement in your handler when you want to "trace" it. Depending
   on how you invoked stap it is then placed in a file or some buffer
   through procfs, relayfs, debugfs or ring_buffers. Alternatively
   you can write an "end" handler that just spits out the data you
   accumulated and stored in the script variables and statistics
   (so as not to have to output anything at all during the probe
    event itself to save data output and processing time).

Hope that helps. And if someone could give a similar overview of ltt
then we could see how we can more easily mix and match these various
steps in the future. Since it seems the mechanisms used are nicely
complementary.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-15 15:53 LTTng-UST vs SystemTap userspace tracing benchmarks Julien Desfossez
                   ` (2 preceding siblings ...)
  2011-02-15 17:03 ` Mark Wielaard
@ 2011-02-16 15:30 ` Tom Tromey
  2011-02-16 18:19   ` Roland McGrath
  2011-02-17 17:33   ` Julien Desfossez
  3 siblings, 2 replies; 23+ messages in thread
From: Tom Tromey @ 2011-02-16 15:30 UTC (permalink / raw)
  To: Julien Desfossez; +Cc: ltt-dev, systemtap, Mathieu Desnoyers, dominique.toupin

>>>>> "Julien" == Julien Desfossez <julien.desfossez@polymtl.ca> writes:

Julien> SystemTap 1.2-5 (from Debian package), hooking on DTrace user-space
Julien> static markup.

The sdt.h stuff has been rewritten at least once since then.  I'd
suggest trying the latest.  I think it probably won't matter for the
flight recorder mode, but it may matter for measuring overhead.

Julien> 0) Baseline : running the program without any instrumentation
Julien> 1) Flight recorder tracing comparison UST vs SystemTap

I'd be interested to also see the numbers when the probes are in place
in the source, but not enabled.  That is, what is the overhead of a
disabled probe?

When doing this with SystemTap it would be interesting to try twice:
once with a semaphore for each probe, and once without.

Tom

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-16 15:30 ` Tom Tromey
@ 2011-02-16 18:19   ` Roland McGrath
  2011-02-17 17:33   ` Julien Desfossez
  1 sibling, 0 replies; 23+ messages in thread
From: Roland McGrath @ 2011-02-16 18:19 UTC (permalink / raw)
  To: Tom Tromey
  Cc: Julien Desfossez, ltt-dev, systemtap, Mathieu Desnoyers,
	dominique.toupin

> The sdt.h stuff has been rewritten at least once since then.  I'd
> suggest trying the latest.  I think it probably won't matter for the
> flight recorder mode, but it may matter for measuring overhead.

The v3 and v2 versions have the same runtime/code overhead.  The
difference is in the data overhead, which includes some startup-time
dynamic linking overhead for PIC code.  In v3, there is no data
overhead except possibly the one semaphore word, and none of that
startup-time overhead at all.

> I'd be interested to also see the numbers when the probes are in place
> in the source, but not enabled.  That is, what is the overhead of a
> disabled probe?
> 
> When doing this with SystemTap it would be interesting to try twice:
> once with a semaphore for each probe, and once without.

In v2 and v3, the only runtime overhead of a disabled probe is the
presence of a single 'nop' instruction in the code path.  AIUI, on
modern processors that costs less than one cycle.  The overhead of
a probe firing is unchanged, being a full breakpoint trap and the
kernel side of handling that and running the probe machinery.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-16 10:56     ` Mark Wielaard
@ 2011-02-16 18:51       ` Roland McGrath
       [not found]         ` <20110216200034.GA6066@Krystal>
  2011-02-16 20:55         ` [ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks Stefan Hajnoczi
  0 siblings, 2 replies; 23+ messages in thread
From: Roland McGrath @ 2011-02-16 18:51 UTC (permalink / raw)
  To: Mark Wielaard
  Cc: Stefan Hajnoczi, Frank Ch. Eigler, Julien Desfossez,
	dominique.toupin, ltt-dev, Mathieu Desnoyers, systemtap

Stefan was referring to #4 in your taxonomy.

It's indeed the case that what UST uses today is an always-there normal
C code sequence that loads global variables to decide whether to make
indirect function calls.  I don't recall off hand how many layers of
function calls to the libust DSO and such there are in either the
disabled or enabled cases.  At best, there is the always the overhead of
several instructions and at least one load in the hot code path, and the
i-cache pollution that goes with that.

It's indeed the cast that what Systemtap uses today is a
sometimes-inserted normal breakpoint instruction, which is indeed a
software interrupt that requires kernel mediation.  When disabled, there
is as close to zero overhead as you can have, being a tiny placeholder
instruction sequence (currently just one nop), so the runtime overhead
is under a cycle and the i-cache pollution is the smallest possible unit
(one instruction, being just one byte on x86).

The "sweet spot" between the two is to have overhead close to
Systemtap's epsilon for a disabled probe, while having overhead close to
UST's pure-user method when a probe is enabled.  In the in-kernel
context, this is what the Linux kernel's latest code (still being hashed
out, but mostly done) has for kernel tracepoints using the so-called
"jump label" method.  That is also possible for sdt markers with some
careful consideration and attention to machine-specific details for each
machine architecture of concern.  It entails making the placeholder in
the hot code path slightly larger (at least for x86, it has to be a
"long nop", being probably neglibly more runtime overhead, and a few
bytes more i-cache pollution), and adding some additional static code
outside the hot path.  The work to enable or disable a probe becomes
just as costly as the current Systemtap method, since it involves
modifying the program text in place (inserting jump instructions rather
than breakpoint ones).  Once enabled, the runtime work of the probes
firing can be very much like what UST does today.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Porting "jump labels" to userspace (was: Re: [ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks)
       [not found]         ` <20110216200034.GA6066@Krystal>
@ 2011-02-16 20:04           ` Roland McGrath
       [not found]             ` <4D5C30D8.8070107@caviumnetworks.com>
  0 siblings, 1 reply; 23+ messages in thread
From: Roland McGrath @ 2011-02-16 20:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Mark Wielaard, Stefan Hajnoczi, Frank Ch. Eigler,
	Julien Desfossez, dominique.toupin, ltt-dev, systemtap,
	linux-kernel, Jason Baron, hpa, rostedt, mingo, tglx, andi, rth,
	masami.hiramatsu.pt, fweisbec, avi, davem, sam, ddaney, michael,
	Peter Zijlstra

IMHO there is not really so much to the in-kernel implementation that it's
worth attempting to reuse the code in userland.  Pretty much all the work
is in the details of the implementation that would naturally differ a lot
in a different context.  If you understand the mechanism and the machine
details, then implementing it well for a userland context is not a big deal
and is cleaner to do from scratch than shoe-horning kernel-centric code
into a wildly different context.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-16 18:51       ` Roland McGrath
       [not found]         ` <20110216200034.GA6066@Krystal>
@ 2011-02-16 20:55         ` Stefan Hajnoczi
  2011-02-16 21:05           ` Mathieu Desnoyers
  2011-02-16 21:16           ` Mark Wielaard
  1 sibling, 2 replies; 23+ messages in thread
From: Stefan Hajnoczi @ 2011-02-16 20:55 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Mark Wielaard, Frank Ch. Eigler, Julien Desfossez,
	dominique.toupin, ltt-dev, Mathieu Desnoyers, systemtap

On Wed, Feb 16, 2011 at 6:50 PM, Roland McGrath <roland@redhat.com> wrote:
> Stefan was referring to #4 in your taxonomy.
>
> It's indeed the case that what UST uses today is an always-there normal
> C code sequence that loads global variables to decide whether to make
> indirect function calls.  I don't recall off hand how many layers of
> function calls to the libust DSO and such there are in either the
> disabled or enabled cases.  At best, there is the always the overhead of
> several instructions and at least one load in the hot code path, and the
> i-cache pollution that goes with that.
>
> It's indeed the cast that what Systemtap uses today is a
> sometimes-inserted normal breakpoint instruction, which is indeed a
> software interrupt that requires kernel mediation.  When disabled, there
> is as close to zero overhead as you can have, being a tiny placeholder
> instruction sequence (currently just one nop), so the runtime overhead
> is under a cycle and the i-cache pollution is the smallest possible unit
> (one instruction, being just one byte on x86).

Thanks for the explanations everyone.

I remember that DTrace also uses the software breakpoint method for
userspace probes.  I think the key reason they choose this method is
that it is the least invasive and does not require target process
cooperation.

Stefan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-16 20:55         ` [ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks Stefan Hajnoczi
@ 2011-02-16 21:05           ` Mathieu Desnoyers
  2011-02-16 21:16           ` Mark Wielaard
  1 sibling, 0 replies; 23+ messages in thread
From: Mathieu Desnoyers @ 2011-02-16 21:05 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Roland McGrath, Mark Wielaard, Frank Ch. Eigler,
	Julien Desfossez, dominique.toupin, ltt-dev, systemtap

* Stefan Hajnoczi (stefanha@gmail.com) wrote:
> On Wed, Feb 16, 2011 at 6:50 PM, Roland McGrath <roland@redhat.com> wrote:
> > Stefan was referring to #4 in your taxonomy.
> >
> > It's indeed the case that what UST uses today is an always-there normal
> > C code sequence that loads global variables to decide whether to make
> > indirect function calls.  I don't recall off hand how many layers of
> > function calls to the libust DSO and such there are in either the
> > disabled or enabled cases.  At best, there is the always the overhead of
> > several instructions and at least one load in the hot code path, and the
> > i-cache pollution that goes with that.
> >
> > It's indeed the cast that what Systemtap uses today is a
> > sometimes-inserted normal breakpoint instruction, which is indeed a
> > software interrupt that requires kernel mediation.  When disabled, there
> > is as close to zero overhead as you can have, being a tiny placeholder
> > instruction sequence (currently just one nop), so the runtime overhead
> > is under a cycle and the i-cache pollution is the smallest possible unit
> > (one instruction, being just one byte on x86).
> 
> Thanks for the explanations everyone.
> 
> I remember that DTrace also uses the software breakpoint method for
> userspace probes.  I think the key reason they choose this method is
> that it is the least invasive and does not require target process
> cooperation.

Yeah, but it's slow. :)

By the way, if the target process refuses to cooperate at some point (e.g.
crash), UST still keeps a handle on the shared memory map, so we can still
extract the buffers.

Another future direction we're looking into for UST is to add the ability to do
dynamic probing in userspace with the equivalent of the "fast tracepoints"
currently available in gdb and in the kernel: whenever possible, replace
instructions with a jump rather than a breakpoint to dynamically instrument
applications.

So with LD_PRELOAD and dynamic instrumentation, we won't require much
collaboration from the application.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-16 20:55         ` [ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks Stefan Hajnoczi
  2011-02-16 21:05           ` Mathieu Desnoyers
@ 2011-02-16 21:16           ` Mark Wielaard
  1 sibling, 0 replies; 23+ messages in thread
From: Mark Wielaard @ 2011-02-16 21:16 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Roland McGrath, Frank Ch. Eigler, Julien Desfossez,
	dominique.toupin, ltt-dev, Mathieu Desnoyers, systemtap

On Wed, 2011-02-16 at 20:55 +0000, Stefan Hajnoczi wrote:
> On Wed, Feb 16, 2011 at 6:50 PM, Roland McGrath <roland@redhat.com> wrote:
> > It's indeed the cast that what Systemtap uses today is a
> > sometimes-inserted normal breakpoint instruction, which is indeed a
> > software interrupt that requires kernel mediation.  When disabled, there
> > is as close to zero overhead as you can have, being a tiny placeholder
> > instruction sequence (currently just one nop), so the runtime overhead
> > is under a cycle and the i-cache pollution is the smallest possible unit
> > (one instruction, being just one byte on x86).
> 
> Thanks for the explanations everyone.
> 
> I remember that DTrace also uses the software breakpoint method for
> userspace probes.  I think the key reason they choose this method is
> that it is the least invasive and does not require target process
> cooperation.

Yes, it prevents slowing down the target process if no probes are ever
triggered (the normal case). Then it is as if no instrumentation was
ever inserted. And the software interrupt is just one implementation,
that is currently used. So you could also look at using dynamic patching
the marker probe location to insert a jump instruction to an in user
handler function (this might need a small tweak to the sdt.h markers so
they leave enough space for that). Then there would also be minimal
overhead when the probes are enabled.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Porting "jump labels" to userspace
       [not found]                   ` <1297892529.5226.543.camel@laptop>
@ 2011-02-17 13:58                     ` Ingo Molnar
  0 siblings, 0 replies; 23+ messages in thread
From: Ingo Molnar @ 2011-02-17 13:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Mathieu Desnoyers, David Daney, Roland McGrath,
	Mark Wielaard, Stefan Hajnoczi, Frank Ch. Eigler,
	Julien Desfossez, dominique.toupin, ltt-dev, systemtap,
	linux-kernel, Jason Baron, hpa, rostedt, andi, rth,
	masami.hiramatsu.pt, fweisbec, avi, davem, sam, michael


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, 2011-02-16 at 21:45 +0100, Thomas Gleixner wrote:
> 
> > We talk about 500 lines of code, where half of it is modules specific
> > and the whole thing is full of kernelims. IMNSHO, that's faster
> > reimplemented from scratch than writing all the mails and get the
> > authors to sign off on the license change.
> 
> That and I'm not going to consent with an LGPL license, I'm an avid
> GPLv2 fan as various people already know ;-)

Ditto for me.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-15 16:25 ` [ltt-dev] " William Cohen
@ 2011-02-17 16:21   ` Julien Desfossez
  0 siblings, 0 replies; 23+ messages in thread
From: Julien Desfossez @ 2011-02-17 16:21 UTC (permalink / raw)
  To: William Cohen; +Cc: ltt-dev, systemtap, dominique.toupin, Mathieu Desnoyers

On 02/15/2011 11:25 AM, William Cohen wrote:
> On 02/15/2011 10:53 AM, Julien Desfossez wrote:
>> LTTng-UST vs SystemTap userspace tracing benchmarks
>>
>> February 15th, 2011
>>
>> Authors: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>          Julien Desfossez <julien.desfossez@polymtl.ca>
> 
>> * SystemTap probe (stap testutrace.stp -F) :
>> probe process("./.libs/tracepoint_benchmark").mark("single_trace") {
>>     printf("%d : %s\n", gettimeofday_ns(), $arg1);
>> }

Hi William,
> How much of the SystemTap overhead is due to the printf() statement in the probe? What is the run time for the following:
> 
> probe process("./.libs/tracepoint_benchmark").mark("single_trace") {}
Except the fact that it produces a warning because the probe is empty,
the results differ a little but not as much as I expected. I also tested
(in flight recorder mode) with just removing the gettimeofday_ns() call
(and printing) :

# of threads   With printf    Without gtod_ns()   Without printf
1              0:58.36        0:52.27             0:46.45
2              1:49.94        1:37.61             1:27.33
4              2:38.49        2:35.13             2:50.87

> Is the code for the benchmarks available, so we can take a look at reducing the overhead of SystemTap?
For those who want to play with the benchmark, we setup a git repository
here : git://git.lttng.org/benchmarks.git

If you have any suggestions or ideas to make these tests better, we'll
be happy to integrate it.

Thanks,

Julien

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-16 15:30 ` Tom Tromey
  2011-02-16 18:19   ` Roland McGrath
@ 2011-02-17 17:33   ` Julien Desfossez
  2011-02-17 19:11     ` Josh Stone
  1 sibling, 1 reply; 23+ messages in thread
From: Julien Desfossez @ 2011-02-17 17:33 UTC (permalink / raw)
  To: Tom Tromey; +Cc: ltt-dev, systemtap, Mathieu Desnoyers, dominique.toupin

Hi,

On 02/16/2011 10:30 AM, Tom Tromey wrote:
> Julien> 0) Baseline : running the program without any instrumentation
> Julien> 1) Flight recorder tracing comparison UST vs SystemTap
> 
> I'd be interested to also see the numbers when the probes are in place
> in the source, but not enabled.  That is, what is the overhead of a
> disabled probe?
I disabled the probe by undefining HAVE_SYSTEMTAP, but I have the same
results in flight recorder mode. Of course if the module is not loaded
we have no overhead at all. It means that the module is responsible for
all the overhead regarless if the probe is called or not.
I would be really interested if you know why it happens (and how to fix it).
This last test was done on a Fedora Core 14 (kernel
2.6.35.10-74.fc14.x86_64 with SystemTap 1.3-3).

If you want to test, the benchmark code is here :
git://git.lttng.org/benchmarks.git

Thanks,

Julien

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-17 17:33   ` Julien Desfossez
@ 2011-02-17 19:11     ` Josh Stone
  2011-02-17 19:33       ` Julien Desfossez
  0 siblings, 1 reply; 23+ messages in thread
From: Josh Stone @ 2011-02-17 19:11 UTC (permalink / raw)
  To: Julien Desfossez
  Cc: Tom Tromey, ltt-dev, systemtap, Mathieu Desnoyers, dominique.toupin

On 02/17/2011 09:33 AM, Julien Desfossez wrote:
> Hi,
> 
> On 02/16/2011 10:30 AM, Tom Tromey wrote:
>> Julien> 0) Baseline : running the program without any instrumentation
>> Julien> 1) Flight recorder tracing comparison UST vs SystemTap
>>
>> I'd be interested to also see the numbers when the probes are in place
>> in the source, but not enabled.  That is, what is the overhead of a
>> disabled probe?
> I disabled the probe by undefining HAVE_SYSTEMTAP, but I have the same
> results in flight recorder mode. Of course if the module is not loaded
> we have no overhead at all. It means that the module is responsible for
> all the overhead regarless if the probe is called or not.
> I would be really interested if you know why it happens (and how to fix it).
> This last test was done on a Fedora Core 14 (kernel
> 2.6.35.10-74.fc14.x86_64 with SystemTap 1.3-3).
> 
> If you want to test, the benchmark code is here :
> git://git.lttng.org/benchmarks.git

Your "testutrace.stp" is probing with process.function, which means
you're not using the compiled tracepoint at all, but rather a function
probe based on dwarf debuginfo.  So compiling !HAVE_SYSTEMTAP in this
case doesn't matter, the function still exists for the module to probe.
The correct form for SDT probes is a process.mark probe, as you quoted
in your original mail, in which case stap would fail to compile the
module for the !HAVE_SYSTEMTAP case as the marks don't exist.

In the general use case, a script can be conditional on the presence of
different probe types, as described in "man stapprobes".  For the
purpose of benchmarking I would avoid this, so we can be absolutely sure
of what's being probed.  But for reference, it can look like:
  probe process("foo").mark("myfn")!,
        process("foo").function("myfn")
  { ... }

Note also that there's about twice the overhead for process.function
versus process.mark.  With .mark, a NOP instruction is inserted for us
to place the debug breakpoint on.  As of the uprobes in stap 1.3, we can
skip the singlestep of probes on a NOP.  But for function probes, the
debug breakpoint is placed near the beginning of the function, likely on
a significant instruction, so it must be singlestepped.  Having a
singlestep means there's basically two traps per probe hit, so it really
is a big win to use process.mark instead.

Getting back to Tom's request, I think these are the variations that we
need to see for a fuller picture:

1) Baseline with NO instrumentation compiled in at all.  You may need
something like an asm("") in single_trace() to keep gcc from compiling
the loop away altogether.
1a) Same binary w/ probe process.function (showing that stap can probe
unmodified binaries, though I expect this to be slowest of all)

2) UST baseline: UST compiled in, but not active.
2a) Same binary w/ tracing activated, UST w/ TC
2b) Same binary w/ tracing activated, UST w/o TC
2c) etc. any other UST variant
* if the UST variations require different compilation, the split this up
and report active/inactive numbers each time.

3) SDT baseline: stap SDT compiled in, but not active.
3a) Same binary w/ active probe process.mark

4) SDT-semaphore baseline: SDT compiled in and using a semaphore, not
active.  The semaphore is TRACEPOINT_BENCHMARK_SINGLE_TRACE_ENABLED(),
so you could put  if (..._ENABLED()) TRACE(...);
4a) Same binary w/ active probe process.mark


Thanks,

Josh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-17 19:11     ` Josh Stone
@ 2011-02-17 19:33       ` Julien Desfossez
  2011-02-17 19:47         ` Josh Stone
  0 siblings, 1 reply; 23+ messages in thread
From: Julien Desfossez @ 2011-02-17 19:33 UTC (permalink / raw)
  To: Josh Stone
  Cc: Tom Tromey, ltt-dev, systemtap, Mathieu Desnoyers, dominique.toupin

On 02/17/2011 02:11 PM, Josh Stone wrote:
> On 02/17/2011 09:33 AM, Julien Desfossez wrote:
>> Hi,
>>
>> On 02/16/2011 10:30 AM, Tom Tromey wrote:
>>> Julien> 0) Baseline : running the program without any instrumentation
>>> Julien> 1) Flight recorder tracing comparison UST vs SystemTap
>>>
>>> I'd be interested to also see the numbers when the probes are in place
>>> in the source, but not enabled.  That is, what is the overhead of a
>>> disabled probe?
>> I disabled the probe by undefining HAVE_SYSTEMTAP, but I have the same
>> results in flight recorder mode. Of course if the module is not loaded
>> we have no overhead at all. It means that the module is responsible for
>> all the overhead regarless if the probe is called or not.
>> I would be really interested if you know why it happens (and how to fix it).
>> This last test was done on a Fedora Core 14 (kernel
>> 2.6.35.10-74.fc14.x86_64 with SystemTap 1.3-3).
>>
>> If you want to test, the benchmark code is here :
>> git://git.lttng.org/benchmarks.git
> 
> Your "testutrace.stp" is probing with process.function, which means
> you're not using the compiled tracepoint at all, but rather a function
> probe based on dwarf debuginfo.  So compiling !HAVE_SYSTEMTAP in this
> case doesn't matter, the function still exists for the module to probe.
> The correct form for SDT probes is a process.mark probe, as you quoted
> in your original mail, in which case stap would fail to compile the
> module for the !HAVE_SYSTEMTAP case as the marks don't exist.

Ok, my bad, I copied an older version of the test, when I setup the
repository, its fixed now. It doesn't change the earlier results which
was done with the mark, just the one I posted today.
Now the empty probe is actually much faster and the overhead when the
probe is disabled is now null as expected :)

> In the general use case, a script can be conditional on the presence of
> different probe types, as described in "man stapprobes".  For the
> purpose of benchmarking I would avoid this, so we can be absolutely sure
> of what's being probed.  But for reference, it can look like:
>   probe process("foo").mark("myfn")!,
>         process("foo").function("myfn")
>   { ... }

Good to know, thanks.

> Note also that there's about twice the overhead for process.function
> versus process.mark.  With .mark, a NOP instruction is inserted for us
> to place the debug breakpoint on.  As of the uprobes in stap 1.3, we can
> skip the singlestep of probes on a NOP.  But for function probes, the
> debug breakpoint is placed near the beginning of the function, likely on
> a significant instruction, so it must be singlestepped.  Having a
> singlestep means there's basically two traps per probe hit, so it really
> is a big win to use process.mark instead.
> 
> Getting back to Tom's request, I think these are the variations that we
> need to see for a fuller picture:
> 
> 1) Baseline with NO instrumentation compiled in at all.  You may need
> something like an asm("") in single_trace() to keep gcc from compiling
> the loop away altogether.
> 1a) Same binary w/ probe process.function (showing that stap can probe
> unmodified binaries, though I expect this to be slowest of all)
> 
> 2) UST baseline: UST compiled in, but not active.
> 2a) Same binary w/ tracing activated, UST w/ TC
> 2b) Same binary w/ tracing activated, UST w/o TC
> 2c) etc. any other UST variant
> * if the UST variations require different compilation, the split this up
> and report active/inactive numbers each time.
> 
> 3) SDT baseline: stap SDT compiled in, but not active.
> 3a) Same binary w/ active probe process.mark
> 
> 4) SDT-semaphore baseline: SDT compiled in and using a semaphore, not
> active.  The semaphore is TRACEPOINT_BENCHMARK_SINGLE_TRACE_ENABLED(),
> so you could put  if (..._ENABLED()) TRACE(...);
> 4a) Same binary w/ active probe process.mark

Ok, I will do these tests on the same machine and post the results soon.

Thanks,

Julien

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: LTTng-UST vs SystemTap userspace tracing benchmarks
  2011-02-17 19:33       ` Julien Desfossez
@ 2011-02-17 19:47         ` Josh Stone
  0 siblings, 0 replies; 23+ messages in thread
From: Josh Stone @ 2011-02-17 19:47 UTC (permalink / raw)
  To: Julien Desfossez
  Cc: Tom Tromey, ltt-dev, systemtap, Mathieu Desnoyers, dominique.toupin

On 02/17/2011 11:33 AM, Julien Desfossez wrote:
> Ok, my bad, I copied an older version of the test, when I setup the
> repository, its fixed now. It doesn't change the earlier results which
> was done with the mark, just the one I posted today.

Your original results were also using stap 1.2, which doesn't have the
NOP-singlestep optimization I described.  Please use our latest 1.4 for
the best comparison.  (and be sure the old uprobes.ko is unloaded)

> Ok, I will do these tests on the same machine and post the results soon.

Great, thanks.

Josh

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2011-02-17 19:47 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-15 15:53 LTTng-UST vs SystemTap userspace tracing benchmarks Julien Desfossez
2011-02-15 16:25 ` [ltt-dev] " William Cohen
2011-02-17 16:21   ` Julien Desfossez
2011-02-15 16:26 ` Frank Ch. Eigler
2011-02-15 16:54   ` Mathieu Desnoyers
2011-02-15 17:39     ` Frank Ch. Eigler
2011-02-15 22:26       ` Mathieu Desnoyers
2011-02-15 17:00   ` [ltt-dev] " Stefan Hajnoczi
2011-02-15 17:04     ` Mathieu Desnoyers
2011-02-16 10:56     ` Mark Wielaard
2011-02-16 18:51       ` Roland McGrath
     [not found]         ` <20110216200034.GA6066@Krystal>
2011-02-16 20:04           ` Porting "jump labels" to userspace (was: Re: [ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks) Roland McGrath
     [not found]             ` <4D5C30D8.8070107@caviumnetworks.com>
     [not found]               ` <20110216203953.GB2015@Krystal>
     [not found]                 ` <alpine.LFD.2.00.1102162143430.2701@localhost6.localdomain6>
     [not found]                   ` <1297892529.5226.543.camel@laptop>
2011-02-17 13:58                     ` Porting "jump labels" to userspace Ingo Molnar
2011-02-16 20:55         ` [ltt-dev] LTTng-UST vs SystemTap userspace tracing benchmarks Stefan Hajnoczi
2011-02-16 21:05           ` Mathieu Desnoyers
2011-02-16 21:16           ` Mark Wielaard
2011-02-15 17:03 ` Mark Wielaard
2011-02-16 15:30 ` Tom Tromey
2011-02-16 18:19   ` Roland McGrath
2011-02-17 17:33   ` Julien Desfossez
2011-02-17 19:11     ` Josh Stone
2011-02-17 19:33       ` Julien Desfossez
2011-02-17 19:47         ` Josh Stone

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).