Systemtap userspace performance

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* Systemtap userspace performance
@ 2011-02-07 18:04 Julien Desfossez
  2011-02-07 22:30 ` Frank Ch. Eigler
  0 siblings, 1 reply; 2+ messages in thread
From: Julien Desfossez @ 2011-02-07 18:04 UTC (permalink / raw)
  To: systemtap

Hi,

I'm evaluating the performance of systemtap userspace instrumentation
(with utrace) and I'd like to make sure the procedure I use is the right
one to get the best results.

I'm particularly interested in the scalability of the solution to
multi-core and multi-processor systems.

My test program creates threads which have for only purpose to call a
function which calls 10 million times an other function designed only to
be instrumented (does nothing but is not inlined). The important parts
of the program are at the bottom of this message.

Right now I'm only working with 1 thread, my baseline (running the
program without instrumentation) gives me 0:00.27 of elapsed time (0.27
second).
I'm on Fedora 14 with a 2.6.35.10-74.fc14.x86_64 kernel with
systemtap-1.3-3.fc14.x86_64 on a Core 2 Duo P8600 with 4G of RAM.

I first instrumented the application using the following script :

probe process("./benchmark").function("single_trace")
{
	printf("%d : %d\n", gettimeofday_ns(), $v);
}

It works and gives me the result I expect, but running it in flight
recorder mode without outputting anything (just stap test.stp -F) gave
me : 0:51.82 elapsed time.

I then came across the embedded markers way to instrument an application
(with the dtrace probe points).
So I added the required code (following the documentation) and updated
my probe to :

probe process("./benchmark").mark("single_trace")
{
	printf("%d : %d\n", gettimeofday_ns(), $arg1);
}

Now I get 0:27.60 elapsed time which is better but is still a huge
overhead (especially when considering I'm not writing to disk anything
is recorded in the buffer).

I mesured the gettimeofday_ns() is responsible for 8.4 seconds for each
test.

So is there a way to get better results or is it the optimal way to
instrument an application with userspace systemtap ?

Thanks,

Julien

#define NR_EVENTS 10000000

void single_trace(unsigned int v)
{
    TRACE(TRACEPOINT_BENCHMARK_SINGLE_TRACE(v));
}

void do_trace(void)
{
    long i;

    for (i = 0; i < NR_EVENTS; i++)
        single_trace(42);
}

void *thr1(void *arg)
{
    do_trace();
    return ((void*)1);
}

int main(int argc, char **argv)
{
[...snip the init stuff...]
    for (i = 0; i < nr_threads; i++) {
        err = pthread_create(&tid[i], NULL, thr1, NULL);
        if (err != 0)
            exit(1);
    }

    for (i = 0; i < nr_threads; i++) {
        err = pthread_join(tid[i], &tret);
        if (err != 0)
            exit(1);
    }
    return 0;
}

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Systemtap userspace performance
  2011-02-07 18:04 Systemtap userspace performance Julien Desfossez
@ 2011-02-07 22:30 ` Frank Ch. Eigler
  0 siblings, 0 replies; 2+ messages in thread
From: Frank Ch. Eigler @ 2011-02-07 22:30 UTC (permalink / raw)
  To: Julien Desfossez; +Cc: systemtap

ju wrote:

> [...]
> I'm evaluating the performance of systemtap userspace instrumentation
> (with utrace) and I'd like to make sure the procedure I use is the right
> one to get the best results.

OK.

> I'm particularly interested in the scalability of the solution to
> multi-core and multi-processor systems.

OK.  I recall there is at least one known scalability-related obstacle
in the current implementation of uprobes/utrace.
http://sourceware.org/bugzilla/show_bug.cgi?id=5660

Note that the rest of your message does not deal with scalability per
se (how the performance varies with number of processors), just
single-processor overhead.

> My test program creates threads which have for only purpose to call a
> function which calls 10 million times an other function [...]
> Right now I'm only working with 1 thread, my baseline (running the
> program without instrumentation) gives me 0:00.27 of elapsed time (0.27
> second).

OK.

> probe process("./benchmark").function("single_trace") [...]
> 0:51.82 elapsed time.
> [...]
> probe process("./benchmark").mark("single_trace") [...]
> I get 0:27.60 elapsed time 

These numbers are on the order of 2-5 microseconds per hit.  They are
not grossly out of whack with what we've seen before.  What are you
comparing this amount to?

> [...]  I mesured the gettimeofday_ns() is responsible for 8.4
> seconds for each test.

We should be able to improve on that.  Try get_cycles() instead.

Also be aware that the formatted printing is also of considerable load
(and is distinct from whether you ultimately store the data on disk or
not).  You might try measuring just printf()'s contribution.  We have
plans to speed that up.
http://sourceware.org/bugzilla/show_bug.cgi?id=3217

- FChE

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2011-02-07 22:30 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-07 18:04 Systemtap userspace performance Julien Desfossez
2011-02-07 22:30 ` Frank Ch. Eigler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).