public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* reducing cost of user-space probes
@ 2017-04-24 11:59 O Mahony, Billy
  2017-04-24 12:17 ` Arkady
  2017-04-24 15:50 ` David Smith
  0 siblings, 2 replies; 9+ messages in thread
From: O Mahony, Billy @ 2017-04-24 11:59 UTC (permalink / raw)
  To: systemtap

Hi,

I'm new to systemtap and I am using it to add some probes into a user space application.

The probe is pretty simple - it collects one integer argument and presents a histogram every 3 seconds.

The probe is working fine and I'm getting results that are sensible. The application is a packet processing application that is using a user space io library (DPDK) to read batches of network packets directly into user space.  The probe is called about 750K times per second  (I have 10Gb link with 64B packets which generates 14.8M packets per second - but the batch size (that's the stat I'm tracing) - is about 20 so 750K probe hits per sec. 

When the probe is in use I see less performance from the packet processing application - it starts loosing packets at about 90% of it's non-probed throughput. 

However, when I run stap I see:

> Pass 4: compiled C into "stap_13723.ko" in 9020usr/980sys/10638real ms

Does this mean that each time the probe is hit that a system call is made to this new .ko module? That would surely mean quite a lot of overhead. If this is correct, can this overhead be avoided for user space probes.

Alternatively is there a way to only execute the script every n times the probe is hit?

Maybe there is a compile time macro that does this or some .stap command that does an early return from the script X% of the time. I searched for 'sample/sampling' in the lang ref but I didn't see anything.

Thanks for any help you can give.

Billy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: reducing cost of user-space probes
  2017-04-24 11:59 reducing cost of user-space probes O Mahony, Billy
@ 2017-04-24 12:17 ` Arkady
  2017-04-24 13:08   ` O Mahony, Billy
  2017-04-24 15:50 ` David Smith
  1 sibling, 1 reply; 9+ messages in thread
From: Arkady @ 2017-04-24 12:17 UTC (permalink / raw)
  To: O Mahony, Billy; +Cc: systemtap

Hi,

8-10% performance hit when handling 0.5M-1M events/s is in line with
what I experience. Some ways to improve the performance

* find (or add) different probe point which is called less frequent
* when running the STAP remove built in checks, for example
--suppress-time-limits
* examine the source code generated by stap (command line switch -k).
there are things which are more expensive. For example, nesting in the
STAP script, strings, associative arrays all come at some cost. I
discovered that using inline C and array makes sense in some cases.
You can access the array with /proc and process the data offline.


On Mon, Apr 24, 2017 at 2:58 PM, O Mahony, Billy
<billy.o.mahony@intel.com> wrote:
> Hi,
>
> I'm new to systemtap and I am using it to add some probes into a user space application.
>
> The probe is pretty simple - it collects one integer argument and presents a histogram every 3 seconds.
>
> The probe is working fine and I'm getting results that are sensible. The application is a packet processing application that is using a user space io library (DPDK) to read batches of network packets directly into user space.  The probe is called about 750K times per second  (I have 10Gb link with 64B packets which generates 14.8M packets per second - but the batch size (that's the stat I'm tracing) - is about 20 so 750K probe hits per sec.
>
> When the probe is in use I see less performance from the packet processing application - it starts loosing packets at about 90% of it's non-probed throughput.
>
> However, when I run stap I see:
>
>> Pass 4: compiled C into "stap_13723.ko" in 9020usr/980sys/10638real ms
>
> Does this mean that each time the probe is hit that a system call is made to this new .ko module? That would surely mean quite a lot of overhead. If this is correct, can this overhead be avoided for user space probes.
>
> Alternatively is there a way to only execute the script every n times the probe is hit?
>
> Maybe there is a compile time macro that does this or some .stap command that does an early return from the script X% of the time. I searched for 'sample/sampling' in the lang ref but I didn't see anything.
>
> Thanks for any help you can give.
>
> Billy
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: reducing cost of user-space probes
  2017-04-24 12:17 ` Arkady
@ 2017-04-24 13:08   ` O Mahony, Billy
  2017-04-24 13:22     ` Arkady
  0 siblings, 1 reply; 9+ messages in thread
From: O Mahony, Billy @ 2017-04-24 13:08 UTC (permalink / raw)
  To: Arkady; +Cc: systemtap

Hi Arkady,

Thanks for the fast reply. Some great tips there.

Can you point be to a sample of using inline C? That sounds really interesting. 

Thanks,
Billy. 

> -----Original Message-----
> From: larytet@gmail.com [mailto:larytet@gmail.com] On Behalf Of Arkady
> Sent: Monday, April 24, 2017 1:17 PM
> To: O Mahony, Billy <billy.o.mahony@intel.com>
> Cc: systemtap@sourceware.org
> Subject: Re: reducing cost of user-space probes
> 
> Hi,
> 
> 8-10% performance hit when handling 0.5M-1M events/s is in line with what I
> experience. Some ways to improve the performance
> 
> * find (or add) different probe point which is called less frequent
> * when running the STAP remove built in checks, for example --suppress-
> time-limits
> * examine the source code generated by stap (command line switch -k).
> there are things which are more expensive. For example, nesting in the STAP
> script, strings, associative arrays all come at some cost. I discovered that using
> inline C and array makes sense in some cases.
> You can access the array with /proc and process the data offline.
> 
> 
> On Mon, Apr 24, 2017 at 2:58 PM, O Mahony, Billy
> <billy.o.mahony@intel.com> wrote:
> > Hi,
> >
> > I'm new to systemtap and I am using it to add some probes into a user
> space application.
> >
> > The probe is pretty simple - it collects one integer argument and presents a
> histogram every 3 seconds.
> >
> > The probe is working fine and I'm getting results that are sensible. The
> application is a packet processing application that is using a user space io
> library (DPDK) to read batches of network packets directly into user space.
> The probe is called about 750K times per second  (I have 10Gb link with 64B
> packets which generates 14.8M packets per second - but the batch size
> (that's the stat I'm tracing) - is about 20 so 750K probe hits per sec.
> >
> > When the probe is in use I see less performance from the packet
> processing application - it starts loosing packets at about 90% of it's non-
> probed throughput.
> >
> > However, when I run stap I see:
> >
> >> Pass 4: compiled C into "stap_13723.ko" in 9020usr/980sys/10638real
> >> ms
> >
> > Does this mean that each time the probe is hit that a system call is made to
> this new .ko module? That would surely mean quite a lot of overhead. If this
> is correct, can this overhead be avoided for user space probes.
> >
> > Alternatively is there a way to only execute the script every n times the
> probe is hit?
> >
> > Maybe there is a compile time macro that does this or some .stap
> command that does an early return from the script X% of the time. I
> searched for 'sample/sampling' in the lang ref but I didn't see anything.
> >
> > Thanks for any help you can give.
> >
> > Billy
> >

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: reducing cost of user-space probes
  2017-04-24 13:08   ` O Mahony, Billy
@ 2017-04-24 13:22     ` Arkady
  0 siblings, 0 replies; 9+ messages in thread
From: Arkady @ 2017-04-24 13:22 UTC (permalink / raw)
  To: O Mahony, Billy; +Cc: systemtap

You can do something like

probe timer.jiffies(100)
{
%{DEBUG_COUNTERS_BUMP(DEBUG_JIFFIES)%}
}

where DEBUG_COUNTERS_BUMP is a macro which increments entry at the
offset DEBUG_JIFFIES  in in an array.

I would not expect lot of performance gain.

On Mon, Apr 24, 2017 at 4:07 PM, O Mahony, Billy
<billy.o.mahony@intel.com> wrote:
> Hi Arkady,
>
> Thanks for the fast reply. Some great tips there.
>
> Can you point be to a sample of using inline C? That sounds really interesting.
>
> Thanks,
> Billy.
>
>> -----Original Message-----
>> From: larytet@gmail.com [mailto:larytet@gmail.com] On Behalf Of Arkady
>> Sent: Monday, April 24, 2017 1:17 PM
>> To: O Mahony, Billy <billy.o.mahony@intel.com>
>> Cc: systemtap@sourceware.org
>> Subject: Re: reducing cost of user-space probes
>>
>> Hi,
>>
>> 8-10% performance hit when handling 0.5M-1M events/s is in line with what I
>> experience. Some ways to improve the performance
>>
>> * find (or add) different probe point which is called less frequent
>> * when running the STAP remove built in checks, for example --suppress-
>> time-limits
>> * examine the source code generated by stap (command line switch -k).
>> there are things which are more expensive. For example, nesting in the STAP
>> script, strings, associative arrays all come at some cost. I discovered that using
>> inline C and array makes sense in some cases.
>> You can access the array with /proc and process the data offline.
>>
>>
>> On Mon, Apr 24, 2017 at 2:58 PM, O Mahony, Billy
>> <billy.o.mahony@intel.com> wrote:
>> > Hi,
>> >
>> > I'm new to systemtap and I am using it to add some probes into a user
>> space application.
>> >
>> > The probe is pretty simple - it collects one integer argument and presents a
>> histogram every 3 seconds.
>> >
>> > The probe is working fine and I'm getting results that are sensible. The
>> application is a packet processing application that is using a user space io
>> library (DPDK) to read batches of network packets directly into user space.
>> The probe is called about 750K times per second  (I have 10Gb link with 64B
>> packets which generates 14.8M packets per second - but the batch size
>> (that's the stat I'm tracing) - is about 20 so 750K probe hits per sec.
>> >
>> > When the probe is in use I see less performance from the packet
>> processing application - it starts loosing packets at about 90% of it's non-
>> probed throughput.
>> >
>> > However, when I run stap I see:
>> >
>> >> Pass 4: compiled C into "stap_13723.ko" in 9020usr/980sys/10638real
>> >> ms
>> >
>> > Does this mean that each time the probe is hit that a system call is made to
>> this new .ko module? That would surely mean quite a lot of overhead. If this
>> is correct, can this overhead be avoided for user space probes.
>> >
>> > Alternatively is there a way to only execute the script every n times the
>> probe is hit?
>> >
>> > Maybe there is a compile time macro that does this or some .stap
>> command that does an early return from the script X% of the time. I
>> searched for 'sample/sampling' in the lang ref but I didn't see anything.
>> >
>> > Thanks for any help you can give.
>> >
>> > Billy
>> >

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: reducing cost of user-space probes
  2017-04-24 11:59 reducing cost of user-space probes O Mahony, Billy
  2017-04-24 12:17 ` Arkady
@ 2017-04-24 15:50 ` David Smith
  2017-04-24 18:37   ` Josh Stone
  1 sibling, 1 reply; 9+ messages in thread
From: David Smith @ 2017-04-24 15:50 UTC (permalink / raw)
  To: O Mahony, Billy; +Cc: systemtap

On Mon, Apr 24, 2017 at 6:58 AM, O Mahony, Billy
<billy.o.mahony@intel.com> wrote:
> Hi,
>
> I'm new to systemtap and I am using it to add some probes into a user space application.
>
> The probe is pretty simple - it collects one integer argument and presents a histogram every 3 seconds.
>
> The probe is working fine and I'm getting results that are sensible. The application is a packet processing application that is using a user space io library (DPDK) to read batches of network packets directly into user space.  The probe is called about 750K times per second  (I have 10Gb link with 64B packets which generates 14.8M packets per second - but the batch size (that's the stat I'm tracing) - is about 20 so 750K probe hits per sec.
>
> When the probe is in use I see less performance from the packet processing application - it starts loosing packets at about 90% of it's non-probed throughput.
>
> However, when I run stap I see:
>
>> Pass 4: compiled C into "stap_13723.ko" in 9020usr/980sys/10638real ms
>
> Does this mean that each time the probe is hit that a system call is made to this new .ko module? That would surely mean quite a lot of overhead. If this is correct, can this overhead be avoided for user space probes.

The default "linux" runtime generates source for a kernel module,
compiles and installs it behind the scenes. That's how the default
runtime works. A system call is not made to the kernel module every
time the probe is hit (even it it wanted to, kernel modules can't call
system calls). Systemtap uses a kernel feature called 'uprobes' to
handle user-space probe hits.

> Alternatively is there a way to only execute the script every n times the probe is hit?
>
> Maybe there is a compile time macro that does this or some .stap command that does an early return from the script X% of the time. I searched for 'sample/sampling' in the lang ref but I didn't see anything.

Sure, if you want to exit early, just call "next". It could look
something like the following, if you only want to look at every 10th
function hit:

====
global iterations

probe process("/usr/bin/foo").function("bar")
{
    iterations++
    if (iterations % 10 != 0)
        next

    # ... data collection here
}
====

If you'd like help in making your script run faster, you'll need to
show it to us so that we can make suggestions.

-- 
David Smith
Principal Software Engineer
Red Hat

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: reducing cost of user-space probes
  2017-04-24 15:50 ` David Smith
@ 2017-04-24 18:37   ` Josh Stone
  2017-04-25  9:11     ` O Mahony, Billy
  0 siblings, 1 reply; 9+ messages in thread
From: Josh Stone @ 2017-04-24 18:37 UTC (permalink / raw)
  To: David Smith, O Mahony, Billy; +Cc: systemtap

On 04/24/2017 08:49 AM, David Smith wrote:
>> Does this mean that each time the probe is hit that a system call is made to this new .ko module? That would surely mean quite a lot of overhead. If this is correct, can this overhead be avoided for user space probes.
> 
> The default "linux" runtime generates source for a kernel module,
> compiles and installs it behind the scenes. That's how the default
> runtime works. A system call is not made to the kernel module every
> time the probe is hit (even it it wanted to, kernel modules can't call
> system calls). Systemtap uses a kernel feature called 'uprobes' to
> handle user-space probe hits.

It's not a syscall, rather an int3 trap, but the overhead is roughly the
same.  You pay the cost of a transition to and from ring0 *every* time,
even if you otherwise short-circuit the probe with 'next' or similar.

This is where "stap --runtime=dyninst" might shine, as the probe
executes entirely in-process.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: reducing cost of user-space probes
  2017-04-24 18:37   ` Josh Stone
@ 2017-04-25  9:11     ` O Mahony, Billy
  2017-05-08 16:20       ` O Mahony, Billy
  0 siblings, 1 reply; 9+ messages in thread
From: O Mahony, Billy @ 2017-04-25  9:11 UTC (permalink / raw)
  To: Josh Stone, David Smith; +Cc: systemtap

Hi David,

Thanks for that tip. I'll read up on it.

Cheers,
Billy

> -----Original Message-----
> From: Josh Stone [mailto:jistone@redhat.com]
> Sent: Monday, April 24, 2017 7:38 PM
> To: David Smith <dsmith@redhat.com>; O Mahony, Billy
> <billy.o.mahony@intel.com>
> Cc: systemtap@sourceware.org
> Subject: Re: reducing cost of user-space probes
> 
> On 04/24/2017 08:49 AM, David Smith wrote:
> >> Does this mean that each time the probe is hit that a system call is made
> to this new .ko module? That would surely mean quite a lot of overhead. If
> this is correct, can this overhead be avoided for user space probes.
> >
> > The default "linux" runtime generates source for a kernel module,
> > compiles and installs it behind the scenes. That's how the default
> > runtime works. A system call is not made to the kernel module every
> > time the probe is hit (even it it wanted to, kernel modules can't call
> > system calls). Systemtap uses a kernel feature called 'uprobes' to
> > handle user-space probe hits.
> 
> It's not a syscall, rather an int3 trap, but the overhead is roughly the same.
> You pay the cost of a transition to and from ring0 *every* time, even if you
> otherwise short-circuit the probe with 'next' or similar.
> 
> This is where "stap --runtime=dyninst" might shine, as the probe executes
> entirely in-process.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: reducing cost of user-space probes
  2017-04-25  9:11     ` O Mahony, Billy
@ 2017-05-08 16:20       ` O Mahony, Billy
  2017-05-10 20:08         ` David Smith
  0 siblings, 1 reply; 9+ messages in thread
From: O Mahony, Billy @ 2017-05-08 16:20 UTC (permalink / raw)
  To: 'Josh Stone', 'David Smith'
  Cc: 'systemtap@sourceware.org'

Hi David,

My unbuntu packaged systemtap doesn't have dyninst support:

   $ sudo stap -v --dyninst ./vswitch.stp
   ERROR: --runtime=dyninst unavailable; this build lacks DYNINST feature
   Systemtap translator/driver (version 2.9/0.165, Debian version 2.9-2ubuntu2 (xenial))
   Copyright (C) 2005-2015 Red Hat, Inc. and others
   This is free software; see the source for copying conditions.
   enabled features: AVAHI LIBSQLITE3 NLS NSS TR1_UNORDERED_MAP

Before I get stuck into building from source - Is there a dyninst-devel package for Ubuntu about? I see there is dyninst 8.x package for Ubuntu 12 but I'm not sure I want to install them directly.

Thanks,
Billy

> -----Original Message-----
> From: O Mahony, Billy
> Sent: Tuesday, April 25, 2017 10:11 AM
> To: Josh Stone <jistone@redhat.com>; David Smith <dsmith@redhat.com>
> Cc: systemtap@sourceware.org
> Subject: RE: reducing cost of user-space probes
> 
> Hi David,
> 
> Thanks for that tip. I'll read up on it.
> 
> Cheers,
> Billy
> 
> > -----Original Message-----
> > From: Josh Stone [mailto:jistone@redhat.com]
> > Sent: Monday, April 24, 2017 7:38 PM
> > To: David Smith <dsmith@redhat.com>; O Mahony, Billy
> > <billy.o.mahony@intel.com>
> > Cc: systemtap@sourceware.org
> > Subject: Re: reducing cost of user-space probes
> >
> > On 04/24/2017 08:49 AM, David Smith wrote:
> > >> Does this mean that each time the probe is hit that a system call
> > >> is made
> > to this new .ko module? That would surely mean quite a lot of
> > overhead. If this is correct, can this overhead be avoided for user space
> probes.
> > >
> > > The default "linux" runtime generates source for a kernel module,
> > > compiles and installs it behind the scenes. That's how the default
> > > runtime works. A system call is not made to the kernel module every
> > > time the probe is hit (even it it wanted to, kernel modules can't
> > > call system calls). Systemtap uses a kernel feature called 'uprobes'
> > > to handle user-space probe hits.
> >
> > It's not a syscall, rather an int3 trap, but the overhead is roughly the same.
> > You pay the cost of a transition to and from ring0 *every* time, even
> > if you otherwise short-circuit the probe with 'next' or similar.
> >
> > This is where "stap --runtime=dyninst" might shine, as the probe
> > executes entirely in-process.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: reducing cost of user-space probes
  2017-05-08 16:20       ` O Mahony, Billy
@ 2017-05-10 20:08         ` David Smith
  0 siblings, 0 replies; 9+ messages in thread
From: David Smith @ 2017-05-10 20:08 UTC (permalink / raw)
  To: O Mahony, Billy; +Cc: Josh Stone, systemtap

On Mon, May 8, 2017 at 11:20 AM, O Mahony, Billy
<billy.o.mahony@intel.com> wrote:
> Hi David,
>
> My unbuntu packaged systemtap doesn't have dyninst support:
>
>    $ sudo stap -v --dyninst ./vswitch.stp
>    ERROR: --runtime=dyninst unavailable; this build lacks DYNINST feature
>    Systemtap translator/driver (version 2.9/0.165, Debian version 2.9-2ubuntu2 (xenial))
>    Copyright (C) 2005-2015 Red Hat, Inc. and others
>    This is free software; see the source for copying conditions.
>    enabled features: AVAHI LIBSQLITE3 NLS NSS TR1_UNORDERED_MAP

You are right, for whatever reason the ubuntu maintainer didn't enable
dyninst support. If it was supported, you'd see "DYNINST" in the above
list of enabled features..

> Before I get stuck into building from source - Is there a dyninst-devel package for Ubuntu about? I see there is dyninst 8.x package for Ubuntu 12 but I'm not sure I want to install them directly.

I'm afraid I don't have easy access to an unbuntu 12 system, so unless
someone else pipes up here I'd suggest going ahead and give that
dyninst packages a shot.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-05-10 20:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-24 11:59 reducing cost of user-space probes O Mahony, Billy
2017-04-24 12:17 ` Arkady
2017-04-24 13:08   ` O Mahony, Billy
2017-04-24 13:22     ` Arkady
2017-04-24 15:50 ` David Smith
2017-04-24 18:37   ` Josh Stone
2017-04-25  9:11     ` O Mahony, Billy
2017-05-08 16:20       ` O Mahony, Billy
2017-05-10 20:08         ` David Smith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).