Hashtable

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* Hashtable
@ 2017-07-05 16:47 Arkady
  2017-07-06 16:36 ` Hashtable David Smith
  2017-07-10 15:20 ` Hashtable Frank Ch. Eigler
  0 siblings, 2 replies; 17+ messages in thread
From: Arkady @ 2017-07-05 16:47 UTC (permalink / raw)
  To: systemtap

Hi,

I have a CPU bottleneck in some situations on heavy loaded servers.

From the tests it appears that associative maps contribute significant
part of the overhead. I was thinking about replacing the maps with a
simple hashtable like this one (this is not production code)
https://gist.github.com/larytet/c306d470f7b032c796efad15dcd609a9

Is there something like this in STAP API which I could use?

Thank you, Arkady.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hashtable
  2017-07-05 16:47 Hashtable Arkady
@ 2017-07-06 16:36 ` David Smith
  2017-07-06 17:26   ` Hashtable Arkady
  2017-07-10 15:20 ` Hashtable Frank Ch. Eigler
  1 sibling, 1 reply; 17+ messages in thread
From: David Smith @ 2017-07-06 16:36 UTC (permalink / raw)
  To: Arkady; +Cc: systemtap

On Wed, Jul 5, 2017 at 11:46 AM, Arkady <arkady.miasnikov@gmail.com> wrote:
> Hi,
>
> I have a CPU bottleneck in some situations on heavy loaded servers.
>
> From the tests it appears that associative maps contribute significant
> part of the overhead.

... stuff deleted ...

Can you show us your script (or the associate map portion) that
illustrates the performance problem? Perhaps we can make some
suggestions.

(Also note that I've started a background personal task to reduce the
use of locks in systemtap. I don't have much to show for it yet.)

-- 
David Smith
Principal Software Engineer
Red Hat

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hashtable
  2017-07-06 16:36 ` Hashtable David Smith
@ 2017-07-06 17:26   ` Arkady
  2017-07-06 17:45     ` Hashtable Arkady
  0 siblings, 1 reply; 17+ messages in thread
From: Arkady @ 2017-07-06 17:26 UTC (permalink / raw)
  To: David Smith; +Cc: systemtap

On Thu, Jul 6, 2017 at 7:36 PM, David Smith <dsmith@redhat.com> wrote:
> On Wed, Jul 5, 2017 at 11:46 AM, Arkady <arkady.miasnikov@gmail.com> wrote:
>> Hi,
>>
>> I have a CPU bottleneck in some situations on heavy loaded servers.
>>
>> From the tests it appears that associative maps contribute significant
>> part of the overhead.
>
> ... stuff deleted ...
>
> Can you show us your script (or the associate map portion) that
> illustrates the performance problem? Perhaps we can make some
> suggestions.
>
My test is a tight loop

file=echo_file_`date +%s%N`; echo $file; echo > $file; counter=1;
end=$((SECONDS+10)); while [ $SECONDS -lt $end ]; do echo $counter >>
$file; counter=$((counter+1)); done; tail -n 1 $file;rm -f $file

I run a number of these - one per core.


My stap script is something like this (8 probes)

stap  -g   -e '%{long long counter;u8 shm[256];static void*
w_shm(void);static void* w_shm() {memset(shm, 0, sizeof(shm));return
shm;} %} probe syscall.close{%{ {counter++;w_shm();} %}} probe
syscall.close.return {%{ {counter++;w_shm();} %}} probe
syscall.open{%{ {counter++;w_shm();} %}} probe syscall.open.return{%{
{counter++;w_shm();} %}} probe syscall.dup2.return{%{
{counter++;w_shm();} %}} probe syscall.dup2.return{%{
{counter++;w_shm();} %}} probe syscall.read.return{%{
{counter++;w_shm();} %}} probe syscall.read{%{ {counter++;w_shm();}
%}} probe end {  %{ {printk("\n%lli\n", counter);} %}}'


w_shm() simulates writes to the shared memory.
The performance impact is ~15% for 4 cores

I am adding a map (global ar%):

stap -D MAXSKIPPED=0 -D MAXTRYLOCK=1000000 -D TRYLOCKDELAY=10 -g   -e
'global ar%; function w_ar() {ar[tid()]=tid();} %{long long counter;u8
shm[256];static void* w_shm(void);static void* w_shm() {memset(shm, 0,
sizeof(shm));return shm;} %} probe syscall.close{w_ar();%{
{counter++;w_shm();} %}} probe syscall.close.return {w_ar();%{
{counter++;w_shm();} %}} probe syscall.open{w_ar();%{
{counter++;w_shm();} %}} probe syscall.open.return{w_ar();%{
{counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{
{counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{
{counter++;w_shm();} %}} probe syscall.read.return{w_ar();%{
{counter++;w_shm();} %}} probe syscall.read{w_ar();%{
{counter++;w_shm();} %}} probe end {  %{ {printk("\n%lli\n",
counter);} %}}'

I am getting 35% hit. The overhead grows with the number of cores.

The scripts roughly reflect what I am doing in the actual code. I have
1-3 associative arrays per syscall type. For example I keep separate
arrays for probe syscall.read and probe syscall.write

I have ~30 probes - I/O, networking, thread life cycle.

> (Also note that I've started a background personal task to reduce the
> use of locks in systemtap. I don't have much to show for it yet.)
>

It looks like the performance of the probes does not scale well with
the number of cores. The overhead increases with the number of cores
growing. I suspect that the spin locks at the beginning of every probe
are to blame.

> --
> David Smith
> Principal Software Engineer
> Red Hat

Thank you, Arkady.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hashtable
  2017-07-06 17:26   ` Hashtable Arkady
@ 2017-07-06 17:45     ` Arkady
  2017-07-06 17:51       ` Hashtable Arkady
  0 siblings, 1 reply; 17+ messages in thread
From: Arkady @ 2017-07-06 17:45 UTC (permalink / raw)
  To: David Smith; +Cc: systemtap

P.S. The performance is very sensitive to TRYLOCKDELAY which is expected.


On Thu, Jul 6, 2017 at 8:25 PM, Arkady <arkady.miasnikov@gmail.com> wrote:
> On Thu, Jul 6, 2017 at 7:36 PM, David Smith <dsmith@redhat.com> wrote:
>> On Wed, Jul 5, 2017 at 11:46 AM, Arkady <arkady.miasnikov@gmail.com> wrote:
>>> Hi,
>>>
>>> I have a CPU bottleneck in some situations on heavy loaded servers.
>>>
>>> From the tests it appears that associative maps contribute significant
>>> part of the overhead.
>>
>> ... stuff deleted ...
>>
>> Can you show us your script (or the associate map portion) that
>> illustrates the performance problem? Perhaps we can make some
>> suggestions.
>>
> My test is a tight loop
>
> file=echo_file_`date +%s%N`; echo $file; echo > $file; counter=1;
> end=$((SECONDS+10)); while [ $SECONDS -lt $end ]; do echo $counter >>
> $file; counter=$((counter+1)); done; tail -n 1 $file;rm -f $file
>
> I run a number of these - one per core.
>
>
> My stap script is something like this (8 probes)
>
> stap  -g   -e '%{long long counter;u8 shm[256];static void*
> w_shm(void);static void* w_shm() {memset(shm, 0, sizeof(shm));return
> shm;} %} probe syscall.close{%{ {counter++;w_shm();} %}} probe
> syscall.close.return {%{ {counter++;w_shm();} %}} probe
> syscall.open{%{ {counter++;w_shm();} %}} probe syscall.open.return{%{
> {counter++;w_shm();} %}} probe syscall.dup2.return{%{
> {counter++;w_shm();} %}} probe syscall.dup2.return{%{
> {counter++;w_shm();} %}} probe syscall.read.return{%{
> {counter++;w_shm();} %}} probe syscall.read{%{ {counter++;w_shm();}
> %}} probe end {  %{ {printk("\n%lli\n", counter);} %}}'
>
>
> w_shm() simulates writes to the shared memory.
> The performance impact is ~15% for 4 cores
>
> I am adding a map (global ar%):
>
> stap -D MAXSKIPPED=0 -D MAXTRYLOCK=1000000 -D TRYLOCKDELAY=10 -g   -e
> 'global ar%; function w_ar() {ar[tid()]=tid();} %{long long counter;u8
> shm[256];static void* w_shm(void);static void* w_shm() {memset(shm, 0,
> sizeof(shm));return shm;} %} probe syscall.close{w_ar();%{
> {counter++;w_shm();} %}} probe syscall.close.return {w_ar();%{
> {counter++;w_shm();} %}} probe syscall.open{w_ar();%{
> {counter++;w_shm();} %}} probe syscall.open.return{w_ar();%{
> {counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{
> {counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{
> {counter++;w_shm();} %}} probe syscall.read.return{w_ar();%{
> {counter++;w_shm();} %}} probe syscall.read{w_ar();%{
> {counter++;w_shm();} %}} probe end {  %{ {printk("\n%lli\n",
> counter);} %}}'
>
> I am getting 35% hit. The overhead grows with the number of cores.
>
> The scripts roughly reflect what I am doing in the actual code. I have
> 1-3 associative arrays per syscall type. For example I keep separate
> arrays for probe syscall.read and probe syscall.write
>
> I have ~30 probes - I/O, networking, thread life cycle.
>
>> (Also note that I've started a background personal task to reduce the
>> use of locks in systemtap. I don't have much to show for it yet.)
>>
>
> It looks like the performance of the probes does not scale well with
> the number of cores. The overhead increases with the number of cores
> growing. I suspect that the spin locks at the beginning of every probe
> are to blame.
>
>> --
>> David Smith
>> Principal Software Engineer
>> Red Hat
>
> Thank you, Arkady.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hashtable
  2017-07-06 17:45     ` Hashtable Arkady
@ 2017-07-06 17:51       ` Arkady
  2017-07-06 18:24         ` Hashtable David Smith
  0 siblings, 1 reply; 17+ messages in thread
From: Arkady @ 2017-07-06 17:51 UTC (permalink / raw)
  To: David Smith; +Cc: systemtap

P.S.2
Convenient links for copy&paste

https://gist.githubusercontent.com/larytet/10ceddea609d2da17aa09558ed0e04bc/raw/05037d536e5edf0e2f5a45282c41b8fa46d1fd55/SystemTap_tests.sh


https://gist.githubusercontent.com/larytet/fc147587e9dfecfe99ab6bac2ba4aaa0/raw/670e385cb76798b526de9f4265046cf576c42f4e/SystemTap_tests

On Thu, Jul 6, 2017 at 8:45 PM, Arkady <arkady.miasnikov@gmail.com> wrote:
> P.S. The performance is very sensitive to TRYLOCKDELAY which is expected.
>
>
> On Thu, Jul 6, 2017 at 8:25 PM, Arkady <arkady.miasnikov@gmail.com> wrote:
>> On Thu, Jul 6, 2017 at 7:36 PM, David Smith <dsmith@redhat.com> wrote:
>>> On Wed, Jul 5, 2017 at 11:46 AM, Arkady <arkady.miasnikov@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> I have a CPU bottleneck in some situations on heavy loaded servers.
>>>>
>>>> From the tests it appears that associative maps contribute significant
>>>> part of the overhead.
>>>
>>> ... stuff deleted ...
>>>
>>> Can you show us your script (or the associate map portion) that
>>> illustrates the performance problem? Perhaps we can make some
>>> suggestions.
>>>
>> My test is a tight loop
>>
>> file=echo_file_`date +%s%N`; echo $file; echo > $file; counter=1;
>> end=$((SECONDS+10)); while [ $SECONDS -lt $end ]; do echo $counter >>
>> $file; counter=$((counter+1)); done; tail -n 1 $file;rm -f $file
>>
>> I run a number of these - one per core.
>>
>>
>> My stap script is something like this (8 probes)
>>
>> stap  -g   -e '%{long long counter;u8 shm[256];static void*
>> w_shm(void);static void* w_shm() {memset(shm, 0, sizeof(shm));return
>> shm;} %} probe syscall.close{%{ {counter++;w_shm();} %}} probe
>> syscall.close.return {%{ {counter++;w_shm();} %}} probe
>> syscall.open{%{ {counter++;w_shm();} %}} probe syscall.open.return{%{
>> {counter++;w_shm();} %}} probe syscall.dup2.return{%{
>> {counter++;w_shm();} %}} probe syscall.dup2.return{%{
>> {counter++;w_shm();} %}} probe syscall.read.return{%{
>> {counter++;w_shm();} %}} probe syscall.read{%{ {counter++;w_shm();}
>> %}} probe end {  %{ {printk("\n%lli\n", counter);} %}}'
>>
>>
>> w_shm() simulates writes to the shared memory.
>> The performance impact is ~15% for 4 cores
>>
>> I am adding a map (global ar%):
>>
>> stap -D MAXSKIPPED=0 -D MAXTRYLOCK=1000000 -D TRYLOCKDELAY=10 -g   -e
>> 'global ar%; function w_ar() {ar[tid()]=tid();} %{long long counter;u8
>> shm[256];static void* w_shm(void);static void* w_shm() {memset(shm, 0,
>> sizeof(shm));return shm;} %} probe syscall.close{w_ar();%{
>> {counter++;w_shm();} %}} probe syscall.close.return {w_ar();%{
>> {counter++;w_shm();} %}} probe syscall.open{w_ar();%{
>> {counter++;w_shm();} %}} probe syscall.open.return{w_ar();%{
>> {counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{
>> {counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{
>> {counter++;w_shm();} %}} probe syscall.read.return{w_ar();%{
>> {counter++;w_shm();} %}} probe syscall.read{w_ar();%{
>> {counter++;w_shm();} %}} probe end {  %{ {printk("\n%lli\n",
>> counter);} %}}'
>>
>> I am getting 35% hit. The overhead grows with the number of cores.
>>
>> The scripts roughly reflect what I am doing in the actual code. I have
>> 1-3 associative arrays per syscall type. For example I keep separate
>> arrays for probe syscall.read and probe syscall.write
>>
>> I have ~30 probes - I/O, networking, thread life cycle.
>>
>>> (Also note that I've started a background personal task to reduce the
>>> use of locks in systemtap. I don't have much to show for it yet.)
>>>
>>
>> It looks like the performance of the probes does not scale well with
>> the number of cores. The overhead increases with the number of cores
>> growing. I suspect that the spin locks at the beginning of every probe
>> are to blame.
>>
>>> --
>>> David Smith
>>> Principal Software Engineer
>>> Red Hat
>>
>> Thank you, Arkady.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hashtable
  2017-07-06 17:51       ` Hashtable Arkady
@ 2017-07-06 18:24         ` David Smith
  2017-07-07  3:26           ` Hashtable Arkady
  0 siblings, 1 reply; 17+ messages in thread
From: David Smith @ 2017-07-06 18:24 UTC (permalink / raw)
  To: Arkady; +Cc: systemtap

OK, The first thing to do is run your script with 1 addition. Add
'-DSTP_ALIBI' to the stap command line. This will compile/run the
script, but when a probe is hit it will return immediately.

This will give you an idea of the overhead of kprobes itself.

Another option would be to remove the '-DSTP_ALIBI' from the command
line and '-t' to the command line. Here's a description of what that
option does:

       -t     Collect timing information on the number of times probe executes
              and average amount of time spent in each probe-point. Also shows
              the derivation for each probe-point.

Once you've done that, we'll know more of what is going on.


On Thu, Jul 6, 2017 at 12:50 PM, Arkady <arkady.miasnikov@gmail.com> wrote:
> P.S.2
> Convenient links for copy&paste
>
> https://gist.githubusercontent.com/larytet/10ceddea609d2da17aa09558ed0e04bc/raw/05037d536e5edf0e2f5a45282c41b8fa46d1fd55/SystemTap_tests.sh
>
>
> https://gist.githubusercontent.com/larytet/fc147587e9dfecfe99ab6bac2ba4aaa0/raw/670e385cb76798b526de9f4265046cf576c42f4e/SystemTap_tests
>
> On Thu, Jul 6, 2017 at 8:45 PM, Arkady <arkady.miasnikov@gmail.com> wrote:
>> P.S. The performance is very sensitive to TRYLOCKDELAY which is expected.
>>
>>
>> On Thu, Jul 6, 2017 at 8:25 PM, Arkady <arkady.miasnikov@gmail.com> wrote:
>>> On Thu, Jul 6, 2017 at 7:36 PM, David Smith <dsmith@redhat.com> wrote:
>>>> On Wed, Jul 5, 2017 at 11:46 AM, Arkady <arkady.miasnikov@gmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> I have a CPU bottleneck in some situations on heavy loaded servers.
>>>>>
>>>>> From the tests it appears that associative maps contribute significant
>>>>> part of the overhead.
>>>>
>>>> ... stuff deleted ...
>>>>
>>>> Can you show us your script (or the associate map portion) that
>>>> illustrates the performance problem? Perhaps we can make some
>>>> suggestions.
>>>>
>>> My test is a tight loop
>>>
>>> file=echo_file_`date +%s%N`; echo $file; echo > $file; counter=1;
>>> end=$((SECONDS+10)); while [ $SECONDS -lt $end ]; do echo $counter >>
>>> $file; counter=$((counter+1)); done; tail -n 1 $file;rm -f $file
>>>
>>> I run a number of these - one per core.
>>>
>>>
>>> My stap script is something like this (8 probes)
>>>
>>> stap  -g   -e '%{long long counter;u8 shm[256];static void*
>>> w_shm(void);static void* w_shm() {memset(shm, 0, sizeof(shm));return
>>> shm;} %} probe syscall.close{%{ {counter++;w_shm();} %}} probe
>>> syscall.close.return {%{ {counter++;w_shm();} %}} probe
>>> syscall.open{%{ {counter++;w_shm();} %}} probe syscall.open.return{%{
>>> {counter++;w_shm();} %}} probe syscall.dup2.return{%{
>>> {counter++;w_shm();} %}} probe syscall.dup2.return{%{
>>> {counter++;w_shm();} %}} probe syscall.read.return{%{
>>> {counter++;w_shm();} %}} probe syscall.read{%{ {counter++;w_shm();}
>>> %}} probe end {  %{ {printk("\n%lli\n", counter);} %}}'
>>>
>>>
>>> w_shm() simulates writes to the shared memory.
>>> The performance impact is ~15% for 4 cores
>>>
>>> I am adding a map (global ar%):
>>>
>>> stap -D MAXSKIPPED=0 -D MAXTRYLOCK=1000000 -D TRYLOCKDELAY=10 -g   -e
>>> 'global ar%; function w_ar() {ar[tid()]=tid();} %{long long counter;u8
>>> shm[256];static void* w_shm(void);static void* w_shm() {memset(shm, 0,
>>> sizeof(shm));return shm;} %} probe syscall.close{w_ar();%{
>>> {counter++;w_shm();} %}} probe syscall.close.return {w_ar();%{
>>> {counter++;w_shm();} %}} probe syscall.open{w_ar();%{
>>> {counter++;w_shm();} %}} probe syscall.open.return{w_ar();%{
>>> {counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{
>>> {counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{
>>> {counter++;w_shm();} %}} probe syscall.read.return{w_ar();%{
>>> {counter++;w_shm();} %}} probe syscall.read{w_ar();%{
>>> {counter++;w_shm();} %}} probe end {  %{ {printk("\n%lli\n",
>>> counter);} %}}'
>>>
>>> I am getting 35% hit. The overhead grows with the number of cores.
>>>
>>> The scripts roughly reflect what I am doing in the actual code. I have
>>> 1-3 associative arrays per syscall type. For example I keep separate
>>> arrays for probe syscall.read and probe syscall.write
>>>
>>> I have ~30 probes - I/O, networking, thread life cycle.
>>>
>>>> (Also note that I've started a background personal task to reduce the
>>>> use of locks in systemtap. I don't have much to show for it yet.)
>>>>
>>>
>>> It looks like the performance of the probes does not scale well with
>>> the number of cores. The overhead increases with the number of cores
>>> growing. I suspect that the spin locks at the beginning of every probe
>>> are to blame.
>>>
>>>> --
>>>> David Smith
>>>> Principal Software Engineer
>>>> Red Hat
>>>
>>> Thank you, Arkady.



-- 
David Smith
Principal Software Engineer
Red Hat

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hashtable
  2017-07-06 18:24         ` Hashtable David Smith
@ 2017-07-07  3:26           ` Arkady
  2017-07-07 16:42             ` Hashtable Arkady
  0 siblings, 1 reply; 17+ messages in thread
From: Arkady @ 2017-07-07  3:26 UTC (permalink / raw)
  To: David Smith; +Cc: systemtap

On Thu, Jul 6, 2017 at 9:24 PM, David Smith <dsmith@redhat.com> wrote:
> OK, The first thing to do is run your script with 1 addition. Add
> '-DSTP_ALIBI' to the stap command line. This will compile/run the
> script, but when a probe is hit it will return immediately.
>

I have been through this routine including the -t flag.
10 empty probes introduce ~8% overhead

> This will give you an idea of the overhead of kprobes itself.
>
> Another option would be to remove the '-DSTP_ALIBI' from the command
> line and '-t' to the command line. Here's a description of what that
> option does:
>
>        -t     Collect timing information on the number of times probe executes
>               and average amount of time spent in each probe-point. Also shows
>               the derivation for each probe-point.
>
> Once you've done that, we'll know more of what is going on.

The heaviest probes are thread creation related. The following output
is from the production code. The lines are ordered by MIN time.

index   state      hits  min*     avg       max name
│168     on            1 79908   79908     79908
begin(-9223372036854775808)
│  0     on            1 47496   47496     47496 begin
│114     on           50 16065   70322    528451
kernel.function("SyS_execve@/build/linux-EO9xOi/linux-4.4.0/fs/exec.c:1722").call
│171     on            1 14337   14337     14337
begin(-9223372036854775808)
│172     on            1 13149   13149     13149
begin(-9223372036854775808)
│132     on            1  9525    9525      9525
kernel.function("SyS_accept4@/build/linux-EO9xOi/linux-4.4.0/net/socket.c:1425").call
│121     on            1  8628    8628      8628
kernel.function("SyS_kill@/build/linux-EO9xOi/linux-4.4.0/kernel/signal.c:2847").call
│165     on          708  8247   18919    247540 timer.sec(1)
│145     on            2  6372    8148      9924
kernel.function("SyS_bind@/build/linux-EO9xOi/linux-4.4.0/net/socket.c:1363").call?
│170     on            1  5055    5055      5055
begin(-9223372036854775808)
│117     on           50  4731   11318     28385
kernel.function("SyS_execve@/build/linux-EO9xOi/linux-4.4.0/fs/exec.c:1722").return
│169     on            1  4416    4416      4416
begin(-9223372036854775808)
│124     on            5  4096    5494      7470
kernel.function("SyS_getcwd@/build/linux-EO9xOi/linux-4.4.0/fs/dcache.c:3244").return
│ 39     on          122  3088   11082     39507
kernel.function("SyS_sendto@/build/linux-EO9xOi/linux-4.4.0/net/socket.c:1619").call?

The following gist is a good representation of what I am doing in the
exec https://gist.github.com/larytet/b8155e9d01942984888e90c88c23c206

When I look into the generated by stap C code there are quite a number
of strlcpy() for the filename and args. Some of them are not strictly
necessary, but I understand that a generic framework has costs.

I figure that the arrays are a low hanging fruit. I hope that by
limiting the hashtable API I can make a lockless hashtable along the
lines  https://gist.github.com/larytet/c306d470f7b032c796efad15dcd609a9.

 I expect 15%-30% improvement in the performance overhead. The
improvement will come from the two things

* A single copy of the filename - I am going to copy the filename/args
from the user space directly to the share memory of the driver instead
of the map
* No spin locks are required for the hashmap

Thank you, Arkady.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hashtable
  2017-07-07  3:26           ` Hashtable Arkady
@ 2017-07-07 16:42             ` Arkady
  0 siblings, 0 replies; 17+ messages in thread
From: Arkady @ 2017-07-07 16:42 UTC (permalink / raw)
  To: David Smith; +Cc: systemtap

This simple script presents the problem when running on a multicore machine

#!/bin/bash

function echo_loop()
{
  file=echo_file_`date +%s%N`; echo $file; echo > $file;
counter=1;end=$((SECONDS+10)); while [ $SECONDS -lt $end ]; do echo
$counter >>$file; counter=$((counter+1)); done; tail -n 1 $file;rm -f
$file
}


CPUS=`grep -c ^processor /proc/cpuinfo`
pids=""
counter=0
while [  $counter -lt $CPUS ]; do
  echo_loop &
  pids="$pids $!"
  counter=$(($counter+1))
done

wait $pids

This short thing generates above 10M system calls/s on a 40 cores
machine.  I do  not know the exact rate because the performance
overhead of my STAP script exceeds 30%.

This rate represents a challenge for the system I develop. The system
collects system calls, copies the arguments to the shared memory.


On Fri, Jul 7, 2017 at 6:26 AM, Arkady <arkady.miasnikov@gmail.com> wrote:
> On Thu, Jul 6, 2017 at 9:24 PM, David Smith <dsmith@redhat.com> wrote:
>> OK, The first thing to do is run your script with 1 addition. Add
>> '-DSTP_ALIBI' to the stap command line. This will compile/run the
>> script, but when a probe is hit it will return immediately.
>>
>
> I have been through this routine including the -t flag.
> 10 empty probes introduce ~8% overhead
>
>> This will give you an idea of the overhead of kprobes itself.
>>
>> Another option would be to remove the '-DSTP_ALIBI' from the command
>> line and '-t' to the command line. Here's a description of what that
>> option does:
>>
>>        -t     Collect timing information on the number of times probe executes
>>               and average amount of time spent in each probe-point. Also shows
>>               the derivation for each probe-point.
>>
>> Once you've done that, we'll know more of what is going on.
>
> The heaviest probes are thread creation related. The following output
> is from the production code. The lines are ordered by MIN time.
>
> index   state      hits  min*     avg       max name
> │168     on            1 79908   79908     79908
> begin(-9223372036854775808)
> │  0     on            1 47496   47496     47496 begin
> │114     on           50 16065   70322    528451
> kernel.function("SyS_execve@/build/linux-EO9xOi/linux-4.4.0/fs/exec.c:1722").call
> │171     on            1 14337   14337     14337
> begin(-9223372036854775808)
> │172     on            1 13149   13149     13149
> begin(-9223372036854775808)
> │132     on            1  9525    9525      9525
> kernel.function("SyS_accept4@/build/linux-EO9xOi/linux-4.4.0/net/socket.c:1425").call
> │121     on            1  8628    8628      8628

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hashtable
  2017-07-05 16:47 Hashtable Arkady
  2017-07-06 16:36 ` Hashtable David Smith
@ 2017-07-10 15:20 ` Frank Ch. Eigler
  1 sibling, 0 replies; 17+ messages in thread
From: Frank Ch. Eigler @ 2017-07-10 15:20 UTC (permalink / raw)
  To: Arkady; +Cc: systemtap


arkady.miasnikov wrote:

> [...]
> From the tests it appears that associative maps contribute significant
> part of the overhead. I was thinking about replacing the maps with a
> simple hashtable like this one (this is not production code)
> https://gist.github.com/larytet/c306d470f7b032c796efad15dcd609a9

The systemtap maps are already hash tables.  The complication here is
that you're using them in a highly contended way, so locking effects
dominate algorithmic ones.  PR10796 is probably more relevant to your
case.

- FChE

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hashtable
@ 2017-07-13 16:52 Arkady
  2017-07-14 15:35 ` Hashtable David Smith
  0 siblings, 1 reply; 17+ messages in thread
From: Arkady @ 2017-07-13 16:52 UTC (permalink / raw)
  To: systemtap

For the WEB search sake

This is a reply to the thread "25400 by: Arkady"

The following code demonstrates about 20% improvement in the latency

https://github.com/larytet/lockfree_hashtable

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hashtable
  2017-07-13 16:52 Hashtable Arkady
@ 2017-07-14 15:35 ` David Smith
  2017-07-14 17:16   ` Hashtable Arkady
  0 siblings, 1 reply; 17+ messages in thread
From: David Smith @ 2017-07-14 15:35 UTC (permalink / raw)
  To: Arkady; +Cc: systemtap

Arkady,

If you think the existing systemtap maps have too much overhead, we'd
welcome you to take a shot at reworking them. This could benefit all
systemtap users. We welcome all contributors. You and/or your employer
will need to be OK with you contributing code under the GPLv2 license.

Take the existing systemtap source code and start replacing the
existing map functionality with your improved map code. When redoing
code like this, you'd want to make sure the existing map tests have
the same results with your code as with the original code. We'd be
happy to consult with you along the way.

Let us know if you have questions or need pointers on where to look first.

On Thu, Jul 13, 2017 at 11:51 AM, Arkady <arkady.miasnikov@gmail.com> wrote:
> For the WEB search sake
>
> This is a reply to the thread "25400 by: Arkady"
>
> The following code demonstrates about 20% improvement in the latency
>
> https://github.com/larytet/lockfree_hashtable

-- 
David Smith
Principal Software Engineer
Red Hat

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hashtable
  2017-07-14 15:35 ` Hashtable David Smith
@ 2017-07-14 17:16   ` Arkady
  2017-07-16 16:09     ` Hashtable Frank Ch. Eigler
  0 siblings, 1 reply; 17+ messages in thread
From: Arkady @ 2017-07-14 17:16 UTC (permalink / raw)
  To: David Smith; +Cc: systemtap

David,

There are numerous problems with my hashtable approach. Besides raw
performance, it is not in any way better than the maps. A custom chunk
of code, tailored for a task will beat a generic library in a specific
test. I do not have anything to replace the maps. I do not have
anything generic enough worth adding to the SystemTap. No yet anyway.
A "map" which is fine tuned for thread ID key - this is an idea I am
considering.

I am playing with something I call "STAP bypass" - allow to register a
pure C implementation of a probe. This gives about 1% gain part of
which comes from removing calls to _stp_print_flush(). A define symbol
like STP_TRACE_OFF would be nice.

Unfortunately I doubt my employer is willing to spend my time
extending STAP functionality. The very opposite is true. I am trying
to minimise any STAP hacking unless I encounter an issue I can not
solve (begin and end probes are such an issue)

The only goal behind my mail below was to help anyone who is trying to
tackle a very specific corner case.

On Fri, Jul 14, 2017 at 6:35 PM, David Smith <dsmith@redhat.com> wrote:
> Arkady,
>
> If you think the existing systemtap maps have too much overhead, we'd
> welcome you to take a shot at reworking them. This could benefit all
> systemtap users. We welcome all contributors. You and/or your employer
> will need to be OK with you contributing code under the GPLv2 license.
>
> Take the existing systemtap source code and start replacing the
> existing map functionality with your improved map code. When redoing
> code like this, you'd want to make sure the existing map tests have
> the same results with your code as with the original code. We'd be
> happy to consult with you along the way.
>
> Let us know if you have questions or need pointers on where to look first.
>
>
> On Thu, Jul 13, 2017 at 11:51 AM, Arkady <arkady.miasnikov@gmail.com> wrote:
>> For the WEB search sake
>>
>> This is a reply to the thread "25400 by: Arkady"
>>
>> The following code demonstrates about 20% improvement in the latency
>>
>> https://github.com/larytet/lockfree_hashtable
>
>
>
> --
> David Smith
> Principal Software Engineer
> Red Hat

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hashtable
  2017-07-14 17:16   ` Hashtable Arkady
@ 2017-07-16 16:09     ` Frank Ch. Eigler
  2017-07-16 16:19       ` Hashtable Arkady
  0 siblings, 1 reply; 17+ messages in thread
From: Frank Ch. Eigler @ 2017-07-16 16:09 UTC (permalink / raw)
  To: Arkady; +Cc: David Smith, systemtap

arkady.miasnikov wrote:

> [...]
> I am playing with something I call "STAP bypass" - allow to register a
> pure C implementation of a probe. This gives about 1% gain part of
> which comes from removing calls to _stp_print_flush(). A define symbol
> like STP_TRACE_OFF would be nice. [...]

That _stp_print_flush() bit could be elided by analysis of the staptree
associated with the probe; if no print*s are found there (or any
function transitively called), then no need to flush.  (For embedded-C
functions, we could add a /* print */ pragma to handle that in the
analysis.)

- FChE

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hashtable
  2017-07-16 16:09     ` Hashtable Frank Ch. Eigler
@ 2017-07-16 16:19       ` Arkady
  2017-07-17  8:58         ` Hashtable Arkady
  0 siblings, 1 reply; 17+ messages in thread
From: Arkady @ 2017-07-16 16:19 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: David Smith, systemtap

Eliding the trace would be cool

On Sun, Jul 16, 2017 at 7:09 PM, Frank Ch. Eigler <fche@redhat.com> wrote:
>
> arkady.miasnikov wrote:
>
>> [...]
>> I am playing with something I call "STAP bypass" - allow to register a
>> pure C implementation of a probe. This gives about 1% gain part of
>> which comes from removing calls to _stp_print_flush(). A define symbol
>> like STP_TRACE_OFF would be nice. [...]
>
> That _stp_print_flush() bit could be elided by analysis of the staptree
> associated with the probe; if no print*s are found there (or any
> function transitively called), then no need to flush.  (For embedded-C
> functions, we could add a /* print */ pragma to handle that in the
> analysis.)
>
> - FChE

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hashtable
  2017-07-16 16:19       ` Hashtable Arkady
@ 2017-07-17  8:58         ` Arkady
  2017-07-18 14:13           ` Hashtable Frank Ch. Eigler
  0 siblings, 1 reply; 17+ messages in thread
From: Arkady @ 2017-07-17  8:58 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: David Smith, systemtap

I am trying to implement full bypass of the code generated by the
SystemTap in some special cases.

My goal is to be able to write code like this:

probe syscall.dup2
{   /* bypass */  /* arg: oldfd */
    printk(KERN_ALERT "tid=%d oldfd=%d\n", current->pid, l->l_oldfd);
// Unmanaged C code
}

I have not finished the code yet. I added two new pragma comments
/* bypass */
/* arg: <argument name> */

When generating the code I add the relevant dwarf_tvar_get_ calls and
initialize "local" variables in the context structure. Because the
probes I am targeting are very short and simple and do not involve
nested calls I shortcut lot of variables copying.

Is there a chance for something similar in the SystemTap future?

Thank you, Arkady.

On Sun, Jul 16, 2017 at 7:19 PM, Arkady <arkady.miasnikov@gmail.com> wrote:
> Eliding the trace would be cool
>
> On Sun, Jul 16, 2017 at 7:09 PM, Frank Ch. Eigler <fche@redhat.com> wrote:
>>
>> arkady.miasnikov wrote:
>>
>>> [...]
>>> I am playing with something I call "STAP bypass" - allow to register a
>>> pure C implementation of a probe. This gives about 1% gain part of
>>> which comes from removing calls to _stp_print_flush(). A define symbol
>>> like STP_TRACE_OFF would be nice. [...]
>>
>> That _stp_print_flush() bit could be elided by analysis of the staptree
>> associated with the probe; if no print*s are found there (or any
>> function transitively called), then no need to flush.  (For embedded-C
>> functions, we could add a /* print */ pragma to handle that in the
>> analysis.)
>>
>> - FChE

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hashtable
  2017-07-17  8:58         ` Hashtable Arkady
@ 2017-07-18 14:13           ` Frank Ch. Eigler
  2017-07-18 16:35             ` Hashtable Arkady
  0 siblings, 1 reply; 17+ messages in thread
From: Frank Ch. Eigler @ 2017-07-18 14:13 UTC (permalink / raw)
  To: Arkady; +Cc: David Smith, systemtap


arkady.miasnikov wrote:


> I am trying to implement full bypass of the code generated by the
> SystemTap in some special cases. [...]

OK, interesting.


> My goal is to be able to write code like this:
> probe syscall.dup2
> {   /* bypass */  /* arg: oldfd */
>     printk(KERN_ALERT "tid=%d oldfd=%d\n", current->pid, l->l_oldfd);
> // Unmanaged C code
> }

So like an embedded-C probe handler?


> [...]  When generating the code I add the relevant dwarf_tvar_get_
> calls and initialize "local" variables in the context
> structure. Because the probes I am targeting are very short and simple
> and do not involve nested calls I shortcut lot of variables copying.

How much more efficient can this get, beyond

    function foo (l) %{
        STAP_PRINTF ("tid=%d oldfd=%d\n", current->pid, STAP_ARG_l); 
    %}
    probe syscall.dup2 { foo ($oldfd) }

?  Which parts of the latter can we elide?  Could we improve the
translator so that this elision can be done for general systemtap
scripts, not just your special case?


- FChE

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Hashtable
  2017-07-18 14:13           ` Hashtable Frank Ch. Eigler
@ 2017-07-18 16:35             ` Arkady
  0 siblings, 0 replies; 17+ messages in thread
From: Arkady @ 2017-07-18 16:35 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: David Smith, systemtap

Frank, David,

I appreciate your patience with me, my crazy ideas, and often very
naive questions.

On Tue, Jul 18, 2017 at 5:13 PM, Frank Ch. Eigler <fche@redhat.com> wrote:
>
> arkady.miasnikov wrote:
>
>
>> I am trying to implement full bypass of the code generated by the
>> SystemTap in some special cases. [...]
>
> OK, interesting.
>
>
>> My goal is to be able to write code like this:
>> probe syscall.dup2
>> {   /* bypass */  /* arg: oldfd */
>>     printk(KERN_ALERT "tid=%d oldfd=%d\n", current->pid, l->l_oldfd);
>> // Unmanaged C code
>> }
>
> So like an embedded-C probe handler?

I am still working on it. I have changed the goal somewhat:

The syntax I am trying to implement is something like
cprobe syscall.dup2
%{
    printk("%d", STAP_ARG_oldfd);
%}

where "cprobe" is a new keyword
This is a patch (not  a production code)
https://github.com/larytet/SystemTap/commit/aa3de76dc11a94dcd0456d493e381bc69bcddb16

>
>
>> [...]  When generating the code I add the relevant dwarf_tvar_get_
>> calls and initialize "local" variables in the context
>> structure. Because the probes I am targeting are very short and simple
>> and do not involve nested calls I shortcut lot of variables copying.
>
> How much more efficient can this get, beyond
>
>     function foo (l) %{
>         STAP_PRINTF ("tid=%d oldfd=%d\n", current->pid, STAP_ARG_l);
>     %}
>     probe syscall.dup2 { foo ($oldfd) }
>
> ?  Which parts of the latter can we elide?  Could we improve the
> translator so that this elision can be done for general systemtap
> scripts, not just your special case?
>

I have a very specific system test which I shall stand. The test looks
like bash "while [ 1 ]; do echo >> filename;done" running on at least
4 cores. With empty probes (STAP_ALIBI, for example) the performance
impact of the 10 open/read/write/close + return probes is ~8%. The
performance impact of a single prove is negligible. The problems arise
when there are many probes concurrently executed on multicore
machines.

Ten simple probes which call a function adding a value to a map have
performance overhead in 30-35% range. I replaced the maps with the
custom hashtable and it helped to reduce the impact by third to 20-25%
range. This is better, but my target is obviously 10%-15% range
(double of a bare minimum 8%).
This is how a typical probe looks like
https://github.com/larytet/lockfree_hashtable/blob/master/dup_probe.stp#L429

probe syscall.dup
{
   tid = tid()
   hashtable_u32_insert(tid, oldfd)
}

probe syscall.dup
{
   tid = tid()
   oldfd = hashtable_u32_remove(tid)
   write_event_to_shared_memory(tid, odlfd)
}

I checked the C code generated by the translate.cxx and the assembler
generated by GCC. The end result is very good considering the
non-trivial task of handling a script language. In my specific case it
can be improved. For example, I do not need lot of stack memory and I
can remove all c->locals[c->nesting] . I can not do it for all my
probes, but for system calls probes I probably can. I can replace
dwarf_tvar_get() with inline functions and help GCC to optimize out
temporary variables, I can drop atomic_read/atomic_write,  and so on.
In the perfect world I would like to see a probe with 10 lines in C
and under 200 hot (in cache) opcodes.

I have stepped through the elaborate.cxx and specifically
semantic_pass().  It is very well written. I am very impressed. This
is a non trivial parser implemented in C with high level code
optimization.  This is an outstanding work. The code is very clean.
Unfortunately I do not see fast&dirty ways to improve the code besides
a special syntax for "unmanaged" probes in C. Which brought me to the
"cprobe syscall.dup2" syntax.

I have a very specific task - roughly 100 probes which cover most of
the system calls. Most of the probes are very short. I have to run
smoothly and remain under radar on multicore systems. Because the
system collects lot of information across many different kernels any
custom Linux driver will be a pain. SystemTap solves 95% of what I
need. I need just one small step forward.

>
> - FChE

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-07-18 16:35 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-05 16:47 Hashtable Arkady
2017-07-06 16:36 ` Hashtable David Smith
2017-07-06 17:26   ` Hashtable Arkady
2017-07-06 17:45     ` Hashtable Arkady
2017-07-06 17:51       ` Hashtable Arkady
2017-07-06 18:24         ` Hashtable David Smith
2017-07-07  3:26           ` Hashtable Arkady
2017-07-07 16:42             ` Hashtable Arkady
2017-07-10 15:20 ` Hashtable Frank Ch. Eigler
2017-07-13 16:52 Hashtable Arkady
2017-07-14 15:35 ` Hashtable David Smith
2017-07-14 17:16   ` Hashtable Arkady
2017-07-16 16:09     ` Hashtable Frank Ch. Eigler
2017-07-16 16:19       ` Hashtable Arkady
2017-07-17  8:58         ` Hashtable Arkady
2017-07-18 14:13           ` Hashtable Frank Ch. Eigler
2017-07-18 16:35             ` Hashtable Arkady

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).