* Hashtable @ 2017-07-05 16:47 Arkady 2017-07-06 16:36 ` Hashtable David Smith 2017-07-10 15:20 ` Hashtable Frank Ch. Eigler 0 siblings, 2 replies; 17+ messages in thread From: Arkady @ 2017-07-05 16:47 UTC (permalink / raw) To: systemtap Hi, I have a CPU bottleneck in some situations on heavy loaded servers. From the tests it appears that associative maps contribute significant part of the overhead. I was thinking about replacing the maps with a simple hashtable like this one (this is not production code) https://gist.github.com/larytet/c306d470f7b032c796efad15dcd609a9 Is there something like this in STAP API which I could use? Thank you, Arkady. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hashtable 2017-07-05 16:47 Hashtable Arkady @ 2017-07-06 16:36 ` David Smith 2017-07-06 17:26 ` Hashtable Arkady 2017-07-10 15:20 ` Hashtable Frank Ch. Eigler 1 sibling, 1 reply; 17+ messages in thread From: David Smith @ 2017-07-06 16:36 UTC (permalink / raw) To: Arkady; +Cc: systemtap On Wed, Jul 5, 2017 at 11:46 AM, Arkady <arkady.miasnikov@gmail.com> wrote: > Hi, > > I have a CPU bottleneck in some situations on heavy loaded servers. > > From the tests it appears that associative maps contribute significant > part of the overhead. ... stuff deleted ... Can you show us your script (or the associate map portion) that illustrates the performance problem? Perhaps we can make some suggestions. (Also note that I've started a background personal task to reduce the use of locks in systemtap. I don't have much to show for it yet.) -- David Smith Principal Software Engineer Red Hat ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hashtable 2017-07-06 16:36 ` Hashtable David Smith @ 2017-07-06 17:26 ` Arkady 2017-07-06 17:45 ` Hashtable Arkady 0 siblings, 1 reply; 17+ messages in thread From: Arkady @ 2017-07-06 17:26 UTC (permalink / raw) To: David Smith; +Cc: systemtap On Thu, Jul 6, 2017 at 7:36 PM, David Smith <dsmith@redhat.com> wrote: > On Wed, Jul 5, 2017 at 11:46 AM, Arkady <arkady.miasnikov@gmail.com> wrote: >> Hi, >> >> I have a CPU bottleneck in some situations on heavy loaded servers. >> >> From the tests it appears that associative maps contribute significant >> part of the overhead. > > ... stuff deleted ... > > Can you show us your script (or the associate map portion) that > illustrates the performance problem? Perhaps we can make some > suggestions. > My test is a tight loop file=echo_file_`date +%s%N`; echo $file; echo > $file; counter=1; end=$((SECONDS+10)); while [ $SECONDS -lt $end ]; do echo $counter >> $file; counter=$((counter+1)); done; tail -n 1 $file;rm -f $file I run a number of these - one per core. My stap script is something like this (8 probes) stap -g -e '%{long long counter;u8 shm[256];static void* w_shm(void);static void* w_shm() {memset(shm, 0, sizeof(shm));return shm;} %} probe syscall.close{%{ {counter++;w_shm();} %}} probe syscall.close.return {%{ {counter++;w_shm();} %}} probe syscall.open{%{ {counter++;w_shm();} %}} probe syscall.open.return{%{ {counter++;w_shm();} %}} probe syscall.dup2.return{%{ {counter++;w_shm();} %}} probe syscall.dup2.return{%{ {counter++;w_shm();} %}} probe syscall.read.return{%{ {counter++;w_shm();} %}} probe syscall.read{%{ {counter++;w_shm();} %}} probe end { %{ {printk("\n%lli\n", counter);} %}}' w_shm() simulates writes to the shared memory. The performance impact is ~15% for 4 cores I am adding a map (global ar%): stap -D MAXSKIPPED=0 -D MAXTRYLOCK=1000000 -D TRYLOCKDELAY=10 -g -e 'global ar%; function w_ar() {ar[tid()]=tid();} %{long long counter;u8 shm[256];static void* w_shm(void);static void* w_shm() {memset(shm, 0, sizeof(shm));return shm;} %} probe syscall.close{w_ar();%{ {counter++;w_shm();} %}} probe syscall.close.return {w_ar();%{ {counter++;w_shm();} %}} probe syscall.open{w_ar();%{ {counter++;w_shm();} %}} probe syscall.open.return{w_ar();%{ {counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{ {counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{ {counter++;w_shm();} %}} probe syscall.read.return{w_ar();%{ {counter++;w_shm();} %}} probe syscall.read{w_ar();%{ {counter++;w_shm();} %}} probe end { %{ {printk("\n%lli\n", counter);} %}}' I am getting 35% hit. The overhead grows with the number of cores. The scripts roughly reflect what I am doing in the actual code. I have 1-3 associative arrays per syscall type. For example I keep separate arrays for probe syscall.read and probe syscall.write I have ~30 probes - I/O, networking, thread life cycle. > (Also note that I've started a background personal task to reduce the > use of locks in systemtap. I don't have much to show for it yet.) > It looks like the performance of the probes does not scale well with the number of cores. The overhead increases with the number of cores growing. I suspect that the spin locks at the beginning of every probe are to blame. > -- > David Smith > Principal Software Engineer > Red Hat Thank you, Arkady. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hashtable 2017-07-06 17:26 ` Hashtable Arkady @ 2017-07-06 17:45 ` Arkady 2017-07-06 17:51 ` Hashtable Arkady 0 siblings, 1 reply; 17+ messages in thread From: Arkady @ 2017-07-06 17:45 UTC (permalink / raw) To: David Smith; +Cc: systemtap P.S. The performance is very sensitive to TRYLOCKDELAY which is expected. On Thu, Jul 6, 2017 at 8:25 PM, Arkady <arkady.miasnikov@gmail.com> wrote: > On Thu, Jul 6, 2017 at 7:36 PM, David Smith <dsmith@redhat.com> wrote: >> On Wed, Jul 5, 2017 at 11:46 AM, Arkady <arkady.miasnikov@gmail.com> wrote: >>> Hi, >>> >>> I have a CPU bottleneck in some situations on heavy loaded servers. >>> >>> From the tests it appears that associative maps contribute significant >>> part of the overhead. >> >> ... stuff deleted ... >> >> Can you show us your script (or the associate map portion) that >> illustrates the performance problem? Perhaps we can make some >> suggestions. >> > My test is a tight loop > > file=echo_file_`date +%s%N`; echo $file; echo > $file; counter=1; > end=$((SECONDS+10)); while [ $SECONDS -lt $end ]; do echo $counter >> > $file; counter=$((counter+1)); done; tail -n 1 $file;rm -f $file > > I run a number of these - one per core. > > > My stap script is something like this (8 probes) > > stap -g -e '%{long long counter;u8 shm[256];static void* > w_shm(void);static void* w_shm() {memset(shm, 0, sizeof(shm));return > shm;} %} probe syscall.close{%{ {counter++;w_shm();} %}} probe > syscall.close.return {%{ {counter++;w_shm();} %}} probe > syscall.open{%{ {counter++;w_shm();} %}} probe syscall.open.return{%{ > {counter++;w_shm();} %}} probe syscall.dup2.return{%{ > {counter++;w_shm();} %}} probe syscall.dup2.return{%{ > {counter++;w_shm();} %}} probe syscall.read.return{%{ > {counter++;w_shm();} %}} probe syscall.read{%{ {counter++;w_shm();} > %}} probe end { %{ {printk("\n%lli\n", counter);} %}}' > > > w_shm() simulates writes to the shared memory. > The performance impact is ~15% for 4 cores > > I am adding a map (global ar%): > > stap -D MAXSKIPPED=0 -D MAXTRYLOCK=1000000 -D TRYLOCKDELAY=10 -g -e > 'global ar%; function w_ar() {ar[tid()]=tid();} %{long long counter;u8 > shm[256];static void* w_shm(void);static void* w_shm() {memset(shm, 0, > sizeof(shm));return shm;} %} probe syscall.close{w_ar();%{ > {counter++;w_shm();} %}} probe syscall.close.return {w_ar();%{ > {counter++;w_shm();} %}} probe syscall.open{w_ar();%{ > {counter++;w_shm();} %}} probe syscall.open.return{w_ar();%{ > {counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{ > {counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{ > {counter++;w_shm();} %}} probe syscall.read.return{w_ar();%{ > {counter++;w_shm();} %}} probe syscall.read{w_ar();%{ > {counter++;w_shm();} %}} probe end { %{ {printk("\n%lli\n", > counter);} %}}' > > I am getting 35% hit. The overhead grows with the number of cores. > > The scripts roughly reflect what I am doing in the actual code. I have > 1-3 associative arrays per syscall type. For example I keep separate > arrays for probe syscall.read and probe syscall.write > > I have ~30 probes - I/O, networking, thread life cycle. > >> (Also note that I've started a background personal task to reduce the >> use of locks in systemtap. I don't have much to show for it yet.) >> > > It looks like the performance of the probes does not scale well with > the number of cores. The overhead increases with the number of cores > growing. I suspect that the spin locks at the beginning of every probe > are to blame. > >> -- >> David Smith >> Principal Software Engineer >> Red Hat > > Thank you, Arkady. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hashtable 2017-07-06 17:45 ` Hashtable Arkady @ 2017-07-06 17:51 ` Arkady 2017-07-06 18:24 ` Hashtable David Smith 0 siblings, 1 reply; 17+ messages in thread From: Arkady @ 2017-07-06 17:51 UTC (permalink / raw) To: David Smith; +Cc: systemtap P.S.2 Convenient links for copy&paste https://gist.githubusercontent.com/larytet/10ceddea609d2da17aa09558ed0e04bc/raw/05037d536e5edf0e2f5a45282c41b8fa46d1fd55/SystemTap_tests.sh https://gist.githubusercontent.com/larytet/fc147587e9dfecfe99ab6bac2ba4aaa0/raw/670e385cb76798b526de9f4265046cf576c42f4e/SystemTap_tests On Thu, Jul 6, 2017 at 8:45 PM, Arkady <arkady.miasnikov@gmail.com> wrote: > P.S. The performance is very sensitive to TRYLOCKDELAY which is expected. > > > On Thu, Jul 6, 2017 at 8:25 PM, Arkady <arkady.miasnikov@gmail.com> wrote: >> On Thu, Jul 6, 2017 at 7:36 PM, David Smith <dsmith@redhat.com> wrote: >>> On Wed, Jul 5, 2017 at 11:46 AM, Arkady <arkady.miasnikov@gmail.com> wrote: >>>> Hi, >>>> >>>> I have a CPU bottleneck in some situations on heavy loaded servers. >>>> >>>> From the tests it appears that associative maps contribute significant >>>> part of the overhead. >>> >>> ... stuff deleted ... >>> >>> Can you show us your script (or the associate map portion) that >>> illustrates the performance problem? Perhaps we can make some >>> suggestions. >>> >> My test is a tight loop >> >> file=echo_file_`date +%s%N`; echo $file; echo > $file; counter=1; >> end=$((SECONDS+10)); while [ $SECONDS -lt $end ]; do echo $counter >> >> $file; counter=$((counter+1)); done; tail -n 1 $file;rm -f $file >> >> I run a number of these - one per core. >> >> >> My stap script is something like this (8 probes) >> >> stap -g -e '%{long long counter;u8 shm[256];static void* >> w_shm(void);static void* w_shm() {memset(shm, 0, sizeof(shm));return >> shm;} %} probe syscall.close{%{ {counter++;w_shm();} %}} probe >> syscall.close.return {%{ {counter++;w_shm();} %}} probe >> syscall.open{%{ {counter++;w_shm();} %}} probe syscall.open.return{%{ >> {counter++;w_shm();} %}} probe syscall.dup2.return{%{ >> {counter++;w_shm();} %}} probe syscall.dup2.return{%{ >> {counter++;w_shm();} %}} probe syscall.read.return{%{ >> {counter++;w_shm();} %}} probe syscall.read{%{ {counter++;w_shm();} >> %}} probe end { %{ {printk("\n%lli\n", counter);} %}}' >> >> >> w_shm() simulates writes to the shared memory. >> The performance impact is ~15% for 4 cores >> >> I am adding a map (global ar%): >> >> stap -D MAXSKIPPED=0 -D MAXTRYLOCK=1000000 -D TRYLOCKDELAY=10 -g -e >> 'global ar%; function w_ar() {ar[tid()]=tid();} %{long long counter;u8 >> shm[256];static void* w_shm(void);static void* w_shm() {memset(shm, 0, >> sizeof(shm));return shm;} %} probe syscall.close{w_ar();%{ >> {counter++;w_shm();} %}} probe syscall.close.return {w_ar();%{ >> {counter++;w_shm();} %}} probe syscall.open{w_ar();%{ >> {counter++;w_shm();} %}} probe syscall.open.return{w_ar();%{ >> {counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{ >> {counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{ >> {counter++;w_shm();} %}} probe syscall.read.return{w_ar();%{ >> {counter++;w_shm();} %}} probe syscall.read{w_ar();%{ >> {counter++;w_shm();} %}} probe end { %{ {printk("\n%lli\n", >> counter);} %}}' >> >> I am getting 35% hit. The overhead grows with the number of cores. >> >> The scripts roughly reflect what I am doing in the actual code. I have >> 1-3 associative arrays per syscall type. For example I keep separate >> arrays for probe syscall.read and probe syscall.write >> >> I have ~30 probes - I/O, networking, thread life cycle. >> >>> (Also note that I've started a background personal task to reduce the >>> use of locks in systemtap. I don't have much to show for it yet.) >>> >> >> It looks like the performance of the probes does not scale well with >> the number of cores. The overhead increases with the number of cores >> growing. I suspect that the spin locks at the beginning of every probe >> are to blame. >> >>> -- >>> David Smith >>> Principal Software Engineer >>> Red Hat >> >> Thank you, Arkady. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hashtable 2017-07-06 17:51 ` Hashtable Arkady @ 2017-07-06 18:24 ` David Smith 2017-07-07 3:26 ` Hashtable Arkady 0 siblings, 1 reply; 17+ messages in thread From: David Smith @ 2017-07-06 18:24 UTC (permalink / raw) To: Arkady; +Cc: systemtap OK, The first thing to do is run your script with 1 addition. Add '-DSTP_ALIBI' to the stap command line. This will compile/run the script, but when a probe is hit it will return immediately. This will give you an idea of the overhead of kprobes itself. Another option would be to remove the '-DSTP_ALIBI' from the command line and '-t' to the command line. Here's a description of what that option does: -t Collect timing information on the number of times probe executes and average amount of time spent in each probe-point. Also shows the derivation for each probe-point. Once you've done that, we'll know more of what is going on. On Thu, Jul 6, 2017 at 12:50 PM, Arkady <arkady.miasnikov@gmail.com> wrote: > P.S.2 > Convenient links for copy&paste > > https://gist.githubusercontent.com/larytet/10ceddea609d2da17aa09558ed0e04bc/raw/05037d536e5edf0e2f5a45282c41b8fa46d1fd55/SystemTap_tests.sh > > > https://gist.githubusercontent.com/larytet/fc147587e9dfecfe99ab6bac2ba4aaa0/raw/670e385cb76798b526de9f4265046cf576c42f4e/SystemTap_tests > > On Thu, Jul 6, 2017 at 8:45 PM, Arkady <arkady.miasnikov@gmail.com> wrote: >> P.S. The performance is very sensitive to TRYLOCKDELAY which is expected. >> >> >> On Thu, Jul 6, 2017 at 8:25 PM, Arkady <arkady.miasnikov@gmail.com> wrote: >>> On Thu, Jul 6, 2017 at 7:36 PM, David Smith <dsmith@redhat.com> wrote: >>>> On Wed, Jul 5, 2017 at 11:46 AM, Arkady <arkady.miasnikov@gmail.com> wrote: >>>>> Hi, >>>>> >>>>> I have a CPU bottleneck in some situations on heavy loaded servers. >>>>> >>>>> From the tests it appears that associative maps contribute significant >>>>> part of the overhead. >>>> >>>> ... stuff deleted ... >>>> >>>> Can you show us your script (or the associate map portion) that >>>> illustrates the performance problem? Perhaps we can make some >>>> suggestions. >>>> >>> My test is a tight loop >>> >>> file=echo_file_`date +%s%N`; echo $file; echo > $file; counter=1; >>> end=$((SECONDS+10)); while [ $SECONDS -lt $end ]; do echo $counter >> >>> $file; counter=$((counter+1)); done; tail -n 1 $file;rm -f $file >>> >>> I run a number of these - one per core. >>> >>> >>> My stap script is something like this (8 probes) >>> >>> stap -g -e '%{long long counter;u8 shm[256];static void* >>> w_shm(void);static void* w_shm() {memset(shm, 0, sizeof(shm));return >>> shm;} %} probe syscall.close{%{ {counter++;w_shm();} %}} probe >>> syscall.close.return {%{ {counter++;w_shm();} %}} probe >>> syscall.open{%{ {counter++;w_shm();} %}} probe syscall.open.return{%{ >>> {counter++;w_shm();} %}} probe syscall.dup2.return{%{ >>> {counter++;w_shm();} %}} probe syscall.dup2.return{%{ >>> {counter++;w_shm();} %}} probe syscall.read.return{%{ >>> {counter++;w_shm();} %}} probe syscall.read{%{ {counter++;w_shm();} >>> %}} probe end { %{ {printk("\n%lli\n", counter);} %}}' >>> >>> >>> w_shm() simulates writes to the shared memory. >>> The performance impact is ~15% for 4 cores >>> >>> I am adding a map (global ar%): >>> >>> stap -D MAXSKIPPED=0 -D MAXTRYLOCK=1000000 -D TRYLOCKDELAY=10 -g -e >>> 'global ar%; function w_ar() {ar[tid()]=tid();} %{long long counter;u8 >>> shm[256];static void* w_shm(void);static void* w_shm() {memset(shm, 0, >>> sizeof(shm));return shm;} %} probe syscall.close{w_ar();%{ >>> {counter++;w_shm();} %}} probe syscall.close.return {w_ar();%{ >>> {counter++;w_shm();} %}} probe syscall.open{w_ar();%{ >>> {counter++;w_shm();} %}} probe syscall.open.return{w_ar();%{ >>> {counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{ >>> {counter++;w_shm();} %}} probe syscall.dup2.return{w_ar();%{ >>> {counter++;w_shm();} %}} probe syscall.read.return{w_ar();%{ >>> {counter++;w_shm();} %}} probe syscall.read{w_ar();%{ >>> {counter++;w_shm();} %}} probe end { %{ {printk("\n%lli\n", >>> counter);} %}}' >>> >>> I am getting 35% hit. The overhead grows with the number of cores. >>> >>> The scripts roughly reflect what I am doing in the actual code. I have >>> 1-3 associative arrays per syscall type. For example I keep separate >>> arrays for probe syscall.read and probe syscall.write >>> >>> I have ~30 probes - I/O, networking, thread life cycle. >>> >>>> (Also note that I've started a background personal task to reduce the >>>> use of locks in systemtap. I don't have much to show for it yet.) >>>> >>> >>> It looks like the performance of the probes does not scale well with >>> the number of cores. The overhead increases with the number of cores >>> growing. I suspect that the spin locks at the beginning of every probe >>> are to blame. >>> >>>> -- >>>> David Smith >>>> Principal Software Engineer >>>> Red Hat >>> >>> Thank you, Arkady. -- David Smith Principal Software Engineer Red Hat ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hashtable 2017-07-06 18:24 ` Hashtable David Smith @ 2017-07-07 3:26 ` Arkady 2017-07-07 16:42 ` Hashtable Arkady 0 siblings, 1 reply; 17+ messages in thread From: Arkady @ 2017-07-07 3:26 UTC (permalink / raw) To: David Smith; +Cc: systemtap On Thu, Jul 6, 2017 at 9:24 PM, David Smith <dsmith@redhat.com> wrote: > OK, The first thing to do is run your script with 1 addition. Add > '-DSTP_ALIBI' to the stap command line. This will compile/run the > script, but when a probe is hit it will return immediately. > I have been through this routine including the -t flag. 10 empty probes introduce ~8% overhead > This will give you an idea of the overhead of kprobes itself. > > Another option would be to remove the '-DSTP_ALIBI' from the command > line and '-t' to the command line. Here's a description of what that > option does: > > -t Collect timing information on the number of times probe executes > and average amount of time spent in each probe-point. Also shows > the derivation for each probe-point. > > Once you've done that, we'll know more of what is going on. The heaviest probes are thread creation related. The following output is from the production code. The lines are ordered by MIN time. index state hits min* avg max name │168 on 1 79908 79908 79908 begin(-9223372036854775808) │ 0 on 1 47496 47496 47496 begin │114 on 50 16065 70322 528451 kernel.function("SyS_execve@/build/linux-EO9xOi/linux-4.4.0/fs/exec.c:1722").call │171 on 1 14337 14337 14337 begin(-9223372036854775808) │172 on 1 13149 13149 13149 begin(-9223372036854775808) │132 on 1 9525 9525 9525 kernel.function("SyS_accept4@/build/linux-EO9xOi/linux-4.4.0/net/socket.c:1425").call │121 on 1 8628 8628 8628 kernel.function("SyS_kill@/build/linux-EO9xOi/linux-4.4.0/kernel/signal.c:2847").call │165 on 708 8247 18919 247540 timer.sec(1) │145 on 2 6372 8148 9924 kernel.function("SyS_bind@/build/linux-EO9xOi/linux-4.4.0/net/socket.c:1363").call? │170 on 1 5055 5055 5055 begin(-9223372036854775808) │117 on 50 4731 11318 28385 kernel.function("SyS_execve@/build/linux-EO9xOi/linux-4.4.0/fs/exec.c:1722").return │169 on 1 4416 4416 4416 begin(-9223372036854775808) │124 on 5 4096 5494 7470 kernel.function("SyS_getcwd@/build/linux-EO9xOi/linux-4.4.0/fs/dcache.c:3244").return │ 39 on 122 3088 11082 39507 kernel.function("SyS_sendto@/build/linux-EO9xOi/linux-4.4.0/net/socket.c:1619").call? The following gist is a good representation of what I am doing in the exec https://gist.github.com/larytet/b8155e9d01942984888e90c88c23c206 When I look into the generated by stap C code there are quite a number of strlcpy() for the filename and args. Some of them are not strictly necessary, but I understand that a generic framework has costs. I figure that the arrays are a low hanging fruit. I hope that by limiting the hashtable API I can make a lockless hashtable along the lines https://gist.github.com/larytet/c306d470f7b032c796efad15dcd609a9. I expect 15%-30% improvement in the performance overhead. The improvement will come from the two things * A single copy of the filename - I am going to copy the filename/args from the user space directly to the share memory of the driver instead of the map * No spin locks are required for the hashmap Thank you, Arkady. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hashtable 2017-07-07 3:26 ` Hashtable Arkady @ 2017-07-07 16:42 ` Arkady 0 siblings, 0 replies; 17+ messages in thread From: Arkady @ 2017-07-07 16:42 UTC (permalink / raw) To: David Smith; +Cc: systemtap This simple script presents the problem when running on a multicore machine #!/bin/bash function echo_loop() { file=echo_file_`date +%s%N`; echo $file; echo > $file; counter=1;end=$((SECONDS+10)); while [ $SECONDS -lt $end ]; do echo $counter >>$file; counter=$((counter+1)); done; tail -n 1 $file;rm -f $file } CPUS=`grep -c ^processor /proc/cpuinfo` pids="" counter=0 while [ $counter -lt $CPUS ]; do echo_loop & pids="$pids $!" counter=$(($counter+1)) done wait $pids This short thing generates above 10M system calls/s on a 40 cores machine. I do not know the exact rate because the performance overhead of my STAP script exceeds 30%. This rate represents a challenge for the system I develop. The system collects system calls, copies the arguments to the shared memory. On Fri, Jul 7, 2017 at 6:26 AM, Arkady <arkady.miasnikov@gmail.com> wrote: > On Thu, Jul 6, 2017 at 9:24 PM, David Smith <dsmith@redhat.com> wrote: >> OK, The first thing to do is run your script with 1 addition. Add >> '-DSTP_ALIBI' to the stap command line. This will compile/run the >> script, but when a probe is hit it will return immediately. >> > > I have been through this routine including the -t flag. > 10 empty probes introduce ~8% overhead > >> This will give you an idea of the overhead of kprobes itself. >> >> Another option would be to remove the '-DSTP_ALIBI' from the command >> line and '-t' to the command line. Here's a description of what that >> option does: >> >> -t Collect timing information on the number of times probe executes >> and average amount of time spent in each probe-point. Also shows >> the derivation for each probe-point. >> >> Once you've done that, we'll know more of what is going on. > > The heaviest probes are thread creation related. The following output > is from the production code. The lines are ordered by MIN time. > > index state hits min* avg max name > │168 on 1 79908 79908 79908 > begin(-9223372036854775808) > │ 0 on 1 47496 47496 47496 begin > │114 on 50 16065 70322 528451 > kernel.function("SyS_execve@/build/linux-EO9xOi/linux-4.4.0/fs/exec.c:1722").call > │171 on 1 14337 14337 14337 > begin(-9223372036854775808) > │172 on 1 13149 13149 13149 > begin(-9223372036854775808) > │132 on 1 9525 9525 9525 > kernel.function("SyS_accept4@/build/linux-EO9xOi/linux-4.4.0/net/socket.c:1425").call > │121 on 1 8628 8628 8628 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hashtable 2017-07-05 16:47 Hashtable Arkady 2017-07-06 16:36 ` Hashtable David Smith @ 2017-07-10 15:20 ` Frank Ch. Eigler 1 sibling, 0 replies; 17+ messages in thread From: Frank Ch. Eigler @ 2017-07-10 15:20 UTC (permalink / raw) To: Arkady; +Cc: systemtap arkady.miasnikov wrote: > [...] > From the tests it appears that associative maps contribute significant > part of the overhead. I was thinking about replacing the maps with a > simple hashtable like this one (this is not production code) > https://gist.github.com/larytet/c306d470f7b032c796efad15dcd609a9 The systemtap maps are already hash tables. The complication here is that you're using them in a highly contended way, so locking effects dominate algorithmic ones. PR10796 is probably more relevant to your case. - FChE ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hashtable @ 2017-07-13 16:52 Arkady 2017-07-14 15:35 ` Hashtable David Smith 0 siblings, 1 reply; 17+ messages in thread From: Arkady @ 2017-07-13 16:52 UTC (permalink / raw) To: systemtap For the WEB search sake This is a reply to the thread "25400 by: Arkady" The following code demonstrates about 20% improvement in the latency https://github.com/larytet/lockfree_hashtable ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hashtable 2017-07-13 16:52 Hashtable Arkady @ 2017-07-14 15:35 ` David Smith 2017-07-14 17:16 ` Hashtable Arkady 0 siblings, 1 reply; 17+ messages in thread From: David Smith @ 2017-07-14 15:35 UTC (permalink / raw) To: Arkady; +Cc: systemtap Arkady, If you think the existing systemtap maps have too much overhead, we'd welcome you to take a shot at reworking them. This could benefit all systemtap users. We welcome all contributors. You and/or your employer will need to be OK with you contributing code under the GPLv2 license. Take the existing systemtap source code and start replacing the existing map functionality with your improved map code. When redoing code like this, you'd want to make sure the existing map tests have the same results with your code as with the original code. We'd be happy to consult with you along the way. Let us know if you have questions or need pointers on where to look first. On Thu, Jul 13, 2017 at 11:51 AM, Arkady <arkady.miasnikov@gmail.com> wrote: > For the WEB search sake > > This is a reply to the thread "25400 by: Arkady" > > The following code demonstrates about 20% improvement in the latency > > https://github.com/larytet/lockfree_hashtable -- David Smith Principal Software Engineer Red Hat ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hashtable 2017-07-14 15:35 ` Hashtable David Smith @ 2017-07-14 17:16 ` Arkady 2017-07-16 16:09 ` Hashtable Frank Ch. Eigler 0 siblings, 1 reply; 17+ messages in thread From: Arkady @ 2017-07-14 17:16 UTC (permalink / raw) To: David Smith; +Cc: systemtap David, There are numerous problems with my hashtable approach. Besides raw performance, it is not in any way better than the maps. A custom chunk of code, tailored for a task will beat a generic library in a specific test. I do not have anything to replace the maps. I do not have anything generic enough worth adding to the SystemTap. No yet anyway. A "map" which is fine tuned for thread ID key - this is an idea I am considering. I am playing with something I call "STAP bypass" - allow to register a pure C implementation of a probe. This gives about 1% gain part of which comes from removing calls to _stp_print_flush(). A define symbol like STP_TRACE_OFF would be nice. Unfortunately I doubt my employer is willing to spend my time extending STAP functionality. The very opposite is true. I am trying to minimise any STAP hacking unless I encounter an issue I can not solve (begin and end probes are such an issue) The only goal behind my mail below was to help anyone who is trying to tackle a very specific corner case. On Fri, Jul 14, 2017 at 6:35 PM, David Smith <dsmith@redhat.com> wrote: > Arkady, > > If you think the existing systemtap maps have too much overhead, we'd > welcome you to take a shot at reworking them. This could benefit all > systemtap users. We welcome all contributors. You and/or your employer > will need to be OK with you contributing code under the GPLv2 license. > > Take the existing systemtap source code and start replacing the > existing map functionality with your improved map code. When redoing > code like this, you'd want to make sure the existing map tests have > the same results with your code as with the original code. We'd be > happy to consult with you along the way. > > Let us know if you have questions or need pointers on where to look first. > > > On Thu, Jul 13, 2017 at 11:51 AM, Arkady <arkady.miasnikov@gmail.com> wrote: >> For the WEB search sake >> >> This is a reply to the thread "25400 by: Arkady" >> >> The following code demonstrates about 20% improvement in the latency >> >> https://github.com/larytet/lockfree_hashtable > > > > -- > David Smith > Principal Software Engineer > Red Hat ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hashtable 2017-07-14 17:16 ` Hashtable Arkady @ 2017-07-16 16:09 ` Frank Ch. Eigler 2017-07-16 16:19 ` Hashtable Arkady 0 siblings, 1 reply; 17+ messages in thread From: Frank Ch. Eigler @ 2017-07-16 16:09 UTC (permalink / raw) To: Arkady; +Cc: David Smith, systemtap arkady.miasnikov wrote: > [...] > I am playing with something I call "STAP bypass" - allow to register a > pure C implementation of a probe. This gives about 1% gain part of > which comes from removing calls to _stp_print_flush(). A define symbol > like STP_TRACE_OFF would be nice. [...] That _stp_print_flush() bit could be elided by analysis of the staptree associated with the probe; if no print*s are found there (or any function transitively called), then no need to flush. (For embedded-C functions, we could add a /* print */ pragma to handle that in the analysis.) - FChE ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hashtable 2017-07-16 16:09 ` Hashtable Frank Ch. Eigler @ 2017-07-16 16:19 ` Arkady 2017-07-17 8:58 ` Hashtable Arkady 0 siblings, 1 reply; 17+ messages in thread From: Arkady @ 2017-07-16 16:19 UTC (permalink / raw) To: Frank Ch. Eigler; +Cc: David Smith, systemtap Eliding the trace would be cool On Sun, Jul 16, 2017 at 7:09 PM, Frank Ch. Eigler <fche@redhat.com> wrote: > > arkady.miasnikov wrote: > >> [...] >> I am playing with something I call "STAP bypass" - allow to register a >> pure C implementation of a probe. This gives about 1% gain part of >> which comes from removing calls to _stp_print_flush(). A define symbol >> like STP_TRACE_OFF would be nice. [...] > > That _stp_print_flush() bit could be elided by analysis of the staptree > associated with the probe; if no print*s are found there (or any > function transitively called), then no need to flush. (For embedded-C > functions, we could add a /* print */ pragma to handle that in the > analysis.) > > - FChE ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hashtable 2017-07-16 16:19 ` Hashtable Arkady @ 2017-07-17 8:58 ` Arkady 2017-07-18 14:13 ` Hashtable Frank Ch. Eigler 0 siblings, 1 reply; 17+ messages in thread From: Arkady @ 2017-07-17 8:58 UTC (permalink / raw) To: Frank Ch. Eigler; +Cc: David Smith, systemtap I am trying to implement full bypass of the code generated by the SystemTap in some special cases. My goal is to be able to write code like this: probe syscall.dup2 { /* bypass */ /* arg: oldfd */ printk(KERN_ALERT "tid=%d oldfd=%d\n", current->pid, l->l_oldfd); // Unmanaged C code } I have not finished the code yet. I added two new pragma comments /* bypass */ /* arg: <argument name> */ When generating the code I add the relevant dwarf_tvar_get_ calls and initialize "local" variables in the context structure. Because the probes I am targeting are very short and simple and do not involve nested calls I shortcut lot of variables copying. Is there a chance for something similar in the SystemTap future? Thank you, Arkady. On Sun, Jul 16, 2017 at 7:19 PM, Arkady <arkady.miasnikov@gmail.com> wrote: > Eliding the trace would be cool > > On Sun, Jul 16, 2017 at 7:09 PM, Frank Ch. Eigler <fche@redhat.com> wrote: >> >> arkady.miasnikov wrote: >> >>> [...] >>> I am playing with something I call "STAP bypass" - allow to register a >>> pure C implementation of a probe. This gives about 1% gain part of >>> which comes from removing calls to _stp_print_flush(). A define symbol >>> like STP_TRACE_OFF would be nice. [...] >> >> That _stp_print_flush() bit could be elided by analysis of the staptree >> associated with the probe; if no print*s are found there (or any >> function transitively called), then no need to flush. (For embedded-C >> functions, we could add a /* print */ pragma to handle that in the >> analysis.) >> >> - FChE ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hashtable 2017-07-17 8:58 ` Hashtable Arkady @ 2017-07-18 14:13 ` Frank Ch. Eigler 2017-07-18 16:35 ` Hashtable Arkady 0 siblings, 1 reply; 17+ messages in thread From: Frank Ch. Eigler @ 2017-07-18 14:13 UTC (permalink / raw) To: Arkady; +Cc: David Smith, systemtap arkady.miasnikov wrote: > I am trying to implement full bypass of the code generated by the > SystemTap in some special cases. [...] OK, interesting. > My goal is to be able to write code like this: > probe syscall.dup2 > { /* bypass */ /* arg: oldfd */ > printk(KERN_ALERT "tid=%d oldfd=%d\n", current->pid, l->l_oldfd); > // Unmanaged C code > } So like an embedded-C probe handler? > [...] When generating the code I add the relevant dwarf_tvar_get_ > calls and initialize "local" variables in the context > structure. Because the probes I am targeting are very short and simple > and do not involve nested calls I shortcut lot of variables copying. How much more efficient can this get, beyond function foo (l) %{ STAP_PRINTF ("tid=%d oldfd=%d\n", current->pid, STAP_ARG_l); %} probe syscall.dup2 { foo ($oldfd) } ? Which parts of the latter can we elide? Could we improve the translator so that this elision can be done for general systemtap scripts, not just your special case? - FChE ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Hashtable 2017-07-18 14:13 ` Hashtable Frank Ch. Eigler @ 2017-07-18 16:35 ` Arkady 0 siblings, 0 replies; 17+ messages in thread From: Arkady @ 2017-07-18 16:35 UTC (permalink / raw) To: Frank Ch. Eigler; +Cc: David Smith, systemtap Frank, David, I appreciate your patience with me, my crazy ideas, and often very naive questions. On Tue, Jul 18, 2017 at 5:13 PM, Frank Ch. Eigler <fche@redhat.com> wrote: > > arkady.miasnikov wrote: > > >> I am trying to implement full bypass of the code generated by the >> SystemTap in some special cases. [...] > > OK, interesting. > > >> My goal is to be able to write code like this: >> probe syscall.dup2 >> { /* bypass */ /* arg: oldfd */ >> printk(KERN_ALERT "tid=%d oldfd=%d\n", current->pid, l->l_oldfd); >> // Unmanaged C code >> } > > So like an embedded-C probe handler? I am still working on it. I have changed the goal somewhat: The syntax I am trying to implement is something like cprobe syscall.dup2 %{ printk("%d", STAP_ARG_oldfd); %} where "cprobe" is a new keyword This is a patch (not a production code) https://github.com/larytet/SystemTap/commit/aa3de76dc11a94dcd0456d493e381bc69bcddb16 > > >> [...] When generating the code I add the relevant dwarf_tvar_get_ >> calls and initialize "local" variables in the context >> structure. Because the probes I am targeting are very short and simple >> and do not involve nested calls I shortcut lot of variables copying. > > How much more efficient can this get, beyond > > function foo (l) %{ > STAP_PRINTF ("tid=%d oldfd=%d\n", current->pid, STAP_ARG_l); > %} > probe syscall.dup2 { foo ($oldfd) } > > ? Which parts of the latter can we elide? Could we improve the > translator so that this elision can be done for general systemtap > scripts, not just your special case? > I have a very specific system test which I shall stand. The test looks like bash "while [ 1 ]; do echo >> filename;done" running on at least 4 cores. With empty probes (STAP_ALIBI, for example) the performance impact of the 10 open/read/write/close + return probes is ~8%. The performance impact of a single prove is negligible. The problems arise when there are many probes concurrently executed on multicore machines. Ten simple probes which call a function adding a value to a map have performance overhead in 30-35% range. I replaced the maps with the custom hashtable and it helped to reduce the impact by third to 20-25% range. This is better, but my target is obviously 10%-15% range (double of a bare minimum 8%). This is how a typical probe looks like https://github.com/larytet/lockfree_hashtable/blob/master/dup_probe.stp#L429 probe syscall.dup { tid = tid() hashtable_u32_insert(tid, oldfd) } probe syscall.dup { tid = tid() oldfd = hashtable_u32_remove(tid) write_event_to_shared_memory(tid, odlfd) } I checked the C code generated by the translate.cxx and the assembler generated by GCC. The end result is very good considering the non-trivial task of handling a script language. In my specific case it can be improved. For example, I do not need lot of stack memory and I can remove all c->locals[c->nesting] . I can not do it for all my probes, but for system calls probes I probably can. I can replace dwarf_tvar_get() with inline functions and help GCC to optimize out temporary variables, I can drop atomic_read/atomic_write, and so on. In the perfect world I would like to see a probe with 10 lines in C and under 200 hot (in cache) opcodes. I have stepped through the elaborate.cxx and specifically semantic_pass(). It is very well written. I am very impressed. This is a non trivial parser implemented in C with high level code optimization. This is an outstanding work. The code is very clean. Unfortunately I do not see fast&dirty ways to improve the code besides a special syntax for "unmanaged" probes in C. Which brought me to the "cprobe syscall.dup2" syntax. I have a very specific task - roughly 100 probes which cover most of the system calls. Most of the probes are very short. I have to run smoothly and remain under radar on multicore systems. Because the system collects lot of information across many different kernels any custom Linux driver will be a pain. SystemTap solves 95% of what I need. I need just one small step forward. > > - FChE ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2017-07-18 16:35 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-07-05 16:47 Hashtable Arkady 2017-07-06 16:36 ` Hashtable David Smith 2017-07-06 17:26 ` Hashtable Arkady 2017-07-06 17:45 ` Hashtable Arkady 2017-07-06 17:51 ` Hashtable Arkady 2017-07-06 18:24 ` Hashtable David Smith 2017-07-07 3:26 ` Hashtable Arkady 2017-07-07 16:42 ` Hashtable Arkady 2017-07-10 15:20 ` Hashtable Frank Ch. Eigler 2017-07-13 16:52 Hashtable Arkady 2017-07-14 15:35 ` Hashtable David Smith 2017-07-14 17:16 ` Hashtable Arkady 2017-07-16 16:09 ` Hashtable Frank Ch. Eigler 2017-07-16 16:19 ` Hashtable Arkady 2017-07-17 8:58 ` Hashtable Arkady 2017-07-18 14:13 ` Hashtable Frank Ch. Eigler 2017-07-18 16:35 ` Hashtable Arkady
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).