* kernel function probe overhead @ 2016-11-25 16:07 Aubrey Li 2016-11-25 17:22 ` Frank Ch. Eigler 0 siblings, 1 reply; 9+ messages in thread From: Aubrey Li @ 2016-11-25 16:07 UTC (permalink / raw) To: systemtap [-- Attachment #1: Type: text/plain, Size: 1458 bytes --] Hi list, I'm trying to profiling some Linux kernel subsystems by systemtap. So I want to know the overhead of various kinds of systemtap probes. Basically I sampled two kernel function probes, probe kernel.function(arch_cpu_idle_enter).call probe kernel.function(local_touch_nmi).call ======== a snip objdump of the code: ffffffff81035740 <arch_cpu_idle_enter>: ffffffff81035740: e8 4b 76 55 00 callq ffffffff8158cd90 <__fentry__> ffffffff81035745: e8 f6 96 ff ff callq ffffffff8102ee40 <local_touch_nmi> ======== because local irq is disabled, I suppose the duration of these two probes to be ZERO. However, I run the attached stp script and here is what I got: $ stap test.stp Press ^C to stop Total time: 10106 miliseconds t_end @count=2908 @min=676 @max=88152 @sum=15005802 @avg=5160 the duration varies from 676 to 88152, I understand all the tracing mechanism should have overhead, while 676ns is acceptable, but 88152 is not. And I have no idea why its range is so big? Please let me know if I can provide more information on my side. Thanks, -Aubrey $ stap -V Systemtap translator/driver (version 3.1/0.167, commit release-3.0-273-g7d27ff5084d6) Copyright (C) 2005-2016 Red Hat, Inc. and others This is free software; see the source for copying conditions. tested kernel versions: 2.6.18 ... 4.8 enabled features: NLS $ uname -a Linux aubrey-hp 4.8.10+ #5 SMP Fri Nov 25 02:11:32 CST 2016 x86_64 GNU/Linux [-- Attachment #2: test.stp --] [-- Type: application/octet-stream, Size: 527 bytes --] #!/usr/local/bin/stap global init_time global t_start global t_end probe begin{ printf("Press ^C to stop\n") init_time = gettimeofday_ms() } probe kernel.function("arch_cpu_idle_enter").call { t_start[cpu()] = gettimeofday_ns() } probe kernel.function("local_touch_nmi").call { if (t_start[cpu()]) { t_end <<< gettimeofday_ns() - t_start[cpu()] t_start[cpu()] = 0 } } probe timer.s(10) { exit() } probe end { printf("Total time: %d miliseconds\n", gettimeofday_ms() - init_time) } ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: kernel function probe overhead 2016-11-25 16:07 kernel function probe overhead Aubrey Li @ 2016-11-25 17:22 ` Frank Ch. Eigler 2016-11-26 6:18 ` Aubrey Li 0 siblings, 1 reply; 9+ messages in thread From: Frank Ch. Eigler @ 2016-11-25 17:22 UTC (permalink / raw) To: Aubrey Li; +Cc: systemtap aubreylee wrote: > [...] > Basically I sampled two kernel function probes, > > probe kernel.function(arch_cpu_idle_enter).call > probe kernel.function(local_touch_nmi).call ... where one directly calls the other. Good test! > because local irq is disabled, I suppose the duration of these two > probes to be ZERO. Yes, ideally. > However, I run the attached stp script and here is what I got: > [...] > Total time: 10106 miliseconds > t_end @count=2908 @min=676 @max=88152 @sum=15005802 @avg=5160 > the duration varies from 676 to 88152, I understand all the tracing > mechanism should have overhead, while 676ns is acceptable, but 88152 > is not. > probe kernel.function("arch_cpu_idle_enter").call > t_start[cpu()] = gettimeofday_ns() > probe kernel.function("local_touch_nmi").call > t_end <<< gettimeofday_ns() - t_start[cpu()] A couple of things come to mind. Use of gettimeofday_ns() function uses a software-estimated real-time clock, which may not be steady enough for high-grained measurements like this. Consider using get_cycles() or the functions in the [man tapset::timestamp_monotonic] tapset. Another possibility is jitter like caching effects within the kernel due to activity elsewhere. The average is pretty low. Consider printing a histogram of t_end, or (if using git systemtap) its @variance. Another possibility is contention over the t_start[] variable, if multiple processors are running this region of code, so trigger the systemtap probes concurrently. systemtap automatically locks such global variables, esp. if they contain non-aggregate values like the scalar timestamps. Consider using "stap -t FOO.stp" to have the tool print internal estimates of its own probes' runtimes. That would include lock contention, but exclude kernel-side overheads like the code involved in the its kprobes machinery. - FChE ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: kernel function probe overhead 2016-11-25 17:22 ` Frank Ch. Eigler @ 2016-11-26 6:18 ` Aubrey Li 2016-11-28 19:19 ` Frank Ch. Eigler 0 siblings, 1 reply; 9+ messages in thread From: Aubrey Li @ 2016-11-26 6:18 UTC (permalink / raw) To: Frank Ch. Eigler; +Cc: systemtap On Sat, Nov 26, 2016 at 1:22 AM, Frank Ch. Eigler <fche@redhat.com> wrote: > Thanks for your reply, Frank. > aubreylee wrote: > >> [...] >> Basically I sampled two kernel function probes, >> >> probe kernel.function(arch_cpu_idle_enter).call >> probe kernel.function(local_touch_nmi).call > > ... where one directly calls the other. Good test! > > >> because local irq is disabled, I suppose the duration of these two >> probes to be ZERO. > > Yes, ideally. > > >> However, I run the attached stp script and here is what I got: >> [...] >> Total time: 10106 miliseconds >> t_end @count=2908 @min=676 @max=88152 @sum=15005802 @avg=5160 > >> the duration varies from 676 to 88152, I understand all the tracing >> mechanism should have overhead, while 676ns is acceptable, but 88152 >> is not. > >> probe kernel.function("arch_cpu_idle_enter").call >> t_start[cpu()] = gettimeofday_ns() >> probe kernel.function("local_touch_nmi").call >> t_end <<< gettimeofday_ns() - t_start[cpu()] > > A couple of things come to mind. Use of gettimeofday_ns() function uses > a software-estimated real-time clock, which may not be steady enough for > high-grained measurements like this Timestamp function: gettimeofday_ns() value |-------------------------------------------------- count 32 | 0 64 | 0 128 | 6 256 |@ 62 512 |@@@@@@@@@@ 434 1024 |@@@@@@@@@@@ 487 2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2014 4096 | 35 8192 |@ 44 16384 | 28 32768 | 20 65536 | 0 131072 | 0 Total time: 10096 miliseconds > Consider using get_cycles() Timestamp function: get_cycles() value |-------------------------------------------------- count 64 | 0 128 | 0 256 | 9 512 |@@@ 91 1024 |@@@@@@@@@@ 313 2048 |@@@@@@@@@@@@@@ 420 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1495 8192 |@@@@@@@@@@@ 350 16384 |@@ 83 32768 | 29 65536 |@ 31 131072 | 1 262144 | 0 524288 | 0 > or the > functions in the [man tapset::timestamp_monotonic] tapset. > Timestamp function: local_clock_ns() value |-------------------------------------------------- count 32 | 0 64 | 0 128 | 6 256 |@ 57 512 |@@@@@@@@@@ 377 1024 |@@@@@@@@@@@ 426 2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1792 4096 | 29 8192 |@ 50 16384 | 23 32768 | 23 65536 | 0 131072 | 0 > Another possibility is jitter like caching effects within the kernel due > to activity elsewhere. I doubt caching effects cause so big variance. > The average is pretty low. Consider printing a > histogram of t_end, or (if using git systemtap) its @variance. > > Another possibility is contention over the t_start[] variable, if > multiple processors are running this region of code, so trigger the > systemtap probes concurrently. systemtap automatically locks such > global variables, esp. if they contain non-aggregate values like the > scalar timestamps. Do we have an approach to probe the function on CPU0 only? > > Consider using "stap -t FOO.stp" to have the tool print internal > estimates of its own probes' runtimes. That would include lock > contention, but exclude kernel-side overheads like the code involved in > the its kprobes machinery. "-t" option output. ----- probe hit report: begin, (test2.stp:8:1), hits: 1, cycles: 3280min/3280avg/3280max, from: begin, index: 0 kernel.function("arch_cpu_idle_enter@arch/x86/kernel/process.c:277").call, (test2.stp:13:1), hits: 2801, cycles: 560min/34290avg/230611max, from: kernel.function("arch_cpu_idle_enter@arch/x86/kernel/process.c:277").call from: kernel.function("arch_cpu_idle_enter").call, index: 2 kernel.function("local_touch_nmi@arch/x86/kernel/nmi.c:561").call, (test2.stp:18:1), hits: 2801, cycles: 396min/5118avg/160096max, from: kernel.function("local_touch_nmi@arch/x86/kernel/nmi.c:561").call from: kernel.function("local_touch_nmi").call, index: 3 timer.s(10), (test2.stp:26:1), hits: 1, cycles: 126258min/126258avg/126258max, from: timer.s(10), index: 4 end, (test2.stp:32:1), hits: 1, cycles: 6240min/6240avg/6240max, from: end, index: 5 ----- refresh report: Thanks, -Aubrey ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: kernel function probe overhead 2016-11-26 6:18 ` Aubrey Li @ 2016-11-28 19:19 ` Frank Ch. Eigler 2016-11-29 3:31 ` Aubrey Li 0 siblings, 1 reply; 9+ messages in thread From: Frank Ch. Eigler @ 2016-11-28 19:19 UTC (permalink / raw) To: Aubrey Li; +Cc: systemtap aubreylee wrote: >> A couple of things come to mind. Use of gettimeofday_ns() function uses >> a software-estimated real-time clock, which may not be steady enough for >> high-grained measurements like this > > Timestamp function: gettimeofday_ns() > [...] > Timestamp function: get_cycles() > [...] > Timestamp function: local_clock_ns() > [...] Thanks for testing! The distribution of those three seem remarkably similar - a nice central peak, and a second small peak at 4-8x the first. It suggests we can probably rule out clock problems. >> The average is pretty low. Consider printing a >> histogram of t_end, or (if using git systemtap) its @variance. >> >> Another possibility is contention over the t_start[] variable, if >> multiple processors are running this region of code, so trigger the >> systemtap probes concurrently. systemtap automatically locks such >> global variables, esp. if they contain non-aggregate values like the >> scalar timestamps. > > Do we have an approach to probe the function on CPU0 only? Not directly; the kernel does not give us such control. The probe could include a if (cpu() != 0) next; at the top of the probe handler body. That way locks still are taken, but released quickly on the other CPUs. Another solution would be to have per-thread global variables. Since the kernel doesn't help us out with this, we'd have to emulate it somehow, and for this to be wortwhile, it'd have to work faster than ordinary locked/shared array[tid()] accesses. >> Consider using "stap -t FOO.stp" to have the tool print internal >> estimates of its own probes' runtimes. That would include lock >> contention, but exclude kernel-side overheads like the code involved in >> the its kprobes machinery. > > "-t" option output. Ah, I note that we don't track occurrences of lock contention, even though we could. Opened https://sourceware.org/bugzilla/show_bug.cgi?id=20879 > [...] > kernel.function("arch_cpu_idle_enter@arch/x86/kernel/process.c:277").call, > (test2.stp:13:1), hits: 2801, cycles: 560min/34290avg/230611max, from: > [...] > kernel.function("local_touch_nmi@arch/x86/kernel/nmi.c:561").call, > (test2.stp:18:1), hits: 2801, cycles: 396min/5118avg/160096max, from: > [...] That inconsistency dramatic contention or caching or some other source of jitter. Having the variance figure in there would be useful though. Would anyone like to give that a try? (translate.cxx line 2283ish and runtime/linux/common_session_state.h _stp_stat_init area). - FChE ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: kernel function probe overhead 2016-11-28 19:19 ` Frank Ch. Eigler @ 2016-11-29 3:31 ` Aubrey Li 2016-11-29 19:21 ` Josh Stone 2016-12-01 18:47 ` Cody Santing 0 siblings, 2 replies; 9+ messages in thread From: Aubrey Li @ 2016-11-29 3:31 UTC (permalink / raw) To: Frank Ch. Eigler; +Cc: systemtap On Tue, Nov 29, 2016 at 3:19 AM, Frank Ch. Eigler <fche@redhat.com> wrote: > > > aubreylee wrote: > > >> A couple of things come to mind. Use of gettimeofday_ns() function uses > >> a software-estimated real-time clock, which may not be steady enough for > >> high-grained measurements like this > > > > Timestamp function: gettimeofday_ns() > > [...] > > Timestamp function: get_cycles() > > [...] > > Timestamp function: local_clock_ns() > > [...] > > Thanks for testing! The distribution of those three seem remarkably > similar - a nice central peak, and a second small peak at 4-8x the > first. It suggests we can probably rule out clock problems. > > > >> The average is pretty low. Consider printing a > >> histogram of t_end, or (if using git systemtap) its @variance. > >> > >> Another possibility is contention over the t_start[] variable, if > >> multiple processors are running this region of code, so trigger the > >> systemtap probes concurrently. systemtap automatically locks such > >> global variables, esp. if they contain non-aggregate values like the > >> scalar timestamps. > > > > Do we have an approach to probe the function on CPU0 only? > > Not directly; the kernel does not give us such control. The probe could > include a > > if (cpu() != 0) next; > > at the top of the probe handler body. That way locks still are taken, > but released quickly on the other CPUs. > Thanks for the suggestion. Original result: t_end @count=4800 @min=740 @max=300281 @sum=90589877 @avg=18872 Add if(cpu()!=0) next in the probe handler: t_end @count=161 @min=653 @max=114856 @sum=1780522 @avg=11059 And if I add "maxcpus=1" to the kernel cmdline: t_end @count=1190 @min=573 @max=14529 @sum=1406437 @avg=1181 The last one is much better, so it looks like we have a scalability opportunity here. > Another solution would be to have per-thread global variables. Since > the kernel doesn't help us out with this, we'd have to emulate it > somehow, and for this to be wortwhile, it'd have to work faster than > ordinary locked/shared array[tid()] accesses. > Is this a suggestion for me? How to do this? > > >> Consider using "stap -t FOO.stp" to have the tool print internal > >> estimates of its own probes' runtimes. That would include lock > >> contention, but exclude kernel-side overheads like the code involved in > >> the its kprobes machinery. > > > > "-t" option output. > > Ah, I note that we don't track occurrences of lock contention, > even though we could. Opened https://sourceware.org/bugzilla/show_bug.cgi?id=20879 okay, thanks, I'll keep an eye on this ticket. > > > [...] > > kernel.function("arch_cpu_idle_enter@arch/x86/kernel/process.c:277").call, > > (test2.stp:13:1), hits: 2801, cycles: 560min/34290avg/230611max, from: > > [...] > > kernel.function("local_touch_nmi@arch/x86/kernel/nmi.c:561").call, > > (test2.stp:18:1), hits: 2801, cycles: 396min/5118avg/160096max, from: > > [...] > > That inconsistency dramatic contention or caching or some other source > of jitter. Having the variance figure in there would be useful though. > Would anyone like to give that a try? (translate.cxx line 2283ish and > runtime/linux/common_session_state.h _stp_stat_init area). > If there is a patch delivered against git head, I can apply it and paste the output here. Thanks, -Aubrey > - FChE ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: kernel function probe overhead 2016-11-29 3:31 ` Aubrey Li @ 2016-11-29 19:21 ` Josh Stone 2016-12-01 18:47 ` Cody Santing 1 sibling, 0 replies; 9+ messages in thread From: Josh Stone @ 2016-11-29 19:21 UTC (permalink / raw) To: Aubrey Li, Frank Ch. Eigler; +Cc: systemtap On 11/28/2016 07:30 PM, Aubrey Li wrote: >> Another solution would be to have per-thread global variables. Since >> > the kernel doesn't help us out with this, we'd have to emulate it >> > somehow, and for this to be wortwhile, it'd have to work faster than >> > ordinary locked/shared array[tid()] accesses. >> > > Is this a suggestion for me? How to do this? There are relevant RFEs for stap already: per-cpu: https://sourceware.org/bugzilla/show_bug.cgi?id=10795 per-thread: https://sourceware.org/bugzilla/show_bug.cgi?id=10796 The latter was closed on the hunch that it wouldn't help much, but that could still be explored if someone is interested. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: kernel function probe overhead 2016-11-29 3:31 ` Aubrey Li 2016-11-29 19:21 ` Josh Stone @ 2016-12-01 18:47 ` Cody Santing [not found] ` <CAGPKeUKm9EZZxL=MHZX9_0N5SUYX291mJMTCExYSspRceJgrxg@mail.gmail.com> 1 sibling, 1 reply; 9+ messages in thread From: Cody Santing @ 2016-12-01 18:47 UTC (permalink / raw) To: Aubrey Li, Frank Ch. Eigler, Josh Stone; +Cc: systemtap Hello Aubrey, >> Ah, I note that we don't track occurrences of lock contention, >> even though we could. Opened https://sourceware.org/bugzilla/show_bug.cgi?id=20879 > > okay, thanks, I'll keep an eye on this ticket. > I have yet to update the ticket but I have added tracking of lock contention. The number of lock contention occurrences on global variables will be printed following the probe hit report from the '-t' option. >> That inconsistency dramatic contention or caching or some other source >> of jitter. Having the variance figure in there would be useful though. >> Would anyone like to give that a try? (translate.cxx line 2283ish and >> runtime/linux/common_session_state.h _stp_stat_init area). >> I have also added variance to the '-t' option. It is immediately following min/avg/max of each probe. > > If there is a patch delivered against git head, I can apply it and > paste the output here. > You should find the patch there. Give it a shot, if you have any feedback I would like to hear it. Thank you, Cody Santing ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <CAGPKeUKm9EZZxL=MHZX9_0N5SUYX291mJMTCExYSspRceJgrxg@mail.gmail.com>]
* Fwd: kernel function probe overhead [not found] ` <CAGPKeUKm9EZZxL=MHZX9_0N5SUYX291mJMTCExYSspRceJgrxg@mail.gmail.com> @ 2016-12-12 17:07 ` Cody Santing 2016-12-12 17:43 ` Frank Ch. Eigler 0 siblings, 1 reply; 9+ messages in thread From: Cody Santing @ 2016-12-12 17:07 UTC (permalink / raw) To: systemtap ---------- Forwarded message ---------- From: Aubrey Li <aubreylee@gmail.com> Date: Thu, Dec 8, 2016 at 8:44 AM Subject: Re: kernel function probe overhead To: Cody Santing <csanting@redhat.com> On Fri, Dec 2, 2016 at 2:47 AM, Cody Santing <csanting@redhat.com> wrote: > > Hello Aubrey, > > >> Ah, I note that we don't track occurrences of lock contention, > >> even though we could. Opened https://sourceware.org/bugzilla/show_bug.cgi?id=20879 > > > > okay, thanks, I'll keep an eye on this ticket. > > > > I have yet to update the ticket but I have added tracking of lock > contention. The number of lock contention occurrences on global > variables will be printed following the probe hit report from the '-t' > option. > > >> That inconsistency dramatic contention or caching or some other source > >> of jitter. Having the variance figure in there would be useful though. > >> Would anyone like to give that a try? (translate.cxx line 2283ish and > >> runtime/linux/common_session_state.h _stp_stat_init area). > >> > > I have also added variance to the '-t' option. It is immediately > following min/avg/max of each probe. > > > > > If there is a patch delivered against git head, I can apply it and > > paste the output here. > > > > You should find the patch there. Give it a shot, if you have any > feedback I would like to hear it. > Thanks Cody. Here is what I got: ============================================================ $ sudo stap -t test.stp ----------------------------------------- Timestamp function: gettimeofday_ns() Total time: 10092 miliseconds t_end @count=2962 @min=268 @max=81878 @sum=10843684 @avg=3660 ----- probe hit report: begin, (./test.stp:8:1), hits: 1, cycles: 3752min/3752avg/3752max, variance: 0, from: begin, index: 0 kernel.function("arch_cpu_idle_enter@arch/x86/kernel/process.c:277").call, (./test.stp:14:1), hits: 2962, cycles: >400min/30951avg/233043max, variance: 824817112, from: kernel.function("arch_cpu_idle_enter@arch/x86/kernel/process.c:277").call May I know what does variance value mean here? from: kernel.function("arch_cpu_idle_enter").call, index: 2 kernel.function("local_touch_nmi@arch/x86/kernel/nmi.c:561").call, (./test.stp:19:1), hits: 2962, cycles: 288min/4483avg/177541max, variance: 36900201, from: kernel.function("local_touch_nmi@arch/x86/kernel/nmi.c:561").call from: kernel.function("local_touch_nmi").call, index: 3 timer.s(10), (./test.stp:27:1), hits: 1, cycles: 6974min/6974avg/6974max, variance: 0, from: timer.s(10), index: 4 end, (./test.stp:32:1), hits: 1, cycles: 6928min/6928avg/6928max, variance: 0, from: end, index: 5 end, (./test.stp:1:1), hits: 1, cycles: 9956min/9956avg/9956max, variance: 0, from: end, index: 6 ----- refresh report: >'__global_t_start' lock contention occurred 2853 times Would it be nice to report time cost of lock contention? Anyway, instead of reporting lock contention, it would be nice if we can reduce it, :) Thanks, -Aubrey ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Fwd: kernel function probe overhead 2016-12-12 17:07 ` Fwd: " Cody Santing @ 2016-12-12 17:43 ` Frank Ch. Eigler 0 siblings, 0 replies; 9+ messages in thread From: Frank Ch. Eigler @ 2016-12-12 17:43 UTC (permalink / raw) To: Aubrey Li, Cody Santing; +Cc: systemtap aubreylee wrote; > Thanks Cody. > Here is what I got: > > ============================================================ > $ sudo stap -t test.stp > ----------------------------------------- > [...] > kernel.function("arch_cpu_idle_enter@arch/x86/kernel/process.c:277").call, > (./test.stp:14:1), hits: 2962, cycles: > >>400min/30951avg/233043max, variance: 824817112, from: kernel.function("arch_cpu_idle_enter@arch/x86/kernel/process.c:277").call > > May I know what does variance value mean here? This is normal statistical variance, i.e., the square of the standard deviation. That means that the bulk of the probe execution times were within sqrt(824817712)=28719 cycles from the mean, if the distribution were roughly normal. > ----- refresh report: >>'__global_t_start' lock contention occurred 2853 times > > Would it be nice to report time cost of lock contention? That would be roughly #-contentions * TRYLOCKDELAY (default 10us), so about 29 ms. > Anyway, instead of reporting lock contention, it would be nice if we > can reduce it, :) Sure, but a good first step is to know the magnitude of the problem. - FChE ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-12-12 17:43 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-11-25 16:07 kernel function probe overhead Aubrey Li 2016-11-25 17:22 ` Frank Ch. Eigler 2016-11-26 6:18 ` Aubrey Li 2016-11-28 19:19 ` Frank Ch. Eigler 2016-11-29 3:31 ` Aubrey Li 2016-11-29 19:21 ` Josh Stone 2016-12-01 18:47 ` Cody Santing [not found] ` <CAGPKeUKm9EZZxL=MHZX9_0N5SUYX291mJMTCExYSspRceJgrxg@mail.gmail.com> 2016-12-12 17:07 ` Fwd: " Cody Santing 2016-12-12 17:43 ` Frank Ch. Eigler
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).