* [Bug runtime/10234] clean up aggregate hard-coded logic
[not found] <bug-10234-6586@http.sourceware.org/bugzilla/>
@ 2016-09-13 15:26 ` mcermak at redhat dot com
2016-09-13 18:53 ` fche at redhat dot com
` (8 subsequent siblings)
9 siblings, 0 replies; 14+ messages in thread
From: mcermak at redhat dot com @ 2016-09-13 15:26 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=10234
Martin Cermak <mcermak at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |mcermak at redhat dot com
--- Comment #3 from Martin Cermak <mcermak at redhat dot com> ---
(In reply to Frank Ch. Eigler from comment #0)
> the runtime makes a number of unfortunate assumptions about aggregate
> objects:
>
> - that all of count/min/max/sum should be tracked
The @variance() operator, introduced per PR11308, comes with a mechanism
tracking the stat operators used with given global within the script. This
mechanism is being used to skip variance calculations if they are not needed.
Variance is significantly more expensive than min or max, so the most
beneficial optimization already is in place. Looking at min, max, sum and
count, I think the effect of optimizing them out is close to unmeasurable:
I've been repeatedly running following command on my workstation:
=======
rm -rf ~/.systemtap; for k in `seq 1 500`; do stap -t -g --suppress-time-limits
-e 'global x probe oneshot{ for(i; i<40000; i++) x<<<i println(@count(x)) }' |
sed 's/avg//g' | awk -F\/ '/^begin/ {print $2/1e3}'; done | awk '{ total += $1;
count++ } END { print total/count }'
=======
and comparing results related to presence versus absence of the following
patch:
=======
diff --git a/runtime/stat-common.c b/runtime/stat-common.c
index e58b1c2..70a46ea 100644
--- a/runtime/stat-common.c
+++ b/runtime/stat-common.c
@@ -306,11 +306,13 @@ static void __stp_stat_add(Hist st, stat_data *sd,
int64_t val)
sd->_M2 = 0;
} else {
sd->count++;
+ /*
sd->sum += val;
if (val > sd->max)
sd->max = val;
if (val < sd->min)
sd->min = val;
+ */
/*
* Following is an optimization that improves performance
* in case @variance() isn't used with given global.
$
=======
The times I got having these lines commented out were: 879, 815, 820, 830, 925,
741, 834, 890.
The times I got not having these lines commented out: 1049, 1010, 1047, 953,
976.
Based on the above I'd say, it might make sense to conditionally optimize this
whole block out if respective stat ops are not in use. My nose tells me that
doing more targeted optimizations would not be beneficial. Thoughts?
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug runtime/10234] clean up aggregate hard-coded logic
[not found] <bug-10234-6586@http.sourceware.org/bugzilla/>
2016-09-13 15:26 ` mcermak at redhat dot com
2016-09-13 18:53 ` fche at redhat dot com
@ 2016-09-16 7:59 ` mcermak at redhat dot com
2016-09-20 14:34 ` mcermak at redhat dot com
` (6 subsequent siblings)
9 siblings, 0 replies; 14+ messages in thread
From: mcermak at redhat dot com @ 2016-09-16 7:59 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=10234
--- Comment #5 from Martin Cermak <mcermak at redhat dot com> ---
Created attachment 9513
--> https://sourceware.org/bugzilla/attachment.cgi?id=9513&action=edit
test.stp
After attempts to use more complex scenario with multiple CPUs (and getting
obscure results) I fell back to following simple one:
=======
$ git diff
diff --git a/runtime/stat-common.c b/runtime/stat-common.c
index e58b1c2..5a9be70 100644
--- a/runtime/stat-common.c
+++ b/runtime/stat-common.c
@@ -305,11 +305,36 @@ static void __stp_stat_add(Hist st, stat_data *sd,
int64_t val)
sd->avg_s = val << sd->shift;
sd->_M2 = 0;
} else {
+ #if defined(OPT2) || defined(OPT3)
+ if (sd->stat_ops &
(STAT_OP_COUNT|STAT_OP_AVG|STAT_OP_VARIANCE))
+ sd->count++;
+ #else
sd->count++;
+ #endif
+
+ #if defined(OPT2) || defined(OPT3)
+ if (sd->stat_ops & (STAT_OP_SUM|STAT_OP_AVG|STAT_OP_VARIANCE))
+ sd->sum += val;
+ #else
sd->sum += val;
+ #endif
+
+ #ifdef OPT2
+ if (sd->stat_ops & STAT_OP_MAX && val > sd->max)
+ #elif defined(OPT3)
+ if (unlikely(sd->stat_ops & STAT_OP_MAX && val > sd->max))
+ #else
if (val > sd->max)
+ #endif
sd->max = val;
+
+ #ifdef OPT2
+ if (sd->stat_ops & STAT_OP_MIN && val < sd->min)
+ #elif defined(OPT3)
+ if (unlikely(sd->stat_ops & STAT_OP_MIN && val < sd->min))
+ #else
if (val < sd->min)
+ #endif
sd->min = val;
/*
* Following is an optimization that improves performance
$
=======
I've repeatedly used following loop for the testing:
=======
for j in ' ' \-DOPT2 \-DOPT3; do
for i in `seq 1 20`; do
STAPOUT=$(mktemp)
stap -vtg --poison-cache --suppress-time-limits $j ./test.stp
>& $STAPOUT
cat $STAPOUT | awk -F/ '/^begin/ {print $3}' | sed 's/avg//'
done | awk '{ total += $1; count++ } END { printf total/(count *
1000) }'
echo " $k $j"
done
=======
where test.stp (attached) uses many '<<<' aggregations over fixed set of once
generated random values, and all of @count, @sum, @min, @max, @avg, and
@variance are explicitly printed at the end of the script. The results are
relative (in 'time units'). Each run of the above loop produces one set of
results in the below table. Each test was running on a single CPU.
The unlikely() function relates to the branch prediction. Tested on various
architectures. Single user mode was used to lower down the unwanted
interference with unrelated userspace processes. The kernel itself seemed to
be performing some adjustments during the first few runs, see below, so I've
thrown related measurements away.
=======
[ 294.297412] perf: interrupt took too long (2509 > 2500), lowering
kernel.perf_event_max_sample_rate to 79000
[ 329.709509] perf: interrupt took too long (3139 > 3136), lowering
kernel.perf_event_max_sample_rate to 63000
[ 652.383749] perf: interrupt took too long (3927 > 3923), lowering
kernel.perf_event_max_sample_rate to 50000
=======
Results:
------------------------------------------------------------------|
ppc64 s390x x86_64
------------------| ------------------| ------------------|
518.857 412.962 2513.75
529.423 -DOPT2 423.081 -DOPT2 2555.29 -DOPT3
759.425 -DOPT3 442.206 -DOPT3 2578.75 -DOPT2
------------------| ------------------| ------------------|
580.696 389.635 2530.95
592.576 -DOPT2 415.82 -DOPT2 2545.6 -DOPT3
642.577 -DOPT3 421.315 -DOPT3 2571.62 -DOPT2
------------------| ------------------| ------------------|
552.094 388.975 2517.83
610.116 -DOPT2 416.48 -DOPT2 2550.84 -DOPT3
759.274 -DOPT3 420.661 -DOPT3 2581.71 -DOPT2
------------------| ------------------| ------------------|
595.439 385.798 2515.89
610.511 -DOPT2 421.387 -DOPT3 2562.03 -DOPT3
679.916 -DOPT3 424.299 -DOPT2 2579.27 -DOPT2
------------------| ------------------| ------------------|
532.466 389.769
576.825 -DOPT2 415.853 -DOPT2
739.139 -DOPT3 422.055 -DOPT3
------------------| ------------------|
562.744 -DOPT2 387.642
594.826 416.685 -DOPT2
978.162 -DOPT3 422.846 -DOPT3
------------------| ------------------|
638.113 -DOPT3 387.403
763.521 421.357 -DOPT2
780.33 -DOPT2 421.584 -DOPT3
------------------| ------------------|
596.114 1383.85
608.686 -DOPT2 419.845 -DOPT3
677.906 -DOPT3 449.496 -DOPT2
------------------| ------------------|
517.915 388.702
606.76 -DOPT2 421.449 -DOPT3
758.362 -DOPT3 499.101 -DOPT2
------------------| ------------------|
561.234 -DOPT2 418.358 -DOPT2
592.586 435.673 -DOPT3
640.741 -DOPT3 481.296
------------------| ------------------|
590.273 -DOPT2
595.477
760.43 -DOPT3
------------------|
533.675
546.956 -DOPT2
740.934 -DOPT3
------------------|
592.894 -DOPT2
595.04
739.344 -DOPT3
------------------|
529.315 -DOPT2
533.694
760.528 -DOPT3
------------------|
595.766
608.028 -DOPT2
699.56 -DOPT3
------------------|
546.428 -DOPT2
548.962
760.877 -DOPT3
------------------|
579.024
605.618 -DOPT2
680.475 -DOPT3
------------------|
518.279
591.874 -DOPT2
759.458 -DOPT3
------------------|
480.887 -DOPT2
578.895
740.364 -DOPT3
------------------|
459.348
469.135 -DOPT2
761.488 -DOPT3
------------------|
457.565
607.04 -DOPT2
759.243 -DOPT3
------------------|
514.944 -DOPT2
563.888
739.681 -DOPT3
------------------|
581.474
593.549 -DOPT2
659.873 -DOPT3
------------------|
532.542
589.381 -DOPT2
720.753 -DOPT3
------------------|
My conclusion is similar to the one in Comment#3: I think the benefits of
optimizing the trivial @count, @sum, @min, and @max online computations by
wrapping them into run-time conditions, is moot. Hmm, actually, maybe we could
interpret the results as "both of -DOPT2 and -DOPT3 slightly worsened the
tested performance in most cases". I might be wrong, though :-)
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug runtime/10234] clean up aggregate hard-coded logic
[not found] <bug-10234-6586@http.sourceware.org/bugzilla/>
` (3 preceding siblings ...)
2016-09-20 14:34 ` mcermak at redhat dot com
@ 2016-09-23 14:35 ` mcermak at redhat dot com
2016-09-27 8:44 ` mcermak at redhat dot com
` (4 subsequent siblings)
9 siblings, 0 replies; 14+ messages in thread
From: mcermak at redhat dot com @ 2016-09-23 14:35 UTC (permalink / raw)
To: systemtap
https://sourceware.org/bugzilla/show_bug.cgi?id=10234
--- Comment #7 from Martin Cermak <mcermak at redhat dot com> ---
I can verify the above patch functionality e.g. the following way: When using
the above patch, I can see that e.g. when printing only @count(), but not the
@sum(), the @sum() computations get skipped during the module preprocessing as
intended and they do not show in the module disassembly:
=======
$ stap -m blah -B CONFIG_DEBUG_INFO=y --poison-cache -e 'global x probe oneshot
{x<<<1 x<<<2 x<<<3 println(@count(x))}' -p4
blah.ko
$ objdump -lSD blah.ko | grep -B 1 -A 1 'sd->sum += val;'
$
=======
Once the @sum() computations are needed, they expectedly also appear in the
module disassembly:
=======
$ stap -m blah -B CONFIG_DEBUG_INFO=y --poison-cache -e 'global x probe oneshot
{x<<<1 x<<<2 x<<<3 println(@sum(x))}' -p4
blah.ko
$ objdump -lSD blah.ko | grep -B 1 -A 1 'sd->sum += val;'
/usr/local/share/systemtap/runtime/stat-add.c:24
sd->sum += val;
620: 78 6f js 691 <__stp_stat_add_4+0x121>
--
/usr/local/share/systemtap/runtime/stat-add.c:24
sd->sum += val;
620: 73 74 jae 696 <.LC17+0x496>
--
/usr/local/share/systemtap/runtime/stat-add.c:24
sd->sum += val;
61f: 00 24 00 add %ah,(%rax,%rax,1)
--
/usr/local/share/systemtap/runtime/stat-add.c:24
sd->sum += val;
61f: 00 00 add %al,(%rax)
--
/usr/local/share/systemtap/runtime/stat-add.c:24
sd->sum += val;
620: 65 6c gs insb (%dx),%es:(%rdi)
--
/usr/local/share/systemtap/runtime/stat-add.c:24
sd->sum += val;
620: 65 5f gs pop %rdi
$
=======
Indeed, these computations live all in the stat-add.c, which is what the patch
from Comment #6 introduces. That's all good.
Now trying another approach based on function inlining and compiler
optimizations. First, bringing back the runtime conditions similar to Comment
#5, and inlining __stp_stat_add():
=======
$ git diff runtime/stat-common.c
diff --git a/runtime/stat-common.c b/runtime/stat-common.c
index e58b1c2..3a0dbdf 100644
--- a/runtime/stat-common.c
+++ b/runtime/stat-common.c
@@ -288,7 +288,7 @@ static void _stp_stat_print_histogram(Hist st, stat_data
*sd)
_stp_print_flush();
}
-static void __stp_stat_add(Hist st, stat_data *sd, int64_t val)
+static inline void __stp_stat_add(Hist st, stat_data *sd, int64_t val)
{
int n;
int delta = 0;
@@ -305,11 +305,13 @@ static void __stp_stat_add(Hist st, stat_data *sd,
int64_t val)
sd->avg_s = val << sd->shift;
sd->_M2 = 0;
} else {
- sd->count++;
- sd->sum += val;
- if (val > sd->max)
+ if (sd->stat_ops &
(STAT_OP_COUNT|STAT_OP_AVG|STAT_OP_VARIANCE))
+ sd->count++;
+ if (sd->stat_ops & (STAT_OP_SUM|STAT_OP_AVG|STAT_OP_VARIANCE))
+ sd->sum += val;
+ if ((sd->stat_ops & STAT_OP_MAX) && (val > sd->max))
sd->max = val;
- if (val < sd->min)
+ if ((sd->stat_ops & STAT_OP_MIN) && (val < sd->min))
sd->min = val;
/*
* Following is an optimization that improves performance
$
=======
Now, testing the above patch, optimizing the module with -O3 (added -O3 to
buildrun.cxx:477). The compiler optimization is there (the second
specification overrides the first):
=======
$ eu-readelf --debug-dump=info blah.ko | grep produce | grep '\-O' | sort -u
producer (strp) "GNU C89 6.2.1 20160901 (Red Hat
6.2.1-1) -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -mno-80387
-mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic
-mno-red-zone -mcmodel=kernel -maccumulate-outgoing-args -mfentry -march=x86-64
-g -O2 -O3 -std=gnu90 -p -fno-strict-aliasing -fno-common -falign-jumps=1
-falign-loops=1 -funit-at-a-time -fno-delete-null-pointer-checks
-fstack-protector-strong -fno-omit-frame-pointer -fno-optimize-sibling-calls
-fvar-tracking-assignments -fno-strict-overflow -fconserve-stack
-freorder-blocks -fasynchronous-unwind-tables -fno-ipa-icf --param
allow-store-data-races=0"
$
=======
That looks fine, so now the real check. When @sum() is needed (println'd),
related computations should be there. They are:
=======
$ stap -m blah -B CONFIG_DEBUG_INFO=y --poison-cache -e 'global x probe oneshot
{x<<<1 x<<<2 x<<<3 println(@sum(x))}' -p4
blah.ko
$ objdump -lSD blah.ko | grep -B 1 -A 1 'sd->sum += val;'
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
4caf: b8 01 00 00 00 mov $0x1,%eax
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
5249: 4c 8b 56 28 mov 0x28(%rsi),%r10
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
52f4: 4c 8b 56 28 mov 0x28(%rsi),%r10
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
4cac: 66 5a pop %dx
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
5246: d7 xlat %ds:(%rbx)
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
52f1: fa cli
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
4cac: 50 push %rax
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
5248: 5f pop %rdi
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
52f1: 6c insb (%dx),%es:(%rdi)
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
4cad: 50 push %rax
--
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
5247: 30 9f 00 00 00 00 xor %bl,0x0(%rdi)
$
=======
But now, when @sum() becomes unneeded (not println'd), the related computations
do *not* disappear (where I was hoping for them to disappear):
=======
$ stap -m blah -B CONFIG_DEBUG_INFO=y --poison-cache -e 'global x probe oneshot
{x<<<1 x<<<2 x<<<3 println(@count(x))}' -p4
blah.ko
$ objdump -lSD blah.ko | grep -B 1 -A 1 'sd->sum += val;'
/usr/local/share/systemtap/runtime/stat-common.c:311
sd->sum += val;
4caf: b8 01 00 00 00 mov $0x1,%eax
--
[ ... stuff deleted, the objdump output is very close to the one right above
one ... ]
=======
Here, the highest -O3 has been used with relatively modern
gcc-6.2.1-1.fc26.x86_64. So this doesn't work. It also doesn't improve the
performance: A brief performance comparison of the approaches:
=======
$ # The above, inlined __stp_stat_add() plus run-time conditions:
$ for i in `seq 1 5`; do stap -vtg --poison-cache --suppress-time-limits
test.stp |& grep ^begin; done
begin, (test.stp:5:1), hits: 1, cycles: 1194232min/1194232avg/1194232max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 1432136min/1432136avg/1432136max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 644615min/644615avg/644615max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 801981min/801981avg/801981max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 784444min/784444avg/784444max, from:
begin, index: 0
$
$
$ # GCC pre-processing based patch from Comment #6 (this looks like a real
improvement):
$ for i in `seq 1 5`; do stap -vtg --poison-cache --suppress-time-limits
test.stp |& grep ^begin; done
begin, (test.stp:5:1), hits: 1, cycles: 606954min/606954avg/606954max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 1360988min/1360988avg/1360988max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 584298min/584298avg/584298max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 758953min/758953avg/758953max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 589420min/589420avg/589420max, from:
begin, index: 0
$
$
$ # Clean fresh git master stap without any optimization attempt:
$ for i in `seq 1 5`; do stap -vtg --poison-cache --suppress-time-limits
test.stp |& grep ^begin; done
begin, (test.stp:5:1), hits: 1, cycles: 726612min/726612avg/726612max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 771579min/771579avg/771579max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 819065min/819065avg/819065max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 640921min/640921avg/640921max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 725201min/725201avg/725201max, from:
begin, index: 0
$
=======
What can I do to make the "inlining" approach work better?
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread