[Bug runtime/10234] clean up aggregate hard-coded logic

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* [Bug runtime/10234] clean up aggregate hard-coded logic
       [not found] <bug-10234-6586@http.sourceware.org/bugzilla/>
@ 2016-09-13 15:26 ` mcermak at redhat dot com
  2016-09-13 18:53 ` fche at redhat dot com
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: mcermak at redhat dot com @ 2016-09-13 15:26 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=10234

Martin Cermak <mcermak at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mcermak at redhat dot com

--- Comment #3 from Martin Cermak <mcermak at redhat dot com> ---
(In reply to Frank Ch. Eigler from comment #0)
> the runtime makes a number of unfortunate assumptions about aggregate
> objects:
> 
> - that all of count/min/max/sum should be tracked

The @variance() operator, introduced per PR11308, comes with a mechanism
tracking the stat operators used with given global within the script.  This
mechanism is being used to skip variance calculations if they are not needed. 
Variance is significantly more expensive than min or max, so the most
beneficial optimization already is in place.  Looking at min, max, sum and
count, I think the effect of optimizing them out is close to unmeasurable:

I've been repeatedly running following command on my workstation:

=======
rm -rf ~/.systemtap; for k in `seq 1 500`; do stap -t -g --suppress-time-limits
-e 'global x probe oneshot{ for(i; i<40000; i++) x<<<i println(@count(x)) }' |
sed 's/avg//g' | awk -F\/ '/^begin/ {print $2/1e3}'; done | awk '{ total += $1;
count++ } END { print total/count }'
=======

and comparing results related to presence versus absence of the following
patch:

=======
diff --git a/runtime/stat-common.c b/runtime/stat-common.c
index e58b1c2..70a46ea 100644
--- a/runtime/stat-common.c
+++ b/runtime/stat-common.c
@@ -306,11 +306,13 @@ static void __stp_stat_add(Hist st, stat_data *sd,
int64_t val)
                sd->_M2 = 0;
        } else {
                sd->count++;
+               /*
                sd->sum += val;
                if (val > sd->max)
                        sd->max = val;
                if (val < sd->min)
                        sd->min = val;
+               */
                /*
                 * Following is an optimization that improves performance
                 * in case @variance() isn't used with given global.
$
=======

The times I got having these lines commented out were: 879, 815, 820, 830, 925,
741, 834, 890.
The times I got not having these lines commented out: 1049, 1010, 1047, 953,
976.

Based on the above I'd say, it might make sense to conditionally optimize this
whole block out if respective stat ops are not in use.  My nose tells me that
doing more targeted optimizations would not be beneficial.  Thoughts?

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug runtime/10234] clean up aggregate hard-coded logic
       [not found] <bug-10234-6586@http.sourceware.org/bugzilla/>
  2016-09-13 15:26 ` [Bug runtime/10234] clean up aggregate hard-coded logic mcermak at redhat dot com
@ 2016-09-13 18:53 ` fche at redhat dot com
  2016-09-16  7:59 ` mcermak at redhat dot com
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: fche at redhat dot com @ 2016-09-13 18:53 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=10234

--- Comment #4 from Frank Ch. Eigler <fche at redhat dot com> ---
One could measure some of the combinations.
In particular, compare the three options

 if (val > sd->max)  sd->max = val;
 if (sd->hist_ops & STAT_OP_MAX && val > sd->max)  sd->max = val;
 if (unlikely(sd->hist_ops & STAT_OP_MAX && val > sd->max))  sd->max = val;

for one or all of the aggregate variables.  That's a number of combinations to
test, but maybe with some #ifdef's and some shell script stap -D'ing, one can
get the computer to explore the whole space.

(Since it's unlikely we'll have a compile-time-only method ... like your #if 0
... unless we were to go to generated-code templates kind of like map-gen* etc.
for the various cartesian-product alternatives.)

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug runtime/10234] clean up aggregate hard-coded logic
       [not found] <bug-10234-6586@http.sourceware.org/bugzilla/>
  2016-09-13 15:26 ` [Bug runtime/10234] clean up aggregate hard-coded logic mcermak at redhat dot com
  2016-09-13 18:53 ` fche at redhat dot com
@ 2016-09-16  7:59 ` mcermak at redhat dot com
  2016-09-20 14:34 ` mcermak at redhat dot com
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: mcermak at redhat dot com @ 2016-09-16  7:59 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=10234

--- Comment #5 from Martin Cermak <mcermak at redhat dot com> ---
Created attachment 9513
  --> https://sourceware.org/bugzilla/attachment.cgi?id=9513&action=edit
test.stp

After attempts to use more complex scenario with multiple CPUs (and getting
obscure results) I fell back to following simple one:

=======
$ git diff
diff --git a/runtime/stat-common.c b/runtime/stat-common.c
index e58b1c2..5a9be70 100644
--- a/runtime/stat-common.c
+++ b/runtime/stat-common.c
@@ -305,11 +305,36 @@ static void __stp_stat_add(Hist st, stat_data *sd,
int64_t val)
                sd->avg_s = val << sd->shift;
                sd->_M2 = 0;
        } else {
+               #if defined(OPT2) || defined(OPT3)
+               if (sd->stat_ops &
(STAT_OP_COUNT|STAT_OP_AVG|STAT_OP_VARIANCE))
+                   sd->count++;
+               #else
                sd->count++;
+               #endif
+
+               #if defined(OPT2) || defined(OPT3)
+               if (sd->stat_ops & (STAT_OP_SUM|STAT_OP_AVG|STAT_OP_VARIANCE))
+                   sd->sum += val;
+               #else
                sd->sum += val;
+               #endif
+
+               #ifdef OPT2
+               if (sd->stat_ops & STAT_OP_MAX && val > sd->max)
+               #elif defined(OPT3)
+               if (unlikely(sd->stat_ops & STAT_OP_MAX && val > sd->max))
+               #else
                if (val > sd->max)
+               #endif
                        sd->max = val;
+
+               #ifdef OPT2
+               if (sd->stat_ops & STAT_OP_MIN && val < sd->min)
+               #elif defined(OPT3)
+               if (unlikely(sd->stat_ops & STAT_OP_MIN && val < sd->min))
+               #else
                if (val < sd->min)
+               #endif
                        sd->min = val;
                /*
                 * Following is an optimization that improves performance
$ 
=======

I've repeatedly used following loop for the testing:

=======
for j in ' ' \-DOPT2 \-DOPT3; do
            for i in `seq 1 20`; do
                STAPOUT=$(mktemp)
                stap -vtg --poison-cache --suppress-time-limits $j ./test.stp
>& $STAPOUT
                cat $STAPOUT | awk -F/ '/^begin/ {print $3}' | sed 's/avg//'
            done | awk '{ total += $1; count++ } END { printf total/(count *
1000) }'
            echo " $k $j"
done
=======

where test.stp (attached) uses many '<<<' aggregations over fixed set of once
generated random values, and all of @count, @sum, @min, @max, @avg, and
@variance are explicitly printed at the end of the script.  The results are
relative (in 'time units').  Each run of the above loop produces one set of
results in the below table.  Each test was running on a single CPU.

The unlikely() function relates to the branch prediction.  Tested on various
architectures.  Single user mode was used to lower down the unwanted
interference with unrelated userspace processes.  The kernel itself seemed to
be performing some adjustments during the first few runs, see below, so I've
thrown related measurements away.

=======
[  294.297412] perf: interrupt took too long (2509 > 2500), lowering
kernel.perf_event_max_sample_rate to 79000
[  329.709509] perf: interrupt took too long (3139 > 3136), lowering
kernel.perf_event_max_sample_rate to 63000
[  652.383749] perf: interrupt took too long (3927 > 3923), lowering
kernel.perf_event_max_sample_rate to 50000
=======



Results:
------------------------------------------------------------------|
      ppc64                    s390x                  x86_64
------------------|     ------------------|     ------------------|
518.857                 412.962                 2513.75
529.423   -DOPT2        423.081   -DOPT2        2555.29   -DOPT3
759.425   -DOPT3        442.206   -DOPT3        2578.75   -DOPT2
------------------|     ------------------|     ------------------|
580.696                 389.635                 2530.95
592.576   -DOPT2        415.82   -DOPT2         2545.6   -DOPT3
642.577   -DOPT3        421.315   -DOPT3        2571.62   -DOPT2
------------------|     ------------------|     ------------------|
552.094                 388.975                 2517.83
610.116   -DOPT2        416.48   -DOPT2         2550.84   -DOPT3
759.274   -DOPT3        420.661   -DOPT3        2581.71   -DOPT2
------------------|     ------------------|     ------------------|
595.439                 385.798                 2515.89
610.511   -DOPT2        421.387   -DOPT3        2562.03   -DOPT3
679.916   -DOPT3        424.299   -DOPT2        2579.27   -DOPT2
------------------|     ------------------|     ------------------|
532.466                 389.769
576.825   -DOPT2        415.853   -DOPT2
739.139   -DOPT3        422.055   -DOPT3
------------------|     ------------------|
562.744   -DOPT2        387.642
594.826                 416.685   -DOPT2
978.162   -DOPT3        422.846   -DOPT3
------------------|     ------------------|
638.113   -DOPT3        387.403
763.521                 421.357   -DOPT2
780.33   -DOPT2         421.584   -DOPT3
------------------|     ------------------|
596.114                 1383.85
608.686   -DOPT2        419.845   -DOPT3
677.906   -DOPT3        449.496   -DOPT2
------------------|     ------------------|
517.915                 388.702
606.76   -DOPT2         421.449   -DOPT3
758.362   -DOPT3        499.101   -DOPT2
------------------|     ------------------|
561.234   -DOPT2        418.358   -DOPT2
592.586                 435.673   -DOPT3
640.741   -DOPT3        481.296
------------------|     ------------------|
590.273   -DOPT2
595.477
760.43   -DOPT3
------------------|
533.675
546.956   -DOPT2
740.934   -DOPT3
------------------|
592.894   -DOPT2
595.04
739.344   -DOPT3
------------------|
529.315   -DOPT2
533.694
760.528   -DOPT3
------------------|
595.766
608.028   -DOPT2
699.56   -DOPT3
------------------|
546.428   -DOPT2
548.962
760.877   -DOPT3
------------------|
579.024
605.618   -DOPT2
680.475   -DOPT3
------------------|
518.279
591.874   -DOPT2
759.458   -DOPT3
------------------|
480.887   -DOPT2
578.895
740.364   -DOPT3
------------------|
459.348
469.135   -DOPT2
761.488   -DOPT3
------------------|
457.565
607.04   -DOPT2
759.243   -DOPT3
------------------|
514.944   -DOPT2
563.888
739.681   -DOPT3
------------------|
581.474
593.549   -DOPT2
659.873   -DOPT3
------------------|
532.542
589.381   -DOPT2
720.753   -DOPT3
------------------|


My conclusion is similar to the one in Comment#3: I think the benefits of
optimizing the trivial @count, @sum, @min, and @max online computations by
wrapping them into run-time conditions, is moot.  Hmm, actually, maybe we could
interpret the results as "both of -DOPT2 and -DOPT3 slightly worsened the
tested performance in most cases".  I might be wrong, though :-)

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug runtime/10234] clean up aggregate hard-coded logic
       [not found] <bug-10234-6586@http.sourceware.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2016-09-16  7:59 ` mcermak at redhat dot com
@ 2016-09-20 14:34 ` mcermak at redhat dot com
  2016-09-23 14:35 ` mcermak at redhat dot com
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: mcermak at redhat dot com @ 2016-09-20 14:34 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=10234

--- Comment #6 from Martin Cermak <mcermak at redhat dot com> ---
Created attachment 9517
  --> https://sourceware.org/bugzilla/attachment.cgi?id=9517&action=edit
proposed prototype patch

Not fully tested yet, but my impression is that the attached prototype patch
improves the performance in a measurable way.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug runtime/10234] clean up aggregate hard-coded logic
       [not found] <bug-10234-6586@http.sourceware.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2016-09-20 14:34 ` mcermak at redhat dot com
@ 2016-09-23 14:35 ` mcermak at redhat dot com
  2016-09-27  8:44 ` mcermak at redhat dot com
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: mcermak at redhat dot com @ 2016-09-23 14:35 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=10234

--- Comment #7 from Martin Cermak <mcermak at redhat dot com> ---
I can verify the above patch functionality e.g. the following way:  When using
the above patch, I can see that e.g. when printing only @count(), but not the
@sum(), the @sum() computations get skipped during the module preprocessing as
intended and they do not show in the module disassembly:

=======
$ stap -m blah -B CONFIG_DEBUG_INFO=y --poison-cache -e 'global x probe oneshot
{x<<<1 x<<<2 x<<<3 println(@count(x))}' -p4
blah.ko
$ objdump -lSD blah.ko | grep -B 1 -A 1 'sd->sum += val;'
$ 
=======

Once the @sum() computations are needed, they expectedly also appear in the
module disassembly:

=======
$ stap -m blah -B CONFIG_DEBUG_INFO=y --poison-cache -e 'global x probe oneshot
{x<<<1 x<<<2 x<<<3 println(@sum(x))}' -p4
blah.ko
$ objdump -lSD blah.ko | grep -B 1 -A 1 'sd->sum += val;'
/usr/local/share/systemtap/runtime/stat-add.c:24
                sd->sum += val;
     620:       78 6f                   js     691 <__stp_stat_add_4+0x121>
--
/usr/local/share/systemtap/runtime/stat-add.c:24
                sd->sum += val;
 620:   73 74                   jae    696 <.LC17+0x496>
--
/usr/local/share/systemtap/runtime/stat-add.c:24
                sd->sum += val;
 61f:   00 24 00                add    %ah,(%rax,%rax,1)
--
/usr/local/share/systemtap/runtime/stat-add.c:24
                sd->sum += val;
     61f:       00 00                   add    %al,(%rax)
--
/usr/local/share/systemtap/runtime/stat-add.c:24
                sd->sum += val;
     620:       65 6c                   gs insb (%dx),%es:(%rdi)
--
/usr/local/share/systemtap/runtime/stat-add.c:24
                sd->sum += val;
     620:       65 5f                   gs pop %rdi
$
=======

Indeed, these computations live all in the stat-add.c, which is what the patch
from Comment #6 introduces.  That's all good.






Now trying another approach based on function inlining and compiler
optimizations. First, bringing back the runtime conditions similar to Comment
#5, and inlining __stp_stat_add():

=======
$ git diff runtime/stat-common.c
diff --git a/runtime/stat-common.c b/runtime/stat-common.c
index e58b1c2..3a0dbdf 100644
--- a/runtime/stat-common.c
+++ b/runtime/stat-common.c
@@ -288,7 +288,7 @@ static void _stp_stat_print_histogram(Hist st, stat_data
*sd)
        _stp_print_flush();
 }

-static void __stp_stat_add(Hist st, stat_data *sd, int64_t val)
+static inline void __stp_stat_add(Hist st, stat_data *sd, int64_t val)
 {
        int n;
        int delta = 0;
@@ -305,11 +305,13 @@ static void __stp_stat_add(Hist st, stat_data *sd,
int64_t val)
                sd->avg_s = val << sd->shift;
                sd->_M2 = 0;
        } else {
-               sd->count++;
-               sd->sum += val;
-               if (val > sd->max)
+               if (sd->stat_ops &
(STAT_OP_COUNT|STAT_OP_AVG|STAT_OP_VARIANCE))
+                       sd->count++;
+               if (sd->stat_ops & (STAT_OP_SUM|STAT_OP_AVG|STAT_OP_VARIANCE))
+                       sd->sum += val;
+               if ((sd->stat_ops & STAT_OP_MAX) && (val > sd->max))
                        sd->max = val;
-               if (val < sd->min)
+               if ((sd->stat_ops & STAT_OP_MIN) && (val < sd->min))
                        sd->min = val;
                /*
                 * Following is an optimization that improves performance
$
=======

Now, testing the above patch, optimizing the module with -O3 (added -O3 to
buildrun.cxx:477).  The compiler optimization is there (the second
specification overrides the first):

=======
$ eu-readelf --debug-dump=info blah.ko | grep produce | grep '\-O' | sort -u
           producer             (strp) "GNU C89 6.2.1 20160901 (Red Hat
6.2.1-1) -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -mno-80387
-mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic
-mno-red-zone -mcmodel=kernel -maccumulate-outgoing-args -mfentry -march=x86-64
-g -O2 -O3 -std=gnu90 -p -fno-strict-aliasing -fno-common -falign-jumps=1
-falign-loops=1 -funit-at-a-time -fno-delete-null-pointer-checks
-fstack-protector-strong -fno-omit-frame-pointer -fno-optimize-sibling-calls
-fvar-tracking-assignments -fno-strict-overflow -fconserve-stack
-freorder-blocks -fasynchronous-unwind-tables -fno-ipa-icf --param
allow-store-data-races=0"
$
=======

That looks fine, so now the real check. When @sum() is needed (println'd),
related computations should be there. They are:

=======
$ stap -m blah -B CONFIG_DEBUG_INFO=y --poison-cache -e 'global x probe oneshot
{x<<<1 x<<<2 x<<<3 println(@sum(x))}' -p4
blah.ko
$ objdump -lSD blah.ko | grep -B 1 -A 1 'sd->sum += val;'
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    4caf:       b8 01 00 00 00          mov    $0x1,%eax
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    5249:       4c 8b 56 28             mov    0x28(%rsi),%r10
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    52f4:       4c 8b 56 28             mov    0x28(%rsi),%r10
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    4cac:       66 5a                   pop    %dx
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    5246:       d7                      xlat   %ds:(%rbx)
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    52f1:       fa                      cli    
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    4cac:       50                      push   %rax
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    5248:       5f                      pop    %rdi
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    52f1:       6c                      insb   (%dx),%es:(%rdi)
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    4cad:       50                      push   %rax
--
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    5247:       30 9f 00 00 00 00       xor    %bl,0x0(%rdi)
$
=======

But now, when @sum() becomes unneeded (not println'd), the related computations
do *not* disappear (where I was hoping for them to disappear):

=======
$ stap -m blah -B CONFIG_DEBUG_INFO=y --poison-cache -e 'global x probe oneshot
{x<<<1 x<<<2 x<<<3 println(@count(x))}' -p4
blah.ko
$ objdump -lSD blah.ko | grep -B 1 -A 1 'sd->sum += val;'
/usr/local/share/systemtap/runtime/stat-common.c:311
                        sd->sum += val;
    4caf:       b8 01 00 00 00          mov    $0x1,%eax
--
[ ... stuff deleted, the objdump output is very close to the one right above
one ... ]
=======

Here, the highest -O3 has been used with relatively modern
gcc-6.2.1-1.fc26.x86_64.  So this doesn't work.  It also doesn't improve the
performance: A brief performance comparison of the approaches:

=======
$ # The above, inlined __stp_stat_add() plus run-time conditions:
$ for i in `seq 1 5`; do stap -vtg --poison-cache --suppress-time-limits
test.stp |& grep ^begin; done
begin, (test.stp:5:1), hits: 1, cycles: 1194232min/1194232avg/1194232max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 1432136min/1432136avg/1432136max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 644615min/644615avg/644615max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 801981min/801981avg/801981max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 784444min/784444avg/784444max, from:
begin, index: 0
$ 
$ 
$ # GCC pre-processing based patch from Comment #6 (this looks like a real
improvement):
$ for i in `seq 1 5`; do stap -vtg --poison-cache --suppress-time-limits
test.stp |& grep ^begin; done
begin, (test.stp:5:1), hits: 1, cycles: 606954min/606954avg/606954max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 1360988min/1360988avg/1360988max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 584298min/584298avg/584298max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 758953min/758953avg/758953max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 589420min/589420avg/589420max, from:
begin, index: 0
$ 
$ 
$ #  Clean fresh git master stap without any optimization attempt:
$ for i in `seq 1 5`; do stap -vtg --poison-cache --suppress-time-limits
test.stp |& grep ^begin; done
begin, (test.stp:5:1), hits: 1, cycles: 726612min/726612avg/726612max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 771579min/771579avg/771579max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 819065min/819065avg/819065max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 640921min/640921avg/640921max, from:
begin, index: 0
begin, (test.stp:5:1), hits: 1, cycles: 725201min/725201avg/725201max, from:
begin, index: 0
$
=======

What can I do to make the "inlining" approach work better?

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug runtime/10234] clean up aggregate hard-coded logic
       [not found] <bug-10234-6586@http.sourceware.org/bugzilla/>
                   ` (4 preceding siblings ...)
  2016-09-23 14:35 ` mcermak at redhat dot com
@ 2016-09-27  8:44 ` mcermak at redhat dot com
  2016-09-30 10:34 ` mcermak at redhat dot com
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: mcermak at redhat dot com @ 2016-09-27  8:44 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=10234

--- Comment #8 from Martin Cermak <mcermak at redhat dot com> ---
Created attachment 9525
  --> https://sourceware.org/bugzilla/attachment.cgi?id=9525&action=edit
proposed prototype patch

The attached patch based on inlining the __stp_stat_add(), and adding a few
integer parameters denoting relevant stat ops, seems to help GCC to effectively
optimize this function.

Very rough testing shows about 17% shrinkage, which is roughly about the same
as with the patch from Comment #6 (based on preprocessing phase of the runtime
compilation).  The results were acquired using gcc-6.2.1-1.fc26.x86_64 and -O2
for the module compilation.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug runtime/10234] clean up aggregate hard-coded logic
       [not found] <bug-10234-6586@http.sourceware.org/bugzilla/>
                   ` (5 preceding siblings ...)
  2016-09-27  8:44 ` mcermak at redhat dot com
@ 2016-09-30 10:34 ` mcermak at redhat dot com
  2016-10-03  9:01 ` mcermak at redhat dot com
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: mcermak at redhat dot com @ 2016-09-30 10:34 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=10234

--- Comment #9 from Martin Cermak <mcermak at redhat dot com> ---
Created attachment 9546
  --> https://sourceware.org/bugzilla/attachment.cgi?id=9546&action=edit
proposed patch

Improved version with a testcase.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug runtime/10234] clean up aggregate hard-coded logic
       [not found] <bug-10234-6586@http.sourceware.org/bugzilla/>
                   ` (6 preceding siblings ...)
  2016-09-30 10:34 ` mcermak at redhat dot com
@ 2016-10-03  9:01 ` mcermak at redhat dot com
  2016-10-05  7:36 ` mcermak at redhat dot com
  2016-10-06 22:05 ` mcermak at redhat dot com
  9 siblings, 0 replies; 13+ messages in thread
From: mcermak at redhat dot com @ 2016-10-03  9:01 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=10234

--- Comment #10 from Martin Cermak <mcermak at redhat dot com> ---
Created attachment 9552
  --> https://sourceware.org/bugzilla/attachment.cgi?id=9552&action=edit
proposed patch

The above patch extended for PMAPs.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug runtime/10234] clean up aggregate hard-coded logic
       [not found] <bug-10234-6586@http.sourceware.org/bugzilla/>
                   ` (7 preceding siblings ...)
  2016-10-03  9:01 ` mcermak at redhat dot com
@ 2016-10-05  7:36 ` mcermak at redhat dot com
  2016-10-06 22:05 ` mcermak at redhat dot com
  9 siblings, 0 replies; 13+ messages in thread
From: mcermak at redhat dot com @ 2016-10-05  7:36 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=10234

--- Comment #11 from Martin Cermak <mcermak at redhat dot com> ---
Checked in commit 26382d613f4d266f74ddb3d3f3f36aceab171e14 .

The optimization effect significantly depends on compiler version,
platform architecture, and the stat operators being used for given
global.  At the moment, the available stat operators are @count,
@sum, @min, @max, @avg, and @variance. The most computionally
expensive is @variance.  The effect of optimizing @variance is
significant.  Other stat operators are computionally chap and so
the effect of their optimizations is relatively low.

Using gcc-6.2.1-1.fc26.x86_64, the @count, @sum, @min, and @max
optimizations brings approximately 8% run time shrinkage. The
@variance optimization shrinkage is up to 70% using this compiler.
For other architectures, namely for power, the optimization is less
effective.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug runtime/10234] clean up aggregate hard-coded logic
       [not found] <bug-10234-6586@http.sourceware.org/bugzilla/>
                   ` (8 preceding siblings ...)
  2016-10-05  7:36 ` mcermak at redhat dot com
@ 2016-10-06 22:05 ` mcermak at redhat dot com
  9 siblings, 0 replies; 13+ messages in thread
From: mcermak at redhat dot com @ 2016-10-06 22:05 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=10234

Martin Cermak <mcermak at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #12 from Martin Cermak <mcermak at redhat dot com> ---
Closing per the aforementioned commit.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug runtime/10234] clean up aggregate hard-coded logic
  2009-06-03 17:45 [Bug runtime/10234] New: " fche at redhat dot com
  2009-06-03 17:49 ` [Bug runtime/10234] " fche at redhat dot com
  2009-06-03 18:01 ` leitao at linux dot vnet dot ibm dot com
@ 2009-06-03 18:18 ` jkenisto at us dot ibm dot com
  2 siblings, 0 replies; 13+ messages in thread
From: jkenisto at us dot ibm dot com @ 2009-06-03 18:18 UTC (permalink / raw)
  To: systemtap


------- Additional Comments From jkenisto at us dot ibm dot com  2009-06-03 18:17 -------
(In reply to comment #0)
> 
> ... and without tracking sums/maxima/whatever.  This is an optimization.

And we could add support for variance and/or standard deviation, since we'd
accumulate the sum of squares (incurring a 64-bit multiply for each observation)
only if it were needed.



-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=10234

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug runtime/10234] clean up aggregate hard-coded logic
  2009-06-03 17:45 [Bug runtime/10234] New: " fche at redhat dot com
  2009-06-03 17:49 ` [Bug runtime/10234] " fche at redhat dot com
@ 2009-06-03 18:01 ` leitao at linux dot vnet dot ibm dot com
  2009-06-03 18:18 ` jkenisto at us dot ibm dot com
  2 siblings, 0 replies; 13+ messages in thread
From: leitao at linux dot vnet dot ibm dot com @ 2009-06-03 18:01 UTC (permalink / raw)
  To: systemtap



-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |leitao at linux dot vnet dot
                   |                            |ibm dot com


http://sourceware.org/bugzilla/show_bug.cgi?id=10234

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug runtime/10234] clean up aggregate hard-coded logic
  2009-06-03 17:45 [Bug runtime/10234] New: " fche at redhat dot com
@ 2009-06-03 17:49 ` fche at redhat dot com
  2009-06-03 18:01 ` leitao at linux dot vnet dot ibm dot com
  2009-06-03 18:18 ` jkenisto at us dot ibm dot com
  2 siblings, 0 replies; 13+ messages in thread
From: fche at redhat dot com @ 2009-06-03 17:49 UTC (permalink / raw)
  To: systemtap


------- Additional Comments From fche at redhat dot com  2009-06-03 17:49 -------
It'd be nice to remove the STP_MAX_BUCKETS and similar hard-coded limits too.

The stat_data struct is too rigid.  Perhaps it should be templatized
like the map-* stuff.


-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=10234

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-10-06 22:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-10234-6586@http.sourceware.org/bugzilla/>
2016-09-13 15:26 ` [Bug runtime/10234] clean up aggregate hard-coded logic mcermak at redhat dot com
2016-09-13 18:53 ` fche at redhat dot com
2016-09-16  7:59 ` mcermak at redhat dot com
2016-09-20 14:34 ` mcermak at redhat dot com
2016-09-23 14:35 ` mcermak at redhat dot com
2016-09-27  8:44 ` mcermak at redhat dot com
2016-09-30 10:34 ` mcermak at redhat dot com
2016-10-03  9:01 ` mcermak at redhat dot com
2016-10-05  7:36 ` mcermak at redhat dot com
2016-10-06 22:05 ` mcermak at redhat dot com
2009-06-03 17:45 [Bug runtime/10234] New: " fche at redhat dot com
2009-06-03 17:49 ` [Bug runtime/10234] " fche at redhat dot com
2009-06-03 18:01 ` leitao at linux dot vnet dot ibm dot com
2009-06-03 18:18 ` jkenisto at us dot ibm dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).