Pass 4 (non-)optimization speedup

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* Pass 4 (non-)optimization speedup
@ 2009-07-11 18:32 Mark Wielaard
  2009-07-12  1:51 ` Frank Ch. Eigler
  2009-07-15  2:17 ` Josh Stone
  0 siblings, 2 replies; 8+ messages in thread
From: Mark Wielaard @ 2009-07-11 18:32 UTC (permalink / raw)
  To: systemtap

Hi,

To make pass 4 a bit more flexible I added -O[0123s] as arguments to
stap (commit 5a5732). This allows you to pass through this option to
gcc. So users can determine whether they want fast compilation or
optimized code for speed or size.

The default is -O0 which makes pass 4 a lot faster, so I think this is a
good default. But maybe the default could be tuned, for example when
using the compile server, which can be assumed to generate systemtap
script modules that will be often reused.

The -O0 default speeds up some things a lot.
$ stap -v -k -p4 -e 'probe syscall.* { log(argstr); }'
Before: Pass 4: compiled C [...] in 7520usr/900sys/8415real ms.
After:  Pass 4: compiled C [...] in 4980usr/730sys/5782real ms.
make installcheck RUNTESTFLAGS=buildok.exp is 15 seconds faster,
was 4 minutes 10 seconds, now 3 minutes 55 seconds
(both with clean caches).

Cheers,

Mark

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Pass 4 (non-)optimization speedup
  2009-07-11 18:32 Pass 4 (non-)optimization speedup Mark Wielaard
@ 2009-07-12  1:51 ` Frank Ch. Eigler
  2009-07-12 14:26   ` Mark Wielaard
  2009-07-15  2:17 ` Josh Stone
  1 sibling, 1 reply; 8+ messages in thread
From: Frank Ch. Eigler @ 2009-07-12  1:51 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: systemtap

Mark Wielaard <mjw@redhat.com> writes:

> To make pass 4 a bit more flexible I added -O[0123s] as arguments to
> stap (commit 5a5732). [...]
> The default is -O0 which makes pass 4 a lot faster, so I think this is a
> good default. [...]

No, it is unlikely to be a good default.  Frankly, I'm surprised it
even compiles, since some kernel code is unbuildable without
optimization.  Anyway, the code generated by systemtap is complex
enough that with optimization disabled, it is bound to run measurably
slower.  Please test some nontrivial probes with -t before & after.

- FChE

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Pass 4 (non-)optimization speedup
  2009-07-12  1:51 ` Frank Ch. Eigler
@ 2009-07-12 14:26   ` Mark Wielaard
  2009-07-13 22:35     ` Mark Wielaard
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Wielaard @ 2009-07-12 14:26 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap

Hi Frank,

On Sat, 2009-07-11 at 21:51 -0400, Frank Ch. Eigler wrote:
> Mark Wielaard <mjw@redhat.com> writes:
> 
> > To make pass 4 a bit more flexible I added -O[0123s] as arguments to
> > stap (commit 5a5732). [...]
> > The default is -O0 which makes pass 4 a lot faster, so I think this is a
> > good default. [...]
> 
> No, it is unlikely to be a good default.

Maybe, like I said: "But maybe the default could be tuned...". It
definitely makes a difference when running stap interactively. When
trying out some probes trying to pinpoint some good stuff to measure, it
really matters if a quick try takes 6 or 3 seconds. It just feels that
much snappier. But for non-interactive work, it might be good if stap
defaulted to some higher optimization. Anyway, the user has a choice now
and we can always try to tweak the default depending on work load.

>   Frankly, I'm surprised it
> even compiles, since some kernel code is unbuildable without
> optimization.

And that surprises me. Obviously I tried it out on some different
kernels/architectures (i386/x86_64/fedora/2.6.29/rhel/2.6.18) before
implementing it. And afterwards of course I ran make installcheck to
make sure there were no regressions (there weren't, it just was minutes
faster!). If you have any examples of things that won't compile with the
new default please do let me know (and preferably add them to the
testsuite). 

>   Anyway, the code generated by systemtap is complex
> enough that with optimization disabled, it is bound to run measurably
> slower.  Please test some nontrivial probes with -t before & after.

Indeed, I should have measured that also. Here are some measurements for
the topsys.stp example, modified to stat only 12 times (add a global
count = 12; and an if (--count == 12) exit() to the timer probe).
gcc (GCC) 4.4.0 20090506 (Red Hat 4.4.0-4), 2.6.29.5-191.fc11.i586.

$ stap -O0 -v -t topsys.stp
[...]
Pass 4: compiled C into [...] in 6850usr/1160sys/8130real ms.
Pass 5: starting run.
[...]
probe syscall.* (topsys.stp:10:1), hits: 21048, cycles:
1045min/2736avg/10021max
probe timer.s(5) (topsys.stp:22:1), hits: 12, cycles:
147235min/153708avg/171006max
Pass 5: run completed in 40usr/80sys/60163real ms.

$ stap -O1 -v -t topsys.stp
[...]
Pass 4: compiled C into [...] in 8090usr/980sys/10815real ms.
Pass 5: starting run.
[...]
probe syscall.* (topsys.stp:10:1), hits: 252713, cycles:
726min/1561avg/537218max
probe timer.s(5) (topsys.stp:22:1), hits: 12, cycles:
68442min/96426avg/128172max
Pass 5: run completed in 40usr/80sys/60197real ms.

$ stap -O2 -v -t topsys.stp
[...]
Pass 4: compiled C into .[...] in 10120usr/1170sys/11433real ms.
Pass 5: starting run.
probe syscall.* (topsys.stp:10:1), hits: 79132, cycles:
671min/1669avg/5962max
probe timer.s(5) (topsys.stp:22:1), hits: 12, cycles:
82896min/90516avg/117381max
Pass 5: run completed in 30usr/100sys/60151real ms.

So with each optimization level we add about 2 seconds extra compile
time in pass 4, but the average number of cycles per probe is definitely
lower with each optimization level. And the win between -O1 and -O2
isn't as large as between -O0 and -O1.

What would be some good other scenarios to try out?

I would really like to have -O0 be the default for interactive (-e
<script>) usage, since it is really so much faster compiling in pass 4.
But you are right that it does impact the overhead per probe at runtime
in pass 5 significantly. Maybe we could make -O1 the default when stap
is given a script and -O2 for when the compile server is used? 

Cheers,

Mark

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Pass 4 (non-)optimization speedup
  2009-07-12 14:26   ` Mark Wielaard
@ 2009-07-13 22:35     ` Mark Wielaard
  2009-07-14  9:56       ` Mark Wielaard
  2009-07-15  2:17       ` Josh Stone
  0 siblings, 2 replies; 8+ messages in thread
From: Mark Wielaard @ 2009-07-13 22:35 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap

On Sun, 2009-07-12 at 16:25 +0200, Mark Wielaard wrote:
> On Sat, 2009-07-11 at 21:51 -0400, Frank Ch. Eigler wrote:
> >   Frankly, I'm surprised it
> > even compiles, since some kernel code is unbuildable without
> > optimization.
> 
> And that surprises me. Obviously I tried it out on some different
> kernels/architectures (i386/x86_64/fedora/2.6.29/rhel/2.6.18) before
> implementing it. And afterwards of course I ran make installcheck to
> make sure there were no regressions (there weren't, it just was minutes
> faster!). If you have any examples of things that won't compile with the
> new default please do let me know (and preferably add them to the
> testsuite). 

We discussed this a bit on IRC. The issue is that some code that we
compile together with a module might depend on source of the kernel that
depends on being optimized in a particular way. For example locking
code. Other code might only produce warnings when optimized (unitialized
variables for example) and so confuse a user suddenly something doesn't
compile (because of -Werror) on higher opt-levels than the default. Will
actually found some ia64 code that relies on optimization of code so
that gcc doesn't emit resolvable symbols (in a complicated if-then-else
switch statement).

So, I made the default the kernel opt-level one again (which is -O2 or
-Os depending on CONFIG_CC_OPTIMIZE_FOR_SIZE).

This is really a shame, gcc is really a lot slower compiling with -O2.
It would be good to find some other way to make pass 4 quicker by
default.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Pass 4 (non-)optimization speedup
  2009-07-13 22:35     ` Mark Wielaard
@ 2009-07-14  9:56       ` Mark Wielaard
  2009-07-15  2:17       ` Josh Stone
  1 sibling, 0 replies; 8+ messages in thread
From: Mark Wielaard @ 2009-07-14  9:56 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap

On Tue, 2009-07-14 at 00:35 +0200, Mark Wielaard wrote:
> So, I made the default the kernel opt-level one again (which is -O2 or
> -Os depending on CONFIG_CC_OPTIMIZE_FOR_SIZE).
> 
> This is really a shame, gcc is really a lot slower compiling with -O2.
> It would be good to find some other way to make pass 4 quicker by
> default.

Actually I now removed all the support again for changing the opt-level.
If we cannot guarantee that the code compiles and/or generates correctly
functioning code we are better off not officially supporting it :{
I left some comments in buildrun.cxx for those wanting to edit the
sources to experiment with it anyway.

Sniff,

Mark

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Pass 4 (non-)optimization speedup
  2009-07-13 22:35     ` Mark Wielaard
  2009-07-14  9:56       ` Mark Wielaard
@ 2009-07-15  2:17       ` Josh Stone
  2009-07-15 17:42         ` Frank Ch. Eigler
  1 sibling, 1 reply; 8+ messages in thread
From: Josh Stone @ 2009-07-15  2:17 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: Frank Ch. Eigler, systemtap

[-- Attachment #1: Type: text/plain, Size: 917 bytes --]

On 07/13/2009 03:35 PM, Mark Wielaard wrote:
> This is really a shame, gcc is really a lot slower compiling with -O2.
> It would be good to find some other way to make pass 4 quicker by
> default.

Another approach mentioned on IRC is trying to simplify the code that we
present to GCC.

I took a stab at consolidating the probe locking code, patch attached.
It makes each probe body contain just a static array of locks needed,
and then call to a single function that actually loops over the locks.

This will only be a win for scripts with many distinct probe bodies,
like topsys.stp.  For that script, I see pass-4 time go from 5 seconds
to 4.4 seconds.  In runtime, it's hard to get consistent measurement,
but the average probe time is about 2-3% slower.

So, is that overhead worth 1/2 second compile time?  Can anyone see ways
to improve this patch, or suggest other areas that could benefit from focus?

Josh

[-- Attachment #2: data-driven-locking.patch --]
[-- Type: text/x-patch, Size: 7409 bytes --]

commit d8a8b06d763d006be5efe0ec53c7b5a41df30223
Author: Josh Stone <jistone@redhat.com>
Date:   Mon Jul 13 13:22:11 2009 -0700

    make probe locking data-driven

diff --git a/translate.cxx b/translate.cxx
index 9c90106..a35be58 100644
--- a/translate.cxx
+++ b/translate.cxx
@@ -1590,14 +1590,7 @@ c_unparser::emit_probe (derived_probe* v)
       // initialize frame pointer
       o->newline() << "struct " << v->name << "_locals * __restrict__ l =";
       o->newline(1) << "& c->locals[0]." << v->name << ";";
-      o->newline(-1) << "(void) l;"; // make sure "l" is marked used
-
-      o->newline() << "#ifdef STP_TIMING";
-      o->newline() << "c->statp = & time_" << v->basest()->name << ";";
-      o->newline() << "#endif";
-
-      // emit probe local initialization block
-      v->emit_probe_local_init(o);
+      o->indent (-1);
 
       // emit all read/write locks for global variables
       varuse_collecting_visitor vut;
@@ -1607,6 +1600,15 @@ c_unparser::emit_probe (derived_probe* v)
 	  emit_locks (vut);
 	}
 
+      o->newline() << "(void) l;"; // make sure "l" is marked used
+
+      o->newline() << "#ifdef STP_TIMING";
+      o->newline() << "c->statp = & time_" << v->basest()->name << ";";
+      o->newline() << "#endif";
+
+      // emit probe local initialization block
+      v->emit_probe_local_init(o);
+
       // initialize locals
       for (unsigned j=0; j<v->locals.size(); j++)
         {
@@ -1658,11 +1660,8 @@ c_unparser::emit_probe (derived_probe* v)
 void
 c_unparser::emit_locks(const varuse_collecting_visitor& vut)
 {
-  o->newline() << "{";
-  o->newline(1) << "unsigned numtrylock = 0;";
-  o->newline() << "(void) numtrylock;";
+  bool has_locks = false;
 
-  string last_locked_var;
   for (unsigned i = 0; i < session->globals.size(); i++)
     {
       vardecl* v = session->globals[i];
@@ -1694,37 +1693,36 @@ c_unparser::emit_locks(const varuse_collecting_visitor& vut)
 	    continue;
 	}
 
-      string lockcall =
-        string (write_p ? "write" : "read") +
-        "_trylock (& global.s_" + v->name + "_lock)";
+      if (!has_locks)
+        {
+          o->newline() << "static const probe_lock_t locks[] = {";
+          o->indent(1);
+          has_locks = true;
+        }
 
-      o->newline() << "while (! " << lockcall
-                   << "&& (++numtrylock < MAXTRYLOCK))";
-      o->newline(1) << "ndelay (TRYLOCKDELAY);";
-      o->newline(-1) << "if (unlikely (numtrylock >= MAXTRYLOCK)) {";
-      o->newline(1) << "atomic_inc (& skipped_count);";
+      o->newline() << "{";
+      o->newline(1) << ".lock = & global.s_" << c_varname (v->name) << "_lock,";
       o->newline() << "#ifdef STP_TIMING";
-      o->newline() << "atomic_inc (& global.s_" << c_varname (v->name) << "_lock_skip_count);";
+      o->newline() << ".skip = & global.s_" << c_varname (v->name) << "_lock_skip_count,";
       o->newline() << "#endif";
-      // The following works even if i==0.  Note that using
-      // globals[i-1]->name is wrong since that global may not have
-      // been lockworthy by this probe.
-      o->newline() << "goto unlock_" << last_locked_var << ";";
-      o->newline(-1) << "}";
-
-      last_locked_var = v->name;
+      o->newline() << ".write_p = " << (write_p ? 1 : 0) << ",";
+      o->newline(-1) << "},";
     }
 
-  o->newline() << "if (0) goto unlock_;";
-
-  o->newline(-1) << "}";
+  if (has_locks)
+    {
+      o->newline(-1) << "};";
+      o->newline() << "if (!probe_trylock(locks, ARRAY_SIZE(locks)))";
+      o->newline(1) << "goto unlocked;";
+      o->indent(-1);
+    }
 }
 
 
 void
 c_unparser::emit_unlocks(const varuse_collecting_visitor& vut)
 {
-  unsigned numvars = 0;
+  bool has_locks = false;
 
   if (session->verbose>1)
     clog << current_probe->name << " locks ";
@@ -1751,32 +1749,24 @@ c_unparser::emit_unlocks(const varuse_collecting_visitor& vut)
 	    continue;
 	}
 
-      numvars ++;
-      o->newline(-1) << "unlock_" << v->name << ":";
-      o->indent(1);
-
+      has_locks = true;
       if (session->verbose>1)
         clog << v->name << "[" << (read_p ? "r" : "")
              << (write_p ? "w" : "")  << "] ";
-
-      if (write_p) // emit write lock
-        o->newline() << "write_unlock (& global.s_" << v->name << "_lock);";
-      else // (read_p && !write_p) : emit read lock
-        o->newline() << "read_unlock (& global.s_" << v->name << "_lock);";
-
-      // fall through to next variable; thus the reverse ordering
     }
 
-  // emit plain "unlock" label, used if the very first lock failed.
-  o->newline(-1) << "unlock_: ;";
-  o->indent(1);
-
-  if (numvars) // is there a chance that any lock attempt failed?
+  if (has_locks) // is there a chance that any lock attempt failed?
     {
       // Formerly, we checked skipped_count > MAXSKIPPED here, and set
       // SYSTEMTAP_SESSION_ERROR if so.  But now, this check is shared
       // via common_probe_entryfn_epilogue().
 
+      o->newline() << "probe_unlock(locks, ARRAY_SIZE(locks));";
+
+      // emit plain "unlocked" label, used if any locks failed
+      o->newline(-1) << "unlocked: ;";
+      o->indent(1);
+
       if (session->verbose>1)
         clog << endl;
     }
@@ -5223,6 +5213,46 @@ translate_pass (systemtap_session& s)
 	}
       s.op->assert_0_indent();
 
+      s.op->newline() << "typedef struct {";
+      s.op->newline(1) << "rwlock_t *lock;";
+      s.op->newline() << "#ifdef STP_TIMING";
+      s.op->newline() << "atomic_t *skip;";
+      s.op->newline() << "#endif";
+      s.op->newline() << "int write_p;";
+      s.op->newline(-1) << "} probe_lock_t;";
+      s.op->newline() << "static void";
+      s.op->newline() << "probe_unlock(const probe_lock_t *locks, unsigned n)";
+      s.op->newline() << "{";
+      s.op->newline(1) << "while (n--)";
+      s.op->newline(1) << "if (locks[n].write_p)";
+      s.op->newline(1) << "write_unlock(locks[n].lock);";
+      s.op->newline(-1) << "else";
+      s.op->newline(1) << "read_unlock(locks[n].lock);";
+      s.op->newline(-3) << "}";
+      s.op->newline() << "static int";
+      s.op->newline() << "probe_trylock(const probe_lock_t *locks, unsigned n)";
+      s.op->newline() << "{";
+      s.op->newline(1) << "unsigned i, numtrylock = 0;";
+      s.op->newline(0) << "for (i = 0; i < n; ++i) {";
+      s.op->newline(1) << "if (locks[i].write_p)";
+      s.op->newline(1) << "while (!write_trylock(locks[i].lock) && (++numtrylock < MAXTRYLOCK))";
+      s.op->newline(1) << "ndelay (TRYLOCKDELAY);";
+      s.op->newline(-2) << "else";
+      s.op->newline(1) << "while (!read_trylock(locks[i].lock) && (++numtrylock < MAXTRYLOCK))";
+      s.op->newline(1) << "ndelay (TRYLOCKDELAY);";
+      s.op->newline(-2) << "if (unlikely (numtrylock >= MAXTRYLOCK)) {";
+      s.op->newline(1) << "atomic_inc (& skipped_count);";
+      s.op->newline() << "#ifdef STP_TIMING";
+      s.op->newline() << "atomic_inc (locks[i].skip);";
+      s.op->newline() << "#endif";
+      s.op->newline() << "probe_unlock(locks, i);";
+      s.op->newline() << "return 0;";
+      s.op->newline(-1) << "}";
+      s.op->newline(-1) << "}";
+      s.op->newline() << "return 1;";
+      s.op->newline(-1) << "}";
+      s.op->assert_0_indent();
+
       // Run a varuse_collecting_visitor over probes that need global
       // variable locks.  We'll use this information later in
       // emit_locks()/emit_unlocks().

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Pass 4 (non-)optimization speedup
  2009-07-15  2:17       ` Josh Stone
@ 2009-07-15 17:42         ` Frank Ch. Eigler
  0 siblings, 0 replies; 8+ messages in thread
From: Frank Ch. Eigler @ 2009-07-15 17:42 UTC (permalink / raw)
  To: Josh Stone; +Cc: Mark Wielaard, systemtap

Hi -

On Tue, Jul 14, 2009 at 07:16:54PM -0700, Josh Stone wrote:
> [...]
> Another approach mentioned on IRC is trying to simplify the code that we
> present to GCC.

Right, and to check again whether we may have some larger
duplicate-elimination opportunities.  We have gone to some
effort to reuse translated functions and probe bodies, but
maybe it's not working well enough.

> [...]  So, is that 2-3% overhead worth 1/2 second compile time?

Not IMO.  Run-time performance is more important.

> Can anyone see ways to improve this patch, or suggest other areas
> that could benefit from focus?

The runtime?
Profile gcc slightly?
(--vp 0009 indicates considerable lex/parse time for some tests.)
Reduce number of #include'd kernel files if possible?
Improving the translate.cxx parts to copy-propagate better to
eliminate more temporary variables?

- FChE

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Pass 4 (non-)optimization speedup
  2009-07-11 18:32 Pass 4 (non-)optimization speedup Mark Wielaard
  2009-07-12  1:51 ` Frank Ch. Eigler
@ 2009-07-15  2:17 ` Josh Stone
  1 sibling, 0 replies; 8+ messages in thread
From: Josh Stone @ 2009-07-15  2:17 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: systemtap

On 07/11/2009 11:32 AM, Mark Wielaard wrote:
> The -O0 default speeds up some things a lot.
> $ stap -v -k -p4 -e 'probe syscall.* { log(argstr); }'
> Before: Pass 4: compiled C [...] in 7520usr/900sys/8415real ms.
> After:  Pass 4: compiled C [...] in 4980usr/730sys/5782real ms.

By the way, -k will bypass ALL caching, including the saved autoconf
header.  Generating that header is a few seconds in penalty that you
wouldn't normally hit when iterating on a script, since the autoconf is
cached independently of script contents.

It might be nice to split session.use_cache so that the tracepoint and
autoconf caches can be used even if the script-specific caches are not
used (due to -k or -m).

Josh

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-07-15 17:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-11 18:32 Pass 4 (non-)optimization speedup Mark Wielaard
2009-07-12  1:51 ` Frank Ch. Eigler
2009-07-12 14:26   ` Mark Wielaard
2009-07-13 22:35     ` Mark Wielaard
2009-07-14  9:56       ` Mark Wielaard
2009-07-15  2:17       ` Josh Stone
2009-07-15 17:42         ` Frank Ch. Eigler
2009-07-15  2:17 ` Josh Stone

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).