* Pass 4 (non-)optimization speedup @ 2009-07-11 18:32 Mark Wielaard 2009-07-12 1:51 ` Frank Ch. Eigler 2009-07-15 2:17 ` Josh Stone 0 siblings, 2 replies; 8+ messages in thread From: Mark Wielaard @ 2009-07-11 18:32 UTC (permalink / raw) To: systemtap Hi, To make pass 4 a bit more flexible I added -O[0123s] as arguments to stap (commit 5a5732). This allows you to pass through this option to gcc. So users can determine whether they want fast compilation or optimized code for speed or size. The default is -O0 which makes pass 4 a lot faster, so I think this is a good default. But maybe the default could be tuned, for example when using the compile server, which can be assumed to generate systemtap script modules that will be often reused. The -O0 default speeds up some things a lot. $ stap -v -k -p4 -e 'probe syscall.* { log(argstr); }' Before: Pass 4: compiled C [...] in 7520usr/900sys/8415real ms. After: Pass 4: compiled C [...] in 4980usr/730sys/5782real ms. make installcheck RUNTESTFLAGS=buildok.exp is 15 seconds faster, was 4 minutes 10 seconds, now 3 minutes 55 seconds (both with clean caches). Cheers, Mark ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Pass 4 (non-)optimization speedup 2009-07-11 18:32 Pass 4 (non-)optimization speedup Mark Wielaard @ 2009-07-12 1:51 ` Frank Ch. Eigler 2009-07-12 14:26 ` Mark Wielaard 2009-07-15 2:17 ` Josh Stone 1 sibling, 1 reply; 8+ messages in thread From: Frank Ch. Eigler @ 2009-07-12 1:51 UTC (permalink / raw) To: Mark Wielaard; +Cc: systemtap Mark Wielaard <mjw@redhat.com> writes: > To make pass 4 a bit more flexible I added -O[0123s] as arguments to > stap (commit 5a5732). [...] > The default is -O0 which makes pass 4 a lot faster, so I think this is a > good default. [...] No, it is unlikely to be a good default. Frankly, I'm surprised it even compiles, since some kernel code is unbuildable without optimization. Anyway, the code generated by systemtap is complex enough that with optimization disabled, it is bound to run measurably slower. Please test some nontrivial probes with -t before & after. - FChE ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Pass 4 (non-)optimization speedup 2009-07-12 1:51 ` Frank Ch. Eigler @ 2009-07-12 14:26 ` Mark Wielaard 2009-07-13 22:35 ` Mark Wielaard 0 siblings, 1 reply; 8+ messages in thread From: Mark Wielaard @ 2009-07-12 14:26 UTC (permalink / raw) To: Frank Ch. Eigler; +Cc: systemtap Hi Frank, On Sat, 2009-07-11 at 21:51 -0400, Frank Ch. Eigler wrote: > Mark Wielaard <mjw@redhat.com> writes: > > > To make pass 4 a bit more flexible I added -O[0123s] as arguments to > > stap (commit 5a5732). [...] > > The default is -O0 which makes pass 4 a lot faster, so I think this is a > > good default. [...] > > No, it is unlikely to be a good default. Maybe, like I said: "But maybe the default could be tuned...". It definitely makes a difference when running stap interactively. When trying out some probes trying to pinpoint some good stuff to measure, it really matters if a quick try takes 6 or 3 seconds. It just feels that much snappier. But for non-interactive work, it might be good if stap defaulted to some higher optimization. Anyway, the user has a choice now and we can always try to tweak the default depending on work load. > Frankly, I'm surprised it > even compiles, since some kernel code is unbuildable without > optimization. And that surprises me. Obviously I tried it out on some different kernels/architectures (i386/x86_64/fedora/2.6.29/rhel/2.6.18) before implementing it. And afterwards of course I ran make installcheck to make sure there were no regressions (there weren't, it just was minutes faster!). If you have any examples of things that won't compile with the new default please do let me know (and preferably add them to the testsuite). > Anyway, the code generated by systemtap is complex > enough that with optimization disabled, it is bound to run measurably > slower. Please test some nontrivial probes with -t before & after. Indeed, I should have measured that also. Here are some measurements for the topsys.stp example, modified to stat only 12 times (add a global count = 12; and an if (--count == 12) exit() to the timer probe). gcc (GCC) 4.4.0 20090506 (Red Hat 4.4.0-4), 2.6.29.5-191.fc11.i586. $ stap -O0 -v -t topsys.stp [...] Pass 4: compiled C into [...] in 6850usr/1160sys/8130real ms. Pass 5: starting run. [...] probe syscall.* (topsys.stp:10:1), hits: 21048, cycles: 1045min/2736avg/10021max probe timer.s(5) (topsys.stp:22:1), hits: 12, cycles: 147235min/153708avg/171006max Pass 5: run completed in 40usr/80sys/60163real ms. $ stap -O1 -v -t topsys.stp [...] Pass 4: compiled C into [...] in 8090usr/980sys/10815real ms. Pass 5: starting run. [...] probe syscall.* (topsys.stp:10:1), hits: 252713, cycles: 726min/1561avg/537218max probe timer.s(5) (topsys.stp:22:1), hits: 12, cycles: 68442min/96426avg/128172max Pass 5: run completed in 40usr/80sys/60197real ms. $ stap -O2 -v -t topsys.stp [...] Pass 4: compiled C into .[...] in 10120usr/1170sys/11433real ms. Pass 5: starting run. probe syscall.* (topsys.stp:10:1), hits: 79132, cycles: 671min/1669avg/5962max probe timer.s(5) (topsys.stp:22:1), hits: 12, cycles: 82896min/90516avg/117381max Pass 5: run completed in 30usr/100sys/60151real ms. So with each optimization level we add about 2 seconds extra compile time in pass 4, but the average number of cycles per probe is definitely lower with each optimization level. And the win between -O1 and -O2 isn't as large as between -O0 and -O1. What would be some good other scenarios to try out? I would really like to have -O0 be the default for interactive (-e <script>) usage, since it is really so much faster compiling in pass 4. But you are right that it does impact the overhead per probe at runtime in pass 5 significantly. Maybe we could make -O1 the default when stap is given a script and -O2 for when the compile server is used? Cheers, Mark ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Pass 4 (non-)optimization speedup 2009-07-12 14:26 ` Mark Wielaard @ 2009-07-13 22:35 ` Mark Wielaard 2009-07-14 9:56 ` Mark Wielaard 2009-07-15 2:17 ` Josh Stone 0 siblings, 2 replies; 8+ messages in thread From: Mark Wielaard @ 2009-07-13 22:35 UTC (permalink / raw) To: Frank Ch. Eigler; +Cc: systemtap On Sun, 2009-07-12 at 16:25 +0200, Mark Wielaard wrote: > On Sat, 2009-07-11 at 21:51 -0400, Frank Ch. Eigler wrote: > > Frankly, I'm surprised it > > even compiles, since some kernel code is unbuildable without > > optimization. > > And that surprises me. Obviously I tried it out on some different > kernels/architectures (i386/x86_64/fedora/2.6.29/rhel/2.6.18) before > implementing it. And afterwards of course I ran make installcheck to > make sure there were no regressions (there weren't, it just was minutes > faster!). If you have any examples of things that won't compile with the > new default please do let me know (and preferably add them to the > testsuite). We discussed this a bit on IRC. The issue is that some code that we compile together with a module might depend on source of the kernel that depends on being optimized in a particular way. For example locking code. Other code might only produce warnings when optimized (unitialized variables for example) and so confuse a user suddenly something doesn't compile (because of -Werror) on higher opt-levels than the default. Will actually found some ia64 code that relies on optimization of code so that gcc doesn't emit resolvable symbols (in a complicated if-then-else switch statement). So, I made the default the kernel opt-level one again (which is -O2 or -Os depending on CONFIG_CC_OPTIMIZE_FOR_SIZE). This is really a shame, gcc is really a lot slower compiling with -O2. It would be good to find some other way to make pass 4 quicker by default. Cheers, Mark ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Pass 4 (non-)optimization speedup 2009-07-13 22:35 ` Mark Wielaard @ 2009-07-14 9:56 ` Mark Wielaard 2009-07-15 2:17 ` Josh Stone 1 sibling, 0 replies; 8+ messages in thread From: Mark Wielaard @ 2009-07-14 9:56 UTC (permalink / raw) To: Frank Ch. Eigler; +Cc: systemtap On Tue, 2009-07-14 at 00:35 +0200, Mark Wielaard wrote: > So, I made the default the kernel opt-level one again (which is -O2 or > -Os depending on CONFIG_CC_OPTIMIZE_FOR_SIZE). > > This is really a shame, gcc is really a lot slower compiling with -O2. > It would be good to find some other way to make pass 4 quicker by > default. Actually I now removed all the support again for changing the opt-level. If we cannot guarantee that the code compiles and/or generates correctly functioning code we are better off not officially supporting it :{ I left some comments in buildrun.cxx for those wanting to edit the sources to experiment with it anyway. Sniff, Mark ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Pass 4 (non-)optimization speedup 2009-07-13 22:35 ` Mark Wielaard 2009-07-14 9:56 ` Mark Wielaard @ 2009-07-15 2:17 ` Josh Stone 2009-07-15 17:42 ` Frank Ch. Eigler 1 sibling, 1 reply; 8+ messages in thread From: Josh Stone @ 2009-07-15 2:17 UTC (permalink / raw) To: Mark Wielaard; +Cc: Frank Ch. Eigler, systemtap [-- Attachment #1: Type: text/plain, Size: 917 bytes --] On 07/13/2009 03:35 PM, Mark Wielaard wrote: > This is really a shame, gcc is really a lot slower compiling with -O2. > It would be good to find some other way to make pass 4 quicker by > default. Another approach mentioned on IRC is trying to simplify the code that we present to GCC. I took a stab at consolidating the probe locking code, patch attached. It makes each probe body contain just a static array of locks needed, and then call to a single function that actually loops over the locks. This will only be a win for scripts with many distinct probe bodies, like topsys.stp. For that script, I see pass-4 time go from 5 seconds to 4.4 seconds. In runtime, it's hard to get consistent measurement, but the average probe time is about 2-3% slower. So, is that overhead worth 1/2 second compile time? Can anyone see ways to improve this patch, or suggest other areas that could benefit from focus? Josh [-- Attachment #2: data-driven-locking.patch --] [-- Type: text/x-patch, Size: 7409 bytes --] commit d8a8b06d763d006be5efe0ec53c7b5a41df30223 Author: Josh Stone <jistone@redhat.com> Date: Mon Jul 13 13:22:11 2009 -0700 make probe locking data-driven diff --git a/translate.cxx b/translate.cxx index 9c90106..a35be58 100644 --- a/translate.cxx +++ b/translate.cxx @@ -1590,14 +1590,7 @@ c_unparser::emit_probe (derived_probe* v) // initialize frame pointer o->newline() << "struct " << v->name << "_locals * __restrict__ l ="; o->newline(1) << "& c->locals[0]." << v->name << ";"; - o->newline(-1) << "(void) l;"; // make sure "l" is marked used - - o->newline() << "#ifdef STP_TIMING"; - o->newline() << "c->statp = & time_" << v->basest()->name << ";"; - o->newline() << "#endif"; - - // emit probe local initialization block - v->emit_probe_local_init(o); + o->indent (-1); // emit all read/write locks for global variables varuse_collecting_visitor vut; @@ -1607,6 +1600,15 @@ c_unparser::emit_probe (derived_probe* v) emit_locks (vut); } + o->newline() << "(void) l;"; // make sure "l" is marked used + + o->newline() << "#ifdef STP_TIMING"; + o->newline() << "c->statp = & time_" << v->basest()->name << ";"; + o->newline() << "#endif"; + + // emit probe local initialization block + v->emit_probe_local_init(o); + // initialize locals for (unsigned j=0; j<v->locals.size(); j++) { @@ -1658,11 +1660,8 @@ c_unparser::emit_probe (derived_probe* v) void c_unparser::emit_locks(const varuse_collecting_visitor& vut) { - o->newline() << "{"; - o->newline(1) << "unsigned numtrylock = 0;"; - o->newline() << "(void) numtrylock;"; + bool has_locks = false; - string last_locked_var; for (unsigned i = 0; i < session->globals.size(); i++) { vardecl* v = session->globals[i]; @@ -1694,37 +1693,36 @@ c_unparser::emit_locks(const varuse_collecting_visitor& vut) continue; } - string lockcall = - string (write_p ? "write" : "read") + - "_trylock (& global.s_" + v->name + "_lock)"; + if (!has_locks) + { + o->newline() << "static const probe_lock_t locks[] = {"; + o->indent(1); + has_locks = true; + } - o->newline() << "while (! " << lockcall - << "&& (++numtrylock < MAXTRYLOCK))"; - o->newline(1) << "ndelay (TRYLOCKDELAY);"; - o->newline(-1) << "if (unlikely (numtrylock >= MAXTRYLOCK)) {"; - o->newline(1) << "atomic_inc (& skipped_count);"; + o->newline() << "{"; + o->newline(1) << ".lock = & global.s_" << c_varname (v->name) << "_lock,"; o->newline() << "#ifdef STP_TIMING"; - o->newline() << "atomic_inc (& global.s_" << c_varname (v->name) << "_lock_skip_count);"; + o->newline() << ".skip = & global.s_" << c_varname (v->name) << "_lock_skip_count,"; o->newline() << "#endif"; - // The following works even if i==0. Note that using - // globals[i-1]->name is wrong since that global may not have - // been lockworthy by this probe. - o->newline() << "goto unlock_" << last_locked_var << ";"; - o->newline(-1) << "}"; - - last_locked_var = v->name; + o->newline() << ".write_p = " << (write_p ? 1 : 0) << ","; + o->newline(-1) << "},"; } - o->newline() << "if (0) goto unlock_;"; - - o->newline(-1) << "}"; + if (has_locks) + { + o->newline(-1) << "};"; + o->newline() << "if (!probe_trylock(locks, ARRAY_SIZE(locks)))"; + o->newline(1) << "goto unlocked;"; + o->indent(-1); + } } void c_unparser::emit_unlocks(const varuse_collecting_visitor& vut) { - unsigned numvars = 0; + bool has_locks = false; if (session->verbose>1) clog << current_probe->name << " locks "; @@ -1751,32 +1749,24 @@ c_unparser::emit_unlocks(const varuse_collecting_visitor& vut) continue; } - numvars ++; - o->newline(-1) << "unlock_" << v->name << ":"; - o->indent(1); - + has_locks = true; if (session->verbose>1) clog << v->name << "[" << (read_p ? "r" : "") << (write_p ? "w" : "") << "] "; - - if (write_p) // emit write lock - o->newline() << "write_unlock (& global.s_" << v->name << "_lock);"; - else // (read_p && !write_p) : emit read lock - o->newline() << "read_unlock (& global.s_" << v->name << "_lock);"; - - // fall through to next variable; thus the reverse ordering } - // emit plain "unlock" label, used if the very first lock failed. - o->newline(-1) << "unlock_: ;"; - o->indent(1); - - if (numvars) // is there a chance that any lock attempt failed? + if (has_locks) // is there a chance that any lock attempt failed? { // Formerly, we checked skipped_count > MAXSKIPPED here, and set // SYSTEMTAP_SESSION_ERROR if so. But now, this check is shared // via common_probe_entryfn_epilogue(). + o->newline() << "probe_unlock(locks, ARRAY_SIZE(locks));"; + + // emit plain "unlocked" label, used if any locks failed + o->newline(-1) << "unlocked: ;"; + o->indent(1); + if (session->verbose>1) clog << endl; } @@ -5223,6 +5213,46 @@ translate_pass (systemtap_session& s) } s.op->assert_0_indent(); + s.op->newline() << "typedef struct {"; + s.op->newline(1) << "rwlock_t *lock;"; + s.op->newline() << "#ifdef STP_TIMING"; + s.op->newline() << "atomic_t *skip;"; + s.op->newline() << "#endif"; + s.op->newline() << "int write_p;"; + s.op->newline(-1) << "} probe_lock_t;"; + s.op->newline() << "static void"; + s.op->newline() << "probe_unlock(const probe_lock_t *locks, unsigned n)"; + s.op->newline() << "{"; + s.op->newline(1) << "while (n--)"; + s.op->newline(1) << "if (locks[n].write_p)"; + s.op->newline(1) << "write_unlock(locks[n].lock);"; + s.op->newline(-1) << "else"; + s.op->newline(1) << "read_unlock(locks[n].lock);"; + s.op->newline(-3) << "}"; + s.op->newline() << "static int"; + s.op->newline() << "probe_trylock(const probe_lock_t *locks, unsigned n)"; + s.op->newline() << "{"; + s.op->newline(1) << "unsigned i, numtrylock = 0;"; + s.op->newline(0) << "for (i = 0; i < n; ++i) {"; + s.op->newline(1) << "if (locks[i].write_p)"; + s.op->newline(1) << "while (!write_trylock(locks[i].lock) && (++numtrylock < MAXTRYLOCK))"; + s.op->newline(1) << "ndelay (TRYLOCKDELAY);"; + s.op->newline(-2) << "else"; + s.op->newline(1) << "while (!read_trylock(locks[i].lock) && (++numtrylock < MAXTRYLOCK))"; + s.op->newline(1) << "ndelay (TRYLOCKDELAY);"; + s.op->newline(-2) << "if (unlikely (numtrylock >= MAXTRYLOCK)) {"; + s.op->newline(1) << "atomic_inc (& skipped_count);"; + s.op->newline() << "#ifdef STP_TIMING"; + s.op->newline() << "atomic_inc (locks[i].skip);"; + s.op->newline() << "#endif"; + s.op->newline() << "probe_unlock(locks, i);"; + s.op->newline() << "return 0;"; + s.op->newline(-1) << "}"; + s.op->newline(-1) << "}"; + s.op->newline() << "return 1;"; + s.op->newline(-1) << "}"; + s.op->assert_0_indent(); + // Run a varuse_collecting_visitor over probes that need global // variable locks. We'll use this information later in // emit_locks()/emit_unlocks(). ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Pass 4 (non-)optimization speedup 2009-07-15 2:17 ` Josh Stone @ 2009-07-15 17:42 ` Frank Ch. Eigler 0 siblings, 0 replies; 8+ messages in thread From: Frank Ch. Eigler @ 2009-07-15 17:42 UTC (permalink / raw) To: Josh Stone; +Cc: Mark Wielaard, systemtap Hi - On Tue, Jul 14, 2009 at 07:16:54PM -0700, Josh Stone wrote: > [...] > Another approach mentioned on IRC is trying to simplify the code that we > present to GCC. Right, and to check again whether we may have some larger duplicate-elimination opportunities. We have gone to some effort to reuse translated functions and probe bodies, but maybe it's not working well enough. > [...] So, is that 2-3% overhead worth 1/2 second compile time? Not IMO. Run-time performance is more important. > Can anyone see ways to improve this patch, or suggest other areas > that could benefit from focus? The runtime? Profile gcc slightly? (--vp 0009 indicates considerable lex/parse time for some tests.) Reduce number of #include'd kernel files if possible? Improving the translate.cxx parts to copy-propagate better to eliminate more temporary variables? - FChE ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Pass 4 (non-)optimization speedup 2009-07-11 18:32 Pass 4 (non-)optimization speedup Mark Wielaard 2009-07-12 1:51 ` Frank Ch. Eigler @ 2009-07-15 2:17 ` Josh Stone 1 sibling, 0 replies; 8+ messages in thread From: Josh Stone @ 2009-07-15 2:17 UTC (permalink / raw) To: Mark Wielaard; +Cc: systemtap On 07/11/2009 11:32 AM, Mark Wielaard wrote: > The -O0 default speeds up some things a lot. > $ stap -v -k -p4 -e 'probe syscall.* { log(argstr); }' > Before: Pass 4: compiled C [...] in 7520usr/900sys/8415real ms. > After: Pass 4: compiled C [...] in 4980usr/730sys/5782real ms. By the way, -k will bypass ALL caching, including the saved autoconf header. Generating that header is a few seconds in penalty that you wouldn't normally hit when iterating on a script, since the autoconf is cached independently of script contents. It might be nice to split session.use_cache so that the tracepoint and autoconf caches can be used even if the script-specific caches are not used (due to -k or -m). Josh ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-07-15 17:42 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-07-11 18:32 Pass 4 (non-)optimization speedup Mark Wielaard 2009-07-12 1:51 ` Frank Ch. Eigler 2009-07-12 14:26 ` Mark Wielaard 2009-07-13 22:35 ` Mark Wielaard 2009-07-14 9:56 ` Mark Wielaard 2009-07-15 2:17 ` Josh Stone 2009-07-15 17:42 ` Frank Ch. Eigler 2009-07-15 2:17 ` Josh Stone
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).