* Where stap spends its time producing instrumentation for testsuite/systemtap.syscall/sys.stp
@ 2007-10-03 19:02 William Cohen
2007-10-04 3:31 ` Frank Ch. Eigler
0 siblings, 1 reply; 3+ messages in thread
From: William Cohen @ 2007-10-03 19:02 UTC (permalink / raw)
To: SystemTAP
One of the complaints has been about the amount of time and space
required for the creation of systemtap instrumentation. I decided to
get a concrete data point on this for the
testsuite/systemtap.syscall/sys.stp. This is not an exhaustive
analysis, but a quick experiment using current a current SystemTap
snapshot with locally built copy of elfutils 0.129. This is on 2GHz
86-64 machine with 1GB ram running 2.6.23-0.214.rc8.git2.fc8 kernel
and snapshot of fedora in development. Hopefully, it will give us some
idea of where improvements can be made.
-Will
Set up oprofile to measure things:
$ sudo opcontrol --setup --vmlinux=/usr/lib/debug/lib/modules/`uname -r`/vmlinux
--separate=library
Below is the the experiment to see how long it takes to build the
instrumentation with oprofile collecting data. The analysis pass (pass
2) and compilation pass take a over 30 seconds each.
$ sudo opcontrol --start; ../../../install/bin/stap -p4 -v sys.stp -m xyz -k;
sudo opcontrol --shutdown
Using 2.6+ OProfile kernel interface.
Reading module info.
Using log file /var/lib/oprofile/samples/oprofiled.log
Daemon started.
Profiler running.
Warning: using '-m' disables cache support.
Pass 1: parsed user script and 37 library script(s) in 530usr/20sys/626real ms.
Pass 2: analyzed script: 768 probe(s), 108 function(s), 14 embed(s), 4 global(s)
in 39690usr/360sys/40385real ms.
Pass 3: translated to C into "/tmp/stapnoy9IG/xyz.c" in 300usr/110sys/462real ms.
Pass 4: compiled C into "xyz.ko" in 34080usr/1500sys/36972real ms.
Keeping temporary directory "/tmp/stapnoy9IG"
Stopping profiling.
Killing daemon.
Quick look at the size of the generated module source code and object code.
Size of generated code, about 120K lines of code generated by
translator, which results in 1.13MB text and 170KB of data.
$ wc /tmp/stapnoy9IG/xyz.c
120522 352489 4295506 /tmp/stapnoy9IG/xyz.c
$ size /tmp/stapnoy9IG/xyz.ko
text data bss dec hex filename
1134094 169408 4816 1308318 13f69e /tmp/stapnoy9IG/xyz.ko
Overall look at the data collected by oprofile below. Each sample of
CPU_CLK_UNHALTED is approximately 50ms (1,000,000 cycle
cycles/2GHz). This agrees time reported by stap for the phases 79459
samples for stap, about 39 seconds. cc1 and as have about 33 seconds
of samples between them. For stap much of the time is spent in libdw.
$ opreport -t1
CPU: AMD64 processors, speed 800 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask
of 0x00 (No unit mask) count 1000000
Counted RETIRED_INSTRUCTIONS events (Retired instructions (includes exceptions,
interrupts, re-syncs)) with a unit mask of 0x00 (No unit mask) count 1000000
CPU_CLK_UNHALT...|RETIRED_INSTRU...|
samples| %| samples| %|
------------------------------------
79469 51.6932 90998 58.4463 stap
CPU_CLK_UNHALT...|RETIRED_INSTRU...|
samples| %| samples| %|
------------------------------------
31962 40.2195 38307 42.0965 libdw.so.1
25285 31.8174 30188 33.1744 stap
12849 16.1686 12348 13.5695 libc-2.6.90.so
8149 10.2543 8118 8.9211 libstdc++.so.6.0.8
1165 1.4660 2021 2.2209 libelf.so.1
38 0.0478 12 0.0132 ld-2.6.90.so
19 0.0239 2 0.0022 libpthread-2.6.90.so
2 0.0025 2 0.0022 libdl-2.6.90.so
63810 41.5073 57978 37.2382 cc1
CPU_CLK_UNHALT...|RETIRED_INSTRU...|
samples| %| samples| %|
------------------------------------
58801 92.1501 53243 91.8331 cc1
5006 7.8452 4734 8.1652 libc-2.6.90.so
3 0.0047 1 0.0017 ld-2.6.90.so
6847 4.4539 3672 2.3585 vmlinux
2779 1.8077 2424 1.5569 as
CPU_CLK_UNHALT...|RETIRED_INSTRU...|
samples| %| samples| %|
------------------------------------
1988 71.5365 1568 64.6865 as
731 26.3044 806 33.2508 libc-2.6.90.so
52 1.8712 49 2.0215 libbfd-2.17.50.0.18-1.so
8 0.2879 1 0.0413 ld-2.6.90.so
Targetted look where which functions stap is spending it's time
below. See that __libdw_find_attr and __libdw_form_val_len are the hotspot and
consume almost 1/3 of the stap's runtime.
$ opreport -l -t1 ../../../install/bin/stap
CPU: AMD64 processors, speed 800 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask
of 0x00 (No unit mask) count 1000000
Counted RETIRED_INSTRUCTIONS events (Retired instructions (includes exceptions,
interrupts, re-syncs)) with a unit mask of 0x00 (No unit mask) count 1000000
samples % samples % image name symbol name
13131 16.5234 13807 15.1729 libdw.so.1 __libdw_find_attr
12639 15.9043 17175 18.8740 libdw.so.1 __libdw_form_val_len
8149 10.2543 8118 8.9211 libstdc++.so.6.0.8 (no symbols)
4545 5.7192 4278 4.7012 libc-2.6.90.so strlen
2550 3.2088 2561 2.8143 stap
query_dwarf_func(Dwarf_Die*, void*)
2052 2.5821 1354 1.4879 libc-2.6.90.so malloc
1703 2.1430 3527 3.8759 libdw.so.1 dwfl_module_getsym
1676 2.1090 1534 1.6858 stap
dwflpp::focus_on_function(Dwarf_Die*)
1575 1.9819 1965 2.1594 stap
__gnu_cxx::__normal_iterator<Dwarf_Die const*, std::vector<Dwarf_Die,
std::allocator<Dwarf_Die> > >::__normal_iterator(Dwarf_Die const* const&)
1551 1.9517 1920 2.1099 stap
std::vector<Dwarf_Die, std::allocator<Dwarf_Die> >::size() const
1527 1.9215 1831 2.0121 stap
dwflpp::iterate_over_functions(int (*)(Dwarf_Die*, void*), void*)
1243 1.5641 1788 1.9649 libc-2.6.90.so memcpy
1190 1.4974 971 1.0671 libdw.so.1 dwarf_attr
1156 1.4547 1560 1.7143 stap
lookup_symbol_address(Dwfl_Module*, char const*)
1135 1.4282 2015 2.2143 libelf.so.1 gelf_getsymshndx
1057 1.3301 904 0.9934 libc-2.6.90.so _int_free
1038 1.3062 876 0.9627 libc-2.6.90.so _int_malloc
1029 1.2948 998 1.0967 stap .plt
990 1.2458 1360 1.4945 stap
__gnu_cxx::__normal_iterator<Dwarf_Die const*, std::vector<Dwarf_Die,
std::allocator<Dwarf_Die> > >::difference_type __gnu_cxx::operator-<Dwarf_Die
const*, Dwarf_Die const*, std::vector<Dwarf_Die, std::allocator<Dwarf_Die> >
>(__gnu_cxx::__normal_iterator<Dwarf_Die const*, std::vector<Dwarf_Die,
std::allocator<Dwarf_Die> > > const&, __gnu_cxx::__normal_iterator<Dwarf_Die
const*, std::vector<Dwarf_Die, std::allocator<Dwarf_Die> > > const&)
927 1.1665 1443 1.5857 stap
dwflpp::func_is_inline()
914 1.1501 1093 1.2011 libc-2.6.90.so internal_fnmatch
897 1.1287 957 1.0517 stap
std::vector<Dwarf_Die, std::allocator<Dwarf_Die> >::at(unsigned long)
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Where stap spends its time producing instrumentation for testsuite/systemtap.syscall/sys.stp
2007-10-03 19:02 Where stap spends its time producing instrumentation for testsuite/systemtap.syscall/sys.stp William Cohen
@ 2007-10-04 3:31 ` Frank Ch. Eigler
2007-10-04 14:17 ` William Cohen
0 siblings, 1 reply; 3+ messages in thread
From: Frank Ch. Eigler @ 2007-10-04 3:31 UTC (permalink / raw)
To: systemtap
wcohen wrote:
> [...] Below is the the experiment to see how long it takes to build the
> [systemtap.syscall/sys.stp] instrumentation with oprofile collecting
> data.
Thanks.
> The analysis pass (pass 2) and compilation pass take a over 30
> seconds each.
Yeah, that's far too long.
> Pass 1: parsed user script and 37 library script(s) in 530usr/20sys/626real ms.
> Pass 2: analyzed script: 768 probe(s), 108 function(s), 14 embed(s), 4
> global(s) in 39690usr/360sys/40385real ms.
This part is likely to improve with the better elfutils-dwfl caching
widget I'm working on for the user-space probing support.
> Pass 3: translated to C into "/tmp/stapnoy9IG/xyz.c" in 300usr/110sys/462real ms.
> Pass 4: compiled C into "xyz.ko" in 34080usr/1500sys/36972real ms. [...]
> $ wc /tmp/stapnoy9IG/xyz.c
> 120522 352489 4295506 /tmp/stapnoy9IG/xyz.c
> $ size /tmp/stapnoy9IG/xyz.ko
> text data bss dec hex filename
> 1134094 169408 4816 1308318 13f69e /tmp/stapnoy9IG/xyz.ko
Both compilation time & result size numbers will have improved greatly
since the time you took these measurements due to some code generation
improvements (see bug #5096 and #5102).
There is a chance that future implementation of bug #5075 could create
many more code elision opportunities as $argstr computation is moved
out of explicit script code.
- FChE
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Where stap spends its time producing instrumentation for testsuite/systemtap.syscall/sys.stp
2007-10-04 3:31 ` Frank Ch. Eigler
@ 2007-10-04 14:17 ` William Cohen
0 siblings, 0 replies; 3+ messages in thread
From: William Cohen @ 2007-10-04 14:17 UTC (permalink / raw)
To: Frank Ch. Eigler; +Cc: systemtap
Hi Frank,
Thanks for the improvements. I reran the same experiment from
Wednesday on sys.stp with a fresh cvs checkout of systemtap. There are
improvements in the pass 2 and pass 4 times. Pass 2 dropped form 39.7
seconds to 29.7 seconds user time. Pass 4 times dropped from 34.1
seconds to 26.4 seconds. Definitely, a trend in ht ecorrect direction.
The __libdw_find_attr and __libdw_form_val_len functions still
dominate the runtimes. The elfutils-dwfl caching is going to improve
that situation?
-Will
$ sudo opcontrol --start; /home/wcohen/systemtap_write/install/bin/stap -p4 -v
/home/wcohen/systemtap_write/src/testsuite/systemtap.syscall/sys.stp -m xyz -k;
sudo opcontrol --shutdown
Using 2.6+ OProfile kernel interface.
Reading module info.
Using log file /var/lib/oprofile/samples/oprofiled.log
Daemon started.
Profiler running.
Warning: using '-m' disables cache support.
Pass 1: parsed user script and 37 library script(s) in 360usr/40sys/423real ms.
Pass 2: analyzed script: 768 probe(s), 108 function(s), 14 embed(s), 4 global(s)
in 29740usr/1590sys/38374real ms.
Pass 3: translated to C into "/tmp/stapJKUmBF/xyz.c" in 150usr/110sys/256real ms.
Pass 4: compiled C into "xyz.ko" in 26350usr/1710sys/30719real ms.
Keeping temporary directory "/tmp/stapJKUmBF"
Stopping profiling.
Killing daemon.
The size of generated code also dropped significantly from 120K lines
to 91K lines of code. The resulting module is roughly half the size
for the text region.
$ wc /tmp/stapJKUmBF/xyz.c
91259 227560 2586615 /tmp/stapJKUmBF/xyz.c
$ size /tmp/stapJKUmBF/xyz.ko
text data bss dec hex filename
529327 169408 4816 703551 abc3f /tmp/stapJKUmBF/xyz.ko
The number of samples for stap dropped quite a bit. Most of these
appear to be areduction in the stap code itself. For the component
breakdown of stap, the stap code was about 32% of the total samples
for stap. It is now around 11% fo the samples. The compiler cc1 does
less work also.
$ opreport -t1
CPU: AMD64 processors, speed 800 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask
of 0x00 (No unit mask) count 1000000
Counted RETIRED_INSTRUCTIONS events (Retired instructions (includes exceptions,
interrupts, re-syncs)) with a unit mask of 0x00 (No unit mask) count 1000000
CPU_CLK_UNHALT...|RETIRED_INSTRU...|
samples| %| samples| %|
------------------------------------
56465 48.0868 64383 53.5976 stap
CPU_CLK_UNHALT...|RETIRED_INSTRU...|
samples| %| samples| %|
------------------------------------
32131 56.9043 37883 58.8401 libdw.so.1
11497 20.3613 12154 18.8777 libc-2.6.90.so
6431 11.3894 5939 9.2245 stap
5233 9.2677 6436 9.9964 libstdc++.so.6.0.8
1129 1.9995 1961 3.0458 libelf.so.1
32 0.0567 6 0.0093 ld-2.6.90.so
9 0.0159 4 0.0062 libpthread-2.6.90.so
2 0.0035 0 0 libebl_x86_64.so
1 0.0018 0 0 libdl-2.6.90.so
48520 41.3207 47267 39.3488 cc1
CPU_CLK_UNHALT...|RETIRED_INSTRU...|
samples| %| samples| %|
------------------------------------
45011 92.7679 43950 92.9824 cc1
3502 7.2176 3313 7.0091 libc-2.6.90.so
7 0.0144 4 0.0085 ld-2.6.90.so
9456 8.0529 6286 5.2330 vmlinux
1869 1.5917 1635 1.3611 as
CPU_CLK_UNHALT...|RETIRED_INSTRU...|
samples| %| samples| %|
------------------------------------
1325 70.8935 1062 64.9541 as
501 26.8058 539 32.9664 libc-2.6.90.so
34 1.8192 31 1.8960 libbfd-2.17.50.0.18-1.so
8 0.4280 3 0.1835 ld-2.6.90.so
1 0.0535 0 0 libopcodes-2.17.50.0.18-1.so
The two hotspot functions still exist. Due to the reduction in other
parts of of stap now have a larger portion of the samples.
$ opreport -l -t1 /home/wcohen/systemtap_write/install/bin/stap
CPU: AMD64 processors, speed 800 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask
of 0x00 (No unit mask) count 1000000
Counted RETIRED_INSTRUCTIONS events (Retired instructions (includes exceptions,
interrupts, re-syncs)) with a unit mask of 0x00 (No unit mask) count 1000000
samples % samples % image name symbol name
13316 23.5828 13444 20.8813 libdw.so.1 __libdw_find_attr
12501 22.1394 17166 26.6623 libdw.so.1 __libdw_form_val_len
4770 8.4477 3806 5.9115 libc-2.6.90.so strlen
1773 3.1400 2930 4.5509 libdw.so.1 dwfl_module_getsym
1581 2.8000 1722 2.6746 stap
query_dwarf_func(Dwarf_Die*, void*)
1413 2.5024 1350 2.0968 libdw.so.1 dwarf_attr
1361 2.4103 2306 3.5817 libstdc++.so.6.0.8
std::string::assign(std::string const&)
1154 2.0437 247 0.3836 stap
std::_Rb_tree<std::string, std::pair<std::string const, std::vector<Dwarf_Die,
std::allocator<Dwarf_Die> >*>, std::_Select1st<std::pair<std::string const,
std::vector<Dwarf_Die, std::allocator<Dwarf_Die> >*> >, std::less<std::string>,
std::allocator<std::pair<std::string const, std::vector<Dwarf_Die,
std::allocator<Dwarf_Die> >*> > >::lower_bound(std::string const&)
1143 2.0243 1640 2.5473 libc-2.6.90.so memcpy
1106 1.9587 1957 3.0396 libelf.so.1 gelf_getsymshndx
1039 1.8401 977 1.5175 libc-2.6.90.so _int_free
1038 1.8383 1201 1.8654 stap
query_cu(Dwarf_Die*, void*)
1027 1.8188 1160 1.8017 libc-2.6.90.so _int_malloc
974 1.7250 1117 1.7349 libc-2.6.90.so internal_fnmatch
733 1.2981 1118 1.7365 libc-2.6.90.so strcmp
718 1.2716 166 0.2578 stap
symresolution_info::find_function(std::string const&, unsigned int)
683 1.2096 1195 1.8561 stap
lookup_symbol_address(Dwfl_Module*, char const*)
644 1.1405 532 0.8263 libc-2.6.90.so free
644 1.1405 581 0.9024 libdw.so.1 dwarf_func_inline
612 1.0839 1145 1.7784 libc-2.6.90.so malloc
598 1.0591 370 0.5747 libdw.so.1 lookup
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2007-10-04 14:17 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-10-03 19:02 Where stap spends its time producing instrumentation for testsuite/systemtap.syscall/sys.stp William Cohen
2007-10-04 3:31 ` Frank Ch. Eigler
2007-10-04 14:17 ` William Cohen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).