On 11/26/19 6:59 PM, Tom de Vries wrote: > Hi, > > I've been working on performance improvements for dwz, using a cc1 > binary as my optimization vehicle. > > Comparing the situation: > - before (commit 04a676d Add --devel-partition-dups-opt), and > - after (current master, commit e405c62 Add --devel-die-count-method > {none,estimate}) > I get the following results. > > When avoiding running into the low-mem die-limit using -lnone, we get > ~25% performance improvement, due to an improved hash function and an > improved hash table allocation strategy (without increasing peak memory > usage): > ... > real: mean: 7378.10 100.00% stddev: 45.31 > mean: 5558.80 75.34% stddev: 35.18 > user: mean: 7106.30 100.00% stddev: 41.53 > mean: 5328.10 74.98% stddev: 22.33 > sys: mean: 271.60 100.00% stddev: 39.57 > mean: 230.00 84.68% stddev: 40.45 > ... > > And if we don't avoid running into the low-mem die-limit, we get ~38% > performance improvement: > ... > real: mean: 15084.80 100.00% stddev: 44.53 > mean: 9232.90 61.21% stddev: 41.80 > user: mean: 14759.40 100.00% stddev: 30.62 > mean: 9100.10 61.66% stddev: 41.75 > sys: mean: 324.00 100.00% stddev: 39.51 > mean: 132.00 40.74% stddev: 27.26 > ... > which is also paired with a reduction in peak memory usage of ~34%, from > 0.95GB to 0.63GB, due to running into the low-mem die-limit in a more > efficient manner. Hi. That sounds very promising! I would like to see it being used in our openSUSE package. Are you planning to use it? Thanks, Martin > > Thanks, > - Tom >
On 27-11-2019 13:52, Martin Liška wrote:
> On 11/26/19 6:59 PM, Tom de Vries wrote:
>> Hi,
>>
>> I've been working on performance improvements for dwz, using a cc1
>> binary as my optimization vehicle.
>>
>> Comparing the situation:
>> - before (commit 04a676d Add --devel-partition-dups-opt), and
>> - after (current master, commit e405c62 Add --devel-die-count-method
>> Â Â {none,estimate})
>> I get the following results.
>>
>> When avoiding running into the low-mem die-limit using -lnone, we get
>> ~25% performance improvement, due to an improved hash function and an
>> improved hash table allocation strategy (without increasing peak memory
>> usage):
>> ...
>> real:Â mean:Â 7378.10Â 100.00%Â stddev:Â 45.31
>> Â Â Â Â Â Â Â mean:Â 5558.80Â Â 75.34%Â stddev:Â 35.18
>> user:Â mean:Â 7106.30Â 100.00%Â stddev:Â 41.53
>> Â Â Â Â Â Â Â mean:Â 5328.10Â Â 74.98%Â stddev:Â 22.33
>> sys:Â Â mean:Â Â 271.60Â 100.00%Â stddev:Â 39.57
>> Â Â Â Â Â Â Â mean:Â Â 230.00Â Â 84.68%Â stddev:Â 40.45
>> ...
>>
>> And if we don't avoid running into the low-mem die-limit, we get ~38%
>> performance improvement:
>> ...
>> real:Â mean:Â 15084.80 100.00%Â stddev:Â 44.53
>> Â Â Â Â Â Â Â mean:Â Â 9232.90Â 61.21%Â stddev:Â 41.80
>> user:Â mean:Â 14759.40 100.00%Â stddev:Â 30.62
>> Â Â Â Â Â Â Â mean:Â Â 9100.10Â 61.66%Â stddev:Â 41.75
>> sys:Â Â mean:Â Â Â 324.00 100.00%Â stddev:Â 39.51
>> Â Â Â Â Â Â Â mean:Â Â Â 132.00Â 40.74%Â stddev:Â 27.26
>> ...
>> which is also paired with a reduction in peak memory usage of ~34%, from
>> 0.95GB to 0.63GB, due to running into the low-mem die-limit in a more
>> efficient manner.
>
> Hi.
>
> That sounds very promising! I would like to see it being used in our
> openSUSE
> package. Are you planning to use it?
>
For the dwz openSUSE package I follow the usual strategy: backport
bugfixes and upgrade to newer releases, once available.
So, the intention is that this lands in openSUSE with the next release.
I'm currently working on a dwz bug fix, and if that is done, and I
manage to finalize the odr stuff as well, I think it'll be time for a
new release.
Thanks,
- Tom
Hi, I've been working on performance improvements for dwz, using a cc1 binary as my optimization vehicle. Comparing the situation: - before (commit 04a676d Add --devel-partition-dups-opt), and - after (current master, commit e405c62 Add --devel-die-count-method {none,estimate}) I get the following results. When avoiding running into the low-mem die-limit using -lnone, we get ~25% performance improvement, due to an improved hash function and an improved hash table allocation strategy (without increasing peak memory usage): ... real: mean: 7378.10 100.00% stddev: 45.31 mean: 5558.80 75.34% stddev: 35.18 user: mean: 7106.30 100.00% stddev: 41.53 mean: 5328.10 74.98% stddev: 22.33 sys: mean: 271.60 100.00% stddev: 39.57 mean: 230.00 84.68% stddev: 40.45 ... And if we don't avoid running into the low-mem die-limit, we get ~38% performance improvement: ... real: mean: 15084.80 100.00% stddev: 44.53 mean: 9232.90 61.21% stddev: 41.80 user: mean: 14759.40 100.00% stddev: 30.62 mean: 9100.10 61.66% stddev: 41.75 sys: mean: 324.00 100.00% stddev: 39.51 mean: 132.00 40.74% stddev: 27.26 ... which is also paired with a reduction in peak memory usage of ~34%, from 0.95GB to 0.63GB, due to running into the low-mem die-limit in a more efficient manner. Thanks, - Tom
Hello. I've made couple of experiments with dwz speed. I've taken the following packages: gcc, krita, libetonyek, rtags, sysdig and run dwz -m x ... for them. There are numbers I collected for the following configurations: dwz (system package, built with LTO and -O2), dwz-O2_lto is supposed to be the same (built from source), then I experimented with -O3 and PGO (based on tramp3d copies 4 times). And the final run is experimental patch I have that replaces the iterative_hash with xxhash: https://github.com/Cyan4973/xxHash # 1/5: sysdig (60M) dwz : 10.0 dwz : 9.8 (98.7%) dwz-O2_lto : 9.5 (95.6%) dwz-O3_lto : 9.2 (91.9%) dwz-O3_lto_pgo : 8.1 (81.3%) dwz-O3_lto_pgo_xxhash : 7.3 (72.9%) # 2/5: rtags (148M) dwz : 19.6 dwz : 19.6 (99.9%) dwz-O2_lto : 17.4 (89.0%) dwz-O3_lto : 16.7 (85.4%) dwz-O3_lto_pgo : 14.4 (73.6%) dwz-O3_lto_pgo_xxhash : 13.2 (67.6%) # 3/5: libetonyek (112M) dwz : 10.5 dwz : 10.5 (100.6%) dwz-O2_lto : 10.8 (102.8%) dwz-O3_lto : 10.1 (96.7%) dwz-O3_lto_pgo : 9.1 (87.4%) dwz-O3_lto_pgo_xxhash : 8.1 (77.1%) # 4/5: krita (685M) dwz : 133.7 dwz : 134.3 (100.5%) dwz-O2_lto : 95.3 (71.3%) dwz-O3_lto : 91.2 (68.2%) dwz-O3_lto_pgo : 78.9 (59.0%) dwz-O3_lto_pgo_xxhash : 71.6 (53.5%) # 5/5: gcc (1.2G) dwz : 61.9 dwz : 61.9 (99.9%) dwz-O2_lto : 58.5 (94.5%) dwz-O3_lto : 56.6 (91.3%) dwz-O3_lto_pgo : 54.1 (87.4%) dwz-O3_lto_pgo_xxhash : 51.7 (83.4%) So as seen, using -O3 really help, one gets a bigger binary, but as dwz is small it's negligible: bloaty dwz-O3_lto -- dwz-O2_lto FILE SIZE VM SIZE -------------- -------------- +28% +50.3Ki [ = ] 0 .debug_loclists +18% +25.3Ki +18% +25.3Ki .text +12% +24.6Ki [ = ] 0 .debug_info +16% +17.3Ki [ = ] 0 .debug_line +31% +6.19Ki [ = ] 0 .debug_rnglists +11% +689 [ = ] 0 .debug_abbrev +7.1% +633 [ = ] 0 .strtab +5.5% +504 +5.5% +504 .eh_frame +1.3% +453 [ = ] 0 .debug_str +0.8% +375 +0.8% +375 .rodata +2.8% +336 [ = ] 0 .symtab +11% +64 [ = ] 0 .debug_aranges +4.2% +64 +4.4% +64 .eh_frame_hdr [ = ] 0 +1.8% +32 .bss -3.1% -21 -3.1% -21 [LOAD #2 [RX]] -61.0% -2.20Ki [ = ] 0 [Unmapped] +16% +124Ki +13% +26.2Ki TOTAL Then, PGO also helps significantly. And finally, using xxhash one can get 5-10% percent improvement. For now I'm suggesting using -O3 and PGO for our openSUSE package: https://build.opensuse.org/request/show/942235 Upstream questions I have: - What about changing -O2 with -O3 by default? - Are you interested in the xxhash patch? Do you want it as a conditional build or may I replace the currently existing hash function? Cheers, Martin
Hi Martin, I noticed that this is a reply to a thread from 2 years ago. Is it related to the work mentioned by Tom in that thread? On Thu, Dec 23, 2021 at 12:57:48PM +0100, Martin Liška wrote: > I've made couple of experiments with dwz speed. I've taken the following packages: > gcc, krita, libetonyek, rtags, sysdig and run dwz -m x ... for them. > > There are numbers I collected for the following configurations: > dwz (system package, built with LTO and -O2), dwz-O2_lto is supposed > to be the same (built from source), then I experimented with -O3 and PGO > (based on tramp3d copies 4 times). And the final run is experimental patch > I have that replaces the iterative_hash with xxhash: > https://github.com/Cyan4973/xxHash > > # 1/5: sysdig (60M) > dwz : 10.0 > dwz : 9.8 (98.7%) > dwz-O2_lto : 9.5 (95.6%) > dwz-O3_lto : 9.2 (91.9%) > dwz-O3_lto_pgo : 8.1 (81.3%) > dwz-O3_lto_pgo_xxhash : 7.3 (72.9%) > # 2/5: rtags (148M) > dwz : 19.6 > dwz : 19.6 (99.9%) > dwz-O2_lto : 17.4 (89.0%) > dwz-O3_lto : 16.7 (85.4%) > dwz-O3_lto_pgo : 14.4 (73.6%) > dwz-O3_lto_pgo_xxhash : 13.2 (67.6%) > # 3/5: libetonyek (112M) > dwz : 10.5 > dwz : 10.5 (100.6%) > dwz-O2_lto : 10.8 (102.8%) > dwz-O3_lto : 10.1 (96.7%) > dwz-O3_lto_pgo : 9.1 (87.4%) > dwz-O3_lto_pgo_xxhash : 8.1 (77.1%) > # 4/5: krita (685M) > dwz : 133.7 > dwz : 134.3 (100.5%) > dwz-O2_lto : 95.3 (71.3%) > dwz-O3_lto : 91.2 (68.2%) > dwz-O3_lto_pgo : 78.9 (59.0%) > dwz-O3_lto_pgo_xxhash : 71.6 (53.5%) > # 5/5: gcc (1.2G) > dwz : 61.9 > dwz : 61.9 (99.9%) > dwz-O2_lto : 58.5 (94.5%) > dwz-O3_lto : 56.6 (91.3%) > dwz-O3_lto_pgo : 54.1 (87.4%) > dwz-O3_lto_pgo_xxhash : 51.7 (83.4%) > > So as seen, using -O3 really help, one gets a bigger binary, but as dwz is small > it's negligible: > > bloaty dwz-O3_lto -- dwz-O2_lto > FILE SIZE VM SIZE > -------------- -------------- > +28% +50.3Ki [ = ] 0 .debug_loclists > +18% +25.3Ki +18% +25.3Ki .text > +12% +24.6Ki [ = ] 0 .debug_info > +16% +17.3Ki [ = ] 0 .debug_line > +31% +6.19Ki [ = ] 0 .debug_rnglists > +11% +689 [ = ] 0 .debug_abbrev > +7.1% +633 [ = ] 0 .strtab > +5.5% +504 +5.5% +504 .eh_frame > +1.3% +453 [ = ] 0 .debug_str > +0.8% +375 +0.8% +375 .rodata > +2.8% +336 [ = ] 0 .symtab > +11% +64 [ = ] 0 .debug_aranges > +4.2% +64 +4.4% +64 .eh_frame_hdr > [ = ] 0 +1.8% +32 .bss > -3.1% -21 -3.1% -21 [LOAD #2 [RX]] > -61.0% -2.20Ki [ = ] 0 [Unmapped] > +16% +124Ki +13% +26.2Ki TOTAL > > Then, PGO also helps significantly. And finally, using xxhash one can get 5-10% percent > improvement. > > For now I'm suggesting using -O3 and PGO for our openSUSE package: > https://build.opensuse.org/request/show/942235 > > Upstream questions I have: > - What about changing -O2 with -O3 by default? Did you test that without -flto? If it still gets a ~5% speedup then I like that idea. Or maybe we should also include -flto by default? > - Are you interested in the xxhash patch? Do you want it as a conditional build > or may I replace the currently existing hash function? I think it is best to simply replace the existing hash function instead of making it a conditional thing. Does it rely on having the libxxhash dynamic library available or would we simply embed a copy (replacing the hashtab.[ch] files)? Cheers, Mark
On 1/3/22 23:06, Mark Wielaard wrote: > Hi Martin, > > I noticed that this is a reply to a thread from 2 years ago. Is it > related to the work mentioned by Tom in that thread? Hello. It's related only a bit as it's also connected to Performance improvements :) > > On Thu, Dec 23, 2021 at 12:57:48PM +0100, Martin Liška wrote: >> I've made couple of experiments with dwz speed. I've taken the following packages: >> gcc, krita, libetonyek, rtags, sysdig and run dwz -m x ... for them. >> >> There are numbers I collected for the following configurations: >> dwz (system package, built with LTO and -O2), dwz-O2_lto is supposed >> to be the same (built from source), then I experimented with -O3 and PGO >> (based on tramp3d copies 4 times). And the final run is experimental patch >> I have that replaces the iterative_hash with xxhash: >> https://github.com/Cyan4973/xxHash >> >> # 1/5: sysdig (60M) >> dwz : 10.0 >> dwz : 9.8 (98.7%) >> dwz-O2_lto : 9.5 (95.6%) >> dwz-O3_lto : 9.2 (91.9%) >> dwz-O3_lto_pgo : 8.1 (81.3%) >> dwz-O3_lto_pgo_xxhash : 7.3 (72.9%) >> # 2/5: rtags (148M) >> dwz : 19.6 >> dwz : 19.6 (99.9%) >> dwz-O2_lto : 17.4 (89.0%) >> dwz-O3_lto : 16.7 (85.4%) >> dwz-O3_lto_pgo : 14.4 (73.6%) >> dwz-O3_lto_pgo_xxhash : 13.2 (67.6%) >> # 3/5: libetonyek (112M) >> dwz : 10.5 >> dwz : 10.5 (100.6%) >> dwz-O2_lto : 10.8 (102.8%) >> dwz-O3_lto : 10.1 (96.7%) >> dwz-O3_lto_pgo : 9.1 (87.4%) >> dwz-O3_lto_pgo_xxhash : 8.1 (77.1%) >> # 4/5: krita (685M) >> dwz : 133.7 >> dwz : 134.3 (100.5%) >> dwz-O2_lto : 95.3 (71.3%) >> dwz-O3_lto : 91.2 (68.2%) >> dwz-O3_lto_pgo : 78.9 (59.0%) >> dwz-O3_lto_pgo_xxhash : 71.6 (53.5%) >> # 5/5: gcc (1.2G) >> dwz : 61.9 >> dwz : 61.9 (99.9%) >> dwz-O2_lto : 58.5 (94.5%) >> dwz-O3_lto : 56.6 (91.3%) >> dwz-O3_lto_pgo : 54.1 (87.4%) >> dwz-O3_lto_pgo_xxhash : 51.7 (83.4%) >> >> So as seen, using -O3 really help, one gets a bigger binary, but as dwz is small >> it's negligible: >> >> bloaty dwz-O3_lto -- dwz-O2_lto >> FILE SIZE VM SIZE >> -------------- -------------- >> +28% +50.3Ki [ = ] 0 .debug_loclists >> +18% +25.3Ki +18% +25.3Ki .text >> +12% +24.6Ki [ = ] 0 .debug_info >> +16% +17.3Ki [ = ] 0 .debug_line >> +31% +6.19Ki [ = ] 0 .debug_rnglists >> +11% +689 [ = ] 0 .debug_abbrev >> +7.1% +633 [ = ] 0 .strtab >> +5.5% +504 +5.5% +504 .eh_frame >> +1.3% +453 [ = ] 0 .debug_str >> +0.8% +375 +0.8% +375 .rodata >> +2.8% +336 [ = ] 0 .symtab >> +11% +64 [ = ] 0 .debug_aranges >> +4.2% +64 +4.4% +64 .eh_frame_hdr >> [ = ] 0 +1.8% +32 .bss >> -3.1% -21 -3.1% -21 [LOAD #2 [RX]] >> -61.0% -2.20Ki [ = ] 0 [Unmapped] >> +16% +124Ki +13% +26.2Ki TOTAL >> >> Then, PGO also helps significantly. And finally, using xxhash one can get 5-10% percent >> improvement. >> >> For now I'm suggesting using -O3 and PGO for our openSUSE package: >> https://build.opensuse.org/request/show/942235 >> >> Upstream questions I have: >> - What about changing -O2 with -O3 by default? > > Did you test that without -flto? If it still gets a ~5% speedup then I Yep: # 1/5: sysdig (60M) dwz_O2 : 9.7 dwz_O2_xxhash : 8.5 (87.7%) # 2/5: rtags (58M) dwz_O2 : 17.6 dwz_O2_xxhash : 15.8 (89.5%) # 3/5: libetonyek (91M) dwz_O2 : 10.8 dwz_O2_xxhash : 9.4 (86.7%) # 4/5: krita (685M) dwz_O2 : 96.0 dwz_O2_xxhash : 85.6 (89.1%) # 5/5: gcc (1.2G) dwz_O2 : 58.6 dwz_O2_xxhash : 54.1 (92.4%) > like that idea. Or maybe we should also include -flto by default? Well, it's probably something that can be decided by distributions. Maybe, we can add a default dwz.spec file? > >> - Are you interested in the xxhash patch? Do you want it as a conditional build >> or may I replace the currently existing hash function? > > I think it is best to simply replace the existing hash function > instead of making it a conditional thing. Fine, I'm going to prepare a patch. > > Does it rely on having the libxxhash dynamic library available or > would we simply embed a copy (replacing the hashtab.[ch] files)? I would not do that as it may become obsolete quite fast. I would rather use a standard shared library (similarly to libelf). Martin > > Cheers, > > Mark >