* Re: minor patch to improve gprofng performance (re: Bug 30898) [not found] <8a8c5ffa-f15d-c7a6-ea64-9afe3d42bdb1@gnu.org> @ 2023-09-28 2:47 ` Vladimir Mezentsev 2024-01-12 22:31 ` Simon Sobisch 0 siblings, 1 reply; 2+ messages in thread From: Vladimir Mezentsev @ 2023-09-28 2:47 UTC (permalink / raw) To: Simon Sobisch; +Cc: binutils hi Simon, Thank you for your report. See comments below. On 9/26/23 04:25, Simon Sobisch wrote: > Inspecting bug #30898 [1] showed that there is an issue when using the > disassembly option with huge (generated) functions. > > I gave this a test and found, via > > perf record -o perf.data.gpdisplay --call-graph dwarf,38192 --aio -z \ > --sample-cpu --mmap-pages 16M \ > gprofng display text -name short:soname -metrics e.%totalcpu:name \ > -disasm prog_ test.1.er > /dev/null > > That the problem is the disassembly handling. > Checking the generated perf recording shows that the **burning hot** > place is DbeInstr::mapPCtoLine(SourceFile*), called by > Module::set_dis_data(Function*, int, int, int, bool, bool, int); > taking more than 93.3% of all instructions. This is a little surprise for me. It's likely that gcc inlines functions and generates Dwarf that gprofng interprets poorly. If you run: gprofng display src -dis prog_ <YOUR_EXECUTION_OR_LIBRARY_WHERE_FUNC_IS_LOCATED> Do you see the same performance problem ? If yes, may I get this binary ? > > Running that took around 5 minutes. Redirecting the output to a file > leads to a file with 4,124,497 lines, so: this _really_ is about huge > disassembly. I generated the big function (~ 1000000 lines). The disassembly is 10000037 lines. This took 38 sec. But my test is trivial and gcc generates a trivial Dwarf. > > I've tinkered a bit with the burning hot function, the result is a > minor decrease when using C++2017 invalid code, you find it in the > attached patch. > > Also attached is the recorded output for the hot function, > interestingly the patched version showed quite clearly that over 60 % > of the complete run's cpu instructions goes to Hist_data.cc line 1380: > > if (p->level == 0) > > > For huge (GnuCOBOL) generated functions the attached patch drops the > perf stat reported counters by 10%. > Reported counters (median of 3 runs - code generated with default > options -O2 -g using g++ (GCC) 11.3) - are as follows: > > Original version: > > 270,060.53 msec task-clock # 0.999 CPUs utilized > 1,023,551,049,245 cycles # 3.790 GHz > 2,160,049,675,779 instructions # 2.11 insn per cycle > > adjusted version using the C++2017 removed "register" storage class > specifier for the pointer (there is possibly a better way), decreasing > everything: > > 260,284.41 msec task-clock # 0.999 CPUs utilized > 986,393,903,158 cycles # 3.790 GHz > 1,815,443,360,713 instructions # 1.84 insn per cycle > > adjusted version that abides to C++2017, only instructions decreased: > > 280,430.13 msec task-clock # 0.999 CPUs utilized > 1,062,479,713,621 cycles # 3.789 GHz > 1,815,698,269,043 instructions # 1.71 insn per cycle > > > > Along to this change a short-term _option_ to drop most of those 60% > (and, if it drops the amount of entries in there, a good portion of > walking the pointers) could be to have a copy of func->inlinedSubr > _once_ that _only_ contains level 0 entries. > > > But in the long-term it seems more reasonable to recheck if that > function should be rewritten/replaced for better supporting "huge > disassembly". > > > > Another note: the reserved memory use for gp-display-text topped 1.6 > GB, there may be a way to improve that, too. It is not normal. It looks like gprofng generates always a new DbeLine in DbeInstr::mapPCtoLine(). -Vladimir > > > [1]: https://sourceware.org/bugzilla/show_bug.cgi?id=30898 ^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: minor patch to improve gprofng performance (re: Bug 30898) 2023-09-28 2:47 ` minor patch to improve gprofng performance (re: Bug 30898) Vladimir Mezentsev @ 2024-01-12 22:31 ` Simon Sobisch 0 siblings, 0 replies; 2+ messages in thread From: Simon Sobisch @ 2024-01-12 22:31 UTC (permalink / raw) To: Vladimir Mezentsev; +Cc: binutils Am 28.09.2023 um 04:47 schrieb Vladimir Mezentsev: > hi Simon, > Thank you for your report. Sure, just checking back now. > See comments below. > > On 9/26/23 04:25, Simon Sobisch wrote: >> Inspecting bug #30898 [1] showed that there is an issue when using the >> disassembly option with huge (generated) functions. >> >> I gave this a test and found, via >> >> perf record -o perf.data.gpdisplay --call-graph dwarf,38192 --aio -z \ >> --sample-cpu --mmap-pages 16M \ >> gprofng display text -name short:soname -metrics e.%totalcpu:name \ >> -disasm prog_ test.1.er > /dev/null >> >> That the problem is the disassembly handling. >> Checking the generated perf recording shows that the **burning hot** >> place is DbeInstr::mapPCtoLine(SourceFile*), called by >> Module::set_dis_data(Function*, int, int, int, bool, bool, int); >> taking more than 93.3% of all instructions. > > This is a little surprise for me. > It's likely that gcc inlines functions and generates Dwarf that gprofng > interprets poorly. Is there a bug open on this? If not can you please do so? > > If you run: > gprofng display src -dis prog_ > <YOUR_EXECUTION_OR_LIBRARY_WHERE_FUNC_IS_LOCATED> > Do you see the same performance problem ? > If yes, may I get this binary ? > >> >> Running that took around 5 minutes. Redirecting the output to a file >> leads to a file with 4,124,497 lines, so: this _really_ is about huge >> disassembly. > > > I generated the big function (~ 1000000 lines). The disassembly is > 10,000,037 lines. This took 38 sec. > But my test is trivial and gcc generates a trivial Dwarf. Sadly I don't have access to that environment any more and also not to the exact setup, but I could try with a similar one if this is likely to help. But this would likely to be more reasonable if the two points below are inspected first. >> I've tinkered a bit with the burning hot function, the result is a >> minor decrease when using C++2017 invalid code, you find it in the >> attached patch. >> >> Also attached is the recorded output for the hot function, >> interestingly the patched version showed quite clearly that over 60 % >> of the complete run's cpu instructions goes to Hist_data.cc line 1380: >> >> if (p->level == 0) >> >> >> For huge (GnuCOBOL) generated functions the attached patch drops the >> perf stat reported counters by 10%. >> >> [...] Is there anything to be happen with that patch? >> >> Along to this change a short-term _option_ to drop most of those 60% >> (and, if it drops the amount of entries in there, a good portion of >> walking the pointers) could be to have a copy of func->inlinedSubr >> _once_ that _only_ contains level 0 entries. At least from reading that nearly 4 months later again, that sounds like an easy to implement change with a huge benefit, no? >> But in the long-term it seems more reasonable to recheck if that >> function should be rewritten/replaced for better supporting "huge >> disassembly". >> >> >> Another note: the reserved memory use for gp-display-text topped 1.6 >> GB, there may be a way to improve that, too. > > It is not normal. > It looks like gprofng generates always a new DbeLine in > DbeInstr::mapPCtoLine(). Is this happen to be tracked with a separate bug, or possibly already solved? > -Vladimir Thank you for working on gprofng, Simon ^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2024-01-12 22:31 UTC | newest] Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <8a8c5ffa-f15d-c7a6-ea64-9afe3d42bdb1@gnu.org> 2023-09-28 2:47 ` minor patch to improve gprofng performance (re: Bug 30898) Vladimir Mezentsev 2024-01-12 22:31 ` Simon Sobisch
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).