* Re: minor patch to improve gprofng performance (re: Bug 30898)
[not found] <8a8c5ffa-f15d-c7a6-ea64-9afe3d42bdb1@gnu.org>
@ 2023-09-28 2:47 ` Vladimir Mezentsev
2024-01-12 22:31 ` Simon Sobisch
0 siblings, 1 reply; 2+ messages in thread
From: Vladimir Mezentsev @ 2023-09-28 2:47 UTC (permalink / raw)
To: Simon Sobisch; +Cc: binutils
hi Simon,
Thank you for your report.
See comments below.
On 9/26/23 04:25, Simon Sobisch wrote:
> Inspecting bug #30898 [1] showed that there is an issue when using the
> disassembly option with huge (generated) functions.
>
> I gave this a test and found, via
>
> perf record -o perf.data.gpdisplay --call-graph dwarf,38192 --aio -z \
> --sample-cpu --mmap-pages 16M \
> gprofng display text -name short:soname -metrics e.%totalcpu:name \
> -disasm prog_ test.1.er > /dev/null
>
> That the problem is the disassembly handling.
> Checking the generated perf recording shows that the **burning hot**
> place is DbeInstr::mapPCtoLine(SourceFile*), called by
> Module::set_dis_data(Function*, int, int, int, bool, bool, int);
> taking more than 93.3% of all instructions.
This is a little surprise for me.
It's likely that gcc inlines functions and generates Dwarf that gprofng
interprets poorly.
If you run:
gprofng display src -dis prog_
<YOUR_EXECUTION_OR_LIBRARY_WHERE_FUNC_IS_LOCATED>
Do you see the same performance problem ?
If yes, may I get this binary ?
>
> Running that took around 5 minutes. Redirecting the output to a file
> leads to a file with 4,124,497 lines, so: this _really_ is about huge
> disassembly.
I generated the big function (~ 1000000 lines). The disassembly is
10000037 lines. This took 38 sec.
But my test is trivial and gcc generates a trivial Dwarf.
>
> I've tinkered a bit with the burning hot function, the result is a
> minor decrease when using C++2017 invalid code, you find it in the
> attached patch.
>
> Also attached is the recorded output for the hot function,
> interestingly the patched version showed quite clearly that over 60 %
> of the complete run's cpu instructions goes to Hist_data.cc line 1380:
>
> if (p->level == 0)
>
>
> For huge (GnuCOBOL) generated functions the attached patch drops the
> perf stat reported counters by 10%.
> Reported counters (median of 3 runs - code generated with default
> options -O2 -g using g++ (GCC) 11.3) - are as follows:
>
> Original version:
>
> 270,060.53 msec task-clock # 0.999 CPUs utilized
> 1,023,551,049,245 cycles # 3.790 GHz
> 2,160,049,675,779 instructions # 2.11 insn per cycle
>
> adjusted version using the C++2017 removed "register" storage class
> specifier for the pointer (there is possibly a better way), decreasing
> everything:
>
> 260,284.41 msec task-clock # 0.999 CPUs utilized
> 986,393,903,158 cycles # 3.790 GHz
> 1,815,443,360,713 instructions # 1.84 insn per cycle
>
> adjusted version that abides to C++2017, only instructions decreased:
>
> 280,430.13 msec task-clock # 0.999 CPUs utilized
> 1,062,479,713,621 cycles # 3.789 GHz
> 1,815,698,269,043 instructions # 1.71 insn per cycle
>
>
>
> Along to this change a short-term _option_ to drop most of those 60%
> (and, if it drops the amount of entries in there, a good portion of
> walking the pointers) could be to have a copy of func->inlinedSubr
> _once_ that _only_ contains level 0 entries.
>
>
> But in the long-term it seems more reasonable to recheck if that
> function should be rewritten/replaced for better supporting "huge
> disassembly".
>
>
>
> Another note: the reserved memory use for gp-display-text topped 1.6
> GB, there may be a way to improve that, too.
It is not normal.
It looks like gprofng generates always a new DbeLine in
DbeInstr::mapPCtoLine().
-Vladimir
>
>
> [1]: https://sourceware.org/bugzilla/show_bug.cgi?id=30898
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: minor patch to improve gprofng performance (re: Bug 30898)
2023-09-28 2:47 ` minor patch to improve gprofng performance (re: Bug 30898) Vladimir Mezentsev
@ 2024-01-12 22:31 ` Simon Sobisch
0 siblings, 0 replies; 2+ messages in thread
From: Simon Sobisch @ 2024-01-12 22:31 UTC (permalink / raw)
To: Vladimir Mezentsev; +Cc: binutils
Am 28.09.2023 um 04:47 schrieb Vladimir Mezentsev:
> hi Simon,
> Thank you for your report.
Sure, just checking back now.
> See comments below.
>
> On 9/26/23 04:25, Simon Sobisch wrote:
>> Inspecting bug #30898 [1] showed that there is an issue when using the
>> disassembly option with huge (generated) functions.
>>
>> I gave this a test and found, via
>>
>> perf record -o perf.data.gpdisplay --call-graph dwarf,38192 --aio -z \
>> --sample-cpu --mmap-pages 16M \
>> gprofng display text -name short:soname -metrics e.%totalcpu:name \
>> -disasm prog_ test.1.er > /dev/null
>>
>> That the problem is the disassembly handling.
>> Checking the generated perf recording shows that the **burning hot**
>> place is DbeInstr::mapPCtoLine(SourceFile*), called by
>> Module::set_dis_data(Function*, int, int, int, bool, bool, int);
>> taking more than 93.3% of all instructions.
>
> This is a little surprise for me.
> It's likely that gcc inlines functions and generates Dwarf that gprofng
> interprets poorly.
Is there a bug open on this?
If not can you please do so?
>
> If you run:
> gprofng display src -dis prog_
> <YOUR_EXECUTION_OR_LIBRARY_WHERE_FUNC_IS_LOCATED>
> Do you see the same performance problem ?
> If yes, may I get this binary ?
>
>>
>> Running that took around 5 minutes. Redirecting the output to a file
>> leads to a file with 4,124,497 lines, so: this _really_ is about huge
>> disassembly.
>
>
> I generated the big function (~ 1000000 lines). The disassembly is
> 10,000,037 lines. This took 38 sec.
> But my test is trivial and gcc generates a trivial Dwarf.
Sadly I don't have access to that environment any more and also not to
the exact setup, but I could try with a similar one if this is likely to
help. But this would likely to be more reasonable if the two points
below are inspected first.
>> I've tinkered a bit with the burning hot function, the result is a
>> minor decrease when using C++2017 invalid code, you find it in the
>> attached patch.
>>
>> Also attached is the recorded output for the hot function,
>> interestingly the patched version showed quite clearly that over 60 %
>> of the complete run's cpu instructions goes to Hist_data.cc line 1380:
>>
>> if (p->level == 0)
>>
>>
>> For huge (GnuCOBOL) generated functions the attached patch drops the
>> perf stat reported counters by 10%.
>>
>> [...]
Is there anything to be happen with that patch?
>>
>> Along to this change a short-term _option_ to drop most of those 60%
>> (and, if it drops the amount of entries in there, a good portion of
>> walking the pointers) could be to have a copy of func->inlinedSubr
>> _once_ that _only_ contains level 0 entries.
At least from reading that nearly 4 months later again, that sounds like
an easy to implement change with a huge benefit, no?
>> But in the long-term it seems more reasonable to recheck if that
>> function should be rewritten/replaced for better supporting "huge
>> disassembly".
>>
>>
>> Another note: the reserved memory use for gp-display-text topped 1.6
>> GB, there may be a way to improve that, too.
>
> It is not normal.
> It looks like gprofng generates always a new DbeLine in
> DbeInstr::mapPCtoLine().
Is this happen to be tracked with a separate bug, or possibly already
solved?
> -Vladimir
Thank you for working on gprofng,
Simon
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2024-01-12 22:31 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <8a8c5ffa-f15d-c7a6-ea64-9afe3d42bdb1@gnu.org>
2023-09-28 2:47 ` minor patch to improve gprofng performance (re: Bug 30898) Vladimir Mezentsev
2024-01-12 22:31 ` Simon Sobisch
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).