* [Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
2023-08-24 20:05 [Bug rtl-optimization/111143] New: [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing eggert at cs dot ucla.edu
@ 2023-08-24 20:05 ` eggert at cs dot ucla.edu
2023-08-24 20:06 ` eggert at cs dot ucla.edu
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: eggert at cs dot ucla.edu @ 2023-08-24 20:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111143
--- Comment #1 from Paul Eggert <eggert at cs dot ucla.edu> ---
Created attachment 55789
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55789&action=edit
asm code generated by gcc -O2 -S
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
2023-08-24 20:05 [Bug rtl-optimization/111143] New: [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing eggert at cs dot ucla.edu
2023-08-24 20:05 ` [Bug rtl-optimization/111143] " eggert at cs dot ucla.edu
@ 2023-08-24 20:06 ` eggert at cs dot ucla.edu
2023-08-24 20:14 ` pinskia at gcc dot gnu.org
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: eggert at cs dot ucla.edu @ 2023-08-24 20:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111143
--- Comment #2 from Paul Eggert <eggert at cs dot ucla.edu> ---
Created attachment 55790
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55790&action=edit
asm code that's 38% faster on my platform
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
2023-08-24 20:05 [Bug rtl-optimization/111143] New: [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing eggert at cs dot ucla.edu
2023-08-24 20:05 ` [Bug rtl-optimization/111143] " eggert at cs dot ucla.edu
2023-08-24 20:06 ` eggert at cs dot ucla.edu
@ 2023-08-24 20:14 ` pinskia at gcc dot gnu.org
2023-08-25 6:46 ` amonakov at gcc dot gnu.org
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-08-24 20:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111143
--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
_22 = *iter_57;
if (_22 >= 0)
goto <bb 4>; [90.00%]
else
goto <bb 5>; [10.00%]
<bb 4> [local count: 860067200]:
_76 = (long long unsigned int) _22;
_15 = sum_31 + _76;
goto <bb 7>; [100.00%]
...
<bb 7> [local count: 955630226]:
# prephitmp_42 = PHI <1(4), 1(5), len_29(6)>
# prephitmp_35 = PHI <_15(4), sum_31(5), _34(6)>
mbs ={v} {CLOBBER(eol)};
ch ={v} {CLOBBER(eol)};
iter_21 = iter_57 + prephitmp_42;
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
2023-08-24 20:05 [Bug rtl-optimization/111143] New: [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing eggert at cs dot ucla.edu
` (2 preceding siblings ...)
2023-08-24 20:14 ` pinskia at gcc dot gnu.org
@ 2023-08-25 6:46 ` amonakov at gcc dot gnu.org
2023-08-26 3:33 ` eggert at cs dot ucla.edu
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-08-25 6:46 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111143
Alexander Monakov <amonakov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amonakov at gcc dot gnu.org
--- Comment #4 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Paul Eggert from comment #0)
> The "movl $1, %eax" immediately followed by "addq %rax, %rbx" is poorly
> scheduled; the resulting dependency makes the code run quite a bit slower
> than it should. Replacing it with "addq $1, %rbx" and readjusting the
> surrounding code accordingly, as is done in the attached file
> code-mcel-opt.s, causes the benchmark to run 38% faster on my laptop's Intel
> i5-1335U.
This is a mischaracterization. The modified loop has one uop less, because you
are replacing 'mov eax, 1; add rbx, rax' with 'add rbx, 1'.
To evaluate scheduling aspect, keep 'mov eax, 1' while changing 'add rbx, rax'
to 'add rbx, 1'.
There are two separate loop-carried data dependencies, both one cycle per
iteration (addition chains over r12 and rbx).
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
2023-08-24 20:05 [Bug rtl-optimization/111143] New: [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing eggert at cs dot ucla.edu
` (3 preceding siblings ...)
2023-08-25 6:46 ` amonakov at gcc dot gnu.org
@ 2023-08-26 3:33 ` eggert at cs dot ucla.edu
2023-08-26 8:09 ` amonakov at gcc dot gnu.org
2023-08-26 16:43 ` eggert at cs dot ucla.edu
6 siblings, 0 replies; 8+ messages in thread
From: eggert at cs dot ucla.edu @ 2023-08-26 3:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111143
--- Comment #5 from Paul Eggert <eggert at cs dot ucla.edu> ---
(In reply to Alexander Monakov from comment #4)
> To evaluate scheduling aspect, keep 'mov eax, 1' while changing 'add rbx,
> rax' to 'add rbx, 1'.
Adding the (unnecessary) 'mov eax, 1' doesn't affect the timing much, which is
what I would expect on a newer processor.
When I reran the benchmark on the same laptop (Intel i5-1335U), I got 3.289s
for GCC-generated code, 2.256s for the "38% faster" code (now it's 46% faster;
don't know why) and 2.260 s for the faster code with the unnecessary 'mov eax,
1' inserted.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
2023-08-24 20:05 [Bug rtl-optimization/111143] New: [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing eggert at cs dot ucla.edu
` (4 preceding siblings ...)
2023-08-26 3:33 ` eggert at cs dot ucla.edu
@ 2023-08-26 8:09 ` amonakov at gcc dot gnu.org
2023-08-26 16:43 ` eggert at cs dot ucla.edu
6 siblings, 0 replies; 8+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-08-26 8:09 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111143
--- Comment #6 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Thanks.
i5-1335U has two "performance cores" (with HT, four logical CPUs) and eight
"efficiency cores". They have different micro-architecture. Are you binding the
benchmark to some core in particular?
On the "performance cores", 'add rbx, 1' can be eliminated ("executed" with
zero latency), this optimization appeared in the Alder Lake generation with the
"Golden Cove" uarch and was found by Andreas Abel. There are limitations (e.g.
it works for 64-bit additions but not 32-bit, the addend must be an immediate
less than 1024).
Of course, it is better to have 'add rbx, 1' instead of 'add rbx, rax' in this
loop on any CPU ('mov eax, 1' competes for ALU ports with other instructions,
so when it's delayed due to contention the dependent 'add rbx, rax; movsx rax,
[rbx]' get delayed too), but ascribing the difference to compiler scheduling on
a CPU that does out-of-order dynamic scheduling is strange.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
2023-08-24 20:05 [Bug rtl-optimization/111143] New: [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing eggert at cs dot ucla.edu
` (5 preceding siblings ...)
2023-08-26 8:09 ` amonakov at gcc dot gnu.org
@ 2023-08-26 16:43 ` eggert at cs dot ucla.edu
6 siblings, 0 replies; 8+ messages in thread
From: eggert at cs dot ucla.edu @ 2023-08-26 16:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111143
--- Comment #7 from Paul Eggert <eggert at cs dot ucla.edu> ---
(In reply to Alexander Monakov from comment #6)
> Are you binding the benchmark to some core in particular?
I did the benchmark on performance cores, which was my original use case. On
efficiency cores, adding the (unnecessary) 'mov eax, 1' doesn't change timing
much (0.9% speedup on one test).
> it is better to have 'add rbx, 1' instead of 'add rbx, rax' in this loop on any CPU
Somewhat counterintuitively, that doesn't seem to be the case for the
efficiency cores on this platform, as the "38% faster" code is 7% slower on
E-cores. However, the use cases I'm concerned about are typically run on
performance cores.
^ permalink raw reply [flat|nested] 8+ messages in thread