[Bug rtl-optimization/111143] New: [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/111143] New: [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
@ 2023-08-24 20:05 eggert at cs dot ucla.edu
  2023-08-24 20:05 ` [Bug rtl-optimization/111143] " eggert at cs dot ucla.edu
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: eggert at cs dot ucla.edu @ 2023-08-24 20:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111143

            Bug ID: 111143
           Summary: [missed optimization] unlikely code slows down
                    diffutils x86-64 ASCII processing
           Product: gcc
           Version: 13.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: eggert at cs dot ucla.edu
  Target Milestone: ---

Created attachment 55788
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55788&action=edit
source code illustrating the performance problem

This bug report may be related to bug 110823 (also found for diffutils) but the
symptoms differ somewhat so I am reporting it separately. I observed it with
GCC 13.1.1 20230614 (Red Hat 13.1.1-4) on x86-64.

While tuning GNU diffutils I noticed that its loops to process mostly-ASCII
text were not compiled well by GCC on x86-64. For a stripped-down example of
the problem, compile the attached program with:

gcc -O2 -S code-mcel.c

The result is in the attached file code-mcel.s. Its loop kernel assuming ASCII
text (starting on line 44) looks like this:

        .L6:
                movsbq  (%rbx), %rax
                testb   %al, %al
                js      .L4
                addq    %rax, %r12
                movl    $1, %eax
        .L5:
                addq    %rax, %rbx
                cmpq    %r13, %rbx
                jb      .L6

The "movl $1, %eax" immediately followed by "addq %rax, %rbx" is poorly
scheduled; the resulting dependency makes the code run quite a bit slower than
it should. Replacing it with "addq $1, %rbx" and readjusting the surrounding
code accordingly, as is done in the attached file code-mcel-opt.s, causes the
benchmark to run 38% faster on my laptop's Intel i5-1335U.

It seems that code that GCC knows is unlikely (because of __builtin_expect) is
causing the kernel, which GCC knows is likely, to be poorly optimized.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
  2023-08-24 20:05 [Bug rtl-optimization/111143] New: [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing eggert at cs dot ucla.edu
@ 2023-08-24 20:05 ` eggert at cs dot ucla.edu
  2023-08-24 20:06 ` eggert at cs dot ucla.edu
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: eggert at cs dot ucla.edu @ 2023-08-24 20:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111143

--- Comment #1 from Paul Eggert <eggert at cs dot ucla.edu> ---
Created attachment 55789
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55789&action=edit
asm code generated by gcc -O2 -S

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
  2023-08-24 20:05 [Bug rtl-optimization/111143] New: [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing eggert at cs dot ucla.edu
  2023-08-24 20:05 ` [Bug rtl-optimization/111143] " eggert at cs dot ucla.edu
@ 2023-08-24 20:06 ` eggert at cs dot ucla.edu
  2023-08-24 20:14 ` pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: eggert at cs dot ucla.edu @ 2023-08-24 20:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111143

--- Comment #2 from Paul Eggert <eggert at cs dot ucla.edu> ---
Created attachment 55790
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55790&action=edit
asm code that's 38% faster on my platform

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
  2023-08-24 20:05 [Bug rtl-optimization/111143] New: [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing eggert at cs dot ucla.edu
  2023-08-24 20:05 ` [Bug rtl-optimization/111143] " eggert at cs dot ucla.edu
  2023-08-24 20:06 ` eggert at cs dot ucla.edu
@ 2023-08-24 20:14 ` pinskia at gcc dot gnu.org
  2023-08-25  6:46 ` amonakov at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-08-24 20:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111143

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
  _22 = *iter_57;
  if (_22 >= 0)
    goto <bb 4>; [90.00%]
  else
    goto <bb 5>; [10.00%]

  <bb 4> [local count: 860067200]:
  _76 = (long long unsigned int) _22;
  _15 = sum_31 + _76;
  goto <bb 7>; [100.00%]
...
  <bb 7> [local count: 955630226]:
  # prephitmp_42 = PHI <1(4), 1(5), len_29(6)>
  # prephitmp_35 = PHI <_15(4), sum_31(5), _34(6)>
  mbs ={v} {CLOBBER(eol)};
  ch ={v} {CLOBBER(eol)};
  iter_21 = iter_57 + prephitmp_42;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
  2023-08-24 20:05 [Bug rtl-optimization/111143] New: [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing eggert at cs dot ucla.edu
                   ` (2 preceding siblings ...)
  2023-08-24 20:14 ` pinskia at gcc dot gnu.org
@ 2023-08-25  6:46 ` amonakov at gcc dot gnu.org
  2023-08-26  3:33 ` eggert at cs dot ucla.edu
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-08-25  6:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111143

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Paul Eggert from comment #0)
> The "movl $1, %eax" immediately followed by "addq %rax, %rbx" is poorly
> scheduled; the resulting dependency makes the code run quite a bit slower
> than it should. Replacing it with "addq $1, %rbx" and readjusting the
> surrounding code accordingly, as is done in the attached file
> code-mcel-opt.s, causes the benchmark to run 38% faster on my laptop's Intel
> i5-1335U.

This is a mischaracterization. The modified loop has one uop less, because you
are replacing 'mov eax, 1; add rbx, rax' with 'add rbx, 1'.

To evaluate scheduling aspect, keep 'mov eax, 1' while changing 'add rbx, rax'
to 'add rbx, 1'.

There are two separate loop-carried data dependencies, both one cycle per
iteration (addition chains over r12 and rbx).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
  2023-08-24 20:05 [Bug rtl-optimization/111143] New: [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing eggert at cs dot ucla.edu
                   ` (3 preceding siblings ...)
  2023-08-25  6:46 ` amonakov at gcc dot gnu.org
@ 2023-08-26  3:33 ` eggert at cs dot ucla.edu
  2023-08-26  8:09 ` amonakov at gcc dot gnu.org
  2023-08-26 16:43 ` eggert at cs dot ucla.edu
  6 siblings, 0 replies; 8+ messages in thread
From: eggert at cs dot ucla.edu @ 2023-08-26  3:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111143

--- Comment #5 from Paul Eggert <eggert at cs dot ucla.edu> ---
(In reply to Alexander Monakov from comment #4)

> To evaluate scheduling aspect, keep 'mov eax, 1' while changing 'add rbx,
> rax' to 'add rbx, 1'.

Adding the (unnecessary) 'mov eax, 1' doesn't affect the timing much, which is
what I would expect on a newer processor.

When I reran the benchmark on the same laptop (Intel i5-1335U), I got 3.289s
for GCC-generated code, 2.256s for the "38% faster" code (now it's 46% faster;
don't know why) and 2.260 s for the faster code with the unnecessary 'mov eax,
1' inserted.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
  2023-08-24 20:05 [Bug rtl-optimization/111143] New: [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing eggert at cs dot ucla.edu
                   ` (4 preceding siblings ...)
  2023-08-26  3:33 ` eggert at cs dot ucla.edu
@ 2023-08-26  8:09 ` amonakov at gcc dot gnu.org
  2023-08-26 16:43 ` eggert at cs dot ucla.edu
  6 siblings, 0 replies; 8+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-08-26  8:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111143

--- Comment #6 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Thanks.

i5-1335U has two "performance cores" (with HT, four logical CPUs) and eight
"efficiency cores". They have different micro-architecture. Are you binding the
benchmark to some core in particular?

On the "performance cores", 'add rbx, 1' can be eliminated ("executed" with
zero latency), this optimization appeared in the Alder Lake generation with the
"Golden Cove" uarch and was found by Andreas Abel. There are limitations (e.g.
it works for 64-bit additions but not 32-bit, the addend must be an immediate
less than 1024).

Of course, it is better to have 'add rbx, 1' instead of 'add rbx, rax' in this
loop on any CPU ('mov eax, 1' competes for ALU ports with other instructions,
so when it's delayed due to contention the dependent 'add rbx, rax; movsx rax,
[rbx]' get delayed too), but ascribing the difference to compiler scheduling on
a CPU that does out-of-order dynamic scheduling is strange.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing
  2023-08-24 20:05 [Bug rtl-optimization/111143] New: [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing eggert at cs dot ucla.edu
                   ` (5 preceding siblings ...)
  2023-08-26  8:09 ` amonakov at gcc dot gnu.org
@ 2023-08-26 16:43 ` eggert at cs dot ucla.edu
  6 siblings, 0 replies; 8+ messages in thread
From: eggert at cs dot ucla.edu @ 2023-08-26 16:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111143

--- Comment #7 from Paul Eggert <eggert at cs dot ucla.edu> ---
(In reply to Alexander Monakov from comment #6)

> Are you binding the benchmark to some core in particular?

I did the benchmark on performance cores, which was my original use case. On
efficiency cores, adding the (unnecessary) 'mov eax, 1' doesn't change timing
much (0.9% speedup on one test).

> it is better to have 'add rbx, 1' instead of 'add rbx, rax' in this loop on any CPU

Somewhat counterintuitively, that doesn't seem to be the case for the
efficiency cores on this platform, as the "38% faster" code is 7% slower on
E-cores. However, the use cases I'm concerned about are typically run on
performance cores.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-08-26 16:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-24 20:05 [Bug rtl-optimization/111143] New: [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing eggert at cs dot ucla.edu
2023-08-24 20:05 ` [Bug rtl-optimization/111143] " eggert at cs dot ucla.edu
2023-08-24 20:06 ` eggert at cs dot ucla.edu
2023-08-24 20:14 ` pinskia at gcc dot gnu.org
2023-08-25  6:46 ` amonakov at gcc dot gnu.org
2023-08-26  3:33 ` eggert at cs dot ucla.edu
2023-08-26  8:09 ` amonakov at gcc dot gnu.org
2023-08-26 16:43 ` eggert at cs dot ucla.edu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).