public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/110823] New: [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils
@ 2023-07-26 19:37 eggert at cs dot ucla.edu
  2023-07-26 19:38 ` [Bug rtl-optimization/110823] " eggert at cs dot ucla.edu
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: eggert at cs dot ucla.edu @ 2023-07-26 19:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110823

            Bug ID: 110823
           Summary: [missed optimization] >50% speedup for x86-64 ASCII
                    processing a la GNU diffutils
           Product: gcc
           Version: 13.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: eggert at cs dot ucla.edu
  Target Milestone: ---

Created attachment 55643
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55643&action=edit
proprocessed source code inspired by GNU diffutils

This is GCC 13.1.1 20230614 (Red Hat 13.1.1-4) on x86-64.

While tuning GNU diffutils I noticed that its loops to process mostly-ASCII
text were not compiled well by GCC on x86-64. For a stripped-down example of
the problem, compile the attached program with:

gcc -O2 -S code-mbcel1.i

The result is in the attached file code-mbcel1.s. Its loop kernel assuming
ASCII text (starting on line 212) looks like this:

  .L33:
        testb   %al, %al
        js      .L30
        movl    $1, %edx
  .L31:
        movl    %eax, %eax
        addq    %rdx, %rbx
        addq    %rax, %rbp
        movsbl  (%rbx), %eax
        testb   %al, %al
        jne     .L33

As I understand it the "movl %eax, %eax" is unnecessary, as all code that
reaches .L31 guarantees that %rax's top 32 bits are zero.

Also, the loop body executes "testb %al, %al" twice when once would suffice.
(As a minor point, since the compiler can easily tell that %al is positive when
the loop is entered, it can omit the first testb.)

Suppose we change the above code to the following, as is done in the attached
file code-mbcel1-opt.s:

  .L33:
        movl    $1, %edx
  .L31:
        addq    %rdx, %rbx
        addq    %rax, %rbp
        movsbl  (%rbx), %eax
        testb   %al, %al
        jg      .L33
        js      .L30

This small change improves performance significantly: for me, the test program
runs 55% faster on a circa-2021 Intel Xeon W-1350, and 74% faster on a
circa-2010 AMD Phenom II x4 910e, using the following commands to benchmark:

gcc -O2 code-mbcel1.i -o code-mbcel1
gcc -O2 code-mbcel1-opt.s -o code-mbcel1-opt
time ./code-mbcel1
time ./code-mbcel1-opt

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-08-25  0:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-26 19:37 [Bug rtl-optimization/110823] New: [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils eggert at cs dot ucla.edu
2023-07-26 19:38 ` [Bug rtl-optimization/110823] " eggert at cs dot ucla.edu
2023-07-26 19:38 ` pinskia at gcc dot gnu.org
2023-07-26 19:39 ` eggert at cs dot ucla.edu
2023-07-26 19:54 ` pinskia at gcc dot gnu.org
2023-07-30 11:38 ` amonakov at gcc dot gnu.org
2023-08-25  0:40 ` eggert at cs dot ucla.edu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).