[Bug rtl-optimization/110823] New: [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/110823] New: [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils
@ 2023-07-26 19:37 eggert at cs dot ucla.edu
  2023-07-26 19:38 ` [Bug rtl-optimization/110823] " eggert at cs dot ucla.edu
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: eggert at cs dot ucla.edu @ 2023-07-26 19:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110823

            Bug ID: 110823
           Summary: [missed optimization] >50% speedup for x86-64 ASCII
                    processing a la GNU diffutils
           Product: gcc
           Version: 13.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: eggert at cs dot ucla.edu
  Target Milestone: ---

Created attachment 55643
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55643&action=edit
proprocessed source code inspired by GNU diffutils

This is GCC 13.1.1 20230614 (Red Hat 13.1.1-4) on x86-64.

While tuning GNU diffutils I noticed that its loops to process mostly-ASCII
text were not compiled well by GCC on x86-64. For a stripped-down example of
the problem, compile the attached program with:

gcc -O2 -S code-mbcel1.i

The result is in the attached file code-mbcel1.s. Its loop kernel assuming
ASCII text (starting on line 212) looks like this:

  .L33:
        testb   %al, %al
        js      .L30
        movl    $1, %edx
  .L31:
        movl    %eax, %eax
        addq    %rdx, %rbx
        addq    %rax, %rbp
        movsbl  (%rbx), %eax
        testb   %al, %al
        jne     .L33

As I understand it the "movl %eax, %eax" is unnecessary, as all code that
reaches .L31 guarantees that %rax's top 32 bits are zero.

Also, the loop body executes "testb %al, %al" twice when once would suffice.
(As a minor point, since the compiler can easily tell that %al is positive when
the loop is entered, it can omit the first testb.)

Suppose we change the above code to the following, as is done in the attached
file code-mbcel1-opt.s:

  .L33:
        movl    $1, %edx
  .L31:
        addq    %rdx, %rbx
        addq    %rax, %rbp
        movsbl  (%rbx), %eax
        testb   %al, %al
        jg      .L33
        js      .L30

This small change improves performance significantly: for me, the test program
runs 55% faster on a circa-2021 Intel Xeon W-1350, and 74% faster on a
circa-2010 AMD Phenom II x4 910e, using the following commands to benchmark:

gcc -O2 code-mbcel1.i -o code-mbcel1
gcc -O2 code-mbcel1-opt.s -o code-mbcel1-opt
time ./code-mbcel1
time ./code-mbcel1-opt

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/110823] [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils
  2023-07-26 19:37 [Bug rtl-optimization/110823] New: [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils eggert at cs dot ucla.edu
@ 2023-07-26 19:38 ` eggert at cs dot ucla.edu
  2023-07-26 19:38 ` pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: eggert at cs dot ucla.edu @ 2023-07-26 19:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110823

--- Comment #1 from Paul Eggert <eggert at cs dot ucla.edu> ---
Created attachment 55644
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55644&action=edit
gcc -O2 -S output (from code-mbcel1.i)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/110823] [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils
  2023-07-26 19:37 [Bug rtl-optimization/110823] New: [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils eggert at cs dot ucla.edu
  2023-07-26 19:38 ` [Bug rtl-optimization/110823] " eggert at cs dot ucla.edu
@ 2023-07-26 19:38 ` pinskia at gcc dot gnu.org
  2023-07-26 19:39 ` eggert at cs dot ucla.edu
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-07-26 19:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110823

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
           Severity|normal                      |enhancement

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/110823] [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils
  2023-07-26 19:37 [Bug rtl-optimization/110823] New: [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils eggert at cs dot ucla.edu
  2023-07-26 19:38 ` [Bug rtl-optimization/110823] " eggert at cs dot ucla.edu
  2023-07-26 19:38 ` pinskia at gcc dot gnu.org
@ 2023-07-26 19:39 ` eggert at cs dot ucla.edu
  2023-07-26 19:54 ` pinskia at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: eggert at cs dot ucla.edu @ 2023-07-26 19:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110823

--- Comment #2 from Paul Eggert <eggert at cs dot ucla.edu> ---
Created attachment 55645
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55645&action=edit
code-mbcel1.s with the optimization suggested in the bug report

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/110823] [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils
  2023-07-26 19:37 [Bug rtl-optimization/110823] New: [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils eggert at cs dot ucla.edu
                   ` (2 preceding siblings ...)
  2023-07-26 19:39 ` eggert at cs dot ucla.edu
@ 2023-07-26 19:54 ` pinskia at gcc dot gnu.org
  2023-07-30 11:38 ` amonakov at gcc dot gnu.org
  2023-08-25  0:40 ` eggert at cs dot ucla.edu
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-07-26 19:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110823

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
The gimple level looks like:
```
  if (_54 >= 0)
    goto <bb 4>; [90.00%]
  else
    goto <bb 5>; [10.00%]

  <bb 4> [local count: 63261141172]:
  _18 = (unsigned int) _54;
  goto <bb 8>; [100.00%]
...
  len_37 = mbrtoc32 (&ch, iter_39, _36, &mbs);
  len.0_38 = (signed long) len_37;
  if (len.0_38 < 0)
    goto <bb 7>; [10.00%]
  else
    goto <bb 6>; [90.00%]

  <bb 6> [local count: 632611429]:
  ch.1_42 = ch; // Note this is a local variable

  <bb 7> [local count: 7029015815]:
  # SR.45_12 = PHI <ch.1_42(6), 0(5)>
  # SR.46_46 = PHI <len_37(6), 1(5)>
  mbs ={v} {CLOBBER(eol)};
  ch ={v} {CLOBBER(eol)};

  <bb 8> [local count: 70290156974]:
  # SR.41_16 = PHI <_18(4), SR.45_12(7)>
  # SR.42_47 = PHI <1(4), SR.46_46(7)>
  _6 = (long long unsigned int) SR.41_16;
```

Maybe we should have a type promotion pass on the gimple level that promotes
_54 to `long unsigned int`.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/110823] [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils
  2023-07-26 19:37 [Bug rtl-optimization/110823] New: [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils eggert at cs dot ucla.edu
                   ` (3 preceding siblings ...)
  2023-07-26 19:54 ` pinskia at gcc dot gnu.org
@ 2023-07-30 11:38 ` amonakov at gcc dot gnu.org
  2023-08-25  0:40 ` eggert at cs dot ucla.edu
  5 siblings, 0 replies; 7+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-07-30 11:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110823

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
It's a weakness in the REE pass. AFAICT normally it would handle this, but here
there are two elimination candidates in 'main', the first is eliminated
successfully, and then REE punts on the second because one if its reaching
definitions is the first redundant extension:

      /* If def_insn is already scheduled to be deleted, don't attempt
         to modify it.  */
      if (state->modified[INSN_UID (def_insn)].deleted)
        return false;

While looking into this I noticed that the fix for PR 61094 introduced a
write-only bitfield 'do_not_reextend' (the Changelog wrongly claimed it was
used).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/110823] [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils
  2023-07-26 19:37 [Bug rtl-optimization/110823] New: [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils eggert at cs dot ucla.edu
                   ` (4 preceding siblings ...)
  2023-07-30 11:38 ` amonakov at gcc dot gnu.org
@ 2023-08-25  0:40 ` eggert at cs dot ucla.edu
  5 siblings, 0 replies; 7+ messages in thread
From: eggert at cs dot ucla.edu @ 2023-08-25  0:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110823

--- Comment #5 from Paul Eggert <eggert at cs dot ucla.edu> ---
Also see bug 111143 for a related performance issue, which is perhaps more
important given the current state of bleeding-edge GNU diffutils.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-08-25  0:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-26 19:37 [Bug rtl-optimization/110823] New: [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils eggert at cs dot ucla.edu
2023-07-26 19:38 ` [Bug rtl-optimization/110823] " eggert at cs dot ucla.edu
2023-07-26 19:38 ` pinskia at gcc dot gnu.org
2023-07-26 19:39 ` eggert at cs dot ucla.edu
2023-07-26 19:54 ` pinskia at gcc dot gnu.org
2023-07-30 11:38 ` amonakov at gcc dot gnu.org
2023-08-25  0:40 ` eggert at cs dot ucla.edu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).