[Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment
@ 2022-04-02  8:53 andre.schackier at gmail dot com
  2022-04-04  7:17 ` [Bug middle-end/105135] " rguenth at gcc dot gnu.org
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: andre.schackier at gmail dot com @ 2022-04-02  8:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135

            Bug ID: 105135
           Summary: [11/12 Regression] Optimization regression for
                    handrolled branchless assignment
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: andre.schackier at gmail dot com
  Target Milestone: ---

Given the following source code [godbolt](https://godbolt.org/z/rrP3bqGW7):

```cpp
char to_lower_1(const char c) { return c + ((c >= 'A' && c <= 'Z') * 32); }

char to_lower_2(const char c) { return c + (((c >= 'A') & (c <= 'Z')) * 32); }

char to_lower_3(const char c) {
    if (c >= 'A' && c <= 'Z') {
        return c + 32;
    }
    return c;
}
```

compiling with `-O3`

produces the following assembly

```asm
to_lower_1(char):
        lea     eax, [rdi-65]
        cmp     al, 25
        setbe   al
        sal     eax, 5
        add     eax, edi
        ret
to_lower_2(char):
        lea     eax, [rdi-65]
        cmp     al, 25
        setbe   al
        sal     eax, 5
        add     eax, edi
        ret
to_lower_3(char):
        lea     edx, [rdi-65]
        lea     eax, [rdi+32]
        cmp     dl, 26
        cmovnb  eax, edi
        ret
```

Note that gcc-10.3 did produce the same assembly for all 3 functions while
gcc-11 and trunk do not.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/105135] [11/12 Regression] Optimization regression for handrolled branchless assignment
  2022-04-02  8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
@ 2022-04-04  7:17 ` rguenth at gcc dot gnu.org
  2022-04-04  9:47 ` [Bug middle-end/105135] [11/12 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32 marxin at gcc dot gnu.org
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-04  7:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |11.3

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
gcc 10 produced cmovnb for all functions, I think setbe is going to be cheaper
since cmov is an odd beast.  So I believe this is a progression.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/105135] [11/12 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32
  2022-04-02  8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
  2022-04-04  7:17 ` [Bug middle-end/105135] " rguenth at gcc dot gnu.org
@ 2022-04-04  9:47 ` marxin at gcc dot gnu.org
  2022-04-04 11:49 ` jakub at gcc dot gnu.org
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: marxin at gcc dot gnu.org @ 2022-04-04  9:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135

Martin Liška <marxin at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org,
                   |                            |marxin at gcc dot gnu.org
            Summary|[11/12 Regression]          |[11/12 Regression]
                   |Optimization regression for |Optimization regression for
                   |handrolled branchless       |handrolled branchless
                   |assignment                  |assignment since
                   |                            |r11-4717-g3e190757fa332d32
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2022-04-04

--- Comment #2 from Martin Liška <marxin at gcc dot gnu.org> ---
Started with r11-4717-g3e190757fa332d32.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/105135] [11/12 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32
  2022-04-02  8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
  2022-04-04  7:17 ` [Bug middle-end/105135] " rguenth at gcc dot gnu.org
  2022-04-04  9:47 ` [Bug middle-end/105135] [11/12 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32 marxin at gcc dot gnu.org
@ 2022-04-04 11:49 ` jakub at gcc dot gnu.org
  2022-04-21  7:51 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-04-04 11:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Trying to microbenchmark this in a tight loop on i9-7960X shows in this case
cmov probably better (but cmov is really a lottery on x86):
cat pr105135.c
__attribute__((noipa)) char to_lower_1(const char c) { return c + ((c >= 'A' &&
c <= 'Z') * 32); }
__attribute__((noipa)) char to_lower_2(const char c) { return c + (((c >= 'A')
& (c <= 'Z')) * 32); }
__attribute__((noipa)) char to_lower_3(const char c) { if (c >= 'A' && c <=
'Z') return c + 32; return c; }
$ cat pr105135-2.c
__attribute__((noipa)) char to_lower_1(const char c);
__attribute__((noipa)) char to_lower_2(const char c);
__attribute__((noipa)) char to_lower_3(const char c);
#define N 1000000000

int
main ()
{
  unsigned long long r = 0;
#ifdef Aa
  for (long long i = 0; i < N; i++)
    r += to_lower ((i & 1) ? 'A' : 'a');
#else
  for (long long i = 0; i < N; i++)
    r += to_lower ('A');
#endif
  asm volatile ("" : : "r" (r));
}
$ for i in "./cc1 -quiet" "gcc -S"; do for j in 1 2 3; do for k in "" -DAa; do
eval $i -O3 pr105135.c; gcc -Dto_lower=to_lower_$j $k -O3 -o pr105135{,.s}
pr105135-2.c; echo $i $j $k; time ./pr105135; done; done; done
./cc1 -quiet 1

real    0m1.230s
user    0m1.228s
sys     0m0.001s
./cc1 -quiet 1 -DAa

real    0m1.706s
user    0m1.703s
sys     0m0.001s
./cc1 -quiet 2

real    0m1.222s
user    0m1.221s
sys     0m0.000s
./cc1 -quiet 2 -DAa

real    0m1.686s
user    0m1.683s
sys     0m0.001s
./cc1 -quiet 3

real    0m1.232s
user    0m1.230s
sys     0m0.000s
./cc1 -quiet 3 -DAa

real    0m1.450s
user    0m1.447s
sys     0m0.001s
gcc -S 1

real    0m1.232s
user    0m1.229s
sys     0m0.001s
gcc -S 1 -DAa

real    0m1.391s
user    0m1.389s
sys     0m0.001s
gcc -S 2

real    0m1.233s
user    0m1.230s
sys     0m0.001s
gcc -S 2 -DAa

real    0m1.398s
user    0m1.397s
sys     0m0.000s
gcc -S 3

real    0m1.232s
user    0m1.229s
sys     0m0.001s
gcc -S 3 -DAa

real    0m1.430s
user    0m1.428s
sys     0m0.000s
where gcc is GCC 10.x and ./cc1 is current trunk.
Seems for the constant 'A' case it is actually a wash, but with alternating
'A'/'a' cmov is better.
clang seems to emit for the first 2 functions very similar code to gcc, the
only difference is that
shift left and addition are performed using 8-bit rather than 32-bit
instructions, so:
        leal    -65(%rdi), %eax
        cmpb    $26, %al
        setb    %al
        shlb    $5, %al
        addb    %dil, %al
and that seems to perform better.
I have tried to use
        leal    -65(%rdi), %ecx
        xorl    %eax, %eax
        cmpb    $25, %cl
        setbe   %al
        sall    $5, %eax
        addl    %edi, %eax
to perform manually what our peephole2 tries to do for setXX instructions but
in this case fails to do as %eax is live in the comparison before it,
that helped a little bit but not as much as the 8-bit instructions do.
But when I disable the
 ;; Avoid redundant prefixes by splitting HImode arithmetic to SImode.
 ;; Do not split instructions with mask registers.
 (define_split
   [(set (match_operand 0 "general_reg_operand")
        (match_operator 3 "promotable_binary_operator"
           [(match_operand 1 "general_reg_operand")
            (match_operand 2 "aligned_operand")]))
    (clobber (reg:CC FLAGS_REG))]
   "! TARGET_PARTIAL_REG_STALL && reload_completed
    && ((GET_MODE (operands[0]) == HImode
        && ((optimize_function_for_speed_p (cfun) && !TARGET_FAST_PREFIX)
             /* ??? next two lines just !satisfies_constraint_K (...) */
            || !CONST_INT_P (operands[2])
            || satisfies_constraint_K (operands[2])))
        || (GET_MODE (operands[0]) == QImode
-          && (TARGET_PROMOTE_QImode || optimize_function_for_size_p (cfun))))"
+          && (0 || optimize_function_for_size_p (cfun))))"
   [(parallel [(set (match_dup 0)
                   (match_op_dup 3 [(match_dup 1) (match_dup 2)]))
              (clobber (reg:CC FLAGS_REG))])]
 {
   operands[0] = gen_lowpart (SImode, operands[0]);
   operands[1] = gen_lowpart (SImode, operands[1]);
   if (GET_CODE (operands[3]) != ASHIFT)
     operands[2] = gen_lowpart (SImode, operands[2]);
   operands[3] = shallow_copy_rtx (operands[3]);
   PUT_MODE (operands[3], SImode);
 })
splitter so that the code is basically the same as from clang, it is still
slower than the clang version, so it is just weird.
Anyway, the GIMPLE optimization is IMNSHO sound, it is all about how exactly
the backend handles it.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/105135] [11/12 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32
  2022-04-02  8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
                   ` (2 preceding siblings ...)
  2022-04-04 11:49 ` jakub at gcc dot gnu.org
@ 2022-04-21  7:51 ` rguenth at gcc dot gnu.org
  2022-07-26 12:37 ` [Bug target/105135] [11/12/13 " rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-21  7:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|11.3                        |11.4

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 11.3 is being released, retargeting bugs to GCC 11.4.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/105135] [11/12/13 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32
  2022-04-02  8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
                   ` (3 preceding siblings ...)
  2022-04-21  7:51 ` rguenth at gcc dot gnu.org
@ 2022-07-26 12:37 ` rguenth at gcc dot gnu.org
  2022-07-26 13:36 ` amonakov at gcc dot gnu.org
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-07-26 12:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|middle-end                  |target
             Target|                            |x86_64-*-*
           Priority|P3                          |P2

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/105135] [11/12/13 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32
  2022-04-02  8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
                   ` (4 preceding siblings ...)
  2022-07-26 12:37 ` [Bug target/105135] [11/12/13 " rguenth at gcc dot gnu.org
@ 2022-07-26 13:36 ` amonakov at gcc dot gnu.org
  2023-03-11  5:29 ` pinskia at gcc dot gnu.org
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: amonakov at gcc dot gnu.org @ 2022-07-26 13:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Regarding Clang's code, the key part is not use of 8-bit operations, but setbe
(2 uops) vs. setb (1 uop):

        cmpb    $25, %cl
        setbe   %al

vs

        cmpb    $26, %al
        setb    %al

(note comparison against 25 or 26).

---

Regarding cmov being a lottery, unless you mean Pentium4, then not really, it's
just 1 or 2 uops, each latency 1 or 2. uops.info has very nice summaries:

https://uops.info/html-instr/CMOVB_R32_R32.html
https://uops.info/html-instr/CMOVBE_R32_R32.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/105135] [11/12/13 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32
  2022-04-02  8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
                   ` (5 preceding siblings ...)
  2022-07-26 13:36 ` amonakov at gcc dot gnu.org
@ 2023-03-11  5:29 ` pinskia at gcc dot gnu.org
  2023-03-11  5:32 ` [Bug tree-optimization/105135] " pinskia at gcc dot gnu.org
  2023-05-29 10:06 ` [Bug tree-optimization/105135] [11/12/13/14 " jakub at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-03-11  5:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135

--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
If we change all char to "unsigned char", it is still different and for
to_lower_3 we get on the tree level:

  if (_1 <= 25)
    goto <bb 3>; [34.00%]
  else
    goto <bb 4>; [66.00%]

  <bb 3> [local count: 365072224]:
  _4 = c_3(D) + 32;

  <bb 4> [local count: 1073741824]:
  # _2 = PHI <_4(3), c_3(D)(2)>

There is another bug for the above which asks to transform this to:
  if (_1 <= 25)
    goto <bb 3>; [34.00%]
  else
    goto <bb 4>; [66.00%]

  <bb 3> [local count: 365072224]:
  _4 = 32;

  <bb 4> [local count: 1073741824]:
  # _t = PHI <_4(3), 0>
_2 = c_3(D) + _t;

Which then would get it similar to the other two ...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/105135] [11/12/13 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32
  2022-04-02  8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
                   ` (6 preceding siblings ...)
  2023-03-11  5:29 ` pinskia at gcc dot gnu.org
@ 2023-03-11  5:32 ` pinskia at gcc dot gnu.org
  2023-05-29 10:06 ` [Bug tree-optimization/105135] [11/12/13/14 " jakub at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-03-11  5:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|target                      |tree-optimization

--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
For aarch64 to_lower_1/to_lower_2 is better:

        and     w0, w0, 255
        sub     w1, w0, #65
        and     w1, w1, 255
        cmp     w1, 25
        cset    w1, ls
        add     w0, w0, w1, lsl 5
        ret

vs
        and     w0, w0, 255
        sub     w2, w0, #65
        add     w1, w0, 32
        and     w2, w2, 255
        and     w1, w1, 255
        cmp     w2, 25
        csel    w0, w0, w1, hi
        ret

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/105135] [11/12/13/14 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32
  2022-04-02  8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
                   ` (7 preceding siblings ...)
  2023-03-11  5:32 ` [Bug tree-optimization/105135] " pinskia at gcc dot gnu.org
@ 2023-05-29 10:06 ` jakub at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-05-29 10:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|11.4                        |11.5

--- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 11.4 is being released, retargeting bugs to GCC 11.5.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-05-29 10:06 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-02  8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
2022-04-04  7:17 ` [Bug middle-end/105135] " rguenth at gcc dot gnu.org
2022-04-04  9:47 ` [Bug middle-end/105135] [11/12 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32 marxin at gcc dot gnu.org
2022-04-04 11:49 ` jakub at gcc dot gnu.org
2022-04-21  7:51 ` rguenth at gcc dot gnu.org
2022-07-26 12:37 ` [Bug target/105135] [11/12/13 " rguenth at gcc dot gnu.org
2022-07-26 13:36 ` amonakov at gcc dot gnu.org
2023-03-11  5:29 ` pinskia at gcc dot gnu.org
2023-03-11  5:32 ` [Bug tree-optimization/105135] " pinskia at gcc dot gnu.org
2023-05-29 10:06 ` [Bug tree-optimization/105135] [11/12/13/14 " jakub at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).