public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment
@ 2022-04-02 8:53 andre.schackier at gmail dot com
2022-04-04 7:17 ` [Bug middle-end/105135] " rguenth at gcc dot gnu.org
` (8 more replies)
0 siblings, 9 replies; 10+ messages in thread
From: andre.schackier at gmail dot com @ 2022-04-02 8:53 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135
Bug ID: 105135
Summary: [11/12 Regression] Optimization regression for
handrolled branchless assignment
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: andre.schackier at gmail dot com
Target Milestone: ---
Given the following source code [godbolt](https://godbolt.org/z/rrP3bqGW7):
```cpp
char to_lower_1(const char c) { return c + ((c >= 'A' && c <= 'Z') * 32); }
char to_lower_2(const char c) { return c + (((c >= 'A') & (c <= 'Z')) * 32); }
char to_lower_3(const char c) {
if (c >= 'A' && c <= 'Z') {
return c + 32;
}
return c;
}
```
compiling with `-O3`
produces the following assembly
```asm
to_lower_1(char):
lea eax, [rdi-65]
cmp al, 25
setbe al
sal eax, 5
add eax, edi
ret
to_lower_2(char):
lea eax, [rdi-65]
cmp al, 25
setbe al
sal eax, 5
add eax, edi
ret
to_lower_3(char):
lea edx, [rdi-65]
lea eax, [rdi+32]
cmp dl, 26
cmovnb eax, edi
ret
```
Note that gcc-10.3 did produce the same assembly for all 3 functions while
gcc-11 and trunk do not.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug middle-end/105135] [11/12 Regression] Optimization regression for handrolled branchless assignment
2022-04-02 8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
@ 2022-04-04 7:17 ` rguenth at gcc dot gnu.org
2022-04-04 9:47 ` [Bug middle-end/105135] [11/12 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32 marxin at gcc dot gnu.org
` (7 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-04 7:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|--- |11.3
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
gcc 10 produced cmovnb for all functions, I think setbe is going to be cheaper
since cmov is an odd beast. So I believe this is a progression.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug middle-end/105135] [11/12 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32
2022-04-02 8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
2022-04-04 7:17 ` [Bug middle-end/105135] " rguenth at gcc dot gnu.org
@ 2022-04-04 9:47 ` marxin at gcc dot gnu.org
2022-04-04 11:49 ` jakub at gcc dot gnu.org
` (6 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: marxin at gcc dot gnu.org @ 2022-04-04 9:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135
Martin Liška <marxin at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |jakub at gcc dot gnu.org,
| |marxin at gcc dot gnu.org
Summary|[11/12 Regression] |[11/12 Regression]
|Optimization regression for |Optimization regression for
|handrolled branchless |handrolled branchless
|assignment |assignment since
| |r11-4717-g3e190757fa332d32
Ever confirmed|0 |1
Status|UNCONFIRMED |NEW
Last reconfirmed| |2022-04-04
--- Comment #2 from Martin Liška <marxin at gcc dot gnu.org> ---
Started with r11-4717-g3e190757fa332d32.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug middle-end/105135] [11/12 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32
2022-04-02 8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
2022-04-04 7:17 ` [Bug middle-end/105135] " rguenth at gcc dot gnu.org
2022-04-04 9:47 ` [Bug middle-end/105135] [11/12 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32 marxin at gcc dot gnu.org
@ 2022-04-04 11:49 ` jakub at gcc dot gnu.org
2022-04-21 7:51 ` rguenth at gcc dot gnu.org
` (5 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-04-04 11:49 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135
--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Trying to microbenchmark this in a tight loop on i9-7960X shows in this case
cmov probably better (but cmov is really a lottery on x86):
cat pr105135.c
__attribute__((noipa)) char to_lower_1(const char c) { return c + ((c >= 'A' &&
c <= 'Z') * 32); }
__attribute__((noipa)) char to_lower_2(const char c) { return c + (((c >= 'A')
& (c <= 'Z')) * 32); }
__attribute__((noipa)) char to_lower_3(const char c) { if (c >= 'A' && c <=
'Z') return c + 32; return c; }
$ cat pr105135-2.c
__attribute__((noipa)) char to_lower_1(const char c);
__attribute__((noipa)) char to_lower_2(const char c);
__attribute__((noipa)) char to_lower_3(const char c);
#define N 1000000000
int
main ()
{
unsigned long long r = 0;
#ifdef Aa
for (long long i = 0; i < N; i++)
r += to_lower ((i & 1) ? 'A' : 'a');
#else
for (long long i = 0; i < N; i++)
r += to_lower ('A');
#endif
asm volatile ("" : : "r" (r));
}
$ for i in "./cc1 -quiet" "gcc -S"; do for j in 1 2 3; do for k in "" -DAa; do
eval $i -O3 pr105135.c; gcc -Dto_lower=to_lower_$j $k -O3 -o pr105135{,.s}
pr105135-2.c; echo $i $j $k; time ./pr105135; done; done; done
./cc1 -quiet 1
real 0m1.230s
user 0m1.228s
sys 0m0.001s
./cc1 -quiet 1 -DAa
real 0m1.706s
user 0m1.703s
sys 0m0.001s
./cc1 -quiet 2
real 0m1.222s
user 0m1.221s
sys 0m0.000s
./cc1 -quiet 2 -DAa
real 0m1.686s
user 0m1.683s
sys 0m0.001s
./cc1 -quiet 3
real 0m1.232s
user 0m1.230s
sys 0m0.000s
./cc1 -quiet 3 -DAa
real 0m1.450s
user 0m1.447s
sys 0m0.001s
gcc -S 1
real 0m1.232s
user 0m1.229s
sys 0m0.001s
gcc -S 1 -DAa
real 0m1.391s
user 0m1.389s
sys 0m0.001s
gcc -S 2
real 0m1.233s
user 0m1.230s
sys 0m0.001s
gcc -S 2 -DAa
real 0m1.398s
user 0m1.397s
sys 0m0.000s
gcc -S 3
real 0m1.232s
user 0m1.229s
sys 0m0.001s
gcc -S 3 -DAa
real 0m1.430s
user 0m1.428s
sys 0m0.000s
where gcc is GCC 10.x and ./cc1 is current trunk.
Seems for the constant 'A' case it is actually a wash, but with alternating
'A'/'a' cmov is better.
clang seems to emit for the first 2 functions very similar code to gcc, the
only difference is that
shift left and addition are performed using 8-bit rather than 32-bit
instructions, so:
leal -65(%rdi), %eax
cmpb $26, %al
setb %al
shlb $5, %al
addb %dil, %al
and that seems to perform better.
I have tried to use
leal -65(%rdi), %ecx
xorl %eax, %eax
cmpb $25, %cl
setbe %al
sall $5, %eax
addl %edi, %eax
to perform manually what our peephole2 tries to do for setXX instructions but
in this case fails to do as %eax is live in the comparison before it,
that helped a little bit but not as much as the 8-bit instructions do.
But when I disable the
;; Avoid redundant prefixes by splitting HImode arithmetic to SImode.
;; Do not split instructions with mask registers.
(define_split
[(set (match_operand 0 "general_reg_operand")
(match_operator 3 "promotable_binary_operator"
[(match_operand 1 "general_reg_operand")
(match_operand 2 "aligned_operand")]))
(clobber (reg:CC FLAGS_REG))]
"! TARGET_PARTIAL_REG_STALL && reload_completed
&& ((GET_MODE (operands[0]) == HImode
&& ((optimize_function_for_speed_p (cfun) && !TARGET_FAST_PREFIX)
/* ??? next two lines just !satisfies_constraint_K (...) */
|| !CONST_INT_P (operands[2])
|| satisfies_constraint_K (operands[2])))
|| (GET_MODE (operands[0]) == QImode
- && (TARGET_PROMOTE_QImode || optimize_function_for_size_p (cfun))))"
+ && (0 || optimize_function_for_size_p (cfun))))"
[(parallel [(set (match_dup 0)
(match_op_dup 3 [(match_dup 1) (match_dup 2)]))
(clobber (reg:CC FLAGS_REG))])]
{
operands[0] = gen_lowpart (SImode, operands[0]);
operands[1] = gen_lowpart (SImode, operands[1]);
if (GET_CODE (operands[3]) != ASHIFT)
operands[2] = gen_lowpart (SImode, operands[2]);
operands[3] = shallow_copy_rtx (operands[3]);
PUT_MODE (operands[3], SImode);
})
splitter so that the code is basically the same as from clang, it is still
slower than the clang version, so it is just weird.
Anyway, the GIMPLE optimization is IMNSHO sound, it is all about how exactly
the backend handles it.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug middle-end/105135] [11/12 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32
2022-04-02 8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
` (2 preceding siblings ...)
2022-04-04 11:49 ` jakub at gcc dot gnu.org
@ 2022-04-21 7:51 ` rguenth at gcc dot gnu.org
2022-07-26 12:37 ` [Bug target/105135] [11/12/13 " rguenth at gcc dot gnu.org
` (4 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-21 7:51 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|11.3 |11.4
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 11.3 is being released, retargeting bugs to GCC 11.4.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/105135] [11/12/13 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32
2022-04-02 8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
` (3 preceding siblings ...)
2022-04-21 7:51 ` rguenth at gcc dot gnu.org
@ 2022-07-26 12:37 ` rguenth at gcc dot gnu.org
2022-07-26 13:36 ` amonakov at gcc dot gnu.org
` (3 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-07-26 12:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|middle-end |target
Target| |x86_64-*-*
Priority|P3 |P2
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/105135] [11/12/13 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32
2022-04-02 8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
` (4 preceding siblings ...)
2022-07-26 12:37 ` [Bug target/105135] [11/12/13 " rguenth at gcc dot gnu.org
@ 2022-07-26 13:36 ` amonakov at gcc dot gnu.org
2023-03-11 5:29 ` pinskia at gcc dot gnu.org
` (2 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: amonakov at gcc dot gnu.org @ 2022-07-26 13:36 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135
Alexander Monakov <amonakov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amonakov at gcc dot gnu.org
--- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Regarding Clang's code, the key part is not use of 8-bit operations, but setbe
(2 uops) vs. setb (1 uop):
cmpb $25, %cl
setbe %al
vs
cmpb $26, %al
setb %al
(note comparison against 25 or 26).
---
Regarding cmov being a lottery, unless you mean Pentium4, then not really, it's
just 1 or 2 uops, each latency 1 or 2. uops.info has very nice summaries:
https://uops.info/html-instr/CMOVB_R32_R32.html
https://uops.info/html-instr/CMOVBE_R32_R32.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/105135] [11/12/13 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32
2022-04-02 8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
` (5 preceding siblings ...)
2022-07-26 13:36 ` amonakov at gcc dot gnu.org
@ 2023-03-11 5:29 ` pinskia at gcc dot gnu.org
2023-03-11 5:32 ` [Bug tree-optimization/105135] " pinskia at gcc dot gnu.org
2023-05-29 10:06 ` [Bug tree-optimization/105135] [11/12/13/14 " jakub at gcc dot gnu.org
8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-03-11 5:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135
--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
If we change all char to "unsigned char", it is still different and for
to_lower_3 we get on the tree level:
if (_1 <= 25)
goto <bb 3>; [34.00%]
else
goto <bb 4>; [66.00%]
<bb 3> [local count: 365072224]:
_4 = c_3(D) + 32;
<bb 4> [local count: 1073741824]:
# _2 = PHI <_4(3), c_3(D)(2)>
There is another bug for the above which asks to transform this to:
if (_1 <= 25)
goto <bb 3>; [34.00%]
else
goto <bb 4>; [66.00%]
<bb 3> [local count: 365072224]:
_4 = 32;
<bb 4> [local count: 1073741824]:
# _t = PHI <_4(3), 0>
_2 = c_3(D) + _t;
Which then would get it similar to the other two ...
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/105135] [11/12/13 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32
2022-04-02 8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
` (6 preceding siblings ...)
2023-03-11 5:29 ` pinskia at gcc dot gnu.org
@ 2023-03-11 5:32 ` pinskia at gcc dot gnu.org
2023-05-29 10:06 ` [Bug tree-optimization/105135] [11/12/13/14 " jakub at gcc dot gnu.org
8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-03-11 5:32 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|target |tree-optimization
--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
For aarch64 to_lower_1/to_lower_2 is better:
and w0, w0, 255
sub w1, w0, #65
and w1, w1, 255
cmp w1, 25
cset w1, ls
add w0, w0, w1, lsl 5
ret
vs
and w0, w0, 255
sub w2, w0, #65
add w1, w0, 32
and w2, w2, 255
and w1, w1, 255
cmp w2, 25
csel w0, w0, w1, hi
ret
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/105135] [11/12/13/14 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32
2022-04-02 8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
` (7 preceding siblings ...)
2023-03-11 5:32 ` [Bug tree-optimization/105135] " pinskia at gcc dot gnu.org
@ 2023-05-29 10:06 ` jakub at gcc dot gnu.org
8 siblings, 0 replies; 10+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-05-29 10:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|11.4 |11.5
--- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 11.4 is being released, retargeting bugs to GCC 11.5.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2023-05-29 10:06 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-02 8:53 [Bug middle-end/105135] New: [11/12 Regression] Optimization regression for handrolled branchless assignment andre.schackier at gmail dot com
2022-04-04 7:17 ` [Bug middle-end/105135] " rguenth at gcc dot gnu.org
2022-04-04 9:47 ` [Bug middle-end/105135] [11/12 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32 marxin at gcc dot gnu.org
2022-04-04 11:49 ` jakub at gcc dot gnu.org
2022-04-21 7:51 ` rguenth at gcc dot gnu.org
2022-07-26 12:37 ` [Bug target/105135] [11/12/13 " rguenth at gcc dot gnu.org
2022-07-26 13:36 ` amonakov at gcc dot gnu.org
2023-03-11 5:29 ` pinskia at gcc dot gnu.org
2023-03-11 5:32 ` [Bug tree-optimization/105135] " pinskia at gcc dot gnu.org
2023-05-29 10:06 ` [Bug tree-optimization/105135] [11/12/13/14 " jakub at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).