public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
@ 2021-01-27 14:28 marxin at gcc dot gnu.org
2021-01-27 14:29 ` [Bug tree-optimization/98856] " marxin at gcc dot gnu.org
` (49 more replies)
0 siblings, 50 replies; 51+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-01-27 14:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Bug ID: 98856
Summary: [11 Regression] botan AES-128/XTS is slower by ~17%
since
r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: marxin at gcc dot gnu.org
CC: rguenth at gcc dot gnu.org
Target Milestone: ---
Since the revision the following is slower:
$ make clean && ./configure.py --cxxflags="-Ofast -march=znver2 -fno-checking"
&& make -j16 && ./botan speed AES-128/XTS
as seen here:
https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=226.721.1&plot.1=14.721.1&
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
@ 2021-01-27 14:29 ` marxin at gcc dot gnu.org
2021-01-27 14:44 ` rguenth at gcc dot gnu.org
` (48 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-01-27 14:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Martin Liška <marxin at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Known to fail| |11.0
Last reconfirmed| |2021-01-27
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
Known to work| |10.2.0
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
2021-01-27 14:29 ` [Bug tree-optimization/98856] " marxin at gcc dot gnu.org
@ 2021-01-27 14:44 ` rguenth at gcc dot gnu.org
2021-01-28 7:47 ` rguenth at gcc dot gnu.org
` (47 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-27 14:44 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |ASSIGNED
Target Milestone|--- |11.0
Keywords| |missed-optimization
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
I will have a look.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
2021-01-27 14:29 ` [Bug tree-optimization/98856] " marxin at gcc dot gnu.org
2021-01-27 14:44 ` rguenth at gcc dot gnu.org
@ 2021-01-28 7:47 ` rguenth at gcc dot gnu.org
2021-01-28 8:44 ` marxin at gcc dot gnu.org
` (46 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-28 7:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
The cxx bench Botan doesn't know --cxxflags, what Botan version are you looking
at?
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (2 preceding siblings ...)
2021-01-28 7:47 ` rguenth at gcc dot gnu.org
@ 2021-01-28 8:44 ` marxin at gcc dot gnu.org
2021-01-28 9:40 ` rguenth at gcc dot gnu.org
` (45 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-01-28 8:44 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #3 from Martin Liška <marxin at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> The cxx bench Botan doesn't know --cxxflags, what Botan version are you
> looking at?
I used this fixed version:
https://gitlab.suse.de/marxin/cpp-benchmarks/-/tree/master/botan
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (3 preceding siblings ...)
2021-01-28 8:44 ` marxin at gcc dot gnu.org
@ 2021-01-28 9:40 ` rguenth at gcc dot gnu.org
2021-01-28 11:03 ` rguenth at gcc dot gnu.org
` (44 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-28 9:40 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Slow:
Samples: 4K of event 'cycles:u', Event count (approx.): 4565667242
Overhead Samples Command Shared Object Symbol
30.88% 1252 botan libbotan-2.so.17 [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan
30.24% 1235 botan libbotan-2.so.17 [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan
26.04% 1055 botan libbotan-2.so.17 [.] Botan::poly_double_n_le
Fast
Samples: 4K of event 'cycles:u', Event count (approx.): 4427277434
Overhead Samples Command Shared Object Symbol
33.59% 1372 botan libbotan-2.so.17 [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Bo
33.16% 1356 botan libbotan-2.so.17 [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Bo
18.71% 765 botan libbotan-2.so.17 [.]
Botan::poly_double_n_le
also fast on trunk when not vectorizing, so the rev does what it was intended
to
(more vectorization). I'll look into what we do to poly_double_n_le.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (4 preceding siblings ...)
2021-01-28 9:40 ` rguenth at gcc dot gnu.org
@ 2021-01-28 11:03 ` rguenth at gcc dot gnu.org
2021-01-28 11:19 ` rguenth at gcc dot gnu.org
` (43 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-28 11:03 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Looks like STLF issues. There's a ls_stlf counter, with SLP vectorization
disabled I see
34.39% 1417 botan libbotan-2.so.17 [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip
32.27% 1333 botan libbotan-2.so.17 [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip
7.31% 306 botan libbotan-2.so.17 [.] Botan::poly_double_n_le
while with SLP vectorization enabled there's
Samples: 4K of event 'ls_stlf:u', Event count (approx.): 723886942
Overhead Samples Command Shared Object Symbol
32.41% 1320 botan libbotan-2.so.17 [.] Botan::poly_double_n_le
27.23% 1114 botan libbotan-2.so.17 [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip
27.06% 1107 botan libbotan-2.so.17 [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip
but then the register docs suggest that the unnamed cpu/event=0x24,umask=0x2/u
is supposed to be the forwarding fails due to incomplete/misaligned data.
Unvectorized:
Samples: 4K of event 'cpu/event=0x24,umask=0x2/u', Event count (approx.):
1024347253
Overhead Samples Command Shared Object Symbol
33.56% 1382 botan libbotan-2.so.17 [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip
30.32% 1246 botan libbotan-2.so.17 [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip
23.18% 953 botan libbotan-2.so.17 [.] Botan::poly_double_n_le
vectorized:
Samples: 4K of event 'cpu/event=0x24,umask=0x2/u', Event count (approx.):
489384781
Overhead Samples Command Shared Object Symbol
30.17% 1229 botan libbotan-2.so.17 [.] Botan::poly_double_n_le
29.40% 1203 botan libbotan-2.so.17 [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip
28.09% 1147 botan libbotan-2.so.17 [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip
but the masking doesn't work as expected since I get hits for either bit
on
4.05 | vmovdqa %xmm4,0x10(%rsp)
#
| const uint64_t carry = POLY * (W[LIMBS-1] >> 63);
#
12.24 | mov 0x18(%rsp),%rdx
#
| W[0] = (W[0] << 1) ^ carry;
#
24.00 | vmovdqa 0x10(%rsp),%xmm5
which should only happen for bit 2 (data not ready). Of course this
code-gen is weird since 0x10(%rsp) is available in %xmm4.
Well, changing the above doesn't make a difference. I guess the event hit
is just quite delayed - that makes perf quite useless here.
As a general optimization remark we fail to scalarize 'W' in poly_double_le
for the larger sizes, but the relevant differences likely appear for the
cases we expand the memcpy inline on GIMPLE, specifically
<bb 10> [local count: 1431655747]:
_60 = MEM <__int128 unsigned> [(char * {ref-all})in_6(D)];
_61 = BIT_FIELD_REF <_60, 64, 64>;
_62 = _61 >> 63;
carry_63 = _62 * 135;
_308 = _61 << 1;
_228 = (long unsigned int) _60;
_310 = _228 >> 63;
_311 = _308 ^ _310;
_71 = _228 << 1;
_72 = carry_63 ^ _71;
MEM <long unsigned int> [(char * {ref-all})out_5(D)] = _72;
MEM <long unsigned int> [(char * {ref-all})out_5(D) + 8B] = _311;
this is turned into
<bb 10> [local count: 1431655747]:
_60 = MEM <__int128 unsigned> [(char * {ref-all})in_6(D)];
_114 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(_60);
vect__71.335_298 = _114 << 1;
_61 = BIT_FIELD_REF <_60, 64, 64>;
_62 = _61 >> 63;
carry_63 = _62 * 135;
_228 = (long unsigned int) _60;
_310 = _228 >> 63;
_147 = {carry_63, _310};
vect__72.336_173 = _147 ^ vect__71.335_298;
MEM <vector(2) long unsigned int> [(char * {ref-all})out_5(D)] =
vect__72.336_173;
after the patch which is
build/include/botan/mem_ops.h:148:15: note: Basic block will be vectorized
using SLP
build/include/botan/mem_ops.h:148:15: note: Vectorizing SLP tree:
build/include/botan/mem_ops.h:148:15: note: node 0x275d8e8 (max_nunits=2,
refcnt=1)
build/include/botan/mem_ops.h:148:15: note: op template: MEM <long unsigned
int> [(char * {ref-all})out_5(D)] = _72;
build/include/botan/mem_ops.h:148:15: note: stmt 0 MEM <long unsigned int>
[(char * {ref-all})out_5(D)] = _72;
build/include/botan/mem_ops.h:148:15: note: stmt 1 MEM <long unsigned int>
[(char * {ref-all})out_5(D) + 8B] = _311;
build/include/botan/mem_ops.h:148:15: note: children 0x275d960
build/include/botan/mem_ops.h:148:15: note: node 0x275d960 (max_nunits=2,
refcnt=1)
build/include/botan/mem_ops.h:148:15: note: op template: _72 = carry_63 ^ _71;
build/include/botan/mem_ops.h:148:15: note: stmt 0 _72 = carry_63 ^ _71;
build/include/botan/mem_ops.h:148:15: note: stmt 1 _311 = _308 ^ _310;
build/include/botan/mem_ops.h:148:15: note: children 0x275d9d8 0x275da50
build/include/botan/mem_ops.h:148:15: note: node (external) 0x275d9d8
(max_nunits=1, refcnt=1)
build/include/botan/mem_ops.h:148:15: note: { carry_63, _310 }
build/include/botan/mem_ops.h:148:15: note: node 0x275da50 (max_nunits=2,
refcnt=1)
build/include/botan/mem_ops.h:148:15: note: op template: _71 = _228 << 1;
build/include/botan/mem_ops.h:148:15: note: stmt 0 _71 = _228 << 1;
build/include/botan/mem_ops.h:148:15: note: stmt 1 _308 = _61 << 1;
build/include/botan/mem_ops.h:148:15: note: children 0x275dac8 0x275dbb8
build/include/botan/mem_ops.h:148:15: note: node 0x275dac8 (max_nunits=1,
refcnt=1)
build/include/botan/mem_ops.h:148:15: note: op: VEC_PERM_EXPR
build/include/botan/mem_ops.h:148:15: note: stmt 0 _228 = BIT_FIELD_REF
<_60, 64, 0>;
build/include/botan/mem_ops.h:148:15: note: stmt 1 _61 = BIT_FIELD_REF
<_60, 64, 64>;
build/include/botan/mem_ops.h:148:15: note: lane permutation { 0[0] 0[1] }
build/include/botan/mem_ops.h:148:15: note: children 0x275db40
build/include/botan/mem_ops.h:148:15: note: node (external) 0x275db40
(max_nunits=1, refcnt=1)
build/include/botan/mem_ops.h:148:15: note: { }
build/include/botan/mem_ops.h:148:15: note: node (constant) 0x275dbb8
(max_nunits=1, refcnt=1)
build/include/botan/mem_ops.h:148:15: note: { 1, 1 }
with costs
build/include/botan/mem_ops.h:148:15: note: Cost model analysis:
Vector inside of basic block cost: 24
Vector prologue cost: 8
Vector epilogue cost: 8
Scalar cost of basic block: 52
the vectorization isn't too bad I think, it turns into
.L56:
.cfi_restore_state
vmovdqu (%rsi), %xmm4
vmovdqa %xmm4, 16(%rsp)
movq 24(%rsp), %rdx
vmovdqa 16(%rsp), %xmm5
shrq $63, %rdx
imulq $135, %rdx, %rdi
movq 16(%rsp), %rdx
vmovq %rdi, %xmm0
vpsllq $1, %xmm5, %xmm1
shrq $63, %rdx
vpinsrq $1, %rdx, %xmm0, %xmm0
vpxor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rax)
jmp .L53
instead of
.L56:
.cfi_restore_state
movq 8(%rsi), %rdx
movq (%rsi), %rdi
movq %rdx, %rcx
leaq (%rdi,%rdi), %rsi
addq %rdx, %rdx
shrq $63, %rdi
shrq $63, %rcx
xorq %rdi, %rdx
imulq $135, %rcx, %rcx
movq %rdx, 8(%rax)
xorq %rsi, %rcx
movq %rcx, (%rax)
jmp .L53
but we see the 128bit move split when using GPRs possibly avoiding the
STLF issue. I don't understand why we spill to extract the high part though.
Will see to create a small testcase for the above kernel.
With the vectorization disabled for just this kernel I get
AES-128/XTS 280780 key schedule/sec; 0.00 ms/op 12122 cycles/op (2 ops in 0 ms)
AES-128/XTS encrypt buffer size 1024 bytes: 852.401 MiB/sec 4.14 cycles/byte
(426.20 MiB in 500.00 ms)
AES-128/XTS decrypt buffer size 1024 bytes: 854.461 MiB/sec 4.13 cycles/byte
(426.20 MiB in 498.80 ms)
compared to
ES-128/XTS 286409 key schedule/sec; 0.00 ms/op 11761 cycles/op (2 ops in 0 ms)
AES-128/XTS encrypt buffer size 1024 bytes: 765.736 MiB/sec 4.62 cycles/byte
(382.87 MiB in 500.00 ms)
AES-128/XTS decrypt buffer size 1024 bytes: 766.612 MiB/sec 4.61 cycles/byte
(382.87 MiB in 499.43 ms)
so that seems to be it.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (5 preceding siblings ...)
2021-01-28 11:03 ` rguenth at gcc dot gnu.org
@ 2021-01-28 11:19 ` rguenth at gcc dot gnu.org
2021-01-28 11:57 ` rguenth at gcc dot gnu.org
` (42 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-28 11:19 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
The following testcase reproduces the assembly:
typedef __UINT64_TYPE__ uint64_t;
void poly_double_le2 (unsigned char *out, const unsigned char *in)
{
uint64_t W[2];
__builtin_memcpy (&W, in, 16);
uint64_t carry = (W[1] >> 63) * 135;
W[1] = (W[1] << 1) ^ (W[0] >> 63);
W[0] = (W[0] << 1) ^ carry;
__builtin_memcpy (out, &W[0], 8);
__builtin_memcpy (out + 8, &W[1], 8);
}
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (6 preceding siblings ...)
2021-01-28 11:19 ` rguenth at gcc dot gnu.org
@ 2021-01-28 11:57 ` rguenth at gcc dot gnu.org
2021-02-05 10:18 ` rguenth at gcc dot gnu.org
` (41 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-28 11:57 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
OK, and the spill is likely because we expand as
(insn 7 6 0 (set (reg:TI 84 [ _9 ])
(mem:TI (reg/v/f:DI 93 [ in ]) [0 MEM <__int128 unsigned> [(char *
{ref-all})in_8(D)]+0 S16 A8])) -1
(nil))
(insn 8 7 9 (parallel [
(set (reg:DI 95)
(lshiftrt:DI (subreg:DI (reg:TI 84 [ _9 ]) 8)
(const_int 63 [0x3f])))
(clobber (reg:CC 17 flags))
]) "t.c":7:26 -1
(nil))
^^^ (subreg:DI (reg:TI 84 [ _9 ]) 8)
...
(insn 12 11 13 (set (reg:V2DI 98 [ vect__5.3 ])
(ashift:V2DI (subreg:V2DI (reg:TI 84 [ _9 ]) 0)
(const_int 1 [0x1]))) "t.c":9:16 -1
(nil))
^^^ (subreg:V2DI (reg:TI 84 [ _9 ]) 0)
LRA then does
Choosing alt 4 in insn 7: (0) v (1) vm {*movti_internal}
Creating newreg=103 from oldreg=84, assigning class ALL_SSE_REGS to r103
7: r103:TI=[r101:DI]
REG_DEAD r101:DI
Inserting insn reload after:
20: r84:TI=r103:TI
Choosing alt 0 in insn 8: (0) =rm (1) 0 (2) cJ {*lshrdi3_1}
Creating newreg=104 from oldreg=95, assigning class GENERAL_REGS to r104
Inserting insn reload before:
21: r104:DI=r84:TI#8
but somehow this means the reload 20 is used for the reload 21 instead
of avoiding the reload 20 and doing a movhlps / movq combo? (I guess
there's no high part xmm extract to gpr)
As said the assembly is a bit weird:
poly_double_le2:
.LFB0:
.cfi_startproc
vmovdqu (%rsi), %xmm2
vmovdqa %xmm2, -24(%rsp)
movq -16(%rsp), %rax
ok, well ...
vmovdqa -24(%rsp), %xmm3
???
shrq $63, %rax
imulq $135, %rax, %rax
vmovq %rax, %xmm0
movq -24(%rsp), %rax
??? movq %xmm2/3, %rax
vpsllq $1, %xmm3, %xmm1
shrq $63, %rax
vpinsrq $1, %rax, %xmm0, %xmm0
vpxor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rdi)
note even with -march=core-avx2 (and thus inter-unit moves not pessimized) we
get
poly_double_le2:
.LFB0:
.cfi_startproc
vmovdqu (%rsi), %xmm2
vmovdqa %xmm2, -24(%rsp)
movq -16(%rsp), %rax
vmovdqa -24(%rsp), %xmm3
shrq $63, %rax
vpsllq $1, %xmm3, %xmm1
imulq $135, %rax, %rax
vmovq %rax, %xmm0
movq -24(%rsp), %rax
shrq $63, %rax
vpinsrq $1, %rax, %xmm0, %xmm0
vpxor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rdi)
with
.L56:
.cfi_restore_state
vmovdqu (%rsi), %xmm4
movq 8(%rsi), %rdx
shrq $63, %rdx
imulq $135, %rdx, %rdi
movq 8(%rsi), %rdx
vmovq %rdi, %xmm0
vpsllq $1, %xmm4, %xmm1
shrq $63, %rdx
vpinsrq $1, %rdx, %xmm0, %xmm0
vpxor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rax)
jmp .L53
we arrive at
ES-128/XTS 672043 key schedule/sec; 0.00 ms/op 4978.00 cycles/op (2 ops in 0.00
ms)
AES-128/XTS encrypt buffer size 1024 bytes: 843.310 MiB/sec 4.18 cycles/byte
(421.66 MiB in 500.00 ms)
AES-128/XTS decrypt buffer size 1024 bytes: 847.215 MiB/sec 4.16 cycles/byte
(421.66 MiB in 497.70 ms)
a variant using movhlps isn't any faster than spilling unfortunately :/
I guess re-materializing from a load is too much to be asked from LRA.
On the vectorizer side costing is 52 scalar vs. 40 vector (as usual the
vectorized store alone leads to a big boost).
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (7 preceding siblings ...)
2021-01-28 11:57 ` rguenth at gcc dot gnu.org
@ 2021-02-05 10:18 ` rguenth at gcc dot gnu.org
2021-02-05 11:52 ` jakub at gcc dot gnu.org
` (40 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-05 10:18 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
exploring more options I noticed there's no arithmetic vector V2DI right shift,
so vectorizing
uint64_t carry = (uint64_t)(((int64_t)W[1]) >> 63) & (uint64_t)135;
W[1] = (W[1] << 1) ^ ((uint64_t)(((int64_t)W[0]) >> 63) & (uint64_t)1);
W[0] = (W[0] << 1) ^ carry;
didn't work out. But V2DI >> CST with CST > 31 can be implemented with
VPSRAD and then doing PMOVSXDQ after shuffling the high shifted part into
low position.
Maybe there's sth more clever for the special case of >> 63 even.
As said, just trying if "optimal" vectorization of the kernel would solve
the issue. But I guess pipelines are wide enough so the original scalar
code effectively executes "vectorized".
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (8 preceding siblings ...)
2021-02-05 10:18 ` rguenth at gcc dot gnu.org
@ 2021-02-05 11:52 ` jakub at gcc dot gnu.org
2021-02-05 12:52 ` rguenth at gcc dot gnu.org
` (39 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-05 11:52 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |jakub at gcc dot gnu.org
--- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
For arithmetic >> (element_precision - 1) one can just use
{,v}pxor + {,v}pcmpgtq, as in instead of return vec >> 63; do return vec < 0;
(in C++-ish way), aka VEC_COND_EXPR vec < 0, { all ones }, { 0 }
For other arithmetic shifts by scalar constant, perhaps one can replace
return vec >> 17; with return (vectype) ((uvectype) vec >> 17) | ((vec < 0) <<
(64 - 17));
- it will actually work even for non-constant scalar shift amounts because
{,v}psllq treats shift counts > 63 as 0.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (9 preceding siblings ...)
2021-02-05 11:52 ` jakub at gcc dot gnu.org
@ 2021-02-05 12:52 ` rguenth at gcc dot gnu.org
2021-02-05 13:43 ` jakub at gcc dot gnu.org
` (38 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-05 12:52 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #9)
> For arithmetic >> (element_precision - 1) one can just use
> {,v}pxor + {,v}pcmpgtq, as in instead of return vec >> 63; do return vec < 0;
> (in C++-ish way), aka VEC_COND_EXPR vec < 0, { all ones }, { 0 }
> For other arithmetic shifts by scalar constant, perhaps one can replace
> return vec >> 17; with return (vectype) ((uvectype) vec >> 17) | ((vec < 0)
> << (64 - 17));
> - it will actually work even for non-constant scalar shift amounts because
> {,v}psllq treats shift counts > 63 as 0.
OK, so that yields
poly_double_le2:
.LFB0:
.cfi_startproc
vmovdqu (%rsi), %xmm0
vpxor %xmm1, %xmm1, %xmm1
vpalignr $8, %xmm0, %xmm0, %xmm2
vpcmpgtq %xmm2, %xmm1, %xmm1
vpand .LC0(%rip), %xmm1, %xmm1
vpsllq $1, %xmm0, %xmm0
vpxor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rdi)
ret
when I feed the following to SLP2 directly:
void __GIMPLE (ssa,guessed_local(1073741824),startwith("slp"))
poly_double_le2 (unsigned char * out, const unsigned char * in)
{
long unsigned int carry;
long unsigned int _1;
long unsigned int _2;
long unsigned int _3;
long unsigned int _4;
long unsigned int _5;
long unsigned int _6;
__int128 unsigned _9;
long unsigned int _14;
long unsigned int _15;
long int _18;
long int _19;
long unsigned int _20;
__BB(2,guessed_local(1073741824)):
_9 = __MEM <__int128 unsigned, 8> ((char *)in_8(D));
_14 = __BIT_FIELD_REF <long unsigned int> (_9, 64u, 64u);
_18 = (long int) _14;
_1 = _18 < 0l ? _Literal (unsigned long) -1ul : 0ul;
carry_10 = _1 & 135ul;
_2 = _14 << 1;
_15 = __BIT_FIELD_REF <long unsigned int> (_9, 64u, 0u);
_19 = (long int) _15;
_20 = _19 < 0l ? _Literal (unsigned long) -1ul : 0ul;
_3 = _20 & 1ul;
_4 = _2 ^ _3;
_5 = _15 << 1;
_6 = _5 ^ carry_10;
__MEM <long unsigned int, 8> ((char *)out_11(D)) = _6;
__MEM <long unsigned int, 8> ((char *)out_11(D) + _Literal (char *) 8) = _4;
return;
}
with
<bb 2> [local count: 1073741824]:
_9 = MEM <__int128 unsigned> [(char *)in_8(D)];
_12 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(_9);
_7 = VEC_PERM_EXPR <_12, _12, { 1, 0 }>;
vect__18.1_25 = VIEW_CONVERT_EXPR<vector(2) long int>(_7);
vect_carry_10.3_28 = .VCOND (vect__18.1_25, { 0, 0 }, { 135, 1 }, { 0, 0 },
108);
vect__5.0_13 = _12 << 1;
vect__6.4_29 = vect__5.0_13 ^ vect_carry_10.3_28;
MEM <vector(2) long unsigned int> [(char *)out_11(D)] = vect__6.4_29;
return;
in .optimized
The latency of the data is at least 7 instructions that way, compared to
4 in the not vectorized code (guess I could try Intel iaca on it).
So if that's indeed the best we can do then it's not profitable (btw,
with the above the vectorizers conclusion is not profitable but due
to excessive costing of constants for the condition vectorization).
Simple asm replacement of the kernel results in
ES-128/XTS 292740 key schedule/sec; 0.00 ms/op 11571 cycles/op (2 ops in 0 ms)
AES-128/XTS encrypt buffer size 1024 bytes: 765.571 MiB/sec 4.62 cycles/byte
(382.79 MiB in 500.00 ms)
AES-128/XTS decrypt buffer size 1024 bytes: 767.064 MiB/sec 4.61 cycles/byte
(382.79 MiB in 499.03 ms)
compared to
AES-128/XTS 283527 key schedule/sec; 0.00 ms/op 11932 cycles/op (2 ops in 0 ms)
AES-128/XTS encrypt buffer size 1024 bytes: 768.446 MiB/sec 4.60 cycles/byte
(384.22 MiB in 500.00 ms)
AES-128/XTS decrypt buffer size 1024 bytes: 769.292 MiB/sec 4.60 cycles/byte
(384.22 MiB in 499.45 ms)
so that's indeed no improvement. Bigger block sizes also contain vector
code but that's not exercised by the botan speed measurement.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (10 preceding siblings ...)
2021-02-05 12:52 ` rguenth at gcc dot gnu.org
@ 2021-02-05 13:43 ` jakub at gcc dot gnu.org
2021-02-05 14:36 ` jakub at gcc dot gnu.org
` (37 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-05 13:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |uros at gcc dot gnu.org
--- Comment #11 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
For V2DImode arithmetic right shift, I think it would be something like:
--- gcc/config/i386/sse.md.jj 2021-01-27 11:50:09.168981297 +0100
+++ gcc/config/i386/sse.md 2021-02-05 14:32:44.175463716 +0100
@@ -20313,10 +20313,55 @@ (define_expand "ashrv2di3"
(ashiftrt:V2DI
(match_operand:V2DI 1 "register_operand")
(match_operand:DI 2 "nonmemory_operand")))]
- "TARGET_XOP || TARGET_AVX512VL"
+ "TARGET_SSE4_2"
{
if (!TARGET_AVX512VL)
{
+ if (CONST_INT_P (operands[2]) && INTVAL (operands[2]) == 63)
+ {
+ rtx zero = force_reg (V2DImode, CONST0_RTX (V2DImode));
+ emit_insn (gen_sse4_2_gtv2di3 (operands[0], zero, operands[1]));
+ DONE;
+ }
+ if (operands[2] == const0_rtx)
+ {
+ emit_move_insn (operands[0], operands[1]);
+ DONE;
+ }
+ if (!TARGET_XOP)
+ {
+ rtx zero = force_reg (V2DImode, CONST0_RTX (V2DImode));
+ rtx zero_or_all_ones = gen_reg_rtx (V2DImode);
+ emit_insn (gen_sse4_2_gtv2di3 (zero_or_all_ones, zero, operands[1]));
+ rtx lshr_res = gen_reg_rtx (V2DImode);
+ emit_insn (gen_lshrv2di3 (lshr_res, operands[1], operands[2]));
+ rtx ashl_res = gen_reg_rtx (V2DImode);
+ rtx amount;
+ if (CONST_INT_P (operands[2]))
+ amount = GEN_INT (64 - INTVAL (operands[2]));
+ else if (TARGET_64BIT)
+ {
+ amount = gen_reg_rtx (DImode);
+ emit_insn (gen_subdi3 (amount, force_reg (DImode, GEN_INT (64)),
+ operands[2]));
+ }
+ else
+ {
+ rtx temp = gen_reg_rtx (SImode);
+ emit_insn (gen_subsi3 (temp, force_reg (SImode, GEN_INT (64)),
+ lowpart_subreg (SImode, operands[2],
+ DImode)));
+ amount = gen_reg_rtx (V4SImode);
+ emit_insn (gen_vec_setv4si_0 (amount, CONST0_RTX (V4SImode),
+ temp));
+ }
+ if (!CONST_INT_P (operands[2]))
+ amount = lowpart_subreg (DImode, amount, GET_MODE (amount));
+ emit_insn (gen_ashlv2di3 (ashl_res, zero_or_all_ones, amount));
+ emit_insn (gen_iorv2di3 (operands[0], lshr_res, ashl_res));
+ DONE;
+ }
+
rtx reg = gen_reg_rtx (V2DImode);
rtx par;
bool negate = false;
plus adjusting the cost computation to hint that at least the non-63 arithmetic
right V2DImode shifts are more expensive.
Even if in the end the V2DImode arithmetic right shifts turn to be more
expensive than scalar code (though, it surprises me at least for the >> 63
case),
I think V4DImode for TARGET_AVX2 should be beneficial always (haven't tried to
adjust the expander for that yet).
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (11 preceding siblings ...)
2021-02-05 13:43 ` jakub at gcc dot gnu.org
@ 2021-02-05 14:36 ` jakub at gcc dot gnu.org
2021-02-05 16:29 ` jakub at gcc dot gnu.org
` (36 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-05 14:36 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #12 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
V4DImode arithmetic right shifts would be (untested):
--- gcc/config/i386/sse.md.jj 2021-02-05 14:32:44.175463716 +0100
+++ gcc/config/i386/sse.md 2021-02-05 15:24:37.942026401 +0100
@@ -12458,7 +12458,7 @@
(set_attr "prefix" "orig,vex")
(set_attr "mode" "<sseinsnmode>")])
-(define_insn "ashr<mode>3<mask_name>"
+(define_insn "<mask_codefor>ashr<mode>3<mask_name>"
[(set (match_operand:VI248_AVX512BW_AVX512VL 0 "register_operand" "=v,v")
(ashiftrt:VI248_AVX512BW_AVX512VL
(match_operand:VI248_AVX512BW_AVX512VL 1 "nonimmediate_operand"
"v,vm")
@@ -12472,6 +12472,67 @@
(const_string "0")))
(set_attr "mode" "<sseinsnmode>")])
+(define_expand "ashr<mode>3"
+ [(set (match_operand:VI248_AVX512BW 0 "register_operand")
+ (ashiftrt:VI248_AVX512BW
+ (match_operand:VI248_AVX512BW 1 "nonimmediate_operand")
+ (match_operand:DI 2 "nonmemory_operand")))]
+ "TARGET_AVX512F")
+
+(define_expand "ashrv4di3"
+ [(set (match_operand:V4DI 0 "register_operand")
+ (ashiftrt:V4DI
+ (match_operand:V4DI 1 "nonimmediate_operand")
+ (match_operand:DI 2 "nonmemory_operand")))]
+ "TARGET_AVX2"
+{
+ if (!TARGET_AVX512VL)
+ {
+ if (CONST_INT_P (operands[2]) && INTVAL (operands[2]) == 63)
+ {
+ rtx zero = force_reg (V4DImode, CONST0_RTX (V4DImode));
+ emit_insn (gen_avx2_gtv4di3 (operands[0], zero, operands[1]));
+ DONE;
+ }
+ if (operands[2] == const0_rtx)
+ {
+ emit_move_insn (operands[0], operands[1]);
+ DONE;
+ }
+
+ rtx zero = force_reg (V4DImode, CONST0_RTX (V4DImode));
+ rtx zero_or_all_ones = gen_reg_rtx (V4DImode);
+ emit_insn (gen_avx2_gtv4di3 (zero_or_all_ones, zero, operands[1]));
+ rtx lshr_res = gen_reg_rtx (V4DImode);
+ emit_insn (gen_lshrv4di3 (lshr_res, operands[1], operands[2]));
+ rtx ashl_res = gen_reg_rtx (V4DImode);
+ rtx amount;
+ if (CONST_INT_P (operands[2]))
+ amount = GEN_INT (64 - INTVAL (operands[2]));
+ else if (TARGET_64BIT)
+ {
+ amount = gen_reg_rtx (DImode);
+ emit_insn (gen_subdi3 (amount, force_reg (DImode, GEN_INT (64)),
+ operands[2]));
+ }
+ else
+ {
+ rtx temp = gen_reg_rtx (SImode);
+ emit_insn (gen_subsi3 (temp, force_reg (SImode, GEN_INT (64)),
+ lowpart_subreg (SImode, operands[2],
+ DImode)));
+ amount = gen_reg_rtx (V4SImode);
+ emit_insn (gen_vec_setv4si_0 (amount, CONST0_RTX (V4SImode),
+ temp));
+ }
+ if (!CONST_INT_P (operands[2]))
+ amount = lowpart_subreg (DImode, amount, GET_MODE (amount));
+ emit_insn (gen_ashlv4di3 (ashl_res, zero_or_all_ones, amount));
+ emit_insn (gen_iorv4di3 (operands[0], lshr_res, ashl_res));
+ DONE;
+ }
+})
+
(define_insn "<mask_codefor><insn><mode>3<mask_name>"
[(set (match_operand:VI248_AVX512BW_2 0 "register_operand" "=v,v")
(any_lshift:VI248_AVX512BW_2
Trying 3 different routines, one returning >> 63 of a V4DImode vector, another
one >> 17 and another one >> var, the differences with -mavx2 are:
- vextracti128 $0x1, %ymm0, %xmm1
- vmovq %xmm0, %rax
- vpextrq $1, %xmm0, %rcx
- cqto
- vmovq %xmm1, %rax
- sarq $63, %rcx
- sarq $63, %rax
- vmovq %rdx, %xmm3
- movq %rax, %rsi
- vpextrq $1, %xmm1, %rax
- vpinsrq $1, %rcx, %xmm3, %xmm0
- sarq $63, %rax
- vmovq %rsi, %xmm2
- vpinsrq $1, %rax, %xmm2, %xmm1
- vinserti128 $0x1, %xmm1, %ymm0, %ymm0
+ vmovdqa %ymm0, %ymm1
+ vpxor %xmm0, %xmm0, %xmm0
+ vpcmpgtq %ymm1, %ymm0, %ymm0
- vmovq %xmm0, %rax
- vextracti128 $0x1, %ymm0, %xmm1
- vpextrq $1, %xmm0, %rcx
- sarq $17, %rax
- sarq $17, %rcx
- movq %rax, %rdx
- vmovq %xmm1, %rax
- sarq $17, %rax
- vmovq %rdx, %xmm3
- movq %rax, %rsi
- vpextrq $1, %xmm1, %rax
- vpinsrq $1, %rcx, %xmm3, %xmm0
- sarq $17, %rax
- vmovq %rsi, %xmm2
- vpinsrq $1, %rax, %xmm2, %xmm1
- vinserti128 $0x1, %xmm1, %ymm0, %ymm0
+ vpxor %xmm1, %xmm1, %xmm1
+ vpcmpgtq %ymm0, %ymm1, %ymm1
+ vpsrlq $17, %ymm0, %ymm0
+ vpsllq $47, %ymm1, %ymm1
+ vpor %ymm1, %ymm0, %ymm0
and
- movl %edi, %ecx
- vmovq %xmm0, %rax
- vextracti128 $0x1, %ymm0, %xmm1
- sarq %cl, %rax
- vpextrq $1, %xmm0, %rsi
- movq %rax, %rdx
- vmovq %xmm1, %rax
- sarq %cl, %rsi
- sarq %cl, %rax
- vmovq %rdx, %xmm3
- movq %rax, %rdi
- vpextrq $1, %xmm1, %rax
- vpinsrq $1, %rsi, %xmm3, %xmm0
- sarq %cl, %rax
+ vpxor %xmm1, %xmm1, %xmm1
+ movslq %edi, %rdi
+ movl $64, %eax
+ vpcmpgtq %ymm0, %ymm1, %ymm1
+ subq %rdi, %rax
vmovq %rdi, %xmm2
- vpinsrq $1, %rax, %xmm2, %xmm1
- vinserti128 $0x1, %xmm1, %ymm0, %ymm0
+ vmovq %rax, %xmm3
+ vpsrlq %xmm2, %ymm0, %ymm0
+ vpsllq %xmm3, %ymm1, %ymm1
+ vpor %ymm1, %ymm0, %ymm0
so at least size-wise much smaller.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (12 preceding siblings ...)
2021-02-05 14:36 ` jakub at gcc dot gnu.org
@ 2021-02-05 16:29 ` jakub at gcc dot gnu.org
2021-02-05 17:55 ` jakub at gcc dot gnu.org
` (35 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-05 16:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #13 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Looking at what other compilers emit for this, ICC seems to be completely
broken, it emits logical right shifts instead of arithmetic right shift, and
LLVM trunk emits for >> 63 what this patch emits, for >> 17 it emits
vpsrad $17, %xmm0, %xmm1
vpsrlq $17, %xmm0, %xmm0
vpblendd $10, %xmm1, %xmm0, %xmm0
instead of
vpxor %xmm1, %xmm1, %xmm1
vpcmpgtq %xmm0, %xmm1, %xmm1
vpsrlq $17, %xmm0, %xmm0
vpsllq $47, %xmm1, %xmm1
vpor %xmm1, %xmm0, %xmm0
the patch emits. For >> 47 it emits:
vpsrad $31, %xmm0, %xmm1
vpsrad $15, %xmm0, %xmm0
vpshufd $245, %xmm0, %xmm0
vpblendd $10, %xmm1, %xmm0, %xmm0
etc.
So, in summary, for >> 63 with SSE4.2 I think what the patch does looks best,
for >> 63 and SSE2 we can emit psrad $31 instead and permute the odd elements
into even ones (i.e. __builtin_shuffle ((v4si) x >> 31, { 1, 1, 3, 3 })).
For >> cst where cst < 32, do a psrad and psrlq by that cst and permute such
that
we get the even SI elts from the psrlq result and odd from psrad result.
For >> 32, do a psrad $31 and permute to get the even SI elts from odd elts of
the source and odd SI elts from odd results of psrad $31.
For >> cst where cst > 32, do psrad $31 and psrad $(cst-32) and permute
such that even SI elts come from odd elts of the latter and odd elts come from
odd elts of the former.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (13 preceding siblings ...)
2021-02-05 16:29 ` jakub at gcc dot gnu.org
@ 2021-02-05 17:55 ` jakub at gcc dot gnu.org
2021-02-05 19:48 ` jakub at gcc dot gnu.org
` (34 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-05 17:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #14 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
WIP that implements that. Except that we need some permutation expansion
improvements, both for the SSE2 V4SImode permutation cases and for AVX2
V8SImode permutation cases.
--- gcc/config/i386/sse.md.jj 2021-02-05 14:32:44.175463716 +0100
+++ gcc/config/i386/sse.md 2021-02-05 18:49:29.621590903 +0100
@@ -12458,7 +12458,7 @@
(set_attr "prefix" "orig,vex")
(set_attr "mode" "<sseinsnmode>")])
-(define_insn "ashr<mode>3<mask_name>"
+(define_insn "<mask_codefor>ashr<mode>3<mask_name>"
[(set (match_operand:VI248_AVX512BW_AVX512VL 0 "register_operand" "=v,v")
(ashiftrt:VI248_AVX512BW_AVX512VL
(match_operand:VI248_AVX512BW_AVX512VL 1 "nonimmediate_operand"
"v,vm")
@@ -12472,6 +12472,125 @@
(const_string "0")))
(set_attr "mode" "<sseinsnmode>")])
+(define_expand "ashr<mode>3"
+ [(set (match_operand:VI248_AVX512BW 0 "register_operand")
+ (ashiftrt:VI248_AVX512BW
+ (match_operand:VI248_AVX512BW 1 "nonimmediate_operand")
+ (match_operand:DI 2 "nonmemory_operand")))]
+ "TARGET_AVX512F")
+
+(define_expand "ashrv4di3"
+ [(set (match_operand:V4DI 0 "register_operand")
+ (ashiftrt:V4DI
+ (match_operand:V4DI 1 "nonimmediate_operand")
+ (match_operand:DI 2 "nonmemory_operand")))]
+ "TARGET_AVX2"
+{
+ if (!TARGET_AVX512VL)
+ {
+ if (CONST_INT_P (operands[2]) && UINTVAL (operands[2]) >= 63)
+ {
+ rtx zero = force_reg (V4DImode, CONST0_RTX (V4DImode));
+ emit_insn (gen_avx2_gtv4di3 (operands[0], zero, operands[1]));
+ DONE;
+ }
+ if (operands[2] == const0_rtx)
+ {
+ emit_move_insn (operands[0], operands[1]);
+ DONE;
+ }
+ if (CONST_INT_P (operands[2]))
+ {
+ vec_perm_builder sel (8, 8, 1);
+ sel.quick_grow (8);
+ rtx arg0, arg1;
+ rtx op1 = lowpart_subreg (V8SImode, operands[1], V4DImode);
+ rtx target = gen_reg_rtx (V8SImode);
+ if (INTVAL (operands[2]) > 32)
+ {
+ arg0 = gen_reg_rtx (V8SImode);
+ arg1 = gen_reg_rtx (V8SImode);
+ emit_insn (gen_ashrv8si3 (arg1, op1, GEN_INT (31)));
+ emit_insn (gen_ashrv8si3 (arg0, op1,
+ GEN_INT (INTVAL (operands[2]) - 32)));
+ sel[0] = 1;
+ sel[1] = 9;
+ sel[2] = 3;
+ sel[3] = 11;
+ sel[4] = 5;
+ sel[5] = 13;
+ sel[6] = 7;
+ sel[7] = 15;
+ }
+ else if (INTVAL (operands[2]) == 32)
+ {
+ arg0 = op1;
+ arg1 = gen_reg_rtx (V8SImode);
+ emit_insn (gen_ashrv8si3 (arg1, op1, GEN_INT (31)));
+ sel[0] = 1;
+ sel[1] = 9;
+ sel[2] = 3;
+ sel[3] = 11;
+ sel[4] = 5;
+ sel[5] = 13;
+ sel[6] = 7;
+ sel[7] = 15;
+ }
+ else
+ {
+ arg0 = gen_reg_rtx (V2DImode);
+ arg1 = gen_reg_rtx (V4SImode);
+ emit_insn (gen_lshrv2di3 (arg0, operands[1], operands[2]));
+ emit_insn (gen_ashrv4si3 (arg1, op1, operands[2]));
+ arg0 = lowpart_subreg (V4SImode, arg0, V2DImode);
+ sel[0] = 0;
+ sel[1] = 9;
+ sel[2] = 2;
+ sel[3] = 11;
+ sel[4] = 4;
+ sel[5] = 13;
+ sel[6] = 6;
+ sel[7] = 15;
+ }
+ vec_perm_indices indices (sel, 2, 8);
+ bool ok = targetm.vectorize.vec_perm_const (V8SImode, target,
+ arg0, arg1, indices);
+ gcc_assert (ok);
+ emit_move_insn (operands[0],
+ lowpart_subreg (V4DImode, target, V8SImode));
+ DONE;
+ }
+
+ rtx zero = force_reg (V4DImode, CONST0_RTX (V4DImode));
+ rtx zero_or_all_ones = gen_reg_rtx (V4DImode);
+ emit_insn (gen_avx2_gtv4di3 (zero_or_all_ones, zero, operands[1]));
+ rtx lshr_res = gen_reg_rtx (V4DImode);
+ emit_insn (gen_lshrv4di3 (lshr_res, operands[1], operands[2]));
+ rtx ashl_res = gen_reg_rtx (V4DImode);
+ rtx amount;
+ if (TARGET_64BIT)
+ {
+ amount = gen_reg_rtx (DImode);
+ emit_insn (gen_subdi3 (amount, force_reg (DImode, GEN_INT (64)),
+ operands[2]));
+ }
+ else
+ {
+ rtx temp = gen_reg_rtx (SImode);
+ emit_insn (gen_subsi3 (temp, force_reg (SImode, GEN_INT (64)),
+ lowpart_subreg (SImode, operands[2],
+ DImode)));
+ amount = gen_reg_rtx (V4SImode);
+ emit_insn (gen_vec_setv4si_0 (amount, CONST0_RTX (V4SImode),
+ temp));
+ }
+ amount = lowpart_subreg (DImode, amount, GET_MODE (amount));
+ emit_insn (gen_ashlv4di3 (ashl_res, zero_or_all_ones, amount));
+ emit_insn (gen_iorv4di3 (operands[0], lshr_res, ashl_res));
+ DONE;
+ }
+})
+
(define_insn "<mask_codefor><insn><mode>3<mask_name>"
[(set (match_operand:VI248_AVX512BW_2 0 "register_operand" "=v,v")
(any_lshift:VI248_AVX512BW_2
@@ -20313,11 +20432,13 @@
(ashiftrt:V2DI
(match_operand:V2DI 1 "register_operand")
(match_operand:DI 2 "nonmemory_operand")))]
- "TARGET_SSE4_2"
+ "TARGET_SSE2"
{
if (!TARGET_AVX512VL)
{
- if (CONST_INT_P (operands[2]) && INTVAL (operands[2]) == 63)
+ if (TARGET_SSE4_2
+ && CONST_INT_P (operands[2])
+ && UINTVAL (operands[2]) >= 63)
{
rtx zero = force_reg (V2DImode, CONST0_RTX (V2DImode));
emit_insn (gen_sse4_2_gtv2di3 (operands[0], zero, operands[1]));
@@ -20328,6 +20449,65 @@
emit_move_insn (operands[0], operands[1]);
DONE;
}
+ if (CONST_INT_P (operands[2])
+ && (!TARGET_XOP || UINTVAL (operands[2]) >= 63))
+ {
+ vec_perm_builder sel (4, 4, 1);
+ sel.quick_grow (4);
+ rtx arg0, arg1;
+ rtx op1 = lowpart_subreg (V4SImode, operands[1], V2DImode);
+ rtx target = gen_reg_rtx (V4SImode);
+ if (UINTVAL (operands[2]) >= 63)
+ {
+ arg0 = arg1 = gen_reg_rtx (V4SImode);
+ emit_insn (gen_ashrv4si3 (arg0, op1, GEN_INT (31)));
+ sel[0] = 1;
+ sel[1] = 1;
+ sel[2] = 3;
+ sel[3] = 3;
+ }
+ else if (INTVAL (operands[2]) > 32)
+ {
+ arg0 = gen_reg_rtx (V4SImode);
+ arg1 = gen_reg_rtx (V4SImode);
+ emit_insn (gen_ashrv4si3 (arg1, op1, GEN_INT (31)));
+ emit_insn (gen_ashrv4si3 (arg0, op1,
+ GEN_INT (INTVAL (operands[2]) - 32)));
+ sel[0] = 1;
+ sel[1] = 5;
+ sel[2] = 3;
+ sel[3] = 7;
+ }
+ else if (INTVAL (operands[2]) == 32)
+ {
+ arg0 = op1;
+ arg1 = gen_reg_rtx (V4SImode);
+ emit_insn (gen_ashrv4si3 (arg1, op1, GEN_INT (31)));
+ sel[0] = 1;
+ sel[1] = 5;
+ sel[2] = 3;
+ sel[3] = 7;
+ }
+ else
+ {
+ arg0 = gen_reg_rtx (V2DImode);
+ arg1 = gen_reg_rtx (V4SImode);
+ emit_insn (gen_lshrv2di3 (arg0, operands[1], operands[2]));
+ emit_insn (gen_ashrv4si3 (arg1, op1, operands[2]));
+ arg0 = lowpart_subreg (V4SImode, arg0, V2DImode);
+ sel[0] = 0;
+ sel[1] = 5;
+ sel[2] = 2;
+ sel[3] = 7;
+ }
+ vec_perm_indices indices (sel, arg0 != arg1 ? 2 : 1, 4);
+ bool ok = targetm.vectorize.vec_perm_const (V4SImode, target,
+ arg0, arg1, indices);
+ gcc_assert (ok);
+ emit_move_insn (operands[0],
+ lowpart_subreg (V2DImode, target, V4SImode));
+ DONE;
+ }
if (!TARGET_XOP)
{
rtx zero = force_reg (V2DImode, CONST0_RTX (V2DImode));
@@ -20337,9 +20517,7 @@
emit_insn (gen_lshrv2di3 (lshr_res, operands[1], operands[2]));
rtx ashl_res = gen_reg_rtx (V2DImode);
rtx amount;
- if (CONST_INT_P (operands[2]))
- amount = GEN_INT (64 - INTVAL (operands[2]));
- else if (TARGET_64BIT)
+ if (TARGET_64BIT)
{
amount = gen_reg_rtx (DImode);
emit_insn (gen_subdi3 (amount, force_reg (DImode, GEN_INT (64)),
@@ -20355,8 +20533,7 @@
emit_insn (gen_vec_setv4si_0 (amount, CONST0_RTX (V4SImode),
temp));
}
- if (!CONST_INT_P (operands[2]))
- amount = lowpart_subreg (DImode, amount, GET_MODE (amount));
+ amount = lowpart_subreg (DImode, amount, GET_MODE (amount));
emit_insn (gen_ashlv2di3 (ashl_res, zero_or_all_ones, amount));
emit_insn (gen_iorv2di3 (operands[0], lshr_res, ashl_res));
DONE;
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (14 preceding siblings ...)
2021-02-05 17:55 ` jakub at gcc dot gnu.org
@ 2021-02-05 19:48 ` jakub at gcc dot gnu.org
2021-02-08 15:14 ` jakub at gcc dot gnu.org
` (33 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-05 19:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #15 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
The needed permutations for this boil down to
typedef int V __attribute__((vector_size (16)));
typedef int W __attribute__((vector_size (32)));
#ifdef __clang__
V f1 (V x) { return __builtin_shufflevector (x, x, 1, 1, 3, 3); }
V f2 (V x, V y) { return __builtin_shufflevector (x, y, 1, 5, 3, 7); }
V f3 (V x, V y) { return __builtin_shufflevector (x, y, 0, 5, 2, 7); }
#ifdef __AVX2__
W f4 (W x, W y) { return __builtin_shufflevector (x, y, 1, 9, 3, 11, 5, 13, 7,
15); }
W f5 (W x, W y) { return __builtin_shufflevector (x, y, 0, 9, 2, 11, 4, 13, 6,
15); }
W f6 (W x) { return __builtin_shufflevector (x, x, 1, 1, 3, 3, 5, 5, 7, 7); }
#endif
V f7 (V x) { return __builtin_shufflevector (x, x, 1, 3, 2, 3); }
V f8 (V x) { return __builtin_shufflevector (x, x, 0, 2, 2, 3); }
V f9 (V x, V y) { return __builtin_shufflevector (x, y, 0, 4, 1, 5); }
#else
V f1 (V x) { return __builtin_shuffle (x, (V) { 1, 1, 3, 3 }); }
V f2 (V x, V y) { return __builtin_shuffle (x, y, (V) { 1, 5, 3, 7 }); }
V f3 (V x, V y) { return __builtin_shuffle (x, y, (V) { 0, 5, 2, 7 }); }
#ifdef __AVX2__
W f4 (W x, W y) { return __builtin_shuffle (x, y, (W) { 1, 9, 3, 11, 5, 13, 7,
15 }); }
W f5 (W x, W y) { return __builtin_shuffle (x, y, (W) { 0, 9, 2, 11, 4, 13, 6,
15 }); }
W f6 (W x, W y) { return __builtin_shuffle (x, (W) { 1, 1, 3, 3, 5, 5, 7, 7 });
}
#endif
V f7 (V x) { return __builtin_shuffle (x, (V) { 1, 3, 2, 3 }); }
V f8 (V x) { return __builtin_shuffle (x, (V) { 0, 2, 2, 3 }); }
V f9 (V x, V y) { return __builtin_shuffle (x, y, (V) { 0, 4, 1, 5 }); }
#endif
With -msse2, LLVM emits 2 x pshufd $237 + punpckldq for f2 and pshufd $237 +
pshufd $232 + punpckldq, we give up or emit very large code.
With -msse4, we handle everything, and f1/f3 are the same/comparable, but for
f2 we emit 2 x pshufb (with memory operands) + por while
LLVM emits pshufd $245 + pblendw $204.
With -mavx2, the f2 inefficiency remains, and for f4 we emit 2x vpshufb with
memory operands + vpor while LLVM emits vpermilps $245 + vblendps $170.
f6-f9 are all insns that we handle through a single insn and that plus f3 are
the roadblocks to build the f2 and f4 permutations more efficiently.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (15 preceding siblings ...)
2021-02-05 19:48 ` jakub at gcc dot gnu.org
@ 2021-02-08 15:14 ` jakub at gcc dot gnu.org
2021-03-04 12:14 ` rguenth at gcc dot gnu.org
` (32 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-08 15:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #16 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 50142
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50142&action=edit
gcc11-pr98856.patch
Full patch.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (16 preceding siblings ...)
2021-02-08 15:14 ` jakub at gcc dot gnu.org
@ 2021-03-04 12:14 ` rguenth at gcc dot gnu.org
2021-03-04 15:36 ` rguenth at gcc dot gnu.org
` (31 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-04 12:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |vmakarov at gcc dot gnu.org
Keywords| |ra
--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
So coming back here. We're presenting RA with a quite hard problem given we
have
(insn 7 4 8 2 (set (reg:TI 84 [ _9 ])
(mem:TI (reg:DI 101) [0 MEM <__int128 unsigned> [(char *
{ref-all})in_8(D)]+0 S16 A8])) 73 {*movti_internal}
(expr_list:REG_DEAD (reg:DI 101)
(nil)))
(insn 8 7 9 2 (parallel [
(set (reg:DI 95)
(lshiftrt:DI (subreg:DI (reg:TI 84 [ _9 ]) 8)
(const_int 63 [0x3f])))
(clobber (reg:CC 17 flags))
]) "t.c":7:26 703 {*lshrdi3_1}
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)))
..
(insn 10 9 11 2 (parallel [
(set (reg:DI 97)
(lshiftrt:DI (subreg:DI (reg:TI 84 [ _9 ]) 0)
(const_int 63 [0x3f])))
(clobber (reg:CC 17 flags))
]) "t.c":8:30 703 {*lshrdi3_1}
(expr_list:REG_UNUSED (reg:CC 17 flags)
..
(insn 12 11 13 2 (set (reg:V2DI 98 [ vect__5.3 ])
(ashift:V2DI (subreg:V2DI (reg:TI 84 [ _9 ]) 0)
(const_int 1 [0x1]))) "t.c":9:16 3611 {ashlv2di3}
(expr_list:REG_DEAD (reg:TI 84 [ _9 ])
(nil)))
where I wonder why we keep the (subreg:DI (reg:TI 84 ...) 8) around
for so long. Probably the subreg pass gives up because of the V2DImode
subreg of that reg.
That said RA chooses xmm for reg:84 but then spills it immediately
to fulfil the subregs even though there's mov and pextrd that could
be used or the reload could use the original mem. That we reload
even the xmm use is another odd thing.
Vlad, I'm not sure about the possibilities LRA has here but maybe
you can have a look at the testcase in comment#6 (use -O3 -march=znver2
or -march=core-avx2). For one I expected
vmovdqu (%rsi), %xmm2
vmovdqa %xmm2, -24(%rsp)
movq -16(%rsp), %rax (2a)
vmovdqa -24(%rsp), %xmm4 (1)
...
movq -24(%rsp), %rdx (2b)
(1) to be not there (not sure how that even survives postreload
optimizations...)
(2a/b) to be 'inherited' by instead loading from (%rsi) and 8(%rsi) which
is maybe too much being asked because it requires aliasing considerations
That is, even if we don't consider using
movq %xmm2, %rax (2a)
pextrd %xmm2, %rdx, 1 (2b)
I expected us to not spill.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (17 preceding siblings ...)
2021-03-04 12:14 ` rguenth at gcc dot gnu.org
@ 2021-03-04 15:36 ` rguenth at gcc dot gnu.org
2021-03-04 16:12 ` rguenth at gcc dot gnu.org
` (30 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-04 15:36 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> ---
There's another thing - we end up with
vmovq %rax, %xmm3
vpinsrq $1, %rdx, %xmm3, %xmm0
but that has way worse latency than the alternative you'd get w/o SSE 4.1:
vmovq %rax, %xmm3
vmovq %rdx, %xmm7
punpcklqdq %xmm7, %xmm3
for example on Zen3 vmovq and vpisnrq have latencies of 3 while punpck
has a latency of only one. So the second variant should have 2 cycles
less latency.
Testcase:
typedef long v2di __attribute__((vector_size(16)));
v2di foo (long a, long b)
{
return (v2di){a, b};
}
Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3. Not
sure if we should somehow do this late somehow (peephole or splitter) since
it requires one more %xmm register.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (18 preceding siblings ...)
2021-03-04 15:36 ` rguenth at gcc dot gnu.org
@ 2021-03-04 16:12 ` rguenth at gcc dot gnu.org
2021-03-04 17:56 ` ubizjak at gmail dot com
` (29 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-04 16:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #19 from Richard Biener <rguenth at gcc dot gnu.org> ---
So to recover performance we need both, avoiding the latency on the vector plus
avoiding the spilling. This variant is fast:
.L56:
.cfi_restore_state
vmovdqu (%rsi), %xmm4
movq 8(%rsi), %rdx
shrq $63, %rdx
imulq $135, %rdx, %rdi
movq (%rsi), %rdx
vmovq %rdi, %xmm0
vpsllq $1, %xmm4, %xmm1
shrq $63, %rdx
vmovq %rdx, %xmm5
vpunpcklqdq %xmm5, %xmm0, %xmm0
vpxor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rax)
jmp .L53
compared to the original:
.L56:
.cfi_restore_state
vmovdqu (%rsi), %xmm4
vmovdqa %xmm4, 16(%rsp)
movq 24(%rsp), %rdx
vmovdqa 16(%rsp), %xmm5
shrq $63, %rdx
imulq $135, %rdx, %rdi
movq 16(%rsp), %rdx
vmovq %rdi, %xmm0
vpsllq $1, %xmm5, %xmm1
shrq $63, %rdx
vpinsrq $1, %rdx, %xmm0, %xmm0
vpxor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rax)
jmp .L53
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (19 preceding siblings ...)
2021-03-04 16:12 ` rguenth at gcc dot gnu.org
@ 2021-03-04 17:56 ` ubizjak at gmail dot com
2021-03-04 18:12 ` ubizjak at gmail dot com
` (28 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: ubizjak at gmail dot com @ 2021-03-04 17:56 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #20 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Richard Biener from comment #18)
> Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3. Not
> sure if we should somehow do this late somehow (peephole or splitter) since
> it requires one more %xmm register.
What happens if you disparage [v]pinsrd alternatives in vec_concatv2di?
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (20 preceding siblings ...)
2021-03-04 17:56 ` ubizjak at gmail dot com
@ 2021-03-04 18:12 ` ubizjak at gmail dot com
2021-03-05 7:44 ` rguenth at gcc dot gnu.org
` (27 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: ubizjak at gmail dot com @ 2021-03-04 18:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #21 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Uroš Bizjak from comment #20)
> (In reply to Richard Biener from comment #18)
> > Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3. Not
> > sure if we should somehow do this late somehow (peephole or splitter) since
> > it requires one more %xmm register.
> What happens if you disparage [v]pinsrd alternatives in vec_concatv2di?
Please try this:
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index db5be59f5b7..edf7b1a3074 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -16043,7 +16043,12 @@
(const_string "maybe_evex")
]
(const_string "orig")))
- (set_attr "mode" "TI,TI,TI,TI,TI,TI,V4SF,V2SF,V2SF")])
+ (set_attr "mode" "TI,TI,TI,TI,TI,TI,V4SF,V2SF,V2SF")
+ (set (attr "preferred_for_speed")
+ (cond [(eq_attr "alternative" "0,1,2,3")
+ (symbol_ref "false")
+ ]
+ (symbol_ref "true")))])
(define_insn "*vec_concatv2di_0"
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (21 preceding siblings ...)
2021-03-04 18:12 ` ubizjak at gmail dot com
@ 2021-03-05 7:44 ` rguenth at gcc dot gnu.org
2021-03-05 7:46 ` rguenth at gcc dot gnu.org
` (26 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-05 7:44 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #22 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Uroš Bizjak from comment #21)
> (In reply to Uroš Bizjak from comment #20)
> > (In reply to Richard Biener from comment #18)
> > > Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3. Not
> > > sure if we should somehow do this late somehow (peephole or splitter) since
> > > it requires one more %xmm register.
> > What happens if you disparage [v]pinsrd alternatives in vec_concatv2di?
>
> Please try this:
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index db5be59f5b7..edf7b1a3074 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -16043,7 +16043,12 @@
> (const_string "maybe_evex")
> ]
> (const_string "orig")))
> - (set_attr "mode" "TI,TI,TI,TI,TI,TI,V4SF,V2SF,V2SF")])
> + (set_attr "mode" "TI,TI,TI,TI,TI,TI,V4SF,V2SF,V2SF")
> + (set (attr "preferred_for_speed")
> + (cond [(eq_attr "alternative" "0,1,2,3")
> + (symbol_ref "false")
> + ]
> + (symbol_ref "true")))])
>
> (define_insn "*vec_concatv2di_0"
That works to avoid the vpinsrq. I guess the case of a mem operand
behaves similar to a gpr (plus the load uop), at least I don't have any
contrary evidence (but I didn't do any microbenchmarks either).
I'm not sure IRA/LRA will optimally handle the situation with register
pressure causing spilling in case it needs to reload both gpr operands.
At least for
typedef long v2di __attribute__((vector_size(16)));
v2di foo (long a, long b)
{
return (v2di){a, b};
}
with -msse4.1 -O3 -ffixed-xmm1 -ffixed-xmm2 -ffixed-xmm3 -ffixed-xmm4
-ffixed-xmm5 -ffixed-xmm6 -ffixed-xmm7 -ffixed-xmm8 -ffixed-xmm9 -ffixed-xmm10
-ffixed-xmm11 -ffixed-xmm12 -ffixed-xmm13 -ffixed-xmm14 -ffixed-xmm15 I get
with the
patch
foo:
.LFB0:
.cfi_startproc
movq %rsi, -16(%rsp)
movq %rdi, %xmm0
pinsrq $1, -16(%rsp), %xmm0
ret
while without it's
movq %rdi, %xmm0
pinsrq $1, %rsi, %xmm0
as far as I understand LRA dumps the new attribute is a hard one, even
applying when other alternatives are worse. In this case we choose
alt 7. Covering also alts 7 and 8 with the optimize-for-speed attribute
causes reload fails - which is expected if there's no way for LRA to
choose alt 1. The following seems to work for the small testcase above
but not for the important case in the benchmark (meh).
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index db5be59f5b7..e393a0d823b 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -15992,7 +15992,7 @@
(match_operand:DI 1 "register_operand"
" 0, 0,x ,Yv,0,Yv,0,0,v")
(match_operand:DI 2 "nonimmediate_operand"
- " rm,rm,rm,rm,x,Yv,x,m,m")))]
+ " !rm,!rm,!rm,!rm,x,Yv,x,!m,!m")))]
"TARGET_SSE"
"@
pinsrq\t{$1, %2, %0|%0, %2, 1}
I guess the idea of this insn setup was exactly to get IRA/LRA choose
the optimal instruction sequence - otherwise exposing the reload so
late is probably suboptimal.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (22 preceding siblings ...)
2021-03-05 7:44 ` rguenth at gcc dot gnu.org
@ 2021-03-05 7:46 ` rguenth at gcc dot gnu.org
2021-03-05 8:29 ` ubizjak at gmail dot com
` (25 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-05 7:46 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #23 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 50300
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50300&action=edit
preprocessed source of the important Botan TU
This is the full preprocessed source of the TU. When compiled with -Ofast
-march=znver2 look for poly_double_n_le in the assembly, in the prologue the
function jumps based on kernel size - size 16 is the important one:
cmpq $16, %rdx
je .L54
...
.L54:
.cfi_restore_state
vmovdqu (%rsi), %xmm4
vmovdqa %xmm4, 16(%rsp)
movq 24(%rsp), %rdx
vmovdqa 16(%rsp), %xmm5
shrq $63, %rdx
imulq $135, %rdx, %rcx
movq 16(%rsp), %rdx
vmovq %rcx, %xmm0
vpsllq $1, %xmm5, %xmm1
shrq $63, %rdx
vpinsrq $1, %rdx, %xmm0, %xmm0
vpxor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rdi)
leaq -16(%rbp), %rsp
popq %r12
popq %r13
popq %rbp
.cfi_remember_state
.cfi_def_cfa 7, 8
ret
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (23 preceding siblings ...)
2021-03-05 7:46 ` rguenth at gcc dot gnu.org
@ 2021-03-05 8:29 ` ubizjak at gmail dot com
2021-03-05 10:04 ` rguenther at suse dot de
` (24 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: ubizjak at gmail dot com @ 2021-03-05 8:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #24 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Richard Biener from comment #22)
> That works to avoid the vpinsrq. I guess the case of a mem operand
> behaves similar to a gpr (plus the load uop), at least I don't have any
> contrary evidence (but I didn't do any microbenchmarks either).
>
> I'm not sure IRA/LRA will optimally handle the situation with register
> pressure causing spilling in case it needs to reload both gpr operands.
> At least for
>
> typedef long v2di __attribute__((vector_size(16)));
>
> v2di foo (long a, long b)
> {
> return (v2di){a, b};
> }
>
> with -msse4.1 -O3 -ffixed-xmm1 -ffixed-xmm2 -ffixed-xmm3 -ffixed-xmm4
> -ffixed-xmm5 -ffixed-xmm6 -ffixed-xmm7 -ffixed-xmm8 -ffixed-xmm9
> -ffixed-xmm10 -ffixed-xmm11 -ffixed-xmm12 -ffixed-xmm13 -ffixed-xmm14
> -ffixed-xmm15 I get with the
> patch
>
> foo:
> .LFB0:
> .cfi_startproc
> movq %rsi, -16(%rsp)
> movq %rdi, %xmm0
> pinsrq $1, -16(%rsp), %xmm0
> ret
>
> while without it's
>
> movq %rdi, %xmm0
> pinsrq $1, %rsi, %xmm0
This is expacted, my patch is based on the assumption that punpcklqdq is cheap
compared to pinsrq, and interunit moves are cheap. This way, IRA will reload GP
register to XMM register and use cheaper instruction.
> as far as I understand LRA dumps the new attribute is a hard one, even
> applying when other alternatives are worse. In this case we choose
> alt 7. Covering also alts 7 and 8 with the optimize-for-speed attribute
> causes reload fails - which is expected if there's no way for LRA to
> choose alt 1. The following seems to work for the small testcase above
> but not for the important case in the benchmark (meh).
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index db5be59f5b7..e393a0d823b 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -15992,7 +15992,7 @@
> (match_operand:DI 1 "register_operand"
> " 0, 0,x ,Yv,0,Yv,0,0,v")
> (match_operand:DI 2 "nonimmediate_operand"
> - " rm,rm,rm,rm,x,Yv,x,m,m")))]
> + " !rm,!rm,!rm,!rm,x,Yv,x,!m,!m")))]
> "TARGET_SSE"
> "@
> pinsrq\t{$1, %2, %0|%0, %2, 1}
The above means that GP will still be used, since it fits without reloading.
> I guess the idea of this insn setup was exactly to get IRA/LRA choose
> the optimal instruction sequence - otherwise exposing the reload so
> late is probably suboptimal.
THere is one more tool in the toolbox. A peephole2 pattern can be
conditionalized on availabe XMM register. So, if XMM reg is available, the
GPR->XMM move can be emitted in front of the insn. So, if there is XMM register
pressure, pinsrd will be used, but if an XMM register is availabe, it will be
reused to emit punpcklqdq.
The peephole2 pattern can also be conditionalized for targets where GPR->XMM
moves are fast.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (24 preceding siblings ...)
2021-03-05 8:29 ` ubizjak at gmail dot com
@ 2021-03-05 10:04 ` rguenther at suse dot de
2021-03-05 10:43 ` rguenth at gcc dot gnu.org
` (23 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenther at suse dot de @ 2021-03-05 10:04 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #25 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 5 Mar 2021, ubizjak at gmail dot com wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
>
> --- Comment #24 from Uroš Bizjak <ubizjak at gmail dot com> ---
> (In reply to Richard Biener from comment #22)
> > I guess the idea of this insn setup was exactly to get IRA/LRA choose
> > the optimal instruction sequence - otherwise exposing the reload so
> > late is probably suboptimal.
>
> THere is one more tool in the toolbox. A peephole2 pattern can be
> conditionalized on availabe XMM register. So, if XMM reg is available, the
> GPR->XMM move can be emitted in front of the insn. So, if there is XMM register
> pressure, pinsrd will be used, but if an XMM register is availabe, it will be
> reused to emit punpcklqdq.
>
> The peephole2 pattern can also be conditionalized for targets where GPR->XMM
> moves are fast.
Note the trick is esp. important when GPR->XMM moves are _slow_. But only
in the case we originally combine two GPR operands. Doing two
GPR->XMM moves and then one puncklqdq hides half of the latency of the
slow moves since they have no data dependence on each other. So for the
peephole we should try to match this - a reloaded operand and a GPR
operand. When the %xmm operand results from a SSE computation there's
no point in splitting out a GPR->XMM move.
So in the end a peephole2 sounds like it could better match the condition
the transform is profitable on.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (25 preceding siblings ...)
2021-03-05 10:04 ` rguenther at suse dot de
@ 2021-03-05 10:43 ` rguenth at gcc dot gnu.org
2021-03-05 11:56 ` ubizjak at gmail dot com
` (22 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-05 10:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #26 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #25)
> On Fri, 5 Mar 2021, ubizjak at gmail dot com wrote:
>
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
> >
> > --- Comment #24 from Uroš Bizjak <ubizjak at gmail dot com> ---
> > (In reply to Richard Biener from comment #22)
> > > I guess the idea of this insn setup was exactly to get IRA/LRA choose
> > > the optimal instruction sequence - otherwise exposing the reload so
> > > late is probably suboptimal.
> >
> > THere is one more tool in the toolbox. A peephole2 pattern can be
> > conditionalized on availabe XMM register. So, if XMM reg is available, the
> > GPR->XMM move can be emitted in front of the insn. So, if there is XMM register
> > pressure, pinsrd will be used, but if an XMM register is availabe, it will be
> > reused to emit punpcklqdq.
> >
> > The peephole2 pattern can also be conditionalized for targets where GPR->XMM
> > moves are fast.
>
> Note the trick is esp. important when GPR->XMM moves are _slow_. But only
> in the case we originally combine two GPR operands. Doing two
> GPR->XMM moves and then one puncklqdq hides half of the latency of the
> slow moves since they have no data dependence on each other. So for the
> peephole we should try to match this - a reloaded operand and a GPR
> operand. When the %xmm operand results from a SSE computation there's
> no point in splitting out a GPR->XMM move.
>
> So in the end a peephole2 sounds like it could better match the condition
> the transform is profitable on.
I tried
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index db5be59f5b7..8d0d3077cf8 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -1419,6 +1419,23 @@
DONE;
})
+(define_peephole2
+ [(set (match_operand:DI 0 "sse_reg_operand")
+ (match_operand:DI 1 "general_gr_operand"))
+ (match_scratch:DI 2 "sse_reg_operand")
+ (set (match_operand:V2DI 2 "sse_reg_operand")
+ (vec_concat:V2DI (match_dup:DI 0)
+ (match_operand:DI 3 "general_gr_operand")))]
+ "reload_completed"
+ [(set (match_dup 0)
+ (match_dup 1))
+ (set (match_dup 2)
+ (match_dup 3))
+ (set (match_dup 2)
+ (vec_concat:V2DI (match_dup 0)
+ (match_dup 2)))]
+ "")
+
;; Merge movsd/movhpd to movupd for TARGET_SSE_UNALIGNED_LOAD_OPTIMAL targets.
(define_peephole2
[(set (match_operand:V2DF 0 "sse_reg_operand")
but that doesn't seem to match for some unknown reason.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (26 preceding siblings ...)
2021-03-05 10:43 ` rguenth at gcc dot gnu.org
@ 2021-03-05 11:56 ` ubizjak at gmail dot com
2021-03-05 12:25 ` ubizjak at gmail dot com
` (21 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: ubizjak at gmail dot com @ 2021-03-05 11:56 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #27 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Richard Biener from comment #26)
> but that doesn't seem to match for some unknown reason.
Try this:
(define_peephole2
[(match_scratch:DI 5 "Yv")
(set (match_operand:DI 0 "sse_reg_operand")
(match_operand:DI 1 "general_reg_operand"))
(set (match_operand:V2DI 2 "sse_reg_operand")
(vec_concat:V2DI (match_operand:DI 3 "sse_reg_operand")
(match_operand:DI 4 "nonimmediate_gr_operand")))]
""
[(set (match_dup 0)
(match_dup 1))
(set (match_dup 5)
(match_dup 4))
(set (match_dup 2)
(vec_concat:V2DI (match_dup 3)
(match_dup 5)))])
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (27 preceding siblings ...)
2021-03-05 11:56 ` ubizjak at gmail dot com
@ 2021-03-05 12:25 ` ubizjak at gmail dot com
2021-03-05 12:27 ` rguenth at gcc dot gnu.org
` (20 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: ubizjak at gmail dot com @ 2021-03-05 12:25 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #28 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Uroš Bizjak from comment #27)
> (In reply to Richard Biener from comment #26)
> > but that doesn't seem to match for some unknown reason.
> Try this:
The latency problem with the original testcase is solved with:
(define_peephole2
[(match_scratch:DI 3 "Yv")
(set (match_operand:V2DI 0 "sse_reg_operand")
(vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand")
(match_operand:DI 2 "nonimmediate_gr_operand")))]
""
[(set (match_dup 3) (match_dup 2))
(set (match_dup 0)
(vec_concat:V2DI (match_dup 1) (match_dup 3)))])
but I don't know if this transformation applies universally to all x86 targets.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (28 preceding siblings ...)
2021-03-05 12:25 ` ubizjak at gmail dot com
@ 2021-03-05 12:27 ` rguenth at gcc dot gnu.org
2021-03-05 12:49 ` jakub at gcc dot gnu.org
` (19 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-05 12:27 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #29 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Uroš Bizjak from comment #27)
> (In reply to Richard Biener from comment #26)
> > but that doesn't seem to match for some unknown reason.
>
> Try this:
>
> (define_peephole2
> [(match_scratch:DI 5 "Yv")
> (set (match_operand:DI 0 "sse_reg_operand")
> (match_operand:DI 1 "general_reg_operand"))
> (set (match_operand:V2DI 2 "sse_reg_operand")
> (vec_concat:V2DI (match_operand:DI 3 "sse_reg_operand")
> (match_operand:DI 4 "nonimmediate_gr_operand")))]
> ""
> [(set (match_dup 0)
> (match_dup 1))
> (set (match_dup 5)
> (match_dup 4))
> (set (match_dup 2)
> (vec_concat:V2DI (match_dup 3)
> (match_dup 5)))])
Ah, I messed up operands. The following works (the above position of
match_scratch happily chooses an operand matching operand 0):
;; Further split pinsrq variants of vec_concatv2di with two GPR sources,
;; one already reloaded, to hide the latency of one GPR->XMM transitions.
(define_peephole2
[(set (match_operand:DI 0 "sse_reg_operand")
(match_operand:DI 1 "general_reg_operand"))
(match_scratch:DI 2 "Yv")
(set (match_operand:V2DI 3 "sse_reg_operand")
(vec_concat:V2DI (match_dup 0)
(match_operand:DI 4 "nonimmediate_gr_operand")))]
"reload_completed && optimize_insn_for_speed_p ()"
[(set (match_dup 0)
(match_dup 1))
(set (match_dup 2)
(match_dup 4))
(set (match_dup 3)
(vec_concat:V2DI (match_dup 0)
(match_dup 2)))])
but for some reason it again doesn't work for the important loop. There
we have
389: xmm0:DI=cx:DI
REG_DEAD cx:DI
390: dx:DI=[sp:DI+0x10]
56: {dx:DI=dx:DI 0>>0x3f;clobber flags:CC;}
REG_UNUSED flags:CC
57: xmm0:V2DI=vec_concat(xmm0:DI,dx:DI)
I suppose the reason is that there's two unrelated insns between the
xmm0 = cx:DI and the vec_concat. Which would hint that we somehow
need to not match this GPR->XMM move in the peephole pattern but
instead somehow in the condition (can we use DF there?)
The simplified variant below works but IMHO matches cases we do not
want to transform. I can't find any example on how to achieve that
though.
;; Further split pinsrq variants of vec_concatv2di with two GPR sources,
;; one already reloaded, to hide the latency of one GPR->XMM transitions.
(define_peephole2
[(match_scratch:DI 3 "Yv")
(set (match_operand:V2DI 0 "sse_reg_operand")
(vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand")
(match_operand:DI 2 "nonimmediate_gr_operand")))]
"reload_completed && optimize_insn_for_speed_p ()"
[(set (match_dup 3)
(match_dup 2))
(set (match_dup 0)
(vec_concat:V2DI (match_dup 1)
(match_dup 3)))])
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (29 preceding siblings ...)
2021-03-05 12:27 ` rguenth at gcc dot gnu.org
@ 2021-03-05 12:49 ` jakub at gcc dot gnu.org
2021-03-05 12:52 ` ubizjak at gmail dot com
` (18 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-03-05 12:49 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #30 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #29)
> I suppose the reason is that there's two unrelated insns between the
> xmm0 = cx:DI and the vec_concat. Which would hint that we somehow
> need to not match this GPR->XMM move in the peephole pattern but
> instead somehow in the condition (can we use DF there?)
peephole2 are run in a pass that does:
df_set_flags (DF_LR_RUN_DCE);
df_note_add_problem ();
df_analyze ();
so, DF that uses the note or default problems is ok, but e.g.
DF_UD_CHAIN/DF_DU_CHAIN is not available.
But it can e.g. walk some number of previous instructions (with some reasonably
small upper bound) etc.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (30 preceding siblings ...)
2021-03-05 12:49 ` jakub at gcc dot gnu.org
@ 2021-03-05 12:52 ` ubizjak at gmail dot com
2021-03-05 12:55 ` rguenther at suse dot de
` (17 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: ubizjak at gmail dot com @ 2021-03-05 12:52 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #31 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Richard Biener from comment #29)
> The simplified variant below works but IMHO matches cases we do not
> want to transform. I can't find any example on how to achieve that
> though.
I think that pinsrd should be transformed to punpcklqdq irrespective of its
first input operand. The insn scheduler should move insns around to mask their
latencies.
> ;; Further split pinsrq variants of vec_concatv2di with two GPR sources,
> ;; one already reloaded, to hide the latency of one GPR->XMM transitions.
> (define_peephole2
> [(match_scratch:DI 3 "Yv")
> (set (match_operand:V2DI 0 "sse_reg_operand")
> (vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand")
> (match_operand:DI 2 "nonimmediate_gr_operand")))]
> "reload_completed && optimize_insn_for_speed_p ()"
Please use
"TARGET_64BIT && TARGET_SSE4_1
&& !optimize_insn_for_size_p ()"
here.
> [(set (match_dup 3)
> (match_dup 2))
> (set (match_dup 0)
> (vec_concat:V2DI (match_dup 1)
> (match_dup 3)))])
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (31 preceding siblings ...)
2021-03-05 12:52 ` ubizjak at gmail dot com
@ 2021-03-05 12:55 ` rguenther at suse dot de
2021-03-05 13:06 ` rguenth at gcc dot gnu.org
` (16 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenther at suse dot de @ 2021-03-05 12:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #32 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 5 Mar 2021, ubizjak at gmail dot com wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
>
> --- Comment #31 from Uroš Bizjak <ubizjak at gmail dot com> ---
> (In reply to Richard Biener from comment #29)
> > The simplified variant below works but IMHO matches cases we do not
> > want to transform. I can't find any example on how to achieve that
> > though.
>
> I think that pinsrd should be transformed to punpcklqdq irrespective of its
> first input operand. The insn scheduler should move insns around to mask their
> latencies.
>
> > ;; Further split pinsrq variants of vec_concatv2di with two GPR sources,
> > ;; one already reloaded, to hide the latency of one GPR->XMM transitions.
> > (define_peephole2
> > [(match_scratch:DI 3 "Yv")
> > (set (match_operand:V2DI 0 "sse_reg_operand")
> > (vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand")
> > (match_operand:DI 2 "nonimmediate_gr_operand")))]
> > "reload_completed && optimize_insn_for_speed_p ()"
>
> Please use
>
> "TARGET_64BIT && TARGET_SSE4_1
> && !optimize_insn_for_size_p ()"
>
> here.
what about reload_completed? We really only want to do this after RA.
Will test the patch then and add the reduced testcase.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (32 preceding siblings ...)
2021-03-05 12:55 ` rguenther at suse dot de
@ 2021-03-05 13:06 ` rguenth at gcc dot gnu.org
2021-03-05 13:08 ` ubizjak at gmail dot com
` (15 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-05 13:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #33 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 50308
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50308&action=edit
patch
I am testing the following.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (33 preceding siblings ...)
2021-03-05 13:06 ` rguenth at gcc dot gnu.org
@ 2021-03-05 13:08 ` ubizjak at gmail dot com
2021-03-05 14:35 ` rguenth at gcc dot gnu.org
` (14 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: ubizjak at gmail dot com @ 2021-03-05 13:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #34 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to rguenther@suse.de from comment #32)
> what about reload_completed? We really only want to do this after RA.
No need for it, this is peephole2 pass that *always* runs after reload.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (34 preceding siblings ...)
2021-03-05 13:08 ` ubizjak at gmail dot com
@ 2021-03-05 14:35 ` rguenth at gcc dot gnu.org
2021-03-08 10:41 ` rguenth at gcc dot gnu.org
` (13 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-05 14:35 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #35 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #33)
> Created attachment 50308 [details]
> patch
>
> I am testing the following.
It FAILs
FAIL: gcc.target/i386/avx512dq-concatv2di-1.c scan-assembler
vpinsrq[^\\n\\r]*\\
\\\$1[^\\n\\r]*%[re]si[^\\n\\r]*%xmm18[^\\n\\r]*%xmm19
FAIL: gcc.target/i386/avx512dq-concatv2di-1.c scan-assembler
vpinsrq[^\\n\\r]*\\\\\$1[^\\n\\r]*%rsi[^\\n\\r]*%xmm16[^\\n\\r]*%xmm17
FAIL: gcc.target/i386/avx512vl-concatv2di-1.c scan-assembler
vmovhps[^\\n\\r]*%[re]si[^\\n\\r]*%xmm18[^\\n\\r]*%xmm19
I'll see how to update those next week.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (35 preceding siblings ...)
2021-03-05 14:35 ` rguenth at gcc dot gnu.org
@ 2021-03-08 10:41 ` rguenth at gcc dot gnu.org
2021-03-08 13:20 ` rguenth at gcc dot gnu.org
` (12 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-08 10:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #36 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #35)
> (In reply to Richard Biener from comment #33)
> > Created attachment 50308 [details]
> > patch
> >
> > I am testing the following.
>
> It FAILs
>
> FAIL: gcc.target/i386/avx512dq-concatv2di-1.c scan-assembler
> vpinsrq[^\\n\\r]*\\
> \\\$1[^\\n\\r]*%[re]si[^\\n\\r]*%xmm18[^\\n\\r]*%xmm19
That's exactly the case we're looking after. V2DI concat from two GPRs.
> FAIL: gcc.target/i386/avx512dq-concatv2di-1.c scan-assembler
> vpinsrq[^\\n\\r]*\\\\\$1[^\\n\\r]*%rsi[^\\n\\r]*%xmm16[^\\n\\r]*%xmm17
This is, like below, a MEM case.
> FAIL: gcc.target/i386/avx512vl-concatv2di-1.c scan-assembler
> vmovhps[^\\n\\r]*%[re]si[^\\n\\r]*%xmm18[^\\n\\r]*%xmm19
This one is because nonimmediate_gr_operand also matches a MEM, in this case
we apply the peephole to
(insn 12 11 13 2 (set (reg/v:V2DI 55 xmm19 [ c ])
(vec_concat:V2DI (reg:DI 54 xmm18 [91])
(mem:DI (reg/v/f:DI 4 si [orig:86 y ] [86]) [1 *y_8(D)+0 S8 A64])))
latency-wise memory isn't any better than a GPR so the decision to split
is reasonable.
> I'll see how to update those next week.
So I updated the above to check for vpunpcklqdq instead.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (36 preceding siblings ...)
2021-03-08 10:41 ` rguenth at gcc dot gnu.org
@ 2021-03-08 13:20 ` rguenth at gcc dot gnu.org
2021-03-08 15:46 ` amonakov at gcc dot gnu.org
` (11 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-08 13:20 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #37 from Richard Biener <rguenth at gcc dot gnu.org> ---
So my analysis was partly wrong and the vpinsrq isn't an issue for the
benchmark
but only the spilling is.
Note that the other idea of disparaging vector CTORs more like with
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2603333f87b..f8caf8e7dff 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -21821,8 +21821,15 @@ ix86_builtin_vectorization_cost (enum
vect_cost_for_stmt type_of_cost,
case vec_construct:
{
- /* N element inserts into SSE vectors. */
+ /* N-element inserts into SSE vectors. */
int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
+ /* We cannot insert from GPRs directly but there's always a
+ GPR->XMM uop involved. Account for that.
+ ??? Note that loads are already costed separately so this
+ eventually double-counts them. */
+ if (!fp)
+ cost += (TYPE_VECTOR_SUBPARTS (vectype)
+ * ix86_cost->hard_register.integer_to_sse);
/* One vinserti128 for combining two SSE vectors for AVX256. */
if (GET_MODE_BITSIZE (mode) == 256)
cost += ix86_vec_cost (mode, ix86_cost->addss);
helps for generic and core-avx2 tuning:
t.c:10:3: note: Cost model analysis:
0x3858cd0 _6 1 times scalar_store costs 12 in body
0x3858cd0 _4 1 times scalar_store costs 12 in body
0x3858cd0 _5 ^ carry_10 1 times scalar_stmt costs 4 in body
0x3858cd0 _2 ^ _3 1 times scalar_stmt costs 4 in body
0x3858cd0 _15 << 1 1 times scalar_stmt costs 4 in body
0x3858cd0 _14 << 1 1 times scalar_stmt costs 4 in body
0x3858cd0 BIT_FIELD_REF <_9, 64, 0> 1 times scalar_stmt costs 4 in body
0x3858cd0 <unknown> 0 times vec_perm costs 0 in body
0x3858cd0 _15 << 1 1 times vector_stmt costs 4 in body
0x3858cd0 _5 ^ carry_10 1 times vector_stmt costs 4 in body
0x3858cd0 <unknown> 1 times vec_construct costs 20 in prologue
0x3858cd0 _6 1 times unaligned_store (misalign -1) costs 12 in body
0x3858cd0 BIT_FIELD_REF <_9, 64, 0> 1 times vec_to_scalar costs 4 in epilogue
0x3858cd0 BIT_FIELD_REF <_9, 64, 64> 1 times vec_to_scalar costs 4 in epilogue
t.c:10:3: note: Cost model analysis for part in loop 0:
Vector cost: 48
Scalar cost: 44
t.c:10:3: missed: not vectorized: vectorization is not profitable.
but not for znver2:
t.c:10:3: note: Cost model analysis:
0x3703790 _6 1 times scalar_store costs 16 in body
0x3703790 _4 1 times scalar_store costs 16 in body
0x3703790 _5 ^ carry_10 1 times scalar_stmt costs 4 in body
0x3703790 _2 ^ _3 1 times scalar_stmt costs 4 in body
0x3703790 _15 << 1 1 times scalar_stmt costs 4 in body
0x3703790 _14 << 1 1 times scalar_stmt costs 4 in body
0x3703790 BIT_FIELD_REF <_9, 64, 0> 1 times scalar_stmt costs 4 in body
0x3703790 <unknown> 0 times vec_perm costs 0 in body
0x3703790 _15 << 1 1 times vector_stmt costs 4 in body
0x3703790 _5 ^ carry_10 1 times vector_stmt costs 4 in body
0x3703790 <unknown> 1 times vec_construct costs 20 in prologue
0x3703790 _6 1 times unaligned_store (misalign -1) costs 16 in body
0x3703790 BIT_FIELD_REF <_9, 64, 0> 1 times vec_to_scalar costs 4 in epilogue
0x3703790 BIT_FIELD_REF <_9, 64, 64> 1 times vec_to_scalar costs 4 in epilogue
t.c:10:3: note: Cost model analysis for part in loop 0:
Vector cost: 52
Scalar cost: 52
t.c:10:3: note: Basic block will be vectorized using SLP
appearantly for znver{1,2,3} we choose a slightly higher load/store cost.
We could also try mitigating vectorization by decomposing the __int128
load in forwprop where we have
else if (TREE_CODE (TREE_TYPE (lhs)) == VECTOR_TYPE
&& TYPE_MODE (TREE_TYPE (lhs)) == BLKmode
&& gimple_assign_load_p (stmt)
&& !gimple_has_volatile_ops (stmt)
&& (TREE_CODE (gimple_assign_rhs1 (stmt))
!= TARGET_MEM_REF)
&& !stmt_can_throw_internal (cfun, stmt))
{
/* Rewrite loads used only in BIT_FIELD_REF extractions to
component-wise loads. */
this was tailored to decompose GCC vector extension loads that are not
supported on the HW early. Here we have
_9 = MEM <__int128 unsigned> [(char * {ref-all})in_8(D)];
_14 = BIT_FIELD_REF <_9, 64, 64>;
_15 = BIT_FIELD_REF <_9, 64, 0>;
where the HW doesn't have any __int128 GPRs. If we do not vectorize then
the RTL pipeline will eventually split the load. If vectorization is
profitable then the vectorizer should be able to vectorize the resulting
split loads as well. In this case this would cause actual costing of the
load (the re-use of the __int128 to-be-in-SSE reg is instead free) and also
cost the live lane extract for the retained integer code. But that moves
the cost even more towards vectorizing since now a vector load (cost 12)
plus two live lane extracts (when fixed to cost sse_to_integer that's 2 * 6)
is used in place of two scalar loads (cost 2 * 12). On the code generation
side this improves things, avoiding the spilling but using vmovq/vpextrq
which is not good enough to recover but it does help a bit (~5%)
vmovdqu (%rsi), %xmm1
vpextrq $1, %xmm1, %rax
shrq $63, %rax
imulq $135, %rax, %rax
vmovq %rax, %xmm0
vmovq %xmm1, %rax
vpsllq $1, %xmm1, %xmm1
shrq $63, %rax
vmovq %rax, %xmm2
vpunpcklqdq %xmm2, %xmm0, %xmm0
vpxor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rdi)
ret
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (37 preceding siblings ...)
2021-03-08 13:20 ` rguenth at gcc dot gnu.org
@ 2021-03-08 15:46 ` amonakov at gcc dot gnu.org
2021-04-27 11:40 ` [Bug tree-optimization/98856] [11/12 " jakub at gcc dot gnu.org
` (10 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: amonakov at gcc dot gnu.org @ 2021-03-08 15:46 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Alexander Monakov <amonakov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amonakov at gcc dot gnu.org
--- Comment #38 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Late to the party, but latency analysis of vpinsrq starting from comment #18 is
incorrect: its latency is different with respect to operands.
For example, on Zen 2 latency with respect to GPR operand is long (6 cycles,
one more that grp->xmm move latency), while latency with respect to XMM operand
is just one cycle, same as punpcklqdq. See uops.info, which also shows that
vpinsrq involves 2 uops, and it's easy to guess what they are: first uop is for
gpr->xmm inter-unit move (latency 5), and the second is SSE merge:
https://uops.info/html-instr/VPINSRQ_XMM_XMM_R64_I8.html
https://uops.info/html-instr/VMOVD_XMM_R32.html
So in the CPU backend there's not much difference between
movq
pinsrq
and
movq
movq
punpcklqdq
both have same uops and overall latency (1 + movq latency).
(though on Intel starting from Haswell pinsrq oddly has latency 2 w.r.t xmm
operand, but on Ice Lake it is again 1 cycle).
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11/12 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (38 preceding siblings ...)
2021-03-08 15:46 ` amonakov at gcc dot gnu.org
@ 2021-04-27 11:40 ` jakub at gcc dot gnu.org
2021-05-13 10:17 ` cvs-commit at gcc dot gnu.org
` (9 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-04-27 11:40 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|11.0 |11.2
--- Comment #39 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 11.1 has been released, retargeting bugs to GCC 11.2.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11/12 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (39 preceding siblings ...)
2021-04-27 11:40 ` [Bug tree-optimization/98856] [11/12 " jakub at gcc dot gnu.org
@ 2021-05-13 10:17 ` cvs-commit at gcc dot gnu.org
2021-07-28 7:05 ` rguenth at gcc dot gnu.org
` (8 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-05-13 10:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #40 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:
https://gcc.gnu.org/g:829c4bea06600ea4201462f91ce6d76ca21fdb35
commit r12-769-g829c4bea06600ea4201462f91ce6d76ca21fdb35
Author: Jakub Jelinek <jakub@redhat.com>
Date: Thu May 13 12:14:14 2021 +0200
ix86: Support V{2, 4}DImode arithmetic right shifts for SSE2+ [PR98856]
As mentioned in the PR, we don't support arithmetic right V2DImode or
V4DImode on x86 without -mavx512vl or -mxop. The ISAs indeed don't have
{,v}psraq instructions until AVX512VL, but we actually can emulate it quite
easily.
One case is arithmetic >> 63, we can just emit {,v}pxor; {,v}pcmpgt for
that for SSE4.2+, or for SSE2 psrad $31; pshufd $0xf5.
Then arithmetic >> by constant > 32, that can be done with {,v}psrad $31
and {,v}psrad $(cst-32) and two operand permutation,
arithmetic >> 32 can be done as {,v}psrad $31 and permutation of that
and the original operand. Arithmetic >> by constant < 32 can be done
as {,v}psrad $cst and {,v}psrlq $cst and two operand permutation.
And arithmetic >> by variable scalar amount can be done as
arithmetic >> 63, logical >> by the amount, << by (64 - amount of the
>> 63 result; note that the vector << 64 result in 0) and oring together.
I had to improve the permutation generation so that it actually handles
the needed permutations (or handles them better).
2021-05-13 Jakub Jelinek <jakub@redhat.com>
PR tree-optimization/98856
* config/i386/i386.c (ix86_shift_rotate_cost): Add CODE argument.
Expect V2DI and V4DI arithmetic right shifts to be emulated.
(ix86_rtx_costs, ix86_add_stmt_cost): Adjust ix86_shift_rotate_cost
caller.
* config/i386/i386-expand.c (expand_vec_perm_2perm_interleave,
expand_vec_perm_2perm_pblendv): New functions.
(ix86_expand_vec_perm_const_1): Use them.
* config/i386/sse.md (ashr<mode>3<mask_name>): Rename to ...
(<mask_codefor>ashr<mode>3<mask_name>): ... this.
(ashr<mode>3): New define_expand with VI248_AVX512BW iterator.
(ashrv4di3): New define_expand.
(ashrv2di3): Change condition to TARGET_SSE2, handle !TARGET_XOP
and !TARGET_AVX512VL expansion.
* gcc.target/i386/sse2-psraq-1.c: New test.
* gcc.target/i386/sse4_2-psraq-1.c: New test.
* gcc.target/i386/avx-psraq-1.c: New test.
* gcc.target/i386/avx2-psraq-1.c: New test.
* gcc.target/i386/avx-pr82370.c: Adjust expected number of vpsrad
instructions.
* gcc.target/i386/avx2-pr82370.c: Likewise.
* gcc.target/i386/avx512f-pr82370.c: Likewise.
* gcc.target/i386/avx512bw-pr82370.c: Likewise.
* gcc.dg/torture/vshuf-4.inc: Add two further permutations.
* gcc.dg/torture/vshuf-8.inc: Likewise.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11/12 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (40 preceding siblings ...)
2021-05-13 10:17 ` cvs-commit at gcc dot gnu.org
@ 2021-07-28 7:05 ` rguenth at gcc dot gnu.org
2022-01-21 13:20 ` rguenth at gcc dot gnu.org
` (7 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-07-28 7:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|11.2 |11.3
--- Comment #41 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 11.2 is being released, retargeting bugs to GCC 11.3
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11/12 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (41 preceding siblings ...)
2021-07-28 7:05 ` rguenth at gcc dot gnu.org
@ 2022-01-21 13:20 ` rguenth at gcc dot gnu.org
2022-04-21 7:48 ` rguenth at gcc dot gnu.org
` (6 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-21 13:20 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Priority|P3 |P2
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11/12 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (42 preceding siblings ...)
2022-01-21 13:20 ` rguenth at gcc dot gnu.org
@ 2022-04-21 7:48 ` rguenth at gcc dot gnu.org
2023-04-17 21:43 ` [Bug tree-optimization/98856] [11/12/13/14 " lukebenes at hotmail dot com
` (5 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-21 7:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|11.3 |11.4
--- Comment #42 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 11.3 is being released, retargeting bugs to GCC 11.4.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11/12/13/14 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (43 preceding siblings ...)
2022-04-21 7:48 ` rguenth at gcc dot gnu.org
@ 2023-04-17 21:43 ` lukebenes at hotmail dot com
2023-04-18 9:07 ` rguenth at gcc dot gnu.org
` (4 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: lukebenes at hotmail dot com @ 2023-04-17 21:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #43 from Luke <lukebenes at hotmail dot com> ---
@Richard Biener
Polite ping. Are you still working on this regression?
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11/12/13/14 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (44 preceding siblings ...)
2023-04-17 21:43 ` [Bug tree-optimization/98856] [11/12/13/14 " lukebenes at hotmail dot com
@ 2023-04-18 9:07 ` rguenth at gcc dot gnu.org
2023-05-29 10:04 ` jakub at gcc dot gnu.org
` (3 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-04-18 9:07 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Assignee|rguenth at gcc dot gnu.org |unassigned at gcc dot gnu.org
Status|ASSIGNED |NEW
--- Comment #44 from Richard Biener <rguenth at gcc dot gnu.org> ---
The original description is for znver1 which we stopped benchmarking.
https://lnt.opensuse.org/db_default/v4/CPP/graph?highlight_run=39959&plot.0=171.721.1
is for znver2 still showing the regression and the following for znver3
which doesn't date back to the rev that regressed
https://lnt.opensuse.org/db_default/v4/CPP/graph?highlight_run=39969&plot.721=283.721.1
So the issue is still there but I am no longer actively working on it.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [11/12/13/14 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (45 preceding siblings ...)
2023-04-18 9:07 ` rguenth at gcc dot gnu.org
@ 2023-05-29 10:04 ` jakub at gcc dot gnu.org
2024-07-19 13:10 ` [Bug tree-optimization/98856] [12/13/14/15 " rguenth at gcc dot gnu.org
` (2 subsequent siblings)
49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-05-29 10:04 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|11.4 |11.5
--- Comment #45 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 11.4 is being released, retargeting bugs to GCC 11.5.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [12/13/14/15 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (46 preceding siblings ...)
2023-05-29 10:04 ` jakub at gcc dot gnu.org
@ 2024-07-19 13:10 ` rguenth at gcc dot gnu.org
2024-07-24 5:32 ` liuhongt at gcc dot gnu.org
2024-07-24 5:48 ` liuhongt at gcc dot gnu.org
49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-07-19 13:10 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|11.5 |12.5
--- Comment #46 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 11 branch is being closed.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [12/13/14/15 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (47 preceding siblings ...)
2024-07-19 13:10 ` [Bug tree-optimization/98856] [12/13/14/15 " rguenth at gcc dot gnu.org
@ 2024-07-24 5:32 ` liuhongt at gcc dot gnu.org
2024-07-24 5:48 ` liuhongt at gcc dot gnu.org
49 siblings, 0 replies; 51+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-07-24 5:32 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
Hongtao Liu <liuhongt at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |liuhongt at gcc dot gnu.org
--- Comment #47 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
Created attachment 58746
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58746&action=edit
Accoate v2di with GPR
The attached patch can allocated V2DI with GPR to avoid spill.
poly_double_le2:
.LFB0:
.cfi_startproc
movq %rdi, %rdx
movq 8(%rsi), %rdi
movq (%rsi), %rsi
movq %rdi, %rax
movq %rsi, %rcx
vmovq %rsi, %xmm4
sarq $63, %rax
shrq $63, %rcx
vpinsrq $1, %rdi, %xmm4, %xmm3
andl $135, %eax
vpsllq $1, %xmm3, %xmm1
vmovq %rax, %xmm2
vpinsrq $1, %rcx, %xmm2, %xmm0
vpxor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rdx)
ret
.cfi_endproc
But when there's (subreg:V (reg:TI 0)) for other vector modes, the issue could
be still there.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [Bug tree-optimization/98856] [12/13/14/15 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
` (48 preceding siblings ...)
2024-07-24 5:32 ` liuhongt at gcc dot gnu.org
@ 2024-07-24 5:48 ` liuhongt at gcc dot gnu.org
49 siblings, 0 replies; 51+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-07-24 5:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #48 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to Hongtao Liu from comment #47)
> Created attachment 58746 [details]
> Accoate v2di with GPR
>
> The attached patch can allocated V2DI with GPR to avoid spill.
>
@Uros Is it a good idea to make GPR available for all 128-bit vector with
1) extend *movti_internal to all 128-bit vectors, extend related splitter to
handle movement between GPR and SSE_REG, extend split_double_mode to handle
movement between GPR and GPR
2) Adjust ix86_hard_regno_mode_ok to make GPR available for all 128-bit vector
3) inline_secondary_memory_needed need to be adjust since now we support
movement between GPR and SSE for 16-bytes vector.
^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2024-07-24 5:48 UTC | newest]
Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
2021-01-27 14:29 ` [Bug tree-optimization/98856] " marxin at gcc dot gnu.org
2021-01-27 14:44 ` rguenth at gcc dot gnu.org
2021-01-28 7:47 ` rguenth at gcc dot gnu.org
2021-01-28 8:44 ` marxin at gcc dot gnu.org
2021-01-28 9:40 ` rguenth at gcc dot gnu.org
2021-01-28 11:03 ` rguenth at gcc dot gnu.org
2021-01-28 11:19 ` rguenth at gcc dot gnu.org
2021-01-28 11:57 ` rguenth at gcc dot gnu.org
2021-02-05 10:18 ` rguenth at gcc dot gnu.org
2021-02-05 11:52 ` jakub at gcc dot gnu.org
2021-02-05 12:52 ` rguenth at gcc dot gnu.org
2021-02-05 13:43 ` jakub at gcc dot gnu.org
2021-02-05 14:36 ` jakub at gcc dot gnu.org
2021-02-05 16:29 ` jakub at gcc dot gnu.org
2021-02-05 17:55 ` jakub at gcc dot gnu.org
2021-02-05 19:48 ` jakub at gcc dot gnu.org
2021-02-08 15:14 ` jakub at gcc dot gnu.org
2021-03-04 12:14 ` rguenth at gcc dot gnu.org
2021-03-04 15:36 ` rguenth at gcc dot gnu.org
2021-03-04 16:12 ` rguenth at gcc dot gnu.org
2021-03-04 17:56 ` ubizjak at gmail dot com
2021-03-04 18:12 ` ubizjak at gmail dot com
2021-03-05 7:44 ` rguenth at gcc dot gnu.org
2021-03-05 7:46 ` rguenth at gcc dot gnu.org
2021-03-05 8:29 ` ubizjak at gmail dot com
2021-03-05 10:04 ` rguenther at suse dot de
2021-03-05 10:43 ` rguenth at gcc dot gnu.org
2021-03-05 11:56 ` ubizjak at gmail dot com
2021-03-05 12:25 ` ubizjak at gmail dot com
2021-03-05 12:27 ` rguenth at gcc dot gnu.org
2021-03-05 12:49 ` jakub at gcc dot gnu.org
2021-03-05 12:52 ` ubizjak at gmail dot com
2021-03-05 12:55 ` rguenther at suse dot de
2021-03-05 13:06 ` rguenth at gcc dot gnu.org
2021-03-05 13:08 ` ubizjak at gmail dot com
2021-03-05 14:35 ` rguenth at gcc dot gnu.org
2021-03-08 10:41 ` rguenth at gcc dot gnu.org
2021-03-08 13:20 ` rguenth at gcc dot gnu.org
2021-03-08 15:46 ` amonakov at gcc dot gnu.org
2021-04-27 11:40 ` [Bug tree-optimization/98856] [11/12 " jakub at gcc dot gnu.org
2021-05-13 10:17 ` cvs-commit at gcc dot gnu.org
2021-07-28 7:05 ` rguenth at gcc dot gnu.org
2022-01-21 13:20 ` rguenth at gcc dot gnu.org
2022-04-21 7:48 ` rguenth at gcc dot gnu.org
2023-04-17 21:43 ` [Bug tree-optimization/98856] [11/12/13/14 " lukebenes at hotmail dot com
2023-04-18 9:07 ` rguenth at gcc dot gnu.org
2023-05-29 10:04 ` jakub at gcc dot gnu.org
2024-07-19 13:10 ` [Bug tree-optimization/98856] [12/13/14/15 " rguenth at gcc dot gnu.org
2024-07-24 5:32 ` liuhongt at gcc dot gnu.org
2024-07-24 5:48 ` liuhongt at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).