[Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
@ 2021-01-27 14:28 marxin at gcc dot gnu.org
  2021-01-27 14:29 ` [Bug tree-optimization/98856] " marxin at gcc dot gnu.org
                   ` (49 more replies)
  0 siblings, 50 replies; 51+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-01-27 14:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

            Bug ID: 98856
           Summary: [11 Regression] botan AES-128/XTS is slower by ~17%
                    since
                    r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: marxin at gcc dot gnu.org
                CC: rguenth at gcc dot gnu.org
  Target Milestone: ---

Since the revision the following is slower:

$ make clean && ./configure.py --cxxflags="-Ofast -march=znver2 -fno-checking"
&& make -j16 && ./botan speed AES-128/XTS

as seen here:
https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=226.721.1&plot.1=14.721.1&

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
@ 2021-01-27 14:29 ` marxin at gcc dot gnu.org
  2021-01-27 14:44 ` rguenth at gcc dot gnu.org
                   ` (48 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-01-27 14:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

Martin Liška <marxin at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
      Known to fail|                            |11.0
   Last reconfirmed|                            |2021-01-27
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
      Known to work|                            |10.2.0

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
  2021-01-27 14:29 ` [Bug tree-optimization/98856] " marxin at gcc dot gnu.org
@ 2021-01-27 14:44 ` rguenth at gcc dot gnu.org
  2021-01-28  7:47 ` rguenth at gcc dot gnu.org
                   ` (47 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-27 14:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
   Target Milestone|---                         |11.0
           Keywords|                            |missed-optimization
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
I will have a look.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
  2021-01-27 14:29 ` [Bug tree-optimization/98856] " marxin at gcc dot gnu.org
  2021-01-27 14:44 ` rguenth at gcc dot gnu.org
@ 2021-01-28  7:47 ` rguenth at gcc dot gnu.org
  2021-01-28  8:44 ` marxin at gcc dot gnu.org
                   ` (46 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-28  7:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
The cxx bench Botan doesn't know --cxxflags, what Botan version are you looking
at?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-01-28  7:47 ` rguenth at gcc dot gnu.org
@ 2021-01-28  8:44 ` marxin at gcc dot gnu.org
  2021-01-28  9:40 ` rguenth at gcc dot gnu.org
                   ` (45 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-01-28  8:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #3 from Martin Liška <marxin at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> The cxx bench Botan doesn't know --cxxflags, what Botan version are you
> looking at?

I used this fixed version:
https://gitlab.suse.de/marxin/cpp-benchmarks/-/tree/master/botan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-01-28  8:44 ` marxin at gcc dot gnu.org
@ 2021-01-28  9:40 ` rguenth at gcc dot gnu.org
  2021-01-28 11:03 ` rguenth at gcc dot gnu.org
                   ` (44 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-28  9:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Slow:

Samples: 4K of event 'cycles:u', Event count (approx.): 4565667242              
Overhead       Samples  Command  Shared Object     Symbol                       
  30.88%          1252  botan    libbotan-2.so.17  [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan
  30.24%          1235  botan    libbotan-2.so.17  [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan
  26.04%          1055  botan    libbotan-2.so.17  [.] Botan::poly_double_n_le

Fast

Samples: 4K of event 'cycles:u', Event count (approx.): 4427277434              
Overhead       Samples  Command  Shared Object        Symbol                    
  33.59%          1372  botan    libbotan-2.so.17     [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Bo
  33.16%          1356  botan    libbotan-2.so.17     [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Bo
  18.71%           765  botan    libbotan-2.so.17     [.]
Botan::poly_double_n_le

also fast on trunk when not vectorizing, so the rev does what it was intended
to
(more vectorization).  I'll look into what we do to poly_double_n_le.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2021-01-28  9:40 ` rguenth at gcc dot gnu.org
@ 2021-01-28 11:03 ` rguenth at gcc dot gnu.org
  2021-01-28 11:19 ` rguenth at gcc dot gnu.org
                   ` (43 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-28 11:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Looks like STLF issues.  There's a ls_stlf counter, with SLP vectorization
disabled I see

  34.39%          1417  botan    libbotan-2.so.17  [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip
  32.27%          1333  botan    libbotan-2.so.17  [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip
   7.31%           306  botan    libbotan-2.so.17  [.] Botan::poly_double_n_le

while with SLP vectorization enabled there's

Samples: 4K of event 'ls_stlf:u', Event count (approx.): 723886942              
Overhead       Samples  Command  Shared Object     Symbol        
  32.41%          1320  botan    libbotan-2.so.17  [.] Botan::poly_double_n_le
  27.23%          1114  botan    libbotan-2.so.17  [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip
  27.06%          1107  botan    libbotan-2.so.17  [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip

but then the register docs suggest that the unnamed cpu/event=0x24,umask=0x2/u
is supposed to be the forwarding fails due to incomplete/misaligned data. 
Unvectorized:

Samples: 4K of event 'cpu/event=0x24,umask=0x2/u', Event count (approx.):
1024347253                                         
Overhead       Samples  Command  Shared Object     Symbol                       
  33.56%          1382  botan    libbotan-2.so.17  [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip
  30.32%          1246  botan    libbotan-2.so.17  [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip
  23.18%           953  botan    libbotan-2.so.17  [.] Botan::poly_double_n_le

vectorized:

Samples: 4K of event 'cpu/event=0x24,umask=0x2/u', Event count (approx.):
489384781                                          
Overhead       Samples  Command  Shared Object     Symbol                       
  30.17%          1229  botan    libbotan-2.so.17  [.] Botan::poly_double_n_le
  29.40%          1203  botan    libbotan-2.so.17  [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip
  28.09%          1147  botan    libbotan-2.so.17  [.]
Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip

but the masking doesn't work as expected since I get hits for either bit
on

  4.05 |       vmovdqa    %xmm4,0x10(%rsp)                                     
                                            #
       |     const uint64_t carry = POLY * (W[LIMBS-1] >> 63);                 
                                            #
 12.24 |       mov        0x18(%rsp),%rdx                                      
                                            #
       |     W[0] = (W[0] << 1) ^ carry;                                       
                                            #
 24.00 |       vmovdqa    0x10(%rsp),%xmm5

which should only happen for bit 2 (data not ready).  Of course this
code-gen is weird since 0x10(%rsp) is available in %xmm4.

Well, changing the above doesn't make a difference.  I guess the event hit
is just quite delayed - that makes perf quite useless here.

As a general optimization remark we fail to scalarize 'W' in poly_double_le
for the larger sizes, but the relevant differences likely appear for the
cases we expand the memcpy inline on GIMPLE, specifically

  <bb 10> [local count: 1431655747]:
  _60 = MEM <__int128 unsigned> [(char * {ref-all})in_6(D)];
  _61 = BIT_FIELD_REF <_60, 64, 64>;
  _62 = _61 >> 63;
  carry_63 = _62 * 135;
  _308 = _61 << 1;
  _228 = (long unsigned int) _60;
  _310 = _228 >> 63;
  _311 = _308 ^ _310;
  _71 = _228 << 1;
  _72 = carry_63 ^ _71;
  MEM <long unsigned int> [(char * {ref-all})out_5(D)] = _72;
  MEM <long unsigned int> [(char * {ref-all})out_5(D) + 8B] = _311;

this is turned into

  <bb 10> [local count: 1431655747]:
  _60 = MEM <__int128 unsigned> [(char * {ref-all})in_6(D)];
  _114 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(_60);
  vect__71.335_298 = _114 << 1;
  _61 = BIT_FIELD_REF <_60, 64, 64>;
  _62 = _61 >> 63;
  carry_63 = _62 * 135;
  _228 = (long unsigned int) _60;
  _310 = _228 >> 63;
  _147 = {carry_63, _310};
  vect__72.336_173 = _147 ^ vect__71.335_298;
  MEM <vector(2) long unsigned int> [(char * {ref-all})out_5(D)] =
vect__72.336_173;

after the patch which is

build/include/botan/mem_ops.h:148:15: note: Basic block will be vectorized
using SLP
build/include/botan/mem_ops.h:148:15: note: Vectorizing SLP tree:
build/include/botan/mem_ops.h:148:15: note: node 0x275d8e8 (max_nunits=2,
refcnt=1)
build/include/botan/mem_ops.h:148:15: note: op template: MEM <long unsigned
int> [(char * {ref-all})out_5(D)] = _72;
build/include/botan/mem_ops.h:148:15: note:     stmt 0 MEM <long unsigned int>
[(char * {ref-all})out_5(D)] = _72;
build/include/botan/mem_ops.h:148:15: note:     stmt 1 MEM <long unsigned int>
[(char * {ref-all})out_5(D) + 8B] = _311;
build/include/botan/mem_ops.h:148:15: note:     children 0x275d960
build/include/botan/mem_ops.h:148:15: note: node 0x275d960 (max_nunits=2,
refcnt=1)
build/include/botan/mem_ops.h:148:15: note: op template: _72 = carry_63 ^ _71;
build/include/botan/mem_ops.h:148:15: note:     stmt 0 _72 = carry_63 ^ _71;
build/include/botan/mem_ops.h:148:15: note:     stmt 1 _311 = _308 ^ _310;
build/include/botan/mem_ops.h:148:15: note:     children 0x275d9d8 0x275da50
build/include/botan/mem_ops.h:148:15: note: node (external) 0x275d9d8
(max_nunits=1, refcnt=1)
build/include/botan/mem_ops.h:148:15: note:     { carry_63, _310 }
build/include/botan/mem_ops.h:148:15: note: node 0x275da50 (max_nunits=2,
refcnt=1)
build/include/botan/mem_ops.h:148:15: note: op template: _71 = _228 << 1;
build/include/botan/mem_ops.h:148:15: note:     stmt 0 _71 = _228 << 1;
build/include/botan/mem_ops.h:148:15: note:     stmt 1 _308 = _61 << 1;
build/include/botan/mem_ops.h:148:15: note:     children 0x275dac8 0x275dbb8
build/include/botan/mem_ops.h:148:15: note: node 0x275dac8 (max_nunits=1,
refcnt=1)
build/include/botan/mem_ops.h:148:15: note: op: VEC_PERM_EXPR
build/include/botan/mem_ops.h:148:15: note:     stmt 0 _228 = BIT_FIELD_REF
<_60, 64, 0>;
build/include/botan/mem_ops.h:148:15: note:     stmt 1 _61 = BIT_FIELD_REF
<_60, 64, 64>;
build/include/botan/mem_ops.h:148:15: note:     lane permutation { 0[0] 0[1] }
build/include/botan/mem_ops.h:148:15: note:     children 0x275db40
build/include/botan/mem_ops.h:148:15: note: node (external) 0x275db40
(max_nunits=1, refcnt=1)
build/include/botan/mem_ops.h:148:15: note:     { }
build/include/botan/mem_ops.h:148:15: note: node (constant) 0x275dbb8
(max_nunits=1, refcnt=1)
build/include/botan/mem_ops.h:148:15: note:     { 1, 1 }

with costs

build/include/botan/mem_ops.h:148:15: note: Cost model analysis:
  Vector inside of basic block cost: 24
  Vector prologue cost: 8
  Vector epilogue cost: 8
  Scalar cost of basic block: 52

the vectorization isn't too bad I think, it turns into

.L56:
        .cfi_restore_state
        vmovdqu (%rsi), %xmm4
        vmovdqa %xmm4, 16(%rsp)
        movq    24(%rsp), %rdx
        vmovdqa 16(%rsp), %xmm5
        shrq    $63, %rdx
        imulq   $135, %rdx, %rdi
        movq    16(%rsp), %rdx
        vmovq   %rdi, %xmm0
        vpsllq  $1, %xmm5, %xmm1
        shrq    $63, %rdx
        vpinsrq $1, %rdx, %xmm0, %xmm0
        vpxor   %xmm1, %xmm0, %xmm0
        vmovdqu %xmm0, (%rax)
        jmp     .L53

instead of

.L56:
        .cfi_restore_state
        movq    8(%rsi), %rdx
        movq    (%rsi), %rdi
        movq    %rdx, %rcx
        leaq    (%rdi,%rdi), %rsi
        addq    %rdx, %rdx
        shrq    $63, %rdi
        shrq    $63, %rcx
        xorq    %rdi, %rdx
        imulq   $135, %rcx, %rcx
        movq    %rdx, 8(%rax)
        xorq    %rsi, %rcx
        movq    %rcx, (%rax)
        jmp     .L53

but we see the 128bit move split when using GPRs possibly avoiding the
STLF issue.  I don't understand why we spill to extract the high part though.

Will see to create a small testcase for the above kernel.

With the vectorization disabled for just this kernel I get

AES-128/XTS 280780 key schedule/sec; 0.00 ms/op 12122 cycles/op (2 ops in 0 ms)
AES-128/XTS encrypt buffer size 1024 bytes: 852.401 MiB/sec 4.14 cycles/byte
(426.20 MiB in 500.00 ms)
AES-128/XTS decrypt buffer size 1024 bytes: 854.461 MiB/sec 4.13 cycles/byte
(426.20 MiB in 498.80 ms)

compared to

ES-128/XTS 286409 key schedule/sec; 0.00 ms/op 11761 cycles/op (2 ops in 0 ms)
AES-128/XTS encrypt buffer size 1024 bytes: 765.736 MiB/sec 4.62 cycles/byte
(382.87 MiB in 500.00 ms)
AES-128/XTS decrypt buffer size 1024 bytes: 766.612 MiB/sec 4.61 cycles/byte
(382.87 MiB in 499.43 ms)

so that seems to be it.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2021-01-28 11:03 ` rguenth at gcc dot gnu.org
@ 2021-01-28 11:19 ` rguenth at gcc dot gnu.org
  2021-01-28 11:57 ` rguenth at gcc dot gnu.org
                   ` (42 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-28 11:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
The following testcase reproduces the assembly:

typedef __UINT64_TYPE__ uint64_t;
void poly_double_le2 (unsigned char *out, const unsigned char *in)
{
  uint64_t W[2];

  __builtin_memcpy (&W, in, 16);
  uint64_t carry = (W[1] >> 63) * 135;
  W[1] = (W[1] << 1) ^ (W[0] >> 63);
  W[0] = (W[0] << 1) ^ carry;
  __builtin_memcpy (out, &W[0], 8);
  __builtin_memcpy (out + 8, &W[1], 8);
}

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2021-01-28 11:19 ` rguenth at gcc dot gnu.org
@ 2021-01-28 11:57 ` rguenth at gcc dot gnu.org
  2021-02-05 10:18 ` rguenth at gcc dot gnu.org
                   ` (41 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-28 11:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
OK, and the spill is likely because we expand as

(insn 7 6 0 (set (reg:TI 84 [ _9 ])
        (mem:TI (reg/v/f:DI 93 [ in ]) [0 MEM <__int128 unsigned> [(char *
{ref-all})in_8(D)]+0 S16 A8])) -1
     (nil))

(insn 8 7 9 (parallel [
            (set (reg:DI 95)
                (lshiftrt:DI (subreg:DI (reg:TI 84 [ _9 ]) 8)
                    (const_int 63 [0x3f])))
            (clobber (reg:CC 17 flags))
        ]) "t.c":7:26 -1
     (nil))

^^^ (subreg:DI (reg:TI 84 [ _9 ]) 8)

...

(insn 12 11 13 (set (reg:V2DI 98 [ vect__5.3 ])
        (ashift:V2DI (subreg:V2DI (reg:TI 84 [ _9 ]) 0)
            (const_int 1 [0x1]))) "t.c":9:16 -1
     (nil))

^^^ (subreg:V2DI (reg:TI 84 [ _9 ]) 0)

LRA then does

         Choosing alt 4 in insn 7:  (0) v  (1) vm {*movti_internal}
      Creating newreg=103 from oldreg=84, assigning class ALL_SSE_REGS to r103
    7: r103:TI=[r101:DI]
      REG_DEAD r101:DI
    Inserting insn reload after:
   20: r84:TI=r103:TI

         Choosing alt 0 in insn 8:  (0) =rm  (1) 0  (2) cJ {*lshrdi3_1}
      Creating newreg=104 from oldreg=95, assigning class GENERAL_REGS to r104

    Inserting insn reload before:
   21: r104:DI=r84:TI#8

but somehow this means the reload 20 is used for the reload 21 instead
of avoiding the reload 20 and doing a movhlps / movq combo?  (I guess
there's no high part xmm extract to gpr)

As said the assembly is a bit weird:

poly_double_le2:
.LFB0:
        .cfi_startproc
        vmovdqu (%rsi), %xmm2
        vmovdqa %xmm2, -24(%rsp)
        movq    -16(%rsp), %rax

ok, well ...

        vmovdqa -24(%rsp), %xmm3

???

        shrq    $63, %rax
        imulq   $135, %rax, %rax
        vmovq   %rax, %xmm0
        movq    -24(%rsp), %rax

???  movq %xmm2/3, %rax

        vpsllq  $1, %xmm3, %xmm1
        shrq    $63, %rax
        vpinsrq $1, %rax, %xmm0, %xmm0
        vpxor   %xmm1, %xmm0, %xmm0
        vmovdqu %xmm0, (%rdi)

note even with -march=core-avx2 (and thus inter-unit moves not pessimized) we
get

poly_double_le2:
.LFB0:
        .cfi_startproc
        vmovdqu (%rsi), %xmm2
        vmovdqa %xmm2, -24(%rsp)
        movq    -16(%rsp), %rax
        vmovdqa -24(%rsp), %xmm3
        shrq    $63, %rax
        vpsllq  $1, %xmm3, %xmm1
        imulq   $135, %rax, %rax
        vmovq   %rax, %xmm0
        movq    -24(%rsp), %rax
        shrq    $63, %rax
        vpinsrq $1, %rax, %xmm0, %xmm0
        vpxor   %xmm1, %xmm0, %xmm0
        vmovdqu %xmm0, (%rdi)

with

.L56:
        .cfi_restore_state
        vmovdqu (%rsi), %xmm4
        movq    8(%rsi), %rdx
        shrq    $63, %rdx
        imulq   $135, %rdx, %rdi
        movq    8(%rsi), %rdx
        vmovq   %rdi, %xmm0
        vpsllq  $1, %xmm4, %xmm1
        shrq    $63, %rdx
        vpinsrq $1, %rdx, %xmm0, %xmm0
        vpxor   %xmm1, %xmm0, %xmm0
        vmovdqu %xmm0, (%rax)
        jmp     .L53

we arrive at

ES-128/XTS 672043 key schedule/sec; 0.00 ms/op 4978.00 cycles/op (2 ops in 0.00
ms)
AES-128/XTS encrypt buffer size 1024 bytes: 843.310 MiB/sec 4.18 cycles/byte
(421.66 MiB in 500.00 ms)
AES-128/XTS decrypt buffer size 1024 bytes: 847.215 MiB/sec 4.16 cycles/byte
(421.66 MiB in 497.70 ms)

a variant using movhlps isn't any faster than spilling unfortunately :/
I guess re-materializing from a load is too much to be asked from LRA.

On the vectorizer side costing is 52 scalar vs. 40 vector (as usual the
vectorized store alone leads to a big boost).

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2021-01-28 11:57 ` rguenth at gcc dot gnu.org
@ 2021-02-05 10:18 ` rguenth at gcc dot gnu.org
  2021-02-05 11:52 ` jakub at gcc dot gnu.org
                   ` (40 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-05 10:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
exploring more options I noticed there's no arithmetic vector V2DI right shift,
so vectorizing

  uint64_t carry = (uint64_t)(((int64_t)W[1]) >> 63) & (uint64_t)135;
  W[1] = (W[1] << 1) ^ ((uint64_t)(((int64_t)W[0]) >> 63) & (uint64_t)1);
  W[0] = (W[0] << 1) ^ carry;

didn't work out.  But V2DI >> CST with CST > 31 can be implemented with
VPSRAD and then doing PMOVSXDQ after shuffling the high shifted part into
low position.

Maybe there's sth more clever for the special case of >> 63 even.

As said, just trying if "optimal" vectorization of the kernel would solve
the issue.  But I guess pipelines are wide enough so the original scalar
code effectively executes "vectorized".

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2021-02-05 10:18 ` rguenth at gcc dot gnu.org
@ 2021-02-05 11:52 ` jakub at gcc dot gnu.org
  2021-02-05 12:52 ` rguenth at gcc dot gnu.org
                   ` (39 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-05 11:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
For arithmetic >> (element_precision - 1) one can just use
{,v}pxor + {,v}pcmpgtq, as in instead of return vec >> 63; do return vec < 0;
(in C++-ish way), aka VEC_COND_EXPR vec < 0, { all ones }, { 0 }
For other arithmetic shifts by scalar constant, perhaps one can replace
return vec >> 17; with return (vectype) ((uvectype) vec >> 17) | ((vec < 0) <<
(64 - 17));
- it will actually work even for non-constant scalar shift amounts because
{,v}psllq treats shift counts > 63 as 0.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2021-02-05 11:52 ` jakub at gcc dot gnu.org
@ 2021-02-05 12:52 ` rguenth at gcc dot gnu.org
  2021-02-05 13:43 ` jakub at gcc dot gnu.org
                   ` (38 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-05 12:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #9)
> For arithmetic >> (element_precision - 1) one can just use
> {,v}pxor + {,v}pcmpgtq, as in instead of return vec >> 63; do return vec < 0;
> (in C++-ish way), aka VEC_COND_EXPR vec < 0, { all ones }, { 0 }
> For other arithmetic shifts by scalar constant, perhaps one can replace
> return vec >> 17; with return (vectype) ((uvectype) vec >> 17) | ((vec < 0)
> << (64 - 17));
> - it will actually work even for non-constant scalar shift amounts because
> {,v}psllq treats shift counts > 63 as 0.

OK, so that yields

poly_double_le2:
.LFB0:
        .cfi_startproc
        vmovdqu (%rsi), %xmm0
        vpxor   %xmm1, %xmm1, %xmm1
        vpalignr        $8, %xmm0, %xmm0, %xmm2
        vpcmpgtq        %xmm2, %xmm1, %xmm1
        vpand   .LC0(%rip), %xmm1, %xmm1
        vpsllq  $1, %xmm0, %xmm0
        vpxor   %xmm1, %xmm0, %xmm0
        vmovdqu %xmm0, (%rdi)
        ret

when I feed the following to SLP2 directly:

void __GIMPLE (ssa,guessed_local(1073741824),startwith("slp"))
poly_double_le2 (unsigned char * out, const unsigned char * in)
{
  long unsigned int carry;
  long unsigned int _1;
  long unsigned int _2;
  long unsigned int _3;
  long unsigned int _4;
  long unsigned int _5;
  long unsigned int _6;
  __int128 unsigned _9;
  long unsigned int _14;
  long unsigned int _15;
  long int _18;
  long int _19;
  long unsigned int _20;

  __BB(2,guessed_local(1073741824)):
  _9 = __MEM <__int128 unsigned, 8> ((char *)in_8(D));
  _14 = __BIT_FIELD_REF <long unsigned int> (_9, 64u, 64u);
  _18 = (long int) _14;
  _1 = _18 < 0l ? _Literal (unsigned long) -1ul : 0ul;
  carry_10 = _1 & 135ul;
  _2 = _14 << 1;
  _15 = __BIT_FIELD_REF <long unsigned int> (_9, 64u, 0u);
  _19 = (long int) _15;
  _20 = _19 < 0l ? _Literal (unsigned long) -1ul : 0ul;
  _3 = _20 & 1ul;
  _4 = _2 ^ _3;
  _5 = _15 << 1;
  _6 = _5 ^ carry_10;
  __MEM <long unsigned int, 8> ((char *)out_11(D)) = _6;
  __MEM <long unsigned int, 8> ((char *)out_11(D) + _Literal (char *) 8) = _4;
  return;

}

with

  <bb 2> [local count: 1073741824]:
  _9 = MEM <__int128 unsigned> [(char *)in_8(D)];
  _12 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(_9);
  _7 = VEC_PERM_EXPR <_12, _12, { 1, 0 }>;
  vect__18.1_25 = VIEW_CONVERT_EXPR<vector(2) long int>(_7);
  vect_carry_10.3_28 = .VCOND (vect__18.1_25, { 0, 0 }, { 135, 1 }, { 0, 0 },
108);
  vect__5.0_13 = _12 << 1;
  vect__6.4_29 = vect__5.0_13 ^ vect_carry_10.3_28;
  MEM <vector(2) long unsigned int> [(char *)out_11(D)] = vect__6.4_29;
  return;

in .optimized

The latency of the data is at least 7 instructions that way, compared to
4 in the not vectorized code (guess I could try Intel iaca on it).

So if that's indeed the best we can do then it's not profitable (btw,
with the above the vectorizers conclusion is not profitable but due
to excessive costing of constants for the condition vectorization).

Simple asm replacement of the kernel results in

ES-128/XTS 292740 key schedule/sec; 0.00 ms/op 11571 cycles/op (2 ops in 0 ms)
AES-128/XTS encrypt buffer size 1024 bytes: 765.571 MiB/sec 4.62 cycles/byte
(382.79 MiB in 500.00 ms)
AES-128/XTS decrypt buffer size 1024 bytes: 767.064 MiB/sec 4.61 cycles/byte
(382.79 MiB in 499.03 ms)

compared to

AES-128/XTS 283527 key schedule/sec; 0.00 ms/op 11932 cycles/op (2 ops in 0 ms)
AES-128/XTS encrypt buffer size 1024 bytes: 768.446 MiB/sec 4.60 cycles/byte
(384.22 MiB in 500.00 ms)
AES-128/XTS decrypt buffer size 1024 bytes: 769.292 MiB/sec 4.60 cycles/byte
(384.22 MiB in 499.45 ms)

so that's indeed no improvement.  Bigger block sizes also contain vector
code but that's not exercised by the botan speed measurement.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2021-02-05 12:52 ` rguenth at gcc dot gnu.org
@ 2021-02-05 13:43 ` jakub at gcc dot gnu.org
  2021-02-05 14:36 ` jakub at gcc dot gnu.org
                   ` (37 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-05 13:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |uros at gcc dot gnu.org

--- Comment #11 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
For V2DImode arithmetic right shift, I think it would be something like:
--- gcc/config/i386/sse.md.jj   2021-01-27 11:50:09.168981297 +0100
+++ gcc/config/i386/sse.md      2021-02-05 14:32:44.175463716 +0100
@@ -20313,10 +20313,55 @@ (define_expand "ashrv2di3"
        (ashiftrt:V2DI
          (match_operand:V2DI 1 "register_operand")
          (match_operand:DI 2 "nonmemory_operand")))]
-  "TARGET_XOP || TARGET_AVX512VL"
+  "TARGET_SSE4_2"
 {
   if (!TARGET_AVX512VL)
     {
+      if (CONST_INT_P (operands[2]) && INTVAL (operands[2]) == 63)
+       {
+         rtx zero = force_reg (V2DImode, CONST0_RTX (V2DImode));
+         emit_insn (gen_sse4_2_gtv2di3 (operands[0], zero, operands[1]));
+         DONE;
+       }
+      if (operands[2] == const0_rtx)
+       {
+         emit_move_insn (operands[0], operands[1]);
+         DONE;
+       }
+      if (!TARGET_XOP)
+       {
+         rtx zero = force_reg (V2DImode, CONST0_RTX (V2DImode));
+         rtx zero_or_all_ones = gen_reg_rtx (V2DImode);
+         emit_insn (gen_sse4_2_gtv2di3 (zero_or_all_ones, zero, operands[1]));
+         rtx lshr_res = gen_reg_rtx (V2DImode);
+         emit_insn (gen_lshrv2di3 (lshr_res, operands[1], operands[2]));
+         rtx ashl_res = gen_reg_rtx (V2DImode);
+         rtx amount;
+         if (CONST_INT_P (operands[2]))
+           amount = GEN_INT (64 - INTVAL (operands[2]));
+         else if (TARGET_64BIT)
+           {
+             amount = gen_reg_rtx (DImode);
+             emit_insn (gen_subdi3 (amount, force_reg (DImode, GEN_INT (64)),
+                                    operands[2]));
+           }
+         else
+           {
+             rtx temp = gen_reg_rtx (SImode);
+             emit_insn (gen_subsi3 (temp, force_reg (SImode, GEN_INT (64)),
+                                    lowpart_subreg (SImode, operands[2],
+                                                    DImode)));
+             amount = gen_reg_rtx (V4SImode);
+             emit_insn (gen_vec_setv4si_0 (amount, CONST0_RTX (V4SImode),
+                                           temp));
+           }
+         if (!CONST_INT_P (operands[2]))
+           amount = lowpart_subreg (DImode, amount, GET_MODE (amount));
+         emit_insn (gen_ashlv2di3 (ashl_res, zero_or_all_ones, amount));
+         emit_insn (gen_iorv2di3 (operands[0], lshr_res, ashl_res));
+         DONE;
+       }
+
       rtx reg = gen_reg_rtx (V2DImode);
       rtx par;
       bool negate = false;
plus adjusting the cost computation to hint that at least the non-63 arithmetic
right V2DImode shifts are more expensive.

Even if in the end the V2DImode arithmetic right shifts turn to be more
expensive than scalar code (though, it surprises me at least for the >> 63
case),
I think V4DImode for TARGET_AVX2 should be beneficial always (haven't tried to
adjust the expander for that yet).

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2021-02-05 13:43 ` jakub at gcc dot gnu.org
@ 2021-02-05 14:36 ` jakub at gcc dot gnu.org
  2021-02-05 16:29 ` jakub at gcc dot gnu.org
                   ` (36 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-05 14:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #12 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
V4DImode arithmetic right shifts would be (untested):
--- gcc/config/i386/sse.md.jj   2021-02-05 14:32:44.175463716 +0100
+++ gcc/config/i386/sse.md      2021-02-05 15:24:37.942026401 +0100
@@ -12458,7 +12458,7 @@
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "<sseinsnmode>")])

-(define_insn "ashr<mode>3<mask_name>"
+(define_insn "<mask_codefor>ashr<mode>3<mask_name>"
   [(set (match_operand:VI248_AVX512BW_AVX512VL 0 "register_operand" "=v,v")
        (ashiftrt:VI248_AVX512BW_AVX512VL
          (match_operand:VI248_AVX512BW_AVX512VL 1 "nonimmediate_operand"
"v,vm")
@@ -12472,6 +12472,67 @@
        (const_string "0")))
    (set_attr "mode" "<sseinsnmode>")])

+(define_expand "ashr<mode>3"
+  [(set (match_operand:VI248_AVX512BW 0 "register_operand")
+       (ashiftrt:VI248_AVX512BW
+         (match_operand:VI248_AVX512BW 1 "nonimmediate_operand")
+         (match_operand:DI 2 "nonmemory_operand")))]
+  "TARGET_AVX512F")
+
+(define_expand "ashrv4di3"
+  [(set (match_operand:V4DI 0 "register_operand")
+       (ashiftrt:V4DI
+         (match_operand:V4DI 1 "nonimmediate_operand")
+         (match_operand:DI 2 "nonmemory_operand")))]
+  "TARGET_AVX2"
+{
+  if (!TARGET_AVX512VL)
+    {
+      if (CONST_INT_P (operands[2]) && INTVAL (operands[2]) == 63)
+       {
+         rtx zero = force_reg (V4DImode, CONST0_RTX (V4DImode));
+         emit_insn (gen_avx2_gtv4di3 (operands[0], zero, operands[1]));
+         DONE;
+       }
+      if (operands[2] == const0_rtx)
+       {
+         emit_move_insn (operands[0], operands[1]);
+         DONE;
+       }
+
+      rtx zero = force_reg (V4DImode, CONST0_RTX (V4DImode));
+      rtx zero_or_all_ones = gen_reg_rtx (V4DImode);
+      emit_insn (gen_avx2_gtv4di3 (zero_or_all_ones, zero, operands[1]));
+      rtx lshr_res = gen_reg_rtx (V4DImode);
+      emit_insn (gen_lshrv4di3 (lshr_res, operands[1], operands[2]));
+      rtx ashl_res = gen_reg_rtx (V4DImode);
+      rtx amount;
+      if (CONST_INT_P (operands[2]))
+       amount = GEN_INT (64 - INTVAL (operands[2]));
+      else if (TARGET_64BIT)
+       {
+         amount = gen_reg_rtx (DImode);
+         emit_insn (gen_subdi3 (amount, force_reg (DImode, GEN_INT (64)),
+                                operands[2]));
+       }
+      else
+       {
+         rtx temp = gen_reg_rtx (SImode);
+         emit_insn (gen_subsi3 (temp, force_reg (SImode, GEN_INT (64)),
+                                lowpart_subreg (SImode, operands[2],
+                                                DImode)));
+         amount = gen_reg_rtx (V4SImode);
+         emit_insn (gen_vec_setv4si_0 (amount, CONST0_RTX (V4SImode),
+                                       temp));
+       }
+      if (!CONST_INT_P (operands[2]))
+       amount = lowpart_subreg (DImode, amount, GET_MODE (amount));
+      emit_insn (gen_ashlv4di3 (ashl_res, zero_or_all_ones, amount));
+      emit_insn (gen_iorv4di3 (operands[0], lshr_res, ashl_res));
+      DONE;
+    }
+})
+
 (define_insn "<mask_codefor><insn><mode>3<mask_name>"
   [(set (match_operand:VI248_AVX512BW_2 0 "register_operand" "=v,v")
        (any_lshift:VI248_AVX512BW_2

Trying 3 different routines, one returning >> 63 of a V4DImode vector, another
one >> 17 and another one >> var, the differences with -mavx2 are:
-       vextracti128    $0x1, %ymm0, %xmm1
-       vmovq   %xmm0, %rax
-       vpextrq $1, %xmm0, %rcx
-       cqto
-       vmovq   %xmm1, %rax
-       sarq    $63, %rcx
-       sarq    $63, %rax
-       vmovq   %rdx, %xmm3
-       movq    %rax, %rsi
-       vpextrq $1, %xmm1, %rax
-       vpinsrq $1, %rcx, %xmm3, %xmm0
-       sarq    $63, %rax
-       vmovq   %rsi, %xmm2
-       vpinsrq $1, %rax, %xmm2, %xmm1
-       vinserti128     $0x1, %xmm1, %ymm0, %ymm0
+       vmovdqa %ymm0, %ymm1
+       vpxor   %xmm0, %xmm0, %xmm0
+       vpcmpgtq        %ymm1, %ymm0, %ymm0

-       vmovq   %xmm0, %rax
-       vextracti128    $0x1, %ymm0, %xmm1
-       vpextrq $1, %xmm0, %rcx
-       sarq    $17, %rax
-       sarq    $17, %rcx
-       movq    %rax, %rdx
-       vmovq   %xmm1, %rax
-       sarq    $17, %rax
-       vmovq   %rdx, %xmm3
-       movq    %rax, %rsi
-       vpextrq $1, %xmm1, %rax
-       vpinsrq $1, %rcx, %xmm3, %xmm0
-       sarq    $17, %rax
-       vmovq   %rsi, %xmm2
-       vpinsrq $1, %rax, %xmm2, %xmm1
-       vinserti128     $0x1, %xmm1, %ymm0, %ymm0
+       vpxor   %xmm1, %xmm1, %xmm1
+       vpcmpgtq        %ymm0, %ymm1, %ymm1
+       vpsrlq  $17, %ymm0, %ymm0
+       vpsllq  $47, %ymm1, %ymm1
+       vpor    %ymm1, %ymm0, %ymm0

and

-       movl    %edi, %ecx
-       vmovq   %xmm0, %rax
-       vextracti128    $0x1, %ymm0, %xmm1
-       sarq    %cl, %rax
-       vpextrq $1, %xmm0, %rsi
-       movq    %rax, %rdx
-       vmovq   %xmm1, %rax
-       sarq    %cl, %rsi
-       sarq    %cl, %rax
-       vmovq   %rdx, %xmm3
-       movq    %rax, %rdi
-       vpextrq $1, %xmm1, %rax
-       vpinsrq $1, %rsi, %xmm3, %xmm0
-       sarq    %cl, %rax
+       vpxor   %xmm1, %xmm1, %xmm1
+       movslq  %edi, %rdi
+       movl    $64, %eax
+       vpcmpgtq        %ymm0, %ymm1, %ymm1
+       subq    %rdi, %rax
        vmovq   %rdi, %xmm2
-       vpinsrq $1, %rax, %xmm2, %xmm1
-       vinserti128     $0x1, %xmm1, %ymm0, %ymm0
+       vmovq   %rax, %xmm3
+       vpsrlq  %xmm2, %ymm0, %ymm0
+       vpsllq  %xmm3, %ymm1, %ymm1
+       vpor    %ymm1, %ymm0, %ymm0

so at least size-wise much smaller.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2021-02-05 14:36 ` jakub at gcc dot gnu.org
@ 2021-02-05 16:29 ` jakub at gcc dot gnu.org
  2021-02-05 17:55 ` jakub at gcc dot gnu.org
                   ` (35 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-05 16:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #13 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Looking at what other compilers emit for this, ICC seems to be completely
broken, it emits logical right shifts instead of arithmetic right shift, and
LLVM trunk emits for >> 63 what this patch emits, for >> 17 it emits
        vpsrad  $17, %xmm0, %xmm1
        vpsrlq  $17, %xmm0, %xmm0
        vpblendd        $10, %xmm1, %xmm0, %xmm0
instead of
        vpxor   %xmm1, %xmm1, %xmm1
        vpcmpgtq        %xmm0, %xmm1, %xmm1
        vpsrlq  $17, %xmm0, %xmm0
        vpsllq  $47, %xmm1, %xmm1
        vpor    %xmm1, %xmm0, %xmm0
the patch emits.  For >> 47 it emits:
        vpsrad  $31, %xmm0, %xmm1
        vpsrad  $15, %xmm0, %xmm0
        vpshufd $245, %xmm0, %xmm0
        vpblendd        $10, %xmm1, %xmm0, %xmm0
etc.
So, in summary, for >> 63 with SSE4.2 I think what the patch does looks best,
for >> 63 and SSE2 we can emit psrad $31 instead and permute the odd elements
into even ones (i.e. __builtin_shuffle ((v4si) x >> 31, { 1, 1, 3, 3 })).
For >> cst where cst < 32, do a psrad and psrlq by that cst and permute such
that
we get the even SI elts from the psrlq result and odd from psrad result.
For >> 32, do a psrad $31 and permute to get the even SI elts from odd elts of
the source and odd SI elts from odd results of psrad $31.
For >> cst where cst > 32, do psrad $31 and psrad $(cst-32) and permute
such that even SI elts come from odd elts of the latter and odd elts come from
odd elts of the former.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2021-02-05 16:29 ` jakub at gcc dot gnu.org
@ 2021-02-05 17:55 ` jakub at gcc dot gnu.org
  2021-02-05 19:48 ` jakub at gcc dot gnu.org
                   ` (34 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-05 17:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #14 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
WIP that implements that.  Except that we need some permutation expansion
improvements, both for the SSE2 V4SImode permutation cases and for AVX2
V8SImode permutation cases.

--- gcc/config/i386/sse.md.jj   2021-02-05 14:32:44.175463716 +0100
+++ gcc/config/i386/sse.md      2021-02-05 18:49:29.621590903 +0100
@@ -12458,7 +12458,7 @@
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "<sseinsnmode>")])

-(define_insn "ashr<mode>3<mask_name>"
+(define_insn "<mask_codefor>ashr<mode>3<mask_name>"
   [(set (match_operand:VI248_AVX512BW_AVX512VL 0 "register_operand" "=v,v")
        (ashiftrt:VI248_AVX512BW_AVX512VL
          (match_operand:VI248_AVX512BW_AVX512VL 1 "nonimmediate_operand"
"v,vm")
@@ -12472,6 +12472,125 @@
        (const_string "0")))
    (set_attr "mode" "<sseinsnmode>")])

+(define_expand "ashr<mode>3"
+  [(set (match_operand:VI248_AVX512BW 0 "register_operand")
+       (ashiftrt:VI248_AVX512BW
+         (match_operand:VI248_AVX512BW 1 "nonimmediate_operand")
+         (match_operand:DI 2 "nonmemory_operand")))]
+  "TARGET_AVX512F")
+
+(define_expand "ashrv4di3"
+  [(set (match_operand:V4DI 0 "register_operand")
+       (ashiftrt:V4DI
+         (match_operand:V4DI 1 "nonimmediate_operand")
+         (match_operand:DI 2 "nonmemory_operand")))]
+  "TARGET_AVX2"
+{
+  if (!TARGET_AVX512VL)
+    {
+      if (CONST_INT_P (operands[2]) && UINTVAL (operands[2]) >= 63)
+       {
+         rtx zero = force_reg (V4DImode, CONST0_RTX (V4DImode));
+         emit_insn (gen_avx2_gtv4di3 (operands[0], zero, operands[1]));
+         DONE;
+       }
+      if (operands[2] == const0_rtx)
+       {
+         emit_move_insn (operands[0], operands[1]);
+         DONE;
+       }
+      if (CONST_INT_P (operands[2]))
+       {
+         vec_perm_builder sel (8, 8, 1);
+         sel.quick_grow (8);
+         rtx arg0, arg1;
+         rtx op1 = lowpart_subreg (V8SImode, operands[1], V4DImode);
+         rtx target = gen_reg_rtx (V8SImode);
+         if (INTVAL (operands[2]) > 32)
+           {
+             arg0 = gen_reg_rtx (V8SImode);
+             arg1 = gen_reg_rtx (V8SImode);
+             emit_insn (gen_ashrv8si3 (arg1, op1, GEN_INT (31)));
+             emit_insn (gen_ashrv8si3 (arg0, op1,
+                                       GEN_INT (INTVAL (operands[2]) - 32)));
+             sel[0] = 1;
+             sel[1] = 9;
+             sel[2] = 3;
+             sel[3] = 11;
+             sel[4] = 5;
+             sel[5] = 13;
+             sel[6] = 7;
+             sel[7] = 15;
+           }
+         else if (INTVAL (operands[2]) == 32)
+           {
+             arg0 = op1;
+             arg1 = gen_reg_rtx (V8SImode);
+             emit_insn (gen_ashrv8si3 (arg1, op1, GEN_INT (31)));
+             sel[0] = 1;
+             sel[1] = 9;
+             sel[2] = 3;
+             sel[3] = 11;
+             sel[4] = 5;
+             sel[5] = 13;
+             sel[6] = 7;
+             sel[7] = 15;
+           }
+         else
+           {
+             arg0 = gen_reg_rtx (V2DImode);
+             arg1 = gen_reg_rtx (V4SImode);
+             emit_insn (gen_lshrv2di3 (arg0, operands[1], operands[2]));
+             emit_insn (gen_ashrv4si3 (arg1, op1, operands[2]));
+             arg0 = lowpart_subreg (V4SImode, arg0, V2DImode);
+             sel[0] = 0;
+             sel[1] = 9;
+             sel[2] = 2;
+             sel[3] = 11;
+             sel[4] = 4;
+             sel[5] = 13;
+             sel[6] = 6;
+             sel[7] = 15;
+           }
+         vec_perm_indices indices (sel, 2, 8);
+         bool ok = targetm.vectorize.vec_perm_const (V8SImode, target,
+                                                     arg0, arg1, indices);
+         gcc_assert (ok);
+         emit_move_insn (operands[0],
+                         lowpart_subreg (V4DImode, target, V8SImode));
+         DONE;
+       }
+
+      rtx zero = force_reg (V4DImode, CONST0_RTX (V4DImode));
+      rtx zero_or_all_ones = gen_reg_rtx (V4DImode);
+      emit_insn (gen_avx2_gtv4di3 (zero_or_all_ones, zero, operands[1]));
+      rtx lshr_res = gen_reg_rtx (V4DImode);
+      emit_insn (gen_lshrv4di3 (lshr_res, operands[1], operands[2]));
+      rtx ashl_res = gen_reg_rtx (V4DImode);
+      rtx amount;
+      if (TARGET_64BIT)
+       {
+         amount = gen_reg_rtx (DImode);
+         emit_insn (gen_subdi3 (amount, force_reg (DImode, GEN_INT (64)),
+                                operands[2]));
+       }
+      else
+       {
+         rtx temp = gen_reg_rtx (SImode);
+         emit_insn (gen_subsi3 (temp, force_reg (SImode, GEN_INT (64)),
+                                lowpart_subreg (SImode, operands[2],
+                                                DImode)));
+         amount = gen_reg_rtx (V4SImode);
+         emit_insn (gen_vec_setv4si_0 (amount, CONST0_RTX (V4SImode),
+                                       temp));
+       }
+      amount = lowpart_subreg (DImode, amount, GET_MODE (amount));
+      emit_insn (gen_ashlv4di3 (ashl_res, zero_or_all_ones, amount));
+      emit_insn (gen_iorv4di3 (operands[0], lshr_res, ashl_res));
+      DONE;
+    }
+})
+
 (define_insn "<mask_codefor><insn><mode>3<mask_name>"
   [(set (match_operand:VI248_AVX512BW_2 0 "register_operand" "=v,v")
        (any_lshift:VI248_AVX512BW_2
@@ -20313,11 +20432,13 @@
        (ashiftrt:V2DI
          (match_operand:V2DI 1 "register_operand")
          (match_operand:DI 2 "nonmemory_operand")))]
-  "TARGET_SSE4_2"
+  "TARGET_SSE2"
 {
   if (!TARGET_AVX512VL)
     {
-      if (CONST_INT_P (operands[2]) && INTVAL (operands[2]) == 63)
+      if (TARGET_SSE4_2
+         && CONST_INT_P (operands[2])
+         && UINTVAL (operands[2]) >= 63)
        {
          rtx zero = force_reg (V2DImode, CONST0_RTX (V2DImode));
          emit_insn (gen_sse4_2_gtv2di3 (operands[0], zero, operands[1]));
@@ -20328,6 +20449,65 @@
          emit_move_insn (operands[0], operands[1]);
          DONE;
        }
+      if (CONST_INT_P (operands[2])
+         && (!TARGET_XOP || UINTVAL (operands[2]) >= 63))
+       {
+         vec_perm_builder sel (4, 4, 1);
+         sel.quick_grow (4);
+         rtx arg0, arg1;
+         rtx op1 = lowpart_subreg (V4SImode, operands[1], V2DImode);
+         rtx target = gen_reg_rtx (V4SImode);
+         if (UINTVAL (operands[2]) >= 63)
+           {
+             arg0 = arg1 = gen_reg_rtx (V4SImode);
+             emit_insn (gen_ashrv4si3 (arg0, op1, GEN_INT (31)));
+             sel[0] = 1;
+             sel[1] = 1;
+             sel[2] = 3;
+             sel[3] = 3;
+           }
+         else if (INTVAL (operands[2]) > 32)
+           {
+             arg0 = gen_reg_rtx (V4SImode);
+             arg1 = gen_reg_rtx (V4SImode);
+             emit_insn (gen_ashrv4si3 (arg1, op1, GEN_INT (31)));
+             emit_insn (gen_ashrv4si3 (arg0, op1,
+                                       GEN_INT (INTVAL (operands[2]) - 32)));
+             sel[0] = 1;
+             sel[1] = 5;
+             sel[2] = 3;
+             sel[3] = 7;
+           }
+         else if (INTVAL (operands[2]) == 32)
+           {
+             arg0 = op1;
+             arg1 = gen_reg_rtx (V4SImode);
+             emit_insn (gen_ashrv4si3 (arg1, op1, GEN_INT (31)));
+             sel[0] = 1;
+             sel[1] = 5;
+             sel[2] = 3;
+             sel[3] = 7;
+           }
+         else
+           {
+             arg0 = gen_reg_rtx (V2DImode);
+             arg1 = gen_reg_rtx (V4SImode);
+             emit_insn (gen_lshrv2di3 (arg0, operands[1], operands[2]));
+             emit_insn (gen_ashrv4si3 (arg1, op1, operands[2]));
+             arg0 = lowpart_subreg (V4SImode, arg0, V2DImode);
+             sel[0] = 0;
+             sel[1] = 5;
+             sel[2] = 2;
+             sel[3] = 7;
+           }
+         vec_perm_indices indices (sel, arg0 != arg1 ? 2 : 1, 4);
+         bool ok = targetm.vectorize.vec_perm_const (V4SImode, target,
+                                                     arg0, arg1, indices);
+         gcc_assert (ok);
+         emit_move_insn (operands[0],
+                         lowpart_subreg (V2DImode, target, V4SImode));
+         DONE;
+       }
       if (!TARGET_XOP)
        {
          rtx zero = force_reg (V2DImode, CONST0_RTX (V2DImode));
@@ -20337,9 +20517,7 @@
          emit_insn (gen_lshrv2di3 (lshr_res, operands[1], operands[2]));
          rtx ashl_res = gen_reg_rtx (V2DImode);
          rtx amount;
-         if (CONST_INT_P (operands[2]))
-           amount = GEN_INT (64 - INTVAL (operands[2]));
-         else if (TARGET_64BIT)
+         if (TARGET_64BIT)
            {
              amount = gen_reg_rtx (DImode);
              emit_insn (gen_subdi3 (amount, force_reg (DImode, GEN_INT (64)),
@@ -20355,8 +20533,7 @@
              emit_insn (gen_vec_setv4si_0 (amount, CONST0_RTX (V4SImode),
                                            temp));
            }
-         if (!CONST_INT_P (operands[2]))
-           amount = lowpart_subreg (DImode, amount, GET_MODE (amount));
+         amount = lowpart_subreg (DImode, amount, GET_MODE (amount));
          emit_insn (gen_ashlv2di3 (ashl_res, zero_or_all_ones, amount));
          emit_insn (gen_iorv2di3 (operands[0], lshr_res, ashl_res));
          DONE;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2021-02-05 17:55 ` jakub at gcc dot gnu.org
@ 2021-02-05 19:48 ` jakub at gcc dot gnu.org
  2021-02-08 15:14 ` jakub at gcc dot gnu.org
                   ` (33 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-05 19:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #15 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
The needed permutations for this boil down to
typedef int V __attribute__((vector_size (16)));
typedef int W __attribute__((vector_size (32)));

#ifdef __clang__
V f1 (V x) { return __builtin_shufflevector (x, x, 1, 1, 3, 3); }
V f2 (V x, V y) { return __builtin_shufflevector (x, y, 1, 5, 3, 7); }
V f3 (V x, V y) { return __builtin_shufflevector (x, y, 0, 5, 2, 7); }
#ifdef __AVX2__
W f4 (W x, W y) { return __builtin_shufflevector (x, y, 1, 9, 3, 11, 5, 13, 7,
15); }
W f5 (W x, W y) { return __builtin_shufflevector (x, y, 0, 9, 2, 11, 4, 13, 6,
15); }
W f6 (W x) { return __builtin_shufflevector (x, x, 1, 1, 3, 3, 5, 5, 7, 7); }
#endif
V f7 (V x) { return __builtin_shufflevector (x, x, 1, 3, 2, 3); }
V f8 (V x) { return __builtin_shufflevector (x, x, 0, 2, 2, 3); }
V f9 (V x, V y) { return __builtin_shufflevector (x, y, 0, 4, 1, 5); }
#else
V f1 (V x) { return __builtin_shuffle (x, (V) { 1, 1, 3, 3 }); }
V f2 (V x, V y) { return __builtin_shuffle (x, y, (V) { 1, 5, 3, 7 }); }
V f3 (V x, V y) { return __builtin_shuffle (x, y, (V) { 0, 5, 2, 7 }); }
#ifdef __AVX2__
W f4 (W x, W y) { return __builtin_shuffle (x, y, (W) { 1, 9, 3, 11, 5, 13, 7,
15 }); }
W f5 (W x, W y) { return __builtin_shuffle (x, y, (W) { 0, 9, 2, 11, 4, 13, 6,
15 }); }
W f6 (W x, W y) { return __builtin_shuffle (x, (W) { 1, 1, 3, 3, 5, 5, 7, 7 });
}
#endif
V f7 (V x) { return __builtin_shuffle (x, (V) { 1, 3, 2, 3 }); }
V f8 (V x) { return __builtin_shuffle (x, (V) { 0, 2, 2, 3 }); }
V f9 (V x, V y) { return __builtin_shuffle (x, y, (V) { 0, 4, 1, 5 }); }
#endif

With -msse2, LLVM emits 2 x pshufd $237 + punpckldq for f2 and pshufd $237 +
pshufd $232 + punpckldq, we give up or emit very large code.
With -msse4, we handle everything, and f1/f3 are the same/comparable, but for
f2 we emit 2 x pshufb (with memory operands) + por while
LLVM emits pshufd $245 + pblendw $204.
With -mavx2, the f2 inefficiency remains, and for f4 we emit 2x vpshufb with
memory operands + vpor while LLVM emits vpermilps $245 + vblendps $170.
f6-f9 are all insns that we handle through a single insn and that plus f3 are
the roadblocks to build the f2 and f4 permutations more efficiently.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2021-02-05 19:48 ` jakub at gcc dot gnu.org
@ 2021-02-08 15:14 ` jakub at gcc dot gnu.org
  2021-03-04 12:14 ` rguenth at gcc dot gnu.org
                   ` (32 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-08 15:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #16 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 50142
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50142&action=edit
gcc11-pr98856.patch

Full patch.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2021-02-08 15:14 ` jakub at gcc dot gnu.org
@ 2021-03-04 12:14 ` rguenth at gcc dot gnu.org
  2021-03-04 15:36 ` rguenth at gcc dot gnu.org
                   ` (31 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-04 12:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |vmakarov at gcc dot gnu.org
           Keywords|                            |ra

--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
So coming back here.  We're presenting RA with a quite hard problem given we
have

(insn 7 4 8 2 (set (reg:TI 84 [ _9 ])
        (mem:TI (reg:DI 101) [0 MEM <__int128 unsigned> [(char *
{ref-all})in_8(D)]+0 S16 A8])) 73 {*movti_internal}
     (expr_list:REG_DEAD (reg:DI 101)
        (nil)))
(insn 8 7 9 2 (parallel [
            (set (reg:DI 95)
                (lshiftrt:DI (subreg:DI (reg:TI 84 [ _9 ]) 8)
                    (const_int 63 [0x3f])))
            (clobber (reg:CC 17 flags))
        ]) "t.c":7:26 703 {*lshrdi3_1}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
        (nil)))
..
(insn 10 9 11 2 (parallel [
            (set (reg:DI 97)
                (lshiftrt:DI (subreg:DI (reg:TI 84 [ _9 ]) 0)
                    (const_int 63 [0x3f])))
            (clobber (reg:CC 17 flags))
        ]) "t.c":8:30 703 {*lshrdi3_1}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
..
(insn 12 11 13 2 (set (reg:V2DI 98 [ vect__5.3 ])
        (ashift:V2DI (subreg:V2DI (reg:TI 84 [ _9 ]) 0)
            (const_int 1 [0x1]))) "t.c":9:16 3611 {ashlv2di3}
     (expr_list:REG_DEAD (reg:TI 84 [ _9 ])
        (nil)))

where I wonder why we keep the (subreg:DI (reg:TI 84 ...) 8) around
for so long.  Probably the subreg pass gives up because of the V2DImode
subreg of that reg.

That said RA chooses xmm for reg:84 but then spills it immediately
to fulfil the subregs even though there's mov and pextrd that could
be used or the reload could use the original mem.  That we reload
even the xmm use is another odd thing.

Vlad, I'm not sure about the possibilities LRA has here but maybe
you can have a look at the testcase in comment#6 (use -O3 -march=znver2
or -march=core-avx2).  For one I expected

        vmovdqu (%rsi), %xmm2
        vmovdqa %xmm2, -24(%rsp)
        movq    -16(%rsp), %rax   (2a)
        vmovdqa -24(%rsp), %xmm4  (1)
...
        movq    -24(%rsp), %rdx   (2b)

(1) to be not there (not sure how that even survives postreload
optimizations...)
(2a/b) to be 'inherited' by instead loading from (%rsi) and 8(%rsi) which
is maybe too much being asked because it requires aliasing considerations

That is, even if we don't consider using

   movq %xmm2, %rax (2a)
   pextrd %xmm2, %rdx, 1 (2b)

I expected us to not spill.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2021-03-04 12:14 ` rguenth at gcc dot gnu.org
@ 2021-03-04 15:36 ` rguenth at gcc dot gnu.org
  2021-03-04 16:12 ` rguenth at gcc dot gnu.org
                   ` (30 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-04 15:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> ---
There's another thing - we end up with

        vmovq   %rax, %xmm3
        vpinsrq $1, %rdx, %xmm3, %xmm0

but that has way worse latency than the alternative you'd get w/o SSE 4.1:

        vmovq   %rax, %xmm3
        vmovq   %rdx, %xmm7
        punpcklqdq  %xmm7, %xmm3

for example on Zen3 vmovq and vpisnrq have latencies of 3 while punpck
has a latency of only one.  So the second variant should have 2 cycles
less latency.

Testcase:

typedef long v2di __attribute__((vector_size(16)));

v2di foo (long a, long b)
{
  return (v2di){a, b};
}

Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3.  Not
sure if we should somehow do this late somehow (peephole or splitter) since
it requires one more %xmm register.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2021-03-04 15:36 ` rguenth at gcc dot gnu.org
@ 2021-03-04 16:12 ` rguenth at gcc dot gnu.org
  2021-03-04 17:56 ` ubizjak at gmail dot com
                   ` (29 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-04 16:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #19 from Richard Biener <rguenth at gcc dot gnu.org> ---
So to recover performance we need both, avoiding the latency on the vector plus
avoiding the spilling.  This variant is fast:

.L56:
        .cfi_restore_state
        vmovdqu (%rsi), %xmm4
        movq    8(%rsi), %rdx
        shrq    $63, %rdx
        imulq   $135, %rdx, %rdi
        movq    (%rsi), %rdx
        vmovq   %rdi, %xmm0
        vpsllq  $1, %xmm4, %xmm1
        shrq    $63, %rdx
        vmovq   %rdx, %xmm5
        vpunpcklqdq %xmm5, %xmm0, %xmm0
        vpxor   %xmm1, %xmm0, %xmm0
        vmovdqu %xmm0, (%rax)
        jmp     .L53

compared to the original:

.L56:
        .cfi_restore_state
        vmovdqu (%rsi), %xmm4
        vmovdqa %xmm4, 16(%rsp)
        movq    24(%rsp), %rdx
        vmovdqa 16(%rsp), %xmm5
        shrq    $63, %rdx
        imulq   $135, %rdx, %rdi
        movq    16(%rsp), %rdx
        vmovq   %rdi, %xmm0
        vpsllq  $1, %xmm5, %xmm1
        shrq    $63, %rdx
        vpinsrq $1, %rdx, %xmm0, %xmm0
        vpxor   %xmm1, %xmm0, %xmm0
        vmovdqu %xmm0, (%rax)
        jmp     .L53

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (19 preceding siblings ...)
  2021-03-04 16:12 ` rguenth at gcc dot gnu.org
@ 2021-03-04 17:56 ` ubizjak at gmail dot com
  2021-03-04 18:12 ` ubizjak at gmail dot com
                   ` (28 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: ubizjak at gmail dot com @ 2021-03-04 17:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #20 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Richard Biener from comment #18)
> Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3.  Not
> sure if we should somehow do this late somehow (peephole or splitter) since
> it requires one more %xmm register.
What happens if you disparage [v]pinsrd alternatives in vec_concatv2di?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (20 preceding siblings ...)
  2021-03-04 17:56 ` ubizjak at gmail dot com
@ 2021-03-04 18:12 ` ubizjak at gmail dot com
  2021-03-05  7:44 ` rguenth at gcc dot gnu.org
                   ` (27 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: ubizjak at gmail dot com @ 2021-03-04 18:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #21 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Uroš Bizjak from comment #20)
> (In reply to Richard Biener from comment #18)
> > Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3.  Not
> > sure if we should somehow do this late somehow (peephole or splitter) since
> > it requires one more %xmm register.
> What happens if you disparage [v]pinsrd alternatives in vec_concatv2di?

Please try this:

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index db5be59f5b7..edf7b1a3074 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -16043,7 +16043,12 @@
              (const_string "maybe_evex")
           ]
           (const_string "orig")))
-   (set_attr "mode" "TI,TI,TI,TI,TI,TI,V4SF,V2SF,V2SF")])
+   (set_attr "mode" "TI,TI,TI,TI,TI,TI,V4SF,V2SF,V2SF")
+   (set (attr "preferred_for_speed")
+     (cond [(eq_attr "alternative" "0,1,2,3")
+             (symbol_ref "false")
+          ]
+          (symbol_ref "true")))])

 (define_insn "*vec_concatv2di_0"

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (21 preceding siblings ...)
  2021-03-04 18:12 ` ubizjak at gmail dot com
@ 2021-03-05  7:44 ` rguenth at gcc dot gnu.org
  2021-03-05  7:46 ` rguenth at gcc dot gnu.org
                   ` (26 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-05  7:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #22 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Uroš Bizjak from comment #21)
> (In reply to Uroš Bizjak from comment #20)
> > (In reply to Richard Biener from comment #18)
> > > Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3.  Not
> > > sure if we should somehow do this late somehow (peephole or splitter) since
> > > it requires one more %xmm register.
> > What happens if you disparage [v]pinsrd alternatives in vec_concatv2di?
> 
> Please try this:
> 
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index db5be59f5b7..edf7b1a3074 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -16043,7 +16043,12 @@
>               (const_string "maybe_evex")
>            ]
>            (const_string "orig")))
> -   (set_attr "mode" "TI,TI,TI,TI,TI,TI,V4SF,V2SF,V2SF")])
> +   (set_attr "mode" "TI,TI,TI,TI,TI,TI,V4SF,V2SF,V2SF")
> +   (set (attr "preferred_for_speed")
> +     (cond [(eq_attr "alternative" "0,1,2,3")
> +             (symbol_ref "false")
> +          ]
> +          (symbol_ref "true")))])
>  
>  (define_insn "*vec_concatv2di_0"

That works to avoid the vpinsrq.  I guess the case of a mem operand
behaves similar to a gpr (plus the load uop), at least I don't have any
contrary evidence (but I didn't do any microbenchmarks either).

I'm not sure IRA/LRA will optimally handle the situation with register
pressure causing spilling in case it needs to reload both gpr operands.
At least for

typedef long v2di __attribute__((vector_size(16)));

v2di foo (long a, long b)
{
  return (v2di){a, b};
}

with -msse4.1 -O3 -ffixed-xmm1 -ffixed-xmm2 -ffixed-xmm3 -ffixed-xmm4
-ffixed-xmm5 -ffixed-xmm6 -ffixed-xmm7 -ffixed-xmm8 -ffixed-xmm9 -ffixed-xmm10
-ffixed-xmm11 -ffixed-xmm12 -ffixed-xmm13 -ffixed-xmm14 -ffixed-xmm15 I get
with the
patch

foo:
.LFB0:
        .cfi_startproc
        movq    %rsi, -16(%rsp)
        movq    %rdi, %xmm0
        pinsrq  $1, -16(%rsp), %xmm0
        ret

while without it's

        movq    %rdi, %xmm0
        pinsrq  $1, %rsi, %xmm0

as far as I understand LRA dumps the new attribute is a hard one, even
applying when other alternatives are worse.  In this case we choose
alt 7.  Covering also alts 7 and 8 with the optimize-for-speed attribute
causes reload fails - which is expected if there's no way for LRA to
choose alt 1.  The following seems to work for the small testcase above
but not for the important case in the benchmark (meh).

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index db5be59f5b7..e393a0d823b 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -15992,7 +15992,7 @@
          (match_operand:DI 1 "register_operand"
          "  0, 0,x ,Yv,0,Yv,0,0,v")
          (match_operand:DI 2 "nonimmediate_operand"
-         " rm,rm,rm,rm,x,Yv,x,m,m")))]
+         " !rm,!rm,!rm,!rm,x,Yv,x,!m,!m")))]
   "TARGET_SSE"
   "@
    pinsrq\t{$1, %2, %0|%0, %2, 1}

I guess the idea of this insn setup was exactly to get IRA/LRA choose
the optimal instruction sequence - otherwise exposing the reload so
late is probably suboptimal.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (22 preceding siblings ...)
  2021-03-05  7:44 ` rguenth at gcc dot gnu.org
@ 2021-03-05  7:46 ` rguenth at gcc dot gnu.org
  2021-03-05  8:29 ` ubizjak at gmail dot com
                   ` (25 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-05  7:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #23 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 50300
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50300&action=edit
preprocessed source of the important Botan TU

This is the full preprocessed source of the TU.  When compiled with -Ofast
-march=znver2 look for poly_double_n_le in the assembly, in the prologue the
function jumps based on kernel size - size 16 is the important one:

        cmpq    $16, %rdx
        je      .L54
...
.L54:
        .cfi_restore_state
        vmovdqu (%rsi), %xmm4
        vmovdqa %xmm4, 16(%rsp)
        movq    24(%rsp), %rdx
        vmovdqa 16(%rsp), %xmm5
        shrq    $63, %rdx
        imulq   $135, %rdx, %rcx
        movq    16(%rsp), %rdx
        vmovq   %rcx, %xmm0
        vpsllq  $1, %xmm5, %xmm1
        shrq    $63, %rdx
        vpinsrq $1, %rdx, %xmm0, %xmm0
        vpxor   %xmm1, %xmm0, %xmm0
        vmovdqu %xmm0, (%rdi)
        leaq    -16(%rbp), %rsp
        popq    %r12
        popq    %r13
        popq    %rbp
        .cfi_remember_state
        .cfi_def_cfa 7, 8
        ret

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (23 preceding siblings ...)
  2021-03-05  7:46 ` rguenth at gcc dot gnu.org
@ 2021-03-05  8:29 ` ubizjak at gmail dot com
  2021-03-05 10:04 ` rguenther at suse dot de
                   ` (24 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: ubizjak at gmail dot com @ 2021-03-05  8:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #24 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Richard Biener from comment #22)

> That works to avoid the vpinsrq.  I guess the case of a mem operand
> behaves similar to a gpr (plus the load uop), at least I don't have any
> contrary evidence (but I didn't do any microbenchmarks either).
> 
> I'm not sure IRA/LRA will optimally handle the situation with register
> pressure causing spilling in case it needs to reload both gpr operands.
> At least for
> 
> typedef long v2di __attribute__((vector_size(16)));
> 
> v2di foo (long a, long b)
> {
>   return (v2di){a, b};
> }
> 
> with -msse4.1 -O3 -ffixed-xmm1 -ffixed-xmm2 -ffixed-xmm3 -ffixed-xmm4
> -ffixed-xmm5 -ffixed-xmm6 -ffixed-xmm7 -ffixed-xmm8 -ffixed-xmm9
> -ffixed-xmm10 -ffixed-xmm11 -ffixed-xmm12 -ffixed-xmm13 -ffixed-xmm14
> -ffixed-xmm15 I get with the
> patch
> 
> foo:
> .LFB0:
>         .cfi_startproc
>         movq    %rsi, -16(%rsp)
>         movq    %rdi, %xmm0
>         pinsrq  $1, -16(%rsp), %xmm0
>         ret
> 
> while without it's
> 
>         movq    %rdi, %xmm0
>         pinsrq  $1, %rsi, %xmm0

This is expacted, my patch is based on the assumption that punpcklqdq is cheap
compared to pinsrq, and interunit moves are cheap. This way, IRA will reload GP
register to XMM register and use cheaper instruction.

> as far as I understand LRA dumps the new attribute is a hard one, even
> applying when other alternatives are worse.  In this case we choose
> alt 7.  Covering also alts 7 and 8 with the optimize-for-speed attribute
> causes reload fails - which is expected if there's no way for LRA to
> choose alt 1.  The following seems to work for the small testcase above
> but not for the important case in the benchmark (meh).
> 
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index db5be59f5b7..e393a0d823b 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -15992,7 +15992,7 @@
>           (match_operand:DI 1 "register_operand"
>           "  0, 0,x ,Yv,0,Yv,0,0,v")
>           (match_operand:DI 2 "nonimmediate_operand"
> -         " rm,rm,rm,rm,x,Yv,x,m,m")))]
> +         " !rm,!rm,!rm,!rm,x,Yv,x,!m,!m")))]
>    "TARGET_SSE"
>    "@
>     pinsrq\t{$1, %2, %0|%0, %2, 1}

The above means that GP will still be used, since it fits without reloading.

> I guess the idea of this insn setup was exactly to get IRA/LRA choose
> the optimal instruction sequence - otherwise exposing the reload so
> late is probably suboptimal.

THere is one more tool in the toolbox. A peephole2 pattern can be
conditionalized on availabe XMM register. So, if XMM reg is available, the
GPR->XMM move can be emitted in front of the insn. So, if there is XMM register
pressure, pinsrd will be used, but if an XMM register is availabe, it will be
reused to emit punpcklqdq.

The peephole2 pattern can also be conditionalized for targets where GPR->XMM
moves are fast.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (24 preceding siblings ...)
  2021-03-05  8:29 ` ubizjak at gmail dot com
@ 2021-03-05 10:04 ` rguenther at suse dot de
  2021-03-05 10:43 ` rguenth at gcc dot gnu.org
                   ` (23 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenther at suse dot de @ 2021-03-05 10:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #25 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 5 Mar 2021, ubizjak at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
> 
> --- Comment #24 from Uroš Bizjak <ubizjak at gmail dot com> ---
> (In reply to Richard Biener from comment #22)
> > I guess the idea of this insn setup was exactly to get IRA/LRA choose
> > the optimal instruction sequence - otherwise exposing the reload so
> > late is probably suboptimal.
> 
> THere is one more tool in the toolbox. A peephole2 pattern can be
> conditionalized on availabe XMM register. So, if XMM reg is available, the
> GPR->XMM move can be emitted in front of the insn. So, if there is XMM register
> pressure, pinsrd will be used, but if an XMM register is availabe, it will be
> reused to emit punpcklqdq.
> 
> The peephole2 pattern can also be conditionalized for targets where GPR->XMM
> moves are fast.

Note the trick is esp. important when GPR->XMM moves are _slow_.  But only
in the case we originally combine two GPR operands.  Doing two
GPR->XMM moves and then one puncklqdq hides half of the latency of the
slow moves since they have no data dependence on each other.  So for the
peephole we should try to match this - a reloaded operand and a GPR
operand.  When the %xmm operand results from a SSE computation there's
no point in splitting out a GPR->XMM move.

So in the end a peephole2 sounds like it could better match the condition
the transform is profitable on.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (25 preceding siblings ...)
  2021-03-05 10:04 ` rguenther at suse dot de
@ 2021-03-05 10:43 ` rguenth at gcc dot gnu.org
  2021-03-05 11:56 ` ubizjak at gmail dot com
                   ` (22 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-05 10:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #26 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #25)
> On Fri, 5 Mar 2021, ubizjak at gmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
> > 
> > --- Comment #24 from Uroš Bizjak <ubizjak at gmail dot com> ---
> > (In reply to Richard Biener from comment #22)
> > > I guess the idea of this insn setup was exactly to get IRA/LRA choose
> > > the optimal instruction sequence - otherwise exposing the reload so
> > > late is probably suboptimal.
> > 
> > THere is one more tool in the toolbox. A peephole2 pattern can be
> > conditionalized on availabe XMM register. So, if XMM reg is available, the
> > GPR->XMM move can be emitted in front of the insn. So, if there is XMM register
> > pressure, pinsrd will be used, but if an XMM register is availabe, it will be
> > reused to emit punpcklqdq.
> > 
> > The peephole2 pattern can also be conditionalized for targets where GPR->XMM
> > moves are fast.
> 
> Note the trick is esp. important when GPR->XMM moves are _slow_.  But only
> in the case we originally combine two GPR operands.  Doing two
> GPR->XMM moves and then one puncklqdq hides half of the latency of the
> slow moves since they have no data dependence on each other.  So for the
> peephole we should try to match this - a reloaded operand and a GPR
> operand.  When the %xmm operand results from a SSE computation there's
> no point in splitting out a GPR->XMM move.
> 
> So in the end a peephole2 sounds like it could better match the condition
> the transform is profitable on.

I tried

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index db5be59f5b7..8d0d3077cf8 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -1419,6 +1419,23 @@
   DONE;
 })

+(define_peephole2
+  [(set (match_operand:DI 0 "sse_reg_operand")
+        (match_operand:DI 1 "general_gr_operand"))
+   (match_scratch:DI 2 "sse_reg_operand")
+   (set (match_operand:V2DI 2 "sse_reg_operand")
+       (vec_concat:V2DI (match_dup:DI 0)
+                        (match_operand:DI 3 "general_gr_operand")))]
+  "reload_completed"
+  [(set (match_dup 0)
+        (match_dup 1))
+   (set (match_dup 2)
+        (match_dup 3))
+   (set (match_dup 2)
+       (vec_concat:V2DI (match_dup 0)
+                        (match_dup 2)))]
+  "")
+
 ;; Merge movsd/movhpd to movupd for TARGET_SSE_UNALIGNED_LOAD_OPTIMAL targets.
 (define_peephole2
   [(set (match_operand:V2DF 0 "sse_reg_operand")

but that doesn't seem to match for some unknown reason.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (26 preceding siblings ...)
  2021-03-05 10:43 ` rguenth at gcc dot gnu.org
@ 2021-03-05 11:56 ` ubizjak at gmail dot com
  2021-03-05 12:25 ` ubizjak at gmail dot com
                   ` (21 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: ubizjak at gmail dot com @ 2021-03-05 11:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #27 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Richard Biener from comment #26)
> but that doesn't seem to match for some unknown reason.

Try this:

(define_peephole2
  [(match_scratch:DI 5 "Yv")
   (set (match_operand:DI 0 "sse_reg_operand")
        (match_operand:DI 1 "general_reg_operand"))
   (set (match_operand:V2DI 2 "sse_reg_operand")
        (vec_concat:V2DI (match_operand:DI 3 "sse_reg_operand")
                         (match_operand:DI 4 "nonimmediate_gr_operand")))]
  ""
  [(set (match_dup 0)
        (match_dup 1))
   (set (match_dup 5)
        (match_dup 4))
   (set (match_dup 2)
       (vec_concat:V2DI (match_dup 3)
                        (match_dup 5)))])

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (27 preceding siblings ...)
  2021-03-05 11:56 ` ubizjak at gmail dot com
@ 2021-03-05 12:25 ` ubizjak at gmail dot com
  2021-03-05 12:27 ` rguenth at gcc dot gnu.org
                   ` (20 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: ubizjak at gmail dot com @ 2021-03-05 12:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #28 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Uroš Bizjak from comment #27)
> (In reply to Richard Biener from comment #26)
> > but that doesn't seem to match for some unknown reason. 
> Try this:

The latency problem with the original testcase is solved with:

(define_peephole2
  [(match_scratch:DI 3 "Yv")
   (set (match_operand:V2DI 0 "sse_reg_operand")
        (vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand")
                         (match_operand:DI 2 "nonimmediate_gr_operand")))]
  ""
  [(set (match_dup 3) (match_dup 2))
   (set (match_dup 0)
        (vec_concat:V2DI (match_dup 1) (match_dup 3)))])

but I don't know if this transformation applies universally to all x86 targets.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (28 preceding siblings ...)
  2021-03-05 12:25 ` ubizjak at gmail dot com
@ 2021-03-05 12:27 ` rguenth at gcc dot gnu.org
  2021-03-05 12:49 ` jakub at gcc dot gnu.org
                   ` (19 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-05 12:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #29 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Uroš Bizjak from comment #27)
> (In reply to Richard Biener from comment #26)
> > but that doesn't seem to match for some unknown reason.
> 
> Try this:
> 
> (define_peephole2
>   [(match_scratch:DI 5 "Yv")
>    (set (match_operand:DI 0 "sse_reg_operand")
>         (match_operand:DI 1 "general_reg_operand"))
>    (set (match_operand:V2DI 2 "sse_reg_operand")
>         (vec_concat:V2DI (match_operand:DI 3 "sse_reg_operand")
>                          (match_operand:DI 4 "nonimmediate_gr_operand")))]
>   ""
>   [(set (match_dup 0)
>         (match_dup 1))
>    (set (match_dup 5)
>         (match_dup 4))
>    (set (match_dup 2)
>        (vec_concat:V2DI (match_dup 3)
>                         (match_dup 5)))])

Ah, I messed up operands.  The following works (the above position of
match_scratch happily chooses an operand matching operand 0):

;; Further split pinsrq variants of vec_concatv2di with two GPR sources,
;; one already reloaded, to hide the latency of one GPR->XMM transitions.
(define_peephole2
  [(set (match_operand:DI 0 "sse_reg_operand")
        (match_operand:DI 1 "general_reg_operand"))
   (match_scratch:DI 2 "Yv")
   (set (match_operand:V2DI 3 "sse_reg_operand")
        (vec_concat:V2DI (match_dup 0)
                         (match_operand:DI 4 "nonimmediate_gr_operand")))]
  "reload_completed && optimize_insn_for_speed_p ()"
  [(set (match_dup 0)
        (match_dup 1))
   (set (match_dup 2)
        (match_dup 4))
   (set (match_dup 3)
        (vec_concat:V2DI (match_dup 0)
                         (match_dup 2)))])

but for some reason it again doesn't work for the important loop.  There
we have

  389: xmm0:DI=cx:DI
      REG_DEAD cx:DI
  390: dx:DI=[sp:DI+0x10]
   56: {dx:DI=dx:DI 0>>0x3f;clobber flags:CC;}
      REG_UNUSED flags:CC
   57: xmm0:V2DI=vec_concat(xmm0:DI,dx:DI)

I suppose the reason is that there's two unrelated insns between the
xmm0 = cx:DI and the vec_concat.  Which would hint that we somehow
need to not match this GPR->XMM move in the peephole pattern but
instead somehow in the condition (can we use DF there?)

The simplified variant below works but IMHO matches cases we do not
want to transform.  I can't find any example on how to achieve that
though.

;; Further split pinsrq variants of vec_concatv2di with two GPR sources,
;; one already reloaded, to hide the latency of one GPR->XMM transitions.
(define_peephole2
  [(match_scratch:DI 3 "Yv")
   (set (match_operand:V2DI 0 "sse_reg_operand")
        (vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand")
                         (match_operand:DI 2 "nonimmediate_gr_operand")))]
  "reload_completed && optimize_insn_for_speed_p ()"
  [(set (match_dup 3)
        (match_dup 2))
   (set (match_dup 0)
        (vec_concat:V2DI (match_dup 1)
                         (match_dup 3)))])

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (29 preceding siblings ...)
  2021-03-05 12:27 ` rguenth at gcc dot gnu.org
@ 2021-03-05 12:49 ` jakub at gcc dot gnu.org
  2021-03-05 12:52 ` ubizjak at gmail dot com
                   ` (18 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-03-05 12:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #30 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #29)
> I suppose the reason is that there's two unrelated insns between the
> xmm0 = cx:DI and the vec_concat.  Which would hint that we somehow
> need to not match this GPR->XMM move in the peephole pattern but
> instead somehow in the condition (can we use DF there?)

peephole2 are run in a pass that does:
  df_set_flags (DF_LR_RUN_DCE);
  df_note_add_problem ();
  df_analyze ();
so, DF that uses the note or default problems is ok, but e.g.
DF_UD_CHAIN/DF_DU_CHAIN is not available.
But it can e.g. walk some number of previous instructions (with some reasonably
small upper bound) etc.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (30 preceding siblings ...)
  2021-03-05 12:49 ` jakub at gcc dot gnu.org
@ 2021-03-05 12:52 ` ubizjak at gmail dot com
  2021-03-05 12:55 ` rguenther at suse dot de
                   ` (17 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: ubizjak at gmail dot com @ 2021-03-05 12:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #31 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Richard Biener from comment #29)
> The simplified variant below works but IMHO matches cases we do not
> want to transform.  I can't find any example on how to achieve that
> though.

I think that pinsrd should be transformed to punpcklqdq irrespective of its
first input operand. The insn scheduler should move insns around to mask their
latencies.

> ;; Further split pinsrq variants of vec_concatv2di with two GPR sources,
> ;; one already reloaded, to hide the latency of one GPR->XMM transitions.
> (define_peephole2
>   [(match_scratch:DI 3 "Yv")
>    (set (match_operand:V2DI 0 "sse_reg_operand")
>         (vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand")
>                          (match_operand:DI 2 "nonimmediate_gr_operand")))]
>   "reload_completed && optimize_insn_for_speed_p ()"

Please use

  "TARGET_64BIT && TARGET_SSE4_1
   && !optimize_insn_for_size_p ()"

here.

>   [(set (match_dup 3)
>         (match_dup 2))
>    (set (match_dup 0)
>         (vec_concat:V2DI (match_dup 1)
>                          (match_dup 3)))])

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (31 preceding siblings ...)
  2021-03-05 12:52 ` ubizjak at gmail dot com
@ 2021-03-05 12:55 ` rguenther at suse dot de
  2021-03-05 13:06 ` rguenth at gcc dot gnu.org
                   ` (16 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenther at suse dot de @ 2021-03-05 12:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #32 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 5 Mar 2021, ubizjak at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
> 
> --- Comment #31 from Uroš Bizjak <ubizjak at gmail dot com> ---
> (In reply to Richard Biener from comment #29)
> > The simplified variant below works but IMHO matches cases we do not
> > want to transform.  I can't find any example on how to achieve that
> > though.
> 
> I think that pinsrd should be transformed to punpcklqdq irrespective of its
> first input operand. The insn scheduler should move insns around to mask their
> latencies.
> 
> > ;; Further split pinsrq variants of vec_concatv2di with two GPR sources,
> > ;; one already reloaded, to hide the latency of one GPR->XMM transitions.
> > (define_peephole2
> >   [(match_scratch:DI 3 "Yv")
> >    (set (match_operand:V2DI 0 "sse_reg_operand")
> >         (vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand")
> >                          (match_operand:DI 2 "nonimmediate_gr_operand")))]
> >   "reload_completed && optimize_insn_for_speed_p ()"
> 
> Please use
> 
>   "TARGET_64BIT && TARGET_SSE4_1
>    && !optimize_insn_for_size_p ()"
> 
> here.

what about reload_completed?  We really only want to do this after RA.

Will test the patch then and add the reduced testcase.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (32 preceding siblings ...)
  2021-03-05 12:55 ` rguenther at suse dot de
@ 2021-03-05 13:06 ` rguenth at gcc dot gnu.org
  2021-03-05 13:08 ` ubizjak at gmail dot com
                   ` (15 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-05 13:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #33 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 50308
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50308&action=edit
patch

I am testing the following.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (33 preceding siblings ...)
  2021-03-05 13:06 ` rguenth at gcc dot gnu.org
@ 2021-03-05 13:08 ` ubizjak at gmail dot com
  2021-03-05 14:35 ` rguenth at gcc dot gnu.org
                   ` (14 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: ubizjak at gmail dot com @ 2021-03-05 13:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #34 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to rguenther@suse.de from comment #32)
> what about reload_completed?  We really only want to do this after RA.

No need for it, this is peephole2 pass that *always* runs after reload.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (34 preceding siblings ...)
  2021-03-05 13:08 ` ubizjak at gmail dot com
@ 2021-03-05 14:35 ` rguenth at gcc dot gnu.org
  2021-03-08 10:41 ` rguenth at gcc dot gnu.org
                   ` (13 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-05 14:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #35 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #33)
> Created attachment 50308 [details]
> patch
> 
> I am testing the following.

It FAILs

FAIL: gcc.target/i386/avx512dq-concatv2di-1.c scan-assembler
vpinsrq[^\\n\\r]*\\
\\\$1[^\\n\\r]*%[re]si[^\\n\\r]*%xmm18[^\\n\\r]*%xmm19
FAIL: gcc.target/i386/avx512dq-concatv2di-1.c scan-assembler
vpinsrq[^\\n\\r]*\\\\\$1[^\\n\\r]*%rsi[^\\n\\r]*%xmm16[^\\n\\r]*%xmm17
FAIL: gcc.target/i386/avx512vl-concatv2di-1.c scan-assembler
vmovhps[^\\n\\r]*%[re]si[^\\n\\r]*%xmm18[^\\n\\r]*%xmm19

I'll see how to update those next week.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (35 preceding siblings ...)
  2021-03-05 14:35 ` rguenth at gcc dot gnu.org
@ 2021-03-08 10:41 ` rguenth at gcc dot gnu.org
  2021-03-08 13:20 ` rguenth at gcc dot gnu.org
                   ` (12 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-08 10:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #36 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #35)
> (In reply to Richard Biener from comment #33)
> > Created attachment 50308 [details]
> > patch
> > 
> > I am testing the following.
> 
> It FAILs
> 
> FAIL: gcc.target/i386/avx512dq-concatv2di-1.c scan-assembler
> vpinsrq[^\\n\\r]*\\
> \\\$1[^\\n\\r]*%[re]si[^\\n\\r]*%xmm18[^\\n\\r]*%xmm19

That's exactly the case we're looking after.  V2DI concat from two GPRs.

> FAIL: gcc.target/i386/avx512dq-concatv2di-1.c scan-assembler
> vpinsrq[^\\n\\r]*\\\\\$1[^\\n\\r]*%rsi[^\\n\\r]*%xmm16[^\\n\\r]*%xmm17

This is, like below, a MEM case.

> FAIL: gcc.target/i386/avx512vl-concatv2di-1.c scan-assembler
> vmovhps[^\\n\\r]*%[re]si[^\\n\\r]*%xmm18[^\\n\\r]*%xmm19

This one is because nonimmediate_gr_operand also matches a MEM, in this case
we apply the peephole to

(insn 12 11 13 2 (set (reg/v:V2DI 55 xmm19 [ c ])
        (vec_concat:V2DI (reg:DI 54 xmm18 [91]) 
            (mem:DI (reg/v/f:DI 4 si [orig:86 y ] [86]) [1 *y_8(D)+0 S8 A64]))) 

latency-wise memory isn't any better than a GPR so the decision to split
is reasonable.

> I'll see how to update those next week.

So I updated the above to check for vpunpcklqdq instead.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (36 preceding siblings ...)
  2021-03-08 10:41 ` rguenth at gcc dot gnu.org
@ 2021-03-08 13:20 ` rguenth at gcc dot gnu.org
  2021-03-08 15:46 ` amonakov at gcc dot gnu.org
                   ` (11 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-08 13:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #37 from Richard Biener <rguenth at gcc dot gnu.org> ---
So my analysis was partly wrong and the vpinsrq isn't an issue for the
benchmark
but only the spilling is.

Note that the other idea of disparaging vector CTORs more like with

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2603333f87b..f8caf8e7dff 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -21821,8 +21821,15 @@ ix86_builtin_vectorization_cost (enum
vect_cost_for_stmt type_of_cost,

       case vec_construct:
        {
-         /* N element inserts into SSE vectors.  */
+         /* N-element inserts into SSE vectors.  */
          int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
+         /* We cannot insert from GPRs directly but there's always a
+            GPR->XMM uop involved.  Account for that.
+            ???  Note that loads are already costed separately so this
+            eventually double-counts them.  */
+         if (!fp)
+           cost += (TYPE_VECTOR_SUBPARTS (vectype)
+                    * ix86_cost->hard_register.integer_to_sse);
          /* One vinserti128 for combining two SSE vectors for AVX256.  */
          if (GET_MODE_BITSIZE (mode) == 256)
            cost += ix86_vec_cost (mode, ix86_cost->addss);

helps for generic and core-avx2 tuning:

t.c:10:3: note: Cost model analysis:
0x3858cd0 _6 1 times scalar_store costs 12 in body
0x3858cd0 _4 1 times scalar_store costs 12 in body
0x3858cd0 _5 ^ carry_10 1 times scalar_stmt costs 4 in body
0x3858cd0 _2 ^ _3 1 times scalar_stmt costs 4 in body
0x3858cd0 _15 << 1 1 times scalar_stmt costs 4 in body
0x3858cd0 _14 << 1 1 times scalar_stmt costs 4 in body
0x3858cd0 BIT_FIELD_REF <_9, 64, 0> 1 times scalar_stmt costs 4 in body
0x3858cd0 <unknown> 0 times vec_perm costs 0 in body
0x3858cd0 _15 << 1 1 times vector_stmt costs 4 in body
0x3858cd0 _5 ^ carry_10 1 times vector_stmt costs 4 in body
0x3858cd0 <unknown> 1 times vec_construct costs 20 in prologue
0x3858cd0 _6 1 times unaligned_store (misalign -1) costs 12 in body
0x3858cd0 BIT_FIELD_REF <_9, 64, 0> 1 times vec_to_scalar costs 4 in epilogue
0x3858cd0 BIT_FIELD_REF <_9, 64, 64> 1 times vec_to_scalar costs 4 in epilogue
t.c:10:3: note: Cost model analysis for part in loop 0:
  Vector cost: 48
  Scalar cost: 44
t.c:10:3: missed: not vectorized: vectorization is not profitable.

but not for znver2:

t.c:10:3: note: Cost model analysis:
0x3703790 _6 1 times scalar_store costs 16 in body
0x3703790 _4 1 times scalar_store costs 16 in body
0x3703790 _5 ^ carry_10 1 times scalar_stmt costs 4 in body
0x3703790 _2 ^ _3 1 times scalar_stmt costs 4 in body
0x3703790 _15 << 1 1 times scalar_stmt costs 4 in body
0x3703790 _14 << 1 1 times scalar_stmt costs 4 in body
0x3703790 BIT_FIELD_REF <_9, 64, 0> 1 times scalar_stmt costs 4 in body
0x3703790 <unknown> 0 times vec_perm costs 0 in body
0x3703790 _15 << 1 1 times vector_stmt costs 4 in body
0x3703790 _5 ^ carry_10 1 times vector_stmt costs 4 in body
0x3703790 <unknown> 1 times vec_construct costs 20 in prologue
0x3703790 _6 1 times unaligned_store (misalign -1) costs 16 in body
0x3703790 BIT_FIELD_REF <_9, 64, 0> 1 times vec_to_scalar costs 4 in epilogue
0x3703790 BIT_FIELD_REF <_9, 64, 64> 1 times vec_to_scalar costs 4 in epilogue
t.c:10:3: note: Cost model analysis for part in loop 0:
  Vector cost: 52
  Scalar cost: 52
t.c:10:3: note: Basic block will be vectorized using SLP

appearantly for znver{1,2,3} we choose a slightly higher load/store cost.

We could also try mitigating vectorization by decomposing the __int128
load in forwprop where we have

          else if (TREE_CODE (TREE_TYPE (lhs)) == VECTOR_TYPE
                   && TYPE_MODE (TREE_TYPE (lhs)) == BLKmode
                   && gimple_assign_load_p (stmt)
                   && !gimple_has_volatile_ops (stmt)
                   && (TREE_CODE (gimple_assign_rhs1 (stmt))
                       != TARGET_MEM_REF)
                   && !stmt_can_throw_internal (cfun, stmt))
            {
              /* Rewrite loads used only in BIT_FIELD_REF extractions to
                 component-wise loads.  */

this was tailored to decompose GCC vector extension loads that are not
supported on the HW early.  Here we have

  _9 = MEM <__int128 unsigned> [(char * {ref-all})in_8(D)];
  _14 = BIT_FIELD_REF <_9, 64, 64>;
  _15 = BIT_FIELD_REF <_9, 64, 0>;

where the HW doesn't have any __int128 GPRs.  If we do not vectorize then
the RTL pipeline will eventually split the load.  If vectorization is
profitable then the vectorizer should be able to vectorize the resulting
split loads as well.  In this case this would cause actual costing of the
load (the re-use of the __int128 to-be-in-SSE reg is instead free) and also
cost the live lane extract for the retained integer code.  But that moves
the cost even more towards vectorizing since now a vector load (cost 12)
plus two live lane extracts (when fixed to cost sse_to_integer that's 2 * 6)
is used in place of two scalar loads (cost 2 * 12).  On the code generation
side this improves things, avoiding the spilling but using vmovq/vpextrq
which is not good enough to recover but it does help a bit (~5%)

        vmovdqu (%rsi), %xmm1
        vpextrq $1, %xmm1, %rax
        shrq    $63, %rax
        imulq   $135, %rax, %rax
        vmovq   %rax, %xmm0
        vmovq   %xmm1, %rax
        vpsllq  $1, %xmm1, %xmm1
        shrq    $63, %rax
        vmovq   %rax, %xmm2
        vpunpcklqdq     %xmm2, %xmm0, %xmm0
        vpxor   %xmm1, %xmm0, %xmm0
        vmovdqu %xmm0, (%rdi)
        ret

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (37 preceding siblings ...)
  2021-03-08 13:20 ` rguenth at gcc dot gnu.org
@ 2021-03-08 15:46 ` amonakov at gcc dot gnu.org
  2021-04-27 11:40 ` [Bug tree-optimization/98856] [11/12 " jakub at gcc dot gnu.org
                   ` (10 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: amonakov at gcc dot gnu.org @ 2021-03-08 15:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #38 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Late to the party, but latency analysis of vpinsrq starting from comment #18 is
incorrect: its latency is different with respect to operands.

For example, on Zen 2 latency with respect to GPR operand is long (6 cycles,
one more that grp->xmm move latency), while latency with respect to XMM operand
is just one cycle, same as punpcklqdq. See uops.info, which also shows that
vpinsrq involves 2 uops, and it's easy to guess what they are: first uop is for
gpr->xmm inter-unit move (latency 5), and the second is SSE merge:

  https://uops.info/html-instr/VPINSRQ_XMM_XMM_R64_I8.html
  https://uops.info/html-instr/VMOVD_XMM_R32.html

So in the CPU backend there's not much difference between

movq
pinsrq

and

movq
movq
punpcklqdq

both have same uops and overall latency (1 + movq latency).

(though on Intel starting from Haswell pinsrq oddly has latency 2 w.r.t xmm
operand, but on Ice Lake it is again 1 cycle).

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11/12 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (38 preceding siblings ...)
  2021-03-08 15:46 ` amonakov at gcc dot gnu.org
@ 2021-04-27 11:40 ` jakub at gcc dot gnu.org
  2021-05-13 10:17 ` cvs-commit at gcc dot gnu.org
                   ` (9 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-04-27 11:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|11.0                        |11.2

--- Comment #39 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 11.1 has been released, retargeting bugs to GCC 11.2.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11/12 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (39 preceding siblings ...)
  2021-04-27 11:40 ` [Bug tree-optimization/98856] [11/12 " jakub at gcc dot gnu.org
@ 2021-05-13 10:17 ` cvs-commit at gcc dot gnu.org
  2021-07-28  7:05 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-05-13 10:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #40 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:829c4bea06600ea4201462f91ce6d76ca21fdb35

commit r12-769-g829c4bea06600ea4201462f91ce6d76ca21fdb35
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Thu May 13 12:14:14 2021 +0200

    ix86: Support V{2, 4}DImode arithmetic right shifts for SSE2+ [PR98856]

    As mentioned in the PR, we don't support arithmetic right V2DImode or
    V4DImode on x86 without -mavx512vl or -mxop.  The ISAs indeed don't have
    {,v}psraq instructions until AVX512VL, but we actually can emulate it quite
    easily.
    One case is arithmetic >> 63, we can just emit {,v}pxor; {,v}pcmpgt for
    that for SSE4.2+, or for SSE2 psrad $31; pshufd $0xf5.
    Then arithmetic >> by constant > 32, that can be done with {,v}psrad $31
    and {,v}psrad $(cst-32) and two operand permutation,
    arithmetic >> 32 can be done as {,v}psrad $31 and permutation of that
    and the original operand.  Arithmetic >> by constant < 32 can be done
    as {,v}psrad $cst and {,v}psrlq $cst and two operand permutation.
    And arithmetic >> by variable scalar amount can be done as
    arithmetic >> 63, logical >> by the amount, << by (64 - amount of the
    >> 63 result; note that the vector << 64 result in 0) and oring together.

    I had to improve the permutation generation so that it actually handles
    the needed permutations (or handles them better).

    2021-05-13  Jakub Jelinek  <jakub@redhat.com>

            PR tree-optimization/98856
            * config/i386/i386.c (ix86_shift_rotate_cost): Add CODE argument.
            Expect V2DI and V4DI arithmetic right shifts to be emulated.
            (ix86_rtx_costs, ix86_add_stmt_cost): Adjust ix86_shift_rotate_cost
            caller.
            * config/i386/i386-expand.c (expand_vec_perm_2perm_interleave,
            expand_vec_perm_2perm_pblendv): New functions.
            (ix86_expand_vec_perm_const_1): Use them.
            * config/i386/sse.md (ashr<mode>3<mask_name>): Rename to ...
            (<mask_codefor>ashr<mode>3<mask_name>): ... this.
            (ashr<mode>3): New define_expand with VI248_AVX512BW iterator.
            (ashrv4di3): New define_expand.
            (ashrv2di3): Change condition to TARGET_SSE2, handle !TARGET_XOP
            and !TARGET_AVX512VL expansion.

            * gcc.target/i386/sse2-psraq-1.c: New test.
            * gcc.target/i386/sse4_2-psraq-1.c: New test.
            * gcc.target/i386/avx-psraq-1.c: New test.
            * gcc.target/i386/avx2-psraq-1.c: New test.
            * gcc.target/i386/avx-pr82370.c: Adjust expected number of vpsrad
            instructions.
            * gcc.target/i386/avx2-pr82370.c: Likewise.
            * gcc.target/i386/avx512f-pr82370.c: Likewise.
            * gcc.target/i386/avx512bw-pr82370.c: Likewise.
            * gcc.dg/torture/vshuf-4.inc: Add two further permutations.
            * gcc.dg/torture/vshuf-8.inc: Likewise.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11/12 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (40 preceding siblings ...)
  2021-05-13 10:17 ` cvs-commit at gcc dot gnu.org
@ 2021-07-28  7:05 ` rguenth at gcc dot gnu.org
  2022-01-21 13:20 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-07-28  7:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|11.2                        |11.3

--- Comment #41 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 11.2 is being released, retargeting bugs to GCC 11.3

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11/12 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (41 preceding siblings ...)
  2021-07-28  7:05 ` rguenth at gcc dot gnu.org
@ 2022-01-21 13:20 ` rguenth at gcc dot gnu.org
  2022-04-21  7:48 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-21 13:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11/12 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (42 preceding siblings ...)
  2022-01-21 13:20 ` rguenth at gcc dot gnu.org
@ 2022-04-21  7:48 ` rguenth at gcc dot gnu.org
  2023-04-17 21:43 ` [Bug tree-optimization/98856] [11/12/13/14 " lukebenes at hotmail dot com
                   ` (5 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-21  7:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|11.3                        |11.4

--- Comment #42 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 11.3 is being released, retargeting bugs to GCC 11.4.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11/12/13/14 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (43 preceding siblings ...)
  2022-04-21  7:48 ` rguenth at gcc dot gnu.org
@ 2023-04-17 21:43 ` lukebenes at hotmail dot com
  2023-04-18  9:07 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: lukebenes at hotmail dot com @ 2023-04-17 21:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #43 from Luke <lukebenes at hotmail dot com> ---
@Richard Biener 
Polite ping. Are you still working on this regression?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11/12/13/14 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (44 preceding siblings ...)
  2023-04-17 21:43 ` [Bug tree-optimization/98856] [11/12/13/14 " lukebenes at hotmail dot com
@ 2023-04-18  9:07 ` rguenth at gcc dot gnu.org
  2023-05-29 10:04 ` jakub at gcc dot gnu.org
                   ` (3 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-04-18  9:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|rguenth at gcc dot gnu.org         |unassigned at gcc dot gnu.org
             Status|ASSIGNED                    |NEW

--- Comment #44 from Richard Biener <rguenth at gcc dot gnu.org> ---
The original description is for znver1 which we stopped benchmarking.
https://lnt.opensuse.org/db_default/v4/CPP/graph?highlight_run=39959&plot.0=171.721.1
is for znver2 still showing the regression and the following for znver3
which doesn't date back to the rev that regressed
https://lnt.opensuse.org/db_default/v4/CPP/graph?highlight_run=39969&plot.721=283.721.1

So the issue is still there but I am no longer actively working on it.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [11/12/13/14 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (45 preceding siblings ...)
  2023-04-18  9:07 ` rguenth at gcc dot gnu.org
@ 2023-05-29 10:04 ` jakub at gcc dot gnu.org
  2024-07-19 13:10 ` [Bug tree-optimization/98856] [12/13/14/15 " rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  49 siblings, 0 replies; 51+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-05-29 10:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|11.4                        |11.5

--- Comment #45 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 11.4 is being released, retargeting bugs to GCC 11.5.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [12/13/14/15 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (46 preceding siblings ...)
  2023-05-29 10:04 ` jakub at gcc dot gnu.org
@ 2024-07-19 13:10 ` rguenth at gcc dot gnu.org
  2024-07-24  5:32 ` liuhongt at gcc dot gnu.org
  2024-07-24  5:48 ` liuhongt at gcc dot gnu.org
  49 siblings, 0 replies; 51+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-07-19 13:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|11.5                        |12.5

--- Comment #46 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 11 branch is being closed.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [12/13/14/15 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (47 preceding siblings ...)
  2024-07-19 13:10 ` [Bug tree-optimization/98856] [12/13/14/15 " rguenth at gcc dot gnu.org
@ 2024-07-24  5:32 ` liuhongt at gcc dot gnu.org
  2024-07-24  5:48 ` liuhongt at gcc dot gnu.org
  49 siblings, 0 replies; 51+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-07-24  5:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

Hongtao Liu <liuhongt at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |liuhongt at gcc dot gnu.org

--- Comment #47 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
Created attachment 58746
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58746&action=edit
Accoate v2di with GPR

The attached patch can allocated V2DI with GPR to avoid spill.

poly_double_le2:
.LFB0:
        .cfi_startproc
        movq    %rdi, %rdx
        movq    8(%rsi), %rdi
        movq    (%rsi), %rsi
        movq    %rdi, %rax
        movq    %rsi, %rcx
        vmovq   %rsi, %xmm4
        sarq    $63, %rax
        shrq    $63, %rcx
        vpinsrq $1, %rdi, %xmm4, %xmm3
        andl    $135, %eax
        vpsllq  $1, %xmm3, %xmm1
        vmovq   %rax, %xmm2
        vpinsrq $1, %rcx, %xmm2, %xmm0
        vpxor   %xmm1, %xmm0, %xmm0
        vmovdqu %xmm0, (%rdx)
        ret
        .cfi_endproc

But when there's (subreg:V (reg:TI 0)) for other vector modes, the issue could
be still there.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Bug tree-optimization/98856] [12/13/14/15 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
  2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
                   ` (48 preceding siblings ...)
  2024-07-24  5:32 ` liuhongt at gcc dot gnu.org
@ 2024-07-24  5:48 ` liuhongt at gcc dot gnu.org
  49 siblings, 0 replies; 51+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-07-24  5:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #48 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to Hongtao Liu from comment #47)
> Created attachment 58746 [details]
> Accoate v2di with GPR
> 
> The attached patch can allocated V2DI with GPR to avoid spill.
> 

@Uros Is it a good idea to make GPR available for all 128-bit vector with

1) extend *movti_internal to all 128-bit vectors,  extend related splitter to
handle movement between GPR and SSE_REG, extend split_double_mode to handle
movement between GPR and GPR
2) Adjust ix86_hard_regno_mode_ok to make GPR available for all 128-bit vector
3) inline_secondary_memory_needed need to be adjust since now we support
movement between GPR and SSE for 16-bytes vector.

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2024-07-24  5:48 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-27 14:28 [Bug tree-optimization/98856] New: [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af marxin at gcc dot gnu.org
2021-01-27 14:29 ` [Bug tree-optimization/98856] " marxin at gcc dot gnu.org
2021-01-27 14:44 ` rguenth at gcc dot gnu.org
2021-01-28  7:47 ` rguenth at gcc dot gnu.org
2021-01-28  8:44 ` marxin at gcc dot gnu.org
2021-01-28  9:40 ` rguenth at gcc dot gnu.org
2021-01-28 11:03 ` rguenth at gcc dot gnu.org
2021-01-28 11:19 ` rguenth at gcc dot gnu.org
2021-01-28 11:57 ` rguenth at gcc dot gnu.org
2021-02-05 10:18 ` rguenth at gcc dot gnu.org
2021-02-05 11:52 ` jakub at gcc dot gnu.org
2021-02-05 12:52 ` rguenth at gcc dot gnu.org
2021-02-05 13:43 ` jakub at gcc dot gnu.org
2021-02-05 14:36 ` jakub at gcc dot gnu.org
2021-02-05 16:29 ` jakub at gcc dot gnu.org
2021-02-05 17:55 ` jakub at gcc dot gnu.org
2021-02-05 19:48 ` jakub at gcc dot gnu.org
2021-02-08 15:14 ` jakub at gcc dot gnu.org
2021-03-04 12:14 ` rguenth at gcc dot gnu.org
2021-03-04 15:36 ` rguenth at gcc dot gnu.org
2021-03-04 16:12 ` rguenth at gcc dot gnu.org
2021-03-04 17:56 ` ubizjak at gmail dot com
2021-03-04 18:12 ` ubizjak at gmail dot com
2021-03-05  7:44 ` rguenth at gcc dot gnu.org
2021-03-05  7:46 ` rguenth at gcc dot gnu.org
2021-03-05  8:29 ` ubizjak at gmail dot com
2021-03-05 10:04 ` rguenther at suse dot de
2021-03-05 10:43 ` rguenth at gcc dot gnu.org
2021-03-05 11:56 ` ubizjak at gmail dot com
2021-03-05 12:25 ` ubizjak at gmail dot com
2021-03-05 12:27 ` rguenth at gcc dot gnu.org
2021-03-05 12:49 ` jakub at gcc dot gnu.org
2021-03-05 12:52 ` ubizjak at gmail dot com
2021-03-05 12:55 ` rguenther at suse dot de
2021-03-05 13:06 ` rguenth at gcc dot gnu.org
2021-03-05 13:08 ` ubizjak at gmail dot com
2021-03-05 14:35 ` rguenth at gcc dot gnu.org
2021-03-08 10:41 ` rguenth at gcc dot gnu.org
2021-03-08 13:20 ` rguenth at gcc dot gnu.org
2021-03-08 15:46 ` amonakov at gcc dot gnu.org
2021-04-27 11:40 ` [Bug tree-optimization/98856] [11/12 " jakub at gcc dot gnu.org
2021-05-13 10:17 ` cvs-commit at gcc dot gnu.org
2021-07-28  7:05 ` rguenth at gcc dot gnu.org
2022-01-21 13:20 ` rguenth at gcc dot gnu.org
2022-04-21  7:48 ` rguenth at gcc dot gnu.org
2023-04-17 21:43 ` [Bug tree-optimization/98856] [11/12/13/14 " lukebenes at hotmail dot com
2023-04-18  9:07 ` rguenth at gcc dot gnu.org
2023-05-29 10:04 ` jakub at gcc dot gnu.org
2024-07-19 13:10 ` [Bug tree-optimization/98856] [12/13/14/15 " rguenth at gcc dot gnu.org
2024-07-24  5:32 ` liuhongt at gcc dot gnu.org
2024-07-24  5:48 ` liuhongt at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).