[Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess
@ 2020-07-11 16:01 nok.raven at gmail dot com
  2020-07-13  8:05 ` [Bug target/96166] " rguenth at gcc dot gnu.org
                   ` (15 more replies)
  0 siblings, 16 replies; 17+ messages in thread
From: nok.raven at gmail dot com @ 2020-07-11 16:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166

            Bug ID: 96166
           Summary: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL
                    into a mess
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: nok.raven at gmail dot com
  Target Milestone: ---

inline void swap(int &x, int &y)
{
  int tmp = x;
  x = y;
  y = tmp;
}

void bar(int (&x)[2])
{
  int y[2];
  __builtin_memcpy(&y, &x, sizeof x);
  swap(y[0], y[1]);
  __builtin_memcpy(&x, &y, sizeof x);
}


GCC 9 (-Os/O2/O3) produces:
  rolq $32, (%rdi)

GCC 10/trunk (-O3/-ftree-slp-vectorize) produces:
  movq (%rdi), %rax
  movd (%rdi), %xmm1
  sarq $32, %rax
  movq %rax, %xmm0
  punpckldq %xmm1, %xmm0
  movq %xmm0, (%rdi)


https://godbolt.org/z/5h3bW8

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/96166] [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess
  2020-07-11 16:01 [Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess nok.raven at gmail dot com
@ 2020-07-13  8:05 ` rguenth at gcc dot gnu.org
  2020-07-23  6:51 ` rguenth at gcc dot gnu.org
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-07-13  8:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |10.2
             Blocks|                            |53947
           Keywords|                            |missed-optimization
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2020-07-13

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.

0x3c02500 _10 1 times vector_store costs 12 in body
0x3c02500 <unknown> 1 times vec_construct costs 8 in prologue
0x3a3e900 _10 1 times scalar_store costs 12 in body
0x3a3e900 _9 1 times scalar_store costs 12 in body
t.i:14:1: note:  Cost model analysis:
  Vector inside of basic block cost: 12
  Vector prologue cost: 8
  Vector epilogue cost: 0
  Scalar cost of basic block: 24
t.i:14:1: note:  Basic block will be vectorized using SLP

and we end up with

  <bb 2> [local count: 1073741824]:
  _3 = MEM <long unsigned int> [(char * {ref-all})x_2(D)];
  _9 = (int) _3;
  _10 = BIT_FIELD_REF <_3, 32, 32>;
  _11 = {_10, _9};
  _7 = VIEW_CONVERT_EXPR<long unsigned int>(_11);
  MEM <long unsigned int> [(char * {ref-all})x_2(D)] = _7;

the IL we feed into the vectorizer and the earlier bswap pass is

  _3 = MEM <long unsigned int> [(char * {ref-all})x_2(D)];
  _9 = (int) _3;
  _10 = BIT_FIELD_REF <_3, 32, 32>;
  y = _10;
  MEM[(int &)&y + 4] = _9;
  _4 = MEM <long unsigned int> [(char * {ref-all})&y];
  MEM <long unsigned int> [(char * {ref-all})x_2(D)] = _4;

I guess fixing the vectorizer to handle the "grouped load" would
eventually allow fixing this.  I don't think there's anything to
do from the costing side...


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/96166] [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess
  2020-07-11 16:01 [Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess nok.raven at gmail dot com
  2020-07-13  8:05 ` [Bug target/96166] " rguenth at gcc dot gnu.org
@ 2020-07-23  6:51 ` rguenth at gcc dot gnu.org
  2020-10-12 12:47 ` rguenth at gcc dot gnu.org
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-07-23  6:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|10.2                        |10.3

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 10.2 is released, adjusting target milestone.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/96166] [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess
  2020-07-11 16:01 [Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess nok.raven at gmail dot com
  2020-07-13  8:05 ` [Bug target/96166] " rguenth at gcc dot gnu.org
  2020-07-23  6:51 ` rguenth at gcc dot gnu.org
@ 2020-10-12 12:47 ` rguenth at gcc dot gnu.org
  2021-02-11 15:00 ` jakub at gcc dot gnu.org
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-10-12 12:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/96166] [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess
  2020-07-11 16:01 [Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess nok.raven at gmail dot com
                   ` (2 preceding siblings ...)
  2020-10-12 12:47 ` rguenth at gcc dot gnu.org
@ 2021-02-11 15:00 ` jakub at gcc dot gnu.org
  2021-02-12 10:11 ` jakub at gcc dot gnu.org
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-11 15:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Started with r10-1361-g9f962469cabc7fdc2ee830125a5cb4e61e1632e4

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/96166] [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess
  2020-07-11 16:01 [Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess nok.raven at gmail dot com
                   ` (3 preceding siblings ...)
  2021-02-11 15:00 ` jakub at gcc dot gnu.org
@ 2021-02-12 10:11 ` jakub at gcc dot gnu.org
  2021-02-12 11:18 ` jakub at gcc dot gnu.org
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-12 10:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166

--- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Note that the rotate isn't something created by the bswap pass, it isn't really
byteswap, just swapping of two halves of the long long.
It comes from expansion and combine.  Expanding
  _9 = (int) _3;
  _10 = BIT_FIELD_REF <_3, 32, 32>;
  MEM[(int &)&y] = _10;
  MEM[(int &)&y + 4] = _9;
  _4 = MEM <long unsigned int> [(char * {ref-all})&y];
  MEM <long unsigned int> [(char * {ref-all})x_2(D)] = _4;
results in
(insn 7 6 8 (parallel [
            (set (reg:DI 88)
                (ashiftrt:DI (reg:DI 82 [ _3 ])
                    (const_int 32 [0x20])))
            (clobber (reg:CC 17 flags))
        ]) "pr96166.c":4:5 -1
     (nil))

(insn 8 7 9 (set (reg:DI 89)
        (zero_extend:DI (subreg:SI (reg:DI 88) 0))) "pr96166.c":4:5 -1
     (nil))

(insn 9 8 10 (set (reg:DI 91)
        (const_int -4294967296 [0xffffffff00000000])) "pr96166.c":4:5 -1
     (nil))

(insn 10 9 11 (parallel [
            (set (reg:DI 90)
                (and:DI (reg/v:DI 86 [ y ])
                    (reg:DI 91)))
            (clobber (reg:CC 17 flags))
        ]) "pr96166.c":4:5 -1
     (nil))

(insn 11 10 12 (parallel [
            (set (reg:DI 92)
                (ior:DI (reg:DI 90)
                    (reg:DI 89)))
            (clobber (reg:CC 17 flags))
        ]) "pr96166.c":4:5 -1
     (nil))

(insn 12 11 0 (set (reg/v:DI 86 [ y ])
        (reg:DI 92)) "pr96166.c":4:5 -1
     (nil))

(insn 13 12 14 (set (reg:DI 93)
        (zero_extend:DI (subreg:SI (reg:DI 82 [ _3 ]) 0))) "pr96166.c":5:5 -1
     (nil))

(insn 14 13 15 (parallel [
            (set (reg:DI 94)
                (ashift:DI (reg:DI 93)
                    (const_int 32 [0x20])))
            (clobber (reg:CC 17 flags))
        ]) "pr96166.c":5:5 -1
     (nil))

(insn 15 14 16 (set (reg:DI 95)
        (zero_extend:DI (subreg:SI (reg/v:DI 86 [ y ]) 0))) "pr96166.c":5:5 -1
     (nil))

(insn 16 15 17 (parallel [
            (set (reg:DI 96)
                (ior:DI (reg:DI 95)
                    (reg:DI 94)))
            (clobber (reg:CC 17 flags))
        ]) "pr96166.c":5:5 -1
     (nil))

(insn 17 16 0 (set (reg/v:DI 86 [ y ])
        (reg:DI 96)) "pr96166.c":5:5 -1
     (nil))

(insn 18 17 0 (set (mem:DI (reg/v/f:DI 87 [ x ]) [0 MEM <long unsigned int>
[(char * {ref-all})x_2(D)]+0 S8 A8])
        (reg/v:DI 86 [ y ])) "pr96166.c":13:19 -1
     (nil))

(I must say I'm surprised y hasn't been forced into stack even when it is
stored in parts) and then combine matches a rotate out of that.
While with SLP vectorization, we end up with:
   _9 = (int) _3;
   _10 = BIT_FIELD_REF <_3, 32, 32>;
-  MEM[(int &)&y] = _10;
-  MEM[(int &)&y + 4] = _9;
+  _11 = {_10, _9};
+  MEM <vector(2) int> [(int &)&y] = _11;
   _4 = MEM <long unsigned int> [(char * {ref-all})&y];
   MEM <long unsigned int> [(char * {ref-all})x_2(D)] = _4;
and aren't able to undo the vectorization during the RTL optimizations.
I'm surprised costs suggest such vectorization is beneficial, constructing a
vector just to store it into memory seems more expensive than just doing two
stores, isn't it?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/96166] [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess
  2020-07-11 16:01 [Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess nok.raven at gmail dot com
                   ` (4 preceding siblings ...)
  2021-02-12 10:11 ` jakub at gcc dot gnu.org
@ 2021-02-12 11:18 ` jakub at gcc dot gnu.org
  2021-02-12 12:17 ` pinskia at gcc dot gnu.org
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-12 11:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |uros at gcc dot gnu.org

--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
So perhaps a peephole2 that matches
(insn 15 6 9 2 (set (reg:V2SI 21 xmm1 [91])
        (mem:V2SI (reg/v/f:DI 5 di [orig:86 x ] [86]) [0 MEM <long unsigned
int> [(char * {ref-all})x_2(D)]+0 S8 A8])) "pr96166.c":13:19 1288
{*movv2si_internal}
     (nil))
(insn 9 15 10 2 (set (reg:V2SI 20 xmm0 [88])
        (vec_select:V2SI (reg:V2SI 21 xmm1 [91])
            (parallel [
                    (const_int 1 [0x1])
                    (const_int 0 [0])
                ]))) "pr96166.c":13:19 1410 {*mmx_pshufd_1}
     (expr_list:REG_DEAD (reg:V2SI 21 xmm1 [91])
        (expr_list:REG_EQUIV (mem:V2SI (reg/v/f:DI 5 di [orig:86 x ] [86]) [0
MEM <long unsigned int> [(char * {ref-all})x_2(D)]+0 S8 A8])
            (nil))))
(insn 10 9 17 2 (set (mem:V2SI (reg/v/f:DI 5 di [orig:86 x ] [86]) [0 MEM <long
unsigned int> [(char * {ref-all})x_2(D)]+0 S8 A8])
        (reg:V2SI 20 xmm0 [88])) "pr96166.c":13:19 1288 {*movv2si_internal}
     (expr_list:REG_DEAD (reg:V2SI 20 xmm0 [88])
        (expr_list:REG_DEAD (reg/v/f:DI 5 di [orig:86 x ] [86])
            (nil))))
back into the rotate of the MEM?
No other ideas on my side :(

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/96166] [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess
  2020-07-11 16:01 [Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess nok.raven at gmail dot com
                   ` (5 preceding siblings ...)
  2021-02-12 11:18 ` jakub at gcc dot gnu.org
@ 2021-02-12 12:17 ` pinskia at gcc dot gnu.org
  2021-02-12 12:21 ` jakub at gcc dot gnu.org
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-02-12 12:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166

--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Hmm,
Shouldn't that really just become a perm swapping the two halves?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/96166] [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess
  2020-07-11 16:01 [Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess nok.raven at gmail dot com
                   ` (6 preceding siblings ...)
  2021-02-12 12:17 ` pinskia at gcc dot gnu.org
@ 2021-02-12 12:21 ` jakub at gcc dot gnu.org
  2021-02-12 13:53 ` jakub at gcc dot gnu.org
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-12 12:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166

--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
That is what happens on the trunk (the revision that introduced didn't do that
yet).  But even that permutation is more expensive than the rotate,
        rolq    $32, (%rdi)
vs.
        movq    (%rdi), %xmm1
        pshufd  $225, %xmm1, %xmm0
        movq    %xmm0, (%rdi)
At least for code size...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/96166] [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess
  2020-07-11 16:01 [Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess nok.raven at gmail dot com
                   ` (7 preceding siblings ...)
  2021-02-12 12:21 ` jakub at gcc dot gnu.org
@ 2021-02-12 13:53 ` jakub at gcc dot gnu.org
  2021-02-12 14:03 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-12 13:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166

--- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
The slightly better code (i.e. just one load + permutation + store) started
with
r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/96166] [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess
  2020-07-11 16:01 [Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess nok.raven at gmail dot com
                   ` (8 preceding siblings ...)
  2021-02-12 13:53 ` jakub at gcc dot gnu.org
@ 2021-02-12 14:03 ` rguenth at gcc dot gnu.org
  2021-02-12 14:40 ` jakub at gcc dot gnu.org
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-12 14:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #4)
> While with SLP vectorization, we end up with:
>    _9 = (int) _3;
>    _10 = BIT_FIELD_REF <_3, 32, 32>;
> -  MEM[(int &)&y] = _10;
> -  MEM[(int &)&y + 4] = _9;
> +  _11 = {_10, _9};
> +  MEM <vector(2) int> [(int &)&y] = _11;
>    _4 = MEM <long unsigned int> [(char * {ref-all})&y];
>    MEM <long unsigned int> [(char * {ref-all})x_2(D)] = _4;
> and aren't able to undo the vectorization during the RTL optimizations.
> I'm surprised costs suggest such vectorization is beneficial, constructing a
> vector just to store it into memory seems more expensive than just doing two
> stores, isn't it?

In general yes (esp. with the components in GPRs).  Of course x86
vectorizer costing assigns 12 + 12 to the scalar stores and
just 12 for the vector store and the CTOR isn't even close to 12.

We're doing

      case vec_construct:
        {
          /* N element inserts into SSE vectors.  */
          int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
          /* One vinserti128 for combining two SSE vectors for AVX256.  */
          if (GET_MODE_BITSIZE (mode) == 256)
            cost += ix86_vec_cost (mode, ix86_cost->addss);
          /* One vinserti64x4 and two vinserti128 for combining SSE
             and AVX256 vectors to AVX512.  */
          else if (GET_MODE_BITSIZE (mode) == 512)
            cost += 3 * ix86_vec_cost (mode, ix86_cost->addss);
          return cost;

so what we miss here is costing GPR -> xmm moves required where they
are not "free" (IIRC there are some AVX grp->xmm insert instructions?).

Then we generally want larger (vector) stores because they are
more likely subject to STLF than smaller stores (the other way
around for loads!).

So say for two scalar SFmode stores doing movlhps + movps mem is
clearly beneificial to two movss.

Now for the testcase the IL before vectorization is

  _3 = MEM <long unsigned int> [(char * {ref-all})x_2(D)];
  _9 = BIT_FIELD_REF <_3, 32, 0>;
  _10 = BIT_FIELD_REF <_3, 32, 32>;
  y = _10;
  MEM[(int &)&y + 4] = _9;

and the vectorizer simply "reloads" _3 to a vector mode, swaps it
and then vectorizes the store.  But it considers the BIT_FIELD_REFs
to come at a cost here.

t.c:4:5: note: Cost model analysis:
0x3ef0a60 _10 1 times scalar_store costs 12 in body
0x3ef0a60 _9 1 times scalar_store costs 12 in body
0x3ef0a60 BIT_FIELD_REF <_3, 32, 32> 1 times scalar_stmt costs 4 in body
0x3ef0a60 BIT_FIELD_REF <_3, 32, 0> 1 times scalar_stmt costs 4 in body
0x3ef0a60 <unknown> 1 times vec_perm costs 4 in body
0x3ef0a60 _10 1 times vector_store costs 12 in body
t.c:4:5: note: Cost model analysis for part in loop 0:
  Vector cost: 16
  Scalar cost: 32
t.c:4:5: note: Basic block will be vectorized using SLP

so the issue is really the vectorizer doesn't see the scalar code can
be implemented with a simple

  rolq    $32, (%rdi)

because that's not how the GIMPLE looks like (of course GIMPLE would
have a wide load, a bswap and a wide store - exactly the same as
the vector code has).

That's

(define_insn "*<insn><mode>3_1"
  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r")
        (any_rotate:SWI48
          (match_operand:SWI48 1 "nonimmediate_operand" "0,rm")
          (match_operand:QI 2 "nonmemory_operand" "c<S>,<S>")))

and combine even tries

Trying 6, 9 -> 10:
    6: r87:DI=[r86:DI]
    9: r88:V2SI=vec_select(r87:DI#0,parallel)
      REG_DEAD r87:DI
   10: [r86:DI]=r88:V2SI
      REG_DEAD r88:V2SI
      REG_DEAD r86:DI
Failed to match this instruction:
(set (mem:V2SI (reg/v/f:DI 86 [ x ]) [0 MEM <long unsigned int> [(char *
{ref-all})x_2(D)]+0 S8 A8])
    (vec_select:V2SI (mem:V2SI (reg/v/f:DI 86 [ x ]) [0 MEM <long unsigned int>
[(char * {ref-all})x_2(D)]+0 S8 A8])
        (parallel [
                (const_int 1 [0x1])
                (const_int 0 [0])
            ])))

but simplification fails to consider doing this with DImode.  So maybe
a combine helper pattern does the trick?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/96166] [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess
  2020-07-11 16:01 [Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess nok.raven at gmail dot com
                   ` (9 preceding siblings ...)
  2021-02-12 14:03 ` rguenth at gcc dot gnu.org
@ 2021-02-12 14:40 ` jakub at gcc dot gnu.org
  2021-02-13  9:33 ` cvs-commit at gcc dot gnu.org
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-12 14:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |jakub at gcc dot gnu.org

--- Comment #10 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 50172
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50172&action=edit
gcc11-pr96166.patch

Combine splitter seems to work nicely.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/96166] [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess
  2020-07-11 16:01 [Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess nok.raven at gmail dot com
                   ` (10 preceding siblings ...)
  2021-02-12 14:40 ` jakub at gcc dot gnu.org
@ 2021-02-13  9:33 ` cvs-commit at gcc dot gnu.org
  2021-02-13  9:34 ` [Bug target/96166] [10 " jakub at gcc dot gnu.org
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-02-13  9:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166

--- Comment #11 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:0f3a743b688f4845e1798eed9b2e2284e891da11

commit r11-7233-g0f3a743b688f4845e1798eed9b2e2284e891da11
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Sat Feb 13 10:32:16 2021 +0100

    i386: Add combiner splitter to optimize V2SImode memory rotation [PR96166]

    Since the x86 backend enabled V2SImode vectorization (with
    TARGET_MMX_WITH_SSE), slp vectorization can kick in and emit
            movq    (%rdi), %xmm1
            pshufd  $225, %xmm1, %xmm0
            movq    %xmm0, (%rdi)
    instead of
            rolq    $32, (%rdi)
    we used to emit (or emit when slp vectorization is disabled).
    I think the rotate is both smaller and faster, so this patch adds
    a combiner splitter to optimize that back.

    2021-02-13  Jakub Jelinek  <jakub@redhat.com>

            PR target/96166
            * config/i386/mmx.md (*mmx_pshufd_1): Add a combine splitter for
            swap of V2SImode elements in memory into DImode memory rotate by
32.

            * gcc.target/i386/pr96166.c: New test.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/96166] [10 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess
  2020-07-11 16:01 [Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess nok.raven at gmail dot com
                   ` (11 preceding siblings ...)
  2021-02-13  9:33 ` cvs-commit at gcc dot gnu.org
@ 2021-02-13  9:34 ` jakub at gcc dot gnu.org
  2021-04-08 12:02 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-02-13  9:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|[10/11 Regression]          |[10 Regression]
                   |-O3/-ftree-slp-vectorize    |-O3/-ftree-slp-vectorize
                   |turns ROL into a mess       |turns ROL into a mess

--- Comment #12 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Fixed on the trunk so far, most likely undesirable for backporting.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/96166] [10 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess
  2020-07-11 16:01 [Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess nok.raven at gmail dot com
                   ` (12 preceding siblings ...)
  2021-02-13  9:34 ` [Bug target/96166] [10 " jakub at gcc dot gnu.org
@ 2021-04-08 12:02 ` rguenth at gcc dot gnu.org
  2022-06-28 10:41 ` jakub at gcc dot gnu.org
  2023-07-07  8:55 ` rguenth at gcc dot gnu.org
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-08 12:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|10.3                        |10.4

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 10.3 is being released, retargeting bugs to GCC 10.4.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/96166] [10 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess
  2020-07-11 16:01 [Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess nok.raven at gmail dot com
                   ` (13 preceding siblings ...)
  2021-04-08 12:02 ` rguenth at gcc dot gnu.org
@ 2022-06-28 10:41 ` jakub at gcc dot gnu.org
  2023-07-07  8:55 ` rguenth at gcc dot gnu.org
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-06-28 10:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|10.4                        |10.5

--- Comment #14 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 10.4 is being released, retargeting bugs to GCC 10.5.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/96166] [10 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess
  2020-07-11 16:01 [Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess nok.raven at gmail dot com
                   ` (14 preceding siblings ...)
  2022-06-28 10:41 ` jakub at gcc dot gnu.org
@ 2023-07-07  8:55 ` rguenth at gcc dot gnu.org
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-07  8:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
   Target Milestone|10.5                        |11.0
      Known to fail|                            |10.5.0
             Status|ASSIGNED                    |RESOLVED
      Known to work|                            |11.0

--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed for GCC 11.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-07-07  8:55 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-11 16:01 [Bug tree-optimization/96166] New: [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess nok.raven at gmail dot com
2020-07-13  8:05 ` [Bug target/96166] " rguenth at gcc dot gnu.org
2020-07-23  6:51 ` rguenth at gcc dot gnu.org
2020-10-12 12:47 ` rguenth at gcc dot gnu.org
2021-02-11 15:00 ` jakub at gcc dot gnu.org
2021-02-12 10:11 ` jakub at gcc dot gnu.org
2021-02-12 11:18 ` jakub at gcc dot gnu.org
2021-02-12 12:17 ` pinskia at gcc dot gnu.org
2021-02-12 12:21 ` jakub at gcc dot gnu.org
2021-02-12 13:53 ` jakub at gcc dot gnu.org
2021-02-12 14:03 ` rguenth at gcc dot gnu.org
2021-02-12 14:40 ` jakub at gcc dot gnu.org
2021-02-13  9:33 ` cvs-commit at gcc dot gnu.org
2021-02-13  9:34 ` [Bug target/96166] [10 " jakub at gcc dot gnu.org
2021-04-08 12:02 ` rguenth at gcc dot gnu.org
2022-06-28 10:41 ` jakub at gcc dot gnu.org
2023-07-07  8:55 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).