From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 7879D3857809; Fri, 12 Feb 2021 10:11:12 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7879D3857809
From: "jakub at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/96166] [10/11 Regression] -O3/-ftree-slp-vectorize turns
 ROL into a mess
Date: Fri, 12 Feb 2021 10:11:12 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 10.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: jakub at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 10.3
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-96166-4-yWU8y0iOjS@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-96166-4@http.gcc.gnu.org/bugzilla/>
References: <bug-96166-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 12 Feb 2021 10:11:12 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D96166
--- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Note that the rotate isn't something created by the bswap pass, it isn't re=
ally
byteswap, just swapping of two halves of the long long.
It comes from expansion and combine.  Expanding
  _9 =3D (int) _3;
  _10 =3D BIT_FIELD_REF <_3, 32, 32>;
  MEM[(int &)&y] =3D _10;
  MEM[(int &)&y + 4] =3D _9;
  _4 =3D MEM <long unsigned int> [(char * {ref-all})&y];
  MEM <long unsigned int> [(char * {ref-all})x_2(D)] =3D _4;
results in
(insn 7 6 8 (parallel [
            (set (reg:DI 88)
                (ashiftrt:DI (reg:DI 82 [ _3 ])
                    (const_int 32 [0x20])))
            (clobber (reg:CC 17 flags))
        ]) "pr96166.c":4:5 -1
     (nil))

(insn 8 7 9 (set (reg:DI 89)
        (zero_extend:DI (subreg:SI (reg:DI 88) 0))) "pr96166.c":4:5 -1
     (nil))

(insn 9 8 10 (set (reg:DI 91)
        (const_int -4294967296 [0xffffffff00000000])) "pr96166.c":4:5 -1
     (nil))

(insn 10 9 11 (parallel [
            (set (reg:DI 90)
                (and:DI (reg/v:DI 86 [ y ])
                    (reg:DI 91)))
            (clobber (reg:CC 17 flags))
        ]) "pr96166.c":4:5 -1
     (nil))

(insn 11 10 12 (parallel [
            (set (reg:DI 92)
                (ior:DI (reg:DI 90)
                    (reg:DI 89)))
            (clobber (reg:CC 17 flags))
        ]) "pr96166.c":4:5 -1
     (nil))

(insn 12 11 0 (set (reg/v:DI 86 [ y ])
        (reg:DI 92)) "pr96166.c":4:5 -1
     (nil))

(insn 13 12 14 (set (reg:DI 93)
        (zero_extend:DI (subreg:SI (reg:DI 82 [ _3 ]) 0))) "pr96166.c":5:5 =
-1
     (nil))

(insn 14 13 15 (parallel [
            (set (reg:DI 94)
                (ashift:DI (reg:DI 93)
                    (const_int 32 [0x20])))
            (clobber (reg:CC 17 flags))
        ]) "pr96166.c":5:5 -1
     (nil))

(insn 15 14 16 (set (reg:DI 95)
        (zero_extend:DI (subreg:SI (reg/v:DI 86 [ y ]) 0))) "pr96166.c":5:5=
 -1
     (nil))

(insn 16 15 17 (parallel [
            (set (reg:DI 96)
                (ior:DI (reg:DI 95)
                    (reg:DI 94)))
            (clobber (reg:CC 17 flags))
        ]) "pr96166.c":5:5 -1
     (nil))

(insn 17 16 0 (set (reg/v:DI 86 [ y ])
        (reg:DI 96)) "pr96166.c":5:5 -1
     (nil))

(insn 18 17 0 (set (mem:DI (reg/v/f:DI 87 [ x ]) [0 MEM <long unsigned int>
[(char * {ref-all})x_2(D)]+0 S8 A8])
        (reg/v:DI 86 [ y ])) "pr96166.c":13:19 -1
     (nil))

(I must say I'm surprised y hasn't been forced into stack even when it is
stored in parts) and then combine matches a rotate out of that.
While with SLP vectorization, we end up with:
   _9 =3D (int) _3;
   _10 =3D BIT_FIELD_REF <_3, 32, 32>;
-  MEM[(int &)&y] =3D _10;
-  MEM[(int &)&y + 4] =3D _9;
+  _11 =3D {_10, _9};
+  MEM <vector(2) int> [(int &)&y] =3D _11;
   _4 =3D MEM <long unsigned int> [(char * {ref-all})&y];
   MEM <long unsigned int> [(char * {ref-all})x_2(D)] =3D _4;
and aren't able to undo the vectorization during the RTL optimizations.
I'm surprised costs suggest such vectorization is beneficial, constructing a
vector just to store it into memory seems more expensive than just doing two
stores, isn't it?=