From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id DE87B3844007; Fri, 12 Feb 2021 14:03:01 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org DE87B3844007 From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/96166] [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess Date: Fri, 12 Feb 2021 14:03:01 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 10.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 10.3 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Feb 2021 14:03:02 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D96166 --- Comment #9 from Richard Biener --- (In reply to Jakub Jelinek from comment #4) > While with SLP vectorization, we end up with: > _9 =3D (int) _3; > _10 =3D BIT_FIELD_REF <_3, 32, 32>; > - MEM[(int &)&y] =3D _10; > - MEM[(int &)&y + 4] =3D _9; > + _11 =3D {_10, _9}; > + MEM [(int &)&y] =3D _11; > _4 =3D MEM [(char * {ref-all})&y]; > MEM [(char * {ref-all})x_2(D)] =3D _4; > and aren't able to undo the vectorization during the RTL optimizations. > I'm surprised costs suggest such vectorization is beneficial, constructin= g a > vector just to store it into memory seems more expensive than just doing = two > stores, isn't it? In general yes (esp. with the components in GPRs). Of course x86 vectorizer costing assigns 12 + 12 to the scalar stores and just 12 for the vector store and the CTOR isn't even close to 12. We're doing case vec_construct: { /* N element inserts into SSE vectors. */ int cost =3D TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; /* One vinserti128 for combining two SSE vectors for AVX256. */ if (GET_MODE_BITSIZE (mode) =3D=3D 256) cost +=3D ix86_vec_cost (mode, ix86_cost->addss); /* One vinserti64x4 and two vinserti128 for combining SSE and AVX256 vectors to AVX512. */ else if (GET_MODE_BITSIZE (mode) =3D=3D 512) cost +=3D 3 * ix86_vec_cost (mode, ix86_cost->addss); return cost; so what we miss here is costing GPR -> xmm moves required where they are not "free" (IIRC there are some AVX grp->xmm insert instructions?). Then we generally want larger (vector) stores because they are more likely subject to STLF than smaller stores (the other way around for loads!). So say for two scalar SFmode stores doing movlhps + movps mem is clearly beneificial to two movss. Now for the testcase the IL before vectorization is _3 =3D MEM [(char * {ref-all})x_2(D)]; _9 =3D BIT_FIELD_REF <_3, 32, 0>; _10 =3D BIT_FIELD_REF <_3, 32, 32>; y =3D _10; MEM[(int &)&y + 4] =3D _9; and the vectorizer simply "reloads" _3 to a vector mode, swaps it and then vectorizes the store. But it considers the BIT_FIELD_REFs to come at a cost here. t.c:4:5: note: Cost model analysis: 0x3ef0a60 _10 1 times scalar_store costs 12 in body 0x3ef0a60 _9 1 times scalar_store costs 12 in body 0x3ef0a60 BIT_FIELD_REF <_3, 32, 32> 1 times scalar_stmt costs 4 in body 0x3ef0a60 BIT_FIELD_REF <_3, 32, 0> 1 times scalar_stmt costs 4 in body 0x3ef0a60 1 times vec_perm costs 4 in body 0x3ef0a60 _10 1 times vector_store costs 12 in body t.c:4:5: note: Cost model analysis for part in loop 0: Vector cost: 16 Scalar cost: 32 t.c:4:5: note: Basic block will be vectorized using SLP so the issue is really the vectorizer doesn't see the scalar code can be implemented with a simple rolq $32, (%rdi) because that's not how the GIMPLE looks like (of course GIMPLE would have a wide load, a bswap and a wide store - exactly the same as the vector code has). That's (define_insn "*3_1" [(set (match_operand:SWI48 0 "nonimmediate_operand" "=3Drm,r") (any_rotate:SWI48 (match_operand:SWI48 1 "nonimmediate_operand" "0,rm") (match_operand:QI 2 "nonmemory_operand" "c,"))) and combine even tries Trying 6, 9 -> 10: 6: r87:DI=3D[r86:DI] 9: r88:V2SI=3Dvec_select(r87:DI#0,parallel) REG_DEAD r87:DI 10: [r86:DI]=3Dr88:V2SI REG_DEAD r88:V2SI REG_DEAD r86:DI Failed to match this instruction: (set (mem:V2SI (reg/v/f:DI 86 [ x ]) [0 MEM [(char * {ref-all})x_2(D)]+0 S8 A8]) (vec_select:V2SI (mem:V2SI (reg/v/f:DI 86 [ x ]) [0 MEM [(char * {ref-all})x_2(D)]+0 S8 A8]) (parallel [ (const_int 1 [0x1]) (const_int 0 [0]) ]))) but simplification fails to consider doing this with DImode. So maybe a combine helper pattern does the trick?=