From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 3A4203950406; Mon, 8 Mar 2021 13:20:00 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3A4203950406 From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af Date: Mon, 08 Mar 2021 13:20:00 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 11.0 X-Bugzilla-Keywords: missed-optimization, ra X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org X-Bugzilla-Target-Milestone: 11.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 08 Mar 2021 13:20:00 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D98856 --- Comment #37 from Richard Biener --- So my analysis was partly wrong and the vpinsrq isn't an issue for the benchmark but only the spilling is. Note that the other idea of disparaging vector CTORs more like with diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 2603333f87b..f8caf8e7dff 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -21821,8 +21821,15 @@ ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost, case vec_construct: { - /* N element inserts into SSE vectors. */ + /* N-element inserts into SSE vectors. */ int cost =3D TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; + /* We cannot insert from GPRs directly but there's always a + GPR->XMM uop involved. Account for that. + ??? Note that loads are already costed separately so this + eventually double-counts them. */ + if (!fp) + cost +=3D (TYPE_VECTOR_SUBPARTS (vectype) + * ix86_cost->hard_register.integer_to_sse); /* One vinserti128 for combining two SSE vectors for AVX256. */ if (GET_MODE_BITSIZE (mode) =3D=3D 256) cost +=3D ix86_vec_cost (mode, ix86_cost->addss); helps for generic and core-avx2 tuning: t.c:10:3: note: Cost model analysis: 0x3858cd0 _6 1 times scalar_store costs 12 in body 0x3858cd0 _4 1 times scalar_store costs 12 in body 0x3858cd0 _5 ^ carry_10 1 times scalar_stmt costs 4 in body 0x3858cd0 _2 ^ _3 1 times scalar_stmt costs 4 in body 0x3858cd0 _15 << 1 1 times scalar_stmt costs 4 in body 0x3858cd0 _14 << 1 1 times scalar_stmt costs 4 in body 0x3858cd0 BIT_FIELD_REF <_9, 64, 0> 1 times scalar_stmt costs 4 in body 0x3858cd0 0 times vec_perm costs 0 in body 0x3858cd0 _15 << 1 1 times vector_stmt costs 4 in body 0x3858cd0 _5 ^ carry_10 1 times vector_stmt costs 4 in body 0x3858cd0 1 times vec_construct costs 20 in prologue 0x3858cd0 _6 1 times unaligned_store (misalign -1) costs 12 in body 0x3858cd0 BIT_FIELD_REF <_9, 64, 0> 1 times vec_to_scalar costs 4 in epilog= ue 0x3858cd0 BIT_FIELD_REF <_9, 64, 64> 1 times vec_to_scalar costs 4 in epilo= gue t.c:10:3: note: Cost model analysis for part in loop 0: Vector cost: 48 Scalar cost: 44 t.c:10:3: missed: not vectorized: vectorization is not profitable. but not for znver2: t.c:10:3: note: Cost model analysis: 0x3703790 _6 1 times scalar_store costs 16 in body 0x3703790 _4 1 times scalar_store costs 16 in body 0x3703790 _5 ^ carry_10 1 times scalar_stmt costs 4 in body 0x3703790 _2 ^ _3 1 times scalar_stmt costs 4 in body 0x3703790 _15 << 1 1 times scalar_stmt costs 4 in body 0x3703790 _14 << 1 1 times scalar_stmt costs 4 in body 0x3703790 BIT_FIELD_REF <_9, 64, 0> 1 times scalar_stmt costs 4 in body 0x3703790 0 times vec_perm costs 0 in body 0x3703790 _15 << 1 1 times vector_stmt costs 4 in body 0x3703790 _5 ^ carry_10 1 times vector_stmt costs 4 in body 0x3703790 1 times vec_construct costs 20 in prologue 0x3703790 _6 1 times unaligned_store (misalign -1) costs 16 in body 0x3703790 BIT_FIELD_REF <_9, 64, 0> 1 times vec_to_scalar costs 4 in epilog= ue 0x3703790 BIT_FIELD_REF <_9, 64, 64> 1 times vec_to_scalar costs 4 in epilo= gue t.c:10:3: note: Cost model analysis for part in loop 0: Vector cost: 52 Scalar cost: 52 t.c:10:3: note: Basic block will be vectorized using SLP appearantly for znver{1,2,3} we choose a slightly higher load/store cost. We could also try mitigating vectorization by decomposing the __int128 load in forwprop where we have else if (TREE_CODE (TREE_TYPE (lhs)) =3D=3D VECTOR_TYPE && TYPE_MODE (TREE_TYPE (lhs)) =3D=3D BLKmode && gimple_assign_load_p (stmt) && !gimple_has_volatile_ops (stmt) && (TREE_CODE (gimple_assign_rhs1 (stmt)) !=3D TARGET_MEM_REF) && !stmt_can_throw_internal (cfun, stmt)) { /* Rewrite loads used only in BIT_FIELD_REF extractions to component-wise loads. */ this was tailored to decompose GCC vector extension loads that are not supported on the HW early. Here we have _9 =3D MEM <__int128 unsigned> [(char * {ref-all})in_8(D)]; _14 =3D BIT_FIELD_REF <_9, 64, 64>; _15 =3D BIT_FIELD_REF <_9, 64, 0>; where the HW doesn't have any __int128 GPRs. If we do not vectorize then the RTL pipeline will eventually split the load. If vectorization is profitable then the vectorizer should be able to vectorize the resulting split loads as well. In this case this would cause actual costing of the load (the re-use of the __int128 to-be-in-SSE reg is instead free) and also cost the live lane extract for the retained integer code. But that moves the cost even more towards vectorizing since now a vector load (cost 12) plus two live lane extracts (when fixed to cost sse_to_integer that's 2 * 6) is used in place of two scalar loads (cost 2 * 12). On the code generation side this improves things, avoiding the spilling but using vmovq/vpextrq which is not good enough to recover but it does help a bit (~5%) vmovdqu (%rsi), %xmm1 vpextrq $1, %xmm1, %rax shrq $63, %rax imulq $135, %rax, %rax vmovq %rax, %xmm0 vmovq %xmm1, %rax vpsllq $1, %xmm1, %xmm1 shrq $63, %rax vmovq %rax, %xmm2 vpunpcklqdq %xmm2, %xmm0, %xmm0 vpxor %xmm1, %xmm0, %xmm0 vmovdqu %xmm0, (%rdi) ret=