From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 3A4203950406; Mon,  8 Mar 2021 13:20:00 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3A4203950406
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is
 slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
Date: Mon, 08 Mar 2021 13:20:00 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 11.0
X-Bugzilla-Keywords: missed-optimization, ra
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 11.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-98856-4-CtZHEyHEux@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-98856-4@http.gcc.gnu.org/bugzilla/>
References: <bug-98856-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 08 Mar 2021 13:20:00 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D98856

--- Comment #37 from Richard Biener <rguenth at gcc dot gnu.org> ---
So my analysis was partly wrong and the vpinsrq isn't an issue for the
benchmark
but only the spilling is.

Note that the other idea of disparaging vector CTORs more like with
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2603333f87b..f8caf8e7dff 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -21821,8 +21821,15 @@ ix86_builtin_vectorization_cost (enum
vect_cost_for_stmt type_of_cost,

       case vec_construct:
        {
-         /* N element inserts into SSE vectors.  */
+         /* N-element inserts into SSE vectors.  */
          int cost =3D TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
+         /* We cannot insert from GPRs directly but there's always a
+            GPR->XMM uop involved.  Account for that.
+            ???  Note that loads are already costed separately so this
+            eventually double-counts them.  */
+         if (!fp)
+           cost +=3D (TYPE_VECTOR_SUBPARTS (vectype)
+                    * ix86_cost->hard_register.integer_to_sse);
          /* One vinserti128 for combining two SSE vectors for AVX256.  */
          if (GET_MODE_BITSIZE (mode) =3D=3D 256)
            cost +=3D ix86_vec_cost (mode, ix86_cost->addss);

helps for generic and core-avx2 tuning:

t.c:10:3: note: Cost model analysis:
0x3858cd0 _6 1 times scalar_store costs 12 in body
0x3858cd0 _4 1 times scalar_store costs 12 in body
0x3858cd0 _5 ^ carry_10 1 times scalar_stmt costs 4 in body
0x3858cd0 _2 ^ _3 1 times scalar_stmt costs 4 in body
0x3858cd0 _15 << 1 1 times scalar_stmt costs 4 in body
0x3858cd0 _14 << 1 1 times scalar_stmt costs 4 in body
0x3858cd0 BIT_FIELD_REF <_9, 64, 0> 1 times scalar_stmt costs 4 in body
0x3858cd0 <unknown> 0 times vec_perm costs 0 in body
0x3858cd0 _15 << 1 1 times vector_stmt costs 4 in body
0x3858cd0 _5 ^ carry_10 1 times vector_stmt costs 4 in body
0x3858cd0 <unknown> 1 times vec_construct costs 20 in prologue
0x3858cd0 _6 1 times unaligned_store (misalign -1) costs 12 in body
0x3858cd0 BIT_FIELD_REF <_9, 64, 0> 1 times vec_to_scalar costs 4 in epilog=
ue
0x3858cd0 BIT_FIELD_REF <_9, 64, 64> 1 times vec_to_scalar costs 4 in epilo=
gue
t.c:10:3: note: Cost model analysis for part in loop 0:
  Vector cost: 48
  Scalar cost: 44
t.c:10:3: missed: not vectorized: vectorization is not profitable.

but not for znver2:

t.c:10:3: note: Cost model analysis:
0x3703790 _6 1 times scalar_store costs 16 in body
0x3703790 _4 1 times scalar_store costs 16 in body
0x3703790 _5 ^ carry_10 1 times scalar_stmt costs 4 in body
0x3703790 _2 ^ _3 1 times scalar_stmt costs 4 in body
0x3703790 _15 << 1 1 times scalar_stmt costs 4 in body
0x3703790 _14 << 1 1 times scalar_stmt costs 4 in body
0x3703790 BIT_FIELD_REF <_9, 64, 0> 1 times scalar_stmt costs 4 in body
0x3703790 <unknown> 0 times vec_perm costs 0 in body
0x3703790 _15 << 1 1 times vector_stmt costs 4 in body
0x3703790 _5 ^ carry_10 1 times vector_stmt costs 4 in body
0x3703790 <unknown> 1 times vec_construct costs 20 in prologue
0x3703790 _6 1 times unaligned_store (misalign -1) costs 16 in body
0x3703790 BIT_FIELD_REF <_9, 64, 0> 1 times vec_to_scalar costs 4 in epilog=
ue
0x3703790 BIT_FIELD_REF <_9, 64, 64> 1 times vec_to_scalar costs 4 in epilo=
gue
t.c:10:3: note: Cost model analysis for part in loop 0:
  Vector cost: 52
  Scalar cost: 52
t.c:10:3: note: Basic block will be vectorized using SLP

appearantly for znver{1,2,3} we choose a slightly higher load/store cost.

We could also try mitigating vectorization by decomposing the __int128
load in forwprop where we have

          else if (TREE_CODE (TREE_TYPE (lhs)) =3D=3D VECTOR_TYPE
                   && TYPE_MODE (TREE_TYPE (lhs)) =3D=3D BLKmode
                   && gimple_assign_load_p (stmt)
                   && !gimple_has_volatile_ops (stmt)
                   && (TREE_CODE (gimple_assign_rhs1 (stmt))
                       !=3D TARGET_MEM_REF)
                   && !stmt_can_throw_internal (cfun, stmt))
            {
              /* Rewrite loads used only in BIT_FIELD_REF extractions to
                 component-wise loads.  */

this was tailored to decompose GCC vector extension loads that are not
supported on the HW early.  Here we have

  _9 =3D MEM <__int128 unsigned> [(char * {ref-all})in_8(D)];
  _14 =3D BIT_FIELD_REF <_9, 64, 64>;
  _15 =3D BIT_FIELD_REF <_9, 64, 0>;

where the HW doesn't have any __int128 GPRs.  If we do not vectorize then
the RTL pipeline will eventually split the load.  If vectorization is
profitable then the vectorizer should be able to vectorize the resulting
split loads as well.  In this case this would cause actual costing of the
load (the re-use of the __int128 to-be-in-SSE reg is instead free) and also
cost the live lane extract for the retained integer code.  But that moves
the cost even more towards vectorizing since now a vector load (cost 12)
plus two live lane extracts (when fixed to cost sse_to_integer that's 2 * 6)
is used in place of two scalar loads (cost 2 * 12).  On the code generation
side this improves things, avoiding the spilling but using vmovq/vpextrq
which is not good enough to recover but it does help a bit (~5%)

        vmovdqu (%rsi), %xmm1
        vpextrq $1, %xmm1, %rax
        shrq    $63, %rax
        imulq   $135, %rax, %rax
        vmovq   %rax, %xmm0
        vmovq   %xmm1, %rax
        vpsllq  $1, %xmm1, %xmm1
        shrq    $63, %rax
        vmovq   %rax, %xmm2
        vpunpcklqdq     %xmm2, %xmm0, %xmm0
        vpxor   %xmm1, %xmm0, %xmm0
        vmovdqu %xmm0, (%rdi)
        ret=