public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
Date: Mon, 08 Mar 2021 13:20:00 +0000	[thread overview]
Message-ID: <bug-98856-4-CtZHEyHEux@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-98856-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #37 from Richard Biener <rguenth at gcc dot gnu.org> ---
So my analysis was partly wrong and the vpinsrq isn't an issue for the
benchmark
but only the spilling is.

Note that the other idea of disparaging vector CTORs more like with

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2603333f87b..f8caf8e7dff 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -21821,8 +21821,15 @@ ix86_builtin_vectorization_cost (enum
vect_cost_for_stmt type_of_cost,

       case vec_construct:
        {
-         /* N element inserts into SSE vectors.  */
+         /* N-element inserts into SSE vectors.  */
          int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
+         /* We cannot insert from GPRs directly but there's always a
+            GPR->XMM uop involved.  Account for that.
+            ???  Note that loads are already costed separately so this
+            eventually double-counts them.  */
+         if (!fp)
+           cost += (TYPE_VECTOR_SUBPARTS (vectype)
+                    * ix86_cost->hard_register.integer_to_sse);
          /* One vinserti128 for combining two SSE vectors for AVX256.  */
          if (GET_MODE_BITSIZE (mode) == 256)
            cost += ix86_vec_cost (mode, ix86_cost->addss);

helps for generic and core-avx2 tuning:

t.c:10:3: note: Cost model analysis:
0x3858cd0 _6 1 times scalar_store costs 12 in body
0x3858cd0 _4 1 times scalar_store costs 12 in body
0x3858cd0 _5 ^ carry_10 1 times scalar_stmt costs 4 in body
0x3858cd0 _2 ^ _3 1 times scalar_stmt costs 4 in body
0x3858cd0 _15 << 1 1 times scalar_stmt costs 4 in body
0x3858cd0 _14 << 1 1 times scalar_stmt costs 4 in body
0x3858cd0 BIT_FIELD_REF <_9, 64, 0> 1 times scalar_stmt costs 4 in body
0x3858cd0 <unknown> 0 times vec_perm costs 0 in body
0x3858cd0 _15 << 1 1 times vector_stmt costs 4 in body
0x3858cd0 _5 ^ carry_10 1 times vector_stmt costs 4 in body
0x3858cd0 <unknown> 1 times vec_construct costs 20 in prologue
0x3858cd0 _6 1 times unaligned_store (misalign -1) costs 12 in body
0x3858cd0 BIT_FIELD_REF <_9, 64, 0> 1 times vec_to_scalar costs 4 in epilogue
0x3858cd0 BIT_FIELD_REF <_9, 64, 64> 1 times vec_to_scalar costs 4 in epilogue
t.c:10:3: note: Cost model analysis for part in loop 0:
  Vector cost: 48
  Scalar cost: 44
t.c:10:3: missed: not vectorized: vectorization is not profitable.

but not for znver2:

t.c:10:3: note: Cost model analysis:
0x3703790 _6 1 times scalar_store costs 16 in body
0x3703790 _4 1 times scalar_store costs 16 in body
0x3703790 _5 ^ carry_10 1 times scalar_stmt costs 4 in body
0x3703790 _2 ^ _3 1 times scalar_stmt costs 4 in body
0x3703790 _15 << 1 1 times scalar_stmt costs 4 in body
0x3703790 _14 << 1 1 times scalar_stmt costs 4 in body
0x3703790 BIT_FIELD_REF <_9, 64, 0> 1 times scalar_stmt costs 4 in body
0x3703790 <unknown> 0 times vec_perm costs 0 in body
0x3703790 _15 << 1 1 times vector_stmt costs 4 in body
0x3703790 _5 ^ carry_10 1 times vector_stmt costs 4 in body
0x3703790 <unknown> 1 times vec_construct costs 20 in prologue
0x3703790 _6 1 times unaligned_store (misalign -1) costs 16 in body
0x3703790 BIT_FIELD_REF <_9, 64, 0> 1 times vec_to_scalar costs 4 in epilogue
0x3703790 BIT_FIELD_REF <_9, 64, 64> 1 times vec_to_scalar costs 4 in epilogue
t.c:10:3: note: Cost model analysis for part in loop 0:
  Vector cost: 52
  Scalar cost: 52
t.c:10:3: note: Basic block will be vectorized using SLP

appearantly for znver{1,2,3} we choose a slightly higher load/store cost.

We could also try mitigating vectorization by decomposing the __int128
load in forwprop where we have

          else if (TREE_CODE (TREE_TYPE (lhs)) == VECTOR_TYPE
                   && TYPE_MODE (TREE_TYPE (lhs)) == BLKmode
                   && gimple_assign_load_p (stmt)
                   && !gimple_has_volatile_ops (stmt)
                   && (TREE_CODE (gimple_assign_rhs1 (stmt))
                       != TARGET_MEM_REF)
                   && !stmt_can_throw_internal (cfun, stmt))
            {
              /* Rewrite loads used only in BIT_FIELD_REF extractions to
                 component-wise loads.  */

this was tailored to decompose GCC vector extension loads that are not
supported on the HW early.  Here we have

  _9 = MEM <__int128 unsigned> [(char * {ref-all})in_8(D)];
  _14 = BIT_FIELD_REF <_9, 64, 64>;
  _15 = BIT_FIELD_REF <_9, 64, 0>;

where the HW doesn't have any __int128 GPRs.  If we do not vectorize then
the RTL pipeline will eventually split the load.  If vectorization is
profitable then the vectorizer should be able to vectorize the resulting
split loads as well.  In this case this would cause actual costing of the
load (the re-use of the __int128 to-be-in-SSE reg is instead free) and also
cost the live lane extract for the retained integer code.  But that moves
the cost even more towards vectorizing since now a vector load (cost 12)
plus two live lane extracts (when fixed to cost sse_to_integer that's 2 * 6)
is used in place of two scalar loads (cost 2 * 12).  On the code generation
side this improves things, avoiding the spilling but using vmovq/vpextrq
which is not good enough to recover but it does help a bit (~5%)

        vmovdqu (%rsi), %xmm1
        vpextrq $1, %xmm1, %rax
        shrq    $63, %rax
        imulq   $135, %rax, %rax
        vmovq   %rax, %xmm0
        vmovq   %xmm1, %rax
        vpsllq  $1, %xmm1, %xmm1
        shrq    $63, %rax
        vmovq   %rax, %xmm2
        vpunpcklqdq     %xmm2, %xmm0, %xmm0
        vpxor   %xmm1, %xmm0, %xmm0
        vmovdqu %xmm0, (%rdi)
        ret

  parent reply	other threads:[~2021-03-08 13:20 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-27 14:28 [Bug tree-optimization/98856] New: " marxin at gcc dot gnu.org
2021-01-27 14:29 ` [Bug tree-optimization/98856] " marxin at gcc dot gnu.org
2021-01-27 14:44 ` rguenth at gcc dot gnu.org
2021-01-28  7:47 ` rguenth at gcc dot gnu.org
2021-01-28  8:44 ` marxin at gcc dot gnu.org
2021-01-28  9:40 ` rguenth at gcc dot gnu.org
2021-01-28 11:03 ` rguenth at gcc dot gnu.org
2021-01-28 11:19 ` rguenth at gcc dot gnu.org
2021-01-28 11:57 ` rguenth at gcc dot gnu.org
2021-02-05 10:18 ` rguenth at gcc dot gnu.org
2021-02-05 11:52 ` jakub at gcc dot gnu.org
2021-02-05 12:52 ` rguenth at gcc dot gnu.org
2021-02-05 13:43 ` jakub at gcc dot gnu.org
2021-02-05 14:36 ` jakub at gcc dot gnu.org
2021-02-05 16:29 ` jakub at gcc dot gnu.org
2021-02-05 17:55 ` jakub at gcc dot gnu.org
2021-02-05 19:48 ` jakub at gcc dot gnu.org
2021-02-08 15:14 ` jakub at gcc dot gnu.org
2021-03-04 12:14 ` rguenth at gcc dot gnu.org
2021-03-04 15:36 ` rguenth at gcc dot gnu.org
2021-03-04 16:12 ` rguenth at gcc dot gnu.org
2021-03-04 17:56 ` ubizjak at gmail dot com
2021-03-04 18:12 ` ubizjak at gmail dot com
2021-03-05  7:44 ` rguenth at gcc dot gnu.org
2021-03-05  7:46 ` rguenth at gcc dot gnu.org
2021-03-05  8:29 ` ubizjak at gmail dot com
2021-03-05 10:04 ` rguenther at suse dot de
2021-03-05 10:43 ` rguenth at gcc dot gnu.org
2021-03-05 11:56 ` ubizjak at gmail dot com
2021-03-05 12:25 ` ubizjak at gmail dot com
2021-03-05 12:27 ` rguenth at gcc dot gnu.org
2021-03-05 12:49 ` jakub at gcc dot gnu.org
2021-03-05 12:52 ` ubizjak at gmail dot com
2021-03-05 12:55 ` rguenther at suse dot de
2021-03-05 13:06 ` rguenth at gcc dot gnu.org
2021-03-05 13:08 ` ubizjak at gmail dot com
2021-03-05 14:35 ` rguenth at gcc dot gnu.org
2021-03-08 10:41 ` rguenth at gcc dot gnu.org
2021-03-08 13:20 ` rguenth at gcc dot gnu.org [this message]
2021-03-08 15:46 ` amonakov at gcc dot gnu.org
2021-04-27 11:40 ` [Bug tree-optimization/98856] [11/12 " jakub at gcc dot gnu.org
2021-05-13 10:17 ` cvs-commit at gcc dot gnu.org
2021-07-28  7:05 ` rguenth at gcc dot gnu.org
2022-01-21 13:20 ` rguenth at gcc dot gnu.org
2022-04-21  7:48 ` rguenth at gcc dot gnu.org
2023-04-17 21:43 ` [Bug tree-optimization/98856] [11/12/13/14 " lukebenes at hotmail dot com
2023-04-18  9:07 ` rguenth at gcc dot gnu.org
2023-05-29 10:04 ` jakub at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-98856-4-CtZHEyHEux@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).