[Bug tree-optimization/101929] New: r12-2549 regress x264

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/101929] New: r12-2549 regress x264_r by 4% on CLX.
@ 2021-08-16  5:08 crazylht at gmail dot com
  2021-08-16  5:10 ` [Bug tree-optimization/101929] " crazylht at gmail dot com
                   ` (14 more replies)
  0 siblings, 15 replies; 16+ messages in thread
From: crazylht at gmail dot com @ 2021-08-16  5:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929

            Bug ID: 101929
           Summary: r12-2549 regress x264_r by 4% on CLX.
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: crazylht at gmail dot com
                CC: hjl.tools at gmail dot com, wwwhhhyyy333 at gmail dot com
  Target Milestone: ---
              Host: x86_64-pc-linux-gnu
            Target: x86_64-*-* i?86-*-*

The regression is in x264_pixel_satd_8x4

typedef unsigned char uint8_t;
typedef unsigned int uint32_t;
typedef unsigned short uint16_t;

// in: a pseudo-simd number of the form x+(y<<16)
// return: abs(x)+(abs(y)<<16)
static inline
uint32_t abs2( uint32_t a )
{
    uint32_t s = ((a>>15)&0x10001)*0xffff;
    return (a+s)^s;
}

#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
    int t0 = s0 + s1;\
    int t1 = s0 - s1;\
    int t2 = s2 + s3;\
    int t3 = s2 - s3;\
    d0 = t0 + t2;\
    d2 = t0 - t2;\
    d1 = t1 + t3;\
    d3 = t1 - t3;\
}

int
x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 )
{
    uint32_t tmp[4][4];
    uint32_t a0, a1, a2, a3;
    int sum = 0;
    for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
    {
        a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
        a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
        a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
        a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
        HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 );
    }
    for( int i = 0; i < 4; i++ )
    {
        HADAMARD4( a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i]
);
        sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3);
    }
    return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1;
}

after increase cost of vector CTOR, slp1 won't vector for below
git diff my.slp1 original.slp1

-  _820 = {_187, _189, _187, _189};
-  vect_t2_188.65_821 = VIEW_CONVERT_EXPR<vector(4) int>(_820);
-  vect__200.67_823 = vect_t0_184.64_819 - vect_t2_188.65_821;
-  vect__191.66_822 = vect_t0_184.64_819 + vect_t2_188.65_821;
-  _824 = VEC_PERM_EXPR <vect__191.66_822, vect__200.67_823, { 0, 1, 6, 7 }>;
-  vect__192.68_825 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(_824);
   t3_190 = (int) _189;
   _191 = t0_184 + t2_188;
   _192 = (unsigned int) _191;
+  tmp[0][0] = _192;
   _194 = t0_184 - t2_188;
   _195 = (unsigned int) _194;
+  tmp[0][2] = _195;
   _197 = t1_186 + t3_190;
   _198 = (unsigned int) _197;
+  tmp[0][1] = _198;
   _200 = t1_186 - t3_190;
   _201 = (unsigned int) _200;
-  MEM <vector(4) unsigned int> [(unsigned int *)&tmp] = vect__192.68_825;
+  tmp[0][3] = _201;

but the vectorized version can somehow help fre to eliminate redundant vector
load and then got even better performace.

git diff dump.veclower21 dump.fre5

   MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 48B] = vect__54.89_852;
-  vect__63.9_482 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp];
-  vect__64.12_478 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 16B];
-  vect__65.13_477 = vect__63.9_482 + vect__64.12_478;
+  vect__65.13_477 = vect__192.68_825 + vect__273.75_834;
   vect_t0_100.14_476 = VIEW_CONVERT_EXPR<vector(4) int>(vect__65.13_477);
-  vect__67.15_475 = vect__63.9_482 - vect__64.12_478;
+  vect__67.15_475 = vect__192.68_825 - vect__273.75_834;
   vect_t1_101.16_474 = VIEW_CONVERT_EXPR<vector(4) int>(vect__67.15_475);
-  vect__68.19_470 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 32B];
-  vect__69.22_466 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 48B];
-  vect__70.23_465 = vect__68.19_470 + vect__69.22_466;
+  vect__70.23_465 = vect__354.82_843 + vect__54.89_852;

If slp1 can realize this and add the upper part to comparison of scalar cost vs
vector cost, gcc should do vectorization, but currently it doesn't.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/101929] r12-2549 regress x264_r by 4% on CLX.
  2021-08-16  5:08 [Bug tree-optimization/101929] New: r12-2549 regress x264_r by 4% on CLX crazylht at gmail dot com
@ 2021-08-16  5:10 ` crazylht at gmail dot com
  2021-08-16  5:21 ` [Bug target/101929] " crazylht at gmail dot com
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: crazylht at gmail dot com @ 2021-08-16  5:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929

--- Comment #1 from Hongtao.liu <crazylht at gmail dot com> ---
Considering this, I'm debating whether to revert my patch.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/101929] r12-2549 regress x264_r by 4% on CLX.
  2021-08-16  5:08 [Bug tree-optimization/101929] New: r12-2549 regress x264_r by 4% on CLX crazylht at gmail dot com
  2021-08-16  5:10 ` [Bug tree-optimization/101929] " crazylht at gmail dot com
@ 2021-08-16  5:21 ` crazylht at gmail dot com
  2021-08-16  9:43 ` rguenth at gcc dot gnu.org
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: crazylht at gmail dot com @ 2021-08-16  5:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929

--- Comment #2 from Hongtao.liu <crazylht at gmail dot com> ---
W/o accurate info provided by vectorizer, the backend can do nothing about this
regression except reverting the patch, that's why i marked the bugzilla ad
tree-optimization component.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/101929] r12-2549 regress x264_r by 4% on CLX.
  2021-08-16  5:08 [Bug tree-optimization/101929] New: r12-2549 regress x264_r by 4% on CLX crazylht at gmail dot com
  2021-08-16  5:10 ` [Bug tree-optimization/101929] " crazylht at gmail dot com
  2021-08-16  5:21 ` [Bug target/101929] " crazylht at gmail dot com
@ 2021-08-16  9:43 ` rguenth at gcc dot gnu.org
  2021-08-19  2:29 ` cvs-commit at gcc dot gnu.org
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-16  9:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
It's interesting to note that in

-  _820 = {_187, _189, _187, _189};
-  vect_t2_188.65_821 = VIEW_CONVERT_EXPR<vector(4) int>(_820);
-  vect__200.67_823 = vect_t0_184.64_819 - vect_t2_188.65_821;
-  vect__191.66_822 = vect_t0_184.64_819 + vect_t2_188.65_821;
-  _824 = VEC_PERM_EXPR <vect__191.66_822, vect__200.67_823, { 0, 1, 6, 7 }>;

we only need parts of the CTOR for the add/sub parts (because we ignore
some lanes with the blend).  That might even allow to elide the final
compose of the low/high part and expose some more insn parallelism.

Of course that looks quite difficult to achieve.

--

Note your CTOR cost estimates might be off given the CTORs are mostly
regular like

{ _181, _181, _181, _181, _262, _262, _262, _262, _343, _343, _343, _343, _48,
_48, _48, _48 }

thus could use 4 splats to xmm and 4 inserts?  For the V4SI vectorization
we unfortunately decide to do

t.c:37:9: note:   Using a splat of the uniform operand
t.c:37:9: note:   Using a splat of the uniform operand
t.c:37:9: note:   Building parent vector operands from scalars instead

and thus end up with { _49, _50, _49, _50 }.  That said, I don't think
the backend gets easy access to the actual CTOR layout yet to improve costing
(similar as with permutes and the actual permute mask).

--

It's difficult (if not impossible) for the vectorizer to second-guess
the followup FRE, we're a long way from doing loop + SLP vectorization
in one go and discover we can elide the vector store.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/101929] r12-2549 regress x264_r by 4% on CLX.
  2021-08-16  5:08 [Bug tree-optimization/101929] New: r12-2549 regress x264_r by 4% on CLX crazylht at gmail dot com
                   ` (2 preceding siblings ...)
  2021-08-16  9:43 ` rguenth at gcc dot gnu.org
@ 2021-08-19  2:29 ` cvs-commit at gcc dot gnu.org
  2022-02-08 13:36 ` rguenth at gcc dot gnu.org
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-08-19  2:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929

--- Comment #4 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:1db70e61a92978377a648bbd90e383859fc0126b

commit r12-3011-g1db70e61a92978377a648bbd90e383859fc0126b
Author: liuhongt <hongtao.liu@intel.com>
Date:   Tue Aug 17 17:29:06 2021 +0800

    Revert "Add the member integer_to_sse to processor_cost as a cost
simulation for movd/pinsrd. It will be used to calculate the cost of
vec_construct."

    This reverts commit 872da9a6f664a06d73c987aa0cb2e5b830158a10.

    PR target/101936
    PR target/101929

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/101929] r12-2549 regress x264_r by 4% on CLX.
  2021-08-16  5:08 [Bug tree-optimization/101929] New: r12-2549 regress x264_r by 4% on CLX crazylht at gmail dot com
                   ` (3 preceding siblings ...)
  2021-08-19  2:29 ` cvs-commit at gcc dot gnu.org
@ 2022-02-08 13:36 ` rguenth at gcc dot gnu.org
  2022-02-22  8:01 ` [Bug target/101929] [12 Regression] r12-7319 " rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-02-08 13:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |FIXED

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
The rev. was reverted, see PR98138 for vectorization of this kernel.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/101929] [12 Regression] r12-7319 regress x264_r by 4% on CLX.
  2021-08-16  5:08 [Bug tree-optimization/101929] New: r12-2549 regress x264_r by 4% on CLX crazylht at gmail dot com
                   ` (4 preceding siblings ...)
  2022-02-08 13:36 ` rguenth at gcc dot gnu.org
@ 2022-02-22  8:01 ` rguenth at gcc dot gnu.org
  2022-03-07  7:18 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-02-22  8:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
             Blocks|                            |26163
   Target Milestone|---                         |12.0
   Last reconfirmed|                            |2022-02-22
             Status|RESOLVED                    |ASSIGNED
         Resolution|FIXED                       |---
            Summary|r12-2549 regress x264_r by  |[12 Regression] r12-7319
                   |4% on CLX.                  |regress x264_r by 4% on
                   |                            |CLX.
     Ever confirmed|0                           |1

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Proactively re-opening.  I will see whether something can be done.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/101929] [12 Regression] r12-7319 regress x264_r by 4% on CLX.
  2021-08-16  5:08 [Bug tree-optimization/101929] New: r12-2549 regress x264_r by 4% on CLX crazylht at gmail dot com
                   ` (5 preceding siblings ...)
  2022-02-22  8:01 ` [Bug target/101929] [12 Regression] r12-7319 " rguenth at gcc dot gnu.org
@ 2022-03-07  7:18 ` rguenth at gcc dot gnu.org
  2022-03-07  8:22 ` crazylht at gmail dot com
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-07  7:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 9188d727e33..7f1f12fb6c6 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -2374,7 +2375,7 @@ fail:
                n_vector_builds++;
            }
        }
-      if (all_uniform_p
+      if ((all_uniform_p && !two_operators)
          || n_vector_builds > 1
          || (n_vector_builds == children.length ()
              && is_a <gphi *> (stmt_info->stmt)))


will re-enable the vectorization - it evades the vect_construct cost bump
by instead using scalar_to_vec (aka splat) which has not yet been fixed to
account for a possible gpr to xmm move (so it would be a temporary "solution"
at best).

Another change to mute the effect somewhat (but not fixing x264) that was
mentioned is

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index b2bf90576d5..acf2cc977b4 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -22595,7 +22595,7 @@ ix86_builtin_vectorization_cost (enum
vect_cost_for_stmt type_of_cost,
       case vec_construct:
        {
          /* N element inserts into SSE vectors.  */
-         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
+         int cost = (TYPE_VECTOR_SUBPARTS (vectype) - 1) * ix86_cost->sse_op;
          /* One vinserti128 for combining two SSE vectors for AVX256.  */
          if (GET_MODE_BITSIZE (mode) == 256)
            cost += ix86_vec_cost (mode, ix86_cost->addss);

which makes sense as the cost of the initial value of the xmm destination is
now costed separately when from GPR and free when from xmm.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/101929] [12 Regression] r12-7319 regress x264_r by 4% on CLX.
  2021-08-16  5:08 [Bug tree-optimization/101929] New: r12-2549 regress x264_r by 4% on CLX crazylht at gmail dot com
                   ` (6 preceding siblings ...)
  2022-03-07  7:18 ` rguenth at gcc dot gnu.org
@ 2022-03-07  8:22 ` crazylht at gmail dot com
  2022-03-07  9:03 ` rguenther at suse dot de
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: crazylht at gmail dot com @ 2022-03-07  8:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929

--- Comment #8 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #7)
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 9188d727e33..7f1f12fb6c6 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -2374,7 +2375,7 @@ fail:
>                 n_vector_builds++;
>             }
>         }
> -      if (all_uniform_p
> +      if ((all_uniform_p && !two_operators)
>           || n_vector_builds > 1
>           || (n_vector_builds == children.length ()
>               && is_a <gphi *> (stmt_info->stmt)))
> 
> 
> will re-enable the vectorization - it evades the vect_construct cost bump
> by instead using scalar_to_vec (aka splat) which has not yet been fixed to
> account for a possible gpr to xmm move (so it would be a temporary "solution"
> at best).
> 
> Another change to mute the effect somewhat (but not fixing x264) that was
> mentioned is
> 
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index b2bf90576d5..acf2cc977b4 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -22595,7 +22595,7 @@ ix86_builtin_vectorization_cost (enum
> vect_cost_for_stmt type_of_cost,
>        case vec_construct:
>         {
>           /* N element inserts into SSE vectors.  */
> -         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> +         int cost = (TYPE_VECTOR_SUBPARTS (vectype) - 1) *
> ix86_cost->sse_op;

(In reply to Richard Biener from comment #7)
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 9188d727e33..7f1f12fb6c6 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -2374,7 +2375,7 @@ fail:
>                 n_vector_builds++;
>             }
>         }
> -      if (all_uniform_p
> +      if ((all_uniform_p && !two_operators)
>           || n_vector_builds > 1
>           || (n_vector_builds == children.length ()
>               && is_a <gphi *> (stmt_info->stmt)))
> 
> 
> will re-enable the vectorization - it evades the vect_construct cost bump
> by instead using scalar_to_vec (aka splat) which has not yet been fixed to
> account for a possible gpr to xmm move (so it would be a temporary "solution"
> at best).
> 
> Another change to mute the effect somewhat (but not fixing x264) that was
> mentioned is
> 
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index b2bf90576d5..acf2cc977b4 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -22595,7 +22595,7 @@ ix86_builtin_vectorization_cost (enum
> vect_cost_for_stmt type_of_cost,
>        case vec_construct:
>         {
>           /* N element inserts into SSE vectors.  */
> -         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> +         int cost = (TYPE_VECTOR_SUBPARTS (vectype) - 1) *
> ix86_cost->sse_op;
n - 1 is right for 128-bit vector, but for 256-bit vector, shouldn't it be n -
2, since we have a separate cost for vinserti128, and n - 4 for 512-bit one.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/101929] [12 Regression] r12-7319 regress x264_r by 4% on CLX.
  2021-08-16  5:08 [Bug tree-optimization/101929] New: r12-2549 regress x264_r by 4% on CLX crazylht at gmail dot com
                   ` (7 preceding siblings ...)
  2022-03-07  8:22 ` crazylht at gmail dot com
@ 2022-03-07  9:03 ` rguenther at suse dot de
  2022-03-11 14:05 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rguenther at suse dot de @ 2022-03-07  9:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929

--- Comment #9 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 7 Mar 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929
> 
> --- Comment #8 from Hongtao.liu <crazylht at gmail dot com> ---
> (In reply to Richard Biener from comment #7)
> > Another change to mute the effect somewhat (but not fixing x264) that was
> > mentioned is
> > 
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index b2bf90576d5..acf2cc977b4 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -22595,7 +22595,7 @@ ix86_builtin_vectorization_cost (enum
> > vect_cost_for_stmt type_of_cost,
> >        case vec_construct:
> >         {
> >           /* N element inserts into SSE vectors.  */
> > -         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> > +         int cost = (TYPE_VECTOR_SUBPARTS (vectype) - 1) *
> > ix86_cost->sse_op;
> n - 1 is right for 128-bit vector, but for 256-bit vector, shouldn't it be n -
> 2, since we have a separate cost for vinserti128, and n - 4 for 512-bit one.

True!  Note that without SLP the gpr->xmm move cost is not yet accounted
for (for loops the cases where we will need an actual gpr->xmm move
will be restricted to CTORs emitted in the prologue - in-loop cases
will always come from memory, so it might not be too important to get
that correct for the non-SLP case).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/101929] [12 Regression] r12-7319 regress x264_r by 4% on CLX.
  2021-08-16  5:08 [Bug tree-optimization/101929] New: r12-2549 regress x264_r by 4% on CLX crazylht at gmail dot com
                   ` (8 preceding siblings ...)
  2022-03-07  9:03 ` rguenther at suse dot de
@ 2022-03-11 14:05 ` rguenth at gcc dot gnu.org
  2022-05-20  3:56 ` crazylht at gmail dot com
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-11 14:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
Should be fixed with r12-7612.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/101929] [12 Regression] r12-7319 regress x264_r by 4% on CLX.
  2021-08-16  5:08 [Bug tree-optimization/101929] New: r12-2549 regress x264_r by 4% on CLX crazylht at gmail dot com
                   ` (9 preceding siblings ...)
  2022-03-11 14:05 ` rguenth at gcc dot gnu.org
@ 2022-05-20  3:56 ` crazylht at gmail dot com
  2022-05-20  4:05 ` crazylht at gmail dot com
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: crazylht at gmail dot com @ 2022-05-20  3:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929

Hongtao.liu <crazylht at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|FIXED                       |---
             Status|RESOLVED                    |REOPENED

--- Comment #11 from Hongtao.liu <crazylht at gmail dot com> ---
105493

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/101929] [12 Regression] r12-7319 regress x264_r by 4% on CLX.
  2021-08-16  5:08 [Bug tree-optimization/101929] New: r12-2549 regress x264_r by 4% on CLX crazylht at gmail dot com
                   ` (10 preceding siblings ...)
  2022-05-20  3:56 ` crazylht at gmail dot com
@ 2022-05-20  4:05 ` crazylht at gmail dot com
  2022-05-20  6:17 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: crazylht at gmail dot com @ 2022-05-20  4:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929

--- Comment #12 from Hongtao.liu <crazylht at gmail dot com> ---

> It's difficult (if not impossible) for the vectorizer to second-guess
> the followup FRE, we're a long way from doing loop + SLP vectorization
> in one go and discover we can elide the vector store.

I'm thinking of adding some detect in the vectorizer to find the "fre pair" of
the new vectorized store and existed vector load, then eliminate vector_store
cost in add_stmt_cost since it's probably be eliminated.

New vector store:  MEM <vector(4) unsigned int> [(unsigned int *)&tmp] =
vect__192.68_825;

Existed vector load below: vect__63.9_482 = MEM <vector(4) unsigned int>
[(unsigned int *)&tmp]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/101929] [12 Regression] r12-7319 regress x264_r by 4% on CLX.
  2021-08-16  5:08 [Bug tree-optimization/101929] New: r12-2549 regress x264_r by 4% on CLX crazylht at gmail dot com
                   ` (11 preceding siblings ...)
  2022-05-20  4:05 ` crazylht at gmail dot com
@ 2022-05-20  6:17 ` rguenth at gcc dot gnu.org
  2022-05-20  6:18 ` [Bug target/101929] [12/13 " rguenth at gcc dot gnu.org
  2023-12-11 20:24 ` [Bug target/101929] [12/13/14 " pinskia at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-05-20  6:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #12)
>  
> > It's difficult (if not impossible) for the vectorizer to second-guess
> > the followup FRE, we're a long way from doing loop + SLP vectorization
> > in one go and discover we can elide the vector store.
> 
> I'm thinking of adding some detect in the vectorizer to find the "fre pair"
> of the new vectorized store and existed vector load, then eliminate
> vector_store cost in add_stmt_cost since it's probably be eliminated.
> 
> New vector store:  MEM <vector(4) unsigned int> [(unsigned int *)&tmp] =
> vect__192.68_825;
> 
> Existed vector load below: vect__63.9_482 = MEM <vector(4) unsigned int>
> [(unsigned int *)&tmp]

I think it would be more useful to explore whether we can find a special
kind of "SLP seed" by looking for vector loads that load from a store
group previously identified.  We can then have the vector _load_ be the
SLP seed similar as to how we handle vector CTORs.  The only special
sauce would be that we need to perform dependence analysis and make the
"store" cheaper (but only if it is really dead afterwards which would
mean doing some kind of DSE analysis - but of course we can offset
the vector load cost that goes away, maybe that alone helps).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/101929] [12/13 Regression] r12-7319 regress x264_r by 4% on CLX.
  2021-08-16  5:08 [Bug tree-optimization/101929] New: r12-2549 regress x264_r by 4% on CLX crazylht at gmail dot com
                   ` (12 preceding siblings ...)
  2022-05-20  6:17 ` rguenth at gcc dot gnu.org
@ 2022-05-20  6:18 ` rguenth at gcc dot gnu.org
  2023-12-11 20:24 ` [Bug target/101929] [12/13/14 " pinskia at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-05-20  6:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2
            Summary|[12 Regression] r12-7319    |[12/13 Regression] r12-7319
                   |regress x264_r by 4% on     |regress x264_r by 4% on
                   |CLX.                        |CLX.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug target/101929] [12/13/14 Regression] r12-7319 regress x264_r by 4% on CLX.
  2021-08-16  5:08 [Bug tree-optimization/101929] New: r12-2549 regress x264_r by 4% on CLX crazylht at gmail dot com
                   ` (13 preceding siblings ...)
  2022-05-20  6:18 ` [Bug target/101929] [12/13 " rguenth at gcc dot gnu.org
@ 2023-12-11 20:24 ` pinskia at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-11 20:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101929

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |98138

--- Comment #14 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
PR 98138  is exactly the same loop ...


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138
[Bug 98138] BB vect fail to SLP one case

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-12-11 20:24 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-16  5:08 [Bug tree-optimization/101929] New: r12-2549 regress x264_r by 4% on CLX crazylht at gmail dot com
2021-08-16  5:10 ` [Bug tree-optimization/101929] " crazylht at gmail dot com
2021-08-16  5:21 ` [Bug target/101929] " crazylht at gmail dot com
2021-08-16  9:43 ` rguenth at gcc dot gnu.org
2021-08-19  2:29 ` cvs-commit at gcc dot gnu.org
2022-02-08 13:36 ` rguenth at gcc dot gnu.org
2022-02-22  8:01 ` [Bug target/101929] [12 Regression] r12-7319 " rguenth at gcc dot gnu.org
2022-03-07  7:18 ` rguenth at gcc dot gnu.org
2022-03-07  8:22 ` crazylht at gmail dot com
2022-03-07  9:03 ` rguenther at suse dot de
2022-03-11 14:05 ` rguenth at gcc dot gnu.org
2022-05-20  3:56 ` crazylht at gmail dot com
2022-05-20  4:05 ` crazylht at gmail dot com
2022-05-20  6:17 ` rguenth at gcc dot gnu.org
2022-05-20  6:18 ` [Bug target/101929] [12/13 " rguenth at gcc dot gnu.org
2023-12-11 20:24 ` [Bug target/101929] [12/13/14 " pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).