public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled
@ 2020-08-26  3:19 linkw at gcc dot gnu.org
  2020-08-26  6:53 ` [Bug target/96789] " rguenth at gcc dot gnu.org
                   ` (36 more replies)
  0 siblings, 37 replies; 38+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-08-26  3:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

            Bug ID: 96789
           Summary: x264: sub4x4_dct() improves when vectorization is
                    disabled
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

One of my workmates found that if we disable vectorization for SPEC2017
525.x264_r function sub4x4_dct in source file x264_src/common/dct.c with
explicit function attribute __attribute__((optimize("no-tree-vectorize"))), it
can speed up by 4%.

The option used is: -O3 -mcpu=power9 -fcommon -fno-strict-aliasing
-fgnu89-inline

I confirmed this finding and it can further narrow down to SLP vectorization
with attribute __attribute__((optimize("no-tree-slp-vectorize"))).

I also checked with r11-0 commit for this particular file, the performance keep
unchanged, with/without vectorization attribute. So I think it's a trunk
regression, probably exposes one SLP flaw or one cost modeling issue.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
@ 2020-08-26  6:53 ` rguenth at gcc dot gnu.org
  2020-08-26  7:13 ` linkw at gcc dot gnu.org
                   ` (35 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-08-26  6:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|tree-optimization           |target
           Keywords|                            |missed-optimization
             Target|                            |powerpc
             Blocks|                            |53947

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Can you attach the relevant SLP vectorizer dump part (just for the function in
question)?


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
  2020-08-26  6:53 ` [Bug target/96789] " rguenth at gcc dot gnu.org
@ 2020-08-26  7:13 ` linkw at gcc dot gnu.org
  2020-08-27  3:28 ` linkw at gcc dot gnu.org
                   ` (34 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-08-26  7:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #2 from Kewen Lin <linkw at gcc dot gnu.org> ---
Created attachment 49124
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49124&action=edit
sub4x4_dct SLP dumping

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
  2020-08-26  6:53 ` [Bug target/96789] " rguenth at gcc dot gnu.org
  2020-08-26  7:13 ` linkw at gcc dot gnu.org
@ 2020-08-27  3:28 ` linkw at gcc dot gnu.org
  2020-08-27  6:40 ` rguenth at gcc dot gnu.org
                   ` (33 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-08-27  3:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #3 from Kewen Lin <linkw at gcc dot gnu.org> ---
Bisection shows it started to fail from r11-205.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2020-08-27  3:28 ` linkw at gcc dot gnu.org
@ 2020-08-27  6:40 ` rguenth at gcc dot gnu.org
  2020-08-27 11:16 ` rguenth at gcc dot gnu.org
                   ` (32 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-08-27  6:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
This delays some checks to eventually support part of the BB vectorization
which is what succeeds here.  I suspect that w/o vectorization we manage
to elide the tmp[] array but with the part vectorization that occurs we
fail to do that.

On the cost side there would be a lot needed to make the vectorization
not profitable:

  Vector inside of basic block cost: 8
  Vector prologue cost: 36
  Vector epilogue cost: 0
  Scalar cost of basic block: 64

the thing to double-check is

0x123b1ff0 <unknown> 1 times vec_construct costs 17 in prologue

that is the cost of the V16QI construct

 _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565,
_576, _125, _143, _161, _179}; 

maybe you can extract a testcase for the function as well?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2020-08-27  6:40 ` rguenth at gcc dot gnu.org
@ 2020-08-27 11:16 ` rguenth at gcc dot gnu.org
  2020-08-31  4:05 ` linkw at gcc dot gnu.org
                   ` (31 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-08-27 11:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
testcase from https://github.com/mirror/x264/blob/master/common/dct.c

where FENC_STRIDE is 16 and FDEC_STRIDE 32

pixel is unsigned char, dctcoef is unsigned short

static inline void pixel_sub_wxh( dctcoef *diff, int i_size,
                                  pixel *pix1, int i_pix1, pixel *pix2, int
i_pix2 )
{
    for( int y = 0; y < i_size; y++ )
    {
        for( int x = 0; x < i_size; x++ )
            diff[x + y*i_size] = pix1[x] - pix2[x];
        pix1 += i_pix1;
        pix2 += i_pix2;
    }
}

static void sub4x4_dct( dctcoef dct[16], pixel *pix1, pixel *pix2 )
{
    dctcoef d[16];
    dctcoef tmp[16];

    pixel_sub_wxh( d, 4, pix1, FENC_STRIDE, pix2, FDEC_STRIDE );

    for( int i = 0; i < 4; i++ )
    {
        int s03 = d[i*4+0] + d[i*4+3];
        int s12 = d[i*4+1] + d[i*4+2];
        int d03 = d[i*4+0] - d[i*4+3];
        int d12 = d[i*4+1] - d[i*4+2];

        tmp[0*4+i] =   s03 +   s12;
        tmp[1*4+i] = 2*d03 +   d12;
        tmp[2*4+i] =   s03 -   s12;
        tmp[3*4+i] =   d03 - 2*d12;
    }

    for( int i = 0; i < 4; i++ )
    {
        int s03 = tmp[i*4+0] + tmp[i*4+3];
        int s12 = tmp[i*4+1] + tmp[i*4+2];
        int d03 = tmp[i*4+0] - tmp[i*4+3];
        int d12 = tmp[i*4+1] - tmp[i*4+2];

        dct[i*4+0] =   s03 +   s12;
        dct[i*4+1] = 2*d03 +   d12;
        dct[i*4+2] =   s03 -   s12;
        dct[i*4+3] =   d03 - 2*d12;
    }
}

one can see vectorization is complicated by the non-homogenous computes.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2020-08-27 11:16 ` rguenth at gcc dot gnu.org
@ 2020-08-31  4:05 ` linkw at gcc dot gnu.org
  2020-09-16 10:03 ` linkw at gcc dot gnu.org
                   ` (30 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-08-31  4:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

Kewen Lin <linkw at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |linkw at gcc dot gnu.org

--- Comment #6 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #4)
> This delays some checks to eventually support part of the BB vectorization
> which is what succeeds here.  I suspect that w/o vectorization we manage
> to elide the tmp[] array but with the part vectorization that occurs we
> fail to do that.
> 
> On the cost side there would be a lot needed to make the vectorization
> not profitable:
> 
>   Vector inside of basic block cost: 8
>   Vector prologue cost: 36
>   Vector epilogue cost: 0
>   Scalar cost of basic block: 64
> 
> the thing to double-check is
> 
> 0x123b1ff0 <unknown> 1 times vec_construct costs 17 in prologue
> 
> that is the cost of the V16QI construct
> 
>  _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565,
> _576, _125, _143, _161, _179}; 
> 

Thanks Richard!  I did some cost adjustment experiment last year and the cost
for v16qi looks off indeed, but at that time with the cost tweaking for this
the SPEC performance doesn't change, I guessed it's just we happened not have
this kind of case to trap into. I'll have a look and re-evaluate it for this.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2020-08-31  4:05 ` linkw at gcc dot gnu.org
@ 2020-09-16 10:03 ` linkw at gcc dot gnu.org
  2020-09-16 11:17 ` rguenth at gcc dot gnu.org
                   ` (29 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-16 10:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

Kewen Lin <linkw at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2020-09-16
             Status|UNCONFIRMED                 |ASSIGNED
     Ever confirmed|0                           |1

--- Comment #7 from Kewen Lin <linkw at gcc dot gnu.org> ---
Two questions in mind, need to dig into it further:
  1) from the assembly of scalar/vector code, I don't see any stores needed
into temp array d (array diff in pixel_sub_wxh), but when modeling we consider
the stores. On Power two vector stores take cost 2 while 16 scalar stores takes
cost 16, it seems wrong to cost model something useless. Later, for the vector
version we need 16 vector halfword extractions from these two halfword vectors,
while scalar version the values are just in GPR register, vector version looks
inefficient.
  2) on Power, the conversion from unsigned char to unsigned short is nop
conversion, when we counting scalar cost, it's counted, then add costs 32
totally onto scalar cost. Meanwhile, the conversion from unsigned short to
signed short should be counted but it's not (need to check why further).  The
nop conversion costing looks something we can handle in function
rs6000_adjust_vect_cost_per_stmt, I tried to use the generic function
tree_nop_conversion_p, but it's only for same mode/precision conversion. Will
find/check something else.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2020-09-16 10:03 ` linkw at gcc dot gnu.org
@ 2020-09-16 11:17 ` rguenth at gcc dot gnu.org
  2020-09-16 12:25 ` linkw at gcc dot gnu.org
                   ` (28 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-09-16 11:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Kewen Lin from comment #7)
> Two questions in mind, need to dig into it further:
>   1) from the assembly of scalar/vector code, I don't see any stores needed
> into temp array d (array diff in pixel_sub_wxh), but when modeling we
> consider the stores.

Because when modeling they are still there.  There's no good way around this.

> On Power two vector stores take cost 2 while 16 scalar
> stores takes cost 16, it seems wrong to cost model something useless. Later,
> for the vector version we need 16 vector halfword extractions from these two
> halfword vectors, while scalar version the values are just in GPR register,
> vector version looks inefficient.
>   2) on Power, the conversion from unsigned char to unsigned short is nop
> conversion, when we counting scalar cost, it's counted, then add costs 32
> totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> signed short should be counted but it's not (need to check why further). 
> The nop conversion costing looks something we can handle in function
> rs6000_adjust_vect_cost_per_stmt, I tried to use the generic function
> tree_nop_conversion_p, but it's only for same mode/precision conversion.
> Will find/check something else.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2020-09-16 11:17 ` rguenth at gcc dot gnu.org
@ 2020-09-16 12:25 ` linkw at gcc dot gnu.org
  2020-09-16 13:04 ` rguenth at gcc dot gnu.org
                   ` (27 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-16 12:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #9 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #8)
> (In reply to Kewen Lin from comment #7)
> > Two questions in mind, need to dig into it further:
> >   1) from the assembly of scalar/vector code, I don't see any stores needed
> > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > consider the stores.
> 
> Because when modeling they are still there.  There's no good way around this.
> 

I noticed the stores get eliminated during FRE.  Can we consider running FRE
once just before SLP? a bad idea due to compilation time?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2020-09-16 12:25 ` linkw at gcc dot gnu.org
@ 2020-09-16 13:04 ` rguenth at gcc dot gnu.org
  2020-09-17  2:50 ` linkw at gcc dot gnu.org
                   ` (26 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-09-16 13:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Kewen Lin from comment #9)
> (In reply to Richard Biener from comment #8)
> > (In reply to Kewen Lin from comment #7)
> > > Two questions in mind, need to dig into it further:
> > >   1) from the assembly of scalar/vector code, I don't see any stores needed
> > > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > > consider the stores.
> > 
> > Because when modeling they are still there.  There's no good way around this.
> > 
> 
> I noticed the stores get eliminated during FRE.  Can we consider running FRE
> once just before SLP? a bad idea due to compilation time?

Yeah, we already run FRE a lot and it is one of the more expensive passes.

Note there's one point we could do better which is the embedded SESE FRE
run from cunroll which is only run before we consider peeling an outer loop
and thus not for the outermost unrolled/peeled code (but the question would
be from where / up to what to apply FRE to).  On x86_64 this would apply to
the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds
quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it
matters for costing though).

I think I looked at this or a related testcase some time ago and split out
some PRs (can't find those right now).  For example we are not considering
to simplify

  _318 = {_4, _14, _293, _30, _49, _251, _225, _248, _52, _70, _260, _284,
_100, _117, _134, _151};
  vect__5.47_319 = (vector(16) short unsigned int) _318;
  _154 = MEM[(pixel *)pix2_58(D) + 99B];
  _320 = {_6, _16, _22, _32, _51, _255, _231, _243, _54, _68, _276, _286, _103,
_120, _137, _154};
  vect__7.48_321 = (vector(16) short unsigned int) _320;
  vect__12.49_322 = vect__5.47_319 - vect__7.48_321;
  _317 = BIT_FIELD_REF <vect__12.49_322, 64, 0>;
  _315 = BIT_FIELD_REF <vect__12.49_322, 64, 64>;
  _313 = BIT_FIELD_REF <vect__12.49_322, 64, 128>;
  _311 = BIT_FIELD_REF <vect__12.49_322, 64, 192>;
  vect_perm_even_165 = VEC_PERM_EXPR <_317, _315, { 0, 2, 4, 6 }>;
  vect_perm_odd_164 = VEC_PERM_EXPR <_317, _315, { 1, 3, 5, 7 }>;
  vect_perm_even_163 = VEC_PERM_EXPR <_313, _311, { 0, 2, 4, 6 }>;
  vect_perm_odd_156 = VEC_PERM_EXPR <_313, _311, { 1, 3, 5, 7 }>;

down to smaller vectors.  Also appearantly the two vector CTORs are not
re-shuffled to vector load + shuffle.  In the SLP analysis we end up with

t2.c:12:32: note:   Final SLP tree for instance:
t2.c:12:32: note:   node 0x436e3c0 (max_nunits=16, refcnt=2)
t2.c:12:32: note:       stmt 0 *_11 = _12;
t2.c:12:32: note:       stmt 1 *_21 = _71;
...
t2.c:12:32: note:       stmt 15 *_160 = _161;
t2.c:12:32: note:       children 0x436de70
t2.c:12:32: note:   node 0x436de70 (max_nunits=16, refcnt=2)
t2.c:12:32: note:       stmt 0 _12 = _5 - _7;
t2.c:12:32: note:       stmt 1 _71 = _15 - _17;
...
.c:12:32: note:       stmt 15 _161 = _152 - _155;
t2.c:12:32: note:       children 0x436ebb0 0x4360b70
t2.c:12:32: note:   node 0x436ebb0 (max_nunits=16, refcnt=2)
t2.c:12:32: note:       stmt 0 _5 = (short unsigned int) _4;
...
t2.c:12:32: note:       stmt 15 _152 = (short unsigned int) _151;
t2.c:12:32: note:       children 0x42f1740
t2.c:12:32: note:   node 0x42f1740 (max_nunits=16, refcnt=2)
t2.c:12:32: note:       stmt 0 _4 = *pix1_57(D);
...
t2.c:12:32: note:       stmt 15 _151 = MEM[(pixel *)pix1_295 + 3B];
t2.c:12:32: note:       load permutation { 0 1 2 3 16 17 18 19 32 33 34 35 48
49 50 51 }
t2.c:12:32: note:   node 0x4360b70 (max_nunits=16, refcnt=2)
t2.c:12:32: note:       stmt 0 _7 = (short unsigned int) _6;
...
t2.c:12:32: note:       stmt 15 _155 = (short unsigned int) _154;
t2.c:12:32: note:       children 0x4360be0
t2.c:12:32: note:   node 0x4360be0 (max_nunits=16, refcnt=2)
t2.c:12:32: note:       stmt 0 _6 = *pix2_58(D);
...
t2.c:12:32: note:       stmt 15 _154 = MEM[(pixel *)pix2_296 + 3B];
t2.c:12:32: note:       load permutation { 0 1 2 3 32 33 34 35 64 65 66 67 96
97 98 99 }

the load permutations suggest that splitting the group into 4-lane pieces
would avoid doing permutes but then that would require target support
for V4QI and V4HI vectors.  At least the loads could be considered
to be vectorized with strided-SLP, yielding 'int' loads and a vector
build from 4 ints.  I'd need to analyze why we do not consider this.

t2.c:50:1: note:   Detected interleaving load of size 52
t2.c:50:1: note:        _4 = *pix1_57(D);
t2.c:50:1: note:        _14 = MEM[(pixel *)pix1_57(D) + 1B];
t2.c:50:1: note:        _293 = MEM[(pixel *)pix1_57(D) + 2B];
t2.c:50:1: note:        _30 = MEM[(pixel *)pix1_57(D) + 3B];
t2.c:50:1: note:        <gap of 12 elements>
t2.c:50:1: note:        _49 = *pix1_40;
t2.c:50:1: note:        _251 = MEM[(pixel *)pix1_40 + 1B];
t2.c:50:1: note:        _225 = MEM[(pixel *)pix1_40 + 2B];
t2.c:50:1: note:        _248 = MEM[(pixel *)pix1_40 + 3B];
t2.c:50:1: note:        <gap of 12 elements>
t2.c:50:1: note:        _52 = *pix1_264;
t2.c:50:1: note:        _70 = MEM[(pixel *)pix1_264 + 1B];
t2.c:50:1: note:        _260 = MEM[(pixel *)pix1_264 + 2B];
t2.c:50:1: note:        _284 = MEM[(pixel *)pix1_264 + 3B];
t2.c:50:1: note:        <gap of 12 elements>
t2.c:50:1: note:        _100 = *pix1_295;
t2.c:50:1: note:        _117 = MEM[(pixel *)pix1_295 + 1B];
t2.c:50:1: note:        _134 = MEM[(pixel *)pix1_295 + 2B];
t2.c:50:1: note:        _151 = MEM[(pixel *)pix1_295 + 3B];
t2.c:50:1: note:   Detected interleaving load of size 100
t2.c:50:1: note:        _6 = *pix2_58(D);
t2.c:50:1: note:        _16 = MEM[(pixel *)pix2_58(D) + 1B];
t2.c:50:1: note:        _22 = MEM[(pixel *)pix2_58(D) + 2B];
t2.c:50:1: note:        _32 = MEM[(pixel *)pix2_58(D) + 3B];
t2.c:50:1: note:        <gap of 28 elements>
t2.c:50:1: note:        _51 = *pix2_41;
t2.c:50:1: note:        _255 = MEM[(pixel *)pix2_41 + 1B];
t2.c:50:1: note:        _231 = MEM[(pixel *)pix2_41 + 2B];
t2.c:50:1: note:        _243 = MEM[(pixel *)pix2_41 + 3B];
t2.c:50:1: note:        <gap of 28 elements>
t2.c:50:1: note:        _54 = *pix2_272;
t2.c:50:1: note:        _68 = MEM[(pixel *)pix2_272 + 1B];
t2.c:50:1: note:        _276 = MEM[(pixel *)pix2_272 + 2B];
t2.c:50:1: note:        _286 = MEM[(pixel *)pix2_272 + 3B];
t2.c:50:1: note:        <gap of 28 elements>
t2.c:50:1: note:        _103 = *pix2_296;
t2.c:50:1: note:        _120 = MEM[(pixel *)pix2_296 + 1B];
t2.c:50:1: note:        _137 = MEM[(pixel *)pix2_296 + 2B];
t2.c:50:1: note:        _154 = MEM[(pixel *)pix2_296 + 3B];

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2020-09-16 13:04 ` rguenth at gcc dot gnu.org
@ 2020-09-17  2:50 ` linkw at gcc dot gnu.org
  2020-09-17  5:06 ` linkw at gcc dot gnu.org
                   ` (25 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-17  2:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #11 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #10)
> (In reply to Kewen Lin from comment #9)
> > (In reply to Richard Biener from comment #8)
> > > (In reply to Kewen Lin from comment #7)
> > > > Two questions in mind, need to dig into it further:
> > > >   1) from the assembly of scalar/vector code, I don't see any stores needed
> > > > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > > > consider the stores.
> > > 
> > > Because when modeling they are still there.  There's no good way around this.
> > > 
> > 
> > I noticed the stores get eliminated during FRE.  Can we consider running FRE
> > once just before SLP? a bad idea due to compilation time?
> 
> Yeah, we already run FRE a lot and it is one of the more expensive passes.
> 
> Note there's one point we could do better which is the embedded SESE FRE
> run from cunroll which is only run before we consider peeling an outer loop
> and thus not for the outermost unrolled/peeled code (but the question would
> be from where / up to what to apply FRE to).  On x86_64 this would apply to
> the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds
> quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it
> matters for costing though).

Thanks for the explanation! I'll look at it after checking 2). IIUC, the
advantage to eliminate stores here looks able to get those things which is fed
to stores and stores' consumers bundled, then get more things SLP-ed if
available?

> 
> I think I looked at this or a related testcase some time ago and split out
> some PRs (can't find those right now).  For example we are not considering
> to simplify
> 
....
> 
> the load permutations suggest that splitting the group into 4-lane pieces
> would avoid doing permutes but then that would require target support
> for V4QI and V4HI vectors.  At least the loads could be considered
> to be vectorized with strided-SLP, yielding 'int' loads and a vector
> build from 4 ints.  I'd need to analyze why we do not consider this.

Good idea! Curious that is there some port where int load can not work well on
1-byte aligned address like trap?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2020-09-17  2:50 ` linkw at gcc dot gnu.org
@ 2020-09-17  5:06 ` linkw at gcc dot gnu.org
  2020-09-18  9:11 ` linkw at gcc dot gnu.org
                   ` (24 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-17  5:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #12 from Kewen Lin <linkw at gcc dot gnu.org> ---

> Thanks for the explanation! I'll look at it after checking 2). IIUC, the
> advantage to eliminate stores here looks able to get those things which is
> fed to stores and stores' consumers bundled, then get more things SLP-ed if
> available?

Hmm, I think I was wrong, if both the feeding chain and consuming chain of the
stores are SLP-ed, later FRE would be able to fuse them.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2020-09-17  5:06 ` linkw at gcc dot gnu.org
@ 2020-09-18  9:11 ` linkw at gcc dot gnu.org
  2020-09-18 10:30 ` rguenther at suse dot de
                   ` (23 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-18  9:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #13 from Kewen Lin <linkw at gcc dot gnu.org> ---
>   2) on Power, the conversion from unsigned char to unsigned short is nop
> conversion, when we counting scalar cost, it's counted, then add costs 32
> totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> signed short should be counted but it's not (need to check why further). 

UH to SH conversion is true when calling vect_nop_conversion_p, so it's not
even put into the cost vector. 

tree_nop_conversion_p's comments saying:

/* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
   no instruction.  */

I may miss something here, but UH to SH conversion does need one explicit
extend instruction *extsh*, the precision/mode equality check looks wrong for
this conversion.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2020-09-18  9:11 ` linkw at gcc dot gnu.org
@ 2020-09-18 10:30 ` rguenther at suse dot de
  2020-09-18 13:40 ` linkw at gcc dot gnu.org
                   ` (22 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: rguenther at suse dot de @ 2020-09-18 10:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #14 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> 
> --- Comment #13 from Kewen Lin <linkw at gcc dot gnu.org> ---
> >   2) on Power, the conversion from unsigned char to unsigned short is nop
> > conversion, when we counting scalar cost, it's counted, then add costs 32
> > totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> > signed short should be counted but it's not (need to check why further). 
> 
> UH to SH conversion is true when calling vect_nop_conversion_p, so it's not
> even put into the cost vector. 
> 
> tree_nop_conversion_p's comments saying:
> 
> /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
>    no instruction.  */
> 
> I may miss something here, but UH to SH conversion does need one explicit
> extend instruction *extsh*, the precision/mode equality check looks wrong for
> this conversion.

Well, it isn't a RTL predicate and it only needs extension because
there's never a HImode pseudo but always SImode subregs.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2020-09-18 10:30 ` rguenther at suse dot de
@ 2020-09-18 13:40 ` linkw at gcc dot gnu.org
  2020-09-19 15:45 ` crazylht at gmail dot com
                   ` (21 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-18 13:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #15 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #14)
> On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > 
> > --- Comment #13 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > >   2) on Power, the conversion from unsigned char to unsigned short is nop
> > > conversion, when we counting scalar cost, it's counted, then add costs 32
> > > totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> > > signed short should be counted but it's not (need to check why further). 
> > 
> > UH to SH conversion is true when calling vect_nop_conversion_p, so it's not
> > even put into the cost vector. 
> > 
> > tree_nop_conversion_p's comments saying:
> > 
> > /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
> >    no instruction.  */
> > 
> > I may miss something here, but UH to SH conversion does need one explicit
> > extend instruction *extsh*, the precision/mode equality check looks wrong for
> > this conversion.
> 
> Well, it isn't a RTL predicate and it only needs extension because
> there's never a HImode pseudo but always SImode subregs.

Thanks Richi! Should we take care of this case? or neglect this kind of
extension as "no instruction"? I was intent to handle it in target specific
code, but it isn't recorded into cost vector while it seems too heavy to do the
bb_info slp_instances revisits in finish_cost.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2020-09-18 13:40 ` linkw at gcc dot gnu.org
@ 2020-09-19 15:45 ` crazylht at gmail dot com
  2020-09-21  7:14 ` rguenther at suse dot de
                   ` (20 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: crazylht at gmail dot com @ 2020-09-19 15:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #16 from Hongtao.liu <crazylht at gmail dot com> ---
I notice

0x5561dc0 _36 * 2 1 times scalar_stmt costs 16 in body
0x5561dc0 _38 * 2 1 times scalar_stmt costs 16 in body
0x5562df0 _36 * 2 1 times vector_stmt costs 16 in body
0x5562df0 _38 * 2 1 times vector_stmt costs 16 in body

ix86_multiplication_cost would be called for cost estimation, but in
pass_expand, synth_mult will tranform the multiplization to shift. So shift
cost should be used in this case, not mult.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2020-09-19 15:45 ` crazylht at gmail dot com
@ 2020-09-21  7:14 ` rguenther at suse dot de
  2020-09-25 12:46 ` linkw at gcc dot gnu.org
                   ` (19 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: rguenther at suse dot de @ 2020-09-21  7:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #17 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> 
> --- Comment #15 from Kewen Lin <linkw at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #14)
> > On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > > 
> > > --- Comment #13 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > > >   2) on Power, the conversion from unsigned char to unsigned short is nop
> > > > conversion, when we counting scalar cost, it's counted, then add costs 32
> > > > totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> > > > signed short should be counted but it's not (need to check why further). 
> > > 
> > > UH to SH conversion is true when calling vect_nop_conversion_p, so it's not
> > > even put into the cost vector. 
> > > 
> > > tree_nop_conversion_p's comments saying:
> > > 
> > > /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
> > >    no instruction.  */
> > > 
> > > I may miss something here, but UH to SH conversion does need one explicit
> > > extend instruction *extsh*, the precision/mode equality check looks wrong for
> > > this conversion.
> > 
> > Well, it isn't a RTL predicate and it only needs extension because
> > there's never a HImode pseudo but always SImode subregs.
> 
> Thanks Richi! Should we take care of this case? or neglect this kind of
> extension as "no instruction"? I was intent to handle it in target specific
> code, but it isn't recorded into cost vector while it seems too heavy to do the
> bb_info slp_instances revisits in finish_cost.

I think it's not something we should handle on GIMPLE.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2020-09-21  7:14 ` rguenther at suse dot de
@ 2020-09-25 12:46 ` linkw at gcc dot gnu.org
  2020-09-25 12:52 ` linkw at gcc dot gnu.org
                   ` (18 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-25 12:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #18 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #10)
> (In reply to Kewen Lin from comment #9)
> > (In reply to Richard Biener from comment #8)
> > > (In reply to Kewen Lin from comment #7)
> > > > Two questions in mind, need to dig into it further:
> > > >   1) from the assembly of scalar/vector code, I don't see any stores needed
> > > > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > > > consider the stores.
> > > 
> > > Because when modeling they are still there.  There's no good way around this.
> > > 
> > 
> > I noticed the stores get eliminated during FRE.  Can we consider running FRE
> > once just before SLP? a bad idea due to compilation time?
> 
> Yeah, we already run FRE a lot and it is one of the more expensive passes.
> 
> Note there's one point we could do better which is the embedded SESE FRE
> run from cunroll which is only run before we consider peeling an outer loop
> and thus not for the outermost unrolled/peeled code (but the question would
> be from where / up to what to apply FRE to).  On x86_64 this would apply to
> the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds
> quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it
> matters for costing though).

By following this idea, to release the restriction on loop_outer (loop_father)
when setting the father_bbs, I can see FRE works as expectedly.  But it
actually does the rpo_vn from cfun's entry to its exit. If it's taken as
costly, we probably can guard it to take effects only when all its inner loops
are unrolled, for this case, all of its three loops are unrolled.
Besides, when SLP happens, FRE gen the bit_field_ref and remove array d, but
for scalar codes it needs one more time dse run after cunroll to get array d
eliminated. I guess it's not costly? Can one pass be run or not controlled by
something in another pass? via global variable and add one parameter in
passes.def seems weird. If it's costly, probably we can go by factoring out one
routine to be called instead of running a pass, like do_rpo_vn?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2020-09-25 12:46 ` linkw at gcc dot gnu.org
@ 2020-09-25 12:52 ` linkw at gcc dot gnu.org
  2020-09-25 12:57 ` rguenth at gcc dot gnu.org
                   ` (17 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-25 12:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #19 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #17)
> On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > 
> > --- Comment #15 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > (In reply to rguenther@suse.de from comment #14)
> > > On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > > > 
> > > > --- Comment #13 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > > > >   2) on Power, the conversion from unsigned char to unsigned short is nop
> > > > > conversion, when we counting scalar cost, it's counted, then add costs 32
> > > > > totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> > > > > signed short should be counted but it's not (need to check why further). 
> > > > 
> > > > UH to SH conversion is true when calling vect_nop_conversion_p, so it's not
> > > > even put into the cost vector. 
> > > > 
> > > > tree_nop_conversion_p's comments saying:
> > > > 
> > > > /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
> > > >    no instruction.  */
> > > > 
> > > > I may miss something here, but UH to SH conversion does need one explicit
> > > > extend instruction *extsh*, the precision/mode equality check looks wrong for
> > > > this conversion.
> > > 
> > > Well, it isn't a RTL predicate and it only needs extension because
> > > there's never a HImode pseudo but always SImode subregs.
> > 
> > Thanks Richi! Should we take care of this case? or neglect this kind of
> > extension as "no instruction"? I was intent to handle it in target specific
> > code, but it isn't recorded into cost vector while it seems too heavy to do the
> > bb_info slp_instances revisits in finish_cost.
> 
> I think it's not something we should handle on GIMPLE.

Got it! For 

          else if (vect_nop_conversion_p (stmt_info))
            continue;

Is it a good idea to change it to call record_stmt_cost like the others? 
  1) introduce one vect_cost_for_stmt enum type eg: nop_stmt
  2) builtin_vectorization_cost return zero for it by default as before.
  3) targets can adjust the cost according to its need

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2020-09-25 12:52 ` linkw at gcc dot gnu.org
@ 2020-09-25 12:57 ` rguenth at gcc dot gnu.org
  2020-09-25 13:05 ` rguenth at gcc dot gnu.org
                   ` (16 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-09-25 12:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #20 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Kewen Lin from comment #19)
> (In reply to rguenther@suse.de from comment #17)
> > On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > > 
> > > --- Comment #15 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > > (In reply to rguenther@suse.de from comment #14)
> > > > On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> > > > 
> > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > > > > 
> > > > > --- Comment #13 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > > > > >   2) on Power, the conversion from unsigned char to unsigned short is nop
> > > > > > conversion, when we counting scalar cost, it's counted, then add costs 32
> > > > > > totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> > > > > > signed short should be counted but it's not (need to check why further). 
> > > > > 
> > > > > UH to SH conversion is true when calling vect_nop_conversion_p, so it's not
> > > > > even put into the cost vector. 
> > > > > 
> > > > > tree_nop_conversion_p's comments saying:
> > > > > 
> > > > > /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
> > > > >    no instruction.  */
> > > > > 
> > > > > I may miss something here, but UH to SH conversion does need one explicit
> > > > > extend instruction *extsh*, the precision/mode equality check looks wrong for
> > > > > this conversion.
> > > > 
> > > > Well, it isn't a RTL predicate and it only needs extension because
> > > > there's never a HImode pseudo but always SImode subregs.
> > > 
> > > Thanks Richi! Should we take care of this case? or neglect this kind of
> > > extension as "no instruction"? I was intent to handle it in target specific
> > > code, but it isn't recorded into cost vector while it seems too heavy to do the
> > > bb_info slp_instances revisits in finish_cost.
> > 
> > I think it's not something we should handle on GIMPLE.
> 
> Got it! For 
> 
> 	  else if (vect_nop_conversion_p (stmt_info))
> 	    continue;
> 
> Is it a good idea to change it to call record_stmt_cost like the others? 
>   1) introduce one vect_cost_for_stmt enum type eg: nop_stmt
>   2) builtin_vectorization_cost return zero for it by default as before.
>   3) targets can adjust the cost according to its need

I think this early-out was added for the case where there was no cost but
the target costed it.  So at least go back and look what target that was
and see if it can be adjusted.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (19 preceding siblings ...)
  2020-09-25 12:57 ` rguenth at gcc dot gnu.org
@ 2020-09-25 13:05 ` rguenth at gcc dot gnu.org
  2020-09-27  2:56 ` crazylht at gmail dot com
                   ` (15 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-09-25 13:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #21 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Kewen Lin from comment #18)
> (In reply to Richard Biener from comment #10)
> > (In reply to Kewen Lin from comment #9)
> > > (In reply to Richard Biener from comment #8)
> > > > (In reply to Kewen Lin from comment #7)
> > > > > Two questions in mind, need to dig into it further:
> > > > >   1) from the assembly of scalar/vector code, I don't see any stores needed
> > > > > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > > > > consider the stores.
> > > > 
> > > > Because when modeling they are still there.  There's no good way around this.
> > > > 
> > > 
> > > I noticed the stores get eliminated during FRE.  Can we consider running FRE
> > > once just before SLP? a bad idea due to compilation time?
> > 
> > Yeah, we already run FRE a lot and it is one of the more expensive passes.
> > 
> > Note there's one point we could do better which is the embedded SESE FRE
> > run from cunroll which is only run before we consider peeling an outer loop
> > and thus not for the outermost unrolled/peeled code (but the question would
> > be from where / up to what to apply FRE to).  On x86_64 this would apply to
> > the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds
> > quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it
> > matters for costing though).
> 
> By following this idea, to release the restriction on loop_outer
> (loop_father) when setting the father_bbs, I can see FRE works as
> expectedly.  But it actually does the rpo_vn from cfun's entry to its exit.

Yeah, that's the reason we do not do it.  We could possibly restrict it
to a containing loop, or if the containing loop is the whole function,
restrict it to the original preheader block to the loop exits (which are
no longer there, we'd need to pre-record those I think)

> If it's taken as costly, we probably can guard it to take effects only when
> all its inner loops are unrolled, for this case, all of its three loops are
> unrolled.

The original reason VN is done on unrolled bodies is to improve cost estimates
on unrolling nests - since we do not unroll the non-existing outer loop of the
outermost loop applying VN there for this reason is pointless and the
reasoning is to simply wait for the next scheduled VN pass (which is too late
for SLP)

> Besides, when SLP happens, FRE gen the bit_field_ref and remove array d, but
> for scalar codes it needs one more time dse run after cunroll to get array d
> eliminated. I guess it's not costly? Can one pass be run or not controlled
> by something in another pass? via global variable and add one parameter in
> passes.def seems weird. If it's costly, probably we can go by factoring out
> one routine to be called instead of running a pass, like do_rpo_vn?

No, we don't have a good way to schedule passes from other passes.  And yes,
the way forward is to support key transforms on regions.  Oh, and every
pass that does memory alias analysis (DSE, DCE, VN, etc.) is costly.

What we could eventually do is move the non-loop SLP pass later and
run the loop SLP pass only on loop bodies (so overall every BB is just
SLP vectorized once).  The

      PUSH_INSERT_PASSES_WITHIN (pass_tree_no_loop)
          NEXT_PASS (pass_slp_vectorize);
      POP_INSERT_PASSES ()

pass would then be unconditional and only run on non-loop BBs.  Note
that SLP still leaves us with FRE opportunities so this will probably
not solve the issue fully.  Pass reorderings are also always tricky...
(adding passes is easy but piling these up over the time is bad).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (20 preceding siblings ...)
  2020-09-25 13:05 ` rguenth at gcc dot gnu.org
@ 2020-09-27  2:56 ` crazylht at gmail dot com
  2020-09-27  3:07 ` crazylht at gmail dot com
                   ` (14 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: crazylht at gmail dot com @ 2020-09-27  2:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #22 from Hongtao.liu <crazylht at gmail dot com> ---
>One of my workmates found that if we disable vectorization for SPEC2017 >525.x264_r function sub4x4_dct in source file x264_src/common/dct.c with ?>explicit function attribute __attribute__((optimize("no-tree-vectorize"))), it >can speed up by 4%.

For CLX, if we disable slp vectorization in sub4x4_dct by 
__attribute__((optimize("no-tree-slp-vectorize"))), it can also speed up by 4%.

> Thanks Richi! Should we take care of this case? or neglect this kind of
> extension as "no instruction"? I was intent to handle it in target specific
> code, but it isn't recorded into cost vector while it seems too heavy to do
> the bb_info slp_instances revisits in finish_cost.

For i386 backend unsigned char --> unsigned short is no "no instruction", but
in this case
---
1033  _134 = MEM[(pixel *)pix1_295 + 2B];                                       
1034  _135 = (short unsigned int) _134;
---

It could be combined and optimized to 
---
movzbl  19(%rcx), %r8d
---

So, if "unsigned char" variable is loaded from memory, then the convertion
would also be "no instruction", i'm not sure if backend cost model could handle
such situation.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (21 preceding siblings ...)
  2020-09-27  2:56 ` crazylht at gmail dot com
@ 2020-09-27  3:07 ` crazylht at gmail dot com
  2020-09-27  6:48 ` rguenther at suse dot de
                   ` (13 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: crazylht at gmail dot com @ 2020-09-27  3:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #23 from Hongtao.liu <crazylht at gmail dot com> ---
>  _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565,
> _576, _125, _143, _161, _179}; 

The cost of vec_construct in i386 backend is 64, calculated as 16 x 4

cut from i386.c
---
/* N element inserts into SSE vectors.  */ 
int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
---

>From perspective of pipeline latency, is seems ok, but from perspective of
rtx_cost, it seems inaccurate since it would be initialized as
---
        vmovd   %eax, %xmm0
        vpinsrb $1, 1(%rsi), %xmm0, %xmm0
        vmovd   %eax, %xmm7
        vpinsrb $1, 3(%rsi), %xmm7, %xmm7
        vmovd   %eax, %xmm3
        vpinsrb $1, 17(%rsi), %xmm3, %xmm3
        vmovd   %eax, %xmm6
        vpinsrb $1, 19(%rsi), %xmm6, %xmm6
        vmovd   %eax, %xmm1
        vpinsrb $1, 33(%rsi), %xmm1, %xmm1
        vmovd   %eax, %xmm5
        vpinsrb $1, 35(%rsi), %xmm5, %xmm5
        vmovd   %eax, %xmm2
        vpinsrb $1, 49(%rsi), %xmm2, %xmm2
        vmovd   %eax, %xmm4
        vpinsrb $1, 51(%rsi), %xmm4, %xmm4
        vpunpcklwd      %xmm6, %xmm3, %xmm3
        vpunpcklwd      %xmm4, %xmm2, %xmm2
        vpunpcklwd      %xmm7, %xmm0, %xmm0
        vpunpcklwd      %xmm5, %xmm1, %xmm1
        vpunpckldq      %xmm2, %xmm1, %xmm1
        vpunpckldq      %xmm3, %xmm0, %xmm0
        vpunpcklqdq     %xmm1, %xmm0, %xmm0
---

it's 16 "vector insert" + (4 + 2 + 1) "vector concat/permutation", so cost
should be 92(23 * 4).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (22 preceding siblings ...)
  2020-09-27  3:07 ` crazylht at gmail dot com
@ 2020-09-27  6:48 ` rguenther at suse dot de
  2020-09-27 10:20 ` linkw at gcc dot gnu.org
                   ` (12 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: rguenther at suse dot de @ 2020-09-27  6:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #24 from rguenther at suse dot de <rguenther at suse dot de> ---
On September 27, 2020 4:56:43 AM GMT+02:00, crazylht at gmail dot com
<gcc-bugzilla@gcc.gnu.org> wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
>
>--- Comment #22 from Hongtao.liu <crazylht at gmail dot com> ---
>>One of my workmates found that if we disable vectorization for
>SPEC2017 >525.x264_r function sub4x4_dct in source file
>x264_src/common/dct.c with ?>explicit function attribute
>__attribute__((optimize("no-tree-vectorize"))), it >can speed up by 4%.
>
>For CLX, if we disable slp vectorization in sub4x4_dct by 
>__attribute__((optimize("no-tree-slp-vectorize"))), it can also speed
>up by 4%.
>
>> Thanks Richi! Should we take care of this case? or neglect this kind
>of
>> extension as "no instruction"? I was intent to handle it in target
>specific
>> code, but it isn't recorded into cost vector while it seems too heavy
>to do
>> the bb_info slp_instances revisits in finish_cost.
>
>For i386 backend unsigned char --> unsigned short is no "no
>instruction", but
>in this case
>---
>1033  _134 = MEM[(pixel *)pix1_295 + 2B];                              
>        
>1034  _135 = (short unsigned int) _134;
>---
>
>It could be combined and optimized to 
>---
>movzbl  19(%rcx), %r8d
>---
>
>So, if "unsigned char" variable is loaded from memory, then the
>convertion
>would also be "no instruction", i'm not sure if backend cost model
>could handle
>such situation.

I think all attempts to address this from the side of the scalar cost is going
to be difficult and fragile..

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (23 preceding siblings ...)
  2020-09-27  6:48 ` rguenther at suse dot de
@ 2020-09-27 10:20 ` linkw at gcc dot gnu.org
  2020-09-27 10:36 ` linkw at gcc dot gnu.org
                   ` (11 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-27 10:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #25 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > 
> > Got it! For 
> > 
> > 	  else if (vect_nop_conversion_p (stmt_info))
> > 	    continue;
> > 
> > Is it a good idea to change it to call record_stmt_cost like the others? 
> >   1) introduce one vect_cost_for_stmt enum type eg: nop_stmt
> >   2) builtin_vectorization_cost return zero for it by default as before.
> >   3) targets can adjust the cost according to its need
> 
> I think this early-out was added for the case where there was no cost but
> the target costed it.  So at least go back and look what target that was
> and see if it can be adjusted.

OK, thanks. The change is from commit r10-4592, I think most of its handled
objects are no costs for all targets, like VIEW_CONVERT_EXPR, UL/SL
bi-direction conversions etc, so it's good to differentiate it from
scalar_stmt, but the "continue" way makes target hard to tweak for some
tree_nop_conversion_p stmts.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (24 preceding siblings ...)
  2020-09-27 10:20 ` linkw at gcc dot gnu.org
@ 2020-09-27 10:36 ` linkw at gcc dot gnu.org
  2020-09-27 10:42 ` linkw at gcc dot gnu.org
                   ` (10 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-27 10:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #26 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > By following this idea, to release the restriction on loop_outer
> > (loop_father) when setting the father_bbs, I can see FRE works as
> > expectedly.  But it actually does the rpo_vn from cfun's entry to its exit.
> 
> Yeah, that's the reason we do not do it.  We could possibly restrict it
> to a containing loop, or if the containing loop is the whole function,
> restrict it to the original preheader block to the loop exits (which are
> no longer there, we'd need to pre-record those I think)

Thanks for the suggestion! 

I tried the idea to restrict it to run from the original preheader block to the
loop exits (pre-record both as you said), but it can't support the array d
eliminated finally, unfortunately this case requires VN to run across the
boundary between the original loops.  Now I ended up to run one time the whole
function VN if there isn't any loops after unrolling. I guess if there are no
loops, the CFG should be simple in most times and then not so costly? 

> > Besides, when SLP happens, FRE gen the bit_field_ref and remove array d, but
> > for scalar codes it needs one more time dse run after cunroll to get array d
> > eliminated. I guess it's not costly? Can one pass be run or not controlled
> > by something in another pass? via global variable and add one parameter in
> > passes.def seems weird. If it's costly, probably we can go by factoring out
> > one routine to be called instead of running a pass, like do_rpo_vn?
> 
> No, we don't have a good way to schedule passes from other passes.  And yes,
> the way forward is to support key transforms on regions.  Oh, and every
> pass that does memory alias analysis (DSE, DCE, VN, etc.) is costly.
> 

OK, I'll have a look at DSE and try to get it to support region style. Although
it may not help this case since it needs to operate things across loop
boundary.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (25 preceding siblings ...)
  2020-09-27 10:36 ` linkw at gcc dot gnu.org
@ 2020-09-27 10:42 ` linkw at gcc dot gnu.org
  2020-09-28  5:45 ` crazylht at gmail dot com
                   ` (9 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-27 10:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #27 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #22)
> >One of my workmates found that if we disable vectorization for SPEC2017 >525.x264_r function sub4x4_dct in source file x264_src/common/dct.c with ?>explicit function attribute __attribute__((optimize("no-tree-vectorize"))), it >can speed up by 4%.
> 
> For CLX, if we disable slp vectorization in sub4x4_dct by 
> __attribute__((optimize("no-tree-slp-vectorize"))), it can also speed up by
> 4%.
> 
> > Thanks Richi! Should we take care of this case? or neglect this kind of
> > extension as "no instruction"? I was intent to handle it in target specific
> > code, but it isn't recorded into cost vector while it seems too heavy to do
> > the bb_info slp_instances revisits in finish_cost.
> 
> For i386 backend unsigned char --> unsigned short is no "no instruction",

Thanks for the information, it means it's target specific.

> but in this case
> ---
> 1033  _134 = MEM[(pixel *)pix1_295 + 2B];                                   
> 
> 1034  _135 = (short unsigned int) _134;
> ---
> 
> It could be combined and optimized to 
> ---
> movzbl  19(%rcx), %r8d
> ---
> 
> So, if "unsigned char" variable is loaded from memory, then the convertion
> would also be "no instruction", i'm not sure if backend cost model could
> handle such situation.

Probably you can try to tweak it in ix86_add_stmt_cost? when the statement is
UB to UH conversion statement, further check if the def of the input UB is MEM.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (26 preceding siblings ...)
  2020-09-27 10:42 ` linkw at gcc dot gnu.org
@ 2020-09-28  5:45 ` crazylht at gmail dot com
  2020-09-28  6:40 ` linkw at gcc dot gnu.org
                   ` (8 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: crazylht at gmail dot com @ 2020-09-28  5:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #28 from Hongtao.liu <crazylht at gmail dot com> ---

> Probably you can try to tweak it in ix86_add_stmt_cost? when the statement

Yes, it's the place.

> is UB to UH conversion statement, further check if the def of the input UB
> is MEM.

Only if there's no multi-use for UB. More generally, it's quite difficult to
guess later optimizations for the purpose of more accurate vectorization cost
model, :(.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (27 preceding siblings ...)
  2020-09-28  5:45 ` crazylht at gmail dot com
@ 2020-09-28  6:40 ` linkw at gcc dot gnu.org
  2020-09-28  6:48 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-28  6:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #29 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #28)
> > Probably you can try to tweak it in ix86_add_stmt_cost? when the statement
> 
> Yes, it's the place.
> 
> > is UB to UH conversion statement, further check if the def of the input UB
> > is MEM.
> 
> Only if there's no multi-use for UB. More generally, it's quite difficult to
> guess later optimizations for the purpose of more accurate vectorization
> cost model, :(.

Yeah, it's hard sadly. The generic cost modeling is rough, ix86_add_stmt_cost
is more fine-grain (at least than what we have on Power :)), if you want to
check it more, it seems doable in target specific hook finish_cost where you
can get the whole vinfo object, but it could end up with very heavy analysis
and might not be worthy.

Do you mind to check if it can also fix this degradation on x86 to run FRE and
DSE just after cunroll? I found it worked for Power, hoped it can help there
too.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (28 preceding siblings ...)
  2020-09-28  6:40 ` linkw at gcc dot gnu.org
@ 2020-09-28  6:48 ` rguenth at gcc dot gnu.org
  2020-09-28  6:59 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-09-28  6:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #30 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #23)
> >  _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565,
> > _576, _125, _143, _161, _179}; 
> 
> The cost of vec_construct in i386 backend is 64, calculated as 16 x 4
> 
> cut from i386.c
> ---
> /* N element inserts into SSE vectors.  */ 
> int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> ---
> 
> From perspective of pipeline latency, is seems ok, but from perspective of
> rtx_cost, it seems inaccurate since it would be initialized as
> ---
>         vmovd   %eax, %xmm0
>         vpinsrb $1, 1(%rsi), %xmm0, %xmm0
>         vmovd   %eax, %xmm7
>         vpinsrb $1, 3(%rsi), %xmm7, %xmm7
>         vmovd   %eax, %xmm3
>         vpinsrb $1, 17(%rsi), %xmm3, %xmm3
>         vmovd   %eax, %xmm6
>         vpinsrb $1, 19(%rsi), %xmm6, %xmm6
>         vmovd   %eax, %xmm1
>         vpinsrb $1, 33(%rsi), %xmm1, %xmm1
>         vmovd   %eax, %xmm5
>         vpinsrb $1, 35(%rsi), %xmm5, %xmm5
>         vmovd   %eax, %xmm2
>         vpinsrb $1, 49(%rsi), %xmm2, %xmm2
>         vmovd   %eax, %xmm4
>         vpinsrb $1, 51(%rsi), %xmm4, %xmm4
>         vpunpcklwd      %xmm6, %xmm3, %xmm3
>         vpunpcklwd      %xmm4, %xmm2, %xmm2
>         vpunpcklwd      %xmm7, %xmm0, %xmm0
>         vpunpcklwd      %xmm5, %xmm1, %xmm1
>         vpunpckldq      %xmm2, %xmm1, %xmm1
>         vpunpckldq      %xmm3, %xmm0, %xmm0
>         vpunpcklqdq     %xmm1, %xmm0, %xmm0
> ---
> 
> it's 16 "vector insert" + (4 + 2 + 1) "vector concat/permutation", so cost
> should be 92(23 * 4).

So the important part for any target is that it makes the scalar and
vector costs apples and apples because they end up being compared
against each other.  For loops the most important metric tends to be
latency which is also the only thing that can be reasonably costed
when looking at a single statement at a time.  For all other factors
coming in there's (in theory) the finish_cost hook where, after
gathering individual stmt data from add_stmt_cost, a target hook can
apply adjustments based on say functional unit allocation (IIRC
the powerpc backend looks whether there are "many" shifts and
disparages vectorization in that case).

For the vector construction the x86 backend does a reasonable job
in costing - the only thing that's not very well modeled is the
extra cost of constructing from values in GPRs compared to
values in XMM regs (on some CPU archs that even as extra penalties).
But as seen above "GPR" values can also come from memory where
the difference vanishes (for AVX, not for SSE).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (29 preceding siblings ...)
  2020-09-28  6:48 ` rguenth at gcc dot gnu.org
@ 2020-09-28  6:59 ` rguenth at gcc dot gnu.org
  2020-09-28 12:54 ` linkw at gcc dot gnu.org
                   ` (5 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-09-28  6:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #31 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Kewen Lin from comment #29)
> (In reply to Hongtao.liu from comment #28)
> > > Probably you can try to tweak it in ix86_add_stmt_cost? when the statement
> > 
> > Yes, it's the place.
> > 
> > > is UB to UH conversion statement, further check if the def of the input UB
> > > is MEM.
> > 
> > Only if there's no multi-use for UB. More generally, it's quite difficult to
> > guess later optimizations for the purpose of more accurate vectorization
> > cost model, :(.
> 
> Yeah, it's hard sadly. The generic cost modeling is rough,
> ix86_add_stmt_cost is more fine-grain (at least than what we have on Power
> :)), if you want to check it more, it seems doable in target specific hook
> finish_cost where you can get the whole vinfo object, but it could end up
> with very heavy analysis and might not be worthy.
> 
> Do you mind to check if it can also fix this degradation on x86 to run FRE
> and DSE just after cunroll? I found it worked for Power, hoped it can help
> there too.

Btw, we could try sth like adding a TODO_force_next_scalar_cleanup to be
returned from passes that see cleanup opportunities and have the pass
manager queue that up, looking for a special marked pass and enabling
that so we could have

          NEXT_PASS (pass_predcom);
          NEXT_PASS (pass_complete_unroll);
          NEXT_PASS (pass_scalar_cleanup);
          PUSH_INSERT_PASSES_WITHIN (pass_scalar_cleanup);
            NEXT_PASS (pass_fre, false /* may_iterate */);
            NEXT_PASS (pass_dse);
          POP_INSERT_PASSES ();

with pass_scalar_cleanup gate() returning false otherwise.  Eventually
pass properties would match this better, or sth else.

That said, running a cleanup on the whole function should be done via
a separate pass - running a cleanup on a sub-CFG can be done from
within another pass.  But mind that sub-CFG cleanup really has to be
of O(size-of-sub-CFG), otherwise it doesn't help.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (30 preceding siblings ...)
  2020-09-28  6:59 ` rguenth at gcc dot gnu.org
@ 2020-09-28 12:54 ` linkw at gcc dot gnu.org
  2020-09-28 13:27 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-28 12:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #32 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #31)
> (In reply to Kewen Lin from comment #29)
> > (In reply to Hongtao.liu from comment #28)
> > > > Probably you can try to tweak it in ix86_add_stmt_cost? when the statement
> > > 
> > > Yes, it's the place.
> > > 
> > > > is UB to UH conversion statement, further check if the def of the input UB
> > > > is MEM.
> > > 
> > > Only if there's no multi-use for UB. More generally, it's quite difficult to
> > > guess later optimizations for the purpose of more accurate vectorization
> > > cost model, :(.
> > 
> > Yeah, it's hard sadly. The generic cost modeling is rough,
> > ix86_add_stmt_cost is more fine-grain (at least than what we have on Power
> > :)), if you want to check it more, it seems doable in target specific hook
> > finish_cost where you can get the whole vinfo object, but it could end up
> > with very heavy analysis and might not be worthy.
> > 
> > Do you mind to check if it can also fix this degradation on x86 to run FRE
> > and DSE just after cunroll? I found it worked for Power, hoped it can help
> > there too.
> 
> Btw, we could try sth like adding a TODO_force_next_scalar_cleanup to be
> returned from passes that see cleanup opportunities and have the pass
> manager queue that up, looking for a special marked pass and enabling
> that so we could have
> 
>           NEXT_PASS (pass_predcom);
>           NEXT_PASS (pass_complete_unroll);
>           NEXT_PASS (pass_scalar_cleanup);
>           PUSH_INSERT_PASSES_WITHIN (pass_scalar_cleanup);
>             NEXT_PASS (pass_fre, false /* may_iterate */);
>             NEXT_PASS (pass_dse);
>           POP_INSERT_PASSES ();
> 
> with pass_scalar_cleanup gate() returning false otherwise.  Eventually
> pass properties would match this better, or sth else.
> 

Thanks for the suggestion! Before cooking the patch, I have one question that
it looks to only update function property is enough, eg: some pass sets
property PROP_ok_for_cleanup and later pass_scalar_cleanup only goes for the
func with this property (checking in gate), I'm not quite sure the reason for
the TODO_flag TODO_force_next_scalar_cleanup.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (31 preceding siblings ...)
  2020-09-28 12:54 ` linkw at gcc dot gnu.org
@ 2020-09-28 13:27 ` rguenth at gcc dot gnu.org
  2020-09-29  2:05 ` crazylht at gmail dot com
                   ` (3 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-09-28 13:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #33 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Kewen Lin from comment #32)
> (In reply to Richard Biener from comment #31)
> > (In reply to Kewen Lin from comment #29)
> > > (In reply to Hongtao.liu from comment #28)
> > > > > Probably you can try to tweak it in ix86_add_stmt_cost? when the statement
> > > > 
> > > > Yes, it's the place.
> > > > 
> > > > > is UB to UH conversion statement, further check if the def of the input UB
> > > > > is MEM.
> > > > 
> > > > Only if there's no multi-use for UB. More generally, it's quite difficult to
> > > > guess later optimizations for the purpose of more accurate vectorization
> > > > cost model, :(.
> > > 
> > > Yeah, it's hard sadly. The generic cost modeling is rough,
> > > ix86_add_stmt_cost is more fine-grain (at least than what we have on Power
> > > :)), if you want to check it more, it seems doable in target specific hook
> > > finish_cost where you can get the whole vinfo object, but it could end up
> > > with very heavy analysis and might not be worthy.
> > > 
> > > Do you mind to check if it can also fix this degradation on x86 to run FRE
> > > and DSE just after cunroll? I found it worked for Power, hoped it can help
> > > there too.
> > 
> > Btw, we could try sth like adding a TODO_force_next_scalar_cleanup to be
> > returned from passes that see cleanup opportunities and have the pass
> > manager queue that up, looking for a special marked pass and enabling
> > that so we could have
> > 
> >           NEXT_PASS (pass_predcom);
> >           NEXT_PASS (pass_complete_unroll);
> >           NEXT_PASS (pass_scalar_cleanup);
> >           PUSH_INSERT_PASSES_WITHIN (pass_scalar_cleanup);
> >             NEXT_PASS (pass_fre, false /* may_iterate */);
> >             NEXT_PASS (pass_dse);
> >           POP_INSERT_PASSES ();
> > 
> > with pass_scalar_cleanup gate() returning false otherwise.  Eventually
> > pass properties would match this better, or sth else.
> > 
> 
> Thanks for the suggestion! Before cooking the patch, I have one question
> that it looks to only update function property is enough, eg: some pass sets
> property PROP_ok_for_cleanup and later pass_scalar_cleanup only goes for the
> func with this property (checking in gate), I'm not quite sure the reason
> for the TODO_flag TODO_force_next_scalar_cleanup.

properties are not an easy fit since they are static in the pass
description while we want to trigger the cleanup only if we unrolled
an outermost loop for example.  Returning TODO_force_next_scalar_cleanup
from cunroll is the natural way to signal this.

The hackish way then would be to "queue" this TODO_force_next_scalar_cleanup
inside the pass manager itself until it runs into a
"scalar cleanup pass" (either somehow marked or just hard-matched), forcing
its gate() to evaluate to true (just skipping its evaluation).

As said, there's no "nice" way for the information flow at the moment.

One could expose the "pending TODO" (TODO not handled by the pass manager
itself) in a global variable (like current_pass) so the cleanup pass
gate() could check

  gate () { return pending_todo & TODO_force_next_scalar_cleanup; }

and in its execute clear this bit from pending_todo.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (32 preceding siblings ...)
  2020-09-28 13:27 ` rguenth at gcc dot gnu.org
@ 2020-09-29  2:05 ` crazylht at gmail dot com
  2020-09-29 12:27 ` rsandifo at gcc dot gnu.org
                   ` (2 subsequent siblings)
  36 siblings, 0 replies; 38+ messages in thread
From: crazylht at gmail dot com @ 2020-09-29  2:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #34 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Kewen Lin from comment #29)
> (In reply to Hongtao.liu from comment #28)
> > > Probably you can try to tweak it in ix86_add_stmt_cost? when the statement
> > 
> > Yes, it's the place.
> > 
> > > is UB to UH conversion statement, further check if the def of the input UB
> > > is MEM.
> > 
> > Only if there's no multi-use for UB. More generally, it's quite difficult to
> > guess later optimizations for the purpose of more accurate vectorization
> > cost model, :(.
> 
> Yeah, it's hard sadly. The generic cost modeling is rough,
> ix86_add_stmt_cost is more fine-grain (at least than what we have on Power
> :)), if you want to check it more, it seems doable in target specific hook
> finish_cost where you can get the whole vinfo object, but it could end up
> with very heavy analysis and might not be worthy.
> 
> Do you mind to check if it can also fix this degradation on x86 to run FRE
> and DSE just after cunroll? I found it worked for Power, hoped it can help
> there too.

No, it's not working for CLX, problem in i386 backend is a bit different.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (33 preceding siblings ...)
  2020-09-29  2:05 ` crazylht at gmail dot com
@ 2020-09-29 12:27 ` rsandifo at gcc dot gnu.org
  2020-11-03  6:28 ` cvs-commit at gcc dot gnu.org
  2020-11-05  2:25 ` linkw at gcc dot gnu.org
  36 siblings, 0 replies; 38+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2020-09-29 12:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org

--- Comment #35 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #24)
> On September 27, 2020 4:56:43 AM GMT+02:00, crazylht at gmail dot com
> <gcc-bugzilla@gcc.gnu.org> wrote:
> >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> >
> >--- Comment #22 from Hongtao.liu <crazylht at gmail dot com> ---
> >>One of my workmates found that if we disable vectorization for
> >SPEC2017 >525.x264_r function sub4x4_dct in source file
> >x264_src/common/dct.c with ?>explicit function attribute
> >__attribute__((optimize("no-tree-vectorize"))), it >can speed up by 4%.
> >
> >For CLX, if we disable slp vectorization in sub4x4_dct by 
> >__attribute__((optimize("no-tree-slp-vectorize"))), it can also speed
> >up by 4%.
> >
> >> Thanks Richi! Should we take care of this case? or neglect this kind
> >of
> >> extension as "no instruction"? I was intent to handle it in target
> >specific
> >> code, but it isn't recorded into cost vector while it seems too heavy
> >to do
> >> the bb_info slp_instances revisits in finish_cost.
> >
> >For i386 backend unsigned char --> unsigned short is no "no
> >instruction", but
> >in this case
> >---
> >1033  _134 = MEM[(pixel *)pix1_295 + 2B];                              
> >        
> >1034  _135 = (short unsigned int) _134;
> >---
> >
> >It could be combined and optimized to 
> >---
> >movzbl  19(%rcx), %r8d
> >---
> >
> >So, if "unsigned char" variable is loaded from memory, then the
> >convertion
> >would also be "no instruction", i'm not sure if backend cost model
> >could handle
> >such situation.
> 
> I think all attempts to address this from the side of the scalar cost is
> going to be difficult and fragile..
Agreed FWIW.  Even in rtl, the kinds of conversion we're talking
about could be removed, such as by proving that the upper bits are
already correct, by combining the extension with other instructions
so that it becomes “free” again, or by ree.  Proving that the upper
bits are already correct isn't uncommon: gimple has to make a choice
between signed and unsigned types even if both choices would be
correct, whereas rtl is sign-agnostic for storage.

So it's not obvious to me that trying model things at this level is
going to be right more often than it's wrong.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (34 preceding siblings ...)
  2020-09-29 12:27 ` rsandifo at gcc dot gnu.org
@ 2020-11-03  6:28 ` cvs-commit at gcc dot gnu.org
  2020-11-05  2:25 ` linkw at gcc dot gnu.org
  36 siblings, 0 replies; 38+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2020-11-03  6:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #36 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Kewen Lin <linkw@gcc.gnu.org>:

https://gcc.gnu.org/g:f5e18dd9c7dacc9671044fc669bd5c1b26b6bdba

commit r11-4637-gf5e18dd9c7dacc9671044fc669bd5c1b26b6bdba
Author: Kewen Lin <linkw@gcc.gnu.org>
Date:   Tue Nov 3 02:51:47 2020 +0000

    pass: Run cleanup passes before SLP [PR96789]

    As the discussion in PR96789, we found that some scalar stmts
    which can be eliminated by some passes after SLP, but we still
    modeled their costs when trying to SLP, it could impact
    vectorizer's decision.  One typical case is the case in PR96789
    on target Power.

    As Richard suggested there, this patch is to introduce one pass
    called pre_slp_scalar_cleanup which has some secondary clean up
    passes, for now they are FRE and DSE.  It introduces one new
    TODO flags group called pending TODO flags, unlike normal TODO
    flags, the pending TODO flags are passed down in the pipeline
    until one of its consumers can perform the requested action.
    Consumers should then clear the flags for the actions that they
    have taken.

    Soem compilation time statistics on all SPEC2017 INT bmks were
    collected on one Power9 machine for several option sets below:
      A1: -Ofast -funroll-loops
      A2: -O1
      A3: -O1 -funroll-loops
      A4: -O2
      A5: -O2 -funroll-loops

    the corresponding increment rate is trivial:
      A1       A2       A3        A4        A5
      0.08%    0.00%    -0.38%    -0.10%    -0.05%

    Bootstrapped/regtested on powerpc64le-linux-gnu P8.

    gcc/ChangeLog:

            PR tree-optimization/96789
            * function.h (struct function): New member unsigned pending_TODOs.
            * passes.c (class pass_pre_slp_scalar_cleanup): New class.
            (make_pass_pre_slp_scalar_cleanup): New function.
            (pass_data_pre_slp_scalar_cleanup): New pass data.
            * passes.def: (pass_pre_slp_scalar_cleanup): New pass, add
            pass_fre and pass_dse as its children.
            * timevar.def (TV_SCALAR_CLEANUP): New timevar.
            * tree-pass.h (PENDING_TODO_force_next_scalar_cleanup): New
            pending TODO flag.
            (make_pass_pre_slp_scalar_cleanup): New declare.
            * tree-ssa-loop-ivcanon.c (tree_unroll_loops_completely_1):
            Once any outermost loop gets unrolled, flag cfun pending_TODOs
            PENDING_TODO_force_next_scalar_cleanup on.

    gcc/testsuite/ChangeLog:

            PR tree-optimization/96789
            * gcc.dg/tree-ssa/ssa-dse-28.c: Adjust.
            * gcc.dg/tree-ssa/ssa-dse-29.c: Likewise.
            * gcc.dg/vect/bb-slp-41.c: Likewise.
            * gcc.dg/tree-ssa/pr96789.c: New test.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
  2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
                   ` (35 preceding siblings ...)
  2020-11-03  6:28 ` cvs-commit at gcc dot gnu.org
@ 2020-11-05  2:25 ` linkw at gcc dot gnu.org
  36 siblings, 0 replies; 38+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-11-05  2:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

Kewen Lin <linkw at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #37 from Kewen Lin <linkw at gcc dot gnu.org> ---
The degradation should be gone on Power now.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2020-11-05  2:25 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-26  3:19 [Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled linkw at gcc dot gnu.org
2020-08-26  6:53 ` [Bug target/96789] " rguenth at gcc dot gnu.org
2020-08-26  7:13 ` linkw at gcc dot gnu.org
2020-08-27  3:28 ` linkw at gcc dot gnu.org
2020-08-27  6:40 ` rguenth at gcc dot gnu.org
2020-08-27 11:16 ` rguenth at gcc dot gnu.org
2020-08-31  4:05 ` linkw at gcc dot gnu.org
2020-09-16 10:03 ` linkw at gcc dot gnu.org
2020-09-16 11:17 ` rguenth at gcc dot gnu.org
2020-09-16 12:25 ` linkw at gcc dot gnu.org
2020-09-16 13:04 ` rguenth at gcc dot gnu.org
2020-09-17  2:50 ` linkw at gcc dot gnu.org
2020-09-17  5:06 ` linkw at gcc dot gnu.org
2020-09-18  9:11 ` linkw at gcc dot gnu.org
2020-09-18 10:30 ` rguenther at suse dot de
2020-09-18 13:40 ` linkw at gcc dot gnu.org
2020-09-19 15:45 ` crazylht at gmail dot com
2020-09-21  7:14 ` rguenther at suse dot de
2020-09-25 12:46 ` linkw at gcc dot gnu.org
2020-09-25 12:52 ` linkw at gcc dot gnu.org
2020-09-25 12:57 ` rguenth at gcc dot gnu.org
2020-09-25 13:05 ` rguenth at gcc dot gnu.org
2020-09-27  2:56 ` crazylht at gmail dot com
2020-09-27  3:07 ` crazylht at gmail dot com
2020-09-27  6:48 ` rguenther at suse dot de
2020-09-27 10:20 ` linkw at gcc dot gnu.org
2020-09-27 10:36 ` linkw at gcc dot gnu.org
2020-09-27 10:42 ` linkw at gcc dot gnu.org
2020-09-28  5:45 ` crazylht at gmail dot com
2020-09-28  6:40 ` linkw at gcc dot gnu.org
2020-09-28  6:48 ` rguenth at gcc dot gnu.org
2020-09-28  6:59 ` rguenth at gcc dot gnu.org
2020-09-28 12:54 ` linkw at gcc dot gnu.org
2020-09-28 13:27 ` rguenth at gcc dot gnu.org
2020-09-29  2:05 ` crazylht at gmail dot com
2020-09-29 12:27 ` rsandifo at gcc dot gnu.org
2020-11-03  6:28 ` cvs-commit at gcc dot gnu.org
2020-11-05  2:25 ` linkw at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).