public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/98138] New: BB vect fail to SLP one case
@ 2020-12-04 10:46 linkw at gcc dot gnu.org
  2020-12-04 10:52 ` [Bug tree-optimization/98138] " linkw at gcc dot gnu.org
                   ` (13 more replies)
  0 siblings, 14 replies; 15+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-12-04 10:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

            Bug ID: 98138
           Summary: BB vect fail to SLP one case
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

Test case:

  extern void test(unsigned int t[4][4]);

  void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
  {
    unsigned int tmp[4][4];
    unsigned int a0, a1, a2, a3;

    for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
      a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
      a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
      a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
      a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);

      int t0 = a0 + a1;
      int t1 = a0 - a1;
      int t2 = a2 + a3;
      int t3 = a2 - a3;

      tmp[i][0] = t0 + t2;
      tmp[i][2] = t0 - t2;
      tmp[i][1] = t1 + t3;
      tmp[i][3] = t1 - t3;
    }
    test(tmp);
  }

The expected code on ppc64le can look like:

  // p1 byte 0 to byte 7 
  d1_0_7 = load_dword(p1)
  // p1+i1 b0 to b7, rename it as 8 to 15       
  d1_8_15 = load_dword(p1 + i1)
  d1_16_23 = load_dword(p1 + 2*i1) 
  d1_24_31 = load_dword(p1 + 3*i1)

  V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char
  V_d1_16_31 = construct_vec(d1_16_23,d1_24_31)
  V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15, 
                      {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27})
  V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15, 
                      {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31})

  // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all

  // Do the subtraction together (all 4x4 bytes)
  V_sub1 = V_d1_0_3_all - V_d2_0_3_all
  V_sub2 = V_d1_4_7_all - V_d2_4_7_all

  // Do some unpack and get the promoted vector int
  V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28}
  V_a0_1 = V_a0_tmp << 16
  V_a0_0 = vec_promote(V_sub1, {0 1 2 3}).  // vector int {b0 b8 b16 b24}
  // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3}
  V_a0 = V_a0_0 + V_a0_1

  // Get the similar for V_a1, V_a2, V_a3

  // Compute t0/t1/t2/t3
  // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3}
  V_t0 = V_a0 + V_a1  
  V_t1 = V_a0 - V_a1
  V_t2 = V_a2 + V_a3
  V_t3 = V_a2 - V_a3

  // Compute tmps
  // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]}
  V_tmp0 = V_t0 + V_t2
  V_tmp2 = V_t0 - V_t2
  V_tmp1 = V_t1 + V_t3
  V_tmp3 = V_t1 - V_t3

  // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ...
  // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/98138] BB vect fail to SLP one case
  2020-12-04 10:46 [Bug tree-optimization/98138] New: BB vect fail to SLP one case linkw at gcc dot gnu.org
@ 2020-12-04 10:52 ` linkw at gcc dot gnu.org
  2020-12-04 12:19 ` rguenth at gcc dot gnu.org
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-12-04 10:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #1 from Kewen Lin <linkw at gcc dot gnu.org> ---
Similar case is x264_pixel_satd_8x4 in x264
https://github.com/mirror/x264/blob/4121277b40a667665d4eea1726aefdc55d12d110/common/pixel.c#L288

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/98138] BB vect fail to SLP one case
  2020-12-04 10:46 [Bug tree-optimization/98138] New: BB vect fail to SLP one case linkw at gcc dot gnu.org
  2020-12-04 10:52 ` [Bug tree-optimization/98138] " linkw at gcc dot gnu.org
@ 2020-12-04 12:19 ` rguenth at gcc dot gnu.org
  2020-12-07  3:10 ` linkw at gcc dot gnu.org
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-12-04 12:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the expected vectorization builds vectors

 { tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0] }

that's not SLP, SLP tries to build the

 { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] }

vector and "succeeds" - the SLP tree turns out to be
highly inefficient though.  So for the stores your desire
is to see an interleaving scheme with VF 4 (the number of
iterations).  But interleaving fails because it would require
a VF of 16 and there are not enough iteration in the loop.

The classical SLP scheme degenerates (also due to the plus/minus
mixed ops) to uniform vectors as we venture beyond the a{0,2} {+,-} a{1,3}
expression.

Starting SLP discovery from the grouped loads would get things going
up to the above same expression.

So not sure what's the best approach to this case.  The testcase
can be simplified still showing the SLP discovery issue:

extern void test(unsigned int t[4][4]);

void foo(int *p1, int i1, int *p2, int i2)
{
  unsigned int tmp[4][4];
  unsigned int a0, a1, a2, a3;

  for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
      a0 = (p1[0] - p2[0]);
      a1 = (p1[1] - p2[1]);
      a2 = (p1[2] - p2[2]);
      a3 = (p1[3] - p2[3]);

      int t0 = a0 + a1;
      int t1 = a0 - a1;
      int t2 = a2 + a3;
      int t3 = a2 - a3;

      tmp[i][0] = t0 + t2;
      tmp[i][2] = t0 - t2;
      tmp[i][1] = t1 + t3;
      tmp[i][3] = t1 - t3;
  }
  test(tmp);
}

So it's basically SLP discovery degenerating to an interleaving scheme
on the load side but not actually "implementing" it.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/98138] BB vect fail to SLP one case
  2020-12-04 10:46 [Bug tree-optimization/98138] New: BB vect fail to SLP one case linkw at gcc dot gnu.org
  2020-12-04 10:52 ` [Bug tree-optimization/98138] " linkw at gcc dot gnu.org
  2020-12-04 12:19 ` rguenth at gcc dot gnu.org
@ 2020-12-07  3:10 ` linkw at gcc dot gnu.org
  2021-01-05  8:42 ` linkw at gcc dot gnu.org
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-12-07  3:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #3 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> So the expected vectorization builds vectors
> 
>  { tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0] }
> 
> that's not SLP, SLP tries to build the
> 
>  { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] }
> 
> vector and "succeeds" - the SLP tree turns out to be
> highly inefficient though.  So for the stores your desire
> is to see an interleaving scheme with VF 4 (the number of
> iterations).  But interleaving fails because it would require
> a VF of 16 and there are not enough iteration in the loop.
> 
> The classical SLP scheme degenerates (also due to the plus/minus
> mixed ops) to uniform vectors as we venture beyond the a{0,2} {+,-} a{1,3}
> expression.
> 
> Starting SLP discovery from the grouped loads would get things going
> up to the above same expression.
> 
> So not sure what's the best approach to this case.  The testcase
> can be simplified still showing the SLP discovery issue:
> 
> extern void test(unsigned int t[4][4]);
> 
> void foo(int *p1, int i1, int *p2, int i2)
> {
>   unsigned int tmp[4][4];
>   unsigned int a0, a1, a2, a3;
> 
>   for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
>       a0 = (p1[0] - p2[0]);
>       a1 = (p1[1] - p2[1]);
>       a2 = (p1[2] - p2[2]);
>       a3 = (p1[3] - p2[3]);
> 
>       int t0 = a0 + a1;
>       int t1 = a0 - a1;
>       int t2 = a2 + a3;
>       int t3 = a2 - a3;
> 
>       tmp[i][0] = t0 + t2;
>       tmp[i][2] = t0 - t2;
>       tmp[i][1] = t1 + t3;
>       tmp[i][3] = t1 - t3;
>   }
>   test(tmp);
> }
> 
> So it's basically SLP discovery degenerating to an interleaving scheme
> on the load side but not actually "implementing" it.

IIUC, in current implementation, we get four grouped stores:
  { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] } /i=0,1,2,3/ independently 

When all these tryings fail, could we do some re-try on the groups
  { tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] } /i=0,1,2,3/
with one extra intermediate layer covering those original groups, then start
from these newly adjusted groups? the built operands should isomorphic then.
May be too hackish?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/98138] BB vect fail to SLP one case
  2020-12-04 10:46 [Bug tree-optimization/98138] New: BB vect fail to SLP one case linkw at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2020-12-07  3:10 ` linkw at gcc dot gnu.org
@ 2021-01-05  8:42 ` linkw at gcc dot gnu.org
  2021-01-06  3:29 ` linkw at gcc dot gnu.org
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-01-05  8:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #4 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Kewen Lin from comment #3)
> 
> IIUC, in current implementation, we get four grouped stores:
>   { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] } /i=0,1,2,3/ independently 
> 
> When all these tryings fail, could we do some re-try on the groups
>   { tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] } /i=0,1,2,3/
> with one extra intermediate layer covering those original groups, then start
> from these newly adjusted groups? the built operands should isomorphic then.
> May be too hackish?

This thought seems impractical in current framework. I was thinking whether we
can use another vectorization way which is sub-optimal but more fits with the
current framework. It only focuses on all things in one loop iteration and
starts from group stores { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] }. It can
be:

  w1_0123 = load_word(p1)
  V1_0123 = scalar_to_vec(w1_0123) // also with promotion
  w1_4567 = load_word(p1+4)
  V1_4567 = scalar_to_vec(w1_4567)

  w2_0123 = load_word(p2)
  V2_0123 = scalar_to_vec(w2_0123)
  w2_4567 = load_word(p2+4)
  V2_4567 = scalar_to_vec(w2_4567)

  V_a0123 = (V1_0123-V2_0123) + (V1_4567-V2_4567) << 16

  V1 = V_a0123                  // {a0, a1, a2, a3}
  V2 = vperm (V1, <1, 0, 3, 2>) // {a1, a0, a3, a2}

  V3 = V1 - V2  // {t1, -t1, t3, -t3}
  V4 = V1 + V2  // {t0,  t0, t2,  t2}

  V5 = vperm (V3, V4, <4, 0, 6, 2>) // {t0, t1, t2, t3}
  V6 = vperm (V3, V4, <6, 2, 4, 0>) // {t2, t3, t0, t1}

  V7 = V5 - V6 // {t0 - t2, t1 - t3, t2 - t0, t3 - t1}
  V8 = V5 + V6 // {t0 + t2, t1 + t3, t2 + t0, t3 + t1}

  V9 = vperm (V7, V8, <4, 5, 0, 1>)

  vec_store (tmp[i], V9)

One simplified case can be:

extern void test(unsigned int t[4]);

void foo(int *p1, int i1, int *p2, int i2) {
  unsigned int tmp[4];
  unsigned int a0, a1, a2, a3;

  a0 = (p1[0] - p2[0]);
  a1 = (p1[1] - p2[1]);
  a2 = (p1[2] - p2[2]);
  a3 = (p1[3] - p2[3]);

  int t0 = a0 + a1;
  int t1 = a0 - a1;
  int t2 = a2 + a3;
  int t3 = a2 - a3;

  tmp[0] = t0 + t2;
  tmp[2] = t0 - t2;
  tmp[1] = t1 + t3;
  tmp[3] = t1 - t3;

  test(tmp);
}

Currently we still can't vectorize this simple case. SLP doesn't go deep enough
to expose the group loads chance on the bottom nodes loading p1/p2. It ends up
early with node { _13, _14, _13, _14 } and { _15, _16, _15, _16 } because the
operands like {a0_28, a0_28, a0_28, a0_28} etc. satisfy the condition
all_uniform_p so "Building parent vector operands from scalars instead".

One rough idea seems:
  1) Relax this condition all_uniform_p somehow to get SLP instance building to
go deeper and get those p1/p2 loads as SLP nodes.
  2) Introduce one more vect_pattern recognizer to catch this kind of pattern,
transform the slp instance as we expect. I assume we can know the whole slp
instance then we can transform it as we want here. Probably need some costing
condition to gate this pattern matching.
  3) If 2) fail, trim the slp instance from those nodes which satisfy
all_uniform_p condition to ensure it's same as before.

Does it sound reasonable?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/98138] BB vect fail to SLP one case
  2020-12-04 10:46 [Bug tree-optimization/98138] New: BB vect fail to SLP one case linkw at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-01-05  8:42 ` linkw at gcc dot gnu.org
@ 2021-01-06  3:29 ` linkw at gcc dot gnu.org
  2021-01-06  9:48 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-01-06  3:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #5 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Kewen Lin from comment #4)
> One rough idea seems:
>   1) Relax this condition all_uniform_p somehow to get SLP instance building
> to go deeper and get those p1/p2 loads as SLP nodes.
>   2) Introduce one more vect_pattern recognizer to catch this kind of
> pattern, transform the slp instance as we expect. I assume we can know the
> whole slp instance then we can transform it as we want here. Probably need
> some costing condition to gate this pattern matching.
>   3) If 2) fail, trim the slp instance from those nodes which satisfy
> all_uniform_p condition to ensure it's same as before.
> 

For 2), instead of vect_pattern with IFN, the appropriate place seems to be
vect_optimize_slp.

But after more thinking, building SLP instance starting from group loads
instead of group stores looks more straightforward. 

  a0 = (p1[0] - p2[0]);
  a1 = (p1[1] - p2[1]);
  a2 = (p1[2] - p2[2]);
  a3 = (p1[3] - p2[3]);

Building the vector <a0, a1, a2, a3> looks more natural and then check the uses
of its all lanes and special patterns to have vector <t0, t1, t2, t3> and
repeat similarly.

Hi Richi,

Is this a good example to request SLP instance build starting group loads?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/98138] BB vect fail to SLP one case
  2020-12-04 10:46 [Bug tree-optimization/98138] New: BB vect fail to SLP one case linkw at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2021-01-06  3:29 ` linkw at gcc dot gnu.org
@ 2021-01-06  9:48 ` rguenth at gcc dot gnu.org
  2021-01-12  7:23 ` linkw at gcc dot gnu.org
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-06  9:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Starting from the loads is not how SLP discovery works so there will be
zero re-use of code.  Sure - the only important thing is you end up
with a valid SLP graph.

But going back to the original testcase and the proposed vectorization
for power - is that faster in the end?

For the "rewrite" of the vectorizer into all-SLP we do have to address
that "interleaving scheme not carried out as interleaving" at some point,
but that's usually for loop vectorization - for BB vectorization all
we have is optimize_slp.  I have patches that would build the vector load
SLP node (you still have to kill that 'build from scalars' thing to make
it trigger ).  But then we end up with a shared vector load node and
N extract/splat operations at the 'scalar' points.  It's not entirely
clear to me how to re-arrange the SLP graph at that point.

Btw, on current trunk the simplified testcase no longer runs into the
'scalar operand' build case but of course vectorization is thought to be
not profitable.  pattern recog of the plus/minus subgraphs may help
(not sure if ppc has those as instruction, x86 has).

That said, "failure" to identify the common (vector) load is known
and I do have experimental patches trying to address that but did
not yet arrive at a conclusive "best" approach.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/98138] BB vect fail to SLP one case
  2020-12-04 10:46 [Bug tree-optimization/98138] New: BB vect fail to SLP one case linkw at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2021-01-06  9:48 ` rguenth at gcc dot gnu.org
@ 2021-01-12  7:23 ` linkw at gcc dot gnu.org
  2021-01-12  7:25 ` linkw at gcc dot gnu.org
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-01-12  7:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #7 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #6)
> Starting from the loads is not how SLP discovery works so there will be
> zero re-use of code.  Sure - the only important thing is you end up
> with a valid SLP graph.
> 
> But going back to the original testcase and the proposed vectorization
> for power - is that faster in the end?
> 

Good question. I coded foo2 and foo3 with altivec built-in functions as the
proposals, foo2 is for the original version (#c0) and foo3 is for the
sub-optimial version (#c4), they have been verified with some random inputs to
ensure the functionality. 

I evaluated the timings on Power8/Power9/Power10(pre) with option -Ofast and
corresponding -mcpu=power{8,9,10} for the build. Using foo1 (scalar version) as
the baseline 100, do the normalization, the speed-ups look below (more bigger,
more better):

    foo1   foo2     foo3
P8  100   163.26   164.97
P9  100   101.15   101.60
P10 100   172.42   159.10

The evaluation shows the vectorized versions can have big gains on Power8 and
Power10, but the gain is trivial on Power9.  I tried to evaluate P8 executable
on P9, it didn't have any remarkable differences. There are some known issues
on P9 for fp/vect intensive workloads, it's probably due to that since the
vectorized versions are full of vector codes. Sigh, we might have to make
different costs for P9 but that's another story. Btw, I tried to run P9
executable on P10, it can also gains that much like what we have in P10 row. So
it's profitable to vectorize this as P8 and P10 testing show.

> For the "rewrite" of the vectorizer into all-SLP we do have to address
> that "interleaving scheme not carried out as interleaving" at some point,
> but that's usually for loop vectorization - for BB vectorization all
> we have is optimize_slp.  I have patches that would build the vector load
> SLP node (you still have to kill that 'build from scalars' thing to make
> it trigger ).  But then we end up with a shared vector load node and
> N extract/splat operations at the 'scalar' points.  It's not entirely
> clear to me how to re-arrange the SLP graph at that point.
> 

Great, it's good to know we can end up with a shared vector load node. It looks
we can further simplify the graph on top of it, for #c4 simplified case,
assuming we have two shared vector load nodes
  N1 = <p1[0], p1[1], p1[2], p1[3]>
  N2 = <p2[0], p2[1], p2[2], p2[3]>

1) If all lanes of one shared vector load node only works with all lanes of
another shared vector load nodes, we can move up (postpone) the extract/splat
operations.  For this case, we needs 4 extract/splat for p1's lanes, 4
extract/splat for p2's lanes, 4 minus for {T,T,T,T} /T=a{0,1,2,3}/, it can be
simplified to one shared node N3 = N1 - N2 (1 minus) and 4 extract/splat.

2) And then we have the shared vector node <a0,a1,a2,a3>, before it's going
to extract/splat to {T,T,T,T} /T=a{0,1,2,3}/, it can check all the uses of
extract/splat lanes, try to detect some pattern and further simplify. For this
case, it can be permuted to <a1,a0,a3,a2>, do the minus/plus and the
permutations to get the nodes N4 <_13, _14, _13, _14> and N5 <_15, _16, _15,
_16> to save the need of extract/splat.

3) Further we can detect the pattern on SLP graph and skip the need to gen 
N4 and N5, but gen <_13, _14, _15, _16> and repeat the minus/plus and
permutations for the required t1/t2/t3/t4 computation.

> Btw, on current trunk the simplified testcase no longer runs into the
> 'scalar operand' build case but of course vectorization is thought to be
> not profitable.  pattern recog of the plus/minus subgraphs may help
> (not sure if ppc has those as instruction, x86 has).

You meant the simplified case in #c4? I re-based the latest trunk (r11-6601)
but still suffered the 'scalar operand' build, unfortunately ppc doesn't
support that kind of instructions, sigh.

> 
> That said, "failure" to identify the common (vector) load is known
> and I do have experimental patches trying to address that but did
> not yet arrive at a conclusive "best" approach.

Thanks for the information!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/98138] BB vect fail to SLP one case
  2020-12-04 10:46 [Bug tree-optimization/98138] New: BB vect fail to SLP one case linkw at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2021-01-12  7:23 ` linkw at gcc dot gnu.org
@ 2021-01-12  7:25 ` linkw at gcc dot gnu.org
  2021-08-04 10:31 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-01-12  7:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #8 from Kewen Lin <linkw at gcc dot gnu.org> ---
Created attachment 49942
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49942&action=edit
vectorized with altivec built-in functions

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/98138] BB vect fail to SLP one case
  2020-12-04 10:46 [Bug tree-optimization/98138] New: BB vect fail to SLP one case linkw at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2021-01-12  7:25 ` linkw at gcc dot gnu.org
@ 2021-08-04 10:31 ` rguenth at gcc dot gnu.org
  2022-07-06 10:39 ` ktkachov at gcc dot gnu.org
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-04 10:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
The full satd_8x4 looks like the following, the 2nd loop isn't to be
disregarded

typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
typedef unsigned int uint32_t;

#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
    int t0 = s0 + s1;\
    int t1 = s0 - s1;\
    int t2 = s2 + s3;\
    int t3 = s2 - s3;\
    d0 = t0 + t2;\
    d2 = t0 - t2;\
    d1 = t1 + t3;\
    d3 = t1 - t3;\
}

static inline uint32_t abs2( uint32_t a )
{
    uint32_t s = ((a>>15)&0x10001)*0xffff;
    return (a+s)^s;
}

int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 )
{
    uint32_t tmp[4][4];
    uint32_t a0, a1, a2, a3;
    int sum = 0;
    for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
    {
        a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
        a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
        a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
        a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
        HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 );
    }
    for( int i = 0; i < 4; i++ )
    {
        HADAMARD4( a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i]
);
        sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3);
    }
    return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1;
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/98138] BB vect fail to SLP one case
  2020-12-04 10:46 [Bug tree-optimization/98138] New: BB vect fail to SLP one case linkw at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2021-08-04 10:31 ` rguenth at gcc dot gnu.org
@ 2022-07-06 10:39 ` ktkachov at gcc dot gnu.org
  2023-02-01  8:19 ` manolis.tsamis at vrull dot eu
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: ktkachov at gcc dot gnu.org @ 2022-07-06 10:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

ktkachov at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2022-07-06
                 CC|                            |ktkachov at gcc dot gnu.org
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW

--- Comment #10 from ktkachov at gcc dot gnu.org ---
Note that current clang does a pretty decent job on this now on aarch64 (in
case it gives some inspiration on the approach)
https://godbolt.org/z/EPvqMhh7v

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/98138] BB vect fail to SLP one case
  2020-12-04 10:46 [Bug tree-optimization/98138] New: BB vect fail to SLP one case linkw at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2022-07-06 10:39 ` ktkachov at gcc dot gnu.org
@ 2023-02-01  8:19 ` manolis.tsamis at vrull dot eu
  2023-10-04 22:37 ` jiangning.liu at amperecomputing dot com
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: manolis.tsamis at vrull dot eu @ 2023-02-01  8:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

manolis.tsamis at vrull dot eu changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |manolis.tsamis at vrull dot eu

--- Comment #11 from manolis.tsamis at vrull dot eu ---
> The full satd_8x4 looks like the following, the 2nd loop isn't to be
> disregarded

Regarding the second loop, this patch
https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608827.html should
result in improved vectorization and performance.

Currently ((a>>15)&0x10001)*0xffff from abs2 is calculated using 4 vector
operations (shift, bitand, shift+sub for the multiplication) whereas with the
proposed patch this will be a single vector compare operation.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/98138] BB vect fail to SLP one case
  2020-12-04 10:46 [Bug tree-optimization/98138] New: BB vect fail to SLP one case linkw at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2023-02-01  8:19 ` manolis.tsamis at vrull dot eu
@ 2023-10-04 22:37 ` jiangning.liu at amperecomputing dot com
  2023-10-05  6:58 ` rguenth at gcc dot gnu.org
  2023-10-09  7:38 ` rguenth at gcc dot gnu.org
  13 siblings, 0 replies; 15+ messages in thread
From: jiangning.liu at amperecomputing dot com @ 2023-10-04 22:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #12 from Jiangning Liu <jiangning.liu at amperecomputing dot com> ---
Hi Richi,

> That said, "failure" to identify the common (vector) load is known
> and I do have experimental patches trying to address that but did
> not yet arrive at a conclusive "best" approach.

It was long time ago, so do you have the "best" approach now?

Thanks,
-Jiangning

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/98138] BB vect fail to SLP one case
  2020-12-04 10:46 [Bug tree-optimization/98138] New: BB vect fail to SLP one case linkw at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2023-10-04 22:37 ` jiangning.liu at amperecomputing dot com
@ 2023-10-05  6:58 ` rguenth at gcc dot gnu.org
  2023-10-09  7:38 ` rguenth at gcc dot gnu.org
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-10-05  6:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jiangning Liu from comment #12)
> Hi Richi,
> 
> > That said, "failure" to identify the common (vector) load is known
> > and I do have experimental patches trying to address that but did
> > not yet arrive at a conclusive "best" approach.
> 
> It was long time ago, so do you have the "best" approach now?

I do have ideas but how it will play out is unknown.  Also the current
priority is to improve maintainability of the code with getting rid of
the non-SLP data structure using paths in the vectorizer.

Next would be to redo loop SLP discovery from scratch which would include
possibly fixing this issue.

> Thanks,
> -Jiangning

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/98138] BB vect fail to SLP one case
  2020-12-04 10:46 [Bug tree-optimization/98138] New: BB vect fail to SLP one case linkw at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2023-10-05  6:58 ` rguenth at gcc dot gnu.org
@ 2023-10-09  7:38 ` rguenth at gcc dot gnu.org
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-10-09  7:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, previous work is at refs/users/rguenth/heads/load-perm

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-10-09  7:38 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-04 10:46 [Bug tree-optimization/98138] New: BB vect fail to SLP one case linkw at gcc dot gnu.org
2020-12-04 10:52 ` [Bug tree-optimization/98138] " linkw at gcc dot gnu.org
2020-12-04 12:19 ` rguenth at gcc dot gnu.org
2020-12-07  3:10 ` linkw at gcc dot gnu.org
2021-01-05  8:42 ` linkw at gcc dot gnu.org
2021-01-06  3:29 ` linkw at gcc dot gnu.org
2021-01-06  9:48 ` rguenth at gcc dot gnu.org
2021-01-12  7:23 ` linkw at gcc dot gnu.org
2021-01-12  7:25 ` linkw at gcc dot gnu.org
2021-08-04 10:31 ` rguenth at gcc dot gnu.org
2022-07-06 10:39 ` ktkachov at gcc dot gnu.org
2023-02-01  8:19 ` manolis.tsamis at vrull dot eu
2023-10-04 22:37 ` jiangning.liu at amperecomputing dot com
2023-10-05  6:58 ` rguenth at gcc dot gnu.org
2023-10-09  7:38 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).