public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* How to extend SLP to support this case
@ 2020-03-10  6:52 Kewen.Lin
  2020-03-10 11:12 ` Richard Biener
  0 siblings, 1 reply; 9+ messages in thread
From: Kewen.Lin @ 2020-03-10  6:52 UTC (permalink / raw)
  To: GCC Development
  Cc: Richard Sandiford, Richard Biener, Segher Boessenkool, Bill Schmidt

Hi all,

I'm investigating whether GCC can vectorize the below case on ppc64le.

  extern void test(unsigned int t[4][4]);

  void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
  {
    unsigned int tmp[4][4];
    unsigned int a0, a1, a2, a3;

    for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
      a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
      a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
      a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
      a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);

      int t0 = a0 + a1;
      int t1 = a0 - a1;
      int t2 = a2 + a3;
      int t3 = a2 - a3;

      tmp[i][0] = t0 + t2;
      tmp[i][2] = t0 - t2;
      tmp[i][1] = t1 + t3;
      tmp[i][3] = t1 - t3;
    }
    test(tmp);
  }

With unlimited costs, I saw loop aware SLP can vectorize it but with very
inefficient codes.  It builds the SLP instance from store group {tmp[i][0]
tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0, a0}, 
{a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after parsing operands
for tmp* and t*.  It means it's unable to make the isomorphic group for
a0, a1, a2, a3, although they appears isomorphic to merge.  Even if it can
recognize over_widening pattern and do some parallel for two a0 from two
iterations, but it's still inefficient (high cost).

In this context, it looks better to build <a0, a1, a2, a3> first by
leveraging isomorphic computation trees constructing them, eg:
  w1_0123 = load_word(p1)
  V1_0123 = construct_vec(w1_0123)
  w1_4567 = load_word(p1 + 4)
  V1_4567 = construct_vec(w1_4567)
  w2_0123 = load_word(p2)
  V2_0123 = construct_vec(w2_0123)
  w2_4567 = load_word(p2 + 4)
  V2_4567 = construct_vec(w2_4567)
  V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16

But how to teach it to be aware of this? Currently the processing starts
from bottom to up (from stores), can we do some analysis on the SLP 
instance, detect some pattern and update the whole instance?

Besides, I also tried whether the standalone SLP pass can handle it,
it failed to. It stops at a* due to incompatible vector types.

The optimal vectorized SLP codes can be:
  // p1 byte 0 to byte 7 
  d1_0_7 = load_dword(p1)
  // p1+i1 b0 to b7, rename it as 8 to 15       
  d1_8_15 = load_dword(p1 + i1)
  d1_16_23 = load_dword(p1 + 2*i1) 
  d1_24_31 = load_dword(p1 + 3*i1)

  V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char
  V_d1_16_31 = construct_vec(d1_16_23,d1_24_31)
  V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15, 
                      {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27})
  V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15, 
                      {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31})

  // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all

  // Do the subtraction together (all 4x4 bytes)
  V_sub1 = V_d1_0_3_all - V_d2_0_3_all
  V_sub2 = V_d1_4_7_all - V_d2_4_7_all
  
  // Do some unpack and get the promoted vector int
  V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28}
  V_a0_1 = V_a0_tmp << 16
  V_a0_0 = vec_promote(V_sub1, {0 1 2 3}).  // vector int {b0 b8 b16 b24}
  // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3}
  V_a0 = V_a0_0 + V_a0_1
  
  // Get the similar for V_a1, V_a2, V_a3

  // Compute t0/t1/t2/t3
  // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3}
  V_t0 = V_a0 + V_a1  
  V_t1 = V_a0 - V_a1
  V_t2 = V_a2 + V_a3
  V_t3 = V_a2 - V_a3

  // Compute tmps
  // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]}
  V_tmp0 = V_t0 + V_t2
  V_tmp2 = V_t0 - V_t2
  V_tmp1 = V_t1 + V_t3
  V_tmp3 = V_t1 - V_t3
    
  // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ...
  // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3

From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/ together, eg:
  tmp[i][0] i=/0,1,2,3/ (one group)
  tmp[i][1] i=/0,1,2,3/ (one group)
  tmp[i][2] i=/0,1,2,3/ (one group)
  tmp[i][3] i=/0,1,2,3/ (one group)

which tmp[i][j] group have the same isomorphic computations.  But currently
SLP is unable to divide group like this way. (call it as A-way for now)

It's understandable since it has better adjacent store groups like,
  tmp[0][i] i=/0,1,2,3/ (one group)
  tmp[1][i] i=/0,1,2,3/ (one group)
  tmp[2][i] i=/0,1,2,3/ (one group)
  tmp[3][i] i=/0,1,2,3/ (one group)

A-way requires some additional vector permutations.  However, I thought
if the existing scheme can't get any SLP chances, it looks reasonable to
extend it to consider this A-way grouping.  Does it make sense?

Another question is that even if we can go with A-way grouping, it can
only handle packing one byte (four iteration -> 4), what place is
reasonable to extend it to pack more?  How about scaning all leaf
nodes and consider/transform them together?  too hacky?

Any comments are highly appreciated.  Thanks in advance!


BR,
Kewen


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to extend SLP to support this case
  2020-03-10  6:52 How to extend SLP to support this case Kewen.Lin
@ 2020-03-10 11:12 ` Richard Biener
  2020-03-10 11:14   ` Richard Biener
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Richard Biener @ 2020-03-10 11:12 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Development, Richard Sandiford, Segher Boessenkool, Bill Schmidt

On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi all,
>
> I'm investigating whether GCC can vectorize the below case on ppc64le.
>
>   extern void test(unsigned int t[4][4]);
>
>   void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
>   {
>     unsigned int tmp[4][4];
>     unsigned int a0, a1, a2, a3;
>
>     for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
>       a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
>       a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
>       a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
>       a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);
>
>       int t0 = a0 + a1;
>       int t1 = a0 - a1;
>       int t2 = a2 + a3;
>       int t3 = a2 - a3;
>
>       tmp[i][0] = t0 + t2;
>       tmp[i][2] = t0 - t2;
>       tmp[i][1] = t1 + t3;
>       tmp[i][3] = t1 - t3;
>     }
>     test(tmp);
>   }
>
> With unlimited costs, I saw loop aware SLP can vectorize it but with very
> inefficient codes.  It builds the SLP instance from store group {tmp[i][0]
> tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0, a0},
> {a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after parsing operands
> for tmp* and t*.  It means it's unable to make the isomorphic group for
> a0, a1, a2, a3, although they appears isomorphic to merge.  Even if it can
> recognize over_widening pattern and do some parallel for two a0 from two
> iterations, but it's still inefficient (high cost).
>
> In this context, it looks better to build <a0, a1, a2, a3> first by
> leveraging isomorphic computation trees constructing them, eg:
>   w1_0123 = load_word(p1)
>   V1_0123 = construct_vec(w1_0123)
>   w1_4567 = load_word(p1 + 4)
>   V1_4567 = construct_vec(w1_4567)
>   w2_0123 = load_word(p2)
>   V2_0123 = construct_vec(w2_0123)
>   w2_4567 = load_word(p2 + 4)
>   V2_4567 = construct_vec(w2_4567)
>   V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16
>
> But how to teach it to be aware of this? Currently the processing starts
> from bottom to up (from stores), can we do some analysis on the SLP
> instance, detect some pattern and update the whole instance?

In theory yes (Tamar had something like that for AARCH64 complex
rotations IIRC).  And yes, the issue boils down to how we handle
SLP discovery.  I'd like to improve SLP discovery but it's on my list
only after I managed to get rid of the non-SLP code paths.  I have
played with some ideas (even produced hackish patches) to find
"seeds" to form SLP groups from using multi-level hashing of stmts.
My plan is to rewrite SLP discovery completely, starting from a
SLP graph that 1:1 reflects the SSA use-def graph (no groups
formed yet) and then form groups from seeds and insert
"connection" nodes (merging two subgraphs with less lanes
into one with more lanes, splitting lanes, permuting lanes, etc.).

Currently I'm working on doing exactly this but only for SLP loads
(because that's where it's most difficult...).

> Besides, I also tried whether the standalone SLP pass can handle it,
> it failed to. It stops at a* due to incompatible vector types.
>
> The optimal vectorized SLP codes can be:
>   // p1 byte 0 to byte 7
>   d1_0_7 = load_dword(p1)
>   // p1+i1 b0 to b7, rename it as 8 to 15
>   d1_8_15 = load_dword(p1 + i1)
>   d1_16_23 = load_dword(p1 + 2*i1)
>   d1_24_31 = load_dword(p1 + 3*i1)
>
>   V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char
>   V_d1_16_31 = construct_vec(d1_16_23,d1_24_31)
>   V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15,
>                       {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27})
>   V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15,
>                       {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31})
>
>   // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all
>
>   // Do the subtraction together (all 4x4 bytes)
>   V_sub1 = V_d1_0_3_all - V_d2_0_3_all
>   V_sub2 = V_d1_4_7_all - V_d2_4_7_all
>
>   // Do some unpack and get the promoted vector int
>   V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28}
>   V_a0_1 = V_a0_tmp << 16
>   V_a0_0 = vec_promote(V_sub1, {0 1 2 3}).  // vector int {b0 b8 b16 b24}
>   // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3}
>   V_a0 = V_a0_0 + V_a0_1
>
>   // Get the similar for V_a1, V_a2, V_a3
>
>   // Compute t0/t1/t2/t3
>   // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3}
>   V_t0 = V_a0 + V_a1
>   V_t1 = V_a0 - V_a1
>   V_t2 = V_a2 + V_a3
>   V_t3 = V_a2 - V_a3
>
>   // Compute tmps
>   // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]}
>   V_tmp0 = V_t0 + V_t2
>   V_tmp2 = V_t0 - V_t2
>   V_tmp1 = V_t1 + V_t3
>   V_tmp3 = V_t1 - V_t3
>
>   // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ...
>   // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3
>
> From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/ together, eg:
>   tmp[i][0] i=/0,1,2,3/ (one group)
>   tmp[i][1] i=/0,1,2,3/ (one group)
>   tmp[i][2] i=/0,1,2,3/ (one group)
>   tmp[i][3] i=/0,1,2,3/ (one group)
>
> which tmp[i][j] group have the same isomorphic computations.  But currently
> SLP is unable to divide group like this way. (call it as A-way for now)
>
> It's understandable since it has better adjacent store groups like,
>   tmp[0][i] i=/0,1,2,3/ (one group)
>   tmp[1][i] i=/0,1,2,3/ (one group)
>   tmp[2][i] i=/0,1,2,3/ (one group)
>   tmp[3][i] i=/0,1,2,3/ (one group)
>
> A-way requires some additional vector permutations.  However, I thought
> if the existing scheme can't get any SLP chances, it looks reasonable to
> extend it to consider this A-way grouping.  Does it make sense?
>
> Another question is that even if we can go with A-way grouping, it can
> only handle packing one byte (four iteration -> 4), what place is
> reasonable to extend it to pack more?  How about scaning all leaf
> nodes and consider/transform them together?  too hacky?

Well, not sure - in the end it will all be heuristics since otherwise
the exploration space is too big.  But surely concentrating on
load/store groups is good.

The SLP discovery code is already quite complicated btw., I'd
hate to add "unstructured" hacks ontop of it right now without
future design goals in mind.

Richard.

> Any comments are highly appreciated.  Thanks in advance!
>
>
> BR,
> Kewen
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to extend SLP to support this case
  2020-03-10 11:12 ` Richard Biener
@ 2020-03-10 11:14   ` Richard Biener
  2020-03-11  5:34     ` Kewen.Lin
  2020-03-10 11:31   ` Tamar Christina
  2020-03-11  1:56   ` Kewen.Lin
  2 siblings, 1 reply; 9+ messages in thread
From: Richard Biener @ 2020-03-10 11:14 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Development, Richard Sandiford, Segher Boessenkool, Bill Schmidt

On Tue, Mar 10, 2020 at 12:12 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >
> > Hi all,
> >
> > I'm investigating whether GCC can vectorize the below case on ppc64le.
> >
> >   extern void test(unsigned int t[4][4]);
> >
> >   void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
> >   {
> >     unsigned int tmp[4][4];
> >     unsigned int a0, a1, a2, a3;
> >
> >     for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
> >       a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
> >       a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
> >       a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
> >       a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);
> >
> >       int t0 = a0 + a1;
> >       int t1 = a0 - a1;
> >       int t2 = a2 + a3;
> >       int t3 = a2 - a3;
> >
> >       tmp[i][0] = t0 + t2;
> >       tmp[i][2] = t0 - t2;
> >       tmp[i][1] = t1 + t3;
> >       tmp[i][3] = t1 - t3;
> >     }
> >     test(tmp);
> >   }
> >
> > With unlimited costs, I saw loop aware SLP can vectorize it but with very
> > inefficient codes.  It builds the SLP instance from store group {tmp[i][0]
> > tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0, a0},
> > {a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after parsing operands
> > for tmp* and t*.  It means it's unable to make the isomorphic group for
> > a0, a1, a2, a3, although they appears isomorphic to merge.  Even if it can
> > recognize over_widening pattern and do some parallel for two a0 from two
> > iterations, but it's still inefficient (high cost).
> >
> > In this context, it looks better to build <a0, a1, a2, a3> first by
> > leveraging isomorphic computation trees constructing them, eg:
> >   w1_0123 = load_word(p1)
> >   V1_0123 = construct_vec(w1_0123)
> >   w1_4567 = load_word(p1 + 4)
> >   V1_4567 = construct_vec(w1_4567)
> >   w2_0123 = load_word(p2)
> >   V2_0123 = construct_vec(w2_0123)
> >   w2_4567 = load_word(p2 + 4)
> >   V2_4567 = construct_vec(w2_4567)
> >   V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16
> >
> > But how to teach it to be aware of this? Currently the processing starts
> > from bottom to up (from stores), can we do some analysis on the SLP
> > instance, detect some pattern and update the whole instance?
>
> In theory yes (Tamar had something like that for AARCH64 complex
> rotations IIRC).  And yes, the issue boils down to how we handle
> SLP discovery.  I'd like to improve SLP discovery but it's on my list
> only after I managed to get rid of the non-SLP code paths.  I have
> played with some ideas (even produced hackish patches) to find
> "seeds" to form SLP groups from using multi-level hashing of stmts.
> My plan is to rewrite SLP discovery completely, starting from a
> SLP graph that 1:1 reflects the SSA use-def graph (no groups
> formed yet) and then form groups from seeds and insert
> "connection" nodes (merging two subgraphs with less lanes
> into one with more lanes, splitting lanes, permuting lanes, etc.).
>
> Currently I'm working on doing exactly this but only for SLP loads
> (because that's where it's most difficult...).
>
> > Besides, I also tried whether the standalone SLP pass can handle it,
> > it failed to. It stops at a* due to incompatible vector types.
> >
> > The optimal vectorized SLP codes can be:
> >   // p1 byte 0 to byte 7
> >   d1_0_7 = load_dword(p1)
> >   // p1+i1 b0 to b7, rename it as 8 to 15
> >   d1_8_15 = load_dword(p1 + i1)
> >   d1_16_23 = load_dword(p1 + 2*i1)
> >   d1_24_31 = load_dword(p1 + 3*i1)
> >
> >   V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char
> >   V_d1_16_31 = construct_vec(d1_16_23,d1_24_31)
> >   V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15,
> >                       {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27})
> >   V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15,
> >                       {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31})
> >
> >   // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all
> >
> >   // Do the subtraction together (all 4x4 bytes)
> >   V_sub1 = V_d1_0_3_all - V_d2_0_3_all
> >   V_sub2 = V_d1_4_7_all - V_d2_4_7_all
> >
> >   // Do some unpack and get the promoted vector int
> >   V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28}
> >   V_a0_1 = V_a0_tmp << 16
> >   V_a0_0 = vec_promote(V_sub1, {0 1 2 3}).  // vector int {b0 b8 b16 b24}
> >   // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3}
> >   V_a0 = V_a0_0 + V_a0_1
> >
> >   // Get the similar for V_a1, V_a2, V_a3
> >
> >   // Compute t0/t1/t2/t3
> >   // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3}
> >   V_t0 = V_a0 + V_a1
> >   V_t1 = V_a0 - V_a1
> >   V_t2 = V_a2 + V_a3
> >   V_t3 = V_a2 - V_a3
> >
> >   // Compute tmps
> >   // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]}
> >   V_tmp0 = V_t0 + V_t2
> >   V_tmp2 = V_t0 - V_t2
> >   V_tmp1 = V_t1 + V_t3
> >   V_tmp3 = V_t1 - V_t3
> >
> >   // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ...
> >   // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3
> >
> > From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/ together, eg:
> >   tmp[i][0] i=/0,1,2,3/ (one group)
> >   tmp[i][1] i=/0,1,2,3/ (one group)
> >   tmp[i][2] i=/0,1,2,3/ (one group)
> >   tmp[i][3] i=/0,1,2,3/ (one group)
> >
> > which tmp[i][j] group have the same isomorphic computations.  But currently
> > SLP is unable to divide group like this way. (call it as A-way for now)
> >
> > It's understandable since it has better adjacent store groups like,
> >   tmp[0][i] i=/0,1,2,3/ (one group)
> >   tmp[1][i] i=/0,1,2,3/ (one group)
> >   tmp[2][i] i=/0,1,2,3/ (one group)
> >   tmp[3][i] i=/0,1,2,3/ (one group)

Note this is how the non-SLP path will (try to) vectorize the loop.

> > A-way requires some additional vector permutations.  However, I thought
> > if the existing scheme can't get any SLP chances, it looks reasonable to
> > extend it to consider this A-way grouping.  Does it make sense?
> >
> > Another question is that even if we can go with A-way grouping, it can
> > only handle packing one byte (four iteration -> 4), what place is
> > reasonable to extend it to pack more?  How about scaning all leaf
> > nodes and consider/transform them together?  too hacky?
>
> Well, not sure - in the end it will all be heuristics since otherwise
> the exploration space is too big.  But surely concentrating on
> load/store groups is good.
>
> The SLP discovery code is already quite complicated btw., I'd
> hate to add "unstructured" hacks ontop of it right now without
> future design goals in mind.
>
> Richard.
>
> > Any comments are highly appreciated.  Thanks in advance!
> >
> >
> > BR,
> > Kewen
> >

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: How to extend SLP to support this case
  2020-03-10 11:12 ` Richard Biener
  2020-03-10 11:14   ` Richard Biener
@ 2020-03-10 11:31   ` Tamar Christina
  2020-03-11  3:58     ` Kewen.Lin
  2020-03-13 11:57     ` Richard Biener
  2020-03-11  1:56   ` Kewen.Lin
  2 siblings, 2 replies; 9+ messages in thread
From: Tamar Christina @ 2020-03-10 11:31 UTC (permalink / raw)
  To: Richard Biener, Kewen.Lin; +Cc: GCC Development, Segher Boessenkool


> -----Original Message-----
> From: Gcc <gcc-bounces@gcc.gnu.org> On Behalf Of Richard Biener
> Sent: Tuesday, March 10, 2020 11:12 AM
> To: Kewen.Lin <linkw@linux.ibm.com>
> Cc: GCC Development <gcc@gcc.gnu.org>; Segher Boessenkool
> <segher@kernel.crashing.org>
> Subject: Re: How to extend SLP to support this case
> 
> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >
> > Hi all,
> >
> > I'm investigating whether GCC can vectorize the below case on ppc64le.
> >
> >   extern void test(unsigned int t[4][4]);
> >
> >   void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
> >   {
> >     unsigned int tmp[4][4];
> >     unsigned int a0, a1, a2, a3;
> >
> >     for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
> >       a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
> >       a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
> >       a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
> >       a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);
> >
> >       int t0 = a0 + a1;
> >       int t1 = a0 - a1;
> >       int t2 = a2 + a3;
> >       int t3 = a2 - a3;
> >
> >       tmp[i][0] = t0 + t2;
> >       tmp[i][2] = t0 - t2;
> >       tmp[i][1] = t1 + t3;
> >       tmp[i][3] = t1 - t3;
> >     }
> >     test(tmp);
> >   }
> >
> > With unlimited costs, I saw loop aware SLP can vectorize it but with
> > very inefficient codes.  It builds the SLP instance from store group
> > {tmp[i][0] tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0,
> > a0}, {a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after
> > parsing operands for tmp* and t*.  It means it's unable to make the
> > isomorphic group for a0, a1, a2, a3, although they appears isomorphic
> > to merge.  Even if it can recognize over_widening pattern and do some
> > parallel for two a0 from two iterations, but it's still inefficient (high cost).
> >
> > In this context, it looks better to build <a0, a1, a2, a3> first by
> > leveraging isomorphic computation trees constructing them, eg:
> >   w1_0123 = load_word(p1)
> >   V1_0123 = construct_vec(w1_0123)
> >   w1_4567 = load_word(p1 + 4)
> >   V1_4567 = construct_vec(w1_4567)
> >   w2_0123 = load_word(p2)
> >   V2_0123 = construct_vec(w2_0123)
> >   w2_4567 = load_word(p2 + 4)
> >   V2_4567 = construct_vec(w2_4567)
> >   V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16
> >
> > But how to teach it to be aware of this? Currently the processing
> > starts from bottom to up (from stores), can we do some analysis on the
> > SLP instance, detect some pattern and update the whole instance?
> 
> In theory yes (Tamar had something like that for AARCH64 complex rotations
> IIRC).  And yes, the issue boils down to how we handle SLP discovery.  I'd like
> to improve SLP discovery but it's on my list only after I managed to get rid of
> the non-SLP code paths.  I have played with some ideas (even produced
> hackish patches) to find "seeds" to form SLP groups from using multi-level
> hashing of stmts.

I still have this but missed the stage-1 deadline after doing the rewriting to C++ 😊

We've also been looking at this and the approach I'm investigating now is trying to get
the SLP codepath to handle this after it's been fully unrolled. I'm looking into whether
the build-slp can be improved to work for the group size == 16 case that it tries but fails
on. 

My intention is to see if doing so would make it simpler to recognize this as just 4 linear
loads and two permutes. I think the loop aware SLP will have a much harder time with this
seeing the load permutations it thinks it needs because of the permutes caused by the +/-
pattern.

One Idea I had before was from your comment on the complex number patch, which is to try
and move up TWO_OPERATORS and undo the permute always when doing +/-. This would simplify
the load permute handling and if a target doesn't have an instruction to support this it would just
fall back to doing an explicit permute after the loads.  But I wasn't sure this approach would get me the
results I wanted.

In the end you don't want a loop here at all. And in order to do the above with TWO_OPERATORS I would
have to let the SLP pattern matcher be able to reduce the group size and increase the no# iterations during
the matching otherwise the matching itself becomes quite difficult in certain cases..

Tamar

> My plan is to rewrite SLP discovery completely, starting from a SLP graph that
> 1:1 reflects the SSA use-def graph (no groups formed yet) and then form
> groups from seeds and insert "connection" nodes (merging two subgraphs
> with less lanes into one with more lanes, splitting lanes, permuting lanes,
> etc.).
> 
> Currently I'm working on doing exactly this but only for SLP loads (because
> that's where it's most difficult...).
> 
> > Besides, I also tried whether the standalone SLP pass can handle it,
> > it failed to. It stops at a* due to incompatible vector types.
> >
> > The optimal vectorized SLP codes can be:
> >   // p1 byte 0 to byte 7
> >   d1_0_7 = load_dword(p1)
> >   // p1+i1 b0 to b7, rename it as 8 to 15
> >   d1_8_15 = load_dword(p1 + i1)
> >   d1_16_23 = load_dword(p1 + 2*i1)
> >   d1_24_31 = load_dword(p1 + 3*i1)
> >
> >   V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char
> >   V_d1_16_31 = construct_vec(d1_16_23,d1_24_31)
> >   V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15,
> >                       {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27})
> >   V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15,
> >                       {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31})
> >
> >   // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all
> >
> >   // Do the subtraction together (all 4x4 bytes)
> >   V_sub1 = V_d1_0_3_all - V_d2_0_3_all
> >   V_sub2 = V_d1_4_7_all - V_d2_4_7_all
> >
> >   // Do some unpack and get the promoted vector int
> >   V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28}
> >   V_a0_1 = V_a0_tmp << 16
> >   V_a0_0 = vec_promote(V_sub1, {0 1 2 3}).  // vector int {b0 b8 b16 b24}
> >   // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3}
> >   V_a0 = V_a0_0 + V_a0_1
> >
> >   // Get the similar for V_a1, V_a2, V_a3
> >
> >   // Compute t0/t1/t2/t3
> >   // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3}
> >   V_t0 = V_a0 + V_a1
> >   V_t1 = V_a0 - V_a1
> >   V_t2 = V_a2 + V_a3
> >   V_t3 = V_a2 - V_a3
> >
> >   // Compute tmps
> >   // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]}
> >   V_tmp0 = V_t0 + V_t2
> >   V_tmp2 = V_t0 - V_t2
> >   V_tmp1 = V_t1 + V_t3
> >   V_tmp3 = V_t1 - V_t3
> >
> >   // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ...
> >   // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3
> >
> > From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/ together, eg:
> >   tmp[i][0] i=/0,1,2,3/ (one group)
> >   tmp[i][1] i=/0,1,2,3/ (one group)
> >   tmp[i][2] i=/0,1,2,3/ (one group)
> >   tmp[i][3] i=/0,1,2,3/ (one group)
> >
> > which tmp[i][j] group have the same isomorphic computations.  But
> > currently SLP is unable to divide group like this way. (call it as
> > A-way for now)
> >
> > It's understandable since it has better adjacent store groups like,
> >   tmp[0][i] i=/0,1,2,3/ (one group)
> >   tmp[1][i] i=/0,1,2,3/ (one group)
> >   tmp[2][i] i=/0,1,2,3/ (one group)
> >   tmp[3][i] i=/0,1,2,3/ (one group)
> >
> > A-way requires some additional vector permutations.  However, I
> > thought if the existing scheme can't get any SLP chances, it looks
> > reasonable to extend it to consider this A-way grouping.  Does it make
> sense?
> >
> > Another question is that even if we can go with A-way grouping, it can
> > only handle packing one byte (four iteration -> 4), what place is
> > reasonable to extend it to pack more?  How about scaning all leaf
> > nodes and consider/transform them together?  too hacky?
> 
> Well, not sure - in the end it will all be heuristics since otherwise the
> exploration space is too big.  But surely concentrating on load/store groups is
> good.
> 
> The SLP discovery code is already quite complicated btw., I'd hate to add
> "unstructured" hacks ontop of it right now without future design goals in
> mind.
> 
> Richard.
> 
> > Any comments are highly appreciated.  Thanks in advance!
> >
> >
> > BR,
> > Kewen
> >

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to extend SLP to support this case
  2020-03-10 11:12 ` Richard Biener
  2020-03-10 11:14   ` Richard Biener
  2020-03-10 11:31   ` Tamar Christina
@ 2020-03-11  1:56   ` Kewen.Lin
  2 siblings, 0 replies; 9+ messages in thread
From: Kewen.Lin @ 2020-03-11  1:56 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Development, Richard Sandiford, Segher Boessenkool, Bill Schmidt

Hi Richi,

on 2020/3/10 下午7:12, Richard Biener wrote:
> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi all,
>>
>> But how to teach it to be aware of this? Currently the processing starts
>> from bottom to up (from stores), can we do some analysis on the SLP
>> instance, detect some pattern and update the whole instance?
> 
> In theory yes (Tamar had something like that for AARCH64 complex
> rotations IIRC).  And yes, the issue boils down to how we handle
> SLP discovery.  I'd like to improve SLP discovery but it's on my list
> only after I managed to get rid of the non-SLP code paths.  I have
> played with some ideas (even produced hackish patches) to find
> "seeds" to form SLP groups from using multi-level hashing of stmts.
> My plan is to rewrite SLP discovery completely, starting from a
> SLP graph that 1:1 reflects the SSA use-def graph (no groups
> formed yet) and then form groups from seeds and insert
> "connection" nodes (merging two subgraphs with less lanes
> into one with more lanes, splitting lanes, permuting lanes, etc.).
> 

Nice!  Thanks for the information!  This improvement sounds big and
promising!  If we can discovery SLP opportunities origins from loads,
this case isn't a big deal then.

> Currently I'm working on doing exactly this but only for SLP loads
> (because that's where it's most difficult...).

This case looks can be an input for your work?  since the isomorphic
computation are easy to detect from loads.

>> A-way requires some additional vector permutations.  However, I thought
>> if the existing scheme can't get any SLP chances, it looks reasonable to
>> extend it to consider this A-way grouping.  Does it make sense?
>>
>> Another question is that even if we can go with A-way grouping, it can
>> only handle packing one byte (four iteration -> 4), what place is
>> reasonable to extend it to pack more?  How about scaning all leaf
>> nodes and consider/transform them together?  too hacky?
> 
> Well, not sure - in the end it will all be heuristics since otherwise
> the exploration space is too big.  But surely concentrating on
> load/store groups is good.
> 

Totally agreed.  This hacky idea orgined from the existing codes, if SLP
discovery improves, I think it's useless then.

> The SLP discovery code is already quite complicated btw., I'd
> hate to add "unstructured" hacks ontop of it right now without
> future design goals in mind.

OK.  Looking forward to its landing!

BR,
Kewen


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to extend SLP to support this case
  2020-03-10 11:31   ` Tamar Christina
@ 2020-03-11  3:58     ` Kewen.Lin
  2020-03-13 11:57     ` Richard Biener
  1 sibling, 0 replies; 9+ messages in thread
From: Kewen.Lin @ 2020-03-11  3:58 UTC (permalink / raw)
  To: Tamar Christina; +Cc: Richard Biener, GCC Development, Segher Boessenkool

Hi Tamar,

on 2020/3/10 下午7:31, Tamar Christina wrote:
> 
>> -----Original Message-----
>> From: Gcc <gcc-bounces@gcc.gnu.org> On Behalf Of Richard Biener
>> Sent: Tuesday, March 10, 2020 11:12 AM
>> To: Kewen.Lin <linkw@linux.ibm.com>
>> Cc: GCC Development <gcc@gcc.gnu.org>; Segher Boessenkool
>> <segher@kernel.crashing.org>
>> Subject: Re: How to extend SLP to support this case
>>
>> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>
>>> Hi all,
>>>
>>> But how to teach it to be aware of this? Currently the processing
>>> starts from bottom to up (from stores), can we do some analysis on the
>>> SLP instance, detect some pattern and update the whole instance?
>>
>> In theory yes (Tamar had something like that for AARCH64 complex rotations
>> IIRC).  And yes, the issue boils down to how we handle SLP discovery.  I'd like
>> to improve SLP discovery but it's on my list only after I managed to get rid of
>> the non-SLP code paths.  I have played with some ideas (even produced
>> hackish patches) to find "seeds" to form SLP groups from using multi-level
>> hashing of stmts.
> 
> I still have this but missed the stage-1 deadline after doing the rewriting to C++ 😊
> 
> We've also been looking at this and the approach I'm investigating now is trying to get
> the SLP codepath to handle this after it's been fully unrolled. I'm looking into whether
> the build-slp can be improved to work for the group size == 16 case that it tries but fails
> on. 
> 

Thanks!  Glad to know you have been working this!

Yes, I saw the standalone SLP pass split the group (16 store stmts) finally.

> My intention is to see if doing so would make it simpler to recognize this as just 4 linear
> loads and two permutes. I think the loop aware SLP will have a much harder time with this
> seeing the load permutations it thinks it needs because of the permutes caused by the +/-
> pattern.

I may miss something, just to double confirm, do you mean for either of p1/p2 make it 
4 linear loads?  Since as the optimal vectorized version, p1 and p2 have 4 separate
loads and construction then further permutations.

> 
> One Idea I had before was from your comment on the complex number patch, which is to try
> and move up TWO_OPERATORS and undo the permute always when doing +/-. This would simplify
> the load permute handling and if a target doesn't have an instruction to support this it would just
> fall back to doing an explicit permute after the loads.  But I wasn't sure this approach would get me the
> results I wanted.> 

IIUC, we have to seek for either <a0, a1, a2, a3> or <a0_iter0, a0_iter1, a0_iter2, a0_iter3> ...,
since either can leverage the isomorphic byte loads, subtraction, shift and addition.

I was thinking that SLP pattern matcher can detect the pattern with two levels of TWO_OPERATORS,
one level is with t/0,1,2,3,/, the other is with a/0,1,2,3/, as well as the dependent isomorphic
computations for a/0,1,2,3/, transform it into isomorphic subtraction, int promotion shift and addition.

> In the end you don't want a loop here at all. And in order to do the above with TWO_OPERATORS I would
> have to let the SLP pattern matcher be able to reduce the group size and increase the no# iterations during
> the matching otherwise the matching itself becomes quite difficult in certain cases..
> 

OK, it sounds unable to get the optimal one which requires all 16 bytes (0-3 or 4-7 x 4 iterations).

BR,
Kewen


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to extend SLP to support this case
  2020-03-10 11:14   ` Richard Biener
@ 2020-03-11  5:34     ` Kewen.Lin
  0 siblings, 0 replies; 9+ messages in thread
From: Kewen.Lin @ 2020-03-11  5:34 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Development, Richard Sandiford, Segher Boessenkool, Bill Schmidt

Hi Richi,

on 2020/3/10 下午7:14, Richard Biener wrote:
> On Tue, Mar 10, 2020 at 12:12 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
>>
>> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>
>>> Hi all,
>>>
>>> I'm investigating whether GCC can vectorize the below case on ppc64le.
>>>
>>>   extern void test(unsigned int t[4][4]);
>>>
>>>   void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
>>>   {
>>>     unsigned int tmp[4][4];
>>>     unsigned int a0, a1, a2, a3;
>>>
>>>     for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
>>>       a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
>>>       a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
>>>       a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
>>>       a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);
>>>
>>>       int t0 = a0 + a1;
>>>       int t1 = a0 - a1;
>>>       int t2 = a2 + a3;
>>>       int t3 = a2 - a3;
>>>
>>>       tmp[i][0] = t0 + t2;
>>>       tmp[i][2] = t0 - t2;
>>>       tmp[i][1] = t1 + t3;
>>>       tmp[i][3] = t1 - t3;
>>>     }
>>>     test(tmp);
>>>   }
>>>
...
>>> From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/ together, eg:
>>>   tmp[i][0] i=/0,1,2,3/ (one group)
>>>   tmp[i][1] i=/0,1,2,3/ (one group)
>>>   tmp[i][2] i=/0,1,2,3/ (one group)
>>>   tmp[i][3] i=/0,1,2,3/ (one group)
>>>
>>> which tmp[i][j] group have the same isomorphic computations.  But currently
>>> SLP is unable to divide group like this way. (call it as A-way for now)
>>>
>>> It's understandable since it has better adjacent store groups like,
>>>   tmp[0][i] i=/0,1,2,3/ (one group)
>>>   tmp[1][i] i=/0,1,2,3/ (one group)
>>>   tmp[2][i] i=/0,1,2,3/ (one group)
>>>   tmp[3][i] i=/0,1,2,3/ (one group)
> 
> Note this is how the non-SLP path will (try to) vectorize the loop.
> 

Oops, sorry for the confusion with poor writing, it's intended to show
how the current SLP group those 16 stores tmp[i][j] i,j=/0,1,2,3/
with completely unrolled.  I saw it split 16 stmts into 4 groups like
this way finally.

BR,
Kewen


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to extend SLP to support this case
  2020-03-10 11:31   ` Tamar Christina
  2020-03-11  3:58     ` Kewen.Lin
@ 2020-03-13 11:57     ` Richard Biener
  2020-03-18 11:37       ` Tamar Christina
  1 sibling, 1 reply; 9+ messages in thread
From: Richard Biener @ 2020-03-13 11:57 UTC (permalink / raw)
  To: Tamar Christina; +Cc: Kewen.Lin, GCC Development, Segher Boessenkool

[-- Attachment #1: Type: text/plain, Size: 9944 bytes --]

On Tue, Mar 10, 2020 at 12:32 PM Tamar Christina <Tamar.Christina@arm.com>
wrote:

>
> > -----Original Message-----
> > From: Gcc <gcc-bounces@gcc.gnu.org> On Behalf Of Richard Biener
> > Sent: Tuesday, March 10, 2020 11:12 AM
> > To: Kewen.Lin <linkw@linux.ibm.com>
> > Cc: GCC Development <gcc@gcc.gnu.org>; Segher Boessenkool
> > <segher@kernel.crashing.org>
> > Subject: Re: How to extend SLP to support this case
> >
> > On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> > >
> > > Hi all,
> > >
> > > I'm investigating whether GCC can vectorize the below case on ppc64le.
> > >
> > >   extern void test(unsigned int t[4][4]);
> > >
> > >   void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
> > >   {
> > >     unsigned int tmp[4][4];
> > >     unsigned int a0, a1, a2, a3;
> > >
> > >     for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
> > >       a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
> > >       a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
> > >       a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
> > >       a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);
> > >
> > >       int t0 = a0 + a1;
> > >       int t1 = a0 - a1;
> > >       int t2 = a2 + a3;
> > >       int t3 = a2 - a3;
> > >
> > >       tmp[i][0] = t0 + t2;
> > >       tmp[i][2] = t0 - t2;
> > >       tmp[i][1] = t1 + t3;
> > >       tmp[i][3] = t1 - t3;
> > >     }
> > >     test(tmp);
> > >   }
> > >
> > > With unlimited costs, I saw loop aware SLP can vectorize it but with
> > > very inefficient codes.  It builds the SLP instance from store group
> > > {tmp[i][0] tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0,
> > > a0}, {a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after
> > > parsing operands for tmp* and t*.  It means it's unable to make the
> > > isomorphic group for a0, a1, a2, a3, although they appears isomorphic
> > > to merge.  Even if it can recognize over_widening pattern and do some
> > > parallel for two a0 from two iterations, but it's still inefficient
> (high cost).
> > >
> > > In this context, it looks better to build <a0, a1, a2, a3> first by
> > > leveraging isomorphic computation trees constructing them, eg:
> > >   w1_0123 = load_word(p1)
> > >   V1_0123 = construct_vec(w1_0123)
> > >   w1_4567 = load_word(p1 + 4)
> > >   V1_4567 = construct_vec(w1_4567)
> > >   w2_0123 = load_word(p2)
> > >   V2_0123 = construct_vec(w2_0123)
> > >   w2_4567 = load_word(p2 + 4)
> > >   V2_4567 = construct_vec(w2_4567)
> > >   V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16
> > >
> > > But how to teach it to be aware of this? Currently the processing
> > > starts from bottom to up (from stores), can we do some analysis on the
> > > SLP instance, detect some pattern and update the whole instance?
> >
> > In theory yes (Tamar had something like that for AARCH64 complex
> rotations
> > IIRC).  And yes, the issue boils down to how we handle SLP discovery.
> I'd like
> > to improve SLP discovery but it's on my list only after I managed to get
> rid of
> > the non-SLP code paths.  I have played with some ideas (even produced
> > hackish patches) to find "seeds" to form SLP groups from using
> multi-level
> > hashing of stmts.
>
> I still have this but missed the stage-1 deadline after doing the
> rewriting to C++ 😊
>
> We've also been looking at this and the approach I'm investigating now is
> trying to get
> the SLP codepath to handle this after it's been fully unrolled. I'm
> looking into whether
> the build-slp can be improved to work for the group size == 16 case that
> it tries but fails
> on.
>
> My intention is to see if doing so would make it simpler to recognize this
> as just 4 linear
> loads and two permutes. I think the loop aware SLP will have a much harder
> time with this
> seeing the load permutations it thinks it needs because of the permutes
> caused by the +/-
> pattern.
>
> One Idea I had before was from your comment on the complex number patch,
> which is to try
> and move up TWO_OPERATORS and undo the permute always when doing +/-. This
> would simplify
> the load permute handling and if a target doesn't have an instruction to
> support this it would just
> fall back to doing an explicit permute after the loads.  But I wasn't sure
> this approach would get me the
> results I wanted.
>
> In the end you don't want a loop here at all. And in order to do the above
> with TWO_OPERATORS I would
> have to let the SLP pattern matcher be able to reduce the group size and
> increase the no# iterations during
> the matching otherwise the matching itself becomes quite difficult in
> certain cases.
>

Just to show where I'm heading I'm attaching current work-in-progress that
introduces
an explicit SLP merge node and implementing SLP_TREE_TWO_OPERATORS that way:

   v1 + v2      v1 - v2
       \              /
          merge

the SLP merge operation is concatenating operands in order and then applies
a lane permutation mask (basically a select from the input lanes).  The
actual patch
depends on earlier cleanups and more preparatory changes are in order to
make
it "nice".  With such SLP merge operation in place it should be
straight-forward
to represent smaller and larger vectorizations in a single graph
(generating optimal
code is of course an entirely different problem).  I'm working on some of
the preparatory
refactorings right now after which I'll post the series up to
SLP_TREE_TWO_OPERATORS.
The next step is then to represent load permutations that way.

Richard.



Tamar
>
> > My plan is to rewrite SLP discovery completely, starting from a SLP
> graph that
> > 1:1 reflects the SSA use-def graph (no groups formed yet) and then form
> > groups from seeds and insert "connection" nodes (merging two subgraphs
> > with less lanes into one with more lanes, splitting lanes, permuting
> lanes,
> > etc.).
> >
> > Currently I'm working on doing exactly this but only for SLP loads
> (because
> > that's where it's most difficult...).
> >
> > > Besides, I also tried whether the standalone SLP pass can handle it,
> > > it failed to. It stops at a* due to incompatible vector types.
> > >
> > > The optimal vectorized SLP codes can be:
> > >   // p1 byte 0 to byte 7
> > >   d1_0_7 = load_dword(p1)
> > >   // p1+i1 b0 to b7, rename it as 8 to 15
> > >   d1_8_15 = load_dword(p1 + i1)
> > >   d1_16_23 = load_dword(p1 + 2*i1)
> > >   d1_24_31 = load_dword(p1 + 3*i1)
> > >
> > >   V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char
> > >   V_d1_16_31 = construct_vec(d1_16_23,d1_24_31)
> > >   V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15,
> > >                       {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27})
> > >   V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15,
> > >                       {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31})
> > >
> > >   // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all
> > >
> > >   // Do the subtraction together (all 4x4 bytes)
> > >   V_sub1 = V_d1_0_3_all - V_d2_0_3_all
> > >   V_sub2 = V_d1_4_7_all - V_d2_4_7_all
> > >
> > >   // Do some unpack and get the promoted vector int
> > >   V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20
> b28}
> > >   V_a0_1 = V_a0_tmp << 16
> > >   V_a0_0 = vec_promote(V_sub1, {0 1 2 3}).  // vector int {b0 b8 b16
> b24}
> > >   // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3}
> > >   V_a0 = V_a0_0 + V_a0_1
> > >
> > >   // Get the similar for V_a1, V_a2, V_a3
> > >
> > >   // Compute t0/t1/t2/t3
> > >   // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3}
> > >   V_t0 = V_a0 + V_a1
> > >   V_t1 = V_a0 - V_a1
> > >   V_t2 = V_a2 + V_a3
> > >   V_t3 = V_a2 - V_a3
> > >
> > >   // Compute tmps
> > >   // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]}
> > >   V_tmp0 = V_t0 + V_t2
> > >   V_tmp2 = V_t0 - V_t2
> > >   V_tmp1 = V_t1 + V_t3
> > >   V_tmp3 = V_t1 - V_t3
> > >
> > >   // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]}
> ...
> > >   // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3
> > >
> > > From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/
> together, eg:
> > >   tmp[i][0] i=/0,1,2,3/ (one group)
> > >   tmp[i][1] i=/0,1,2,3/ (one group)
> > >   tmp[i][2] i=/0,1,2,3/ (one group)
> > >   tmp[i][3] i=/0,1,2,3/ (one group)
> > >
> > > which tmp[i][j] group have the same isomorphic computations.  But
> > > currently SLP is unable to divide group like this way. (call it as
> > > A-way for now)
> > >
> > > It's understandable since it has better adjacent store groups like,
> > >   tmp[0][i] i=/0,1,2,3/ (one group)
> > >   tmp[1][i] i=/0,1,2,3/ (one group)
> > >   tmp[2][i] i=/0,1,2,3/ (one group)
> > >   tmp[3][i] i=/0,1,2,3/ (one group)
> > >
> > > A-way requires some additional vector permutations.  However, I
> > > thought if the existing scheme can't get any SLP chances, it looks
> > > reasonable to extend it to consider this A-way grouping.  Does it make
> > sense?
> > >
> > > Another question is that even if we can go with A-way grouping, it can
> > > only handle packing one byte (four iteration -> 4), what place is
> > > reasonable to extend it to pack more?  How about scaning all leaf
> > > nodes and consider/transform them together?  too hacky?
> >
> > Well, not sure - in the end it will all be heuristics since otherwise the
> > exploration space is too big.  But surely concentrating on load/store
> groups is
> > good.
> >
> > The SLP discovery code is already quite complicated btw., I'd hate to add
> > "unstructured" hacks ontop of it right now without future design goals in
> > mind.
> >
> > Richard.
> >
> > > Any comments are highly appreciated.  Thanks in advance!
> > >
> > >
> > > BR,
> > > Kewen
> > >
>

[-- Attachment #2: p --]
[-- Type: application/octet-stream, Size: 25790 bytes --]

commit 5953cae56aa7ec4cbf1f8febf848c7273fff3796
Author: Richard Biener <rguenther@suse.de>
Date:   Fri Mar 13 12:37:03 2020 +0100

    remove SLP_TREE_TWO_OPERATORS
    
    This removes the SLP_TREE_TWO_OPERATORS hack in favor of having
    explicit SLP nodes for both computations and the blend operation
    thereby introducing a generic merge + select + permute SLP node
    (with implementation limits).
    
    TODO refactoring:
     - make SLP node sharing work on ->ops[] rather than ->stmts[],
       make SLP_TREE_TWO_OPERATORS lowering nodes properly shared then
     - make use of SLP_TREE_CODE instead of relying on ->stmts[0]
     - add stmt_vec_info_type to the SLP node
     - populate SLP_TREE_TYPE where we otherwise compute the vectype
     - pass down vec_info to more functions, avoiding
       STMT_VINFO_{LOOP,BB}_VINFO (or make more functions member of vinfo)
    
    2020-03-13  Richard Biener  <rguenther@suse.de>
    
            * tree-vectorizer.h (_slp_tree::_slp_tree): Add CTOR.
            (_slp_tree::~_slp_tree): Add DTOR.
            (_slp_tree::classify): New method.
            (_slp_tree::n_lanes): Likewise.
            (_slp_tree::lane_permutation): New member.
            (_slp_tree::vectype): Likewise.
            (_slp_tree::code): Likewise.
            (_slp_tree::two_operators): Remove.
            (SLP_TREE_TWO_OPERATORS): Remove.
            (SLP_TREE_LANE_PERMUTATION): New.
            (SLP_TREE_CODE): Likewise.
            (SLP_TREE_VECTYPE): Likewise.
            * tree-vect-slp.c (_slp_tree::_slp_tree): Implement.
            (_slp_tree::~_slp_tree): Likewise.
            (vect_free_slp_tree): Use delete instead of free.
            (vect_create_new_slp_node): Use new instead of XNEW.
            (_slp_tree::classify): Implement.
            (_slp_tree::n_lanes): Likewise.
            (vect_build_slp_tree_2): When we have two different operators
            build two computation SLP nodes and a blend.
            (...): Adjustments for NULL ->stmts[] hack.
            (vect_find_last_scalar_stmt_in_slp): Check children for
            placement.
            (vectorizable_slp_permutation): New function.

diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 32e6a0beaed..12eaff0a3ca 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -46,6 +46,39 @@ along with GCC; see the file COPYING3.  If not see
 #include "internal-fn.h"
 
 
+static bool vectorizable_slp_permutation (vec_info *, gimple_stmt_iterator *,
+					  slp_tree, stmt_vector_for_cost *);
+
+/* Initialize a SLP node.  */
+
+_slp_tree::_slp_tree ()
+{
+  SLP_TREE_SCALAR_STMTS (this) = vNULL;
+  SLP_TREE_SCALAR_OPS (this) = vNULL;
+  SLP_TREE_VEC_STMTS (this).create (0);
+  SLP_TREE_NUMBER_OF_VEC_STMTS (this) = 0;
+  SLP_TREE_CHILDREN (this) = vNULL;
+  SLP_TREE_LOAD_PERMUTATION (this) = vNULL;
+  SLP_TREE_LANE_PERMUTATION (this) = vNULL;
+  SLP_TREE_DEF_TYPE (this) = vect_internal_def;
+  SLP_TREE_CODE (this) = ERROR_MARK;
+  SLP_TREE_VECTYPE (this) = NULL_TREE;
+  this->refcnt = 1;
+  this->max_nunits = 1;
+}
+
+/* Tear down a SLP node.  */
+
+_slp_tree::~_slp_tree ()
+{
+  SLP_TREE_CHILDREN (this).release ();
+  SLP_TREE_SCALAR_STMTS (this).release ();
+  SLP_TREE_SCALAR_OPS (this).release ();
+  SLP_TREE_VEC_STMTS (this).release ();
+  SLP_TREE_LOAD_PERMUTATION (this).release ();
+  SLP_TREE_LANE_PERMUTATION (this).release ();
+}
+
 /* Recursively free the memory allocated for the SLP tree rooted at NODE.
    FINAL_P is true if we have vectorized the instance or if we have
    made a final decision not to vectorize the statements in any way.  */
@@ -76,13 +109,7 @@ vect_free_slp_tree (slp_tree node, bool final_p)
 	}
     }
 
-  SLP_TREE_CHILDREN (node).release ();
-  SLP_TREE_SCALAR_STMTS (node).release ();
-  SLP_TREE_SCALAR_OPS (node).release ();
-  SLP_TREE_VEC_STMTS (node).release ();
-  SLP_TREE_LOAD_PERMUTATION (node).release ();
-
-  free (node);
+  delete node;
 }
 
 /* Free the memory allocated for the SLP instance.  FINAL_P is true if we
@@ -120,17 +147,10 @@ vect_create_new_slp_node (vec<stmt_vec_info> scalar_stmts)
   else
     return NULL;
 
-  node = XNEW (struct _slp_tree);
+  node = new _slp_tree;
   SLP_TREE_SCALAR_STMTS (node) = scalar_stmts;
-  SLP_TREE_SCALAR_OPS (node) = vNULL;
-  SLP_TREE_VEC_STMTS (node).create (0);
-  SLP_TREE_NUMBER_OF_VEC_STMTS (node) = 0;
   SLP_TREE_CHILDREN (node).create (nops);
-  SLP_TREE_LOAD_PERMUTATION (node) = vNULL;
-  SLP_TREE_TWO_OPERATORS (node) = false;
   SLP_TREE_DEF_TYPE (node) = vect_internal_def;
-  node->refcnt = 1;
-  node->max_nunits = 1;
 
   unsigned i;
   FOR_EACH_VEC_ELT (scalar_stmts, i, stmt_info)
@@ -146,21 +166,41 @@ vect_create_new_slp_node (vec<tree> ops)
 {
   slp_tree node;
 
-  node = XNEW (struct _slp_tree);
-  SLP_TREE_SCALAR_STMTS (node) = vNULL;
+  node = new _slp_tree;
   SLP_TREE_SCALAR_OPS (node) = ops;
-  SLP_TREE_VEC_STMTS (node).create (0);
-  SLP_TREE_NUMBER_OF_VEC_STMTS (node) = 0;
-  SLP_TREE_CHILDREN (node) = vNULL;
-  SLP_TREE_LOAD_PERMUTATION (node) = vNULL;
-  SLP_TREE_TWO_OPERATORS (node) = false;
   SLP_TREE_DEF_TYPE (node) = vect_external_def;
-  node->refcnt = 1;
-  node->max_nunits = 1;
 
   return node;
 }
 
+/* While in transient representation state provide a helper to
+   classify SLP node kinds.  */
+
+_slp_tree::kind
+_slp_tree::classify ()
+{
+  if (lane_permutation.exists ())
+    /* concat children and apply permute mask.  */
+    return SLP_CPERM;
+
+  return SLP_OTHER;
+}
+
+/* Returns the number of lanes in the node.  */
+
+unsigned
+_slp_tree::n_lanes ()
+{
+  if (stmts.exists ())
+    return stmts.length ();
+  if (ops.exists ())
+    return ops.length ();
+  if (lane_permutation.exists ())
+    return lane_permutation.length ();
+  /* ???  If we introduce "pure" SLP non-permute ops we might want to
+     have an explicit number of lanes member.  */
+  __builtin_unreachable ();
+}
 
 /* This structure is used in creation of an SLP tree.  Each instance
    corresponds to the same operand in a group of scalar stmts in an SLP
@@ -1629,8 +1669,56 @@ fail:
   *tree_size += this_tree_size + 1;
   *max_nunits = this_max_nunits;
 
+  if (two_operators)
+    {
+      /* ???  We'd likely want to either cache in bst_map sth like
+	 { a+b, NULL, a+b, NULL } and { NULL, a-b, NULL, a-b } or
+	 the true { a+b, a+b, a+b, a+b } ... but there we don't have
+	 explicit stmts to put in so the keying on 'stmts' doesn't
+	 work (but we have the same issue with nodes that use 'ops').  */
+      slp_tree one = new _slp_tree;
+      slp_tree two = new _slp_tree;
+      SLP_TREE_CHILDREN (one).safe_splice (children);
+      SLP_TREE_CHILDREN (two).safe_splice (children);
+      slp_tree child;
+      FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (two), i, child)
+	child->refcnt++;
+
+      node = new _slp_tree;
+      /* Here we record the original stmts since this
+	 node represents the final lane configuration.  */
+      SLP_TREE_SCALAR_STMTS (node) = stmts;
+      SLP_TREE_DEF_TYPE (node) = vect_internal_def;
+      SLP_TREE_CHILDREN (node).safe_push (one);
+      SLP_TREE_CHILDREN (node).safe_push (two);
+      gassign *stmt = as_a <gassign *> (stmts[0]->stmt);
+      enum tree_code code0 = gimple_assign_rhs_code (stmt);
+      enum tree_code ocode = ERROR_MARK;
+      stmt_vec_info ostmt_info;
+      unsigned j;
+      FOR_EACH_VEC_ELT (stmts, i, ostmt_info)
+	{
+	  gassign *ostmt = as_a <gassign *> (ostmt_info->stmt);
+	  if (gimple_assign_rhs_code (ostmt) != code0)
+	    {
+	      SLP_TREE_LANE_PERMUTATION (node).safe_push (std::make_pair (1, i));;
+	      ocode = gimple_assign_rhs_code (ostmt);
+	      j = i;
+	    }
+	  else
+	    SLP_TREE_LANE_PERMUTATION (node).safe_push (std::make_pair (0, i));
+	}
+      SLP_TREE_CODE (one) = code0;
+      SLP_TREE_CODE (two) = ocode;
+      /* Fake some stmts for now.  */
+      SLP_TREE_SCALAR_STMTS (one).safe_grow_cleared (stmts.length ());
+      SLP_TREE_SCALAR_STMTS (two).safe_grow_cleared (stmts.length ());
+      SLP_TREE_SCALAR_STMTS (one)[0] = stmts[0];
+      SLP_TREE_SCALAR_STMTS (two)[0] = stmts[j];
+      return node;
+    }
+
   node = vect_create_new_slp_node (stmts);
-  SLP_TREE_TWO_OPERATORS (node) = two_operators;
   SLP_TREE_CHILDREN (node).splice (children);
   return node;
 }
@@ -1660,7 +1748,10 @@ vect_print_slp_tree (dump_flags_t dump_kind, dump_location_t loc,
 		   estimated_poly_value (node->max_nunits), node->refcnt);
   if (SLP_TREE_SCALAR_STMTS (node).exists ())
     FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), i, stmt_info)
-      dump_printf_loc (metadata, user_loc, "\tstmt %u %G", i, stmt_info->stmt);
+      if (!stmt_info)
+	dump_printf_loc (metadata, user_loc, "\tNULL\n");
+      else
+	dump_printf_loc (metadata, user_loc, "\tstmt %u %G", i, stmt_info->stmt);
   else
     {
       dump_printf_loc (metadata, user_loc, "\t{ ");
@@ -1710,7 +1801,7 @@ vect_mark_slp_stmts (slp_tree node, hash_set<slp_tree> &visited)
     return;
 
   FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), i, stmt_info)
-    STMT_SLP_TYPE (stmt_info) = pure_slp;
+    if (stmt_info) STMT_SLP_TYPE (stmt_info) = pure_slp;
 
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
     vect_mark_slp_stmts (child, visited);
@@ -1969,11 +2060,24 @@ vect_find_last_scalar_stmt_in_slp (slp_tree node)
   stmt_vec_info last = NULL;
   stmt_vec_info stmt_vinfo;
 
+  bool check_ops = false;
   for (int i = 0; SLP_TREE_SCALAR_STMTS (node).iterate (i, &stmt_vinfo); i++)
     {
+      if (!stmt_vinfo) { check_ops = true; break; }
       stmt_vinfo = vect_orig_stmt (stmt_vinfo);
       last = last ? get_later_stmt (stmt_vinfo, last) : stmt_vinfo;
     }
+  if (check_ops)
+    {
+      last = NULL;
+      int i;
+      slp_tree child;
+      FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
+	{
+	  stmt_vinfo = vect_find_last_scalar_stmt_in_slp (child);
+	  last = last ? get_later_stmt (stmt_vinfo, last) : stmt_vinfo;
+	}
+    }
 
   return last;
 }
@@ -2547,9 +2651,6 @@ vect_detect_hybrid_slp_stmts (slp_tree node, unsigned i, slp_vect_type stype,
   imm_use_iterator imm_iter;
   gimple *use_stmt;
   stmt_vec_info use_vinfo;
-  slp_tree child;
-  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
-  int j;
 
   /* We need to union stype over the incoming graph edges but we still
      want to limit recursion to stay O(N+E).  */
@@ -2557,6 +2658,9 @@ vect_detect_hybrid_slp_stmts (slp_tree node, unsigned i, slp_vect_type stype,
   gcc_assert (visited_cnt <= node->refcnt);
   bool only_edge = (visited_cnt != node->refcnt);
 
+  if (stmt_vinfo) {
+  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
+
   /* Propagate hybrid down the SLP tree.  */
   if (stype == hybrid)
     ;
@@ -2603,7 +2707,10 @@ vect_detect_hybrid_slp_stmts (slp_tree node, unsigned i, slp_vect_type stype,
 			 stmt_vinfo->stmt);
       STMT_SLP_TYPE (stmt_vinfo) = hybrid;
     }
+  }
 
+  slp_tree child;
+  int j;
   if (!only_edge)
     FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), j, child)
       if (SLP_TREE_DEF_TYPE (child) != vect_external_def
@@ -2790,8 +2897,17 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, slp_tree node,
 	= vect_get_num_vectors (vf * group_size, vectype);
     }
 
+  /* Handle purely internal nodes.  */
+  if (node->classify () == _slp_tree::SLP_CPERM)
+    return vectorizable_slp_permutation (vinfo, NULL, node, cost_vec);
+
   bool dummy;
-  return vect_analyze_stmt (stmt_info, &dummy, node, node_instance, cost_vec);
+  dummy = vect_analyze_stmt (stmt_info, &dummy, node, node_instance, cost_vec);
+  /* We eventually expect vectorizable_* functions to guess and set the
+     output vector type from the inputs.  */
+  if (!SLP_TREE_VECTYPE (node))
+    SLP_TREE_VECTYPE (node) = STMT_VINFO_VECTYPE (stmt_info);
+  return dummy;
 }
 
 /* Try to build NODE from scalars, returning true on success.
@@ -4114,10 +4230,190 @@ vect_transform_slp_perm_load (slp_tree node, vec<tree> dr_chain,
   return true;
 }
 
+
+/* Vectorize SLP permutations.  */
+
+static bool
+vectorizable_slp_permutation (vec_info *vinfo, gimple_stmt_iterator *gsi,
+			      slp_tree node, stmt_vector_for_cost *cost_vec)
+{
+  /* At analysis time compute the vector type.
+     ???  We currently only support all same vector input and output types
+     while the SLP IL should really do a concat + select and thus accept
+     arbitrary mismatches.
+     ???  Verification of the current limitation is missing here.  */
+  if (!gsi)
+    SLP_TREE_VECTYPE (node) = SLP_TREE_VECTYPE (SLP_TREE_CHILDREN (node)[0]);
+  tree vectype = SLP_TREE_VECTYPE (node);
+
+  vec<std::pair<unsigned, unsigned> > &perm = SLP_TREE_LANE_PERMUTATION (node);
+
+  unsigned vf = 1;
+  if (loop_vec_info linfo = dyn_cast <loop_vec_info> (vinfo))
+    vf = LOOP_VINFO_VECT_FACTOR (linfo).to_constant ();
+  unsigned olanes = vf * node->n_lanes ();
+  gcc_assert (olanes % perm.length () == 0);
+  gcc_assert (multiple_p (olanes, TYPE_VECTOR_SUBPARTS (vectype)));
+
+  /* ???   Compute { op, vector-def, lane } permutation sequence, delaying
+     final index compute and thus combining vector-defs in a particular
+     order.
+     ???   As intermediate step to actually code-gen in the SLP tree
+     representation?  */
+
+  auto_vec<unsigned> active_lane;
+  auto_vec<unsigned> defi;
+  active_lane.safe_grow_cleared (SLP_TREE_CHILDREN (node).length ());
+  defi.safe_grow_cleared (SLP_TREE_CHILDREN (node).length ());
+
+  if (dump_enabled_p ())
+    {
+      dump_printf_loc (MSG_NOTE, vect_location,
+		       "vectorizing permutation");
+      for (unsigned i = 0; i < perm.length (); ++i)
+	dump_printf (MSG_NOTE, " op%u[%u]", perm[i].first, perm[i].second); 
+      dump_printf (MSG_NOTE, "\n");
+    }
+
+  auto_vec<std::pair<std::pair<unsigned, unsigned>, unsigned> > vperm;
+  vperm.create (olanes);
+  for (unsigned i = 0; i < olanes / perm.length (); ++i)
+    {
+      for (unsigned pi = 0; pi < perm.length (); ++pi)
+	{
+	  std::pair<unsigned, unsigned> p = perm[pi];
+	  unsigned vnunits = TYPE_VECTOR_SUBPARTS
+	  (SLP_TREE_VECTYPE (SLP_TREE_CHILDREN (node)[p.first])).to_constant ();
+	  unsigned vi = (active_lane[p.first] + p.second) / vnunits;
+	  unsigned vl = (active_lane[p.first] + p.second) % vnunits;
+	  vperm.quick_push (std::make_pair (std::make_pair (p.first, vi), vl));
+	}
+      /* Advance to the next group.  */
+      for (unsigned j = 0; j < SLP_TREE_CHILDREN (node).length (); ++j)
+	active_lane[j] += SLP_TREE_CHILDREN (node)[j]->n_lanes ();
+    }
+
+  if (dump_enabled_p ())
+    {
+      dump_printf_loc (MSG_NOTE, vect_location,
+		       "as");
+      for (unsigned i = 0; i < vperm.length (); ++i)
+	{
+	  if (i != 0 && multiple_p (i, TYPE_VECTOR_SUBPARTS (vectype)))
+	    dump_printf (MSG_NOTE, ",");
+	  dump_printf (MSG_NOTE, " vops%u[%u][%u]",
+		       vperm[i].first.first, vperm[i].first.second,
+		       vperm[i].first.second); 
+	}
+      dump_printf (MSG_NOTE, "\n");
+    }
+
+  /* We can only handle two-vector permutes, everything else should
+     be lowered on the SLP level.  */
+  std::pair<unsigned, unsigned> first_vec = std::make_pair (-1U, -1U);
+  std::pair<unsigned, unsigned> second_vec = std::make_pair (-1U, -1U);
+  unsigned int const_nunits = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
+  unsigned int index = 0;
+  unsigned int mask_element;
+  vec_perm_builder mask;
+  mask.new_vector (const_nunits, const_nunits, 1);
+  unsigned int count = mask.encoded_nelts ();
+  mask.quick_grow (count);
+  vec_perm_indices indices;
+  unsigned nperms = 0;
+  for (unsigned i = 0; i < vperm.length (); ++i)
+    {
+      mask_element = vperm[i].second;
+      if (first_vec.first == -1U
+	  || first_vec == vperm[i].first)
+	first_vec = vperm[i].first;
+      else if (second_vec.first == -1U
+	       || second_vec == vperm[i].first)
+	{
+	  second_vec = vperm[i].first;
+	  mask_element += const_nunits;
+	}
+      else
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "permutation requires at "
+			     "least three vectors");
+	  gcc_assert (!gsi);
+	  return false;
+	}
+
+      mask[index++] = mask_element;
+
+      if (index == count)
+	{
+	  indices.new_vector (mask, second_vec.first == -1U ? 1 : 2,
+			      const_nunits);
+	  if (!can_vec_perm_const_p (TYPE_MODE (vectype), indices))
+	    {
+	      if (dump_enabled_p ())
+		{
+		  dump_printf_loc (MSG_MISSED_OPTIMIZATION,
+				   vect_location,
+				   "unsupported vect permute { ");
+		  for (i = 0; i < count; ++i)
+		    {
+		      dump_dec (MSG_MISSED_OPTIMIZATION, mask[i]);
+		      dump_printf (MSG_MISSED_OPTIMIZATION, " ");
+		    }
+		  dump_printf (MSG_MISSED_OPTIMIZATION, "}\n");
+		}
+	      gcc_assert (!gsi);
+	      return false;
+	    }
+	}
+
+      if (index == count)
+	{
+	  nperms++;
+
+	  if (gsi)
+	    {
+	      tree mask_vec = vect_gen_perm_mask_checked (vectype, indices);
+
+	      if (second_vec.first == -1U)
+		second_vec = first_vec;
+
+	      /* Generate the permute statement if necessary.  */
+	      tree first_def = gimple_get_lhs (SLP_TREE_VEC_STMTS
+	(SLP_TREE_CHILDREN (node)[first_vec.first])[first_vec.second]->stmt);
+	      tree second_def = gimple_get_lhs (SLP_TREE_VEC_STMTS
+	(SLP_TREE_CHILDREN (node)[second_vec.first])[second_vec.second]->stmt);
+	      stmt_vec_info perm_stmt_info;
+	      tree perm_dest = make_ssa_name (vectype);
+	      gassign *perm_stmt
+		  = gimple_build_assign (perm_dest, VEC_PERM_EXPR,
+					 first_def, second_def,
+					 mask_vec);
+	      /* ???  Refactor vect_finish_stmt_generation.  */
+	      gsi_insert_before (gsi, perm_stmt, GSI_SAME_STMT);
+	      perm_stmt_info = vinfo->add_stmt (perm_stmt);
+	      /* Store the vector statement in NODE.  */
+	      SLP_TREE_VEC_STMTS (node).quick_push (perm_stmt_info);
+	    }
+
+	  index = 0;
+	  first_vec = std::make_pair (-1U, -1U);
+	  second_vec = std::make_pair (-1U, -1U);
+	}
+    }
+
+  if (!gsi)
+    record_stmt_cost (cost_vec, nperms, vec_perm, NULL, 0, vect_body);
+
+  return true;
+}
+
 /* Vectorize SLP instance tree in postorder.  */
 
 static void
-vect_schedule_slp_instance (slp_tree node, slp_instance instance)
+vect_schedule_slp_instance (vec_info *vinfo,
+			    slp_tree node, slp_instance instance)
 {
   gimple_stmt_iterator si;
   stmt_vec_info stmt_info;
@@ -4135,7 +4431,7 @@ vect_schedule_slp_instance (slp_tree node, slp_instance instance)
     return;
 
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
-    vect_schedule_slp_instance (child, instance);
+    vect_schedule_slp_instance (vinfo, child, instance);
 
   /* Push SLP node def-type to stmts.  */
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
@@ -4166,81 +4462,19 @@ vect_schedule_slp_instance (slp_tree node, slp_instance instance)
   stmt_vec_info last_stmt_info = vect_find_last_scalar_stmt_in_slp (node);
   si = gsi_for_stmt (last_stmt_info->stmt);
 
-  /* Handle two-operation SLP nodes by vectorizing the group with
-     both operations and then performing a merge.  */
   bool done_p = false;
-  if (SLP_TREE_TWO_OPERATORS (node))
+
+  /* Handle purely internal nodes.  */
+  if (node->classify () == _slp_tree::SLP_CPERM)
     {
-      gassign *stmt = as_a <gassign *> (stmt_info->stmt);
-      enum tree_code code0 = gimple_assign_rhs_code (stmt);
-      enum tree_code ocode = ERROR_MARK;
-      stmt_vec_info ostmt_info;
-      vec_perm_builder mask (group_size, group_size, 1);
-      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), i, ostmt_info)
-	{
-	  gassign *ostmt = as_a <gassign *> (ostmt_info->stmt);
-	  if (gimple_assign_rhs_code (ostmt) != code0)
-	    {
-	      mask.quick_push (1);
-	      ocode = gimple_assign_rhs_code (ostmt);
-	    }
-	  else
-	    mask.quick_push (0);
-	}
-      if (ocode != ERROR_MARK)
-	{
-	  vec<stmt_vec_info> v0;
-	  vec<stmt_vec_info> v1;
-	  unsigned j;
-	  tree tmask = NULL_TREE;
-	  vect_transform_stmt (stmt_info, &si, node, instance);
-	  v0 = SLP_TREE_VEC_STMTS (node).copy ();
-	  SLP_TREE_VEC_STMTS (node).truncate (0);
-	  gimple_assign_set_rhs_code (stmt, ocode);
-	  vect_transform_stmt (stmt_info, &si, node, instance);
-	  gimple_assign_set_rhs_code (stmt, code0);
-	  v1 = SLP_TREE_VEC_STMTS (node).copy ();
-	  SLP_TREE_VEC_STMTS (node).truncate (0);
-	  tree meltype = build_nonstandard_integer_type
-	      (GET_MODE_BITSIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype))), 1);
-	  tree mvectype = get_same_sized_vectype (meltype, vectype);
-	  unsigned k = 0, l;
-	  for (j = 0; j < v0.length (); ++j)
-	    {
-	      /* Enforced by vect_build_slp_tree, which rejects variable-length
-		 vectors for SLP_TREE_TWO_OPERATORS.  */
-	      unsigned int const_nunits = nunits.to_constant ();
-	      tree_vector_builder melts (mvectype, const_nunits, 1);
-	      for (l = 0; l < const_nunits; ++l)
-		{
-		  if (k >= group_size)
-		    k = 0;
-		  tree t = build_int_cst (meltype,
-					  mask[k++] * const_nunits + l);
-		  melts.quick_push (t);
-		}
-	      tmask = melts.build ();
-
-	      /* ???  Not all targets support a VEC_PERM_EXPR with a
-	         constant mask that would translate to a vec_merge RTX
-		 (with their vec_perm_const_ok).  We can either not
-		 vectorize in that case or let veclower do its job.
-		 Unfortunately that isn't too great and at least for
-		 plus/minus we'd eventually like to match targets
-		 vector addsub instructions.  */
-	      gimple *vstmt;
-	      vstmt = gimple_build_assign (make_ssa_name (vectype),
-					   VEC_PERM_EXPR,
-					   gimple_assign_lhs (v0[j]->stmt),
-					   gimple_assign_lhs (v1[j]->stmt),
-					   tmask);
-	      SLP_TREE_VEC_STMTS (node).quick_push
-		(vect_finish_stmt_generation (stmt_info, vstmt, &si));
-	    }
-	  v0.release ();
-	  v1.release ();
-	  done_p = true;
-	}
+      /* ???  the transform kind is stored to STMT_VINFO_TYPE which might
+	 be shared with different SLP nodes (but usually it's the same
+	 operation apart from the case the stmt is only there for denoting
+	 the actual scalar lane defs ...).  So do not call vect_transform_stmt
+	 but open-code it here (partly).  */
+      bool done = vectorizable_slp_permutation (vinfo, &si, node, NULL);
+      gcc_assert (done);
+      done_p = true;
     }
   if (!done_p)
     vect_transform_stmt (stmt_info, &si, node, instance);
@@ -4281,6 +4515,7 @@ vect_remove_slp_scalar_calls (slp_tree node, hash_set<slp_tree> &visited)
 
   FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), i, stmt_info)
     {
+      if (!stmt_info) continue;
       gcall *stmt = dyn_cast <gcall *> (stmt_info->stmt);
       if (!stmt || gimple_bb (stmt) == NULL)
 	continue;
@@ -4366,7 +4601,7 @@ vect_schedule_slp (vec_info *vinfo)
     {
       slp_tree node = SLP_INSTANCE_TREE (instance);
       /* Schedule the tree of INSTANCE.  */
-      vect_schedule_slp_instance (node, instance);
+      vect_schedule_slp_instance (vinfo, node, instance);
 
       if (SLP_INSTANCE_ROOT_STMT (instance))
 	vectorize_slp_instance_root_stmt (node, instance);
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index f34ac151681..25dad1d712a 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -10730,7 +10730,7 @@ can_vectorize_live_stmts (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
       unsigned int i;
       FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (slp_node), i, slp_stmt_info)
 	{
-	  if (STMT_VINFO_LIVE_P (slp_stmt_info)
+	  if (slp_stmt_info && STMT_VINFO_LIVE_P (slp_stmt_info)
 	      && !vectorizable_live_operation (slp_stmt_info, gsi, slp_node,
 					       slp_node_instance, i,
 					       vec_stmt_p, cost_vec))
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 9e75098d1f2..aca5fc62dbe 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -117,6 +117,15 @@ typedef struct _slp_tree *slp_tree;
 /* A computation tree of an SLP instance.  Each node corresponds to a group of
    stmts to be packed in a SIMD stmt.  */
 struct _slp_tree {
+  _slp_tree();
+  ~_slp_tree();
+
+  enum kind { SLP_OTHER, SLP_CPERM };
+  kind classify ();
+
+  /* Number of (scalar) lanes produced by this node.  */
+  unsigned n_lanes ();
+
   /* Nodes that contain def-stmts of this node statements operands.  */
   vec<slp_tree> children;
 
@@ -128,7 +137,12 @@ struct _slp_tree {
   /* Load permutation relative to the stores, NULL if there is no
      permutation.  */
   vec<unsigned> load_permutation;
+  /* Lane permutation of the operands scalar lanes encoded as pairs
+     of { operand number, lane number }.  The number of elements
+     denotes the number of output lanes.  */
+  vec<std::pair<unsigned, unsigned> > lane_permutation;
 
+  tree vectype;
   /* Vectorized stmt/s.  */
   vec<stmt_vec_info> vec_stmts;
   /* Number of vector stmts that are created to replace the group of scalar
@@ -142,10 +156,10 @@ struct _slp_tree {
   /* The maximum number of vector elements for the subtree rooted
      at this node.  */
   poly_uint64 max_nunits;
-  /* Whether the scalar computations use two different operators.  */
-  bool two_operators;
   /* The DEF type of this node.  */
   enum vect_def_type def_type;
+  /* The operation of this node.  */
+  enum tree_code code;
 };
 
 
@@ -183,8 +197,10 @@ public:
 #define SLP_TREE_VEC_STMTS(S)                    (S)->vec_stmts
 #define SLP_TREE_NUMBER_OF_VEC_STMTS(S)          (S)->vec_stmts_size
 #define SLP_TREE_LOAD_PERMUTATION(S)             (S)->load_permutation
-#define SLP_TREE_TWO_OPERATORS(S)		 (S)->two_operators
+#define SLP_TREE_LANE_PERMUTATION(S)             (S)->lane_permutation
 #define SLP_TREE_DEF_TYPE(S)			 (S)->def_type
+#define SLP_TREE_CODE(S)			 (S)->code
+#define SLP_TREE_VECTYPE(S)			 (S)->vectype
 
 /* Key for map that records association between
    scalar conditions and corresponding loop mask, and

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: How to extend SLP to support this case
  2020-03-13 11:57     ` Richard Biener
@ 2020-03-18 11:37       ` Tamar Christina
  0 siblings, 0 replies; 9+ messages in thread
From: Tamar Christina @ 2020-03-18 11:37 UTC (permalink / raw)
  To: Richard Biener; +Cc: Kewen.Lin, GCC Development, Segher Boessenkool

Thanks Richard!

This is very helpful to see where you’re going with the changes!

Cheers,
Tamar

From: Richard Biener <richard.guenther@gmail.com>
Sent: Friday, March 13, 2020 11:57 AM
To: Tamar Christina <Tamar.Christina@arm.com>
Cc: Kewen.Lin <linkw@linux.ibm.com>; GCC Development <gcc@gcc.gnu.org>; Segher Boessenkool <segher@kernel.crashing.org>
Subject: Re: How to extend SLP to support this case

On Tue, Mar 10, 2020 at 12:32 PM Tamar Christina <Tamar.Christina@arm.com<mailto:Tamar.Christina@arm.com>> wrote:

> -----Original Message-----
> From: Gcc <gcc-bounces@gcc.gnu.org<mailto:gcc-bounces@gcc.gnu.org>> On Behalf Of Richard Biener
> Sent: Tuesday, March 10, 2020 11:12 AM
> To: Kewen.Lin <linkw@linux.ibm.com<mailto:linkw@linux.ibm.com>>
> Cc: GCC Development <gcc@gcc.gnu.org<mailto:gcc@gcc.gnu.org>>; Segher Boessenkool
> <segher@kernel.crashing.org<mailto:segher@kernel.crashing.org>>
> Subject: Re: How to extend SLP to support this case
>
> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com<mailto:linkw@linux.ibm.com>> wrote:
> >
> > Hi all,
> >
> > I'm investigating whether GCC can vectorize the below case on ppc64le.
> >
> >   extern void test(unsigned int t[4][4]);
> >
> >   void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
> >   {
> >     unsigned int tmp[4][4];
> >     unsigned int a0, a1, a2, a3;
> >
> >     for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
> >       a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
> >       a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
> >       a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
> >       a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);
> >
> >       int t0 = a0 + a1;
> >       int t1 = a0 - a1;
> >       int t2 = a2 + a3;
> >       int t3 = a2 - a3;
> >
> >       tmp[i][0] = t0 + t2;
> >       tmp[i][2] = t0 - t2;
> >       tmp[i][1] = t1 + t3;
> >       tmp[i][3] = t1 - t3;
> >     }
> >     test(tmp);
> >   }
> >
> > With unlimited costs, I saw loop aware SLP can vectorize it but with
> > very inefficient codes.  It builds the SLP instance from store group
> > {tmp[i][0] tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0,
> > a0}, {a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after
> > parsing operands for tmp* and t*.  It means it's unable to make the
> > isomorphic group for a0, a1, a2, a3, although they appears isomorphic
> > to merge.  Even if it can recognize over_widening pattern and do some
> > parallel for two a0 from two iterations, but it's still inefficient (high cost).
> >
> > In this context, it looks better to build <a0, a1, a2, a3> first by
> > leveraging isomorphic computation trees constructing them, eg:
> >   w1_0123 = load_word(p1)
> >   V1_0123 = construct_vec(w1_0123)
> >   w1_4567 = load_word(p1 + 4)
> >   V1_4567 = construct_vec(w1_4567)
> >   w2_0123 = load_word(p2)
> >   V2_0123 = construct_vec(w2_0123)
> >   w2_4567 = load_word(p2 + 4)
> >   V2_4567 = construct_vec(w2_4567)
> >   V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16
> >
> > But how to teach it to be aware of this? Currently the processing
> > starts from bottom to up (from stores), can we do some analysis on the
> > SLP instance, detect some pattern and update the whole instance?
>
> In theory yes (Tamar had something like that for AARCH64 complex rotations
> IIRC).  And yes, the issue boils down to how we handle SLP discovery.  I'd like
> to improve SLP discovery but it's on my list only after I managed to get rid of
> the non-SLP code paths.  I have played with some ideas (even produced
> hackish patches) to find "seeds" to form SLP groups from using multi-level
> hashing of stmts.

I still have this but missed the stage-1 deadline after doing the rewriting to C++ 😊

We've also been looking at this and the approach I'm investigating now is trying to get
the SLP codepath to handle this after it's been fully unrolled. I'm looking into whether
the build-slp can be improved to work for the group size == 16 case that it tries but fails
on.

My intention is to see if doing so would make it simpler to recognize this as just 4 linear
loads and two permutes. I think the loop aware SLP will have a much harder time with this
seeing the load permutations it thinks it needs because of the permutes caused by the +/-
pattern.

One Idea I had before was from your comment on the complex number patch, which is to try
and move up TWO_OPERATORS and undo the permute always when doing +/-. This would simplify
the load permute handling and if a target doesn't have an instruction to support this it would just
fall back to doing an explicit permute after the loads.  But I wasn't sure this approach would get me the
results I wanted.

In the end you don't want a loop here at all. And in order to do the above with TWO_OPERATORS I would
have to let the SLP pattern matcher be able to reduce the group size and increase the no# iterations during
the matching otherwise the matching itself becomes quite difficult in certain cases.

Just to show where I'm heading I'm attaching current work-in-progress that introduces
an explicit SLP merge node and implementing SLP_TREE_TWO_OPERATORS that way:

   v1 + v2      v1 - v2
       \              /
          merge

the SLP merge operation is concatenating operands in order and then applies
a lane permutation mask (basically a select from the input lanes).  The actual patch
depends on earlier cleanups and more preparatory changes are in order to make
it "nice".  With such SLP merge operation in place it should be straight-forward
to represent smaller and larger vectorizations in a single graph (generating optimal
code is of course an entirely different problem).  I'm working on some of the preparatory
refactorings right now after which I'll post the series up to SLP_TREE_TWO_OPERATORS.
The next step is then to represent load permutations that way.

Richard.



Tamar

> My plan is to rewrite SLP discovery completely, starting from a SLP graph that
> 1:1 reflects the SSA use-def graph (no groups formed yet) and then form
> groups from seeds and insert "connection" nodes (merging two subgraphs
> with less lanes into one with more lanes, splitting lanes, permuting lanes,
> etc.).
>
> Currently I'm working on doing exactly this but only for SLP loads (because
> that's where it's most difficult...).
>
> > Besides, I also tried whether the standalone SLP pass can handle it,
> > it failed to. It stops at a* due to incompatible vector types.
> >
> > The optimal vectorized SLP codes can be:
> >   // p1 byte 0 to byte 7
> >   d1_0_7 = load_dword(p1)
> >   // p1+i1 b0 to b7, rename it as 8 to 15
> >   d1_8_15 = load_dword(p1 + i1)
> >   d1_16_23 = load_dword(p1 + 2*i1)
> >   d1_24_31 = load_dword(p1 + 3*i1)
> >
> >   V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char
> >   V_d1_16_31 = construct_vec(d1_16_23,d1_24_31)
> >   V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15,
> >                       {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27})
> >   V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15,
> >                       {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31})
> >
> >   // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all
> >
> >   // Do the subtraction together (all 4x4 bytes)
> >   V_sub1 = V_d1_0_3_all - V_d2_0_3_all
> >   V_sub2 = V_d1_4_7_all - V_d2_4_7_all
> >
> >   // Do some unpack and get the promoted vector int
> >   V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28}
> >   V_a0_1 = V_a0_tmp << 16
> >   V_a0_0 = vec_promote(V_sub1, {0 1 2 3}).  // vector int {b0 b8 b16 b24}
> >   // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3}
> >   V_a0 = V_a0_0 + V_a0_1
> >
> >   // Get the similar for V_a1, V_a2, V_a3
> >
> >   // Compute t0/t1/t2/t3
> >   // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3}
> >   V_t0 = V_a0 + V_a1
> >   V_t1 = V_a0 - V_a1
> >   V_t2 = V_a2 + V_a3
> >   V_t3 = V_a2 - V_a3
> >
> >   // Compute tmps
> >   // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]}
> >   V_tmp0 = V_t0 + V_t2
> >   V_tmp2 = V_t0 - V_t2
> >   V_tmp1 = V_t1 + V_t3
> >   V_tmp3 = V_t1 - V_t3
> >
> >   // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ...
> >   // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3
> >
> > From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/ together, eg:
> >   tmp[i][0] i=/0,1,2,3/ (one group)
> >   tmp[i][1] i=/0,1,2,3/ (one group)
> >   tmp[i][2] i=/0,1,2,3/ (one group)
> >   tmp[i][3] i=/0,1,2,3/ (one group)
> >
> > which tmp[i][j] group have the same isomorphic computations.  But
> > currently SLP is unable to divide group like this way. (call it as
> > A-way for now)
> >
> > It's understandable since it has better adjacent store groups like,
> >   tmp[0][i] i=/0,1,2,3/ (one group)
> >   tmp[1][i] i=/0,1,2,3/ (one group)
> >   tmp[2][i] i=/0,1,2,3/ (one group)
> >   tmp[3][i] i=/0,1,2,3/ (one group)
> >
> > A-way requires some additional vector permutations.  However, I
> > thought if the existing scheme can't get any SLP chances, it looks
> > reasonable to extend it to consider this A-way grouping.  Does it make
> sense?
> >
> > Another question is that even if we can go with A-way grouping, it can
> > only handle packing one byte (four iteration -> 4), what place is
> > reasonable to extend it to pack more?  How about scaning all leaf
> > nodes and consider/transform them together?  too hacky?
>
> Well, not sure - in the end it will all be heuristics since otherwise the
> exploration space is too big.  But surely concentrating on load/store groups is
> good.
>
> The SLP discovery code is already quite complicated btw., I'd hate to add
> "unstructured" hacks ontop of it right now without future design goals in
> mind.
>
> Richard.
>
> > Any comments are highly appreciated.  Thanks in advance!
> >
> >
> > BR,
> > Kewen
> >

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-03-18 11:37 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-10  6:52 How to extend SLP to support this case Kewen.Lin
2020-03-10 11:12 ` Richard Biener
2020-03-10 11:14   ` Richard Biener
2020-03-11  5:34     ` Kewen.Lin
2020-03-10 11:31   ` Tamar Christina
2020-03-11  3:58     ` Kewen.Lin
2020-03-13 11:57     ` Richard Biener
2020-03-18 11:37       ` Tamar Christina
2020-03-11  1:56   ` Kewen.Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).