On Tue, Mar 10, 2020 at 12:32 PM Tamar Christina <Tamar.Christina@arm.com>
wrote:

>
> > -----Original Message-----
> > From: Gcc <gcc-bounces@gcc.gnu.org> On Behalf Of Richard Biener
> > Sent: Tuesday, March 10, 2020 11:12 AM
> > To: Kewen.Lin <linkw@linux.ibm.com>
> > Cc: GCC Development <gcc@gcc.gnu.org>; Segher Boessenkool
> > <segher@kernel.crashing.org>
> > Subject: Re: How to extend SLP to support this case
> >
> > On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> > >
> > > Hi all,
> > >
> > > I'm investigating whether GCC can vectorize the below case on ppc64le.
> > >
> > >   extern void test(unsigned int t[4][4]);
> > >
> > >   void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
> > >   {
> > >     unsigned int tmp[4][4];
> > >     unsigned int a0, a1, a2, a3;
> > >
> > >     for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
> > >       a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
> > >       a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
> > >       a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
> > >       a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);
> > >
> > >       int t0 = a0 + a1;
> > >       int t1 = a0 - a1;
> > >       int t2 = a2 + a3;
> > >       int t3 = a2 - a3;
> > >
> > >       tmp[i][0] = t0 + t2;
> > >       tmp[i][2] = t0 - t2;
> > >       tmp[i][1] = t1 + t3;
> > >       tmp[i][3] = t1 - t3;
> > >     }
> > >     test(tmp);
> > >   }
> > >
> > > With unlimited costs, I saw loop aware SLP can vectorize it but with
> > > very inefficient codes.  It builds the SLP instance from store group
> > > {tmp[i][0] tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0,
> > > a0}, {a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after
> > > parsing operands for tmp* and t*.  It means it's unable to make the
> > > isomorphic group for a0, a1, a2, a3, although they appears isomorphic
> > > to merge.  Even if it can recognize over_widening pattern and do some
> > > parallel for two a0 from two iterations, but it's still inefficient
> (high cost).
> > >
> > > In this context, it looks better to build <a0, a1, a2, a3> first by
> > > leveraging isomorphic computation trees constructing them, eg:
> > >   w1_0123 = load_word(p1)
> > >   V1_0123 = construct_vec(w1_0123)
> > >   w1_4567 = load_word(p1 + 4)
> > >   V1_4567 = construct_vec(w1_4567)
> > >   w2_0123 = load_word(p2)
> > >   V2_0123 = construct_vec(w2_0123)
> > >   w2_4567 = load_word(p2 + 4)
> > >   V2_4567 = construct_vec(w2_4567)
> > >   V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16
> > >
> > > But how to teach it to be aware of this? Currently the processing
> > > starts from bottom to up (from stores), can we do some analysis on the
> > > SLP instance, detect some pattern and update the whole instance?
> >
> > In theory yes (Tamar had something like that for AARCH64 complex
> rotations
> > IIRC).  And yes, the issue boils down to how we handle SLP discovery.
> I'd like
> > to improve SLP discovery but it's on my list only after I managed to get
> rid of
> > the non-SLP code paths.  I have played with some ideas (even produced
> > hackish patches) to find "seeds" to form SLP groups from using
> multi-level
> > hashing of stmts.
>
> I still have this but missed the stage-1 deadline after doing the
> rewriting to C++ 😊
>
> We've also been looking at this and the approach I'm investigating now is
> trying to get
> the SLP codepath to handle this after it's been fully unrolled. I'm
> looking into whether
> the build-slp can be improved to work for the group size == 16 case that
> it tries but fails
> on.
>
> My intention is to see if doing so would make it simpler to recognize this
> as just 4 linear
> loads and two permutes. I think the loop aware SLP will have a much harder
> time with this
> seeing the load permutations it thinks it needs because of the permutes
> caused by the +/-
> pattern.
>
> One Idea I had before was from your comment on the complex number patch,
> which is to try
> and move up TWO_OPERATORS and undo the permute always when doing +/-. This
> would simplify
> the load permute handling and if a target doesn't have an instruction to
> support this it would just
> fall back to doing an explicit permute after the loads.  But I wasn't sure
> this approach would get me the
> results I wanted.
>
> In the end you don't want a loop here at all. And in order to do the above
> with TWO_OPERATORS I would
> have to let the SLP pattern matcher be able to reduce the group size and
> increase the no# iterations during
> the matching otherwise the matching itself becomes quite difficult in
> certain cases.
>

Just to show where I'm heading I'm attaching current work-in-progress that
introduces
an explicit SLP merge node and implementing SLP_TREE_TWO_OPERATORS that way:

   v1 + v2      v1 - v2
       \              /
          merge

the SLP merge operation is concatenating operands in order and then applies
a lane permutation mask (basically a select from the input lanes).  The
actual patch
depends on earlier cleanups and more preparatory changes are in order to
make
it "nice".  With such SLP merge operation in place it should be
straight-forward
to represent smaller and larger vectorizations in a single graph
(generating optimal
code is of course an entirely different problem).  I'm working on some of
the preparatory
refactorings right now after which I'll post the series up to
SLP_TREE_TWO_OPERATORS.
The next step is then to represent load permutations that way.

Richard.



Tamar
>
> > My plan is to rewrite SLP discovery completely, starting from a SLP
> graph that
> > 1:1 reflects the SSA use-def graph (no groups formed yet) and then form
> > groups from seeds and insert "connection" nodes (merging two subgraphs
> > with less lanes into one with more lanes, splitting lanes, permuting
> lanes,
> > etc.).
> >
> > Currently I'm working on doing exactly this but only for SLP loads
> (because
> > that's where it's most difficult...).
> >
> > > Besides, I also tried whether the standalone SLP pass can handle it,
> > > it failed to. It stops at a* due to incompatible vector types.
> > >
> > > The optimal vectorized SLP codes can be:
> > >   // p1 byte 0 to byte 7
> > >   d1_0_7 = load_dword(p1)
> > >   // p1+i1 b0 to b7, rename it as 8 to 15
> > >   d1_8_15 = load_dword(p1 + i1)
> > >   d1_16_23 = load_dword(p1 + 2*i1)
> > >   d1_24_31 = load_dword(p1 + 3*i1)
> > >
> > >   V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char
> > >   V_d1_16_31 = construct_vec(d1_16_23,d1_24_31)
> > >   V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15,
> > >                       {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27})
> > >   V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15,
> > >                       {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31})
> > >
> > >   // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all
> > >
> > >   // Do the subtraction together (all 4x4 bytes)
> > >   V_sub1 = V_d1_0_3_all - V_d2_0_3_all
> > >   V_sub2 = V_d1_4_7_all - V_d2_4_7_all
> > >
> > >   // Do some unpack and get the promoted vector int
> > >   V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20
> b28}
> > >   V_a0_1 = V_a0_tmp << 16
> > >   V_a0_0 = vec_promote(V_sub1, {0 1 2 3}).  // vector int {b0 b8 b16
> b24}
> > >   // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3}
> > >   V_a0 = V_a0_0 + V_a0_1
> > >
> > >   // Get the similar for V_a1, V_a2, V_a3
> > >
> > >   // Compute t0/t1/t2/t3
> > >   // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3}
> > >   V_t0 = V_a0 + V_a1
> > >   V_t1 = V_a0 - V_a1
> > >   V_t2 = V_a2 + V_a3
> > >   V_t3 = V_a2 - V_a3
> > >
> > >   // Compute tmps
> > >   // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]}
> > >   V_tmp0 = V_t0 + V_t2
> > >   V_tmp2 = V_t0 - V_t2
> > >   V_tmp1 = V_t1 + V_t3
> > >   V_tmp3 = V_t1 - V_t3
> > >
> > >   // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]}
> ...
> > >   // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3
> > >
> > > From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/
> together, eg:
> > >   tmp[i][0] i=/0,1,2,3/ (one group)
> > >   tmp[i][1] i=/0,1,2,3/ (one group)
> > >   tmp[i][2] i=/0,1,2,3/ (one group)
> > >   tmp[i][3] i=/0,1,2,3/ (one group)
> > >
> > > which tmp[i][j] group have the same isomorphic computations.  But
> > > currently SLP is unable to divide group like this way. (call it as
> > > A-way for now)
> > >
> > > It's understandable since it has better adjacent store groups like,
> > >   tmp[0][i] i=/0,1,2,3/ (one group)
> > >   tmp[1][i] i=/0,1,2,3/ (one group)
> > >   tmp[2][i] i=/0,1,2,3/ (one group)
> > >   tmp[3][i] i=/0,1,2,3/ (one group)
> > >
> > > A-way requires some additional vector permutations.  However, I
> > > thought if the existing scheme can't get any SLP chances, it looks
> > > reasonable to extend it to consider this A-way grouping.  Does it make
> > sense?
> > >
> > > Another question is that even if we can go with A-way grouping, it can
> > > only handle packing one byte (four iteration -> 4), what place is
> > > reasonable to extend it to pack more?  How about scaning all leaf
> > > nodes and consider/transform them together?  too hacky?
> >
> > Well, not sure - in the end it will all be heuristics since otherwise the
> > exploration space is too big.  But surely concentrating on load/store
> groups is
> > good.
> >
> > The SLP discovery code is already quite complicated btw., I'd hate to add
> > "unstructured" hacks ontop of it right now without future design goals in
> > mind.
> >
> > Richard.
> >
> > > Any comments are highly appreciated.  Thanks in advance!
> > >
> > >
> > > BR,
> > > Kewen
> > >
>