How to extend SLP to support this case

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

From: "Kewen.Lin" <linkw@linux.ibm.com>
To: GCC Development <gcc@gcc.gnu.org>
Cc: Richard Sandiford <richard.sandiford@arm.com>,
	Richard Biener <richard.guenther@gmail.com>,
	Segher Boessenkool <segher@kernel.crashing.org>,
	Bill Schmidt <wschmidt@linux.ibm.com>
Subject: How to extend SLP to support this case
Date: Tue, 10 Mar 2020 14:52:04 +0800	[thread overview]
Message-ID: <3d1278f2-ecd6-09c1-2179-19b7a1ecd5d4@linux.ibm.com> (raw)

Hi all,

I'm investigating whether GCC can vectorize the below case on ppc64le.

  extern void test(unsigned int t[4][4]);

  void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
  {
    unsigned int tmp[4][4];
    unsigned int a0, a1, a2, a3;

    for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
      a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
      a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
      a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
      a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);

      int t0 = a0 + a1;
      int t1 = a0 - a1;
      int t2 = a2 + a3;
      int t3 = a2 - a3;

      tmp[i][0] = t0 + t2;
      tmp[i][2] = t0 - t2;
      tmp[i][1] = t1 + t3;
      tmp[i][3] = t1 - t3;
    }
    test(tmp);
  }

With unlimited costs, I saw loop aware SLP can vectorize it but with very
inefficient codes.  It builds the SLP instance from store group {tmp[i][0]
tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0, a0}, 
{a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after parsing operands
for tmp* and t*.  It means it's unable to make the isomorphic group for
a0, a1, a2, a3, although they appears isomorphic to merge.  Even if it can
recognize over_widening pattern and do some parallel for two a0 from two
iterations, but it's still inefficient (high cost).

In this context, it looks better to build <a0, a1, a2, a3> first by
leveraging isomorphic computation trees constructing them, eg:
  w1_0123 = load_word(p1)
  V1_0123 = construct_vec(w1_0123)
  w1_4567 = load_word(p1 + 4)
  V1_4567 = construct_vec(w1_4567)
  w2_0123 = load_word(p2)
  V2_0123 = construct_vec(w2_0123)
  w2_4567 = load_word(p2 + 4)
  V2_4567 = construct_vec(w2_4567)
  V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16

But how to teach it to be aware of this? Currently the processing starts
from bottom to up (from stores), can we do some analysis on the SLP 
instance, detect some pattern and update the whole instance?

Besides, I also tried whether the standalone SLP pass can handle it,
it failed to. It stops at a* due to incompatible vector types.

The optimal vectorized SLP codes can be:
  // p1 byte 0 to byte 7 
  d1_0_7 = load_dword(p1)
  // p1+i1 b0 to b7, rename it as 8 to 15       
  d1_8_15 = load_dword(p1 + i1)
  d1_16_23 = load_dword(p1 + 2*i1) 
  d1_24_31 = load_dword(p1 + 3*i1)

  V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char
  V_d1_16_31 = construct_vec(d1_16_23,d1_24_31)
  V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15, 
                      {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27})
  V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15, 
                      {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31})

  // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all

  // Do the subtraction together (all 4x4 bytes)
  V_sub1 = V_d1_0_3_all - V_d2_0_3_all
  V_sub2 = V_d1_4_7_all - V_d2_4_7_all

  // Do some unpack and get the promoted vector int
  V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28}
  V_a0_1 = V_a0_tmp << 16
  V_a0_0 = vec_promote(V_sub1, {0 1 2 3}).  // vector int {b0 b8 b16 b24}
  // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3}
  V_a0 = V_a0_0 + V_a0_1

  // Get the similar for V_a1, V_a2, V_a3

  // Compute t0/t1/t2/t3
  // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3}
  V_t0 = V_a0 + V_a1  
  V_t1 = V_a0 - V_a1
  V_t2 = V_a2 + V_a3
  V_t3 = V_a2 - V_a3

  // Compute tmps
  // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]}
  V_tmp0 = V_t0 + V_t2
  V_tmp2 = V_t0 - V_t2
  V_tmp1 = V_t1 + V_t3
  V_tmp3 = V_t1 - V_t3

  // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ...
  // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3

From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/ together, eg:
  tmp[i][0] i=/0,1,2,3/ (one group)
  tmp[i][1] i=/0,1,2,3/ (one group)
  tmp[i][2] i=/0,1,2,3/ (one group)
  tmp[i][3] i=/0,1,2,3/ (one group)

which tmp[i][j] group have the same isomorphic computations.  But currently
SLP is unable to divide group like this way. (call it as A-way for now)

It's understandable since it has better adjacent store groups like,
  tmp[0][i] i=/0,1,2,3/ (one group)
  tmp[1][i] i=/0,1,2,3/ (one group)
  tmp[2][i] i=/0,1,2,3/ (one group)
  tmp[3][i] i=/0,1,2,3/ (one group)

A-way requires some additional vector permutations.  However, I thought
if the existing scheme can't get any SLP chances, it looks reasonable to
extend it to consider this A-way grouping.  Does it make sense?

Another question is that even if we can go with A-way grouping, it can
only handle packing one byte (four iteration -> 4), what place is
reasonable to extend it to pack more?  How about scaning all leaf
nodes and consider/transform them together?  too hacky?

Any comments are highly appreciated.  Thanks in advance!

BR,
Kewen

next             reply	other threads:[~2020-03-10  6:52 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-10  6:52 Kewen.Lin [this message]
2020-03-10 11:12 ` Richard Biener
2020-03-10 11:14   ` Richard Biener
2020-03-11  5:34     ` Kewen.Lin
2020-03-10 11:31   ` Tamar Christina
2020-03-11  3:58     ` Kewen.Lin
2020-03-13 11:57     ` Richard Biener
2020-03-18 11:37       ` Tamar Christina
2020-03-11  1:56   ` Kewen.Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3d1278f2-ecd6-09c1-2179-19b7a1ecd5d4@linux.ibm.com \
    --to=linkw@linux.ibm.com \
    --cc=gcc@gcc.gnu.org \
    --cc=richard.guenther@gmail.com \
    --cc=richard.sandiford@arm.com \
    --cc=segher@kernel.crashing.org \
    --cc=wschmidt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).