public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "linkw at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/98138] BB vect fail to SLP one case
Date: Tue, 05 Jan 2021 08:42:20 +0000	[thread overview]
Message-ID: <bug-98138-4-2FOtsrAh58@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-98138-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #4 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Kewen Lin from comment #3)
> 
> IIUC, in current implementation, we get four grouped stores:
>   { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] } /i=0,1,2,3/ independently 
> 
> When all these tryings fail, could we do some re-try on the groups
>   { tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] } /i=0,1,2,3/
> with one extra intermediate layer covering those original groups, then start
> from these newly adjusted groups? the built operands should isomorphic then.
> May be too hackish?

This thought seems impractical in current framework. I was thinking whether we
can use another vectorization way which is sub-optimal but more fits with the
current framework. It only focuses on all things in one loop iteration and
starts from group stores { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] }. It can
be:

  w1_0123 = load_word(p1)
  V1_0123 = scalar_to_vec(w1_0123) // also with promotion
  w1_4567 = load_word(p1+4)
  V1_4567 = scalar_to_vec(w1_4567)

  w2_0123 = load_word(p2)
  V2_0123 = scalar_to_vec(w2_0123)
  w2_4567 = load_word(p2+4)
  V2_4567 = scalar_to_vec(w2_4567)

  V_a0123 = (V1_0123-V2_0123) + (V1_4567-V2_4567) << 16

  V1 = V_a0123                  // {a0, a1, a2, a3}
  V2 = vperm (V1, <1, 0, 3, 2>) // {a1, a0, a3, a2}

  V3 = V1 - V2  // {t1, -t1, t3, -t3}
  V4 = V1 + V2  // {t0,  t0, t2,  t2}

  V5 = vperm (V3, V4, <4, 0, 6, 2>) // {t0, t1, t2, t3}
  V6 = vperm (V3, V4, <6, 2, 4, 0>) // {t2, t3, t0, t1}

  V7 = V5 - V6 // {t0 - t2, t1 - t3, t2 - t0, t3 - t1}
  V8 = V5 + V6 // {t0 + t2, t1 + t3, t2 + t0, t3 + t1}

  V9 = vperm (V7, V8, <4, 5, 0, 1>)

  vec_store (tmp[i], V9)

One simplified case can be:

extern void test(unsigned int t[4]);

void foo(int *p1, int i1, int *p2, int i2) {
  unsigned int tmp[4];
  unsigned int a0, a1, a2, a3;

  a0 = (p1[0] - p2[0]);
  a1 = (p1[1] - p2[1]);
  a2 = (p1[2] - p2[2]);
  a3 = (p1[3] - p2[3]);

  int t0 = a0 + a1;
  int t1 = a0 - a1;
  int t2 = a2 + a3;
  int t3 = a2 - a3;

  tmp[0] = t0 + t2;
  tmp[2] = t0 - t2;
  tmp[1] = t1 + t3;
  tmp[3] = t1 - t3;

  test(tmp);
}

Currently we still can't vectorize this simple case. SLP doesn't go deep enough
to expose the group loads chance on the bottom nodes loading p1/p2. It ends up
early with node { _13, _14, _13, _14 } and { _15, _16, _15, _16 } because the
operands like {a0_28, a0_28, a0_28, a0_28} etc. satisfy the condition
all_uniform_p so "Building parent vector operands from scalars instead".

One rough idea seems:
  1) Relax this condition all_uniform_p somehow to get SLP instance building to
go deeper and get those p1/p2 loads as SLP nodes.
  2) Introduce one more vect_pattern recognizer to catch this kind of pattern,
transform the slp instance as we expect. I assume we can know the whole slp
instance then we can transform it as we want here. Probably need some costing
condition to gate this pattern matching.
  3) If 2) fail, trim the slp instance from those nodes which satisfy
all_uniform_p condition to ensure it's same as before.

Does it sound reasonable?

  parent reply	other threads:[~2021-01-05  8:42 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-04 10:46 [Bug tree-optimization/98138] New: " linkw at gcc dot gnu.org
2020-12-04 10:52 ` [Bug tree-optimization/98138] " linkw at gcc dot gnu.org
2020-12-04 12:19 ` rguenth at gcc dot gnu.org
2020-12-07  3:10 ` linkw at gcc dot gnu.org
2021-01-05  8:42 ` linkw at gcc dot gnu.org [this message]
2021-01-06  3:29 ` linkw at gcc dot gnu.org
2021-01-06  9:48 ` rguenth at gcc dot gnu.org
2021-01-12  7:23 ` linkw at gcc dot gnu.org
2021-01-12  7:25 ` linkw at gcc dot gnu.org
2021-08-04 10:31 ` rguenth at gcc dot gnu.org
2022-07-06 10:39 ` ktkachov at gcc dot gnu.org
2023-02-01  8:19 ` manolis.tsamis at vrull dot eu
2023-10-04 22:37 ` jiangning.liu at amperecomputing dot com
2023-10-05  6:58 ` rguenth at gcc dot gnu.org
2023-10-09  7:38 ` rguenth at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-98138-4-2FOtsrAh58@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).