public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "linkw at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/98138] BB vect fail to SLP one case
Date: Tue, 12 Jan 2021 07:23:42 +0000	[thread overview]
Message-ID: <bug-98138-4-f1s0S5xjjF@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-98138-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #7 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #6)
> Starting from the loads is not how SLP discovery works so there will be
> zero re-use of code.  Sure - the only important thing is you end up
> with a valid SLP graph.
> 
> But going back to the original testcase and the proposed vectorization
> for power - is that faster in the end?
> 

Good question. I coded foo2 and foo3 with altivec built-in functions as the
proposals, foo2 is for the original version (#c0) and foo3 is for the
sub-optimial version (#c4), they have been verified with some random inputs to
ensure the functionality. 

I evaluated the timings on Power8/Power9/Power10(pre) with option -Ofast and
corresponding -mcpu=power{8,9,10} for the build. Using foo1 (scalar version) as
the baseline 100, do the normalization, the speed-ups look below (more bigger,
more better):

    foo1   foo2     foo3
P8  100   163.26   164.97
P9  100   101.15   101.60
P10 100   172.42   159.10

The evaluation shows the vectorized versions can have big gains on Power8 and
Power10, but the gain is trivial on Power9.  I tried to evaluate P8 executable
on P9, it didn't have any remarkable differences. There are some known issues
on P9 for fp/vect intensive workloads, it's probably due to that since the
vectorized versions are full of vector codes. Sigh, we might have to make
different costs for P9 but that's another story. Btw, I tried to run P9
executable on P10, it can also gains that much like what we have in P10 row. So
it's profitable to vectorize this as P8 and P10 testing show.

> For the "rewrite" of the vectorizer into all-SLP we do have to address
> that "interleaving scheme not carried out as interleaving" at some point,
> but that's usually for loop vectorization - for BB vectorization all
> we have is optimize_slp.  I have patches that would build the vector load
> SLP node (you still have to kill that 'build from scalars' thing to make
> it trigger ).  But then we end up with a shared vector load node and
> N extract/splat operations at the 'scalar' points.  It's not entirely
> clear to me how to re-arrange the SLP graph at that point.
> 

Great, it's good to know we can end up with a shared vector load node. It looks
we can further simplify the graph on top of it, for #c4 simplified case,
assuming we have two shared vector load nodes
  N1 = <p1[0], p1[1], p1[2], p1[3]>
  N2 = <p2[0], p2[1], p2[2], p2[3]>

1) If all lanes of one shared vector load node only works with all lanes of
another shared vector load nodes, we can move up (postpone) the extract/splat
operations.  For this case, we needs 4 extract/splat for p1's lanes, 4
extract/splat for p2's lanes, 4 minus for {T,T,T,T} /T=a{0,1,2,3}/, it can be
simplified to one shared node N3 = N1 - N2 (1 minus) and 4 extract/splat.

2) And then we have the shared vector node <a0,a1,a2,a3>, before it's going
to extract/splat to {T,T,T,T} /T=a{0,1,2,3}/, it can check all the uses of
extract/splat lanes, try to detect some pattern and further simplify. For this
case, it can be permuted to <a1,a0,a3,a2>, do the minus/plus and the
permutations to get the nodes N4 <_13, _14, _13, _14> and N5 <_15, _16, _15,
_16> to save the need of extract/splat.

3) Further we can detect the pattern on SLP graph and skip the need to gen 
N4 and N5, but gen <_13, _14, _15, _16> and repeat the minus/plus and
permutations for the required t1/t2/t3/t4 computation.

> Btw, on current trunk the simplified testcase no longer runs into the
> 'scalar operand' build case but of course vectorization is thought to be
> not profitable.  pattern recog of the plus/minus subgraphs may help
> (not sure if ppc has those as instruction, x86 has).

You meant the simplified case in #c4? I re-based the latest trunk (r11-6601)
but still suffered the 'scalar operand' build, unfortunately ppc doesn't
support that kind of instructions, sigh.

> 
> That said, "failure" to identify the common (vector) load is known
> and I do have experimental patches trying to address that but did
> not yet arrive at a conclusive "best" approach.

Thanks for the information!

  parent reply	other threads:[~2021-01-12  7:23 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-04 10:46 [Bug tree-optimization/98138] New: " linkw at gcc dot gnu.org
2020-12-04 10:52 ` [Bug tree-optimization/98138] " linkw at gcc dot gnu.org
2020-12-04 12:19 ` rguenth at gcc dot gnu.org
2020-12-07  3:10 ` linkw at gcc dot gnu.org
2021-01-05  8:42 ` linkw at gcc dot gnu.org
2021-01-06  3:29 ` linkw at gcc dot gnu.org
2021-01-06  9:48 ` rguenth at gcc dot gnu.org
2021-01-12  7:23 ` linkw at gcc dot gnu.org [this message]
2021-01-12  7:25 ` linkw at gcc dot gnu.org
2021-08-04 10:31 ` rguenth at gcc dot gnu.org
2022-07-06 10:39 ` ktkachov at gcc dot gnu.org
2023-02-01  8:19 ` manolis.tsamis at vrull dot eu
2023-10-04 22:37 ` jiangning.liu at amperecomputing dot com
2023-10-05  6:58 ` rguenth at gcc dot gnu.org
2023-10-09  7:38 ` rguenth at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-98138-4-f1s0S5xjjF@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).