Re: [PATCH] Add emulated gather capability to the vectorizer

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Biener <rguenther@suse.de>
To: "Kewen.Lin" <linkw@linux.ibm.com>
Cc: richard.sandiford@arm.com, gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] Add emulated gather capability to the vectorizer
Date: Mon, 2 Aug 2021 11:11:46 +0200 (CEST)	[thread overview]
Message-ID: <nycvar.YFH.7.76.2108021107210.11781@zhemvz.fhfr.qr> (raw)
In-Reply-To: <169f58db-4e9d-df26-309f-8e8c0c4b691f@linux.ibm.com>

On Mon, 2 Aug 2021, Kewen.Lin wrote:

> on 2021/8/2 下午3:09, Richard Biener wrote:
> > On Mon, 2 Aug 2021, Kewen.Lin wrote:
> > 
> >> on 2021/7/30 下午10:04, Kewen.Lin via Gcc-patches wrote:
> >>> Hi Richi,
> >>>
> >>> on 2021/7/30 下午7:34, Richard Biener wrote:
> >>>> This adds a gather vectorization capability to the vectorizer
> >>>> without target support by decomposing the offset vector, doing
> >>>> sclar loads and then building a vector from the result.  This
> >>>> is aimed mainly at cases where vectorizing the rest of the loop
> >>>> offsets the cost of vectorizing the gather.
> >>>>
> >>>> Note it's difficult to avoid vectorizing the offset load, but in
> >>>> some cases later passes can turn the vector load + extract into
> >>>> scalar loads, see the followup patch.
> >>>>
> >>>> On SPEC CPU 2017 510.parest_r this improves runtime from 250s
> >>>> to 219s on a Zen2 CPU which has its native gather instructions
> >>>> disabled (using those the runtime instead increases to 254s)
> >>>> using -Ofast -march=znver2 [-flto].  It turns out the critical
> >>>> loops in this benchmark all perform gather operations.
> >>>>
> >>>
> >>> Wow, it sounds promising!
> >>>
> >>>> Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >>>>
> >>>> Any comments?  I still plan to run this over full SPEC and
> >>>> I have to apply TLC to the followup patch before I can post it.
> >>>>
> >>>> I think neither power nor z has gather so I'm curious if the
> >>>> patch helps 510.parest_r there, I'm unsure about neon/advsimd.
> >>>
> >>> Yes, Power (latest Power10) doesn't support gather load.
> >>> I just measured 510.parest_r with this patch on Power9 at option
> >>> -Ofast -mcpu=power9 {,-funroll-loops}, both are neutral.
> >>>
> >>> It fails to vectorize the loop in vect-gather-1.c:
> >>>
> >>> vect-gather.c:12:28: missed:  failed: evolution of base is not affine.
> >>> vect-gather.c:11:46: missed:   not vectorized: data ref analysis failed _6 = *_5;
> >>> vect-gather.c:12:28: missed:   not vectorized: data ref analysis failed: _6 = *_5;
> >>> vect-gather.c:11:46: missed:  bad data references.
> >>> vect-gather.c:11:46: missed: couldn't vectorize loop
> >>>
> >>
> >> By further investigation, it's due to that rs6000 fails to make
> >> maybe_gather true in:
> >>
> >> 	  bool maybe_gather
> >> 	    = DR_IS_READ (dr)
> >> 	      && !TREE_THIS_VOLATILE (DR_REF (dr))
> >> 	      && (targetm.vectorize.builtin_gather != NULL
> >> 		  || supports_vec_gather_load_p ());
> >>
> >> With the hacking defining TARGET_VECTORIZE_BUILTIN_GATHER (as
> >> well as TARGET_VECTORIZE_BUILTIN_SCATTER) for rs6000, the
> >> case gets vectorized as expected.
> > 
> > Ah, yeah - this check needs to be elided.  Checking with a cross
> > that's indeed what is missing.
> > 
> >> But re-evaluated 510.parest_r with this extra hacking, the
> >> runtime performance doesn't have any changes.
> > 
> > OK, it likely needs the followup as well then.  For the added
> > testcase I end up with
> > 
> > .L4:
> >         lwz %r10,8(%r9)
> >         lwz %r31,12(%r9)
> >         addi %r8,%r8,32
> >         addi %r9,%r9,16
> >         lwz %r7,-16(%r9)
> >         lwz %r0,-12(%r9)
> >         lxv %vs10,-16(%r8)
> >         lxv %vs12,-32(%r8)
> >         extswsli %r10,%r10,3
> >         extswsli %r31,%r31,3
> >         extswsli %r7,%r7,3
> >         extswsli %r0,%r0,3
> >         ldx %r10,%r6,%r10
> >         ldx %r31,%r6,%r31
> >         ldx %r7,%r6,%r7
> >         ldx %r12,%r6,%r0
> >         mtvsrdd %vs0,%r31,%r10
> >         mtvsrdd %vs11,%r12,%r7
> >         xvmuldp %vs0,%vs0,%vs10
> >         xvmaddadp %vs0,%vs12,%vs11
> >         xvadddp %vs32,%vs32,%vs0
> >         bdnz .L4
> > 
> > as the inner vectorized loop.  Looks like power doesn't have
> > extending SI->DImode loads?  Otherwise the code looks
> 
> You meant one instruction for both left shifting and loads?

No, I meant doing

         lwz %r10,8(%r9)
         extswsli %r10,%r10,3

in one step.  I think extswsli should be a SImode->DImode sign-extend.
Ah - it does the multiplication by 8 and the sign-extend.  OK, that's
nice as well.  On x86 the multiplication by 8 is done via complex
addressing mode on the dependent load and the scalar load does
the sign-extension part.

> It doesn't if so.  btw, the above code sequence looks nice!
> > straight-forward (I've used -mcpu=power10), but eventually
> > the whole gather, which I think boils down to
> > 
> >         lwz %r10,8(%r9)
> >         lwz %r31,12(%r9)
> >         addi %r9,%r9,16
> >         lwz %r7,-16(%r9)
> >         lwz %r0,-12(%r9)
> >         extswsli %r10,%r10,3
> >         extswsli %r31,%r31,3
> >         extswsli %r7,%r7,3
> >         extswsli %r0,%r0,3
> >         ldx %r10,%r6,%r10
> >         ldx %r31,%r6,%r31
> >         ldx %r7,%r6,%r7
> >         ldx %r12,%r6,%r0
> >         mtvsrdd %vs0,%r31,%r10
> >         mtvsrdd %vs11,%r12,%r7
> > 
> > is too difficult for the integer/load side of the pipeline.  It's
> > of course the same in the scalar loop but there the FP ops only
> > need to wait for one lane to complete.  Note the above is with
> > the followup patch.
> > 
> 
> Good point!  Will check the actual effect after this and its follow-up
> are landed, may need some cost tweaking on it.

Yes - I hope the vectorizer costing part is conservative enough.

Richard.

next prev parent reply	other threads:[~2021-08-02  9:11 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-30 11:34 Richard Biener
2021-07-30 13:42 ` Richard Sandiford
2021-08-02  7:44   ` Richard Biener
2021-08-02  8:05     ` Richard Sandiford
2021-08-02  8:15       ` Richard Biener
2021-08-02  8:20         ` Richard Sandiford
2021-07-30 14:04 ` Tamar Christina
2021-07-30 14:04 ` Kewen.Lin
2021-08-02  6:07   ` Kewen.Lin
2021-08-02  7:09     ` Richard Biener
2021-08-02  8:35       ` Kewen.Lin
2021-08-02  9:11         ` Richard Biener [this message]
2021-08-02  9:20           ` Kewen.Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=nycvar.YFH.7.76.2108021107210.11781@zhemvz.fhfr.qr \
    --to=rguenther@suse.de \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=linkw@linux.ibm.com \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).