Re: [PATCH 2/2][RFC] Add loop masking support for x86

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Biener <rguenther@suse.de>
To: Hongtao Liu <crazylht@gmail.com>
Cc: Richard Sandiford <richard.sandiford@arm.com>,
	 Hongtao Liu <hongtao.liu@intel.com>,
	GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH 2/2][RFC] Add loop masking support for x86
Date: Tue, 20 Jul 2021 09:38:57 +0200 (CEST)	[thread overview]
Message-ID: <nycvar.YFH.7.76.2107200931290.10711@zhemvz.fhfr.qr> (raw)
In-Reply-To: <CAMZc-bw3sDERU6YunDQ_eku_cg4TySCC8bZMX_B83A6hCycfHA@mail.gmail.com>

On Tue, 20 Jul 2021, Hongtao Liu wrote:

> On Fri, Jul 16, 2021 at 5:11 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Thu, 15 Jul 2021, Richard Biener wrote:
> >
> > > On Thu, 15 Jul 2021, Richard Biener wrote:
> > >
> > > > OK, guess I was more looking at
> > > >
> > > > #define N 32
> > > > int foo (unsigned long *a, unsigned long * __restrict b,
> > > >          unsigned int *c, unsigned int * __restrict d,
> > > >          int n)
> > > > {
> > > >   unsigned sum = 1;
> > > >   for (int i = 0; i < n; ++i)
> > > >     {
> > > >       b[i] += a[i];
> > > >       d[i] += c[i];
> > > >     }
> > > >   return sum;
> > > > }
> > > >
> > > > where we on x86 AVX512 vectorize with V8DI and V16SI and we
> > > > generate two masks for the two copies of V8DI (VF is 16) and one
> > > > mask for V16SI.  With SVE I see
> > > >
> > > >         punpklo p1.h, p0.b
> > > >         punpkhi p2.h, p0.b
> > > >
> > > > that's sth I expected to see for AVX512 as well, using the V16SI
> > > > mask and unpacking that to two V8DI ones.  But I see
> > > >
> > > >         vpbroadcastd    %eax, %ymm0
> > > >         vpaddd  %ymm12, %ymm0, %ymm0
> > > >         vpcmpud $6, %ymm0, %ymm11, %k3
> > > >         vpbroadcastd    %eax, %xmm0
> > > >         vpaddd  %xmm10, %xmm0, %xmm0
> > > >         vpcmpud $1, %xmm7, %xmm0, %k1
> > > >         vpcmpud $6, %xmm0, %xmm8, %k2
> > > >         kortestb        %k1, %k1
> > > >         jne     .L3
> > > >
> > > > so three %k masks generated by vpcmpud.  I'll have to look what's
> > > > the magic for SVE and why that doesn't trigger for x86 here.
> > >
> > > So answer myself, vect_maybe_permute_loop_masks looks for
> > > vec_unpacku_hi/lo_optab, but with AVX512 the vector bools have
> > > QImode so that doesn't play well here.  Not sure if there
> > > are proper mask instructions to use (I guess there's a shift
> > > and lopart is free).  This is QI:8 to two QI:4 (bits) mask
> Yes, for 16bit and more, we have KUNPCKBW/D/Q. but for 8bit
> unpack_lo/hi, only shift.
> > > conversion.  Not sure how to better ask the target here - again
> > > VnBImode might have been easier here.
> >
> > So I've managed to "emulate" the unpack_lo/hi for the case of
> > !VECTOR_MODE_P masks by using sub-vector select (we're asking
> > to turn vector(8) <signed-boolean:1> into two
> > vector(4) <signed-boolean:1>) via BIT_FIELD_REF.  That then
> > produces the desired single mask producer and
> >
> >   loop_mask_38 = VIEW_CONVERT_EXPR<vector(4) <signed-boolean:1>>(loop_mask_54);
> >   loop_mask_37 = BIT_FIELD_REF <loop_mask_54, 4, 4>;
> >
> > note for the lowpart we can just view-convert away the excess bits,
> > fully re-using the mask.  We generate surprisingly "good" code:
> >
> >         kmovb   %k1, %edi
> >         shrb    $4, %dil
> >         kmovb   %edi, %k2
> >
> > besides the lack of using kshiftrb.  I guess we're just lacking
> > a mask register alternative for
> Yes, we can do it similar as kor/kand/kxor.
> >
> > (insn 22 20 25 4 (parallel [
> >             (set (reg:QI 94 [ loop_mask_37 ])
> >                 (lshiftrt:QI (reg:QI 98 [ loop_mask_54 ])
> >                     (const_int 4 [0x4])))
> >             (clobber (reg:CC 17 flags))
> >         ]) 724 {*lshrqi3_1}
> >      (expr_list:REG_UNUSED (reg:CC 17 flags)
> >         (nil)))
> >
> > and so we reload.  For the above cited loop the AVX512 vectorization
> > with --param vect-partial-vector-usage=1 does look quite sensible
> > to me.  Instead of a SSE vectorized epilogue plus a scalar
> > epilogue we get a single fully masked AVX512 "iteration" for both.
> > I suppose it's still mostly a code-size optimization (384 bytes
> > with the masked epiloge vs. 474 bytes with trunk) since it will
> > be likely slower for very low iteration counts but it's good
> > for icache usage then and good for less branch predictor usage.
> >
> > That said, I have to set up SPEC on a AVX512 machine to do
> Does patch  land in trunk already, i can have a test on CLX.

I'm still experimenting a bit right now but hope to get something
trunk ready at the end of this or beginning next week.  Since it's
disabled by default we can work on improving it during stage1 then.

I'm mostly struggling with the GIMPLE IL to be used for the
mask unpacking since we currently reject both the BIT_FIELD_REF
and the VIEW_CONVERT we generate (why do AVX512 masks not all have
SImode but sometimes QImode and sometimes HImode ...).  Unfortunately
we've dropped whole-vector shifts in favor of VEC_PERM but that
doesn't work well either for integer mode vectors.  So I'm still
playing with my options here and looking for something that doesn't
require too much surgery on the RTL side to recover good mask
register code ...

Another part missing is expanders for the various cond_* patterns

OPTAB_D (cond_add_optab, "cond_add$a")
OPTAB_D (cond_sub_optab, "cond_sub$a")
OPTAB_D (cond_smul_optab, "cond_mul$a")
OPTAB_D (cond_sdiv_optab, "cond_div$a")
OPTAB_D (cond_smod_optab, "cond_mod$a")
OPTAB_D (cond_udiv_optab, "cond_udiv$a")
OPTAB_D (cond_umod_optab, "cond_umod$a")
OPTAB_D (cond_and_optab, "cond_and$a")
OPTAB_D (cond_ior_optab, "cond_ior$a")
OPTAB_D (cond_xor_optab, "cond_xor$a")
OPTAB_D (cond_ashl_optab, "cond_ashl$a")
OPTAB_D (cond_ashr_optab, "cond_ashr$a")
OPTAB_D (cond_lshr_optab, "cond_lshr$a")
OPTAB_D (cond_smin_optab, "cond_smin$a")
OPTAB_D (cond_smax_optab, "cond_smax$a")
OPTAB_D (cond_umin_optab, "cond_umin$a")
OPTAB_D (cond_umax_optab, "cond_umax$a")
OPTAB_D (cond_fma_optab, "cond_fma$a")
OPTAB_D (cond_fms_optab, "cond_fms$a")
OPTAB_D (cond_fnma_optab, "cond_fnma$a")
OPTAB_D (cond_fnms_optab, "cond_fnms$a")

I think the most useful are those for possibly trapping ops
(will be used by if-conversion) and those for reduction operations
(add,min,max) which would enable a masked reduction epilogue.

The good thing is that I've been able to get my hands on a
Cascadelake system so I can at least test things for correctness.

Richard.

> > any meaningful measurements (I suspect with just AVX2 we're not
> > going to see any benefit from masking).  Hints/help how to fix
> > the missing kshiftrb appreciated.
> >
> > Oh, and if there's only V4DImode and V16HImode data then
> > we don't go the vect_maybe_permute_loop_masks path - that is,
> > we don't generate the (not used) intermediate mask but end up
> > generating two while_ult parts.
> >
> > Thanks,
> > Richard.

next prev parent reply	other threads:[~2021-07-20  7:38 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-15 10:30 Richard Biener
2021-07-15 10:45 ` Richard Biener
2021-07-15 11:20   ` Hongtao Liu
2021-07-15 11:48     ` Richard Biener
2021-07-15 14:57       ` Richard Sandiford
2021-07-15 15:15         ` Richard Biener
2021-07-15 15:31           ` Richard Biener
2021-07-16  9:11             ` Richard Biener
2021-07-20  4:20               ` Hongtao Liu
2021-07-20  7:38                 ` Richard Biener [this message]
2021-07-20 11:07                   ` Hongtao Liu
2021-07-20 11:09                     ` Richard Biener
2021-07-21  7:57                   ` Hongtao Liu
2021-07-21  8:16                     ` Richard Biener
2021-07-21  9:38                       ` Hongtao Liu
2021-07-21 10:13                         ` Richard Biener
2021-07-16  1:46       ` Hongtao Liu
2021-07-16  6:09         ` Richard Biener
2021-07-15 13:49 ` Richard Sandiford
2021-07-15 13:54   ` Richard Biener
2021-07-20 13:48   ` Richard Biener
2021-07-21  6:17     ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=nycvar.YFH.7.76.2107200931290.10711@zhemvz.fhfr.qr \
    --to=rguenther@suse.de \
    --cc=crazylht@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=hongtao.liu@intel.com \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).