public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: "Liu, Hongtao" <hongtao.liu@intel.com>
To: Richard Biener <rguenther@suse.de>, Andrew Stubbs <ams@codesourcery.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	"richard.sandiford@arm.com" <richard.sandiford@arm.com>,
	Jan Hubicka <hubicka@ucw.cz>,
	"kirill.yukhin@gmail.com" <kirill.yukhin@gmail.com>
Subject: RE: [PATCH 3/3] AVX512 fully masked vectorization
Date: Thu, 15 Jun 2023 05:50:45 +0000	[thread overview]
Message-ID: <SA1PR11MB67573854CBC1B787A4395D53E55BA@SA1PR11MB6757.namprd11.prod.outlook.com> (raw)
In-Reply-To: <DA5221DD-8F3C-47A3-821D-8CB09DA5E60B@suse.de>



> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Wednesday, June 14, 2023 10:30 PM
> To: Andrew Stubbs <ams@codesourcery.com>
> Cc: gcc-patches@gcc.gnu.org; richard.sandiford@arm.com; Jan Hubicka
> <hubicka@ucw.cz>; Liu, Hongtao <hongtao.liu@intel.com>;
> kirill.yukhin@gmail.com
> Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization
> 
> 
> 
> > Am 14.06.2023 um 16:27 schrieb Andrew Stubbs
> <ams@codesourcery.com>:
> >
> > On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
> >> This implemens fully masked vectorization or a masked epilog for
> >> AVX512 style masks which single themselves out by representing each
> >> lane with a single bit and by using integer modes for the mask (both
> >> is much like GCN).
> >> AVX512 is also special in that it doesn't have any instruction to
> >> compute the mask from a scalar IV like SVE has with while_ult.
> >> Instead the masks are produced by vector compares and the loop
> >> control retains the scalar IV (mainly to avoid dependences on mask
> >> generation, a suitable mask test instruction is available).
> >
> > This is also sounds like GCN. We currently use WHILE_ULT in the middle end
> which expands to a vector compare against a vector of stepped values. This
> requires an additional instruction to prepare the comparison vector
> (compared to SVE), but the "while_ultv64sidi" pattern (for example) returns
> the DImode bitmask, so it works reasonably well.
> >
> >> Like RVV code generation prefers a decrementing IV though IVOPTs
> >> messes things up in some cases removing that IV to eliminate it with
> >> an incrementing one used for address generation.
> >> One of the motivating testcases is from PR108410 which in turn is
> >> extracted from x264 where large size vectorization shows issues with
> >> small trip loops.  Execution time there improves compared to classic
> >> AVX512 with AVX2 epilogues for the cases of less than 32 iterations.
> >> size   scalar     128     256     512    512e    512f
> >>     1    9.42   11.32    9.35   11.17   15.13   16.89
> >>     2    5.72    6.53    6.66    6.66    7.62    8.56
> >>     3    4.49    5.10    5.10    5.74    5.08    5.73
> >>     4    4.10    4.33    4.29    5.21    3.79    4.25
> >>     6    3.78    3.85    3.86    4.76    2.54    2.85
> >>     8    3.64    1.89    3.76    4.50    1.92    2.16
> >>    12    3.56    2.21    3.75    4.26    1.26    1.42
> >>    16    3.36    0.83    1.06    4.16    0.95    1.07
> >>    20    3.39    1.42    1.33    4.07    0.75    0.85
> >>    24    3.23    0.66    1.72    4.22    0.62    0.70
> >>    28    3.18    1.09    2.04    4.20    0.54    0.61
> >>    32    3.16    0.47    0.41    0.41    0.47    0.53
> >>    34    3.16    0.67    0.61    0.56    0.44    0.50
> >>    38    3.19    0.95    0.95    0.82    0.40    0.45
> >>    42    3.09    0.58    1.21    1.13    0.36    0.40
> >> 'size' specifies the number of actual iterations, 512e is for a
> >> masked epilog and 512f for the fully masked loop.  From
> >> 4 scalar iterations on the AVX512 masked epilog code is clearly the
> >> winner, the fully masked variant is clearly worse and it's size
> >> benefit is also tiny.
> >
> > Let me check I understand correctly. In the fully masked case, there is a
> single loop in which a new mask is generated at the start of each iteration. In
> the masked epilogue case, the main loop uses no masking whatsoever, thus
> avoiding the need for generating a mask, carrying the mask, inserting
> vec_merge operations, etc, and then the epilogue looks much like the fully
> masked case, but unlike smaller mode epilogues there is no loop because the
> eplogue vector size is the same. Is that right?
> 
> Yes.
What about vectorizer and unroll, when vector size is the same, unroll factor is N, but there're at most N - 1 iterations for epilogue loop, will there still a loop? 
> > This scheme seems like it might also benefit GCN, in so much as it simplifies
> the hot code path.
> >
> > GCN does not actually have smaller vector sizes, so there's no analogue to
> AVX2 (we pretend we have some smaller sizes, but that's because the
> middle end can't do masking everywhere yet, and it helps make some vector
> constants smaller, perhaps).
> >
> >> This patch does not enable using fully masked loops or masked
> >> epilogues by default.  More work on cost modeling and vectorization
> >> kind selection on x86_64 is necessary for this.
> >> Implementation wise this introduces
> LOOP_VINFO_PARTIAL_VECTORS_STYLE
> >> which could be exploited further to unify some of the flags we have
> >> right now but there didn't seem to be many easy things to merge, so
> >> I'm leaving this for followups.
> >> Mask requirements as registered by vect_record_loop_mask are kept in
> >> their original form and recorded in a hash_set now instead of being
> >> processed to a vector of rgroup_controls.  Instead that's now left to
> >> the final analysis phase which tries forming the rgroup_controls
> >> vector using while_ult and if that fails now tries AVX512 style which
> >> needs a different organization and instead fills a hash_map with the
> >> relevant info.  vect_get_loop_mask now has two implementations, one
> >> for the two mask styles we then have.
> >> I have decided against interweaving
> >> vect_set_loop_condition_partial_vectors
> >> with conditions to do AVX512 style masking and instead opted to
> >> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
> >> Likewise for vect_verify_full_masking vs
> vect_verify_full_masking_avx512.
> >> I was split between making 'vec_loop_masks' a class with methods,
> >> possibly merging in the _len stuff into a single registry.  It seemed
> >> to be too many changes for the purpose of getting AVX512 working.
> >> I'm going to play wait and see what happens with RISC-V here since
> >> they are going to get both masks and lengths registered I think.
> >> The vect_prepare_for_masked_peels hunk might run into issues with
> >> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
> >> looked odd.
> >> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run the
> >> testsuite with --param vect-partial-vector-usage=2 with and without
> >> -fno-vect-cost-model and filed two bugs, one ICE (PR110221) and one
> >> latent wrong-code (PR110237).
> >> There's followup work to be done to try enabling masked epilogues for
> >> x86-64 by default (when AVX512 is enabled, possibly only when
> >> -mprefer-vector-width=512).  Getting cost modeling and decision right
> >> is going to be challenging.
> >> Any comments?
> >> OK?
> >> Btw, testing on GCN would be welcome - the _avx512 paths could work
> >> for it so in case the while_ult path fails (not sure if it ever does)
> >> it could get _avx512 style masking.  Likewise testing on ARM just to
> >> see I didn't break anything here.
> >> I don't have SVE hardware so testing is probably meaningless.
> >
> > I can set some tests going. Is vect.exp enough?
> 
> Well, only you know (from experience), but sure that’s a nice start.
> 
> Richard
> 
> > Andrew
> >

  reply	other threads:[~2023-06-15  5:50 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20230614115450.28CEA3858288@sourceware.org>
2023-06-14 14:26 ` Andrew Stubbs
2023-06-14 14:29   ` Richard Biener
2023-06-15  5:50     ` Liu, Hongtao [this message]
2023-06-15  6:51       ` Richard Biener
2023-06-15  9:26     ` Andrew Stubbs
2023-06-15  9:58       ` Richard Biener
2023-06-15 10:13         ` Andrew Stubbs
2023-06-15 11:06           ` Richard Biener
2023-06-15 13:04             ` Andrew Stubbs
2023-06-15 13:34               ` Richard Biener
2023-06-15 13:52                 ` Andrew Stubbs
2023-06-15 14:00                   ` Richard Biener
2023-06-15 14:04                     ` Andrew Stubbs
2023-06-15 16:16                       ` Richard Biener
2023-06-15  9:58       ` Richard Sandiford
     [not found] <20230614115429.D400C3858433@sourceware.org>
2023-06-14 18:45 ` Richard Sandiford
2023-06-15 12:14   ` Richard Biener
2023-06-15 12:53     ` Richard Biener
2023-06-14 11:54 Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=SA1PR11MB67573854CBC1B787A4395D53E55BA@SA1PR11MB6757.namprd11.prod.outlook.com \
    --to=hongtao.liu@intel.com \
    --cc=ams@codesourcery.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=hubicka@ucw.cz \
    --cc=kirill.yukhin@gmail.com \
    --cc=rguenther@suse.de \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).