Re: [PATCH 3/3] AVX512 fully masked vectorization

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [PATCH 3/3] AVX512 fully masked vectorization
       [not found] <20230614115450.28CEA3858288@sourceware.org>
@ 2023-06-14 14:26 ` Andrew Stubbs
  2023-06-14 14:29   ` Richard Biener
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Stubbs @ 2023-06-14 14:26 UTC (permalink / raw)
  To: Richard Biener, gcc-patches
  Cc: richard.sandiford, Jan Hubicka, hongtao.liu, kirill.yukhin

On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
> This implemens fully masked vectorization or a masked epilog for
> AVX512 style masks which single themselves out by representing
> each lane with a single bit and by using integer modes for the mask
> (both is much like GCN).
> 
> AVX512 is also special in that it doesn't have any instruction
> to compute the mask from a scalar IV like SVE has with while_ult.
> Instead the masks are produced by vector compares and the loop
> control retains the scalar IV (mainly to avoid dependences on
> mask generation, a suitable mask test instruction is available).

This is also sounds like GCN. We currently use WHILE_ULT in the middle 
end which expands to a vector compare against a vector of stepped 
values. This requires an additional instruction to prepare the 
comparison vector (compared to SVE), but the "while_ultv64sidi" pattern 
(for example) returns the DImode bitmask, so it works reasonably well.

> Like RVV code generation prefers a decrementing IV though IVOPTs
> messes things up in some cases removing that IV to eliminate
> it with an incrementing one used for address generation.
> 
> One of the motivating testcases is from PR108410 which in turn
> is extracted from x264 where large size vectorization shows
> issues with small trip loops.  Execution time there improves
> compared to classic AVX512 with AVX2 epilogues for the cases
> of less than 32 iterations.
> 
> size   scalar     128     256     512    512e    512f
>      1    9.42   11.32    9.35   11.17   15.13   16.89
>      2    5.72    6.53    6.66    6.66    7.62    8.56
>      3    4.49    5.10    5.10    5.74    5.08    5.73
>      4    4.10    4.33    4.29    5.21    3.79    4.25
>      6    3.78    3.85    3.86    4.76    2.54    2.85
>      8    3.64    1.89    3.76    4.50    1.92    2.16
>     12    3.56    2.21    3.75    4.26    1.26    1.42
>     16    3.36    0.83    1.06    4.16    0.95    1.07
>     20    3.39    1.42    1.33    4.07    0.75    0.85
>     24    3.23    0.66    1.72    4.22    0.62    0.70
>     28    3.18    1.09    2.04    4.20    0.54    0.61
>     32    3.16    0.47    0.41    0.41    0.47    0.53
>     34    3.16    0.67    0.61    0.56    0.44    0.50
>     38    3.19    0.95    0.95    0.82    0.40    0.45
>     42    3.09    0.58    1.21    1.13    0.36    0.40
> 
> 'size' specifies the number of actual iterations, 512e is for
> a masked epilog and 512f for the fully masked loop.  From
> 4 scalar iterations on the AVX512 masked epilog code is clearly
> the winner, the fully masked variant is clearly worse and
> it's size benefit is also tiny.

Let me check I understand correctly. In the fully masked case, there is 
a single loop in which a new mask is generated at the start of each 
iteration. In the masked epilogue case, the main loop uses no masking 
whatsoever, thus avoiding the need for generating a mask, carrying the 
mask, inserting vec_merge operations, etc, and then the epilogue looks 
much like the fully masked case, but unlike smaller mode epilogues there 
is no loop because the eplogue vector size is the same. Is that right?

This scheme seems like it might also benefit GCN, in so much as it 
simplifies the hot code path.

GCN does not actually have smaller vector sizes, so there's no analogue 
to AVX2 (we pretend we have some smaller sizes, but that's because the 
middle end can't do masking everywhere yet, and it helps make some 
vector constants smaller, perhaps).

> This patch does not enable using fully masked loops or
> masked epilogues by default.  More work on cost modeling
> and vectorization kind selection on x86_64 is necessary
> for this.
> 
> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
> which could be exploited further to unify some of the flags
> we have right now but there didn't seem to be many easy things
> to merge, so I'm leaving this for followups.
> 
> Mask requirements as registered by vect_record_loop_mask are kept in their
> original form and recorded in a hash_set now instead of being
> processed to a vector of rgroup_controls.  Instead that's now
> left to the final analysis phase which tries forming the rgroup_controls
> vector using while_ult and if that fails now tries AVX512 style
> which needs a different organization and instead fills a hash_map
> with the relevant info.  vect_get_loop_mask now has two implementations,
> one for the two mask styles we then have.
> 
> I have decided against interweaving vect_set_loop_condition_partial_vectors
> with conditions to do AVX512 style masking and instead opted to
> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
> Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.
> 
> I was split between making 'vec_loop_masks' a class with methods,
> possibly merging in the _len stuff into a single registry.  It
> seemed to be too many changes for the purpose of getting AVX512
> working.  I'm going to play wait and see what happens with RISC-V
> here since they are going to get both masks and lengths registered
> I think.
> 
> The vect_prepare_for_masked_peels hunk might run into issues with
> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
> looked odd.
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
> the testsuite with --param vect-partial-vector-usage=2 with and
> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
> and one latent wrong-code (PR110237).
> 
> There's followup work to be done to try enabling masked epilogues
> for x86-64 by default (when AVX512 is enabled, possibly only when
> -mprefer-vector-width=512).  Getting cost modeling and decision
> right is going to be challenging.
> 
> Any comments?
> 
> OK?
> 
> Btw, testing on GCN would be welcome - the _avx512 paths could
> work for it so in case the while_ult path fails (not sure if
> it ever does) it could get _avx512 style masking.  Likewise
> testing on ARM just to see I didn't break anything here.
> I don't have SVE hardware so testing is probably meaningless.

I can set some tests going. Is vect.exp enough?

Andrew


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] AVX512 fully masked vectorization
  2023-06-14 14:26 ` [PATCH 3/3] AVX512 fully masked vectorization Andrew Stubbs
@ 2023-06-14 14:29   ` Richard Biener
  2023-06-15  5:50     ` Liu, Hongtao
  2023-06-15  9:26     ` Andrew Stubbs
  0 siblings, 2 replies; 19+ messages in thread
From: Richard Biener @ 2023-06-14 14:29 UTC (permalink / raw)
  To: Andrew Stubbs
  Cc: gcc-patches, richard.sandiford, Jan Hubicka, hongtao.liu, kirill.yukhin



> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <ams@codesourcery.com>:
> 
> On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
>> This implemens fully masked vectorization or a masked epilog for
>> AVX512 style masks which single themselves out by representing
>> each lane with a single bit and by using integer modes for the mask
>> (both is much like GCN).
>> AVX512 is also special in that it doesn't have any instruction
>> to compute the mask from a scalar IV like SVE has with while_ult.
>> Instead the masks are produced by vector compares and the loop
>> control retains the scalar IV (mainly to avoid dependences on
>> mask generation, a suitable mask test instruction is available).
> 
> This is also sounds like GCN. We currently use WHILE_ULT in the middle end which expands to a vector compare against a vector of stepped values. This requires an additional instruction to prepare the comparison vector (compared to SVE), but the "while_ultv64sidi" pattern (for example) returns the DImode bitmask, so it works reasonably well.
> 
>> Like RVV code generation prefers a decrementing IV though IVOPTs
>> messes things up in some cases removing that IV to eliminate
>> it with an incrementing one used for address generation.
>> One of the motivating testcases is from PR108410 which in turn
>> is extracted from x264 where large size vectorization shows
>> issues with small trip loops.  Execution time there improves
>> compared to classic AVX512 with AVX2 epilogues for the cases
>> of less than 32 iterations.
>> size   scalar     128     256     512    512e    512f
>>     1    9.42   11.32    9.35   11.17   15.13   16.89
>>     2    5.72    6.53    6.66    6.66    7.62    8.56
>>     3    4.49    5.10    5.10    5.74    5.08    5.73
>>     4    4.10    4.33    4.29    5.21    3.79    4.25
>>     6    3.78    3.85    3.86    4.76    2.54    2.85
>>     8    3.64    1.89    3.76    4.50    1.92    2.16
>>    12    3.56    2.21    3.75    4.26    1.26    1.42
>>    16    3.36    0.83    1.06    4.16    0.95    1.07
>>    20    3.39    1.42    1.33    4.07    0.75    0.85
>>    24    3.23    0.66    1.72    4.22    0.62    0.70
>>    28    3.18    1.09    2.04    4.20    0.54    0.61
>>    32    3.16    0.47    0.41    0.41    0.47    0.53
>>    34    3.16    0.67    0.61    0.56    0.44    0.50
>>    38    3.19    0.95    0.95    0.82    0.40    0.45
>>    42    3.09    0.58    1.21    1.13    0.36    0.40
>> 'size' specifies the number of actual iterations, 512e is for
>> a masked epilog and 512f for the fully masked loop.  From
>> 4 scalar iterations on the AVX512 masked epilog code is clearly
>> the winner, the fully masked variant is clearly worse and
>> it's size benefit is also tiny.
> 
> Let me check I understand correctly. In the fully masked case, there is a single loop in which a new mask is generated at the start of each iteration. In the masked epilogue case, the main loop uses no masking whatsoever, thus avoiding the need for generating a mask, carrying the mask, inserting vec_merge operations, etc, and then the epilogue looks much like the fully masked case, but unlike smaller mode epilogues there is no loop because the eplogue vector size is the same. Is that right?

Yes.

> This scheme seems like it might also benefit GCN, in so much as it simplifies the hot code path.
> 
> GCN does not actually have smaller vector sizes, so there's no analogue to AVX2 (we pretend we have some smaller sizes, but that's because the middle end can't do masking everywhere yet, and it helps make some vector constants smaller, perhaps).
> 
>> This patch does not enable using fully masked loops or
>> masked epilogues by default.  More work on cost modeling
>> and vectorization kind selection on x86_64 is necessary
>> for this.
>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
>> which could be exploited further to unify some of the flags
>> we have right now but there didn't seem to be many easy things
>> to merge, so I'm leaving this for followups.
>> Mask requirements as registered by vect_record_loop_mask are kept in their
>> original form and recorded in a hash_set now instead of being
>> processed to a vector of rgroup_controls.  Instead that's now
>> left to the final analysis phase which tries forming the rgroup_controls
>> vector using while_ult and if that fails now tries AVX512 style
>> which needs a different organization and instead fills a hash_map
>> with the relevant info.  vect_get_loop_mask now has two implementations,
>> one for the two mask styles we then have.
>> I have decided against interweaving vect_set_loop_condition_partial_vectors
>> with conditions to do AVX512 style masking and instead opted to
>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
>> Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.
>> I was split between making 'vec_loop_masks' a class with methods,
>> possibly merging in the _len stuff into a single registry.  It
>> seemed to be too many changes for the purpose of getting AVX512
>> working.  I'm going to play wait and see what happens with RISC-V
>> here since they are going to get both masks and lengths registered
>> I think.
>> The vect_prepare_for_masked_peels hunk might run into issues with
>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
>> looked odd.
>> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
>> the testsuite with --param vect-partial-vector-usage=2 with and
>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
>> and one latent wrong-code (PR110237).
>> There's followup work to be done to try enabling masked epilogues
>> for x86-64 by default (when AVX512 is enabled, possibly only when
>> -mprefer-vector-width=512).  Getting cost modeling and decision
>> right is going to be challenging.
>> Any comments?
>> OK?
>> Btw, testing on GCN would be welcome - the _avx512 paths could
>> work for it so in case the while_ult path fails (not sure if
>> it ever does) it could get _avx512 style masking.  Likewise
>> testing on ARM just to see I didn't break anything here.
>> I don't have SVE hardware so testing is probably meaningless.
> 
> I can set some tests going. Is vect.exp enough?

Well, only you know (from experience), but sure that’s a nice start.

Richard 

> Andrew
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 3/3] AVX512 fully masked vectorization
  2023-06-14 14:29   ` Richard Biener
@ 2023-06-15  5:50     ` Liu, Hongtao
  2023-06-15  6:51       ` Richard Biener
  2023-06-15  9:26     ` Andrew Stubbs
  1 sibling, 1 reply; 19+ messages in thread
From: Liu, Hongtao @ 2023-06-15  5:50 UTC (permalink / raw)
  To: Richard Biener, Andrew Stubbs
  Cc: gcc-patches, richard.sandiford, Jan Hubicka, kirill.yukhin



> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Wednesday, June 14, 2023 10:30 PM
> To: Andrew Stubbs <ams@codesourcery.com>
> Cc: gcc-patches@gcc.gnu.org; richard.sandiford@arm.com; Jan Hubicka
> <hubicka@ucw.cz>; Liu, Hongtao <hongtao.liu@intel.com>;
> kirill.yukhin@gmail.com
> Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization
> 
> 
> 
> > Am 14.06.2023 um 16:27 schrieb Andrew Stubbs
> <ams@codesourcery.com>:
> >
> > On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
> >> This implemens fully masked vectorization or a masked epilog for
> >> AVX512 style masks which single themselves out by representing each
> >> lane with a single bit and by using integer modes for the mask (both
> >> is much like GCN).
> >> AVX512 is also special in that it doesn't have any instruction to
> >> compute the mask from a scalar IV like SVE has with while_ult.
> >> Instead the masks are produced by vector compares and the loop
> >> control retains the scalar IV (mainly to avoid dependences on mask
> >> generation, a suitable mask test instruction is available).
> >
> > This is also sounds like GCN. We currently use WHILE_ULT in the middle end
> which expands to a vector compare against a vector of stepped values. This
> requires an additional instruction to prepare the comparison vector
> (compared to SVE), but the "while_ultv64sidi" pattern (for example) returns
> the DImode bitmask, so it works reasonably well.
> >
> >> Like RVV code generation prefers a decrementing IV though IVOPTs
> >> messes things up in some cases removing that IV to eliminate it with
> >> an incrementing one used for address generation.
> >> One of the motivating testcases is from PR108410 which in turn is
> >> extracted from x264 where large size vectorization shows issues with
> >> small trip loops.  Execution time there improves compared to classic
> >> AVX512 with AVX2 epilogues for the cases of less than 32 iterations.
> >> size   scalar     128     256     512    512e    512f
> >>     1    9.42   11.32    9.35   11.17   15.13   16.89
> >>     2    5.72    6.53    6.66    6.66    7.62    8.56
> >>     3    4.49    5.10    5.10    5.74    5.08    5.73
> >>     4    4.10    4.33    4.29    5.21    3.79    4.25
> >>     6    3.78    3.85    3.86    4.76    2.54    2.85
> >>     8    3.64    1.89    3.76    4.50    1.92    2.16
> >>    12    3.56    2.21    3.75    4.26    1.26    1.42
> >>    16    3.36    0.83    1.06    4.16    0.95    1.07
> >>    20    3.39    1.42    1.33    4.07    0.75    0.85
> >>    24    3.23    0.66    1.72    4.22    0.62    0.70
> >>    28    3.18    1.09    2.04    4.20    0.54    0.61
> >>    32    3.16    0.47    0.41    0.41    0.47    0.53
> >>    34    3.16    0.67    0.61    0.56    0.44    0.50
> >>    38    3.19    0.95    0.95    0.82    0.40    0.45
> >>    42    3.09    0.58    1.21    1.13    0.36    0.40
> >> 'size' specifies the number of actual iterations, 512e is for a
> >> masked epilog and 512f for the fully masked loop.  From
> >> 4 scalar iterations on the AVX512 masked epilog code is clearly the
> >> winner, the fully masked variant is clearly worse and it's size
> >> benefit is also tiny.
> >
> > Let me check I understand correctly. In the fully masked case, there is a
> single loop in which a new mask is generated at the start of each iteration. In
> the masked epilogue case, the main loop uses no masking whatsoever, thus
> avoiding the need for generating a mask, carrying the mask, inserting
> vec_merge operations, etc, and then the epilogue looks much like the fully
> masked case, but unlike smaller mode epilogues there is no loop because the
> eplogue vector size is the same. Is that right?
> 
> Yes.
What about vectorizer and unroll, when vector size is the same, unroll factor is N, but there're at most N - 1 iterations for epilogue loop, will there still a loop? 
> > This scheme seems like it might also benefit GCN, in so much as it simplifies
> the hot code path.
> >
> > GCN does not actually have smaller vector sizes, so there's no analogue to
> AVX2 (we pretend we have some smaller sizes, but that's because the
> middle end can't do masking everywhere yet, and it helps make some vector
> constants smaller, perhaps).
> >
> >> This patch does not enable using fully masked loops or masked
> >> epilogues by default.  More work on cost modeling and vectorization
> >> kind selection on x86_64 is necessary for this.
> >> Implementation wise this introduces
> LOOP_VINFO_PARTIAL_VECTORS_STYLE
> >> which could be exploited further to unify some of the flags we have
> >> right now but there didn't seem to be many easy things to merge, so
> >> I'm leaving this for followups.
> >> Mask requirements as registered by vect_record_loop_mask are kept in
> >> their original form and recorded in a hash_set now instead of being
> >> processed to a vector of rgroup_controls.  Instead that's now left to
> >> the final analysis phase which tries forming the rgroup_controls
> >> vector using while_ult and if that fails now tries AVX512 style which
> >> needs a different organization and instead fills a hash_map with the
> >> relevant info.  vect_get_loop_mask now has two implementations, one
> >> for the two mask styles we then have.
> >> I have decided against interweaving
> >> vect_set_loop_condition_partial_vectors
> >> with conditions to do AVX512 style masking and instead opted to
> >> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
> >> Likewise for vect_verify_full_masking vs
> vect_verify_full_masking_avx512.
> >> I was split between making 'vec_loop_masks' a class with methods,
> >> possibly merging in the _len stuff into a single registry.  It seemed
> >> to be too many changes for the purpose of getting AVX512 working.
> >> I'm going to play wait and see what happens with RISC-V here since
> >> they are going to get both masks and lengths registered I think.
> >> The vect_prepare_for_masked_peels hunk might run into issues with
> >> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
> >> looked odd.
> >> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run the
> >> testsuite with --param vect-partial-vector-usage=2 with and without
> >> -fno-vect-cost-model and filed two bugs, one ICE (PR110221) and one
> >> latent wrong-code (PR110237).
> >> There's followup work to be done to try enabling masked epilogues for
> >> x86-64 by default (when AVX512 is enabled, possibly only when
> >> -mprefer-vector-width=512).  Getting cost modeling and decision right
> >> is going to be challenging.
> >> Any comments?
> >> OK?
> >> Btw, testing on GCN would be welcome - the _avx512 paths could work
> >> for it so in case the while_ult path fails (not sure if it ever does)
> >> it could get _avx512 style masking.  Likewise testing on ARM just to
> >> see I didn't break anything here.
> >> I don't have SVE hardware so testing is probably meaningless.
> >
> > I can set some tests going. Is vect.exp enough?
> 
> Well, only you know (from experience), but sure that’s a nice start.
> 
> Richard
> 
> > Andrew
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 3/3] AVX512 fully masked vectorization
  2023-06-15  5:50     ` Liu, Hongtao
@ 2023-06-15  6:51       ` Richard Biener
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Biener @ 2023-06-15  6:51 UTC (permalink / raw)
  To: Liu, Hongtao
  Cc: Andrew Stubbs, gcc-patches, richard.sandiford, Jan Hubicka,
	kirill.yukhin

[-- Attachment #1: Type: text/plain, Size: 8804 bytes --]

On Thu, 15 Jun 2023, Liu, Hongtao wrote:

> 
> 
> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Wednesday, June 14, 2023 10:30 PM
> > To: Andrew Stubbs <ams@codesourcery.com>
> > Cc: gcc-patches@gcc.gnu.org; richard.sandiford@arm.com; Jan Hubicka
> > <hubicka@ucw.cz>; Liu, Hongtao <hongtao.liu@intel.com>;
> > kirill.yukhin@gmail.com
> > Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization
> > 
> > 
> > 
> > > Am 14.06.2023 um 16:27 schrieb Andrew Stubbs
> > <ams@codesourcery.com>:
> > >
> > > On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
> > >> This implemens fully masked vectorization or a masked epilog for
> > >> AVX512 style masks which single themselves out by representing each
> > >> lane with a single bit and by using integer modes for the mask (both
> > >> is much like GCN).
> > >> AVX512 is also special in that it doesn't have any instruction to
> > >> compute the mask from a scalar IV like SVE has with while_ult.
> > >> Instead the masks are produced by vector compares and the loop
> > >> control retains the scalar IV (mainly to avoid dependences on mask
> > >> generation, a suitable mask test instruction is available).
> > >
> > > This is also sounds like GCN. We currently use WHILE_ULT in the middle end
> > which expands to a vector compare against a vector of stepped values. This
> > requires an additional instruction to prepare the comparison vector
> > (compared to SVE), but the "while_ultv64sidi" pattern (for example) returns
> > the DImode bitmask, so it works reasonably well.
> > >
> > >> Like RVV code generation prefers a decrementing IV though IVOPTs
> > >> messes things up in some cases removing that IV to eliminate it with
> > >> an incrementing one used for address generation.
> > >> One of the motivating testcases is from PR108410 which in turn is
> > >> extracted from x264 where large size vectorization shows issues with
> > >> small trip loops.  Execution time there improves compared to classic
> > >> AVX512 with AVX2 epilogues for the cases of less than 32 iterations.
> > >> size   scalar     128     256     512    512e    512f
> > >>     1    9.42   11.32    9.35   11.17   15.13   16.89
> > >>     2    5.72    6.53    6.66    6.66    7.62    8.56
> > >>     3    4.49    5.10    5.10    5.74    5.08    5.73
> > >>     4    4.10    4.33    4.29    5.21    3.79    4.25
> > >>     6    3.78    3.85    3.86    4.76    2.54    2.85
> > >>     8    3.64    1.89    3.76    4.50    1.92    2.16
> > >>    12    3.56    2.21    3.75    4.26    1.26    1.42
> > >>    16    3.36    0.83    1.06    4.16    0.95    1.07
> > >>    20    3.39    1.42    1.33    4.07    0.75    0.85
> > >>    24    3.23    0.66    1.72    4.22    0.62    0.70
> > >>    28    3.18    1.09    2.04    4.20    0.54    0.61
> > >>    32    3.16    0.47    0.41    0.41    0.47    0.53
> > >>    34    3.16    0.67    0.61    0.56    0.44    0.50
> > >>    38    3.19    0.95    0.95    0.82    0.40    0.45
> > >>    42    3.09    0.58    1.21    1.13    0.36    0.40
> > >> 'size' specifies the number of actual iterations, 512e is for a
> > >> masked epilog and 512f for the fully masked loop.  From
> > >> 4 scalar iterations on the AVX512 masked epilog code is clearly the
> > >> winner, the fully masked variant is clearly worse and it's size
> > >> benefit is also tiny.
> > >
> > > Let me check I understand correctly. In the fully masked case, there is a
> > single loop in which a new mask is generated at the start of each iteration. In
> > the masked epilogue case, the main loop uses no masking whatsoever, thus
> > avoiding the need for generating a mask, carrying the mask, inserting
> > vec_merge operations, etc, and then the epilogue looks much like the fully
> > masked case, but unlike smaller mode epilogues there is no loop because the
> > eplogue vector size is the same. Is that right?
> > 
> > Yes.
> What about vectorizer and unroll, when vector size is the same, unroll 
> factor is N, but there're at most N - 1 iterations for epilogue loop, 
> will there still a loop?

Yes, it looks like vect_determine_partial_vectors_and_peeling has
a special exception for that, it does

         If we are unrolling we also do not want to use partial vectors. 
This
         is to avoid the overhead of generating multiple masks and also to
         avoid having to execute entire iterations of FALSE masked 
instructions
         when dealing with one or less full iterations.
...
      if ((param_vect_partial_vector_usage == 1
           || loop_vinfo->suggested_unroll_factor > 1)
          && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
          && !vect_known_niters_smaller_than_vf (loop_vinfo))
        LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
      else
        LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;

so when we unroll the main loop we are forcing partial vectors on
the epilog even though that's going to iterate.  The comment suggests
that wasn't the main idea but the idea was to not use partial vectors
for the main loop.

Of course since we can at most have a single vectorized epilog
(but not a vectorized epilog of the epilog) it might still make
sense to use partial vectors there.

> > > This scheme seems like it might also benefit GCN, in so much as it simplifies
> > the hot code path.
> > >
> > > GCN does not actually have smaller vector sizes, so there's no analogue to
> > AVX2 (we pretend we have some smaller sizes, but that's because the
> > middle end can't do masking everywhere yet, and it helps make some vector
> > constants smaller, perhaps).
> > >
> > >> This patch does not enable using fully masked loops or masked
> > >> epilogues by default.  More work on cost modeling and vectorization
> > >> kind selection on x86_64 is necessary for this.
> > >> Implementation wise this introduces
> > LOOP_VINFO_PARTIAL_VECTORS_STYLE
> > >> which could be exploited further to unify some of the flags we have
> > >> right now but there didn't seem to be many easy things to merge, so
> > >> I'm leaving this for followups.
> > >> Mask requirements as registered by vect_record_loop_mask are kept in
> > >> their original form and recorded in a hash_set now instead of being
> > >> processed to a vector of rgroup_controls.  Instead that's now left to
> > >> the final analysis phase which tries forming the rgroup_controls
> > >> vector using while_ult and if that fails now tries AVX512 style which
> > >> needs a different organization and instead fills a hash_map with the
> > >> relevant info.  vect_get_loop_mask now has two implementations, one
> > >> for the two mask styles we then have.
> > >> I have decided against interweaving
> > >> vect_set_loop_condition_partial_vectors
> > >> with conditions to do AVX512 style masking and instead opted to
> > >> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
> > >> Likewise for vect_verify_full_masking vs
> > vect_verify_full_masking_avx512.
> > >> I was split between making 'vec_loop_masks' a class with methods,
> > >> possibly merging in the _len stuff into a single registry.  It seemed
> > >> to be too many changes for the purpose of getting AVX512 working.
> > >> I'm going to play wait and see what happens with RISC-V here since
> > >> they are going to get both masks and lengths registered I think.
> > >> The vect_prepare_for_masked_peels hunk might run into issues with
> > >> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
> > >> looked odd.
> > >> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run the
> > >> testsuite with --param vect-partial-vector-usage=2 with and without
> > >> -fno-vect-cost-model and filed two bugs, one ICE (PR110221) and one
> > >> latent wrong-code (PR110237).
> > >> There's followup work to be done to try enabling masked epilogues for
> > >> x86-64 by default (when AVX512 is enabled, possibly only when
> > >> -mprefer-vector-width=512).  Getting cost modeling and decision right
> > >> is going to be challenging.
> > >> Any comments?
> > >> OK?
> > >> Btw, testing on GCN would be welcome - the _avx512 paths could work
> > >> for it so in case the while_ult path fails (not sure if it ever does)
> > >> it could get _avx512 style masking.  Likewise testing on ARM just to
> > >> see I didn't break anything here.
> > >> I don't have SVE hardware so testing is probably meaningless.
> > >
> > > I can set some tests going. Is vect.exp enough?
> > 
> > Well, only you know (from experience), but sure that?s a nice start.
> > 
> > Richard
> > 
> > > Andrew
> > >
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] AVX512 fully masked vectorization
  2023-06-14 14:29   ` Richard Biener
  2023-06-15  5:50     ` Liu, Hongtao
@ 2023-06-15  9:26     ` Andrew Stubbs
  2023-06-15  9:58       ` Richard Biener
  2023-06-15  9:58       ` Richard Sandiford
  1 sibling, 2 replies; 19+ messages in thread
From: Andrew Stubbs @ 2023-06-15  9:26 UTC (permalink / raw)
  To: Richard Biener
  Cc: gcc-patches, richard.sandiford, Jan Hubicka, hongtao.liu, kirill.yukhin

On 14/06/2023 15:29, Richard Biener wrote:
> 
> 
>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <ams@codesourcery.com>:
>>
>> On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
>>> This implemens fully masked vectorization or a masked epilog for
>>> AVX512 style masks which single themselves out by representing
>>> each lane with a single bit and by using integer modes for the mask
>>> (both is much like GCN).
>>> AVX512 is also special in that it doesn't have any instruction
>>> to compute the mask from a scalar IV like SVE has with while_ult.
>>> Instead the masks are produced by vector compares and the loop
>>> control retains the scalar IV (mainly to avoid dependences on
>>> mask generation, a suitable mask test instruction is available).
>>
>> This is also sounds like GCN. We currently use WHILE_ULT in the middle end which expands to a vector compare against a vector of stepped values. This requires an additional instruction to prepare the comparison vector (compared to SVE), but the "while_ultv64sidi" pattern (for example) returns the DImode bitmask, so it works reasonably well.
>>
>>> Like RVV code generation prefers a decrementing IV though IVOPTs
>>> messes things up in some cases removing that IV to eliminate
>>> it with an incrementing one used for address generation.
>>> One of the motivating testcases is from PR108410 which in turn
>>> is extracted from x264 where large size vectorization shows
>>> issues with small trip loops.  Execution time there improves
>>> compared to classic AVX512 with AVX2 epilogues for the cases
>>> of less than 32 iterations.
>>> size   scalar     128     256     512    512e    512f
>>>      1    9.42   11.32    9.35   11.17   15.13   16.89
>>>      2    5.72    6.53    6.66    6.66    7.62    8.56
>>>      3    4.49    5.10    5.10    5.74    5.08    5.73
>>>      4    4.10    4.33    4.29    5.21    3.79    4.25
>>>      6    3.78    3.85    3.86    4.76    2.54    2.85
>>>      8    3.64    1.89    3.76    4.50    1.92    2.16
>>>     12    3.56    2.21    3.75    4.26    1.26    1.42
>>>     16    3.36    0.83    1.06    4.16    0.95    1.07
>>>     20    3.39    1.42    1.33    4.07    0.75    0.85
>>>     24    3.23    0.66    1.72    4.22    0.62    0.70
>>>     28    3.18    1.09    2.04    4.20    0.54    0.61
>>>     32    3.16    0.47    0.41    0.41    0.47    0.53
>>>     34    3.16    0.67    0.61    0.56    0.44    0.50
>>>     38    3.19    0.95    0.95    0.82    0.40    0.45
>>>     42    3.09    0.58    1.21    1.13    0.36    0.40
>>> 'size' specifies the number of actual iterations, 512e is for
>>> a masked epilog and 512f for the fully masked loop.  From
>>> 4 scalar iterations on the AVX512 masked epilog code is clearly
>>> the winner, the fully masked variant is clearly worse and
>>> it's size benefit is also tiny.
>>
>> Let me check I understand correctly. In the fully masked case, there is a single loop in which a new mask is generated at the start of each iteration. In the masked epilogue case, the main loop uses no masking whatsoever, thus avoiding the need for generating a mask, carrying the mask, inserting vec_merge operations, etc, and then the epilogue looks much like the fully masked case, but unlike smaller mode epilogues there is no loop because the eplogue vector size is the same. Is that right?
> 
> Yes.
> 
>> This scheme seems like it might also benefit GCN, in so much as it simplifies the hot code path.
>>
>> GCN does not actually have smaller vector sizes, so there's no analogue to AVX2 (we pretend we have some smaller sizes, but that's because the middle end can't do masking everywhere yet, and it helps make some vector constants smaller, perhaps).
>>
>>> This patch does not enable using fully masked loops or
>>> masked epilogues by default.  More work on cost modeling
>>> and vectorization kind selection on x86_64 is necessary
>>> for this.
>>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
>>> which could be exploited further to unify some of the flags
>>> we have right now but there didn't seem to be many easy things
>>> to merge, so I'm leaving this for followups.
>>> Mask requirements as registered by vect_record_loop_mask are kept in their
>>> original form and recorded in a hash_set now instead of being
>>> processed to a vector of rgroup_controls.  Instead that's now
>>> left to the final analysis phase which tries forming the rgroup_controls
>>> vector using while_ult and if that fails now tries AVX512 style
>>> which needs a different organization and instead fills a hash_map
>>> with the relevant info.  vect_get_loop_mask now has two implementations,
>>> one for the two mask styles we then have.
>>> I have decided against interweaving vect_set_loop_condition_partial_vectors
>>> with conditions to do AVX512 style masking and instead opted to
>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
>>> Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.
>>> I was split between making 'vec_loop_masks' a class with methods,
>>> possibly merging in the _len stuff into a single registry.  It
>>> seemed to be too many changes for the purpose of getting AVX512
>>> working.  I'm going to play wait and see what happens with RISC-V
>>> here since they are going to get both masks and lengths registered
>>> I think.
>>> The vect_prepare_for_masked_peels hunk might run into issues with
>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
>>> looked odd.
>>> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
>>> the testsuite with --param vect-partial-vector-usage=2 with and
>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
>>> and one latent wrong-code (PR110237).
>>> There's followup work to be done to try enabling masked epilogues
>>> for x86-64 by default (when AVX512 is enabled, possibly only when
>>> -mprefer-vector-width=512).  Getting cost modeling and decision
>>> right is going to be challenging.
>>> Any comments?
>>> OK?
>>> Btw, testing on GCN would be welcome - the _avx512 paths could
>>> work for it so in case the while_ult path fails (not sure if
>>> it ever does) it could get _avx512 style masking.  Likewise
>>> testing on ARM just to see I didn't break anything here.
>>> I don't have SVE hardware so testing is probably meaningless.
>>
>> I can set some tests going. Is vect.exp enough?
> 
> Well, only you know (from experience), but sure that’s a nice start.

I tested vect.exp for both gcc and gfortran and there were no 
regressions. I have another run going with the other param settings.

(Side note: vect.exp used to be a nice quick test for use during 
development, but the tsvc tests are now really slow, at least when run 
on a single GPU thread.)

I tried some small examples with --param vect-partial-vector-usage=1 
(IIUC this prevents masked loops, but not masked epilogues, right?) and 
the results look good. I plan to do some benchmarking shortly. One 
comment: building a vector constant {0, 1, 2, 3, ...., 63} results in a 
very large entry in the constant pool and an unnecessary memory load (it 
literally has to use this sequence to generate the addresses to load the 
constant!) Generating the sequence via VEC_SERIES would be a no-op, for 
GCN, because we have an ABI-mandated register that already holds that 
value. (Perhaps I have another piece missing here, IDK?)

Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] AVX512 fully masked vectorization
  2023-06-15  9:26     ` Andrew Stubbs
@ 2023-06-15  9:58       ` Richard Biener
  2023-06-15 10:13         ` Andrew Stubbs
  2023-06-15  9:58       ` Richard Sandiford
  1 sibling, 1 reply; 19+ messages in thread
From: Richard Biener @ 2023-06-15  9:58 UTC (permalink / raw)
  To: Andrew Stubbs
  Cc: gcc-patches, richard.sandiford, Jan Hubicka, hongtao.liu, kirill.yukhin

[-- Attachment #1: Type: text/plain, Size: 8367 bytes --]

On Thu, 15 Jun 2023, Andrew Stubbs wrote:

> On 14/06/2023 15:29, Richard Biener wrote:
> > 
> > 
> >> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <ams@codesourcery.com>:
> >>
> >> On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
> >>> This implemens fully masked vectorization or a masked epilog for
> >>> AVX512 style masks which single themselves out by representing
> >>> each lane with a single bit and by using integer modes for the mask
> >>> (both is much like GCN).
> >>> AVX512 is also special in that it doesn't have any instruction
> >>> to compute the mask from a scalar IV like SVE has with while_ult.
> >>> Instead the masks are produced by vector compares and the loop
> >>> control retains the scalar IV (mainly to avoid dependences on
> >>> mask generation, a suitable mask test instruction is available).
> >>
> >> This is also sounds like GCN. We currently use WHILE_ULT in the middle end
> >> which expands to a vector compare against a vector of stepped values. This
> >> requires an additional instruction to prepare the comparison vector
> >> (compared to SVE), but the "while_ultv64sidi" pattern (for example) returns
> >> the DImode bitmask, so it works reasonably well.
> >>
> >>> Like RVV code generation prefers a decrementing IV though IVOPTs
> >>> messes things up in some cases removing that IV to eliminate
> >>> it with an incrementing one used for address generation.
> >>> One of the motivating testcases is from PR108410 which in turn
> >>> is extracted from x264 where large size vectorization shows
> >>> issues with small trip loops.  Execution time there improves
> >>> compared to classic AVX512 with AVX2 epilogues for the cases
> >>> of less than 32 iterations.
> >>> size   scalar     128     256     512    512e    512f
> >>>      1    9.42   11.32    9.35   11.17   15.13   16.89
> >>>      2    5.72    6.53    6.66    6.66    7.62    8.56
> >>>      3    4.49    5.10    5.10    5.74    5.08    5.73
> >>>      4    4.10    4.33    4.29    5.21    3.79    4.25
> >>>      6    3.78    3.85    3.86    4.76    2.54    2.85
> >>>      8    3.64    1.89    3.76    4.50    1.92    2.16
> >>>     12    3.56    2.21    3.75    4.26    1.26    1.42
> >>>     16    3.36    0.83    1.06    4.16    0.95    1.07
> >>>     20    3.39    1.42    1.33    4.07    0.75    0.85
> >>>     24    3.23    0.66    1.72    4.22    0.62    0.70
> >>>     28    3.18    1.09    2.04    4.20    0.54    0.61
> >>>     32    3.16    0.47    0.41    0.41    0.47    0.53
> >>>     34    3.16    0.67    0.61    0.56    0.44    0.50
> >>>     38    3.19    0.95    0.95    0.82    0.40    0.45
> >>>     42    3.09    0.58    1.21    1.13    0.36    0.40
> >>> 'size' specifies the number of actual iterations, 512e is for
> >>> a masked epilog and 512f for the fully masked loop.  From
> >>> 4 scalar iterations on the AVX512 masked epilog code is clearly
> >>> the winner, the fully masked variant is clearly worse and
> >>> it's size benefit is also tiny.
> >>
> >> Let me check I understand correctly. In the fully masked case, there is a
> >> single loop in which a new mask is generated at the start of each
> >> iteration. In the masked epilogue case, the main loop uses no masking
> >> whatsoever, thus avoiding the need for generating a mask, carrying the
> >> mask, inserting vec_merge operations, etc, and then the epilogue looks much
> >> like the fully masked case, but unlike smaller mode epilogues there is no
> >> loop because the eplogue vector size is the same. Is that right?
> > 
> > Yes.
> > 
> >> This scheme seems like it might also benefit GCN, in so much as it
> >> simplifies the hot code path.
> >>
> >> GCN does not actually have smaller vector sizes, so there's no analogue to
> >> AVX2 (we pretend we have some smaller sizes, but that's because the middle
> >> end can't do masking everywhere yet, and it helps make some vector
> >> constants smaller, perhaps).
> >>
> >>> This patch does not enable using fully masked loops or
> >>> masked epilogues by default.  More work on cost modeling
> >>> and vectorization kind selection on x86_64 is necessary
> >>> for this.
> >>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
> >>> which could be exploited further to unify some of the flags
> >>> we have right now but there didn't seem to be many easy things
> >>> to merge, so I'm leaving this for followups.
> >>> Mask requirements as registered by vect_record_loop_mask are kept in their
> >>> original form and recorded in a hash_set now instead of being
> >>> processed to a vector of rgroup_controls.  Instead that's now
> >>> left to the final analysis phase which tries forming the rgroup_controls
> >>> vector using while_ult and if that fails now tries AVX512 style
> >>> which needs a different organization and instead fills a hash_map
> >>> with the relevant info.  vect_get_loop_mask now has two implementations,
> >>> one for the two mask styles we then have.
> >>> I have decided against interweaving
> >>> vect_set_loop_condition_partial_vectors
> >>> with conditions to do AVX512 style masking and instead opted to
> >>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
> >>> Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.
> >>> I was split between making 'vec_loop_masks' a class with methods,
> >>> possibly merging in the _len stuff into a single registry.  It
> >>> seemed to be too many changes for the purpose of getting AVX512
> >>> working.  I'm going to play wait and see what happens with RISC-V
> >>> here since they are going to get both masks and lengths registered
> >>> I think.
> >>> The vect_prepare_for_masked_peels hunk might run into issues with
> >>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
> >>> looked odd.
> >>> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
> >>> the testsuite with --param vect-partial-vector-usage=2 with and
> >>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
> >>> and one latent wrong-code (PR110237).
> >>> There's followup work to be done to try enabling masked epilogues
> >>> for x86-64 by default (when AVX512 is enabled, possibly only when
> >>> -mprefer-vector-width=512).  Getting cost modeling and decision
> >>> right is going to be challenging.
> >>> Any comments?
> >>> OK?
> >>> Btw, testing on GCN would be welcome - the _avx512 paths could
> >>> work for it so in case the while_ult path fails (not sure if
> >>> it ever does) it could get _avx512 style masking.  Likewise
> >>> testing on ARM just to see I didn't break anything here.
> >>> I don't have SVE hardware so testing is probably meaningless.
> >>
> >> I can set some tests going. Is vect.exp enough?
> > 
> > Well, only you know (from experience), but sure that?s a nice start.
> 
> I tested vect.exp for both gcc and gfortran and there were no regressions. I
> have another run going with the other param settings.
> 
> (Side note: vect.exp used to be a nice quick test for use during development,
> but the tsvc tests are now really slow, at least when run on a single GPU
> thread.)
> 
> I tried some small examples with --param vect-partial-vector-usage=1 (IIUC
> this prevents masked loops, but not masked epilogues, right?)

Yes.  That should also work with the while_ult style btw.

> and the results
> look good. I plan to do some benchmarking shortly. One comment: building a
> vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in the
> constant pool and an unnecessary memory load (it literally has to use this
> sequence to generate the addresses to load the constant!) Generating the
> sequence via VEC_SERIES would be a no-op, for GCN, because we have an
> ABI-mandated register that already holds that value. (Perhaps I have another
> piece missing here, IDK?)

I failed to special-case the {0, 1, 2, 3, ... } constant because I
couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }.
It might be that the target needs to pattern match these constants
at RTL expansion time?

Btw, did you disable your while_ult pattern for the experiment?

Richard.

> 
> Andrew
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] AVX512 fully masked vectorization
  2023-06-15  9:26     ` Andrew Stubbs
  2023-06-15  9:58       ` Richard Biener
@ 2023-06-15  9:58       ` Richard Sandiford
  1 sibling, 0 replies; 19+ messages in thread
From: Richard Sandiford @ 2023-06-15  9:58 UTC (permalink / raw)
  To: Andrew Stubbs
  Cc: Richard Biener, gcc-patches, Jan Hubicka, hongtao.liu, kirill.yukhin

Andrew Stubbs <ams@codesourcery.com> writes:
> One 
> comment: building a vector constant {0, 1, 2, 3, ...., 63} results in a 
> very large entry in the constant pool and an unnecessary memory load (it 
> literally has to use this sequence to generate the addresses to load the 
> constant!) Generating the sequence via VEC_SERIES would be a no-op, for 
> GCN, because we have an ABI-mandated register that already holds that 
> value. (Perhaps I have another piece missing here, IDK?)

A constant like that should become a CONST_VECTOR in RTL, so I think
the way to handle it would be to treat such a CONST_VECTOR as a valid
immediate operand, including providing an alternative for it in the
move patterns.  const_vec_series_p provides a quick way to test.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] AVX512 fully masked vectorization
  2023-06-15  9:58       ` Richard Biener
@ 2023-06-15 10:13         ` Andrew Stubbs
  2023-06-15 11:06           ` Richard Biener
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Stubbs @ 2023-06-15 10:13 UTC (permalink / raw)
  To: Richard Biener
  Cc: gcc-patches, richard.sandiford, Jan Hubicka, hongtao.liu, kirill.yukhin

On 15/06/2023 10:58, Richard Biener wrote:
> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> 
>> On 14/06/2023 15:29, Richard Biener wrote:
>>>
>>>
>>>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <ams@codesourcery.com>:
>>>>
>>>> On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
>>>>> This implemens fully masked vectorization or a masked epilog for
>>>>> AVX512 style masks which single themselves out by representing
>>>>> each lane with a single bit and by using integer modes for the mask
>>>>> (both is much like GCN).
>>>>> AVX512 is also special in that it doesn't have any instruction
>>>>> to compute the mask from a scalar IV like SVE has with while_ult.
>>>>> Instead the masks are produced by vector compares and the loop
>>>>> control retains the scalar IV (mainly to avoid dependences on
>>>>> mask generation, a suitable mask test instruction is available).
>>>>
>>>> This is also sounds like GCN. We currently use WHILE_ULT in the middle end
>>>> which expands to a vector compare against a vector of stepped values. This
>>>> requires an additional instruction to prepare the comparison vector
>>>> (compared to SVE), but the "while_ultv64sidi" pattern (for example) returns
>>>> the DImode bitmask, so it works reasonably well.
>>>>
>>>>> Like RVV code generation prefers a decrementing IV though IVOPTs
>>>>> messes things up in some cases removing that IV to eliminate
>>>>> it with an incrementing one used for address generation.
>>>>> One of the motivating testcases is from PR108410 which in turn
>>>>> is extracted from x264 where large size vectorization shows
>>>>> issues with small trip loops.  Execution time there improves
>>>>> compared to classic AVX512 with AVX2 epilogues for the cases
>>>>> of less than 32 iterations.
>>>>> size   scalar     128     256     512    512e    512f
>>>>>       1    9.42   11.32    9.35   11.17   15.13   16.89
>>>>>       2    5.72    6.53    6.66    6.66    7.62    8.56
>>>>>       3    4.49    5.10    5.10    5.74    5.08    5.73
>>>>>       4    4.10    4.33    4.29    5.21    3.79    4.25
>>>>>       6    3.78    3.85    3.86    4.76    2.54    2.85
>>>>>       8    3.64    1.89    3.76    4.50    1.92    2.16
>>>>>      12    3.56    2.21    3.75    4.26    1.26    1.42
>>>>>      16    3.36    0.83    1.06    4.16    0.95    1.07
>>>>>      20    3.39    1.42    1.33    4.07    0.75    0.85
>>>>>      24    3.23    0.66    1.72    4.22    0.62    0.70
>>>>>      28    3.18    1.09    2.04    4.20    0.54    0.61
>>>>>      32    3.16    0.47    0.41    0.41    0.47    0.53
>>>>>      34    3.16    0.67    0.61    0.56    0.44    0.50
>>>>>      38    3.19    0.95    0.95    0.82    0.40    0.45
>>>>>      42    3.09    0.58    1.21    1.13    0.36    0.40
>>>>> 'size' specifies the number of actual iterations, 512e is for
>>>>> a masked epilog and 512f for the fully masked loop.  From
>>>>> 4 scalar iterations on the AVX512 masked epilog code is clearly
>>>>> the winner, the fully masked variant is clearly worse and
>>>>> it's size benefit is also tiny.
>>>>
>>>> Let me check I understand correctly. In the fully masked case, there is a
>>>> single loop in which a new mask is generated at the start of each
>>>> iteration. In the masked epilogue case, the main loop uses no masking
>>>> whatsoever, thus avoiding the need for generating a mask, carrying the
>>>> mask, inserting vec_merge operations, etc, and then the epilogue looks much
>>>> like the fully masked case, but unlike smaller mode epilogues there is no
>>>> loop because the eplogue vector size is the same. Is that right?
>>>
>>> Yes.
>>>
>>>> This scheme seems like it might also benefit GCN, in so much as it
>>>> simplifies the hot code path.
>>>>
>>>> GCN does not actually have smaller vector sizes, so there's no analogue to
>>>> AVX2 (we pretend we have some smaller sizes, but that's because the middle
>>>> end can't do masking everywhere yet, and it helps make some vector
>>>> constants smaller, perhaps).
>>>>
>>>>> This patch does not enable using fully masked loops or
>>>>> masked epilogues by default.  More work on cost modeling
>>>>> and vectorization kind selection on x86_64 is necessary
>>>>> for this.
>>>>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
>>>>> which could be exploited further to unify some of the flags
>>>>> we have right now but there didn't seem to be many easy things
>>>>> to merge, so I'm leaving this for followups.
>>>>> Mask requirements as registered by vect_record_loop_mask are kept in their
>>>>> original form and recorded in a hash_set now instead of being
>>>>> processed to a vector of rgroup_controls.  Instead that's now
>>>>> left to the final analysis phase which tries forming the rgroup_controls
>>>>> vector using while_ult and if that fails now tries AVX512 style
>>>>> which needs a different organization and instead fills a hash_map
>>>>> with the relevant info.  vect_get_loop_mask now has two implementations,
>>>>> one for the two mask styles we then have.
>>>>> I have decided against interweaving
>>>>> vect_set_loop_condition_partial_vectors
>>>>> with conditions to do AVX512 style masking and instead opted to
>>>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
>>>>> Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.
>>>>> I was split between making 'vec_loop_masks' a class with methods,
>>>>> possibly merging in the _len stuff into a single registry.  It
>>>>> seemed to be too many changes for the purpose of getting AVX512
>>>>> working.  I'm going to play wait and see what happens with RISC-V
>>>>> here since they are going to get both masks and lengths registered
>>>>> I think.
>>>>> The vect_prepare_for_masked_peels hunk might run into issues with
>>>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
>>>>> looked odd.
>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
>>>>> the testsuite with --param vect-partial-vector-usage=2 with and
>>>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
>>>>> and one latent wrong-code (PR110237).
>>>>> There's followup work to be done to try enabling masked epilogues
>>>>> for x86-64 by default (when AVX512 is enabled, possibly only when
>>>>> -mprefer-vector-width=512).  Getting cost modeling and decision
>>>>> right is going to be challenging.
>>>>> Any comments?
>>>>> OK?
>>>>> Btw, testing on GCN would be welcome - the _avx512 paths could
>>>>> work for it so in case the while_ult path fails (not sure if
>>>>> it ever does) it could get _avx512 style masking.  Likewise
>>>>> testing on ARM just to see I didn't break anything here.
>>>>> I don't have SVE hardware so testing is probably meaningless.
>>>>
>>>> I can set some tests going. Is vect.exp enough?
>>>
>>> Well, only you know (from experience), but sure that?s a nice start.
>>
>> I tested vect.exp for both gcc and gfortran and there were no regressions. I
>> have another run going with the other param settings.
>>
>> (Side note: vect.exp used to be a nice quick test for use during development,
>> but the tsvc tests are now really slow, at least when run on a single GPU
>> thread.)
>>
>> I tried some small examples with --param vect-partial-vector-usage=1 (IIUC
>> this prevents masked loops, but not masked epilogues, right?)
> 
> Yes.  That should also work with the while_ult style btw.
> 
>> and the results
>> look good. I plan to do some benchmarking shortly. One comment: building a
>> vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in the
>> constant pool and an unnecessary memory load (it literally has to use this
>> sequence to generate the addresses to load the constant!) Generating the
>> sequence via VEC_SERIES would be a no-op, for GCN, because we have an
>> ABI-mandated register that already holds that value. (Perhaps I have another
>> piece missing here, IDK?)
> 
> I failed to special-case the {0, 1, 2, 3, ... } constant because I
> couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }.
> It might be that the target needs to pattern match these constants
> at RTL expansion time?
> 
> Btw, did you disable your while_ult pattern for the experiment?

I tried it both ways; both appear to work, and the while_ult case does 
avoid the constant vector. I also don't seem to need while_ult for the 
fully masked case any more (is that new?).

Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] AVX512 fully masked vectorization
  2023-06-15 10:13         ` Andrew Stubbs
@ 2023-06-15 11:06           ` Richard Biener
  2023-06-15 13:04             ` Andrew Stubbs
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Biener @ 2023-06-15 11:06 UTC (permalink / raw)
  To: Andrew Stubbs
  Cc: gcc-patches, richard.sandiford, Jan Hubicka, hongtao.liu, kirill.yukhin

[-- Attachment #1: Type: text/plain, Size: 9590 bytes --]

On Thu, 15 Jun 2023, Andrew Stubbs wrote:

> On 15/06/2023 10:58, Richard Biener wrote:
> > On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> > 
> >> On 14/06/2023 15:29, Richard Biener wrote:
> >>>
> >>>
> >>>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <ams@codesourcery.com>:
> >>>>
> >>>> On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
> >>>>> This implemens fully masked vectorization or a masked epilog for
> >>>>> AVX512 style masks which single themselves out by representing
> >>>>> each lane with a single bit and by using integer modes for the mask
> >>>>> (both is much like GCN).
> >>>>> AVX512 is also special in that it doesn't have any instruction
> >>>>> to compute the mask from a scalar IV like SVE has with while_ult.
> >>>>> Instead the masks are produced by vector compares and the loop
> >>>>> control retains the scalar IV (mainly to avoid dependences on
> >>>>> mask generation, a suitable mask test instruction is available).
> >>>>
> >>>> This is also sounds like GCN. We currently use WHILE_ULT in the middle
> >>>> end
> >>>> which expands to a vector compare against a vector of stepped values.
> >>>> This
> >>>> requires an additional instruction to prepare the comparison vector
> >>>> (compared to SVE), but the "while_ultv64sidi" pattern (for example)
> >>>> returns
> >>>> the DImode bitmask, so it works reasonably well.
> >>>>
> >>>>> Like RVV code generation prefers a decrementing IV though IVOPTs
> >>>>> messes things up in some cases removing that IV to eliminate
> >>>>> it with an incrementing one used for address generation.
> >>>>> One of the motivating testcases is from PR108410 which in turn
> >>>>> is extracted from x264 where large size vectorization shows
> >>>>> issues with small trip loops.  Execution time there improves
> >>>>> compared to classic AVX512 with AVX2 epilogues for the cases
> >>>>> of less than 32 iterations.
> >>>>> size   scalar     128     256     512    512e    512f
> >>>>>       1    9.42   11.32    9.35   11.17   15.13   16.89
> >>>>>       2    5.72    6.53    6.66    6.66    7.62    8.56
> >>>>>       3    4.49    5.10    5.10    5.74    5.08    5.73
> >>>>>       4    4.10    4.33    4.29    5.21    3.79    4.25
> >>>>>       6    3.78    3.85    3.86    4.76    2.54    2.85
> >>>>>       8    3.64    1.89    3.76    4.50    1.92    2.16
> >>>>>      12    3.56    2.21    3.75    4.26    1.26    1.42
> >>>>>      16    3.36    0.83    1.06    4.16    0.95    1.07
> >>>>>      20    3.39    1.42    1.33    4.07    0.75    0.85
> >>>>>      24    3.23    0.66    1.72    4.22    0.62    0.70
> >>>>>      28    3.18    1.09    2.04    4.20    0.54    0.61
> >>>>>      32    3.16    0.47    0.41    0.41    0.47    0.53
> >>>>>      34    3.16    0.67    0.61    0.56    0.44    0.50
> >>>>>      38    3.19    0.95    0.95    0.82    0.40    0.45
> >>>>>      42    3.09    0.58    1.21    1.13    0.36    0.40
> >>>>> 'size' specifies the number of actual iterations, 512e is for
> >>>>> a masked epilog and 512f for the fully masked loop.  From
> >>>>> 4 scalar iterations on the AVX512 masked epilog code is clearly
> >>>>> the winner, the fully masked variant is clearly worse and
> >>>>> it's size benefit is also tiny.
> >>>>
> >>>> Let me check I understand correctly. In the fully masked case, there is a
> >>>> single loop in which a new mask is generated at the start of each
> >>>> iteration. In the masked epilogue case, the main loop uses no masking
> >>>> whatsoever, thus avoiding the need for generating a mask, carrying the
> >>>> mask, inserting vec_merge operations, etc, and then the epilogue looks
> >>>> much
> >>>> like the fully masked case, but unlike smaller mode epilogues there is no
> >>>> loop because the eplogue vector size is the same. Is that right?
> >>>
> >>> Yes.
> >>>
> >>>> This scheme seems like it might also benefit GCN, in so much as it
> >>>> simplifies the hot code path.
> >>>>
> >>>> GCN does not actually have smaller vector sizes, so there's no analogue
> >>>> to
> >>>> AVX2 (we pretend we have some smaller sizes, but that's because the
> >>>> middle
> >>>> end can't do masking everywhere yet, and it helps make some vector
> >>>> constants smaller, perhaps).
> >>>>
> >>>>> This patch does not enable using fully masked loops or
> >>>>> masked epilogues by default.  More work on cost modeling
> >>>>> and vectorization kind selection on x86_64 is necessary
> >>>>> for this.
> >>>>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
> >>>>> which could be exploited further to unify some of the flags
> >>>>> we have right now but there didn't seem to be many easy things
> >>>>> to merge, so I'm leaving this for followups.
> >>>>> Mask requirements as registered by vect_record_loop_mask are kept in
> >>>>> their
> >>>>> original form and recorded in a hash_set now instead of being
> >>>>> processed to a vector of rgroup_controls.  Instead that's now
> >>>>> left to the final analysis phase which tries forming the rgroup_controls
> >>>>> vector using while_ult and if that fails now tries AVX512 style
> >>>>> which needs a different organization and instead fills a hash_map
> >>>>> with the relevant info.  vect_get_loop_mask now has two implementations,
> >>>>> one for the two mask styles we then have.
> >>>>> I have decided against interweaving
> >>>>> vect_set_loop_condition_partial_vectors
> >>>>> with conditions to do AVX512 style masking and instead opted to
> >>>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
> >>>>> Likewise for vect_verify_full_masking vs
> >>>>> vect_verify_full_masking_avx512.
> >>>>> I was split between making 'vec_loop_masks' a class with methods,
> >>>>> possibly merging in the _len stuff into a single registry.  It
> >>>>> seemed to be too many changes for the purpose of getting AVX512
> >>>>> working.  I'm going to play wait and see what happens with RISC-V
> >>>>> here since they are going to get both masks and lengths registered
> >>>>> I think.
> >>>>> The vect_prepare_for_masked_peels hunk might run into issues with
> >>>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
> >>>>> looked odd.
> >>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
> >>>>> the testsuite with --param vect-partial-vector-usage=2 with and
> >>>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
> >>>>> and one latent wrong-code (PR110237).
> >>>>> There's followup work to be done to try enabling masked epilogues
> >>>>> for x86-64 by default (when AVX512 is enabled, possibly only when
> >>>>> -mprefer-vector-width=512).  Getting cost modeling and decision
> >>>>> right is going to be challenging.
> >>>>> Any comments?
> >>>>> OK?
> >>>>> Btw, testing on GCN would be welcome - the _avx512 paths could
> >>>>> work for it so in case the while_ult path fails (not sure if
> >>>>> it ever does) it could get _avx512 style masking.  Likewise
> >>>>> testing on ARM just to see I didn't break anything here.
> >>>>> I don't have SVE hardware so testing is probably meaningless.
> >>>>
> >>>> I can set some tests going. Is vect.exp enough?
> >>>
> >>> Well, only you know (from experience), but sure that?s a nice start.
> >>
> >> I tested vect.exp for both gcc and gfortran and there were no regressions.
> >> I
> >> have another run going with the other param settings.
> >>
> >> (Side note: vect.exp used to be a nice quick test for use during
> >> development,
> >> but the tsvc tests are now really slow, at least when run on a single GPU
> >> thread.)
> >>
> >> I tried some small examples with --param vect-partial-vector-usage=1 (IIUC
> >> this prevents masked loops, but not masked epilogues, right?)
> > 
> > Yes.  That should also work with the while_ult style btw.
> > 
> >> and the results
> >> look good. I plan to do some benchmarking shortly. One comment: building a
> >> vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in the
> >> constant pool and an unnecessary memory load (it literally has to use this
> >> sequence to generate the addresses to load the constant!) Generating the
> >> sequence via VEC_SERIES would be a no-op, for GCN, because we have an
> >> ABI-mandated register that already holds that value. (Perhaps I have
> >> another
> >> piece missing here, IDK?)
> > 
> > I failed to special-case the {0, 1, 2, 3, ... } constant because I
> > couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }.
> > It might be that the target needs to pattern match these constants
> > at RTL expansion time?
> > 
> > Btw, did you disable your while_ult pattern for the experiment?
> 
> I tried it both ways; both appear to work, and the while_ult case does avoid
> the constant vector. I also don't seem to need while_ult for the fully masked
> case any more (is that new?).

Yes, while_ult always compares to {0, 1, 2, 3 ...} which seems
conveniently available but it has to multiply the IV with the
number of scalars per iter which has overflow issues it has
to compensate for by choosing a wider IV.  I'm avoiding that
issue (besides for alignment peeling) by instead altering
the constant vector to compare against.  On x86 the constant
vector is always a load but the multiplication would add to
the latency of mask production which already isn't too great.

And yes, the alternate scheme doesn't rely on while_ult but instead
on vec_cmpu to produce the masks.

You might be able to produce the {0, 0, 1, 1, ... } constant
by interleaving v1 with itself?  Any non-power-of-two duplication
looks more difficult though.

Richard.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] AVX512 fully masked vectorization
  2023-06-15 11:06           ` Richard Biener
@ 2023-06-15 13:04             ` Andrew Stubbs
  2023-06-15 13:34               ` Richard Biener
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Stubbs @ 2023-06-15 13:04 UTC (permalink / raw)
  To: Richard Biener
  Cc: gcc-patches, richard.sandiford, Jan Hubicka, hongtao.liu, kirill.yukhin

On 15/06/2023 12:06, Richard Biener wrote:
> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> 
>> On 15/06/2023 10:58, Richard Biener wrote:
>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
>>>
>>>> On 14/06/2023 15:29, Richard Biener wrote:
>>>>>
>>>>>
>>>>>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <ams@codesourcery.com>:
>>>>>>
>>>>>> On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
>>>>>>> This implemens fully masked vectorization or a masked epilog for
>>>>>>> AVX512 style masks which single themselves out by representing
>>>>>>> each lane with a single bit and by using integer modes for the mask
>>>>>>> (both is much like GCN).
>>>>>>> AVX512 is also special in that it doesn't have any instruction
>>>>>>> to compute the mask from a scalar IV like SVE has with while_ult.
>>>>>>> Instead the masks are produced by vector compares and the loop
>>>>>>> control retains the scalar IV (mainly to avoid dependences on
>>>>>>> mask generation, a suitable mask test instruction is available).
>>>>>>
>>>>>> This is also sounds like GCN. We currently use WHILE_ULT in the middle
>>>>>> end
>>>>>> which expands to a vector compare against a vector of stepped values.
>>>>>> This
>>>>>> requires an additional instruction to prepare the comparison vector
>>>>>> (compared to SVE), but the "while_ultv64sidi" pattern (for example)
>>>>>> returns
>>>>>> the DImode bitmask, so it works reasonably well.
>>>>>>
>>>>>>> Like RVV code generation prefers a decrementing IV though IVOPTs
>>>>>>> messes things up in some cases removing that IV to eliminate
>>>>>>> it with an incrementing one used for address generation.
>>>>>>> One of the motivating testcases is from PR108410 which in turn
>>>>>>> is extracted from x264 where large size vectorization shows
>>>>>>> issues with small trip loops.  Execution time there improves
>>>>>>> compared to classic AVX512 with AVX2 epilogues for the cases
>>>>>>> of less than 32 iterations.
>>>>>>> size   scalar     128     256     512    512e    512f
>>>>>>>        1    9.42   11.32    9.35   11.17   15.13   16.89
>>>>>>>        2    5.72    6.53    6.66    6.66    7.62    8.56
>>>>>>>        3    4.49    5.10    5.10    5.74    5.08    5.73
>>>>>>>        4    4.10    4.33    4.29    5.21    3.79    4.25
>>>>>>>        6    3.78    3.85    3.86    4.76    2.54    2.85
>>>>>>>        8    3.64    1.89    3.76    4.50    1.92    2.16
>>>>>>>       12    3.56    2.21    3.75    4.26    1.26    1.42
>>>>>>>       16    3.36    0.83    1.06    4.16    0.95    1.07
>>>>>>>       20    3.39    1.42    1.33    4.07    0.75    0.85
>>>>>>>       24    3.23    0.66    1.72    4.22    0.62    0.70
>>>>>>>       28    3.18    1.09    2.04    4.20    0.54    0.61
>>>>>>>       32    3.16    0.47    0.41    0.41    0.47    0.53
>>>>>>>       34    3.16    0.67    0.61    0.56    0.44    0.50
>>>>>>>       38    3.19    0.95    0.95    0.82    0.40    0.45
>>>>>>>       42    3.09    0.58    1.21    1.13    0.36    0.40
>>>>>>> 'size' specifies the number of actual iterations, 512e is for
>>>>>>> a masked epilog and 512f for the fully masked loop.  From
>>>>>>> 4 scalar iterations on the AVX512 masked epilog code is clearly
>>>>>>> the winner, the fully masked variant is clearly worse and
>>>>>>> it's size benefit is also tiny.
>>>>>>
>>>>>> Let me check I understand correctly. In the fully masked case, there is a
>>>>>> single loop in which a new mask is generated at the start of each
>>>>>> iteration. In the masked epilogue case, the main loop uses no masking
>>>>>> whatsoever, thus avoiding the need for generating a mask, carrying the
>>>>>> mask, inserting vec_merge operations, etc, and then the epilogue looks
>>>>>> much
>>>>>> like the fully masked case, but unlike smaller mode epilogues there is no
>>>>>> loop because the eplogue vector size is the same. Is that right?
>>>>>
>>>>> Yes.
>>>>>
>>>>>> This scheme seems like it might also benefit GCN, in so much as it
>>>>>> simplifies the hot code path.
>>>>>>
>>>>>> GCN does not actually have smaller vector sizes, so there's no analogue
>>>>>> to
>>>>>> AVX2 (we pretend we have some smaller sizes, but that's because the
>>>>>> middle
>>>>>> end can't do masking everywhere yet, and it helps make some vector
>>>>>> constants smaller, perhaps).
>>>>>>
>>>>>>> This patch does not enable using fully masked loops or
>>>>>>> masked epilogues by default.  More work on cost modeling
>>>>>>> and vectorization kind selection on x86_64 is necessary
>>>>>>> for this.
>>>>>>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
>>>>>>> which could be exploited further to unify some of the flags
>>>>>>> we have right now but there didn't seem to be many easy things
>>>>>>> to merge, so I'm leaving this for followups.
>>>>>>> Mask requirements as registered by vect_record_loop_mask are kept in
>>>>>>> their
>>>>>>> original form and recorded in a hash_set now instead of being
>>>>>>> processed to a vector of rgroup_controls.  Instead that's now
>>>>>>> left to the final analysis phase which tries forming the rgroup_controls
>>>>>>> vector using while_ult and if that fails now tries AVX512 style
>>>>>>> which needs a different organization and instead fills a hash_map
>>>>>>> with the relevant info.  vect_get_loop_mask now has two implementations,
>>>>>>> one for the two mask styles we then have.
>>>>>>> I have decided against interweaving
>>>>>>> vect_set_loop_condition_partial_vectors
>>>>>>> with conditions to do AVX512 style masking and instead opted to
>>>>>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
>>>>>>> Likewise for vect_verify_full_masking vs
>>>>>>> vect_verify_full_masking_avx512.
>>>>>>> I was split between making 'vec_loop_masks' a class with methods,
>>>>>>> possibly merging in the _len stuff into a single registry.  It
>>>>>>> seemed to be too many changes for the purpose of getting AVX512
>>>>>>> working.  I'm going to play wait and see what happens with RISC-V
>>>>>>> here since they are going to get both masks and lengths registered
>>>>>>> I think.
>>>>>>> The vect_prepare_for_masked_peels hunk might run into issues with
>>>>>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
>>>>>>> looked odd.
>>>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
>>>>>>> the testsuite with --param vect-partial-vector-usage=2 with and
>>>>>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
>>>>>>> and one latent wrong-code (PR110237).
>>>>>>> There's followup work to be done to try enabling masked epilogues
>>>>>>> for x86-64 by default (when AVX512 is enabled, possibly only when
>>>>>>> -mprefer-vector-width=512).  Getting cost modeling and decision
>>>>>>> right is going to be challenging.
>>>>>>> Any comments?
>>>>>>> OK?
>>>>>>> Btw, testing on GCN would be welcome - the _avx512 paths could
>>>>>>> work for it so in case the while_ult path fails (not sure if
>>>>>>> it ever does) it could get _avx512 style masking.  Likewise
>>>>>>> testing on ARM just to see I didn't break anything here.
>>>>>>> I don't have SVE hardware so testing is probably meaningless.
>>>>>>
>>>>>> I can set some tests going. Is vect.exp enough?
>>>>>
>>>>> Well, only you know (from experience), but sure that?s a nice start.
>>>>
>>>> I tested vect.exp for both gcc and gfortran and there were no regressions.
>>>> I
>>>> have another run going with the other param settings.
>>>>
>>>> (Side note: vect.exp used to be a nice quick test for use during
>>>> development,
>>>> but the tsvc tests are now really slow, at least when run on a single GPU
>>>> thread.)
>>>>
>>>> I tried some small examples with --param vect-partial-vector-usage=1 (IIUC
>>>> this prevents masked loops, but not masked epilogues, right?)
>>>
>>> Yes.  That should also work with the while_ult style btw.
>>>
>>>> and the results
>>>> look good. I plan to do some benchmarking shortly. One comment: building a
>>>> vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in the
>>>> constant pool and an unnecessary memory load (it literally has to use this
>>>> sequence to generate the addresses to load the constant!) Generating the
>>>> sequence via VEC_SERIES would be a no-op, for GCN, because we have an
>>>> ABI-mandated register that already holds that value. (Perhaps I have
>>>> another
>>>> piece missing here, IDK?)
>>>
>>> I failed to special-case the {0, 1, 2, 3, ... } constant because I
>>> couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }.
>>> It might be that the target needs to pattern match these constants
>>> at RTL expansion time?
>>>
>>> Btw, did you disable your while_ult pattern for the experiment?
>>
>> I tried it both ways; both appear to work, and the while_ult case does avoid
>> the constant vector. I also don't seem to need while_ult for the fully masked
>> case any more (is that new?).
> 
> Yes, while_ult always compares to {0, 1, 2, 3 ...} which seems
> conveniently available but it has to multiply the IV with the
> number of scalars per iter which has overflow issues it has
> to compensate for by choosing a wider IV.  I'm avoiding that
> issue (besides for alignment peeling) by instead altering
> the constant vector to compare against.  On x86 the constant
> vector is always a load but the multiplication would add to
> the latency of mask production which already isn't too great.

Is the multiplication not usually a shift?

> And yes, the alternate scheme doesn't rely on while_ult but instead
> on vec_cmpu to produce the masks.
> 
> You might be able to produce the {0, 0, 1, 1, ... } constant
> by interleaving v1 with itself?  Any non-power-of-two duplication
> looks more difficult though.

I think that would need to use a full permutation, which is probably 
faster than a cold load, but in all these cases the vector that defines 
the permutation looks exactly like the result, so ......

I've been playing with this stuff some more and I find that even though 
GCN supports fully masked loops and uses them when I test without 
offload, it's actually been running in 
param_vect_partial_vector_usage==0 mode for offload because i386.cc has 
that hardcoded and the offload compiler inherits param settings from the 
host.

I tried running the Babelstream benchmark with the various settings and 
it's a wash for most of the measurements (memory limited, most likely), 
but the "Dot" benchmark is considerably slower when fully masked (about 
50%). This probably explains why adding the additional "fake" smaller 
vector sizes was so good for our numbers, but confirms that the partial 
epilogue is a good option.

Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] AVX512 fully masked vectorization
  2023-06-15 13:04             ` Andrew Stubbs
@ 2023-06-15 13:34               ` Richard Biener
  2023-06-15 13:52                 ` Andrew Stubbs
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Biener @ 2023-06-15 13:34 UTC (permalink / raw)
  To: Andrew Stubbs
  Cc: gcc-patches, richard.sandiford, Jan Hubicka, hongtao.liu, kirill.yukhin

[-- Attachment #1: Type: text/plain, Size: 11657 bytes --]

On Thu, 15 Jun 2023, Andrew Stubbs wrote:

> On 15/06/2023 12:06, Richard Biener wrote:
> > On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> > 
> >> On 15/06/2023 10:58, Richard Biener wrote:
> >>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> >>>
> >>>> On 14/06/2023 15:29, Richard Biener wrote:
> >>>>>
> >>>>>
> >>>>>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <ams@codesourcery.com>:
> >>>>>>
> >>>>>> On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
> >>>>>>> This implemens fully masked vectorization or a masked epilog for
> >>>>>>> AVX512 style masks which single themselves out by representing
> >>>>>>> each lane with a single bit and by using integer modes for the mask
> >>>>>>> (both is much like GCN).
> >>>>>>> AVX512 is also special in that it doesn't have any instruction
> >>>>>>> to compute the mask from a scalar IV like SVE has with while_ult.
> >>>>>>> Instead the masks are produced by vector compares and the loop
> >>>>>>> control retains the scalar IV (mainly to avoid dependences on
> >>>>>>> mask generation, a suitable mask test instruction is available).
> >>>>>>
> >>>>>> This is also sounds like GCN. We currently use WHILE_ULT in the middle
> >>>>>> end
> >>>>>> which expands to a vector compare against a vector of stepped values.
> >>>>>> This
> >>>>>> requires an additional instruction to prepare the comparison vector
> >>>>>> (compared to SVE), but the "while_ultv64sidi" pattern (for example)
> >>>>>> returns
> >>>>>> the DImode bitmask, so it works reasonably well.
> >>>>>>
> >>>>>>> Like RVV code generation prefers a decrementing IV though IVOPTs
> >>>>>>> messes things up in some cases removing that IV to eliminate
> >>>>>>> it with an incrementing one used for address generation.
> >>>>>>> One of the motivating testcases is from PR108410 which in turn
> >>>>>>> is extracted from x264 where large size vectorization shows
> >>>>>>> issues with small trip loops.  Execution time there improves
> >>>>>>> compared to classic AVX512 with AVX2 epilogues for the cases
> >>>>>>> of less than 32 iterations.
> >>>>>>> size   scalar     128     256     512    512e    512f
> >>>>>>>        1    9.42   11.32    9.35   11.17   15.13   16.89
> >>>>>>>        2    5.72    6.53    6.66    6.66    7.62    8.56
> >>>>>>>        3    4.49    5.10    5.10    5.74    5.08    5.73
> >>>>>>>        4    4.10    4.33    4.29    5.21    3.79    4.25
> >>>>>>>        6    3.78    3.85    3.86    4.76    2.54    2.85
> >>>>>>>        8    3.64    1.89    3.76    4.50    1.92    2.16
> >>>>>>>       12    3.56    2.21    3.75    4.26    1.26    1.42
> >>>>>>>       16    3.36    0.83    1.06    4.16    0.95    1.07
> >>>>>>>       20    3.39    1.42    1.33    4.07    0.75    0.85
> >>>>>>>       24    3.23    0.66    1.72    4.22    0.62    0.70
> >>>>>>>       28    3.18    1.09    2.04    4.20    0.54    0.61
> >>>>>>>       32    3.16    0.47    0.41    0.41    0.47    0.53
> >>>>>>>       34    3.16    0.67    0.61    0.56    0.44    0.50
> >>>>>>>       38    3.19    0.95    0.95    0.82    0.40    0.45
> >>>>>>>       42    3.09    0.58    1.21    1.13    0.36    0.40
> >>>>>>> 'size' specifies the number of actual iterations, 512e is for
> >>>>>>> a masked epilog and 512f for the fully masked loop.  From
> >>>>>>> 4 scalar iterations on the AVX512 masked epilog code is clearly
> >>>>>>> the winner, the fully masked variant is clearly worse and
> >>>>>>> it's size benefit is also tiny.
> >>>>>>
> >>>>>> Let me check I understand correctly. In the fully masked case, there is
> >>>>>> a
> >>>>>> single loop in which a new mask is generated at the start of each
> >>>>>> iteration. In the masked epilogue case, the main loop uses no masking
> >>>>>> whatsoever, thus avoiding the need for generating a mask, carrying the
> >>>>>> mask, inserting vec_merge operations, etc, and then the epilogue looks
> >>>>>> much
> >>>>>> like the fully masked case, but unlike smaller mode epilogues there is
> >>>>>> no
> >>>>>> loop because the eplogue vector size is the same. Is that right?
> >>>>>
> >>>>> Yes.
> >>>>>
> >>>>>> This scheme seems like it might also benefit GCN, in so much as it
> >>>>>> simplifies the hot code path.
> >>>>>>
> >>>>>> GCN does not actually have smaller vector sizes, so there's no analogue
> >>>>>> to
> >>>>>> AVX2 (we pretend we have some smaller sizes, but that's because the
> >>>>>> middle
> >>>>>> end can't do masking everywhere yet, and it helps make some vector
> >>>>>> constants smaller, perhaps).
> >>>>>>
> >>>>>>> This patch does not enable using fully masked loops or
> >>>>>>> masked epilogues by default.  More work on cost modeling
> >>>>>>> and vectorization kind selection on x86_64 is necessary
> >>>>>>> for this.
> >>>>>>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
> >>>>>>> which could be exploited further to unify some of the flags
> >>>>>>> we have right now but there didn't seem to be many easy things
> >>>>>>> to merge, so I'm leaving this for followups.
> >>>>>>> Mask requirements as registered by vect_record_loop_mask are kept in
> >>>>>>> their
> >>>>>>> original form and recorded in a hash_set now instead of being
> >>>>>>> processed to a vector of rgroup_controls.  Instead that's now
> >>>>>>> left to the final analysis phase which tries forming the
> >>>>>>> rgroup_controls
> >>>>>>> vector using while_ult and if that fails now tries AVX512 style
> >>>>>>> which needs a different organization and instead fills a hash_map
> >>>>>>> with the relevant info.  vect_get_loop_mask now has two
> >>>>>>> implementations,
> >>>>>>> one for the two mask styles we then have.
> >>>>>>> I have decided against interweaving
> >>>>>>> vect_set_loop_condition_partial_vectors
> >>>>>>> with conditions to do AVX512 style masking and instead opted to
> >>>>>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
> >>>>>>> Likewise for vect_verify_full_masking vs
> >>>>>>> vect_verify_full_masking_avx512.
> >>>>>>> I was split between making 'vec_loop_masks' a class with methods,
> >>>>>>> possibly merging in the _len stuff into a single registry.  It
> >>>>>>> seemed to be too many changes for the purpose of getting AVX512
> >>>>>>> working.  I'm going to play wait and see what happens with RISC-V
> >>>>>>> here since they are going to get both masks and lengths registered
> >>>>>>> I think.
> >>>>>>> The vect_prepare_for_masked_peels hunk might run into issues with
> >>>>>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
> >>>>>>> looked odd.
> >>>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
> >>>>>>> the testsuite with --param vect-partial-vector-usage=2 with and
> >>>>>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
> >>>>>>> and one latent wrong-code (PR110237).
> >>>>>>> There's followup work to be done to try enabling masked epilogues
> >>>>>>> for x86-64 by default (when AVX512 is enabled, possibly only when
> >>>>>>> -mprefer-vector-width=512).  Getting cost modeling and decision
> >>>>>>> right is going to be challenging.
> >>>>>>> Any comments?
> >>>>>>> OK?
> >>>>>>> Btw, testing on GCN would be welcome - the _avx512 paths could
> >>>>>>> work for it so in case the while_ult path fails (not sure if
> >>>>>>> it ever does) it could get _avx512 style masking.  Likewise
> >>>>>>> testing on ARM just to see I didn't break anything here.
> >>>>>>> I don't have SVE hardware so testing is probably meaningless.
> >>>>>>
> >>>>>> I can set some tests going. Is vect.exp enough?
> >>>>>
> >>>>> Well, only you know (from experience), but sure that?s a nice start.
> >>>>
> >>>> I tested vect.exp for both gcc and gfortran and there were no
> >>>> regressions.
> >>>> I
> >>>> have another run going with the other param settings.
> >>>>
> >>>> (Side note: vect.exp used to be a nice quick test for use during
> >>>> development,
> >>>> but the tsvc tests are now really slow, at least when run on a single GPU
> >>>> thread.)
> >>>>
> >>>> I tried some small examples with --param vect-partial-vector-usage=1
> >>>> (IIUC
> >>>> this prevents masked loops, but not masked epilogues, right?)
> >>>
> >>> Yes.  That should also work with the while_ult style btw.
> >>>
> >>>> and the results
> >>>> look good. I plan to do some benchmarking shortly. One comment: building
> >>>> a
> >>>> vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in
> >>>> the
> >>>> constant pool and an unnecessary memory load (it literally has to use
> >>>> this
> >>>> sequence to generate the addresses to load the constant!) Generating the
> >>>> sequence via VEC_SERIES would be a no-op, for GCN, because we have an
> >>>> ABI-mandated register that already holds that value. (Perhaps I have
> >>>> another
> >>>> piece missing here, IDK?)
> >>>
> >>> I failed to special-case the {0, 1, 2, 3, ... } constant because I
> >>> couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }.
> >>> It might be that the target needs to pattern match these constants
> >>> at RTL expansion time?
> >>>
> >>> Btw, did you disable your while_ult pattern for the experiment?
> >>
> >> I tried it both ways; both appear to work, and the while_ult case does
> >> avoid
> >> the constant vector. I also don't seem to need while_ult for the fully
> >> masked
> >> case any more (is that new?).
> > 
> > Yes, while_ult always compares to {0, 1, 2, 3 ...} which seems
> > conveniently available but it has to multiply the IV with the
> > number of scalars per iter which has overflow issues it has
> > to compensate for by choosing a wider IV.  I'm avoiding that
> > issue (besides for alignment peeling) by instead altering
> > the constant vector to compare against.  On x86 the constant
> > vector is always a load but the multiplication would add to
> > the latency of mask production which already isn't too great.
> 
> Is the multiplication not usually a shift?
> 
> > And yes, the alternate scheme doesn't rely on while_ult but instead
> > on vec_cmpu to produce the masks.
> > 
> > You might be able to produce the {0, 0, 1, 1, ... } constant
> > by interleaving v1 with itself?  Any non-power-of-two duplication
> > looks more difficult though.
> 
> I think that would need to use a full permutation, which is probably faster
> than a cold load, but in all these cases the vector that defines the
> permutation looks exactly like the result, so ......
> 
> I've been playing with this stuff some more and I find that even though GCN
> supports fully masked loops and uses them when I test without offload, it's
> actually been running in param_vect_partial_vector_usage==0 mode for offload
> because i386.cc has that hardcoded and the offload compiler inherits param
> settings from the host.

Doesn't that mean it will have a scalar epilog and a very large VF for the
main loop due to the large vector size?

> I tried running the Babelstream benchmark with the various settings and it's a
> wash for most of the measurements (memory limited, most likely), but the "Dot"
> benchmark is considerably slower when fully masked (about 50%). This probably
> explains why adding the additional "fake" smaller vector sizes was so good for
> our numbers, but confirms that the partial epilogue is a good option.

Ah, "fake" smaller vector sizes probably then made up for this with
"fixed" size epilogue vectorization?  But yes, I think a vectorized
epilog with partial vectors that then does not iterate would get you
the best of both worlds.

So param_vect_partial_vector_usage == 1.

Whether with a while_ult optab or the vec_cmpu scheme should then
depend on generated code quality.

Richard.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] AVX512 fully masked vectorization
  2023-06-15 13:34               ` Richard Biener
@ 2023-06-15 13:52                 ` Andrew Stubbs
  2023-06-15 14:00                   ` Richard Biener
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Stubbs @ 2023-06-15 13:52 UTC (permalink / raw)
  To: Richard Biener
  Cc: gcc-patches, richard.sandiford, Jan Hubicka, hongtao.liu, kirill.yukhin

On 15/06/2023 14:34, Richard Biener wrote:
> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> 
>> On 15/06/2023 12:06, Richard Biener wrote:
>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
>>>
>>>> On 15/06/2023 10:58, Richard Biener wrote:
>>>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
>>>>>
>>>>>> On 14/06/2023 15:29, Richard Biener wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <ams@codesourcery.com>:
>>>>>>>>
>>>>>>>> On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
>>>>>>>>> This implemens fully masked vectorization or a masked epilog for
>>>>>>>>> AVX512 style masks which single themselves out by representing
>>>>>>>>> each lane with a single bit and by using integer modes for the mask
>>>>>>>>> (both is much like GCN).
>>>>>>>>> AVX512 is also special in that it doesn't have any instruction
>>>>>>>>> to compute the mask from a scalar IV like SVE has with while_ult.
>>>>>>>>> Instead the masks are produced by vector compares and the loop
>>>>>>>>> control retains the scalar IV (mainly to avoid dependences on
>>>>>>>>> mask generation, a suitable mask test instruction is available).
>>>>>>>>
>>>>>>>> This is also sounds like GCN. We currently use WHILE_ULT in the middle
>>>>>>>> end
>>>>>>>> which expands to a vector compare against a vector of stepped values.
>>>>>>>> This
>>>>>>>> requires an additional instruction to prepare the comparison vector
>>>>>>>> (compared to SVE), but the "while_ultv64sidi" pattern (for example)
>>>>>>>> returns
>>>>>>>> the DImode bitmask, so it works reasonably well.
>>>>>>>>
>>>>>>>>> Like RVV code generation prefers a decrementing IV though IVOPTs
>>>>>>>>> messes things up in some cases removing that IV to eliminate
>>>>>>>>> it with an incrementing one used for address generation.
>>>>>>>>> One of the motivating testcases is from PR108410 which in turn
>>>>>>>>> is extracted from x264 where large size vectorization shows
>>>>>>>>> issues with small trip loops.  Execution time there improves
>>>>>>>>> compared to classic AVX512 with AVX2 epilogues for the cases
>>>>>>>>> of less than 32 iterations.
>>>>>>>>> size   scalar     128     256     512    512e    512f
>>>>>>>>>         1    9.42   11.32    9.35   11.17   15.13   16.89
>>>>>>>>>         2    5.72    6.53    6.66    6.66    7.62    8.56
>>>>>>>>>         3    4.49    5.10    5.10    5.74    5.08    5.73
>>>>>>>>>         4    4.10    4.33    4.29    5.21    3.79    4.25
>>>>>>>>>         6    3.78    3.85    3.86    4.76    2.54    2.85
>>>>>>>>>         8    3.64    1.89    3.76    4.50    1.92    2.16
>>>>>>>>>        12    3.56    2.21    3.75    4.26    1.26    1.42
>>>>>>>>>        16    3.36    0.83    1.06    4.16    0.95    1.07
>>>>>>>>>        20    3.39    1.42    1.33    4.07    0.75    0.85
>>>>>>>>>        24    3.23    0.66    1.72    4.22    0.62    0.70
>>>>>>>>>        28    3.18    1.09    2.04    4.20    0.54    0.61
>>>>>>>>>        32    3.16    0.47    0.41    0.41    0.47    0.53
>>>>>>>>>        34    3.16    0.67    0.61    0.56    0.44    0.50
>>>>>>>>>        38    3.19    0.95    0.95    0.82    0.40    0.45
>>>>>>>>>        42    3.09    0.58    1.21    1.13    0.36    0.40
>>>>>>>>> 'size' specifies the number of actual iterations, 512e is for
>>>>>>>>> a masked epilog and 512f for the fully masked loop.  From
>>>>>>>>> 4 scalar iterations on the AVX512 masked epilog code is clearly
>>>>>>>>> the winner, the fully masked variant is clearly worse and
>>>>>>>>> it's size benefit is also tiny.
>>>>>>>>
>>>>>>>> Let me check I understand correctly. In the fully masked case, there is
>>>>>>>> a
>>>>>>>> single loop in which a new mask is generated at the start of each
>>>>>>>> iteration. In the masked epilogue case, the main loop uses no masking
>>>>>>>> whatsoever, thus avoiding the need for generating a mask, carrying the
>>>>>>>> mask, inserting vec_merge operations, etc, and then the epilogue looks
>>>>>>>> much
>>>>>>>> like the fully masked case, but unlike smaller mode epilogues there is
>>>>>>>> no
>>>>>>>> loop because the eplogue vector size is the same. Is that right?
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>> This scheme seems like it might also benefit GCN, in so much as it
>>>>>>>> simplifies the hot code path.
>>>>>>>>
>>>>>>>> GCN does not actually have smaller vector sizes, so there's no analogue
>>>>>>>> to
>>>>>>>> AVX2 (we pretend we have some smaller sizes, but that's because the
>>>>>>>> middle
>>>>>>>> end can't do masking everywhere yet, and it helps make some vector
>>>>>>>> constants smaller, perhaps).
>>>>>>>>
>>>>>>>>> This patch does not enable using fully masked loops or
>>>>>>>>> masked epilogues by default.  More work on cost modeling
>>>>>>>>> and vectorization kind selection on x86_64 is necessary
>>>>>>>>> for this.
>>>>>>>>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
>>>>>>>>> which could be exploited further to unify some of the flags
>>>>>>>>> we have right now but there didn't seem to be many easy things
>>>>>>>>> to merge, so I'm leaving this for followups.
>>>>>>>>> Mask requirements as registered by vect_record_loop_mask are kept in
>>>>>>>>> their
>>>>>>>>> original form and recorded in a hash_set now instead of being
>>>>>>>>> processed to a vector of rgroup_controls.  Instead that's now
>>>>>>>>> left to the final analysis phase which tries forming the
>>>>>>>>> rgroup_controls
>>>>>>>>> vector using while_ult and if that fails now tries AVX512 style
>>>>>>>>> which needs a different organization and instead fills a hash_map
>>>>>>>>> with the relevant info.  vect_get_loop_mask now has two
>>>>>>>>> implementations,
>>>>>>>>> one for the two mask styles we then have.
>>>>>>>>> I have decided against interweaving
>>>>>>>>> vect_set_loop_condition_partial_vectors
>>>>>>>>> with conditions to do AVX512 style masking and instead opted to
>>>>>>>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
>>>>>>>>> Likewise for vect_verify_full_masking vs
>>>>>>>>> vect_verify_full_masking_avx512.
>>>>>>>>> I was split between making 'vec_loop_masks' a class with methods,
>>>>>>>>> possibly merging in the _len stuff into a single registry.  It
>>>>>>>>> seemed to be too many changes for the purpose of getting AVX512
>>>>>>>>> working.  I'm going to play wait and see what happens with RISC-V
>>>>>>>>> here since they are going to get both masks and lengths registered
>>>>>>>>> I think.
>>>>>>>>> The vect_prepare_for_masked_peels hunk might run into issues with
>>>>>>>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
>>>>>>>>> looked odd.
>>>>>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
>>>>>>>>> the testsuite with --param vect-partial-vector-usage=2 with and
>>>>>>>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
>>>>>>>>> and one latent wrong-code (PR110237).
>>>>>>>>> There's followup work to be done to try enabling masked epilogues
>>>>>>>>> for x86-64 by default (when AVX512 is enabled, possibly only when
>>>>>>>>> -mprefer-vector-width=512).  Getting cost modeling and decision
>>>>>>>>> right is going to be challenging.
>>>>>>>>> Any comments?
>>>>>>>>> OK?
>>>>>>>>> Btw, testing on GCN would be welcome - the _avx512 paths could
>>>>>>>>> work for it so in case the while_ult path fails (not sure if
>>>>>>>>> it ever does) it could get _avx512 style masking.  Likewise
>>>>>>>>> testing on ARM just to see I didn't break anything here.
>>>>>>>>> I don't have SVE hardware so testing is probably meaningless.
>>>>>>>>
>>>>>>>> I can set some tests going. Is vect.exp enough?
>>>>>>>
>>>>>>> Well, only you know (from experience), but sure that?s a nice start.
>>>>>>
>>>>>> I tested vect.exp for both gcc and gfortran and there were no
>>>>>> regressions.
>>>>>> I
>>>>>> have another run going with the other param settings.
>>>>>>
>>>>>> (Side note: vect.exp used to be a nice quick test for use during
>>>>>> development,
>>>>>> but the tsvc tests are now really slow, at least when run on a single GPU
>>>>>> thread.)
>>>>>>
>>>>>> I tried some small examples with --param vect-partial-vector-usage=1
>>>>>> (IIUC
>>>>>> this prevents masked loops, but not masked epilogues, right?)
>>>>>
>>>>> Yes.  That should also work with the while_ult style btw.
>>>>>
>>>>>> and the results
>>>>>> look good. I plan to do some benchmarking shortly. One comment: building
>>>>>> a
>>>>>> vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in
>>>>>> the
>>>>>> constant pool and an unnecessary memory load (it literally has to use
>>>>>> this
>>>>>> sequence to generate the addresses to load the constant!) Generating the
>>>>>> sequence via VEC_SERIES would be a no-op, for GCN, because we have an
>>>>>> ABI-mandated register that already holds that value. (Perhaps I have
>>>>>> another
>>>>>> piece missing here, IDK?)
>>>>>
>>>>> I failed to special-case the {0, 1, 2, 3, ... } constant because I
>>>>> couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }.
>>>>> It might be that the target needs to pattern match these constants
>>>>> at RTL expansion time?
>>>>>
>>>>> Btw, did you disable your while_ult pattern for the experiment?
>>>>
>>>> I tried it both ways; both appear to work, and the while_ult case does
>>>> avoid
>>>> the constant vector. I also don't seem to need while_ult for the fully
>>>> masked
>>>> case any more (is that new?).
>>>
>>> Yes, while_ult always compares to {0, 1, 2, 3 ...} which seems
>>> conveniently available but it has to multiply the IV with the
>>> number of scalars per iter which has overflow issues it has
>>> to compensate for by choosing a wider IV.  I'm avoiding that
>>> issue (besides for alignment peeling) by instead altering
>>> the constant vector to compare against.  On x86 the constant
>>> vector is always a load but the multiplication would add to
>>> the latency of mask production which already isn't too great.
>>
>> Is the multiplication not usually a shift?
>>
>>> And yes, the alternate scheme doesn't rely on while_ult but instead
>>> on vec_cmpu to produce the masks.
>>>
>>> You might be able to produce the {0, 0, 1, 1, ... } constant
>>> by interleaving v1 with itself?  Any non-power-of-two duplication
>>> looks more difficult though.
>>
>> I think that would need to use a full permutation, which is probably faster
>> than a cold load, but in all these cases the vector that defines the
>> permutation looks exactly like the result, so ......
>>
>> I've been playing with this stuff some more and I find that even though GCN
>> supports fully masked loops and uses them when I test without offload, it's
>> actually been running in param_vect_partial_vector_usage==0 mode for offload
>> because i386.cc has that hardcoded and the offload compiler inherits param
>> settings from the host.
> 
> Doesn't that mean it will have a scalar epilog and a very large VF for the
> main loop due to the large vector size?
> 
>> I tried running the Babelstream benchmark with the various settings and it's a
>> wash for most of the measurements (memory limited, most likely), but the "Dot"
>> benchmark is considerably slower when fully masked (about 50%). This probably
>> explains why adding the additional "fake" smaller vector sizes was so good for
>> our numbers, but confirms that the partial epilogue is a good option.
> 
> Ah, "fake" smaller vector sizes probably then made up for this with
> "fixed" size epilogue vectorization?  But yes, I think a vectorized
> epilog with partial vectors that then does not iterate would get you
> the best of both worlds.

Yes, it uses V32 for the epilogue, which won't fit every case but it 
better than nothing.

> So param_vect_partial_vector_usage == 1.

Unfortunately, there's doesn't seem to be a way to set this *only* for 
the offload compiler. If you could fix it for x86_64 soon then that 
would be awesome. :)

> Whether with a while_ult optab or the vec_cmpu scheme should then
> depend on generated code quality.

Which looks like it depends on these constants alone.

Thanks

Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] AVX512 fully masked vectorization
  2023-06-15 13:52                 ` Andrew Stubbs
@ 2023-06-15 14:00                   ` Richard Biener
  2023-06-15 14:04                     ` Andrew Stubbs
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Biener @ 2023-06-15 14:00 UTC (permalink / raw)
  To: Andrew Stubbs
  Cc: gcc-patches, richard.sandiford, Jan Hubicka, hongtao.liu, kirill.yukhin

[-- Attachment #1: Type: text/plain, Size: 12905 bytes --]

On Thu, 15 Jun 2023, Andrew Stubbs wrote:

> On 15/06/2023 14:34, Richard Biener wrote:
> > On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> > 
> >> On 15/06/2023 12:06, Richard Biener wrote:
> >>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> >>>
> >>>> On 15/06/2023 10:58, Richard Biener wrote:
> >>>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> >>>>>
> >>>>>> On 14/06/2023 15:29, Richard Biener wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <ams@codesourcery.com>:
> >>>>>>>>
> >>>>>>>> On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
> >>>>>>>>> This implemens fully masked vectorization or a masked epilog for
> >>>>>>>>> AVX512 style masks which single themselves out by representing
> >>>>>>>>> each lane with a single bit and by using integer modes for the mask
> >>>>>>>>> (both is much like GCN).
> >>>>>>>>> AVX512 is also special in that it doesn't have any instruction
> >>>>>>>>> to compute the mask from a scalar IV like SVE has with while_ult.
> >>>>>>>>> Instead the masks are produced by vector compares and the loop
> >>>>>>>>> control retains the scalar IV (mainly to avoid dependences on
> >>>>>>>>> mask generation, a suitable mask test instruction is available).
> >>>>>>>>
> >>>>>>>> This is also sounds like GCN. We currently use WHILE_ULT in the
> >>>>>>>> middle
> >>>>>>>> end
> >>>>>>>> which expands to a vector compare against a vector of stepped values.
> >>>>>>>> This
> >>>>>>>> requires an additional instruction to prepare the comparison vector
> >>>>>>>> (compared to SVE), but the "while_ultv64sidi" pattern (for example)
> >>>>>>>> returns
> >>>>>>>> the DImode bitmask, so it works reasonably well.
> >>>>>>>>
> >>>>>>>>> Like RVV code generation prefers a decrementing IV though IVOPTs
> >>>>>>>>> messes things up in some cases removing that IV to eliminate
> >>>>>>>>> it with an incrementing one used for address generation.
> >>>>>>>>> One of the motivating testcases is from PR108410 which in turn
> >>>>>>>>> is extracted from x264 where large size vectorization shows
> >>>>>>>>> issues with small trip loops.  Execution time there improves
> >>>>>>>>> compared to classic AVX512 with AVX2 epilogues for the cases
> >>>>>>>>> of less than 32 iterations.
> >>>>>>>>> size   scalar     128     256     512    512e    512f
> >>>>>>>>>         1    9.42   11.32    9.35   11.17   15.13   16.89
> >>>>>>>>>         2    5.72    6.53    6.66    6.66    7.62    8.56
> >>>>>>>>>         3    4.49    5.10    5.10    5.74    5.08    5.73
> >>>>>>>>>         4    4.10    4.33    4.29    5.21    3.79    4.25
> >>>>>>>>>         6    3.78    3.85    3.86    4.76    2.54    2.85
> >>>>>>>>>         8    3.64    1.89    3.76    4.50    1.92    2.16
> >>>>>>>>>        12    3.56    2.21    3.75    4.26    1.26    1.42
> >>>>>>>>>        16    3.36    0.83    1.06    4.16    0.95    1.07
> >>>>>>>>>        20    3.39    1.42    1.33    4.07    0.75    0.85
> >>>>>>>>>        24    3.23    0.66    1.72    4.22    0.62    0.70
> >>>>>>>>>        28    3.18    1.09    2.04    4.20    0.54    0.61
> >>>>>>>>>        32    3.16    0.47    0.41    0.41    0.47    0.53
> >>>>>>>>>        34    3.16    0.67    0.61    0.56    0.44    0.50
> >>>>>>>>>        38    3.19    0.95    0.95    0.82    0.40    0.45
> >>>>>>>>>        42    3.09    0.58    1.21    1.13    0.36    0.40
> >>>>>>>>> 'size' specifies the number of actual iterations, 512e is for
> >>>>>>>>> a masked epilog and 512f for the fully masked loop.  From
> >>>>>>>>> 4 scalar iterations on the AVX512 masked epilog code is clearly
> >>>>>>>>> the winner, the fully masked variant is clearly worse and
> >>>>>>>>> it's size benefit is also tiny.
> >>>>>>>>
> >>>>>>>> Let me check I understand correctly. In the fully masked case, there
> >>>>>>>> is
> >>>>>>>> a
> >>>>>>>> single loop in which a new mask is generated at the start of each
> >>>>>>>> iteration. In the masked epilogue case, the main loop uses no masking
> >>>>>>>> whatsoever, thus avoiding the need for generating a mask, carrying
> >>>>>>>> the
> >>>>>>>> mask, inserting vec_merge operations, etc, and then the epilogue
> >>>>>>>> looks
> >>>>>>>> much
> >>>>>>>> like the fully masked case, but unlike smaller mode epilogues there
> >>>>>>>> is
> >>>>>>>> no
> >>>>>>>> loop because the eplogue vector size is the same. Is that right?
> >>>>>>>
> >>>>>>> Yes.
> >>>>>>>
> >>>>>>>> This scheme seems like it might also benefit GCN, in so much as it
> >>>>>>>> simplifies the hot code path.
> >>>>>>>>
> >>>>>>>> GCN does not actually have smaller vector sizes, so there's no
> >>>>>>>> analogue
> >>>>>>>> to
> >>>>>>>> AVX2 (we pretend we have some smaller sizes, but that's because the
> >>>>>>>> middle
> >>>>>>>> end can't do masking everywhere yet, and it helps make some vector
> >>>>>>>> constants smaller, perhaps).
> >>>>>>>>
> >>>>>>>>> This patch does not enable using fully masked loops or
> >>>>>>>>> masked epilogues by default.  More work on cost modeling
> >>>>>>>>> and vectorization kind selection on x86_64 is necessary
> >>>>>>>>> for this.
> >>>>>>>>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
> >>>>>>>>> which could be exploited further to unify some of the flags
> >>>>>>>>> we have right now but there didn't seem to be many easy things
> >>>>>>>>> to merge, so I'm leaving this for followups.
> >>>>>>>>> Mask requirements as registered by vect_record_loop_mask are kept in
> >>>>>>>>> their
> >>>>>>>>> original form and recorded in a hash_set now instead of being
> >>>>>>>>> processed to a vector of rgroup_controls.  Instead that's now
> >>>>>>>>> left to the final analysis phase which tries forming the
> >>>>>>>>> rgroup_controls
> >>>>>>>>> vector using while_ult and if that fails now tries AVX512 style
> >>>>>>>>> which needs a different organization and instead fills a hash_map
> >>>>>>>>> with the relevant info.  vect_get_loop_mask now has two
> >>>>>>>>> implementations,
> >>>>>>>>> one for the two mask styles we then have.
> >>>>>>>>> I have decided against interweaving
> >>>>>>>>> vect_set_loop_condition_partial_vectors
> >>>>>>>>> with conditions to do AVX512 style masking and instead opted to
> >>>>>>>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
> >>>>>>>>> Likewise for vect_verify_full_masking vs
> >>>>>>>>> vect_verify_full_masking_avx512.
> >>>>>>>>> I was split between making 'vec_loop_masks' a class with methods,
> >>>>>>>>> possibly merging in the _len stuff into a single registry.  It
> >>>>>>>>> seemed to be too many changes for the purpose of getting AVX512
> >>>>>>>>> working.  I'm going to play wait and see what happens with RISC-V
> >>>>>>>>> here since they are going to get both masks and lengths registered
> >>>>>>>>> I think.
> >>>>>>>>> The vect_prepare_for_masked_peels hunk might run into issues with
> >>>>>>>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
> >>>>>>>>> looked odd.
> >>>>>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
> >>>>>>>>> the testsuite with --param vect-partial-vector-usage=2 with and
> >>>>>>>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
> >>>>>>>>> and one latent wrong-code (PR110237).
> >>>>>>>>> There's followup work to be done to try enabling masked epilogues
> >>>>>>>>> for x86-64 by default (when AVX512 is enabled, possibly only when
> >>>>>>>>> -mprefer-vector-width=512).  Getting cost modeling and decision
> >>>>>>>>> right is going to be challenging.
> >>>>>>>>> Any comments?
> >>>>>>>>> OK?
> >>>>>>>>> Btw, testing on GCN would be welcome - the _avx512 paths could
> >>>>>>>>> work for it so in case the while_ult path fails (not sure if
> >>>>>>>>> it ever does) it could get _avx512 style masking.  Likewise
> >>>>>>>>> testing on ARM just to see I didn't break anything here.
> >>>>>>>>> I don't have SVE hardware so testing is probably meaningless.
> >>>>>>>>
> >>>>>>>> I can set some tests going. Is vect.exp enough?
> >>>>>>>
> >>>>>>> Well, only you know (from experience), but sure that?s a nice start.
> >>>>>>
> >>>>>> I tested vect.exp for both gcc and gfortran and there were no
> >>>>>> regressions.
> >>>>>> I
> >>>>>> have another run going with the other param settings.
> >>>>>>
> >>>>>> (Side note: vect.exp used to be a nice quick test for use during
> >>>>>> development,
> >>>>>> but the tsvc tests are now really slow, at least when run on a single
> >>>>>> GPU
> >>>>>> thread.)
> >>>>>>
> >>>>>> I tried some small examples with --param vect-partial-vector-usage=1
> >>>>>> (IIUC
> >>>>>> this prevents masked loops, but not masked epilogues, right?)
> >>>>>
> >>>>> Yes.  That should also work with the while_ult style btw.
> >>>>>
> >>>>>> and the results
> >>>>>> look good. I plan to do some benchmarking shortly. One comment:
> >>>>>> building
> >>>>>> a
> >>>>>> vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in
> >>>>>> the
> >>>>>> constant pool and an unnecessary memory load (it literally has to use
> >>>>>> this
> >>>>>> sequence to generate the addresses to load the constant!) Generating
> >>>>>> the
> >>>>>> sequence via VEC_SERIES would be a no-op, for GCN, because we have an
> >>>>>> ABI-mandated register that already holds that value. (Perhaps I have
> >>>>>> another
> >>>>>> piece missing here, IDK?)
> >>>>>
> >>>>> I failed to special-case the {0, 1, 2, 3, ... } constant because I
> >>>>> couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }.
> >>>>> It might be that the target needs to pattern match these constants
> >>>>> at RTL expansion time?
> >>>>>
> >>>>> Btw, did you disable your while_ult pattern for the experiment?
> >>>>
> >>>> I tried it both ways; both appear to work, and the while_ult case does
> >>>> avoid
> >>>> the constant vector. I also don't seem to need while_ult for the fully
> >>>> masked
> >>>> case any more (is that new?).
> >>>
> >>> Yes, while_ult always compares to {0, 1, 2, 3 ...} which seems
> >>> conveniently available but it has to multiply the IV with the
> >>> number of scalars per iter which has overflow issues it has
> >>> to compensate for by choosing a wider IV.  I'm avoiding that
> >>> issue (besides for alignment peeling) by instead altering
> >>> the constant vector to compare against.  On x86 the constant
> >>> vector is always a load but the multiplication would add to
> >>> the latency of mask production which already isn't too great.
> >>
> >> Is the multiplication not usually a shift?
> >>
> >>> And yes, the alternate scheme doesn't rely on while_ult but instead
> >>> on vec_cmpu to produce the masks.
> >>>
> >>> You might be able to produce the {0, 0, 1, 1, ... } constant
> >>> by interleaving v1 with itself?  Any non-power-of-two duplication
> >>> looks more difficult though.
> >>
> >> I think that would need to use a full permutation, which is probably faster
> >> than a cold load, but in all these cases the vector that defines the
> >> permutation looks exactly like the result, so ......
> >>
> >> I've been playing with this stuff some more and I find that even though GCN
> >> supports fully masked loops and uses them when I test without offload, it's
> >> actually been running in param_vect_partial_vector_usage==0 mode for
> >> offload
> >> because i386.cc has that hardcoded and the offload compiler inherits param
> >> settings from the host.
> > 
> > Doesn't that mean it will have a scalar epilog and a very large VF for the
> > main loop due to the large vector size?
> > 
> >> I tried running the Babelstream benchmark with the various settings and
> >> it's a
> >> wash for most of the measurements (memory limited, most likely), but the
> >> "Dot"
> >> benchmark is considerably slower when fully masked (about 50%). This
> >> probably
> >> explains why adding the additional "fake" smaller vector sizes was so good
> >> for
> >> our numbers, but confirms that the partial epilogue is a good option.
> > 
> > Ah, "fake" smaller vector sizes probably then made up for this with
> > "fixed" size epilogue vectorization?  But yes, I think a vectorized
> > epilog with partial vectors that then does not iterate would get you
> > the best of both worlds.
> 
> Yes, it uses V32 for the epilogue, which won't fit every case but it better
> than nothing.
> 
> > So param_vect_partial_vector_usage == 1.
> 
> Unfortunately, there's doesn't seem to be a way to set this *only* for the
> offload compiler. If you could fix it for x86_64 soon then that would be
> awesome. :)

So some opts tweaking in the GCN option_override hook doesn't work?

> > Whether with a while_ult optab or the vec_cmpu scheme should then
> > depend on generated code quality.
> 
> Which looks like it depends on these constants alone.

Understood, compared to while_ult these now appear explicitely.  As
Richard said you likely have to tweak the backend to make use of 'r1'

Richard.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] AVX512 fully masked vectorization
  2023-06-15 14:00                   ` Richard Biener
@ 2023-06-15 14:04                     ` Andrew Stubbs
  2023-06-15 16:16                       ` Richard Biener
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Stubbs @ 2023-06-15 14:04 UTC (permalink / raw)
  To: Richard Biener
  Cc: gcc-patches, richard.sandiford, Jan Hubicka, hongtao.liu, kirill.yukhin

On 15/06/2023 15:00, Richard Biener wrote:
> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> 
>> On 15/06/2023 14:34, Richard Biener wrote:
>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
>>>
>>>> On 15/06/2023 12:06, Richard Biener wrote:
>>>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
>>>>>
>>>>>> On 15/06/2023 10:58, Richard Biener wrote:
>>>>>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
>>>>>>>
>>>>>>>> On 14/06/2023 15:29, Richard Biener wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <ams@codesourcery.com>:
>>>>>>>>>>
>>>>>>>>>> On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
>>>>>>>>>>> This implemens fully masked vectorization or a masked epilog for
>>>>>>>>>>> AVX512 style masks which single themselves out by representing
>>>>>>>>>>> each lane with a single bit and by using integer modes for the mask
>>>>>>>>>>> (both is much like GCN).
>>>>>>>>>>> AVX512 is also special in that it doesn't have any instruction
>>>>>>>>>>> to compute the mask from a scalar IV like SVE has with while_ult.
>>>>>>>>>>> Instead the masks are produced by vector compares and the loop
>>>>>>>>>>> control retains the scalar IV (mainly to avoid dependences on
>>>>>>>>>>> mask generation, a suitable mask test instruction is available).
>>>>>>>>>>
>>>>>>>>>> This is also sounds like GCN. We currently use WHILE_ULT in the
>>>>>>>>>> middle
>>>>>>>>>> end
>>>>>>>>>> which expands to a vector compare against a vector of stepped values.
>>>>>>>>>> This
>>>>>>>>>> requires an additional instruction to prepare the comparison vector
>>>>>>>>>> (compared to SVE), but the "while_ultv64sidi" pattern (for example)
>>>>>>>>>> returns
>>>>>>>>>> the DImode bitmask, so it works reasonably well.
>>>>>>>>>>
>>>>>>>>>>> Like RVV code generation prefers a decrementing IV though IVOPTs
>>>>>>>>>>> messes things up in some cases removing that IV to eliminate
>>>>>>>>>>> it with an incrementing one used for address generation.
>>>>>>>>>>> One of the motivating testcases is from PR108410 which in turn
>>>>>>>>>>> is extracted from x264 where large size vectorization shows
>>>>>>>>>>> issues with small trip loops.  Execution time there improves
>>>>>>>>>>> compared to classic AVX512 with AVX2 epilogues for the cases
>>>>>>>>>>> of less than 32 iterations.
>>>>>>>>>>> size   scalar     128     256     512    512e    512f
>>>>>>>>>>>          1    9.42   11.32    9.35   11.17   15.13   16.89
>>>>>>>>>>>          2    5.72    6.53    6.66    6.66    7.62    8.56
>>>>>>>>>>>          3    4.49    5.10    5.10    5.74    5.08    5.73
>>>>>>>>>>>          4    4.10    4.33    4.29    5.21    3.79    4.25
>>>>>>>>>>>          6    3.78    3.85    3.86    4.76    2.54    2.85
>>>>>>>>>>>          8    3.64    1.89    3.76    4.50    1.92    2.16
>>>>>>>>>>>         12    3.56    2.21    3.75    4.26    1.26    1.42
>>>>>>>>>>>         16    3.36    0.83    1.06    4.16    0.95    1.07
>>>>>>>>>>>         20    3.39    1.42    1.33    4.07    0.75    0.85
>>>>>>>>>>>         24    3.23    0.66    1.72    4.22    0.62    0.70
>>>>>>>>>>>         28    3.18    1.09    2.04    4.20    0.54    0.61
>>>>>>>>>>>         32    3.16    0.47    0.41    0.41    0.47    0.53
>>>>>>>>>>>         34    3.16    0.67    0.61    0.56    0.44    0.50
>>>>>>>>>>>         38    3.19    0.95    0.95    0.82    0.40    0.45
>>>>>>>>>>>         42    3.09    0.58    1.21    1.13    0.36    0.40
>>>>>>>>>>> 'size' specifies the number of actual iterations, 512e is for
>>>>>>>>>>> a masked epilog and 512f for the fully masked loop.  From
>>>>>>>>>>> 4 scalar iterations on the AVX512 masked epilog code is clearly
>>>>>>>>>>> the winner, the fully masked variant is clearly worse and
>>>>>>>>>>> it's size benefit is also tiny.
>>>>>>>>>>
>>>>>>>>>> Let me check I understand correctly. In the fully masked case, there
>>>>>>>>>> is
>>>>>>>>>> a
>>>>>>>>>> single loop in which a new mask is generated at the start of each
>>>>>>>>>> iteration. In the masked epilogue case, the main loop uses no masking
>>>>>>>>>> whatsoever, thus avoiding the need for generating a mask, carrying
>>>>>>>>>> the
>>>>>>>>>> mask, inserting vec_merge operations, etc, and then the epilogue
>>>>>>>>>> looks
>>>>>>>>>> much
>>>>>>>>>> like the fully masked case, but unlike smaller mode epilogues there
>>>>>>>>>> is
>>>>>>>>>> no
>>>>>>>>>> loop because the eplogue vector size is the same. Is that right?
>>>>>>>>>
>>>>>>>>> Yes.
>>>>>>>>>
>>>>>>>>>> This scheme seems like it might also benefit GCN, in so much as it
>>>>>>>>>> simplifies the hot code path.
>>>>>>>>>>
>>>>>>>>>> GCN does not actually have smaller vector sizes, so there's no
>>>>>>>>>> analogue
>>>>>>>>>> to
>>>>>>>>>> AVX2 (we pretend we have some smaller sizes, but that's because the
>>>>>>>>>> middle
>>>>>>>>>> end can't do masking everywhere yet, and it helps make some vector
>>>>>>>>>> constants smaller, perhaps).
>>>>>>>>>>
>>>>>>>>>>> This patch does not enable using fully masked loops or
>>>>>>>>>>> masked epilogues by default.  More work on cost modeling
>>>>>>>>>>> and vectorization kind selection on x86_64 is necessary
>>>>>>>>>>> for this.
>>>>>>>>>>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
>>>>>>>>>>> which could be exploited further to unify some of the flags
>>>>>>>>>>> we have right now but there didn't seem to be many easy things
>>>>>>>>>>> to merge, so I'm leaving this for followups.
>>>>>>>>>>> Mask requirements as registered by vect_record_loop_mask are kept in
>>>>>>>>>>> their
>>>>>>>>>>> original form and recorded in a hash_set now instead of being
>>>>>>>>>>> processed to a vector of rgroup_controls.  Instead that's now
>>>>>>>>>>> left to the final analysis phase which tries forming the
>>>>>>>>>>> rgroup_controls
>>>>>>>>>>> vector using while_ult and if that fails now tries AVX512 style
>>>>>>>>>>> which needs a different organization and instead fills a hash_map
>>>>>>>>>>> with the relevant info.  vect_get_loop_mask now has two
>>>>>>>>>>> implementations,
>>>>>>>>>>> one for the two mask styles we then have.
>>>>>>>>>>> I have decided against interweaving
>>>>>>>>>>> vect_set_loop_condition_partial_vectors
>>>>>>>>>>> with conditions to do AVX512 style masking and instead opted to
>>>>>>>>>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
>>>>>>>>>>> Likewise for vect_verify_full_masking vs
>>>>>>>>>>> vect_verify_full_masking_avx512.
>>>>>>>>>>> I was split between making 'vec_loop_masks' a class with methods,
>>>>>>>>>>> possibly merging in the _len stuff into a single registry.  It
>>>>>>>>>>> seemed to be too many changes for the purpose of getting AVX512
>>>>>>>>>>> working.  I'm going to play wait and see what happens with RISC-V
>>>>>>>>>>> here since they are going to get both masks and lengths registered
>>>>>>>>>>> I think.
>>>>>>>>>>> The vect_prepare_for_masked_peels hunk might run into issues with
>>>>>>>>>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
>>>>>>>>>>> looked odd.
>>>>>>>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
>>>>>>>>>>> the testsuite with --param vect-partial-vector-usage=2 with and
>>>>>>>>>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
>>>>>>>>>>> and one latent wrong-code (PR110237).
>>>>>>>>>>> There's followup work to be done to try enabling masked epilogues
>>>>>>>>>>> for x86-64 by default (when AVX512 is enabled, possibly only when
>>>>>>>>>>> -mprefer-vector-width=512).  Getting cost modeling and decision
>>>>>>>>>>> right is going to be challenging.
>>>>>>>>>>> Any comments?
>>>>>>>>>>> OK?
>>>>>>>>>>> Btw, testing on GCN would be welcome - the _avx512 paths could
>>>>>>>>>>> work for it so in case the while_ult path fails (not sure if
>>>>>>>>>>> it ever does) it could get _avx512 style masking.  Likewise
>>>>>>>>>>> testing on ARM just to see I didn't break anything here.
>>>>>>>>>>> I don't have SVE hardware so testing is probably meaningless.
>>>>>>>>>>
>>>>>>>>>> I can set some tests going. Is vect.exp enough?
>>>>>>>>>
>>>>>>>>> Well, only you know (from experience), but sure that?s a nice start.
>>>>>>>>
>>>>>>>> I tested vect.exp for both gcc and gfortran and there were no
>>>>>>>> regressions.
>>>>>>>> I
>>>>>>>> have another run going with the other param settings.
>>>>>>>>
>>>>>>>> (Side note: vect.exp used to be a nice quick test for use during
>>>>>>>> development,
>>>>>>>> but the tsvc tests are now really slow, at least when run on a single
>>>>>>>> GPU
>>>>>>>> thread.)
>>>>>>>>
>>>>>>>> I tried some small examples with --param vect-partial-vector-usage=1
>>>>>>>> (IIUC
>>>>>>>> this prevents masked loops, but not masked epilogues, right?)
>>>>>>>
>>>>>>> Yes.  That should also work with the while_ult style btw.
>>>>>>>
>>>>>>>> and the results
>>>>>>>> look good. I plan to do some benchmarking shortly. One comment:
>>>>>>>> building
>>>>>>>> a
>>>>>>>> vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in
>>>>>>>> the
>>>>>>>> constant pool and an unnecessary memory load (it literally has to use
>>>>>>>> this
>>>>>>>> sequence to generate the addresses to load the constant!) Generating
>>>>>>>> the
>>>>>>>> sequence via VEC_SERIES would be a no-op, for GCN, because we have an
>>>>>>>> ABI-mandated register that already holds that value. (Perhaps I have
>>>>>>>> another
>>>>>>>> piece missing here, IDK?)
>>>>>>>
>>>>>>> I failed to special-case the {0, 1, 2, 3, ... } constant because I
>>>>>>> couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }.
>>>>>>> It might be that the target needs to pattern match these constants
>>>>>>> at RTL expansion time?
>>>>>>>
>>>>>>> Btw, did you disable your while_ult pattern for the experiment?
>>>>>>
>>>>>> I tried it both ways; both appear to work, and the while_ult case does
>>>>>> avoid
>>>>>> the constant vector. I also don't seem to need while_ult for the fully
>>>>>> masked
>>>>>> case any more (is that new?).
>>>>>
>>>>> Yes, while_ult always compares to {0, 1, 2, 3 ...} which seems
>>>>> conveniently available but it has to multiply the IV with the
>>>>> number of scalars per iter which has overflow issues it has
>>>>> to compensate for by choosing a wider IV.  I'm avoiding that
>>>>> issue (besides for alignment peeling) by instead altering
>>>>> the constant vector to compare against.  On x86 the constant
>>>>> vector is always a load but the multiplication would add to
>>>>> the latency of mask production which already isn't too great.
>>>>
>>>> Is the multiplication not usually a shift?
>>>>
>>>>> And yes, the alternate scheme doesn't rely on while_ult but instead
>>>>> on vec_cmpu to produce the masks.
>>>>>
>>>>> You might be able to produce the {0, 0, 1, 1, ... } constant
>>>>> by interleaving v1 with itself?  Any non-power-of-two duplication
>>>>> looks more difficult though.
>>>>
>>>> I think that would need to use a full permutation, which is probably faster
>>>> than a cold load, but in all these cases the vector that defines the
>>>> permutation looks exactly like the result, so ......
>>>>
>>>> I've been playing with this stuff some more and I find that even though GCN
>>>> supports fully masked loops and uses them when I test without offload, it's
>>>> actually been running in param_vect_partial_vector_usage==0 mode for
>>>> offload
>>>> because i386.cc has that hardcoded and the offload compiler inherits param
>>>> settings from the host.
>>>
>>> Doesn't that mean it will have a scalar epilog and a very large VF for the
>>> main loop due to the large vector size?
>>>
>>>> I tried running the Babelstream benchmark with the various settings and
>>>> it's a
>>>> wash for most of the measurements (memory limited, most likely), but the
>>>> "Dot"
>>>> benchmark is considerably slower when fully masked (about 50%). This
>>>> probably
>>>> explains why adding the additional "fake" smaller vector sizes was so good
>>>> for
>>>> our numbers, but confirms that the partial epilogue is a good option.
>>>
>>> Ah, "fake" smaller vector sizes probably then made up for this with
>>> "fixed" size epilogue vectorization?  But yes, I think a vectorized
>>> epilog with partial vectors that then does not iterate would get you
>>> the best of both worlds.
>>
>> Yes, it uses V32 for the epilogue, which won't fit every case but it better
>> than nothing.
>>
>>> So param_vect_partial_vector_usage == 1.
>>
>> Unfortunately, there's doesn't seem to be a way to set this *only* for the
>> offload compiler. If you could fix it for x86_64 soon then that would be
>> awesome. :)
> 
> So some opts tweaking in the GCN option_override hook doesn't work?

I didn't try that, but 
-foffload-options=--param=vect-partial-vector-usage=1 had no effect.

>>> Whether with a while_ult optab or the vec_cmpu scheme should then
>>> depend on generated code quality.
>>
>> Which looks like it depends on these constants alone.
> 
> Understood, compared to while_ult these now appear explicitely.  As
> Richard said you likely have to tweak the backend to make use of 'r1'

Yes, I think I understood what he meant. Should be doable.

Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] AVX512 fully masked vectorization
  2023-06-15 14:04                     ` Andrew Stubbs
@ 2023-06-15 16:16                       ` Richard Biener
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Biener @ 2023-06-15 16:16 UTC (permalink / raw)
  To: Andrew Stubbs
  Cc: gcc-patches, richard.sandiford, Jan Hubicka, hongtao.liu, kirill.yukhin



> Am 15.06.2023 um 16:04 schrieb Andrew Stubbs <ams@codesourcery.com>:
> 
> On 15/06/2023 15:00, Richard Biener wrote:
>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
>>> On 15/06/2023 14:34, Richard Biener wrote:
>>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
>>>> 
>>>>> On 15/06/2023 12:06, Richard Biener wrote:
>>>>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
>>>>>> 
>>>>>>> On 15/06/2023 10:58, Richard Biener wrote:
>>>>>>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
>>>>>>>> 
>>>>>>>>> On 14/06/2023 15:29, Richard Biener wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <ams@codesourcery.com>:
>>>>>>>>>>> 
>>>>>>>>>>> On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
>>>>>>>>>>>> This implemens fully masked vectorization or a masked epilog for
>>>>>>>>>>>> AVX512 style masks which single themselves out by representing
>>>>>>>>>>>> each lane with a single bit and by using integer modes for the mask
>>>>>>>>>>>> (both is much like GCN).
>>>>>>>>>>>> AVX512 is also special in that it doesn't have any instruction
>>>>>>>>>>>> to compute the mask from a scalar IV like SVE has with while_ult.
>>>>>>>>>>>> Instead the masks are produced by vector compares and the loop
>>>>>>>>>>>> control retains the scalar IV (mainly to avoid dependences on
>>>>>>>>>>>> mask generation, a suitable mask test instruction is available).
>>>>>>>>>>> 
>>>>>>>>>>> This is also sounds like GCN. We currently use WHILE_ULT in the
>>>>>>>>>>> middle
>>>>>>>>>>> end
>>>>>>>>>>> which expands to a vector compare against a vector of stepped values.
>>>>>>>>>>> This
>>>>>>>>>>> requires an additional instruction to prepare the comparison vector
>>>>>>>>>>> (compared to SVE), but the "while_ultv64sidi" pattern (for example)
>>>>>>>>>>> returns
>>>>>>>>>>> the DImode bitmask, so it works reasonably well.
>>>>>>>>>>> 
>>>>>>>>>>>> Like RVV code generation prefers a decrementing IV though IVOPTs
>>>>>>>>>>>> messes things up in some cases removing that IV to eliminate
>>>>>>>>>>>> it with an incrementing one used for address generation.
>>>>>>>>>>>> One of the motivating testcases is from PR108410 which in turn
>>>>>>>>>>>> is extracted from x264 where large size vectorization shows
>>>>>>>>>>>> issues with small trip loops.  Execution time there improves
>>>>>>>>>>>> compared to classic AVX512 with AVX2 epilogues for the cases
>>>>>>>>>>>> of less than 32 iterations.
>>>>>>>>>>>> size   scalar     128     256     512    512e    512f
>>>>>>>>>>>>         1    9.42   11.32    9.35   11.17   15.13   16.89
>>>>>>>>>>>>         2    5.72    6.53    6.66    6.66    7.62    8.56
>>>>>>>>>>>>         3    4.49    5.10    5.10    5.74    5.08    5.73
>>>>>>>>>>>>         4    4.10    4.33    4.29    5.21    3.79    4.25
>>>>>>>>>>>>         6    3.78    3.85    3.86    4.76    2.54    2.85
>>>>>>>>>>>>         8    3.64    1.89    3.76    4.50    1.92    2.16
>>>>>>>>>>>>        12    3.56    2.21    3.75    4.26    1.26    1.42
>>>>>>>>>>>>        16    3.36    0.83    1.06    4.16    0.95    1.07
>>>>>>>>>>>>        20    3.39    1.42    1.33    4.07    0.75    0.85
>>>>>>>>>>>>        24    3.23    0.66    1.72    4.22    0.62    0.70
>>>>>>>>>>>>        28    3.18    1.09    2.04    4.20    0.54    0.61
>>>>>>>>>>>>        32    3.16    0.47    0.41    0.41    0.47    0.53
>>>>>>>>>>>>        34    3.16    0.67    0.61    0.56    0.44    0.50
>>>>>>>>>>>>        38    3.19    0.95    0.95    0.82    0.40    0.45
>>>>>>>>>>>>        42    3.09    0.58    1.21    1.13    0.36    0.40
>>>>>>>>>>>> 'size' specifies the number of actual iterations, 512e is for
>>>>>>>>>>>> a masked epilog and 512f for the fully masked loop.  From
>>>>>>>>>>>> 4 scalar iterations on the AVX512 masked epilog code is clearly
>>>>>>>>>>>> the winner, the fully masked variant is clearly worse and
>>>>>>>>>>>> it's size benefit is also tiny.
>>>>>>>>>>> 
>>>>>>>>>>> Let me check I understand correctly. In the fully masked case, there
>>>>>>>>>>> is
>>>>>>>>>>> a
>>>>>>>>>>> single loop in which a new mask is generated at the start of each
>>>>>>>>>>> iteration. In the masked epilogue case, the main loop uses no masking
>>>>>>>>>>> whatsoever, thus avoiding the need for generating a mask, carrying
>>>>>>>>>>> the
>>>>>>>>>>> mask, inserting vec_merge operations, etc, and then the epilogue
>>>>>>>>>>> looks
>>>>>>>>>>> much
>>>>>>>>>>> like the fully masked case, but unlike smaller mode epilogues there
>>>>>>>>>>> is
>>>>>>>>>>> no
>>>>>>>>>>> loop because the eplogue vector size is the same. Is that right?
>>>>>>>>>> 
>>>>>>>>>> Yes.
>>>>>>>>>> 
>>>>>>>>>>> This scheme seems like it might also benefit GCN, in so much as it
>>>>>>>>>>> simplifies the hot code path.
>>>>>>>>>>> 
>>>>>>>>>>> GCN does not actually have smaller vector sizes, so there's no
>>>>>>>>>>> analogue
>>>>>>>>>>> to
>>>>>>>>>>> AVX2 (we pretend we have some smaller sizes, but that's because the
>>>>>>>>>>> middle
>>>>>>>>>>> end can't do masking everywhere yet, and it helps make some vector
>>>>>>>>>>> constants smaller, perhaps).
>>>>>>>>>>> 
>>>>>>>>>>>> This patch does not enable using fully masked loops or
>>>>>>>>>>>> masked epilogues by default.  More work on cost modeling
>>>>>>>>>>>> and vectorization kind selection on x86_64 is necessary
>>>>>>>>>>>> for this.
>>>>>>>>>>>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
>>>>>>>>>>>> which could be exploited further to unify some of the flags
>>>>>>>>>>>> we have right now but there didn't seem to be many easy things
>>>>>>>>>>>> to merge, so I'm leaving this for followups.
>>>>>>>>>>>> Mask requirements as registered by vect_record_loop_mask are kept in
>>>>>>>>>>>> their
>>>>>>>>>>>> original form and recorded in a hash_set now instead of being
>>>>>>>>>>>> processed to a vector of rgroup_controls.  Instead that's now
>>>>>>>>>>>> left to the final analysis phase which tries forming the
>>>>>>>>>>>> rgroup_controls
>>>>>>>>>>>> vector using while_ult and if that fails now tries AVX512 style
>>>>>>>>>>>> which needs a different organization and instead fills a hash_map
>>>>>>>>>>>> with the relevant info.  vect_get_loop_mask now has two
>>>>>>>>>>>> implementations,
>>>>>>>>>>>> one for the two mask styles we then have.
>>>>>>>>>>>> I have decided against interweaving
>>>>>>>>>>>> vect_set_loop_condition_partial_vectors
>>>>>>>>>>>> with conditions to do AVX512 style masking and instead opted to
>>>>>>>>>>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
>>>>>>>>>>>> Likewise for vect_verify_full_masking vs
>>>>>>>>>>>> vect_verify_full_masking_avx512.
>>>>>>>>>>>> I was split between making 'vec_loop_masks' a class with methods,
>>>>>>>>>>>> possibly merging in the _len stuff into a single registry.  It
>>>>>>>>>>>> seemed to be too many changes for the purpose of getting AVX512
>>>>>>>>>>>> working.  I'm going to play wait and see what happens with RISC-V
>>>>>>>>>>>> here since they are going to get both masks and lengths registered
>>>>>>>>>>>> I think.
>>>>>>>>>>>> The vect_prepare_for_masked_peels hunk might run into issues with
>>>>>>>>>>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
>>>>>>>>>>>> looked odd.
>>>>>>>>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
>>>>>>>>>>>> the testsuite with --param vect-partial-vector-usage=2 with and
>>>>>>>>>>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
>>>>>>>>>>>> and one latent wrong-code (PR110237).
>>>>>>>>>>>> There's followup work to be done to try enabling masked epilogues
>>>>>>>>>>>> for x86-64 by default (when AVX512 is enabled, possibly only when
>>>>>>>>>>>> -mprefer-vector-width=512).  Getting cost modeling and decision
>>>>>>>>>>>> right is going to be challenging.
>>>>>>>>>>>> Any comments?
>>>>>>>>>>>> OK?
>>>>>>>>>>>> Btw, testing on GCN would be welcome - the _avx512 paths could
>>>>>>>>>>>> work for it so in case the while_ult path fails (not sure if
>>>>>>>>>>>> it ever does) it could get _avx512 style masking.  Likewise
>>>>>>>>>>>> testing on ARM just to see I didn't break anything here.
>>>>>>>>>>>> I don't have SVE hardware so testing is probably meaningless.
>>>>>>>>>>> 
>>>>>>>>>>> I can set some tests going. Is vect.exp enough?
>>>>>>>>>> 
>>>>>>>>>> Well, only you know (from experience), but sure that?s a nice start.
>>>>>>>>> 
>>>>>>>>> I tested vect.exp for both gcc and gfortran and there were no
>>>>>>>>> regressions.
>>>>>>>>> I
>>>>>>>>> have another run going with the other param settings.
>>>>>>>>> 
>>>>>>>>> (Side note: vect.exp used to be a nice quick test for use during
>>>>>>>>> development,
>>>>>>>>> but the tsvc tests are now really slow, at least when run on a single
>>>>>>>>> GPU
>>>>>>>>> thread.)
>>>>>>>>> 
>>>>>>>>> I tried some small examples with --param vect-partial-vector-usage=1
>>>>>>>>> (IIUC
>>>>>>>>> this prevents masked loops, but not masked epilogues, right?)
>>>>>>>> 
>>>>>>>> Yes.  That should also work with the while_ult style btw.
>>>>>>>> 
>>>>>>>>> and the results
>>>>>>>>> look good. I plan to do some benchmarking shortly. One comment:
>>>>>>>>> building
>>>>>>>>> a
>>>>>>>>> vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in
>>>>>>>>> the
>>>>>>>>> constant pool and an unnecessary memory load (it literally has to use
>>>>>>>>> this
>>>>>>>>> sequence to generate the addresses to load the constant!) Generating
>>>>>>>>> the
>>>>>>>>> sequence via VEC_SERIES would be a no-op, for GCN, because we have an
>>>>>>>>> ABI-mandated register that already holds that value. (Perhaps I have
>>>>>>>>> another
>>>>>>>>> piece missing here, IDK?)
>>>>>>>> 
>>>>>>>> I failed to special-case the {0, 1, 2, 3, ... } constant because I
>>>>>>>> couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }.
>>>>>>>> It might be that the target needs to pattern match these constants
>>>>>>>> at RTL expansion time?
>>>>>>>> 
>>>>>>>> Btw, did you disable your while_ult pattern for the experiment?
>>>>>>> 
>>>>>>> I tried it both ways; both appear to work, and the while_ult case does
>>>>>>> avoid
>>>>>>> the constant vector. I also don't seem to need while_ult for the fully
>>>>>>> masked
>>>>>>> case any more (is that new?).
>>>>>> 
>>>>>> Yes, while_ult always compares to {0, 1, 2, 3 ...} which seems
>>>>>> conveniently available but it has to multiply the IV with the
>>>>>> number of scalars per iter which has overflow issues it has
>>>>>> to compensate for by choosing a wider IV.  I'm avoiding that
>>>>>> issue (besides for alignment peeling) by instead altering
>>>>>> the constant vector to compare against.  On x86 the constant
>>>>>> vector is always a load but the multiplication would add to
>>>>>> the latency of mask production which already isn't too great.
>>>>> 
>>>>> Is the multiplication not usually a shift?
>>>>> 
>>>>>> And yes, the alternate scheme doesn't rely on while_ult but instead
>>>>>> on vec_cmpu to produce the masks.
>>>>>> 
>>>>>> You might be able to produce the {0, 0, 1, 1, ... } constant
>>>>>> by interleaving v1 with itself?  Any non-power-of-two duplication
>>>>>> looks more difficult though.
>>>>> 
>>>>> I think that would need to use a full permutation, which is probably faster
>>>>> than a cold load, but in all these cases the vector that defines the
>>>>> permutation looks exactly like the result, so ......
>>>>> 
>>>>> I've been playing with this stuff some more and I find that even though GCN
>>>>> supports fully masked loops and uses them when I test without offload, it's
>>>>> actually been running in param_vect_partial_vector_usage==0 mode for
>>>>> offload
>>>>> because i386.cc has that hardcoded and the offload compiler inherits param
>>>>> settings from the host.
>>>> 
>>>> Doesn't that mean it will have a scalar epilog and a very large VF for the
>>>> main loop due to the large vector size?
>>>> 
>>>>> I tried running the Babelstream benchmark with the various settings and
>>>>> it's a
>>>>> wash for most of the measurements (memory limited, most likely), but the
>>>>> "Dot"
>>>>> benchmark is considerably slower when fully masked (about 50%). This
>>>>> probably
>>>>> explains why adding the additional "fake" smaller vector sizes was so good
>>>>> for
>>>>> our numbers, but confirms that the partial epilogue is a good option.
>>>> 
>>>> Ah, "fake" smaller vector sizes probably then made up for this with
>>>> "fixed" size epilogue vectorization?  But yes, I think a vectorized
>>>> epilog with partial vectors that then does not iterate would get you
>>>> the best of both worlds.
>>> 
>>> Yes, it uses V32 for the epilogue, which won't fit every case but it better
>>> than nothing.
>>> 
>>>> So param_vect_partial_vector_usage == 1.
>>> 
>>> Unfortunately, there's doesn't seem to be a way to set this *only* for the
>>> offload compiler. If you could fix it for x86_64 soon then that would be
>>> awesome. :)
>> So some opts tweaking in the GCN option_override hook doesn't work?
> 
> I didn't try that, but -foffload-options=--param=vect-partial-vector-usage=1 had no effect.

I guess the flag is streamed as function specific optimization…

> 
>>>> Whether with a while_ult optab or the vec_cmpu scheme should then
>>>> depend on generated code quality.
>>> 
>>> Which looks like it depends on these constants alone.
>> Understood, compared to while_ult these now appear explicitely.  As
>> Richard said you likely have to tweak the backend to make use of 'r1'
> 
> Yes, I think I understood what he meant. Should be doable.
> 
> Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] AVX512 fully masked vectorization
  2023-06-15 12:14   ` Richard Biener
@ 2023-06-15 12:53     ` Richard Biener
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Biener @ 2023-06-15 12:53 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: Richard Biener via Gcc-patches, Jan Hubicka, hongtao.liu,
	kirill.yukhin, ams

On Thu, 15 Jun 2023, Richard Biener wrote:

> On Wed, 14 Jun 2023, Richard Sandiford wrote:
> 
> > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > This implemens fully masked vectorization or a masked epilog for
> > > AVX512 style masks which single themselves out by representing
> > > each lane with a single bit and by using integer modes for the mask
> > > (both is much like GCN).
> > >
> > > AVX512 is also special in that it doesn't have any instruction
> > > to compute the mask from a scalar IV like SVE has with while_ult.
> > > Instead the masks are produced by vector compares and the loop
> > > control retains the scalar IV (mainly to avoid dependences on
> > > mask generation, a suitable mask test instruction is available).
> > >
> > > Like RVV code generation prefers a decrementing IV though IVOPTs
> > > messes things up in some cases removing that IV to eliminate
> > > it with an incrementing one used for address generation.
> > >
> > > One of the motivating testcases is from PR108410 which in turn
> > > is extracted from x264 where large size vectorization shows
> > > issues with small trip loops.  Execution time there improves
> > > compared to classic AVX512 with AVX2 epilogues for the cases
> > > of less than 32 iterations.
> > >
> > > size   scalar     128     256     512    512e    512f
> > >     1    9.42   11.32    9.35   11.17   15.13   16.89
> > >     2    5.72    6.53    6.66    6.66    7.62    8.56
> > >     3    4.49    5.10    5.10    5.74    5.08    5.73
> > >     4    4.10    4.33    4.29    5.21    3.79    4.25
> > >     6    3.78    3.85    3.86    4.76    2.54    2.85
> > >     8    3.64    1.89    3.76    4.50    1.92    2.16
> > >    12    3.56    2.21    3.75    4.26    1.26    1.42
> > >    16    3.36    0.83    1.06    4.16    0.95    1.07
> > >    20    3.39    1.42    1.33    4.07    0.75    0.85
> > >    24    3.23    0.66    1.72    4.22    0.62    0.70
> > >    28    3.18    1.09    2.04    4.20    0.54    0.61
> > >    32    3.16    0.47    0.41    0.41    0.47    0.53
> > >    34    3.16    0.67    0.61    0.56    0.44    0.50
> > >    38    3.19    0.95    0.95    0.82    0.40    0.45
> > >    42    3.09    0.58    1.21    1.13    0.36    0.40
> > >
> > > 'size' specifies the number of actual iterations, 512e is for
> > > a masked epilog and 512f for the fully masked loop.  From
> > > 4 scalar iterations on the AVX512 masked epilog code is clearly
> > > the winner, the fully masked variant is clearly worse and
> > > it's size benefit is also tiny.
> > >
> > > This patch does not enable using fully masked loops or
> > > masked epilogues by default.  More work on cost modeling
> > > and vectorization kind selection on x86_64 is necessary
> > > for this.
> > >
> > > Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
> > > which could be exploited further to unify some of the flags
> > > we have right now but there didn't seem to be many easy things
> > > to merge, so I'm leaving this for followups.
> > >
> > > Mask requirements as registered by vect_record_loop_mask are kept in their
> > > original form and recorded in a hash_set now instead of being
> > > processed to a vector of rgroup_controls.  Instead that's now
> > > left to the final analysis phase which tries forming the rgroup_controls
> > > vector using while_ult and if that fails now tries AVX512 style
> > > which needs a different organization and instead fills a hash_map
> > > with the relevant info.  vect_get_loop_mask now has two implementations,
> > > one for the two mask styles we then have.
> > >
> > > I have decided against interweaving vect_set_loop_condition_partial_vectors
> > > with conditions to do AVX512 style masking and instead opted to
> > > "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
> > > Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.
> > >
> > > I was split between making 'vec_loop_masks' a class with methods,
> > > possibly merging in the _len stuff into a single registry.  It
> > > seemed to be too many changes for the purpose of getting AVX512
> > > working.  I'm going to play wait and see what happens with RISC-V
> > > here since they are going to get both masks and lengths registered
> > > I think.
> > >
> > > The vect_prepare_for_masked_peels hunk might run into issues with
> > > SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
> > > looked odd.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
> > > the testsuite with --param vect-partial-vector-usage=2 with and
> > > without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
> > > and one latent wrong-code (PR110237).
> > >
> > > There's followup work to be done to try enabling masked epilogues
> > > for x86-64 by default (when AVX512 is enabled, possibly only when
> > > -mprefer-vector-width=512).  Getting cost modeling and decision
> > > right is going to be challenging.
> > >
> > > Any comments?
> > >
> > > OK?
> > 
> > Some comments below, but otherwise LGTM FWIW.
> > 
> > > Btw, testing on GCN would be welcome - the _avx512 paths could
> > > work for it so in case the while_ult path fails (not sure if
> > > it ever does) it could get _avx512 style masking.  Likewise
> > > testing on ARM just to see I didn't break anything here.
> > > I don't have SVE hardware so testing is probably meaningless.
> > >
> > > Thanks,
> > > Richard.
> > >
> > > 	* tree-vectorizer.h (enum vect_partial_vector_style): New.
> > > 	(_loop_vec_info::partial_vector_style): Likewise.
> > > 	(LOOP_VINFO_PARTIAL_VECTORS_STYLE): Likewise.
> > > 	(rgroup_controls::compare_type): Add.
> > > 	(vec_loop_masks): Change from a typedef to auto_vec<>
> > > 	to a structure.
> > > 	* tree-vect-loop-manip.cc (vect_set_loop_condition_partial_vectors):
> > > 	Adjust.
> > > 	(vect_set_loop_condition_partial_vectors_avx512): New function
> > > 	implementing the AVX512 partial vector codegen.
> > > 	(vect_set_loop_condition): Dispatch to the correct
> > > 	vect_set_loop_condition_partial_vectors_* function based on
> > > 	LOOP_VINFO_PARTIAL_VECTORS_STYLE.
> > > 	(vect_prepare_for_masked_peels): Compute LOOP_VINFO_MASK_SKIP_NITERS
> > > 	in the original niter type.
> > > 	* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize
> > > 	partial_vector_style.
> > > 	(_loop_vec_info::~_loop_vec_info): Release the hash-map recorded
> > > 	rgroup_controls.
> > > 	(can_produce_all_loop_masks_p): Adjust.
> > > 	(vect_verify_full_masking): Produce the rgroup_controls vector
> > > 	here.  Set LOOP_VINFO_PARTIAL_VECTORS_STYLE on success.
> > > 	(vect_verify_full_masking_avx512): New function implementing
> > > 	verification of AVX512 style masking.
> > > 	(vect_verify_loop_lens): Set LOOP_VINFO_PARTIAL_VECTORS_STYLE.
> > > 	(vect_analyze_loop_2): Also try AVX512 style masking.
> > > 	Adjust condition.
> > > 	(vect_estimate_min_profitable_iters): Implement AVX512 style
> > > 	mask producing cost.
> > > 	(vect_record_loop_mask): Do not build the rgroup_controls
> > > 	vector here but record masks in a hash-set.
> > > 	(vect_get_loop_mask): Implement AVX512 style mask query,
> > > 	complementing the existing while_ult style.
> > > ---
> > >  gcc/tree-vect-loop-manip.cc | 264 ++++++++++++++++++++++-
> > >  gcc/tree-vect-loop.cc       | 413 +++++++++++++++++++++++++++++++-----
> > >  gcc/tree-vectorizer.h       |  35 ++-
> > >  3 files changed, 651 insertions(+), 61 deletions(-)
> > >
> > > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> > > index 1c8100c1a1c..f0ecaec28f4 100644
> > > --- a/gcc/tree-vect-loop-manip.cc
> > > +++ b/gcc/tree-vect-loop-manip.cc
> > > @@ -50,6 +50,9 @@ along with GCC; see the file COPYING3.  If not see
> > >  #include "insn-config.h"
> > >  #include "rtl.h"
> > >  #include "recog.h"
> > > +#include "langhooks.h"
> > > +#include "tree-vector-builder.h"
> > > +#include "optabs-tree.h"
> > >  
> > >  /*************************************************************************
> > >    Simple Loop Peeling Utilities
> > > @@ -845,7 +848,7 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
> > >    rgroup_controls *iv_rgc = nullptr;
> > >    unsigned int i;
> > >    auto_vec<rgroup_controls> *controls = use_masks_p
> > > -					  ? &LOOP_VINFO_MASKS (loop_vinfo)
> > > +					  ? &LOOP_VINFO_MASKS (loop_vinfo).rgc_vec
> > >  					  : &LOOP_VINFO_LENS (loop_vinfo);
> > >    FOR_EACH_VEC_ELT (*controls, i, rgc)
> > >      if (!rgc->controls.is_empty ())
> > > @@ -936,6 +939,246 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
> > >    return cond_stmt;
> > >  }
> > >  
> > > +/* Set up the iteration condition and rgroup controls for LOOP in AVX512
> > > +   style, given that LOOP_VINFO_USING_PARTIAL_VECTORS_P is true for the
> > > +   vectorized loop.  LOOP_VINFO describes the vectorization of LOOP.  NITERS is
> > > +   the number of iterations of the original scalar loop that should be
> > > +   handled by the vector loop.  NITERS_MAYBE_ZERO and FINAL_IV are as
> > > +   for vect_set_loop_condition.
> > > +
> > > +   Insert the branch-back condition before LOOP_COND_GSI and return the
> > > +   final gcond.  */
> > > +
> > > +static gcond *
> > > +vect_set_loop_condition_partial_vectors_avx512 (class loop *loop,
> > > +					 loop_vec_info loop_vinfo, tree niters,
> > > +					 tree final_iv,
> > > +					 bool niters_maybe_zero,
> > > +					 gimple_stmt_iterator loop_cond_gsi)
> > > +{
> > > +  tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
> > > +  tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
> > > +  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> > > +  tree orig_niters = niters;
> > > +  gimple_seq preheader_seq = NULL;
> > > +
> > > +  /* Create an IV that counts down from niters and whose step
> > > +     is the number of iterations processed in the current iteration.
> > > +     Produce the controls with compares like the following.
> > > +
> > > +       # iv_2 = PHI <niters, iv_3>
> > > +       rem_4 = MIN <iv_2, VF>;
> > > +       remv_6 = { rem_4, rem_4, rem_4, ... }
> > > +       mask_5 = { 0, 0, 1, 1, 2, 2, ... } < remv6;
> > > +       iv_3 = iv_2 - VF;
> > > +       if (iv_2 > VF)
> > > +	 continue;
> > > +
> > > +     Where the constant is built with elements at most VF - 1 and
> > > +     repetitions according to max_nscalars_per_iter which is guarnateed
> > > +     to be the same within a group.  */
> > > +
> > > +  /* Convert NITERS to the determined IV type.  */
> > > +  if (TYPE_PRECISION (iv_type) > TYPE_PRECISION (TREE_TYPE (niters))
> > > +      && niters_maybe_zero)
> > > +    {
> > > +      /* We know that there is always at least one iteration, so if the
> > > +	 count is zero then it must have wrapped.  Cope with this by
> > > +	 subtracting 1 before the conversion and adding 1 to the result.  */
> > > +      gcc_assert (TYPE_UNSIGNED (TREE_TYPE (niters)));
> > > +      niters = gimple_build (&preheader_seq, PLUS_EXPR, TREE_TYPE (niters),
> > > +			     niters, build_minus_one_cst (TREE_TYPE (niters)));
> > > +      niters = gimple_convert (&preheader_seq, iv_type, niters);
> > > +      niters = gimple_build (&preheader_seq, PLUS_EXPR, iv_type,
> > > +			     niters, build_one_cst (iv_type));
> > > +    }
> > > +  else
> > > +    niters = gimple_convert (&preheader_seq, iv_type, niters);
> > > +
> > > +  /* Bias the initial value of the IV in case we need to skip iterations
> > > +     at the beginning.  */
> > > +  tree niters_adj = niters;
> > > +  if (niters_skip)
> > > +    {
> > > +      tree skip = gimple_convert (&preheader_seq, iv_type, niters_skip);
> > > +      niters_adj = gimple_build (&preheader_seq, PLUS_EXPR,
> > > +				 iv_type, niters, skip);
> > > +    }
> > > +
> > > +  /* The iteration step is the vectorization factor.  */
> > > +  tree iv_step = build_int_cst (iv_type, vf);
> > > +
> > > +  /* Create the decrement IV.  */
> > > +  tree index_before_incr, index_after_incr;
> > > +  gimple_stmt_iterator incr_gsi;
> > > +  bool insert_after;
> > > +  standard_iv_increment_position (loop, &incr_gsi, &insert_after);
> > > +  create_iv (niters_adj, MINUS_EXPR, iv_step, NULL_TREE, loop,
> > > +	     &incr_gsi, insert_after, &index_before_incr,
> > > +	     &index_after_incr);
> > > +
> > > +  /* Iterate over all the rgroups and fill in their controls.  */
> > > +  for (auto rgcm : LOOP_VINFO_MASKS (loop_vinfo).rgc_map)
> > > +    {
> > > +      rgroup_controls *rgc = rgcm.second;
> > > +      if (rgc->controls.is_empty ())
> > > +	continue;
> > > +
> > > +      tree ctrl_type = rgc->type;
> > > +      poly_uint64 nitems_per_ctrl = TYPE_VECTOR_SUBPARTS (ctrl_type);
> > > +
> > > +      tree vectype = rgc->compare_type;
> > > +
> > > +      /* index_after_incr is the IV specifying the remaining iterations in
> > > +	 the next iteration.  */
> > > +      tree rem = index_after_incr;
> > > +      /* When the data type for the compare to produce the mask is
> > > +	 smaller than the IV type we need to saturate.  Saturate to
> > > +	 the smallest possible value (IV_TYPE) so we only have to
> > > +	 saturate once (CSE will catch redundant ones we add).  */
> > > +      if (TYPE_PRECISION (TREE_TYPE (vectype)) < TYPE_PRECISION (iv_type))
> > > +	rem = gimple_build (&incr_gsi, false, GSI_CONTINUE_LINKING,
> > > +			    UNKNOWN_LOCATION,
> > > +			    MIN_EXPR, TREE_TYPE (rem), rem, iv_step);
> > > +      rem = gimple_convert (&incr_gsi, false, GSI_CONTINUE_LINKING,
> > > +			    UNKNOWN_LOCATION, TREE_TYPE (vectype), rem);
> > > +
> > > +      /* Build a data vector composed of the remaining iterations.  */
> > > +      rem = gimple_build_vector_from_val (&incr_gsi, false, GSI_CONTINUE_LINKING,
> > > +					  UNKNOWN_LOCATION, vectype, rem);
> > > +
> > > +      /* Provide a definition of each vector in the control group.  */
> > > +      tree next_ctrl = NULL_TREE;
> > > +      tree first_rem = NULL_TREE;
> > > +      tree ctrl;
> > > +      unsigned int i;
> > > +      FOR_EACH_VEC_ELT_REVERSE (rgc->controls, i, ctrl)
> > > +	{
> > > +	  /* Previous controls will cover BIAS items.  This control covers the
> > > +	     next batch.  */
> > > +	  poly_uint64 bias = nitems_per_ctrl * i;
> > > +
> > > +	  /* Build the constant to compare the remaining iters against,
> > > +	     this is sth like { 0, 0, 1, 1, 2, 2, 3, 3, ... } appropriately
> > > +	     split into pieces.  */
> > > +	  unsigned n = TYPE_VECTOR_SUBPARTS (ctrl_type).to_constant ();
> > > +	  tree_vector_builder builder (vectype, n, 1);
> > > +	  for (unsigned i = 0; i < n; ++i)
> > > +	    {
> > > +	      unsigned HOST_WIDE_INT val
> > > +		= (i + bias.to_constant ()) / rgc->max_nscalars_per_iter;
> > > +	      gcc_assert (val < vf.to_constant ());
> > > +	      builder.quick_push (build_int_cst (TREE_TYPE (vectype), val));
> > > +	    }
> > > +	  tree cmp_series = builder.build ();
> > > +
> > > +	  /* Create the initial control.  First include all items that
> > > +	     are within the loop limit.  */
> > > +	  tree init_ctrl = NULL_TREE;
> > > +	  poly_uint64 const_limit;
> > > +	  /* See whether the first iteration of the vector loop is known
> > > +	     to have a full control.  */
> > > +	  if (poly_int_tree_p (niters, &const_limit)
> > > +	      && known_ge (const_limit, (i + 1) * nitems_per_ctrl))
> > > +	    init_ctrl = build_minus_one_cst (ctrl_type);
> > > +	  else
> > > +	    {
> > > +	      /* The remaining work items initially are niters.  Saturate,
> > > +		 splat and compare.  */
> > > +	      if (!first_rem)
> > > +		{
> > > +		  first_rem = niters;
> > > +		  if (TYPE_PRECISION (TREE_TYPE (vectype))
> > > +		      < TYPE_PRECISION (iv_type))
> > > +		    first_rem = gimple_build (&preheader_seq,
> > > +					      MIN_EXPR, TREE_TYPE (first_rem),
> > > +					      first_rem, iv_step);
> > > +		  first_rem = gimple_convert (&preheader_seq, TREE_TYPE (vectype),
> > > +					      first_rem);
> > > +		  first_rem = gimple_build_vector_from_val (&preheader_seq,
> > > +							    vectype, first_rem);
> > > +		}
> > > +	      init_ctrl = gimple_build (&preheader_seq, LT_EXPR, ctrl_type,
> > > +					cmp_series, first_rem);
> > > +	    }
> > > +
> > > +	  /* Now AND out the bits that are within the number of skipped
> > > +	     items.  */
> > > +	  poly_uint64 const_skip;
> > > +	  if (niters_skip
> > > +	      && !(poly_int_tree_p (niters_skip, &const_skip)
> > > +		   && known_le (const_skip, bias)))
> > > +	    {
> > > +	      /* For integer mode masks it's cheaper to shift out the bits
> > > +		 since that avoids loading a constant.  */
> > > +	      gcc_assert (GET_MODE_CLASS (TYPE_MODE (ctrl_type)) == MODE_INT);
> > > +	      init_ctrl = gimple_build (&preheader_seq, VIEW_CONVERT_EXPR,
> > > +					lang_hooks.types.type_for_mode
> > > +					  (TYPE_MODE (ctrl_type), 1),
> > > +					init_ctrl);
> > > +	      /* ???  But when the shift amount isn't constant this requires
> > > +		 a round-trip to GRPs.  We could apply the bias to either
> > > +		 side of the compare instead.  */
> > > +	      tree shift = gimple_build (&preheader_seq, MULT_EXPR,
> > > +					 TREE_TYPE (niters_skip),
> > > +					 niters_skip,
> > > +					 build_int_cst (TREE_TYPE (niters_skip),
> > > +							rgc->max_nscalars_per_iter));
> > > +	      init_ctrl = gimple_build (&preheader_seq, LSHIFT_EXPR,
> > > +					TREE_TYPE (init_ctrl),
> > > +					init_ctrl, shift);
> > > +	      init_ctrl = gimple_build (&preheader_seq, VIEW_CONVERT_EXPR,
> > > +					ctrl_type, init_ctrl);
> > 
> > It looks like this assumes that either the first lane or the last lane
> > of the first mask is always true, is that right?  I'm not sure we ever
> > prove that, at least not for SVE.  There it's possible to have inactive
> > elements at both ends of the same mask.
> 
> It builds a mask for the first iteration without considering niters_skip
> and then shifts in inactive lanes for niters_skip.  As I'm using
> VIEW_CONVERT_EXPR this indeed can cause bits outside of the range
> relevant for the vector mask to be set but at least x86 ignores those
> (so I'm missing a final AND with TYPE_PRECISION of the mask).
> 
> So I think it should work correctly.  For variable niters_skip
> it might be faster to build a mask based on niter_skip adjusted niters
> and then AND since we can build both masks in parallel.  Any
> variable niters_skip shifting has to be done in GPRs as the mask
> register ops only can perform shifts by immediates.
> 
> > > +	    }
> > > +
> > > +	  /* Get the control value for the next iteration of the loop.  */
> > > +	  next_ctrl = gimple_build (&incr_gsi, false, GSI_CONTINUE_LINKING,
> > > +				    UNKNOWN_LOCATION,
> > > +				    LT_EXPR, ctrl_type, cmp_series, rem);
> > > +
> > > +	  vect_set_loop_control (loop, ctrl, init_ctrl, next_ctrl);
> > > +	}
> > > +    }
> > > +
> > > +  /* Emit all accumulated statements.  */
> > > +  add_preheader_seq (loop, preheader_seq);
> > > +
> > > +  /* Adjust the exit test using the decrementing IV.  */
> > > +  edge exit_edge = single_exit (loop);
> > > +  tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? LE_EXPR : GT_EXPR;
> > > +  /* When we peel for alignment with niter_skip != 0 this can
> > > +     cause niter + niter_skip to wrap and since we are comparing the
> > > +     value before the decrement here we get a false early exit.
> > > +     We can't compare the value after decrement either because that
> > > +     decrement could wrap as well as we're not doing a saturating
> > > +     decrement.  To avoid this situation we force a larger
> > > +     iv_type.  */
> > > +  gcond *cond_stmt = gimple_build_cond (code, index_before_incr, iv_step,
> > > +					NULL_TREE, NULL_TREE);
> > > +  gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
> > > +
> > > +  /* The loop iterates (NITERS - 1 + NITERS_SKIP) / VF + 1 times.
> > > +     Subtract one from this to get the latch count.  */
> > > +  tree niters_minus_one
> > > +    = fold_build2 (PLUS_EXPR, TREE_TYPE (orig_niters), orig_niters,
> > > +		   build_minus_one_cst (TREE_TYPE (orig_niters)));
> > > +  tree niters_adj2 = fold_convert (iv_type, niters_minus_one);
> > > +  if (niters_skip)
> > > +    niters_adj2 = fold_build2 (PLUS_EXPR, iv_type, niters_minus_one,
> > > +			       fold_convert (iv_type, niters_skip));
> > > +  loop->nb_iterations = fold_build2 (TRUNC_DIV_EXPR, iv_type,
> > > +				     niters_adj2, iv_step);
> > > +
> > > +  if (final_iv)
> > > +    {
> > > +      gassign *assign = gimple_build_assign (final_iv, orig_niters);
> > > +      gsi_insert_on_edge_immediate (single_exit (loop), assign);
> > > +    }
> > > +
> > > +  return cond_stmt;
> > > +}
> > > +
> > > +
> > >  /* Like vect_set_loop_condition, but handle the case in which the vector
> > >     loop handles exactly VF scalars per iteration.  */
> > >  
> > > @@ -1114,10 +1357,18 @@ vect_set_loop_condition (class loop *loop, loop_vec_info loop_vinfo,
> > >    gimple_stmt_iterator loop_cond_gsi = gsi_for_stmt (orig_cond);
> > >  
> > >    if (loop_vinfo && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
> > > -    cond_stmt = vect_set_loop_condition_partial_vectors (loop, loop_vinfo,
> > > -							 niters, final_iv,
> > > -							 niters_maybe_zero,
> > > -							 loop_cond_gsi);
> > > +    {
> > > +      if (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) == vect_partial_vectors_avx512)
> > > +	cond_stmt = vect_set_loop_condition_partial_vectors_avx512 (loop, loop_vinfo,
> > > +								    niters, final_iv,
> > > +								    niters_maybe_zero,
> > > +								    loop_cond_gsi);
> > > +      else
> > > +	cond_stmt = vect_set_loop_condition_partial_vectors (loop, loop_vinfo,
> > > +							     niters, final_iv,
> > > +							     niters_maybe_zero,
> > > +							     loop_cond_gsi);
> > > +    }
> > >    else
> > >      cond_stmt = vect_set_loop_condition_normal (loop, niters, step, final_iv,
> > >  						niters_maybe_zero,
> > > @@ -2030,7 +2281,8 @@ void
> > >  vect_prepare_for_masked_peels (loop_vec_info loop_vinfo)
> > >  {
> > >    tree misalign_in_elems;
> > > -  tree type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
> > > +  /* ???  With AVX512 we want LOOP_VINFO_RGROUP_IV_TYPE in the end.  */
> > > +  tree type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo));
> > 
> > Like you say, I think this might cause problems for SVE, since
> > LOOP_VINFO_MASK_SKIP_NITERS is assumed to be the compare type
> > in vect_set_loop_controls_directly.  Not sure what the best way
> > around that is.
> 
> We should be able to do the conversion there, we could also simply
> leave it at what get_misalign_in_elems uses (an unsigned type
> with the width of a pointer).  The issue with the AVX512 style
> masking is that there's not a single compare type so I left
> LOOP_VINFO_RGROUP_COMPARE_TYPE as error_mark_node to catch
> any erroneous uses.
> 
> Any preference here?
> 
> > >  
> > >    gcc_assert (vect_use_loop_mask_for_alignment_p (loop_vinfo));
> > >  
> > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > index 1897e720389..9be66b8fbc5 100644
> > > --- a/gcc/tree-vect-loop.cc
> > > +++ b/gcc/tree-vect-loop.cc
> > > @@ -55,6 +55,7 @@ along with GCC; see the file COPYING3.  If not see
> > >  #include "vec-perm-indices.h"
> > >  #include "tree-eh.h"
> > >  #include "case-cfn-macros.h"
> > > +#include "langhooks.h"
> > >  
> > >  /* Loop Vectorization Pass.
> > >  
> > > @@ -963,6 +964,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
> > >      mask_skip_niters (NULL_TREE),
> > >      rgroup_compare_type (NULL_TREE),
> > >      simd_if_cond (NULL_TREE),
> > > +    partial_vector_style (vect_partial_vectors_none),
> > >      unaligned_dr (NULL),
> > >      peeling_for_alignment (0),
> > >      ptr_mask (0),
> > > @@ -1058,7 +1060,12 @@ _loop_vec_info::~_loop_vec_info ()
> > >  {
> > >    free (bbs);
> > >  
> > > -  release_vec_loop_controls (&masks);
> > > +  for (auto m : masks.rgc_map)
> > > +    {
> > > +      m.second->controls.release ();
> > > +      delete m.second;
> > > +    }
> > > +  release_vec_loop_controls (&masks.rgc_vec);
> > >    release_vec_loop_controls (&lens);
> > >    delete ivexpr_map;
> > >    delete scan_map;
> > > @@ -1108,7 +1115,7 @@ can_produce_all_loop_masks_p (loop_vec_info loop_vinfo, tree cmp_type)
> > >  {
> > >    rgroup_controls *rgm;
> > >    unsigned int i;
> > > -  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm)
> > > +  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec, i, rgm)
> > >      if (rgm->type != NULL_TREE
> > >  	&& !direct_internal_fn_supported_p (IFN_WHILE_ULT,
> > >  					    cmp_type, rgm->type,
> > > @@ -1203,9 +1210,33 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
> > >    if (LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
> > >      return false;
> > >  
> > > +  /* Produce the rgroup controls.  */
> > > +  for (auto mask : LOOP_VINFO_MASKS (loop_vinfo).mask_set)
> > > +    {
> > > +      vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > +      tree vectype = mask.first;
> > > +      unsigned nvectors = mask.second;
> > > +
> > > +      if (masks->rgc_vec.length () < nvectors)
> > > +	masks->rgc_vec.safe_grow_cleared (nvectors, true);
> > > +      rgroup_controls *rgm = &(*masks).rgc_vec[nvectors - 1];
> > > +      /* The number of scalars per iteration and the number of vectors are
> > > +	 both compile-time constants.  */
> > > +      unsigned int nscalars_per_iter
> > > +	  = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> > > +		       LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
> > > +
> > > +      if (rgm->max_nscalars_per_iter < nscalars_per_iter)
> > > +	{
> > > +	  rgm->max_nscalars_per_iter = nscalars_per_iter;
> > > +	  rgm->type = truth_type_for (vectype);
> > > +	  rgm->factor = 1;
> > > +	}
> > > +    }
> > > +
> > >    /* Calculate the maximum number of scalars per iteration for every rgroup.  */
> > >    unsigned int max_nscalars_per_iter = 1;
> > > -  for (auto rgm : LOOP_VINFO_MASKS (loop_vinfo))
> > > +  for (auto rgm : LOOP_VINFO_MASKS (loop_vinfo).rgc_vec)
> > >      max_nscalars_per_iter
> > >        = MAX (max_nscalars_per_iter, rgm.max_nscalars_per_iter);
> > >  
> > > @@ -1268,10 +1299,159 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
> > >      }
> > >  
> > >    if (!cmp_type)
> > > -    return false;
> > > +    {
> > > +      LOOP_VINFO_MASKS (loop_vinfo).rgc_vec.release ();
> > > +      return false;
> > > +    }
> > >  
> > >    LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = cmp_type;
> > >    LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
> > > +  LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) = vect_partial_vectors_while_ult;
> > > +  return true;
> > > +}
> > > +
> > > +/* Each statement in LOOP_VINFO can be masked where necessary.  Check
> > > +   whether we can actually generate AVX512 style masks.  Return true if so,
> > > +   storing the type of the scalar IV in LOOP_VINFO_RGROUP_IV_TYPE.  */
> > > +
> > > +static bool
> > > +vect_verify_full_masking_avx512 (loop_vec_info loop_vinfo)
> > > +{
> > > +  /* Produce differently organized rgc_vec and differently check
> > > +     we can produce masks.  */
> > > +
> > > +  /* Use a normal loop if there are no statements that need masking.
> > > +     This only happens in rare degenerate cases: it means that the loop
> > > +     has no loads, no stores, and no live-out values.  */
> > > +  if (LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
> > > +    return false;
> > > +
> > > +  /* For the decrementing IV we need to represent all values in
> > > +     [0, niter + niter_skip] where niter_skip is the elements we
> > > +     skip in the first iteration for prologue peeling.  */
> > > +  tree iv_type = NULL_TREE;
> > > +  widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
> > > +  unsigned int iv_precision = UINT_MAX;
> > > +  if (iv_limit != -1)
> > > +    iv_precision = wi::min_precision (iv_limit, UNSIGNED);
> > > +
> > > +  /* First compute the type for the IV we use to track the remaining
> > > +     scalar iterations.  */
> > > +  opt_scalar_int_mode cmp_mode_iter;
> > > +  FOR_EACH_MODE_IN_CLASS (cmp_mode_iter, MODE_INT)
> > > +    {
> > > +      unsigned int cmp_bits = GET_MODE_BITSIZE (cmp_mode_iter.require ());
> > > +      if (cmp_bits >= iv_precision
> > > +	  && targetm.scalar_mode_supported_p (cmp_mode_iter.require ()))
> > > +	{
> > > +	  iv_type = build_nonstandard_integer_type (cmp_bits, true);
> > > +	  if (iv_type)
> > > +	    break;
> > > +	}
> > > +    }
> > > +  if (!iv_type)
> > > +    return false;
> > > +
> > > +  /* Produce the rgroup controls.  */
> > > +  for (auto mask : LOOP_VINFO_MASKS (loop_vinfo).mask_set)
> > > +    {
> > > +      vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > +      tree vectype = mask.first;
> > > +      unsigned nvectors = mask.second;
> > > +
> > > +      /* The number of scalars per iteration and the number of vectors are
> > > +	 both compile-time constants.  */
> > > +      unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
> > > +      unsigned int nscalars_per_iter
> > > +	= nvectors * nunits / LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ();
> > 
> > The comment seems to be borrowed from:
> > 
> >   /* The number of scalars per iteration and the number of vectors are
> >      both compile-time constants.  */
> >   unsigned int nscalars_per_iter
> >     = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> > 		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
> > 
> > but the calculation instead applies to_constant to the VF and to the
> > number of vector elements, which aren't in general known to be constant.
> > Does the exact_div not work here too?
> 
> It failed the overload when I use a constant nunits, but yes, the
> original code works fine, but I need ...
> 
> > Avoiding the current to_constants here only moves the problem elsewhere
> > though.  Since this is a verification routine, I think the should
> > instead use is_constant and fail if it is false.
> > 
> > > +      /* We key off a hash-map with nscalars_per_iter and the number of total
> > > +	 lanes in the mask vector and then remember the needed vector mask
> > > +	 with the largest number of lanes (thus the fewest nV).  */
> > > +      bool existed;
> > > +      rgroup_controls *& rgc
> > > +	= masks->rgc_map.get_or_insert (std::make_pair (nscalars_per_iter,
> > > +							nvectors * nunits),
> 
> ^^^
> 
> this decomposition to constants (well, OK - I could have used a
> poly-int for the second half of the pair).
> 
> I did wonder if I could restrict this more - I need different groups
> for different nscalars_per_iter but also when the total number of
> work items is different.  Somehow I think it's really only
> nscalars_per_iter that's interesting but I didn't get to convince
> myself that nvectors * nunits is going to be always the same
> when I vary the vector size within a loop.

Well, yes, nvectors * nunits / nscalars_per_iter is the
vectorization factor which is invariant.  If nscalars_per_iter is
invariant within the group then nvectors * nunits should be as well.

That should simplify things.

Richard.

> > > +					&existed);
> > > +      if (!existed)
> > > +	{
> > > +	  rgc = new rgroup_controls ();
> > > +	  rgc->type = truth_type_for (vectype);
> > > +	  rgc->compare_type = NULL_TREE;
> > > +	  rgc->max_nscalars_per_iter = nscalars_per_iter;
> > > +	  rgc->factor = 1;
> > > +	  rgc->bias_adjusted_ctrl = NULL_TREE;
> > > +	}
> > > +      else
> > > +	{
> > > +	  gcc_assert (rgc->max_nscalars_per_iter == nscalars_per_iter);
> > > +	  if (known_lt (TYPE_VECTOR_SUBPARTS (rgc->type),
> > > +			TYPE_VECTOR_SUBPARTS (vectype)))
> > > +	    rgc->type = truth_type_for (vectype);
> > > +	}
> > > +    }
> > > +
> > > +  /* There is no fixed compare type we are going to use but we have to
> > > +     be able to get at one for each mask group.  */
> > > +  unsigned int min_ni_width
> > > +    = wi::min_precision (vect_max_vf (loop_vinfo), UNSIGNED);
> > > +
> > > +  bool ok = true;
> > > +  for (auto mask : LOOP_VINFO_MASKS (loop_vinfo).rgc_map)
> > > +    {
> > > +      rgroup_controls *rgc = mask.second;
> > > +      tree mask_type = rgc->type;
> > > +      if (TYPE_PRECISION (TREE_TYPE (mask_type)) != 1)
> > > +	{
> > > +	  ok = false;
> > > +	  break;
> > > +	}
> > > +
> > > +      /* If iv_type is usable as compare type use that - we can elide the
> > > +	 saturation in that case.   */
> > > +      if (TYPE_PRECISION (iv_type) >= min_ni_width)
> > > +	{
> > > +	  tree cmp_vectype
> > > +	    = build_vector_type (iv_type, TYPE_VECTOR_SUBPARTS (mask_type));
> > > +	  if (expand_vec_cmp_expr_p (cmp_vectype, mask_type, LT_EXPR))
> > > +	    rgc->compare_type = cmp_vectype;
> > > +	}
> > > +      if (!rgc->compare_type)
> > > +	FOR_EACH_MODE_IN_CLASS (cmp_mode_iter, MODE_INT)
> > > +	  {
> > > +	    unsigned int cmp_bits = GET_MODE_BITSIZE (cmp_mode_iter.require ());
> > > +	    if (cmp_bits >= min_ni_width
> > > +		&& targetm.scalar_mode_supported_p (cmp_mode_iter.require ()))
> > > +	      {
> > > +		tree cmp_type = build_nonstandard_integer_type (cmp_bits, true);
> > > +		if (!cmp_type)
> > > +		  continue;
> > > +
> > > +		/* Check whether we can produce the mask with cmp_type.  */
> > > +		tree cmp_vectype
> > > +		  = build_vector_type (cmp_type, TYPE_VECTOR_SUBPARTS (mask_type));
> > > +		if (expand_vec_cmp_expr_p (cmp_vectype, mask_type, LT_EXPR))
> > > +		  {
> > > +		    rgc->compare_type = cmp_vectype;
> > > +		    break;
> > > +		  }
> > > +	      }
> > > +	}
> > 
> > Just curious: is this fallback loop ever used in practice?
> > TYPE_PRECISION (iv_type) >= min_ni_width seems like an easy condition
> > to satisfy.
> 
> I think the only remaining case for this particular condition
> is an original unsigned IV and us peeling for alignment with a mask.
> 
> But then of course the main case is that with say a unsigned long IV but
> a QImode data type the x86 backend lacks a V64DImode vector type so we 
> cannot produce the mask mode for V64QImode data with a compare using a 
> data type based on that IV type.
> 
> For 'int' IVs and int/float or double data we can always use the
> original IV here.
> 
> > > +      if (!rgc->compare_type)
> > > +	{
> > > +	  ok = false;
> > > +	  break;
> > > +	}
> > > +    }
> > > +  if (!ok)
> > > +    {
> > > +      LOOP_VINFO_MASKS (loop_vinfo).rgc_map.empty ();
> > > +      return false;
> > 
> > It looks like this leaks the rgroup_controls created above.
> 
> Fixed.
> 
> > > +    }
> > > +
> > > +  LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = error_mark_node;
> > > +  LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
> > > +  LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) = vect_partial_vectors_avx512;
> > >    return true;
> > >  }
> > >  
> > > @@ -1371,6 +1551,7 @@ vect_verify_loop_lens (loop_vec_info loop_vinfo)
> > >  
> > >    LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = iv_type;
> > >    LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
> > > +  LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) = vect_partial_vectors_len;
> > >  
> > >    return true;
> > >  }
> > > @@ -2712,16 +2893,24 @@ start_over:
> > >  
> > >    /* If we still have the option of using partial vectors,
> > >       check whether we can generate the necessary loop controls.  */
> > > -  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > > -      && !vect_verify_full_masking (loop_vinfo)
> > > -      && !vect_verify_loop_lens (loop_vinfo))
> > > -    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> > > +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> > > +    {
> > > +      if (!LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
> > > +	{
> > > +	  if (!vect_verify_full_masking (loop_vinfo)
> > > +	      && !vect_verify_full_masking_avx512 (loop_vinfo))
> > > +	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> > > +	}
> > > +      else /* !LOOP_VINFO_LENS (loop_vinfo).is_empty () */
> > > +	if (!vect_verify_loop_lens (loop_vinfo))
> > > +	  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> > > +    }
> > >  
> > >    /* If we're vectorizing a loop that uses length "controls" and
> > >       can iterate more than once, we apply decrementing IV approach
> > >       in loop control.  */
> > >    if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > > -      && !LOOP_VINFO_LENS (loop_vinfo).is_empty ()
> > > +      && LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) == vect_partial_vectors_len
> > >        && LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) == 0
> > >        && !(LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > >  	   && known_le (LOOP_VINFO_INT_NITERS (loop_vinfo),
> > > @@ -3022,7 +3211,7 @@ again:
> > >    delete loop_vinfo->vector_costs;
> > >    loop_vinfo->vector_costs = nullptr;
> > >    /* Reset accumulated rgroup information.  */
> > > -  release_vec_loop_controls (&LOOP_VINFO_MASKS (loop_vinfo));
> > > +  release_vec_loop_controls (&LOOP_VINFO_MASKS (loop_vinfo).rgc_vec);
> > >    release_vec_loop_controls (&LOOP_VINFO_LENS (loop_vinfo));
> > >    /* Reset assorted flags.  */
> > >    LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
> > > @@ -4362,13 +4551,67 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
> > >  			  cond_branch_not_taken, vect_epilogue);
> > >  
> > >    /* Take care of special costs for rgroup controls of partial vectors.  */
> > > -  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> > > +  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
> > > +      && (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> > > +	  == vect_partial_vectors_avx512))
> > > +    {
> > > +      /* Calculate how many masks we need to generate.  */
> > > +      unsigned int num_masks = 0;
> > > +      bool need_saturation = false;
> > > +      for (auto rgcm : LOOP_VINFO_MASKS (loop_vinfo).rgc_map)
> > > +	{
> > > +	  rgroup_controls *rgm = rgcm.second;
> > > +	  unsigned nvectors
> > > +	    = (rgcm.first.second
> > > +	       / TYPE_VECTOR_SUBPARTS (rgm->type).to_constant ());
> > > +	  num_masks += nvectors;
> > > +	  if (TYPE_PRECISION (TREE_TYPE (rgm->compare_type))
> > > +	      < TYPE_PRECISION (LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo)))
> > > +	    need_saturation = true;
> > > +	}
> > > +
> > > +      /* ???  The target isn't able to identify the costs below as
> > > +	 producing masks so it cannot penaltize cases where we'd run
> > > +	 out of mask registers for example.  */
> > > +
> > > +      /* In the worst case, we need to generate each mask in the prologue
> > > +	 and in the loop body.  We need one splat per group and one
> > > +	 compare per mask.
> > > +
> > > +	 Sometimes we can use unpacks instead of generating prologue
> > > +	 masks and sometimes the prologue mask will fold to a constant,
> > > +	 so the actual prologue cost might be smaller.  However, it's
> > > +	 simpler and safer to use the worst-case cost; if this ends up
> > > +	 being the tie-breaker between vectorizing or not, then it's
> > > +	 probably better not to vectorize.  */
> > 
> > Not sure all of this applies to the AVX512 case.  In particular, the
> > unpacks bit doesn't seem relevant.
> 
> Removed that part and noted we fail to account for the cost of dealing
> with different nvector mask cases using the same wide masks
> (the cases that vect_get_loop_mask constructs).  For SVE it's
> re-interpreting with VIEW_CONVERT that's done there which should be
> free but for AVX512 (when we ever get multiple vector sizes in one
> loop) to split a mask in half we need one shift for the upper part.
> 
> > > +      (void) add_stmt_cost (target_cost_data,
> > > +			    num_masks
> > > +			    + LOOP_VINFO_MASKS (loop_vinfo).rgc_map.elements (),
> > > +			    vector_stmt, NULL, NULL, NULL_TREE, 0, vect_prologue);
> > > +      (void) add_stmt_cost (target_cost_data,
> > > +			    num_masks
> > > +			    + LOOP_VINFO_MASKS (loop_vinfo).rgc_map.elements (),
> > > +			    vector_stmt, NULL, NULL, NULL_TREE, 0, vect_body);
> > > +
> > > +      /* When we need saturation we need it both in the prologue and
> > > +	 the epilogue.  */
> > > +      if (need_saturation)
> > > +	{
> > > +	  (void) add_stmt_cost (target_cost_data, 1, scalar_stmt,
> > > +				NULL, NULL, NULL_TREE, 0, vect_prologue);
> > > +	  (void) add_stmt_cost (target_cost_data, 1, scalar_stmt,
> > > +				NULL, NULL, NULL_TREE, 0, vect_body);
> > > +	}
> > > +    }
> > > +  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
> > > +	   && (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> > > +	       == vect_partial_vectors_avx512))
> 
> Eh, cut&past error here, should be == vect_partial_vectors_while_ult
> 
> > >      {
> > >        /* Calculate how many masks we need to generate.  */
> > >        unsigned int num_masks = 0;
> > >        rgroup_controls *rgm;
> > >        unsigned int num_vectors_m1;
> > > -      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
> > > +      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec, num_vectors_m1, rgm)
> > 
> > Nit: long line.
> 
> fixed.
> 
> > >  	if (rgm->type)
> > >  	  num_masks += num_vectors_m1 + 1;
> > >        gcc_assert (num_masks > 0);
> > > @@ -10329,14 +10572,6 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
> > >  		       unsigned int nvectors, tree vectype, tree scalar_mask)
> > >  {
> > >    gcc_assert (nvectors != 0);
> > > -  if (masks->length () < nvectors)
> > > -    masks->safe_grow_cleared (nvectors, true);
> > > -  rgroup_controls *rgm = &(*masks)[nvectors - 1];
> > > -  /* The number of scalars per iteration and the number of vectors are
> > > -     both compile-time constants.  */
> > > -  unsigned int nscalars_per_iter
> > > -    = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> > > -		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
> > >  
> > >    if (scalar_mask)
> > >      {
> > > @@ -10344,12 +10579,7 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
> > >        loop_vinfo->scalar_cond_masked_set.add (cond);
> > >      }
> > >  
> > > -  if (rgm->max_nscalars_per_iter < nscalars_per_iter)
> > > -    {
> > > -      rgm->max_nscalars_per_iter = nscalars_per_iter;
> > > -      rgm->type = truth_type_for (vectype);
> > > -      rgm->factor = 1;
> > > -    }
> > > +  masks->mask_set.add (std::make_pair (vectype, nvectors));
> > >  }
> > >  
> > >  /* Given a complete set of masks MASKS, extract mask number INDEX
> > > @@ -10360,46 +10590,121 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
> > >     arrangement.  */
> > >  
> > >  tree
> > > -vect_get_loop_mask (loop_vec_info,
> > > +vect_get_loop_mask (loop_vec_info loop_vinfo,
> > >  		    gimple_stmt_iterator *gsi, vec_loop_masks *masks,
> > >  		    unsigned int nvectors, tree vectype, unsigned int index)
> > >  {
> > > -  rgroup_controls *rgm = &(*masks)[nvectors - 1];
> > > -  tree mask_type = rgm->type;
> > > -
> > > -  /* Populate the rgroup's mask array, if this is the first time we've
> > > -     used it.  */
> > > -  if (rgm->controls.is_empty ())
> > > +  if (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> > > +      == vect_partial_vectors_while_ult)
> > >      {
> > > -      rgm->controls.safe_grow_cleared (nvectors, true);
> > > -      for (unsigned int i = 0; i < nvectors; ++i)
> > > +      rgroup_controls *rgm = &(masks->rgc_vec)[nvectors - 1];
> > > +      tree mask_type = rgm->type;
> > > +
> > > +      /* Populate the rgroup's mask array, if this is the first time we've
> > > +	 used it.  */
> > > +      if (rgm->controls.is_empty ())
> > >  	{
> > > -	  tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask");
> > > -	  /* Provide a dummy definition until the real one is available.  */
> > > -	  SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
> > > -	  rgm->controls[i] = mask;
> > > +	  rgm->controls.safe_grow_cleared (nvectors, true);
> > > +	  for (unsigned int i = 0; i < nvectors; ++i)
> > > +	    {
> > > +	      tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask");
> > > +	      /* Provide a dummy definition until the real one is available.  */
> > > +	      SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
> > > +	      rgm->controls[i] = mask;
> > > +	    }
> > >  	}
> > > -    }
> > >  
> > > -  tree mask = rgm->controls[index];
> > > -  if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type),
> > > -		TYPE_VECTOR_SUBPARTS (vectype)))
> > > +      tree mask = rgm->controls[index];
> > > +      if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type),
> > > +		    TYPE_VECTOR_SUBPARTS (vectype)))
> > > +	{
> > > +	  /* A loop mask for data type X can be reused for data type Y
> > > +	     if X has N times more elements than Y and if Y's elements
> > > +	     are N times bigger than X's.  In this case each sequence
> > > +	     of N elements in the loop mask will be all-zero or all-one.
> > > +	     We can then view-convert the mask so that each sequence of
> > > +	     N elements is replaced by a single element.  */
> > > +	  gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (mask_type),
> > > +				  TYPE_VECTOR_SUBPARTS (vectype)));
> > > +	  gimple_seq seq = NULL;
> > > +	  mask_type = truth_type_for (vectype);
> > > +	  /* We can only use re-use the mask by reinterpreting it if it
> > > +	     occupies the same space, that is the mask with less elements
> > 
> > Nit: fewer elements
> 
> Ah this assert was also a left-over from earlier attempts, I've removed
> it again.
> 
> > > +	     uses multiple bits for each masked elements.  */
> > > +	  gcc_assert (known_eq (TYPE_PRECISION (TREE_TYPE (TREE_TYPE (mask)))
> > > +				* TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)),
> > > +				TYPE_PRECISION (TREE_TYPE (mask_type))
> > > +				* TYPE_VECTOR_SUBPARTS (mask_type)));
> > > +	  mask = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, mask);
> > > +	  if (seq)
> > > +	    gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
> > > +	}
> > > +      return mask;
> > > +    }
> > > +  else if (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> > > +	   == vect_partial_vectors_avx512)
> > >      {
> > > -      /* A loop mask for data type X can be reused for data type Y
> > > -	 if X has N times more elements than Y and if Y's elements
> > > -	 are N times bigger than X's.  In this case each sequence
> > > -	 of N elements in the loop mask will be all-zero or all-one.
> > > -	 We can then view-convert the mask so that each sequence of
> > > -	 N elements is replaced by a single element.  */
> > > -      gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (mask_type),
> > > -			      TYPE_VECTOR_SUBPARTS (vectype)));
> > > +      /* The number of scalars per iteration and the number of vectors are
> > > +	 both compile-time constants.  */
> > > +      unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
> > > +      unsigned int nscalars_per_iter
> > > +	= nvectors * nunits / LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ();
> > 
> > Same disconnect between the comment and the code here.  If we do use
> > is_constant in vect_verify_full_masking_avx512 then we could reference
> > that instead.
> 
> I'm going to dig into this a bit, maybe I can simplify all this and
> just have the vector indexed by nscalars_per_iter and get rid of the
> hash-map.
> 
> > > +
> > > +      rgroup_controls *rgm
> > > +	= *masks->rgc_map.get (std::make_pair (nscalars_per_iter,
> > > +					       nvectors * nunits));
> > > +
> > > +      /* The stored nV is dependent on the mask type produced.  */
> > > +      nvectors = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> > > +			    TYPE_VECTOR_SUBPARTS (rgm->type)).to_constant ();
> > > +
> > > +      /* Populate the rgroup's mask array, if this is the first time we've
> > > +	 used it.  */
> > > +      if (rgm->controls.is_empty ())
> > > +	{
> > > +	  rgm->controls.safe_grow_cleared (nvectors, true);
> > > +	  for (unsigned int i = 0; i < nvectors; ++i)
> > > +	    {
> > > +	      tree mask = make_temp_ssa_name (rgm->type, NULL, "loop_mask");
> > > +	      /* Provide a dummy definition until the real one is available.  */
> > > +	      SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
> > > +	      rgm->controls[i] = mask;
> > > +	    }
> > > +	}
> > > +      if (known_eq (TYPE_VECTOR_SUBPARTS (rgm->type),
> > > +		    TYPE_VECTOR_SUBPARTS (vectype)))
> > > +	return rgm->controls[index];
> > > +
> > > +      /* Split the vector if needed.  Since we are dealing with integer mode
> > > +	 masks with AVX512 we can operate on the integer representation
> > > +	 performing the whole vector shifting.  */
> > > +      unsigned HOST_WIDE_INT factor;
> > > +      bool ok = constant_multiple_p (TYPE_VECTOR_SUBPARTS (rgm->type),
> > > +				     TYPE_VECTOR_SUBPARTS (vectype), &factor);
> > > +      gcc_assert (ok);
> > > +      gcc_assert (GET_MODE_CLASS (TYPE_MODE (rgm->type)) == MODE_INT);
> > > +      tree mask_type = truth_type_for (vectype);
> > > +      gcc_assert (GET_MODE_CLASS (TYPE_MODE (mask_type)) == MODE_INT);
> > > +      unsigned vi = index / factor;
> > > +      unsigned vpart = index % factor;
> > > +      tree vec = rgm->controls[vi];
> > >        gimple_seq seq = NULL;
> > > -      mask_type = truth_type_for (vectype);
> > > -      mask = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, mask);
> > > +      vec = gimple_build (&seq, VIEW_CONVERT_EXPR,
> > > +			  lang_hooks.types.type_for_mode
> > > +				(TYPE_MODE (rgm->type), 1), vec);
> > > +      /* For integer mode masks simply shift the right bits into position.  */
> > > +      if (vpart != 0)
> > > +	vec = gimple_build (&seq, RSHIFT_EXPR, TREE_TYPE (vec), vec,
> > > +			    build_int_cst (integer_type_node, vpart * nunits));
> > > +      vec = gimple_convert (&seq, lang_hooks.types.type_for_mode
> > > +				    (TYPE_MODE (mask_type), 1), vec);
> > > +      vec = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, vec);
> > 
> > Would it be worth creating an rgroup_controls for each nunits derivative
> > of each rgc_map entry?  That way we'd share the calculations, and be
> > able to cost them more accurately.
> > 
> > Maybe it's not worth it, just asking. :)
> 
> I suppose we could do that.  So with AVX512 we have a common
> num_scalars_per_iter but vary on nvectors.  So we'd have to
> compute a max_nvectors.
> 
> I guess it's doable, I'll keep that in mind.
> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Richard
> > 
> > >        if (seq)
> > >  	gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
> > > +      return vec;
> > >      }
> > > -  return mask;
> > > +  else
> > > +    gcc_unreachable ();
> > >  }
> > >  
> > >  /* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
> > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > > index 767a0774d45..42161778dc1 100644
> > > --- a/gcc/tree-vectorizer.h
> > > +++ b/gcc/tree-vectorizer.h
> > > @@ -300,6 +300,13 @@ public:
> > >  #define SLP_TREE_LANES(S)			 (S)->lanes
> > >  #define SLP_TREE_CODE(S)			 (S)->code
> > >  
> > > +enum vect_partial_vector_style {
> > > +    vect_partial_vectors_none,
> > > +    vect_partial_vectors_while_ult,
> > > +    vect_partial_vectors_avx512,
> > > +    vect_partial_vectors_len
> > > +};
> > > +
> > >  /* Key for map that records association between
> > >     scalar conditions and corresponding loop mask, and
> > >     is populated by vect_record_loop_mask.  */
> > > @@ -605,6 +612,10 @@ struct rgroup_controls {
> > >       specified number of elements; the type of the elements doesn't matter.  */
> > >    tree type;
> > >  
> > > +  /* When there is no uniformly used LOOP_VINFO_RGROUP_COMPARE_TYPE this
> > > +     is the rgroup specific type used.  */
> > > +  tree compare_type;
> > > +
> > >    /* A vector of nV controls, in iteration order.  */
> > >    vec<tree> controls;
> > >  
> > > @@ -613,7 +624,24 @@ struct rgroup_controls {
> > >    tree bias_adjusted_ctrl;
> > >  };
> > >  
> > > -typedef auto_vec<rgroup_controls> vec_loop_masks;
> > > +struct vec_loop_masks
> > > +{
> > > +  bool is_empty () const { return mask_set.is_empty (); }
> > > +
> > > +  typedef pair_hash <nofree_ptr_hash <tree_node>,
> > > +		     int_hash<unsigned, 0>> mp_hash;
> > > +  hash_set<mp_hash> mask_set;
> > > +
> > > +  /* Default storage for rgroup_controls.  */
> > > +  auto_vec<rgroup_controls> rgc_vec;
> > > +
> > > +  /* The vect_partial_vectors_avx512 style uses a hash-map.  */
> > > +  hash_map<std::pair<unsigned /* nscalars_per_iter */,
> > > +		     unsigned /* nlanes */>, rgroup_controls *,
> > > +	   simple_hashmap_traits<pair_hash <int_hash<unsigned, 0>,
> > > +					    int_hash<unsigned, 0>>,
> > > +				 rgroup_controls *>> rgc_map;
> > > +};
> > >  
> > >  typedef auto_vec<rgroup_controls> vec_loop_lens;
> > >  
> > > @@ -741,6 +769,10 @@ public:
> > >       LOOP_VINFO_USING_PARTIAL_VECTORS_P is true.  */
> > >    tree rgroup_iv_type;
> > >  
> > > +  /* The style used for implementing partial vectors when
> > > +     LOOP_VINFO_USING_PARTIAL_VECTORS_P is true.  */
> > > +  vect_partial_vector_style partial_vector_style;
> > > +
> > >    /* Unknown DRs according to which loop was peeled.  */
> > >    class dr_vec_info *unaligned_dr;
> > >  
> > > @@ -914,6 +946,7 @@ public:
> > >  #define LOOP_VINFO_MASK_SKIP_NITERS(L)     (L)->mask_skip_niters
> > >  #define LOOP_VINFO_RGROUP_COMPARE_TYPE(L)  (L)->rgroup_compare_type
> > >  #define LOOP_VINFO_RGROUP_IV_TYPE(L)       (L)->rgroup_iv_type
> > > +#define LOOP_VINFO_PARTIAL_VECTORS_STYLE(L) (L)->partial_vector_style
> > >  #define LOOP_VINFO_PTR_MASK(L)             (L)->ptr_mask
> > >  #define LOOP_VINFO_N_STMTS(L)		   (L)->shared->n_stmts
> > >  #define LOOP_VINFO_LOOP_NEST(L)            (L)->shared->loop_nest
> > 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] AVX512 fully masked vectorization
  2023-06-14 18:45 ` Richard Sandiford
@ 2023-06-15 12:14   ` Richard Biener
  2023-06-15 12:53     ` Richard Biener
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Biener @ 2023-06-15 12:14 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: Richard Biener via Gcc-patches, Jan Hubicka, hongtao.liu,
	kirill.yukhin, ams

On Wed, 14 Jun 2023, Richard Sandiford wrote:

> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > This implemens fully masked vectorization or a masked epilog for
> > AVX512 style masks which single themselves out by representing
> > each lane with a single bit and by using integer modes for the mask
> > (both is much like GCN).
> >
> > AVX512 is also special in that it doesn't have any instruction
> > to compute the mask from a scalar IV like SVE has with while_ult.
> > Instead the masks are produced by vector compares and the loop
> > control retains the scalar IV (mainly to avoid dependences on
> > mask generation, a suitable mask test instruction is available).
> >
> > Like RVV code generation prefers a decrementing IV though IVOPTs
> > messes things up in some cases removing that IV to eliminate
> > it with an incrementing one used for address generation.
> >
> > One of the motivating testcases is from PR108410 which in turn
> > is extracted from x264 where large size vectorization shows
> > issues with small trip loops.  Execution time there improves
> > compared to classic AVX512 with AVX2 epilogues for the cases
> > of less than 32 iterations.
> >
> > size   scalar     128     256     512    512e    512f
> >     1    9.42   11.32    9.35   11.17   15.13   16.89
> >     2    5.72    6.53    6.66    6.66    7.62    8.56
> >     3    4.49    5.10    5.10    5.74    5.08    5.73
> >     4    4.10    4.33    4.29    5.21    3.79    4.25
> >     6    3.78    3.85    3.86    4.76    2.54    2.85
> >     8    3.64    1.89    3.76    4.50    1.92    2.16
> >    12    3.56    2.21    3.75    4.26    1.26    1.42
> >    16    3.36    0.83    1.06    4.16    0.95    1.07
> >    20    3.39    1.42    1.33    4.07    0.75    0.85
> >    24    3.23    0.66    1.72    4.22    0.62    0.70
> >    28    3.18    1.09    2.04    4.20    0.54    0.61
> >    32    3.16    0.47    0.41    0.41    0.47    0.53
> >    34    3.16    0.67    0.61    0.56    0.44    0.50
> >    38    3.19    0.95    0.95    0.82    0.40    0.45
> >    42    3.09    0.58    1.21    1.13    0.36    0.40
> >
> > 'size' specifies the number of actual iterations, 512e is for
> > a masked epilog and 512f for the fully masked loop.  From
> > 4 scalar iterations on the AVX512 masked epilog code is clearly
> > the winner, the fully masked variant is clearly worse and
> > it's size benefit is also tiny.
> >
> > This patch does not enable using fully masked loops or
> > masked epilogues by default.  More work on cost modeling
> > and vectorization kind selection on x86_64 is necessary
> > for this.
> >
> > Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
> > which could be exploited further to unify some of the flags
> > we have right now but there didn't seem to be many easy things
> > to merge, so I'm leaving this for followups.
> >
> > Mask requirements as registered by vect_record_loop_mask are kept in their
> > original form and recorded in a hash_set now instead of being
> > processed to a vector of rgroup_controls.  Instead that's now
> > left to the final analysis phase which tries forming the rgroup_controls
> > vector using while_ult and if that fails now tries AVX512 style
> > which needs a different organization and instead fills a hash_map
> > with the relevant info.  vect_get_loop_mask now has two implementations,
> > one for the two mask styles we then have.
> >
> > I have decided against interweaving vect_set_loop_condition_partial_vectors
> > with conditions to do AVX512 style masking and instead opted to
> > "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
> > Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.
> >
> > I was split between making 'vec_loop_masks' a class with methods,
> > possibly merging in the _len stuff into a single registry.  It
> > seemed to be too many changes for the purpose of getting AVX512
> > working.  I'm going to play wait and see what happens with RISC-V
> > here since they are going to get both masks and lengths registered
> > I think.
> >
> > The vect_prepare_for_masked_peels hunk might run into issues with
> > SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
> > looked odd.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
> > the testsuite with --param vect-partial-vector-usage=2 with and
> > without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
> > and one latent wrong-code (PR110237).
> >
> > There's followup work to be done to try enabling masked epilogues
> > for x86-64 by default (when AVX512 is enabled, possibly only when
> > -mprefer-vector-width=512).  Getting cost modeling and decision
> > right is going to be challenging.
> >
> > Any comments?
> >
> > OK?
> 
> Some comments below, but otherwise LGTM FWIW.
> 
> > Btw, testing on GCN would be welcome - the _avx512 paths could
> > work for it so in case the while_ult path fails (not sure if
> > it ever does) it could get _avx512 style masking.  Likewise
> > testing on ARM just to see I didn't break anything here.
> > I don't have SVE hardware so testing is probably meaningless.
> >
> > Thanks,
> > Richard.
> >
> > 	* tree-vectorizer.h (enum vect_partial_vector_style): New.
> > 	(_loop_vec_info::partial_vector_style): Likewise.
> > 	(LOOP_VINFO_PARTIAL_VECTORS_STYLE): Likewise.
> > 	(rgroup_controls::compare_type): Add.
> > 	(vec_loop_masks): Change from a typedef to auto_vec<>
> > 	to a structure.
> > 	* tree-vect-loop-manip.cc (vect_set_loop_condition_partial_vectors):
> > 	Adjust.
> > 	(vect_set_loop_condition_partial_vectors_avx512): New function
> > 	implementing the AVX512 partial vector codegen.
> > 	(vect_set_loop_condition): Dispatch to the correct
> > 	vect_set_loop_condition_partial_vectors_* function based on
> > 	LOOP_VINFO_PARTIAL_VECTORS_STYLE.
> > 	(vect_prepare_for_masked_peels): Compute LOOP_VINFO_MASK_SKIP_NITERS
> > 	in the original niter type.
> > 	* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize
> > 	partial_vector_style.
> > 	(_loop_vec_info::~_loop_vec_info): Release the hash-map recorded
> > 	rgroup_controls.
> > 	(can_produce_all_loop_masks_p): Adjust.
> > 	(vect_verify_full_masking): Produce the rgroup_controls vector
> > 	here.  Set LOOP_VINFO_PARTIAL_VECTORS_STYLE on success.
> > 	(vect_verify_full_masking_avx512): New function implementing
> > 	verification of AVX512 style masking.
> > 	(vect_verify_loop_lens): Set LOOP_VINFO_PARTIAL_VECTORS_STYLE.
> > 	(vect_analyze_loop_2): Also try AVX512 style masking.
> > 	Adjust condition.
> > 	(vect_estimate_min_profitable_iters): Implement AVX512 style
> > 	mask producing cost.
> > 	(vect_record_loop_mask): Do not build the rgroup_controls
> > 	vector here but record masks in a hash-set.
> > 	(vect_get_loop_mask): Implement AVX512 style mask query,
> > 	complementing the existing while_ult style.
> > ---
> >  gcc/tree-vect-loop-manip.cc | 264 ++++++++++++++++++++++-
> >  gcc/tree-vect-loop.cc       | 413 +++++++++++++++++++++++++++++++-----
> >  gcc/tree-vectorizer.h       |  35 ++-
> >  3 files changed, 651 insertions(+), 61 deletions(-)
> >
> > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> > index 1c8100c1a1c..f0ecaec28f4 100644
> > --- a/gcc/tree-vect-loop-manip.cc
> > +++ b/gcc/tree-vect-loop-manip.cc
> > @@ -50,6 +50,9 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "insn-config.h"
> >  #include "rtl.h"
> >  #include "recog.h"
> > +#include "langhooks.h"
> > +#include "tree-vector-builder.h"
> > +#include "optabs-tree.h"
> >  
> >  /*************************************************************************
> >    Simple Loop Peeling Utilities
> > @@ -845,7 +848,7 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
> >    rgroup_controls *iv_rgc = nullptr;
> >    unsigned int i;
> >    auto_vec<rgroup_controls> *controls = use_masks_p
> > -					  ? &LOOP_VINFO_MASKS (loop_vinfo)
> > +					  ? &LOOP_VINFO_MASKS (loop_vinfo).rgc_vec
> >  					  : &LOOP_VINFO_LENS (loop_vinfo);
> >    FOR_EACH_VEC_ELT (*controls, i, rgc)
> >      if (!rgc->controls.is_empty ())
> > @@ -936,6 +939,246 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
> >    return cond_stmt;
> >  }
> >  
> > +/* Set up the iteration condition and rgroup controls for LOOP in AVX512
> > +   style, given that LOOP_VINFO_USING_PARTIAL_VECTORS_P is true for the
> > +   vectorized loop.  LOOP_VINFO describes the vectorization of LOOP.  NITERS is
> > +   the number of iterations of the original scalar loop that should be
> > +   handled by the vector loop.  NITERS_MAYBE_ZERO and FINAL_IV are as
> > +   for vect_set_loop_condition.
> > +
> > +   Insert the branch-back condition before LOOP_COND_GSI and return the
> > +   final gcond.  */
> > +
> > +static gcond *
> > +vect_set_loop_condition_partial_vectors_avx512 (class loop *loop,
> > +					 loop_vec_info loop_vinfo, tree niters,
> > +					 tree final_iv,
> > +					 bool niters_maybe_zero,
> > +					 gimple_stmt_iterator loop_cond_gsi)
> > +{
> > +  tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
> > +  tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
> > +  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> > +  tree orig_niters = niters;
> > +  gimple_seq preheader_seq = NULL;
> > +
> > +  /* Create an IV that counts down from niters and whose step
> > +     is the number of iterations processed in the current iteration.
> > +     Produce the controls with compares like the following.
> > +
> > +       # iv_2 = PHI <niters, iv_3>
> > +       rem_4 = MIN <iv_2, VF>;
> > +       remv_6 = { rem_4, rem_4, rem_4, ... }
> > +       mask_5 = { 0, 0, 1, 1, 2, 2, ... } < remv6;
> > +       iv_3 = iv_2 - VF;
> > +       if (iv_2 > VF)
> > +	 continue;
> > +
> > +     Where the constant is built with elements at most VF - 1 and
> > +     repetitions according to max_nscalars_per_iter which is guarnateed
> > +     to be the same within a group.  */
> > +
> > +  /* Convert NITERS to the determined IV type.  */
> > +  if (TYPE_PRECISION (iv_type) > TYPE_PRECISION (TREE_TYPE (niters))
> > +      && niters_maybe_zero)
> > +    {
> > +      /* We know that there is always at least one iteration, so if the
> > +	 count is zero then it must have wrapped.  Cope with this by
> > +	 subtracting 1 before the conversion and adding 1 to the result.  */
> > +      gcc_assert (TYPE_UNSIGNED (TREE_TYPE (niters)));
> > +      niters = gimple_build (&preheader_seq, PLUS_EXPR, TREE_TYPE (niters),
> > +			     niters, build_minus_one_cst (TREE_TYPE (niters)));
> > +      niters = gimple_convert (&preheader_seq, iv_type, niters);
> > +      niters = gimple_build (&preheader_seq, PLUS_EXPR, iv_type,
> > +			     niters, build_one_cst (iv_type));
> > +    }
> > +  else
> > +    niters = gimple_convert (&preheader_seq, iv_type, niters);
> > +
> > +  /* Bias the initial value of the IV in case we need to skip iterations
> > +     at the beginning.  */
> > +  tree niters_adj = niters;
> > +  if (niters_skip)
> > +    {
> > +      tree skip = gimple_convert (&preheader_seq, iv_type, niters_skip);
> > +      niters_adj = gimple_build (&preheader_seq, PLUS_EXPR,
> > +				 iv_type, niters, skip);
> > +    }
> > +
> > +  /* The iteration step is the vectorization factor.  */
> > +  tree iv_step = build_int_cst (iv_type, vf);
> > +
> > +  /* Create the decrement IV.  */
> > +  tree index_before_incr, index_after_incr;
> > +  gimple_stmt_iterator incr_gsi;
> > +  bool insert_after;
> > +  standard_iv_increment_position (loop, &incr_gsi, &insert_after);
> > +  create_iv (niters_adj, MINUS_EXPR, iv_step, NULL_TREE, loop,
> > +	     &incr_gsi, insert_after, &index_before_incr,
> > +	     &index_after_incr);
> > +
> > +  /* Iterate over all the rgroups and fill in their controls.  */
> > +  for (auto rgcm : LOOP_VINFO_MASKS (loop_vinfo).rgc_map)
> > +    {
> > +      rgroup_controls *rgc = rgcm.second;
> > +      if (rgc->controls.is_empty ())
> > +	continue;
> > +
> > +      tree ctrl_type = rgc->type;
> > +      poly_uint64 nitems_per_ctrl = TYPE_VECTOR_SUBPARTS (ctrl_type);
> > +
> > +      tree vectype = rgc->compare_type;
> > +
> > +      /* index_after_incr is the IV specifying the remaining iterations in
> > +	 the next iteration.  */
> > +      tree rem = index_after_incr;
> > +      /* When the data type for the compare to produce the mask is
> > +	 smaller than the IV type we need to saturate.  Saturate to
> > +	 the smallest possible value (IV_TYPE) so we only have to
> > +	 saturate once (CSE will catch redundant ones we add).  */
> > +      if (TYPE_PRECISION (TREE_TYPE (vectype)) < TYPE_PRECISION (iv_type))
> > +	rem = gimple_build (&incr_gsi, false, GSI_CONTINUE_LINKING,
> > +			    UNKNOWN_LOCATION,
> > +			    MIN_EXPR, TREE_TYPE (rem), rem, iv_step);
> > +      rem = gimple_convert (&incr_gsi, false, GSI_CONTINUE_LINKING,
> > +			    UNKNOWN_LOCATION, TREE_TYPE (vectype), rem);
> > +
> > +      /* Build a data vector composed of the remaining iterations.  */
> > +      rem = gimple_build_vector_from_val (&incr_gsi, false, GSI_CONTINUE_LINKING,
> > +					  UNKNOWN_LOCATION, vectype, rem);
> > +
> > +      /* Provide a definition of each vector in the control group.  */
> > +      tree next_ctrl = NULL_TREE;
> > +      tree first_rem = NULL_TREE;
> > +      tree ctrl;
> > +      unsigned int i;
> > +      FOR_EACH_VEC_ELT_REVERSE (rgc->controls, i, ctrl)
> > +	{
> > +	  /* Previous controls will cover BIAS items.  This control covers the
> > +	     next batch.  */
> > +	  poly_uint64 bias = nitems_per_ctrl * i;
> > +
> > +	  /* Build the constant to compare the remaining iters against,
> > +	     this is sth like { 0, 0, 1, 1, 2, 2, 3, 3, ... } appropriately
> > +	     split into pieces.  */
> > +	  unsigned n = TYPE_VECTOR_SUBPARTS (ctrl_type).to_constant ();
> > +	  tree_vector_builder builder (vectype, n, 1);
> > +	  for (unsigned i = 0; i < n; ++i)
> > +	    {
> > +	      unsigned HOST_WIDE_INT val
> > +		= (i + bias.to_constant ()) / rgc->max_nscalars_per_iter;
> > +	      gcc_assert (val < vf.to_constant ());
> > +	      builder.quick_push (build_int_cst (TREE_TYPE (vectype), val));
> > +	    }
> > +	  tree cmp_series = builder.build ();
> > +
> > +	  /* Create the initial control.  First include all items that
> > +	     are within the loop limit.  */
> > +	  tree init_ctrl = NULL_TREE;
> > +	  poly_uint64 const_limit;
> > +	  /* See whether the first iteration of the vector loop is known
> > +	     to have a full control.  */
> > +	  if (poly_int_tree_p (niters, &const_limit)
> > +	      && known_ge (const_limit, (i + 1) * nitems_per_ctrl))
> > +	    init_ctrl = build_minus_one_cst (ctrl_type);
> > +	  else
> > +	    {
> > +	      /* The remaining work items initially are niters.  Saturate,
> > +		 splat and compare.  */
> > +	      if (!first_rem)
> > +		{
> > +		  first_rem = niters;
> > +		  if (TYPE_PRECISION (TREE_TYPE (vectype))
> > +		      < TYPE_PRECISION (iv_type))
> > +		    first_rem = gimple_build (&preheader_seq,
> > +					      MIN_EXPR, TREE_TYPE (first_rem),
> > +					      first_rem, iv_step);
> > +		  first_rem = gimple_convert (&preheader_seq, TREE_TYPE (vectype),
> > +					      first_rem);
> > +		  first_rem = gimple_build_vector_from_val (&preheader_seq,
> > +							    vectype, first_rem);
> > +		}
> > +	      init_ctrl = gimple_build (&preheader_seq, LT_EXPR, ctrl_type,
> > +					cmp_series, first_rem);
> > +	    }
> > +
> > +	  /* Now AND out the bits that are within the number of skipped
> > +	     items.  */
> > +	  poly_uint64 const_skip;
> > +	  if (niters_skip
> > +	      && !(poly_int_tree_p (niters_skip, &const_skip)
> > +		   && known_le (const_skip, bias)))
> > +	    {
> > +	      /* For integer mode masks it's cheaper to shift out the bits
> > +		 since that avoids loading a constant.  */
> > +	      gcc_assert (GET_MODE_CLASS (TYPE_MODE (ctrl_type)) == MODE_INT);
> > +	      init_ctrl = gimple_build (&preheader_seq, VIEW_CONVERT_EXPR,
> > +					lang_hooks.types.type_for_mode
> > +					  (TYPE_MODE (ctrl_type), 1),
> > +					init_ctrl);
> > +	      /* ???  But when the shift amount isn't constant this requires
> > +		 a round-trip to GRPs.  We could apply the bias to either
> > +		 side of the compare instead.  */
> > +	      tree shift = gimple_build (&preheader_seq, MULT_EXPR,
> > +					 TREE_TYPE (niters_skip),
> > +					 niters_skip,
> > +					 build_int_cst (TREE_TYPE (niters_skip),
> > +							rgc->max_nscalars_per_iter));
> > +	      init_ctrl = gimple_build (&preheader_seq, LSHIFT_EXPR,
> > +					TREE_TYPE (init_ctrl),
> > +					init_ctrl, shift);
> > +	      init_ctrl = gimple_build (&preheader_seq, VIEW_CONVERT_EXPR,
> > +					ctrl_type, init_ctrl);
> 
> It looks like this assumes that either the first lane or the last lane
> of the first mask is always true, is that right?  I'm not sure we ever
> prove that, at least not for SVE.  There it's possible to have inactive
> elements at both ends of the same mask.

It builds a mask for the first iteration without considering niters_skip
and then shifts in inactive lanes for niters_skip.  As I'm using
VIEW_CONVERT_EXPR this indeed can cause bits outside of the range
relevant for the vector mask to be set but at least x86 ignores those
(so I'm missing a final AND with TYPE_PRECISION of the mask).

So I think it should work correctly.  For variable niters_skip
it might be faster to build a mask based on niter_skip adjusted niters
and then AND since we can build both masks in parallel.  Any
variable niters_skip shifting has to be done in GPRs as the mask
register ops only can perform shifts by immediates.

> > +	    }
> > +
> > +	  /* Get the control value for the next iteration of the loop.  */
> > +	  next_ctrl = gimple_build (&incr_gsi, false, GSI_CONTINUE_LINKING,
> > +				    UNKNOWN_LOCATION,
> > +				    LT_EXPR, ctrl_type, cmp_series, rem);
> > +
> > +	  vect_set_loop_control (loop, ctrl, init_ctrl, next_ctrl);
> > +	}
> > +    }
> > +
> > +  /* Emit all accumulated statements.  */
> > +  add_preheader_seq (loop, preheader_seq);
> > +
> > +  /* Adjust the exit test using the decrementing IV.  */
> > +  edge exit_edge = single_exit (loop);
> > +  tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? LE_EXPR : GT_EXPR;
> > +  /* When we peel for alignment with niter_skip != 0 this can
> > +     cause niter + niter_skip to wrap and since we are comparing the
> > +     value before the decrement here we get a false early exit.
> > +     We can't compare the value after decrement either because that
> > +     decrement could wrap as well as we're not doing a saturating
> > +     decrement.  To avoid this situation we force a larger
> > +     iv_type.  */
> > +  gcond *cond_stmt = gimple_build_cond (code, index_before_incr, iv_step,
> > +					NULL_TREE, NULL_TREE);
> > +  gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
> > +
> > +  /* The loop iterates (NITERS - 1 + NITERS_SKIP) / VF + 1 times.
> > +     Subtract one from this to get the latch count.  */
> > +  tree niters_minus_one
> > +    = fold_build2 (PLUS_EXPR, TREE_TYPE (orig_niters), orig_niters,
> > +		   build_minus_one_cst (TREE_TYPE (orig_niters)));
> > +  tree niters_adj2 = fold_convert (iv_type, niters_minus_one);
> > +  if (niters_skip)
> > +    niters_adj2 = fold_build2 (PLUS_EXPR, iv_type, niters_minus_one,
> > +			       fold_convert (iv_type, niters_skip));
> > +  loop->nb_iterations = fold_build2 (TRUNC_DIV_EXPR, iv_type,
> > +				     niters_adj2, iv_step);
> > +
> > +  if (final_iv)
> > +    {
> > +      gassign *assign = gimple_build_assign (final_iv, orig_niters);
> > +      gsi_insert_on_edge_immediate (single_exit (loop), assign);
> > +    }
> > +
> > +  return cond_stmt;
> > +}
> > +
> > +
> >  /* Like vect_set_loop_condition, but handle the case in which the vector
> >     loop handles exactly VF scalars per iteration.  */
> >  
> > @@ -1114,10 +1357,18 @@ vect_set_loop_condition (class loop *loop, loop_vec_info loop_vinfo,
> >    gimple_stmt_iterator loop_cond_gsi = gsi_for_stmt (orig_cond);
> >  
> >    if (loop_vinfo && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
> > -    cond_stmt = vect_set_loop_condition_partial_vectors (loop, loop_vinfo,
> > -							 niters, final_iv,
> > -							 niters_maybe_zero,
> > -							 loop_cond_gsi);
> > +    {
> > +      if (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) == vect_partial_vectors_avx512)
> > +	cond_stmt = vect_set_loop_condition_partial_vectors_avx512 (loop, loop_vinfo,
> > +								    niters, final_iv,
> > +								    niters_maybe_zero,
> > +								    loop_cond_gsi);
> > +      else
> > +	cond_stmt = vect_set_loop_condition_partial_vectors (loop, loop_vinfo,
> > +							     niters, final_iv,
> > +							     niters_maybe_zero,
> > +							     loop_cond_gsi);
> > +    }
> >    else
> >      cond_stmt = vect_set_loop_condition_normal (loop, niters, step, final_iv,
> >  						niters_maybe_zero,
> > @@ -2030,7 +2281,8 @@ void
> >  vect_prepare_for_masked_peels (loop_vec_info loop_vinfo)
> >  {
> >    tree misalign_in_elems;
> > -  tree type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
> > +  /* ???  With AVX512 we want LOOP_VINFO_RGROUP_IV_TYPE in the end.  */
> > +  tree type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo));
> 
> Like you say, I think this might cause problems for SVE, since
> LOOP_VINFO_MASK_SKIP_NITERS is assumed to be the compare type
> in vect_set_loop_controls_directly.  Not sure what the best way
> around that is.

We should be able to do the conversion there, we could also simply
leave it at what get_misalign_in_elems uses (an unsigned type
with the width of a pointer).  The issue with the AVX512 style
masking is that there's not a single compare type so I left
LOOP_VINFO_RGROUP_COMPARE_TYPE as error_mark_node to catch
any erroneous uses.

Any preference here?

> >  
> >    gcc_assert (vect_use_loop_mask_for_alignment_p (loop_vinfo));
> >  
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index 1897e720389..9be66b8fbc5 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -55,6 +55,7 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "vec-perm-indices.h"
> >  #include "tree-eh.h"
> >  #include "case-cfn-macros.h"
> > +#include "langhooks.h"
> >  
> >  /* Loop Vectorization Pass.
> >  
> > @@ -963,6 +964,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
> >      mask_skip_niters (NULL_TREE),
> >      rgroup_compare_type (NULL_TREE),
> >      simd_if_cond (NULL_TREE),
> > +    partial_vector_style (vect_partial_vectors_none),
> >      unaligned_dr (NULL),
> >      peeling_for_alignment (0),
> >      ptr_mask (0),
> > @@ -1058,7 +1060,12 @@ _loop_vec_info::~_loop_vec_info ()
> >  {
> >    free (bbs);
> >  
> > -  release_vec_loop_controls (&masks);
> > +  for (auto m : masks.rgc_map)
> > +    {
> > +      m.second->controls.release ();
> > +      delete m.second;
> > +    }
> > +  release_vec_loop_controls (&masks.rgc_vec);
> >    release_vec_loop_controls (&lens);
> >    delete ivexpr_map;
> >    delete scan_map;
> > @@ -1108,7 +1115,7 @@ can_produce_all_loop_masks_p (loop_vec_info loop_vinfo, tree cmp_type)
> >  {
> >    rgroup_controls *rgm;
> >    unsigned int i;
> > -  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm)
> > +  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec, i, rgm)
> >      if (rgm->type != NULL_TREE
> >  	&& !direct_internal_fn_supported_p (IFN_WHILE_ULT,
> >  					    cmp_type, rgm->type,
> > @@ -1203,9 +1210,33 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
> >    if (LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
> >      return false;
> >  
> > +  /* Produce the rgroup controls.  */
> > +  for (auto mask : LOOP_VINFO_MASKS (loop_vinfo).mask_set)
> > +    {
> > +      vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > +      tree vectype = mask.first;
> > +      unsigned nvectors = mask.second;
> > +
> > +      if (masks->rgc_vec.length () < nvectors)
> > +	masks->rgc_vec.safe_grow_cleared (nvectors, true);
> > +      rgroup_controls *rgm = &(*masks).rgc_vec[nvectors - 1];
> > +      /* The number of scalars per iteration and the number of vectors are
> > +	 both compile-time constants.  */
> > +      unsigned int nscalars_per_iter
> > +	  = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> > +		       LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
> > +
> > +      if (rgm->max_nscalars_per_iter < nscalars_per_iter)
> > +	{
> > +	  rgm->max_nscalars_per_iter = nscalars_per_iter;
> > +	  rgm->type = truth_type_for (vectype);
> > +	  rgm->factor = 1;
> > +	}
> > +    }
> > +
> >    /* Calculate the maximum number of scalars per iteration for every rgroup.  */
> >    unsigned int max_nscalars_per_iter = 1;
> > -  for (auto rgm : LOOP_VINFO_MASKS (loop_vinfo))
> > +  for (auto rgm : LOOP_VINFO_MASKS (loop_vinfo).rgc_vec)
> >      max_nscalars_per_iter
> >        = MAX (max_nscalars_per_iter, rgm.max_nscalars_per_iter);
> >  
> > @@ -1268,10 +1299,159 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
> >      }
> >  
> >    if (!cmp_type)
> > -    return false;
> > +    {
> > +      LOOP_VINFO_MASKS (loop_vinfo).rgc_vec.release ();
> > +      return false;
> > +    }
> >  
> >    LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = cmp_type;
> >    LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
> > +  LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) = vect_partial_vectors_while_ult;
> > +  return true;
> > +}
> > +
> > +/* Each statement in LOOP_VINFO can be masked where necessary.  Check
> > +   whether we can actually generate AVX512 style masks.  Return true if so,
> > +   storing the type of the scalar IV in LOOP_VINFO_RGROUP_IV_TYPE.  */
> > +
> > +static bool
> > +vect_verify_full_masking_avx512 (loop_vec_info loop_vinfo)
> > +{
> > +  /* Produce differently organized rgc_vec and differently check
> > +     we can produce masks.  */
> > +
> > +  /* Use a normal loop if there are no statements that need masking.
> > +     This only happens in rare degenerate cases: it means that the loop
> > +     has no loads, no stores, and no live-out values.  */
> > +  if (LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
> > +    return false;
> > +
> > +  /* For the decrementing IV we need to represent all values in
> > +     [0, niter + niter_skip] where niter_skip is the elements we
> > +     skip in the first iteration for prologue peeling.  */
> > +  tree iv_type = NULL_TREE;
> > +  widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
> > +  unsigned int iv_precision = UINT_MAX;
> > +  if (iv_limit != -1)
> > +    iv_precision = wi::min_precision (iv_limit, UNSIGNED);
> > +
> > +  /* First compute the type for the IV we use to track the remaining
> > +     scalar iterations.  */
> > +  opt_scalar_int_mode cmp_mode_iter;
> > +  FOR_EACH_MODE_IN_CLASS (cmp_mode_iter, MODE_INT)
> > +    {
> > +      unsigned int cmp_bits = GET_MODE_BITSIZE (cmp_mode_iter.require ());
> > +      if (cmp_bits >= iv_precision
> > +	  && targetm.scalar_mode_supported_p (cmp_mode_iter.require ()))
> > +	{
> > +	  iv_type = build_nonstandard_integer_type (cmp_bits, true);
> > +	  if (iv_type)
> > +	    break;
> > +	}
> > +    }
> > +  if (!iv_type)
> > +    return false;
> > +
> > +  /* Produce the rgroup controls.  */
> > +  for (auto mask : LOOP_VINFO_MASKS (loop_vinfo).mask_set)
> > +    {
> > +      vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > +      tree vectype = mask.first;
> > +      unsigned nvectors = mask.second;
> > +
> > +      /* The number of scalars per iteration and the number of vectors are
> > +	 both compile-time constants.  */
> > +      unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
> > +      unsigned int nscalars_per_iter
> > +	= nvectors * nunits / LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ();
> 
> The comment seems to be borrowed from:
> 
>   /* The number of scalars per iteration and the number of vectors are
>      both compile-time constants.  */
>   unsigned int nscalars_per_iter
>     = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> 		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
> 
> but the calculation instead applies to_constant to the VF and to the
> number of vector elements, which aren't in general known to be constant.
> Does the exact_div not work here too?

It failed the overload when I use a constant nunits, but yes, the
original code works fine, but I need ...

> Avoiding the current to_constants here only moves the problem elsewhere
> though.  Since this is a verification routine, I think the should
> instead use is_constant and fail if it is false.
> 
> > +      /* We key off a hash-map with nscalars_per_iter and the number of total
> > +	 lanes in the mask vector and then remember the needed vector mask
> > +	 with the largest number of lanes (thus the fewest nV).  */
> > +      bool existed;
> > +      rgroup_controls *& rgc
> > +	= masks->rgc_map.get_or_insert (std::make_pair (nscalars_per_iter,
> > +							nvectors * nunits),

^^^

this decomposition to constants (well, OK - I could have used a
poly-int for the second half of the pair).

I did wonder if I could restrict this more - I need different groups
for different nscalars_per_iter but also when the total number of
work items is different.  Somehow I think it's really only
nscalars_per_iter that's interesting but I didn't get to convince
myself that nvectors * nunits is going to be always the same
when I vary the vector size within a loop.

> > +					&existed);
> > +      if (!existed)
> > +	{
> > +	  rgc = new rgroup_controls ();
> > +	  rgc->type = truth_type_for (vectype);
> > +	  rgc->compare_type = NULL_TREE;
> > +	  rgc->max_nscalars_per_iter = nscalars_per_iter;
> > +	  rgc->factor = 1;
> > +	  rgc->bias_adjusted_ctrl = NULL_TREE;
> > +	}
> > +      else
> > +	{
> > +	  gcc_assert (rgc->max_nscalars_per_iter == nscalars_per_iter);
> > +	  if (known_lt (TYPE_VECTOR_SUBPARTS (rgc->type),
> > +			TYPE_VECTOR_SUBPARTS (vectype)))
> > +	    rgc->type = truth_type_for (vectype);
> > +	}
> > +    }
> > +
> > +  /* There is no fixed compare type we are going to use but we have to
> > +     be able to get at one for each mask group.  */
> > +  unsigned int min_ni_width
> > +    = wi::min_precision (vect_max_vf (loop_vinfo), UNSIGNED);
> > +
> > +  bool ok = true;
> > +  for (auto mask : LOOP_VINFO_MASKS (loop_vinfo).rgc_map)
> > +    {
> > +      rgroup_controls *rgc = mask.second;
> > +      tree mask_type = rgc->type;
> > +      if (TYPE_PRECISION (TREE_TYPE (mask_type)) != 1)
> > +	{
> > +	  ok = false;
> > +	  break;
> > +	}
> > +
> > +      /* If iv_type is usable as compare type use that - we can elide the
> > +	 saturation in that case.   */
> > +      if (TYPE_PRECISION (iv_type) >= min_ni_width)
> > +	{
> > +	  tree cmp_vectype
> > +	    = build_vector_type (iv_type, TYPE_VECTOR_SUBPARTS (mask_type));
> > +	  if (expand_vec_cmp_expr_p (cmp_vectype, mask_type, LT_EXPR))
> > +	    rgc->compare_type = cmp_vectype;
> > +	}
> > +      if (!rgc->compare_type)
> > +	FOR_EACH_MODE_IN_CLASS (cmp_mode_iter, MODE_INT)
> > +	  {
> > +	    unsigned int cmp_bits = GET_MODE_BITSIZE (cmp_mode_iter.require ());
> > +	    if (cmp_bits >= min_ni_width
> > +		&& targetm.scalar_mode_supported_p (cmp_mode_iter.require ()))
> > +	      {
> > +		tree cmp_type = build_nonstandard_integer_type (cmp_bits, true);
> > +		if (!cmp_type)
> > +		  continue;
> > +
> > +		/* Check whether we can produce the mask with cmp_type.  */
> > +		tree cmp_vectype
> > +		  = build_vector_type (cmp_type, TYPE_VECTOR_SUBPARTS (mask_type));
> > +		if (expand_vec_cmp_expr_p (cmp_vectype, mask_type, LT_EXPR))
> > +		  {
> > +		    rgc->compare_type = cmp_vectype;
> > +		    break;
> > +		  }
> > +	      }
> > +	}
> 
> Just curious: is this fallback loop ever used in practice?
> TYPE_PRECISION (iv_type) >= min_ni_width seems like an easy condition
> to satisfy.

I think the only remaining case for this particular condition
is an original unsigned IV and us peeling for alignment with a mask.

But then of course the main case is that with say a unsigned long IV but
a QImode data type the x86 backend lacks a V64DImode vector type so we 
cannot produce the mask mode for V64QImode data with a compare using a 
data type based on that IV type.

For 'int' IVs and int/float or double data we can always use the
original IV here.

> > +      if (!rgc->compare_type)
> > +	{
> > +	  ok = false;
> > +	  break;
> > +	}
> > +    }
> > +  if (!ok)
> > +    {
> > +      LOOP_VINFO_MASKS (loop_vinfo).rgc_map.empty ();
> > +      return false;
> 
> It looks like this leaks the rgroup_controls created above.

Fixed.

> > +    }
> > +
> > +  LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = error_mark_node;
> > +  LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
> > +  LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) = vect_partial_vectors_avx512;
> >    return true;
> >  }
> >  
> > @@ -1371,6 +1551,7 @@ vect_verify_loop_lens (loop_vec_info loop_vinfo)
> >  
> >    LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = iv_type;
> >    LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
> > +  LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) = vect_partial_vectors_len;
> >  
> >    return true;
> >  }
> > @@ -2712,16 +2893,24 @@ start_over:
> >  
> >    /* If we still have the option of using partial vectors,
> >       check whether we can generate the necessary loop controls.  */
> > -  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > -      && !vect_verify_full_masking (loop_vinfo)
> > -      && !vect_verify_loop_lens (loop_vinfo))
> > -    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> > +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> > +    {
> > +      if (!LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
> > +	{
> > +	  if (!vect_verify_full_masking (loop_vinfo)
> > +	      && !vect_verify_full_masking_avx512 (loop_vinfo))
> > +	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> > +	}
> > +      else /* !LOOP_VINFO_LENS (loop_vinfo).is_empty () */
> > +	if (!vect_verify_loop_lens (loop_vinfo))
> > +	  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> > +    }
> >  
> >    /* If we're vectorizing a loop that uses length "controls" and
> >       can iterate more than once, we apply decrementing IV approach
> >       in loop control.  */
> >    if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > -      && !LOOP_VINFO_LENS (loop_vinfo).is_empty ()
> > +      && LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) == vect_partial_vectors_len
> >        && LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) == 0
> >        && !(LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> >  	   && known_le (LOOP_VINFO_INT_NITERS (loop_vinfo),
> > @@ -3022,7 +3211,7 @@ again:
> >    delete loop_vinfo->vector_costs;
> >    loop_vinfo->vector_costs = nullptr;
> >    /* Reset accumulated rgroup information.  */
> > -  release_vec_loop_controls (&LOOP_VINFO_MASKS (loop_vinfo));
> > +  release_vec_loop_controls (&LOOP_VINFO_MASKS (loop_vinfo).rgc_vec);
> >    release_vec_loop_controls (&LOOP_VINFO_LENS (loop_vinfo));
> >    /* Reset assorted flags.  */
> >    LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
> > @@ -4362,13 +4551,67 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
> >  			  cond_branch_not_taken, vect_epilogue);
> >  
> >    /* Take care of special costs for rgroup controls of partial vectors.  */
> > -  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> > +  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
> > +      && (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> > +	  == vect_partial_vectors_avx512))
> > +    {
> > +      /* Calculate how many masks we need to generate.  */
> > +      unsigned int num_masks = 0;
> > +      bool need_saturation = false;
> > +      for (auto rgcm : LOOP_VINFO_MASKS (loop_vinfo).rgc_map)
> > +	{
> > +	  rgroup_controls *rgm = rgcm.second;
> > +	  unsigned nvectors
> > +	    = (rgcm.first.second
> > +	       / TYPE_VECTOR_SUBPARTS (rgm->type).to_constant ());
> > +	  num_masks += nvectors;
> > +	  if (TYPE_PRECISION (TREE_TYPE (rgm->compare_type))
> > +	      < TYPE_PRECISION (LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo)))
> > +	    need_saturation = true;
> > +	}
> > +
> > +      /* ???  The target isn't able to identify the costs below as
> > +	 producing masks so it cannot penaltize cases where we'd run
> > +	 out of mask registers for example.  */
> > +
> > +      /* In the worst case, we need to generate each mask in the prologue
> > +	 and in the loop body.  We need one splat per group and one
> > +	 compare per mask.
> > +
> > +	 Sometimes we can use unpacks instead of generating prologue
> > +	 masks and sometimes the prologue mask will fold to a constant,
> > +	 so the actual prologue cost might be smaller.  However, it's
> > +	 simpler and safer to use the worst-case cost; if this ends up
> > +	 being the tie-breaker between vectorizing or not, then it's
> > +	 probably better not to vectorize.  */
> 
> Not sure all of this applies to the AVX512 case.  In particular, the
> unpacks bit doesn't seem relevant.

Removed that part and noted we fail to account for the cost of dealing
with different nvector mask cases using the same wide masks
(the cases that vect_get_loop_mask constructs).  For SVE it's
re-interpreting with VIEW_CONVERT that's done there which should be
free but for AVX512 (when we ever get multiple vector sizes in one
loop) to split a mask in half we need one shift for the upper part.

> > +      (void) add_stmt_cost (target_cost_data,
> > +			    num_masks
> > +			    + LOOP_VINFO_MASKS (loop_vinfo).rgc_map.elements (),
> > +			    vector_stmt, NULL, NULL, NULL_TREE, 0, vect_prologue);
> > +      (void) add_stmt_cost (target_cost_data,
> > +			    num_masks
> > +			    + LOOP_VINFO_MASKS (loop_vinfo).rgc_map.elements (),
> > +			    vector_stmt, NULL, NULL, NULL_TREE, 0, vect_body);
> > +
> > +      /* When we need saturation we need it both in the prologue and
> > +	 the epilogue.  */
> > +      if (need_saturation)
> > +	{
> > +	  (void) add_stmt_cost (target_cost_data, 1, scalar_stmt,
> > +				NULL, NULL, NULL_TREE, 0, vect_prologue);
> > +	  (void) add_stmt_cost (target_cost_data, 1, scalar_stmt,
> > +				NULL, NULL, NULL_TREE, 0, vect_body);
> > +	}
> > +    }
> > +  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
> > +	   && (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> > +	       == vect_partial_vectors_avx512))

Eh, cut&past error here, should be == vect_partial_vectors_while_ult

> >      {
> >        /* Calculate how many masks we need to generate.  */
> >        unsigned int num_masks = 0;
> >        rgroup_controls *rgm;
> >        unsigned int num_vectors_m1;
> > -      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
> > +      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec, num_vectors_m1, rgm)
> 
> Nit: long line.

fixed.

> >  	if (rgm->type)
> >  	  num_masks += num_vectors_m1 + 1;
> >        gcc_assert (num_masks > 0);
> > @@ -10329,14 +10572,6 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
> >  		       unsigned int nvectors, tree vectype, tree scalar_mask)
> >  {
> >    gcc_assert (nvectors != 0);
> > -  if (masks->length () < nvectors)
> > -    masks->safe_grow_cleared (nvectors, true);
> > -  rgroup_controls *rgm = &(*masks)[nvectors - 1];
> > -  /* The number of scalars per iteration and the number of vectors are
> > -     both compile-time constants.  */
> > -  unsigned int nscalars_per_iter
> > -    = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> > -		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
> >  
> >    if (scalar_mask)
> >      {
> > @@ -10344,12 +10579,7 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
> >        loop_vinfo->scalar_cond_masked_set.add (cond);
> >      }
> >  
> > -  if (rgm->max_nscalars_per_iter < nscalars_per_iter)
> > -    {
> > -      rgm->max_nscalars_per_iter = nscalars_per_iter;
> > -      rgm->type = truth_type_for (vectype);
> > -      rgm->factor = 1;
> > -    }
> > +  masks->mask_set.add (std::make_pair (vectype, nvectors));
> >  }
> >  
> >  /* Given a complete set of masks MASKS, extract mask number INDEX
> > @@ -10360,46 +10590,121 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
> >     arrangement.  */
> >  
> >  tree
> > -vect_get_loop_mask (loop_vec_info,
> > +vect_get_loop_mask (loop_vec_info loop_vinfo,
> >  		    gimple_stmt_iterator *gsi, vec_loop_masks *masks,
> >  		    unsigned int nvectors, tree vectype, unsigned int index)
> >  {
> > -  rgroup_controls *rgm = &(*masks)[nvectors - 1];
> > -  tree mask_type = rgm->type;
> > -
> > -  /* Populate the rgroup's mask array, if this is the first time we've
> > -     used it.  */
> > -  if (rgm->controls.is_empty ())
> > +  if (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> > +      == vect_partial_vectors_while_ult)
> >      {
> > -      rgm->controls.safe_grow_cleared (nvectors, true);
> > -      for (unsigned int i = 0; i < nvectors; ++i)
> > +      rgroup_controls *rgm = &(masks->rgc_vec)[nvectors - 1];
> > +      tree mask_type = rgm->type;
> > +
> > +      /* Populate the rgroup's mask array, if this is the first time we've
> > +	 used it.  */
> > +      if (rgm->controls.is_empty ())
> >  	{
> > -	  tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask");
> > -	  /* Provide a dummy definition until the real one is available.  */
> > -	  SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
> > -	  rgm->controls[i] = mask;
> > +	  rgm->controls.safe_grow_cleared (nvectors, true);
> > +	  for (unsigned int i = 0; i < nvectors; ++i)
> > +	    {
> > +	      tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask");
> > +	      /* Provide a dummy definition until the real one is available.  */
> > +	      SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
> > +	      rgm->controls[i] = mask;
> > +	    }
> >  	}
> > -    }
> >  
> > -  tree mask = rgm->controls[index];
> > -  if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type),
> > -		TYPE_VECTOR_SUBPARTS (vectype)))
> > +      tree mask = rgm->controls[index];
> > +      if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type),
> > +		    TYPE_VECTOR_SUBPARTS (vectype)))
> > +	{
> > +	  /* A loop mask for data type X can be reused for data type Y
> > +	     if X has N times more elements than Y and if Y's elements
> > +	     are N times bigger than X's.  In this case each sequence
> > +	     of N elements in the loop mask will be all-zero or all-one.
> > +	     We can then view-convert the mask so that each sequence of
> > +	     N elements is replaced by a single element.  */
> > +	  gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (mask_type),
> > +				  TYPE_VECTOR_SUBPARTS (vectype)));
> > +	  gimple_seq seq = NULL;
> > +	  mask_type = truth_type_for (vectype);
> > +	  /* We can only use re-use the mask by reinterpreting it if it
> > +	     occupies the same space, that is the mask with less elements
> 
> Nit: fewer elements

Ah this assert was also a left-over from earlier attempts, I've removed
it again.

> > +	     uses multiple bits for each masked elements.  */
> > +	  gcc_assert (known_eq (TYPE_PRECISION (TREE_TYPE (TREE_TYPE (mask)))
> > +				* TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)),
> > +				TYPE_PRECISION (TREE_TYPE (mask_type))
> > +				* TYPE_VECTOR_SUBPARTS (mask_type)));
> > +	  mask = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, mask);
> > +	  if (seq)
> > +	    gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
> > +	}
> > +      return mask;
> > +    }
> > +  else if (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> > +	   == vect_partial_vectors_avx512)
> >      {
> > -      /* A loop mask for data type X can be reused for data type Y
> > -	 if X has N times more elements than Y and if Y's elements
> > -	 are N times bigger than X's.  In this case each sequence
> > -	 of N elements in the loop mask will be all-zero or all-one.
> > -	 We can then view-convert the mask so that each sequence of
> > -	 N elements is replaced by a single element.  */
> > -      gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (mask_type),
> > -			      TYPE_VECTOR_SUBPARTS (vectype)));
> > +      /* The number of scalars per iteration and the number of vectors are
> > +	 both compile-time constants.  */
> > +      unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
> > +      unsigned int nscalars_per_iter
> > +	= nvectors * nunits / LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ();
> 
> Same disconnect between the comment and the code here.  If we do use
> is_constant in vect_verify_full_masking_avx512 then we could reference
> that instead.

I'm going to dig into this a bit, maybe I can simplify all this and
just have the vector indexed by nscalars_per_iter and get rid of the
hash-map.

> > +
> > +      rgroup_controls *rgm
> > +	= *masks->rgc_map.get (std::make_pair (nscalars_per_iter,
> > +					       nvectors * nunits));
> > +
> > +      /* The stored nV is dependent on the mask type produced.  */
> > +      nvectors = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> > +			    TYPE_VECTOR_SUBPARTS (rgm->type)).to_constant ();
> > +
> > +      /* Populate the rgroup's mask array, if this is the first time we've
> > +	 used it.  */
> > +      if (rgm->controls.is_empty ())
> > +	{
> > +	  rgm->controls.safe_grow_cleared (nvectors, true);
> > +	  for (unsigned int i = 0; i < nvectors; ++i)
> > +	    {
> > +	      tree mask = make_temp_ssa_name (rgm->type, NULL, "loop_mask");
> > +	      /* Provide a dummy definition until the real one is available.  */
> > +	      SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
> > +	      rgm->controls[i] = mask;
> > +	    }
> > +	}
> > +      if (known_eq (TYPE_VECTOR_SUBPARTS (rgm->type),
> > +		    TYPE_VECTOR_SUBPARTS (vectype)))
> > +	return rgm->controls[index];
> > +
> > +      /* Split the vector if needed.  Since we are dealing with integer mode
> > +	 masks with AVX512 we can operate on the integer representation
> > +	 performing the whole vector shifting.  */
> > +      unsigned HOST_WIDE_INT factor;
> > +      bool ok = constant_multiple_p (TYPE_VECTOR_SUBPARTS (rgm->type),
> > +				     TYPE_VECTOR_SUBPARTS (vectype), &factor);
> > +      gcc_assert (ok);
> > +      gcc_assert (GET_MODE_CLASS (TYPE_MODE (rgm->type)) == MODE_INT);
> > +      tree mask_type = truth_type_for (vectype);
> > +      gcc_assert (GET_MODE_CLASS (TYPE_MODE (mask_type)) == MODE_INT);
> > +      unsigned vi = index / factor;
> > +      unsigned vpart = index % factor;
> > +      tree vec = rgm->controls[vi];
> >        gimple_seq seq = NULL;
> > -      mask_type = truth_type_for (vectype);
> > -      mask = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, mask);
> > +      vec = gimple_build (&seq, VIEW_CONVERT_EXPR,
> > +			  lang_hooks.types.type_for_mode
> > +				(TYPE_MODE (rgm->type), 1), vec);
> > +      /* For integer mode masks simply shift the right bits into position.  */
> > +      if (vpart != 0)
> > +	vec = gimple_build (&seq, RSHIFT_EXPR, TREE_TYPE (vec), vec,
> > +			    build_int_cst (integer_type_node, vpart * nunits));
> > +      vec = gimple_convert (&seq, lang_hooks.types.type_for_mode
> > +				    (TYPE_MODE (mask_type), 1), vec);
> > +      vec = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, vec);
> 
> Would it be worth creating an rgroup_controls for each nunits derivative
> of each rgc_map entry?  That way we'd share the calculations, and be
> able to cost them more accurately.
> 
> Maybe it's not worth it, just asking. :)

I suppose we could do that.  So with AVX512 we have a common
num_scalars_per_iter but vary on nvectors.  So we'd have to
compute a max_nvectors.

I guess it's doable, I'll keep that in mind.

Thanks,
Richard.

> Thanks,
> Richard
> 
> >        if (seq)
> >  	gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
> > +      return vec;
> >      }
> > -  return mask;
> > +  else
> > +    gcc_unreachable ();
> >  }
> >  
> >  /* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
> > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > index 767a0774d45..42161778dc1 100644
> > --- a/gcc/tree-vectorizer.h
> > +++ b/gcc/tree-vectorizer.h
> > @@ -300,6 +300,13 @@ public:
> >  #define SLP_TREE_LANES(S)			 (S)->lanes
> >  #define SLP_TREE_CODE(S)			 (S)->code
> >  
> > +enum vect_partial_vector_style {
> > +    vect_partial_vectors_none,
> > +    vect_partial_vectors_while_ult,
> > +    vect_partial_vectors_avx512,
> > +    vect_partial_vectors_len
> > +};
> > +
> >  /* Key for map that records association between
> >     scalar conditions and corresponding loop mask, and
> >     is populated by vect_record_loop_mask.  */
> > @@ -605,6 +612,10 @@ struct rgroup_controls {
> >       specified number of elements; the type of the elements doesn't matter.  */
> >    tree type;
> >  
> > +  /* When there is no uniformly used LOOP_VINFO_RGROUP_COMPARE_TYPE this
> > +     is the rgroup specific type used.  */
> > +  tree compare_type;
> > +
> >    /* A vector of nV controls, in iteration order.  */
> >    vec<tree> controls;
> >  
> > @@ -613,7 +624,24 @@ struct rgroup_controls {
> >    tree bias_adjusted_ctrl;
> >  };
> >  
> > -typedef auto_vec<rgroup_controls> vec_loop_masks;
> > +struct vec_loop_masks
> > +{
> > +  bool is_empty () const { return mask_set.is_empty (); }
> > +
> > +  typedef pair_hash <nofree_ptr_hash <tree_node>,
> > +		     int_hash<unsigned, 0>> mp_hash;
> > +  hash_set<mp_hash> mask_set;
> > +
> > +  /* Default storage for rgroup_controls.  */
> > +  auto_vec<rgroup_controls> rgc_vec;
> > +
> > +  /* The vect_partial_vectors_avx512 style uses a hash-map.  */
> > +  hash_map<std::pair<unsigned /* nscalars_per_iter */,
> > +		     unsigned /* nlanes */>, rgroup_controls *,
> > +	   simple_hashmap_traits<pair_hash <int_hash<unsigned, 0>,
> > +					    int_hash<unsigned, 0>>,
> > +				 rgroup_controls *>> rgc_map;
> > +};
> >  
> >  typedef auto_vec<rgroup_controls> vec_loop_lens;
> >  
> > @@ -741,6 +769,10 @@ public:
> >       LOOP_VINFO_USING_PARTIAL_VECTORS_P is true.  */
> >    tree rgroup_iv_type;
> >  
> > +  /* The style used for implementing partial vectors when
> > +     LOOP_VINFO_USING_PARTIAL_VECTORS_P is true.  */
> > +  vect_partial_vector_style partial_vector_style;
> > +
> >    /* Unknown DRs according to which loop was peeled.  */
> >    class dr_vec_info *unaligned_dr;
> >  
> > @@ -914,6 +946,7 @@ public:
> >  #define LOOP_VINFO_MASK_SKIP_NITERS(L)     (L)->mask_skip_niters
> >  #define LOOP_VINFO_RGROUP_COMPARE_TYPE(L)  (L)->rgroup_compare_type
> >  #define LOOP_VINFO_RGROUP_IV_TYPE(L)       (L)->rgroup_iv_type
> > +#define LOOP_VINFO_PARTIAL_VECTORS_STYLE(L) (L)->partial_vector_style
> >  #define LOOP_VINFO_PTR_MASK(L)             (L)->ptr_mask
> >  #define LOOP_VINFO_N_STMTS(L)		   (L)->shared->n_stmts
> >  #define LOOP_VINFO_LOOP_NEST(L)            (L)->shared->loop_nest
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] AVX512 fully masked vectorization
       [not found] <20230614115429.D400C3858433@sourceware.org>
@ 2023-06-14 18:45 ` Richard Sandiford
  2023-06-15 12:14   ` Richard Biener
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Sandiford @ 2023-06-14 18:45 UTC (permalink / raw)
  To: Richard Biener via Gcc-patches
  Cc: Richard Biener, Jan Hubicka, hongtao.liu, kirill.yukhin, ams

Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> This implemens fully masked vectorization or a masked epilog for
> AVX512 style masks which single themselves out by representing
> each lane with a single bit and by using integer modes for the mask
> (both is much like GCN).
>
> AVX512 is also special in that it doesn't have any instruction
> to compute the mask from a scalar IV like SVE has with while_ult.
> Instead the masks are produced by vector compares and the loop
> control retains the scalar IV (mainly to avoid dependences on
> mask generation, a suitable mask test instruction is available).
>
> Like RVV code generation prefers a decrementing IV though IVOPTs
> messes things up in some cases removing that IV to eliminate
> it with an incrementing one used for address generation.
>
> One of the motivating testcases is from PR108410 which in turn
> is extracted from x264 where large size vectorization shows
> issues with small trip loops.  Execution time there improves
> compared to classic AVX512 with AVX2 epilogues for the cases
> of less than 32 iterations.
>
> size   scalar     128     256     512    512e    512f
>     1    9.42   11.32    9.35   11.17   15.13   16.89
>     2    5.72    6.53    6.66    6.66    7.62    8.56
>     3    4.49    5.10    5.10    5.74    5.08    5.73
>     4    4.10    4.33    4.29    5.21    3.79    4.25
>     6    3.78    3.85    3.86    4.76    2.54    2.85
>     8    3.64    1.89    3.76    4.50    1.92    2.16
>    12    3.56    2.21    3.75    4.26    1.26    1.42
>    16    3.36    0.83    1.06    4.16    0.95    1.07
>    20    3.39    1.42    1.33    4.07    0.75    0.85
>    24    3.23    0.66    1.72    4.22    0.62    0.70
>    28    3.18    1.09    2.04    4.20    0.54    0.61
>    32    3.16    0.47    0.41    0.41    0.47    0.53
>    34    3.16    0.67    0.61    0.56    0.44    0.50
>    38    3.19    0.95    0.95    0.82    0.40    0.45
>    42    3.09    0.58    1.21    1.13    0.36    0.40
>
> 'size' specifies the number of actual iterations, 512e is for
> a masked epilog and 512f for the fully masked loop.  From
> 4 scalar iterations on the AVX512 masked epilog code is clearly
> the winner, the fully masked variant is clearly worse and
> it's size benefit is also tiny.
>
> This patch does not enable using fully masked loops or
> masked epilogues by default.  More work on cost modeling
> and vectorization kind selection on x86_64 is necessary
> for this.
>
> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
> which could be exploited further to unify some of the flags
> we have right now but there didn't seem to be many easy things
> to merge, so I'm leaving this for followups.
>
> Mask requirements as registered by vect_record_loop_mask are kept in their
> original form and recorded in a hash_set now instead of being
> processed to a vector of rgroup_controls.  Instead that's now
> left to the final analysis phase which tries forming the rgroup_controls
> vector using while_ult and if that fails now tries AVX512 style
> which needs a different organization and instead fills a hash_map
> with the relevant info.  vect_get_loop_mask now has two implementations,
> one for the two mask styles we then have.
>
> I have decided against interweaving vect_set_loop_condition_partial_vectors
> with conditions to do AVX512 style masking and instead opted to
> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
> Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.
>
> I was split between making 'vec_loop_masks' a class with methods,
> possibly merging in the _len stuff into a single registry.  It
> seemed to be too many changes for the purpose of getting AVX512
> working.  I'm going to play wait and see what happens with RISC-V
> here since they are going to get both masks and lengths registered
> I think.
>
> The vect_prepare_for_masked_peels hunk might run into issues with
> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
> looked odd.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
> the testsuite with --param vect-partial-vector-usage=2 with and
> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
> and one latent wrong-code (PR110237).
>
> There's followup work to be done to try enabling masked epilogues
> for x86-64 by default (when AVX512 is enabled, possibly only when
> -mprefer-vector-width=512).  Getting cost modeling and decision
> right is going to be challenging.
>
> Any comments?
>
> OK?

Some comments below, but otherwise LGTM FWIW.

> Btw, testing on GCN would be welcome - the _avx512 paths could
> work for it so in case the while_ult path fails (not sure if
> it ever does) it could get _avx512 style masking.  Likewise
> testing on ARM just to see I didn't break anything here.
> I don't have SVE hardware so testing is probably meaningless.
>
> Thanks,
> Richard.
>
> 	* tree-vectorizer.h (enum vect_partial_vector_style): New.
> 	(_loop_vec_info::partial_vector_style): Likewise.
> 	(LOOP_VINFO_PARTIAL_VECTORS_STYLE): Likewise.
> 	(rgroup_controls::compare_type): Add.
> 	(vec_loop_masks): Change from a typedef to auto_vec<>
> 	to a structure.
> 	* tree-vect-loop-manip.cc (vect_set_loop_condition_partial_vectors):
> 	Adjust.
> 	(vect_set_loop_condition_partial_vectors_avx512): New function
> 	implementing the AVX512 partial vector codegen.
> 	(vect_set_loop_condition): Dispatch to the correct
> 	vect_set_loop_condition_partial_vectors_* function based on
> 	LOOP_VINFO_PARTIAL_VECTORS_STYLE.
> 	(vect_prepare_for_masked_peels): Compute LOOP_VINFO_MASK_SKIP_NITERS
> 	in the original niter type.
> 	* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize
> 	partial_vector_style.
> 	(_loop_vec_info::~_loop_vec_info): Release the hash-map recorded
> 	rgroup_controls.
> 	(can_produce_all_loop_masks_p): Adjust.
> 	(vect_verify_full_masking): Produce the rgroup_controls vector
> 	here.  Set LOOP_VINFO_PARTIAL_VECTORS_STYLE on success.
> 	(vect_verify_full_masking_avx512): New function implementing
> 	verification of AVX512 style masking.
> 	(vect_verify_loop_lens): Set LOOP_VINFO_PARTIAL_VECTORS_STYLE.
> 	(vect_analyze_loop_2): Also try AVX512 style masking.
> 	Adjust condition.
> 	(vect_estimate_min_profitable_iters): Implement AVX512 style
> 	mask producing cost.
> 	(vect_record_loop_mask): Do not build the rgroup_controls
> 	vector here but record masks in a hash-set.
> 	(vect_get_loop_mask): Implement AVX512 style mask query,
> 	complementing the existing while_ult style.
> ---
>  gcc/tree-vect-loop-manip.cc | 264 ++++++++++++++++++++++-
>  gcc/tree-vect-loop.cc       | 413 +++++++++++++++++++++++++++++++-----
>  gcc/tree-vectorizer.h       |  35 ++-
>  3 files changed, 651 insertions(+), 61 deletions(-)
>
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index 1c8100c1a1c..f0ecaec28f4 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -50,6 +50,9 @@ along with GCC; see the file COPYING3.  If not see
>  #include "insn-config.h"
>  #include "rtl.h"
>  #include "recog.h"
> +#include "langhooks.h"
> +#include "tree-vector-builder.h"
> +#include "optabs-tree.h"
>  
>  /*************************************************************************
>    Simple Loop Peeling Utilities
> @@ -845,7 +848,7 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
>    rgroup_controls *iv_rgc = nullptr;
>    unsigned int i;
>    auto_vec<rgroup_controls> *controls = use_masks_p
> -					  ? &LOOP_VINFO_MASKS (loop_vinfo)
> +					  ? &LOOP_VINFO_MASKS (loop_vinfo).rgc_vec
>  					  : &LOOP_VINFO_LENS (loop_vinfo);
>    FOR_EACH_VEC_ELT (*controls, i, rgc)
>      if (!rgc->controls.is_empty ())
> @@ -936,6 +939,246 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
>    return cond_stmt;
>  }
>  
> +/* Set up the iteration condition and rgroup controls for LOOP in AVX512
> +   style, given that LOOP_VINFO_USING_PARTIAL_VECTORS_P is true for the
> +   vectorized loop.  LOOP_VINFO describes the vectorization of LOOP.  NITERS is
> +   the number of iterations of the original scalar loop that should be
> +   handled by the vector loop.  NITERS_MAYBE_ZERO and FINAL_IV are as
> +   for vect_set_loop_condition.
> +
> +   Insert the branch-back condition before LOOP_COND_GSI and return the
> +   final gcond.  */
> +
> +static gcond *
> +vect_set_loop_condition_partial_vectors_avx512 (class loop *loop,
> +					 loop_vec_info loop_vinfo, tree niters,
> +					 tree final_iv,
> +					 bool niters_maybe_zero,
> +					 gimple_stmt_iterator loop_cond_gsi)
> +{
> +  tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
> +  tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
> +  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> +  tree orig_niters = niters;
> +  gimple_seq preheader_seq = NULL;
> +
> +  /* Create an IV that counts down from niters and whose step
> +     is the number of iterations processed in the current iteration.
> +     Produce the controls with compares like the following.
> +
> +       # iv_2 = PHI <niters, iv_3>
> +       rem_4 = MIN <iv_2, VF>;
> +       remv_6 = { rem_4, rem_4, rem_4, ... }
> +       mask_5 = { 0, 0, 1, 1, 2, 2, ... } < remv6;
> +       iv_3 = iv_2 - VF;
> +       if (iv_2 > VF)
> +	 continue;
> +
> +     Where the constant is built with elements at most VF - 1 and
> +     repetitions according to max_nscalars_per_iter which is guarnateed
> +     to be the same within a group.  */
> +
> +  /* Convert NITERS to the determined IV type.  */
> +  if (TYPE_PRECISION (iv_type) > TYPE_PRECISION (TREE_TYPE (niters))
> +      && niters_maybe_zero)
> +    {
> +      /* We know that there is always at least one iteration, so if the
> +	 count is zero then it must have wrapped.  Cope with this by
> +	 subtracting 1 before the conversion and adding 1 to the result.  */
> +      gcc_assert (TYPE_UNSIGNED (TREE_TYPE (niters)));
> +      niters = gimple_build (&preheader_seq, PLUS_EXPR, TREE_TYPE (niters),
> +			     niters, build_minus_one_cst (TREE_TYPE (niters)));
> +      niters = gimple_convert (&preheader_seq, iv_type, niters);
> +      niters = gimple_build (&preheader_seq, PLUS_EXPR, iv_type,
> +			     niters, build_one_cst (iv_type));
> +    }
> +  else
> +    niters = gimple_convert (&preheader_seq, iv_type, niters);
> +
> +  /* Bias the initial value of the IV in case we need to skip iterations
> +     at the beginning.  */
> +  tree niters_adj = niters;
> +  if (niters_skip)
> +    {
> +      tree skip = gimple_convert (&preheader_seq, iv_type, niters_skip);
> +      niters_adj = gimple_build (&preheader_seq, PLUS_EXPR,
> +				 iv_type, niters, skip);
> +    }
> +
> +  /* The iteration step is the vectorization factor.  */
> +  tree iv_step = build_int_cst (iv_type, vf);
> +
> +  /* Create the decrement IV.  */
> +  tree index_before_incr, index_after_incr;
> +  gimple_stmt_iterator incr_gsi;
> +  bool insert_after;
> +  standard_iv_increment_position (loop, &incr_gsi, &insert_after);
> +  create_iv (niters_adj, MINUS_EXPR, iv_step, NULL_TREE, loop,
> +	     &incr_gsi, insert_after, &index_before_incr,
> +	     &index_after_incr);
> +
> +  /* Iterate over all the rgroups and fill in their controls.  */
> +  for (auto rgcm : LOOP_VINFO_MASKS (loop_vinfo).rgc_map)
> +    {
> +      rgroup_controls *rgc = rgcm.second;
> +      if (rgc->controls.is_empty ())
> +	continue;
> +
> +      tree ctrl_type = rgc->type;
> +      poly_uint64 nitems_per_ctrl = TYPE_VECTOR_SUBPARTS (ctrl_type);
> +
> +      tree vectype = rgc->compare_type;
> +
> +      /* index_after_incr is the IV specifying the remaining iterations in
> +	 the next iteration.  */
> +      tree rem = index_after_incr;
> +      /* When the data type for the compare to produce the mask is
> +	 smaller than the IV type we need to saturate.  Saturate to
> +	 the smallest possible value (IV_TYPE) so we only have to
> +	 saturate once (CSE will catch redundant ones we add).  */
> +      if (TYPE_PRECISION (TREE_TYPE (vectype)) < TYPE_PRECISION (iv_type))
> +	rem = gimple_build (&incr_gsi, false, GSI_CONTINUE_LINKING,
> +			    UNKNOWN_LOCATION,
> +			    MIN_EXPR, TREE_TYPE (rem), rem, iv_step);
> +      rem = gimple_convert (&incr_gsi, false, GSI_CONTINUE_LINKING,
> +			    UNKNOWN_LOCATION, TREE_TYPE (vectype), rem);
> +
> +      /* Build a data vector composed of the remaining iterations.  */
> +      rem = gimple_build_vector_from_val (&incr_gsi, false, GSI_CONTINUE_LINKING,
> +					  UNKNOWN_LOCATION, vectype, rem);
> +
> +      /* Provide a definition of each vector in the control group.  */
> +      tree next_ctrl = NULL_TREE;
> +      tree first_rem = NULL_TREE;
> +      tree ctrl;
> +      unsigned int i;
> +      FOR_EACH_VEC_ELT_REVERSE (rgc->controls, i, ctrl)
> +	{
> +	  /* Previous controls will cover BIAS items.  This control covers the
> +	     next batch.  */
> +	  poly_uint64 bias = nitems_per_ctrl * i;
> +
> +	  /* Build the constant to compare the remaining iters against,
> +	     this is sth like { 0, 0, 1, 1, 2, 2, 3, 3, ... } appropriately
> +	     split into pieces.  */
> +	  unsigned n = TYPE_VECTOR_SUBPARTS (ctrl_type).to_constant ();
> +	  tree_vector_builder builder (vectype, n, 1);
> +	  for (unsigned i = 0; i < n; ++i)
> +	    {
> +	      unsigned HOST_WIDE_INT val
> +		= (i + bias.to_constant ()) / rgc->max_nscalars_per_iter;
> +	      gcc_assert (val < vf.to_constant ());
> +	      builder.quick_push (build_int_cst (TREE_TYPE (vectype), val));
> +	    }
> +	  tree cmp_series = builder.build ();
> +
> +	  /* Create the initial control.  First include all items that
> +	     are within the loop limit.  */
> +	  tree init_ctrl = NULL_TREE;
> +	  poly_uint64 const_limit;
> +	  /* See whether the first iteration of the vector loop is known
> +	     to have a full control.  */
> +	  if (poly_int_tree_p (niters, &const_limit)
> +	      && known_ge (const_limit, (i + 1) * nitems_per_ctrl))
> +	    init_ctrl = build_minus_one_cst (ctrl_type);
> +	  else
> +	    {
> +	      /* The remaining work items initially are niters.  Saturate,
> +		 splat and compare.  */
> +	      if (!first_rem)
> +		{
> +		  first_rem = niters;
> +		  if (TYPE_PRECISION (TREE_TYPE (vectype))
> +		      < TYPE_PRECISION (iv_type))
> +		    first_rem = gimple_build (&preheader_seq,
> +					      MIN_EXPR, TREE_TYPE (first_rem),
> +					      first_rem, iv_step);
> +		  first_rem = gimple_convert (&preheader_seq, TREE_TYPE (vectype),
> +					      first_rem);
> +		  first_rem = gimple_build_vector_from_val (&preheader_seq,
> +							    vectype, first_rem);
> +		}
> +	      init_ctrl = gimple_build (&preheader_seq, LT_EXPR, ctrl_type,
> +					cmp_series, first_rem);
> +	    }
> +
> +	  /* Now AND out the bits that are within the number of skipped
> +	     items.  */
> +	  poly_uint64 const_skip;
> +	  if (niters_skip
> +	      && !(poly_int_tree_p (niters_skip, &const_skip)
> +		   && known_le (const_skip, bias)))
> +	    {
> +	      /* For integer mode masks it's cheaper to shift out the bits
> +		 since that avoids loading a constant.  */
> +	      gcc_assert (GET_MODE_CLASS (TYPE_MODE (ctrl_type)) == MODE_INT);
> +	      init_ctrl = gimple_build (&preheader_seq, VIEW_CONVERT_EXPR,
> +					lang_hooks.types.type_for_mode
> +					  (TYPE_MODE (ctrl_type), 1),
> +					init_ctrl);
> +	      /* ???  But when the shift amount isn't constant this requires
> +		 a round-trip to GRPs.  We could apply the bias to either
> +		 side of the compare instead.  */
> +	      tree shift = gimple_build (&preheader_seq, MULT_EXPR,
> +					 TREE_TYPE (niters_skip),
> +					 niters_skip,
> +					 build_int_cst (TREE_TYPE (niters_skip),
> +							rgc->max_nscalars_per_iter));
> +	      init_ctrl = gimple_build (&preheader_seq, LSHIFT_EXPR,
> +					TREE_TYPE (init_ctrl),
> +					init_ctrl, shift);
> +	      init_ctrl = gimple_build (&preheader_seq, VIEW_CONVERT_EXPR,
> +					ctrl_type, init_ctrl);

It looks like this assumes that either the first lane or the last lane
of the first mask is always true, is that right?  I'm not sure we ever
prove that, at least not for SVE.  There it's possible to have inactive
elements at both ends of the same mask.

> +	    }
> +
> +	  /* Get the control value for the next iteration of the loop.  */
> +	  next_ctrl = gimple_build (&incr_gsi, false, GSI_CONTINUE_LINKING,
> +				    UNKNOWN_LOCATION,
> +				    LT_EXPR, ctrl_type, cmp_series, rem);
> +
> +	  vect_set_loop_control (loop, ctrl, init_ctrl, next_ctrl);
> +	}
> +    }
> +
> +  /* Emit all accumulated statements.  */
> +  add_preheader_seq (loop, preheader_seq);
> +
> +  /* Adjust the exit test using the decrementing IV.  */
> +  edge exit_edge = single_exit (loop);
> +  tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? LE_EXPR : GT_EXPR;
> +  /* When we peel for alignment with niter_skip != 0 this can
> +     cause niter + niter_skip to wrap and since we are comparing the
> +     value before the decrement here we get a false early exit.
> +     We can't compare the value after decrement either because that
> +     decrement could wrap as well as we're not doing a saturating
> +     decrement.  To avoid this situation we force a larger
> +     iv_type.  */
> +  gcond *cond_stmt = gimple_build_cond (code, index_before_incr, iv_step,
> +					NULL_TREE, NULL_TREE);
> +  gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
> +
> +  /* The loop iterates (NITERS - 1 + NITERS_SKIP) / VF + 1 times.
> +     Subtract one from this to get the latch count.  */
> +  tree niters_minus_one
> +    = fold_build2 (PLUS_EXPR, TREE_TYPE (orig_niters), orig_niters,
> +		   build_minus_one_cst (TREE_TYPE (orig_niters)));
> +  tree niters_adj2 = fold_convert (iv_type, niters_minus_one);
> +  if (niters_skip)
> +    niters_adj2 = fold_build2 (PLUS_EXPR, iv_type, niters_minus_one,
> +			       fold_convert (iv_type, niters_skip));
> +  loop->nb_iterations = fold_build2 (TRUNC_DIV_EXPR, iv_type,
> +				     niters_adj2, iv_step);
> +
> +  if (final_iv)
> +    {
> +      gassign *assign = gimple_build_assign (final_iv, orig_niters);
> +      gsi_insert_on_edge_immediate (single_exit (loop), assign);
> +    }
> +
> +  return cond_stmt;
> +}
> +
> +
>  /* Like vect_set_loop_condition, but handle the case in which the vector
>     loop handles exactly VF scalars per iteration.  */
>  
> @@ -1114,10 +1357,18 @@ vect_set_loop_condition (class loop *loop, loop_vec_info loop_vinfo,
>    gimple_stmt_iterator loop_cond_gsi = gsi_for_stmt (orig_cond);
>  
>    if (loop_vinfo && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
> -    cond_stmt = vect_set_loop_condition_partial_vectors (loop, loop_vinfo,
> -							 niters, final_iv,
> -							 niters_maybe_zero,
> -							 loop_cond_gsi);
> +    {
> +      if (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) == vect_partial_vectors_avx512)
> +	cond_stmt = vect_set_loop_condition_partial_vectors_avx512 (loop, loop_vinfo,
> +								    niters, final_iv,
> +								    niters_maybe_zero,
> +								    loop_cond_gsi);
> +      else
> +	cond_stmt = vect_set_loop_condition_partial_vectors (loop, loop_vinfo,
> +							     niters, final_iv,
> +							     niters_maybe_zero,
> +							     loop_cond_gsi);
> +    }
>    else
>      cond_stmt = vect_set_loop_condition_normal (loop, niters, step, final_iv,
>  						niters_maybe_zero,
> @@ -2030,7 +2281,8 @@ void
>  vect_prepare_for_masked_peels (loop_vec_info loop_vinfo)
>  {
>    tree misalign_in_elems;
> -  tree type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
> +  /* ???  With AVX512 we want LOOP_VINFO_RGROUP_IV_TYPE in the end.  */
> +  tree type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo));

Like you say, I think this might cause problems for SVE, since
LOOP_VINFO_MASK_SKIP_NITERS is assumed to be the compare type
in vect_set_loop_controls_directly.  Not sure what the best way
around that is.

>  
>    gcc_assert (vect_use_loop_mask_for_alignment_p (loop_vinfo));
>  
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 1897e720389..9be66b8fbc5 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -55,6 +55,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "vec-perm-indices.h"
>  #include "tree-eh.h"
>  #include "case-cfn-macros.h"
> +#include "langhooks.h"
>  
>  /* Loop Vectorization Pass.
>  
> @@ -963,6 +964,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
>      mask_skip_niters (NULL_TREE),
>      rgroup_compare_type (NULL_TREE),
>      simd_if_cond (NULL_TREE),
> +    partial_vector_style (vect_partial_vectors_none),
>      unaligned_dr (NULL),
>      peeling_for_alignment (0),
>      ptr_mask (0),
> @@ -1058,7 +1060,12 @@ _loop_vec_info::~_loop_vec_info ()
>  {
>    free (bbs);
>  
> -  release_vec_loop_controls (&masks);
> +  for (auto m : masks.rgc_map)
> +    {
> +      m.second->controls.release ();
> +      delete m.second;
> +    }
> +  release_vec_loop_controls (&masks.rgc_vec);
>    release_vec_loop_controls (&lens);
>    delete ivexpr_map;
>    delete scan_map;
> @@ -1108,7 +1115,7 @@ can_produce_all_loop_masks_p (loop_vec_info loop_vinfo, tree cmp_type)
>  {
>    rgroup_controls *rgm;
>    unsigned int i;
> -  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm)
> +  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec, i, rgm)
>      if (rgm->type != NULL_TREE
>  	&& !direct_internal_fn_supported_p (IFN_WHILE_ULT,
>  					    cmp_type, rgm->type,
> @@ -1203,9 +1210,33 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
>    if (LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
>      return false;
>  
> +  /* Produce the rgroup controls.  */
> +  for (auto mask : LOOP_VINFO_MASKS (loop_vinfo).mask_set)
> +    {
> +      vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> +      tree vectype = mask.first;
> +      unsigned nvectors = mask.second;
> +
> +      if (masks->rgc_vec.length () < nvectors)
> +	masks->rgc_vec.safe_grow_cleared (nvectors, true);
> +      rgroup_controls *rgm = &(*masks).rgc_vec[nvectors - 1];
> +      /* The number of scalars per iteration and the number of vectors are
> +	 both compile-time constants.  */
> +      unsigned int nscalars_per_iter
> +	  = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> +		       LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
> +
> +      if (rgm->max_nscalars_per_iter < nscalars_per_iter)
> +	{
> +	  rgm->max_nscalars_per_iter = nscalars_per_iter;
> +	  rgm->type = truth_type_for (vectype);
> +	  rgm->factor = 1;
> +	}
> +    }
> +
>    /* Calculate the maximum number of scalars per iteration for every rgroup.  */
>    unsigned int max_nscalars_per_iter = 1;
> -  for (auto rgm : LOOP_VINFO_MASKS (loop_vinfo))
> +  for (auto rgm : LOOP_VINFO_MASKS (loop_vinfo).rgc_vec)
>      max_nscalars_per_iter
>        = MAX (max_nscalars_per_iter, rgm.max_nscalars_per_iter);
>  
> @@ -1268,10 +1299,159 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
>      }
>  
>    if (!cmp_type)
> -    return false;
> +    {
> +      LOOP_VINFO_MASKS (loop_vinfo).rgc_vec.release ();
> +      return false;
> +    }
>  
>    LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = cmp_type;
>    LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
> +  LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) = vect_partial_vectors_while_ult;
> +  return true;
> +}
> +
> +/* Each statement in LOOP_VINFO can be masked where necessary.  Check
> +   whether we can actually generate AVX512 style masks.  Return true if so,
> +   storing the type of the scalar IV in LOOP_VINFO_RGROUP_IV_TYPE.  */
> +
> +static bool
> +vect_verify_full_masking_avx512 (loop_vec_info loop_vinfo)
> +{
> +  /* Produce differently organized rgc_vec and differently check
> +     we can produce masks.  */
> +
> +  /* Use a normal loop if there are no statements that need masking.
> +     This only happens in rare degenerate cases: it means that the loop
> +     has no loads, no stores, and no live-out values.  */
> +  if (LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
> +    return false;
> +
> +  /* For the decrementing IV we need to represent all values in
> +     [0, niter + niter_skip] where niter_skip is the elements we
> +     skip in the first iteration for prologue peeling.  */
> +  tree iv_type = NULL_TREE;
> +  widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
> +  unsigned int iv_precision = UINT_MAX;
> +  if (iv_limit != -1)
> +    iv_precision = wi::min_precision (iv_limit, UNSIGNED);
> +
> +  /* First compute the type for the IV we use to track the remaining
> +     scalar iterations.  */
> +  opt_scalar_int_mode cmp_mode_iter;
> +  FOR_EACH_MODE_IN_CLASS (cmp_mode_iter, MODE_INT)
> +    {
> +      unsigned int cmp_bits = GET_MODE_BITSIZE (cmp_mode_iter.require ());
> +      if (cmp_bits >= iv_precision
> +	  && targetm.scalar_mode_supported_p (cmp_mode_iter.require ()))
> +	{
> +	  iv_type = build_nonstandard_integer_type (cmp_bits, true);
> +	  if (iv_type)
> +	    break;
> +	}
> +    }
> +  if (!iv_type)
> +    return false;
> +
> +  /* Produce the rgroup controls.  */
> +  for (auto mask : LOOP_VINFO_MASKS (loop_vinfo).mask_set)
> +    {
> +      vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> +      tree vectype = mask.first;
> +      unsigned nvectors = mask.second;
> +
> +      /* The number of scalars per iteration and the number of vectors are
> +	 both compile-time constants.  */
> +      unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
> +      unsigned int nscalars_per_iter
> +	= nvectors * nunits / LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ();

The comment seems to be borrowed from:

  /* The number of scalars per iteration and the number of vectors are
     both compile-time constants.  */
  unsigned int nscalars_per_iter
    = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();

but the calculation instead applies to_constant to the VF and to the
number of vector elements, which aren't in general known to be constant.
Does the exact_div not work here too?

Avoiding the current to_constants here only moves the problem elsewhere
though.  Since this is a verification routine, I think the should
instead use is_constant and fail if it is false.

> +      /* We key off a hash-map with nscalars_per_iter and the number of total
> +	 lanes in the mask vector and then remember the needed vector mask
> +	 with the largest number of lanes (thus the fewest nV).  */
> +      bool existed;
> +      rgroup_controls *& rgc
> +	= masks->rgc_map.get_or_insert (std::make_pair (nscalars_per_iter,
> +							nvectors * nunits),
> +					&existed);
> +      if (!existed)
> +	{
> +	  rgc = new rgroup_controls ();
> +	  rgc->type = truth_type_for (vectype);
> +	  rgc->compare_type = NULL_TREE;
> +	  rgc->max_nscalars_per_iter = nscalars_per_iter;
> +	  rgc->factor = 1;
> +	  rgc->bias_adjusted_ctrl = NULL_TREE;
> +	}
> +      else
> +	{
> +	  gcc_assert (rgc->max_nscalars_per_iter == nscalars_per_iter);
> +	  if (known_lt (TYPE_VECTOR_SUBPARTS (rgc->type),
> +			TYPE_VECTOR_SUBPARTS (vectype)))
> +	    rgc->type = truth_type_for (vectype);
> +	}
> +    }
> +
> +  /* There is no fixed compare type we are going to use but we have to
> +     be able to get at one for each mask group.  */
> +  unsigned int min_ni_width
> +    = wi::min_precision (vect_max_vf (loop_vinfo), UNSIGNED);
> +
> +  bool ok = true;
> +  for (auto mask : LOOP_VINFO_MASKS (loop_vinfo).rgc_map)
> +    {
> +      rgroup_controls *rgc = mask.second;
> +      tree mask_type = rgc->type;
> +      if (TYPE_PRECISION (TREE_TYPE (mask_type)) != 1)
> +	{
> +	  ok = false;
> +	  break;
> +	}
> +
> +      /* If iv_type is usable as compare type use that - we can elide the
> +	 saturation in that case.   */
> +      if (TYPE_PRECISION (iv_type) >= min_ni_width)
> +	{
> +	  tree cmp_vectype
> +	    = build_vector_type (iv_type, TYPE_VECTOR_SUBPARTS (mask_type));
> +	  if (expand_vec_cmp_expr_p (cmp_vectype, mask_type, LT_EXPR))
> +	    rgc->compare_type = cmp_vectype;
> +	}
> +      if (!rgc->compare_type)
> +	FOR_EACH_MODE_IN_CLASS (cmp_mode_iter, MODE_INT)
> +	  {
> +	    unsigned int cmp_bits = GET_MODE_BITSIZE (cmp_mode_iter.require ());
> +	    if (cmp_bits >= min_ni_width
> +		&& targetm.scalar_mode_supported_p (cmp_mode_iter.require ()))
> +	      {
> +		tree cmp_type = build_nonstandard_integer_type (cmp_bits, true);
> +		if (!cmp_type)
> +		  continue;
> +
> +		/* Check whether we can produce the mask with cmp_type.  */
> +		tree cmp_vectype
> +		  = build_vector_type (cmp_type, TYPE_VECTOR_SUBPARTS (mask_type));
> +		if (expand_vec_cmp_expr_p (cmp_vectype, mask_type, LT_EXPR))
> +		  {
> +		    rgc->compare_type = cmp_vectype;
> +		    break;
> +		  }
> +	      }
> +	}

Just curious: is this fallback loop ever used in practice?
TYPE_PRECISION (iv_type) >= min_ni_width seems like an easy condition
to satisfy.

> +      if (!rgc->compare_type)
> +	{
> +	  ok = false;
> +	  break;
> +	}
> +    }
> +  if (!ok)
> +    {
> +      LOOP_VINFO_MASKS (loop_vinfo).rgc_map.empty ();
> +      return false;

It looks like this leaks the rgroup_controls created above.

> +    }
> +
> +  LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = error_mark_node;
> +  LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
> +  LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) = vect_partial_vectors_avx512;
>    return true;
>  }
>  
> @@ -1371,6 +1551,7 @@ vect_verify_loop_lens (loop_vec_info loop_vinfo)
>  
>    LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = iv_type;
>    LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
> +  LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) = vect_partial_vectors_len;
>  
>    return true;
>  }
> @@ -2712,16 +2893,24 @@ start_over:
>  
>    /* If we still have the option of using partial vectors,
>       check whether we can generate the necessary loop controls.  */
> -  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> -      && !vect_verify_full_masking (loop_vinfo)
> -      && !vect_verify_loop_lens (loop_vinfo))
> -    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> +    {
> +      if (!LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
> +	{
> +	  if (!vect_verify_full_masking (loop_vinfo)
> +	      && !vect_verify_full_masking_avx512 (loop_vinfo))
> +	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +	}
> +      else /* !LOOP_VINFO_LENS (loop_vinfo).is_empty () */
> +	if (!vect_verify_loop_lens (loop_vinfo))
> +	  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +    }
>  
>    /* If we're vectorizing a loop that uses length "controls" and
>       can iterate more than once, we apply decrementing IV approach
>       in loop control.  */
>    if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> -      && !LOOP_VINFO_LENS (loop_vinfo).is_empty ()
> +      && LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) == vect_partial_vectors_len
>        && LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) == 0
>        && !(LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>  	   && known_le (LOOP_VINFO_INT_NITERS (loop_vinfo),
> @@ -3022,7 +3211,7 @@ again:
>    delete loop_vinfo->vector_costs;
>    loop_vinfo->vector_costs = nullptr;
>    /* Reset accumulated rgroup information.  */
> -  release_vec_loop_controls (&LOOP_VINFO_MASKS (loop_vinfo));
> +  release_vec_loop_controls (&LOOP_VINFO_MASKS (loop_vinfo).rgc_vec);
>    release_vec_loop_controls (&LOOP_VINFO_LENS (loop_vinfo));
>    /* Reset assorted flags.  */
>    LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
> @@ -4362,13 +4551,67 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>  			  cond_branch_not_taken, vect_epilogue);
>  
>    /* Take care of special costs for rgroup controls of partial vectors.  */
> -  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
> +      && (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> +	  == vect_partial_vectors_avx512))
> +    {
> +      /* Calculate how many masks we need to generate.  */
> +      unsigned int num_masks = 0;
> +      bool need_saturation = false;
> +      for (auto rgcm : LOOP_VINFO_MASKS (loop_vinfo).rgc_map)
> +	{
> +	  rgroup_controls *rgm = rgcm.second;
> +	  unsigned nvectors
> +	    = (rgcm.first.second
> +	       / TYPE_VECTOR_SUBPARTS (rgm->type).to_constant ());
> +	  num_masks += nvectors;
> +	  if (TYPE_PRECISION (TREE_TYPE (rgm->compare_type))
> +	      < TYPE_PRECISION (LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo)))
> +	    need_saturation = true;
> +	}
> +
> +      /* ???  The target isn't able to identify the costs below as
> +	 producing masks so it cannot penaltize cases where we'd run
> +	 out of mask registers for example.  */
> +
> +      /* In the worst case, we need to generate each mask in the prologue
> +	 and in the loop body.  We need one splat per group and one
> +	 compare per mask.
> +
> +	 Sometimes we can use unpacks instead of generating prologue
> +	 masks and sometimes the prologue mask will fold to a constant,
> +	 so the actual prologue cost might be smaller.  However, it's
> +	 simpler and safer to use the worst-case cost; if this ends up
> +	 being the tie-breaker between vectorizing or not, then it's
> +	 probably better not to vectorize.  */

Not sure all of this applies to the AVX512 case.  In particular, the
unpacks bit doesn't seem relevant.

> +      (void) add_stmt_cost (target_cost_data,
> +			    num_masks
> +			    + LOOP_VINFO_MASKS (loop_vinfo).rgc_map.elements (),
> +			    vector_stmt, NULL, NULL, NULL_TREE, 0, vect_prologue);
> +      (void) add_stmt_cost (target_cost_data,
> +			    num_masks
> +			    + LOOP_VINFO_MASKS (loop_vinfo).rgc_map.elements (),
> +			    vector_stmt, NULL, NULL, NULL_TREE, 0, vect_body);
> +
> +      /* When we need saturation we need it both in the prologue and
> +	 the epilogue.  */
> +      if (need_saturation)
> +	{
> +	  (void) add_stmt_cost (target_cost_data, 1, scalar_stmt,
> +				NULL, NULL, NULL_TREE, 0, vect_prologue);
> +	  (void) add_stmt_cost (target_cost_data, 1, scalar_stmt,
> +				NULL, NULL, NULL_TREE, 0, vect_body);
> +	}
> +    }
> +  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
> +	   && (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> +	       == vect_partial_vectors_avx512))
>      {
>        /* Calculate how many masks we need to generate.  */
>        unsigned int num_masks = 0;
>        rgroup_controls *rgm;
>        unsigned int num_vectors_m1;
> -      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
> +      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec, num_vectors_m1, rgm)

Nit: long line.

>  	if (rgm->type)
>  	  num_masks += num_vectors_m1 + 1;
>        gcc_assert (num_masks > 0);
> @@ -10329,14 +10572,6 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
>  		       unsigned int nvectors, tree vectype, tree scalar_mask)
>  {
>    gcc_assert (nvectors != 0);
> -  if (masks->length () < nvectors)
> -    masks->safe_grow_cleared (nvectors, true);
> -  rgroup_controls *rgm = &(*masks)[nvectors - 1];
> -  /* The number of scalars per iteration and the number of vectors are
> -     both compile-time constants.  */
> -  unsigned int nscalars_per_iter
> -    = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> -		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
>  
>    if (scalar_mask)
>      {
> @@ -10344,12 +10579,7 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
>        loop_vinfo->scalar_cond_masked_set.add (cond);
>      }
>  
> -  if (rgm->max_nscalars_per_iter < nscalars_per_iter)
> -    {
> -      rgm->max_nscalars_per_iter = nscalars_per_iter;
> -      rgm->type = truth_type_for (vectype);
> -      rgm->factor = 1;
> -    }
> +  masks->mask_set.add (std::make_pair (vectype, nvectors));
>  }
>  
>  /* Given a complete set of masks MASKS, extract mask number INDEX
> @@ -10360,46 +10590,121 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
>     arrangement.  */
>  
>  tree
> -vect_get_loop_mask (loop_vec_info,
> +vect_get_loop_mask (loop_vec_info loop_vinfo,
>  		    gimple_stmt_iterator *gsi, vec_loop_masks *masks,
>  		    unsigned int nvectors, tree vectype, unsigned int index)
>  {
> -  rgroup_controls *rgm = &(*masks)[nvectors - 1];
> -  tree mask_type = rgm->type;
> -
> -  /* Populate the rgroup's mask array, if this is the first time we've
> -     used it.  */
> -  if (rgm->controls.is_empty ())
> +  if (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> +      == vect_partial_vectors_while_ult)
>      {
> -      rgm->controls.safe_grow_cleared (nvectors, true);
> -      for (unsigned int i = 0; i < nvectors; ++i)
> +      rgroup_controls *rgm = &(masks->rgc_vec)[nvectors - 1];
> +      tree mask_type = rgm->type;
> +
> +      /* Populate the rgroup's mask array, if this is the first time we've
> +	 used it.  */
> +      if (rgm->controls.is_empty ())
>  	{
> -	  tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask");
> -	  /* Provide a dummy definition until the real one is available.  */
> -	  SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
> -	  rgm->controls[i] = mask;
> +	  rgm->controls.safe_grow_cleared (nvectors, true);
> +	  for (unsigned int i = 0; i < nvectors; ++i)
> +	    {
> +	      tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask");
> +	      /* Provide a dummy definition until the real one is available.  */
> +	      SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
> +	      rgm->controls[i] = mask;
> +	    }
>  	}
> -    }
>  
> -  tree mask = rgm->controls[index];
> -  if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type),
> -		TYPE_VECTOR_SUBPARTS (vectype)))
> +      tree mask = rgm->controls[index];
> +      if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type),
> +		    TYPE_VECTOR_SUBPARTS (vectype)))
> +	{
> +	  /* A loop mask for data type X can be reused for data type Y
> +	     if X has N times more elements than Y and if Y's elements
> +	     are N times bigger than X's.  In this case each sequence
> +	     of N elements in the loop mask will be all-zero or all-one.
> +	     We can then view-convert the mask so that each sequence of
> +	     N elements is replaced by a single element.  */
> +	  gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (mask_type),
> +				  TYPE_VECTOR_SUBPARTS (vectype)));
> +	  gimple_seq seq = NULL;
> +	  mask_type = truth_type_for (vectype);
> +	  /* We can only use re-use the mask by reinterpreting it if it
> +	     occupies the same space, that is the mask with less elements

Nit: fewer elements

> +	     uses multiple bits for each masked elements.  */
> +	  gcc_assert (known_eq (TYPE_PRECISION (TREE_TYPE (TREE_TYPE (mask)))
> +				* TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)),
> +				TYPE_PRECISION (TREE_TYPE (mask_type))
> +				* TYPE_VECTOR_SUBPARTS (mask_type)));
> +	  mask = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, mask);
> +	  if (seq)
> +	    gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
> +	}
> +      return mask;
> +    }
> +  else if (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> +	   == vect_partial_vectors_avx512)
>      {
> -      /* A loop mask for data type X can be reused for data type Y
> -	 if X has N times more elements than Y and if Y's elements
> -	 are N times bigger than X's.  In this case each sequence
> -	 of N elements in the loop mask will be all-zero or all-one.
> -	 We can then view-convert the mask so that each sequence of
> -	 N elements is replaced by a single element.  */
> -      gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (mask_type),
> -			      TYPE_VECTOR_SUBPARTS (vectype)));
> +      /* The number of scalars per iteration and the number of vectors are
> +	 both compile-time constants.  */
> +      unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
> +      unsigned int nscalars_per_iter
> +	= nvectors * nunits / LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ();

Same disconnect between the comment and the code here.  If we do use
is_constant in vect_verify_full_masking_avx512 then we could reference
that instead.

> +
> +      rgroup_controls *rgm
> +	= *masks->rgc_map.get (std::make_pair (nscalars_per_iter,
> +					       nvectors * nunits));
> +
> +      /* The stored nV is dependent on the mask type produced.  */
> +      nvectors = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> +			    TYPE_VECTOR_SUBPARTS (rgm->type)).to_constant ();
> +
> +      /* Populate the rgroup's mask array, if this is the first time we've
> +	 used it.  */
> +      if (rgm->controls.is_empty ())
> +	{
> +	  rgm->controls.safe_grow_cleared (nvectors, true);
> +	  for (unsigned int i = 0; i < nvectors; ++i)
> +	    {
> +	      tree mask = make_temp_ssa_name (rgm->type, NULL, "loop_mask");
> +	      /* Provide a dummy definition until the real one is available.  */
> +	      SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
> +	      rgm->controls[i] = mask;
> +	    }
> +	}
> +      if (known_eq (TYPE_VECTOR_SUBPARTS (rgm->type),
> +		    TYPE_VECTOR_SUBPARTS (vectype)))
> +	return rgm->controls[index];
> +
> +      /* Split the vector if needed.  Since we are dealing with integer mode
> +	 masks with AVX512 we can operate on the integer representation
> +	 performing the whole vector shifting.  */
> +      unsigned HOST_WIDE_INT factor;
> +      bool ok = constant_multiple_p (TYPE_VECTOR_SUBPARTS (rgm->type),
> +				     TYPE_VECTOR_SUBPARTS (vectype), &factor);
> +      gcc_assert (ok);
> +      gcc_assert (GET_MODE_CLASS (TYPE_MODE (rgm->type)) == MODE_INT);
> +      tree mask_type = truth_type_for (vectype);
> +      gcc_assert (GET_MODE_CLASS (TYPE_MODE (mask_type)) == MODE_INT);
> +      unsigned vi = index / factor;
> +      unsigned vpart = index % factor;
> +      tree vec = rgm->controls[vi];
>        gimple_seq seq = NULL;
> -      mask_type = truth_type_for (vectype);
> -      mask = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, mask);
> +      vec = gimple_build (&seq, VIEW_CONVERT_EXPR,
> +			  lang_hooks.types.type_for_mode
> +				(TYPE_MODE (rgm->type), 1), vec);
> +      /* For integer mode masks simply shift the right bits into position.  */
> +      if (vpart != 0)
> +	vec = gimple_build (&seq, RSHIFT_EXPR, TREE_TYPE (vec), vec,
> +			    build_int_cst (integer_type_node, vpart * nunits));
> +      vec = gimple_convert (&seq, lang_hooks.types.type_for_mode
> +				    (TYPE_MODE (mask_type), 1), vec);
> +      vec = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, vec);

Would it be worth creating an rgroup_controls for each nunits derivative
of each rgc_map entry?  That way we'd share the calculations, and be
able to cost them more accurately.

Maybe it's not worth it, just asking. :)

Thanks,
Richard

>        if (seq)
>  	gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
> +      return vec;
>      }
> -  return mask;
> +  else
> +    gcc_unreachable ();
>  }
>  
>  /* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 767a0774d45..42161778dc1 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -300,6 +300,13 @@ public:
>  #define SLP_TREE_LANES(S)			 (S)->lanes
>  #define SLP_TREE_CODE(S)			 (S)->code
>  
> +enum vect_partial_vector_style {
> +    vect_partial_vectors_none,
> +    vect_partial_vectors_while_ult,
> +    vect_partial_vectors_avx512,
> +    vect_partial_vectors_len
> +};
> +
>  /* Key for map that records association between
>     scalar conditions and corresponding loop mask, and
>     is populated by vect_record_loop_mask.  */
> @@ -605,6 +612,10 @@ struct rgroup_controls {
>       specified number of elements; the type of the elements doesn't matter.  */
>    tree type;
>  
> +  /* When there is no uniformly used LOOP_VINFO_RGROUP_COMPARE_TYPE this
> +     is the rgroup specific type used.  */
> +  tree compare_type;
> +
>    /* A vector of nV controls, in iteration order.  */
>    vec<tree> controls;
>  
> @@ -613,7 +624,24 @@ struct rgroup_controls {
>    tree bias_adjusted_ctrl;
>  };
>  
> -typedef auto_vec<rgroup_controls> vec_loop_masks;
> +struct vec_loop_masks
> +{
> +  bool is_empty () const { return mask_set.is_empty (); }
> +
> +  typedef pair_hash <nofree_ptr_hash <tree_node>,
> +		     int_hash<unsigned, 0>> mp_hash;
> +  hash_set<mp_hash> mask_set;
> +
> +  /* Default storage for rgroup_controls.  */
> +  auto_vec<rgroup_controls> rgc_vec;
> +
> +  /* The vect_partial_vectors_avx512 style uses a hash-map.  */
> +  hash_map<std::pair<unsigned /* nscalars_per_iter */,
> +		     unsigned /* nlanes */>, rgroup_controls *,
> +	   simple_hashmap_traits<pair_hash <int_hash<unsigned, 0>,
> +					    int_hash<unsigned, 0>>,
> +				 rgroup_controls *>> rgc_map;
> +};
>  
>  typedef auto_vec<rgroup_controls> vec_loop_lens;
>  
> @@ -741,6 +769,10 @@ public:
>       LOOP_VINFO_USING_PARTIAL_VECTORS_P is true.  */
>    tree rgroup_iv_type;
>  
> +  /* The style used for implementing partial vectors when
> +     LOOP_VINFO_USING_PARTIAL_VECTORS_P is true.  */
> +  vect_partial_vector_style partial_vector_style;
> +
>    /* Unknown DRs according to which loop was peeled.  */
>    class dr_vec_info *unaligned_dr;
>  
> @@ -914,6 +946,7 @@ public:
>  #define LOOP_VINFO_MASK_SKIP_NITERS(L)     (L)->mask_skip_niters
>  #define LOOP_VINFO_RGROUP_COMPARE_TYPE(L)  (L)->rgroup_compare_type
>  #define LOOP_VINFO_RGROUP_IV_TYPE(L)       (L)->rgroup_iv_type
> +#define LOOP_VINFO_PARTIAL_VECTORS_STYLE(L) (L)->partial_vector_style
>  #define LOOP_VINFO_PTR_MASK(L)             (L)->ptr_mask
>  #define LOOP_VINFO_N_STMTS(L)		   (L)->shared->n_stmts
>  #define LOOP_VINFO_LOOP_NEST(L)            (L)->shared->loop_nest

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 3/3] AVX512 fully masked vectorization
@ 2023-06-14 11:54 Richard Biener
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Biener @ 2023-06-14 11:54 UTC (permalink / raw)
  To: gcc-patches
  Cc: richard.sandiford, Jan Hubicka, hongtao.liu, kirill.yukhin, ams

This implemens fully masked vectorization or a masked epilog for
AVX512 style masks which single themselves out by representing
each lane with a single bit and by using integer modes for the mask
(both is much like GCN).

AVX512 is also special in that it doesn't have any instruction
to compute the mask from a scalar IV like SVE has with while_ult.
Instead the masks are produced by vector compares and the loop
control retains the scalar IV (mainly to avoid dependences on
mask generation, a suitable mask test instruction is available).

Like RVV code generation prefers a decrementing IV though IVOPTs
messes things up in some cases removing that IV to eliminate
it with an incrementing one used for address generation.

One of the motivating testcases is from PR108410 which in turn
is extracted from x264 where large size vectorization shows
issues with small trip loops.  Execution time there improves
compared to classic AVX512 with AVX2 epilogues for the cases
of less than 32 iterations.

size   scalar     128     256     512    512e    512f
    1    9.42   11.32    9.35   11.17   15.13   16.89
    2    5.72    6.53    6.66    6.66    7.62    8.56
    3    4.49    5.10    5.10    5.74    5.08    5.73
    4    4.10    4.33    4.29    5.21    3.79    4.25
    6    3.78    3.85    3.86    4.76    2.54    2.85
    8    3.64    1.89    3.76    4.50    1.92    2.16
   12    3.56    2.21    3.75    4.26    1.26    1.42
   16    3.36    0.83    1.06    4.16    0.95    1.07
   20    3.39    1.42    1.33    4.07    0.75    0.85
   24    3.23    0.66    1.72    4.22    0.62    0.70
   28    3.18    1.09    2.04    4.20    0.54    0.61
   32    3.16    0.47    0.41    0.41    0.47    0.53
   34    3.16    0.67    0.61    0.56    0.44    0.50
   38    3.19    0.95    0.95    0.82    0.40    0.45
   42    3.09    0.58    1.21    1.13    0.36    0.40

'size' specifies the number of actual iterations, 512e is for
a masked epilog and 512f for the fully masked loop.  From
4 scalar iterations on the AVX512 masked epilog code is clearly
the winner, the fully masked variant is clearly worse and
it's size benefit is also tiny.

This patch does not enable using fully masked loops or
masked epilogues by default.  More work on cost modeling
and vectorization kind selection on x86_64 is necessary
for this.

Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
which could be exploited further to unify some of the flags
we have right now but there didn't seem to be many easy things
to merge, so I'm leaving this for followups.

Mask requirements as registered by vect_record_loop_mask are kept in their
original form and recorded in a hash_set now instead of being
processed to a vector of rgroup_controls.  Instead that's now
left to the final analysis phase which tries forming the rgroup_controls
vector using while_ult and if that fails now tries AVX512 style
which needs a different organization and instead fills a hash_map
with the relevant info.  vect_get_loop_mask now has two implementations,
one for the two mask styles we then have.

I have decided against interweaving vect_set_loop_condition_partial_vectors
with conditions to do AVX512 style masking and instead opted to
"duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.

I was split between making 'vec_loop_masks' a class with methods,
possibly merging in the _len stuff into a single registry.  It
seemed to be too many changes for the purpose of getting AVX512
working.  I'm going to play wait and see what happens with RISC-V
here since they are going to get both masks and lengths registered
I think.

The vect_prepare_for_masked_peels hunk might run into issues with
SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
looked odd.

Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
the testsuite with --param vect-partial-vector-usage=2 with and
without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
and one latent wrong-code (PR110237).

There's followup work to be done to try enabling masked epilogues
for x86-64 by default (when AVX512 is enabled, possibly only when
-mprefer-vector-width=512).  Getting cost modeling and decision
right is going to be challenging.

Any comments?

OK?

Btw, testing on GCN would be welcome - the _avx512 paths could
work for it so in case the while_ult path fails (not sure if
it ever does) it could get _avx512 style masking.  Likewise
testing on ARM just to see I didn't break anything here.
I don't have SVE hardware so testing is probably meaningless.

Thanks,
Richard.

	* tree-vectorizer.h (enum vect_partial_vector_style): New.
	(_loop_vec_info::partial_vector_style): Likewise.
	(LOOP_VINFO_PARTIAL_VECTORS_STYLE): Likewise.
	(rgroup_controls::compare_type): Add.
	(vec_loop_masks): Change from a typedef to auto_vec<>
	to a structure.
	* tree-vect-loop-manip.cc (vect_set_loop_condition_partial_vectors):
	Adjust.
	(vect_set_loop_condition_partial_vectors_avx512): New function
	implementing the AVX512 partial vector codegen.
	(vect_set_loop_condition): Dispatch to the correct
	vect_set_loop_condition_partial_vectors_* function based on
	LOOP_VINFO_PARTIAL_VECTORS_STYLE.
	(vect_prepare_for_masked_peels): Compute LOOP_VINFO_MASK_SKIP_NITERS
	in the original niter type.
	* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize
	partial_vector_style.
	(_loop_vec_info::~_loop_vec_info): Release the hash-map recorded
	rgroup_controls.
	(can_produce_all_loop_masks_p): Adjust.
	(vect_verify_full_masking): Produce the rgroup_controls vector
	here.  Set LOOP_VINFO_PARTIAL_VECTORS_STYLE on success.
	(vect_verify_full_masking_avx512): New function implementing
	verification of AVX512 style masking.
	(vect_verify_loop_lens): Set LOOP_VINFO_PARTIAL_VECTORS_STYLE.
	(vect_analyze_loop_2): Also try AVX512 style masking.
	Adjust condition.
	(vect_estimate_min_profitable_iters): Implement AVX512 style
	mask producing cost.
	(vect_record_loop_mask): Do not build the rgroup_controls
	vector here but record masks in a hash-set.
	(vect_get_loop_mask): Implement AVX512 style mask query,
	complementing the existing while_ult style.
---
 gcc/tree-vect-loop-manip.cc | 264 ++++++++++++++++++++++-
 gcc/tree-vect-loop.cc       | 413 +++++++++++++++++++++++++++++++-----
 gcc/tree-vectorizer.h       |  35 ++-
 3 files changed, 651 insertions(+), 61 deletions(-)

diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 1c8100c1a1c..f0ecaec28f4 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -50,6 +50,9 @@ along with GCC; see the file COPYING3.  If not see
 #include "insn-config.h"
 #include "rtl.h"
 #include "recog.h"
+#include "langhooks.h"
+#include "tree-vector-builder.h"
+#include "optabs-tree.h"
 
 /*************************************************************************
   Simple Loop Peeling Utilities
@@ -845,7 +848,7 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
   rgroup_controls *iv_rgc = nullptr;
   unsigned int i;
   auto_vec<rgroup_controls> *controls = use_masks_p
-					  ? &LOOP_VINFO_MASKS (loop_vinfo)
+					  ? &LOOP_VINFO_MASKS (loop_vinfo).rgc_vec
 					  : &LOOP_VINFO_LENS (loop_vinfo);
   FOR_EACH_VEC_ELT (*controls, i, rgc)
     if (!rgc->controls.is_empty ())
@@ -936,6 +939,246 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
   return cond_stmt;
 }
 
+/* Set up the iteration condition and rgroup controls for LOOP in AVX512
+   style, given that LOOP_VINFO_USING_PARTIAL_VECTORS_P is true for the
+   vectorized loop.  LOOP_VINFO describes the vectorization of LOOP.  NITERS is
+   the number of iterations of the original scalar loop that should be
+   handled by the vector loop.  NITERS_MAYBE_ZERO and FINAL_IV are as
+   for vect_set_loop_condition.
+
+   Insert the branch-back condition before LOOP_COND_GSI and return the
+   final gcond.  */
+
+static gcond *
+vect_set_loop_condition_partial_vectors_avx512 (class loop *loop,
+					 loop_vec_info loop_vinfo, tree niters,
+					 tree final_iv,
+					 bool niters_maybe_zero,
+					 gimple_stmt_iterator loop_cond_gsi)
+{
+  tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
+  tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
+  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  tree orig_niters = niters;
+  gimple_seq preheader_seq = NULL;
+
+  /* Create an IV that counts down from niters and whose step
+     is the number of iterations processed in the current iteration.
+     Produce the controls with compares like the following.
+
+       # iv_2 = PHI <niters, iv_3>
+       rem_4 = MIN <iv_2, VF>;
+       remv_6 = { rem_4, rem_4, rem_4, ... }
+       mask_5 = { 0, 0, 1, 1, 2, 2, ... } < remv6;
+       iv_3 = iv_2 - VF;
+       if (iv_2 > VF)
+	 continue;
+
+     Where the constant is built with elements at most VF - 1 and
+     repetitions according to max_nscalars_per_iter which is guarnateed
+     to be the same within a group.  */
+
+  /* Convert NITERS to the determined IV type.  */
+  if (TYPE_PRECISION (iv_type) > TYPE_PRECISION (TREE_TYPE (niters))
+      && niters_maybe_zero)
+    {
+      /* We know that there is always at least one iteration, so if the
+	 count is zero then it must have wrapped.  Cope with this by
+	 subtracting 1 before the conversion and adding 1 to the result.  */
+      gcc_assert (TYPE_UNSIGNED (TREE_TYPE (niters)));
+      niters = gimple_build (&preheader_seq, PLUS_EXPR, TREE_TYPE (niters),
+			     niters, build_minus_one_cst (TREE_TYPE (niters)));
+      niters = gimple_convert (&preheader_seq, iv_type, niters);
+      niters = gimple_build (&preheader_seq, PLUS_EXPR, iv_type,
+			     niters, build_one_cst (iv_type));
+    }
+  else
+    niters = gimple_convert (&preheader_seq, iv_type, niters);
+
+  /* Bias the initial value of the IV in case we need to skip iterations
+     at the beginning.  */
+  tree niters_adj = niters;
+  if (niters_skip)
+    {
+      tree skip = gimple_convert (&preheader_seq, iv_type, niters_skip);
+      niters_adj = gimple_build (&preheader_seq, PLUS_EXPR,
+				 iv_type, niters, skip);
+    }
+
+  /* The iteration step is the vectorization factor.  */
+  tree iv_step = build_int_cst (iv_type, vf);
+
+  /* Create the decrement IV.  */
+  tree index_before_incr, index_after_incr;
+  gimple_stmt_iterator incr_gsi;
+  bool insert_after;
+  standard_iv_increment_position (loop, &incr_gsi, &insert_after);
+  create_iv (niters_adj, MINUS_EXPR, iv_step, NULL_TREE, loop,
+	     &incr_gsi, insert_after, &index_before_incr,
+	     &index_after_incr);
+
+  /* Iterate over all the rgroups and fill in their controls.  */
+  for (auto rgcm : LOOP_VINFO_MASKS (loop_vinfo).rgc_map)
+    {
+      rgroup_controls *rgc = rgcm.second;
+      if (rgc->controls.is_empty ())
+	continue;
+
+      tree ctrl_type = rgc->type;
+      poly_uint64 nitems_per_ctrl = TYPE_VECTOR_SUBPARTS (ctrl_type);
+
+      tree vectype = rgc->compare_type;
+
+      /* index_after_incr is the IV specifying the remaining iterations in
+	 the next iteration.  */
+      tree rem = index_after_incr;
+      /* When the data type for the compare to produce the mask is
+	 smaller than the IV type we need to saturate.  Saturate to
+	 the smallest possible value (IV_TYPE) so we only have to
+	 saturate once (CSE will catch redundant ones we add).  */
+      if (TYPE_PRECISION (TREE_TYPE (vectype)) < TYPE_PRECISION (iv_type))
+	rem = gimple_build (&incr_gsi, false, GSI_CONTINUE_LINKING,
+			    UNKNOWN_LOCATION,
+			    MIN_EXPR, TREE_TYPE (rem), rem, iv_step);
+      rem = gimple_convert (&incr_gsi, false, GSI_CONTINUE_LINKING,
+			    UNKNOWN_LOCATION, TREE_TYPE (vectype), rem);
+
+      /* Build a data vector composed of the remaining iterations.  */
+      rem = gimple_build_vector_from_val (&incr_gsi, false, GSI_CONTINUE_LINKING,
+					  UNKNOWN_LOCATION, vectype, rem);
+
+      /* Provide a definition of each vector in the control group.  */
+      tree next_ctrl = NULL_TREE;
+      tree first_rem = NULL_TREE;
+      tree ctrl;
+      unsigned int i;
+      FOR_EACH_VEC_ELT_REVERSE (rgc->controls, i, ctrl)
+	{
+	  /* Previous controls will cover BIAS items.  This control covers the
+	     next batch.  */
+	  poly_uint64 bias = nitems_per_ctrl * i;
+
+	  /* Build the constant to compare the remaining iters against,
+	     this is sth like { 0, 0, 1, 1, 2, 2, 3, 3, ... } appropriately
+	     split into pieces.  */
+	  unsigned n = TYPE_VECTOR_SUBPARTS (ctrl_type).to_constant ();
+	  tree_vector_builder builder (vectype, n, 1);
+	  for (unsigned i = 0; i < n; ++i)
+	    {
+	      unsigned HOST_WIDE_INT val
+		= (i + bias.to_constant ()) / rgc->max_nscalars_per_iter;
+	      gcc_assert (val < vf.to_constant ());
+	      builder.quick_push (build_int_cst (TREE_TYPE (vectype), val));
+	    }
+	  tree cmp_series = builder.build ();
+
+	  /* Create the initial control.  First include all items that
+	     are within the loop limit.  */
+	  tree init_ctrl = NULL_TREE;
+	  poly_uint64 const_limit;
+	  /* See whether the first iteration of the vector loop is known
+	     to have a full control.  */
+	  if (poly_int_tree_p (niters, &const_limit)
+	      && known_ge (const_limit, (i + 1) * nitems_per_ctrl))
+	    init_ctrl = build_minus_one_cst (ctrl_type);
+	  else
+	    {
+	      /* The remaining work items initially are niters.  Saturate,
+		 splat and compare.  */
+	      if (!first_rem)
+		{
+		  first_rem = niters;
+		  if (TYPE_PRECISION (TREE_TYPE (vectype))
+		      < TYPE_PRECISION (iv_type))
+		    first_rem = gimple_build (&preheader_seq,
+					      MIN_EXPR, TREE_TYPE (first_rem),
+					      first_rem, iv_step);
+		  first_rem = gimple_convert (&preheader_seq, TREE_TYPE (vectype),
+					      first_rem);
+		  first_rem = gimple_build_vector_from_val (&preheader_seq,
+							    vectype, first_rem);
+		}
+	      init_ctrl = gimple_build (&preheader_seq, LT_EXPR, ctrl_type,
+					cmp_series, first_rem);
+	    }
+
+	  /* Now AND out the bits that are within the number of skipped
+	     items.  */
+	  poly_uint64 const_skip;
+	  if (niters_skip
+	      && !(poly_int_tree_p (niters_skip, &const_skip)
+		   && known_le (const_skip, bias)))
+	    {
+	      /* For integer mode masks it's cheaper to shift out the bits
+		 since that avoids loading a constant.  */
+	      gcc_assert (GET_MODE_CLASS (TYPE_MODE (ctrl_type)) == MODE_INT);
+	      init_ctrl = gimple_build (&preheader_seq, VIEW_CONVERT_EXPR,
+					lang_hooks.types.type_for_mode
+					  (TYPE_MODE (ctrl_type), 1),
+					init_ctrl);
+	      /* ???  But when the shift amount isn't constant this requires
+		 a round-trip to GRPs.  We could apply the bias to either
+		 side of the compare instead.  */
+	      tree shift = gimple_build (&preheader_seq, MULT_EXPR,
+					 TREE_TYPE (niters_skip),
+					 niters_skip,
+					 build_int_cst (TREE_TYPE (niters_skip),
+							rgc->max_nscalars_per_iter));
+	      init_ctrl = gimple_build (&preheader_seq, LSHIFT_EXPR,
+					TREE_TYPE (init_ctrl),
+					init_ctrl, shift);
+	      init_ctrl = gimple_build (&preheader_seq, VIEW_CONVERT_EXPR,
+					ctrl_type, init_ctrl);
+	    }
+
+	  /* Get the control value for the next iteration of the loop.  */
+	  next_ctrl = gimple_build (&incr_gsi, false, GSI_CONTINUE_LINKING,
+				    UNKNOWN_LOCATION,
+				    LT_EXPR, ctrl_type, cmp_series, rem);
+
+	  vect_set_loop_control (loop, ctrl, init_ctrl, next_ctrl);
+	}
+    }
+
+  /* Emit all accumulated statements.  */
+  add_preheader_seq (loop, preheader_seq);
+
+  /* Adjust the exit test using the decrementing IV.  */
+  edge exit_edge = single_exit (loop);
+  tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? LE_EXPR : GT_EXPR;
+  /* When we peel for alignment with niter_skip != 0 this can
+     cause niter + niter_skip to wrap and since we are comparing the
+     value before the decrement here we get a false early exit.
+     We can't compare the value after decrement either because that
+     decrement could wrap as well as we're not doing a saturating
+     decrement.  To avoid this situation we force a larger
+     iv_type.  */
+  gcond *cond_stmt = gimple_build_cond (code, index_before_incr, iv_step,
+					NULL_TREE, NULL_TREE);
+  gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
+
+  /* The loop iterates (NITERS - 1 + NITERS_SKIP) / VF + 1 times.
+     Subtract one from this to get the latch count.  */
+  tree niters_minus_one
+    = fold_build2 (PLUS_EXPR, TREE_TYPE (orig_niters), orig_niters,
+		   build_minus_one_cst (TREE_TYPE (orig_niters)));
+  tree niters_adj2 = fold_convert (iv_type, niters_minus_one);
+  if (niters_skip)
+    niters_adj2 = fold_build2 (PLUS_EXPR, iv_type, niters_minus_one,
+			       fold_convert (iv_type, niters_skip));
+  loop->nb_iterations = fold_build2 (TRUNC_DIV_EXPR, iv_type,
+				     niters_adj2, iv_step);
+
+  if (final_iv)
+    {
+      gassign *assign = gimple_build_assign (final_iv, orig_niters);
+      gsi_insert_on_edge_immediate (single_exit (loop), assign);
+    }
+
+  return cond_stmt;
+}
+
+
 /* Like vect_set_loop_condition, but handle the case in which the vector
    loop handles exactly VF scalars per iteration.  */
 
@@ -1114,10 +1357,18 @@ vect_set_loop_condition (class loop *loop, loop_vec_info loop_vinfo,
   gimple_stmt_iterator loop_cond_gsi = gsi_for_stmt (orig_cond);
 
   if (loop_vinfo && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
-    cond_stmt = vect_set_loop_condition_partial_vectors (loop, loop_vinfo,
-							 niters, final_iv,
-							 niters_maybe_zero,
-							 loop_cond_gsi);
+    {
+      if (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) == vect_partial_vectors_avx512)
+	cond_stmt = vect_set_loop_condition_partial_vectors_avx512 (loop, loop_vinfo,
+								    niters, final_iv,
+								    niters_maybe_zero,
+								    loop_cond_gsi);
+      else
+	cond_stmt = vect_set_loop_condition_partial_vectors (loop, loop_vinfo,
+							     niters, final_iv,
+							     niters_maybe_zero,
+							     loop_cond_gsi);
+    }
   else
     cond_stmt = vect_set_loop_condition_normal (loop, niters, step, final_iv,
 						niters_maybe_zero,
@@ -2030,7 +2281,8 @@ void
 vect_prepare_for_masked_peels (loop_vec_info loop_vinfo)
 {
   tree misalign_in_elems;
-  tree type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+  /* ???  With AVX512 we want LOOP_VINFO_RGROUP_IV_TYPE in the end.  */
+  tree type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo));
 
   gcc_assert (vect_use_loop_mask_for_alignment_p (loop_vinfo));
 
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 1897e720389..9be66b8fbc5 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -55,6 +55,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "vec-perm-indices.h"
 #include "tree-eh.h"
 #include "case-cfn-macros.h"
+#include "langhooks.h"
 
 /* Loop Vectorization Pass.
 
@@ -963,6 +964,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     mask_skip_niters (NULL_TREE),
     rgroup_compare_type (NULL_TREE),
     simd_if_cond (NULL_TREE),
+    partial_vector_style (vect_partial_vectors_none),
     unaligned_dr (NULL),
     peeling_for_alignment (0),
     ptr_mask (0),
@@ -1058,7 +1060,12 @@ _loop_vec_info::~_loop_vec_info ()
 {
   free (bbs);
 
-  release_vec_loop_controls (&masks);
+  for (auto m : masks.rgc_map)
+    {
+      m.second->controls.release ();
+      delete m.second;
+    }
+  release_vec_loop_controls (&masks.rgc_vec);
   release_vec_loop_controls (&lens);
   delete ivexpr_map;
   delete scan_map;
@@ -1108,7 +1115,7 @@ can_produce_all_loop_masks_p (loop_vec_info loop_vinfo, tree cmp_type)
 {
   rgroup_controls *rgm;
   unsigned int i;
-  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm)
+  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec, i, rgm)
     if (rgm->type != NULL_TREE
 	&& !direct_internal_fn_supported_p (IFN_WHILE_ULT,
 					    cmp_type, rgm->type,
@@ -1203,9 +1210,33 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
   if (LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
     return false;
 
+  /* Produce the rgroup controls.  */
+  for (auto mask : LOOP_VINFO_MASKS (loop_vinfo).mask_set)
+    {
+      vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
+      tree vectype = mask.first;
+      unsigned nvectors = mask.second;
+
+      if (masks->rgc_vec.length () < nvectors)
+	masks->rgc_vec.safe_grow_cleared (nvectors, true);
+      rgroup_controls *rgm = &(*masks).rgc_vec[nvectors - 1];
+      /* The number of scalars per iteration and the number of vectors are
+	 both compile-time constants.  */
+      unsigned int nscalars_per_iter
+	  = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
+		       LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
+
+      if (rgm->max_nscalars_per_iter < nscalars_per_iter)
+	{
+	  rgm->max_nscalars_per_iter = nscalars_per_iter;
+	  rgm->type = truth_type_for (vectype);
+	  rgm->factor = 1;
+	}
+    }
+
   /* Calculate the maximum number of scalars per iteration for every rgroup.  */
   unsigned int max_nscalars_per_iter = 1;
-  for (auto rgm : LOOP_VINFO_MASKS (loop_vinfo))
+  for (auto rgm : LOOP_VINFO_MASKS (loop_vinfo).rgc_vec)
     max_nscalars_per_iter
       = MAX (max_nscalars_per_iter, rgm.max_nscalars_per_iter);
 
@@ -1268,10 +1299,159 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
     }
 
   if (!cmp_type)
-    return false;
+    {
+      LOOP_VINFO_MASKS (loop_vinfo).rgc_vec.release ();
+      return false;
+    }
 
   LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = cmp_type;
   LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
+  LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) = vect_partial_vectors_while_ult;
+  return true;
+}
+
+/* Each statement in LOOP_VINFO can be masked where necessary.  Check
+   whether we can actually generate AVX512 style masks.  Return true if so,
+   storing the type of the scalar IV in LOOP_VINFO_RGROUP_IV_TYPE.  */
+
+static bool
+vect_verify_full_masking_avx512 (loop_vec_info loop_vinfo)
+{
+  /* Produce differently organized rgc_vec and differently check
+     we can produce masks.  */
+
+  /* Use a normal loop if there are no statements that need masking.
+     This only happens in rare degenerate cases: it means that the loop
+     has no loads, no stores, and no live-out values.  */
+  if (LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
+    return false;
+
+  /* For the decrementing IV we need to represent all values in
+     [0, niter + niter_skip] where niter_skip is the elements we
+     skip in the first iteration for prologue peeling.  */
+  tree iv_type = NULL_TREE;
+  widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
+  unsigned int iv_precision = UINT_MAX;
+  if (iv_limit != -1)
+    iv_precision = wi::min_precision (iv_limit, UNSIGNED);
+
+  /* First compute the type for the IV we use to track the remaining
+     scalar iterations.  */
+  opt_scalar_int_mode cmp_mode_iter;
+  FOR_EACH_MODE_IN_CLASS (cmp_mode_iter, MODE_INT)
+    {
+      unsigned int cmp_bits = GET_MODE_BITSIZE (cmp_mode_iter.require ());
+      if (cmp_bits >= iv_precision
+	  && targetm.scalar_mode_supported_p (cmp_mode_iter.require ()))
+	{
+	  iv_type = build_nonstandard_integer_type (cmp_bits, true);
+	  if (iv_type)
+	    break;
+	}
+    }
+  if (!iv_type)
+    return false;
+
+  /* Produce the rgroup controls.  */
+  for (auto mask : LOOP_VINFO_MASKS (loop_vinfo).mask_set)
+    {
+      vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
+      tree vectype = mask.first;
+      unsigned nvectors = mask.second;
+
+      /* The number of scalars per iteration and the number of vectors are
+	 both compile-time constants.  */
+      unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
+      unsigned int nscalars_per_iter
+	= nvectors * nunits / LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ();
+
+      /* We key off a hash-map with nscalars_per_iter and the number of total
+	 lanes in the mask vector and then remember the needed vector mask
+	 with the largest number of lanes (thus the fewest nV).  */
+      bool existed;
+      rgroup_controls *& rgc
+	= masks->rgc_map.get_or_insert (std::make_pair (nscalars_per_iter,
+							nvectors * nunits),
+					&existed);
+      if (!existed)
+	{
+	  rgc = new rgroup_controls ();
+	  rgc->type = truth_type_for (vectype);
+	  rgc->compare_type = NULL_TREE;
+	  rgc->max_nscalars_per_iter = nscalars_per_iter;
+	  rgc->factor = 1;
+	  rgc->bias_adjusted_ctrl = NULL_TREE;
+	}
+      else
+	{
+	  gcc_assert (rgc->max_nscalars_per_iter == nscalars_per_iter);
+	  if (known_lt (TYPE_VECTOR_SUBPARTS (rgc->type),
+			TYPE_VECTOR_SUBPARTS (vectype)))
+	    rgc->type = truth_type_for (vectype);
+	}
+    }
+
+  /* There is no fixed compare type we are going to use but we have to
+     be able to get at one for each mask group.  */
+  unsigned int min_ni_width
+    = wi::min_precision (vect_max_vf (loop_vinfo), UNSIGNED);
+
+  bool ok = true;
+  for (auto mask : LOOP_VINFO_MASKS (loop_vinfo).rgc_map)
+    {
+      rgroup_controls *rgc = mask.second;
+      tree mask_type = rgc->type;
+      if (TYPE_PRECISION (TREE_TYPE (mask_type)) != 1)
+	{
+	  ok = false;
+	  break;
+	}
+
+      /* If iv_type is usable as compare type use that - we can elide the
+	 saturation in that case.   */
+      if (TYPE_PRECISION (iv_type) >= min_ni_width)
+	{
+	  tree cmp_vectype
+	    = build_vector_type (iv_type, TYPE_VECTOR_SUBPARTS (mask_type));
+	  if (expand_vec_cmp_expr_p (cmp_vectype, mask_type, LT_EXPR))
+	    rgc->compare_type = cmp_vectype;
+	}
+      if (!rgc->compare_type)
+	FOR_EACH_MODE_IN_CLASS (cmp_mode_iter, MODE_INT)
+	  {
+	    unsigned int cmp_bits = GET_MODE_BITSIZE (cmp_mode_iter.require ());
+	    if (cmp_bits >= min_ni_width
+		&& targetm.scalar_mode_supported_p (cmp_mode_iter.require ()))
+	      {
+		tree cmp_type = build_nonstandard_integer_type (cmp_bits, true);
+		if (!cmp_type)
+		  continue;
+
+		/* Check whether we can produce the mask with cmp_type.  */
+		tree cmp_vectype
+		  = build_vector_type (cmp_type, TYPE_VECTOR_SUBPARTS (mask_type));
+		if (expand_vec_cmp_expr_p (cmp_vectype, mask_type, LT_EXPR))
+		  {
+		    rgc->compare_type = cmp_vectype;
+		    break;
+		  }
+	      }
+	}
+      if (!rgc->compare_type)
+	{
+	  ok = false;
+	  break;
+	}
+    }
+  if (!ok)
+    {
+      LOOP_VINFO_MASKS (loop_vinfo).rgc_map.empty ();
+      return false;
+    }
+
+  LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = error_mark_node;
+  LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
+  LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) = vect_partial_vectors_avx512;
   return true;
 }
 
@@ -1371,6 +1551,7 @@ vect_verify_loop_lens (loop_vec_info loop_vinfo)
 
   LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = iv_type;
   LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
+  LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) = vect_partial_vectors_len;
 
   return true;
 }
@@ -2712,16 +2893,24 @@ start_over:
 
   /* If we still have the option of using partial vectors,
      check whether we can generate the necessary loop controls.  */
-  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
-      && !vect_verify_full_masking (loop_vinfo)
-      && !vect_verify_loop_lens (loop_vinfo))
-    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+    {
+      if (!LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
+	{
+	  if (!vect_verify_full_masking (loop_vinfo)
+	      && !vect_verify_full_masking_avx512 (loop_vinfo))
+	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	}
+      else /* !LOOP_VINFO_LENS (loop_vinfo).is_empty () */
+	if (!vect_verify_loop_lens (loop_vinfo))
+	  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+    }
 
   /* If we're vectorizing a loop that uses length "controls" and
      can iterate more than once, we apply decrementing IV approach
      in loop control.  */
   if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
-      && !LOOP_VINFO_LENS (loop_vinfo).is_empty ()
+      && LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) == vect_partial_vectors_len
       && LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) == 0
       && !(LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
 	   && known_le (LOOP_VINFO_INT_NITERS (loop_vinfo),
@@ -3022,7 +3211,7 @@ again:
   delete loop_vinfo->vector_costs;
   loop_vinfo->vector_costs = nullptr;
   /* Reset accumulated rgroup information.  */
-  release_vec_loop_controls (&LOOP_VINFO_MASKS (loop_vinfo));
+  release_vec_loop_controls (&LOOP_VINFO_MASKS (loop_vinfo).rgc_vec);
   release_vec_loop_controls (&LOOP_VINFO_LENS (loop_vinfo));
   /* Reset assorted flags.  */
   LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
@@ -4362,13 +4551,67 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 			  cond_branch_not_taken, vect_epilogue);
 
   /* Take care of special costs for rgroup controls of partial vectors.  */
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+      && (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
+	  == vect_partial_vectors_avx512))
+    {
+      /* Calculate how many masks we need to generate.  */
+      unsigned int num_masks = 0;
+      bool need_saturation = false;
+      for (auto rgcm : LOOP_VINFO_MASKS (loop_vinfo).rgc_map)
+	{
+	  rgroup_controls *rgm = rgcm.second;
+	  unsigned nvectors
+	    = (rgcm.first.second
+	       / TYPE_VECTOR_SUBPARTS (rgm->type).to_constant ());
+	  num_masks += nvectors;
+	  if (TYPE_PRECISION (TREE_TYPE (rgm->compare_type))
+	      < TYPE_PRECISION (LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo)))
+	    need_saturation = true;
+	}
+
+      /* ???  The target isn't able to identify the costs below as
+	 producing masks so it cannot penaltize cases where we'd run
+	 out of mask registers for example.  */
+
+      /* In the worst case, we need to generate each mask in the prologue
+	 and in the loop body.  We need one splat per group and one
+	 compare per mask.
+
+	 Sometimes we can use unpacks instead of generating prologue
+	 masks and sometimes the prologue mask will fold to a constant,
+	 so the actual prologue cost might be smaller.  However, it's
+	 simpler and safer to use the worst-case cost; if this ends up
+	 being the tie-breaker between vectorizing or not, then it's
+	 probably better not to vectorize.  */
+      (void) add_stmt_cost (target_cost_data,
+			    num_masks
+			    + LOOP_VINFO_MASKS (loop_vinfo).rgc_map.elements (),
+			    vector_stmt, NULL, NULL, NULL_TREE, 0, vect_prologue);
+      (void) add_stmt_cost (target_cost_data,
+			    num_masks
+			    + LOOP_VINFO_MASKS (loop_vinfo).rgc_map.elements (),
+			    vector_stmt, NULL, NULL, NULL_TREE, 0, vect_body);
+
+      /* When we need saturation we need it both in the prologue and
+	 the epilogue.  */
+      if (need_saturation)
+	{
+	  (void) add_stmt_cost (target_cost_data, 1, scalar_stmt,
+				NULL, NULL, NULL_TREE, 0, vect_prologue);
+	  (void) add_stmt_cost (target_cost_data, 1, scalar_stmt,
+				NULL, NULL, NULL_TREE, 0, vect_body);
+	}
+    }
+  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+	   && (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
+	       == vect_partial_vectors_avx512))
     {
       /* Calculate how many masks we need to generate.  */
       unsigned int num_masks = 0;
       rgroup_controls *rgm;
       unsigned int num_vectors_m1;
-      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
+      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec, num_vectors_m1, rgm)
 	if (rgm->type)
 	  num_masks += num_vectors_m1 + 1;
       gcc_assert (num_masks > 0);
@@ -10329,14 +10572,6 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
 		       unsigned int nvectors, tree vectype, tree scalar_mask)
 {
   gcc_assert (nvectors != 0);
-  if (masks->length () < nvectors)
-    masks->safe_grow_cleared (nvectors, true);
-  rgroup_controls *rgm = &(*masks)[nvectors - 1];
-  /* The number of scalars per iteration and the number of vectors are
-     both compile-time constants.  */
-  unsigned int nscalars_per_iter
-    = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
-		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
 
   if (scalar_mask)
     {
@@ -10344,12 +10579,7 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
       loop_vinfo->scalar_cond_masked_set.add (cond);
     }
 
-  if (rgm->max_nscalars_per_iter < nscalars_per_iter)
-    {
-      rgm->max_nscalars_per_iter = nscalars_per_iter;
-      rgm->type = truth_type_for (vectype);
-      rgm->factor = 1;
-    }
+  masks->mask_set.add (std::make_pair (vectype, nvectors));
 }
 
 /* Given a complete set of masks MASKS, extract mask number INDEX
@@ -10360,46 +10590,121 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
    arrangement.  */
 
 tree
-vect_get_loop_mask (loop_vec_info,
+vect_get_loop_mask (loop_vec_info loop_vinfo,
 		    gimple_stmt_iterator *gsi, vec_loop_masks *masks,
 		    unsigned int nvectors, tree vectype, unsigned int index)
 {
-  rgroup_controls *rgm = &(*masks)[nvectors - 1];
-  tree mask_type = rgm->type;
-
-  /* Populate the rgroup's mask array, if this is the first time we've
-     used it.  */
-  if (rgm->controls.is_empty ())
+  if (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
+      == vect_partial_vectors_while_ult)
     {
-      rgm->controls.safe_grow_cleared (nvectors, true);
-      for (unsigned int i = 0; i < nvectors; ++i)
+      rgroup_controls *rgm = &(masks->rgc_vec)[nvectors - 1];
+      tree mask_type = rgm->type;
+
+      /* Populate the rgroup's mask array, if this is the first time we've
+	 used it.  */
+      if (rgm->controls.is_empty ())
 	{
-	  tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask");
-	  /* Provide a dummy definition until the real one is available.  */
-	  SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
-	  rgm->controls[i] = mask;
+	  rgm->controls.safe_grow_cleared (nvectors, true);
+	  for (unsigned int i = 0; i < nvectors; ++i)
+	    {
+	      tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask");
+	      /* Provide a dummy definition until the real one is available.  */
+	      SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
+	      rgm->controls[i] = mask;
+	    }
 	}
-    }
 
-  tree mask = rgm->controls[index];
-  if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type),
-		TYPE_VECTOR_SUBPARTS (vectype)))
+      tree mask = rgm->controls[index];
+      if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type),
+		    TYPE_VECTOR_SUBPARTS (vectype)))
+	{
+	  /* A loop mask for data type X can be reused for data type Y
+	     if X has N times more elements than Y and if Y's elements
+	     are N times bigger than X's.  In this case each sequence
+	     of N elements in the loop mask will be all-zero or all-one.
+	     We can then view-convert the mask so that each sequence of
+	     N elements is replaced by a single element.  */
+	  gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (mask_type),
+				  TYPE_VECTOR_SUBPARTS (vectype)));
+	  gimple_seq seq = NULL;
+	  mask_type = truth_type_for (vectype);
+	  /* We can only use re-use the mask by reinterpreting it if it
+	     occupies the same space, that is the mask with less elements
+	     uses multiple bits for each masked elements.  */
+	  gcc_assert (known_eq (TYPE_PRECISION (TREE_TYPE (TREE_TYPE (mask)))
+				* TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)),
+				TYPE_PRECISION (TREE_TYPE (mask_type))
+				* TYPE_VECTOR_SUBPARTS (mask_type)));
+	  mask = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, mask);
+	  if (seq)
+	    gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
+	}
+      return mask;
+    }
+  else if (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
+	   == vect_partial_vectors_avx512)
     {
-      /* A loop mask for data type X can be reused for data type Y
-	 if X has N times more elements than Y and if Y's elements
-	 are N times bigger than X's.  In this case each sequence
-	 of N elements in the loop mask will be all-zero or all-one.
-	 We can then view-convert the mask so that each sequence of
-	 N elements is replaced by a single element.  */
-      gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (mask_type),
-			      TYPE_VECTOR_SUBPARTS (vectype)));
+      /* The number of scalars per iteration and the number of vectors are
+	 both compile-time constants.  */
+      unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
+      unsigned int nscalars_per_iter
+	= nvectors * nunits / LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ();
+
+      rgroup_controls *rgm
+	= *masks->rgc_map.get (std::make_pair (nscalars_per_iter,
+					       nvectors * nunits));
+
+      /* The stored nV is dependent on the mask type produced.  */
+      nvectors = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
+			    TYPE_VECTOR_SUBPARTS (rgm->type)).to_constant ();
+
+      /* Populate the rgroup's mask array, if this is the first time we've
+	 used it.  */
+      if (rgm->controls.is_empty ())
+	{
+	  rgm->controls.safe_grow_cleared (nvectors, true);
+	  for (unsigned int i = 0; i < nvectors; ++i)
+	    {
+	      tree mask = make_temp_ssa_name (rgm->type, NULL, "loop_mask");
+	      /* Provide a dummy definition until the real one is available.  */
+	      SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
+	      rgm->controls[i] = mask;
+	    }
+	}
+      if (known_eq (TYPE_VECTOR_SUBPARTS (rgm->type),
+		    TYPE_VECTOR_SUBPARTS (vectype)))
+	return rgm->controls[index];
+
+      /* Split the vector if needed.  Since we are dealing with integer mode
+	 masks with AVX512 we can operate on the integer representation
+	 performing the whole vector shifting.  */
+      unsigned HOST_WIDE_INT factor;
+      bool ok = constant_multiple_p (TYPE_VECTOR_SUBPARTS (rgm->type),
+				     TYPE_VECTOR_SUBPARTS (vectype), &factor);
+      gcc_assert (ok);
+      gcc_assert (GET_MODE_CLASS (TYPE_MODE (rgm->type)) == MODE_INT);
+      tree mask_type = truth_type_for (vectype);
+      gcc_assert (GET_MODE_CLASS (TYPE_MODE (mask_type)) == MODE_INT);
+      unsigned vi = index / factor;
+      unsigned vpart = index % factor;
+      tree vec = rgm->controls[vi];
       gimple_seq seq = NULL;
-      mask_type = truth_type_for (vectype);
-      mask = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, mask);
+      vec = gimple_build (&seq, VIEW_CONVERT_EXPR,
+			  lang_hooks.types.type_for_mode
+				(TYPE_MODE (rgm->type), 1), vec);
+      /* For integer mode masks simply shift the right bits into position.  */
+      if (vpart != 0)
+	vec = gimple_build (&seq, RSHIFT_EXPR, TREE_TYPE (vec), vec,
+			    build_int_cst (integer_type_node, vpart * nunits));
+      vec = gimple_convert (&seq, lang_hooks.types.type_for_mode
+				    (TYPE_MODE (mask_type), 1), vec);
+      vec = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, vec);
       if (seq)
 	gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
+      return vec;
     }
-  return mask;
+  else
+    gcc_unreachable ();
 }
 
 /* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 767a0774d45..42161778dc1 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -300,6 +300,13 @@ public:
 #define SLP_TREE_LANES(S)			 (S)->lanes
 #define SLP_TREE_CODE(S)			 (S)->code
 
+enum vect_partial_vector_style {
+    vect_partial_vectors_none,
+    vect_partial_vectors_while_ult,
+    vect_partial_vectors_avx512,
+    vect_partial_vectors_len
+};
+
 /* Key for map that records association between
    scalar conditions and corresponding loop mask, and
    is populated by vect_record_loop_mask.  */
@@ -605,6 +612,10 @@ struct rgroup_controls {
      specified number of elements; the type of the elements doesn't matter.  */
   tree type;
 
+  /* When there is no uniformly used LOOP_VINFO_RGROUP_COMPARE_TYPE this
+     is the rgroup specific type used.  */
+  tree compare_type;
+
   /* A vector of nV controls, in iteration order.  */
   vec<tree> controls;
 
@@ -613,7 +624,24 @@ struct rgroup_controls {
   tree bias_adjusted_ctrl;
 };
 
-typedef auto_vec<rgroup_controls> vec_loop_masks;
+struct vec_loop_masks
+{
+  bool is_empty () const { return mask_set.is_empty (); }
+
+  typedef pair_hash <nofree_ptr_hash <tree_node>,
+		     int_hash<unsigned, 0>> mp_hash;
+  hash_set<mp_hash> mask_set;
+
+  /* Default storage for rgroup_controls.  */
+  auto_vec<rgroup_controls> rgc_vec;
+
+  /* The vect_partial_vectors_avx512 style uses a hash-map.  */
+  hash_map<std::pair<unsigned /* nscalars_per_iter */,
+		     unsigned /* nlanes */>, rgroup_controls *,
+	   simple_hashmap_traits<pair_hash <int_hash<unsigned, 0>,
+					    int_hash<unsigned, 0>>,
+				 rgroup_controls *>> rgc_map;
+};
 
 typedef auto_vec<rgroup_controls> vec_loop_lens;
 
@@ -741,6 +769,10 @@ public:
      LOOP_VINFO_USING_PARTIAL_VECTORS_P is true.  */
   tree rgroup_iv_type;
 
+  /* The style used for implementing partial vectors when
+     LOOP_VINFO_USING_PARTIAL_VECTORS_P is true.  */
+  vect_partial_vector_style partial_vector_style;
+
   /* Unknown DRs according to which loop was peeled.  */
   class dr_vec_info *unaligned_dr;
 
@@ -914,6 +946,7 @@ public:
 #define LOOP_VINFO_MASK_SKIP_NITERS(L)     (L)->mask_skip_niters
 #define LOOP_VINFO_RGROUP_COMPARE_TYPE(L)  (L)->rgroup_compare_type
 #define LOOP_VINFO_RGROUP_IV_TYPE(L)       (L)->rgroup_iv_type
+#define LOOP_VINFO_PARTIAL_VECTORS_STYLE(L) (L)->partial_vector_style
 #define LOOP_VINFO_PTR_MASK(L)             (L)->ptr_mask
 #define LOOP_VINFO_N_STMTS(L)		   (L)->shared->n_stmts
 #define LOOP_VINFO_LOOP_NEST(L)            (L)->shared->loop_nest
-- 
2.35.3

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-06-15 16:17 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20230614115450.28CEA3858288@sourceware.org>
2023-06-14 14:26 ` [PATCH 3/3] AVX512 fully masked vectorization Andrew Stubbs
2023-06-14 14:29   ` Richard Biener
2023-06-15  5:50     ` Liu, Hongtao
2023-06-15  6:51       ` Richard Biener
2023-06-15  9:26     ` Andrew Stubbs
2023-06-15  9:58       ` Richard Biener
2023-06-15 10:13         ` Andrew Stubbs
2023-06-15 11:06           ` Richard Biener
2023-06-15 13:04             ` Andrew Stubbs
2023-06-15 13:34               ` Richard Biener
2023-06-15 13:52                 ` Andrew Stubbs
2023-06-15 14:00                   ` Richard Biener
2023-06-15 14:04                     ` Andrew Stubbs
2023-06-15 16:16                       ` Richard Biener
2023-06-15  9:58       ` Richard Sandiford
     [not found] <20230614115429.D400C3858433@sourceware.org>
2023-06-14 18:45 ` Richard Sandiford
2023-06-15 12:14   ` Richard Biener
2023-06-15 12:53     ` Richard Biener
2023-06-14 11:54 Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).