From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=qweC=CD=suse.de=rguenther@sourceware.org>
Received: from smtp-out1.suse.de (smtp-out1.suse.de [IPv6:2001:67c:2178:6::1c])
	by sourceware.org (Postfix) with ESMTPS id F40AE3858D35
	for <gcc-patches@gcc.gnu.org>; Thu, 15 Jun 2023 12:53:30 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org F40AE3858D35
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
	by smtp-out1.suse.de (Postfix) with ESMTP id 103C321C36;
	Thu, 15 Jun 2023 12:53:30 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1686833610; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=LQE2MQZDo2EiQgSA/MU0RTOKjJnK/v9+uzP0Jk1jhTE=;
	b=jMQbp5ULHWAIYueV1SosFQzydcaGWVNR/7UpywsWp/29W9XMtAZj7k5zKHLTXtztM+QVHb
	AK1K85VYVasHhR4+xDZGPQehtolNLXIdEYLsGjDCknDWCTbRR4eTFGRtZsg3x0JW3Q6+G2
	9McPBYDhOLe5+a1iehh6vj7Pb56VsdQ=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1686833610;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=LQE2MQZDo2EiQgSA/MU0RTOKjJnK/v9+uzP0Jk1jhTE=;
	b=kuOE7MAXOKK2CLM0FxI5mq/z/9eSIrcJ35h4AqNTH/oXB1BI9xHu3f37QM2Df0KDwB0oJI
	24W14CESqf28evDQ==
Received: from wotan.suse.de (wotan.suse.de [10.160.0.1])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by relay2.suse.de (Postfix) with ESMTPS id 52CDA2C141;
	Thu, 15 Jun 2023 12:53:29 +0000 (UTC)
Date: Thu, 15 Jun 2023 12:53:29 +0000 (UTC)
From: Richard Biener <rguenther@suse.de>
To: Richard Sandiford <richard.sandiford@arm.com>
cc: Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>, 
    Jan Hubicka <hubicka@ucw.cz>, hongtao.liu@intel.com, 
    kirill.yukhin@gmail.com, ams@codesourcery.com
Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization
In-Reply-To: <nycvar.YFH.7.77.849.2306151116140.4723@jbgna.fhfr.qr>
Message-ID: <nycvar.YFH.7.77.849.2306151251330.4723@jbgna.fhfr.qr>
References: <20230614115429.D400C3858433@sourceware.org> <mptzg51evcm.fsf@arm.com> <nycvar.YFH.7.77.849.2306151116140.4723@jbgna.fhfr.qr>
User-Agent: Alpine 2.22 (LSU 394 2020-01-19)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-10.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Thu, 15 Jun 2023, Richard Biener wrote:

> On Wed, 14 Jun 2023, Richard Sandiford wrote:
> 
> > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > This implemens fully masked vectorization or a masked epilog for
> > > AVX512 style masks which single themselves out by representing
> > > each lane with a single bit and by using integer modes for the mask
> > > (both is much like GCN).
> > >
> > > AVX512 is also special in that it doesn't have any instruction
> > > to compute the mask from a scalar IV like SVE has with while_ult.
> > > Instead the masks are produced by vector compares and the loop
> > > control retains the scalar IV (mainly to avoid dependences on
> > > mask generation, a suitable mask test instruction is available).
> > >
> > > Like RVV code generation prefers a decrementing IV though IVOPTs
> > > messes things up in some cases removing that IV to eliminate
> > > it with an incrementing one used for address generation.
> > >
> > > One of the motivating testcases is from PR108410 which in turn
> > > is extracted from x264 where large size vectorization shows
> > > issues with small trip loops.  Execution time there improves
> > > compared to classic AVX512 with AVX2 epilogues for the cases
> > > of less than 32 iterations.
> > >
> > > size   scalar     128     256     512    512e    512f
> > >     1    9.42   11.32    9.35   11.17   15.13   16.89
> > >     2    5.72    6.53    6.66    6.66    7.62    8.56
> > >     3    4.49    5.10    5.10    5.74    5.08    5.73
> > >     4    4.10    4.33    4.29    5.21    3.79    4.25
> > >     6    3.78    3.85    3.86    4.76    2.54    2.85
> > >     8    3.64    1.89    3.76    4.50    1.92    2.16
> > >    12    3.56    2.21    3.75    4.26    1.26    1.42
> > >    16    3.36    0.83    1.06    4.16    0.95    1.07
> > >    20    3.39    1.42    1.33    4.07    0.75    0.85
> > >    24    3.23    0.66    1.72    4.22    0.62    0.70
> > >    28    3.18    1.09    2.04    4.20    0.54    0.61
> > >    32    3.16    0.47    0.41    0.41    0.47    0.53
> > >    34    3.16    0.67    0.61    0.56    0.44    0.50
> > >    38    3.19    0.95    0.95    0.82    0.40    0.45
> > >    42    3.09    0.58    1.21    1.13    0.36    0.40
> > >
> > > 'size' specifies the number of actual iterations, 512e is for
> > > a masked epilog and 512f for the fully masked loop.  From
> > > 4 scalar iterations on the AVX512 masked epilog code is clearly
> > > the winner, the fully masked variant is clearly worse and
> > > it's size benefit is also tiny.
> > >
> > > This patch does not enable using fully masked loops or
> > > masked epilogues by default.  More work on cost modeling
> > > and vectorization kind selection on x86_64 is necessary
> > > for this.
> > >
> > > Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
> > > which could be exploited further to unify some of the flags
> > > we have right now but there didn't seem to be many easy things
> > > to merge, so I'm leaving this for followups.
> > >
> > > Mask requirements as registered by vect_record_loop_mask are kept in their
> > > original form and recorded in a hash_set now instead of being
> > > processed to a vector of rgroup_controls.  Instead that's now
> > > left to the final analysis phase which tries forming the rgroup_controls
> > > vector using while_ult and if that fails now tries AVX512 style
> > > which needs a different organization and instead fills a hash_map
> > > with the relevant info.  vect_get_loop_mask now has two implementations,
> > > one for the two mask styles we then have.
> > >
> > > I have decided against interweaving vect_set_loop_condition_partial_vectors
> > > with conditions to do AVX512 style masking and instead opted to
> > > "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
> > > Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.
> > >
> > > I was split between making 'vec_loop_masks' a class with methods,
> > > possibly merging in the _len stuff into a single registry.  It
> > > seemed to be too many changes for the purpose of getting AVX512
> > > working.  I'm going to play wait and see what happens with RISC-V
> > > here since they are going to get both masks and lengths registered
> > > I think.
> > >
> > > The vect_prepare_for_masked_peels hunk might run into issues with
> > > SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
> > > looked odd.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
> > > the testsuite with --param vect-partial-vector-usage=2 with and
> > > without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
> > > and one latent wrong-code (PR110237).
> > >
> > > There's followup work to be done to try enabling masked epilogues
> > > for x86-64 by default (when AVX512 is enabled, possibly only when
> > > -mprefer-vector-width=512).  Getting cost modeling and decision
> > > right is going to be challenging.
> > >
> > > Any comments?
> > >
> > > OK?
> > 
> > Some comments below, but otherwise LGTM FWIW.
> > 
> > > Btw, testing on GCN would be welcome - the _avx512 paths could
> > > work for it so in case the while_ult path fails (not sure if
> > > it ever does) it could get _avx512 style masking.  Likewise
> > > testing on ARM just to see I didn't break anything here.
> > > I don't have SVE hardware so testing is probably meaningless.
> > >
> > > Thanks,
> > > Richard.
> > >
> > > 	* tree-vectorizer.h (enum vect_partial_vector_style): New.
> > > 	(_loop_vec_info::partial_vector_style): Likewise.
> > > 	(LOOP_VINFO_PARTIAL_VECTORS_STYLE): Likewise.
> > > 	(rgroup_controls::compare_type): Add.
> > > 	(vec_loop_masks): Change from a typedef to auto_vec<>
> > > 	to a structure.
> > > 	* tree-vect-loop-manip.cc (vect_set_loop_condition_partial_vectors):
> > > 	Adjust.
> > > 	(vect_set_loop_condition_partial_vectors_avx512): New function
> > > 	implementing the AVX512 partial vector codegen.
> > > 	(vect_set_loop_condition): Dispatch to the correct
> > > 	vect_set_loop_condition_partial_vectors_* function based on
> > > 	LOOP_VINFO_PARTIAL_VECTORS_STYLE.
> > > 	(vect_prepare_for_masked_peels): Compute LOOP_VINFO_MASK_SKIP_NITERS
> > > 	in the original niter type.
> > > 	* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize
> > > 	partial_vector_style.
> > > 	(_loop_vec_info::~_loop_vec_info): Release the hash-map recorded
> > > 	rgroup_controls.
> > > 	(can_produce_all_loop_masks_p): Adjust.
> > > 	(vect_verify_full_masking): Produce the rgroup_controls vector
> > > 	here.  Set LOOP_VINFO_PARTIAL_VECTORS_STYLE on success.
> > > 	(vect_verify_full_masking_avx512): New function implementing
> > > 	verification of AVX512 style masking.
> > > 	(vect_verify_loop_lens): Set LOOP_VINFO_PARTIAL_VECTORS_STYLE.
> > > 	(vect_analyze_loop_2): Also try AVX512 style masking.
> > > 	Adjust condition.
> > > 	(vect_estimate_min_profitable_iters): Implement AVX512 style
> > > 	mask producing cost.
> > > 	(vect_record_loop_mask): Do not build the rgroup_controls
> > > 	vector here but record masks in a hash-set.
> > > 	(vect_get_loop_mask): Implement AVX512 style mask query,
> > > 	complementing the existing while_ult style.
> > > ---
> > >  gcc/tree-vect-loop-manip.cc | 264 ++++++++++++++++++++++-
> > >  gcc/tree-vect-loop.cc       | 413 +++++++++++++++++++++++++++++++-----
> > >  gcc/tree-vectorizer.h       |  35 ++-
> > >  3 files changed, 651 insertions(+), 61 deletions(-)
> > >
> > > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> > > index 1c8100c1a1c..f0ecaec28f4 100644
> > > --- a/gcc/tree-vect-loop-manip.cc
> > > +++ b/gcc/tree-vect-loop-manip.cc
> > > @@ -50,6 +50,9 @@ along with GCC; see the file COPYING3.  If not see
> > >  #include "insn-config.h"
> > >  #include "rtl.h"
> > >  #include "recog.h"
> > > +#include "langhooks.h"
> > > +#include "tree-vector-builder.h"
> > > +#include "optabs-tree.h"
> > >  
> > >  /*************************************************************************
> > >    Simple Loop Peeling Utilities
> > > @@ -845,7 +848,7 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
> > >    rgroup_controls *iv_rgc = nullptr;
> > >    unsigned int i;
> > >    auto_vec<rgroup_controls> *controls = use_masks_p
> > > -					  ? &LOOP_VINFO_MASKS (loop_vinfo)
> > > +					  ? &LOOP_VINFO_MASKS (loop_vinfo).rgc_vec
> > >  					  : &LOOP_VINFO_LENS (loop_vinfo);
> > >    FOR_EACH_VEC_ELT (*controls, i, rgc)
> > >      if (!rgc->controls.is_empty ())
> > > @@ -936,6 +939,246 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
> > >    return cond_stmt;
> > >  }
> > >  
> > > +/* Set up the iteration condition and rgroup controls for LOOP in AVX512
> > > +   style, given that LOOP_VINFO_USING_PARTIAL_VECTORS_P is true for the
> > > +   vectorized loop.  LOOP_VINFO describes the vectorization of LOOP.  NITERS is
> > > +   the number of iterations of the original scalar loop that should be
> > > +   handled by the vector loop.  NITERS_MAYBE_ZERO and FINAL_IV are as
> > > +   for vect_set_loop_condition.
> > > +
> > > +   Insert the branch-back condition before LOOP_COND_GSI and return the
> > > +   final gcond.  */
> > > +
> > > +static gcond *
> > > +vect_set_loop_condition_partial_vectors_avx512 (class loop *loop,
> > > +					 loop_vec_info loop_vinfo, tree niters,
> > > +					 tree final_iv,
> > > +					 bool niters_maybe_zero,
> > > +					 gimple_stmt_iterator loop_cond_gsi)
> > > +{
> > > +  tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
> > > +  tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
> > > +  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> > > +  tree orig_niters = niters;
> > > +  gimple_seq preheader_seq = NULL;
> > > +
> > > +  /* Create an IV that counts down from niters and whose step
> > > +     is the number of iterations processed in the current iteration.
> > > +     Produce the controls with compares like the following.
> > > +
> > > +       # iv_2 = PHI <niters, iv_3>
> > > +       rem_4 = MIN <iv_2, VF>;
> > > +       remv_6 = { rem_4, rem_4, rem_4, ... }
> > > +       mask_5 = { 0, 0, 1, 1, 2, 2, ... } < remv6;
> > > +       iv_3 = iv_2 - VF;
> > > +       if (iv_2 > VF)
> > > +	 continue;
> > > +
> > > +     Where the constant is built with elements at most VF - 1 and
> > > +     repetitions according to max_nscalars_per_iter which is guarnateed
> > > +     to be the same within a group.  */
> > > +
> > > +  /* Convert NITERS to the determined IV type.  */
> > > +  if (TYPE_PRECISION (iv_type) > TYPE_PRECISION (TREE_TYPE (niters))
> > > +      && niters_maybe_zero)
> > > +    {
> > > +      /* We know that there is always at least one iteration, so if the
> > > +	 count is zero then it must have wrapped.  Cope with this by
> > > +	 subtracting 1 before the conversion and adding 1 to the result.  */
> > > +      gcc_assert (TYPE_UNSIGNED (TREE_TYPE (niters)));
> > > +      niters = gimple_build (&preheader_seq, PLUS_EXPR, TREE_TYPE (niters),
> > > +			     niters, build_minus_one_cst (TREE_TYPE (niters)));
> > > +      niters = gimple_convert (&preheader_seq, iv_type, niters);
> > > +      niters = gimple_build (&preheader_seq, PLUS_EXPR, iv_type,
> > > +			     niters, build_one_cst (iv_type));
> > > +    }
> > > +  else
> > > +    niters = gimple_convert (&preheader_seq, iv_type, niters);
> > > +
> > > +  /* Bias the initial value of the IV in case we need to skip iterations
> > > +     at the beginning.  */
> > > +  tree niters_adj = niters;
> > > +  if (niters_skip)
> > > +    {
> > > +      tree skip = gimple_convert (&preheader_seq, iv_type, niters_skip);
> > > +      niters_adj = gimple_build (&preheader_seq, PLUS_EXPR,
> > > +				 iv_type, niters, skip);
> > > +    }
> > > +
> > > +  /* The iteration step is the vectorization factor.  */
> > > +  tree iv_step = build_int_cst (iv_type, vf);
> > > +
> > > +  /* Create the decrement IV.  */
> > > +  tree index_before_incr, index_after_incr;
> > > +  gimple_stmt_iterator incr_gsi;
> > > +  bool insert_after;
> > > +  standard_iv_increment_position (loop, &incr_gsi, &insert_after);
> > > +  create_iv (niters_adj, MINUS_EXPR, iv_step, NULL_TREE, loop,
> > > +	     &incr_gsi, insert_after, &index_before_incr,
> > > +	     &index_after_incr);
> > > +
> > > +  /* Iterate over all the rgroups and fill in their controls.  */
> > > +  for (auto rgcm : LOOP_VINFO_MASKS (loop_vinfo).rgc_map)
> > > +    {
> > > +      rgroup_controls *rgc = rgcm.second;
> > > +      if (rgc->controls.is_empty ())
> > > +	continue;
> > > +
> > > +      tree ctrl_type = rgc->type;
> > > +      poly_uint64 nitems_per_ctrl = TYPE_VECTOR_SUBPARTS (ctrl_type);
> > > +
> > > +      tree vectype = rgc->compare_type;
> > > +
> > > +      /* index_after_incr is the IV specifying the remaining iterations in
> > > +	 the next iteration.  */
> > > +      tree rem = index_after_incr;
> > > +      /* When the data type for the compare to produce the mask is
> > > +	 smaller than the IV type we need to saturate.  Saturate to
> > > +	 the smallest possible value (IV_TYPE) so we only have to
> > > +	 saturate once (CSE will catch redundant ones we add).  */
> > > +      if (TYPE_PRECISION (TREE_TYPE (vectype)) < TYPE_PRECISION (iv_type))
> > > +	rem = gimple_build (&incr_gsi, false, GSI_CONTINUE_LINKING,
> > > +			    UNKNOWN_LOCATION,
> > > +			    MIN_EXPR, TREE_TYPE (rem), rem, iv_step);
> > > +      rem = gimple_convert (&incr_gsi, false, GSI_CONTINUE_LINKING,
> > > +			    UNKNOWN_LOCATION, TREE_TYPE (vectype), rem);
> > > +
> > > +      /* Build a data vector composed of the remaining iterations.  */
> > > +      rem = gimple_build_vector_from_val (&incr_gsi, false, GSI_CONTINUE_LINKING,
> > > +					  UNKNOWN_LOCATION, vectype, rem);
> > > +
> > > +      /* Provide a definition of each vector in the control group.  */
> > > +      tree next_ctrl = NULL_TREE;
> > > +      tree first_rem = NULL_TREE;
> > > +      tree ctrl;
> > > +      unsigned int i;
> > > +      FOR_EACH_VEC_ELT_REVERSE (rgc->controls, i, ctrl)
> > > +	{
> > > +	  /* Previous controls will cover BIAS items.  This control covers the
> > > +	     next batch.  */
> > > +	  poly_uint64 bias = nitems_per_ctrl * i;
> > > +
> > > +	  /* Build the constant to compare the remaining iters against,
> > > +	     this is sth like { 0, 0, 1, 1, 2, 2, 3, 3, ... } appropriately
> > > +	     split into pieces.  */
> > > +	  unsigned n = TYPE_VECTOR_SUBPARTS (ctrl_type).to_constant ();
> > > +	  tree_vector_builder builder (vectype, n, 1);
> > > +	  for (unsigned i = 0; i < n; ++i)
> > > +	    {
> > > +	      unsigned HOST_WIDE_INT val
> > > +		= (i + bias.to_constant ()) / rgc->max_nscalars_per_iter;
> > > +	      gcc_assert (val < vf.to_constant ());
> > > +	      builder.quick_push (build_int_cst (TREE_TYPE (vectype), val));
> > > +	    }
> > > +	  tree cmp_series = builder.build ();
> > > +
> > > +	  /* Create the initial control.  First include all items that
> > > +	     are within the loop limit.  */
> > > +	  tree init_ctrl = NULL_TREE;
> > > +	  poly_uint64 const_limit;
> > > +	  /* See whether the first iteration of the vector loop is known
> > > +	     to have a full control.  */
> > > +	  if (poly_int_tree_p (niters, &const_limit)
> > > +	      && known_ge (const_limit, (i + 1) * nitems_per_ctrl))
> > > +	    init_ctrl = build_minus_one_cst (ctrl_type);
> > > +	  else
> > > +	    {
> > > +	      /* The remaining work items initially are niters.  Saturate,
> > > +		 splat and compare.  */
> > > +	      if (!first_rem)
> > > +		{
> > > +		  first_rem = niters;
> > > +		  if (TYPE_PRECISION (TREE_TYPE (vectype))
> > > +		      < TYPE_PRECISION (iv_type))
> > > +		    first_rem = gimple_build (&preheader_seq,
> > > +					      MIN_EXPR, TREE_TYPE (first_rem),
> > > +					      first_rem, iv_step);
> > > +		  first_rem = gimple_convert (&preheader_seq, TREE_TYPE (vectype),
> > > +					      first_rem);
> > > +		  first_rem = gimple_build_vector_from_val (&preheader_seq,
> > > +							    vectype, first_rem);
> > > +		}
> > > +	      init_ctrl = gimple_build (&preheader_seq, LT_EXPR, ctrl_type,
> > > +					cmp_series, first_rem);
> > > +	    }
> > > +
> > > +	  /* Now AND out the bits that are within the number of skipped
> > > +	     items.  */
> > > +	  poly_uint64 const_skip;
> > > +	  if (niters_skip
> > > +	      && !(poly_int_tree_p (niters_skip, &const_skip)
> > > +		   && known_le (const_skip, bias)))
> > > +	    {
> > > +	      /* For integer mode masks it's cheaper to shift out the bits
> > > +		 since that avoids loading a constant.  */
> > > +	      gcc_assert (GET_MODE_CLASS (TYPE_MODE (ctrl_type)) == MODE_INT);
> > > +	      init_ctrl = gimple_build (&preheader_seq, VIEW_CONVERT_EXPR,
> > > +					lang_hooks.types.type_for_mode
> > > +					  (TYPE_MODE (ctrl_type), 1),
> > > +					init_ctrl);
> > > +	      /* ???  But when the shift amount isn't constant this requires
> > > +		 a round-trip to GRPs.  We could apply the bias to either
> > > +		 side of the compare instead.  */
> > > +	      tree shift = gimple_build (&preheader_seq, MULT_EXPR,
> > > +					 TREE_TYPE (niters_skip),
> > > +					 niters_skip,
> > > +					 build_int_cst (TREE_TYPE (niters_skip),
> > > +							rgc->max_nscalars_per_iter));
> > > +	      init_ctrl = gimple_build (&preheader_seq, LSHIFT_EXPR,
> > > +					TREE_TYPE (init_ctrl),
> > > +					init_ctrl, shift);
> > > +	      init_ctrl = gimple_build (&preheader_seq, VIEW_CONVERT_EXPR,
> > > +					ctrl_type, init_ctrl);
> > 
> > It looks like this assumes that either the first lane or the last lane
> > of the first mask is always true, is that right?  I'm not sure we ever
> > prove that, at least not for SVE.  There it's possible to have inactive
> > elements at both ends of the same mask.
> 
> It builds a mask for the first iteration without considering niters_skip
> and then shifts in inactive lanes for niters_skip.  As I'm using
> VIEW_CONVERT_EXPR this indeed can cause bits outside of the range
> relevant for the vector mask to be set but at least x86 ignores those
> (so I'm missing a final AND with TYPE_PRECISION of the mask).
> 
> So I think it should work correctly.  For variable niters_skip
> it might be faster to build a mask based on niter_skip adjusted niters
> and then AND since we can build both masks in parallel.  Any
> variable niters_skip shifting has to be done in GPRs as the mask
> register ops only can perform shifts by immediates.
> 
> > > +	    }
> > > +
> > > +	  /* Get the control value for the next iteration of the loop.  */
> > > +	  next_ctrl = gimple_build (&incr_gsi, false, GSI_CONTINUE_LINKING,
> > > +				    UNKNOWN_LOCATION,
> > > +				    LT_EXPR, ctrl_type, cmp_series, rem);
> > > +
> > > +	  vect_set_loop_control (loop, ctrl, init_ctrl, next_ctrl);
> > > +	}
> > > +    }
> > > +
> > > +  /* Emit all accumulated statements.  */
> > > +  add_preheader_seq (loop, preheader_seq);
> > > +
> > > +  /* Adjust the exit test using the decrementing IV.  */
> > > +  edge exit_edge = single_exit (loop);
> > > +  tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? LE_EXPR : GT_EXPR;
> > > +  /* When we peel for alignment with niter_skip != 0 this can
> > > +     cause niter + niter_skip to wrap and since we are comparing the
> > > +     value before the decrement here we get a false early exit.
> > > +     We can't compare the value after decrement either because that
> > > +     decrement could wrap as well as we're not doing a saturating
> > > +     decrement.  To avoid this situation we force a larger
> > > +     iv_type.  */
> > > +  gcond *cond_stmt = gimple_build_cond (code, index_before_incr, iv_step,
> > > +					NULL_TREE, NULL_TREE);
> > > +  gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
> > > +
> > > +  /* The loop iterates (NITERS - 1 + NITERS_SKIP) / VF + 1 times.
> > > +     Subtract one from this to get the latch count.  */
> > > +  tree niters_minus_one
> > > +    = fold_build2 (PLUS_EXPR, TREE_TYPE (orig_niters), orig_niters,
> > > +		   build_minus_one_cst (TREE_TYPE (orig_niters)));
> > > +  tree niters_adj2 = fold_convert (iv_type, niters_minus_one);
> > > +  if (niters_skip)
> > > +    niters_adj2 = fold_build2 (PLUS_EXPR, iv_type, niters_minus_one,
> > > +			       fold_convert (iv_type, niters_skip));
> > > +  loop->nb_iterations = fold_build2 (TRUNC_DIV_EXPR, iv_type,
> > > +				     niters_adj2, iv_step);
> > > +
> > > +  if (final_iv)
> > > +    {
> > > +      gassign *assign = gimple_build_assign (final_iv, orig_niters);
> > > +      gsi_insert_on_edge_immediate (single_exit (loop), assign);
> > > +    }
> > > +
> > > +  return cond_stmt;
> > > +}
> > > +
> > > +
> > >  /* Like vect_set_loop_condition, but handle the case in which the vector
> > >     loop handles exactly VF scalars per iteration.  */
> > >  
> > > @@ -1114,10 +1357,18 @@ vect_set_loop_condition (class loop *loop, loop_vec_info loop_vinfo,
> > >    gimple_stmt_iterator loop_cond_gsi = gsi_for_stmt (orig_cond);
> > >  
> > >    if (loop_vinfo && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
> > > -    cond_stmt = vect_set_loop_condition_partial_vectors (loop, loop_vinfo,
> > > -							 niters, final_iv,
> > > -							 niters_maybe_zero,
> > > -							 loop_cond_gsi);
> > > +    {
> > > +      if (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) == vect_partial_vectors_avx512)
> > > +	cond_stmt = vect_set_loop_condition_partial_vectors_avx512 (loop, loop_vinfo,
> > > +								    niters, final_iv,
> > > +								    niters_maybe_zero,
> > > +								    loop_cond_gsi);
> > > +      else
> > > +	cond_stmt = vect_set_loop_condition_partial_vectors (loop, loop_vinfo,
> > > +							     niters, final_iv,
> > > +							     niters_maybe_zero,
> > > +							     loop_cond_gsi);
> > > +    }
> > >    else
> > >      cond_stmt = vect_set_loop_condition_normal (loop, niters, step, final_iv,
> > >  						niters_maybe_zero,
> > > @@ -2030,7 +2281,8 @@ void
> > >  vect_prepare_for_masked_peels (loop_vec_info loop_vinfo)
> > >  {
> > >    tree misalign_in_elems;
> > > -  tree type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
> > > +  /* ???  With AVX512 we want LOOP_VINFO_RGROUP_IV_TYPE in the end.  */
> > > +  tree type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo));
> > 
> > Like you say, I think this might cause problems for SVE, since
> > LOOP_VINFO_MASK_SKIP_NITERS is assumed to be the compare type
> > in vect_set_loop_controls_directly.  Not sure what the best way
> > around that is.
> 
> We should be able to do the conversion there, we could also simply
> leave it at what get_misalign_in_elems uses (an unsigned type
> with the width of a pointer).  The issue with the AVX512 style
> masking is that there's not a single compare type so I left
> LOOP_VINFO_RGROUP_COMPARE_TYPE as error_mark_node to catch
> any erroneous uses.
> 
> Any preference here?
> 
> > >  
> > >    gcc_assert (vect_use_loop_mask_for_alignment_p (loop_vinfo));
> > >  
> > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > index 1897e720389..9be66b8fbc5 100644
> > > --- a/gcc/tree-vect-loop.cc
> > > +++ b/gcc/tree-vect-loop.cc
> > > @@ -55,6 +55,7 @@ along with GCC; see the file COPYING3.  If not see
> > >  #include "vec-perm-indices.h"
> > >  #include "tree-eh.h"
> > >  #include "case-cfn-macros.h"
> > > +#include "langhooks.h"
> > >  
> > >  /* Loop Vectorization Pass.
> > >  
> > > @@ -963,6 +964,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
> > >      mask_skip_niters (NULL_TREE),
> > >      rgroup_compare_type (NULL_TREE),
> > >      simd_if_cond (NULL_TREE),
> > > +    partial_vector_style (vect_partial_vectors_none),
> > >      unaligned_dr (NULL),
> > >      peeling_for_alignment (0),
> > >      ptr_mask (0),
> > > @@ -1058,7 +1060,12 @@ _loop_vec_info::~_loop_vec_info ()
> > >  {
> > >    free (bbs);
> > >  
> > > -  release_vec_loop_controls (&masks);
> > > +  for (auto m : masks.rgc_map)
> > > +    {
> > > +      m.second->controls.release ();
> > > +      delete m.second;
> > > +    }
> > > +  release_vec_loop_controls (&masks.rgc_vec);
> > >    release_vec_loop_controls (&lens);
> > >    delete ivexpr_map;
> > >    delete scan_map;
> > > @@ -1108,7 +1115,7 @@ can_produce_all_loop_masks_p (loop_vec_info loop_vinfo, tree cmp_type)
> > >  {
> > >    rgroup_controls *rgm;
> > >    unsigned int i;
> > > -  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm)
> > > +  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec, i, rgm)
> > >      if (rgm->type != NULL_TREE
> > >  	&& !direct_internal_fn_supported_p (IFN_WHILE_ULT,
> > >  					    cmp_type, rgm->type,
> > > @@ -1203,9 +1210,33 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
> > >    if (LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
> > >      return false;
> > >  
> > > +  /* Produce the rgroup controls.  */
> > > +  for (auto mask : LOOP_VINFO_MASKS (loop_vinfo).mask_set)
> > > +    {
> > > +      vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > +      tree vectype = mask.first;
> > > +      unsigned nvectors = mask.second;
> > > +
> > > +      if (masks->rgc_vec.length () < nvectors)
> > > +	masks->rgc_vec.safe_grow_cleared (nvectors, true);
> > > +      rgroup_controls *rgm = &(*masks).rgc_vec[nvectors - 1];
> > > +      /* The number of scalars per iteration and the number of vectors are
> > > +	 both compile-time constants.  */
> > > +      unsigned int nscalars_per_iter
> > > +	  = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> > > +		       LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
> > > +
> > > +      if (rgm->max_nscalars_per_iter < nscalars_per_iter)
> > > +	{
> > > +	  rgm->max_nscalars_per_iter = nscalars_per_iter;
> > > +	  rgm->type = truth_type_for (vectype);
> > > +	  rgm->factor = 1;
> > > +	}
> > > +    }
> > > +
> > >    /* Calculate the maximum number of scalars per iteration for every rgroup.  */
> > >    unsigned int max_nscalars_per_iter = 1;
> > > -  for (auto rgm : LOOP_VINFO_MASKS (loop_vinfo))
> > > +  for (auto rgm : LOOP_VINFO_MASKS (loop_vinfo).rgc_vec)
> > >      max_nscalars_per_iter
> > >        = MAX (max_nscalars_per_iter, rgm.max_nscalars_per_iter);
> > >  
> > > @@ -1268,10 +1299,159 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
> > >      }
> > >  
> > >    if (!cmp_type)
> > > -    return false;
> > > +    {
> > > +      LOOP_VINFO_MASKS (loop_vinfo).rgc_vec.release ();
> > > +      return false;
> > > +    }
> > >  
> > >    LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = cmp_type;
> > >    LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
> > > +  LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) = vect_partial_vectors_while_ult;
> > > +  return true;
> > > +}
> > > +
> > > +/* Each statement in LOOP_VINFO can be masked where necessary.  Check
> > > +   whether we can actually generate AVX512 style masks.  Return true if so,
> > > +   storing the type of the scalar IV in LOOP_VINFO_RGROUP_IV_TYPE.  */
> > > +
> > > +static bool
> > > +vect_verify_full_masking_avx512 (loop_vec_info loop_vinfo)
> > > +{
> > > +  /* Produce differently organized rgc_vec and differently check
> > > +     we can produce masks.  */
> > > +
> > > +  /* Use a normal loop if there are no statements that need masking.
> > > +     This only happens in rare degenerate cases: it means that the loop
> > > +     has no loads, no stores, and no live-out values.  */
> > > +  if (LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
> > > +    return false;
> > > +
> > > +  /* For the decrementing IV we need to represent all values in
> > > +     [0, niter + niter_skip] where niter_skip is the elements we
> > > +     skip in the first iteration for prologue peeling.  */
> > > +  tree iv_type = NULL_TREE;
> > > +  widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
> > > +  unsigned int iv_precision = UINT_MAX;
> > > +  if (iv_limit != -1)
> > > +    iv_precision = wi::min_precision (iv_limit, UNSIGNED);
> > > +
> > > +  /* First compute the type for the IV we use to track the remaining
> > > +     scalar iterations.  */
> > > +  opt_scalar_int_mode cmp_mode_iter;
> > > +  FOR_EACH_MODE_IN_CLASS (cmp_mode_iter, MODE_INT)
> > > +    {
> > > +      unsigned int cmp_bits = GET_MODE_BITSIZE (cmp_mode_iter.require ());
> > > +      if (cmp_bits >= iv_precision
> > > +	  && targetm.scalar_mode_supported_p (cmp_mode_iter.require ()))
> > > +	{
> > > +	  iv_type = build_nonstandard_integer_type (cmp_bits, true);
> > > +	  if (iv_type)
> > > +	    break;
> > > +	}
> > > +    }
> > > +  if (!iv_type)
> > > +    return false;
> > > +
> > > +  /* Produce the rgroup controls.  */
> > > +  for (auto mask : LOOP_VINFO_MASKS (loop_vinfo).mask_set)
> > > +    {
> > > +      vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > +      tree vectype = mask.first;
> > > +      unsigned nvectors = mask.second;
> > > +
> > > +      /* The number of scalars per iteration and the number of vectors are
> > > +	 both compile-time constants.  */
> > > +      unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
> > > +      unsigned int nscalars_per_iter
> > > +	= nvectors * nunits / LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ();
> > 
> > The comment seems to be borrowed from:
> > 
> >   /* The number of scalars per iteration and the number of vectors are
> >      both compile-time constants.  */
> >   unsigned int nscalars_per_iter
> >     = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> > 		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
> > 
> > but the calculation instead applies to_constant to the VF and to the
> > number of vector elements, which aren't in general known to be constant.
> > Does the exact_div not work here too?
> 
> It failed the overload when I use a constant nunits, but yes, the
> original code works fine, but I need ...
> 
> > Avoiding the current to_constants here only moves the problem elsewhere
> > though.  Since this is a verification routine, I think the should
> > instead use is_constant and fail if it is false.
> > 
> > > +      /* We key off a hash-map with nscalars_per_iter and the number of total
> > > +	 lanes in the mask vector and then remember the needed vector mask
> > > +	 with the largest number of lanes (thus the fewest nV).  */
> > > +      bool existed;
> > > +      rgroup_controls *& rgc
> > > +	= masks->rgc_map.get_or_insert (std::make_pair (nscalars_per_iter,
> > > +							nvectors * nunits),
> 
> ^^^
> 
> this decomposition to constants (well, OK - I could have used a
> poly-int for the second half of the pair).
> 
> I did wonder if I could restrict this more - I need different groups
> for different nscalars_per_iter but also when the total number of
> work items is different.  Somehow I think it's really only
> nscalars_per_iter that's interesting but I didn't get to convince
> myself that nvectors * nunits is going to be always the same
> when I vary the vector size within a loop.

Well, yes, nvectors * nunits / nscalars_per_iter is the
vectorization factor which is invariant.  If nscalars_per_iter is
invariant within the group then nvectors * nunits should be as well.

That should simplify things.

Richard.

> > > +					&existed);
> > > +      if (!existed)
> > > +	{
> > > +	  rgc = new rgroup_controls ();
> > > +	  rgc->type = truth_type_for (vectype);
> > > +	  rgc->compare_type = NULL_TREE;
> > > +	  rgc->max_nscalars_per_iter = nscalars_per_iter;
> > > +	  rgc->factor = 1;
> > > +	  rgc->bias_adjusted_ctrl = NULL_TREE;
> > > +	}
> > > +      else
> > > +	{
> > > +	  gcc_assert (rgc->max_nscalars_per_iter == nscalars_per_iter);
> > > +	  if (known_lt (TYPE_VECTOR_SUBPARTS (rgc->type),
> > > +			TYPE_VECTOR_SUBPARTS (vectype)))
> > > +	    rgc->type = truth_type_for (vectype);
> > > +	}
> > > +    }
> > > +
> > > +  /* There is no fixed compare type we are going to use but we have to
> > > +     be able to get at one for each mask group.  */
> > > +  unsigned int min_ni_width
> > > +    = wi::min_precision (vect_max_vf (loop_vinfo), UNSIGNED);
> > > +
> > > +  bool ok = true;
> > > +  for (auto mask : LOOP_VINFO_MASKS (loop_vinfo).rgc_map)
> > > +    {
> > > +      rgroup_controls *rgc = mask.second;
> > > +      tree mask_type = rgc->type;
> > > +      if (TYPE_PRECISION (TREE_TYPE (mask_type)) != 1)
> > > +	{
> > > +	  ok = false;
> > > +	  break;
> > > +	}
> > > +
> > > +      /* If iv_type is usable as compare type use that - we can elide the
> > > +	 saturation in that case.   */
> > > +      if (TYPE_PRECISION (iv_type) >= min_ni_width)
> > > +	{
> > > +	  tree cmp_vectype
> > > +	    = build_vector_type (iv_type, TYPE_VECTOR_SUBPARTS (mask_type));
> > > +	  if (expand_vec_cmp_expr_p (cmp_vectype, mask_type, LT_EXPR))
> > > +	    rgc->compare_type = cmp_vectype;
> > > +	}
> > > +      if (!rgc->compare_type)
> > > +	FOR_EACH_MODE_IN_CLASS (cmp_mode_iter, MODE_INT)
> > > +	  {
> > > +	    unsigned int cmp_bits = GET_MODE_BITSIZE (cmp_mode_iter.require ());
> > > +	    if (cmp_bits >= min_ni_width
> > > +		&& targetm.scalar_mode_supported_p (cmp_mode_iter.require ()))
> > > +	      {
> > > +		tree cmp_type = build_nonstandard_integer_type (cmp_bits, true);
> > > +		if (!cmp_type)
> > > +		  continue;
> > > +
> > > +		/* Check whether we can produce the mask with cmp_type.  */
> > > +		tree cmp_vectype
> > > +		  = build_vector_type (cmp_type, TYPE_VECTOR_SUBPARTS (mask_type));
> > > +		if (expand_vec_cmp_expr_p (cmp_vectype, mask_type, LT_EXPR))
> > > +		  {
> > > +		    rgc->compare_type = cmp_vectype;
> > > +		    break;
> > > +		  }
> > > +	      }
> > > +	}
> > 
> > Just curious: is this fallback loop ever used in practice?
> > TYPE_PRECISION (iv_type) >= min_ni_width seems like an easy condition
> > to satisfy.
> 
> I think the only remaining case for this particular condition
> is an original unsigned IV and us peeling for alignment with a mask.
> 
> But then of course the main case is that with say a unsigned long IV but
> a QImode data type the x86 backend lacks a V64DImode vector type so we 
> cannot produce the mask mode for V64QImode data with a compare using a 
> data type based on that IV type.
> 
> For 'int' IVs and int/float or double data we can always use the
> original IV here.
> 
> > > +      if (!rgc->compare_type)
> > > +	{
> > > +	  ok = false;
> > > +	  break;
> > > +	}
> > > +    }
> > > +  if (!ok)
> > > +    {
> > > +      LOOP_VINFO_MASKS (loop_vinfo).rgc_map.empty ();
> > > +      return false;
> > 
> > It looks like this leaks the rgroup_controls created above.
> 
> Fixed.
> 
> > > +    }
> > > +
> > > +  LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = error_mark_node;
> > > +  LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
> > > +  LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) = vect_partial_vectors_avx512;
> > >    return true;
> > >  }
> > >  
> > > @@ -1371,6 +1551,7 @@ vect_verify_loop_lens (loop_vec_info loop_vinfo)
> > >  
> > >    LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = iv_type;
> > >    LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
> > > +  LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) = vect_partial_vectors_len;
> > >  
> > >    return true;
> > >  }
> > > @@ -2712,16 +2893,24 @@ start_over:
> > >  
> > >    /* If we still have the option of using partial vectors,
> > >       check whether we can generate the necessary loop controls.  */
> > > -  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > > -      && !vect_verify_full_masking (loop_vinfo)
> > > -      && !vect_verify_loop_lens (loop_vinfo))
> > > -    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> > > +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> > > +    {
> > > +      if (!LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
> > > +	{
> > > +	  if (!vect_verify_full_masking (loop_vinfo)
> > > +	      && !vect_verify_full_masking_avx512 (loop_vinfo))
> > > +	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> > > +	}
> > > +      else /* !LOOP_VINFO_LENS (loop_vinfo).is_empty () */
> > > +	if (!vect_verify_loop_lens (loop_vinfo))
> > > +	  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> > > +    }
> > >  
> > >    /* If we're vectorizing a loop that uses length "controls" and
> > >       can iterate more than once, we apply decrementing IV approach
> > >       in loop control.  */
> > >    if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > > -      && !LOOP_VINFO_LENS (loop_vinfo).is_empty ()
> > > +      && LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) == vect_partial_vectors_len
> > >        && LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) == 0
> > >        && !(LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > >  	   && known_le (LOOP_VINFO_INT_NITERS (loop_vinfo),
> > > @@ -3022,7 +3211,7 @@ again:
> > >    delete loop_vinfo->vector_costs;
> > >    loop_vinfo->vector_costs = nullptr;
> > >    /* Reset accumulated rgroup information.  */
> > > -  release_vec_loop_controls (&LOOP_VINFO_MASKS (loop_vinfo));
> > > +  release_vec_loop_controls (&LOOP_VINFO_MASKS (loop_vinfo).rgc_vec);
> > >    release_vec_loop_controls (&LOOP_VINFO_LENS (loop_vinfo));
> > >    /* Reset assorted flags.  */
> > >    LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
> > > @@ -4362,13 +4551,67 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
> > >  			  cond_branch_not_taken, vect_epilogue);
> > >  
> > >    /* Take care of special costs for rgroup controls of partial vectors.  */
> > > -  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> > > +  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
> > > +      && (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> > > +	  == vect_partial_vectors_avx512))
> > > +    {
> > > +      /* Calculate how many masks we need to generate.  */
> > > +      unsigned int num_masks = 0;
> > > +      bool need_saturation = false;
> > > +      for (auto rgcm : LOOP_VINFO_MASKS (loop_vinfo).rgc_map)
> > > +	{
> > > +	  rgroup_controls *rgm = rgcm.second;
> > > +	  unsigned nvectors
> > > +	    = (rgcm.first.second
> > > +	       / TYPE_VECTOR_SUBPARTS (rgm->type).to_constant ());
> > > +	  num_masks += nvectors;
> > > +	  if (TYPE_PRECISION (TREE_TYPE (rgm->compare_type))
> > > +	      < TYPE_PRECISION (LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo)))
> > > +	    need_saturation = true;
> > > +	}
> > > +
> > > +      /* ???  The target isn't able to identify the costs below as
> > > +	 producing masks so it cannot penaltize cases where we'd run
> > > +	 out of mask registers for example.  */
> > > +
> > > +      /* In the worst case, we need to generate each mask in the prologue
> > > +	 and in the loop body.  We need one splat per group and one
> > > +	 compare per mask.
> > > +
> > > +	 Sometimes we can use unpacks instead of generating prologue
> > > +	 masks and sometimes the prologue mask will fold to a constant,
> > > +	 so the actual prologue cost might be smaller.  However, it's
> > > +	 simpler and safer to use the worst-case cost; if this ends up
> > > +	 being the tie-breaker between vectorizing or not, then it's
> > > +	 probably better not to vectorize.  */
> > 
> > Not sure all of this applies to the AVX512 case.  In particular, the
> > unpacks bit doesn't seem relevant.
> 
> Removed that part and noted we fail to account for the cost of dealing
> with different nvector mask cases using the same wide masks
> (the cases that vect_get_loop_mask constructs).  For SVE it's
> re-interpreting with VIEW_CONVERT that's done there which should be
> free but for AVX512 (when we ever get multiple vector sizes in one
> loop) to split a mask in half we need one shift for the upper part.
> 
> > > +      (void) add_stmt_cost (target_cost_data,
> > > +			    num_masks
> > > +			    + LOOP_VINFO_MASKS (loop_vinfo).rgc_map.elements (),
> > > +			    vector_stmt, NULL, NULL, NULL_TREE, 0, vect_prologue);
> > > +      (void) add_stmt_cost (target_cost_data,
> > > +			    num_masks
> > > +			    + LOOP_VINFO_MASKS (loop_vinfo).rgc_map.elements (),
> > > +			    vector_stmt, NULL, NULL, NULL_TREE, 0, vect_body);
> > > +
> > > +      /* When we need saturation we need it both in the prologue and
> > > +	 the epilogue.  */
> > > +      if (need_saturation)
> > > +	{
> > > +	  (void) add_stmt_cost (target_cost_data, 1, scalar_stmt,
> > > +				NULL, NULL, NULL_TREE, 0, vect_prologue);
> > > +	  (void) add_stmt_cost (target_cost_data, 1, scalar_stmt,
> > > +				NULL, NULL, NULL_TREE, 0, vect_body);
> > > +	}
> > > +    }
> > > +  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
> > > +	   && (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> > > +	       == vect_partial_vectors_avx512))
> 
> Eh, cut&past error here, should be == vect_partial_vectors_while_ult
> 
> > >      {
> > >        /* Calculate how many masks we need to generate.  */
> > >        unsigned int num_masks = 0;
> > >        rgroup_controls *rgm;
> > >        unsigned int num_vectors_m1;
> > > -      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
> > > +      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec, num_vectors_m1, rgm)
> > 
> > Nit: long line.
> 
> fixed.
> 
> > >  	if (rgm->type)
> > >  	  num_masks += num_vectors_m1 + 1;
> > >        gcc_assert (num_masks > 0);
> > > @@ -10329,14 +10572,6 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
> > >  		       unsigned int nvectors, tree vectype, tree scalar_mask)
> > >  {
> > >    gcc_assert (nvectors != 0);
> > > -  if (masks->length () < nvectors)
> > > -    masks->safe_grow_cleared (nvectors, true);
> > > -  rgroup_controls *rgm = &(*masks)[nvectors - 1];
> > > -  /* The number of scalars per iteration and the number of vectors are
> > > -     both compile-time constants.  */
> > > -  unsigned int nscalars_per_iter
> > > -    = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> > > -		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
> > >  
> > >    if (scalar_mask)
> > >      {
> > > @@ -10344,12 +10579,7 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
> > >        loop_vinfo->scalar_cond_masked_set.add (cond);
> > >      }
> > >  
> > > -  if (rgm->max_nscalars_per_iter < nscalars_per_iter)
> > > -    {
> > > -      rgm->max_nscalars_per_iter = nscalars_per_iter;
> > > -      rgm->type = truth_type_for (vectype);
> > > -      rgm->factor = 1;
> > > -    }
> > > +  masks->mask_set.add (std::make_pair (vectype, nvectors));
> > >  }
> > >  
> > >  /* Given a complete set of masks MASKS, extract mask number INDEX
> > > @@ -10360,46 +10590,121 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
> > >     arrangement.  */
> > >  
> > >  tree
> > > -vect_get_loop_mask (loop_vec_info,
> > > +vect_get_loop_mask (loop_vec_info loop_vinfo,
> > >  		    gimple_stmt_iterator *gsi, vec_loop_masks *masks,
> > >  		    unsigned int nvectors, tree vectype, unsigned int index)
> > >  {
> > > -  rgroup_controls *rgm = &(*masks)[nvectors - 1];
> > > -  tree mask_type = rgm->type;
> > > -
> > > -  /* Populate the rgroup's mask array, if this is the first time we've
> > > -     used it.  */
> > > -  if (rgm->controls.is_empty ())
> > > +  if (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> > > +      == vect_partial_vectors_while_ult)
> > >      {
> > > -      rgm->controls.safe_grow_cleared (nvectors, true);
> > > -      for (unsigned int i = 0; i < nvectors; ++i)
> > > +      rgroup_controls *rgm = &(masks->rgc_vec)[nvectors - 1];
> > > +      tree mask_type = rgm->type;
> > > +
> > > +      /* Populate the rgroup's mask array, if this is the first time we've
> > > +	 used it.  */
> > > +      if (rgm->controls.is_empty ())
> > >  	{
> > > -	  tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask");
> > > -	  /* Provide a dummy definition until the real one is available.  */
> > > -	  SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
> > > -	  rgm->controls[i] = mask;
> > > +	  rgm->controls.safe_grow_cleared (nvectors, true);
> > > +	  for (unsigned int i = 0; i < nvectors; ++i)
> > > +	    {
> > > +	      tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask");
> > > +	      /* Provide a dummy definition until the real one is available.  */
> > > +	      SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
> > > +	      rgm->controls[i] = mask;
> > > +	    }
> > >  	}
> > > -    }
> > >  
> > > -  tree mask = rgm->controls[index];
> > > -  if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type),
> > > -		TYPE_VECTOR_SUBPARTS (vectype)))
> > > +      tree mask = rgm->controls[index];
> > > +      if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type),
> > > +		    TYPE_VECTOR_SUBPARTS (vectype)))
> > > +	{
> > > +	  /* A loop mask for data type X can be reused for data type Y
> > > +	     if X has N times more elements than Y and if Y's elements
> > > +	     are N times bigger than X's.  In this case each sequence
> > > +	     of N elements in the loop mask will be all-zero or all-one.
> > > +	     We can then view-convert the mask so that each sequence of
> > > +	     N elements is replaced by a single element.  */
> > > +	  gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (mask_type),
> > > +				  TYPE_VECTOR_SUBPARTS (vectype)));
> > > +	  gimple_seq seq = NULL;
> > > +	  mask_type = truth_type_for (vectype);
> > > +	  /* We can only use re-use the mask by reinterpreting it if it
> > > +	     occupies the same space, that is the mask with less elements
> > 
> > Nit: fewer elements
> 
> Ah this assert was also a left-over from earlier attempts, I've removed
> it again.
> 
> > > +	     uses multiple bits for each masked elements.  */
> > > +	  gcc_assert (known_eq (TYPE_PRECISION (TREE_TYPE (TREE_TYPE (mask)))
> > > +				* TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)),
> > > +				TYPE_PRECISION (TREE_TYPE (mask_type))
> > > +				* TYPE_VECTOR_SUBPARTS (mask_type)));
> > > +	  mask = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, mask);
> > > +	  if (seq)
> > > +	    gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
> > > +	}
> > > +      return mask;
> > > +    }
> > > +  else if (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> > > +	   == vect_partial_vectors_avx512)
> > >      {
> > > -      /* A loop mask for data type X can be reused for data type Y
> > > -	 if X has N times more elements than Y and if Y's elements
> > > -	 are N times bigger than X's.  In this case each sequence
> > > -	 of N elements in the loop mask will be all-zero or all-one.
> > > -	 We can then view-convert the mask so that each sequence of
> > > -	 N elements is replaced by a single element.  */
> > > -      gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (mask_type),
> > > -			      TYPE_VECTOR_SUBPARTS (vectype)));
> > > +      /* The number of scalars per iteration and the number of vectors are
> > > +	 both compile-time constants.  */
> > > +      unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
> > > +      unsigned int nscalars_per_iter
> > > +	= nvectors * nunits / LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ();
> > 
> > Same disconnect between the comment and the code here.  If we do use
> > is_constant in vect_verify_full_masking_avx512 then we could reference
> > that instead.
> 
> I'm going to dig into this a bit, maybe I can simplify all this and
> just have the vector indexed by nscalars_per_iter and get rid of the
> hash-map.
> 
> > > +
> > > +      rgroup_controls *rgm
> > > +	= *masks->rgc_map.get (std::make_pair (nscalars_per_iter,
> > > +					       nvectors * nunits));
> > > +
> > > +      /* The stored nV is dependent on the mask type produced.  */
> > > +      nvectors = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> > > +			    TYPE_VECTOR_SUBPARTS (rgm->type)).to_constant ();
> > > +
> > > +      /* Populate the rgroup's mask array, if this is the first time we've
> > > +	 used it.  */
> > > +      if (rgm->controls.is_empty ())
> > > +	{
> > > +	  rgm->controls.safe_grow_cleared (nvectors, true);
> > > +	  for (unsigned int i = 0; i < nvectors; ++i)
> > > +	    {
> > > +	      tree mask = make_temp_ssa_name (rgm->type, NULL, "loop_mask");
> > > +	      /* Provide a dummy definition until the real one is available.  */
> > > +	      SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
> > > +	      rgm->controls[i] = mask;
> > > +	    }
> > > +	}
> > > +      if (known_eq (TYPE_VECTOR_SUBPARTS (rgm->type),
> > > +		    TYPE_VECTOR_SUBPARTS (vectype)))
> > > +	return rgm->controls[index];
> > > +
> > > +      /* Split the vector if needed.  Since we are dealing with integer mode
> > > +	 masks with AVX512 we can operate on the integer representation
> > > +	 performing the whole vector shifting.  */
> > > +      unsigned HOST_WIDE_INT factor;
> > > +      bool ok = constant_multiple_p (TYPE_VECTOR_SUBPARTS (rgm->type),
> > > +				     TYPE_VECTOR_SUBPARTS (vectype), &factor);
> > > +      gcc_assert (ok);
> > > +      gcc_assert (GET_MODE_CLASS (TYPE_MODE (rgm->type)) == MODE_INT);
> > > +      tree mask_type = truth_type_for (vectype);
> > > +      gcc_assert (GET_MODE_CLASS (TYPE_MODE (mask_type)) == MODE_INT);
> > > +      unsigned vi = index / factor;
> > > +      unsigned vpart = index % factor;
> > > +      tree vec = rgm->controls[vi];
> > >        gimple_seq seq = NULL;
> > > -      mask_type = truth_type_for (vectype);
> > > -      mask = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, mask);
> > > +      vec = gimple_build (&seq, VIEW_CONVERT_EXPR,
> > > +			  lang_hooks.types.type_for_mode
> > > +				(TYPE_MODE (rgm->type), 1), vec);
> > > +      /* For integer mode masks simply shift the right bits into position.  */
> > > +      if (vpart != 0)
> > > +	vec = gimple_build (&seq, RSHIFT_EXPR, TREE_TYPE (vec), vec,
> > > +			    build_int_cst (integer_type_node, vpart * nunits));
> > > +      vec = gimple_convert (&seq, lang_hooks.types.type_for_mode
> > > +				    (TYPE_MODE (mask_type), 1), vec);
> > > +      vec = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, vec);
> > 
> > Would it be worth creating an rgroup_controls for each nunits derivative
> > of each rgc_map entry?  That way we'd share the calculations, and be
> > able to cost them more accurately.
> > 
> > Maybe it's not worth it, just asking. :)
> 
> I suppose we could do that.  So with AVX512 we have a common
> num_scalars_per_iter but vary on nvectors.  So we'd have to
> compute a max_nvectors.
> 
> I guess it's doable, I'll keep that in mind.
> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Richard
> > 
> > >        if (seq)
> > >  	gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
> > > +      return vec;
> > >      }
> > > -  return mask;
> > > +  else
> > > +    gcc_unreachable ();
> > >  }
> > >  
> > >  /* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
> > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > > index 767a0774d45..42161778dc1 100644
> > > --- a/gcc/tree-vectorizer.h
> > > +++ b/gcc/tree-vectorizer.h
> > > @@ -300,6 +300,13 @@ public:
> > >  #define SLP_TREE_LANES(S)			 (S)->lanes
> > >  #define SLP_TREE_CODE(S)			 (S)->code
> > >  
> > > +enum vect_partial_vector_style {
> > > +    vect_partial_vectors_none,
> > > +    vect_partial_vectors_while_ult,
> > > +    vect_partial_vectors_avx512,
> > > +    vect_partial_vectors_len
> > > +};
> > > +
> > >  /* Key for map that records association between
> > >     scalar conditions and corresponding loop mask, and
> > >     is populated by vect_record_loop_mask.  */
> > > @@ -605,6 +612,10 @@ struct rgroup_controls {
> > >       specified number of elements; the type of the elements doesn't matter.  */
> > >    tree type;
> > >  
> > > +  /* When there is no uniformly used LOOP_VINFO_RGROUP_COMPARE_TYPE this
> > > +     is the rgroup specific type used.  */
> > > +  tree compare_type;
> > > +
> > >    /* A vector of nV controls, in iteration order.  */
> > >    vec<tree> controls;
> > >  
> > > @@ -613,7 +624,24 @@ struct rgroup_controls {
> > >    tree bias_adjusted_ctrl;
> > >  };
> > >  
> > > -typedef auto_vec<rgroup_controls> vec_loop_masks;
> > > +struct vec_loop_masks
> > > +{
> > > +  bool is_empty () const { return mask_set.is_empty (); }
> > > +
> > > +  typedef pair_hash <nofree_ptr_hash <tree_node>,
> > > +		     int_hash<unsigned, 0>> mp_hash;
> > > +  hash_set<mp_hash> mask_set;
> > > +
> > > +  /* Default storage for rgroup_controls.  */
> > > +  auto_vec<rgroup_controls> rgc_vec;
> > > +
> > > +  /* The vect_partial_vectors_avx512 style uses a hash-map.  */
> > > +  hash_map<std::pair<unsigned /* nscalars_per_iter */,
> > > +		     unsigned /* nlanes */>, rgroup_controls *,
> > > +	   simple_hashmap_traits<pair_hash <int_hash<unsigned, 0>,
> > > +					    int_hash<unsigned, 0>>,
> > > +				 rgroup_controls *>> rgc_map;
> > > +};
> > >  
> > >  typedef auto_vec<rgroup_controls> vec_loop_lens;
> > >  
> > > @@ -741,6 +769,10 @@ public:
> > >       LOOP_VINFO_USING_PARTIAL_VECTORS_P is true.  */
> > >    tree rgroup_iv_type;
> > >  
> > > +  /* The style used for implementing partial vectors when
> > > +     LOOP_VINFO_USING_PARTIAL_VECTORS_P is true.  */
> > > +  vect_partial_vector_style partial_vector_style;
> > > +
> > >    /* Unknown DRs according to which loop was peeled.  */
> > >    class dr_vec_info *unaligned_dr;
> > >  
> > > @@ -914,6 +946,7 @@ public:
> > >  #define LOOP_VINFO_MASK_SKIP_NITERS(L)     (L)->mask_skip_niters
> > >  #define LOOP_VINFO_RGROUP_COMPARE_TYPE(L)  (L)->rgroup_compare_type
> > >  #define LOOP_VINFO_RGROUP_IV_TYPE(L)       (L)->rgroup_iv_type
> > > +#define LOOP_VINFO_PARTIAL_VECTORS_STYLE(L) (L)->partial_vector_style
> > >  #define LOOP_VINFO_PTR_MASK(L)             (L)->ptr_mask
> > >  #define LOOP_VINFO_N_STMTS(L)		   (L)->shared->n_stmts
> > >  #define LOOP_VINFO_LOOP_NEST(L)            (L)->shared->loop_nest
> > 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)