From: Richard Biener <rguenther@suse.de>
To: Richard Sandiford <richard.sandiford@arm.com>
Cc: Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>,
Jan Hubicka <hubicka@ucw.cz>,
hongtao.liu@intel.com, kirill.yukhin@gmail.com,
ams@codesourcery.com
Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization
Date: Thu, 15 Jun 2023 12:53:29 +0000 (UTC) [thread overview]
Message-ID: <nycvar.YFH.7.77.849.2306151251330.4723@jbgna.fhfr.qr> (raw)
In-Reply-To: <nycvar.YFH.7.77.849.2306151116140.4723@jbgna.fhfr.qr>
On Thu, 15 Jun 2023, Richard Biener wrote:
> On Wed, 14 Jun 2023, Richard Sandiford wrote:
>
> > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > This implemens fully masked vectorization or a masked epilog for
> > > AVX512 style masks which single themselves out by representing
> > > each lane with a single bit and by using integer modes for the mask
> > > (both is much like GCN).
> > >
> > > AVX512 is also special in that it doesn't have any instruction
> > > to compute the mask from a scalar IV like SVE has with while_ult.
> > > Instead the masks are produced by vector compares and the loop
> > > control retains the scalar IV (mainly to avoid dependences on
> > > mask generation, a suitable mask test instruction is available).
> > >
> > > Like RVV code generation prefers a decrementing IV though IVOPTs
> > > messes things up in some cases removing that IV to eliminate
> > > it with an incrementing one used for address generation.
> > >
> > > One of the motivating testcases is from PR108410 which in turn
> > > is extracted from x264 where large size vectorization shows
> > > issues with small trip loops. Execution time there improves
> > > compared to classic AVX512 with AVX2 epilogues for the cases
> > > of less than 32 iterations.
> > >
> > > size scalar 128 256 512 512e 512f
> > > 1 9.42 11.32 9.35 11.17 15.13 16.89
> > > 2 5.72 6.53 6.66 6.66 7.62 8.56
> > > 3 4.49 5.10 5.10 5.74 5.08 5.73
> > > 4 4.10 4.33 4.29 5.21 3.79 4.25
> > > 6 3.78 3.85 3.86 4.76 2.54 2.85
> > > 8 3.64 1.89 3.76 4.50 1.92 2.16
> > > 12 3.56 2.21 3.75 4.26 1.26 1.42
> > > 16 3.36 0.83 1.06 4.16 0.95 1.07
> > > 20 3.39 1.42 1.33 4.07 0.75 0.85
> > > 24 3.23 0.66 1.72 4.22 0.62 0.70
> > > 28 3.18 1.09 2.04 4.20 0.54 0.61
> > > 32 3.16 0.47 0.41 0.41 0.47 0.53
> > > 34 3.16 0.67 0.61 0.56 0.44 0.50
> > > 38 3.19 0.95 0.95 0.82 0.40 0.45
> > > 42 3.09 0.58 1.21 1.13 0.36 0.40
> > >
> > > 'size' specifies the number of actual iterations, 512e is for
> > > a masked epilog and 512f for the fully masked loop. From
> > > 4 scalar iterations on the AVX512 masked epilog code is clearly
> > > the winner, the fully masked variant is clearly worse and
> > > it's size benefit is also tiny.
> > >
> > > This patch does not enable using fully masked loops or
> > > masked epilogues by default. More work on cost modeling
> > > and vectorization kind selection on x86_64 is necessary
> > > for this.
> > >
> > > Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
> > > which could be exploited further to unify some of the flags
> > > we have right now but there didn't seem to be many easy things
> > > to merge, so I'm leaving this for followups.
> > >
> > > Mask requirements as registered by vect_record_loop_mask are kept in their
> > > original form and recorded in a hash_set now instead of being
> > > processed to a vector of rgroup_controls. Instead that's now
> > > left to the final analysis phase which tries forming the rgroup_controls
> > > vector using while_ult and if that fails now tries AVX512 style
> > > which needs a different organization and instead fills a hash_map
> > > with the relevant info. vect_get_loop_mask now has two implementations,
> > > one for the two mask styles we then have.
> > >
> > > I have decided against interweaving vect_set_loop_condition_partial_vectors
> > > with conditions to do AVX512 style masking and instead opted to
> > > "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
> > > Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.
> > >
> > > I was split between making 'vec_loop_masks' a class with methods,
> > > possibly merging in the _len stuff into a single registry. It
> > > seemed to be too many changes for the purpose of getting AVX512
> > > working. I'm going to play wait and see what happens with RISC-V
> > > here since they are going to get both masks and lengths registered
> > > I think.
> > >
> > > The vect_prepare_for_masked_peels hunk might run into issues with
> > > SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
> > > looked odd.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu. I've run
> > > the testsuite with --param vect-partial-vector-usage=2 with and
> > > without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
> > > and one latent wrong-code (PR110237).
> > >
> > > There's followup work to be done to try enabling masked epilogues
> > > for x86-64 by default (when AVX512 is enabled, possibly only when
> > > -mprefer-vector-width=512). Getting cost modeling and decision
> > > right is going to be challenging.
> > >
> > > Any comments?
> > >
> > > OK?
> >
> > Some comments below, but otherwise LGTM FWIW.
> >
> > > Btw, testing on GCN would be welcome - the _avx512 paths could
> > > work for it so in case the while_ult path fails (not sure if
> > > it ever does) it could get _avx512 style masking. Likewise
> > > testing on ARM just to see I didn't break anything here.
> > > I don't have SVE hardware so testing is probably meaningless.
> > >
> > > Thanks,
> > > Richard.
> > >
> > > * tree-vectorizer.h (enum vect_partial_vector_style): New.
> > > (_loop_vec_info::partial_vector_style): Likewise.
> > > (LOOP_VINFO_PARTIAL_VECTORS_STYLE): Likewise.
> > > (rgroup_controls::compare_type): Add.
> > > (vec_loop_masks): Change from a typedef to auto_vec<>
> > > to a structure.
> > > * tree-vect-loop-manip.cc (vect_set_loop_condition_partial_vectors):
> > > Adjust.
> > > (vect_set_loop_condition_partial_vectors_avx512): New function
> > > implementing the AVX512 partial vector codegen.
> > > (vect_set_loop_condition): Dispatch to the correct
> > > vect_set_loop_condition_partial_vectors_* function based on
> > > LOOP_VINFO_PARTIAL_VECTORS_STYLE.
> > > (vect_prepare_for_masked_peels): Compute LOOP_VINFO_MASK_SKIP_NITERS
> > > in the original niter type.
> > > * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize
> > > partial_vector_style.
> > > (_loop_vec_info::~_loop_vec_info): Release the hash-map recorded
> > > rgroup_controls.
> > > (can_produce_all_loop_masks_p): Adjust.
> > > (vect_verify_full_masking): Produce the rgroup_controls vector
> > > here. Set LOOP_VINFO_PARTIAL_VECTORS_STYLE on success.
> > > (vect_verify_full_masking_avx512): New function implementing
> > > verification of AVX512 style masking.
> > > (vect_verify_loop_lens): Set LOOP_VINFO_PARTIAL_VECTORS_STYLE.
> > > (vect_analyze_loop_2): Also try AVX512 style masking.
> > > Adjust condition.
> > > (vect_estimate_min_profitable_iters): Implement AVX512 style
> > > mask producing cost.
> > > (vect_record_loop_mask): Do not build the rgroup_controls
> > > vector here but record masks in a hash-set.
> > > (vect_get_loop_mask): Implement AVX512 style mask query,
> > > complementing the existing while_ult style.
> > > ---
> > > gcc/tree-vect-loop-manip.cc | 264 ++++++++++++++++++++++-
> > > gcc/tree-vect-loop.cc | 413 +++++++++++++++++++++++++++++++-----
> > > gcc/tree-vectorizer.h | 35 ++-
> > > 3 files changed, 651 insertions(+), 61 deletions(-)
> > >
> > > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> > > index 1c8100c1a1c..f0ecaec28f4 100644
> > > --- a/gcc/tree-vect-loop-manip.cc
> > > +++ b/gcc/tree-vect-loop-manip.cc
> > > @@ -50,6 +50,9 @@ along with GCC; see the file COPYING3. If not see
> > > #include "insn-config.h"
> > > #include "rtl.h"
> > > #include "recog.h"
> > > +#include "langhooks.h"
> > > +#include "tree-vector-builder.h"
> > > +#include "optabs-tree.h"
> > >
> > > /*************************************************************************
> > > Simple Loop Peeling Utilities
> > > @@ -845,7 +848,7 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
> > > rgroup_controls *iv_rgc = nullptr;
> > > unsigned int i;
> > > auto_vec<rgroup_controls> *controls = use_masks_p
> > > - ? &LOOP_VINFO_MASKS (loop_vinfo)
> > > + ? &LOOP_VINFO_MASKS (loop_vinfo).rgc_vec
> > > : &LOOP_VINFO_LENS (loop_vinfo);
> > > FOR_EACH_VEC_ELT (*controls, i, rgc)
> > > if (!rgc->controls.is_empty ())
> > > @@ -936,6 +939,246 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
> > > return cond_stmt;
> > > }
> > >
> > > +/* Set up the iteration condition and rgroup controls for LOOP in AVX512
> > > + style, given that LOOP_VINFO_USING_PARTIAL_VECTORS_P is true for the
> > > + vectorized loop. LOOP_VINFO describes the vectorization of LOOP. NITERS is
> > > + the number of iterations of the original scalar loop that should be
> > > + handled by the vector loop. NITERS_MAYBE_ZERO and FINAL_IV are as
> > > + for vect_set_loop_condition.
> > > +
> > > + Insert the branch-back condition before LOOP_COND_GSI and return the
> > > + final gcond. */
> > > +
> > > +static gcond *
> > > +vect_set_loop_condition_partial_vectors_avx512 (class loop *loop,
> > > + loop_vec_info loop_vinfo, tree niters,
> > > + tree final_iv,
> > > + bool niters_maybe_zero,
> > > + gimple_stmt_iterator loop_cond_gsi)
> > > +{
> > > + tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
> > > + tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
> > > + poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> > > + tree orig_niters = niters;
> > > + gimple_seq preheader_seq = NULL;
> > > +
> > > + /* Create an IV that counts down from niters and whose step
> > > + is the number of iterations processed in the current iteration.
> > > + Produce the controls with compares like the following.
> > > +
> > > + # iv_2 = PHI <niters, iv_3>
> > > + rem_4 = MIN <iv_2, VF>;
> > > + remv_6 = { rem_4, rem_4, rem_4, ... }
> > > + mask_5 = { 0, 0, 1, 1, 2, 2, ... } < remv6;
> > > + iv_3 = iv_2 - VF;
> > > + if (iv_2 > VF)
> > > + continue;
> > > +
> > > + Where the constant is built with elements at most VF - 1 and
> > > + repetitions according to max_nscalars_per_iter which is guarnateed
> > > + to be the same within a group. */
> > > +
> > > + /* Convert NITERS to the determined IV type. */
> > > + if (TYPE_PRECISION (iv_type) > TYPE_PRECISION (TREE_TYPE (niters))
> > > + && niters_maybe_zero)
> > > + {
> > > + /* We know that there is always at least one iteration, so if the
> > > + count is zero then it must have wrapped. Cope with this by
> > > + subtracting 1 before the conversion and adding 1 to the result. */
> > > + gcc_assert (TYPE_UNSIGNED (TREE_TYPE (niters)));
> > > + niters = gimple_build (&preheader_seq, PLUS_EXPR, TREE_TYPE (niters),
> > > + niters, build_minus_one_cst (TREE_TYPE (niters)));
> > > + niters = gimple_convert (&preheader_seq, iv_type, niters);
> > > + niters = gimple_build (&preheader_seq, PLUS_EXPR, iv_type,
> > > + niters, build_one_cst (iv_type));
> > > + }
> > > + else
> > > + niters = gimple_convert (&preheader_seq, iv_type, niters);
> > > +
> > > + /* Bias the initial value of the IV in case we need to skip iterations
> > > + at the beginning. */
> > > + tree niters_adj = niters;
> > > + if (niters_skip)
> > > + {
> > > + tree skip = gimple_convert (&preheader_seq, iv_type, niters_skip);
> > > + niters_adj = gimple_build (&preheader_seq, PLUS_EXPR,
> > > + iv_type, niters, skip);
> > > + }
> > > +
> > > + /* The iteration step is the vectorization factor. */
> > > + tree iv_step = build_int_cst (iv_type, vf);
> > > +
> > > + /* Create the decrement IV. */
> > > + tree index_before_incr, index_after_incr;
> > > + gimple_stmt_iterator incr_gsi;
> > > + bool insert_after;
> > > + standard_iv_increment_position (loop, &incr_gsi, &insert_after);
> > > + create_iv (niters_adj, MINUS_EXPR, iv_step, NULL_TREE, loop,
> > > + &incr_gsi, insert_after, &index_before_incr,
> > > + &index_after_incr);
> > > +
> > > + /* Iterate over all the rgroups and fill in their controls. */
> > > + for (auto rgcm : LOOP_VINFO_MASKS (loop_vinfo).rgc_map)
> > > + {
> > > + rgroup_controls *rgc = rgcm.second;
> > > + if (rgc->controls.is_empty ())
> > > + continue;
> > > +
> > > + tree ctrl_type = rgc->type;
> > > + poly_uint64 nitems_per_ctrl = TYPE_VECTOR_SUBPARTS (ctrl_type);
> > > +
> > > + tree vectype = rgc->compare_type;
> > > +
> > > + /* index_after_incr is the IV specifying the remaining iterations in
> > > + the next iteration. */
> > > + tree rem = index_after_incr;
> > > + /* When the data type for the compare to produce the mask is
> > > + smaller than the IV type we need to saturate. Saturate to
> > > + the smallest possible value (IV_TYPE) so we only have to
> > > + saturate once (CSE will catch redundant ones we add). */
> > > + if (TYPE_PRECISION (TREE_TYPE (vectype)) < TYPE_PRECISION (iv_type))
> > > + rem = gimple_build (&incr_gsi, false, GSI_CONTINUE_LINKING,
> > > + UNKNOWN_LOCATION,
> > > + MIN_EXPR, TREE_TYPE (rem), rem, iv_step);
> > > + rem = gimple_convert (&incr_gsi, false, GSI_CONTINUE_LINKING,
> > > + UNKNOWN_LOCATION, TREE_TYPE (vectype), rem);
> > > +
> > > + /* Build a data vector composed of the remaining iterations. */
> > > + rem = gimple_build_vector_from_val (&incr_gsi, false, GSI_CONTINUE_LINKING,
> > > + UNKNOWN_LOCATION, vectype, rem);
> > > +
> > > + /* Provide a definition of each vector in the control group. */
> > > + tree next_ctrl = NULL_TREE;
> > > + tree first_rem = NULL_TREE;
> > > + tree ctrl;
> > > + unsigned int i;
> > > + FOR_EACH_VEC_ELT_REVERSE (rgc->controls, i, ctrl)
> > > + {
> > > + /* Previous controls will cover BIAS items. This control covers the
> > > + next batch. */
> > > + poly_uint64 bias = nitems_per_ctrl * i;
> > > +
> > > + /* Build the constant to compare the remaining iters against,
> > > + this is sth like { 0, 0, 1, 1, 2, 2, 3, 3, ... } appropriately
> > > + split into pieces. */
> > > + unsigned n = TYPE_VECTOR_SUBPARTS (ctrl_type).to_constant ();
> > > + tree_vector_builder builder (vectype, n, 1);
> > > + for (unsigned i = 0; i < n; ++i)
> > > + {
> > > + unsigned HOST_WIDE_INT val
> > > + = (i + bias.to_constant ()) / rgc->max_nscalars_per_iter;
> > > + gcc_assert (val < vf.to_constant ());
> > > + builder.quick_push (build_int_cst (TREE_TYPE (vectype), val));
> > > + }
> > > + tree cmp_series = builder.build ();
> > > +
> > > + /* Create the initial control. First include all items that
> > > + are within the loop limit. */
> > > + tree init_ctrl = NULL_TREE;
> > > + poly_uint64 const_limit;
> > > + /* See whether the first iteration of the vector loop is known
> > > + to have a full control. */
> > > + if (poly_int_tree_p (niters, &const_limit)
> > > + && known_ge (const_limit, (i + 1) * nitems_per_ctrl))
> > > + init_ctrl = build_minus_one_cst (ctrl_type);
> > > + else
> > > + {
> > > + /* The remaining work items initially are niters. Saturate,
> > > + splat and compare. */
> > > + if (!first_rem)
> > > + {
> > > + first_rem = niters;
> > > + if (TYPE_PRECISION (TREE_TYPE (vectype))
> > > + < TYPE_PRECISION (iv_type))
> > > + first_rem = gimple_build (&preheader_seq,
> > > + MIN_EXPR, TREE_TYPE (first_rem),
> > > + first_rem, iv_step);
> > > + first_rem = gimple_convert (&preheader_seq, TREE_TYPE (vectype),
> > > + first_rem);
> > > + first_rem = gimple_build_vector_from_val (&preheader_seq,
> > > + vectype, first_rem);
> > > + }
> > > + init_ctrl = gimple_build (&preheader_seq, LT_EXPR, ctrl_type,
> > > + cmp_series, first_rem);
> > > + }
> > > +
> > > + /* Now AND out the bits that are within the number of skipped
> > > + items. */
> > > + poly_uint64 const_skip;
> > > + if (niters_skip
> > > + && !(poly_int_tree_p (niters_skip, &const_skip)
> > > + && known_le (const_skip, bias)))
> > > + {
> > > + /* For integer mode masks it's cheaper to shift out the bits
> > > + since that avoids loading a constant. */
> > > + gcc_assert (GET_MODE_CLASS (TYPE_MODE (ctrl_type)) == MODE_INT);
> > > + init_ctrl = gimple_build (&preheader_seq, VIEW_CONVERT_EXPR,
> > > + lang_hooks.types.type_for_mode
> > > + (TYPE_MODE (ctrl_type), 1),
> > > + init_ctrl);
> > > + /* ??? But when the shift amount isn't constant this requires
> > > + a round-trip to GRPs. We could apply the bias to either
> > > + side of the compare instead. */
> > > + tree shift = gimple_build (&preheader_seq, MULT_EXPR,
> > > + TREE_TYPE (niters_skip),
> > > + niters_skip,
> > > + build_int_cst (TREE_TYPE (niters_skip),
> > > + rgc->max_nscalars_per_iter));
> > > + init_ctrl = gimple_build (&preheader_seq, LSHIFT_EXPR,
> > > + TREE_TYPE (init_ctrl),
> > > + init_ctrl, shift);
> > > + init_ctrl = gimple_build (&preheader_seq, VIEW_CONVERT_EXPR,
> > > + ctrl_type, init_ctrl);
> >
> > It looks like this assumes that either the first lane or the last lane
> > of the first mask is always true, is that right? I'm not sure we ever
> > prove that, at least not for SVE. There it's possible to have inactive
> > elements at both ends of the same mask.
>
> It builds a mask for the first iteration without considering niters_skip
> and then shifts in inactive lanes for niters_skip. As I'm using
> VIEW_CONVERT_EXPR this indeed can cause bits outside of the range
> relevant for the vector mask to be set but at least x86 ignores those
> (so I'm missing a final AND with TYPE_PRECISION of the mask).
>
> So I think it should work correctly. For variable niters_skip
> it might be faster to build a mask based on niter_skip adjusted niters
> and then AND since we can build both masks in parallel. Any
> variable niters_skip shifting has to be done in GPRs as the mask
> register ops only can perform shifts by immediates.
>
> > > + }
> > > +
> > > + /* Get the control value for the next iteration of the loop. */
> > > + next_ctrl = gimple_build (&incr_gsi, false, GSI_CONTINUE_LINKING,
> > > + UNKNOWN_LOCATION,
> > > + LT_EXPR, ctrl_type, cmp_series, rem);
> > > +
> > > + vect_set_loop_control (loop, ctrl, init_ctrl, next_ctrl);
> > > + }
> > > + }
> > > +
> > > + /* Emit all accumulated statements. */
> > > + add_preheader_seq (loop, preheader_seq);
> > > +
> > > + /* Adjust the exit test using the decrementing IV. */
> > > + edge exit_edge = single_exit (loop);
> > > + tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? LE_EXPR : GT_EXPR;
> > > + /* When we peel for alignment with niter_skip != 0 this can
> > > + cause niter + niter_skip to wrap and since we are comparing the
> > > + value before the decrement here we get a false early exit.
> > > + We can't compare the value after decrement either because that
> > > + decrement could wrap as well as we're not doing a saturating
> > > + decrement. To avoid this situation we force a larger
> > > + iv_type. */
> > > + gcond *cond_stmt = gimple_build_cond (code, index_before_incr, iv_step,
> > > + NULL_TREE, NULL_TREE);
> > > + gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
> > > +
> > > + /* The loop iterates (NITERS - 1 + NITERS_SKIP) / VF + 1 times.
> > > + Subtract one from this to get the latch count. */
> > > + tree niters_minus_one
> > > + = fold_build2 (PLUS_EXPR, TREE_TYPE (orig_niters), orig_niters,
> > > + build_minus_one_cst (TREE_TYPE (orig_niters)));
> > > + tree niters_adj2 = fold_convert (iv_type, niters_minus_one);
> > > + if (niters_skip)
> > > + niters_adj2 = fold_build2 (PLUS_EXPR, iv_type, niters_minus_one,
> > > + fold_convert (iv_type, niters_skip));
> > > + loop->nb_iterations = fold_build2 (TRUNC_DIV_EXPR, iv_type,
> > > + niters_adj2, iv_step);
> > > +
> > > + if (final_iv)
> > > + {
> > > + gassign *assign = gimple_build_assign (final_iv, orig_niters);
> > > + gsi_insert_on_edge_immediate (single_exit (loop), assign);
> > > + }
> > > +
> > > + return cond_stmt;
> > > +}
> > > +
> > > +
> > > /* Like vect_set_loop_condition, but handle the case in which the vector
> > > loop handles exactly VF scalars per iteration. */
> > >
> > > @@ -1114,10 +1357,18 @@ vect_set_loop_condition (class loop *loop, loop_vec_info loop_vinfo,
> > > gimple_stmt_iterator loop_cond_gsi = gsi_for_stmt (orig_cond);
> > >
> > > if (loop_vinfo && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
> > > - cond_stmt = vect_set_loop_condition_partial_vectors (loop, loop_vinfo,
> > > - niters, final_iv,
> > > - niters_maybe_zero,
> > > - loop_cond_gsi);
> > > + {
> > > + if (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) == vect_partial_vectors_avx512)
> > > + cond_stmt = vect_set_loop_condition_partial_vectors_avx512 (loop, loop_vinfo,
> > > + niters, final_iv,
> > > + niters_maybe_zero,
> > > + loop_cond_gsi);
> > > + else
> > > + cond_stmt = vect_set_loop_condition_partial_vectors (loop, loop_vinfo,
> > > + niters, final_iv,
> > > + niters_maybe_zero,
> > > + loop_cond_gsi);
> > > + }
> > > else
> > > cond_stmt = vect_set_loop_condition_normal (loop, niters, step, final_iv,
> > > niters_maybe_zero,
> > > @@ -2030,7 +2281,8 @@ void
> > > vect_prepare_for_masked_peels (loop_vec_info loop_vinfo)
> > > {
> > > tree misalign_in_elems;
> > > - tree type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
> > > + /* ??? With AVX512 we want LOOP_VINFO_RGROUP_IV_TYPE in the end. */
> > > + tree type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo));
> >
> > Like you say, I think this might cause problems for SVE, since
> > LOOP_VINFO_MASK_SKIP_NITERS is assumed to be the compare type
> > in vect_set_loop_controls_directly. Not sure what the best way
> > around that is.
>
> We should be able to do the conversion there, we could also simply
> leave it at what get_misalign_in_elems uses (an unsigned type
> with the width of a pointer). The issue with the AVX512 style
> masking is that there's not a single compare type so I left
> LOOP_VINFO_RGROUP_COMPARE_TYPE as error_mark_node to catch
> any erroneous uses.
>
> Any preference here?
>
> > >
> > > gcc_assert (vect_use_loop_mask_for_alignment_p (loop_vinfo));
> > >
> > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > index 1897e720389..9be66b8fbc5 100644
> > > --- a/gcc/tree-vect-loop.cc
> > > +++ b/gcc/tree-vect-loop.cc
> > > @@ -55,6 +55,7 @@ along with GCC; see the file COPYING3. If not see
> > > #include "vec-perm-indices.h"
> > > #include "tree-eh.h"
> > > #include "case-cfn-macros.h"
> > > +#include "langhooks.h"
> > >
> > > /* Loop Vectorization Pass.
> > >
> > > @@ -963,6 +964,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
> > > mask_skip_niters (NULL_TREE),
> > > rgroup_compare_type (NULL_TREE),
> > > simd_if_cond (NULL_TREE),
> > > + partial_vector_style (vect_partial_vectors_none),
> > > unaligned_dr (NULL),
> > > peeling_for_alignment (0),
> > > ptr_mask (0),
> > > @@ -1058,7 +1060,12 @@ _loop_vec_info::~_loop_vec_info ()
> > > {
> > > free (bbs);
> > >
> > > - release_vec_loop_controls (&masks);
> > > + for (auto m : masks.rgc_map)
> > > + {
> > > + m.second->controls.release ();
> > > + delete m.second;
> > > + }
> > > + release_vec_loop_controls (&masks.rgc_vec);
> > > release_vec_loop_controls (&lens);
> > > delete ivexpr_map;
> > > delete scan_map;
> > > @@ -1108,7 +1115,7 @@ can_produce_all_loop_masks_p (loop_vec_info loop_vinfo, tree cmp_type)
> > > {
> > > rgroup_controls *rgm;
> > > unsigned int i;
> > > - FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm)
> > > + FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec, i, rgm)
> > > if (rgm->type != NULL_TREE
> > > && !direct_internal_fn_supported_p (IFN_WHILE_ULT,
> > > cmp_type, rgm->type,
> > > @@ -1203,9 +1210,33 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
> > > if (LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
> > > return false;
> > >
> > > + /* Produce the rgroup controls. */
> > > + for (auto mask : LOOP_VINFO_MASKS (loop_vinfo).mask_set)
> > > + {
> > > + vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > + tree vectype = mask.first;
> > > + unsigned nvectors = mask.second;
> > > +
> > > + if (masks->rgc_vec.length () < nvectors)
> > > + masks->rgc_vec.safe_grow_cleared (nvectors, true);
> > > + rgroup_controls *rgm = &(*masks).rgc_vec[nvectors - 1];
> > > + /* The number of scalars per iteration and the number of vectors are
> > > + both compile-time constants. */
> > > + unsigned int nscalars_per_iter
> > > + = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> > > + LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
> > > +
> > > + if (rgm->max_nscalars_per_iter < nscalars_per_iter)
> > > + {
> > > + rgm->max_nscalars_per_iter = nscalars_per_iter;
> > > + rgm->type = truth_type_for (vectype);
> > > + rgm->factor = 1;
> > > + }
> > > + }
> > > +
> > > /* Calculate the maximum number of scalars per iteration for every rgroup. */
> > > unsigned int max_nscalars_per_iter = 1;
> > > - for (auto rgm : LOOP_VINFO_MASKS (loop_vinfo))
> > > + for (auto rgm : LOOP_VINFO_MASKS (loop_vinfo).rgc_vec)
> > > max_nscalars_per_iter
> > > = MAX (max_nscalars_per_iter, rgm.max_nscalars_per_iter);
> > >
> > > @@ -1268,10 +1299,159 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
> > > }
> > >
> > > if (!cmp_type)
> > > - return false;
> > > + {
> > > + LOOP_VINFO_MASKS (loop_vinfo).rgc_vec.release ();
> > > + return false;
> > > + }
> > >
> > > LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = cmp_type;
> > > LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
> > > + LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) = vect_partial_vectors_while_ult;
> > > + return true;
> > > +}
> > > +
> > > +/* Each statement in LOOP_VINFO can be masked where necessary. Check
> > > + whether we can actually generate AVX512 style masks. Return true if so,
> > > + storing the type of the scalar IV in LOOP_VINFO_RGROUP_IV_TYPE. */
> > > +
> > > +static bool
> > > +vect_verify_full_masking_avx512 (loop_vec_info loop_vinfo)
> > > +{
> > > + /* Produce differently organized rgc_vec and differently check
> > > + we can produce masks. */
> > > +
> > > + /* Use a normal loop if there are no statements that need masking.
> > > + This only happens in rare degenerate cases: it means that the loop
> > > + has no loads, no stores, and no live-out values. */
> > > + if (LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
> > > + return false;
> > > +
> > > + /* For the decrementing IV we need to represent all values in
> > > + [0, niter + niter_skip] where niter_skip is the elements we
> > > + skip in the first iteration for prologue peeling. */
> > > + tree iv_type = NULL_TREE;
> > > + widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
> > > + unsigned int iv_precision = UINT_MAX;
> > > + if (iv_limit != -1)
> > > + iv_precision = wi::min_precision (iv_limit, UNSIGNED);
> > > +
> > > + /* First compute the type for the IV we use to track the remaining
> > > + scalar iterations. */
> > > + opt_scalar_int_mode cmp_mode_iter;
> > > + FOR_EACH_MODE_IN_CLASS (cmp_mode_iter, MODE_INT)
> > > + {
> > > + unsigned int cmp_bits = GET_MODE_BITSIZE (cmp_mode_iter.require ());
> > > + if (cmp_bits >= iv_precision
> > > + && targetm.scalar_mode_supported_p (cmp_mode_iter.require ()))
> > > + {
> > > + iv_type = build_nonstandard_integer_type (cmp_bits, true);
> > > + if (iv_type)
> > > + break;
> > > + }
> > > + }
> > > + if (!iv_type)
> > > + return false;
> > > +
> > > + /* Produce the rgroup controls. */
> > > + for (auto mask : LOOP_VINFO_MASKS (loop_vinfo).mask_set)
> > > + {
> > > + vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > > + tree vectype = mask.first;
> > > + unsigned nvectors = mask.second;
> > > +
> > > + /* The number of scalars per iteration and the number of vectors are
> > > + both compile-time constants. */
> > > + unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
> > > + unsigned int nscalars_per_iter
> > > + = nvectors * nunits / LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ();
> >
> > The comment seems to be borrowed from:
> >
> > /* The number of scalars per iteration and the number of vectors are
> > both compile-time constants. */
> > unsigned int nscalars_per_iter
> > = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> > LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
> >
> > but the calculation instead applies to_constant to the VF and to the
> > number of vector elements, which aren't in general known to be constant.
> > Does the exact_div not work here too?
>
> It failed the overload when I use a constant nunits, but yes, the
> original code works fine, but I need ...
>
> > Avoiding the current to_constants here only moves the problem elsewhere
> > though. Since this is a verification routine, I think the should
> > instead use is_constant and fail if it is false.
> >
> > > + /* We key off a hash-map with nscalars_per_iter and the number of total
> > > + lanes in the mask vector and then remember the needed vector mask
> > > + with the largest number of lanes (thus the fewest nV). */
> > > + bool existed;
> > > + rgroup_controls *& rgc
> > > + = masks->rgc_map.get_or_insert (std::make_pair (nscalars_per_iter,
> > > + nvectors * nunits),
>
> ^^^
>
> this decomposition to constants (well, OK - I could have used a
> poly-int for the second half of the pair).
>
> I did wonder if I could restrict this more - I need different groups
> for different nscalars_per_iter but also when the total number of
> work items is different. Somehow I think it's really only
> nscalars_per_iter that's interesting but I didn't get to convince
> myself that nvectors * nunits is going to be always the same
> when I vary the vector size within a loop.
Well, yes, nvectors * nunits / nscalars_per_iter is the
vectorization factor which is invariant. If nscalars_per_iter is
invariant within the group then nvectors * nunits should be as well.
That should simplify things.
Richard.
> > > + &existed);
> > > + if (!existed)
> > > + {
> > > + rgc = new rgroup_controls ();
> > > + rgc->type = truth_type_for (vectype);
> > > + rgc->compare_type = NULL_TREE;
> > > + rgc->max_nscalars_per_iter = nscalars_per_iter;
> > > + rgc->factor = 1;
> > > + rgc->bias_adjusted_ctrl = NULL_TREE;
> > > + }
> > > + else
> > > + {
> > > + gcc_assert (rgc->max_nscalars_per_iter == nscalars_per_iter);
> > > + if (known_lt (TYPE_VECTOR_SUBPARTS (rgc->type),
> > > + TYPE_VECTOR_SUBPARTS (vectype)))
> > > + rgc->type = truth_type_for (vectype);
> > > + }
> > > + }
> > > +
> > > + /* There is no fixed compare type we are going to use but we have to
> > > + be able to get at one for each mask group. */
> > > + unsigned int min_ni_width
> > > + = wi::min_precision (vect_max_vf (loop_vinfo), UNSIGNED);
> > > +
> > > + bool ok = true;
> > > + for (auto mask : LOOP_VINFO_MASKS (loop_vinfo).rgc_map)
> > > + {
> > > + rgroup_controls *rgc = mask.second;
> > > + tree mask_type = rgc->type;
> > > + if (TYPE_PRECISION (TREE_TYPE (mask_type)) != 1)
> > > + {
> > > + ok = false;
> > > + break;
> > > + }
> > > +
> > > + /* If iv_type is usable as compare type use that - we can elide the
> > > + saturation in that case. */
> > > + if (TYPE_PRECISION (iv_type) >= min_ni_width)
> > > + {
> > > + tree cmp_vectype
> > > + = build_vector_type (iv_type, TYPE_VECTOR_SUBPARTS (mask_type));
> > > + if (expand_vec_cmp_expr_p (cmp_vectype, mask_type, LT_EXPR))
> > > + rgc->compare_type = cmp_vectype;
> > > + }
> > > + if (!rgc->compare_type)
> > > + FOR_EACH_MODE_IN_CLASS (cmp_mode_iter, MODE_INT)
> > > + {
> > > + unsigned int cmp_bits = GET_MODE_BITSIZE (cmp_mode_iter.require ());
> > > + if (cmp_bits >= min_ni_width
> > > + && targetm.scalar_mode_supported_p (cmp_mode_iter.require ()))
> > > + {
> > > + tree cmp_type = build_nonstandard_integer_type (cmp_bits, true);
> > > + if (!cmp_type)
> > > + continue;
> > > +
> > > + /* Check whether we can produce the mask with cmp_type. */
> > > + tree cmp_vectype
> > > + = build_vector_type (cmp_type, TYPE_VECTOR_SUBPARTS (mask_type));
> > > + if (expand_vec_cmp_expr_p (cmp_vectype, mask_type, LT_EXPR))
> > > + {
> > > + rgc->compare_type = cmp_vectype;
> > > + break;
> > > + }
> > > + }
> > > + }
> >
> > Just curious: is this fallback loop ever used in practice?
> > TYPE_PRECISION (iv_type) >= min_ni_width seems like an easy condition
> > to satisfy.
>
> I think the only remaining case for this particular condition
> is an original unsigned IV and us peeling for alignment with a mask.
>
> But then of course the main case is that with say a unsigned long IV but
> a QImode data type the x86 backend lacks a V64DImode vector type so we
> cannot produce the mask mode for V64QImode data with a compare using a
> data type based on that IV type.
>
> For 'int' IVs and int/float or double data we can always use the
> original IV here.
>
> > > + if (!rgc->compare_type)
> > > + {
> > > + ok = false;
> > > + break;
> > > + }
> > > + }
> > > + if (!ok)
> > > + {
> > > + LOOP_VINFO_MASKS (loop_vinfo).rgc_map.empty ();
> > > + return false;
> >
> > It looks like this leaks the rgroup_controls created above.
>
> Fixed.
>
> > > + }
> > > +
> > > + LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = error_mark_node;
> > > + LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
> > > + LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) = vect_partial_vectors_avx512;
> > > return true;
> > > }
> > >
> > > @@ -1371,6 +1551,7 @@ vect_verify_loop_lens (loop_vec_info loop_vinfo)
> > >
> > > LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = iv_type;
> > > LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
> > > + LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) = vect_partial_vectors_len;
> > >
> > > return true;
> > > }
> > > @@ -2712,16 +2893,24 @@ start_over:
> > >
> > > /* If we still have the option of using partial vectors,
> > > check whether we can generate the necessary loop controls. */
> > > - if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > > - && !vect_verify_full_masking (loop_vinfo)
> > > - && !vect_verify_loop_lens (loop_vinfo))
> > > - LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> > > + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> > > + {
> > > + if (!LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
> > > + {
> > > + if (!vect_verify_full_masking (loop_vinfo)
> > > + && !vect_verify_full_masking_avx512 (loop_vinfo))
> > > + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> > > + }
> > > + else /* !LOOP_VINFO_LENS (loop_vinfo).is_empty () */
> > > + if (!vect_verify_loop_lens (loop_vinfo))
> > > + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> > > + }
> > >
> > > /* If we're vectorizing a loop that uses length "controls" and
> > > can iterate more than once, we apply decrementing IV approach
> > > in loop control. */
> > > if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > > - && !LOOP_VINFO_LENS (loop_vinfo).is_empty ()
> > > + && LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo) == vect_partial_vectors_len
> > > && LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) == 0
> > > && !(LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > && known_le (LOOP_VINFO_INT_NITERS (loop_vinfo),
> > > @@ -3022,7 +3211,7 @@ again:
> > > delete loop_vinfo->vector_costs;
> > > loop_vinfo->vector_costs = nullptr;
> > > /* Reset accumulated rgroup information. */
> > > - release_vec_loop_controls (&LOOP_VINFO_MASKS (loop_vinfo));
> > > + release_vec_loop_controls (&LOOP_VINFO_MASKS (loop_vinfo).rgc_vec);
> > > release_vec_loop_controls (&LOOP_VINFO_LENS (loop_vinfo));
> > > /* Reset assorted flags. */
> > > LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
> > > @@ -4362,13 +4551,67 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
> > > cond_branch_not_taken, vect_epilogue);
> > >
> > > /* Take care of special costs for rgroup controls of partial vectors. */
> > > - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> > > + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
> > > + && (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> > > + == vect_partial_vectors_avx512))
> > > + {
> > > + /* Calculate how many masks we need to generate. */
> > > + unsigned int num_masks = 0;
> > > + bool need_saturation = false;
> > > + for (auto rgcm : LOOP_VINFO_MASKS (loop_vinfo).rgc_map)
> > > + {
> > > + rgroup_controls *rgm = rgcm.second;
> > > + unsigned nvectors
> > > + = (rgcm.first.second
> > > + / TYPE_VECTOR_SUBPARTS (rgm->type).to_constant ());
> > > + num_masks += nvectors;
> > > + if (TYPE_PRECISION (TREE_TYPE (rgm->compare_type))
> > > + < TYPE_PRECISION (LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo)))
> > > + need_saturation = true;
> > > + }
> > > +
> > > + /* ??? The target isn't able to identify the costs below as
> > > + producing masks so it cannot penaltize cases where we'd run
> > > + out of mask registers for example. */
> > > +
> > > + /* In the worst case, we need to generate each mask in the prologue
> > > + and in the loop body. We need one splat per group and one
> > > + compare per mask.
> > > +
> > > + Sometimes we can use unpacks instead of generating prologue
> > > + masks and sometimes the prologue mask will fold to a constant,
> > > + so the actual prologue cost might be smaller. However, it's
> > > + simpler and safer to use the worst-case cost; if this ends up
> > > + being the tie-breaker between vectorizing or not, then it's
> > > + probably better not to vectorize. */
> >
> > Not sure all of this applies to the AVX512 case. In particular, the
> > unpacks bit doesn't seem relevant.
>
> Removed that part and noted we fail to account for the cost of dealing
> with different nvector mask cases using the same wide masks
> (the cases that vect_get_loop_mask constructs). For SVE it's
> re-interpreting with VIEW_CONVERT that's done there which should be
> free but for AVX512 (when we ever get multiple vector sizes in one
> loop) to split a mask in half we need one shift for the upper part.
>
> > > + (void) add_stmt_cost (target_cost_data,
> > > + num_masks
> > > + + LOOP_VINFO_MASKS (loop_vinfo).rgc_map.elements (),
> > > + vector_stmt, NULL, NULL, NULL_TREE, 0, vect_prologue);
> > > + (void) add_stmt_cost (target_cost_data,
> > > + num_masks
> > > + + LOOP_VINFO_MASKS (loop_vinfo).rgc_map.elements (),
> > > + vector_stmt, NULL, NULL, NULL_TREE, 0, vect_body);
> > > +
> > > + /* When we need saturation we need it both in the prologue and
> > > + the epilogue. */
> > > + if (need_saturation)
> > > + {
> > > + (void) add_stmt_cost (target_cost_data, 1, scalar_stmt,
> > > + NULL, NULL, NULL_TREE, 0, vect_prologue);
> > > + (void) add_stmt_cost (target_cost_data, 1, scalar_stmt,
> > > + NULL, NULL, NULL_TREE, 0, vect_body);
> > > + }
> > > + }
> > > + else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
> > > + && (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> > > + == vect_partial_vectors_avx512))
>
> Eh, cut&past error here, should be == vect_partial_vectors_while_ult
>
> > > {
> > > /* Calculate how many masks we need to generate. */
> > > unsigned int num_masks = 0;
> > > rgroup_controls *rgm;
> > > unsigned int num_vectors_m1;
> > > - FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
> > > + FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec, num_vectors_m1, rgm)
> >
> > Nit: long line.
>
> fixed.
>
> > > if (rgm->type)
> > > num_masks += num_vectors_m1 + 1;
> > > gcc_assert (num_masks > 0);
> > > @@ -10329,14 +10572,6 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
> > > unsigned int nvectors, tree vectype, tree scalar_mask)
> > > {
> > > gcc_assert (nvectors != 0);
> > > - if (masks->length () < nvectors)
> > > - masks->safe_grow_cleared (nvectors, true);
> > > - rgroup_controls *rgm = &(*masks)[nvectors - 1];
> > > - /* The number of scalars per iteration and the number of vectors are
> > > - both compile-time constants. */
> > > - unsigned int nscalars_per_iter
> > > - = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> > > - LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
> > >
> > > if (scalar_mask)
> > > {
> > > @@ -10344,12 +10579,7 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
> > > loop_vinfo->scalar_cond_masked_set.add (cond);
> > > }
> > >
> > > - if (rgm->max_nscalars_per_iter < nscalars_per_iter)
> > > - {
> > > - rgm->max_nscalars_per_iter = nscalars_per_iter;
> > > - rgm->type = truth_type_for (vectype);
> > > - rgm->factor = 1;
> > > - }
> > > + masks->mask_set.add (std::make_pair (vectype, nvectors));
> > > }
> > >
> > > /* Given a complete set of masks MASKS, extract mask number INDEX
> > > @@ -10360,46 +10590,121 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
> > > arrangement. */
> > >
> > > tree
> > > -vect_get_loop_mask (loop_vec_info,
> > > +vect_get_loop_mask (loop_vec_info loop_vinfo,
> > > gimple_stmt_iterator *gsi, vec_loop_masks *masks,
> > > unsigned int nvectors, tree vectype, unsigned int index)
> > > {
> > > - rgroup_controls *rgm = &(*masks)[nvectors - 1];
> > > - tree mask_type = rgm->type;
> > > -
> > > - /* Populate the rgroup's mask array, if this is the first time we've
> > > - used it. */
> > > - if (rgm->controls.is_empty ())
> > > + if (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> > > + == vect_partial_vectors_while_ult)
> > > {
> > > - rgm->controls.safe_grow_cleared (nvectors, true);
> > > - for (unsigned int i = 0; i < nvectors; ++i)
> > > + rgroup_controls *rgm = &(masks->rgc_vec)[nvectors - 1];
> > > + tree mask_type = rgm->type;
> > > +
> > > + /* Populate the rgroup's mask array, if this is the first time we've
> > > + used it. */
> > > + if (rgm->controls.is_empty ())
> > > {
> > > - tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask");
> > > - /* Provide a dummy definition until the real one is available. */
> > > - SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
> > > - rgm->controls[i] = mask;
> > > + rgm->controls.safe_grow_cleared (nvectors, true);
> > > + for (unsigned int i = 0; i < nvectors; ++i)
> > > + {
> > > + tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask");
> > > + /* Provide a dummy definition until the real one is available. */
> > > + SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
> > > + rgm->controls[i] = mask;
> > > + }
> > > }
> > > - }
> > >
> > > - tree mask = rgm->controls[index];
> > > - if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type),
> > > - TYPE_VECTOR_SUBPARTS (vectype)))
> > > + tree mask = rgm->controls[index];
> > > + if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type),
> > > + TYPE_VECTOR_SUBPARTS (vectype)))
> > > + {
> > > + /* A loop mask for data type X can be reused for data type Y
> > > + if X has N times more elements than Y and if Y's elements
> > > + are N times bigger than X's. In this case each sequence
> > > + of N elements in the loop mask will be all-zero or all-one.
> > > + We can then view-convert the mask so that each sequence of
> > > + N elements is replaced by a single element. */
> > > + gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (mask_type),
> > > + TYPE_VECTOR_SUBPARTS (vectype)));
> > > + gimple_seq seq = NULL;
> > > + mask_type = truth_type_for (vectype);
> > > + /* We can only use re-use the mask by reinterpreting it if it
> > > + occupies the same space, that is the mask with less elements
> >
> > Nit: fewer elements
>
> Ah this assert was also a left-over from earlier attempts, I've removed
> it again.
>
> > > + uses multiple bits for each masked elements. */
> > > + gcc_assert (known_eq (TYPE_PRECISION (TREE_TYPE (TREE_TYPE (mask)))
> > > + * TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)),
> > > + TYPE_PRECISION (TREE_TYPE (mask_type))
> > > + * TYPE_VECTOR_SUBPARTS (mask_type)));
> > > + mask = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, mask);
> > > + if (seq)
> > > + gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
> > > + }
> > > + return mask;
> > > + }
> > > + else if (LOOP_VINFO_PARTIAL_VECTORS_STYLE (loop_vinfo)
> > > + == vect_partial_vectors_avx512)
> > > {
> > > - /* A loop mask for data type X can be reused for data type Y
> > > - if X has N times more elements than Y and if Y's elements
> > > - are N times bigger than X's. In this case each sequence
> > > - of N elements in the loop mask will be all-zero or all-one.
> > > - We can then view-convert the mask so that each sequence of
> > > - N elements is replaced by a single element. */
> > > - gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (mask_type),
> > > - TYPE_VECTOR_SUBPARTS (vectype)));
> > > + /* The number of scalars per iteration and the number of vectors are
> > > + both compile-time constants. */
> > > + unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
> > > + unsigned int nscalars_per_iter
> > > + = nvectors * nunits / LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ();
> >
> > Same disconnect between the comment and the code here. If we do use
> > is_constant in vect_verify_full_masking_avx512 then we could reference
> > that instead.
>
> I'm going to dig into this a bit, maybe I can simplify all this and
> just have the vector indexed by nscalars_per_iter and get rid of the
> hash-map.
>
> > > +
> > > + rgroup_controls *rgm
> > > + = *masks->rgc_map.get (std::make_pair (nscalars_per_iter,
> > > + nvectors * nunits));
> > > +
> > > + /* The stored nV is dependent on the mask type produced. */
> > > + nvectors = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> > > + TYPE_VECTOR_SUBPARTS (rgm->type)).to_constant ();
> > > +
> > > + /* Populate the rgroup's mask array, if this is the first time we've
> > > + used it. */
> > > + if (rgm->controls.is_empty ())
> > > + {
> > > + rgm->controls.safe_grow_cleared (nvectors, true);
> > > + for (unsigned int i = 0; i < nvectors; ++i)
> > > + {
> > > + tree mask = make_temp_ssa_name (rgm->type, NULL, "loop_mask");
> > > + /* Provide a dummy definition until the real one is available. */
> > > + SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
> > > + rgm->controls[i] = mask;
> > > + }
> > > + }
> > > + if (known_eq (TYPE_VECTOR_SUBPARTS (rgm->type),
> > > + TYPE_VECTOR_SUBPARTS (vectype)))
> > > + return rgm->controls[index];
> > > +
> > > + /* Split the vector if needed. Since we are dealing with integer mode
> > > + masks with AVX512 we can operate on the integer representation
> > > + performing the whole vector shifting. */
> > > + unsigned HOST_WIDE_INT factor;
> > > + bool ok = constant_multiple_p (TYPE_VECTOR_SUBPARTS (rgm->type),
> > > + TYPE_VECTOR_SUBPARTS (vectype), &factor);
> > > + gcc_assert (ok);
> > > + gcc_assert (GET_MODE_CLASS (TYPE_MODE (rgm->type)) == MODE_INT);
> > > + tree mask_type = truth_type_for (vectype);
> > > + gcc_assert (GET_MODE_CLASS (TYPE_MODE (mask_type)) == MODE_INT);
> > > + unsigned vi = index / factor;
> > > + unsigned vpart = index % factor;
> > > + tree vec = rgm->controls[vi];
> > > gimple_seq seq = NULL;
> > > - mask_type = truth_type_for (vectype);
> > > - mask = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, mask);
> > > + vec = gimple_build (&seq, VIEW_CONVERT_EXPR,
> > > + lang_hooks.types.type_for_mode
> > > + (TYPE_MODE (rgm->type), 1), vec);
> > > + /* For integer mode masks simply shift the right bits into position. */
> > > + if (vpart != 0)
> > > + vec = gimple_build (&seq, RSHIFT_EXPR, TREE_TYPE (vec), vec,
> > > + build_int_cst (integer_type_node, vpart * nunits));
> > > + vec = gimple_convert (&seq, lang_hooks.types.type_for_mode
> > > + (TYPE_MODE (mask_type), 1), vec);
> > > + vec = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, vec);
> >
> > Would it be worth creating an rgroup_controls for each nunits derivative
> > of each rgc_map entry? That way we'd share the calculations, and be
> > able to cost them more accurately.
> >
> > Maybe it's not worth it, just asking. :)
>
> I suppose we could do that. So with AVX512 we have a common
> num_scalars_per_iter but vary on nvectors. So we'd have to
> compute a max_nvectors.
>
> I guess it's doable, I'll keep that in mind.
>
> Thanks,
> Richard.
>
> > Thanks,
> > Richard
> >
> > > if (seq)
> > > gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
> > > + return vec;
> > > }
> > > - return mask;
> > > + else
> > > + gcc_unreachable ();
> > > }
> > >
> > > /* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
> > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > > index 767a0774d45..42161778dc1 100644
> > > --- a/gcc/tree-vectorizer.h
> > > +++ b/gcc/tree-vectorizer.h
> > > @@ -300,6 +300,13 @@ public:
> > > #define SLP_TREE_LANES(S) (S)->lanes
> > > #define SLP_TREE_CODE(S) (S)->code
> > >
> > > +enum vect_partial_vector_style {
> > > + vect_partial_vectors_none,
> > > + vect_partial_vectors_while_ult,
> > > + vect_partial_vectors_avx512,
> > > + vect_partial_vectors_len
> > > +};
> > > +
> > > /* Key for map that records association between
> > > scalar conditions and corresponding loop mask, and
> > > is populated by vect_record_loop_mask. */
> > > @@ -605,6 +612,10 @@ struct rgroup_controls {
> > > specified number of elements; the type of the elements doesn't matter. */
> > > tree type;
> > >
> > > + /* When there is no uniformly used LOOP_VINFO_RGROUP_COMPARE_TYPE this
> > > + is the rgroup specific type used. */
> > > + tree compare_type;
> > > +
> > > /* A vector of nV controls, in iteration order. */
> > > vec<tree> controls;
> > >
> > > @@ -613,7 +624,24 @@ struct rgroup_controls {
> > > tree bias_adjusted_ctrl;
> > > };
> > >
> > > -typedef auto_vec<rgroup_controls> vec_loop_masks;
> > > +struct vec_loop_masks
> > > +{
> > > + bool is_empty () const { return mask_set.is_empty (); }
> > > +
> > > + typedef pair_hash <nofree_ptr_hash <tree_node>,
> > > + int_hash<unsigned, 0>> mp_hash;
> > > + hash_set<mp_hash> mask_set;
> > > +
> > > + /* Default storage for rgroup_controls. */
> > > + auto_vec<rgroup_controls> rgc_vec;
> > > +
> > > + /* The vect_partial_vectors_avx512 style uses a hash-map. */
> > > + hash_map<std::pair<unsigned /* nscalars_per_iter */,
> > > + unsigned /* nlanes */>, rgroup_controls *,
> > > + simple_hashmap_traits<pair_hash <int_hash<unsigned, 0>,
> > > + int_hash<unsigned, 0>>,
> > > + rgroup_controls *>> rgc_map;
> > > +};
> > >
> > > typedef auto_vec<rgroup_controls> vec_loop_lens;
> > >
> > > @@ -741,6 +769,10 @@ public:
> > > LOOP_VINFO_USING_PARTIAL_VECTORS_P is true. */
> > > tree rgroup_iv_type;
> > >
> > > + /* The style used for implementing partial vectors when
> > > + LOOP_VINFO_USING_PARTIAL_VECTORS_P is true. */
> > > + vect_partial_vector_style partial_vector_style;
> > > +
> > > /* Unknown DRs according to which loop was peeled. */
> > > class dr_vec_info *unaligned_dr;
> > >
> > > @@ -914,6 +946,7 @@ public:
> > > #define LOOP_VINFO_MASK_SKIP_NITERS(L) (L)->mask_skip_niters
> > > #define LOOP_VINFO_RGROUP_COMPARE_TYPE(L) (L)->rgroup_compare_type
> > > #define LOOP_VINFO_RGROUP_IV_TYPE(L) (L)->rgroup_iv_type
> > > +#define LOOP_VINFO_PARTIAL_VECTORS_STYLE(L) (L)->partial_vector_style
> > > #define LOOP_VINFO_PTR_MASK(L) (L)->ptr_mask
> > > #define LOOP_VINFO_N_STMTS(L) (L)->shared->n_stmts
> > > #define LOOP_VINFO_LOOP_NEST(L) (L)->shared->loop_nest
> >
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
next prev parent reply other threads:[~2023-06-15 12:53 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20230614115429.D400C3858433@sourceware.org>
2023-06-14 18:45 ` Richard Sandiford
2023-06-15 12:14 ` Richard Biener
2023-06-15 12:53 ` Richard Biener [this message]
[not found] <20230614115450.28CEA3858288@sourceware.org>
2023-06-14 14:26 ` Andrew Stubbs
2023-06-14 14:29 ` Richard Biener
2023-06-15 5:50 ` Liu, Hongtao
2023-06-15 6:51 ` Richard Biener
2023-06-15 9:26 ` Andrew Stubbs
2023-06-15 9:58 ` Richard Biener
2023-06-15 10:13 ` Andrew Stubbs
2023-06-15 11:06 ` Richard Biener
2023-06-15 13:04 ` Andrew Stubbs
2023-06-15 13:34 ` Richard Biener
2023-06-15 13:52 ` Andrew Stubbs
2023-06-15 14:00 ` Richard Biener
2023-06-15 14:04 ` Andrew Stubbs
2023-06-15 16:16 ` Richard Biener
2023-06-15 9:58 ` Richard Sandiford
2023-06-14 11:54 Richard Biener
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=nycvar.YFH.7.77.849.2306151251330.4723@jbgna.fhfr.qr \
--to=rguenther@suse.de \
--cc=ams@codesourcery.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=hongtao.liu@intel.com \
--cc=hubicka@ucw.cz \
--cc=kirill.yukhin@gmail.com \
--cc=richard.sandiford@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).