From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 119865 invoked by alias); 15 Jun 2016 11:44:54 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 119847 invoked by uid 89); 15 Jun 2016 11:44:53 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.3 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 spammy=sk:aligned, enkovich.gnu@gmail.com, ilya.enkovich@intel.com, ilyaenkovichintelcom X-HELO: mail-wm0-f68.google.com Received: from mail-wm0-f68.google.com (HELO mail-wm0-f68.google.com) (74.125.82.68) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Wed, 15 Jun 2016 11:44:42 +0000 Received: by mail-wm0-f68.google.com with SMTP id k184so5303149wme.2 for ; Wed, 15 Jun 2016 04:44:42 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=YFF1ynYk9ygj0SFa2ZI/EKr0SJJpJR0ITz+AR4K2m+4=; b=BhhAssgSk7gIyLvdxvvF1HRdJ6cGik3JHJIA4o8radugfX8HXunI3PgzVDDtTk7r6T LTaZuPewFFuxhLpolx+Y6cR/pMb83IQHT88u2TsvSxnFx4MS0/Oh9IJEcEewsppufC72 GCb6DNqQJ0RjJYqL4E9haIvfCXFBajsEDpX25f+JrFv9Bmkd8inW1WFnjoDRDwJMjRgW MRAqu5D10t9dSUoASiqZ4v8xA4R3Vjm1+JxcvpEgUZrBphanSCFq5UJz5vE7ooHGeeZ4 Uw+FJ7uhjCBrx4zZYopBGkpWZbkuY8RGgncyHBCyynZQ4o5JsBVl8LxaLoqJ/g/FBIGZ t+aA== X-Gm-Message-State: ALyK8tJ1rAd5nyaGQrUAfzqa49Q3xWO8QTWurfnRlyqWwXzookYPVPV2qWFu07rEqCynmTKS0aPgmNjaNaLa/Q== X-Received: by 10.194.114.72 with SMTP id je8mr10557426wjb.88.1465991078555; Wed, 15 Jun 2016 04:44:38 -0700 (PDT) MIME-Version: 1.0 Received: by 10.194.87.34 with HTTP; Wed, 15 Jun 2016 04:44:38 -0700 (PDT) In-Reply-To: <20160519194450.GH40563@msticlxl57.ims.intel.com> References: <20160519194450.GH40563@msticlxl57.ims.intel.com> From: Richard Biener Date: Wed, 15 Jun 2016 11:44:00 -0000 Message-ID: Subject: Re: [PATCH, vec-tails 07/10] Support loop epilogue combining To: Ilya Enkovich Cc: GCC Patches Content-Type: text/plain; charset=UTF-8 X-IsSubscribed: yes X-SW-Source: 2016-06/txt/msg01129.txt.bz2 On Thu, May 19, 2016 at 9:44 PM, Ilya Enkovich wrote: > Hi, > > This patch introduces support for loop epilogue combining. This includes > support in cost estimation and all required changes required to mask > vectorized loop. I wonder why you compute a minimum number of iterations to make masking of the vectorized body profitable rather than a maximum number of iterations. I'd say masking the vectorized loop is profitable if niter/vf * masking-overhead < epilogue-cost. Masking the epilogue is profitable if vectorizing the epilogue with masking is profitable. Am I missing something? Thanks, Richard. > Thanks, > Ilya > -- > gcc/ > > 2016-05-19 Ilya Enkovich > > * dbgcnt.def (vect_tail_combine): New. > * params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New. > * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var. > * tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support > epilogue combined with loop body. > (vect_do_peeling_for_loop_bound): Likewise. > * tree-vect-loop.c Include alias.h and dbgcnt.h. > (vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters > arg, compute number of iterations for which loop epilogue combining is > profitable. > (vect_generate_tmps_on_preheader): Support combined apilogue. > (vect_gen_ivs_for_masking): New. > (vect_get_mask_index_for_elems): New. > (vect_get_mask_index_for_type): New. > (vect_gen_loop_masks): New. > (vect_mask_reduction_stmt): New. > (vect_mask_mask_load_store_stmt): New. > (vect_mask_load_store_stmt): New. > (vect_combine_loop_epilogue): New. > (vect_transform_loop): Support combined apilogue. > > > diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def > index 78ddcc2..73c2966 100644 > --- a/gcc/dbgcnt.def > +++ b/gcc/dbgcnt.def > @@ -192,4 +192,5 @@ DEBUG_COUNTER (treepre_insert) > DEBUG_COUNTER (tree_sra) > DEBUG_COUNTER (vect_loop) > DEBUG_COUNTER (vect_slp) > +DEBUG_COUNTER (vect_tail_combine) > DEBUG_COUNTER (dom_unreachable_edges) > diff --git a/gcc/params.def b/gcc/params.def > index 62a1e40..98d6c5a 100644 > --- a/gcc/params.def > +++ b/gcc/params.def > @@ -1220,6 +1220,11 @@ DEFPARAM (PARAM_MAX_SPECULATIVE_DEVIRT_MAYDEFS, > "Maximum number of may-defs visited when devirtualizing " > "speculatively", 50, 0, 0) > > +DEFPARAM (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD, > + "vect-cost-increase-combine-threshold", > + "Cost increase threshold to mask main loop for epilogue.", > + 10, 0, 300) > + > /* > > Local variables: > diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c > index f275933..c5bdeb9 100644 > --- a/gcc/tree-vect-data-refs.c > +++ b/gcc/tree-vect-data-refs.c > @@ -4000,6 +4000,9 @@ vect_get_new_ssa_name (tree type, enum vect_var_kind var_kind, const char *name) > case vect_scalar_var: > prefix = "stmp"; > break; > + case vect_mask_var: > + prefix = "mask"; > + break; > case vect_pointer_var: > prefix = "vectp"; > break; > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c > index fab5879..b3c0668 100644 > --- a/gcc/tree-vect-loop-manip.c > +++ b/gcc/tree-vect-loop-manip.c > @@ -1195,6 +1195,7 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop, > int first_guard_probability = 2 * REG_BR_PROB_BASE / 3; > int second_guard_probability = 2 * REG_BR_PROB_BASE / 3; > int probability_of_second_loop; > + bool skip_second_after_first = false; > > if (!slpeel_can_duplicate_loop_p (loop, e)) > return NULL; > @@ -1393,7 +1394,11 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop, > { > loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop); > tree scalar_loop_iters = LOOP_VINFO_NITERSM1 (loop_vinfo); > - unsigned limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1; > + unsigned limit = 0; > + if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)) > + skip_second_after_first = true; > + else > + limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1; > if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)) > limit = limit + 1; > if (check_profitability > @@ -1464,11 +1469,20 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop, > bb_between_loops = new_exit_bb; > bb_after_second_loop = split_edge (single_exit (second_loop)); > > - pre_condition = > - fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters); > - skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL, > - bb_after_second_loop, bb_before_first_loop, > - inverse_probability (second_guard_probability)); > + if (skip_second_after_first) > + /* We can just redirect edge from bb_between_loops to > + bb_after_second_loop but we have many code assuming > + we have a guard after the first loop. So just make > + always taken condtion. */ > + pre_condition = fold_build2 (EQ_EXPR, boolean_type_node, integer_zero_node, > + integer_zero_node); > + else > + pre_condition = > + fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters); > + skip_e > + = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL, > + bb_after_second_loop, bb_before_first_loop, > + inverse_probability (second_guard_probability)); > scale_loop_profile (second_loop, probability_of_second_loop, bound2); > slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop, > second_loop == new_loop, &new_exit_bb); > @@ -1758,8 +1772,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, > basic_block preheader; > int loop_num; > int max_iter; > + int bound2; > tree cond_expr = NULL_TREE; > gimple_seq cond_expr_stmt_list = NULL; > + bool combine = LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo); > > if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, > @@ -1769,12 +1785,13 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, > > loop_num = loop->num; > > + bound2 = combine ? th : LOOP_VINFO_VECT_FACTOR (loop_vinfo); > new_loop > = slpeel_tree_peel_loop_to_edge (loop, scalar_loop, single_exit (loop), > &ratio_mult_vf_name, ni_name, false, > th, check_profitability, > cond_expr, cond_expr_stmt_list, > - 0, LOOP_VINFO_VECT_FACTOR (loop_vinfo)); > + 0, bound2); > gcc_assert (new_loop); > gcc_assert (loop_num == loop->num); > slpeel_checking_verify_cfg_after_peeling (loop, new_loop); > @@ -1803,7 +1820,11 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, > max_iter = (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) * 2 > : LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 2; > - if (check_profitability) > + /* When epilogue is combined only profitability > + treshold matters. */ > + if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)) > + max_iter = (int) th - 1; > + else if (check_profitability) > max_iter = MAX (max_iter, (int) th - 1); > record_niter_bound (new_loop, max_iter, false, true); > dump_printf (MSG_NOTE, > @@ -2036,7 +2057,8 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo, tree ni_name, > bound, 0); > > gcc_assert (new_loop); > - slpeel_checking_verify_cfg_after_peeling (new_loop, loop); > + if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)) > + slpeel_checking_verify_cfg_after_peeling (new_loop, loop); > /* For vectorization factor N, we need to copy at most N-1 values > for alignment and this means N-2 loopback edge executions. */ > max_iter = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 2; > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c > index 31360d3..1a80c42 100644 > --- a/gcc/tree-vect-loop.c > +++ b/gcc/tree-vect-loop.c > @@ -50,6 +50,8 @@ along with GCC; see the file COPYING3. If not see > #include "gimple-fold.h" > #include "cgraph.h" > #include "tree-if-conv.h" > +#include "alias.h" > +#include "dbgcnt.h" > > /* Loop Vectorization Pass. > > @@ -149,7 +151,8 @@ along with GCC; see the file COPYING3. If not see > http://gcc.gnu.org/projects/tree-ssa/vectorization.html > */ > > -static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *); > +static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *, > + int *); > > /* Function vect_determine_vectorization_factor > > @@ -2304,8 +2307,10 @@ start_over: > > /* Analyze cost. Decide if worth while to vectorize. */ > int min_profitable_estimate, min_profitable_iters; > + int min_profitable_combine_iters; > vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters, > - &min_profitable_estimate); > + &min_profitable_estimate, > + &min_profitable_combine_iters); > > if (min_profitable_iters < 0) > { > @@ -2412,6 +2417,52 @@ start_over: > gcc_assert (vectorization_factor > == (unsigned)LOOP_VINFO_VECT_FACTOR (loop_vinfo)); > > + if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)) > + { > + LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false; > + LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = false; > + } > + else if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) > + && min_profitable_combine_iters >= 0) > + { > + if (((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) > + && (LOOP_VINFO_INT_NITERS (loop_vinfo) > + >= (unsigned) min_profitable_combine_iters)) > + || estimated_niter == -1 > + || estimated_niter >= min_profitable_combine_iters) > + && dbg_cnt (vect_tail_combine)) > + { > + LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false; > + LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = true; > + > + dump_printf_loc (MSG_NOTE, vect_location, > + "Decided to combine loop with its epilogue.\n"); > + > + /* We need to adjust profitability check if combine > + epilogue considering additional vector iteration > + and profitable combine iterations. */ > + if ((int)(min_profitable_combine_iters + vectorization_factor) > + > min_scalar_loop_bound) > + { > + LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) > + = (unsigned) min_profitable_combine_iters; > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "Updated runtime profitability treshold: %d\n", > + min_profitable_combine_iters); > + > + } > + } > + else > + { > + if (!LOOP_VINFO_NEED_MASKING (loop_vinfo) && dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "Not combined loop with epilogue: iterations " > + "count is too low (threshold is %d).\n", > + min_profitable_combine_iters); > + } > + } > + > /* Ok to vectorize! */ > return true; > > @@ -3381,7 +3432,8 @@ vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue, > static void > vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, > int *ret_min_profitable_niters, > - int *ret_min_profitable_estimate) > + int *ret_min_profitable_estimate, > + int *ret_min_profitable_combine_niters) > { > int min_profitable_iters; > int min_profitable_estimate; > @@ -3625,6 +3677,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, > vec_prologue_cost); > dump_printf (MSG_NOTE, " Vector epilogue cost: %d\n", > vec_epilogue_cost); > + dump_printf (MSG_NOTE, " Masking prologue cost: %d\n", > + masking_prologue_cost); > + dump_printf (MSG_NOTE, " Masking inside cost: %d\n", > + masking_inside_cost); > dump_printf (MSG_NOTE, " Scalar iteration cost: %d\n", > scalar_single_iter_cost); > dump_printf (MSG_NOTE, " Scalar outside cost: %d\n", > @@ -3728,6 +3784,77 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, > min_profitable_estimate); > > *ret_min_profitable_estimate = min_profitable_estimate; > + > + *ret_min_profitable_combine_niters = -1; > + > + /* Don't try to vectorize epilogue of epilogue. */ > + if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)) > + return; > + > + if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)) > + { > + if (flag_vect_epilogue_cost_model == VECT_COST_MODEL_UNLIMITED) > + { > + if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE) > + *ret_min_profitable_combine_niters = 0; > + return; > + } > + > + unsigned combine_treshold > + = PARAM_VALUE (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD); > + /* Calculate profitability combining epilogue with the main loop. > + We have a threshold for inside cost overhead (not applied > + for low trip count loop case): > + MIC * 100 < VIC * CT > + Masked iteration should be better than a scalar prologue: > + MIC + VIC < SIC * epilogue_niters */ > + if (masking_inside_cost * 100 >= vec_inside_cost * combine_treshold) > + { > + if (dump_enabled_p ()) > + { > + dump_printf_loc (MSG_NOTE, vect_location, > + "Combining loop with epilogue is not " > + "profitable.\n"); > + dump_printf_loc (MSG_NOTE, vect_location, > + " Combining overhead %d%% exceeds " > + "treshold %d%%.\n", > + masking_inside_cost * 100 / vec_inside_cost, > + combine_treshold); > + } > + *ret_min_profitable_combine_niters = -1; > + } > + else if ((int)(masking_inside_cost + vec_inside_cost) > + >= scalar_single_iter_cost * peel_iters_epilogue) > + { > + if (dump_enabled_p ()) > + { > + dump_printf_loc (MSG_NOTE, vect_location, > + "Combining loop with epilogue is not " > + "profitable.\n"); > + dump_printf_loc (MSG_NOTE, vect_location, > + " Scalar epilogue is faster than a " > + "single masked iteration.\n"); > + } > + *ret_min_profitable_combine_niters = -1; > + } > + else if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE) > + { > + int inside_cost = vec_inside_cost + masking_inside_cost; > + int outside_cost = vec_outside_cost + masking_prologue_cost; > + int profitable_iters = ((outside_cost - scalar_outside_cost) * vf > + - inside_cost * peel_iters_prologue > + - inside_cost * peel_iters_epilogue) > + / ((scalar_single_iter_cost * vf) > + - inside_cost); > + > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "Combinig loop with epilogue " > + "pofitability treshold = %d\n", > + profitable_iters); > + *ret_min_profitable_combine_niters = profitable_iters; > + } > + } > } > > /* Writes into SEL a mask for a vec_perm, equivalent to a vec_shr by OFFSET > @@ -6843,20 +6970,37 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo, > else > ni_minus_gap_name = ni_name; > > - /* Create: ratio = ni >> log2(vf) */ > - /* ??? As we have ni == number of latch executions + 1, ni could > - have overflown to zero. So avoid computing ratio based on ni > - but compute it using the fact that we know ratio will be at least > - one, thus via (ni - vf) >> log2(vf) + 1. */ > - ratio_name > - = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name), > - fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name), > - fold_build2 (MINUS_EXPR, TREE_TYPE (ni_name), > - ni_minus_gap_name, > - build_int_cst > - (TREE_TYPE (ni_name), vf)), > - log_vf), > - build_int_cst (TREE_TYPE (ni_name), 1)); > + if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)) > + { > + /* Create ni + (vf-1) >> log2(vf) if epilogue is combined with loop. */ > + gcc_assert (!LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)); > + ratio_name > + = fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name), > + fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name), > + ni_name, > + build_int_cst (TREE_TYPE (ni_name), > + vf - 1)), > + log_vf); > + } > + else > + { > + /* Create: ratio = ni >> log2(vf) */ > + /* ??? As we have ni == number of latch executions + 1, ni could > + have overflown to zero. So avoid computing ratio based on ni > + but compute it using the fact that we know ratio will be at least > + one, thus via (ni - vf) >> log2(vf) + 1. */ > + ratio_name > + = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name), > + fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name), > + fold_build2 (MINUS_EXPR, > + TREE_TYPE (ni_name), > + ni_minus_gap_name, > + build_int_cst > + (TREE_TYPE (ni_name), vf)), > + log_vf), > + build_int_cst (TREE_TYPE (ni_name), 1)); > + } > + > if (!is_gimple_val (ratio_name)) > { > var = create_tmp_var (TREE_TYPE (ni_name), "bnd"); > @@ -6886,6 +7030,485 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo, > return; > } > > +/* Function vect_gen_ivs_for_masking. > + > + Create IVs to be used for masks computation to mask loop described > + by LOOP_VINFO. Created IVs are stored in IVS vector. . > + > + Initial IV values is {0, 1, ..., VF - 1} (probably split into several > + vectors, in this case IVS's elements with lower index hold IV with > + smaller numbers). IV step is {VF, VF, ..., VF}. VF is a used > + vectorization factor. */ > + > +static void > +vect_gen_ivs_for_masking (loop_vec_info loop_vinfo, vec *ivs) > +{ > + struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); > + tree vectype = vect_get_masking_iv_type (loop_vinfo); > + tree type = TREE_TYPE (vectype); > + int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); > + unsigned elems = TYPE_VECTOR_SUBPARTS (vectype); > + int ncopies = vf / elems; > + int i, k; > + tree iv, init_val, step_val; > + bool insert_after; > + gimple_stmt_iterator gsi; > + tree *vtemp; > + > + /* Create {VF, ..., VF} vector constant. */ > + step_val = build_vector_from_val (vectype, build_int_cst (type, vf)); > + > + vtemp = XALLOCAVEC (tree, vf); > + for (i = 0; i < ncopies; i++) > + { > + /* Create initial IV value. */ > + for (k = 0; k < vf; k++) > + vtemp[k] = build_int_cst (type, k + i * elems); > + init_val = build_vector (vectype, vtemp); > + > + /* Create an inductive variable including phi node. */ > + standard_iv_increment_position (loop, &gsi, &insert_after); > + create_iv (init_val, step_val, NULL, loop, &gsi, insert_after, > + &iv, NULL); > + ivs->safe_push (iv); > + } > +} > + > +/* Function vect_get_mask_index_for_elems. > + > + A helper function to access masks vector. See vect_gen_loop_masks > + for masks vector sorting description. Return index of the first > + mask having MASK_ELEMS elements. */ > + > +static inline unsigned > +vect_get_mask_index_for_elems (unsigned mask_elems) > +{ > + return current_vector_size / mask_elems - 1; > +} > + > +/* Function vect_get_mask_index_for_type. > + > + A helper function to access masks vector. See vect_gen_loop_masks > + for masks vector sorting description. Return index of the first > + mask appropriate for VECTYPE. */ > + > +static inline unsigned > +vect_get_mask_index_for_type (tree vectype) > +{ > + unsigned elems = TYPE_VECTOR_SUBPARTS (vectype); > + return vect_get_mask_index_for_elems (elems); > +} > + > +/* Function vect_gen_loop_masks. > + > + Create masks to mask a loop desvribed by LOOP_VINFO. Masks > + are created according to LOOP_VINFO_REQUIRED_MASKS and are stored > + into MASKS vector. > + > + Index of a mask in a vector is computed according to a number > + of masks's elements. Masks are sorted by number of its elements > + in descending order. Index 0 is used to access a mask with > + current_vector_size elements. Among masks with the same number > + of elements the one with lower index is used to mask iterations > + with smaller iteration counter. Note that you may get NULL elements > + for masks which are not required. Use vect_get_mask_index_for_elems > + or vect_get_mask_index_for_type to access resulting vector. */ > + > +static void > +vect_gen_loop_masks (loop_vec_info loop_vinfo, vec *masks) > +{ > + struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); > + edge pe = loop_preheader_edge (loop); > + tree niters = LOOP_VINFO_NITERS (loop_vinfo); > + unsigned min_mask_elems, max_mask_elems, nmasks; > + unsigned iv_elems, cur_mask, prev_mask, cur_mask_elems; > + auto_vec ivs; > + tree vectype, mask_type; > + tree vec_niters, vec_niters_val, mask; > + gimple *stmt; > + basic_block bb; > + gimple_stmt_iterator gsi = gsi_after_labels (loop->header); > + unsigned vec_size; > + > + /* Create required IVs. */ > + vect_gen_ivs_for_masking (loop_vinfo, &ivs); > + vectype = TREE_TYPE (ivs[0]); > + > + vec_size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype)); > + iv_elems = TYPE_VECTOR_SUBPARTS (vectype); > + > + /* Get a proper niter to build a vector. */ > + if (!is_gimple_val (niters)) > + { > + gimple_seq seq = NULL; > + niters = force_gimple_operand (niters, &seq, true, NULL); > + gsi_insert_seq_on_edge_immediate (pe, seq); > + } > + /* We may need a type cast in case niter has a too small type > + for generated IVs. */ > + if (!types_compatible_p (TREE_TYPE (vectype), TREE_TYPE (niters))) > + { > + tree new_niters = make_temp_ssa_name (TREE_TYPE (vectype), > + NULL, "niters"); > + stmt = gimple_build_assign (new_niters, CONVERT_EXPR, niters); > + bb = gsi_insert_on_edge_immediate (pe, stmt); > + gcc_assert (!bb); > + niters = new_niters; > + } > + /* Create {NITERS, ..., NITERS} vector and put to SSA_NAME. */ > + vec_niters_val = build_vector_from_val (vectype, niters); > + vec_niters = vect_get_new_ssa_name (vectype, vect_simple_var, "niters"); > + stmt = gimple_build_assign (vec_niters, vec_niters_val); > + bb = gsi_insert_on_edge_immediate (pe, stmt); > + gcc_assert (!bb); > + > + /* Determine which masks we need to compute and how many. */ > + vect_get_extreme_masks (loop_vinfo, &min_mask_elems, &max_mask_elems); > + nmasks = vect_get_mask_index_for_elems (MIN (min_mask_elems, iv_elems) / 2); > + masks->safe_grow_cleared (nmasks); > + > + /* Now create base masks through comparison IV < VEC_NITERS. */ > + mask_type = build_same_sized_truth_vector_type (vectype); > + cur_mask = vect_get_mask_index_for_elems (iv_elems); > + for (unsigned i = 0; i < ivs.length (); i++) > + { > + tree iv = ivs[i]; > + mask = vect_get_new_ssa_name (mask_type, vect_mask_var); > + stmt = gimple_build_assign (mask, LT_EXPR, iv, vec_niters); > + gsi_insert_before (&gsi, stmt, GSI_SAME_STMT); > + (*masks)[cur_mask++] = mask; > + } > + > + /* Create narrowed masks. */ > + cur_mask_elems = iv_elems; > + nmasks = ivs.length (); > + while (cur_mask_elems < max_mask_elems) > + { > + prev_mask = vect_get_mask_index_for_elems (cur_mask_elems); > + > + cur_mask_elems <<= 1; > + nmasks >>= 1; > + > + cur_mask = vect_get_mask_index_for_elems (cur_mask_elems); > + > + mask_type = build_truth_vector_type (cur_mask_elems, vec_size); > + > + for (unsigned i = 0; i < nmasks; i++) > + { > + tree mask_low = (*masks)[prev_mask++]; > + tree mask_hi = (*masks)[prev_mask++]; > + mask = vect_get_new_ssa_name (mask_type, vect_mask_var); > + stmt = gimple_build_assign (mask, VEC_PACK_TRUNC_EXPR, > + mask_low, mask_hi); > + gsi_insert_before (&gsi, stmt, GSI_SAME_STMT); > + (*masks)[cur_mask++] = mask; > + } > + } > + > + /* Created widened masks. */ > + cur_mask_elems = iv_elems; > + nmasks = ivs.length (); > + while (cur_mask_elems > min_mask_elems) > + { > + prev_mask = vect_get_mask_index_for_elems (cur_mask_elems); > + > + cur_mask_elems >>= 1; > + nmasks <<= 1; > + > + cur_mask = vect_get_mask_index_for_elems (cur_mask_elems); > + > + mask_type = build_truth_vector_type (cur_mask_elems, vec_size); > + > + for (unsigned i = 0; i < nmasks; i += 2) > + { > + tree orig_mask = (*masks)[prev_mask++]; > + > + mask = vect_get_new_ssa_name (mask_type, vect_mask_var); > + stmt = gimple_build_assign (mask, VEC_UNPACK_LO_EXPR, orig_mask); > + gsi_insert_before (&gsi, stmt, GSI_SAME_STMT); > + (*masks)[cur_mask++] = mask; > + > + mask = vect_get_new_ssa_name (mask_type, vect_mask_var); > + stmt = gimple_build_assign (mask, VEC_UNPACK_HI_EXPR, orig_mask); > + gsi_insert_before (&gsi, stmt, GSI_SAME_STMT); > + (*masks)[cur_mask++] = mask; > + } > + } > +} > + > +/* Function vect_mask_reduction_stmt. > + > + Mask given vectorized reduction statement STMT using > + MASK. In case scalar reduction statement is vectorized > + into several vector statements then PREV holds a > + preceding vetor statement copy for STMT. > + > + Masking is performed using VEC_COND_EXPR. E.g. > + > + S1: r_1 = r_2 + d_3 > + > + is transformed into: > + > + S1': r_4 = r_2 + d_3 > + S2': r_1 = VEC_COND_EXPR > + > + Return generated condition statement. */ > + > +static gimple * > +vect_mask_reduction_stmt (gimple *stmt, tree mask, gimple *prev) > +{ > + gimple_stmt_iterator gsi; > + tree vectype; > + tree lhs, rhs, tmp; > + gimple *new_stmt, *phi; > + > + lhs = gimple_assign_lhs (stmt); > + vectype = TREE_TYPE (lhs); > + > + gcc_assert (TYPE_VECTOR_SUBPARTS (vectype) > + == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask))); > + > + /* Find operand RHS defined by PHI node. */ > + rhs = gimple_assign_rhs1 (stmt); > + gcc_assert (TREE_CODE (rhs) == SSA_NAME); > + phi = SSA_NAME_DEF_STMT (rhs); > + > + if (phi != prev && gimple_code (phi) != GIMPLE_PHI) > + { > + rhs = gimple_assign_rhs2 (stmt); > + gcc_assert (TREE_CODE (rhs) == SSA_NAME); > + phi = SSA_NAME_DEF_STMT (rhs); > + gcc_assert (phi == prev || gimple_code (phi) == GIMPLE_PHI); > + } > + > + /* Convert reduction stmt to ordinary assignment to TMP. */ > + tmp = vect_get_new_ssa_name (vectype, vect_simple_var, NULL); > + gimple_assign_set_lhs (stmt, tmp); > + > + /* Create VEC_COND_EXPR and insert it after STMT. */ > + new_stmt = gimple_build_assign (lhs, VEC_COND_EXPR, mask, tmp, rhs); > + gsi = gsi_for_stmt (stmt); > + gsi_insert_after (&gsi, new_stmt, GSI_SAME_STMT); > + > + return new_stmt; > +} > + > +/* Function vect_mask_mask_load_store_stmt. > + > + Mask given vectorized MASK_LOAD or MASK_STORE statement > + STMT using MASK. Function replaces a mask used by STMT > + with its conjunction with MASK. */ > + > +static void > +vect_mask_mask_load_store_stmt (gimple *stmt, tree mask) > +{ > + gimple *new_stmt; > + tree old_mask, new_mask; > + gimple_stmt_iterator gsi; > + > + gsi = gsi_for_stmt (stmt); > + old_mask = gimple_call_arg (stmt, 2); > + > + gcc_assert (types_compatible_p (TREE_TYPE (old_mask), TREE_TYPE (mask))); > + > + new_mask = vect_get_new_ssa_name (TREE_TYPE (mask), vect_simple_var, NULL); > + new_stmt = gimple_build_assign (new_mask, BIT_AND_EXPR, old_mask, mask); > + gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT); > + > + gimple_call_set_arg (stmt, 2, new_mask); > + update_stmt (stmt); > +} > + > + > +/* Function vect_mask_load_store_stmt. > + > + Mask given vectorized load or store statement STMT using > + MASK. DR is a data reference for a scalar memory access. > + Assignment is transformed into MASK_LOAD or MASK_STORE > + statement. SI is either an iterator pointing to STMT and > + is to be updated or NULL. */ > + > +static void > +vect_mask_load_store_stmt (gimple *stmt, tree vectype, tree mask, > + data_reference *dr, gimple_stmt_iterator *si) > +{ > + tree mem, val, addr, ptr; > + gimple_stmt_iterator gsi = gsi_for_stmt (stmt); > + unsigned align, misalign; > + tree elem_type = TREE_TYPE (vectype); > + gimple *new_stmt; > + > + gcc_assert (!si || gsi_stmt (*si) == stmt); > + > + gsi = gsi_for_stmt (stmt); > + if (gimple_store_p (stmt)) > + { > + val = gimple_assign_rhs1 (stmt); > + mem = gimple_assign_lhs (stmt); > + } > + else > + { > + val = gimple_assign_lhs (stmt); > + mem = gimple_assign_rhs1 (stmt); > + } > + > + gcc_assert (TYPE_VECTOR_SUBPARTS (vectype) > + == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask))); > + > + addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (mem), > + true, NULL_TREE, true, > + GSI_SAME_STMT); > + > + align = TYPE_ALIGN_UNIT (vectype); > + if (aligned_access_p (dr)) > + misalign = 0; > + else if (DR_MISALIGNMENT (dr) == -1) > + { > + align = TYPE_ALIGN_UNIT (elem_type); > + misalign = 0; > + } > + else > + misalign = DR_MISALIGNMENT (dr); > + set_ptr_info_alignment (get_ptr_info (addr), align, misalign); > + ptr = build_int_cst (reference_alias_ptr_type (mem), > + misalign ? misalign & -misalign : align); > + > + if (gimple_store_p (stmt)) > + new_stmt = gimple_build_call_internal (IFN_MASK_STORE, 4, addr, ptr, > + mask, val); > + else > + { > + new_stmt = gimple_build_call_internal (IFN_MASK_LOAD, 3, addr, ptr, > + mask); > + gimple_call_set_lhs (new_stmt, val); > + } > + gsi_replace (si ? si : &gsi, new_stmt, false); > +} > + > +/* Function vect_combine_loop_epilogue. > + > + Combine loop epilogue with the main vectorized body. It requires > + masking of memory accesses and reductions. */ > + > +static void > +vect_combine_loop_epilogue (loop_vec_info loop_vinfo) > +{ > + struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); > + basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo); > + unsigned mask_no; > + auto_vec masks; > + > + vect_gen_loop_masks (loop_vinfo, &masks); > + > + /* Convert reduction statements if any. */ > + for (unsigned i = 0; i < LOOP_VINFO_REDUCTIONS (loop_vinfo).length (); i++) > + { > + gimple *stmt = LOOP_VINFO_REDUCTIONS (loop_vinfo)[i]; > + gimple *prev_stmt = NULL; > + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); > + > + mask_no = vect_get_mask_index_for_type (STMT_VINFO_VECTYPE (stmt_info)); > + > + stmt = STMT_VINFO_VEC_STMT (stmt_info); > + while (stmt) > + { > + prev_stmt = vect_mask_reduction_stmt (stmt, masks[mask_no++], > + prev_stmt); > + stmt_info = vinfo_for_stmt (stmt); > + stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL; > + } > + } > + > + /* Scan all loop statements to convert vector load/store including masked > + form. */ > + for (unsigned i = 0; i < loop->num_nodes; i++) > + { > + basic_block bb = bbs[i]; > + for (gimple_stmt_iterator si = gsi_start_bb (bb); > + !gsi_end_p (si); gsi_next (&si)) > + { > + gimple *stmt = gsi_stmt (si); > + stmt_vec_info stmt_info = NULL; > + tree vectype = NULL; > + data_reference *dr; > + > + /* Mask load case. */ > + if (is_gimple_call (stmt) > + && gimple_call_internal_p (stmt) > + && gimple_call_internal_fn (stmt) == IFN_MASK_LOAD > + && !VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (stmt, 2)))) > + { > + stmt_info = vinfo_for_stmt (stmt); > + if (!STMT_VINFO_VEC_STMT (stmt_info)) > + continue; > + stmt = STMT_VINFO_VEC_STMT (stmt_info); > + vectype = STMT_VINFO_VECTYPE (stmt_info); > + } > + /* Mask store case. */ > + else if (is_gimple_call (stmt) > + && gimple_call_internal_p (stmt) > + && gimple_call_internal_fn (stmt) == IFN_MASK_STORE > + && vinfo_for_stmt (stmt) > + && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt))) > + { > + stmt_info = vinfo_for_stmt (stmt); > + vectype = TREE_TYPE (gimple_call_arg (stmt, 2)); > + } > + /* Load case. */ > + else if (gimple_assign_load_p (stmt) > + && !VECTOR_TYPE_P (TREE_TYPE (gimple_assign_lhs (stmt)))) > + { > + stmt_info = vinfo_for_stmt (stmt); > + /* Skip vector loads. */ > + if (!STMT_VINFO_VEC_STMT (stmt_info)) > + continue; > + /* Skip invariant loads. */ > + if (integer_zerop (nested_in_vect_loop_p (loop, stmt) > + ? STMT_VINFO_DR_STEP (stmt_info) > + : DR_STEP (STMT_VINFO_DATA_REF (stmt_info)))) > + continue; > + stmt = STMT_VINFO_VEC_STMT (stmt_info); > + vectype = STMT_VINFO_VECTYPE (stmt_info); > + } > + /* Store case. */ > + else if (gimple_code (stmt) == GIMPLE_ASSIGN > + && gimple_store_p (stmt) > + && vinfo_for_stmt (stmt) > + && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt))) > + { > + stmt_info = vinfo_for_stmt (stmt); > + vectype = STMT_VINFO_VECTYPE (stmt_info); > + } > + else > + continue; > + > + /* Skip hoisted out statements. */ > + if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt))) > + continue; > + > + mask_no = vect_get_mask_index_for_type (vectype); > + > + dr = STMT_VINFO_DATA_REF (stmt_info); > + while (stmt) > + { > + if (is_gimple_call (stmt)) > + vect_mask_mask_load_store_stmt (stmt, masks[mask_no++]); > + else > + vect_mask_load_store_stmt (stmt, vectype, masks[mask_no++], dr, > + /* Have to update iterator only if > + it points to stmt we mask. */ > + stmt == gsi_stmt (si) ? &si : NULL); > + > + stmt_info = vinfo_for_stmt (stmt); > + stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL; > + } > + } > + } > + > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "=== Loop epilogue was combined ===\n"); > +} > > /* Function vect_transform_loop. > > @@ -6927,7 +7550,9 @@ vect_transform_loop (loop_vec_info loop_vinfo) > run at least the vectorization factor number of times checking > is pointless, too. */ > th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo); > - if (th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1 > + if ((th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1 > + || (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) > + && th > 1)) > && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) > { > if (dump_enabled_p ()) > @@ -6976,12 +7601,18 @@ vect_transform_loop (loop_vec_info loop_vinfo) > { > tree ratio_mult_vf; > if (!ni_name) > - ni_name = vect_build_loop_niters (loop_vinfo); > + { > + ni_name = vect_build_loop_niters (loop_vinfo); > + LOOP_VINFO_NITERS (loop_vinfo) = ni_name; > + } > vect_generate_tmps_on_preheader (loop_vinfo, ni_name, &ratio_mult_vf, > &ratio); > - epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name, > - ratio_mult_vf, th, > - check_profitability); > + /* If epilogue is combined with main loop peeling is not needed. */ > + if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) > + || check_profitability) > + epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name, > + ratio_mult_vf, th, > + check_profitability); > } > else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) > ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)), > @@ -6989,7 +7620,10 @@ vect_transform_loop (loop_vec_info loop_vinfo) > else > { > if (!ni_name) > - ni_name = vect_build_loop_niters (loop_vinfo); > + { > + ni_name = vect_build_loop_niters (loop_vinfo); > + LOOP_VINFO_NITERS (loop_vinfo) = ni_name; > + } > vect_generate_tmps_on_preheader (loop_vinfo, ni_name, NULL, &ratio); > } > > @@ -7243,23 +7877,35 @@ vect_transform_loop (loop_vec_info loop_vinfo) > > slpeel_make_loop_iterate_ntimes (loop, ratio); > > + if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)) > + vect_combine_loop_epilogue (loop_vinfo); > + > /* Reduce loop iterations by the vectorization factor. */ > scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vectorization_factor), > expected_iterations / vectorization_factor); > if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > && loop->nb_iterations_upper_bound != 0) > - loop->nb_iterations_upper_bound = loop->nb_iterations_upper_bound - 1; > - loop->nb_iterations_upper_bound > - = wi::udiv_floor (loop->nb_iterations_upper_bound + 1, > - vectorization_factor) - 1; > - > + loop->nb_iterations_upper_bound -= 1; > + if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)) > + loop->nb_iterations_upper_bound > + = wi::div_ceil (loop->nb_iterations_upper_bound + 1, > + vectorization_factor, UNSIGNED) - 1; > + else > + loop->nb_iterations_upper_bound > + = wi::udiv_floor (loop->nb_iterations_upper_bound + 1, > + vectorization_factor) - 1; > if (loop->any_estimate) > { > - loop->nb_iterations_estimate > - = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor); > - if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > - && loop->nb_iterations_estimate != 0) > - loop->nb_iterations_estimate = loop->nb_iterations_estimate - 1; > + if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)) > + loop->nb_iterations_estimate > + = wi::div_ceil (loop->nb_iterations_estimate, vectorization_factor, > + UNSIGNED); > + else > + loop->nb_iterations_estimate > + = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor); > + if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > + && loop->nb_iterations_estimate != 0) > + loop->nb_iterations_estimate -= 1; > } > > if (dump_enabled_p ())