From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 102209 invoked by alias); 28 Jun 2016 12:27:29 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 102194 invoked by uid 89); 28 Jun 2016 12:27:27 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=BAYES_20,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 spammy=ilya.enkovich@intel.com, ilyaenkovichintelcom, n2, n1 X-HELO: mail-qt0-f177.google.com Received: from mail-qt0-f177.google.com (HELO mail-qt0-f177.google.com) (209.85.216.177) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Tue, 28 Jun 2016 12:27:12 +0000 Received: by mail-qt0-f177.google.com with SMTP id c34so6334865qte.0 for ; Tue, 28 Jun 2016 05:27:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=wI2pQ/P25MylSy7fcstV85k0pUH+1K72X6UE+ka3Yao=; b=GSo3CTMoOslikk0+GKSiaIAS0l6Ale5e54nupPb/bjXXt4Bs99KeL+Hi4rZmSBRU6g 8rpV9k9wW843iyR+OFhuoU99lw+OLwo2iC0wB/P0Eylq37b6vpC6rs7NC+tGO/KE3rqh e4DhZyKXwAAgh3qkrzUjWaPQOyp9R8Ipbhdaf+DdrYHL52S9dolm6pb+fWUTkkzFRhHK NXejRtJ6nTA9dXrEIHCj+IcI4NbMMBgT9D6THIeTlsx+9dK4xFj5J2yTFC/LkNWjYHW6 Ilf7IPNOv8eoCT45wh+pYn4gugd8HGFU53X3+yVR7IeE8cHslJNc0om+dA6MGRnd+sLn 1wsg== X-Gm-Message-State: ALyK8tLaYN93rt7ENsJyiK1Vqw96t16/LyEa4rzfqcZM0Fn1h4/YeBKsNzxn97q8yRPzAg== X-Received: by 10.237.53.233 with SMTP id d38mr1262246qte.104.1467116829311; Tue, 28 Jun 2016 05:27:09 -0700 (PDT) Received: from msticlxl57.ims.intel.com (irdmzpr02-ext.ir.intel.com. [192.198.151.37]) by smtp.gmail.com with ESMTPSA id 132sm1476245qkh.41.2016.06.28.05.27.06 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 28 Jun 2016 05:27:08 -0700 (PDT) Date: Tue, 28 Jun 2016 13:37:00 -0000 From: Ilya Enkovich To: Jeff Law Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH, vec-tails 07/10] Support loop epilogue combining Message-ID: <20160628122439.GB4143@msticlxl57.ims.intel.com> References: <20160519194450.GH40563@msticlxl57.ims.intel.com> <18ccae1a-30c3-c23c-e28f-287f9d41eaa0@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <18ccae1a-30c3-c23c-e28f-287f9d41eaa0@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) X-IsSubscribed: yes X-SW-Source: 2016-06/txt/msg01877.txt.bz2 On 16 Jun 10:54, Jeff Law wrote: > On 05/19/2016 01:44 PM, Ilya Enkovich wrote: > >Hi, > > > >This patch introduces support for loop epilogue combining. This includes > >support in cost estimation and all required changes required to mask > >vectorized loop. > > > >Thanks, > >Ilya > >-- > >gcc/ > > > >2016-05-19 Ilya Enkovich > > > > * dbgcnt.def (vect_tail_combine): New. > > * params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New. > > * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var. > > * tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support > > epilogue combined with loop body. > > (vect_do_peeling_for_loop_bound): Likewise. > > * tree-vect-loop.c Include alias.h and dbgcnt.h. > > (vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters > > arg, compute number of iterations for which loop epilogue combining is > > profitable. > > (vect_generate_tmps_on_preheader): Support combined apilogue. > > (vect_gen_ivs_for_masking): New. > > (vect_get_mask_index_for_elems): New. > > (vect_get_mask_index_for_type): New. > > (vect_gen_loop_masks): New. > > (vect_mask_reduction_stmt): New. > > (vect_mask_mask_load_store_stmt): New. > > (vect_mask_load_store_stmt): New. > > (vect_combine_loop_epilogue): New. > > (vect_transform_loop): Support combined apilogue. > > > > > >diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c > >index fab5879..b3c0668 100644 > >--- a/gcc/tree-vect-loop-manip.c > >+++ b/gcc/tree-vect-loop-manip.c > >@@ -1464,11 +1469,20 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop, > > bb_between_loops = new_exit_bb; > > bb_after_second_loop = split_edge (single_exit (second_loop)); > > > >- pre_condition = > >- fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters); > >- skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL, > >- bb_after_second_loop, bb_before_first_loop, > >- inverse_probability (second_guard_probability)); > >+ if (skip_second_after_first) > >+ /* We can just redirect edge from bb_between_loops to > >+ bb_after_second_loop but we have many code assuming > >+ we have a guard after the first loop. So just make > >+ always taken condtion. */ > >+ pre_condition = fold_build2 (EQ_EXPR, boolean_type_node, integer_zero_node, > >+ integer_zero_node); > This isn't ideal, but I don't think it's that big of an issue. > > >@@ -1758,8 +1772,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, > > basic_block preheader; > > int loop_num; > > int max_iter; > >+ int bound2; > > tree cond_expr = NULL_TREE; > > gimple_seq cond_expr_stmt_list = NULL; > >+ bool combine = LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo); > > > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_NOTE, vect_location, > >@@ -1769,12 +1785,13 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, > > > > loop_num = loop->num; > > > >+ bound2 = combine ? th : LOOP_VINFO_VECT_FACTOR (loop_vinfo); > Can you document what the TH parameter is to the various routines that use > it in tree-vect-loop-manip.c? I realize you didn't add it, but it would > help anyone looking at this code in the future to know it's the threshold of > iterations for vectorization without having to find it in other function > comment headers ;-) > > That's pre-approved to go in immediately :-) > > >@@ -1803,7 +1820,11 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, > > max_iter = (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > > ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) * 2 > > : LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 2; > >- if (check_profitability) > >+ /* When epilogue is combined only profitability > >+ treshold matters. */ > s/treshold/threshold/ > > > > > static void > > vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, > > int *ret_min_profitable_niters, > >- int *ret_min_profitable_estimate) > >+ int *ret_min_profitable_estimate, > >+ int *ret_min_profitable_combine_niters) > I'm torn a bit here. There's all kinds of things missing/incomplete in the > function comments throughout the vectorizer. And in some cases, like this > one, the parameters are largely self-documenting. But we've also got coding > standards that we'd like to adhere to. > > I don't think it's fair to require you to fix all these issues in the > vectorizer (though if you wanted to, I'd fully support those an independent > cleanups). > > Perhaps just document LOOP_VINFO with a generic comment about the ret_* > parameters for this function rather than a comment for each ret_* parameter. > Pre-approved for the trunk independent of the vec-tails work. > > > >@@ -3728,6 +3784,77 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, > > min_profitable_estimate); > > > >+ > >+ unsigned combine_treshold > >+ = PARAM_VALUE (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD); > >+ /* Calculate profitability combining epilogue with the main loop. > >+ We have a threshold for inside cost overhead (not applied > >+ for low trip count loop case): > >+ MIC * 100 < VIC * CT > >+ Masked iteration should be better than a scalar prologue: > >+ MIC + VIC < SIC * epilogue_niters */ > Can you double-check the whitespace formatting here. Where does the "100" > come from and should it be a param? I checked the formatting. We have 100 here because combine_treshold is measured in percent. E.g. value 2 means iterations masking overhead shouldn't exceed 2% of vector iteration cost. > > > >@@ -6886,6 +7030,485 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo, > > return; > > } > > > > >+ > >+/* Function vect_gen_loop_masks. > >+ > >+ Create masks to mask a loop desvribed by LOOP_VINFO. Masks > s/desvribed/described/ > > >+ are created according to LOOP_VINFO_REQUIRED_MASKS and are stored > >+ into MASKS vector. > >+ > >+ Index of a mask in a vector is computed according to a number > >+ of masks's elements. Masks are sorted by number of its elements > >+ in descending order. Index 0 is used to access a mask with > >+ current_vector_size elements. Among masks with the same number > >+ of elements the one with lower index is used to mask iterations > >+ with smaller iteration counter. Note that you may get NULL elements > >+ for masks which are not required. Use vect_get_mask_index_for_elems > >+ or vect_get_mask_index_for_type to access resulting vector. */ > >+ > >+static void > >+vect_gen_loop_masks (loop_vec_info loop_vinfo, vec *masks) > >+{ > >+ struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); > >+ edge pe = loop_preheader_edge (loop); > >+ tree niters = LOOP_VINFO_NITERS (loop_vinfo); > >+ unsigned min_mask_elems, max_mask_elems, nmasks; > >+ unsigned iv_elems, cur_mask, prev_mask, cur_mask_elems; > >+ auto_vec ivs; > >+ tree vectype, mask_type; > >+ tree vec_niters, vec_niters_val, mask; > >+ gimple *stmt; > >+ basic_block bb; > >+ gimple_stmt_iterator gsi = gsi_after_labels (loop->header); > >+ unsigned vec_size; > >+ > >+ /* Create required IVs. */ > >+ vect_gen_ivs_for_masking (loop_vinfo, &ivs); > >+ vectype = TREE_TYPE (ivs[0]); > >+ > >+ vec_size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype)); > >+ iv_elems = TYPE_VECTOR_SUBPARTS (vectype); > >+ > >+ /* Get a proper niter to build a vector. */ > >+ if (!is_gimple_val (niters)) > >+ { > >+ gimple_seq seq = NULL; > >+ niters = force_gimple_operand (niters, &seq, true, NULL); > >+ gsi_insert_seq_on_edge_immediate (pe, seq); > >+ } > >+ /* We may need a type cast in case niter has a too small type > >+ for generated IVs. */ > Nit. There should be vertical whitespace after the close brace and the > comment for the next logical block of code. Can you do a scan over the > patchkit looking for other instances where the vertical whitespace is > needed. > > Generally, if you find that a blob of code needs a comment, then the comment > and blob of code should have that vertical whitespace to visually separate > it from everything else. > > > > >+/* Function vect_combine_loop_epilogue. > >+ > >+ Combine loop epilogue with the main vectorized body. It requires > >+ masking of memory accesses and reductions. */ > So you mask reductions, loads & stores. Is there anything else that we > might potentially need to mask to combine the loop & epilogue via masking? > > > I don't see anything particularly worrisome here either -- I have a slight > concern about correctness issues with only masking loads/stores and > reductions. But I will defer to your judgment on whether or not there's > other stuff that we need to mask to combine the epilogue with the loop via > masking. We have to mask operations which may cause errors if executed speculatively. For others we just ignore produced result. So we don't truly mask reductions but fix-up their results. I assume memory accesses are only ones we have to truly mask (plus non-const calls which are rejected now). For signalling arithmetic I assumed we just don't vectorize it. Basically we should act similar to if-conversion. I'll check if it has restrictions I miss. > > Jeff Here is an updated patch version. Thanks, Ilya -- gcc/ 2016-05-28 Ilya Enkovich * dbgcnt.def (vect_tail_combine): New. * params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New. * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var. * tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support epilogue combined with loop body. (vect_do_peeling_for_loop_bound): LIkewise. (vect_do_peeling_for_alignment): ??? * tree-vect-loop.c Include alias.h and dbgcnt.h. (vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters arg, compute number of iterations for which loop epilogue combining is profitable. (vect_generate_tmps_on_preheader): Support combined apilogue. (vect_gen_ivs_for_masking): New. (vect_get_mask_index_for_elems): New. (vect_get_mask_index_for_type): New. (vect_gen_loop_masks): New. (vect_mask_reduction_stmt): New. (vect_mask_mask_load_store_stmt): New. (vect_mask_load_store_stmt): New. (vect_combine_loop_epilogue): New. (vect_transform_loop): Support combined apilogue. diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def index 78ddcc2..73c2966 100644 --- a/gcc/dbgcnt.def +++ b/gcc/dbgcnt.def @@ -192,4 +192,5 @@ DEBUG_COUNTER (treepre_insert) DEBUG_COUNTER (tree_sra) DEBUG_COUNTER (vect_loop) DEBUG_COUNTER (vect_slp) +DEBUG_COUNTER (vect_tail_combine) DEBUG_COUNTER (dom_unreachable_edges) diff --git a/gcc/params.def b/gcc/params.def index 62a1e40..98d6c5a 100644 --- a/gcc/params.def +++ b/gcc/params.def @@ -1220,6 +1220,11 @@ DEFPARAM (PARAM_MAX_SPECULATIVE_DEVIRT_MAYDEFS, "Maximum number of may-defs visited when devirtualizing " "speculatively", 50, 0, 0) +DEFPARAM (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD, + "vect-cost-increase-combine-threshold", + "Cost increase threshold to mask main loop for epilogue.", + 10, 0, 300) + /* Local variables: diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c index a902a50..26e0cc1 100644 --- a/gcc/tree-vect-data-refs.c +++ b/gcc/tree-vect-data-refs.c @@ -4007,6 +4007,9 @@ vect_get_new_ssa_name (tree type, enum vect_var_kind var_kind, const char *name) case vect_scalar_var: prefix = "stmp"; break; + case vect_mask_var: + prefix = "mask"; + break; case vect_pointer_var: prefix = "vectp"; break; diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index c26aa1d..7403686 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -1195,6 +1195,7 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop, int first_guard_probability = 2 * REG_BR_PROB_BASE / 3; int second_guard_probability = 2 * REG_BR_PROB_BASE / 3; int probability_of_second_loop; + bool skip_second_after_first = false; if (!slpeel_can_duplicate_loop_p (loop, e)) return NULL; @@ -1393,7 +1394,11 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop, { loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop); tree scalar_loop_iters = LOOP_VINFO_NITERSM1 (loop_vinfo); - unsigned limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1; + unsigned limit = 0; + if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)) + skip_second_after_first = true; + else + limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1; if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)) limit = limit + 1; if (check_profitability @@ -1464,11 +1469,20 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop, bb_between_loops = new_exit_bb; bb_after_second_loop = split_edge (single_exit (second_loop)); - pre_condition = - fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters); - skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL, - bb_after_second_loop, bb_before_first_loop, - inverse_probability (second_guard_probability)); + if (skip_second_after_first) + /* We can just redirect edge from bb_between_loops to + bb_after_second_loop but we have many code assuming + we have a guard after the first loop. So just make + always taken condtion. */ + pre_condition = fold_build2 (EQ_EXPR, boolean_type_node, integer_zero_node, + integer_zero_node); + else + pre_condition = + fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters); + skip_e + = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL, + bb_after_second_loop, bb_before_first_loop, + inverse_probability (second_guard_probability)); scale_loop_profile (second_loop, probability_of_second_loop, bound2); slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop, second_loop == new_loop, &new_exit_bb); @@ -1762,8 +1776,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, basic_block preheader; int loop_num; int max_iter; + int bound2; tree cond_expr = NULL_TREE; gimple_seq cond_expr_stmt_list = NULL; + bool combine = LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo); if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, @@ -1773,12 +1789,13 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, loop_num = loop->num; + bound2 = combine ? th : LOOP_VINFO_VECT_FACTOR (loop_vinfo); new_loop = slpeel_tree_peel_loop_to_edge (loop, scalar_loop, single_exit (loop), &ratio_mult_vf_name, ni_name, false, th, check_profitability, cond_expr, cond_expr_stmt_list, - 0, LOOP_VINFO_VECT_FACTOR (loop_vinfo)); + 0, bound2); gcc_assert (new_loop); gcc_assert (loop_num == loop->num); slpeel_checking_verify_cfg_after_peeling (loop, new_loop); @@ -1807,7 +1824,12 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, max_iter = (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) * 2 : LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 2; - if (check_profitability) + + /* When epilogue is combined only profitability + threshold matters. */ + if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)) + max_iter = (int) th - 1; + else if (check_profitability) max_iter = MAX (max_iter, (int) th - 1); record_niter_bound (new_loop, max_iter, false, true); dump_printf (MSG_NOTE, @@ -2044,7 +2066,8 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo, tree ni_name, bound, 0); gcc_assert (new_loop); - slpeel_checking_verify_cfg_after_peeling (new_loop, loop); + if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)) + slpeel_checking_verify_cfg_after_peeling (new_loop, loop); /* For vectorization factor N, we need to copy at most N-1 values for alignment and this means N-2 loopback edge executions. */ max_iter = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 2; diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 41b9380..08fad82 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -50,6 +50,8 @@ along with GCC; see the file COPYING3. If not see #include "gimple-fold.h" #include "cgraph.h" #include "tree-if-conv.h" +#include "alias.h" +#include "dbgcnt.h" /* Loop Vectorization Pass. @@ -149,7 +151,8 @@ along with GCC; see the file COPYING3. If not see http://gcc.gnu.org/projects/tree-ssa/vectorization.html */ -static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *); +static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *, + int *); /* Function vect_determine_vectorization_factor @@ -2310,8 +2313,10 @@ start_over: /* Analyze cost. Decide if worth while to vectorize. */ int min_profitable_estimate, min_profitable_iters; + int min_profitable_combine_iters; vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters, - &min_profitable_estimate); + &min_profitable_estimate, + &min_profitable_combine_iters); if (min_profitable_iters < 0) { @@ -2420,6 +2425,52 @@ start_over: gcc_assert (vectorization_factor == (unsigned)LOOP_VINFO_VECT_FACTOR (loop_vinfo)); + if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)) + { + LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false; + LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = false; + } + else if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) + && min_profitable_combine_iters >= 0) + { + if (((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) + && (LOOP_VINFO_INT_NITERS (loop_vinfo) + >= (unsigned) min_profitable_combine_iters)) + || estimated_niter == -1 + || estimated_niter >= min_profitable_combine_iters) + && dbg_cnt (vect_tail_combine)) + { + LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false; + LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = true; + + dump_printf_loc (MSG_NOTE, vect_location, + "Decided to combine loop with its epilogue.\n"); + + /* We need to adjust profitability check if combine + epilogue considering additional vector iteration + and profitable combine iterations. */ + if ((int)(min_profitable_combine_iters + vectorization_factor) + > min_scalar_loop_bound) + { + LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) + = (unsigned) min_profitable_combine_iters; + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "Updated runtime profitability treshold: %d\n", + min_profitable_combine_iters); + + } + } + else + { + if (!LOOP_VINFO_NEED_MASKING (loop_vinfo) && dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "Not combined loop with epilogue: iterations " + "count is too low (threshold is %d).\n", + min_profitable_combine_iters); + } + } + /* Ok to vectorize! */ return true; @@ -3392,12 +3443,18 @@ vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue, profitability check. *RET_MIN_PROFITABLE_ESTIMATE is a profitability threshold to be used - for static check against estimated number of iterations. */ + for static check against estimated number of iterations. + + *RET_MIN_PROFITABLE_COMBINE_NITERS is a cost model profitability threshold + of iterations for vectorization with combined loop epilogue. -1 means + combining is not profitable. Value may be used fo dynamic profitability + check. */ static void vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, int *ret_min_profitable_niters, - int *ret_min_profitable_estimate) + int *ret_min_profitable_estimate, + int *ret_min_profitable_combine_niters) { int min_profitable_iters; int min_profitable_estimate; @@ -3641,6 +3698,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, vec_prologue_cost); dump_printf (MSG_NOTE, " Vector epilogue cost: %d\n", vec_epilogue_cost); + dump_printf (MSG_NOTE, " Masking prologue cost: %d\n", + masking_prologue_cost); + dump_printf (MSG_NOTE, " Masking inside cost: %d\n", + masking_inside_cost); dump_printf (MSG_NOTE, " Scalar iteration cost: %d\n", scalar_single_iter_cost); dump_printf (MSG_NOTE, " Scalar outside cost: %d\n", @@ -3744,6 +3805,77 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, min_profitable_estimate); *ret_min_profitable_estimate = min_profitable_estimate; + + *ret_min_profitable_combine_niters = -1; + + /* Don't try to vectorize epilogue of epilogue. */ + if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)) + return; + + if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)) + { + if (flag_vect_epilogue_cost_model == VECT_COST_MODEL_UNLIMITED) + { + if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE) + *ret_min_profitable_combine_niters = 0; + return; + } + + unsigned combine_treshold + = PARAM_VALUE (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD); + /* Calculate profitability combining epilogue with the main loop. + We have a threshold for inside cost overhead (not applied + for low trip count loop case): + MIC * 100 < VIC * CT + Masked iteration should be better than a scalar prologue: + MIC + VIC < SIC * epilogue_niters */ + if (masking_inside_cost * 100 >= vec_inside_cost * combine_treshold) + { + if (dump_enabled_p ()) + { + dump_printf_loc (MSG_NOTE, vect_location, + "Combining loop with epilogue is not " + "profitable.\n"); + dump_printf_loc (MSG_NOTE, vect_location, + " Combining overhead %d%% exceeds " + "treshold %d%%.\n", + masking_inside_cost * 100 / vec_inside_cost, + combine_treshold); + } + *ret_min_profitable_combine_niters = -1; + } + else if ((int)(masking_inside_cost + vec_inside_cost) + >= scalar_single_iter_cost * peel_iters_epilogue) + { + if (dump_enabled_p ()) + { + dump_printf_loc (MSG_NOTE, vect_location, + "Combining loop with epilogue is not " + "profitable.\n"); + dump_printf_loc (MSG_NOTE, vect_location, + " Scalar epilogue is faster than a " + "single masked iteration.\n"); + } + *ret_min_profitable_combine_niters = -1; + } + else if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE) + { + int inside_cost = vec_inside_cost + masking_inside_cost; + int outside_cost = vec_outside_cost + masking_prologue_cost; + int profitable_iters = ((outside_cost - scalar_outside_cost) * vf + - inside_cost * peel_iters_prologue + - inside_cost * peel_iters_epilogue) + / ((scalar_single_iter_cost * vf) + - inside_cost); + + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "Combinig loop with epilogue " + "pofitability treshold = %d\n", + profitable_iters); + *ret_min_profitable_combine_niters = profitable_iters; + } + } } /* Writes into SEL a mask for a vec_perm, equivalent to a vec_shr by OFFSET @@ -6852,20 +6984,37 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo, else ni_minus_gap_name = ni_name; - /* Create: ratio = ni >> log2(vf) */ - /* ??? As we have ni == number of latch executions + 1, ni could - have overflown to zero. So avoid computing ratio based on ni - but compute it using the fact that we know ratio will be at least - one, thus via (ni - vf) >> log2(vf) + 1. */ - ratio_name - = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name), - fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name), - fold_build2 (MINUS_EXPR, TREE_TYPE (ni_name), - ni_minus_gap_name, - build_int_cst - (TREE_TYPE (ni_name), vf)), - log_vf), - build_int_cst (TREE_TYPE (ni_name), 1)); + if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)) + { + /* Create ni + (vf-1) >> log2(vf) if epilogue is combined with loop. */ + gcc_assert (!LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)); + ratio_name + = fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name), + fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name), + ni_name, + build_int_cst (TREE_TYPE (ni_name), + vf - 1)), + log_vf); + } + else + { + /* Create: ratio = ni >> log2(vf) */ + /* ??? As we have ni == number of latch executions + 1, ni could + have overflown to zero. So avoid computing ratio based on ni + but compute it using the fact that we know ratio will be at least + one, thus via (ni - vf) >> log2(vf) + 1. */ + ratio_name + = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name), + fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name), + fold_build2 (MINUS_EXPR, + TREE_TYPE (ni_name), + ni_minus_gap_name, + build_int_cst + (TREE_TYPE (ni_name), vf)), + log_vf), + build_int_cst (TREE_TYPE (ni_name), 1)); + } + if (!is_gimple_val (ratio_name)) { var = create_tmp_var (TREE_TYPE (ni_name), "bnd"); @@ -6895,6 +7044,489 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo, return; } +/* Function vect_gen_ivs_for_masking. + + Create IVs to be used for masks computation to mask loop described + by LOOP_VINFO. Created IVs are stored in IVS vector. . + + Initial IV values is {0, 1, ..., VF - 1} (probably split into several + vectors, in this case IVS's elements with lower index hold IV with + smaller numbers). IV step is {VF, VF, ..., VF}. VF is a used + vectorization factor. */ + +static void +vect_gen_ivs_for_masking (loop_vec_info loop_vinfo, vec *ivs) +{ + struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); + tree vectype = vect_get_masking_iv_type (loop_vinfo); + tree type = TREE_TYPE (vectype); + int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); + unsigned elems = TYPE_VECTOR_SUBPARTS (vectype); + int ncopies = vf / elems; + int i, k; + tree iv, init_val, step_val; + bool insert_after; + gimple_stmt_iterator gsi; + tree *vtemp; + + /* Create {VF, ..., VF} vector constant. */ + step_val = build_vector_from_val (vectype, build_int_cst (type, vf)); + + vtemp = XALLOCAVEC (tree, vf); + for (i = 0; i < ncopies; i++) + { + /* Create initial IV value. */ + for (k = 0; k < vf; k++) + vtemp[k] = build_int_cst (type, k + i * elems); + init_val = build_vector (vectype, vtemp); + + /* Create an inductive variable including phi node. */ + standard_iv_increment_position (loop, &gsi, &insert_after); + create_iv (init_val, step_val, NULL, loop, &gsi, insert_after, + &iv, NULL); + ivs->safe_push (iv); + } +} + +/* Function vect_get_mask_index_for_elems. + + A helper function to access masks vector. See vect_gen_loop_masks + for masks vector sorting description. Return index of the first + mask having MASK_ELEMS elements. */ + +static inline unsigned +vect_get_mask_index_for_elems (unsigned mask_elems) +{ + return current_vector_size / mask_elems - 1; +} + +/* Function vect_get_mask_index_for_type. + + A helper function to access masks vector. See vect_gen_loop_masks + for masks vector sorting description. Return index of the first + mask appropriate for VECTYPE. */ + +static inline unsigned +vect_get_mask_index_for_type (tree vectype) +{ + unsigned elems = TYPE_VECTOR_SUBPARTS (vectype); + return vect_get_mask_index_for_elems (elems); +} + +/* Function vect_gen_loop_masks. + + Create masks to mask a loop described by LOOP_VINFO. Masks + are created according to LOOP_VINFO_REQUIRED_MASKS and are stored + into MASKS vector. + + Index of a mask in a vector is computed according to a number + of masks's elements. Masks are sorted by number of its elements + in descending order. Index 0 is used to access a mask with + current_vector_size elements. Among masks with the same number + of elements the one with lower index is used to mask iterations + with smaller iteration counter. Note that you may get NULL elements + for masks which are not required. Use vect_get_mask_index_for_elems + or vect_get_mask_index_for_type to access resulting vector. */ + +static void +vect_gen_loop_masks (loop_vec_info loop_vinfo, vec *masks) +{ + struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); + edge pe = loop_preheader_edge (loop); + tree niters = LOOP_VINFO_NITERS (loop_vinfo); + unsigned min_mask_elems, max_mask_elems, nmasks; + unsigned iv_elems, cur_mask, prev_mask, cur_mask_elems; + auto_vec ivs; + tree vectype, mask_type; + tree vec_niters, vec_niters_val, mask; + gimple *stmt; + basic_block bb; + gimple_stmt_iterator gsi = gsi_after_labels (loop->header); + unsigned vec_size; + + /* Create required IVs. */ + vect_gen_ivs_for_masking (loop_vinfo, &ivs); + vectype = TREE_TYPE (ivs[0]); + + vec_size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype)); + iv_elems = TYPE_VECTOR_SUBPARTS (vectype); + + /* Get a proper niter to build a vector. */ + if (!is_gimple_val (niters)) + { + gimple_seq seq = NULL; + niters = force_gimple_operand (niters, &seq, true, NULL); + gsi_insert_seq_on_edge_immediate (pe, seq); + } + + /* We may need a type cast in case niter has a too small type + for generated IVs. */ + if (!types_compatible_p (TREE_TYPE (vectype), TREE_TYPE (niters))) + { + tree new_niters = make_temp_ssa_name (TREE_TYPE (vectype), + NULL, "niters"); + stmt = gimple_build_assign (new_niters, CONVERT_EXPR, niters); + bb = gsi_insert_on_edge_immediate (pe, stmt); + gcc_assert (!bb); + niters = new_niters; + } + + /* Create {NITERS, ..., NITERS} vector and put to SSA_NAME. */ + vec_niters_val = build_vector_from_val (vectype, niters); + vec_niters = vect_get_new_ssa_name (vectype, vect_simple_var, "niters"); + stmt = gimple_build_assign (vec_niters, vec_niters_val); + bb = gsi_insert_on_edge_immediate (pe, stmt); + gcc_assert (!bb); + + /* Determine which masks we need to compute and how many. */ + vect_get_extreme_masks (loop_vinfo, &min_mask_elems, &max_mask_elems); + nmasks = vect_get_mask_index_for_elems (MIN (min_mask_elems, iv_elems) / 2); + masks->safe_grow_cleared (nmasks); + + /* Now create base masks through comparison IV < VEC_NITERS. */ + mask_type = build_same_sized_truth_vector_type (vectype); + cur_mask = vect_get_mask_index_for_elems (iv_elems); + for (unsigned i = 0; i < ivs.length (); i++) + { + tree iv = ivs[i]; + mask = vect_get_new_ssa_name (mask_type, vect_mask_var); + stmt = gimple_build_assign (mask, LT_EXPR, iv, vec_niters); + gsi_insert_before (&gsi, stmt, GSI_SAME_STMT); + (*masks)[cur_mask++] = mask; + } + + /* Create narrowed masks. */ + cur_mask_elems = iv_elems; + nmasks = ivs.length (); + while (cur_mask_elems < max_mask_elems) + { + prev_mask = vect_get_mask_index_for_elems (cur_mask_elems); + + cur_mask_elems <<= 1; + nmasks >>= 1; + + cur_mask = vect_get_mask_index_for_elems (cur_mask_elems); + + mask_type = build_truth_vector_type (cur_mask_elems, vec_size); + + for (unsigned i = 0; i < nmasks; i++) + { + tree mask_low = (*masks)[prev_mask++]; + tree mask_hi = (*masks)[prev_mask++]; + mask = vect_get_new_ssa_name (mask_type, vect_mask_var); + stmt = gimple_build_assign (mask, VEC_PACK_TRUNC_EXPR, + mask_low, mask_hi); + gsi_insert_before (&gsi, stmt, GSI_SAME_STMT); + (*masks)[cur_mask++] = mask; + } + } + + /* Created widened masks. */ + cur_mask_elems = iv_elems; + nmasks = ivs.length (); + while (cur_mask_elems > min_mask_elems) + { + prev_mask = vect_get_mask_index_for_elems (cur_mask_elems); + + cur_mask_elems >>= 1; + nmasks <<= 1; + + cur_mask = vect_get_mask_index_for_elems (cur_mask_elems); + + mask_type = build_truth_vector_type (cur_mask_elems, vec_size); + + for (unsigned i = 0; i < nmasks; i += 2) + { + tree orig_mask = (*masks)[prev_mask++]; + + mask = vect_get_new_ssa_name (mask_type, vect_mask_var); + stmt = gimple_build_assign (mask, VEC_UNPACK_LO_EXPR, orig_mask); + gsi_insert_before (&gsi, stmt, GSI_SAME_STMT); + (*masks)[cur_mask++] = mask; + + mask = vect_get_new_ssa_name (mask_type, vect_mask_var); + stmt = gimple_build_assign (mask, VEC_UNPACK_HI_EXPR, orig_mask); + gsi_insert_before (&gsi, stmt, GSI_SAME_STMT); + (*masks)[cur_mask++] = mask; + } + } +} + +/* Function vect_mask_reduction_stmt. + + Mask given vectorized reduction statement STMT using + MASK. In case scalar reduction statement is vectorized + into several vector statements then PREV holds a + preceding vetor statement copy for STMT. + + Masking is performed using VEC_COND_EXPR. E.g. + + S1: r_1 = r_2 + d_3 + + is transformed into: + + S1': r_4 = r_2 + d_3 + S2': r_1 = VEC_COND_EXPR + + Return generated condition statement. */ + +static gimple * +vect_mask_reduction_stmt (gimple *stmt, tree mask, gimple *prev) +{ + gimple_stmt_iterator gsi; + tree vectype; + tree lhs, rhs, tmp; + gimple *new_stmt, *phi; + + lhs = gimple_assign_lhs (stmt); + vectype = TREE_TYPE (lhs); + + gcc_assert (TYPE_VECTOR_SUBPARTS (vectype) + == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask))); + + /* Find operand RHS defined by PHI node. */ + rhs = gimple_assign_rhs1 (stmt); + gcc_assert (TREE_CODE (rhs) == SSA_NAME); + phi = SSA_NAME_DEF_STMT (rhs); + + if (phi != prev && gimple_code (phi) != GIMPLE_PHI) + { + rhs = gimple_assign_rhs2 (stmt); + gcc_assert (TREE_CODE (rhs) == SSA_NAME); + phi = SSA_NAME_DEF_STMT (rhs); + gcc_assert (phi == prev || gimple_code (phi) == GIMPLE_PHI); + } + + /* Convert reduction stmt to ordinary assignment to TMP. */ + tmp = vect_get_new_ssa_name (vectype, vect_simple_var, NULL); + gimple_assign_set_lhs (stmt, tmp); + + /* Create VEC_COND_EXPR and insert it after STMT. */ + new_stmt = gimple_build_assign (lhs, VEC_COND_EXPR, mask, tmp, rhs); + gsi = gsi_for_stmt (stmt); + gsi_insert_after (&gsi, new_stmt, GSI_SAME_STMT); + + return new_stmt; +} + +/* Function vect_mask_mask_load_store_stmt. + + Mask given vectorized MASK_LOAD or MASK_STORE statement + STMT using MASK. Function replaces a mask used by STMT + with its conjunction with MASK. */ + +static void +vect_mask_mask_load_store_stmt (gimple *stmt, tree mask) +{ + gimple *new_stmt; + tree old_mask, new_mask; + gimple_stmt_iterator gsi; + + gsi = gsi_for_stmt (stmt); + old_mask = gimple_call_arg (stmt, 2); + + gcc_assert (types_compatible_p (TREE_TYPE (old_mask), TREE_TYPE (mask))); + + new_mask = vect_get_new_ssa_name (TREE_TYPE (mask), vect_simple_var, NULL); + new_stmt = gimple_build_assign (new_mask, BIT_AND_EXPR, old_mask, mask); + gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT); + + gimple_call_set_arg (stmt, 2, new_mask); + update_stmt (stmt); +} + + +/* Function vect_mask_load_store_stmt. + + Mask given vectorized load or store statement STMT using + MASK. DR is a data reference for a scalar memory access. + Assignment is transformed into MASK_LOAD or MASK_STORE + statement. SI is either an iterator pointing to STMT and + is to be updated or NULL. */ + +static void +vect_mask_load_store_stmt (gimple *stmt, tree vectype, tree mask, + data_reference *dr, gimple_stmt_iterator *si) +{ + tree mem, val, addr, ptr; + gimple_stmt_iterator gsi = gsi_for_stmt (stmt); + unsigned align, misalign; + tree elem_type = TREE_TYPE (vectype); + gimple *new_stmt; + + gcc_assert (!si || gsi_stmt (*si) == stmt); + + gsi = gsi_for_stmt (stmt); + if (gimple_store_p (stmt)) + { + val = gimple_assign_rhs1 (stmt); + mem = gimple_assign_lhs (stmt); + } + else + { + val = gimple_assign_lhs (stmt); + mem = gimple_assign_rhs1 (stmt); + } + + gcc_assert (TYPE_VECTOR_SUBPARTS (vectype) + == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask))); + + addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (mem), + true, NULL_TREE, true, + GSI_SAME_STMT); + + align = TYPE_ALIGN_UNIT (vectype); + if (aligned_access_p (dr)) + misalign = 0; + else if (DR_MISALIGNMENT (dr) == -1) + { + align = TYPE_ALIGN_UNIT (elem_type); + misalign = 0; + } + else + misalign = DR_MISALIGNMENT (dr); + set_ptr_info_alignment (get_ptr_info (addr), align, misalign); + ptr = build_int_cst (reference_alias_ptr_type (mem), + misalign ? misalign & -misalign : align); + + if (gimple_store_p (stmt)) + new_stmt = gimple_build_call_internal (IFN_MASK_STORE, 4, addr, ptr, + mask, val); + else + { + new_stmt = gimple_build_call_internal (IFN_MASK_LOAD, 3, addr, ptr, + mask); + gimple_call_set_lhs (new_stmt, val); + } + gsi_replace (si ? si : &gsi, new_stmt, false); +} + +/* Function vect_combine_loop_epilogue. + + Combine loop epilogue with the main vectorized body. It requires + masking of memory accesses and reductions. */ + +static void +vect_combine_loop_epilogue (loop_vec_info loop_vinfo) +{ + struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); + basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo); + unsigned mask_no; + auto_vec masks; + + vect_gen_loop_masks (loop_vinfo, &masks); + + /* Convert reduction statements if any. */ + for (unsigned i = 0; i < LOOP_VINFO_REDUCTIONS (loop_vinfo).length (); i++) + { + gimple *stmt = LOOP_VINFO_REDUCTIONS (loop_vinfo)[i]; + gimple *prev_stmt = NULL; + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); + + mask_no = vect_get_mask_index_for_type (STMT_VINFO_VECTYPE (stmt_info)); + + stmt = STMT_VINFO_VEC_STMT (stmt_info); + while (stmt) + { + prev_stmt = vect_mask_reduction_stmt (stmt, masks[mask_no++], + prev_stmt); + stmt_info = vinfo_for_stmt (stmt); + stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL; + } + } + + /* Scan all loop statements to convert vector load/store including masked + form. */ + for (unsigned i = 0; i < loop->num_nodes; i++) + { + basic_block bb = bbs[i]; + for (gimple_stmt_iterator si = gsi_start_bb (bb); + !gsi_end_p (si); gsi_next (&si)) + { + gimple *stmt = gsi_stmt (si); + stmt_vec_info stmt_info = NULL; + tree vectype = NULL; + data_reference *dr; + + /* Mask load case. */ + if (is_gimple_call (stmt) + && gimple_call_internal_p (stmt) + && gimple_call_internal_fn (stmt) == IFN_MASK_LOAD + && !VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (stmt, 2)))) + { + stmt_info = vinfo_for_stmt (stmt); + if (!STMT_VINFO_VEC_STMT (stmt_info)) + continue; + stmt = STMT_VINFO_VEC_STMT (stmt_info); + vectype = STMT_VINFO_VECTYPE (stmt_info); + } + /* Mask store case. */ + else if (is_gimple_call (stmt) + && gimple_call_internal_p (stmt) + && gimple_call_internal_fn (stmt) == IFN_MASK_STORE + && vinfo_for_stmt (stmt) + && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt))) + { + stmt_info = vinfo_for_stmt (stmt); + vectype = TREE_TYPE (gimple_call_arg (stmt, 2)); + } + /* Load case. */ + else if (gimple_assign_load_p (stmt) + && !VECTOR_TYPE_P (TREE_TYPE (gimple_assign_lhs (stmt)))) + { + stmt_info = vinfo_for_stmt (stmt); + + /* Skip vector loads. */ + if (!STMT_VINFO_VEC_STMT (stmt_info)) + continue; + + /* Skip invariant loads. */ + if (integer_zerop (nested_in_vect_loop_p (loop, stmt) + ? STMT_VINFO_DR_STEP (stmt_info) + : DR_STEP (STMT_VINFO_DATA_REF (stmt_info)))) + continue; + stmt = STMT_VINFO_VEC_STMT (stmt_info); + vectype = STMT_VINFO_VECTYPE (stmt_info); + } + /* Store case. */ + else if (gimple_code (stmt) == GIMPLE_ASSIGN + && gimple_store_p (stmt) + && vinfo_for_stmt (stmt) + && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt))) + { + stmt_info = vinfo_for_stmt (stmt); + vectype = STMT_VINFO_VECTYPE (stmt_info); + } + else + continue; + + /* Skip hoisted out statements. */ + if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt))) + continue; + + mask_no = vect_get_mask_index_for_type (vectype); + + dr = STMT_VINFO_DATA_REF (stmt_info); + while (stmt) + { + if (is_gimple_call (stmt)) + vect_mask_mask_load_store_stmt (stmt, masks[mask_no++]); + else + vect_mask_load_store_stmt (stmt, vectype, masks[mask_no++], dr, + /* Have to update iterator only if + it points to stmt we mask. */ + stmt == gsi_stmt (si) ? &si : NULL); + + stmt_info = vinfo_for_stmt (stmt); + stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL; + } + } + } + + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "=== Loop epilogue was combined ===\n"); +} /* Function vect_transform_loop. @@ -6936,7 +7568,9 @@ vect_transform_loop (loop_vec_info loop_vinfo) run at least the vectorization factor number of times checking is pointless, too. */ th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo); - if (th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1 + if ((th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1 + || (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) + && th > 1)) && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) { if (dump_enabled_p ()) @@ -6985,12 +7619,18 @@ vect_transform_loop (loop_vec_info loop_vinfo) { tree ratio_mult_vf; if (!ni_name) - ni_name = vect_build_loop_niters (loop_vinfo); + { + ni_name = vect_build_loop_niters (loop_vinfo); + LOOP_VINFO_NITERS (loop_vinfo) = ni_name; + } vect_generate_tmps_on_preheader (loop_vinfo, ni_name, &ratio_mult_vf, &ratio); - epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name, - ratio_mult_vf, th, - check_profitability); + /* If epilogue is combined with main loop peeling is not needed. */ + if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) + || check_profitability) + epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name, + ratio_mult_vf, th, + check_profitability); } else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)), @@ -6998,7 +7638,10 @@ vect_transform_loop (loop_vec_info loop_vinfo) else { if (!ni_name) - ni_name = vect_build_loop_niters (loop_vinfo); + { + ni_name = vect_build_loop_niters (loop_vinfo); + LOOP_VINFO_NITERS (loop_vinfo) = ni_name; + } vect_generate_tmps_on_preheader (loop_vinfo, ni_name, NULL, &ratio); } @@ -7252,6 +7895,9 @@ vect_transform_loop (loop_vec_info loop_vinfo) slpeel_make_loop_iterate_ntimes (loop, ratio); + if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)) + vect_combine_loop_epilogue (loop_vinfo); + /* Reduce loop iterations by the vectorization factor. */ scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vectorization_factor), expected_iterations / vectorization_factor); @@ -7263,20 +7909,28 @@ vect_transform_loop (loop_vec_info loop_vinfo) loop->nb_iterations_likely_upper_bound = loop->nb_iterations_likely_upper_bound - 1; } - loop->nb_iterations_upper_bound - = wi::udiv_floor (loop->nb_iterations_upper_bound + 1, - vectorization_factor) - 1; - loop->nb_iterations_likely_upper_bound - = wi::udiv_floor (loop->nb_iterations_likely_upper_bound + 1, - vectorization_factor) - 1; + + if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)) + loop->nb_iterations_upper_bound + = wi::div_ceil (loop->nb_iterations_upper_bound + 1, + vectorization_factor, UNSIGNED) - 1; + else + loop->nb_iterations_upper_bound + = wi::udiv_floor (loop->nb_iterations_upper_bound + 1, + vectorization_factor) - 1; if (loop->any_estimate) { - loop->nb_iterations_estimate - = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor); - if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) - && loop->nb_iterations_estimate != 0) - loop->nb_iterations_estimate = loop->nb_iterations_estimate - 1; + if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)) + loop->nb_iterations_estimate + = wi::div_ceil (loop->nb_iterations_estimate, vectorization_factor, + UNSIGNED); + else + loop->nb_iterations_estimate + = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor); + if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) + && loop->nb_iterations_estimate != 0) + loop->nb_iterations_estimate -= 1; } if (dump_enabled_p ())