[PATCH, vec-tails 07/10] Support loop epilogue combining

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH, vec-tails 07/10] Support loop epilogue combining
@ 2016-05-19 19:46 Ilya Enkovich
  2016-06-15 11:44 ` Richard Biener
  2016-06-16 16:54 ` Jeff Law
  0 siblings, 2 replies; 29+ messages in thread
From: Ilya Enkovich @ 2016-05-19 19:46 UTC (permalink / raw)
  To: gcc-patches

Hi,

This patch introduces support for loop epilogue combining.  This includes
support in cost estimation and all required changes required to mask
vectorized loop.

Thanks,
Ilya
--
gcc/

2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>

	* dbgcnt.def (vect_tail_combine): New.
	* params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
	* tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
	* tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
	epilogue combined with loop body.
	(vect_do_peeling_for_loop_bound): Likewise.
	* tree-vect-loop.c Include alias.h and dbgcnt.h.
	(vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters
	arg, compute number of iterations for which loop epilogue combining is
	profitable.
	(vect_generate_tmps_on_preheader): Support combined apilogue.
	(vect_gen_ivs_for_masking): New.
	(vect_get_mask_index_for_elems): New.
	(vect_get_mask_index_for_type): New.
	(vect_gen_loop_masks): New.
	(vect_mask_reduction_stmt): New.
	(vect_mask_mask_load_store_stmt): New.
	(vect_mask_load_store_stmt): New.
	(vect_combine_loop_epilogue): New.
	(vect_transform_loop): Support combined apilogue.


diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def
index 78ddcc2..73c2966 100644
--- a/gcc/dbgcnt.def
+++ b/gcc/dbgcnt.def
@@ -192,4 +192,5 @@ DEBUG_COUNTER (treepre_insert)
 DEBUG_COUNTER (tree_sra)
 DEBUG_COUNTER (vect_loop)
 DEBUG_COUNTER (vect_slp)
+DEBUG_COUNTER (vect_tail_combine)
 DEBUG_COUNTER (dom_unreachable_edges)
diff --git a/gcc/params.def b/gcc/params.def
index 62a1e40..98d6c5a 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1220,6 +1220,11 @@ DEFPARAM (PARAM_MAX_SPECULATIVE_DEVIRT_MAYDEFS,
 	  "Maximum number of may-defs visited when devirtualizing "
 	  "speculatively", 50, 0, 0)
 
+DEFPARAM (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD,
+	  "vect-cost-increase-combine-threshold",
+	  "Cost increase threshold to mask main loop for epilogue.",
+	  10, 0, 300)
+
 /*
 
 Local variables:
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index f275933..c5bdeb9 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4000,6 +4000,9 @@ vect_get_new_ssa_name (tree type, enum vect_var_kind var_kind, const char *name)
   case vect_scalar_var:
     prefix = "stmp";
     break;
+  case vect_mask_var:
+    prefix = "mask";
+    break;
   case vect_pointer_var:
     prefix = "vectp";
     break;
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index fab5879..b3c0668 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1195,6 +1195,7 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
   int first_guard_probability = 2 * REG_BR_PROB_BASE / 3;
   int second_guard_probability = 2 * REG_BR_PROB_BASE / 3;
   int probability_of_second_loop;
+  bool skip_second_after_first = false;
 
   if (!slpeel_can_duplicate_loop_p (loop, e))
     return NULL;
@@ -1393,7 +1394,11 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
     {
       loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
       tree scalar_loop_iters = LOOP_VINFO_NITERSM1 (loop_vinfo);
-      unsigned limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
+      unsigned limit = 0;
+      if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+	skip_second_after_first = true;
+      else
+	limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
 	limit = limit + 1;
       if (check_profitability
@@ -1464,11 +1469,20 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
   bb_between_loops = new_exit_bb;
   bb_after_second_loop = split_edge (single_exit (second_loop));
 
-  pre_condition =
-	fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
-  skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
-                                  bb_after_second_loop, bb_before_first_loop,
-				  inverse_probability (second_guard_probability));
+  if (skip_second_after_first)
+    /* We can just redirect edge from bb_between_loops to
+       bb_after_second_loop but we have many code assuming
+       we have a guard after the first loop.  So just make
+       always taken condtion.  */
+    pre_condition = fold_build2 (EQ_EXPR, boolean_type_node, integer_zero_node,
+				 integer_zero_node);
+  else
+    pre_condition =
+      fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
+  skip_e
+    = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
+			     bb_after_second_loop, bb_before_first_loop,
+			     inverse_probability (second_guard_probability));
   scale_loop_profile (second_loop, probability_of_second_loop, bound2);
   slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop,
                                      second_loop == new_loop, &new_exit_bb);
@@ -1758,8 +1772,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
   basic_block preheader;
   int loop_num;
   int max_iter;
+  int bound2;
   tree cond_expr = NULL_TREE;
   gimple_seq cond_expr_stmt_list = NULL;
+  bool combine = LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo);
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -1769,12 +1785,13 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
 
   loop_num  = loop->num;
 
+  bound2 = combine ? th : LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   new_loop
     = slpeel_tree_peel_loop_to_edge (loop, scalar_loop, single_exit (loop),
 				     &ratio_mult_vf_name, ni_name, false,
 				     th, check_profitability,
 				     cond_expr, cond_expr_stmt_list,
-				     0, LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+				     0, bound2);
   gcc_assert (new_loop);
   gcc_assert (loop_num == loop->num);
   slpeel_checking_verify_cfg_after_peeling (loop, new_loop);
@@ -1803,7 +1820,11 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
   max_iter = (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
 	      ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) * 2
 	      : LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 2;
-  if (check_profitability)
+  /* When epilogue is combined only profitability
+     treshold matters.  */
+  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    max_iter = (int) th - 1;
+  else if (check_profitability)
     max_iter = MAX (max_iter, (int) th - 1);
   record_niter_bound (new_loop, max_iter, false, true);
   dump_printf (MSG_NOTE,
@@ -2036,7 +2057,8 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo, tree ni_name,
 				   bound, 0);
 
   gcc_assert (new_loop);
-  slpeel_checking_verify_cfg_after_peeling (new_loop, loop);
+  if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    slpeel_checking_verify_cfg_after_peeling (new_loop, loop);
   /* For vectorization factor N, we need to copy at most N-1 values 
      for alignment and this means N-2 loopback edge executions.  */
   max_iter = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 2;
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 31360d3..1a80c42 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -50,6 +50,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-fold.h"
 #include "cgraph.h"
 #include "tree-if-conv.h"
+#include "alias.h"
+#include "dbgcnt.h"
 
 /* Loop Vectorization Pass.
 
@@ -149,7 +151,8 @@ along with GCC; see the file COPYING3.  If not see
    http://gcc.gnu.org/projects/tree-ssa/vectorization.html
 */
 
-static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *);
+static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *,
+						int *);
 
 /* Function vect_determine_vectorization_factor
 
@@ -2304,8 +2307,10 @@ start_over:
 
   /* Analyze cost.  Decide if worth while to vectorize.  */
   int min_profitable_estimate, min_profitable_iters;
+  int min_profitable_combine_iters;
   vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
-				      &min_profitable_estimate);
+				      &min_profitable_estimate,
+				      &min_profitable_combine_iters);
 
   if (min_profitable_iters < 0)
     {
@@ -2412,6 +2417,52 @@ start_over:
   gcc_assert (vectorization_factor
 	      == (unsigned)LOOP_VINFO_VECT_FACTOR (loop_vinfo));
 
+  if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
+    {
+      LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
+      LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = false;
+    }
+  else if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+	   && min_profitable_combine_iters >= 0)
+    {
+      if (((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+	    && (LOOP_VINFO_INT_NITERS (loop_vinfo)
+		>= (unsigned) min_profitable_combine_iters))
+	   || estimated_niter == -1
+	   || estimated_niter >= min_profitable_combine_iters)
+	  && dbg_cnt (vect_tail_combine))
+	{
+	  LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
+	  LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = true;
+
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "Decided to combine loop with its epilogue.\n");
+
+	  /* We need to adjust profitability check if combine
+	     epilogue considering additional vector iteration
+	     and profitable combine iterations.  */
+	  if ((int)(min_profitable_combine_iters + vectorization_factor)
+	      > min_scalar_loop_bound)
+	    {
+	      LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo)
+		= (unsigned) min_profitable_combine_iters;
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "Updated runtime profitability treshold: %d\n",
+				 min_profitable_combine_iters);
+
+	    }
+	}
+      else
+	{
+	  if (!LOOP_VINFO_NEED_MASKING (loop_vinfo) && dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Not combined loop with epilogue: iterations "
+			     "count is too low (threshold is %d).\n",
+			     min_profitable_combine_iters);
+	}
+    }
+
   /* Ok to vectorize!  */
   return true;
 
@@ -3381,7 +3432,8 @@ vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
 static void
 vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 				    int *ret_min_profitable_niters,
-				    int *ret_min_profitable_estimate)
+				    int *ret_min_profitable_estimate,
+				    int *ret_min_profitable_combine_niters)
 {
   int min_profitable_iters;
   int min_profitable_estimate;
@@ -3625,6 +3677,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
                    vec_prologue_cost);
       dump_printf (MSG_NOTE, "  Vector epilogue cost: %d\n",
                    vec_epilogue_cost);
+      dump_printf (MSG_NOTE, "  Masking prologue cost: %d\n",
+                   masking_prologue_cost);
+      dump_printf (MSG_NOTE, "  Masking inside cost: %d\n",
+                   masking_inside_cost);
       dump_printf (MSG_NOTE, "  Scalar iteration cost: %d\n",
                    scalar_single_iter_cost);
       dump_printf (MSG_NOTE, "  Scalar outside cost: %d\n",
@@ -3728,6 +3784,77 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 		     min_profitable_estimate);
 
   *ret_min_profitable_estimate = min_profitable_estimate;
+
+  *ret_min_profitable_combine_niters = -1;
+
+  /* Don't try to vectorize epilogue of epilogue.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    return;
+
+  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+    {
+      if (flag_vect_epilogue_cost_model == VECT_COST_MODEL_UNLIMITED)
+	{
+	  if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE)
+	    *ret_min_profitable_combine_niters = 0;
+	  return;
+	}
+
+      unsigned combine_treshold
+	= PARAM_VALUE (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD);
+      /* Calculate profitability combining epilogue with the main loop.
+	 We have a threshold for inside cost overhead (not applied
+	 for low trip count loop case):
+	 MIC * 100 < VIC * CT
+	 Masked iteration should be better than a scalar prologue:
+	 MIC + VIC < SIC * epilogue_niters  */
+      if (masking_inside_cost * 100 >= vec_inside_cost * combine_treshold)
+	{
+	  if (dump_enabled_p ())
+	    {
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "Combining loop with epilogue is not "
+			       "profitable.\n");
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "  Combining overhead %d%% exceeds "
+			       "treshold %d%%.\n",
+			       masking_inside_cost * 100 / vec_inside_cost,
+			       combine_treshold);
+	    }
+	  *ret_min_profitable_combine_niters = -1;
+	}
+      else if ((int)(masking_inside_cost + vec_inside_cost)
+	       >= scalar_single_iter_cost * peel_iters_epilogue)
+	{
+	  if (dump_enabled_p ())
+	    {
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "Combining loop with epilogue is not "
+			       "profitable.\n");
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "  Scalar epilogue is faster than a "
+			       "single masked iteration.\n");
+	    }
+	  *ret_min_profitable_combine_niters = -1;
+	}
+      else if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE)
+	{
+	  int inside_cost = vec_inside_cost + masking_inside_cost;
+	  int outside_cost = vec_outside_cost + masking_prologue_cost;
+          int profitable_iters = ((outside_cost - scalar_outside_cost) * vf
+				  - inside_cost * peel_iters_prologue
+                                  - inside_cost * peel_iters_epilogue)
+                                 / ((scalar_single_iter_cost * vf)
+                                    - inside_cost);
+
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Combinig loop with epilogue "
+			     "pofitability treshold = %d\n",
+			     profitable_iters);
+	  *ret_min_profitable_combine_niters = profitable_iters;
+	}
+    }
 }
 
 /* Writes into SEL a mask for a vec_perm, equivalent to a vec_shr by OFFSET
@@ -6843,20 +6970,37 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
   else
     ni_minus_gap_name = ni_name;
 
-  /* Create: ratio = ni >> log2(vf) */
-  /* ???  As we have ni == number of latch executions + 1, ni could
-     have overflown to zero.  So avoid computing ratio based on ni
-     but compute it using the fact that we know ratio will be at least
-     one, thus via (ni - vf) >> log2(vf) + 1.  */
-  ratio_name
-    = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
-		   fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
-				fold_build2 (MINUS_EXPR, TREE_TYPE (ni_name),
-					     ni_minus_gap_name,
-					     build_int_cst
-					       (TREE_TYPE (ni_name), vf)),
-				log_vf),
-		   build_int_cst (TREE_TYPE (ni_name), 1));
+  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    {
+      /* Create ni + (vf-1) >> log2(vf) if epilogue is combined with loop.  */
+      gcc_assert (!LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
+      ratio_name
+	= fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
+		       fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
+				    ni_name,
+				    build_int_cst (TREE_TYPE (ni_name),
+						   vf - 1)),
+		       log_vf);
+    }
+  else
+    {
+      /* Create: ratio = ni >> log2(vf) */
+      /* ???  As we have ni == number of latch executions + 1, ni could
+	 have overflown to zero.  So avoid computing ratio based on ni
+	 but compute it using the fact that we know ratio will be at least
+	 one, thus via (ni - vf) >> log2(vf) + 1.  */
+      ratio_name
+        = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
+		       fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
+				    fold_build2 (MINUS_EXPR,
+						 TREE_TYPE (ni_name),
+						 ni_minus_gap_name,
+						 build_int_cst
+						   (TREE_TYPE (ni_name), vf)),
+				    log_vf),
+		       build_int_cst (TREE_TYPE (ni_name), 1));
+    }
+
   if (!is_gimple_val (ratio_name))
     {
       var = create_tmp_var (TREE_TYPE (ni_name), "bnd");
@@ -6886,6 +7030,485 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
   return;
 }
 
+/* Function vect_gen_ivs_for_masking.
+
+   Create IVs to be used for masks computation to mask loop described
+   by LOOP_VINFO.  Created IVs are stored in IVS vector.  .
+
+   Initial IV values is {0, 1, ..., VF - 1} (probably split into several
+   vectors, in this case IVS's elements with lower index hold IV with
+   smaller numbers).  IV step is {VF, VF, ..., VF}.  VF is a used
+   vectorization factor.  */
+
+static void
+vect_gen_ivs_for_masking (loop_vec_info loop_vinfo, vec<tree> *ivs)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  tree vectype = vect_get_masking_iv_type (loop_vinfo);
+  tree type = TREE_TYPE (vectype);
+  int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
+  int ncopies  = vf / elems;
+  int i, k;
+  tree iv, init_val, step_val;
+  bool insert_after;
+  gimple_stmt_iterator gsi;
+  tree *vtemp;
+
+  /* Create {VF, ..., VF} vector constant.  */
+  step_val = build_vector_from_val (vectype, build_int_cst (type, vf));
+
+  vtemp = XALLOCAVEC (tree, vf);
+  for (i = 0; i < ncopies; i++)
+    {
+      /* Create initial IV value.  */
+      for (k = 0; k < vf; k++)
+	vtemp[k] = build_int_cst (type, k + i * elems);
+      init_val = build_vector (vectype, vtemp);
+
+      /* Create an inductive variable including phi node.  */
+      standard_iv_increment_position (loop, &gsi, &insert_after);
+      create_iv (init_val, step_val, NULL, loop, &gsi, insert_after,
+		 &iv, NULL);
+      ivs->safe_push (iv);
+    }
+}
+
+/* Function vect_get_mask_index_for_elems.
+
+   A helper function to access masks vector.  See vect_gen_loop_masks
+   for masks vector sorting description.  Return index of the first
+   mask having MASK_ELEMS elements.  */
+
+static inline unsigned
+vect_get_mask_index_for_elems (unsigned mask_elems)
+{
+  return current_vector_size / mask_elems - 1;
+}
+
+/* Function vect_get_mask_index_for_type.
+
+   A helper function to access masks vector.  See vect_gen_loop_masks
+   for masks vector sorting description.  Return index of the first
+   mask appropriate for VECTYPE.  */
+
+static inline unsigned
+vect_get_mask_index_for_type (tree vectype)
+{
+  unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
+  return vect_get_mask_index_for_elems (elems);
+}
+
+/* Function vect_gen_loop_masks.
+
+   Create masks to mask a loop desvribed by LOOP_VINFO.  Masks
+   are created according to LOOP_VINFO_REQUIRED_MASKS and are stored
+   into MASKS vector.
+
+   Index of a mask in a vector is computed according to a number
+   of masks's elements.  Masks are sorted by number of its elements
+   in descending order.  Index 0 is used to access a mask with
+   current_vector_size elements.  Among masks with the same number
+   of elements the one with lower index is used to mask iterations
+   with smaller iteration counter.  Note that you may get NULL elements
+   for masks which are not required.  Use vect_get_mask_index_for_elems
+   or vect_get_mask_index_for_type to access resulting vector.  */
+
+static void
+vect_gen_loop_masks (loop_vec_info loop_vinfo, vec<tree> *masks)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  edge pe = loop_preheader_edge (loop);
+  tree niters = LOOP_VINFO_NITERS (loop_vinfo);
+  unsigned min_mask_elems, max_mask_elems, nmasks;
+  unsigned iv_elems, cur_mask, prev_mask, cur_mask_elems;
+  auto_vec<tree> ivs;
+  tree vectype, mask_type;
+  tree vec_niters, vec_niters_val, mask;
+  gimple *stmt;
+  basic_block bb;
+  gimple_stmt_iterator gsi = gsi_after_labels (loop->header);
+  unsigned vec_size;
+
+  /* Create required IVs.  */
+  vect_gen_ivs_for_masking (loop_vinfo, &ivs);
+  vectype = TREE_TYPE (ivs[0]);
+
+  vec_size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
+  iv_elems = TYPE_VECTOR_SUBPARTS (vectype);
+
+  /* Get a proper niter to build a vector.  */
+  if (!is_gimple_val (niters))
+    {
+      gimple_seq seq = NULL;
+      niters = force_gimple_operand (niters, &seq, true, NULL);
+      gsi_insert_seq_on_edge_immediate (pe, seq);
+    }
+  /* We may need a type cast in case niter has a too small type
+     for generated IVs.  */
+  if (!types_compatible_p (TREE_TYPE (vectype), TREE_TYPE (niters)))
+    {
+      tree new_niters = make_temp_ssa_name (TREE_TYPE (vectype),
+					    NULL, "niters");
+      stmt = gimple_build_assign (new_niters, CONVERT_EXPR, niters);
+      bb = gsi_insert_on_edge_immediate (pe, stmt);
+      gcc_assert (!bb);
+      niters = new_niters;
+    }
+  /* Create {NITERS, ..., NITERS} vector and put to SSA_NAME.  */
+  vec_niters_val = build_vector_from_val (vectype, niters);
+  vec_niters = vect_get_new_ssa_name (vectype, vect_simple_var, "niters");
+  stmt = gimple_build_assign (vec_niters, vec_niters_val);
+  bb = gsi_insert_on_edge_immediate (pe, stmt);
+  gcc_assert (!bb);
+
+  /* Determine which masks we need to compute and how many.  */
+  vect_get_extreme_masks (loop_vinfo, &min_mask_elems, &max_mask_elems);
+  nmasks = vect_get_mask_index_for_elems (MIN (min_mask_elems, iv_elems) / 2);
+  masks->safe_grow_cleared (nmasks);
+
+  /* Now create base masks through comparison IV < VEC_NITERS.  */
+  mask_type = build_same_sized_truth_vector_type (vectype);
+  cur_mask = vect_get_mask_index_for_elems (iv_elems);
+  for (unsigned i = 0; i < ivs.length (); i++)
+    {
+      tree iv = ivs[i];
+      mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+      stmt = gimple_build_assign (mask, LT_EXPR, iv, vec_niters);
+      gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+      (*masks)[cur_mask++] = mask;
+    }
+
+  /* Create narrowed masks.  */
+  cur_mask_elems = iv_elems;
+  nmasks = ivs.length ();
+  while (cur_mask_elems < max_mask_elems)
+    {
+      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      cur_mask_elems <<= 1;
+      nmasks >>= 1;
+
+      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
+
+      for (unsigned i = 0; i < nmasks; i++)
+	{
+	  tree mask_low = (*masks)[prev_mask++];
+	  tree mask_hi = (*masks)[prev_mask++];
+	  mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+	  stmt = gimple_build_assign (mask, VEC_PACK_TRUNC_EXPR,
+				      mask_low, mask_hi);
+	  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+	  (*masks)[cur_mask++] = mask;
+	}
+    }
+
+  /* Created widened masks.  */
+  cur_mask_elems = iv_elems;
+  nmasks = ivs.length ();
+  while (cur_mask_elems > min_mask_elems)
+    {
+      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      cur_mask_elems >>= 1;
+      nmasks <<= 1;
+
+      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
+
+      for (unsigned i = 0; i < nmasks; i += 2)
+	{
+	  tree orig_mask = (*masks)[prev_mask++];
+
+	  mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+	  stmt = gimple_build_assign (mask, VEC_UNPACK_LO_EXPR, orig_mask);
+	  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+	  (*masks)[cur_mask++] = mask;
+
+	  mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+	  stmt = gimple_build_assign (mask, VEC_UNPACK_HI_EXPR, orig_mask);
+	  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+	  (*masks)[cur_mask++] = mask;
+	}
+    }
+}
+
+/* Function vect_mask_reduction_stmt.
+
+   Mask given vectorized reduction statement STMT using
+   MASK.  In case scalar reduction statement is vectorized
+   into several vector statements then PREV holds a
+   preceding vetor statement copy for STMT.
+
+   Masking is performed using VEC_COND_EXPR. E.g.
+
+   S1: r_1 = r_2 + d_3
+
+   is transformed into:
+
+   S1': r_4 = r_2 + d_3
+   S2': r_1 = VEC_COND_EXPR<MASK, r_4, r_2>
+
+   Return generated condition statement.  */
+
+static gimple *
+vect_mask_reduction_stmt (gimple *stmt, tree mask, gimple *prev)
+{
+  gimple_stmt_iterator gsi;
+  tree vectype;
+  tree lhs, rhs, tmp;
+  gimple *new_stmt, *phi;
+
+  lhs = gimple_assign_lhs (stmt);
+  vectype = TREE_TYPE (lhs);
+
+  gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
+	      == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
+
+  /* Find operand RHS defined by PHI node.  */
+  rhs = gimple_assign_rhs1 (stmt);
+  gcc_assert (TREE_CODE (rhs) == SSA_NAME);
+  phi = SSA_NAME_DEF_STMT (rhs);
+
+  if (phi != prev && gimple_code (phi) != GIMPLE_PHI)
+    {
+      rhs = gimple_assign_rhs2 (stmt);
+      gcc_assert (TREE_CODE (rhs) == SSA_NAME);
+      phi = SSA_NAME_DEF_STMT (rhs);
+      gcc_assert (phi == prev || gimple_code (phi) == GIMPLE_PHI);
+    }
+
+  /* Convert reduction stmt to ordinary assignment to TMP.  */
+  tmp = vect_get_new_ssa_name (vectype, vect_simple_var, NULL);
+  gimple_assign_set_lhs (stmt, tmp);
+
+  /* Create VEC_COND_EXPR and insert it after STMT.  */
+  new_stmt = gimple_build_assign (lhs, VEC_COND_EXPR, mask, tmp, rhs);
+  gsi = gsi_for_stmt (stmt);
+  gsi_insert_after (&gsi, new_stmt, GSI_SAME_STMT);
+
+  return new_stmt;
+}
+
+/* Function vect_mask_mask_load_store_stmt.
+
+   Mask given vectorized MASK_LOAD or MASK_STORE statement
+   STMT using MASK.  Function replaces a mask used by STMT
+   with its conjunction with MASK.  */
+
+static void
+vect_mask_mask_load_store_stmt (gimple *stmt, tree mask)
+{
+  gimple *new_stmt;
+  tree old_mask, new_mask;
+  gimple_stmt_iterator gsi;
+
+  gsi = gsi_for_stmt (stmt);
+  old_mask = gimple_call_arg (stmt, 2);
+
+  gcc_assert (types_compatible_p (TREE_TYPE (old_mask), TREE_TYPE (mask)));
+
+  new_mask = vect_get_new_ssa_name (TREE_TYPE (mask), vect_simple_var, NULL);
+  new_stmt = gimple_build_assign (new_mask, BIT_AND_EXPR, old_mask, mask);
+  gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
+
+  gimple_call_set_arg (stmt, 2, new_mask);
+  update_stmt (stmt);
+}
+
+
+/* Function vect_mask_load_store_stmt.
+
+   Mask given vectorized load or store statement STMT using
+   MASK.  DR is a data reference for a scalar memory access.
+   Assignment is transformed into MASK_LOAD or MASK_STORE
+   statement.  SI is either an iterator pointing to STMT and
+   is to be updated or NULL.  */
+
+static void
+vect_mask_load_store_stmt (gimple *stmt, tree vectype, tree mask,
+			   data_reference *dr, gimple_stmt_iterator *si)
+{
+  tree mem, val, addr, ptr;
+  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+  unsigned align, misalign;
+  tree elem_type = TREE_TYPE (vectype);
+  gimple *new_stmt;
+
+  gcc_assert (!si || gsi_stmt (*si) == stmt);
+
+  gsi = gsi_for_stmt (stmt);
+  if (gimple_store_p (stmt))
+    {
+      val = gimple_assign_rhs1 (stmt);
+      mem = gimple_assign_lhs (stmt);
+    }
+  else
+    {
+      val = gimple_assign_lhs (stmt);
+      mem = gimple_assign_rhs1 (stmt);
+    }
+
+  gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
+	      == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
+
+  addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (mem),
+				   true, NULL_TREE, true,
+				   GSI_SAME_STMT);
+
+  align = TYPE_ALIGN_UNIT (vectype);
+  if (aligned_access_p (dr))
+    misalign = 0;
+  else if (DR_MISALIGNMENT (dr) == -1)
+    {
+      align = TYPE_ALIGN_UNIT (elem_type);
+      misalign = 0;
+    }
+  else
+    misalign = DR_MISALIGNMENT (dr);
+  set_ptr_info_alignment (get_ptr_info (addr), align, misalign);
+  ptr = build_int_cst (reference_alias_ptr_type (mem),
+		       misalign ? misalign & -misalign : align);
+
+  if (gimple_store_p (stmt))
+    new_stmt = gimple_build_call_internal (IFN_MASK_STORE, 4, addr, ptr,
+					   mask, val);
+  else
+    {
+      new_stmt = gimple_build_call_internal (IFN_MASK_LOAD, 3, addr, ptr,
+					     mask);
+      gimple_call_set_lhs (new_stmt, val);
+    }
+  gsi_replace (si ? si : &gsi, new_stmt, false);
+}
+
+/* Function vect_combine_loop_epilogue.
+
+   Combine loop epilogue with the main vectorized body.  It requires
+   masking of memory accesses and reductions.  */
+
+static void
+vect_combine_loop_epilogue (loop_vec_info loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
+  unsigned mask_no;
+  auto_vec<tree> masks;
+
+  vect_gen_loop_masks (loop_vinfo, &masks);
+
+  /* Convert reduction statements if any.  */
+  for (unsigned i = 0; i < LOOP_VINFO_REDUCTIONS (loop_vinfo).length (); i++)
+    {
+      gimple *stmt = LOOP_VINFO_REDUCTIONS (loop_vinfo)[i];
+      gimple *prev_stmt = NULL;
+      stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+
+      mask_no = vect_get_mask_index_for_type (STMT_VINFO_VECTYPE (stmt_info));
+
+      stmt = STMT_VINFO_VEC_STMT (stmt_info);
+      while (stmt)
+	{
+	  prev_stmt = vect_mask_reduction_stmt (stmt, masks[mask_no++],
+						prev_stmt);
+	  stmt_info = vinfo_for_stmt (stmt);
+	  stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
+	}
+    }
+
+  /* Scan all loop statements to convert vector load/store including masked
+     form.  */
+  for (unsigned i = 0; i < loop->num_nodes; i++)
+    {
+      basic_block bb = bbs[i];
+      for (gimple_stmt_iterator si = gsi_start_bb (bb);
+	   !gsi_end_p (si); gsi_next (&si))
+	{
+	  gimple *stmt = gsi_stmt (si);
+	  stmt_vec_info stmt_info = NULL;
+	  tree vectype = NULL;
+	  data_reference *dr;
+
+	  /* Mask load case.  */
+	  if (is_gimple_call (stmt)
+	      && gimple_call_internal_p (stmt)
+	      && gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
+	      && !VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (stmt, 2))))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+	      if (!STMT_VINFO_VEC_STMT (stmt_info))
+		continue;
+	      stmt = STMT_VINFO_VEC_STMT (stmt_info);
+	      vectype = STMT_VINFO_VECTYPE (stmt_info);
+	    }
+	  /* Mask store case.  */
+	  else if (is_gimple_call (stmt)
+		   && gimple_call_internal_p (stmt)
+		   && gimple_call_internal_fn (stmt) == IFN_MASK_STORE
+		   && vinfo_for_stmt (stmt)
+		   && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+	      vectype = TREE_TYPE (gimple_call_arg (stmt, 2));
+	    }
+	  /* Load case.  */
+	  else if (gimple_assign_load_p (stmt)
+		   && !VECTOR_TYPE_P (TREE_TYPE (gimple_assign_lhs (stmt))))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+	      /* Skip vector loads.  */
+	      if (!STMT_VINFO_VEC_STMT (stmt_info))
+		continue;
+	      /* Skip invariant loads.  */
+	      if (integer_zerop (nested_in_vect_loop_p (loop, stmt)
+				 ? STMT_VINFO_DR_STEP (stmt_info)
+				 : DR_STEP (STMT_VINFO_DATA_REF (stmt_info))))
+		continue;
+	      stmt = STMT_VINFO_VEC_STMT (stmt_info);
+	      vectype = STMT_VINFO_VECTYPE (stmt_info);
+	    }
+	  /* Store case.  */
+	  else if (gimple_code (stmt) == GIMPLE_ASSIGN
+		   && gimple_store_p (stmt)
+		   && vinfo_for_stmt (stmt)
+		   && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+	      vectype = STMT_VINFO_VECTYPE (stmt_info);
+	    }
+	  else
+	    continue;
+
+	  /* Skip hoisted out statements.  */
+	  if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
+	    continue;
+
+	  mask_no = vect_get_mask_index_for_type (vectype);
+
+	  dr = STMT_VINFO_DATA_REF (stmt_info);
+	  while (stmt)
+	    {
+	      if (is_gimple_call (stmt))
+		vect_mask_mask_load_store_stmt (stmt, masks[mask_no++]);
+	      else
+		vect_mask_load_store_stmt (stmt, vectype, masks[mask_no++], dr,
+					   /* Have to update iterator only if
+					      it points to stmt we mask.  */
+					   stmt == gsi_stmt (si) ? &si : NULL);
+
+	      stmt_info = vinfo_for_stmt (stmt);
+	      stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
+	    }
+	}
+    }
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location,
+		     "=== Loop epilogue was combined ===\n");
+}
 
 /* Function vect_transform_loop.
 
@@ -6927,7 +7550,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
      run at least the vectorization factor number of times checking
      is pointless, too.  */
   th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
-  if (th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1
+  if ((th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1
+       || (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)
+	   && th > 1))
       && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
       if (dump_enabled_p ())
@@ -6976,12 +7601,18 @@ vect_transform_loop (loop_vec_info loop_vinfo)
     {
       tree ratio_mult_vf;
       if (!ni_name)
-	ni_name = vect_build_loop_niters (loop_vinfo);
+	{
+	  ni_name = vect_build_loop_niters (loop_vinfo);
+	  LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
+	}
       vect_generate_tmps_on_preheader (loop_vinfo, ni_name, &ratio_mult_vf,
 				       &ratio);
-      epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name,
-						 ratio_mult_vf, th,
-						 check_profitability);
+      /* If epilogue is combined with main loop peeling is not needed.  */
+      if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)
+	  || check_profitability)
+	epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name,
+						   ratio_mult_vf, th,
+						   check_profitability);
     }
   else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
@@ -6989,7 +7620,10 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   else
     {
       if (!ni_name)
-	ni_name = vect_build_loop_niters (loop_vinfo);
+	{
+	  ni_name = vect_build_loop_niters (loop_vinfo);
+	  LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
+	}
       vect_generate_tmps_on_preheader (loop_vinfo, ni_name, NULL, &ratio);
     }
 
@@ -7243,23 +7877,35 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 
   slpeel_make_loop_iterate_ntimes (loop, ratio);
 
+  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    vect_combine_loop_epilogue (loop_vinfo);
+
   /* Reduce loop iterations by the vectorization factor.  */
   scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vectorization_factor),
 		      expected_iterations / vectorization_factor);
   if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
       && loop->nb_iterations_upper_bound != 0)
-    loop->nb_iterations_upper_bound = loop->nb_iterations_upper_bound - 1;
-  loop->nb_iterations_upper_bound
-    = wi::udiv_floor (loop->nb_iterations_upper_bound + 1,
-		      vectorization_factor) - 1;
-
+    loop->nb_iterations_upper_bound -= 1;
+  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    loop->nb_iterations_upper_bound
+      = wi::div_ceil (loop->nb_iterations_upper_bound + 1,
+		      vectorization_factor, UNSIGNED) - 1;
+  else
+    loop->nb_iterations_upper_bound
+      = wi::udiv_floor (loop->nb_iterations_upper_bound + 1,
+			vectorization_factor) - 1;
   if (loop->any_estimate)
     {
-      loop->nb_iterations_estimate
-        = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor);
-       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
-	   && loop->nb_iterations_estimate != 0)
-	 loop->nb_iterations_estimate = loop->nb_iterations_estimate - 1;
+      if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+	loop->nb_iterations_estimate
+	  = wi::div_ceil (loop->nb_iterations_estimate, vectorization_factor,
+			  UNSIGNED);
+      else
+	loop->nb_iterations_estimate
+	  = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor);
+      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+	  && loop->nb_iterations_estimate != 0)
+	loop->nb_iterations_estimate -= 1;
     }
 
   if (dump_enabled_p ())

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-05-19 19:46 [PATCH, vec-tails 07/10] Support loop epilogue combining Ilya Enkovich
@ 2016-06-15 11:44 ` Richard Biener
  2016-06-16 15:41   ` Ilya Enkovich
  2016-06-16 16:54 ` Jeff Law
  1 sibling, 1 reply; 29+ messages in thread
From: Richard Biener @ 2016-06-15 11:44 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: GCC Patches

On Thu, May 19, 2016 at 9:44 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> Hi,
>
> This patch introduces support for loop epilogue combining.  This includes
> support in cost estimation and all required changes required to mask
> vectorized loop.

I wonder why you compute a minimum number of iterations to make masking
of the vectorized body profitable rather than a maximum number of iterations.

I'd say masking the vectorized loop is profitable if niter/vf *
masking-overhead < epilogue-cost.
Masking the epilogue is profitable if vectorizing the epilogue with
masking is profitable.

Am I missing something?

Thanks,
Richard.

> Thanks,
> Ilya
> --
> gcc/
>
> 2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
>
>         * dbgcnt.def (vect_tail_combine): New.
>         * params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
>         * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
>         * tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
>         epilogue combined with loop body.
>         (vect_do_peeling_for_loop_bound): Likewise.
>         * tree-vect-loop.c Include alias.h and dbgcnt.h.
>         (vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters
>         arg, compute number of iterations for which loop epilogue combining is
>         profitable.
>         (vect_generate_tmps_on_preheader): Support combined apilogue.
>         (vect_gen_ivs_for_masking): New.
>         (vect_get_mask_index_for_elems): New.
>         (vect_get_mask_index_for_type): New.
>         (vect_gen_loop_masks): New.
>         (vect_mask_reduction_stmt): New.
>         (vect_mask_mask_load_store_stmt): New.
>         (vect_mask_load_store_stmt): New.
>         (vect_combine_loop_epilogue): New.
>         (vect_transform_loop): Support combined apilogue.
>
>
> diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def
> index 78ddcc2..73c2966 100644
> --- a/gcc/dbgcnt.def
> +++ b/gcc/dbgcnt.def
> @@ -192,4 +192,5 @@ DEBUG_COUNTER (treepre_insert)
>  DEBUG_COUNTER (tree_sra)
>  DEBUG_COUNTER (vect_loop)
>  DEBUG_COUNTER (vect_slp)
> +DEBUG_COUNTER (vect_tail_combine)
>  DEBUG_COUNTER (dom_unreachable_edges)
> diff --git a/gcc/params.def b/gcc/params.def
> index 62a1e40..98d6c5a 100644
> --- a/gcc/params.def
> +++ b/gcc/params.def
> @@ -1220,6 +1220,11 @@ DEFPARAM (PARAM_MAX_SPECULATIVE_DEVIRT_MAYDEFS,
>           "Maximum number of may-defs visited when devirtualizing "
>           "speculatively", 50, 0, 0)
>
> +DEFPARAM (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD,
> +         "vect-cost-increase-combine-threshold",
> +         "Cost increase threshold to mask main loop for epilogue.",
> +         10, 0, 300)
> +
>  /*
>
>  Local variables:
> diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
> index f275933..c5bdeb9 100644
> --- a/gcc/tree-vect-data-refs.c
> +++ b/gcc/tree-vect-data-refs.c
> @@ -4000,6 +4000,9 @@ vect_get_new_ssa_name (tree type, enum vect_var_kind var_kind, const char *name)
>    case vect_scalar_var:
>      prefix = "stmp";
>      break;
> +  case vect_mask_var:
> +    prefix = "mask";
> +    break;
>    case vect_pointer_var:
>      prefix = "vectp";
>      break;
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index fab5879..b3c0668 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -1195,6 +1195,7 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
>    int first_guard_probability = 2 * REG_BR_PROB_BASE / 3;
>    int second_guard_probability = 2 * REG_BR_PROB_BASE / 3;
>    int probability_of_second_loop;
> +  bool skip_second_after_first = false;
>
>    if (!slpeel_can_duplicate_loop_p (loop, e))
>      return NULL;
> @@ -1393,7 +1394,11 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
>      {
>        loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
>        tree scalar_loop_iters = LOOP_VINFO_NITERSM1 (loop_vinfo);
> -      unsigned limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
> +      unsigned limit = 0;
> +      if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +       skip_second_after_first = true;
> +      else
> +       limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
>        if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
>         limit = limit + 1;
>        if (check_profitability
> @@ -1464,11 +1469,20 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
>    bb_between_loops = new_exit_bb;
>    bb_after_second_loop = split_edge (single_exit (second_loop));
>
> -  pre_condition =
> -       fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
> -  skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
> -                                  bb_after_second_loop, bb_before_first_loop,
> -                                 inverse_probability (second_guard_probability));
> +  if (skip_second_after_first)
> +    /* We can just redirect edge from bb_between_loops to
> +       bb_after_second_loop but we have many code assuming
> +       we have a guard after the first loop.  So just make
> +       always taken condtion.  */
> +    pre_condition = fold_build2 (EQ_EXPR, boolean_type_node, integer_zero_node,
> +                                integer_zero_node);
> +  else
> +    pre_condition =
> +      fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
> +  skip_e
> +    = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
> +                            bb_after_second_loop, bb_before_first_loop,
> +                            inverse_probability (second_guard_probability));
>    scale_loop_profile (second_loop, probability_of_second_loop, bound2);
>    slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop,
>                                       second_loop == new_loop, &new_exit_bb);
> @@ -1758,8 +1772,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>    basic_block preheader;
>    int loop_num;
>    int max_iter;
> +  int bound2;
>    tree cond_expr = NULL_TREE;
>    gimple_seq cond_expr_stmt_list = NULL;
> +  bool combine = LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo);
>
>    if (dump_enabled_p ())
>      dump_printf_loc (MSG_NOTE, vect_location,
> @@ -1769,12 +1785,13 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>
>    loop_num  = loop->num;
>
> +  bound2 = combine ? th : LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>    new_loop
>      = slpeel_tree_peel_loop_to_edge (loop, scalar_loop, single_exit (loop),
>                                      &ratio_mult_vf_name, ni_name, false,
>                                      th, check_profitability,
>                                      cond_expr, cond_expr_stmt_list,
> -                                    0, LOOP_VINFO_VECT_FACTOR (loop_vinfo));
> +                                    0, bound2);
>    gcc_assert (new_loop);
>    gcc_assert (loop_num == loop->num);
>    slpeel_checking_verify_cfg_after_peeling (loop, new_loop);
> @@ -1803,7 +1820,11 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>    max_iter = (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>               ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) * 2
>               : LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 2;
> -  if (check_profitability)
> +  /* When epilogue is combined only profitability
> +     treshold matters.  */
> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +    max_iter = (int) th - 1;
> +  else if (check_profitability)
>      max_iter = MAX (max_iter, (int) th - 1);
>    record_niter_bound (new_loop, max_iter, false, true);
>    dump_printf (MSG_NOTE,
> @@ -2036,7 +2057,8 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo, tree ni_name,
>                                    bound, 0);
>
>    gcc_assert (new_loop);
> -  slpeel_checking_verify_cfg_after_peeling (new_loop, loop);
> +  if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +    slpeel_checking_verify_cfg_after_peeling (new_loop, loop);
>    /* For vectorization factor N, we need to copy at most N-1 values
>       for alignment and this means N-2 loopback edge executions.  */
>    max_iter = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 2;
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 31360d3..1a80c42 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -50,6 +50,8 @@ along with GCC; see the file COPYING3.  If not see
>  #include "gimple-fold.h"
>  #include "cgraph.h"
>  #include "tree-if-conv.h"
> +#include "alias.h"
> +#include "dbgcnt.h"
>
>  /* Loop Vectorization Pass.
>
> @@ -149,7 +151,8 @@ along with GCC; see the file COPYING3.  If not see
>     http://gcc.gnu.org/projects/tree-ssa/vectorization.html
>  */
>
> -static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *);
> +static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *,
> +                                               int *);
>
>  /* Function vect_determine_vectorization_factor
>
> @@ -2304,8 +2307,10 @@ start_over:
>
>    /* Analyze cost.  Decide if worth while to vectorize.  */
>    int min_profitable_estimate, min_profitable_iters;
> +  int min_profitable_combine_iters;
>    vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
> -                                     &min_profitable_estimate);
> +                                     &min_profitable_estimate,
> +                                     &min_profitable_combine_iters);
>
>    if (min_profitable_iters < 0)
>      {
> @@ -2412,6 +2417,52 @@ start_over:
>    gcc_assert (vectorization_factor
>               == (unsigned)LOOP_VINFO_VECT_FACTOR (loop_vinfo));
>
> +  if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> +    {
> +      LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
> +      LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = false;
> +    }
> +  else if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> +          && min_profitable_combine_iters >= 0)
> +    {
> +      if (((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> +           && (LOOP_VINFO_INT_NITERS (loop_vinfo)
> +               >= (unsigned) min_profitable_combine_iters))
> +          || estimated_niter == -1
> +          || estimated_niter >= min_profitable_combine_iters)
> +         && dbg_cnt (vect_tail_combine))
> +       {
> +         LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
> +         LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = true;
> +
> +         dump_printf_loc (MSG_NOTE, vect_location,
> +                          "Decided to combine loop with its epilogue.\n");
> +
> +         /* We need to adjust profitability check if combine
> +            epilogue considering additional vector iteration
> +            and profitable combine iterations.  */
> +         if ((int)(min_profitable_combine_iters + vectorization_factor)
> +             > min_scalar_loop_bound)
> +           {
> +             LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo)
> +               = (unsigned) min_profitable_combine_iters;
> +             if (dump_enabled_p ())
> +               dump_printf_loc (MSG_NOTE, vect_location,
> +                                "Updated runtime profitability treshold: %d\n",
> +                                min_profitable_combine_iters);
> +
> +           }
> +       }
> +      else
> +       {
> +         if (!LOOP_VINFO_NEED_MASKING (loop_vinfo) && dump_enabled_p ())
> +           dump_printf_loc (MSG_NOTE, vect_location,
> +                            "Not combined loop with epilogue: iterations "
> +                            "count is too low (threshold is %d).\n",
> +                            min_profitable_combine_iters);
> +       }
> +    }
> +
>    /* Ok to vectorize!  */
>    return true;
>
> @@ -3381,7 +3432,8 @@ vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
>  static void
>  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>                                     int *ret_min_profitable_niters,
> -                                   int *ret_min_profitable_estimate)
> +                                   int *ret_min_profitable_estimate,
> +                                   int *ret_min_profitable_combine_niters)
>  {
>    int min_profitable_iters;
>    int min_profitable_estimate;
> @@ -3625,6 +3677,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>                     vec_prologue_cost);
>        dump_printf (MSG_NOTE, "  Vector epilogue cost: %d\n",
>                     vec_epilogue_cost);
> +      dump_printf (MSG_NOTE, "  Masking prologue cost: %d\n",
> +                   masking_prologue_cost);
> +      dump_printf (MSG_NOTE, "  Masking inside cost: %d\n",
> +                   masking_inside_cost);
>        dump_printf (MSG_NOTE, "  Scalar iteration cost: %d\n",
>                     scalar_single_iter_cost);
>        dump_printf (MSG_NOTE, "  Scalar outside cost: %d\n",
> @@ -3728,6 +3784,77 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>                      min_profitable_estimate);
>
>    *ret_min_profitable_estimate = min_profitable_estimate;
> +
> +  *ret_min_profitable_combine_niters = -1;
> +
> +  /* Don't try to vectorize epilogue of epilogue.  */
> +  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> +    return;
> +
> +  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +    {
> +      if (flag_vect_epilogue_cost_model == VECT_COST_MODEL_UNLIMITED)
> +       {
> +         if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE)
> +           *ret_min_profitable_combine_niters = 0;
> +         return;
> +       }
> +
> +      unsigned combine_treshold
> +       = PARAM_VALUE (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD);
> +      /* Calculate profitability combining epilogue with the main loop.
> +        We have a threshold for inside cost overhead (not applied
> +        for low trip count loop case):
> +        MIC * 100 < VIC * CT
> +        Masked iteration should be better than a scalar prologue:
> +        MIC + VIC < SIC * epilogue_niters  */
> +      if (masking_inside_cost * 100 >= vec_inside_cost * combine_treshold)
> +       {
> +         if (dump_enabled_p ())
> +           {
> +             dump_printf_loc (MSG_NOTE, vect_location,
> +                              "Combining loop with epilogue is not "
> +                              "profitable.\n");
> +             dump_printf_loc (MSG_NOTE, vect_location,
> +                              "  Combining overhead %d%% exceeds "
> +                              "treshold %d%%.\n",
> +                              masking_inside_cost * 100 / vec_inside_cost,
> +                              combine_treshold);
> +           }
> +         *ret_min_profitable_combine_niters = -1;
> +       }
> +      else if ((int)(masking_inside_cost + vec_inside_cost)
> +              >= scalar_single_iter_cost * peel_iters_epilogue)
> +       {
> +         if (dump_enabled_p ())
> +           {
> +             dump_printf_loc (MSG_NOTE, vect_location,
> +                              "Combining loop with epilogue is not "
> +                              "profitable.\n");
> +             dump_printf_loc (MSG_NOTE, vect_location,
> +                              "  Scalar epilogue is faster than a "
> +                              "single masked iteration.\n");
> +           }
> +         *ret_min_profitable_combine_niters = -1;
> +       }
> +      else if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE)
> +       {
> +         int inside_cost = vec_inside_cost + masking_inside_cost;
> +         int outside_cost = vec_outside_cost + masking_prologue_cost;
> +          int profitable_iters = ((outside_cost - scalar_outside_cost) * vf
> +                                 - inside_cost * peel_iters_prologue
> +                                  - inside_cost * peel_iters_epilogue)
> +                                 / ((scalar_single_iter_cost * vf)
> +                                    - inside_cost);
> +
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_NOTE, vect_location,
> +                            "Combinig loop with epilogue "
> +                            "pofitability treshold = %d\n",
> +                            profitable_iters);
> +         *ret_min_profitable_combine_niters = profitable_iters;
> +       }
> +    }
>  }
>
>  /* Writes into SEL a mask for a vec_perm, equivalent to a vec_shr by OFFSET
> @@ -6843,20 +6970,37 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
>    else
>      ni_minus_gap_name = ni_name;
>
> -  /* Create: ratio = ni >> log2(vf) */
> -  /* ???  As we have ni == number of latch executions + 1, ni could
> -     have overflown to zero.  So avoid computing ratio based on ni
> -     but compute it using the fact that we know ratio will be at least
> -     one, thus via (ni - vf) >> log2(vf) + 1.  */
> -  ratio_name
> -    = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
> -                  fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
> -                               fold_build2 (MINUS_EXPR, TREE_TYPE (ni_name),
> -                                            ni_minus_gap_name,
> -                                            build_int_cst
> -                                              (TREE_TYPE (ni_name), vf)),
> -                               log_vf),
> -                  build_int_cst (TREE_TYPE (ni_name), 1));
> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +    {
> +      /* Create ni + (vf-1) >> log2(vf) if epilogue is combined with loop.  */
> +      gcc_assert (!LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
> +      ratio_name
> +       = fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
> +                      fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
> +                                   ni_name,
> +                                   build_int_cst (TREE_TYPE (ni_name),
> +                                                  vf - 1)),
> +                      log_vf);
> +    }
> +  else
> +    {
> +      /* Create: ratio = ni >> log2(vf) */
> +      /* ???  As we have ni == number of latch executions + 1, ni could
> +        have overflown to zero.  So avoid computing ratio based on ni
> +        but compute it using the fact that we know ratio will be at least
> +        one, thus via (ni - vf) >> log2(vf) + 1.  */
> +      ratio_name
> +        = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
> +                      fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
> +                                   fold_build2 (MINUS_EXPR,
> +                                                TREE_TYPE (ni_name),
> +                                                ni_minus_gap_name,
> +                                                build_int_cst
> +                                                  (TREE_TYPE (ni_name), vf)),
> +                                   log_vf),
> +                      build_int_cst (TREE_TYPE (ni_name), 1));
> +    }
> +
>    if (!is_gimple_val (ratio_name))
>      {
>        var = create_tmp_var (TREE_TYPE (ni_name), "bnd");
> @@ -6886,6 +7030,485 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
>    return;
>  }
>
> +/* Function vect_gen_ivs_for_masking.
> +
> +   Create IVs to be used for masks computation to mask loop described
> +   by LOOP_VINFO.  Created IVs are stored in IVS vector.  .
> +
> +   Initial IV values is {0, 1, ..., VF - 1} (probably split into several
> +   vectors, in this case IVS's elements with lower index hold IV with
> +   smaller numbers).  IV step is {VF, VF, ..., VF}.  VF is a used
> +   vectorization factor.  */
> +
> +static void
> +vect_gen_ivs_for_masking (loop_vec_info loop_vinfo, vec<tree> *ivs)
> +{
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  tree vectype = vect_get_masking_iv_type (loop_vinfo);
> +  tree type = TREE_TYPE (vectype);
> +  int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> +  unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
> +  int ncopies  = vf / elems;
> +  int i, k;
> +  tree iv, init_val, step_val;
> +  bool insert_after;
> +  gimple_stmt_iterator gsi;
> +  tree *vtemp;
> +
> +  /* Create {VF, ..., VF} vector constant.  */
> +  step_val = build_vector_from_val (vectype, build_int_cst (type, vf));
> +
> +  vtemp = XALLOCAVEC (tree, vf);
> +  for (i = 0; i < ncopies; i++)
> +    {
> +      /* Create initial IV value.  */
> +      for (k = 0; k < vf; k++)
> +       vtemp[k] = build_int_cst (type, k + i * elems);
> +      init_val = build_vector (vectype, vtemp);
> +
> +      /* Create an inductive variable including phi node.  */
> +      standard_iv_increment_position (loop, &gsi, &insert_after);
> +      create_iv (init_val, step_val, NULL, loop, &gsi, insert_after,
> +                &iv, NULL);
> +      ivs->safe_push (iv);
> +    }
> +}
> +
> +/* Function vect_get_mask_index_for_elems.
> +
> +   A helper function to access masks vector.  See vect_gen_loop_masks
> +   for masks vector sorting description.  Return index of the first
> +   mask having MASK_ELEMS elements.  */
> +
> +static inline unsigned
> +vect_get_mask_index_for_elems (unsigned mask_elems)
> +{
> +  return current_vector_size / mask_elems - 1;
> +}
> +
> +/* Function vect_get_mask_index_for_type.
> +
> +   A helper function to access masks vector.  See vect_gen_loop_masks
> +   for masks vector sorting description.  Return index of the first
> +   mask appropriate for VECTYPE.  */
> +
> +static inline unsigned
> +vect_get_mask_index_for_type (tree vectype)
> +{
> +  unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
> +  return vect_get_mask_index_for_elems (elems);
> +}
> +
> +/* Function vect_gen_loop_masks.
> +
> +   Create masks to mask a loop desvribed by LOOP_VINFO.  Masks
> +   are created according to LOOP_VINFO_REQUIRED_MASKS and are stored
> +   into MASKS vector.
> +
> +   Index of a mask in a vector is computed according to a number
> +   of masks's elements.  Masks are sorted by number of its elements
> +   in descending order.  Index 0 is used to access a mask with
> +   current_vector_size elements.  Among masks with the same number
> +   of elements the one with lower index is used to mask iterations
> +   with smaller iteration counter.  Note that you may get NULL elements
> +   for masks which are not required.  Use vect_get_mask_index_for_elems
> +   or vect_get_mask_index_for_type to access resulting vector.  */
> +
> +static void
> +vect_gen_loop_masks (loop_vec_info loop_vinfo, vec<tree> *masks)
> +{
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  edge pe = loop_preheader_edge (loop);
> +  tree niters = LOOP_VINFO_NITERS (loop_vinfo);
> +  unsigned min_mask_elems, max_mask_elems, nmasks;
> +  unsigned iv_elems, cur_mask, prev_mask, cur_mask_elems;
> +  auto_vec<tree> ivs;
> +  tree vectype, mask_type;
> +  tree vec_niters, vec_niters_val, mask;
> +  gimple *stmt;
> +  basic_block bb;
> +  gimple_stmt_iterator gsi = gsi_after_labels (loop->header);
> +  unsigned vec_size;
> +
> +  /* Create required IVs.  */
> +  vect_gen_ivs_for_masking (loop_vinfo, &ivs);
> +  vectype = TREE_TYPE (ivs[0]);
> +
> +  vec_size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
> +  iv_elems = TYPE_VECTOR_SUBPARTS (vectype);
> +
> +  /* Get a proper niter to build a vector.  */
> +  if (!is_gimple_val (niters))
> +    {
> +      gimple_seq seq = NULL;
> +      niters = force_gimple_operand (niters, &seq, true, NULL);
> +      gsi_insert_seq_on_edge_immediate (pe, seq);
> +    }
> +  /* We may need a type cast in case niter has a too small type
> +     for generated IVs.  */
> +  if (!types_compatible_p (TREE_TYPE (vectype), TREE_TYPE (niters)))
> +    {
> +      tree new_niters = make_temp_ssa_name (TREE_TYPE (vectype),
> +                                           NULL, "niters");
> +      stmt = gimple_build_assign (new_niters, CONVERT_EXPR, niters);
> +      bb = gsi_insert_on_edge_immediate (pe, stmt);
> +      gcc_assert (!bb);
> +      niters = new_niters;
> +    }
> +  /* Create {NITERS, ..., NITERS} vector and put to SSA_NAME.  */
> +  vec_niters_val = build_vector_from_val (vectype, niters);
> +  vec_niters = vect_get_new_ssa_name (vectype, vect_simple_var, "niters");
> +  stmt = gimple_build_assign (vec_niters, vec_niters_val);
> +  bb = gsi_insert_on_edge_immediate (pe, stmt);
> +  gcc_assert (!bb);
> +
> +  /* Determine which masks we need to compute and how many.  */
> +  vect_get_extreme_masks (loop_vinfo, &min_mask_elems, &max_mask_elems);
> +  nmasks = vect_get_mask_index_for_elems (MIN (min_mask_elems, iv_elems) / 2);
> +  masks->safe_grow_cleared (nmasks);
> +
> +  /* Now create base masks through comparison IV < VEC_NITERS.  */
> +  mask_type = build_same_sized_truth_vector_type (vectype);
> +  cur_mask = vect_get_mask_index_for_elems (iv_elems);
> +  for (unsigned i = 0; i < ivs.length (); i++)
> +    {
> +      tree iv = ivs[i];
> +      mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
> +      stmt = gimple_build_assign (mask, LT_EXPR, iv, vec_niters);
> +      gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
> +      (*masks)[cur_mask++] = mask;
> +    }
> +
> +  /* Create narrowed masks.  */
> +  cur_mask_elems = iv_elems;
> +  nmasks = ivs.length ();
> +  while (cur_mask_elems < max_mask_elems)
> +    {
> +      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> +
> +      cur_mask_elems <<= 1;
> +      nmasks >>= 1;
> +
> +      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> +
> +      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
> +
> +      for (unsigned i = 0; i < nmasks; i++)
> +       {
> +         tree mask_low = (*masks)[prev_mask++];
> +         tree mask_hi = (*masks)[prev_mask++];
> +         mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
> +         stmt = gimple_build_assign (mask, VEC_PACK_TRUNC_EXPR,
> +                                     mask_low, mask_hi);
> +         gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
> +         (*masks)[cur_mask++] = mask;
> +       }
> +    }
> +
> +  /* Created widened masks.  */
> +  cur_mask_elems = iv_elems;
> +  nmasks = ivs.length ();
> +  while (cur_mask_elems > min_mask_elems)
> +    {
> +      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> +
> +      cur_mask_elems >>= 1;
> +      nmasks <<= 1;
> +
> +      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> +
> +      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
> +
> +      for (unsigned i = 0; i < nmasks; i += 2)
> +       {
> +         tree orig_mask = (*masks)[prev_mask++];
> +
> +         mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
> +         stmt = gimple_build_assign (mask, VEC_UNPACK_LO_EXPR, orig_mask);
> +         gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
> +         (*masks)[cur_mask++] = mask;
> +
> +         mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
> +         stmt = gimple_build_assign (mask, VEC_UNPACK_HI_EXPR, orig_mask);
> +         gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
> +         (*masks)[cur_mask++] = mask;
> +       }
> +    }
> +}
> +
> +/* Function vect_mask_reduction_stmt.
> +
> +   Mask given vectorized reduction statement STMT using
> +   MASK.  In case scalar reduction statement is vectorized
> +   into several vector statements then PREV holds a
> +   preceding vetor statement copy for STMT.
> +
> +   Masking is performed using VEC_COND_EXPR. E.g.
> +
> +   S1: r_1 = r_2 + d_3
> +
> +   is transformed into:
> +
> +   S1': r_4 = r_2 + d_3
> +   S2': r_1 = VEC_COND_EXPR<MASK, r_4, r_2>
> +
> +   Return generated condition statement.  */
> +
> +static gimple *
> +vect_mask_reduction_stmt (gimple *stmt, tree mask, gimple *prev)
> +{
> +  gimple_stmt_iterator gsi;
> +  tree vectype;
> +  tree lhs, rhs, tmp;
> +  gimple *new_stmt, *phi;
> +
> +  lhs = gimple_assign_lhs (stmt);
> +  vectype = TREE_TYPE (lhs);
> +
> +  gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
> +             == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
> +
> +  /* Find operand RHS defined by PHI node.  */
> +  rhs = gimple_assign_rhs1 (stmt);
> +  gcc_assert (TREE_CODE (rhs) == SSA_NAME);
> +  phi = SSA_NAME_DEF_STMT (rhs);
> +
> +  if (phi != prev && gimple_code (phi) != GIMPLE_PHI)
> +    {
> +      rhs = gimple_assign_rhs2 (stmt);
> +      gcc_assert (TREE_CODE (rhs) == SSA_NAME);
> +      phi = SSA_NAME_DEF_STMT (rhs);
> +      gcc_assert (phi == prev || gimple_code (phi) == GIMPLE_PHI);
> +    }
> +
> +  /* Convert reduction stmt to ordinary assignment to TMP.  */
> +  tmp = vect_get_new_ssa_name (vectype, vect_simple_var, NULL);
> +  gimple_assign_set_lhs (stmt, tmp);
> +
> +  /* Create VEC_COND_EXPR and insert it after STMT.  */
> +  new_stmt = gimple_build_assign (lhs, VEC_COND_EXPR, mask, tmp, rhs);
> +  gsi = gsi_for_stmt (stmt);
> +  gsi_insert_after (&gsi, new_stmt, GSI_SAME_STMT);
> +
> +  return new_stmt;
> +}
> +
> +/* Function vect_mask_mask_load_store_stmt.
> +
> +   Mask given vectorized MASK_LOAD or MASK_STORE statement
> +   STMT using MASK.  Function replaces a mask used by STMT
> +   with its conjunction with MASK.  */
> +
> +static void
> +vect_mask_mask_load_store_stmt (gimple *stmt, tree mask)
> +{
> +  gimple *new_stmt;
> +  tree old_mask, new_mask;
> +  gimple_stmt_iterator gsi;
> +
> +  gsi = gsi_for_stmt (stmt);
> +  old_mask = gimple_call_arg (stmt, 2);
> +
> +  gcc_assert (types_compatible_p (TREE_TYPE (old_mask), TREE_TYPE (mask)));
> +
> +  new_mask = vect_get_new_ssa_name (TREE_TYPE (mask), vect_simple_var, NULL);
> +  new_stmt = gimple_build_assign (new_mask, BIT_AND_EXPR, old_mask, mask);
> +  gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
> +
> +  gimple_call_set_arg (stmt, 2, new_mask);
> +  update_stmt (stmt);
> +}
> +
> +
> +/* Function vect_mask_load_store_stmt.
> +
> +   Mask given vectorized load or store statement STMT using
> +   MASK.  DR is a data reference for a scalar memory access.
> +   Assignment is transformed into MASK_LOAD or MASK_STORE
> +   statement.  SI is either an iterator pointing to STMT and
> +   is to be updated or NULL.  */
> +
> +static void
> +vect_mask_load_store_stmt (gimple *stmt, tree vectype, tree mask,
> +                          data_reference *dr, gimple_stmt_iterator *si)
> +{
> +  tree mem, val, addr, ptr;
> +  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
> +  unsigned align, misalign;
> +  tree elem_type = TREE_TYPE (vectype);
> +  gimple *new_stmt;
> +
> +  gcc_assert (!si || gsi_stmt (*si) == stmt);
> +
> +  gsi = gsi_for_stmt (stmt);
> +  if (gimple_store_p (stmt))
> +    {
> +      val = gimple_assign_rhs1 (stmt);
> +      mem = gimple_assign_lhs (stmt);
> +    }
> +  else
> +    {
> +      val = gimple_assign_lhs (stmt);
> +      mem = gimple_assign_rhs1 (stmt);
> +    }
> +
> +  gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
> +             == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
> +
> +  addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (mem),
> +                                  true, NULL_TREE, true,
> +                                  GSI_SAME_STMT);
> +
> +  align = TYPE_ALIGN_UNIT (vectype);
> +  if (aligned_access_p (dr))
> +    misalign = 0;
> +  else if (DR_MISALIGNMENT (dr) == -1)
> +    {
> +      align = TYPE_ALIGN_UNIT (elem_type);
> +      misalign = 0;
> +    }
> +  else
> +    misalign = DR_MISALIGNMENT (dr);
> +  set_ptr_info_alignment (get_ptr_info (addr), align, misalign);
> +  ptr = build_int_cst (reference_alias_ptr_type (mem),
> +                      misalign ? misalign & -misalign : align);
> +
> +  if (gimple_store_p (stmt))
> +    new_stmt = gimple_build_call_internal (IFN_MASK_STORE, 4, addr, ptr,
> +                                          mask, val);
> +  else
> +    {
> +      new_stmt = gimple_build_call_internal (IFN_MASK_LOAD, 3, addr, ptr,
> +                                            mask);
> +      gimple_call_set_lhs (new_stmt, val);
> +    }
> +  gsi_replace (si ? si : &gsi, new_stmt, false);
> +}
> +
> +/* Function vect_combine_loop_epilogue.
> +
> +   Combine loop epilogue with the main vectorized body.  It requires
> +   masking of memory accesses and reductions.  */
> +
> +static void
> +vect_combine_loop_epilogue (loop_vec_info loop_vinfo)
> +{
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
> +  unsigned mask_no;
> +  auto_vec<tree> masks;
> +
> +  vect_gen_loop_masks (loop_vinfo, &masks);
> +
> +  /* Convert reduction statements if any.  */
> +  for (unsigned i = 0; i < LOOP_VINFO_REDUCTIONS (loop_vinfo).length (); i++)
> +    {
> +      gimple *stmt = LOOP_VINFO_REDUCTIONS (loop_vinfo)[i];
> +      gimple *prev_stmt = NULL;
> +      stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> +
> +      mask_no = vect_get_mask_index_for_type (STMT_VINFO_VECTYPE (stmt_info));
> +
> +      stmt = STMT_VINFO_VEC_STMT (stmt_info);
> +      while (stmt)
> +       {
> +         prev_stmt = vect_mask_reduction_stmt (stmt, masks[mask_no++],
> +                                               prev_stmt);
> +         stmt_info = vinfo_for_stmt (stmt);
> +         stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
> +       }
> +    }
> +
> +  /* Scan all loop statements to convert vector load/store including masked
> +     form.  */
> +  for (unsigned i = 0; i < loop->num_nodes; i++)
> +    {
> +      basic_block bb = bbs[i];
> +      for (gimple_stmt_iterator si = gsi_start_bb (bb);
> +          !gsi_end_p (si); gsi_next (&si))
> +       {
> +         gimple *stmt = gsi_stmt (si);
> +         stmt_vec_info stmt_info = NULL;
> +         tree vectype = NULL;
> +         data_reference *dr;
> +
> +         /* Mask load case.  */
> +         if (is_gimple_call (stmt)
> +             && gimple_call_internal_p (stmt)
> +             && gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
> +             && !VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (stmt, 2))))
> +           {
> +             stmt_info = vinfo_for_stmt (stmt);
> +             if (!STMT_VINFO_VEC_STMT (stmt_info))
> +               continue;
> +             stmt = STMT_VINFO_VEC_STMT (stmt_info);
> +             vectype = STMT_VINFO_VECTYPE (stmt_info);
> +           }
> +         /* Mask store case.  */
> +         else if (is_gimple_call (stmt)
> +                  && gimple_call_internal_p (stmt)
> +                  && gimple_call_internal_fn (stmt) == IFN_MASK_STORE
> +                  && vinfo_for_stmt (stmt)
> +                  && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
> +           {
> +             stmt_info = vinfo_for_stmt (stmt);
> +             vectype = TREE_TYPE (gimple_call_arg (stmt, 2));
> +           }
> +         /* Load case.  */
> +         else if (gimple_assign_load_p (stmt)
> +                  && !VECTOR_TYPE_P (TREE_TYPE (gimple_assign_lhs (stmt))))
> +           {
> +             stmt_info = vinfo_for_stmt (stmt);
> +             /* Skip vector loads.  */
> +             if (!STMT_VINFO_VEC_STMT (stmt_info))
> +               continue;
> +             /* Skip invariant loads.  */
> +             if (integer_zerop (nested_in_vect_loop_p (loop, stmt)
> +                                ? STMT_VINFO_DR_STEP (stmt_info)
> +                                : DR_STEP (STMT_VINFO_DATA_REF (stmt_info))))
> +               continue;
> +             stmt = STMT_VINFO_VEC_STMT (stmt_info);
> +             vectype = STMT_VINFO_VECTYPE (stmt_info);
> +           }
> +         /* Store case.  */
> +         else if (gimple_code (stmt) == GIMPLE_ASSIGN
> +                  && gimple_store_p (stmt)
> +                  && vinfo_for_stmt (stmt)
> +                  && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
> +           {
> +             stmt_info = vinfo_for_stmt (stmt);
> +             vectype = STMT_VINFO_VECTYPE (stmt_info);
> +           }
> +         else
> +           continue;
> +
> +         /* Skip hoisted out statements.  */
> +         if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
> +           continue;
> +
> +         mask_no = vect_get_mask_index_for_type (vectype);
> +
> +         dr = STMT_VINFO_DATA_REF (stmt_info);
> +         while (stmt)
> +           {
> +             if (is_gimple_call (stmt))
> +               vect_mask_mask_load_store_stmt (stmt, masks[mask_no++]);
> +             else
> +               vect_mask_load_store_stmt (stmt, vectype, masks[mask_no++], dr,
> +                                          /* Have to update iterator only if
> +                                             it points to stmt we mask.  */
> +                                          stmt == gsi_stmt (si) ? &si : NULL);
> +
> +             stmt_info = vinfo_for_stmt (stmt);
> +             stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
> +           }
> +       }
> +    }
> +
> +  if (dump_enabled_p ())
> +    dump_printf_loc (MSG_NOTE, vect_location,
> +                    "=== Loop epilogue was combined ===\n");
> +}
>
>  /* Function vect_transform_loop.
>
> @@ -6927,7 +7550,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>       run at least the vectorization factor number of times checking
>       is pointless, too.  */
>    th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
> -  if (th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1
> +  if ((th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1
> +       || (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)
> +          && th > 1))
>        && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
>      {
>        if (dump_enabled_p ())
> @@ -6976,12 +7601,18 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>      {
>        tree ratio_mult_vf;
>        if (!ni_name)
> -       ni_name = vect_build_loop_niters (loop_vinfo);
> +       {
> +         ni_name = vect_build_loop_niters (loop_vinfo);
> +         LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
> +       }
>        vect_generate_tmps_on_preheader (loop_vinfo, ni_name, &ratio_mult_vf,
>                                        &ratio);
> -      epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name,
> -                                                ratio_mult_vf, th,
> -                                                check_profitability);
> +      /* If epilogue is combined with main loop peeling is not needed.  */
> +      if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)
> +         || check_profitability)
> +       epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name,
> +                                                  ratio_mult_vf, th,
> +                                                  check_profitability);
>      }
>    else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
>      ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
> @@ -6989,7 +7620,10 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>    else
>      {
>        if (!ni_name)
> -       ni_name = vect_build_loop_niters (loop_vinfo);
> +       {
> +         ni_name = vect_build_loop_niters (loop_vinfo);
> +         LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
> +       }
>        vect_generate_tmps_on_preheader (loop_vinfo, ni_name, NULL, &ratio);
>      }
>
> @@ -7243,23 +7877,35 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>
>    slpeel_make_loop_iterate_ntimes (loop, ratio);
>
> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +    vect_combine_loop_epilogue (loop_vinfo);
> +
>    /* Reduce loop iterations by the vectorization factor.  */
>    scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vectorization_factor),
>                       expected_iterations / vectorization_factor);
>    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>        && loop->nb_iterations_upper_bound != 0)
> -    loop->nb_iterations_upper_bound = loop->nb_iterations_upper_bound - 1;
> -  loop->nb_iterations_upper_bound
> -    = wi::udiv_floor (loop->nb_iterations_upper_bound + 1,
> -                     vectorization_factor) - 1;
> -
> +    loop->nb_iterations_upper_bound -= 1;
> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +    loop->nb_iterations_upper_bound
> +      = wi::div_ceil (loop->nb_iterations_upper_bound + 1,
> +                     vectorization_factor, UNSIGNED) - 1;
> +  else
> +    loop->nb_iterations_upper_bound
> +      = wi::udiv_floor (loop->nb_iterations_upper_bound + 1,
> +                       vectorization_factor) - 1;
>    if (loop->any_estimate)
>      {
> -      loop->nb_iterations_estimate
> -        = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor);
> -       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> -          && loop->nb_iterations_estimate != 0)
> -        loop->nb_iterations_estimate = loop->nb_iterations_estimate - 1;
> +      if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +       loop->nb_iterations_estimate
> +         = wi::div_ceil (loop->nb_iterations_estimate, vectorization_factor,
> +                         UNSIGNED);
> +      else
> +       loop->nb_iterations_estimate
> +         = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor);
> +      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> +         && loop->nb_iterations_estimate != 0)
> +       loop->nb_iterations_estimate -= 1;
>      }
>
>    if (dump_enabled_p ())

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-06-15 11:44 ` Richard Biener
@ 2016-06-16 15:41   ` Ilya Enkovich
  2016-06-16 15:51     ` Jeff Law
  0 siblings, 1 reply; 29+ messages in thread
From: Ilya Enkovich @ 2016-06-16 15:41 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

2016-06-15 14:44 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
> On Thu, May 19, 2016 at 9:44 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>> Hi,
>>
>> This patch introduces support for loop epilogue combining.  This includes
>> support in cost estimation and all required changes required to mask
>> vectorized loop.
>
> I wonder why you compute a minimum number of iterations to make masking
> of the vectorized body profitable rather than a maximum number of iterations.
>
> I'd say masking the vectorized loop is profitable if niter/vf *
> masking-overhead < epilogue-cost.
> Masking the epilogue is profitable if vectorizing the epilogue with
> masking is profitable.
>
> Am I missing something?

We don't have two versions of vectorized loop.  The choice is between vector
and scalar loop and in this case minimum number of iterations is what we need.
Generating two vectorized loop versions would be something new to vectorizer.

Thanks,
Ilya

>
> Thanks,
> Richard.
>
>> Thanks,
>> Ilya
>> --
>> gcc/
>>
>> 2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
>>
>>         * dbgcnt.def (vect_tail_combine): New.
>>         * params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
>>         * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
>>         * tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
>>         epilogue combined with loop body.
>>         (vect_do_peeling_for_loop_bound): Likewise.
>>         * tree-vect-loop.c Include alias.h and dbgcnt.h.
>>         (vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters
>>         arg, compute number of iterations for which loop epilogue combining is
>>         profitable.
>>         (vect_generate_tmps_on_preheader): Support combined apilogue.
>>         (vect_gen_ivs_for_masking): New.
>>         (vect_get_mask_index_for_elems): New.
>>         (vect_get_mask_index_for_type): New.
>>         (vect_gen_loop_masks): New.
>>         (vect_mask_reduction_stmt): New.
>>         (vect_mask_mask_load_store_stmt): New.
>>         (vect_mask_load_store_stmt): New.
>>         (vect_combine_loop_epilogue): New.
>>         (vect_transform_loop): Support combined apilogue.
>>
>>
>> diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def
>> index 78ddcc2..73c2966 100644
>> --- a/gcc/dbgcnt.def
>> +++ b/gcc/dbgcnt.def
>> @@ -192,4 +192,5 @@ DEBUG_COUNTER (treepre_insert)
>>  DEBUG_COUNTER (tree_sra)
>>  DEBUG_COUNTER (vect_loop)
>>  DEBUG_COUNTER (vect_slp)
>> +DEBUG_COUNTER (vect_tail_combine)
>>  DEBUG_COUNTER (dom_unreachable_edges)
>> diff --git a/gcc/params.def b/gcc/params.def
>> index 62a1e40..98d6c5a 100644
>> --- a/gcc/params.def
>> +++ b/gcc/params.def
>> @@ -1220,6 +1220,11 @@ DEFPARAM (PARAM_MAX_SPECULATIVE_DEVIRT_MAYDEFS,
>>           "Maximum number of may-defs visited when devirtualizing "
>>           "speculatively", 50, 0, 0)
>>
>> +DEFPARAM (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD,
>> +         "vect-cost-increase-combine-threshold",
>> +         "Cost increase threshold to mask main loop for epilogue.",
>> +         10, 0, 300)
>> +
>>  /*
>>
>>  Local variables:
>> diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
>> index f275933..c5bdeb9 100644
>> --- a/gcc/tree-vect-data-refs.c
>> +++ b/gcc/tree-vect-data-refs.c
>> @@ -4000,6 +4000,9 @@ vect_get_new_ssa_name (tree type, enum vect_var_kind var_kind, const char *name)
>>    case vect_scalar_var:
>>      prefix = "stmp";
>>      break;
>> +  case vect_mask_var:
>> +    prefix = "mask";
>> +    break;
>>    case vect_pointer_var:
>>      prefix = "vectp";
>>      break;
>> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
>> index fab5879..b3c0668 100644
>> --- a/gcc/tree-vect-loop-manip.c
>> +++ b/gcc/tree-vect-loop-manip.c
>> @@ -1195,6 +1195,7 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
>>    int first_guard_probability = 2 * REG_BR_PROB_BASE / 3;
>>    int second_guard_probability = 2 * REG_BR_PROB_BASE / 3;
>>    int probability_of_second_loop;
>> +  bool skip_second_after_first = false;
>>
>>    if (!slpeel_can_duplicate_loop_p (loop, e))
>>      return NULL;
>> @@ -1393,7 +1394,11 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
>>      {
>>        loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
>>        tree scalar_loop_iters = LOOP_VINFO_NITERSM1 (loop_vinfo);
>> -      unsigned limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
>> +      unsigned limit = 0;
>> +      if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
>> +       skip_second_after_first = true;
>> +      else
>> +       limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
>>        if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
>>         limit = limit + 1;
>>        if (check_profitability
>> @@ -1464,11 +1469,20 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
>>    bb_between_loops = new_exit_bb;
>>    bb_after_second_loop = split_edge (single_exit (second_loop));
>>
>> -  pre_condition =
>> -       fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
>> -  skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
>> -                                  bb_after_second_loop, bb_before_first_loop,
>> -                                 inverse_probability (second_guard_probability));
>> +  if (skip_second_after_first)
>> +    /* We can just redirect edge from bb_between_loops to
>> +       bb_after_second_loop but we have many code assuming
>> +       we have a guard after the first loop.  So just make
>> +       always taken condtion.  */
>> +    pre_condition = fold_build2 (EQ_EXPR, boolean_type_node, integer_zero_node,
>> +                                integer_zero_node);
>> +  else
>> +    pre_condition =
>> +      fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
>> +  skip_e
>> +    = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
>> +                            bb_after_second_loop, bb_before_first_loop,
>> +                            inverse_probability (second_guard_probability));
>>    scale_loop_profile (second_loop, probability_of_second_loop, bound2);
>>    slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop,
>>                                       second_loop == new_loop, &new_exit_bb);
>> @@ -1758,8 +1772,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>>    basic_block preheader;
>>    int loop_num;
>>    int max_iter;
>> +  int bound2;
>>    tree cond_expr = NULL_TREE;
>>    gimple_seq cond_expr_stmt_list = NULL;
>> +  bool combine = LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo);
>>
>>    if (dump_enabled_p ())
>>      dump_printf_loc (MSG_NOTE, vect_location,
>> @@ -1769,12 +1785,13 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>>
>>    loop_num  = loop->num;
>>
>> +  bound2 = combine ? th : LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>>    new_loop
>>      = slpeel_tree_peel_loop_to_edge (loop, scalar_loop, single_exit (loop),
>>                                      &ratio_mult_vf_name, ni_name, false,
>>                                      th, check_profitability,
>>                                      cond_expr, cond_expr_stmt_list,
>> -                                    0, LOOP_VINFO_VECT_FACTOR (loop_vinfo));
>> +                                    0, bound2);
>>    gcc_assert (new_loop);
>>    gcc_assert (loop_num == loop->num);
>>    slpeel_checking_verify_cfg_after_peeling (loop, new_loop);
>> @@ -1803,7 +1820,11 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>>    max_iter = (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>>               ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) * 2
>>               : LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 2;
>> -  if (check_profitability)
>> +  /* When epilogue is combined only profitability
>> +     treshold matters.  */
>> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
>> +    max_iter = (int) th - 1;
>> +  else if (check_profitability)
>>      max_iter = MAX (max_iter, (int) th - 1);
>>    record_niter_bound (new_loop, max_iter, false, true);
>>    dump_printf (MSG_NOTE,
>> @@ -2036,7 +2057,8 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo, tree ni_name,
>>                                    bound, 0);
>>
>>    gcc_assert (new_loop);
>> -  slpeel_checking_verify_cfg_after_peeling (new_loop, loop);
>> +  if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
>> +    slpeel_checking_verify_cfg_after_peeling (new_loop, loop);
>>    /* For vectorization factor N, we need to copy at most N-1 values
>>       for alignment and this means N-2 loopback edge executions.  */
>>    max_iter = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 2;
>> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> index 31360d3..1a80c42 100644
>> --- a/gcc/tree-vect-loop.c
>> +++ b/gcc/tree-vect-loop.c
>> @@ -50,6 +50,8 @@ along with GCC; see the file COPYING3.  If not see
>>  #include "gimple-fold.h"
>>  #include "cgraph.h"
>>  #include "tree-if-conv.h"
>> +#include "alias.h"
>> +#include "dbgcnt.h"
>>
>>  /* Loop Vectorization Pass.
>>
>> @@ -149,7 +151,8 @@ along with GCC; see the file COPYING3.  If not see
>>     http://gcc.gnu.org/projects/tree-ssa/vectorization.html
>>  */
>>
>> -static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *);
>> +static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *,
>> +                                               int *);
>>
>>  /* Function vect_determine_vectorization_factor
>>
>> @@ -2304,8 +2307,10 @@ start_over:
>>
>>    /* Analyze cost.  Decide if worth while to vectorize.  */
>>    int min_profitable_estimate, min_profitable_iters;
>> +  int min_profitable_combine_iters;
>>    vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
>> -                                     &min_profitable_estimate);
>> +                                     &min_profitable_estimate,
>> +                                     &min_profitable_combine_iters);
>>
>>    if (min_profitable_iters < 0)
>>      {
>> @@ -2412,6 +2417,52 @@ start_over:
>>    gcc_assert (vectorization_factor
>>               == (unsigned)LOOP_VINFO_VECT_FACTOR (loop_vinfo));
>>
>> +  if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
>> +    {
>> +      LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
>> +      LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = false;
>> +    }
>> +  else if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
>> +          && min_profitable_combine_iters >= 0)
>> +    {
>> +      if (((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>> +           && (LOOP_VINFO_INT_NITERS (loop_vinfo)
>> +               >= (unsigned) min_profitable_combine_iters))
>> +          || estimated_niter == -1
>> +          || estimated_niter >= min_profitable_combine_iters)
>> +         && dbg_cnt (vect_tail_combine))
>> +       {
>> +         LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
>> +         LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = true;
>> +
>> +         dump_printf_loc (MSG_NOTE, vect_location,
>> +                          "Decided to combine loop with its epilogue.\n");
>> +
>> +         /* We need to adjust profitability check if combine
>> +            epilogue considering additional vector iteration
>> +            and profitable combine iterations.  */
>> +         if ((int)(min_profitable_combine_iters + vectorization_factor)
>> +             > min_scalar_loop_bound)
>> +           {
>> +             LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo)
>> +               = (unsigned) min_profitable_combine_iters;
>> +             if (dump_enabled_p ())
>> +               dump_printf_loc (MSG_NOTE, vect_location,
>> +                                "Updated runtime profitability treshold: %d\n",
>> +                                min_profitable_combine_iters);
>> +
>> +           }
>> +       }
>> +      else
>> +       {
>> +         if (!LOOP_VINFO_NEED_MASKING (loop_vinfo) && dump_enabled_p ())
>> +           dump_printf_loc (MSG_NOTE, vect_location,
>> +                            "Not combined loop with epilogue: iterations "
>> +                            "count is too low (threshold is %d).\n",
>> +                            min_profitable_combine_iters);
>> +       }
>> +    }
>> +
>>    /* Ok to vectorize!  */
>>    return true;
>>
>> @@ -3381,7 +3432,8 @@ vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
>>  static void
>>  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>>                                     int *ret_min_profitable_niters,
>> -                                   int *ret_min_profitable_estimate)
>> +                                   int *ret_min_profitable_estimate,
>> +                                   int *ret_min_profitable_combine_niters)
>>  {
>>    int min_profitable_iters;
>>    int min_profitable_estimate;
>> @@ -3625,6 +3677,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>>                     vec_prologue_cost);
>>        dump_printf (MSG_NOTE, "  Vector epilogue cost: %d\n",
>>                     vec_epilogue_cost);
>> +      dump_printf (MSG_NOTE, "  Masking prologue cost: %d\n",
>> +                   masking_prologue_cost);
>> +      dump_printf (MSG_NOTE, "  Masking inside cost: %d\n",
>> +                   masking_inside_cost);
>>        dump_printf (MSG_NOTE, "  Scalar iteration cost: %d\n",
>>                     scalar_single_iter_cost);
>>        dump_printf (MSG_NOTE, "  Scalar outside cost: %d\n",
>> @@ -3728,6 +3784,77 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>>                      min_profitable_estimate);
>>
>>    *ret_min_profitable_estimate = min_profitable_estimate;
>> +
>> +  *ret_min_profitable_combine_niters = -1;
>> +
>> +  /* Don't try to vectorize epilogue of epilogue.  */
>> +  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
>> +    return;
>> +
>> +  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
>> +    {
>> +      if (flag_vect_epilogue_cost_model == VECT_COST_MODEL_UNLIMITED)
>> +       {
>> +         if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE)
>> +           *ret_min_profitable_combine_niters = 0;
>> +         return;
>> +       }
>> +
>> +      unsigned combine_treshold
>> +       = PARAM_VALUE (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD);
>> +      /* Calculate profitability combining epilogue with the main loop.
>> +        We have a threshold for inside cost overhead (not applied
>> +        for low trip count loop case):
>> +        MIC * 100 < VIC * CT
>> +        Masked iteration should be better than a scalar prologue:
>> +        MIC + VIC < SIC * epilogue_niters  */
>> +      if (masking_inside_cost * 100 >= vec_inside_cost * combine_treshold)
>> +       {
>> +         if (dump_enabled_p ())
>> +           {
>> +             dump_printf_loc (MSG_NOTE, vect_location,
>> +                              "Combining loop with epilogue is not "
>> +                              "profitable.\n");
>> +             dump_printf_loc (MSG_NOTE, vect_location,
>> +                              "  Combining overhead %d%% exceeds "
>> +                              "treshold %d%%.\n",
>> +                              masking_inside_cost * 100 / vec_inside_cost,
>> +                              combine_treshold);
>> +           }
>> +         *ret_min_profitable_combine_niters = -1;
>> +       }
>> +      else if ((int)(masking_inside_cost + vec_inside_cost)
>> +              >= scalar_single_iter_cost * peel_iters_epilogue)
>> +       {
>> +         if (dump_enabled_p ())
>> +           {
>> +             dump_printf_loc (MSG_NOTE, vect_location,
>> +                              "Combining loop with epilogue is not "
>> +                              "profitable.\n");
>> +             dump_printf_loc (MSG_NOTE, vect_location,
>> +                              "  Scalar epilogue is faster than a "
>> +                              "single masked iteration.\n");
>> +           }
>> +         *ret_min_profitable_combine_niters = -1;
>> +       }
>> +      else if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE)
>> +       {
>> +         int inside_cost = vec_inside_cost + masking_inside_cost;
>> +         int outside_cost = vec_outside_cost + masking_prologue_cost;
>> +          int profitable_iters = ((outside_cost - scalar_outside_cost) * vf
>> +                                 - inside_cost * peel_iters_prologue
>> +                                  - inside_cost * peel_iters_epilogue)
>> +                                 / ((scalar_single_iter_cost * vf)
>> +                                    - inside_cost);
>> +
>> +         if (dump_enabled_p ())
>> +           dump_printf_loc (MSG_NOTE, vect_location,
>> +                            "Combinig loop with epilogue "
>> +                            "pofitability treshold = %d\n",
>> +                            profitable_iters);
>> +         *ret_min_profitable_combine_niters = profitable_iters;
>> +       }
>> +    }
>>  }
>>
>>  /* Writes into SEL a mask for a vec_perm, equivalent to a vec_shr by OFFSET
>> @@ -6843,20 +6970,37 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
>>    else
>>      ni_minus_gap_name = ni_name;
>>
>> -  /* Create: ratio = ni >> log2(vf) */
>> -  /* ???  As we have ni == number of latch executions + 1, ni could
>> -     have overflown to zero.  So avoid computing ratio based on ni
>> -     but compute it using the fact that we know ratio will be at least
>> -     one, thus via (ni - vf) >> log2(vf) + 1.  */
>> -  ratio_name
>> -    = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
>> -                  fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
>> -                               fold_build2 (MINUS_EXPR, TREE_TYPE (ni_name),
>> -                                            ni_minus_gap_name,
>> -                                            build_int_cst
>> -                                              (TREE_TYPE (ni_name), vf)),
>> -                               log_vf),
>> -                  build_int_cst (TREE_TYPE (ni_name), 1));
>> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
>> +    {
>> +      /* Create ni + (vf-1) >> log2(vf) if epilogue is combined with loop.  */
>> +      gcc_assert (!LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
>> +      ratio_name
>> +       = fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
>> +                      fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
>> +                                   ni_name,
>> +                                   build_int_cst (TREE_TYPE (ni_name),
>> +                                                  vf - 1)),
>> +                      log_vf);
>> +    }
>> +  else
>> +    {
>> +      /* Create: ratio = ni >> log2(vf) */
>> +      /* ???  As we have ni == number of latch executions + 1, ni could
>> +        have overflown to zero.  So avoid computing ratio based on ni
>> +        but compute it using the fact that we know ratio will be at least
>> +        one, thus via (ni - vf) >> log2(vf) + 1.  */
>> +      ratio_name
>> +        = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
>> +                      fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
>> +                                   fold_build2 (MINUS_EXPR,
>> +                                                TREE_TYPE (ni_name),
>> +                                                ni_minus_gap_name,
>> +                                                build_int_cst
>> +                                                  (TREE_TYPE (ni_name), vf)),
>> +                                   log_vf),
>> +                      build_int_cst (TREE_TYPE (ni_name), 1));
>> +    }
>> +
>>    if (!is_gimple_val (ratio_name))
>>      {
>>        var = create_tmp_var (TREE_TYPE (ni_name), "bnd");
>> @@ -6886,6 +7030,485 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
>>    return;
>>  }
>>
>> +/* Function vect_gen_ivs_for_masking.
>> +
>> +   Create IVs to be used for masks computation to mask loop described
>> +   by LOOP_VINFO.  Created IVs are stored in IVS vector.  .
>> +
>> +   Initial IV values is {0, 1, ..., VF - 1} (probably split into several
>> +   vectors, in this case IVS's elements with lower index hold IV with
>> +   smaller numbers).  IV step is {VF, VF, ..., VF}.  VF is a used
>> +   vectorization factor.  */
>> +
>> +static void
>> +vect_gen_ivs_for_masking (loop_vec_info loop_vinfo, vec<tree> *ivs)
>> +{
>> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>> +  tree vectype = vect_get_masking_iv_type (loop_vinfo);
>> +  tree type = TREE_TYPE (vectype);
>> +  int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>> +  unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
>> +  int ncopies  = vf / elems;
>> +  int i, k;
>> +  tree iv, init_val, step_val;
>> +  bool insert_after;
>> +  gimple_stmt_iterator gsi;
>> +  tree *vtemp;
>> +
>> +  /* Create {VF, ..., VF} vector constant.  */
>> +  step_val = build_vector_from_val (vectype, build_int_cst (type, vf));
>> +
>> +  vtemp = XALLOCAVEC (tree, vf);
>> +  for (i = 0; i < ncopies; i++)
>> +    {
>> +      /* Create initial IV value.  */
>> +      for (k = 0; k < vf; k++)
>> +       vtemp[k] = build_int_cst (type, k + i * elems);
>> +      init_val = build_vector (vectype, vtemp);
>> +
>> +      /* Create an inductive variable including phi node.  */
>> +      standard_iv_increment_position (loop, &gsi, &insert_after);
>> +      create_iv (init_val, step_val, NULL, loop, &gsi, insert_after,
>> +                &iv, NULL);
>> +      ivs->safe_push (iv);
>> +    }
>> +}
>> +
>> +/* Function vect_get_mask_index_for_elems.
>> +
>> +   A helper function to access masks vector.  See vect_gen_loop_masks
>> +   for masks vector sorting description.  Return index of the first
>> +   mask having MASK_ELEMS elements.  */
>> +
>> +static inline unsigned
>> +vect_get_mask_index_for_elems (unsigned mask_elems)
>> +{
>> +  return current_vector_size / mask_elems - 1;
>> +}
>> +
>> +/* Function vect_get_mask_index_for_type.
>> +
>> +   A helper function to access masks vector.  See vect_gen_loop_masks
>> +   for masks vector sorting description.  Return index of the first
>> +   mask appropriate for VECTYPE.  */
>> +
>> +static inline unsigned
>> +vect_get_mask_index_for_type (tree vectype)
>> +{
>> +  unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
>> +  return vect_get_mask_index_for_elems (elems);
>> +}
>> +
>> +/* Function vect_gen_loop_masks.
>> +
>> +   Create masks to mask a loop desvribed by LOOP_VINFO.  Masks
>> +   are created according to LOOP_VINFO_REQUIRED_MASKS and are stored
>> +   into MASKS vector.
>> +
>> +   Index of a mask in a vector is computed according to a number
>> +   of masks's elements.  Masks are sorted by number of its elements
>> +   in descending order.  Index 0 is used to access a mask with
>> +   current_vector_size elements.  Among masks with the same number
>> +   of elements the one with lower index is used to mask iterations
>> +   with smaller iteration counter.  Note that you may get NULL elements
>> +   for masks which are not required.  Use vect_get_mask_index_for_elems
>> +   or vect_get_mask_index_for_type to access resulting vector.  */
>> +
>> +static void
>> +vect_gen_loop_masks (loop_vec_info loop_vinfo, vec<tree> *masks)
>> +{
>> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>> +  edge pe = loop_preheader_edge (loop);
>> +  tree niters = LOOP_VINFO_NITERS (loop_vinfo);
>> +  unsigned min_mask_elems, max_mask_elems, nmasks;
>> +  unsigned iv_elems, cur_mask, prev_mask, cur_mask_elems;
>> +  auto_vec<tree> ivs;
>> +  tree vectype, mask_type;
>> +  tree vec_niters, vec_niters_val, mask;
>> +  gimple *stmt;
>> +  basic_block bb;
>> +  gimple_stmt_iterator gsi = gsi_after_labels (loop->header);
>> +  unsigned vec_size;
>> +
>> +  /* Create required IVs.  */
>> +  vect_gen_ivs_for_masking (loop_vinfo, &ivs);
>> +  vectype = TREE_TYPE (ivs[0]);
>> +
>> +  vec_size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
>> +  iv_elems = TYPE_VECTOR_SUBPARTS (vectype);
>> +
>> +  /* Get a proper niter to build a vector.  */
>> +  if (!is_gimple_val (niters))
>> +    {
>> +      gimple_seq seq = NULL;
>> +      niters = force_gimple_operand (niters, &seq, true, NULL);
>> +      gsi_insert_seq_on_edge_immediate (pe, seq);
>> +    }
>> +  /* We may need a type cast in case niter has a too small type
>> +     for generated IVs.  */
>> +  if (!types_compatible_p (TREE_TYPE (vectype), TREE_TYPE (niters)))
>> +    {
>> +      tree new_niters = make_temp_ssa_name (TREE_TYPE (vectype),
>> +                                           NULL, "niters");
>> +      stmt = gimple_build_assign (new_niters, CONVERT_EXPR, niters);
>> +      bb = gsi_insert_on_edge_immediate (pe, stmt);
>> +      gcc_assert (!bb);
>> +      niters = new_niters;
>> +    }
>> +  /* Create {NITERS, ..., NITERS} vector and put to SSA_NAME.  */
>> +  vec_niters_val = build_vector_from_val (vectype, niters);
>> +  vec_niters = vect_get_new_ssa_name (vectype, vect_simple_var, "niters");
>> +  stmt = gimple_build_assign (vec_niters, vec_niters_val);
>> +  bb = gsi_insert_on_edge_immediate (pe, stmt);
>> +  gcc_assert (!bb);
>> +
>> +  /* Determine which masks we need to compute and how many.  */
>> +  vect_get_extreme_masks (loop_vinfo, &min_mask_elems, &max_mask_elems);
>> +  nmasks = vect_get_mask_index_for_elems (MIN (min_mask_elems, iv_elems) / 2);
>> +  masks->safe_grow_cleared (nmasks);
>> +
>> +  /* Now create base masks through comparison IV < VEC_NITERS.  */
>> +  mask_type = build_same_sized_truth_vector_type (vectype);
>> +  cur_mask = vect_get_mask_index_for_elems (iv_elems);
>> +  for (unsigned i = 0; i < ivs.length (); i++)
>> +    {
>> +      tree iv = ivs[i];
>> +      mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
>> +      stmt = gimple_build_assign (mask, LT_EXPR, iv, vec_niters);
>> +      gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
>> +      (*masks)[cur_mask++] = mask;
>> +    }
>> +
>> +  /* Create narrowed masks.  */
>> +  cur_mask_elems = iv_elems;
>> +  nmasks = ivs.length ();
>> +  while (cur_mask_elems < max_mask_elems)
>> +    {
>> +      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
>> +
>> +      cur_mask_elems <<= 1;
>> +      nmasks >>= 1;
>> +
>> +      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
>> +
>> +      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
>> +
>> +      for (unsigned i = 0; i < nmasks; i++)
>> +       {
>> +         tree mask_low = (*masks)[prev_mask++];
>> +         tree mask_hi = (*masks)[prev_mask++];
>> +         mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
>> +         stmt = gimple_build_assign (mask, VEC_PACK_TRUNC_EXPR,
>> +                                     mask_low, mask_hi);
>> +         gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
>> +         (*masks)[cur_mask++] = mask;
>> +       }
>> +    }
>> +
>> +  /* Created widened masks.  */
>> +  cur_mask_elems = iv_elems;
>> +  nmasks = ivs.length ();
>> +  while (cur_mask_elems > min_mask_elems)
>> +    {
>> +      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
>> +
>> +      cur_mask_elems >>= 1;
>> +      nmasks <<= 1;
>> +
>> +      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
>> +
>> +      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
>> +
>> +      for (unsigned i = 0; i < nmasks; i += 2)
>> +       {
>> +         tree orig_mask = (*masks)[prev_mask++];
>> +
>> +         mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
>> +         stmt = gimple_build_assign (mask, VEC_UNPACK_LO_EXPR, orig_mask);
>> +         gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
>> +         (*masks)[cur_mask++] = mask;
>> +
>> +         mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
>> +         stmt = gimple_build_assign (mask, VEC_UNPACK_HI_EXPR, orig_mask);
>> +         gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
>> +         (*masks)[cur_mask++] = mask;
>> +       }
>> +    }
>> +}
>> +
>> +/* Function vect_mask_reduction_stmt.
>> +
>> +   Mask given vectorized reduction statement STMT using
>> +   MASK.  In case scalar reduction statement is vectorized
>> +   into several vector statements then PREV holds a
>> +   preceding vetor statement copy for STMT.
>> +
>> +   Masking is performed using VEC_COND_EXPR. E.g.
>> +
>> +   S1: r_1 = r_2 + d_3
>> +
>> +   is transformed into:
>> +
>> +   S1': r_4 = r_2 + d_3
>> +   S2': r_1 = VEC_COND_EXPR<MASK, r_4, r_2>
>> +
>> +   Return generated condition statement.  */
>> +
>> +static gimple *
>> +vect_mask_reduction_stmt (gimple *stmt, tree mask, gimple *prev)
>> +{
>> +  gimple_stmt_iterator gsi;
>> +  tree vectype;
>> +  tree lhs, rhs, tmp;
>> +  gimple *new_stmt, *phi;
>> +
>> +  lhs = gimple_assign_lhs (stmt);
>> +  vectype = TREE_TYPE (lhs);
>> +
>> +  gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
>> +             == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
>> +
>> +  /* Find operand RHS defined by PHI node.  */
>> +  rhs = gimple_assign_rhs1 (stmt);
>> +  gcc_assert (TREE_CODE (rhs) == SSA_NAME);
>> +  phi = SSA_NAME_DEF_STMT (rhs);
>> +
>> +  if (phi != prev && gimple_code (phi) != GIMPLE_PHI)
>> +    {
>> +      rhs = gimple_assign_rhs2 (stmt);
>> +      gcc_assert (TREE_CODE (rhs) == SSA_NAME);
>> +      phi = SSA_NAME_DEF_STMT (rhs);
>> +      gcc_assert (phi == prev || gimple_code (phi) == GIMPLE_PHI);
>> +    }
>> +
>> +  /* Convert reduction stmt to ordinary assignment to TMP.  */
>> +  tmp = vect_get_new_ssa_name (vectype, vect_simple_var, NULL);
>> +  gimple_assign_set_lhs (stmt, tmp);
>> +
>> +  /* Create VEC_COND_EXPR and insert it after STMT.  */
>> +  new_stmt = gimple_build_assign (lhs, VEC_COND_EXPR, mask, tmp, rhs);
>> +  gsi = gsi_for_stmt (stmt);
>> +  gsi_insert_after (&gsi, new_stmt, GSI_SAME_STMT);
>> +
>> +  return new_stmt;
>> +}
>> +
>> +/* Function vect_mask_mask_load_store_stmt.
>> +
>> +   Mask given vectorized MASK_LOAD or MASK_STORE statement
>> +   STMT using MASK.  Function replaces a mask used by STMT
>> +   with its conjunction with MASK.  */
>> +
>> +static void
>> +vect_mask_mask_load_store_stmt (gimple *stmt, tree mask)
>> +{
>> +  gimple *new_stmt;
>> +  tree old_mask, new_mask;
>> +  gimple_stmt_iterator gsi;
>> +
>> +  gsi = gsi_for_stmt (stmt);
>> +  old_mask = gimple_call_arg (stmt, 2);
>> +
>> +  gcc_assert (types_compatible_p (TREE_TYPE (old_mask), TREE_TYPE (mask)));
>> +
>> +  new_mask = vect_get_new_ssa_name (TREE_TYPE (mask), vect_simple_var, NULL);
>> +  new_stmt = gimple_build_assign (new_mask, BIT_AND_EXPR, old_mask, mask);
>> +  gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
>> +
>> +  gimple_call_set_arg (stmt, 2, new_mask);
>> +  update_stmt (stmt);
>> +}
>> +
>> +
>> +/* Function vect_mask_load_store_stmt.
>> +
>> +   Mask given vectorized load or store statement STMT using
>> +   MASK.  DR is a data reference for a scalar memory access.
>> +   Assignment is transformed into MASK_LOAD or MASK_STORE
>> +   statement.  SI is either an iterator pointing to STMT and
>> +   is to be updated or NULL.  */
>> +
>> +static void
>> +vect_mask_load_store_stmt (gimple *stmt, tree vectype, tree mask,
>> +                          data_reference *dr, gimple_stmt_iterator *si)
>> +{
>> +  tree mem, val, addr, ptr;
>> +  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
>> +  unsigned align, misalign;
>> +  tree elem_type = TREE_TYPE (vectype);
>> +  gimple *new_stmt;
>> +
>> +  gcc_assert (!si || gsi_stmt (*si) == stmt);
>> +
>> +  gsi = gsi_for_stmt (stmt);
>> +  if (gimple_store_p (stmt))
>> +    {
>> +      val = gimple_assign_rhs1 (stmt);
>> +      mem = gimple_assign_lhs (stmt);
>> +    }
>> +  else
>> +    {
>> +      val = gimple_assign_lhs (stmt);
>> +      mem = gimple_assign_rhs1 (stmt);
>> +    }
>> +
>> +  gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
>> +             == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
>> +
>> +  addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (mem),
>> +                                  true, NULL_TREE, true,
>> +                                  GSI_SAME_STMT);
>> +
>> +  align = TYPE_ALIGN_UNIT (vectype);
>> +  if (aligned_access_p (dr))
>> +    misalign = 0;
>> +  else if (DR_MISALIGNMENT (dr) == -1)
>> +    {
>> +      align = TYPE_ALIGN_UNIT (elem_type);
>> +      misalign = 0;
>> +    }
>> +  else
>> +    misalign = DR_MISALIGNMENT (dr);
>> +  set_ptr_info_alignment (get_ptr_info (addr), align, misalign);
>> +  ptr = build_int_cst (reference_alias_ptr_type (mem),
>> +                      misalign ? misalign & -misalign : align);
>> +
>> +  if (gimple_store_p (stmt))
>> +    new_stmt = gimple_build_call_internal (IFN_MASK_STORE, 4, addr, ptr,
>> +                                          mask, val);
>> +  else
>> +    {
>> +      new_stmt = gimple_build_call_internal (IFN_MASK_LOAD, 3, addr, ptr,
>> +                                            mask);
>> +      gimple_call_set_lhs (new_stmt, val);
>> +    }
>> +  gsi_replace (si ? si : &gsi, new_stmt, false);
>> +}
>> +
>> +/* Function vect_combine_loop_epilogue.
>> +
>> +   Combine loop epilogue with the main vectorized body.  It requires
>> +   masking of memory accesses and reductions.  */
>> +
>> +static void
>> +vect_combine_loop_epilogue (loop_vec_info loop_vinfo)
>> +{
>> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>> +  basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
>> +  unsigned mask_no;
>> +  auto_vec<tree> masks;
>> +
>> +  vect_gen_loop_masks (loop_vinfo, &masks);
>> +
>> +  /* Convert reduction statements if any.  */
>> +  for (unsigned i = 0; i < LOOP_VINFO_REDUCTIONS (loop_vinfo).length (); i++)
>> +    {
>> +      gimple *stmt = LOOP_VINFO_REDUCTIONS (loop_vinfo)[i];
>> +      gimple *prev_stmt = NULL;
>> +      stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
>> +
>> +      mask_no = vect_get_mask_index_for_type (STMT_VINFO_VECTYPE (stmt_info));
>> +
>> +      stmt = STMT_VINFO_VEC_STMT (stmt_info);
>> +      while (stmt)
>> +       {
>> +         prev_stmt = vect_mask_reduction_stmt (stmt, masks[mask_no++],
>> +                                               prev_stmt);
>> +         stmt_info = vinfo_for_stmt (stmt);
>> +         stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
>> +       }
>> +    }
>> +
>> +  /* Scan all loop statements to convert vector load/store including masked
>> +     form.  */
>> +  for (unsigned i = 0; i < loop->num_nodes; i++)
>> +    {
>> +      basic_block bb = bbs[i];
>> +      for (gimple_stmt_iterator si = gsi_start_bb (bb);
>> +          !gsi_end_p (si); gsi_next (&si))
>> +       {
>> +         gimple *stmt = gsi_stmt (si);
>> +         stmt_vec_info stmt_info = NULL;
>> +         tree vectype = NULL;
>> +         data_reference *dr;
>> +
>> +         /* Mask load case.  */
>> +         if (is_gimple_call (stmt)
>> +             && gimple_call_internal_p (stmt)
>> +             && gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
>> +             && !VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (stmt, 2))))
>> +           {
>> +             stmt_info = vinfo_for_stmt (stmt);
>> +             if (!STMT_VINFO_VEC_STMT (stmt_info))
>> +               continue;
>> +             stmt = STMT_VINFO_VEC_STMT (stmt_info);
>> +             vectype = STMT_VINFO_VECTYPE (stmt_info);
>> +           }
>> +         /* Mask store case.  */
>> +         else if (is_gimple_call (stmt)
>> +                  && gimple_call_internal_p (stmt)
>> +                  && gimple_call_internal_fn (stmt) == IFN_MASK_STORE
>> +                  && vinfo_for_stmt (stmt)
>> +                  && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
>> +           {
>> +             stmt_info = vinfo_for_stmt (stmt);
>> +             vectype = TREE_TYPE (gimple_call_arg (stmt, 2));
>> +           }
>> +         /* Load case.  */
>> +         else if (gimple_assign_load_p (stmt)
>> +                  && !VECTOR_TYPE_P (TREE_TYPE (gimple_assign_lhs (stmt))))
>> +           {
>> +             stmt_info = vinfo_for_stmt (stmt);
>> +             /* Skip vector loads.  */
>> +             if (!STMT_VINFO_VEC_STMT (stmt_info))
>> +               continue;
>> +             /* Skip invariant loads.  */
>> +             if (integer_zerop (nested_in_vect_loop_p (loop, stmt)
>> +                                ? STMT_VINFO_DR_STEP (stmt_info)
>> +                                : DR_STEP (STMT_VINFO_DATA_REF (stmt_info))))
>> +               continue;
>> +             stmt = STMT_VINFO_VEC_STMT (stmt_info);
>> +             vectype = STMT_VINFO_VECTYPE (stmt_info);
>> +           }
>> +         /* Store case.  */
>> +         else if (gimple_code (stmt) == GIMPLE_ASSIGN
>> +                  && gimple_store_p (stmt)
>> +                  && vinfo_for_stmt (stmt)
>> +                  && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
>> +           {
>> +             stmt_info = vinfo_for_stmt (stmt);
>> +             vectype = STMT_VINFO_VECTYPE (stmt_info);
>> +           }
>> +         else
>> +           continue;
>> +
>> +         /* Skip hoisted out statements.  */
>> +         if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
>> +           continue;
>> +
>> +         mask_no = vect_get_mask_index_for_type (vectype);
>> +
>> +         dr = STMT_VINFO_DATA_REF (stmt_info);
>> +         while (stmt)
>> +           {
>> +             if (is_gimple_call (stmt))
>> +               vect_mask_mask_load_store_stmt (stmt, masks[mask_no++]);
>> +             else
>> +               vect_mask_load_store_stmt (stmt, vectype, masks[mask_no++], dr,
>> +                                          /* Have to update iterator only if
>> +                                             it points to stmt we mask.  */
>> +                                          stmt == gsi_stmt (si) ? &si : NULL);
>> +
>> +             stmt_info = vinfo_for_stmt (stmt);
>> +             stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
>> +           }
>> +       }
>> +    }
>> +
>> +  if (dump_enabled_p ())
>> +    dump_printf_loc (MSG_NOTE, vect_location,
>> +                    "=== Loop epilogue was combined ===\n");
>> +}
>>
>>  /* Function vect_transform_loop.
>>
>> @@ -6927,7 +7550,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>>       run at least the vectorization factor number of times checking
>>       is pointless, too.  */
>>    th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
>> -  if (th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1
>> +  if ((th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1
>> +       || (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)
>> +          && th > 1))
>>        && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
>>      {
>>        if (dump_enabled_p ())
>> @@ -6976,12 +7601,18 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>>      {
>>        tree ratio_mult_vf;
>>        if (!ni_name)
>> -       ni_name = vect_build_loop_niters (loop_vinfo);
>> +       {
>> +         ni_name = vect_build_loop_niters (loop_vinfo);
>> +         LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
>> +       }
>>        vect_generate_tmps_on_preheader (loop_vinfo, ni_name, &ratio_mult_vf,
>>                                        &ratio);
>> -      epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name,
>> -                                                ratio_mult_vf, th,
>> -                                                check_profitability);
>> +      /* If epilogue is combined with main loop peeling is not needed.  */
>> +      if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)
>> +         || check_profitability)
>> +       epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name,
>> +                                                  ratio_mult_vf, th,
>> +                                                  check_profitability);
>>      }
>>    else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
>>      ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
>> @@ -6989,7 +7620,10 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>>    else
>>      {
>>        if (!ni_name)
>> -       ni_name = vect_build_loop_niters (loop_vinfo);
>> +       {
>> +         ni_name = vect_build_loop_niters (loop_vinfo);
>> +         LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
>> +       }
>>        vect_generate_tmps_on_preheader (loop_vinfo, ni_name, NULL, &ratio);
>>      }
>>
>> @@ -7243,23 +7877,35 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>>
>>    slpeel_make_loop_iterate_ntimes (loop, ratio);
>>
>> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
>> +    vect_combine_loop_epilogue (loop_vinfo);
>> +
>>    /* Reduce loop iterations by the vectorization factor.  */
>>    scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vectorization_factor),
>>                       expected_iterations / vectorization_factor);
>>    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>>        && loop->nb_iterations_upper_bound != 0)
>> -    loop->nb_iterations_upper_bound = loop->nb_iterations_upper_bound - 1;
>> -  loop->nb_iterations_upper_bound
>> -    = wi::udiv_floor (loop->nb_iterations_upper_bound + 1,
>> -                     vectorization_factor) - 1;
>> -
>> +    loop->nb_iterations_upper_bound -= 1;
>> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
>> +    loop->nb_iterations_upper_bound
>> +      = wi::div_ceil (loop->nb_iterations_upper_bound + 1,
>> +                     vectorization_factor, UNSIGNED) - 1;
>> +  else
>> +    loop->nb_iterations_upper_bound
>> +      = wi::udiv_floor (loop->nb_iterations_upper_bound + 1,
>> +                       vectorization_factor) - 1;
>>    if (loop->any_estimate)
>>      {
>> -      loop->nb_iterations_estimate
>> -        = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor);
>> -       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>> -          && loop->nb_iterations_estimate != 0)
>> -        loop->nb_iterations_estimate = loop->nb_iterations_estimate - 1;
>> +      if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
>> +       loop->nb_iterations_estimate
>> +         = wi::div_ceil (loop->nb_iterations_estimate, vectorization_factor,
>> +                         UNSIGNED);
>> +      else
>> +       loop->nb_iterations_estimate
>> +         = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor);
>> +      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>> +         && loop->nb_iterations_estimate != 0)
>> +       loop->nb_iterations_estimate -= 1;
>>      }
>>
>>    if (dump_enabled_p ())

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-06-16 15:41   ` Ilya Enkovich
@ 2016-06-16 15:51     ` Jeff Law
  2016-06-16 16:03       ` Ilya Enkovich
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff Law @ 2016-06-16 15:51 UTC (permalink / raw)
  To: Ilya Enkovich, Richard Biener; +Cc: GCC Patches

On 06/16/2016 09:41 AM, Ilya Enkovich wrote:
> 2016-06-15 14:44 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>> On Thu, May 19, 2016 at 9:44 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>>> Hi,
>>>
>>> This patch introduces support for loop epilogue combining.  This includes
>>> support in cost estimation and all required changes required to mask
>>> vectorized loop.
>>
>> I wonder why you compute a minimum number of iterations to make masking
>> of the vectorized body profitable rather than a maximum number of iterations.
>>
>> I'd say masking the vectorized loop is profitable if niter/vf *
>> masking-overhead < epilogue-cost.
>> Masking the epilogue is profitable if vectorizing the epilogue with
>> masking is profitable.
>>
>> Am I missing something?
>
> We don't have two versions of vectorized loop.  The choice is between vector
> and scalar loop and in this case minimum number of iterations is what we need.
> Generating two vectorized loop versions would be something new to vectorizer.
What I think Richi is saying is that we have to multiply the cost of the 
masking overhead by the number of iterations of vectorized loop to 
determine the cost of masking -- the more loop iterations we have, the 
greater the cost of masking in the loop becomes and those costs may be 
higher than the normal epilogue sequence.

Jeff



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-06-16 15:51     ` Jeff Law
@ 2016-06-16 16:03       ` Ilya Enkovich
  0 siblings, 0 replies; 29+ messages in thread
From: Ilya Enkovich @ 2016-06-16 16:03 UTC (permalink / raw)
  To: Jeff Law; +Cc: Richard Biener, GCC Patches

2016-06-16 18:51 GMT+03:00 Jeff Law <law@redhat.com>:
> On 06/16/2016 09:41 AM, Ilya Enkovich wrote:
>>
>> 2016-06-15 14:44 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>
>>> On Thu, May 19, 2016 at 9:44 PM, Ilya Enkovich <enkovich.gnu@gmail.com>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> This patch introduces support for loop epilogue combining.  This
>>>> includes
>>>> support in cost estimation and all required changes required to mask
>>>> vectorized loop.
>>>
>>>
>>> I wonder why you compute a minimum number of iterations to make masking
>>> of the vectorized body profitable rather than a maximum number of
>>> iterations.
>>>
>>> I'd say masking the vectorized loop is profitable if niter/vf *
>>> masking-overhead < epilogue-cost.
>>> Masking the epilogue is profitable if vectorizing the epilogue with
>>> masking is profitable.
>>>
>>> Am I missing something?
>>
>>
>> We don't have two versions of vectorized loop.  The choice is between
>> vector
>> and scalar loop and in this case minimum number of iterations is what we
>> need.
>> Generating two vectorized loop versions would be something new to
>> vectorizer.
>
> What I think Richi is saying is that we have to multiply the cost of the
> masking overhead by the number of iterations of vectorized loop to determine
> the cost of masking -- the more loop iterations we have, the greater the
> cost of masking in the loop becomes and those costs may be higher than the
> normal epilogue sequence.

Right.  But we compute that dynamically.  And what do we do when we see overall
masking cost becomes greater than a scalar epilogue cost?  The only case when
this check is useful is when we have vectorized non-combined version of a loop.
The original idea of combining (patches sent by Yuri last year) was to use it
only in cases when masking cost is small enough (and we expect cheap masking
computations are 'hidden' under heavier instructions by scheduler, so we don't
loose performance even for high iterations count).

Dynamically choosing between combined and non-combined versions is
another story.

Thanks,
Ilya

>
> Jeff
>
>
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-05-19 19:46 [PATCH, vec-tails 07/10] Support loop epilogue combining Ilya Enkovich
  2016-06-15 11:44 ` Richard Biener
@ 2016-06-16 16:54 ` Jeff Law
  2016-06-28 13:37   ` Ilya Enkovich
  1 sibling, 1 reply; 29+ messages in thread
From: Jeff Law @ 2016-06-16 16:54 UTC (permalink / raw)
  To: Ilya Enkovich, gcc-patches

On 05/19/2016 01:44 PM, Ilya Enkovich wrote:
> Hi,
>
> This patch introduces support for loop epilogue combining.  This includes
> support in cost estimation and all required changes required to mask
> vectorized loop.
>
> Thanks,
> Ilya
> --
> gcc/
>
> 2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
>
> 	* dbgcnt.def (vect_tail_combine): New.
> 	* params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
> 	* tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
> 	* tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
> 	epilogue combined with loop body.
> 	(vect_do_peeling_for_loop_bound): Likewise.
> 	* tree-vect-loop.c Include alias.h and dbgcnt.h.
> 	(vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters
> 	arg, compute number of iterations for which loop epilogue combining is
> 	profitable.
> 	(vect_generate_tmps_on_preheader): Support combined apilogue.
> 	(vect_gen_ivs_for_masking): New.
> 	(vect_get_mask_index_for_elems): New.
> 	(vect_get_mask_index_for_type): New.
> 	(vect_gen_loop_masks): New.
> 	(vect_mask_reduction_stmt): New.
> 	(vect_mask_mask_load_store_stmt): New.
> 	(vect_mask_load_store_stmt): New.
> 	(vect_combine_loop_epilogue): New.
> 	(vect_transform_loop): Support combined apilogue.
>
>
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index fab5879..b3c0668 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -1464,11 +1469,20 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
>    bb_between_loops = new_exit_bb;
>    bb_after_second_loop = split_edge (single_exit (second_loop));
>
> -  pre_condition =
> -	fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
> -  skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
> -                                  bb_after_second_loop, bb_before_first_loop,
> -				  inverse_probability (second_guard_probability));
> +  if (skip_second_after_first)
> +    /* We can just redirect edge from bb_between_loops to
> +       bb_after_second_loop but we have many code assuming
> +       we have a guard after the first loop.  So just make
> +       always taken condtion.  */
> +    pre_condition = fold_build2 (EQ_EXPR, boolean_type_node, integer_zero_node,
> +				 integer_zero_node);
This isn't ideal, but I don't think it's that big of an issue.

> @@ -1758,8 +1772,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>    basic_block preheader;
>    int loop_num;
>    int max_iter;
> +  int bound2;
>    tree cond_expr = NULL_TREE;
>    gimple_seq cond_expr_stmt_list = NULL;
> +  bool combine = LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo);
>
>    if (dump_enabled_p ())
>      dump_printf_loc (MSG_NOTE, vect_location,
> @@ -1769,12 +1785,13 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>
>    loop_num  = loop->num;
>
> +  bound2 = combine ? th : LOOP_VINFO_VECT_FACTOR (loop_vinfo);
Can you document what the TH parameter is to the various routines that 
use it in tree-vect-loop-manip.c?  I realize you didn't add it, but it 
would help anyone looking at this code in the future to know it's the 
threshold of iterations for vectorization without having to find it in 
other function comment headers ;-)

That's pre-approved to go in immediately :-)

> @@ -1803,7 +1820,11 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>    max_iter = (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>  	      ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) * 2
>  	      : LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 2;
> -  if (check_profitability)
> +  /* When epilogue is combined only profitability
> +     treshold matters.  */
s/treshold/threshold/



>  static void
>  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>  				    int *ret_min_profitable_niters,
> -				    int *ret_min_profitable_estimate)
> +				    int *ret_min_profitable_estimate,
> +				    int *ret_min_profitable_combine_niters)
I'm torn a bit here.  There's all kinds of things missing/incomplete in 
the function comments throughout the vectorizer.  And in some cases, 
like this one, the parameters are largely self-documenting.  But we've 
also got coding standards that we'd like to adhere to.

I don't think it's fair to require you to fix all these issues in the 
vectorizer (though if you wanted to, I'd fully support those an 
independent cleanups).

Perhaps just document LOOP_VINFO with a generic comment about the ret_* 
parameters for this function rather than a comment for each ret_* 
parameter.  Pre-approved for the trunk independent of the vec-tails work.


> @@ -3728,6 +3784,77 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>  		     min_profitable_estimate);
>
> +
> +      unsigned combine_treshold
> +	= PARAM_VALUE (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD);
> +      /* Calculate profitability combining epilogue with the main loop.
> +	 We have a threshold for inside cost overhead (not applied
> +	 for low trip count loop case):
> +	 MIC * 100 < VIC * CT
> +	 Masked iteration should be better than a scalar prologue:
> +	 MIC + VIC < SIC * epilogue_niters  */
Can you double-check the whitespace formatting here.  Where does the 
"100" come from and should it be a param?


> @@ -6886,6 +7030,485 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
>    return;
>  }
>

> +
> +/* Function vect_gen_loop_masks.
> +
> +   Create masks to mask a loop desvribed by LOOP_VINFO.  Masks
s/desvribed/described/

> +   are created according to LOOP_VINFO_REQUIRED_MASKS and are stored
> +   into MASKS vector.
> +
> +   Index of a mask in a vector is computed according to a number
> +   of masks's elements.  Masks are sorted by number of its elements
> +   in descending order.  Index 0 is used to access a mask with
> +   current_vector_size elements.  Among masks with the same number
> +   of elements the one with lower index is used to mask iterations
> +   with smaller iteration counter.  Note that you may get NULL elements
> +   for masks which are not required.  Use vect_get_mask_index_for_elems
> +   or vect_get_mask_index_for_type to access resulting vector.  */
> +
> +static void
> +vect_gen_loop_masks (loop_vec_info loop_vinfo, vec<tree> *masks)
> +{
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  edge pe = loop_preheader_edge (loop);
> +  tree niters = LOOP_VINFO_NITERS (loop_vinfo);
> +  unsigned min_mask_elems, max_mask_elems, nmasks;
> +  unsigned iv_elems, cur_mask, prev_mask, cur_mask_elems;
> +  auto_vec<tree> ivs;
> +  tree vectype, mask_type;
> +  tree vec_niters, vec_niters_val, mask;
> +  gimple *stmt;
> +  basic_block bb;
> +  gimple_stmt_iterator gsi = gsi_after_labels (loop->header);
> +  unsigned vec_size;
> +
> +  /* Create required IVs.  */
> +  vect_gen_ivs_for_masking (loop_vinfo, &ivs);
> +  vectype = TREE_TYPE (ivs[0]);
> +
> +  vec_size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
> +  iv_elems = TYPE_VECTOR_SUBPARTS (vectype);
> +
> +  /* Get a proper niter to build a vector.  */
> +  if (!is_gimple_val (niters))
> +    {
> +      gimple_seq seq = NULL;
> +      niters = force_gimple_operand (niters, &seq, true, NULL);
> +      gsi_insert_seq_on_edge_immediate (pe, seq);
> +    }
> +  /* We may need a type cast in case niter has a too small type
> +     for generated IVs.  */
Nit.  There should be vertical whitespace after the close brace and the 
comment for the next logical block of code.  Can you do a scan over the 
patchkit looking for other instances where the vertical whitespace is 
needed.

Generally, if you find that a blob of code needs a comment, then the 
comment and blob of code should have that vertical whitespace to 
visually separate it from everything else.



> +/* Function vect_combine_loop_epilogue.
> +
> +   Combine loop epilogue with the main vectorized body.  It requires
> +   masking of memory accesses and reductions.  */
So you mask reductions, loads & stores.  Is there anything else that we 
might potentially need to mask to combine the loop & epilogue via masking?


I don't see anything particularly worrisome here either -- I have a 
slight concern about correctness issues with only masking loads/stores 
and reductions.  But I will defer to your judgment on whether or not 
there's other stuff that we need to mask to combine the epilogue with 
the loop via masking.

Jeff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-06-16 16:54 ` Jeff Law
@ 2016-06-28 13:37   ` Ilya Enkovich
  2016-06-28 14:16     ` Ilya Enkovich
                       ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Ilya Enkovich @ 2016-06-28 13:37 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches

On 16 Jun 10:54, Jeff Law wrote:
> On 05/19/2016 01:44 PM, Ilya Enkovich wrote:
> >Hi,
> >
> >This patch introduces support for loop epilogue combining.  This includes
> >support in cost estimation and all required changes required to mask
> >vectorized loop.
> >
> >Thanks,
> >Ilya
> >--
> >gcc/
> >
> >2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
> >
> >	* dbgcnt.def (vect_tail_combine): New.
> >	* params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
> >	* tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
> >	* tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
> >	epilogue combined with loop body.
> >	(vect_do_peeling_for_loop_bound): Likewise.
> >	* tree-vect-loop.c Include alias.h and dbgcnt.h.
> >	(vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters
> >	arg, compute number of iterations for which loop epilogue combining is
> >	profitable.
> >	(vect_generate_tmps_on_preheader): Support combined apilogue.
> >	(vect_gen_ivs_for_masking): New.
> >	(vect_get_mask_index_for_elems): New.
> >	(vect_get_mask_index_for_type): New.
> >	(vect_gen_loop_masks): New.
> >	(vect_mask_reduction_stmt): New.
> >	(vect_mask_mask_load_store_stmt): New.
> >	(vect_mask_load_store_stmt): New.
> >	(vect_combine_loop_epilogue): New.
> >	(vect_transform_loop): Support combined apilogue.
> >
> >
> >diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> >index fab5879..b3c0668 100644
> >--- a/gcc/tree-vect-loop-manip.c
> >+++ b/gcc/tree-vect-loop-manip.c
> >@@ -1464,11 +1469,20 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
> >   bb_between_loops = new_exit_bb;
> >   bb_after_second_loop = split_edge (single_exit (second_loop));
> >
> >-  pre_condition =
> >-	fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
> >-  skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
> >-                                  bb_after_second_loop, bb_before_first_loop,
> >-				  inverse_probability (second_guard_probability));
> >+  if (skip_second_after_first)
> >+    /* We can just redirect edge from bb_between_loops to
> >+       bb_after_second_loop but we have many code assuming
> >+       we have a guard after the first loop.  So just make
> >+       always taken condtion.  */
> >+    pre_condition = fold_build2 (EQ_EXPR, boolean_type_node, integer_zero_node,
> >+				 integer_zero_node);
> This isn't ideal, but I don't think it's that big of an issue.
> 
> >@@ -1758,8 +1772,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
> >   basic_block preheader;
> >   int loop_num;
> >   int max_iter;
> >+  int bound2;
> >   tree cond_expr = NULL_TREE;
> >   gimple_seq cond_expr_stmt_list = NULL;
> >+  bool combine = LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo);
> >
> >   if (dump_enabled_p ())
> >     dump_printf_loc (MSG_NOTE, vect_location,
> >@@ -1769,12 +1785,13 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
> >
> >   loop_num  = loop->num;
> >
> >+  bound2 = combine ? th : LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> Can you document what the TH parameter is to the various routines that use
> it in tree-vect-loop-manip.c?  I realize you didn't add it, but it would
> help anyone looking at this code in the future to know it's the threshold of
> iterations for vectorization without having to find it in other function
> comment headers ;-)
> 
> That's pre-approved to go in immediately :-)
> 
> >@@ -1803,7 +1820,11 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
> >   max_iter = (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > 	      ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) * 2
> > 	      : LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 2;
> >-  if (check_profitability)
> >+  /* When epilogue is combined only profitability
> >+     treshold matters.  */
> s/treshold/threshold/
> 
> 
> 
> > static void
> > vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
> > 				    int *ret_min_profitable_niters,
> >-				    int *ret_min_profitable_estimate)
> >+				    int *ret_min_profitable_estimate,
> >+				    int *ret_min_profitable_combine_niters)
> I'm torn a bit here.  There's all kinds of things missing/incomplete in the
> function comments throughout the vectorizer.  And in some cases, like this
> one, the parameters are largely self-documenting.  But we've also got coding
> standards that we'd like to adhere to.
> 
> I don't think it's fair to require you to fix all these issues in the
> vectorizer (though if you wanted to, I'd fully support those an independent
> cleanups).
> 
> Perhaps just document LOOP_VINFO with a generic comment about the ret_*
> parameters for this function rather than a comment for each ret_* parameter.
> Pre-approved for the trunk independent of the vec-tails work.
> 
> 
> >@@ -3728,6 +3784,77 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
> > 		     min_profitable_estimate);
> >
> >+
> >+      unsigned combine_treshold
> >+	= PARAM_VALUE (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD);
> >+      /* Calculate profitability combining epilogue with the main loop.
> >+	 We have a threshold for inside cost overhead (not applied
> >+	 for low trip count loop case):
> >+	 MIC * 100 < VIC * CT
> >+	 Masked iteration should be better than a scalar prologue:
> >+	 MIC + VIC < SIC * epilogue_niters  */
> Can you double-check the whitespace formatting here.  Where does the "100"
> come from and should it be a param?

I checked the formatting.  We have 100 here because combine_treshold
is measured in percent.  E.g. value 2 means iterations masking overhead
shouldn't exceed 2% of vector iteration cost.

> 
> 
> >@@ -6886,6 +7030,485 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
> >   return;
> > }
> >
> 
> >+
> >+/* Function vect_gen_loop_masks.
> >+
> >+   Create masks to mask a loop desvribed by LOOP_VINFO.  Masks
> s/desvribed/described/
> 
> >+   are created according to LOOP_VINFO_REQUIRED_MASKS and are stored
> >+   into MASKS vector.
> >+
> >+   Index of a mask in a vector is computed according to a number
> >+   of masks's elements.  Masks are sorted by number of its elements
> >+   in descending order.  Index 0 is used to access a mask with
> >+   current_vector_size elements.  Among masks with the same number
> >+   of elements the one with lower index is used to mask iterations
> >+   with smaller iteration counter.  Note that you may get NULL elements
> >+   for masks which are not required.  Use vect_get_mask_index_for_elems
> >+   or vect_get_mask_index_for_type to access resulting vector.  */
> >+
> >+static void
> >+vect_gen_loop_masks (loop_vec_info loop_vinfo, vec<tree> *masks)
> >+{
> >+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> >+  edge pe = loop_preheader_edge (loop);
> >+  tree niters = LOOP_VINFO_NITERS (loop_vinfo);
> >+  unsigned min_mask_elems, max_mask_elems, nmasks;
> >+  unsigned iv_elems, cur_mask, prev_mask, cur_mask_elems;
> >+  auto_vec<tree> ivs;
> >+  tree vectype, mask_type;
> >+  tree vec_niters, vec_niters_val, mask;
> >+  gimple *stmt;
> >+  basic_block bb;
> >+  gimple_stmt_iterator gsi = gsi_after_labels (loop->header);
> >+  unsigned vec_size;
> >+
> >+  /* Create required IVs.  */
> >+  vect_gen_ivs_for_masking (loop_vinfo, &ivs);
> >+  vectype = TREE_TYPE (ivs[0]);
> >+
> >+  vec_size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
> >+  iv_elems = TYPE_VECTOR_SUBPARTS (vectype);
> >+
> >+  /* Get a proper niter to build a vector.  */
> >+  if (!is_gimple_val (niters))
> >+    {
> >+      gimple_seq seq = NULL;
> >+      niters = force_gimple_operand (niters, &seq, true, NULL);
> >+      gsi_insert_seq_on_edge_immediate (pe, seq);
> >+    }
> >+  /* We may need a type cast in case niter has a too small type
> >+     for generated IVs.  */
> Nit.  There should be vertical whitespace after the close brace and the
> comment for the next logical block of code.  Can you do a scan over the
> patchkit looking for other instances where the vertical whitespace is
> needed.
> 
> Generally, if you find that a blob of code needs a comment, then the comment
> and blob of code should have that vertical whitespace to visually separate
> it from everything else.
> 
> 
> 
> >+/* Function vect_combine_loop_epilogue.
> >+
> >+   Combine loop epilogue with the main vectorized body.  It requires
> >+   masking of memory accesses and reductions.  */
> So you mask reductions, loads & stores.  Is there anything else that we
> might potentially need to mask to combine the loop & epilogue via masking?
> 
> 
> I don't see anything particularly worrisome here either -- I have a slight
> concern about correctness issues with only masking loads/stores and
> reductions.  But I will defer to your judgment on whether or not there's
> other stuff that we need to mask to combine the epilogue with the loop via
> masking.

We have to mask operations which may cause errors if executed speculatively.
For others we just ignore produced result.  So we don't truly mask reductions
but fix-up their results.  I assume memory accesses are only ones we have to
truly mask (plus non-const calls which are rejected now).  For signalling
arithmetic I assumed we just don't vectorize it.

Basically we should act similar to if-conversion.  I'll check if it has
restrictions I miss.

> 
> Jeff

Here is an updated patch version.

Thanks,
Ilya
--
gcc/

2016-05-28  Ilya Enkovich  <ilya.enkovich@intel.com>

	* dbgcnt.def (vect_tail_combine): New.
	* params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
	* tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
	* tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
	epilogue combined with loop body.
	(vect_do_peeling_for_loop_bound): LIkewise.
	(vect_do_peeling_for_alignment): ???
	* tree-vect-loop.c Include alias.h and dbgcnt.h.
	(vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters
	arg, compute number of iterations for which loop epilogue combining is
	profitable.
	(vect_generate_tmps_on_preheader): Support combined apilogue.
	(vect_gen_ivs_for_masking): New.
	(vect_get_mask_index_for_elems): New.
	(vect_get_mask_index_for_type): New.
	(vect_gen_loop_masks): New.
	(vect_mask_reduction_stmt): New.
	(vect_mask_mask_load_store_stmt): New.
	(vect_mask_load_store_stmt): New.
	(vect_combine_loop_epilogue): New.
	(vect_transform_loop): Support combined apilogue.


diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def
index 78ddcc2..73c2966 100644
--- a/gcc/dbgcnt.def
+++ b/gcc/dbgcnt.def
@@ -192,4 +192,5 @@ DEBUG_COUNTER (treepre_insert)
 DEBUG_COUNTER (tree_sra)
 DEBUG_COUNTER (vect_loop)
 DEBUG_COUNTER (vect_slp)
+DEBUG_COUNTER (vect_tail_combine)
 DEBUG_COUNTER (dom_unreachable_edges)
diff --git a/gcc/params.def b/gcc/params.def
index 62a1e40..98d6c5a 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1220,6 +1220,11 @@ DEFPARAM (PARAM_MAX_SPECULATIVE_DEVIRT_MAYDEFS,
 	  "Maximum number of may-defs visited when devirtualizing "
 	  "speculatively", 50, 0, 0)
 
+DEFPARAM (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD,
+	  "vect-cost-increase-combine-threshold",
+	  "Cost increase threshold to mask main loop for epilogue.",
+	  10, 0, 300)
+
 /*
 
 Local variables:
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index a902a50..26e0cc1 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4007,6 +4007,9 @@ vect_get_new_ssa_name (tree type, enum vect_var_kind var_kind, const char *name)
   case vect_scalar_var:
     prefix = "stmp";
     break;
+  case vect_mask_var:
+    prefix = "mask";
+    break;
   case vect_pointer_var:
     prefix = "vectp";
     break;
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index c26aa1d..7403686 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1195,6 +1195,7 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
   int first_guard_probability = 2 * REG_BR_PROB_BASE / 3;
   int second_guard_probability = 2 * REG_BR_PROB_BASE / 3;
   int probability_of_second_loop;
+  bool skip_second_after_first = false;
 
   if (!slpeel_can_duplicate_loop_p (loop, e))
     return NULL;
@@ -1393,7 +1394,11 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
     {
       loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
       tree scalar_loop_iters = LOOP_VINFO_NITERSM1 (loop_vinfo);
-      unsigned limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
+      unsigned limit = 0;
+      if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+	skip_second_after_first = true;
+      else
+	limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
 	limit = limit + 1;
       if (check_profitability
@@ -1464,11 +1469,20 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
   bb_between_loops = new_exit_bb;
   bb_after_second_loop = split_edge (single_exit (second_loop));
 
-  pre_condition =
-	fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
-  skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
-                                  bb_after_second_loop, bb_before_first_loop,
-				  inverse_probability (second_guard_probability));
+  if (skip_second_after_first)
+    /* We can just redirect edge from bb_between_loops to
+       bb_after_second_loop but we have many code assuming
+       we have a guard after the first loop.  So just make
+       always taken condtion.  */
+    pre_condition = fold_build2 (EQ_EXPR, boolean_type_node, integer_zero_node,
+				 integer_zero_node);
+  else
+    pre_condition =
+      fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
+  skip_e
+    = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
+			     bb_after_second_loop, bb_before_first_loop,
+			     inverse_probability (second_guard_probability));
   scale_loop_profile (second_loop, probability_of_second_loop, bound2);
   slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop,
                                      second_loop == new_loop, &new_exit_bb);
@@ -1762,8 +1776,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
   basic_block preheader;
   int loop_num;
   int max_iter;
+  int bound2;
   tree cond_expr = NULL_TREE;
   gimple_seq cond_expr_stmt_list = NULL;
+  bool combine = LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo);
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -1773,12 +1789,13 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
 
   loop_num  = loop->num;
 
+  bound2 = combine ? th : LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   new_loop
     = slpeel_tree_peel_loop_to_edge (loop, scalar_loop, single_exit (loop),
 				     &ratio_mult_vf_name, ni_name, false,
 				     th, check_profitability,
 				     cond_expr, cond_expr_stmt_list,
-				     0, LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+				     0, bound2);
   gcc_assert (new_loop);
   gcc_assert (loop_num == loop->num);
   slpeel_checking_verify_cfg_after_peeling (loop, new_loop);
@@ -1807,7 +1824,12 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
   max_iter = (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
 	      ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) * 2
 	      : LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 2;
-  if (check_profitability)
+
+  /* When epilogue is combined only profitability
+     threshold matters.  */
+  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    max_iter = (int) th - 1;
+  else if (check_profitability)
     max_iter = MAX (max_iter, (int) th - 1);
   record_niter_bound (new_loop, max_iter, false, true);
   dump_printf (MSG_NOTE,
@@ -2044,7 +2066,8 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo, tree ni_name,
 				   bound, 0);
 
   gcc_assert (new_loop);
-  slpeel_checking_verify_cfg_after_peeling (new_loop, loop);
+  if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    slpeel_checking_verify_cfg_after_peeling (new_loop, loop);
   /* For vectorization factor N, we need to copy at most N-1 values 
      for alignment and this means N-2 loopback edge executions.  */
   max_iter = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 2;
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 41b9380..08fad82 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -50,6 +50,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-fold.h"
 #include "cgraph.h"
 #include "tree-if-conv.h"
+#include "alias.h"
+#include "dbgcnt.h"
 
 /* Loop Vectorization Pass.
 
@@ -149,7 +151,8 @@ along with GCC; see the file COPYING3.  If not see
    http://gcc.gnu.org/projects/tree-ssa/vectorization.html
 */
 
-static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *);
+static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *,
+						int *);
 
 /* Function vect_determine_vectorization_factor
 
@@ -2310,8 +2313,10 @@ start_over:
 
   /* Analyze cost.  Decide if worth while to vectorize.  */
   int min_profitable_estimate, min_profitable_iters;
+  int min_profitable_combine_iters;
   vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
-				      &min_profitable_estimate);
+				      &min_profitable_estimate,
+				      &min_profitable_combine_iters);
 
   if (min_profitable_iters < 0)
     {
@@ -2420,6 +2425,52 @@ start_over:
   gcc_assert (vectorization_factor
 	      == (unsigned)LOOP_VINFO_VECT_FACTOR (loop_vinfo));
 
+  if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
+    {
+      LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
+      LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = false;
+    }
+  else if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+	   && min_profitable_combine_iters >= 0)
+    {
+      if (((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+	    && (LOOP_VINFO_INT_NITERS (loop_vinfo)
+		>= (unsigned) min_profitable_combine_iters))
+	   || estimated_niter == -1
+	   || estimated_niter >= min_profitable_combine_iters)
+	  && dbg_cnt (vect_tail_combine))
+	{
+	  LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
+	  LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = true;
+
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "Decided to combine loop with its epilogue.\n");
+
+	  /* We need to adjust profitability check if combine
+	     epilogue considering additional vector iteration
+	     and profitable combine iterations.  */
+	  if ((int)(min_profitable_combine_iters + vectorization_factor)
+	      > min_scalar_loop_bound)
+	    {
+	      LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo)
+		= (unsigned) min_profitable_combine_iters;
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "Updated runtime profitability treshold: %d\n",
+				 min_profitable_combine_iters);
+
+	    }
+	}
+      else
+	{
+	  if (!LOOP_VINFO_NEED_MASKING (loop_vinfo) && dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Not combined loop with epilogue: iterations "
+			     "count is too low (threshold is %d).\n",
+			     min_profitable_combine_iters);
+	}
+    }
+
   /* Ok to vectorize!  */
   return true;
 
@@ -3392,12 +3443,18 @@ vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
    profitability check.
 
    *RET_MIN_PROFITABLE_ESTIMATE is a profitability threshold to be used
-   for static check against estimated number of iterations.  */
+   for static check against estimated number of iterations.
+
+   *RET_MIN_PROFITABLE_COMBINE_NITERS is a cost model profitability threshold
+   of iterations for vectorization with combined loop epilogue.  -1 means
+   combining is not profitable.  Value may be used fo dynamic profitability
+   check.  */
 
 static void
 vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 				    int *ret_min_profitable_niters,
-				    int *ret_min_profitable_estimate)
+				    int *ret_min_profitable_estimate,
+				    int *ret_min_profitable_combine_niters)
 {
   int min_profitable_iters;
   int min_profitable_estimate;
@@ -3641,6 +3698,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
                    vec_prologue_cost);
       dump_printf (MSG_NOTE, "  Vector epilogue cost: %d\n",
                    vec_epilogue_cost);
+      dump_printf (MSG_NOTE, "  Masking prologue cost: %d\n",
+                   masking_prologue_cost);
+      dump_printf (MSG_NOTE, "  Masking inside cost: %d\n",
+                   masking_inside_cost);
       dump_printf (MSG_NOTE, "  Scalar iteration cost: %d\n",
                    scalar_single_iter_cost);
       dump_printf (MSG_NOTE, "  Scalar outside cost: %d\n",
@@ -3744,6 +3805,77 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 		     min_profitable_estimate);
 
   *ret_min_profitable_estimate = min_profitable_estimate;
+
+  *ret_min_profitable_combine_niters = -1;
+
+  /* Don't try to vectorize epilogue of epilogue.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    return;
+
+  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+    {
+      if (flag_vect_epilogue_cost_model == VECT_COST_MODEL_UNLIMITED)
+	{
+	  if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE)
+	    *ret_min_profitable_combine_niters = 0;
+	  return;
+	}
+
+      unsigned combine_treshold
+	= PARAM_VALUE (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD);
+      /* Calculate profitability combining epilogue with the main loop.
+	 We have a threshold for inside cost overhead (not applied
+	 for low trip count loop case):
+	 MIC * 100 < VIC * CT
+	 Masked iteration should be better than a scalar prologue:
+	 MIC + VIC < SIC * epilogue_niters  */
+      if (masking_inside_cost * 100 >= vec_inside_cost * combine_treshold)
+	{
+	  if (dump_enabled_p ())
+	    {
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "Combining loop with epilogue is not "
+			       "profitable.\n");
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "  Combining overhead %d%% exceeds "
+			       "treshold %d%%.\n",
+			       masking_inside_cost * 100 / vec_inside_cost,
+			       combine_treshold);
+	    }
+	  *ret_min_profitable_combine_niters = -1;
+	}
+      else if ((int)(masking_inside_cost + vec_inside_cost)
+	       >= scalar_single_iter_cost * peel_iters_epilogue)
+	{
+	  if (dump_enabled_p ())
+	    {
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "Combining loop with epilogue is not "
+			       "profitable.\n");
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "  Scalar epilogue is faster than a "
+			       "single masked iteration.\n");
+	    }
+	  *ret_min_profitable_combine_niters = -1;
+	}
+      else if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE)
+	{
+	  int inside_cost = vec_inside_cost + masking_inside_cost;
+	  int outside_cost = vec_outside_cost + masking_prologue_cost;
+	  int profitable_iters = ((outside_cost - scalar_outside_cost) * vf
+				  - inside_cost * peel_iters_prologue
+				  - inside_cost * peel_iters_epilogue)
+				 / ((scalar_single_iter_cost * vf)
+				    - inside_cost);
+
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Combinig loop with epilogue "
+			     "pofitability treshold = %d\n",
+			     profitable_iters);
+	  *ret_min_profitable_combine_niters = profitable_iters;
+	}
+    }
 }
 
 /* Writes into SEL a mask for a vec_perm, equivalent to a vec_shr by OFFSET
@@ -6852,20 +6984,37 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
   else
     ni_minus_gap_name = ni_name;
 
-  /* Create: ratio = ni >> log2(vf) */
-  /* ???  As we have ni == number of latch executions + 1, ni could
-     have overflown to zero.  So avoid computing ratio based on ni
-     but compute it using the fact that we know ratio will be at least
-     one, thus via (ni - vf) >> log2(vf) + 1.  */
-  ratio_name
-    = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
-		   fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
-				fold_build2 (MINUS_EXPR, TREE_TYPE (ni_name),
-					     ni_minus_gap_name,
-					     build_int_cst
-					       (TREE_TYPE (ni_name), vf)),
-				log_vf),
-		   build_int_cst (TREE_TYPE (ni_name), 1));
+  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    {
+      /* Create ni + (vf-1) >> log2(vf) if epilogue is combined with loop.  */
+      gcc_assert (!LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
+      ratio_name
+	= fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
+		       fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
+				    ni_name,
+				    build_int_cst (TREE_TYPE (ni_name),
+						   vf - 1)),
+		       log_vf);
+    }
+  else
+    {
+      /* Create: ratio = ni >> log2(vf) */
+      /* ???  As we have ni == number of latch executions + 1, ni could
+	 have overflown to zero.  So avoid computing ratio based on ni
+	 but compute it using the fact that we know ratio will be at least
+	 one, thus via (ni - vf) >> log2(vf) + 1.  */
+      ratio_name
+        = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
+		       fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
+				    fold_build2 (MINUS_EXPR,
+						 TREE_TYPE (ni_name),
+						 ni_minus_gap_name,
+						 build_int_cst
+						   (TREE_TYPE (ni_name), vf)),
+				    log_vf),
+		       build_int_cst (TREE_TYPE (ni_name), 1));
+    }
+
   if (!is_gimple_val (ratio_name))
     {
       var = create_tmp_var (TREE_TYPE (ni_name), "bnd");
@@ -6895,6 +7044,489 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
   return;
 }
 
+/* Function vect_gen_ivs_for_masking.
+
+   Create IVs to be used for masks computation to mask loop described
+   by LOOP_VINFO.  Created IVs are stored in IVS vector.  .
+
+   Initial IV values is {0, 1, ..., VF - 1} (probably split into several
+   vectors, in this case IVS's elements with lower index hold IV with
+   smaller numbers).  IV step is {VF, VF, ..., VF}.  VF is a used
+   vectorization factor.  */
+
+static void
+vect_gen_ivs_for_masking (loop_vec_info loop_vinfo, vec<tree> *ivs)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  tree vectype = vect_get_masking_iv_type (loop_vinfo);
+  tree type = TREE_TYPE (vectype);
+  int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
+  int ncopies  = vf / elems;
+  int i, k;
+  tree iv, init_val, step_val;
+  bool insert_after;
+  gimple_stmt_iterator gsi;
+  tree *vtemp;
+
+  /* Create {VF, ..., VF} vector constant.  */
+  step_val = build_vector_from_val (vectype, build_int_cst (type, vf));
+
+  vtemp = XALLOCAVEC (tree, vf);
+  for (i = 0; i < ncopies; i++)
+    {
+      /* Create initial IV value.  */
+      for (k = 0; k < vf; k++)
+	vtemp[k] = build_int_cst (type, k + i * elems);
+      init_val = build_vector (vectype, vtemp);
+
+      /* Create an inductive variable including phi node.  */
+      standard_iv_increment_position (loop, &gsi, &insert_after);
+      create_iv (init_val, step_val, NULL, loop, &gsi, insert_after,
+		 &iv, NULL);
+      ivs->safe_push (iv);
+    }
+}
+
+/* Function vect_get_mask_index_for_elems.
+
+   A helper function to access masks vector.  See vect_gen_loop_masks
+   for masks vector sorting description.  Return index of the first
+   mask having MASK_ELEMS elements.  */
+
+static inline unsigned
+vect_get_mask_index_for_elems (unsigned mask_elems)
+{
+  return current_vector_size / mask_elems - 1;
+}
+
+/* Function vect_get_mask_index_for_type.
+
+   A helper function to access masks vector.  See vect_gen_loop_masks
+   for masks vector sorting description.  Return index of the first
+   mask appropriate for VECTYPE.  */
+
+static inline unsigned
+vect_get_mask_index_for_type (tree vectype)
+{
+  unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
+  return vect_get_mask_index_for_elems (elems);
+}
+
+/* Function vect_gen_loop_masks.
+
+   Create masks to mask a loop described by LOOP_VINFO.  Masks
+   are created according to LOOP_VINFO_REQUIRED_MASKS and are stored
+   into MASKS vector.
+
+   Index of a mask in a vector is computed according to a number
+   of masks's elements.  Masks are sorted by number of its elements
+   in descending order.  Index 0 is used to access a mask with
+   current_vector_size elements.  Among masks with the same number
+   of elements the one with lower index is used to mask iterations
+   with smaller iteration counter.  Note that you may get NULL elements
+   for masks which are not required.  Use vect_get_mask_index_for_elems
+   or vect_get_mask_index_for_type to access resulting vector.  */
+
+static void
+vect_gen_loop_masks (loop_vec_info loop_vinfo, vec<tree> *masks)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  edge pe = loop_preheader_edge (loop);
+  tree niters = LOOP_VINFO_NITERS (loop_vinfo);
+  unsigned min_mask_elems, max_mask_elems, nmasks;
+  unsigned iv_elems, cur_mask, prev_mask, cur_mask_elems;
+  auto_vec<tree> ivs;
+  tree vectype, mask_type;
+  tree vec_niters, vec_niters_val, mask;
+  gimple *stmt;
+  basic_block bb;
+  gimple_stmt_iterator gsi = gsi_after_labels (loop->header);
+  unsigned vec_size;
+
+  /* Create required IVs.  */
+  vect_gen_ivs_for_masking (loop_vinfo, &ivs);
+  vectype = TREE_TYPE (ivs[0]);
+
+  vec_size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
+  iv_elems = TYPE_VECTOR_SUBPARTS (vectype);
+
+  /* Get a proper niter to build a vector.  */
+  if (!is_gimple_val (niters))
+    {
+      gimple_seq seq = NULL;
+      niters = force_gimple_operand (niters, &seq, true, NULL);
+      gsi_insert_seq_on_edge_immediate (pe, seq);
+    }
+
+  /* We may need a type cast in case niter has a too small type
+     for generated IVs.  */
+  if (!types_compatible_p (TREE_TYPE (vectype), TREE_TYPE (niters)))
+    {
+      tree new_niters = make_temp_ssa_name (TREE_TYPE (vectype),
+					    NULL, "niters");
+      stmt = gimple_build_assign (new_niters, CONVERT_EXPR, niters);
+      bb = gsi_insert_on_edge_immediate (pe, stmt);
+      gcc_assert (!bb);
+      niters = new_niters;
+    }
+
+  /* Create {NITERS, ..., NITERS} vector and put to SSA_NAME.  */
+  vec_niters_val = build_vector_from_val (vectype, niters);
+  vec_niters = vect_get_new_ssa_name (vectype, vect_simple_var, "niters");
+  stmt = gimple_build_assign (vec_niters, vec_niters_val);
+  bb = gsi_insert_on_edge_immediate (pe, stmt);
+  gcc_assert (!bb);
+
+  /* Determine which masks we need to compute and how many.  */
+  vect_get_extreme_masks (loop_vinfo, &min_mask_elems, &max_mask_elems);
+  nmasks = vect_get_mask_index_for_elems (MIN (min_mask_elems, iv_elems) / 2);
+  masks->safe_grow_cleared (nmasks);
+
+  /* Now create base masks through comparison IV < VEC_NITERS.  */
+  mask_type = build_same_sized_truth_vector_type (vectype);
+  cur_mask = vect_get_mask_index_for_elems (iv_elems);
+  for (unsigned i = 0; i < ivs.length (); i++)
+    {
+      tree iv = ivs[i];
+      mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+      stmt = gimple_build_assign (mask, LT_EXPR, iv, vec_niters);
+      gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+      (*masks)[cur_mask++] = mask;
+    }
+
+  /* Create narrowed masks.  */
+  cur_mask_elems = iv_elems;
+  nmasks = ivs.length ();
+  while (cur_mask_elems < max_mask_elems)
+    {
+      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      cur_mask_elems <<= 1;
+      nmasks >>= 1;
+
+      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
+
+      for (unsigned i = 0; i < nmasks; i++)
+	{
+	  tree mask_low = (*masks)[prev_mask++];
+	  tree mask_hi = (*masks)[prev_mask++];
+	  mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+	  stmt = gimple_build_assign (mask, VEC_PACK_TRUNC_EXPR,
+				      mask_low, mask_hi);
+	  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+	  (*masks)[cur_mask++] = mask;
+	}
+    }
+
+  /* Created widened masks.  */
+  cur_mask_elems = iv_elems;
+  nmasks = ivs.length ();
+  while (cur_mask_elems > min_mask_elems)
+    {
+      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      cur_mask_elems >>= 1;
+      nmasks <<= 1;
+
+      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
+
+      for (unsigned i = 0; i < nmasks; i += 2)
+	{
+	  tree orig_mask = (*masks)[prev_mask++];
+
+	  mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+	  stmt = gimple_build_assign (mask, VEC_UNPACK_LO_EXPR, orig_mask);
+	  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+	  (*masks)[cur_mask++] = mask;
+
+	  mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+	  stmt = gimple_build_assign (mask, VEC_UNPACK_HI_EXPR, orig_mask);
+	  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+	  (*masks)[cur_mask++] = mask;
+	}
+    }
+}
+
+/* Function vect_mask_reduction_stmt.
+
+   Mask given vectorized reduction statement STMT using
+   MASK.  In case scalar reduction statement is vectorized
+   into several vector statements then PREV holds a
+   preceding vetor statement copy for STMT.
+
+   Masking is performed using VEC_COND_EXPR. E.g.
+
+   S1: r_1 = r_2 + d_3
+
+   is transformed into:
+
+   S1': r_4 = r_2 + d_3
+   S2': r_1 = VEC_COND_EXPR<MASK, r_4, r_2>
+
+   Return generated condition statement.  */
+
+static gimple *
+vect_mask_reduction_stmt (gimple *stmt, tree mask, gimple *prev)
+{
+  gimple_stmt_iterator gsi;
+  tree vectype;
+  tree lhs, rhs, tmp;
+  gimple *new_stmt, *phi;
+
+  lhs = gimple_assign_lhs (stmt);
+  vectype = TREE_TYPE (lhs);
+
+  gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
+	      == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
+
+  /* Find operand RHS defined by PHI node.  */
+  rhs = gimple_assign_rhs1 (stmt);
+  gcc_assert (TREE_CODE (rhs) == SSA_NAME);
+  phi = SSA_NAME_DEF_STMT (rhs);
+
+  if (phi != prev && gimple_code (phi) != GIMPLE_PHI)
+    {
+      rhs = gimple_assign_rhs2 (stmt);
+      gcc_assert (TREE_CODE (rhs) == SSA_NAME);
+      phi = SSA_NAME_DEF_STMT (rhs);
+      gcc_assert (phi == prev || gimple_code (phi) == GIMPLE_PHI);
+    }
+
+  /* Convert reduction stmt to ordinary assignment to TMP.  */
+  tmp = vect_get_new_ssa_name (vectype, vect_simple_var, NULL);
+  gimple_assign_set_lhs (stmt, tmp);
+
+  /* Create VEC_COND_EXPR and insert it after STMT.  */
+  new_stmt = gimple_build_assign (lhs, VEC_COND_EXPR, mask, tmp, rhs);
+  gsi = gsi_for_stmt (stmt);
+  gsi_insert_after (&gsi, new_stmt, GSI_SAME_STMT);
+
+  return new_stmt;
+}
+
+/* Function vect_mask_mask_load_store_stmt.
+
+   Mask given vectorized MASK_LOAD or MASK_STORE statement
+   STMT using MASK.  Function replaces a mask used by STMT
+   with its conjunction with MASK.  */
+
+static void
+vect_mask_mask_load_store_stmt (gimple *stmt, tree mask)
+{
+  gimple *new_stmt;
+  tree old_mask, new_mask;
+  gimple_stmt_iterator gsi;
+
+  gsi = gsi_for_stmt (stmt);
+  old_mask = gimple_call_arg (stmt, 2);
+
+  gcc_assert (types_compatible_p (TREE_TYPE (old_mask), TREE_TYPE (mask)));
+
+  new_mask = vect_get_new_ssa_name (TREE_TYPE (mask), vect_simple_var, NULL);
+  new_stmt = gimple_build_assign (new_mask, BIT_AND_EXPR, old_mask, mask);
+  gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
+
+  gimple_call_set_arg (stmt, 2, new_mask);
+  update_stmt (stmt);
+}
+
+
+/* Function vect_mask_load_store_stmt.
+
+   Mask given vectorized load or store statement STMT using
+   MASK.  DR is a data reference for a scalar memory access.
+   Assignment is transformed into MASK_LOAD or MASK_STORE
+   statement.  SI is either an iterator pointing to STMT and
+   is to be updated or NULL.  */
+
+static void
+vect_mask_load_store_stmt (gimple *stmt, tree vectype, tree mask,
+			   data_reference *dr, gimple_stmt_iterator *si)
+{
+  tree mem, val, addr, ptr;
+  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+  unsigned align, misalign;
+  tree elem_type = TREE_TYPE (vectype);
+  gimple *new_stmt;
+
+  gcc_assert (!si || gsi_stmt (*si) == stmt);
+
+  gsi = gsi_for_stmt (stmt);
+  if (gimple_store_p (stmt))
+    {
+      val = gimple_assign_rhs1 (stmt);
+      mem = gimple_assign_lhs (stmt);
+    }
+  else
+    {
+      val = gimple_assign_lhs (stmt);
+      mem = gimple_assign_rhs1 (stmt);
+    }
+
+  gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
+	      == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
+
+  addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (mem),
+				   true, NULL_TREE, true,
+				   GSI_SAME_STMT);
+
+  align = TYPE_ALIGN_UNIT (vectype);
+  if (aligned_access_p (dr))
+    misalign = 0;
+  else if (DR_MISALIGNMENT (dr) == -1)
+    {
+      align = TYPE_ALIGN_UNIT (elem_type);
+      misalign = 0;
+    }
+  else
+    misalign = DR_MISALIGNMENT (dr);
+  set_ptr_info_alignment (get_ptr_info (addr), align, misalign);
+  ptr = build_int_cst (reference_alias_ptr_type (mem),
+		       misalign ? misalign & -misalign : align);
+
+  if (gimple_store_p (stmt))
+    new_stmt = gimple_build_call_internal (IFN_MASK_STORE, 4, addr, ptr,
+					   mask, val);
+  else
+    {
+      new_stmt = gimple_build_call_internal (IFN_MASK_LOAD, 3, addr, ptr,
+					     mask);
+      gimple_call_set_lhs (new_stmt, val);
+    }
+  gsi_replace (si ? si : &gsi, new_stmt, false);
+}
+
+/* Function vect_combine_loop_epilogue.
+
+   Combine loop epilogue with the main vectorized body.  It requires
+   masking of memory accesses and reductions.  */
+
+static void
+vect_combine_loop_epilogue (loop_vec_info loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
+  unsigned mask_no;
+  auto_vec<tree> masks;
+
+  vect_gen_loop_masks (loop_vinfo, &masks);
+
+  /* Convert reduction statements if any.  */
+  for (unsigned i = 0; i < LOOP_VINFO_REDUCTIONS (loop_vinfo).length (); i++)
+    {
+      gimple *stmt = LOOP_VINFO_REDUCTIONS (loop_vinfo)[i];
+      gimple *prev_stmt = NULL;
+      stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+
+      mask_no = vect_get_mask_index_for_type (STMT_VINFO_VECTYPE (stmt_info));
+
+      stmt = STMT_VINFO_VEC_STMT (stmt_info);
+      while (stmt)
+	{
+	  prev_stmt = vect_mask_reduction_stmt (stmt, masks[mask_no++],
+						prev_stmt);
+	  stmt_info = vinfo_for_stmt (stmt);
+	  stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
+	}
+    }
+
+  /* Scan all loop statements to convert vector load/store including masked
+     form.  */
+  for (unsigned i = 0; i < loop->num_nodes; i++)
+    {
+      basic_block bb = bbs[i];
+      for (gimple_stmt_iterator si = gsi_start_bb (bb);
+	   !gsi_end_p (si); gsi_next (&si))
+	{
+	  gimple *stmt = gsi_stmt (si);
+	  stmt_vec_info stmt_info = NULL;
+	  tree vectype = NULL;
+	  data_reference *dr;
+
+	  /* Mask load case.  */
+	  if (is_gimple_call (stmt)
+	      && gimple_call_internal_p (stmt)
+	      && gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
+	      && !VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (stmt, 2))))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+	      if (!STMT_VINFO_VEC_STMT (stmt_info))
+		continue;
+	      stmt = STMT_VINFO_VEC_STMT (stmt_info);
+	      vectype = STMT_VINFO_VECTYPE (stmt_info);
+	    }
+	  /* Mask store case.  */
+	  else if (is_gimple_call (stmt)
+		   && gimple_call_internal_p (stmt)
+		   && gimple_call_internal_fn (stmt) == IFN_MASK_STORE
+		   && vinfo_for_stmt (stmt)
+		   && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+	      vectype = TREE_TYPE (gimple_call_arg (stmt, 2));
+	    }
+	  /* Load case.  */
+	  else if (gimple_assign_load_p (stmt)
+		   && !VECTOR_TYPE_P (TREE_TYPE (gimple_assign_lhs (stmt))))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+
+	      /* Skip vector loads.  */
+	      if (!STMT_VINFO_VEC_STMT (stmt_info))
+		continue;
+
+	      /* Skip invariant loads.  */
+	      if (integer_zerop (nested_in_vect_loop_p (loop, stmt)
+				 ? STMT_VINFO_DR_STEP (stmt_info)
+				 : DR_STEP (STMT_VINFO_DATA_REF (stmt_info))))
+		continue;
+	      stmt = STMT_VINFO_VEC_STMT (stmt_info);
+	      vectype = STMT_VINFO_VECTYPE (stmt_info);
+	    }
+	  /* Store case.  */
+	  else if (gimple_code (stmt) == GIMPLE_ASSIGN
+		   && gimple_store_p (stmt)
+		   && vinfo_for_stmt (stmt)
+		   && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+	      vectype = STMT_VINFO_VECTYPE (stmt_info);
+	    }
+	  else
+	    continue;
+
+	  /* Skip hoisted out statements.  */
+	  if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
+	    continue;
+
+	  mask_no = vect_get_mask_index_for_type (vectype);
+
+	  dr = STMT_VINFO_DATA_REF (stmt_info);
+	  while (stmt)
+	    {
+	      if (is_gimple_call (stmt))
+		vect_mask_mask_load_store_stmt (stmt, masks[mask_no++]);
+	      else
+		vect_mask_load_store_stmt (stmt, vectype, masks[mask_no++], dr,
+					   /* Have to update iterator only if
+					      it points to stmt we mask.  */
+					   stmt == gsi_stmt (si) ? &si : NULL);
+
+	      stmt_info = vinfo_for_stmt (stmt);
+	      stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
+	    }
+	}
+    }
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location,
+		     "=== Loop epilogue was combined ===\n");
+}
 
 /* Function vect_transform_loop.
 
@@ -6936,7 +7568,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
      run at least the vectorization factor number of times checking
      is pointless, too.  */
   th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
-  if (th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1
+  if ((th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1
+       || (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)
+	   && th > 1))
       && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
       if (dump_enabled_p ())
@@ -6985,12 +7619,18 @@ vect_transform_loop (loop_vec_info loop_vinfo)
     {
       tree ratio_mult_vf;
       if (!ni_name)
-	ni_name = vect_build_loop_niters (loop_vinfo);
+	{
+	  ni_name = vect_build_loop_niters (loop_vinfo);
+	  LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
+	}
       vect_generate_tmps_on_preheader (loop_vinfo, ni_name, &ratio_mult_vf,
 				       &ratio);
-      epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name,
-						 ratio_mult_vf, th,
-						 check_profitability);
+      /* If epilogue is combined with main loop peeling is not needed.  */
+      if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)
+	  || check_profitability)
+	epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name,
+						   ratio_mult_vf, th,
+						   check_profitability);
     }
   else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
@@ -6998,7 +7638,10 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   else
     {
       if (!ni_name)
-	ni_name = vect_build_loop_niters (loop_vinfo);
+	{
+	  ni_name = vect_build_loop_niters (loop_vinfo);
+	  LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
+	}
       vect_generate_tmps_on_preheader (loop_vinfo, ni_name, NULL, &ratio);
     }
 
@@ -7252,6 +7895,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 
   slpeel_make_loop_iterate_ntimes (loop, ratio);
 
+  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    vect_combine_loop_epilogue (loop_vinfo);
+
   /* Reduce loop iterations by the vectorization factor.  */
   scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vectorization_factor),
 		      expected_iterations / vectorization_factor);
@@ -7263,20 +7909,28 @@ vect_transform_loop (loop_vec_info loop_vinfo)
         loop->nb_iterations_likely_upper_bound
 	   = loop->nb_iterations_likely_upper_bound - 1;
     }
-  loop->nb_iterations_upper_bound
-    = wi::udiv_floor (loop->nb_iterations_upper_bound + 1,
-		      vectorization_factor) - 1;
-  loop->nb_iterations_likely_upper_bound
-    = wi::udiv_floor (loop->nb_iterations_likely_upper_bound + 1,
-		      vectorization_factor) - 1;
+
+  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    loop->nb_iterations_upper_bound
+      = wi::div_ceil (loop->nb_iterations_upper_bound + 1,
+		      vectorization_factor, UNSIGNED) - 1;
+  else
+    loop->nb_iterations_upper_bound
+      = wi::udiv_floor (loop->nb_iterations_upper_bound + 1,
+			vectorization_factor) - 1;
 
   if (loop->any_estimate)
     {
-      loop->nb_iterations_estimate
-        = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor);
-       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
-	   && loop->nb_iterations_estimate != 0)
-	 loop->nb_iterations_estimate = loop->nb_iterations_estimate - 1;
+      if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+	loop->nb_iterations_estimate
+	  = wi::div_ceil (loop->nb_iterations_estimate, vectorization_factor,
+			  UNSIGNED);
+      else
+	loop->nb_iterations_estimate
+	  = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor);
+      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+	  && loop->nb_iterations_estimate != 0)
+	loop->nb_iterations_estimate -= 1;
     }
 
   if (dump_enabled_p ())

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-06-28 13:37   ` Ilya Enkovich
@ 2016-06-28 14:16     ` Ilya Enkovich
  2016-07-11 13:39     ` Ilya Enkovich
  2016-07-14 22:04     ` Jeff Law
  2 siblings, 0 replies; 29+ messages in thread
From: Ilya Enkovich @ 2016-06-28 14:16 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches

On 28 Jun 15:24, Ilya Enkovich wrote:
> On 16 Jun 10:54, Jeff Law wrote:
> > 
> > I don't see anything particularly worrisome here either -- I have a slight
> > concern about correctness issues with only masking loads/stores and
> > reductions.  But I will defer to your judgment on whether or not there's
> > other stuff that we need to mask to combine the epilogue with the loop via
> > masking.
> 
> We have to mask operations which may cause errors if executed speculatively.
> For others we just ignore produced result.  So we don't truly mask reductions
> but fix-up their results.  I assume memory accesses are only ones we have to
> truly mask (plus non-const calls which are rejected now).  For signalling
> arithmetic I assumed we just don't vectorize it.
> 
> Basically we should act similar to if-conversion.  I'll check if it has
> restrictions I miss.

I see if-conversion pass checks non-memory statements using gimple_could_trap_p_1
but I don't see similar checks in the vectorizer.  Therefore I add similar
check into patch #05.

Currently we don't have a way to mask such statements except mask their operands
to exclude zeros, NaNs etc. and replace them with whatever we want.  This seems
inefficient though and most probably wouldn't fold into real masked instruction
on RTL.  We may handle those cases later if find them important.

Thanks,
Ilya

> 
> > 
> > Jeff
> 
> Here is an updated patch version.
> 
> Thanks,
> Ilya
> --

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-06-28 13:37   ` Ilya Enkovich
  2016-06-28 14:16     ` Ilya Enkovich
@ 2016-07-11 13:39     ` Ilya Enkovich
  2016-07-14 22:04     ` Jeff Law
  2 siblings, 0 replies; 29+ messages in thread
From: Ilya Enkovich @ 2016-07-11 13:39 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches, Yuri Rumyantsev, Igor Zamyatin

Ping

2016-06-28 15:24 GMT+03:00 Ilya Enkovich <enkovich.gnu@gmail.com>:
> On 16 Jun 10:54, Jeff Law wrote:
>> On 05/19/2016 01:44 PM, Ilya Enkovich wrote:
>> >Hi,
>> >
>> >This patch introduces support for loop epilogue combining.  This includes
>> >support in cost estimation and all required changes required to mask
>> >vectorized loop.
>> >
>> >Thanks,
>> >Ilya
>> >--
>> >gcc/
>> >
>> >2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
>> >
>> >     * dbgcnt.def (vect_tail_combine): New.
>> >     * params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
>> >     * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
>> >     * tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
>> >     epilogue combined with loop body.
>> >     (vect_do_peeling_for_loop_bound): Likewise.
>> >     * tree-vect-loop.c Include alias.h and dbgcnt.h.
>> >     (vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters
>> >     arg, compute number of iterations for which loop epilogue combining is
>> >     profitable.
>> >     (vect_generate_tmps_on_preheader): Support combined apilogue.
>> >     (vect_gen_ivs_for_masking): New.
>> >     (vect_get_mask_index_for_elems): New.
>> >     (vect_get_mask_index_for_type): New.
>> >     (vect_gen_loop_masks): New.
>> >     (vect_mask_reduction_stmt): New.
>> >     (vect_mask_mask_load_store_stmt): New.
>> >     (vect_mask_load_store_stmt): New.
>> >     (vect_combine_loop_epilogue): New.
>> >     (vect_transform_loop): Support combined apilogue.
>> >
>> >
>> >diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
>> >index fab5879..b3c0668 100644
>> >--- a/gcc/tree-vect-loop-manip.c
>> >+++ b/gcc/tree-vect-loop-manip.c
>> >@@ -1464,11 +1469,20 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
>> >   bb_between_loops = new_exit_bb;
>> >   bb_after_second_loop = split_edge (single_exit (second_loop));
>> >
>> >-  pre_condition =
>> >-    fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
>> >-  skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
>> >-                                  bb_after_second_loop, bb_before_first_loop,
>> >-                              inverse_probability (second_guard_probability));
>> >+  if (skip_second_after_first)
>> >+    /* We can just redirect edge from bb_between_loops to
>> >+       bb_after_second_loop but we have many code assuming
>> >+       we have a guard after the first loop.  So just make
>> >+       always taken condtion.  */
>> >+    pre_condition = fold_build2 (EQ_EXPR, boolean_type_node, integer_zero_node,
>> >+                             integer_zero_node);
>> This isn't ideal, but I don't think it's that big of an issue.
>>
>> >@@ -1758,8 +1772,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>> >   basic_block preheader;
>> >   int loop_num;
>> >   int max_iter;
>> >+  int bound2;
>> >   tree cond_expr = NULL_TREE;
>> >   gimple_seq cond_expr_stmt_list = NULL;
>> >+  bool combine = LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo);
>> >
>> >   if (dump_enabled_p ())
>> >     dump_printf_loc (MSG_NOTE, vect_location,
>> >@@ -1769,12 +1785,13 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>> >
>> >   loop_num  = loop->num;
>> >
>> >+  bound2 = combine ? th : LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>> Can you document what the TH parameter is to the various routines that use
>> it in tree-vect-loop-manip.c?  I realize you didn't add it, but it would
>> help anyone looking at this code in the future to know it's the threshold of
>> iterations for vectorization without having to find it in other function
>> comment headers ;-)
>>
>> That's pre-approved to go in immediately :-)
>>
>> >@@ -1803,7 +1820,11 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>> >   max_iter = (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>> >           ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) * 2
>> >           : LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 2;
>> >-  if (check_profitability)
>> >+  /* When epilogue is combined only profitability
>> >+     treshold matters.  */
>> s/treshold/threshold/
>>
>>
>>
>> > static void
>> > vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>> >                                 int *ret_min_profitable_niters,
>> >-                                int *ret_min_profitable_estimate)
>> >+                                int *ret_min_profitable_estimate,
>> >+                                int *ret_min_profitable_combine_niters)
>> I'm torn a bit here.  There's all kinds of things missing/incomplete in the
>> function comments throughout the vectorizer.  And in some cases, like this
>> one, the parameters are largely self-documenting.  But we've also got coding
>> standards that we'd like to adhere to.
>>
>> I don't think it's fair to require you to fix all these issues in the
>> vectorizer (though if you wanted to, I'd fully support those an independent
>> cleanups).
>>
>> Perhaps just document LOOP_VINFO with a generic comment about the ret_*
>> parameters for this function rather than a comment for each ret_* parameter.
>> Pre-approved for the trunk independent of the vec-tails work.
>>
>>
>> >@@ -3728,6 +3784,77 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>> >                  min_profitable_estimate);
>> >
>> >+
>> >+      unsigned combine_treshold
>> >+    = PARAM_VALUE (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD);
>> >+      /* Calculate profitability combining epilogue with the main loop.
>> >+     We have a threshold for inside cost overhead (not applied
>> >+     for low trip count loop case):
>> >+     MIC * 100 < VIC * CT
>> >+     Masked iteration should be better than a scalar prologue:
>> >+     MIC + VIC < SIC * epilogue_niters  */
>> Can you double-check the whitespace formatting here.  Where does the "100"
>> come from and should it be a param?
>
> I checked the formatting.  We have 100 here because combine_treshold
> is measured in percent.  E.g. value 2 means iterations masking overhead
> shouldn't exceed 2% of vector iteration cost.
>
>>
>>
>> >@@ -6886,6 +7030,485 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
>> >   return;
>> > }
>> >
>>
>> >+
>> >+/* Function vect_gen_loop_masks.
>> >+
>> >+   Create masks to mask a loop desvribed by LOOP_VINFO.  Masks
>> s/desvribed/described/
>>
>> >+   are created according to LOOP_VINFO_REQUIRED_MASKS and are stored
>> >+   into MASKS vector.
>> >+
>> >+   Index of a mask in a vector is computed according to a number
>> >+   of masks's elements.  Masks are sorted by number of its elements
>> >+   in descending order.  Index 0 is used to access a mask with
>> >+   current_vector_size elements.  Among masks with the same number
>> >+   of elements the one with lower index is used to mask iterations
>> >+   with smaller iteration counter.  Note that you may get NULL elements
>> >+   for masks which are not required.  Use vect_get_mask_index_for_elems
>> >+   or vect_get_mask_index_for_type to access resulting vector.  */
>> >+
>> >+static void
>> >+vect_gen_loop_masks (loop_vec_info loop_vinfo, vec<tree> *masks)
>> >+{
>> >+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>> >+  edge pe = loop_preheader_edge (loop);
>> >+  tree niters = LOOP_VINFO_NITERS (loop_vinfo);
>> >+  unsigned min_mask_elems, max_mask_elems, nmasks;
>> >+  unsigned iv_elems, cur_mask, prev_mask, cur_mask_elems;
>> >+  auto_vec<tree> ivs;
>> >+  tree vectype, mask_type;
>> >+  tree vec_niters, vec_niters_val, mask;
>> >+  gimple *stmt;
>> >+  basic_block bb;
>> >+  gimple_stmt_iterator gsi = gsi_after_labels (loop->header);
>> >+  unsigned vec_size;
>> >+
>> >+  /* Create required IVs.  */
>> >+  vect_gen_ivs_for_masking (loop_vinfo, &ivs);
>> >+  vectype = TREE_TYPE (ivs[0]);
>> >+
>> >+  vec_size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
>> >+  iv_elems = TYPE_VECTOR_SUBPARTS (vectype);
>> >+
>> >+  /* Get a proper niter to build a vector.  */
>> >+  if (!is_gimple_val (niters))
>> >+    {
>> >+      gimple_seq seq = NULL;
>> >+      niters = force_gimple_operand (niters, &seq, true, NULL);
>> >+      gsi_insert_seq_on_edge_immediate (pe, seq);
>> >+    }
>> >+  /* We may need a type cast in case niter has a too small type
>> >+     for generated IVs.  */
>> Nit.  There should be vertical whitespace after the close brace and the
>> comment for the next logical block of code.  Can you do a scan over the
>> patchkit looking for other instances where the vertical whitespace is
>> needed.
>>
>> Generally, if you find that a blob of code needs a comment, then the comment
>> and blob of code should have that vertical whitespace to visually separate
>> it from everything else.
>>
>>
>>
>> >+/* Function vect_combine_loop_epilogue.
>> >+
>> >+   Combine loop epilogue with the main vectorized body.  It requires
>> >+   masking of memory accesses and reductions.  */
>> So you mask reductions, loads & stores.  Is there anything else that we
>> might potentially need to mask to combine the loop & epilogue via masking?
>>
>>
>> I don't see anything particularly worrisome here either -- I have a slight
>> concern about correctness issues with only masking loads/stores and
>> reductions.  But I will defer to your judgment on whether or not there's
>> other stuff that we need to mask to combine the epilogue with the loop via
>> masking.
>
> We have to mask operations which may cause errors if executed speculatively.
> For others we just ignore produced result.  So we don't truly mask reductions
> but fix-up their results.  I assume memory accesses are only ones we have to
> truly mask (plus non-const calls which are rejected now).  For signalling
> arithmetic I assumed we just don't vectorize it.
>
> Basically we should act similar to if-conversion.  I'll check if it has
> restrictions I miss.
>
>>
>> Jeff
>
> Here is an updated patch version.
>
> Thanks,
> Ilya
> --
> gcc/
>
> 2016-05-28  Ilya Enkovich  <ilya.enkovich@intel.com>
>
>         * dbgcnt.def (vect_tail_combine): New.
>         * params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
>         * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
>         * tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
>         epilogue combined with loop body.
>         (vect_do_peeling_for_loop_bound): LIkewise.
>         (vect_do_peeling_for_alignment): ???
>         * tree-vect-loop.c Include alias.h and dbgcnt.h.
>         (vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters
>         arg, compute number of iterations for which loop epilogue combining is
>         profitable.
>         (vect_generate_tmps_on_preheader): Support combined apilogue.
>         (vect_gen_ivs_for_masking): New.
>         (vect_get_mask_index_for_elems): New.
>         (vect_get_mask_index_for_type): New.
>         (vect_gen_loop_masks): New.
>         (vect_mask_reduction_stmt): New.
>         (vect_mask_mask_load_store_stmt): New.
>         (vect_mask_load_store_stmt): New.
>         (vect_combine_loop_epilogue): New.
>         (vect_transform_loop): Support combined apilogue.
>
>
> diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def
> index 78ddcc2..73c2966 100644
> --- a/gcc/dbgcnt.def
> +++ b/gcc/dbgcnt.def
> @@ -192,4 +192,5 @@ DEBUG_COUNTER (treepre_insert)
>  DEBUG_COUNTER (tree_sra)
>  DEBUG_COUNTER (vect_loop)
>  DEBUG_COUNTER (vect_slp)
> +DEBUG_COUNTER (vect_tail_combine)
>  DEBUG_COUNTER (dom_unreachable_edges)
> diff --git a/gcc/params.def b/gcc/params.def
> index 62a1e40..98d6c5a 100644
> --- a/gcc/params.def
> +++ b/gcc/params.def
> @@ -1220,6 +1220,11 @@ DEFPARAM (PARAM_MAX_SPECULATIVE_DEVIRT_MAYDEFS,
>           "Maximum number of may-defs visited when devirtualizing "
>           "speculatively", 50, 0, 0)
>
> +DEFPARAM (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD,
> +         "vect-cost-increase-combine-threshold",
> +         "Cost increase threshold to mask main loop for epilogue.",
> +         10, 0, 300)
> +
>  /*
>
>  Local variables:
> diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
> index a902a50..26e0cc1 100644
> --- a/gcc/tree-vect-data-refs.c
> +++ b/gcc/tree-vect-data-refs.c
> @@ -4007,6 +4007,9 @@ vect_get_new_ssa_name (tree type, enum vect_var_kind var_kind, const char *name)
>    case vect_scalar_var:
>      prefix = "stmp";
>      break;
> +  case vect_mask_var:
> +    prefix = "mask";
> +    break;
>    case vect_pointer_var:
>      prefix = "vectp";
>      break;
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index c26aa1d..7403686 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -1195,6 +1195,7 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
>    int first_guard_probability = 2 * REG_BR_PROB_BASE / 3;
>    int second_guard_probability = 2 * REG_BR_PROB_BASE / 3;
>    int probability_of_second_loop;
> +  bool skip_second_after_first = false;
>
>    if (!slpeel_can_duplicate_loop_p (loop, e))
>      return NULL;
> @@ -1393,7 +1394,11 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
>      {
>        loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
>        tree scalar_loop_iters = LOOP_VINFO_NITERSM1 (loop_vinfo);
> -      unsigned limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
> +      unsigned limit = 0;
> +      if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +       skip_second_after_first = true;
> +      else
> +       limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
>        if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
>         limit = limit + 1;
>        if (check_profitability
> @@ -1464,11 +1469,20 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
>    bb_between_loops = new_exit_bb;
>    bb_after_second_loop = split_edge (single_exit (second_loop));
>
> -  pre_condition =
> -       fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
> -  skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
> -                                  bb_after_second_loop, bb_before_first_loop,
> -                                 inverse_probability (second_guard_probability));
> +  if (skip_second_after_first)
> +    /* We can just redirect edge from bb_between_loops to
> +       bb_after_second_loop but we have many code assuming
> +       we have a guard after the first loop.  So just make
> +       always taken condtion.  */
> +    pre_condition = fold_build2 (EQ_EXPR, boolean_type_node, integer_zero_node,
> +                                integer_zero_node);
> +  else
> +    pre_condition =
> +      fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
> +  skip_e
> +    = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
> +                            bb_after_second_loop, bb_before_first_loop,
> +                            inverse_probability (second_guard_probability));
>    scale_loop_profile (second_loop, probability_of_second_loop, bound2);
>    slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop,
>                                       second_loop == new_loop, &new_exit_bb);
> @@ -1762,8 +1776,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>    basic_block preheader;
>    int loop_num;
>    int max_iter;
> +  int bound2;
>    tree cond_expr = NULL_TREE;
>    gimple_seq cond_expr_stmt_list = NULL;
> +  bool combine = LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo);
>
>    if (dump_enabled_p ())
>      dump_printf_loc (MSG_NOTE, vect_location,
> @@ -1773,12 +1789,13 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>
>    loop_num  = loop->num;
>
> +  bound2 = combine ? th : LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>    new_loop
>      = slpeel_tree_peel_loop_to_edge (loop, scalar_loop, single_exit (loop),
>                                      &ratio_mult_vf_name, ni_name, false,
>                                      th, check_profitability,
>                                      cond_expr, cond_expr_stmt_list,
> -                                    0, LOOP_VINFO_VECT_FACTOR (loop_vinfo));
> +                                    0, bound2);
>    gcc_assert (new_loop);
>    gcc_assert (loop_num == loop->num);
>    slpeel_checking_verify_cfg_after_peeling (loop, new_loop);
> @@ -1807,7 +1824,12 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>    max_iter = (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>               ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) * 2
>               : LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 2;
> -  if (check_profitability)
> +
> +  /* When epilogue is combined only profitability
> +     threshold matters.  */
> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +    max_iter = (int) th - 1;
> +  else if (check_profitability)
>      max_iter = MAX (max_iter, (int) th - 1);
>    record_niter_bound (new_loop, max_iter, false, true);
>    dump_printf (MSG_NOTE,
> @@ -2044,7 +2066,8 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo, tree ni_name,
>                                    bound, 0);
>
>    gcc_assert (new_loop);
> -  slpeel_checking_verify_cfg_after_peeling (new_loop, loop);
> +  if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +    slpeel_checking_verify_cfg_after_peeling (new_loop, loop);
>    /* For vectorization factor N, we need to copy at most N-1 values
>       for alignment and this means N-2 loopback edge executions.  */
>    max_iter = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 2;
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 41b9380..08fad82 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -50,6 +50,8 @@ along with GCC; see the file COPYING3.  If not see
>  #include "gimple-fold.h"
>  #include "cgraph.h"
>  #include "tree-if-conv.h"
> +#include "alias.h"
> +#include "dbgcnt.h"
>
>  /* Loop Vectorization Pass.
>
> @@ -149,7 +151,8 @@ along with GCC; see the file COPYING3.  If not see
>     http://gcc.gnu.org/projects/tree-ssa/vectorization.html
>  */
>
> -static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *);
> +static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *,
> +                                               int *);
>
>  /* Function vect_determine_vectorization_factor
>
> @@ -2310,8 +2313,10 @@ start_over:
>
>    /* Analyze cost.  Decide if worth while to vectorize.  */
>    int min_profitable_estimate, min_profitable_iters;
> +  int min_profitable_combine_iters;
>    vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
> -                                     &min_profitable_estimate);
> +                                     &min_profitable_estimate,
> +                                     &min_profitable_combine_iters);
>
>    if (min_profitable_iters < 0)
>      {
> @@ -2420,6 +2425,52 @@ start_over:
>    gcc_assert (vectorization_factor
>               == (unsigned)LOOP_VINFO_VECT_FACTOR (loop_vinfo));
>
> +  if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> +    {
> +      LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
> +      LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = false;
> +    }
> +  else if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> +          && min_profitable_combine_iters >= 0)
> +    {
> +      if (((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> +           && (LOOP_VINFO_INT_NITERS (loop_vinfo)
> +               >= (unsigned) min_profitable_combine_iters))
> +          || estimated_niter == -1
> +          || estimated_niter >= min_profitable_combine_iters)
> +         && dbg_cnt (vect_tail_combine))
> +       {
> +         LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
> +         LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = true;
> +
> +         dump_printf_loc (MSG_NOTE, vect_location,
> +                          "Decided to combine loop with its epilogue.\n");
> +
> +         /* We need to adjust profitability check if combine
> +            epilogue considering additional vector iteration
> +            and profitable combine iterations.  */
> +         if ((int)(min_profitable_combine_iters + vectorization_factor)
> +             > min_scalar_loop_bound)
> +           {
> +             LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo)
> +               = (unsigned) min_profitable_combine_iters;
> +             if (dump_enabled_p ())
> +               dump_printf_loc (MSG_NOTE, vect_location,
> +                                "Updated runtime profitability treshold: %d\n",
> +                                min_profitable_combine_iters);
> +
> +           }
> +       }
> +      else
> +       {
> +         if (!LOOP_VINFO_NEED_MASKING (loop_vinfo) && dump_enabled_p ())
> +           dump_printf_loc (MSG_NOTE, vect_location,
> +                            "Not combined loop with epilogue: iterations "
> +                            "count is too low (threshold is %d).\n",
> +                            min_profitable_combine_iters);
> +       }
> +    }
> +
>    /* Ok to vectorize!  */
>    return true;
>
> @@ -3392,12 +3443,18 @@ vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
>     profitability check.
>
>     *RET_MIN_PROFITABLE_ESTIMATE is a profitability threshold to be used
> -   for static check against estimated number of iterations.  */
> +   for static check against estimated number of iterations.
> +
> +   *RET_MIN_PROFITABLE_COMBINE_NITERS is a cost model profitability threshold
> +   of iterations for vectorization with combined loop epilogue.  -1 means
> +   combining is not profitable.  Value may be used fo dynamic profitability
> +   check.  */
>
>  static void
>  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>                                     int *ret_min_profitable_niters,
> -                                   int *ret_min_profitable_estimate)
> +                                   int *ret_min_profitable_estimate,
> +                                   int *ret_min_profitable_combine_niters)
>  {
>    int min_profitable_iters;
>    int min_profitable_estimate;
> @@ -3641,6 +3698,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>                     vec_prologue_cost);
>        dump_printf (MSG_NOTE, "  Vector epilogue cost: %d\n",
>                     vec_epilogue_cost);
> +      dump_printf (MSG_NOTE, "  Masking prologue cost: %d\n",
> +                   masking_prologue_cost);
> +      dump_printf (MSG_NOTE, "  Masking inside cost: %d\n",
> +                   masking_inside_cost);
>        dump_printf (MSG_NOTE, "  Scalar iteration cost: %d\n",
>                     scalar_single_iter_cost);
>        dump_printf (MSG_NOTE, "  Scalar outside cost: %d\n",
> @@ -3744,6 +3805,77 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>                      min_profitable_estimate);
>
>    *ret_min_profitable_estimate = min_profitable_estimate;
> +
> +  *ret_min_profitable_combine_niters = -1;
> +
> +  /* Don't try to vectorize epilogue of epilogue.  */
> +  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> +    return;
> +
> +  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +    {
> +      if (flag_vect_epilogue_cost_model == VECT_COST_MODEL_UNLIMITED)
> +       {
> +         if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE)
> +           *ret_min_profitable_combine_niters = 0;
> +         return;
> +       }
> +
> +      unsigned combine_treshold
> +       = PARAM_VALUE (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD);
> +      /* Calculate profitability combining epilogue with the main loop.
> +        We have a threshold for inside cost overhead (not applied
> +        for low trip count loop case):
> +        MIC * 100 < VIC * CT
> +        Masked iteration should be better than a scalar prologue:
> +        MIC + VIC < SIC * epilogue_niters  */
> +      if (masking_inside_cost * 100 >= vec_inside_cost * combine_treshold)
> +       {
> +         if (dump_enabled_p ())
> +           {
> +             dump_printf_loc (MSG_NOTE, vect_location,
> +                              "Combining loop with epilogue is not "
> +                              "profitable.\n");
> +             dump_printf_loc (MSG_NOTE, vect_location,
> +                              "  Combining overhead %d%% exceeds "
> +                              "treshold %d%%.\n",
> +                              masking_inside_cost * 100 / vec_inside_cost,
> +                              combine_treshold);
> +           }
> +         *ret_min_profitable_combine_niters = -1;
> +       }
> +      else if ((int)(masking_inside_cost + vec_inside_cost)
> +              >= scalar_single_iter_cost * peel_iters_epilogue)
> +       {
> +         if (dump_enabled_p ())
> +           {
> +             dump_printf_loc (MSG_NOTE, vect_location,
> +                              "Combining loop with epilogue is not "
> +                              "profitable.\n");
> +             dump_printf_loc (MSG_NOTE, vect_location,
> +                              "  Scalar epilogue is faster than a "
> +                              "single masked iteration.\n");
> +           }
> +         *ret_min_profitable_combine_niters = -1;
> +       }
> +      else if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE)
> +       {
> +         int inside_cost = vec_inside_cost + masking_inside_cost;
> +         int outside_cost = vec_outside_cost + masking_prologue_cost;
> +         int profitable_iters = ((outside_cost - scalar_outside_cost) * vf
> +                                 - inside_cost * peel_iters_prologue
> +                                 - inside_cost * peel_iters_epilogue)
> +                                / ((scalar_single_iter_cost * vf)
> +                                   - inside_cost);
> +
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_NOTE, vect_location,
> +                            "Combinig loop with epilogue "
> +                            "pofitability treshold = %d\n",
> +                            profitable_iters);
> +         *ret_min_profitable_combine_niters = profitable_iters;
> +       }
> +    }
>  }
>
>  /* Writes into SEL a mask for a vec_perm, equivalent to a vec_shr by OFFSET
> @@ -6852,20 +6984,37 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
>    else
>      ni_minus_gap_name = ni_name;
>
> -  /* Create: ratio = ni >> log2(vf) */
> -  /* ???  As we have ni == number of latch executions + 1, ni could
> -     have overflown to zero.  So avoid computing ratio based on ni
> -     but compute it using the fact that we know ratio will be at least
> -     one, thus via (ni - vf) >> log2(vf) + 1.  */
> -  ratio_name
> -    = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
> -                  fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
> -                               fold_build2 (MINUS_EXPR, TREE_TYPE (ni_name),
> -                                            ni_minus_gap_name,
> -                                            build_int_cst
> -                                              (TREE_TYPE (ni_name), vf)),
> -                               log_vf),
> -                  build_int_cst (TREE_TYPE (ni_name), 1));
> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +    {
> +      /* Create ni + (vf-1) >> log2(vf) if epilogue is combined with loop.  */
> +      gcc_assert (!LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
> +      ratio_name
> +       = fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
> +                      fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
> +                                   ni_name,
> +                                   build_int_cst (TREE_TYPE (ni_name),
> +                                                  vf - 1)),
> +                      log_vf);
> +    }
> +  else
> +    {
> +      /* Create: ratio = ni >> log2(vf) */
> +      /* ???  As we have ni == number of latch executions + 1, ni could
> +        have overflown to zero.  So avoid computing ratio based on ni
> +        but compute it using the fact that we know ratio will be at least
> +        one, thus via (ni - vf) >> log2(vf) + 1.  */
> +      ratio_name
> +        = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
> +                      fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
> +                                   fold_build2 (MINUS_EXPR,
> +                                                TREE_TYPE (ni_name),
> +                                                ni_minus_gap_name,
> +                                                build_int_cst
> +                                                  (TREE_TYPE (ni_name), vf)),
> +                                   log_vf),
> +                      build_int_cst (TREE_TYPE (ni_name), 1));
> +    }
> +
>    if (!is_gimple_val (ratio_name))
>      {
>        var = create_tmp_var (TREE_TYPE (ni_name), "bnd");
> @@ -6895,6 +7044,489 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
>    return;
>  }
>
> +/* Function vect_gen_ivs_for_masking.
> +
> +   Create IVs to be used for masks computation to mask loop described
> +   by LOOP_VINFO.  Created IVs are stored in IVS vector.  .
> +
> +   Initial IV values is {0, 1, ..., VF - 1} (probably split into several
> +   vectors, in this case IVS's elements with lower index hold IV with
> +   smaller numbers).  IV step is {VF, VF, ..., VF}.  VF is a used
> +   vectorization factor.  */
> +
> +static void
> +vect_gen_ivs_for_masking (loop_vec_info loop_vinfo, vec<tree> *ivs)
> +{
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  tree vectype = vect_get_masking_iv_type (loop_vinfo);
> +  tree type = TREE_TYPE (vectype);
> +  int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> +  unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
> +  int ncopies  = vf / elems;
> +  int i, k;
> +  tree iv, init_val, step_val;
> +  bool insert_after;
> +  gimple_stmt_iterator gsi;
> +  tree *vtemp;
> +
> +  /* Create {VF, ..., VF} vector constant.  */
> +  step_val = build_vector_from_val (vectype, build_int_cst (type, vf));
> +
> +  vtemp = XALLOCAVEC (tree, vf);
> +  for (i = 0; i < ncopies; i++)
> +    {
> +      /* Create initial IV value.  */
> +      for (k = 0; k < vf; k++)
> +       vtemp[k] = build_int_cst (type, k + i * elems);
> +      init_val = build_vector (vectype, vtemp);
> +
> +      /* Create an inductive variable including phi node.  */
> +      standard_iv_increment_position (loop, &gsi, &insert_after);
> +      create_iv (init_val, step_val, NULL, loop, &gsi, insert_after,
> +                &iv, NULL);
> +      ivs->safe_push (iv);
> +    }
> +}
> +
> +/* Function vect_get_mask_index_for_elems.
> +
> +   A helper function to access masks vector.  See vect_gen_loop_masks
> +   for masks vector sorting description.  Return index of the first
> +   mask having MASK_ELEMS elements.  */
> +
> +static inline unsigned
> +vect_get_mask_index_for_elems (unsigned mask_elems)
> +{
> +  return current_vector_size / mask_elems - 1;
> +}
> +
> +/* Function vect_get_mask_index_for_type.
> +
> +   A helper function to access masks vector.  See vect_gen_loop_masks
> +   for masks vector sorting description.  Return index of the first
> +   mask appropriate for VECTYPE.  */
> +
> +static inline unsigned
> +vect_get_mask_index_for_type (tree vectype)
> +{
> +  unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
> +  return vect_get_mask_index_for_elems (elems);
> +}
> +
> +/* Function vect_gen_loop_masks.
> +
> +   Create masks to mask a loop described by LOOP_VINFO.  Masks
> +   are created according to LOOP_VINFO_REQUIRED_MASKS and are stored
> +   into MASKS vector.
> +
> +   Index of a mask in a vector is computed according to a number
> +   of masks's elements.  Masks are sorted by number of its elements
> +   in descending order.  Index 0 is used to access a mask with
> +   current_vector_size elements.  Among masks with the same number
> +   of elements the one with lower index is used to mask iterations
> +   with smaller iteration counter.  Note that you may get NULL elements
> +   for masks which are not required.  Use vect_get_mask_index_for_elems
> +   or vect_get_mask_index_for_type to access resulting vector.  */
> +
> +static void
> +vect_gen_loop_masks (loop_vec_info loop_vinfo, vec<tree> *masks)
> +{
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  edge pe = loop_preheader_edge (loop);
> +  tree niters = LOOP_VINFO_NITERS (loop_vinfo);
> +  unsigned min_mask_elems, max_mask_elems, nmasks;
> +  unsigned iv_elems, cur_mask, prev_mask, cur_mask_elems;
> +  auto_vec<tree> ivs;
> +  tree vectype, mask_type;
> +  tree vec_niters, vec_niters_val, mask;
> +  gimple *stmt;
> +  basic_block bb;
> +  gimple_stmt_iterator gsi = gsi_after_labels (loop->header);
> +  unsigned vec_size;
> +
> +  /* Create required IVs.  */
> +  vect_gen_ivs_for_masking (loop_vinfo, &ivs);
> +  vectype = TREE_TYPE (ivs[0]);
> +
> +  vec_size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
> +  iv_elems = TYPE_VECTOR_SUBPARTS (vectype);
> +
> +  /* Get a proper niter to build a vector.  */
> +  if (!is_gimple_val (niters))
> +    {
> +      gimple_seq seq = NULL;
> +      niters = force_gimple_operand (niters, &seq, true, NULL);
> +      gsi_insert_seq_on_edge_immediate (pe, seq);
> +    }
> +
> +  /* We may need a type cast in case niter has a too small type
> +     for generated IVs.  */
> +  if (!types_compatible_p (TREE_TYPE (vectype), TREE_TYPE (niters)))
> +    {
> +      tree new_niters = make_temp_ssa_name (TREE_TYPE (vectype),
> +                                           NULL, "niters");
> +      stmt = gimple_build_assign (new_niters, CONVERT_EXPR, niters);
> +      bb = gsi_insert_on_edge_immediate (pe, stmt);
> +      gcc_assert (!bb);
> +      niters = new_niters;
> +    }
> +
> +  /* Create {NITERS, ..., NITERS} vector and put to SSA_NAME.  */
> +  vec_niters_val = build_vector_from_val (vectype, niters);
> +  vec_niters = vect_get_new_ssa_name (vectype, vect_simple_var, "niters");
> +  stmt = gimple_build_assign (vec_niters, vec_niters_val);
> +  bb = gsi_insert_on_edge_immediate (pe, stmt);
> +  gcc_assert (!bb);
> +
> +  /* Determine which masks we need to compute and how many.  */
> +  vect_get_extreme_masks (loop_vinfo, &min_mask_elems, &max_mask_elems);
> +  nmasks = vect_get_mask_index_for_elems (MIN (min_mask_elems, iv_elems) / 2);
> +  masks->safe_grow_cleared (nmasks);
> +
> +  /* Now create base masks through comparison IV < VEC_NITERS.  */
> +  mask_type = build_same_sized_truth_vector_type (vectype);
> +  cur_mask = vect_get_mask_index_for_elems (iv_elems);
> +  for (unsigned i = 0; i < ivs.length (); i++)
> +    {
> +      tree iv = ivs[i];
> +      mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
> +      stmt = gimple_build_assign (mask, LT_EXPR, iv, vec_niters);
> +      gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
> +      (*masks)[cur_mask++] = mask;
> +    }
> +
> +  /* Create narrowed masks.  */
> +  cur_mask_elems = iv_elems;
> +  nmasks = ivs.length ();
> +  while (cur_mask_elems < max_mask_elems)
> +    {
> +      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> +
> +      cur_mask_elems <<= 1;
> +      nmasks >>= 1;
> +
> +      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> +
> +      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
> +
> +      for (unsigned i = 0; i < nmasks; i++)
> +       {
> +         tree mask_low = (*masks)[prev_mask++];
> +         tree mask_hi = (*masks)[prev_mask++];
> +         mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
> +         stmt = gimple_build_assign (mask, VEC_PACK_TRUNC_EXPR,
> +                                     mask_low, mask_hi);
> +         gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
> +         (*masks)[cur_mask++] = mask;
> +       }
> +    }
> +
> +  /* Created widened masks.  */
> +  cur_mask_elems = iv_elems;
> +  nmasks = ivs.length ();
> +  while (cur_mask_elems > min_mask_elems)
> +    {
> +      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> +
> +      cur_mask_elems >>= 1;
> +      nmasks <<= 1;
> +
> +      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> +
> +      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
> +
> +      for (unsigned i = 0; i < nmasks; i += 2)
> +       {
> +         tree orig_mask = (*masks)[prev_mask++];
> +
> +         mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
> +         stmt = gimple_build_assign (mask, VEC_UNPACK_LO_EXPR, orig_mask);
> +         gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
> +         (*masks)[cur_mask++] = mask;
> +
> +         mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
> +         stmt = gimple_build_assign (mask, VEC_UNPACK_HI_EXPR, orig_mask);
> +         gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
> +         (*masks)[cur_mask++] = mask;
> +       }
> +    }
> +}
> +
> +/* Function vect_mask_reduction_stmt.
> +
> +   Mask given vectorized reduction statement STMT using
> +   MASK.  In case scalar reduction statement is vectorized
> +   into several vector statements then PREV holds a
> +   preceding vetor statement copy for STMT.
> +
> +   Masking is performed using VEC_COND_EXPR. E.g.
> +
> +   S1: r_1 = r_2 + d_3
> +
> +   is transformed into:
> +
> +   S1': r_4 = r_2 + d_3
> +   S2': r_1 = VEC_COND_EXPR<MASK, r_4, r_2>
> +
> +   Return generated condition statement.  */
> +
> +static gimple *
> +vect_mask_reduction_stmt (gimple *stmt, tree mask, gimple *prev)
> +{
> +  gimple_stmt_iterator gsi;
> +  tree vectype;
> +  tree lhs, rhs, tmp;
> +  gimple *new_stmt, *phi;
> +
> +  lhs = gimple_assign_lhs (stmt);
> +  vectype = TREE_TYPE (lhs);
> +
> +  gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
> +             == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
> +
> +  /* Find operand RHS defined by PHI node.  */
> +  rhs = gimple_assign_rhs1 (stmt);
> +  gcc_assert (TREE_CODE (rhs) == SSA_NAME);
> +  phi = SSA_NAME_DEF_STMT (rhs);
> +
> +  if (phi != prev && gimple_code (phi) != GIMPLE_PHI)
> +    {
> +      rhs = gimple_assign_rhs2 (stmt);
> +      gcc_assert (TREE_CODE (rhs) == SSA_NAME);
> +      phi = SSA_NAME_DEF_STMT (rhs);
> +      gcc_assert (phi == prev || gimple_code (phi) == GIMPLE_PHI);
> +    }
> +
> +  /* Convert reduction stmt to ordinary assignment to TMP.  */
> +  tmp = vect_get_new_ssa_name (vectype, vect_simple_var, NULL);
> +  gimple_assign_set_lhs (stmt, tmp);
> +
> +  /* Create VEC_COND_EXPR and insert it after STMT.  */
> +  new_stmt = gimple_build_assign (lhs, VEC_COND_EXPR, mask, tmp, rhs);
> +  gsi = gsi_for_stmt (stmt);
> +  gsi_insert_after (&gsi, new_stmt, GSI_SAME_STMT);
> +
> +  return new_stmt;
> +}
> +
> +/* Function vect_mask_mask_load_store_stmt.
> +
> +   Mask given vectorized MASK_LOAD or MASK_STORE statement
> +   STMT using MASK.  Function replaces a mask used by STMT
> +   with its conjunction with MASK.  */
> +
> +static void
> +vect_mask_mask_load_store_stmt (gimple *stmt, tree mask)
> +{
> +  gimple *new_stmt;
> +  tree old_mask, new_mask;
> +  gimple_stmt_iterator gsi;
> +
> +  gsi = gsi_for_stmt (stmt);
> +  old_mask = gimple_call_arg (stmt, 2);
> +
> +  gcc_assert (types_compatible_p (TREE_TYPE (old_mask), TREE_TYPE (mask)));
> +
> +  new_mask = vect_get_new_ssa_name (TREE_TYPE (mask), vect_simple_var, NULL);
> +  new_stmt = gimple_build_assign (new_mask, BIT_AND_EXPR, old_mask, mask);
> +  gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
> +
> +  gimple_call_set_arg (stmt, 2, new_mask);
> +  update_stmt (stmt);
> +}
> +
> +
> +/* Function vect_mask_load_store_stmt.
> +
> +   Mask given vectorized load or store statement STMT using
> +   MASK.  DR is a data reference for a scalar memory access.
> +   Assignment is transformed into MASK_LOAD or MASK_STORE
> +   statement.  SI is either an iterator pointing to STMT and
> +   is to be updated or NULL.  */
> +
> +static void
> +vect_mask_load_store_stmt (gimple *stmt, tree vectype, tree mask,
> +                          data_reference *dr, gimple_stmt_iterator *si)
> +{
> +  tree mem, val, addr, ptr;
> +  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
> +  unsigned align, misalign;
> +  tree elem_type = TREE_TYPE (vectype);
> +  gimple *new_stmt;
> +
> +  gcc_assert (!si || gsi_stmt (*si) == stmt);
> +
> +  gsi = gsi_for_stmt (stmt);
> +  if (gimple_store_p (stmt))
> +    {
> +      val = gimple_assign_rhs1 (stmt);
> +      mem = gimple_assign_lhs (stmt);
> +    }
> +  else
> +    {
> +      val = gimple_assign_lhs (stmt);
> +      mem = gimple_assign_rhs1 (stmt);
> +    }
> +
> +  gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
> +             == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
> +
> +  addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (mem),
> +                                  true, NULL_TREE, true,
> +                                  GSI_SAME_STMT);
> +
> +  align = TYPE_ALIGN_UNIT (vectype);
> +  if (aligned_access_p (dr))
> +    misalign = 0;
> +  else if (DR_MISALIGNMENT (dr) == -1)
> +    {
> +      align = TYPE_ALIGN_UNIT (elem_type);
> +      misalign = 0;
> +    }
> +  else
> +    misalign = DR_MISALIGNMENT (dr);
> +  set_ptr_info_alignment (get_ptr_info (addr), align, misalign);
> +  ptr = build_int_cst (reference_alias_ptr_type (mem),
> +                      misalign ? misalign & -misalign : align);
> +
> +  if (gimple_store_p (stmt))
> +    new_stmt = gimple_build_call_internal (IFN_MASK_STORE, 4, addr, ptr,
> +                                          mask, val);
> +  else
> +    {
> +      new_stmt = gimple_build_call_internal (IFN_MASK_LOAD, 3, addr, ptr,
> +                                            mask);
> +      gimple_call_set_lhs (new_stmt, val);
> +    }
> +  gsi_replace (si ? si : &gsi, new_stmt, false);
> +}
> +
> +/* Function vect_combine_loop_epilogue.
> +
> +   Combine loop epilogue with the main vectorized body.  It requires
> +   masking of memory accesses and reductions.  */
> +
> +static void
> +vect_combine_loop_epilogue (loop_vec_info loop_vinfo)
> +{
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
> +  unsigned mask_no;
> +  auto_vec<tree> masks;
> +
> +  vect_gen_loop_masks (loop_vinfo, &masks);
> +
> +  /* Convert reduction statements if any.  */
> +  for (unsigned i = 0; i < LOOP_VINFO_REDUCTIONS (loop_vinfo).length (); i++)
> +    {
> +      gimple *stmt = LOOP_VINFO_REDUCTIONS (loop_vinfo)[i];
> +      gimple *prev_stmt = NULL;
> +      stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> +
> +      mask_no = vect_get_mask_index_for_type (STMT_VINFO_VECTYPE (stmt_info));
> +
> +      stmt = STMT_VINFO_VEC_STMT (stmt_info);
> +      while (stmt)
> +       {
> +         prev_stmt = vect_mask_reduction_stmt (stmt, masks[mask_no++],
> +                                               prev_stmt);
> +         stmt_info = vinfo_for_stmt (stmt);
> +         stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
> +       }
> +    }
> +
> +  /* Scan all loop statements to convert vector load/store including masked
> +     form.  */
> +  for (unsigned i = 0; i < loop->num_nodes; i++)
> +    {
> +      basic_block bb = bbs[i];
> +      for (gimple_stmt_iterator si = gsi_start_bb (bb);
> +          !gsi_end_p (si); gsi_next (&si))
> +       {
> +         gimple *stmt = gsi_stmt (si);
> +         stmt_vec_info stmt_info = NULL;
> +         tree vectype = NULL;
> +         data_reference *dr;
> +
> +         /* Mask load case.  */
> +         if (is_gimple_call (stmt)
> +             && gimple_call_internal_p (stmt)
> +             && gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
> +             && !VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (stmt, 2))))
> +           {
> +             stmt_info = vinfo_for_stmt (stmt);
> +             if (!STMT_VINFO_VEC_STMT (stmt_info))
> +               continue;
> +             stmt = STMT_VINFO_VEC_STMT (stmt_info);
> +             vectype = STMT_VINFO_VECTYPE (stmt_info);
> +           }
> +         /* Mask store case.  */
> +         else if (is_gimple_call (stmt)
> +                  && gimple_call_internal_p (stmt)
> +                  && gimple_call_internal_fn (stmt) == IFN_MASK_STORE
> +                  && vinfo_for_stmt (stmt)
> +                  && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
> +           {
> +             stmt_info = vinfo_for_stmt (stmt);
> +             vectype = TREE_TYPE (gimple_call_arg (stmt, 2));
> +           }
> +         /* Load case.  */
> +         else if (gimple_assign_load_p (stmt)
> +                  && !VECTOR_TYPE_P (TREE_TYPE (gimple_assign_lhs (stmt))))
> +           {
> +             stmt_info = vinfo_for_stmt (stmt);
> +
> +             /* Skip vector loads.  */
> +             if (!STMT_VINFO_VEC_STMT (stmt_info))
> +               continue;
> +
> +             /* Skip invariant loads.  */
> +             if (integer_zerop (nested_in_vect_loop_p (loop, stmt)
> +                                ? STMT_VINFO_DR_STEP (stmt_info)
> +                                : DR_STEP (STMT_VINFO_DATA_REF (stmt_info))))
> +               continue;
> +             stmt = STMT_VINFO_VEC_STMT (stmt_info);
> +             vectype = STMT_VINFO_VECTYPE (stmt_info);
> +           }
> +         /* Store case.  */
> +         else if (gimple_code (stmt) == GIMPLE_ASSIGN
> +                  && gimple_store_p (stmt)
> +                  && vinfo_for_stmt (stmt)
> +                  && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
> +           {
> +             stmt_info = vinfo_for_stmt (stmt);
> +             vectype = STMT_VINFO_VECTYPE (stmt_info);
> +           }
> +         else
> +           continue;
> +
> +         /* Skip hoisted out statements.  */
> +         if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
> +           continue;
> +
> +         mask_no = vect_get_mask_index_for_type (vectype);
> +
> +         dr = STMT_VINFO_DATA_REF (stmt_info);
> +         while (stmt)
> +           {
> +             if (is_gimple_call (stmt))
> +               vect_mask_mask_load_store_stmt (stmt, masks[mask_no++]);
> +             else
> +               vect_mask_load_store_stmt (stmt, vectype, masks[mask_no++], dr,
> +                                          /* Have to update iterator only if
> +                                             it points to stmt we mask.  */
> +                                          stmt == gsi_stmt (si) ? &si : NULL);
> +
> +             stmt_info = vinfo_for_stmt (stmt);
> +             stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
> +           }
> +       }
> +    }
> +
> +  if (dump_enabled_p ())
> +    dump_printf_loc (MSG_NOTE, vect_location,
> +                    "=== Loop epilogue was combined ===\n");
> +}
>
>  /* Function vect_transform_loop.
>
> @@ -6936,7 +7568,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>       run at least the vectorization factor number of times checking
>       is pointless, too.  */
>    th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
> -  if (th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1
> +  if ((th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1
> +       || (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)
> +          && th > 1))
>        && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
>      {
>        if (dump_enabled_p ())
> @@ -6985,12 +7619,18 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>      {
>        tree ratio_mult_vf;
>        if (!ni_name)
> -       ni_name = vect_build_loop_niters (loop_vinfo);
> +       {
> +         ni_name = vect_build_loop_niters (loop_vinfo);
> +         LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
> +       }
>        vect_generate_tmps_on_preheader (loop_vinfo, ni_name, &ratio_mult_vf,
>                                        &ratio);
> -      epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name,
> -                                                ratio_mult_vf, th,
> -                                                check_profitability);
> +      /* If epilogue is combined with main loop peeling is not needed.  */
> +      if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)
> +         || check_profitability)
> +       epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name,
> +                                                  ratio_mult_vf, th,
> +                                                  check_profitability);
>      }
>    else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
>      ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
> @@ -6998,7 +7638,10 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>    else
>      {
>        if (!ni_name)
> -       ni_name = vect_build_loop_niters (loop_vinfo);
> +       {
> +         ni_name = vect_build_loop_niters (loop_vinfo);
> +         LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
> +       }
>        vect_generate_tmps_on_preheader (loop_vinfo, ni_name, NULL, &ratio);
>      }
>
> @@ -7252,6 +7895,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>
>    slpeel_make_loop_iterate_ntimes (loop, ratio);
>
> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +    vect_combine_loop_epilogue (loop_vinfo);
> +
>    /* Reduce loop iterations by the vectorization factor.  */
>    scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vectorization_factor),
>                       expected_iterations / vectorization_factor);
> @@ -7263,20 +7909,28 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>          loop->nb_iterations_likely_upper_bound
>            = loop->nb_iterations_likely_upper_bound - 1;
>      }
> -  loop->nb_iterations_upper_bound
> -    = wi::udiv_floor (loop->nb_iterations_upper_bound + 1,
> -                     vectorization_factor) - 1;
> -  loop->nb_iterations_likely_upper_bound
> -    = wi::udiv_floor (loop->nb_iterations_likely_upper_bound + 1,
> -                     vectorization_factor) - 1;
> +
> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +    loop->nb_iterations_upper_bound
> +      = wi::div_ceil (loop->nb_iterations_upper_bound + 1,
> +                     vectorization_factor, UNSIGNED) - 1;
> +  else
> +    loop->nb_iterations_upper_bound
> +      = wi::udiv_floor (loop->nb_iterations_upper_bound + 1,
> +                       vectorization_factor) - 1;
>
>    if (loop->any_estimate)
>      {
> -      loop->nb_iterations_estimate
> -        = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor);
> -       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> -          && loop->nb_iterations_estimate != 0)
> -        loop->nb_iterations_estimate = loop->nb_iterations_estimate - 1;
> +      if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +       loop->nb_iterations_estimate
> +         = wi::div_ceil (loop->nb_iterations_estimate, vectorization_factor,
> +                         UNSIGNED);
> +      else
> +       loop->nb_iterations_estimate
> +         = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor);
> +      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> +         && loop->nb_iterations_estimate != 0)
> +       loop->nb_iterations_estimate -= 1;
>      }
>
>    if (dump_enabled_p ())

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-06-28 13:37   ` Ilya Enkovich
  2016-06-28 14:16     ` Ilya Enkovich
  2016-07-11 13:39     ` Ilya Enkovich
@ 2016-07-14 22:04     ` Jeff Law
  2016-07-20 14:40       ` Ilya Enkovich
  2 siblings, 1 reply; 29+ messages in thread
From: Jeff Law @ 2016-07-14 22:04 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: gcc-patches

On 06/28/2016 06:24 AM, Ilya Enkovich wrote:

>
> Here is an updated patch version.
>
> Thanks,
> Ilya
> --
> gcc/
>
> 2016-05-28  Ilya Enkovich  <ilya.enkovich@intel.com>
>
> 	* dbgcnt.def (vect_tail_combine): New.
> 	* params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
> 	* tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
> 	* tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
> 	epilogue combined with loop body.
> 	(vect_do_peeling_for_loop_bound): LIkewise.
> 	(vect_do_peeling_for_alignment): ???
> 	* tree-vect-loop.c Include alias.h and dbgcnt.h.
> 	(vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters
> 	arg, compute number of iterations for which loop epilogue combining is
> 	profitable.
> 	(vect_generate_tmps_on_preheader): Support combined apilogue.
> 	(vect_gen_ivs_for_masking): New.
> 	(vect_get_mask_index_for_elems): New.
> 	(vect_get_mask_index_for_type): New.
> 	(vect_gen_loop_masks): New.
> 	(vect_mask_reduction_stmt): New.
> 	(vect_mask_mask_load_store_stmt): New.
> 	(vect_mask_load_store_stmt): New.
> 	(vect_combine_loop_epilogue): New.
> 	(vect_transform_loop): Support combined apilogue.
>
>
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 41b9380..08fad82 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -6895,6 +7044,489 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
>    return;
>  }
>
  +
> +/* Function vect_gen_loop_masks.
> +
> +   Create masks to mask a loop described by LOOP_VINFO.  Masks
> +   are created according to LOOP_VINFO_REQUIRED_MASKS and are stored
> +   into MASKS vector.
> +
> +   Index of a mask in a vector is computed according to a number
> +   of masks's elements.  Masks are sorted by number of its elements
> +   in descending order.  Index 0 is used to access a mask with
> +   current_vector_size elements.  Among masks with the same number
> +   of elements the one with lower index is used to mask iterations
> +   with smaller iteration counter.  Note that you may get NULL elements
> +   for masks which are not required.  Use vect_get_mask_index_for_elems
> +   or vect_get_mask_index_for_type to access resulting vector.  */
> +
> +static void
> +vect_gen_loop_masks (loop_vec_info loop_vinfo, vec<tree> *masks)
I find myself wondering if this ought to be broken down a bit (without 
changing the underlying semantics).

> +
> +  /* Create narrowed masks.  */
> +  cur_mask_elems = iv_elems;
> +  nmasks = ivs.length ();
> +  while (cur_mask_elems < max_mask_elems)
> +    {
> +      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> +
> +      cur_mask_elems <<= 1;
> +      nmasks >>= 1;
> +
> +      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> +
> +      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
> +
> +      for (unsigned i = 0; i < nmasks; i++)
> +	{
> +	  tree mask_low = (*masks)[prev_mask++];
> +	  tree mask_hi = (*masks)[prev_mask++];
> +	  mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
> +	  stmt = gimple_build_assign (mask, VEC_PACK_TRUNC_EXPR,
> +				      mask_low, mask_hi);
> +	  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
> +	  (*masks)[cur_mask++] = mask;
> +	}
> +    }
For example, pull this into its own function as well as the code to 
create widened masks.  In fact, didn't I see those functions in one of 
the other patches as their own separate subroutines?

It's not a huge deal and I don't think it requires another round of 
review.  I just found myself scrolling through multiple pages of this 
function and thought it'd be slightly easier to grok if were simply smaller.


> +
> +/* Function vect_mask_reduction_stmt.
> +
> +   Mask given vectorized reduction statement STMT using
> +   MASK.  In case scalar reduction statement is vectorized
> +   into several vector statements then PREV holds a
> +   preceding vetor statement copy for STMT.
s/vetor/vector/

With the one function split up and the typo fix I think this is OK for 
the trunk when the set as a whole is ready.

jeff


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-07-14 22:04     ` Jeff Law
@ 2016-07-20 14:40       ` Ilya Enkovich
  2016-07-20 16:24         ` Jeff Law
  0 siblings, 1 reply; 29+ messages in thread
From: Ilya Enkovich @ 2016-07-20 14:40 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches

On 14 Jul 16:04, Jeff Law wrote:
> On 06/28/2016 06:24 AM, Ilya Enkovich wrote:
> 
> >
> >Here is an updated patch version.
> >
> >Thanks,
> >Ilya
> >--
> >gcc/
> >
> >+/* Function vect_gen_loop_masks.
> >+
> >+   Create masks to mask a loop described by LOOP_VINFO.  Masks
> >+   are created according to LOOP_VINFO_REQUIRED_MASKS and are stored
> >+   into MASKS vector.
> >+
> >+   Index of a mask in a vector is computed according to a number
> >+   of masks's elements.  Masks are sorted by number of its elements
> >+   in descending order.  Index 0 is used to access a mask with
> >+   current_vector_size elements.  Among masks with the same number
> >+   of elements the one with lower index is used to mask iterations
> >+   with smaller iteration counter.  Note that you may get NULL elements
> >+   for masks which are not required.  Use vect_get_mask_index_for_elems
> >+   or vect_get_mask_index_for_type to access resulting vector.  */
> >+
> >+static void
> >+vect_gen_loop_masks (loop_vec_info loop_vinfo, vec<tree> *masks)
> I find myself wondering if this ought to be broken down a bit (without
> changing the underlying semantics).
> 
> >+
> >+  /* Create narrowed masks.  */
> >+  cur_mask_elems = iv_elems;
> >+  nmasks = ivs.length ();
> >+  while (cur_mask_elems < max_mask_elems)
> >+    {
> >+      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> >+
> >+      cur_mask_elems <<= 1;
> >+      nmasks >>= 1;
> >+
> >+      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> >+
> >+      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
> >+
> >+      for (unsigned i = 0; i < nmasks; i++)
> >+	{
> >+	  tree mask_low = (*masks)[prev_mask++];
> >+	  tree mask_hi = (*masks)[prev_mask++];
> >+	  mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
> >+	  stmt = gimple_build_assign (mask, VEC_PACK_TRUNC_EXPR,
> >+				      mask_low, mask_hi);
> >+	  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
> >+	  (*masks)[cur_mask++] = mask;
> >+	}
> >+    }
> For example, pull this into its own function as well as the code to create
> widened masks.  In fact, didn't I see those functions in one of the other
> patches as their own separate subroutines?

There were functions which check we may generate such masks.  Here we
actually generate them.  I moved the code into separate functions.

> 
> It's not a huge deal and I don't think it requires another round of review.
> I just found myself scrolling through multiple pages of this function and
> thought it'd be slightly easier to grok if were simply smaller.
> 
> 
> >+
> >+/* Function vect_mask_reduction_stmt.
> >+
> >+   Mask given vectorized reduction statement STMT using
> >+   MASK.  In case scalar reduction statement is vectorized
> >+   into several vector statements then PREV holds a
> >+   preceding vetor statement copy for STMT.
> s/vetor/vector/
> 
> With the one function split up and the typo fix I think this is OK for the
> trunk when the set as a whole is ready.
> 
> jeff
> 
> 

Here is an updated version.

Thanks,
Ilya
--
gcc/

2016-07-20  Ilya Enkovich  <ilya.enkovich@intel.com>

	* dbgcnt.def (vect_tail_combine): New.
	* params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
	* tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
	* tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
	epilogue combined with loop body.
	(vect_do_peeling_for_loop_bound): LIkewise.
	(vect_do_peeling_for_alignment): ???
	* tree-vect-loop.c Include alias.h and dbgcnt.h.
	(vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters
	arg, compute number of iterations for which loop epilogue combining is
	profitable.
	(vect_generate_tmps_on_preheader): Support combined apilogue.
	(vect_gen_ivs_for_masking): New.
	(vect_get_mask_index_for_elems): New.
	(vect_get_mask_index_for_type): New.
	(vect_create_narrowed_masks): New.
	(vect_create_widened_masks): New.
	(vect_gen_loop_masks): New.
	(vect_mask_reduction_stmt): New.
	(vect_mask_mask_load_store_stmt): New.
	(vect_mask_load_store_stmt): New.
	(vect_combine_loop_epilogue): New.
	(vect_transform_loop): Support combined apilogue.


diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def
index 78ddcc2..73c2966 100644
--- a/gcc/dbgcnt.def
+++ b/gcc/dbgcnt.def
@@ -192,4 +192,5 @@ DEBUG_COUNTER (treepre_insert)
 DEBUG_COUNTER (tree_sra)
 DEBUG_COUNTER (vect_loop)
 DEBUG_COUNTER (vect_slp)
+DEBUG_COUNTER (vect_tail_combine)
 DEBUG_COUNTER (dom_unreachable_edges)
diff --git a/gcc/params.def b/gcc/params.def
index b86d592..745da4c 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1232,6 +1232,11 @@ DEFPARAM (PARAM_MAX_SPECULATIVE_DEVIRT_MAYDEFS,
 	  "Maximum number of may-defs visited when devirtualizing "
 	  "speculatively", 50, 0, 0)
 
+DEFPARAM (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD,
+	  "vect-cost-increase-combine-threshold",
+	  "Cost increase threshold to mask main loop for epilogue.",
+	  10, 0, 300)
+
 /*
 
 Local variables:
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index cdf2ec9..195b033 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4097,6 +4097,9 @@ vect_get_new_ssa_name (tree type, enum vect_var_kind var_kind, const char *name)
   case vect_scalar_var:
     prefix = "stmp";
     break;
+  case vect_mask_var:
+    prefix = "mask";
+    break;
   case vect_pointer_var:
     prefix = "vectp";
     break;
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index af53889..59996f6 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1194,6 +1194,7 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
   int first_guard_probability = 2 * REG_BR_PROB_BASE / 3;
   int second_guard_probability = 2 * REG_BR_PROB_BASE / 3;
   int probability_of_second_loop;
+  bool skip_second_after_first = false;
 
   if (!slpeel_can_duplicate_loop_p (loop, e))
     return NULL;
@@ -1392,7 +1393,11 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
     {
       loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
       tree scalar_loop_iters = LOOP_VINFO_NITERSM1 (loop_vinfo);
-      unsigned limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
+      unsigned limit = 0;
+      if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+	skip_second_after_first = true;
+      else
+	limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
 	limit = limit + 1;
       if (check_profitability
@@ -1463,11 +1468,20 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
   bb_between_loops = new_exit_bb;
   bb_after_second_loop = split_edge (single_exit (second_loop));
 
-  pre_condition =
-	fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
-  skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
-                                  bb_after_second_loop, bb_before_first_loop,
-				  inverse_probability (second_guard_probability));
+  if (skip_second_after_first)
+    /* We can just redirect edge from bb_between_loops to
+       bb_after_second_loop but we have many code assuming
+       we have a guard after the first loop.  So just make
+       always taken condtion.  */
+    pre_condition = fold_build2 (EQ_EXPR, boolean_type_node, integer_zero_node,
+				 integer_zero_node);
+  else
+    pre_condition =
+      fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
+  skip_e
+    = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
+			     bb_after_second_loop, bb_before_first_loop,
+			     inverse_probability (second_guard_probability));
   scale_loop_profile (second_loop, probability_of_second_loop, bound2);
   slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop,
                                      second_loop == new_loop, &new_exit_bb);
@@ -1759,8 +1773,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
   basic_block preheader;
   int loop_num;
   int max_iter;
+  int bound2;
   tree cond_expr = NULL_TREE;
   gimple_seq cond_expr_stmt_list = NULL;
+  bool combine = LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo);
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -1770,12 +1786,13 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
 
   loop_num  = loop->num;
 
+  bound2 = combine ? th : LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   new_loop
     = slpeel_tree_peel_loop_to_edge (loop, scalar_loop, single_exit (loop),
 				     &ratio_mult_vf_name, ni_name, false,
 				     th, check_profitability,
 				     cond_expr, cond_expr_stmt_list,
-				     0, LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+				     0, bound2);
   gcc_assert (new_loop);
   gcc_assert (loop_num == loop->num);
   slpeel_checking_verify_cfg_after_peeling (loop, new_loop);
@@ -1804,7 +1821,12 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
   max_iter = (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
 	      ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) * 2
 	      : LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 2;
-  if (check_profitability)
+
+  /* When epilogue is combined only profitability
+     threshold matters.  */
+  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    max_iter = (int) th - 1;
+  else if (check_profitability)
     max_iter = MAX (max_iter, (int) th - 1);
   record_niter_bound (new_loop, max_iter, false, true);
   dump_printf (MSG_NOTE,
@@ -2041,7 +2063,8 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo, tree ni_name,
 				   bound, 0);
 
   gcc_assert (new_loop);
-  slpeel_checking_verify_cfg_after_peeling (new_loop, loop);
+  if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    slpeel_checking_verify_cfg_after_peeling (new_loop, loop);
   /* For vectorization factor N, we need to copy at most N-1 values 
      for alignment and this means N-2 loopback edge executions.  */
   max_iter = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 2;
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 12c6a66..81d5ad1 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -51,6 +51,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "cgraph.h"
 #include "tree-cfg.h"
 #include "tree-if-conv.h"
+#include "alias.h"
+#include "dbgcnt.h"
 
 /* Loop Vectorization Pass.
 
@@ -150,7 +152,8 @@ along with GCC; see the file COPYING3.  If not see
    http://gcc.gnu.org/projects/tree-ssa/vectorization.html
 */
 
-static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *);
+static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *,
+						int *);
 
 /* Function vect_determine_vectorization_factor
 
@@ -2288,8 +2291,10 @@ start_over:
 
   /* Analyze cost.  Decide if worth while to vectorize.  */
   int min_profitable_estimate, min_profitable_iters;
+  int min_profitable_combine_iters;
   vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
-				      &min_profitable_estimate);
+				      &min_profitable_estimate,
+				      &min_profitable_combine_iters);
 
   if (min_profitable_iters < 0)
     {
@@ -2398,6 +2403,52 @@ start_over:
   gcc_assert (vectorization_factor
 	      == (unsigned)LOOP_VINFO_VECT_FACTOR (loop_vinfo));
 
+  if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
+    {
+      LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
+      LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = false;
+    }
+  else if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+	   && min_profitable_combine_iters >= 0)
+    {
+      if (((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+	    && (LOOP_VINFO_INT_NITERS (loop_vinfo)
+		>= (unsigned) min_profitable_combine_iters))
+	   || estimated_niter == -1
+	   || estimated_niter >= min_profitable_combine_iters)
+	  && dbg_cnt (vect_tail_combine))
+	{
+	  LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
+	  LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = true;
+
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "Decided to combine loop with its epilogue.\n");
+
+	  /* We need to adjust profitability check if combine
+	     epilogue considering additional vector iteration
+	     and profitable combine iterations.  */
+	  if ((int)(min_profitable_combine_iters + vectorization_factor)
+	      > min_scalar_loop_bound)
+	    {
+	      LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo)
+		= (unsigned) min_profitable_combine_iters;
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "Updated runtime profitability treshold: %d\n",
+				 min_profitable_combine_iters);
+
+	    }
+	}
+      else
+	{
+	  if (!LOOP_VINFO_NEED_MASKING (loop_vinfo) && dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Not combined loop with epilogue: iterations "
+			     "count is too low (threshold is %d).\n",
+			     min_profitable_combine_iters);
+	}
+    }
+
   /* Ok to vectorize!  */
   return true;
 
@@ -3367,12 +3418,18 @@ vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
    profitability check.
 
    *RET_MIN_PROFITABLE_ESTIMATE is a profitability threshold to be used
-   for static check against estimated number of iterations.  */
+   for static check against estimated number of iterations.
+
+   *RET_MIN_PROFITABLE_COMBINE_NITERS is a cost model profitability threshold
+   of iterations for vectorization with combined loop epilogue.  -1 means
+   combining is not profitable.  Value may be used fo dynamic profitability
+   check.  */
 
 static void
 vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 				    int *ret_min_profitable_niters,
-				    int *ret_min_profitable_estimate)
+				    int *ret_min_profitable_estimate,
+				    int *ret_min_profitable_combine_niters)
 {
   int min_profitable_iters;
   int min_profitable_estimate;
@@ -3616,6 +3673,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
                    vec_prologue_cost);
       dump_printf (MSG_NOTE, "  Vector epilogue cost: %d\n",
                    vec_epilogue_cost);
+      dump_printf (MSG_NOTE, "  Masking prologue cost: %d\n",
+                   masking_prologue_cost);
+      dump_printf (MSG_NOTE, "  Masking inside cost: %d\n",
+                   masking_inside_cost);
       dump_printf (MSG_NOTE, "  Scalar iteration cost: %d\n",
                    scalar_single_iter_cost);
       dump_printf (MSG_NOTE, "  Scalar outside cost: %d\n",
@@ -3719,6 +3780,77 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 		     min_profitable_estimate);
 
   *ret_min_profitable_estimate = min_profitable_estimate;
+
+  *ret_min_profitable_combine_niters = -1;
+
+  /* Don't try to vectorize epilogue of epilogue.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    return;
+
+  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+    {
+      if (flag_vect_epilogue_cost_model == VECT_COST_MODEL_UNLIMITED)
+	{
+	  if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE)
+	    *ret_min_profitable_combine_niters = 0;
+	  return;
+	}
+
+      unsigned combine_treshold
+	= PARAM_VALUE (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD);
+      /* Calculate profitability combining epilogue with the main loop.
+	 We have a threshold for inside cost overhead (not applied
+	 for low trip count loop case):
+	 MIC * 100 < VIC * CT
+	 Masked iteration should be better than a scalar prologue:
+	 MIC + VIC < SIC * epilogue_niters  */
+      if (masking_inside_cost * 100 >= vec_inside_cost * combine_treshold)
+	{
+	  if (dump_enabled_p ())
+	    {
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "Combining loop with epilogue is not "
+			       "profitable.\n");
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "  Combining overhead %d%% exceeds "
+			       "treshold %d%%.\n",
+			       masking_inside_cost * 100 / vec_inside_cost,
+			       combine_treshold);
+	    }
+	  *ret_min_profitable_combine_niters = -1;
+	}
+      else if ((int)(masking_inside_cost + vec_inside_cost)
+	       >= scalar_single_iter_cost * peel_iters_epilogue)
+	{
+	  if (dump_enabled_p ())
+	    {
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "Combining loop with epilogue is not "
+			       "profitable.\n");
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "  Scalar epilogue is faster than a "
+			       "single masked iteration.\n");
+	    }
+	  *ret_min_profitable_combine_niters = -1;
+	}
+      else if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE)
+	{
+	  int inside_cost = vec_inside_cost + masking_inside_cost;
+	  int outside_cost = vec_outside_cost + masking_prologue_cost;
+	  int profitable_iters = ((outside_cost - scalar_outside_cost) * vf
+				  - inside_cost * peel_iters_prologue
+				  - inside_cost * peel_iters_epilogue)
+				 / ((scalar_single_iter_cost * vf)
+				    - inside_cost);
+
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Combinig loop with epilogue "
+			     "pofitability treshold = %d\n",
+			     profitable_iters);
+	  *ret_min_profitable_combine_niters = profitable_iters;
+	}
+    }
 }
 
 /* Writes into SEL a mask for a vec_perm, equivalent to a vec_shr by OFFSET
@@ -6860,20 +6992,37 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
   else
     ni_minus_gap_name = ni_name;
 
-  /* Create: ratio = ni >> log2(vf) */
-  /* ???  As we have ni == number of latch executions + 1, ni could
-     have overflown to zero.  So avoid computing ratio based on ni
-     but compute it using the fact that we know ratio will be at least
-     one, thus via (ni - vf) >> log2(vf) + 1.  */
-  ratio_name
-    = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
-		   fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
-				fold_build2 (MINUS_EXPR, TREE_TYPE (ni_name),
-					     ni_minus_gap_name,
-					     build_int_cst
-					       (TREE_TYPE (ni_name), vf)),
-				log_vf),
-		   build_int_cst (TREE_TYPE (ni_name), 1));
+  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    {
+      /* Create ni + (vf-1) >> log2(vf) if epilogue is combined with loop.  */
+      gcc_assert (!LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
+      ratio_name
+	= fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
+		       fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
+				    ni_name,
+				    build_int_cst (TREE_TYPE (ni_name),
+						   vf - 1)),
+		       log_vf);
+    }
+  else
+    {
+      /* Create: ratio = ni >> log2(vf) */
+      /* ???  As we have ni == number of latch executions + 1, ni could
+	 have overflown to zero.  So avoid computing ratio based on ni
+	 but compute it using the fact that we know ratio will be at least
+	 one, thus via (ni - vf) >> log2(vf) + 1.  */
+      ratio_name
+        = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
+		       fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
+				    fold_build2 (MINUS_EXPR,
+						 TREE_TYPE (ni_name),
+						 ni_minus_gap_name,
+						 build_int_cst
+						   (TREE_TYPE (ni_name), vf)),
+				    log_vf),
+		       build_int_cst (TREE_TYPE (ni_name), 1));
+    }
+
   if (!is_gimple_val (ratio_name))
     {
       var = create_tmp_var (TREE_TYPE (ni_name), "bnd");
@@ -6903,6 +7052,525 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
   return;
 }
 
+/* Function vect_gen_ivs_for_masking.
+
+   Create IVs to be used for masks computation to mask loop described
+   by LOOP_VINFO.  Created IVs are stored in IVS vector.  .
+
+   Initial IV values is {0, 1, ..., VF - 1} (probably split into several
+   vectors, in this case IVS's elements with lower index hold IV with
+   smaller numbers).  IV step is {VF, VF, ..., VF}.  VF is a used
+   vectorization factor.  */
+
+static void
+vect_gen_ivs_for_masking (loop_vec_info loop_vinfo, vec<tree> *ivs)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  tree vectype = vect_get_masking_iv_type (loop_vinfo);
+  tree type = TREE_TYPE (vectype);
+  int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
+  int ncopies  = vf / elems;
+  int i, k;
+  tree iv, init_val, step_val;
+  bool insert_after;
+  gimple_stmt_iterator gsi;
+  tree *vtemp;
+
+  /* Create {VF, ..., VF} vector constant.  */
+  step_val = build_vector_from_val (vectype, build_int_cst (type, vf));
+
+  vtemp = XALLOCAVEC (tree, vf);
+  for (i = 0; i < ncopies; i++)
+    {
+      /* Create initial IV value.  */
+      for (k = 0; k < vf; k++)
+	vtemp[k] = build_int_cst (type, k + i * elems);
+      init_val = build_vector (vectype, vtemp);
+
+      /* Create an inductive variable including phi node.  */
+      standard_iv_increment_position (loop, &gsi, &insert_after);
+      create_iv (init_val, step_val, NULL, loop, &gsi, insert_after,
+		 &iv, NULL);
+      ivs->safe_push (iv);
+    }
+}
+
+/* Function vect_get_mask_index_for_elems.
+
+   A helper function to access masks vector.  See vect_gen_loop_masks
+   for masks vector sorting description.  Return index of the first
+   mask having MASK_ELEMS elements.  */
+
+static inline unsigned
+vect_get_mask_index_for_elems (unsigned mask_elems)
+{
+  return current_vector_size / mask_elems - 1;
+}
+
+/* Function vect_get_mask_index_for_type.
+
+   A helper function to access masks vector.  See vect_gen_loop_masks
+   for masks vector sorting description.  Return index of the first
+   mask appropriate for VECTYPE.  */
+
+static inline unsigned
+vect_get_mask_index_for_type (tree vectype)
+{
+  unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
+  return vect_get_mask_index_for_elems (elems);
+}
+
+/* Function vect_create_narrowed_masks.
+
+   Create masks by narrowing NMASKS base masks having BASE_MASK_ELEMS
+   elements each and put them into MASKS vector.  MAX_MASK_ELEMS holds
+   the maximum number of elements in a mask required.  Generated
+   statements are inserted before GSI.  */
+static void
+vect_create_narrowed_masks (vec<tree> *masks, unsigned nmasks,
+			    unsigned base_mask_elems, unsigned max_mask_elems,
+			    gimple_stmt_iterator *gsi)
+{
+  unsigned cur_mask_elems = base_mask_elems;
+  unsigned cur_mask, prev_mask;
+  unsigned vec_size = current_vector_size;
+  tree mask_type, mask;
+  gimple *stmt;
+
+  while (cur_mask_elems < max_mask_elems)
+    {
+      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      cur_mask_elems <<= 1;
+      nmasks >>= 1;
+
+      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
+
+      for (unsigned i = 0; i < nmasks; i++)
+	{
+	  tree mask_low = (*masks)[prev_mask++];
+	  tree mask_hi = (*masks)[prev_mask++];
+	  mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+	  stmt = gimple_build_assign (mask, VEC_PACK_TRUNC_EXPR,
+				      mask_low, mask_hi);
+	  gsi_insert_before (gsi, stmt, GSI_SAME_STMT);
+	  (*masks)[cur_mask++] = mask;
+	}
+    }
+}
+
+/* Function vect_create_widened_masks.
+
+   Create masks by widening NMASKS base masks having BASE_MASK_ELEMS
+   elements each and put them into MASKS vector.  MIN_MASK_ELEMS holds
+   the minimum number of elements in a mask required.  Generated
+   statements are inserted before GSI.  */
+static void
+vect_create_widened_masks (vec<tree> *masks, unsigned nmasks,
+			   unsigned base_mask_elems, unsigned min_mask_elems,
+			   gimple_stmt_iterator *gsi)
+{
+  unsigned cur_mask_elems = base_mask_elems;
+  unsigned cur_mask, prev_mask;
+  unsigned vec_size = current_vector_size;
+  tree mask_type, mask;
+  gimple *stmt;
+
+  while (cur_mask_elems > min_mask_elems)
+    {
+      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      cur_mask_elems >>= 1;
+      nmasks <<= 1;
+
+      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
+
+      for (unsigned i = 0; i < nmasks; i += 2)
+	{
+	  tree orig_mask = (*masks)[prev_mask++];
+
+	  mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+	  stmt = gimple_build_assign (mask, VEC_UNPACK_LO_EXPR, orig_mask);
+	  gsi_insert_before (gsi, stmt, GSI_SAME_STMT);
+	  (*masks)[cur_mask++] = mask;
+
+	  mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+	  stmt = gimple_build_assign (mask, VEC_UNPACK_HI_EXPR, orig_mask);
+	  gsi_insert_before (gsi, stmt, GSI_SAME_STMT);
+	  (*masks)[cur_mask++] = mask;
+	}
+    }
+}
+
+/* Function vect_gen_loop_masks.
+
+   Create masks to mask a loop described by LOOP_VINFO.  Masks
+   are created according to LOOP_VINFO_REQUIRED_MASKS and are stored
+   into MASKS vector.
+
+   Index of a mask in a vector is computed according to a number
+   of masks's elements.  Masks are sorted by number of its elements
+   in descending order.  Index 0 is used to access a mask with
+   current_vector_size elements.  Among masks with the same number
+   of elements the one with lower index is used to mask iterations
+   with smaller iteration counter.  Note that vector may have NULL values
+   for masks which are not required.  Use vect_get_mask_index_for_elems
+   or vect_get_mask_index_for_type to access resulting vector.  */
+
+static void
+vect_gen_loop_masks (loop_vec_info loop_vinfo, vec<tree> *masks)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  edge pe = loop_preheader_edge (loop);
+  tree niters = LOOP_VINFO_NITERS (loop_vinfo);
+  unsigned min_mask_elems, max_mask_elems, nmasks;
+  unsigned iv_elems, cur_mask;
+  auto_vec<tree> ivs;
+  tree vectype, mask_type;
+  tree vec_niters, vec_niters_val, mask;
+  gimple *stmt;
+  basic_block bb;
+  gimple_stmt_iterator gsi = gsi_after_labels (loop->header);
+  unsigned vec_size;
+
+  /* Create required IVs.  */
+  vect_gen_ivs_for_masking (loop_vinfo, &ivs);
+  vectype = TREE_TYPE (ivs[0]);
+
+  vec_size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
+  iv_elems = TYPE_VECTOR_SUBPARTS (vectype);
+
+  /* Get a proper niter to build a vector.  */
+  if (!is_gimple_val (niters))
+    {
+      gimple_seq seq = NULL;
+      niters = force_gimple_operand (niters, &seq, true, NULL);
+      gsi_insert_seq_on_edge_immediate (pe, seq);
+    }
+
+  /* We may need a type cast in case niter has a too small type
+     for generated IVs.  */
+  if (!types_compatible_p (TREE_TYPE (vectype), TREE_TYPE (niters)))
+    {
+      tree new_niters = make_temp_ssa_name (TREE_TYPE (vectype),
+					    NULL, "niters");
+      stmt = gimple_build_assign (new_niters, CONVERT_EXPR, niters);
+      bb = gsi_insert_on_edge_immediate (pe, stmt);
+      gcc_assert (!bb);
+      niters = new_niters;
+    }
+
+  /* Create {NITERS, ..., NITERS} vector and put to SSA_NAME.  */
+  vec_niters_val = build_vector_from_val (vectype, niters);
+  vec_niters = vect_get_new_ssa_name (vectype, vect_simple_var, "niters");
+  stmt = gimple_build_assign (vec_niters, vec_niters_val);
+  bb = gsi_insert_on_edge_immediate (pe, stmt);
+  gcc_assert (!bb);
+
+  /* Determine which masks we need to compute and how many.  */
+  vect_get_extreme_masks (loop_vinfo, &min_mask_elems, &max_mask_elems);
+  nmasks = vect_get_mask_index_for_elems (MIN (min_mask_elems, iv_elems) / 2);
+  masks->safe_grow_cleared (nmasks);
+
+  /* Now create base masks through comparison IV < VEC_NITERS.  */
+  mask_type = build_same_sized_truth_vector_type (vectype);
+  cur_mask = vect_get_mask_index_for_elems (iv_elems);
+  for (unsigned i = 0; i < ivs.length (); i++)
+    {
+      tree iv = ivs[i];
+      mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+      stmt = gimple_build_assign (mask, LT_EXPR, iv, vec_niters);
+      gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+      (*masks)[cur_mask++] = mask;
+    }
+
+  vect_create_narrowed_masks (masks, ivs.length (), iv_elems,
+			      max_mask_elems, &gsi);
+
+  vect_create_widened_masks (masks, ivs.length (), iv_elems,
+			     min_mask_elems, &gsi);
+}
+
+/* Function vect_mask_reduction_stmt.
+
+   Mask given vectorized reduction statement STMT using
+   MASK.  In case scalar reduction statement is vectorized
+   into several vector statements then PREV holds a
+   preceding vector statement copy for STMT.
+
+   Masking is performed using VEC_COND_EXPR. E.g.
+
+   S1: r_1 = r_2 + d_3
+
+   is transformed into:
+
+   S1': r_4 = r_2 + d_3
+   S2': r_1 = VEC_COND_EXPR<MASK, r_4, r_2>
+
+   Return generated condition statement.  */
+
+static gimple *
+vect_mask_reduction_stmt (gimple *stmt, tree mask, gimple *prev)
+{
+  gimple_stmt_iterator gsi;
+  tree vectype;
+  tree lhs, rhs, tmp;
+  gimple *new_stmt, *phi;
+
+  lhs = gimple_assign_lhs (stmt);
+  vectype = TREE_TYPE (lhs);
+
+  gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
+	      == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
+
+  /* Find operand RHS defined by PHI node.  */
+  rhs = gimple_assign_rhs1 (stmt);
+  gcc_assert (TREE_CODE (rhs) == SSA_NAME);
+  phi = SSA_NAME_DEF_STMT (rhs);
+
+  if (phi != prev && gimple_code (phi) != GIMPLE_PHI)
+    {
+      rhs = gimple_assign_rhs2 (stmt);
+      gcc_assert (TREE_CODE (rhs) == SSA_NAME);
+      phi = SSA_NAME_DEF_STMT (rhs);
+      gcc_assert (phi == prev || gimple_code (phi) == GIMPLE_PHI);
+    }
+
+  /* Convert reduction stmt to ordinary assignment to TMP.  */
+  tmp = vect_get_new_ssa_name (vectype, vect_simple_var, NULL);
+  gimple_assign_set_lhs (stmt, tmp);
+
+  /* Create VEC_COND_EXPR and insert it after STMT.  */
+  new_stmt = gimple_build_assign (lhs, VEC_COND_EXPR, mask, tmp, rhs);
+  gsi = gsi_for_stmt (stmt);
+  gsi_insert_after (&gsi, new_stmt, GSI_SAME_STMT);
+
+  return new_stmt;
+}
+
+/* Function vect_mask_mask_load_store_stmt.
+
+   Mask given vectorized MASK_LOAD or MASK_STORE statement
+   STMT using MASK.  Function replaces a mask used by STMT
+   with its conjunction with MASK.  */
+
+static void
+vect_mask_mask_load_store_stmt (gimple *stmt, tree mask)
+{
+  gimple *new_stmt;
+  tree old_mask, new_mask;
+  gimple_stmt_iterator gsi;
+
+  gsi = gsi_for_stmt (stmt);
+  old_mask = gimple_call_arg (stmt, 2);
+
+  gcc_assert (types_compatible_p (TREE_TYPE (old_mask), TREE_TYPE (mask)));
+
+  new_mask = vect_get_new_ssa_name (TREE_TYPE (mask), vect_simple_var, NULL);
+  new_stmt = gimple_build_assign (new_mask, BIT_AND_EXPR, old_mask, mask);
+  gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
+
+  gimple_call_set_arg (stmt, 2, new_mask);
+  update_stmt (stmt);
+}
+
+
+/* Function vect_mask_load_store_stmt.
+
+   Mask given vectorized load or store statement STMT using
+   MASK.  DR is a data reference for a scalar memory access.
+   Assignment is transformed into MASK_LOAD or MASK_STORE
+   statement.  SI is either an iterator pointing to STMT and
+   is to be updated or NULL.  */
+
+static void
+vect_mask_load_store_stmt (gimple *stmt, tree vectype, tree mask,
+			   data_reference *dr, gimple_stmt_iterator *si)
+{
+  tree mem, val, addr, ptr;
+  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+  unsigned align, misalign;
+  tree elem_type = TREE_TYPE (vectype);
+  gimple *new_stmt;
+
+  gcc_assert (!si || gsi_stmt (*si) == stmt);
+
+  gsi = gsi_for_stmt (stmt);
+  if (gimple_store_p (stmt))
+    {
+      val = gimple_assign_rhs1 (stmt);
+      mem = gimple_assign_lhs (stmt);
+    }
+  else
+    {
+      val = gimple_assign_lhs (stmt);
+      mem = gimple_assign_rhs1 (stmt);
+    }
+
+  gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
+	      == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
+
+  addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (mem),
+				   true, NULL_TREE, true,
+				   GSI_SAME_STMT);
+
+  align = TYPE_ALIGN_UNIT (vectype);
+  if (aligned_access_p (dr))
+    misalign = 0;
+  else if (DR_MISALIGNMENT (dr) == -1)
+    {
+      align = TYPE_ALIGN_UNIT (elem_type);
+      misalign = 0;
+    }
+  else
+    misalign = DR_MISALIGNMENT (dr);
+  set_ptr_info_alignment (get_ptr_info (addr), align, misalign);
+  ptr = build_int_cst (reference_alias_ptr_type (mem),
+		       misalign ? misalign & -misalign : align);
+
+  if (gimple_store_p (stmt))
+    new_stmt = gimple_build_call_internal (IFN_MASK_STORE, 4, addr, ptr,
+					   mask, val);
+  else
+    {
+      new_stmt = gimple_build_call_internal (IFN_MASK_LOAD, 3, addr, ptr,
+					     mask);
+      gimple_call_set_lhs (new_stmt, val);
+    }
+  gsi_replace (si ? si : &gsi, new_stmt, false);
+}
+
+/* Function vect_combine_loop_epilogue.
+
+   Combine loop epilogue with the main vectorized body.  It requires
+   masking of memory accesses and reductions.  */
+
+static void
+vect_combine_loop_epilogue (loop_vec_info loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
+  unsigned mask_no;
+  auto_vec<tree> masks;
+
+  vect_gen_loop_masks (loop_vinfo, &masks);
+
+  /* Convert reduction statements if any.  */
+  for (unsigned i = 0; i < LOOP_VINFO_REDUCTIONS (loop_vinfo).length (); i++)
+    {
+      gimple *stmt = LOOP_VINFO_REDUCTIONS (loop_vinfo)[i];
+      gimple *prev_stmt = NULL;
+      stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+
+      mask_no = vect_get_mask_index_for_type (STMT_VINFO_VECTYPE (stmt_info));
+
+      stmt = STMT_VINFO_VEC_STMT (stmt_info);
+      while (stmt)
+	{
+	  prev_stmt = vect_mask_reduction_stmt (stmt, masks[mask_no++],
+						prev_stmt);
+	  stmt_info = vinfo_for_stmt (stmt);
+	  stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
+	}
+    }
+
+  /* Scan all loop statements to convert vector load/store including masked
+     form.  */
+  for (unsigned i = 0; i < loop->num_nodes; i++)
+    {
+      basic_block bb = bbs[i];
+      for (gimple_stmt_iterator si = gsi_start_bb (bb);
+	   !gsi_end_p (si); gsi_next (&si))
+	{
+	  gimple *stmt = gsi_stmt (si);
+	  stmt_vec_info stmt_info = NULL;
+	  tree vectype = NULL;
+	  data_reference *dr;
+
+	  /* Mask load case.  */
+	  if (is_gimple_call (stmt)
+	      && gimple_call_internal_p (stmt)
+	      && gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
+	      && !VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (stmt, 2))))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+	      if (!STMT_VINFO_VEC_STMT (stmt_info))
+		continue;
+	      stmt = STMT_VINFO_VEC_STMT (stmt_info);
+	      vectype = STMT_VINFO_VECTYPE (stmt_info);
+	    }
+	  /* Mask store case.  */
+	  else if (is_gimple_call (stmt)
+		   && gimple_call_internal_p (stmt)
+		   && gimple_call_internal_fn (stmt) == IFN_MASK_STORE
+		   && vinfo_for_stmt (stmt)
+		   && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+	      vectype = TREE_TYPE (gimple_call_arg (stmt, 2));
+	    }
+	  /* Load case.  */
+	  else if (gimple_assign_load_p (stmt)
+		   && !VECTOR_TYPE_P (TREE_TYPE (gimple_assign_lhs (stmt))))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+
+	      /* Skip vector loads.  */
+	      if (!STMT_VINFO_VEC_STMT (stmt_info))
+		continue;
+
+	      /* Skip invariant loads.  */
+	      if (integer_zerop (nested_in_vect_loop_p (loop, stmt)
+				 ? STMT_VINFO_DR_STEP (stmt_info)
+				 : DR_STEP (STMT_VINFO_DATA_REF (stmt_info))))
+		continue;
+	      stmt = STMT_VINFO_VEC_STMT (stmt_info);
+	      vectype = STMT_VINFO_VECTYPE (stmt_info);
+	    }
+	  /* Store case.  */
+	  else if (gimple_code (stmt) == GIMPLE_ASSIGN
+		   && gimple_store_p (stmt)
+		   && vinfo_for_stmt (stmt)
+		   && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+	      vectype = STMT_VINFO_VECTYPE (stmt_info);
+	    }
+	  else
+	    continue;
+
+	  /* Skip hoisted out statements.  */
+	  if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
+	    continue;
+
+	  mask_no = vect_get_mask_index_for_type (vectype);
+
+	  dr = STMT_VINFO_DATA_REF (stmt_info);
+	  while (stmt)
+	    {
+	      if (is_gimple_call (stmt))
+		vect_mask_mask_load_store_stmt (stmt, masks[mask_no++]);
+	      else
+		vect_mask_load_store_stmt (stmt, vectype, masks[mask_no++], dr,
+					   /* Have to update iterator only if
+					      it points to stmt we mask.  */
+					   stmt == gsi_stmt (si) ? &si : NULL);
+
+	      stmt_info = vinfo_for_stmt (stmt);
+	      stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
+	    }
+	}
+    }
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location,
+		     "=== Loop epilogue was combined ===\n");
+}
 
 /* Function vect_transform_loop.
 
@@ -6944,7 +7612,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
      run at least the vectorization factor number of times checking
      is pointless, too.  */
   th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
-  if (th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1
+  if ((th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1
+       || (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)
+	   && th > 1))
       && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
       if (dump_enabled_p ())
@@ -6993,12 +7663,18 @@ vect_transform_loop (loop_vec_info loop_vinfo)
     {
       tree ratio_mult_vf;
       if (!ni_name)
-	ni_name = vect_build_loop_niters (loop_vinfo);
+	{
+	  ni_name = vect_build_loop_niters (loop_vinfo);
+	  LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
+	}
       vect_generate_tmps_on_preheader (loop_vinfo, ni_name, &ratio_mult_vf,
 				       &ratio);
-      epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name,
-						 ratio_mult_vf, th,
-						 check_profitability);
+      /* If epilogue is combined with main loop peeling is not needed.  */
+      if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)
+	  || check_profitability)
+	epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name,
+						   ratio_mult_vf, th,
+						   check_profitability);
     }
   else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
@@ -7006,7 +7682,10 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   else
     {
       if (!ni_name)
-	ni_name = vect_build_loop_niters (loop_vinfo);
+	{
+	  ni_name = vect_build_loop_niters (loop_vinfo);
+	  LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
+	}
       vect_generate_tmps_on_preheader (loop_vinfo, ni_name, NULL, &ratio);
     }
 
@@ -7257,6 +7936,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 
   slpeel_make_loop_iterate_ntimes (loop, ratio);
 
+  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    vect_combine_loop_epilogue (loop_vinfo);
+
   /* Reduce loop iterations by the vectorization factor.  */
   scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vectorization_factor),
 		      expected_iterations / vectorization_factor);
@@ -7268,20 +7950,28 @@ vect_transform_loop (loop_vec_info loop_vinfo)
         loop->nb_iterations_likely_upper_bound
 	   = loop->nb_iterations_likely_upper_bound - 1;
     }
-  loop->nb_iterations_upper_bound
-    = wi::udiv_floor (loop->nb_iterations_upper_bound + 1,
-		      vectorization_factor) - 1;
-  loop->nb_iterations_likely_upper_bound
-    = wi::udiv_floor (loop->nb_iterations_likely_upper_bound + 1,
-		      vectorization_factor) - 1;
+
+  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    loop->nb_iterations_upper_bound
+      = wi::div_ceil (loop->nb_iterations_upper_bound + 1,
+		      vectorization_factor, UNSIGNED) - 1;
+  else
+    loop->nb_iterations_upper_bound
+      = wi::udiv_floor (loop->nb_iterations_upper_bound + 1,
+			vectorization_factor) - 1;
 
   if (loop->any_estimate)
     {
-      loop->nb_iterations_estimate
-        = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor);
-       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
-	   && loop->nb_iterations_estimate != 0)
-	 loop->nb_iterations_estimate = loop->nb_iterations_estimate - 1;
+      if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+	loop->nb_iterations_estimate
+	  = wi::div_ceil (loop->nb_iterations_estimate, vectorization_factor,
+			  UNSIGNED);
+      else
+	loop->nb_iterations_estimate
+	  = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor);
+      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+	  && loop->nb_iterations_estimate != 0)
+	loop->nb_iterations_estimate -= 1;
     }
 
   if (dump_enabled_p ())

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-07-20 14:40       ` Ilya Enkovich
@ 2016-07-20 16:24         ` Jeff Law
  2016-07-21  9:15           ` Ilya Enkovich
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff Law @ 2016-07-20 16:24 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: gcc-patches

On 07/20/2016 08:37 AM, Ilya Enkovich wrote:

> Here is an updated version.
>
> Thanks,
> Ilya
> --
> gcc/
>
> 2016-07-20  Ilya Enkovich  <ilya.enkovich@intel.com>
>
> 	* dbgcnt.def (vect_tail_combine): New.
> 	* params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
> 	* tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
> 	* tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
> 	epilogue combined with loop body.
> 	(vect_do_peeling_for_loop_bound): LIkewise.
> 	(vect_do_peeling_for_alignment): ???
> 	* tree-vect-loop.c Include alias.h and dbgcnt.h.
> 	(vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters
> 	arg, compute number of iterations for which loop epilogue combining is
> 	profitable.
> 	(vect_generate_tmps_on_preheader): Support combined apilogue.
> 	(vect_gen_ivs_for_masking): New.
> 	(vect_get_mask_index_for_elems): New.
> 	(vect_get_mask_index_for_type): New.
> 	(vect_create_narrowed_masks): New.
> 	(vect_create_widened_masks): New.
> 	(vect_gen_loop_masks): New.
> 	(vect_mask_reduction_stmt): New.
> 	(vect_mask_mask_load_store_stmt): New.
> 	(vect_mask_load_store_stmt): New.
> 	(vect_combine_loop_epilogue): New.
> 	(vect_transform_loop): Support combined apilogue.
I think this is OK.  We've just got patch #5 to work through now, correct?

Jeff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-07-20 16:24         ` Jeff Law
@ 2016-07-21  9:15           ` Ilya Enkovich
  2016-07-21 16:34             ` Jeff Law
  0 siblings, 1 reply; 29+ messages in thread
From: Ilya Enkovich @ 2016-07-21  9:15 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches

In my list I see #1, #4, and #5 are not approved.

Thanks,
Ilya

2016-07-20 19:24 GMT+03:00 Jeff Law <law@redhat.com>:
> On 07/20/2016 08:37 AM, Ilya Enkovich wrote:
>
>> Here is an updated version.
>>
>> Thanks,
>> Ilya
>> --
>> gcc/
>>
>> 2016-07-20  Ilya Enkovich  <ilya.enkovich@intel.com>
>>
>>         * dbgcnt.def (vect_tail_combine): New.
>>         * params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
>>         * tree-vect-data-refs.c (vect_get_new_ssa_name): Support
>> vect_mask_var.
>>         * tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
>>         epilogue combined with loop body.
>>         (vect_do_peeling_for_loop_bound): LIkewise.
>>         (vect_do_peeling_for_alignment): ???
>>         * tree-vect-loop.c Include alias.h and dbgcnt.h.
>>         (vect_estimate_min_profitable_iters): Add
>> ret_min_profitable_combine_niters
>>         arg, compute number of iterations for which loop epilogue
>> combining is
>>         profitable.
>>         (vect_generate_tmps_on_preheader): Support combined apilogue.
>>         (vect_gen_ivs_for_masking): New.
>>         (vect_get_mask_index_for_elems): New.
>>         (vect_get_mask_index_for_type): New.
>>         (vect_create_narrowed_masks): New.
>>         (vect_create_widened_masks): New.
>>         (vect_gen_loop_masks): New.
>>         (vect_mask_reduction_stmt): New.
>>         (vect_mask_mask_load_store_stmt): New.
>>         (vect_mask_load_store_stmt): New.
>>         (vect_combine_loop_epilogue): New.
>>         (vect_transform_loop): Support combined apilogue.
>
> I think this is OK.  We've just got patch #5 to work through now, correct?
>
> Jeff
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-07-21  9:15           ` Ilya Enkovich
@ 2016-07-21 16:34             ` Jeff Law
  2016-07-22 11:36               ` Richard Biener
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff Law @ 2016-07-21 16:34 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: gcc-patches

On 07/21/2016 03:15 AM, Ilya Enkovich wrote:
> In my list I see #1, #4, and #5 are not approved.
So I think Richi wanted to see param control for the new options; Joseph 
wanted the new options properly documented in invoke.texi;  I had a few 
higher level questions which you answered.  Your updated patch #1 added 
param control and invoke.texi documentation.  So IMHO, #1 is approved.

#4 had some whitespace nits and needed some light doc improvements which 
you've done.  I think the only real implementation issue was computing 
costs in a single scan vs restarting the scan.  I was OK with the single 
scan approach you took -- not sure how strongly Richi feels about 
restarting the scan.  Seems like Richi needs to chime in on that topic.

I'm hoping to finish reviewing the update to #5 today.

jeff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-07-21 16:34             ` Jeff Law
@ 2016-07-22 11:36               ` Richard Biener
  2016-07-25 18:01                 ` Jeff Law
  0 siblings, 1 reply; 29+ messages in thread
From: Richard Biener @ 2016-07-22 11:36 UTC (permalink / raw)
  To: Jeff Law; +Cc: Ilya Enkovich, gcc-patches

On Thu, Jul 21, 2016 at 6:34 PM, Jeff Law <law@redhat.com> wrote:
> On 07/21/2016 03:15 AM, Ilya Enkovich wrote:
>>
>> In my list I see #1, #4, and #5 are not approved.
>
> So I think Richi wanted to see param control for the new options; Joseph
> wanted the new options properly documented in invoke.texi;  I had a few
> higher level questions which you answered.  Your updated patch #1 added
> param control and invoke.texi documentation.  So IMHO, #1 is approved.
>
>
> #4 had some whitespace nits and needed some light doc improvements which
> you've done.  I think the only real implementation issue was computing costs
> in a single scan vs restarting the scan.  I was OK with the single scan
> approach you took -- not sure how strongly Richi feels about restarting the
> scan.  Seems like Richi needs to chime in on that topic.

I'm ok with a single scan as well.

The thing that needs work I think is re-running of if-conversion.
Also I don't like at
all that we have many variants of vectorizing but somehow the decision which one
to choose is rather unclear.  The way the epilogue vectorization code
is hooked in
is rather awkward and bound to be a maintainance burden (well, maybe a
small one).

And last, I double there is a case for a masked vectorized loop - I can bet that
doing a non-masked vectorized loop plus a masked epilogue (with no iteration
then!) will be always faster unless you hit the window of very few iterations
(or optimizing for size - in which case vectorizing is questionable on
its own and
disabled IIRC).

I don't mind Jeff not caring too much about this as I do not have
sufficient time
to iterate over this patch series with you within reasonable time.

Thanks,
Richard.

> I'm hoping to finish reviewing the update to #5 today.
>
> jeff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-07-22 11:36               ` Richard Biener
@ 2016-07-25 18:01                 ` Jeff Law
  2016-07-25 18:33                   ` Richard Biener
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff Law @ 2016-07-25 18:01 UTC (permalink / raw)
  To: Richard Biener; +Cc: Ilya Enkovich, gcc-patches

On 07/22/2016 05:36 AM, Richard Biener wrote:
> The thing that needs work I think is re-running of if-conversion.
I wonder if we could revamp if-conversion to work on a subset of the 
CFG?   I can see that potentially being useful in other contexts.  Would 
that work for you Richi?

We've already got Bin doing that for DOM...


> Also I don't like at
> all that we have many variants of vectorizing but somehow the decision which one
> to choose is rather unclear.  The way the epilogue vectorization code
> is hooked in
> is rather awkward and bound to be a maintainance burden (well, maybe a
> small one).
I think it's going to be a small one.  I suspect that we really need 
another architecture with masking capabilities to really be able to see 
how the costing models ought to work and bring sanity to that decision.

>
> And last, I double there is a case for a masked vectorized loop - I can bet that
> doing a non-masked vectorized loop plus a masked epilogue (with no iteration
> then!) will be always faster unless you hit the window of very few iterations
> (or optimizing for size - in which case vectorizing is questionable on
> its own and
> disabled IIRC).
Ilya, does this case make a noticeable difference with the ICC 
implementation?

Jeff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-07-25 18:01                 ` Jeff Law
@ 2016-07-25 18:33                   ` Richard Biener
  2016-07-25 21:08                     ` Jeff Law
  0 siblings, 1 reply; 29+ messages in thread
From: Richard Biener @ 2016-07-25 18:33 UTC (permalink / raw)
  To: Jeff Law; +Cc: Ilya Enkovich, gcc-patches

On July 25, 2016 8:01:17 PM GMT+02:00, Jeff Law <law@redhat.com> wrote:
>On 07/22/2016 05:36 AM, Richard Biener wrote:
>> The thing that needs work I think is re-running of if-conversion.
>I wonder if we could revamp if-conversion to work on a subset of the 
>CFG?   I can see that potentially being useful in other contexts. 
>Would 
>that work for you Richi?

Well, you need to make it not need post-dominators or preserve them (or compute "post-dominators" on SESE regions).

What doesn't work with the idea to clone the epilogue using __built-in_vectorized()
For the if- vs. Not if-converted loop?

Richard.

>We've already got Bin doing that for DOM...
>
>
>> Also I don't like at
>> all that we have many variants of vectorizing but somehow the
>decision which one
>> to choose is rather unclear.  The way the epilogue vectorization code
>> is hooked in
>> is rather awkward and bound to be a maintainance burden (well, maybe
>a
>> small one).
>I think it's going to be a small one.  I suspect that we really need 
>another architecture with masking capabilities to really be able to see
>
>how the costing models ought to work and bring sanity to that decision.
>
>>
>> And last, I double there is a case for a masked vectorized loop - I
>can bet that
>> doing a non-masked vectorized loop plus a masked epilogue (with no
>iteration
>> then!) will be always faster unless you hit the window of very few
>iterations
>> (or optimizing for size - in which case vectorizing is questionable
>on
>> its own and
>> disabled IIRC).
>Ilya, does this case make a noticeable difference with the ICC 
>implementation?
>
>Jeff


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-07-25 18:33                   ` Richard Biener
@ 2016-07-25 21:08                     ` Jeff Law
  2016-07-26  9:57                       ` Ilya Enkovich
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff Law @ 2016-07-25 21:08 UTC (permalink / raw)
  To: Richard Biener; +Cc: Ilya Enkovich, gcc-patches

On 07/25/2016 12:32 PM, Richard Biener wrote:
> On July 25, 2016 8:01:17 PM GMT+02:00, Jeff Law <law@redhat.com> wrote:
>> On 07/22/2016 05:36 AM, Richard Biener wrote:
>>> The thing that needs work I think is re-running of if-conversion.
>> I wonder if we could revamp if-conversion to work on a subset of the
>> CFG?   I can see that potentially being useful in other contexts.
>> Would
>> that work for you Richi?
>
> Well, you need to make it not need post-dominators or preserve them (or compute "post-dominators" on SESE regions).
Oh, but it'd be so nice to have DOMs and/or PDOMs on regions.  But 
that's probably out of scope for gcc-7.


>
> What doesn't work with the idea to clone the epilogue using __built-in_vectorized()
> For the if- vs. Not if-converted loop?
I must be missing something.   I don't see how 
builtin_vectorized_function helps, but maybe I've got the wrong built-in 
or don't understand what you're suggesting.

It sounds like this is the biggest impediment to moving forward.  So 
let's reset and make sure we're all on the same page here.

Ilya, what's the fundamental reason why we need to run if-conversion 
again?    Yes, I know you want to if-convert the epilogue, but why?

What are the consequences of not doing if-conversion on the epilogue? 
Presumably we miss a vectorization opportunity on the tail.  But that 
may be a reasonable limitation to allow the existing work to move 
forward while you go back and revamp things a little.

Jeff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-07-25 21:08                     ` Jeff Law
@ 2016-07-26  9:57                       ` Ilya Enkovich
  2016-07-26 11:51                         ` Richard Biener
  2016-07-26 15:26                         ` Jeff Law
  0 siblings, 2 replies; 29+ messages in thread
From: Ilya Enkovich @ 2016-07-26  9:57 UTC (permalink / raw)
  To: Jeff Law; +Cc: Richard Biener, gcc-patches

2016-07-26 0:08 GMT+03:00 Jeff Law <law@redhat.com>:
> On 07/25/2016 12:32 PM, Richard Biener wrote:
>>
>> On July 25, 2016 8:01:17 PM GMT+02:00, Jeff Law <law@redhat.com> wrote:
>>>
>>> On 07/22/2016 05:36 AM, Richard Biener wrote:
>>>>
>>>> The thing that needs work I think is re-running of if-conversion.
>>>
>>> I wonder if we could revamp if-conversion to work on a subset of the
>>> CFG?   I can see that potentially being useful in other contexts.
>>> Would
>>> that work for you Richi?
>>
>>
>> Well, you need to make it not need post-dominators or preserve them (or
>> compute "post-dominators" on SESE regions).
>
> Oh, but it'd be so nice to have DOMs and/or PDOMs on regions.  But that's
> probably out of scope for gcc-7.
>
>
>>
>> What doesn't work with the idea to clone the epilogue using
>> __built-in_vectorized()
>> For the if- vs. Not if-converted loop?
>
> I must be missing something.   I don't see how builtin_vectorized_function
> helps, but maybe I've got the wrong built-in or don't understand what you're
> suggesting.
>
> It sounds like this is the biggest impediment to moving forward.  So let's
> reset and make sure we're all on the same page here.
>
> Ilya, what's the fundamental reason why we need to run if-conversion again?
> Yes, I know you want to if-convert the epilogue, but why?
>
> What are the consequences of not doing if-conversion on the epilogue?
> Presumably we miss a vectorization opportunity on the tail.  But that may be
> a reasonable limitation to allow the existing work to move forward while you
> go back and revamp things a little.

If we have some control-flow in a loop then we have to if-convert it
for vectorizer.
We need to preserve both versions: if-converted one for vectorizer and
the original
one to be used if vectorization fails.  For epilogues we have similar
situation and
need two versions.  I do it by running if-conversion on a copy of original loop.
Note that it doesn't run full if-conversion pass. If-conversion is
called for epilogue
loop only.

Thanks,
Ilya

>
> Jeff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-07-26  9:57                       ` Ilya Enkovich
@ 2016-07-26 11:51                         ` Richard Biener
  2016-07-26 13:03                           ` Ilya Enkovich
  2016-07-26 15:26                         ` Jeff Law
  1 sibling, 1 reply; 29+ messages in thread
From: Richard Biener @ 2016-07-26 11:51 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: Jeff Law, gcc-patches

On Tue, Jul 26, 2016 at 11:57 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> 2016-07-26 0:08 GMT+03:00 Jeff Law <law@redhat.com>:
>> On 07/25/2016 12:32 PM, Richard Biener wrote:
>>>
>>> On July 25, 2016 8:01:17 PM GMT+02:00, Jeff Law <law@redhat.com> wrote:
>>>>
>>>> On 07/22/2016 05:36 AM, Richard Biener wrote:
>>>>>
>>>>> The thing that needs work I think is re-running of if-conversion.
>>>>
>>>> I wonder if we could revamp if-conversion to work on a subset of the
>>>> CFG?   I can see that potentially being useful in other contexts.
>>>> Would
>>>> that work for you Richi?
>>>
>>>
>>> Well, you need to make it not need post-dominators or preserve them (or
>>> compute "post-dominators" on SESE regions).
>>
>> Oh, but it'd be so nice to have DOMs and/or PDOMs on regions.  But that's
>> probably out of scope for gcc-7.
>>
>>
>>>
>>> What doesn't work with the idea to clone the epilogue using
>>> __built-in_vectorized()
>>> For the if- vs. Not if-converted loop?
>>
>> I must be missing something.   I don't see how builtin_vectorized_function
>> helps, but maybe I've got the wrong built-in or don't understand what you're
>> suggesting.
>>
>> It sounds like this is the biggest impediment to moving forward.  So let's
>> reset and make sure we're all on the same page here.
>>
>> Ilya, what's the fundamental reason why we need to run if-conversion again?
>> Yes, I know you want to if-convert the epilogue, but why?
>>
>> What are the consequences of not doing if-conversion on the epilogue?
>> Presumably we miss a vectorization opportunity on the tail.  But that may be
>> a reasonable limitation to allow the existing work to move forward while you
>> go back and revamp things a little.
>
> If we have some control-flow in a loop then we have to if-convert it
> for vectorizer.
> We need to preserve both versions: if-converted one for vectorizer and
> the original
> one to be used if vectorization fails.  For epilogues we have similar
> situation and
> need two versions.  I do it by running if-conversion on a copy of original loop.
> Note that it doesn't run full if-conversion pass. If-conversion is
> called for epilogue
> loop only.

But it will still compute post-dominators for the full function for example.

You have the if-converted loop available already - it's the loop we are going
to vectorize.  If if-conversion generated if (__builtin_vectorized_p ()) style
loop copies then you can simply create the epilogue in the same way.
If it didn't then the loop is already if-converted anyway.

I see no need to re-run if-conversion here.

Richard.

> Thanks,
> Ilya
>
>>
>> Jeff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-07-26 11:51                         ` Richard Biener
@ 2016-07-26 13:03                           ` Ilya Enkovich
  2016-07-26 13:05                             ` Richard Biener
  0 siblings, 1 reply; 29+ messages in thread
From: Ilya Enkovich @ 2016-07-26 13:03 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, gcc-patches

2016-07-26 14:51 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
> On Tue, Jul 26, 2016 at 11:57 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>> 2016-07-26 0:08 GMT+03:00 Jeff Law <law@redhat.com>:
>>> On 07/25/2016 12:32 PM, Richard Biener wrote:
>>>>
>>>> On July 25, 2016 8:01:17 PM GMT+02:00, Jeff Law <law@redhat.com> wrote:
>>>>>
>>>>> On 07/22/2016 05:36 AM, Richard Biener wrote:
>>>>>>
>>>>>> The thing that needs work I think is re-running of if-conversion.
>>>>>
>>>>> I wonder if we could revamp if-conversion to work on a subset of the
>>>>> CFG?   I can see that potentially being useful in other contexts.
>>>>> Would
>>>>> that work for you Richi?
>>>>
>>>>
>>>> Well, you need to make it not need post-dominators or preserve them (or
>>>> compute "post-dominators" on SESE regions).
>>>
>>> Oh, but it'd be so nice to have DOMs and/or PDOMs on regions.  But that's
>>> probably out of scope for gcc-7.
>>>
>>>
>>>>
>>>> What doesn't work with the idea to clone the epilogue using
>>>> __built-in_vectorized()
>>>> For the if- vs. Not if-converted loop?
>>>
>>> I must be missing something.   I don't see how builtin_vectorized_function
>>> helps, but maybe I've got the wrong built-in or don't understand what you're
>>> suggesting.
>>>
>>> It sounds like this is the biggest impediment to moving forward.  So let's
>>> reset and make sure we're all on the same page here.
>>>
>>> Ilya, what's the fundamental reason why we need to run if-conversion again?
>>> Yes, I know you want to if-convert the epilogue, but why?
>>>
>>> What are the consequences of not doing if-conversion on the epilogue?
>>> Presumably we miss a vectorization opportunity on the tail.  But that may be
>>> a reasonable limitation to allow the existing work to move forward while you
>>> go back and revamp things a little.
>>
>> If we have some control-flow in a loop then we have to if-convert it
>> for vectorizer.
>> We need to preserve both versions: if-converted one for vectorizer and
>> the original
>> one to be used if vectorization fails.  For epilogues we have similar
>> situation and
>> need two versions.  I do it by running if-conversion on a copy of original loop.
>> Note that it doesn't run full if-conversion pass. If-conversion is
>> called for epilogue
>> loop only.
>
> But it will still compute post-dominators for the full function for example.
>
> You have the if-converted loop available already - it's the loop we are going
> to vectorize.  If if-conversion generated if (__builtin_vectorized_p ()) style
> loop copies then you can simply create the epilogue in the same way.
> If it didn't then the loop is already if-converted anyway.
>

Agree.  Calling if-conversion is just much simpler in implementation.

Thanks,
Ilya

> I see no need to re-run if-conversion here.
>
> Richard.
>
>> Thanks,
>> Ilya
>>
>>>
>>> Jeff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-07-26 13:03                           ` Ilya Enkovich
@ 2016-07-26 13:05                             ` Richard Biener
  0 siblings, 0 replies; 29+ messages in thread
From: Richard Biener @ 2016-07-26 13:05 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: Jeff Law, gcc-patches

On Tue, Jul 26, 2016 at 3:03 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> 2016-07-26 14:51 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>> On Tue, Jul 26, 2016 at 11:57 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>>> 2016-07-26 0:08 GMT+03:00 Jeff Law <law@redhat.com>:
>>>> On 07/25/2016 12:32 PM, Richard Biener wrote:
>>>>>
>>>>> On July 25, 2016 8:01:17 PM GMT+02:00, Jeff Law <law@redhat.com> wrote:
>>>>>>
>>>>>> On 07/22/2016 05:36 AM, Richard Biener wrote:
>>>>>>>
>>>>>>> The thing that needs work I think is re-running of if-conversion.
>>>>>>
>>>>>> I wonder if we could revamp if-conversion to work on a subset of the
>>>>>> CFG?   I can see that potentially being useful in other contexts.
>>>>>> Would
>>>>>> that work for you Richi?
>>>>>
>>>>>
>>>>> Well, you need to make it not need post-dominators or preserve them (or
>>>>> compute "post-dominators" on SESE regions).
>>>>
>>>> Oh, but it'd be so nice to have DOMs and/or PDOMs on regions.  But that's
>>>> probably out of scope for gcc-7.
>>>>
>>>>
>>>>>
>>>>> What doesn't work with the idea to clone the epilogue using
>>>>> __built-in_vectorized()
>>>>> For the if- vs. Not if-converted loop?
>>>>
>>>> I must be missing something.   I don't see how builtin_vectorized_function
>>>> helps, but maybe I've got the wrong built-in or don't understand what you're
>>>> suggesting.
>>>>
>>>> It sounds like this is the biggest impediment to moving forward.  So let's
>>>> reset and make sure we're all on the same page here.
>>>>
>>>> Ilya, what's the fundamental reason why we need to run if-conversion again?
>>>> Yes, I know you want to if-convert the epilogue, but why?
>>>>
>>>> What are the consequences of not doing if-conversion on the epilogue?
>>>> Presumably we miss a vectorization opportunity on the tail.  But that may be
>>>> a reasonable limitation to allow the existing work to move forward while you
>>>> go back and revamp things a little.
>>>
>>> If we have some control-flow in a loop then we have to if-convert it
>>> for vectorizer.
>>> We need to preserve both versions: if-converted one for vectorizer and
>>> the original
>>> one to be used if vectorization fails.  For epilogues we have similar
>>> situation and
>>> need two versions.  I do it by running if-conversion on a copy of original loop.
>>> Note that it doesn't run full if-conversion pass. If-conversion is
>>> called for epilogue
>>> loop only.
>>
>> But it will still compute post-dominators for the full function for example.
>>
>> You have the if-converted loop available already - it's the loop we are going
>> to vectorize.  If if-conversion generated if (__builtin_vectorized_p ()) style
>> loop copies then you can simply create the epilogue in the same way.
>> If it didn't then the loop is already if-converted anyway.
>>
>
> Agree.  Calling if-conversion is just much simpler in implementation.

Agreed, but it's also quadratic in the number of vectorized loops in a function.
Not sure if it is really very much simpler either.

Richard.

> Thanks,
> Ilya
>
>> I see no need to re-run if-conversion here.
>>
>> Richard.
>>
>>> Thanks,
>>> Ilya
>>>
>>>>
>>>> Jeff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-07-26  9:57                       ` Ilya Enkovich
  2016-07-26 11:51                         ` Richard Biener
@ 2016-07-26 15:26                         ` Jeff Law
  2016-07-26 15:38                           ` Ilya Enkovich
  1 sibling, 1 reply; 29+ messages in thread
From: Jeff Law @ 2016-07-26 15:26 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: Richard Biener, gcc-patches

On 07/26/2016 03:57 AM, Ilya Enkovich wrote:
>>
>> Ilya, what's the fundamental reason why we need to run
>> if-conversion again? Yes, I know you want to if-convert the
>> epilogue, but why?
>>
>> What are the consequences of not doing if-conversion on the
>> epilogue? Presumably we miss a vectorization opportunity on the
>> tail.  But that may be a reasonable limitation to allow the
>> existing work to move forward while you go back and revamp things a
>> little.
>
> If we have some control-flow in a loop then we have to if-convert it
> for vectorizer. We need to preserve both versions: if-converted one
> for vectorizer and the original one to be used if vectorization
> fails.  For epilogues we have similar situation and need two
> versions.  I do it by running if-conversion on a copy of original
> loop. Note that it doesn't run full if-conversion pass. If-conversion
> is called for epilogue loop only.
Right.  So what I think Richi wants you to try is to use the 
if-converted loop to construct the if-converted epilogue.  It seems 
conceptually simple and low cost -- the question is on the 
implementation side.  I have no clue how painful that would be.

jeff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-07-26 15:26                         ` Jeff Law
@ 2016-07-26 15:38                           ` Ilya Enkovich
  2016-08-01  9:09                             ` Ilya Enkovich
  0 siblings, 1 reply; 29+ messages in thread
From: Ilya Enkovich @ 2016-07-26 15:38 UTC (permalink / raw)
  To: Jeff Law; +Cc: Richard Biener, gcc-patches, Yuri Rumyantsev, Igor Zamyatin

2016-07-26 18:26 GMT+03:00 Jeff Law <law@redhat.com>:
> On 07/26/2016 03:57 AM, Ilya Enkovich wrote:
>>>
>>>
>>> Ilya, what's the fundamental reason why we need to run
>>> if-conversion again? Yes, I know you want to if-convert the
>>> epilogue, but why?
>>>
>>> What are the consequences of not doing if-conversion on the
>>> epilogue? Presumably we miss a vectorization opportunity on the
>>> tail.  But that may be a reasonable limitation to allow the
>>> existing work to move forward while you go back and revamp things a
>>> little.
>>
>>
>> If we have some control-flow in a loop then we have to if-convert it
>> for vectorizer. We need to preserve both versions: if-converted one
>> for vectorizer and the original one to be used if vectorization
>> fails.  For epilogues we have similar situation and need two
>> versions.  I do it by running if-conversion on a copy of original
>> loop. Note that it doesn't run full if-conversion pass. If-conversion
>> is called for epilogue loop only.
>
> Right.  So what I think Richi wants you to try is to use the if-converted
> loop to construct the if-converted epilogue.  It seems conceptually simple
> and low cost -- the question is on the implementation side.  I have no clue
> how painful that would be.

Probably another part of if-conversion may be re-used to build required
epilogue.  I'll have a look.

Thanks,
Ilya

>
> jeff
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-07-26 15:38                           ` Ilya Enkovich
@ 2016-08-01  9:09                             ` Ilya Enkovich
  2016-08-01 16:10                               ` Jeff Law
  0 siblings, 1 reply; 29+ messages in thread
From: Ilya Enkovich @ 2016-08-01  9:09 UTC (permalink / raw)
  To: Jeff Law; +Cc: Richard Biener, gcc-patches, Yuri Rumyantsev, Igor Zamyatin

2016-07-26 18:38 GMT+03:00 Ilya Enkovich <enkovich.gnu@gmail.com>:
> 2016-07-26 18:26 GMT+03:00 Jeff Law <law@redhat.com>:
>> On 07/26/2016 03:57 AM, Ilya Enkovich wrote:
>>>>
>>>>
>>>> Ilya, what's the fundamental reason why we need to run
>>>> if-conversion again? Yes, I know you want to if-convert the
>>>> epilogue, but why?
>>>>
>>>> What are the consequences of not doing if-conversion on the
>>>> epilogue? Presumably we miss a vectorization opportunity on the
>>>> tail.  But that may be a reasonable limitation to allow the
>>>> existing work to move forward while you go back and revamp things a
>>>> little.
>>>
>>>
>>> If we have some control-flow in a loop then we have to if-convert it
>>> for vectorizer. We need to preserve both versions: if-converted one
>>> for vectorizer and the original one to be used if vectorization
>>> fails.  For epilogues we have similar situation and need two
>>> versions.  I do it by running if-conversion on a copy of original
>>> loop. Note that it doesn't run full if-conversion pass. If-conversion
>>> is called for epilogue loop only.
>>
>> Right.  So what I think Richi wants you to try is to use the if-converted
>> loop to construct the if-converted epilogue.  It seems conceptually simple
>> and low cost -- the question is on the implementation side.  I have no clue
>> how painful that would be.
>
> Probably another part of if-conversion may be re-used to build required
> epilogue.  I'll have a look.

Hi,

Yuri will continue my work from this point.

Thanks,
Ilya

>
> Thanks,
> Ilya
>
>>
>> jeff
>>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-08-01  9:09                             ` Ilya Enkovich
@ 2016-08-01 16:10                               ` Jeff Law
  2016-09-02 14:46                                 ` Yuri Rumyantsev
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff Law @ 2016-08-01 16:10 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: Richard Biener, gcc-patches, Yuri Rumyantsev, Igor Zamyatin

On 08/01/2016 03:09 AM, Ilya Enkovich wrote:
> 2016-07-26 18:38 GMT+03:00 Ilya Enkovich <enkovich.gnu@gmail.com>:
>> 2016-07-26 18:26 GMT+03:00 Jeff Law <law@redhat.com>:
>>> On 07/26/2016 03:57 AM, Ilya Enkovich wrote:
>>>>>
>>>>>
>>>>> Ilya, what's the fundamental reason why we need to run
>>>>> if-conversion again? Yes, I know you want to if-convert the
>>>>> epilogue, but why?
>>>>>
>>>>> What are the consequences of not doing if-conversion on the
>>>>> epilogue? Presumably we miss a vectorization opportunity on the
>>>>> tail.  But that may be a reasonable limitation to allow the
>>>>> existing work to move forward while you go back and revamp things a
>>>>> little.
>>>>
>>>>
>>>> If we have some control-flow in a loop then we have to if-convert it
>>>> for vectorizer. We need to preserve both versions: if-converted one
>>>> for vectorizer and the original one to be used if vectorization
>>>> fails.  For epilogues we have similar situation and need two
>>>> versions.  I do it by running if-conversion on a copy of original
>>>> loop. Note that it doesn't run full if-conversion pass. If-conversion
>>>> is called for epilogue loop only.
>>>
>>> Right.  So what I think Richi wants you to try is to use the if-converted
>>> loop to construct the if-converted epilogue.  It seems conceptually simple
>>> and low cost -- the question is on the implementation side.  I have no clue
>>> how painful that would be.
>>
>> Probably another part of if-conversion may be re-used to build required
>> epilogue.  I'll have a look.
>
> Hi,
>
> Yuri will continue my work from this point.
Understood.  I'm actually got some comments on #5 and Yuri is already on 
the CC list for that draft message.

Jeff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-08-01 16:10                               ` Jeff Law
@ 2016-09-02 14:46                                 ` Yuri Rumyantsev
  2016-09-02 16:33                                   ` Bin.Cheng
  2016-09-05  7:39                                   ` Richard Biener
  0 siblings, 2 replies; 29+ messages in thread
From: Yuri Rumyantsev @ 2016-09-02 14:46 UTC (permalink / raw)
  To: Jeff Law; +Cc: Ilya Enkovich, Richard Biener, gcc-patches, Igor Zamyatin

Hi Jeff,

I am trying to reduce cost of repeated call of if-conversion for
epilogue vectorization. I'd like to clarify your recommendation -
should I design additional support for versioning in
vect_do_peeling_for_loop_bound or lightweight version of if-conversion
is sufficient? Any help in clarification will be appreciated.

Thanks ahead.
Yuri.

2016-08-01 19:10 GMT+03:00 Jeff Law <law@redhat.com>:
> On 08/01/2016 03:09 AM, Ilya Enkovich wrote:
>>
>> 2016-07-26 18:38 GMT+03:00 Ilya Enkovich <enkovich.gnu@gmail.com>:
>>>
>>> 2016-07-26 18:26 GMT+03:00 Jeff Law <law@redhat.com>:
>>>>
>>>> On 07/26/2016 03:57 AM, Ilya Enkovich wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Ilya, what's the fundamental reason why we need to run
>>>>>> if-conversion again? Yes, I know you want to if-convert the
>>>>>> epilogue, but why?
>>>>>>
>>>>>> What are the consequences of not doing if-conversion on the
>>>>>> epilogue? Presumably we miss a vectorization opportunity on the
>>>>>> tail.  But that may be a reasonable limitation to allow the
>>>>>> existing work to move forward while you go back and revamp things a
>>>>>> little.
>>>>>
>>>>>
>>>>>
>>>>> If we have some control-flow in a loop then we have to if-convert it
>>>>> for vectorizer. We need to preserve both versions: if-converted one
>>>>> for vectorizer and the original one to be used if vectorization
>>>>> fails.  For epilogues we have similar situation and need two
>>>>> versions.  I do it by running if-conversion on a copy of original
>>>>> loop. Note that it doesn't run full if-conversion pass. If-conversion
>>>>> is called for epilogue loop only.
>>>>
>>>>
>>>> Right.  So what I think Richi wants you to try is to use the
>>>> if-converted
>>>> loop to construct the if-converted epilogue.  It seems conceptually
>>>> simple
>>>> and low cost -- the question is on the implementation side.  I have no
>>>> clue
>>>> how painful that would be.
>>>
>>>
>>> Probably another part of if-conversion may be re-used to build required
>>> epilogue.  I'll have a look.
>>
>>
>> Hi,
>>
>> Yuri will continue my work from this point.
>
> Understood.  I'm actually got some comments on #5 and Yuri is already on the
> CC list for that draft message.
>
> Jeff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-09-02 14:46                                 ` Yuri Rumyantsev
@ 2016-09-02 16:33                                   ` Bin.Cheng
  2016-09-05  7:39                                   ` Richard Biener
  1 sibling, 0 replies; 29+ messages in thread
From: Bin.Cheng @ 2016-09-02 16:33 UTC (permalink / raw)
  To: Yuri Rumyantsev
  Cc: Jeff Law, Ilya Enkovich, Richard Biener, gcc-patches, Igor Zamyatin

On Fri, Sep 2, 2016 at 3:46 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Hi Jeff,
>
> I am trying to reduce cost of repeated call of if-conversion for
> epilogue vectorization. I'd like to clarify your recommendation -
> should I design additional support for versioning in
> vect_do_peeling_for_loop_bound or lightweight version of if-conversion
Hi Yuri,
I didn't read the patch, so please correct me if I mis-understand
anything.  It might be better not to introduce versioning logic in
peeling stuff if possible.  The peeling part is complicated and
generates somehow inefficient CFG.  I am preparing patches rewriting
the peeling stuff.

Thanks,
bin
> is sufficient? Any help in clarification will be appreciated.
>
> Thanks ahead.
> Yuri.
>
> 2016-08-01 19:10 GMT+03:00 Jeff Law <law@redhat.com>:
>> On 08/01/2016 03:09 AM, Ilya Enkovich wrote:
>>>
>>> 2016-07-26 18:38 GMT+03:00 Ilya Enkovich <enkovich.gnu@gmail.com>:
>>>>
>>>> 2016-07-26 18:26 GMT+03:00 Jeff Law <law@redhat.com>:
>>>>>
>>>>> On 07/26/2016 03:57 AM, Ilya Enkovich wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Ilya, what's the fundamental reason why we need to run
>>>>>>> if-conversion again? Yes, I know you want to if-convert the
>>>>>>> epilogue, but why?
>>>>>>>
>>>>>>> What are the consequences of not doing if-conversion on the
>>>>>>> epilogue? Presumably we miss a vectorization opportunity on the
>>>>>>> tail.  But that may be a reasonable limitation to allow the
>>>>>>> existing work to move forward while you go back and revamp things a
>>>>>>> little.
>>>>>>
>>>>>>
>>>>>>
>>>>>> If we have some control-flow in a loop then we have to if-convert it
>>>>>> for vectorizer. We need to preserve both versions: if-converted one
>>>>>> for vectorizer and the original one to be used if vectorization
>>>>>> fails.  For epilogues we have similar situation and need two
>>>>>> versions.  I do it by running if-conversion on a copy of original
>>>>>> loop. Note that it doesn't run full if-conversion pass. If-conversion
>>>>>> is called for epilogue loop only.
>>>>>
>>>>>
>>>>> Right.  So what I think Richi wants you to try is to use the
>>>>> if-converted
>>>>> loop to construct the if-converted epilogue.  It seems conceptually
>>>>> simple
>>>>> and low cost -- the question is on the implementation side.  I have no
>>>>> clue
>>>>> how painful that would be.
>>>>
>>>>
>>>> Probably another part of if-conversion may be re-used to build required
>>>> epilogue.  I'll have a look.
>>>
>>>
>>> Hi,
>>>
>>> Yuri will continue my work from this point.
>>
>> Understood.  I'm actually got some comments on #5 and Yuri is already on the
>> CC list for that draft message.
>>
>> Jeff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
  2016-09-02 14:46                                 ` Yuri Rumyantsev
  2016-09-02 16:33                                   ` Bin.Cheng
@ 2016-09-05  7:39                                   ` Richard Biener
  1 sibling, 0 replies; 29+ messages in thread
From: Richard Biener @ 2016-09-05  7:39 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Jeff Law, Ilya Enkovich, gcc-patches, Igor Zamyatin

On Fri, Sep 2, 2016 at 4:46 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Hi Jeff,
>
> I am trying to reduce cost of repeated call of if-conversion for
> epilogue vectorization. I'd like to clarify your recommendation -
> should I design additional support for versioning in
> vect_do_peeling_for_loop_bound or lightweight version of if-conversion
> is sufficient? Any help in clarification will be appreciated.

For general infrastructure it would be nice to expose a (post-)dominator
compute for MESE (post-dominators) / SEME (dominators) regions.  I believe
what makes if-conversion expensive is the post-dom compute which happens
for each loop for the whole function.  It shouldn't be very difficult
to write this,
sharing as much as possible code with the current DOM code might need
quite some refactoring though.

If you want to avoid this work then you have to go the versioning route.

Richard.

> Thanks ahead.
> Yuri.
>
> 2016-08-01 19:10 GMT+03:00 Jeff Law <law@redhat.com>:
>> On 08/01/2016 03:09 AM, Ilya Enkovich wrote:
>>>
>>> 2016-07-26 18:38 GMT+03:00 Ilya Enkovich <enkovich.gnu@gmail.com>:
>>>>
>>>> 2016-07-26 18:26 GMT+03:00 Jeff Law <law@redhat.com>:
>>>>>
>>>>> On 07/26/2016 03:57 AM, Ilya Enkovich wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Ilya, what's the fundamental reason why we need to run
>>>>>>> if-conversion again? Yes, I know you want to if-convert the
>>>>>>> epilogue, but why?
>>>>>>>
>>>>>>> What are the consequences of not doing if-conversion on the
>>>>>>> epilogue? Presumably we miss a vectorization opportunity on the
>>>>>>> tail.  But that may be a reasonable limitation to allow the
>>>>>>> existing work to move forward while you go back and revamp things a
>>>>>>> little.
>>>>>>
>>>>>>
>>>>>>
>>>>>> If we have some control-flow in a loop then we have to if-convert it
>>>>>> for vectorizer. We need to preserve both versions: if-converted one
>>>>>> for vectorizer and the original one to be used if vectorization
>>>>>> fails.  For epilogues we have similar situation and need two
>>>>>> versions.  I do it by running if-conversion on a copy of original
>>>>>> loop. Note that it doesn't run full if-conversion pass. If-conversion
>>>>>> is called for epilogue loop only.
>>>>>
>>>>>
>>>>> Right.  So what I think Richi wants you to try is to use the
>>>>> if-converted
>>>>> loop to construct the if-converted epilogue.  It seems conceptually
>>>>> simple
>>>>> and low cost -- the question is on the implementation side.  I have no
>>>>> clue
>>>>> how painful that would be.
>>>>
>>>>
>>>> Probably another part of if-conversion may be re-used to build required
>>>> epilogue.  I'll have a look.
>>>
>>>
>>> Hi,
>>>
>>> Yuri will continue my work from this point.
>>
>> Understood.  I'm actually got some comments on #5 and Yuri is already on the
>> CC list for that draft message.
>>
>> Jeff

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2016-09-05  7:21 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-19 19:46 [PATCH, vec-tails 07/10] Support loop epilogue combining Ilya Enkovich
2016-06-15 11:44 ` Richard Biener
2016-06-16 15:41   ` Ilya Enkovich
2016-06-16 15:51     ` Jeff Law
2016-06-16 16:03       ` Ilya Enkovich
2016-06-16 16:54 ` Jeff Law
2016-06-28 13:37   ` Ilya Enkovich
2016-06-28 14:16     ` Ilya Enkovich
2016-07-11 13:39     ` Ilya Enkovich
2016-07-14 22:04     ` Jeff Law
2016-07-20 14:40       ` Ilya Enkovich
2016-07-20 16:24         ` Jeff Law
2016-07-21  9:15           ` Ilya Enkovich
2016-07-21 16:34             ` Jeff Law
2016-07-22 11:36               ` Richard Biener
2016-07-25 18:01                 ` Jeff Law
2016-07-25 18:33                   ` Richard Biener
2016-07-25 21:08                     ` Jeff Law
2016-07-26  9:57                       ` Ilya Enkovich
2016-07-26 11:51                         ` Richard Biener
2016-07-26 13:03                           ` Ilya Enkovich
2016-07-26 13:05                             ` Richard Biener
2016-07-26 15:26                         ` Jeff Law
2016-07-26 15:38                           ` Ilya Enkovich
2016-08-01  9:09                             ` Ilya Enkovich
2016-08-01 16:10                               ` Jeff Law
2016-09-02 14:46                                 ` Yuri Rumyantsev
2016-09-02 16:33                                   ` Bin.Cheng
2016-09-05  7:39                                   ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).