public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [RFC] Combine vectorized loops with its scalar remainder.
@ 2015-10-28 10:57 Yuri Rumyantsev
  2015-11-03 10:08 ` Richard Henderson
  2015-11-03 11:47 ` Richard Biener
  0 siblings, 2 replies; 17+ messages in thread
From: Yuri Rumyantsev @ 2015-10-28 10:57 UTC (permalink / raw)
  To: gcc-patches, Jeff Law, Richard Biener, Igor Zamyatin,
	Илья
	Энкович

[-- Attachment #1: Type: text/plain, Size: 1663 bytes --]

Hi All,

Here is a preliminary patch to combine vectorized loop with its scalar
remainder, draft of which was proposed by Kirill Yukhin month ago:
https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
It was tested wwith '-mavx2' option to run on Haswell processor.
The main goal of it is to improve performance of vectorized loops for AVX512.
Note that only loads/stores and simple reductions with binary operations are
converted to masked form, e.g. load --> masked load and reduction like
r1 = f <op> r2 --> t = f <op> r2; r1 = m ? t : r2. Masking is performed through
creation of a new vector induction variable initialized with consequent values
from 0.. VF-1, new const vector upper bound which contains number of iterations
and the result of comparison which is considered as mask vector.
This implementation has several restrictions:

1. Multiple types are not supported.
2. SLP is not supported.
3. Gather/Scatter's are also not supported.
4. Vectorization of the loops with low trip count is not implemented yet since
   it requires additional design and tuning.

We are planning to eleminate all these restrictions in GCCv7.

This patch will be extended to include cost model to reject unprofutable
transformations, e.g. new vector body cost will be evaluated through new
target hook which estimates cast of masking different vector statements. New
threshold parameter will be introduced which determines permissible cost
increasing which will be tuned on an AVX512 machine.
This patch is not in sync with changes of Ilya Enkovich for AVX512 masked
load/store support since only part of them is in trunk compiler.

Any comments will be appreciated.

[-- Attachment #2: remainder.patch.1 --]
[-- Type: application/octet-stream, Size: 27229 bytes --]

diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index cc51597..b85bfc5 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -938,6 +938,7 @@ new_loop_vec_info (struct loop *loop)
   LOOP_VINFO_NITERSM1 (res) = NULL;
   LOOP_VINFO_NITERS (res) = NULL;
   LOOP_VINFO_NITERS_UNCHANGED (res) = NULL;
+  LOOP_VINFO_NITERS_VECT_LOOP (res) = NULL;
   LOOP_VINFO_COST_MODEL_THRESHOLD (res) = 0;
   LOOP_VINFO_VECTORIZABLE_P (res) = 0;
   LOOP_VINFO_PEELING_FOR_ALIGNMENT (res) = 0;
@@ -6232,9 +6233,13 @@ vect_transform_loop (loop_vec_info loop_vinfo)
     {
       tree ratio_mult_vf;
       if (!ni_name)
-	ni_name = vect_build_loop_niters (loop_vinfo);
+	{
+	  ni_name = vect_build_loop_niters (loop_vinfo);
+	  LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
+	}
       vect_generate_tmps_on_preheader (loop_vinfo, ni_name, &ratio_mult_vf,
 				       &ratio);
+      LOOP_VINFO_NITERS_VECT_LOOP (loop_vinfo) = ratio_mult_vf;
       vect_do_peeling_for_loop_bound (loop_vinfo, ni_name, ratio_mult_vf,
 				      th, check_profitability);
     }
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 82fca0c..02e1359 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -52,6 +52,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-vectorizer.h"
 #include "cgraph.h"
 #include "builtins.h"
+#include "tree-ssa-address.h"
+#include "tree-ssa-loop-ivopts.h"
 
 /* For lang_hooks.types.type_for_mode.  */
 #include "langhooks.h"
@@ -8627,3 +8629,714 @@ supportable_narrowing_operation (enum tree_code code,
   interm_types->release ();
   return false;
 }
+
+/* Fix trip count of vectorized loop to iterate for loop remainder also.  */
+
+static void
+fix_vec_loop_trip_count (loop_vec_info loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  tree niters;
+  tree ratio_mult_vf = LOOP_VINFO_NITERS_VECT_LOOP (loop_vinfo);
+  int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  gimple *stmt;
+  gimple_stmt_iterator gsi;
+
+  niters = (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)) ?
+	    LOOP_VINFO_NITERS (loop_vinfo)
+	    : LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo);
+
+  if (TREE_CODE (ratio_mult_vf) == SSA_NAME)
+    {
+      gimple *def = SSA_NAME_DEF_STMT (ratio_mult_vf);
+      tree bnd, lhs, tmp, log_vf;
+      gimple *def_bnd;
+      gimple *new_def_bnd;
+      gcc_assert (gimple_code (def) == GIMPLE_ASSIGN);
+      gcc_assert (gimple_assign_rhs_code (def) == LSHIFT_EXPR);
+      bnd = gimple_assign_rhs1 (def);
+      gcc_assert (TREE_CODE (bnd) == SSA_NAME);
+      gcc_assert (TREE_CODE (gimple_assign_rhs2 (def)) == INTEGER_CST);
+      def_bnd = SSA_NAME_DEF_STMT (bnd);
+      gsi = gsi_for_stmt (def_bnd);
+      /* Create t = niters + vfm1 statement.  */
+      lhs = create_tmp_var (TREE_TYPE (bnd));
+      stmt = gimple_build_assign (lhs, PLUS_EXPR, niters,
+				  build_int_cst (TREE_TYPE (bnd), vf - 1));
+      tmp = make_ssa_name (lhs, stmt);
+      gimple_assign_set_lhs (stmt, tmp);
+      gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+      /* Replace BND definition with bnd = t >> log2 (vf).  */
+      log_vf = build_int_cst (TREE_TYPE (tmp), exact_log2 (vf));
+      new_def_bnd = gimple_build_assign (bnd, RSHIFT_EXPR, tmp, log_vf);
+      gsi_replace (&gsi, new_def_bnd, false);
+    }
+  else
+    {
+      tree op_const;
+      unsigned n;
+      unsigned logvf = exact_log2 (vf);
+      gcond *cond;
+      gcc_assert (TREE_CODE (ratio_mult_vf) == INTEGER_CST);
+      gcc_assert (TREE_CODE (niters) == INTEGER_CST);
+      /* Change value of bnd in GIMPLE_COND.  */
+      gcc_assert (loop->num_nodes == 2);
+      stmt = last_stmt (loop->header);
+      gcc_assert (gimple_code (stmt) == GIMPLE_COND);
+      n = tree_to_uhwi (niters);
+      n = ((n + (vf - 1)) >> logvf) << logvf;
+      op_const = build_int_cst (TREE_TYPE (gimple_cond_lhs (stmt)), n);
+      gcc_assert (TREE_CODE (gimple_cond_rhs (stmt)) == INTEGER_CST);
+      cond = dyn_cast <gcond *> (stmt);
+      gimple_cond_set_rhs (cond, op_const);
+    }
+}
+
+/* Did scalar remainder unreachable through vecotirzed loop.  */
+
+static void
+isolate_remainder (loop_vec_info loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  edge e;
+  basic_block bb = loop->header;
+  gimple *last;
+  gcond *cond;
+
+  e = EDGE_SUCC ((bb), 0);
+  if (flow_bb_inside_loop_p (loop, e->dest))
+    e = EDGE_SUCC ((bb), 1);
+  bb = e->dest;
+  gcc_assert (!flow_bb_inside_loop_p (loop, bb));
+  last = last_stmt (bb);
+  gcc_assert (gimple_code (last) == GIMPLE_COND);
+  cond = as_a <gcond *> (last);
+  /* Assume that target of false edge is scalar loop preheader.  */
+  gimple_cond_make_true (cond);
+}
+
+/* Generate induction_vector which will be used to mask evaluation.  */
+
+static tree
+gen_vec_induction (loop_vec_info loop_vinfo, unsigned elem_size, unsigned size)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  edge pe = loop_preheader_edge (loop);
+  vec<constructor_elt, va_gc> *v;
+  gimple *stmt;
+  gimple_stmt_iterator gsi;
+  gphi *induction_phi;
+  tree iv_type, vectype;
+  tree lhs, rhs, iv;
+  unsigned n;
+  int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  int i;
+  tree new_vec, new_var;
+  tree vec_init, vec_step, vec_dest, vec_def;
+  tree val;
+  tree induc_def;
+  basic_block new_bb;
+  machine_mode mode;
+
+  /* Find control iv.  */
+  stmt = last_stmt (loop->header);
+  gcc_assert (gimple_code (stmt) == GIMPLE_COND);
+  lhs = gimple_cond_lhs (stmt);
+  rhs = gimple_cond_rhs (stmt);
+  /* Assume any operand order.  */
+  if (TREE_CODE (lhs) != SSA_NAME)
+    iv = rhs;
+  else
+    {
+      gimple *def_stmt = SSA_NAME_DEF_STMT (lhs);
+      if (gimple_bb (def_stmt) != loop->header)
+	iv = rhs;
+      else
+	iv = lhs;
+    }
+  gcc_assert (TREE_CODE (iv) == SSA_NAME);
+  /* Determine type to build vector index aka induction vector.  */
+  n = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (iv)));
+  if (n > elem_size)
+    /* Multiple types are not yet supported.  */
+    return NULL_TREE;
+  if (n == elem_size && !TYPE_UNSIGNED (TREE_TYPE (iv)))
+    iv_type = TREE_TYPE (iv);
+  else
+    iv_type = build_nonstandard_integer_type (elem_size, 0);
+  vectype = get_vectype_for_scalar_type_and_size (iv_type, size);
+  mode =  TYPE_MODE (vectype);
+  /* Check that vector comparison for IV_TYPE is supported.  */
+  if (get_vcond_icode (mode, mode, 0)== CODE_FOR_nothing)
+    {
+      if (dump_enabled_p ())
+	{
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "type is not supported for vector compare!\n");
+	  dump_generic_expr (MSG_NOTE, TDF_SLIM, vectype);
+	}
+      return NULL_TREE;
+    }
+
+  /* Build induction initialization and insert it to loop preheader.  */
+  vec_alloc (v, vf);
+  for (i = 0; i < vf; i++)
+    {
+      tree elem;
+      elem = build_int_cst (iv_type, i);
+      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, elem);
+    }
+  new_vec = build_vector_from_ctor (vectype, v);
+  new_var = vect_get_new_vect_var (vectype, vect_simple_var, "cst_");
+  stmt = gimple_build_assign (new_var, new_vec);
+  vec_init = make_ssa_name (new_var, stmt);
+  gimple_assign_set_lhs (stmt, vec_init);
+  new_bb = gsi_insert_on_edge_immediate (pe, stmt);
+  gcc_assert (!new_bb);
+
+  /* Create vector-step consisting from VF.  */
+  val = build_int_cst (iv_type, vf);
+  new_vec = build_vector_from_val (vectype, val);
+  new_var = vect_get_new_vect_var (vectype, vect_simple_var, "cst_");
+  stmt = gimple_build_assign (new_var, new_vec);
+  vec_step = make_ssa_name (new_var, stmt);
+  gimple_assign_set_lhs (stmt, vec_step);
+  new_bb = gsi_insert_on_edge_immediate (pe, stmt);
+  gcc_assert (!new_bb);
+
+  /* Create the induction-phi.  */
+  vec_dest = vect_get_new_vect_var (vectype, vect_simple_var, "vec_iv_");
+  induction_phi = create_phi_node (vec_dest, loop->header);
+  induc_def = PHI_RESULT (induction_phi);
+
+  /* Create vector iv increment inside loop.  */
+  gsi = gsi_after_labels (loop->header);
+  stmt = gimple_build_assign (vec_dest, PLUS_EXPR, induc_def, vec_step);
+  vec_def = make_ssa_name (vec_dest, stmt);
+  gimple_assign_set_lhs (stmt, vec_def);
+  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+
+  /* Set the arguments of phi node.  */
+  add_phi_arg (induction_phi, vec_init, pe, UNKNOWN_LOCATION);
+  add_phi_arg (induction_phi, vec_def, loop_latch_edge (loop),
+	       UNKNOWN_LOCATION);
+  return induc_def;
+}
+
+/* Produce mask which will be used for masking.  */
+
+static tree
+gen_mask_for_remainder (loop_vec_info loop_vinfo, tree vec_index, unsigned size)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  tree new_vec, new_var;
+  tree niters, vec_niters, new_niters, vec_res, vec_mask;
+  gimple *stmt;
+  basic_block new_bb;
+  edge pe = loop_preheader_edge (loop);
+  gimple_stmt_iterator gsi;
+  tree vectype = TREE_TYPE (vec_index);
+  tree s_vectype;
+
+  gsi = gsi_after_labels (loop->header);
+  niters = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
+	   ? LOOP_VINFO_NITERS (loop_vinfo)
+	   : LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo);
+
+  /* Create vector for comparison consisting of niters.  */
+  if (!types_compatible_p (TREE_TYPE (niters), TREE_TYPE (vectype)))
+    {
+      tree new_type = TREE_TYPE (vectype);
+      enum tree_code cop;
+      cop = tree_to_uhwi (TYPE_SIZE (new_type)) ==
+	    tree_to_uhwi (TYPE_SIZE (TREE_TYPE (niters)))
+	    ? NOP_EXPR : CONVERT_EXPR;
+      new_niters = make_ssa_name (new_type);
+      stmt = gimple_build_assign (new_niters, cop, niters);
+      new_bb = gsi_insert_on_edge_immediate (pe, stmt);
+      gcc_assert (!new_bb);
+    }
+  else
+    new_niters = niters;
+  new_vec = build_vector_from_val (vectype, new_niters);
+  new_var = vect_get_new_vect_var (vectype, vect_simple_var, "cst_");
+  stmt = gimple_build_assign (new_var, new_vec);
+  vec_niters = make_ssa_name (new_var, stmt);
+  gimple_assign_set_lhs (stmt, vec_niters);
+  new_bb = gsi_insert_on_edge_immediate (pe, stmt);
+  gcc_assert (!new_bb);
+  /* Create vector comparison the result of which will be used as mask
+     for loads/stores.  */
+  if (TYPE_UNSIGNED (vectype))
+    {
+      /* Create signed vectype.  */
+      tree stype = TREE_TYPE (vectype);
+      unsigned sz = tree_to_uhwi (TYPE_SIZE (stype));
+      tree new_type = build_nonstandard_integer_type (sz, 0);
+      s_vectype = get_vectype_for_scalar_type_and_size (new_type, size);
+      gcc_assert (s_vectype);
+    }
+  else
+    s_vectype = vectype;
+  vec_mask = vect_get_new_vect_var (s_vectype, vect_simple_var, "vec_mask_");
+  stmt = gimple_build_assign (vec_mask, LT_EXPR, vec_index, vec_niters);
+  vec_res = make_ssa_name (vec_mask, stmt);
+  gimple_assign_set_lhs (stmt, vec_res);
+  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+  return vec_res;
+}
+
+/* Convert each load to masked load.  */
+
+static void
+convert_loads_to_masked (vec<gimple *> *loads, tree mask)
+{
+  gimple *stmt, *new_stmt;
+  tree addr, ref;
+  gimple_stmt_iterator gsi;
+
+  while (loads->length () > 0)
+    {
+      tree lhs, ptr;
+      stmt = loads->pop ();
+      gsi = gsi_for_stmt (stmt);
+      lhs = gimple_assign_lhs (stmt);
+      ref = gimple_assign_rhs1 (stmt);
+      addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (ref),
+				       true, NULL_TREE, true,
+				       GSI_SAME_STMT);
+      ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
+      if (!SSA_NAME_PTR_INFO (addr))
+	copy_ref_info (build2 (MEM_REF, TREE_TYPE (ref), addr, ptr), ref);
+      new_stmt = gimple_build_call_internal (IFN_MASK_LOAD, 3,
+					     addr, ptr, mask);
+      gimple_call_set_lhs (new_stmt, lhs);
+      gsi_replace (&gsi, new_stmt, false);
+    }
+}
+
+/* Convert each store to masked one.  */
+
+static void
+convert_stores_to_masked (vec<gimple *> *stores, tree mask)
+{
+  gimple *stmt, *new_stmt;
+  tree addr, ref;
+  gimple_stmt_iterator gsi;
+
+  while (stores->length () > 0)
+    {
+      tree rhs, ptr;
+      stmt = stores->pop ();
+      gsi = gsi_for_stmt (stmt);
+      ref = gimple_assign_lhs (stmt);
+      rhs = gimple_assign_rhs1 (stmt);
+      addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (ref),
+				       true, NULL_TREE, true,
+				       GSI_SAME_STMT);
+      ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
+      if (!SSA_NAME_PTR_INFO (addr))
+	copy_ref_info (build2 (MEM_REF, TREE_TYPE (ref), addr, ptr), ref);
+      new_stmt = gimple_build_call_internal (IFN_MASK_STORE, 4, addr, ptr,
+					      mask, rhs);
+      gsi_replace (&gsi, new_stmt, false);
+    }
+}
+
+static void
+fix_mask_for_masked_ld_st (vec<gimple *> *masked_stmt, tree mask)
+{
+  gimple *stmt, *new_stmt;
+  tree old, lhs, vectype, var, n_lhs;
+  gimple_stmt_iterator gsi;
+
+  while (masked_stmt->length () > 0)
+    {
+      stmt = masked_stmt->pop ();
+      gsi = gsi_for_stmt (stmt);
+      old = gimple_call_arg (stmt, 2);
+      vectype = TREE_TYPE (old);
+      if (TREE_TYPE (mask) != vectype)
+	{
+	  tree new_vtype = TREE_TYPE (mask);
+	  tree n_var;
+	  tree conv_expr;
+	  n_var = vect_get_new_vect_var (new_vtype, vect_simple_var, NULL);
+	  conv_expr = build1 (VIEW_CONVERT_EXPR, new_vtype, old);
+	  new_stmt = gimple_build_assign (n_var, conv_expr);
+	  n_lhs = make_ssa_name (n_var);
+	  gimple_assign_set_lhs (new_stmt, n_lhs);
+	  vectype = new_vtype;
+	  gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
+	}
+      else
+	n_lhs = old;
+      var = vect_get_new_vect_var (vectype, vect_simple_var, NULL);
+      new_stmt = gimple_build_assign (var, BIT_AND_EXPR, mask, n_lhs);
+      lhs = make_ssa_name (var, new_stmt);
+      gimple_assign_set_lhs (new_stmt, lhs);
+      gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
+      gimple_call_set_arg (stmt, 2, lhs);
+      update_stmt (stmt);
+    }
+}
+
+/* Convert vectorized reductions to VEC_COND statements to preserve
+   reduction semantic:
+	s1 = x + s2 --> t = x + s2; s1 = (mask)? t : s2.  */
+
+static void
+convert_reductions (loop_vec_info loop_vinfo, tree mask)
+{
+  unsigned i;
+  for (i = 0; i < LOOP_VINFO_REDUCTIONS (loop_vinfo).length (); i++)
+    {
+      gimple *stmt = LOOP_VINFO_REDUCTIONS (loop_vinfo)[i];
+      gimple_stmt_iterator gsi;
+      tree vectype;
+      tree lhs, rhs;
+      tree var, new_lhs, vec_cond_expr;
+      gimple *new_stmt, *def;
+      stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+      stmt = STMT_VINFO_VEC_STMT (stmt_info);
+      lhs = gimple_assign_lhs (stmt);
+      vectype = TREE_TYPE (lhs);
+      gsi = gsi_for_stmt (stmt);
+      rhs = gimple_assign_rhs1 (stmt);
+      gcc_assert (TREE_CODE (rhs) == SSA_NAME);
+      def = SSA_NAME_DEF_STMT (rhs);
+      if (gimple_code (def) != GIMPLE_PHI)
+	{
+	  rhs = gimple_assign_rhs2 (stmt);
+	  gcc_assert (TREE_CODE (rhs) == SSA_NAME);
+	  def = SSA_NAME_DEF_STMT (rhs);
+	  gcc_assert (gimple_code (def) == GIMPLE_PHI);
+	}
+      /* Change lhs of STMT.  */
+      var = vect_get_new_vect_var (vectype, vect_simple_var, NULL);
+      new_lhs = make_ssa_name (var, stmt);
+      gimple_assign_set_lhs (stmt, new_lhs);
+      /* Generate new VEC_COND expr.  */
+      vec_cond_expr = build3 (VEC_COND_EXPR, vectype, mask, new_lhs, rhs);
+      new_stmt = gimple_build_assign (lhs, vec_cond_expr);
+      gsi_insert_after (&gsi, new_stmt, GSI_SAME_STMT);
+    }
+}
+
+/* Return true if MEM_REF is incremented by vector size and false otherwise.  */
+
+static bool
+mem_ref_is_vec_size_incremented (loop_vec_info loop_vinfo, tree lhs)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  tree vectype = TREE_TYPE (lhs);
+  unsigned n = GET_MODE_SIZE (TYPE_MODE (vectype));
+  gphi *phi;
+  edge e = loop_latch_edge (loop);
+  tree arg;
+  gimple *def;
+  tree name;
+  if (TREE_CODE (lhs) != MEM_REF)
+    return false;
+  name = TREE_OPERAND (lhs, 0);
+  if (TREE_CODE (name) != SSA_NAME)
+    return false;
+  def = SSA_NAME_DEF_STMT (name);
+  if (!def || gimple_code (def) != GIMPLE_PHI)
+    return false;
+  phi = as_a <gphi *> (def);
+  arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
+  gcc_assert (TREE_CODE (arg) == SSA_NAME);
+  def = SSA_NAME_DEF_STMT (arg);
+  if (gimple_code (def) != GIMPLE_ASSIGN
+      || gimple_assign_rhs_code (def) != POINTER_PLUS_EXPR)
+    return false;
+  arg = gimple_assign_rhs2 (def);
+  if (TREE_CODE (arg) != INTEGER_CST)
+    arg = gimple_assign_rhs1 (def);
+  if (TREE_CODE (arg) != INTEGER_CST)
+    return false;
+  if (compare_tree_int (arg, n) != 0)
+    return false;
+  return true;
+}
+
+/* Combine vectorized loop with scalar remainder through masking statemnts
+   such as memoryt read/write and reduction to produce legal result.
+   New vector inductive variable is created to generate mask which simply is
+   result of compare new variable with vector containing a number of iteration.
+   Loop tripe count is adjusted and scalar loop correspondent to remainder
+   is made unreachable through vectorized loop.  */
+
+void
+combine_vect_loop_remainder (loop_vec_info loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  auto_vec<gimple *, 10> loads;
+  auto_vec<gimple *, 5> stores;
+  auto_vec<gimple *, 5> masked_ld_st;
+  int elem_size = 0;
+  int n;
+  int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  basic_block bb;
+  gimple_stmt_iterator gsi;
+  gimple *stmt;
+  stmt_vec_info stmt_info;
+  tree lhs, rhs, vectype;
+  tree vec_index, vec_mask;
+  bool has_reductions = false;
+  unsigned size = 0;
+
+  if (!loop)
+    return;
+  if (loop->inner)
+    return;  /* do not support outer-loop vectorization.  */
+  gcc_assert (LOOP_VINFO_VECTORIZABLE_P (loop_vinfo));
+  vect_location = find_loop_location (loop);
+  if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
+      || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
+    return;
+  if (!LOOP_VINFO_REDUCTION_CHAINS (loop_vinfo).is_empty ()
+      || !LOOP_VINFO_GROUPED_STORES (loop_vinfo).is_empty ())
+    return;
+  bb = loop->header;
+  /* Collect all loads and stores.  */
+  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      stmt = gsi_stmt (gsi);
+      stmt_info = vinfo_for_stmt (stmt);
+      if (stmt_info && STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+	/* Not supported yet!  */
+	return;
+      /* Check that we support given define type.  */
+      if (stmt_info)
+	switch (STMT_VINFO_DEF_TYPE (stmt_info))
+	  {
+	    case vect_induction_def:
+	      if (STMT_VINFO_LIVE_P (stmt_info))
+		return;
+	      break;
+	    case vect_nested_cycle:
+	    case vect_double_reduction_def:
+	    case vect_external_def:
+	      return;
+	    default:
+	      break;
+	  }
+
+      if (gimple_assign_load_p (stmt))
+	{
+	  lhs = gimple_assign_lhs (stmt);
+	  rhs = gimple_assign_rhs1 (stmt);
+	  vectype = TREE_TYPE (lhs);
+	  if (may_be_nonaddressable_p (rhs))
+	    return;
+	  if (!VECTOR_TYPE_P (vectype))
+	    {
+	      struct data_reference *dr;
+	      if (!stmt_info)
+		continue;
+	      dr = STMT_VINFO_DATA_REF (stmt_info);
+	      if (!dr)
+		continue;
+	      if (TREE_CODE (DR_STEP (dr)) != INTEGER_CST)
+		return;
+	      if (tree_int_cst_compare (DR_STEP (dr), size_zero_node) <= 0)
+		{
+		  if (dump_enabled_p ())
+		    dump_printf_loc (MSG_NOTE, vect_location,
+				 "Load with decrement is not masked.\n");
+		  return;
+		}
+	      continue;
+	    }
+	  if (vf / TYPE_VECTOR_SUBPARTS (vectype) > 1)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "multiple-types are not supported yet.\n");
+	      return;
+	    }
+	  n = GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (vectype)));
+	  if (elem_size == 0)
+	    elem_size = n;
+	  else if (n != elem_size)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "multiple-types are not supported yet.\n");
+	      return;
+	    }
+	  if (size == 0)
+	    size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
+	  if (!can_vec_mask_load_store_p (TYPE_MODE (vectype), true))
+	    {
+	      if (dump_enabled_p ())
+		{
+		  dump_printf_loc (MSG_NOTE, vect_location,
+				   "type is not supported for masking!\n");
+		  dump_generic_expr (MSG_NOTE, TDF_SLIM, vectype);
+		}
+	      return;
+	    }
+	  loads.safe_push (stmt);
+	}
+      else if (gimple_store_p (stmt))
+	{
+	  gcc_assert (gimple_assign_single_p (stmt));
+	  lhs = gimple_assign_lhs (stmt);
+	  if (may_be_nonaddressable_p (lhs))
+	    return;
+	  vectype = TREE_TYPE (lhs);
+	  if (!VECTOR_TYPE_P (vectype))
+	    continue;
+	  if (vf / TYPE_VECTOR_SUBPARTS (vectype) > 1)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "multiple-types are not supported yet.\n");
+	      return;
+	    }
+	  n = GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (vectype)));
+	  if (elem_size == 0)
+	      elem_size = n;
+	  else if (n != elem_size)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "multiple-types are not supported yet.\n");
+	      return;
+	    }
+	  if (!mem_ref_is_vec_size_incremented (loop_vinfo, lhs))
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "Store with decrement is not masked.\n");
+	      return;
+	    }
+	  if (size == 0)
+	    size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
+	  if (!can_vec_mask_load_store_p (TYPE_MODE (vectype), false))
+	    {
+	      if (dump_enabled_p ())
+		{
+		  dump_printf_loc (MSG_NOTE, vect_location,
+				   "type is not supported for masking!\n");
+		  dump_generic_expr (MSG_NOTE, TDF_SLIM, vectype);
+		}
+	      return;
+	    }
+	  stores.safe_push (stmt);
+	}
+      else if (is_gimple_call (stmt)
+	       && gimple_call_internal_p (stmt)
+	       && (gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
+		   || gimple_call_internal_fn (stmt) == IFN_MASK_STORE))
+	/* Need to figure out what is vectype for new mask.  */
+	masked_ld_st.safe_push (stmt);
+      else if (is_gimple_call (stmt))
+	return;
+    }
+
+  /* Check that all vectorizable reductions can be converted to VCOND.  */
+  if (!LOOP_VINFO_REDUCTIONS (loop_vinfo).is_empty ())
+    {
+      unsigned i;
+      has_reductions = true;
+      for (i = 0; i < LOOP_VINFO_REDUCTIONS (loop_vinfo).length (); i++)
+	{
+	  machine_mode mode;
+
+	  stmt = LOOP_VINFO_REDUCTIONS (loop_vinfo)[i];
+	  stmt_info = vinfo_for_stmt (stmt);
+	  gcc_assert (stmt_info);
+	  if (PURE_SLP_STMT (stmt_info))
+	    return;
+	  gcc_assert (STMT_VINFO_VEC_STMT (stmt_info));
+	  stmt = STMT_VINFO_VEC_STMT (stmt_info);
+	  if (gimple_code (stmt) != GIMPLE_ASSIGN)
+	    return;
+	  /* Only reduction with binary operation is supported.  */
+	  if (get_gimple_rhs_class (gimple_assign_rhs_code (stmt))
+	      != GIMPLE_BINARY_RHS)
+	    return;
+	  lhs = gimple_assign_lhs (stmt);
+	  vectype = TREE_TYPE (lhs);
+	  if (vf / TYPE_VECTOR_SUBPARTS (vectype) > 1)
+	    /* Not yet supported!  */
+	    return;
+	  n = GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (vectype)));
+	  if (elem_size == 0)
+	    elem_size = n;
+	  else if (n != elem_size)
+	    /* Not yet supported!  */
+	    return;
+	  if (size == 0)
+	    size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
+	  mode = TYPE_MODE (vectype);
+	  if (get_vcond_icode (mode, mode, TYPE_UNSIGNED (vectype))
+	      == CODE_FOR_nothing)
+	    return;
+	}
+    }
+  /* Check masked load/stores is any.  */
+  if (!masked_ld_st.is_empty ())
+    {
+      unsigned i;
+      for (i = 0; i < masked_ld_st.length (); i++)
+	{
+	  tree mask;
+	  tree vectype;
+	  optab tab;
+	  stmt = masked_ld_st[i];
+	  mask = gimple_call_arg (stmt, 2);
+	  vectype = TREE_TYPE (mask);
+	  n = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (vectype)));
+	  if (elem_size == 0)
+	    elem_size = n;
+	  else if (n != elem_size)
+	    /* Mask conversion is not supported yet!  */
+	    return;
+	  if (size == 0)
+	    size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
+	  /* Check that BIT_AND is supported on target.  */
+	  tab = optab_for_tree_code (BIT_AND_EXPR, vectype, optab_default);
+	  if (!tab)
+	    return;
+	  if (optab_handler (tab, TYPE_MODE (vectype)) == CODE_FOR_nothing)
+	    return;
+	}
+    }
+
+  /* Generate induction vector which will be used to evaluate mask.  */
+  vec_index = gen_vec_induction (loop_vinfo, elem_size, size);
+  if (!vec_index)
+    return;
+
+  /* Generate mask vector which will be used to nask saved statements.  */
+  vec_mask = gen_mask_for_remainder (loop_vinfo, vec_index, size);
+  gcc_assert (vec_mask);
+
+  /* Convert vectororized loads to masked ones.  */
+  if (!loads.is_empty ())
+    convert_loads_to_masked (&loads, vec_mask);
+
+  /* Convert vectoirizzed stores to masked ones.  */
+  if (!stores.is_empty ())
+    convert_stores_to_masked (&stores, vec_mask);
+
+  if (has_reductions)
+    convert_reductions (loop_vinfo, vec_mask);
+
+  if (!masked_ld_st.is_empty ())
+    fix_mask_for_masked_ld_st (&masked_ld_st, vec_mask);
+
+  /* Fix loop trip count.  */
+  fix_vec_loop_trip_count (loop_vinfo);
+
+  /* Fix up cfg to make scalar loop remainder unreachable.  */
+  isolate_remainder (loop_vinfo);
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location,
+		     "=== scalar remainder has been deleted ===\n");
+}
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 3e6fd35..f7366c1 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -559,6 +559,18 @@ vectorize_loops (void)
 	  }
       }
 
+  /* Try to combine vectorized loop and scalar remainder.  */
+  for (i = 1; i < vect_loops_num; i++)
+    {
+      loop_vec_info loop_vinfo;
+      loop = get_loop (cfun, i);
+      if (!loop || loop->inner)
+	continue;
+      loop_vinfo = (loop_vec_info) loop->aux;
+      if (loop_vinfo && LOOP_VINFO_VECTORIZABLE_P (loop_vinfo))
+	combine_vect_loop_remainder (loop_vinfo);
+    }
+
   for (i = 1; i < vect_loops_num; i++)
     {
       loop_vec_info loop_vinfo;
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index bf01ded..e8865bc 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -230,6 +230,8 @@ typedef struct _loop_vec_info : public vec_info {
   tree num_iters;
   /* Number of iterations of the original loop.  */
   tree num_iters_unchanged;
+  /* Number of iteration of vectorized loop.  */
+  tree num_iters_vect_loop;
 
   /* Threshold of number of iterations below which vectorzation will not be
      performed. It is calculated from MIN_PROFITABLE_ITERS and
@@ -335,6 +337,7 @@ typedef struct _loop_vec_info : public vec_info {
 #define LOOP_VINFO_BBS(L)                  (L)->bbs
 #define LOOP_VINFO_NITERSM1(L)             (L)->num_itersm1
 #define LOOP_VINFO_NITERS(L)               (L)->num_iters
+#define LOOP_VINFO_NITERS_VECT_LOOP(L)    (L)->num_iters_vect_loop
 /* Since LOOP_VINFO_NITERS and LOOP_VINFO_NITERSM1 can change after
    prologue peeling retain total unchanged scalar loop iterations for
    cost model.  */
@@ -994,6 +997,7 @@ extern void vect_get_vec_defs (tree, tree, gimple *, vec<tree> *,
 			       vec<tree> *, slp_tree, int);
 extern tree vect_gen_perm_mask_any (tree, const unsigned char *);
 extern tree vect_gen_perm_mask_checked (tree, const unsigned char *);
+extern void combine_vect_loop_remainder (loop_vec_info);
 
 /* In tree-vect-data-refs.c.  */
 extern bool vect_can_force_dr_alignment_p (const_tree, unsigned int);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Combine vectorized loops with its scalar remainder.
  2015-10-28 10:57 [RFC] Combine vectorized loops with its scalar remainder Yuri Rumyantsev
@ 2015-11-03 10:08 ` Richard Henderson
  2015-11-03 10:35   ` Yuri Rumyantsev
  2015-11-03 11:47 ` Richard Biener
  1 sibling, 1 reply; 17+ messages in thread
From: Richard Henderson @ 2015-11-03 10:08 UTC (permalink / raw)
  To: Yuri Rumyantsev, gcc-patches, Jeff Law, Richard Biener,
	Igor Zamyatin,
	Илья
	Энкович

On 10/28/2015 11:45 AM, Yuri Rumyantsev wrote:
> Hi All,
>
> Here is a preliminary patch to combine vectorized loop with its scalar
> remainder, draft of which was proposed by Kirill Yukhin month ago:
> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
> It was tested wwith '-mavx2' option to run on Haswell processor.
> The main goal of it is to improve performance of vectorized loops for AVX512.

Ought this really be enabled for avx2?  While it's nice for testing to be able 
to use normal vcond patterns to be able to test with current hardware, I have 
trouble imagining that it's an improvement without the real masked operations.

I tried to have a look myself at what kind of output we'd be getting, but the 
very first test I tried produced an ICE:

void foo(float *a, float *b, int n)
{
   int i;
   for (i = 0; i < n; ++i)
     a[i] += b[i];
}

$ ./cc1 -O3 -mavx2 z.c
  foo
Analyzing compilation unit
Performing interprocedural optimizations
  <*free_lang_data> <visibility> <build_ssa_passes> <opt_local_passes> 
<free-inline-summary> <whole-program> <profile_estimate> <icf> <devirt> <cp> 
<targetclone> <inline> <pure-const> <static-var> <single-use> 
<comdats>Assembling functions:
  <dispachercalls> foo
z.c: In function ‘foo’:
z.c:1:6: error: bogus comparison result type
  void foo(float *a, float *b, int n)
       ^
vector(8) signed int
vect_vec_mask_.24_116 = vect_vec_iv_.22_112 < vect_cst_.23_115;
z.c:1:6: internal compiler error: verify_gimple failed
0xb20d17 verify_gimple_in_cfg(function*, bool)
	../../git-master/gcc/tree-cfg.c:5082
0xa16d77 execute_function_todo
	../../git-master/gcc/passes.c:1940
0xa1769b execute_todo
	../../git-master/gcc/passes.c:1995
Please submit a full bug report,


r~

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Combine vectorized loops with its scalar remainder.
  2015-11-03 10:08 ` Richard Henderson
@ 2015-11-03 10:35   ` Yuri Rumyantsev
  0 siblings, 0 replies; 17+ messages in thread
From: Yuri Rumyantsev @ 2015-11-03 10:35 UTC (permalink / raw)
  To: Richard Henderson
  Cc: gcc-patches, Jeff Law, Richard Biener, Igor Zamyatin,
	Илья
	Энкович

This is expected failure since this patch is not in sync with the
latest patches related to masking support for AVX512.
I am waiting for support for masking load/store support which is not
integrated to trunk. To get workable version of compiler use revision
before r229128.

2015-11-03 13:08 GMT+03:00 Richard Henderson <rth@redhat.com>:
> On 10/28/2015 11:45 AM, Yuri Rumyantsev wrote:
>>
>> Hi All,
>>
>> Here is a preliminary patch to combine vectorized loop with its scalar
>> remainder, draft of which was proposed by Kirill Yukhin month ago:
>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>> It was tested wwith '-mavx2' option to run on Haswell processor.
>> The main goal of it is to improve performance of vectorized loops for
>> AVX512.
>
>
> Ought this really be enabled for avx2?  While it's nice for testing to be
> able to use normal vcond patterns to be able to test with current hardware,
> I have trouble imagining that it's an improvement without the real masked
> operations.
>
> I tried to have a look myself at what kind of output we'd be getting, but
> the very first test I tried produced an ICE:
>
> void foo(float *a, float *b, int n)
> {
>   int i;
>   for (i = 0; i < n; ++i)
>     a[i] += b[i];
> }
>
> $ ./cc1 -O3 -mavx2 z.c
>  foo
> Analyzing compilation unit
> Performing interprocedural optimizations
>  <*free_lang_data> <visibility> <build_ssa_passes> <opt_local_passes>
> <free-inline-summary> <whole-program> <profile_estimate> <icf> <devirt> <cp>
> <targetclone> <inline> <pure-const> <static-var> <single-use>
> <comdats>Assembling functions:
>  <dispachercalls> foo
> z.c: In function ‘foo’:
> z.c:1:6: error: bogus comparison result type
>  void foo(float *a, float *b, int n)
>       ^
> vector(8) signed int
> vect_vec_mask_.24_116 = vect_vec_iv_.22_112 < vect_cst_.23_115;
> z.c:1:6: internal compiler error: verify_gimple failed
> 0xb20d17 verify_gimple_in_cfg(function*, bool)
>         ../../git-master/gcc/tree-cfg.c:5082
> 0xa16d77 execute_function_todo
>         ../../git-master/gcc/passes.c:1940
> 0xa1769b execute_todo
>         ../../git-master/gcc/passes.c:1995
> Please submit a full bug report,
>
>
> r~

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Combine vectorized loops with its scalar remainder.
  2015-10-28 10:57 [RFC] Combine vectorized loops with its scalar remainder Yuri Rumyantsev
  2015-11-03 10:08 ` Richard Henderson
@ 2015-11-03 11:47 ` Richard Biener
  2015-11-03 12:08   ` Yuri Rumyantsev
  1 sibling, 1 reply; 17+ messages in thread
From: Richard Biener @ 2015-11-03 11:47 UTC (permalink / raw)
  To: Yuri Rumyantsev
  Cc: gcc-patches, Jeff Law, Igor Zamyatin,
	Илья
	Энкович

On Wed, Oct 28, 2015 at 11:45 AM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Hi All,
>
> Here is a preliminary patch to combine vectorized loop with its scalar
> remainder, draft of which was proposed by Kirill Yukhin month ago:
> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
> It was tested wwith '-mavx2' option to run on Haswell processor.
> The main goal of it is to improve performance of vectorized loops for AVX512.
> Note that only loads/stores and simple reductions with binary operations are
> converted to masked form, e.g. load --> masked load and reduction like
> r1 = f <op> r2 --> t = f <op> r2; r1 = m ? t : r2. Masking is performed through
> creation of a new vector induction variable initialized with consequent values
> from 0.. VF-1, new const vector upper bound which contains number of iterations
> and the result of comparison which is considered as mask vector.
> This implementation has several restrictions:
>
> 1. Multiple types are not supported.
> 2. SLP is not supported.
> 3. Gather/Scatter's are also not supported.
> 4. Vectorization of the loops with low trip count is not implemented yet since
>    it requires additional design and tuning.
>
> We are planning to eleminate all these restrictions in GCCv7.
>
> This patch will be extended to include cost model to reject unprofutable
> transformations, e.g. new vector body cost will be evaluated through new
> target hook which estimates cast of masking different vector statements. New
> threshold parameter will be introduced which determines permissible cost
> increasing which will be tuned on an AVX512 machine.
> This patch is not in sync with changes of Ilya Enkovich for AVX512 masked
> load/store support since only part of them is in trunk compiler.
>
> Any comments will be appreciated.

As stated in the previous discussion I don't think the extra mask IV
is a good idea
and we instead should have a masked final iteration for the epilogue
(yes, that's
not really "combined" then).  This is because in the end we'd not only
want AVX512
to benefit from this work but also other ISAs that can do unaligned or masked
operations (we can overlap the epilogue work with the vectorized work or use
masked loads/stores available with AVX).  Note that the same applies to
the alignment prologue if present, I can't see how you can handle that with the
in-loop approach.

Richard.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Combine vectorized loops with its scalar remainder.
  2015-11-03 11:47 ` Richard Biener
@ 2015-11-03 12:08   ` Yuri Rumyantsev
  2015-11-10 12:30     ` Richard Biener
  0 siblings, 1 reply; 17+ messages in thread
From: Yuri Rumyantsev @ 2015-11-03 12:08 UTC (permalink / raw)
  To: Richard Biener
  Cc: gcc-patches, Jeff Law, Igor Zamyatin,
	Илья
	Энкович

Richard,

It looks like misunderstanding - we assume that for GCCv6 the simple
scheme of remainder will be used through introducing new IV :
https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html

Is it true or we missed something?
Now we are testing vectorization of loops with small non-constant trip count.
Yuri.

2015-11-03 14:47 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
> On Wed, Oct 28, 2015 at 11:45 AM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> Hi All,
>>
>> Here is a preliminary patch to combine vectorized loop with its scalar
>> remainder, draft of which was proposed by Kirill Yukhin month ago:
>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>> It was tested wwith '-mavx2' option to run on Haswell processor.
>> The main goal of it is to improve performance of vectorized loops for AVX512.
>> Note that only loads/stores and simple reductions with binary operations are
>> converted to masked form, e.g. load --> masked load and reduction like
>> r1 = f <op> r2 --> t = f <op> r2; r1 = m ? t : r2. Masking is performed through
>> creation of a new vector induction variable initialized with consequent values
>> from 0.. VF-1, new const vector upper bound which contains number of iterations
>> and the result of comparison which is considered as mask vector.
>> This implementation has several restrictions:
>>
>> 1. Multiple types are not supported.
>> 2. SLP is not supported.
>> 3. Gather/Scatter's are also not supported.
>> 4. Vectorization of the loops with low trip count is not implemented yet since
>>    it requires additional design and tuning.
>>
>> We are planning to eleminate all these restrictions in GCCv7.
>>
>> This patch will be extended to include cost model to reject unprofutable
>> transformations, e.g. new vector body cost will be evaluated through new
>> target hook which estimates cast of masking different vector statements. New
>> threshold parameter will be introduced which determines permissible cost
>> increasing which will be tuned on an AVX512 machine.
>> This patch is not in sync with changes of Ilya Enkovich for AVX512 masked
>> load/store support since only part of them is in trunk compiler.
>>
>> Any comments will be appreciated.
>
> As stated in the previous discussion I don't think the extra mask IV
> is a good idea
> and we instead should have a masked final iteration for the epilogue
> (yes, that's
> not really "combined" then).  This is because in the end we'd not only
> want AVX512
> to benefit from this work but also other ISAs that can do unaligned or masked
> operations (we can overlap the epilogue work with the vectorized work or use
> masked loads/stores available with AVX).  Note that the same applies to
> the alignment prologue if present, I can't see how you can handle that with the
> in-loop approach.
>
> Richard.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Combine vectorized loops with its scalar remainder.
  2015-11-03 12:08   ` Yuri Rumyantsev
@ 2015-11-10 12:30     ` Richard Biener
  2015-11-10 13:02       ` Ilya Enkovich
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Biener @ 2015-11-10 12:30 UTC (permalink / raw)
  To: Yuri Rumyantsev
  Cc: gcc-patches, Jeff Law, Igor Zamyatin,
	Илья
	Энкович

On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Richard,
>
> It looks like misunderstanding - we assume that for GCCv6 the simple
> scheme of remainder will be used through introducing new IV :
> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>
> Is it true or we missed something?

<quote>
> > Do you have an idea how "masking" is better be organized to be usable
> > for both 4b and 4c?
>
> Do 2a ...
Okay.
</quote>

Richard.

> Now we are testing vectorization of loops with small non-constant trip count.
> Yuri.
>
> 2015-11-03 14:47 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>> On Wed, Oct 28, 2015 at 11:45 AM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>> Hi All,
>>>
>>> Here is a preliminary patch to combine vectorized loop with its scalar
>>> remainder, draft of which was proposed by Kirill Yukhin month ago:
>>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>>> It was tested wwith '-mavx2' option to run on Haswell processor.
>>> The main goal of it is to improve performance of vectorized loops for AVX512.
>>> Note that only loads/stores and simple reductions with binary operations are
>>> converted to masked form, e.g. load --> masked load and reduction like
>>> r1 = f <op> r2 --> t = f <op> r2; r1 = m ? t : r2. Masking is performed through
>>> creation of a new vector induction variable initialized with consequent values
>>> from 0.. VF-1, new const vector upper bound which contains number of iterations
>>> and the result of comparison which is considered as mask vector.
>>> This implementation has several restrictions:
>>>
>>> 1. Multiple types are not supported.
>>> 2. SLP is not supported.
>>> 3. Gather/Scatter's are also not supported.
>>> 4. Vectorization of the loops with low trip count is not implemented yet since
>>>    it requires additional design and tuning.
>>>
>>> We are planning to eleminate all these restrictions in GCCv7.
>>>
>>> This patch will be extended to include cost model to reject unprofutable
>>> transformations, e.g. new vector body cost will be evaluated through new
>>> target hook which estimates cast of masking different vector statements. New
>>> threshold parameter will be introduced which determines permissible cost
>>> increasing which will be tuned on an AVX512 machine.
>>> This patch is not in sync with changes of Ilya Enkovich for AVX512 masked
>>> load/store support since only part of them is in trunk compiler.
>>>
>>> Any comments will be appreciated.
>>
>> As stated in the previous discussion I don't think the extra mask IV
>> is a good idea
>> and we instead should have a masked final iteration for the epilogue
>> (yes, that's
>> not really "combined" then).  This is because in the end we'd not only
>> want AVX512
>> to benefit from this work but also other ISAs that can do unaligned or masked
>> operations (we can overlap the epilogue work with the vectorized work or use
>> masked loads/stores available with AVX).  Note that the same applies to
>> the alignment prologue if present, I can't see how you can handle that with the
>> in-loop approach.
>>
>> Richard.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Combine vectorized loops with its scalar remainder.
  2015-11-10 12:30     ` Richard Biener
@ 2015-11-10 13:02       ` Ilya Enkovich
  2015-11-10 14:52         ` Richard Biener
  0 siblings, 1 reply; 17+ messages in thread
From: Ilya Enkovich @ 2015-11-10 13:02 UTC (permalink / raw)
  To: Richard Biener; +Cc: Yuri Rumyantsev, gcc-patches, Jeff Law, Igor Zamyatin

2015-11-10 15:30 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
> On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> Richard,
>>
>> It looks like misunderstanding - we assume that for GCCv6 the simple
>> scheme of remainder will be used through introducing new IV :
>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>>
>> Is it true or we missed something?
>
> <quote>
>> > Do you have an idea how "masking" is better be organized to be usable
>> > for both 4b and 4c?
>>
>> Do 2a ...
> Okay.
> </quote>

2a was 'transform already vectorized loop as a separate
post-processing'. Isn't it what this prototype patch implements?
Current version only masks loop body which is in practice applicable
for AVX-512 only in the most cases.  With AVX-512 it's easier to see
how profitable masking might be and it is a main target for the first
masking version.  Extending it to prologues/epilogues and thus making
it more profitable for other targets is the next step and is out of
the scope of this patch.

Thanks,
Ilya

>
> Richard.
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Combine vectorized loops with its scalar remainder.
  2015-11-10 13:02       ` Ilya Enkovich
@ 2015-11-10 14:52         ` Richard Biener
  2015-11-13 10:36           ` Yuri Rumyantsev
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Biener @ 2015-11-10 14:52 UTC (permalink / raw)
  To: Ilya Enkovich; +Cc: Yuri Rumyantsev, gcc-patches, Jeff Law, Igor Zamyatin

On Tue, Nov 10, 2015 at 2:02 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> 2015-11-10 15:30 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>> On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>> Richard,
>>>
>>> It looks like misunderstanding - we assume that for GCCv6 the simple
>>> scheme of remainder will be used through introducing new IV :
>>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>>>
>>> Is it true or we missed something?
>>
>> <quote>
>>> > Do you have an idea how "masking" is better be organized to be usable
>>> > for both 4b and 4c?
>>>
>>> Do 2a ...
>> Okay.
>> </quote>
>
> 2a was 'transform already vectorized loop as a separate
> post-processing'. Isn't it what this prototype patch implements?
> Current version only masks loop body which is in practice applicable
> for AVX-512 only in the most cases.  With AVX-512 it's easier to see
> how profitable masking might be and it is a main target for the first
> masking version.  Extending it to prologues/epilogues and thus making
> it more profitable for other targets is the next step and is out of
> the scope of this patch.

Ok, technically the prototype transforms the already vectorized loop.
Of course I meant the vectorized loop be copied, masked and that
result used as epilogue...

I'll queue a more detailed look into the patch for this week.

Did you perform any measurements with this patch like # of
masked epilogues in SPEC 2006 FP (and any speedup?)

Thanks,
Richard.

> Thanks,
> Ilya
>
>>
>> Richard.
>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Combine vectorized loops with its scalar remainder.
  2015-11-10 14:52         ` Richard Biener
@ 2015-11-13 10:36           ` Yuri Rumyantsev
  2015-11-23 15:54             ` Yuri Rumyantsev
  2015-11-27 13:49             ` Richard Biener
  0 siblings, 2 replies; 17+ messages in thread
From: Yuri Rumyantsev @ 2015-11-13 10:36 UTC (permalink / raw)
  To: Richard Biener; +Cc: Ilya Enkovich, gcc-patches, Jeff Law, Igor Zamyatin

[-- Attachment #1: Type: text/plain, Size: 2105 bytes --]

Hi Richard,

Here is updated version of the patch which 91) is in sync with trunk
compiler and (2) contains simple cost model to estimate profitability
of scalar epilogue elimination. The part related to vectorization of
loops with small trip count is in process of developing. Note that
implemented cost model was not tuned  well for HASWELL and KNL but we
got  ~6% speed-up on 436.cactusADM from spec2006 suite for HASWELL.

2015-11-10 17:52 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
> On Tue, Nov 10, 2015 at 2:02 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>> 2015-11-10 15:30 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>> On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>> Richard,
>>>>
>>>> It looks like misunderstanding - we assume that for GCCv6 the simple
>>>> scheme of remainder will be used through introducing new IV :
>>>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>>>>
>>>> Is it true or we missed something?
>>>
>>> <quote>
>>>> > Do you have an idea how "masking" is better be organized to be usable
>>>> > for both 4b and 4c?
>>>>
>>>> Do 2a ...
>>> Okay.
>>> </quote>
>>
>> 2a was 'transform already vectorized loop as a separate
>> post-processing'. Isn't it what this prototype patch implements?
>> Current version only masks loop body which is in practice applicable
>> for AVX-512 only in the most cases.  With AVX-512 it's easier to see
>> how profitable masking might be and it is a main target for the first
>> masking version.  Extending it to prologues/epilogues and thus making
>> it more profitable for other targets is the next step and is out of
>> the scope of this patch.
>
> Ok, technically the prototype transforms the already vectorized loop.
> Of course I meant the vectorized loop be copied, masked and that
> result used as epilogue...
>
> I'll queue a more detailed look into the patch for this week.
>
> Did you perform any measurements with this patch like # of
> masked epilogues in SPEC 2006 FP (and any speedup?)
>
> Thanks,
> Richard.
>
>> Thanks,
>> Ilya
>>
>>>
>>> Richard.
>>>

[-- Attachment #2: remainder.patch.2 --]
[-- Type: application/octet-stream, Size: 34445 bytes --]

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 9205d49..4951b0a 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -48577,6 +48577,25 @@ ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
     }
 }
 
+/* Implement targetm.vectorize.builtin_masking_cost.  */
+
+static int
+ix86_builtin_masking_cost (enum vect_cost_for_masking k, tree vectype)
+{
+  if (GET_MODE_CLASS (TYPE_MODE (vectype)) != MODE_VECTOR_INT)
+    return 0;
+
+  switch (k)
+    {
+      case masking_load:
+	return 0;
+      case masking_store:
+	return (ix86_tune == PROCESSOR_HASWELL) ? 10 : 0;
+      default:
+	return ix86_builtin_vectorization_cost (vector_stmt, NULL_TREE, 0);
+    }
+}
+
 /* A cached (set (nil) (vselect (vconcat (nil) (nil)) (parallel [])))
    insn, so that expand_vselect{,_vconcat} doesn't have to create a fresh
    insn every time.  */
@@ -54300,6 +54319,9 @@ ix86_addr_space_zero_address_valid (addr_space_t as)
 #undef TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST
 #define TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST \
   ix86_builtin_vectorization_cost
+#undef TARGET_VECTORIZE_BUILTIN_MASKING_COST
+#define TARGET_VECTORIZE_BUILTIN_MASKING_COST \
+  ix86_builtin_masking_cost
 #undef TARGET_VECTORIZE_VEC_PERM_CONST_OK
 #define TARGET_VECTORIZE_VEC_PERM_CONST_OK \
   ix86_vectorize_vec_perm_const_ok
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 96ca063a..ba6a841 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4222,6 +4222,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST
 
+@hook TARGET_VECTORIZE_BUILTIN_MASKING_COST
+
 @hook TARGET_VECTORIZE_VECTOR_ALIGNMENT_REACHABLE
 
 @hook TARGET_VECTORIZE_VEC_PERM_CONST_OK
diff --git a/gcc/params.def b/gcc/params.def
index c5d96e7..849373b 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1177,6 +1177,11 @@ DEFPARAM (PARAM_MAX_SSA_NAME_QUERY_DEPTH,
 	  "Maximum recursion depth allowed when querying a property of an"
 	  " SSA name.",
 	  2, 1, 0)
+
+DEFPARAM (PARAM_VECT_COST_INCREASE_THRESHOLD,
+	  "vect-cost-increase-threshold",
+	  "Threshold for cost increase in scalar epilogue vectorization.",
+	  10, 0, 100)
 /*
 
 Local variables:
diff --git a/gcc/target.def b/gcc/target.def
index 61cb14b..aa95f52 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1768,6 +1768,14 @@ misalignment value (@var{misalign}).",
  int, (enum vect_cost_for_stmt type_of_cost, tree vectype, int misalign),
  default_builtin_vectorization_cost)
 
+/* Cost of masking for different statements in vectorized loop for
+   scalar epilog vectorization.  */
+DEFHOOK
+(builtin_masking_cost,
+ "Returns cost of convertion vector statement to masked form.",
+  int, (enum vect_cost_for_masking kind, tree vectype),
+  defualt_builtin_masking_cost)
+
 /* Return true if vector alignment is reachable (by peeling N
    iterations) for the given type.  */
 DEFHOOK
diff --git a/gcc/target.h b/gcc/target.h
index ffc4d6a..3f7d3c6 100644
--- a/gcc/target.h
+++ b/gcc/target.h
@@ -173,6 +173,15 @@ enum vect_cost_for_stmt
   vec_construct
 };
 
+/* Types of costs for masking statements in vectorized loops.  */
+enum vect_cost_for_masking
+{
+  masking_load,
+  masking_store,
+  masking_masked_stmt,
+  masking_reduction
+};
+
 /* Separate locations for which the vectorizer cost model should
    track costs.  */
 enum vect_cost_model_location {
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index c34b4e9..ca3a044 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -591,6 +591,26 @@ default_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
     }
 }
 
+/* Default masking cost model values.  */
+
+int
+default_builtin_masking_cost (enum vect_cost_for_masking kind,
+			      tree vectype ATTRIBUTE_UNUSED)
+{
+  switch (kind)
+    {
+      case masking_load:
+      case masking_store:
+	return 1;
+
+      case masking_reduction:
+      case masking_masked_stmt:
+	return default_builtin_vectorization_cost (vector_stmt, NULL_TREE, 0);
+      defualt:
+	gcc_unreachable ();
+    }
+}
+
 /* Reciprocal.  */
 
 tree
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 55e5309..500add9 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1064,6 +1064,7 @@ new_loop_vec_info (struct loop *loop)
   LOOP_VINFO_NITERSM1 (res) = NULL;
   LOOP_VINFO_NITERS (res) = NULL;
   LOOP_VINFO_NITERS_UNCHANGED (res) = NULL;
+  LOOP_VINFO_NITERS_VECT_LOOP (res) = NULL;
   LOOP_VINFO_COST_MODEL_THRESHOLD (res) = 0;
   LOOP_VINFO_VECTORIZABLE_P (res) = 0;
   LOOP_VINFO_PEELING_FOR_ALIGNMENT (res) = 0;
@@ -3210,6 +3211,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	       &vec_inside_cost, &vec_epilogue_cost);
 
   vec_outside_cost = (int)(vec_prologue_cost + vec_epilogue_cost);
+
+  /* Save cost of a vector single iteration for possible vectorization of
+     a scalar epilogue.  */
+  LOOP_VINFO_SINGLE_VECTOR_ITERATION_COST (loop_vinfo) = vec_inside_cost;
   
   if (dump_enabled_p ())
     {
@@ -6405,9 +6410,13 @@ vect_transform_loop (loop_vec_info loop_vinfo)
     {
       tree ratio_mult_vf;
       if (!ni_name)
-	ni_name = vect_build_loop_niters (loop_vinfo);
+	{
+	  ni_name = vect_build_loop_niters (loop_vinfo);
+	  LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
+	}
       vect_generate_tmps_on_preheader (loop_vinfo, ni_name, &ratio_mult_vf,
 				       &ratio);
+      LOOP_VINFO_NITERS_VECT_LOOP (loop_vinfo) = ratio_mult_vf;
       vect_do_peeling_for_loop_bound (loop_vinfo, ni_name, ratio_mult_vf,
 				      th, check_profitability);
     }
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index cfe30e0..39ea438b 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -47,6 +47,9 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-scalar-evolution.h"
 #include "tree-vectorizer.h"
 #include "builtins.h"
+#include "tree-ssa-address.h"
+#include "tree-ssa-loop-ivopts.h"
+#include "params.h"
 
 /* For lang_hooks.types.type_for_mode.  */
 #include "langhooks.h"
@@ -8938,3 +8941,742 @@ supportable_narrowing_operation (enum tree_code code,
   interm_types->release ();
   return false;
 }
+
+/* Fix trip count of vectorized loop to iterate for loop remainder also.  */
+
+static void
+fix_vec_loop_trip_count (loop_vec_info loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  tree niters;
+  tree ratio_mult_vf = LOOP_VINFO_NITERS_VECT_LOOP (loop_vinfo);
+  int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  gimple *stmt;
+  gimple_stmt_iterator gsi;
+
+  niters = (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)) ?
+	    LOOP_VINFO_NITERS (loop_vinfo)
+	    : LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo);
+
+  if (TREE_CODE (ratio_mult_vf) == SSA_NAME)
+    {
+      gimple *def = SSA_NAME_DEF_STMT (ratio_mult_vf);
+      tree bnd, lhs, tmp, log_vf;
+      gimple *def_bnd;
+      gimple *new_def_bnd;
+      gcc_assert (gimple_code (def) == GIMPLE_ASSIGN);
+      gcc_assert (gimple_assign_rhs_code (def) == LSHIFT_EXPR);
+      bnd = gimple_assign_rhs1 (def);
+      gcc_assert (TREE_CODE (bnd) == SSA_NAME);
+      gcc_assert (TREE_CODE (gimple_assign_rhs2 (def)) == INTEGER_CST);
+      def_bnd = SSA_NAME_DEF_STMT (bnd);
+      gsi = gsi_for_stmt (def_bnd);
+      /* Create t = niters + vfm1 statement.  */
+      lhs = create_tmp_var (TREE_TYPE (bnd));
+      stmt = gimple_build_assign (lhs, PLUS_EXPR, niters,
+				  build_int_cst (TREE_TYPE (bnd), vf - 1));
+      tmp = make_ssa_name (lhs, stmt);
+      gimple_assign_set_lhs (stmt, tmp);
+      gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+      /* Replace BND definition with bnd = t >> log2 (vf).  */
+      log_vf = build_int_cst (TREE_TYPE (tmp), exact_log2 (vf));
+      new_def_bnd = gimple_build_assign (bnd, RSHIFT_EXPR, tmp, log_vf);
+      gsi_replace (&gsi, new_def_bnd, false);
+    }
+  else
+    {
+      tree op_const;
+      unsigned n;
+      unsigned logvf = exact_log2 (vf);
+      gcond *cond;
+      gcc_assert (TREE_CODE (ratio_mult_vf) == INTEGER_CST);
+      gcc_assert (TREE_CODE (niters) == INTEGER_CST);
+      /* Change value of bnd in GIMPLE_COND.  */
+      gcc_assert (loop->num_nodes == 2);
+      stmt = last_stmt (loop->header);
+      gcc_assert (gimple_code (stmt) == GIMPLE_COND);
+      n = tree_to_uhwi (niters);
+      n = ((n + (vf - 1)) >> logvf) << logvf;
+      op_const = build_int_cst (TREE_TYPE (gimple_cond_lhs (stmt)), n);
+      gcc_assert (TREE_CODE (gimple_cond_rhs (stmt)) == INTEGER_CST);
+      cond = dyn_cast <gcond *> (stmt);
+      gimple_cond_set_rhs (cond, op_const);
+    }
+}
+
+/* Did scalar remainder unreachable through vecotirzed loop.  */
+
+static void
+isolate_remainder (loop_vec_info loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  edge e;
+  basic_block bb = loop->header;
+  gimple *last;
+  gcond *cond;
+
+  e = EDGE_SUCC ((bb), 0);
+  if (flow_bb_inside_loop_p (loop, e->dest))
+    e = EDGE_SUCC ((bb), 1);
+  bb = e->dest;
+  gcc_assert (!flow_bb_inside_loop_p (loop, bb));
+  last = last_stmt (bb);
+  gcc_assert (gimple_code (last) == GIMPLE_COND);
+  cond = as_a <gcond *> (last);
+  /* Assume that target of false edge is scalar loop preheader.  */
+  gimple_cond_make_true (cond);
+}
+
+/* Generate induction_vector which will be used to mask evaluation.  */
+
+static tree
+gen_vec_induction (loop_vec_info loop_vinfo, unsigned elem_size, unsigned size)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  edge pe = loop_preheader_edge (loop);
+  vec<constructor_elt, va_gc> *v;
+  gimple *stmt;
+  gimple_stmt_iterator gsi;
+  gphi *induction_phi;
+  tree iv_type, vectype, cmp_vectype;
+  tree lhs, rhs, iv;
+  unsigned n;
+  int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  int i;
+  tree new_vec, new_var;
+  tree vec_init, vec_step, vec_dest, vec_def;
+  tree val;
+  tree induc_def;
+  basic_block new_bb;
+
+  /* Find control iv.  */
+  stmt = last_stmt (loop->header);
+  gcc_assert (gimple_code (stmt) == GIMPLE_COND);
+  lhs = gimple_cond_lhs (stmt);
+  rhs = gimple_cond_rhs (stmt);
+  /* Assume any operand order.  */
+  if (TREE_CODE (lhs) != SSA_NAME)
+    iv = rhs;
+  else
+    {
+      gimple *def_stmt = SSA_NAME_DEF_STMT (lhs);
+      if (gimple_bb (def_stmt) != loop->header)
+	iv = rhs;
+      else
+	iv = lhs;
+    }
+  gcc_assert (TREE_CODE (iv) == SSA_NAME);
+  /* Determine type to build vector index aka induction vector.  */
+  n = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (iv)));
+  if (n > elem_size)
+    /* Multiple types are not yet supported.  */
+    return NULL_TREE;
+  if (n == elem_size && !TYPE_UNSIGNED (TREE_TYPE (iv)))
+    iv_type = TREE_TYPE (iv);
+  else
+    iv_type = build_nonstandard_integer_type (elem_size, 0);
+  vectype = get_vectype_for_scalar_type_and_size (iv_type, size);
+  /* Check that vector comparison for IV_TYPE is supported.  */
+  cmp_vectype = build_same_sized_truth_vector_type (vectype);
+  if (!expand_vec_cmp_expr_p (vectype, cmp_vectype))
+    {
+      if (dump_enabled_p ())
+	{
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "type is not supported for vector compare!\n");
+	  dump_generic_expr (MSG_NOTE, TDF_SLIM, vectype);
+	}
+      return NULL_TREE;
+    }
+
+  /* Build induction initialization and insert it to loop preheader.  */
+  vec_alloc (v, vf);
+  for (i = 0; i < vf; i++)
+    {
+      tree elem;
+      elem = build_int_cst (iv_type, i);
+      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, elem);
+    }
+  new_vec = build_vector_from_ctor (vectype, v);
+  new_var = vect_get_new_vect_var (vectype, vect_simple_var, "cst_");
+  stmt = gimple_build_assign (new_var, new_vec);
+  vec_init = make_ssa_name (new_var, stmt);
+  gimple_assign_set_lhs (stmt, vec_init);
+  new_bb = gsi_insert_on_edge_immediate (pe, stmt);
+  gcc_assert (!new_bb);
+
+  /* Create vector-step consisting from VF.  */
+  val = build_int_cst (iv_type, vf);
+  new_vec = build_vector_from_val (vectype, val);
+  new_var = vect_get_new_vect_var (vectype, vect_simple_var, "cst_");
+  stmt = gimple_build_assign (new_var, new_vec);
+  vec_step = make_ssa_name (new_var, stmt);
+  gimple_assign_set_lhs (stmt, vec_step);
+  new_bb = gsi_insert_on_edge_immediate (pe, stmt);
+  gcc_assert (!new_bb);
+
+  /* Create the induction-phi.  */
+  vec_dest = vect_get_new_vect_var (vectype, vect_simple_var, "vec_iv_");
+  induction_phi = create_phi_node (vec_dest, loop->header);
+  induc_def = PHI_RESULT (induction_phi);
+
+  /* Create vector iv increment inside loop.  */
+  gsi = gsi_after_labels (loop->header);
+  stmt = gimple_build_assign (vec_dest, PLUS_EXPR, induc_def, vec_step);
+  vec_def = make_ssa_name (vec_dest, stmt);
+  gimple_assign_set_lhs (stmt, vec_def);
+  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+
+  /* Set the arguments of phi node.  */
+  add_phi_arg (induction_phi, vec_init, pe, UNKNOWN_LOCATION);
+  add_phi_arg (induction_phi, vec_def, loop_latch_edge (loop),
+	       UNKNOWN_LOCATION);
+  return induc_def;
+}
+
+/* Produce mask which will be used for masking.  */
+
+static tree
+gen_mask_for_remainder (loop_vec_info loop_vinfo, tree vec_index)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  tree new_vec, new_var;
+  tree niters, vec_niters, new_niters, vec_res, vec_mask;
+  gimple *stmt;
+  basic_block new_bb;
+  edge pe = loop_preheader_edge (loop);
+  gimple_stmt_iterator gsi;
+  tree vectype = TREE_TYPE (vec_index);
+  tree s_vectype;
+
+  gsi = gsi_after_labels (loop->header);
+  niters = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
+	   ? LOOP_VINFO_NITERS (loop_vinfo)
+	   : LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo);
+
+  /* Create vector for comparison consisting of niters.  */
+  if (!types_compatible_p (TREE_TYPE (niters), TREE_TYPE (vectype)))
+    {
+      tree new_type = TREE_TYPE (vectype);
+      enum tree_code cop;
+      cop = tree_to_uhwi (TYPE_SIZE (new_type)) ==
+	    tree_to_uhwi (TYPE_SIZE (TREE_TYPE (niters)))
+	    ? NOP_EXPR : CONVERT_EXPR;
+      new_niters = make_ssa_name (new_type);
+      stmt = gimple_build_assign (new_niters, cop, niters);
+      new_bb = gsi_insert_on_edge_immediate (pe, stmt);
+      gcc_assert (!new_bb);
+    }
+  else
+    new_niters = niters;
+  new_vec = build_vector_from_val (vectype, new_niters);
+  new_var = vect_get_new_vect_var (vectype, vect_simple_var, "cst_");
+  stmt = gimple_build_assign (new_var, new_vec);
+  vec_niters = make_ssa_name (new_var, stmt);
+  gimple_assign_set_lhs (stmt, vec_niters);
+  new_bb = gsi_insert_on_edge_immediate (pe, stmt);
+  gcc_assert (!new_bb);
+  /* Create vector comparison the result of which will be used as mask
+     for loads/stores.  */
+  s_vectype = build_same_sized_truth_vector_type (vectype);
+  vec_mask = vect_get_new_vect_var (s_vectype, vect_simple_var, "vec_mask_");
+  stmt = gimple_build_assign (vec_mask, LT_EXPR, vec_index, vec_niters);
+  vec_res = make_ssa_name (vec_mask, stmt);
+  gimple_assign_set_lhs (stmt, vec_res);
+  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+  return vec_res;
+}
+
+/* Convert each load to masked load.  */
+
+static void
+convert_loads_to_masked (vec<gimple *> *loads, tree mask)
+{
+  gimple *stmt, *new_stmt;
+  tree addr, ref;
+  gimple_stmt_iterator gsi;
+
+  while (loads->length () > 0)
+    {
+      tree lhs, ptr;
+      stmt = loads->pop ();
+      gsi = gsi_for_stmt (stmt);
+      lhs = gimple_assign_lhs (stmt);
+      ref = gimple_assign_rhs1 (stmt);
+      addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (ref),
+				       true, NULL_TREE, true,
+				       GSI_SAME_STMT);
+      ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
+      if (!SSA_NAME_PTR_INFO (addr))
+	copy_ref_info (build2 (MEM_REF, TREE_TYPE (ref), addr, ptr), ref);
+      new_stmt = gimple_build_call_internal (IFN_MASK_LOAD, 3,
+					     addr, ptr, mask);
+      gimple_call_set_lhs (new_stmt, lhs);
+      gsi_replace (&gsi, new_stmt, false);
+    }
+}
+
+/* Convert each store to masked one.  */
+
+static void
+convert_stores_to_masked (vec<gimple *> *stores, tree mask)
+{
+  gimple *stmt, *new_stmt;
+  tree addr, ref;
+  gimple_stmt_iterator gsi;
+
+  while (stores->length () > 0)
+    {
+      tree rhs, ptr;
+      stmt = stores->pop ();
+      gsi = gsi_for_stmt (stmt);
+      ref = gimple_assign_lhs (stmt);
+      rhs = gimple_assign_rhs1 (stmt);
+      addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (ref),
+				       true, NULL_TREE, true,
+				       GSI_SAME_STMT);
+      ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
+      if (!SSA_NAME_PTR_INFO (addr))
+	copy_ref_info (build2 (MEM_REF, TREE_TYPE (ref), addr, ptr), ref);
+      new_stmt = gimple_build_call_internal (IFN_MASK_STORE, 4, addr, ptr,
+					      mask, rhs);
+      gsi_replace (&gsi, new_stmt, false);
+    }
+}
+
+static void
+fix_mask_for_masked_ld_st (vec<gimple *> *masked_stmt, tree mask)
+{
+  gimple *stmt, *new_stmt;
+  tree old, lhs, vectype, var, n_lhs;
+  gimple_stmt_iterator gsi;
+
+  while (masked_stmt->length () > 0)
+    {
+      stmt = masked_stmt->pop ();
+      gsi = gsi_for_stmt (stmt);
+      old = gimple_call_arg (stmt, 2);
+      vectype = TREE_TYPE (old);
+      if (TREE_TYPE (mask) != vectype)
+	{
+	  tree new_vtype = TREE_TYPE (mask);
+	  tree n_var;
+	  tree conv_expr;
+	  n_var = vect_get_new_vect_var (new_vtype, vect_simple_var, NULL);
+	  conv_expr = build1 (VIEW_CONVERT_EXPR, new_vtype, old);
+	  new_stmt = gimple_build_assign (n_var, conv_expr);
+	  n_lhs = make_ssa_name (n_var);
+	  gimple_assign_set_lhs (new_stmt, n_lhs);
+	  vectype = new_vtype;
+	  gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
+	}
+      else
+	n_lhs = old;
+      var = vect_get_new_vect_var (vectype, vect_simple_var, NULL);
+      new_stmt = gimple_build_assign (var, BIT_AND_EXPR, mask, n_lhs);
+      lhs = make_ssa_name (var, new_stmt);
+      gimple_assign_set_lhs (new_stmt, lhs);
+      gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
+      gimple_call_set_arg (stmt, 2, lhs);
+      update_stmt (stmt);
+    }
+}
+
+/* Convert vectorized reductions to VEC_COND statements to preserve
+   reduction semantic:
+	s1 = x + s2 --> t = x + s2; s1 = (mask)? t : s2.  */
+
+static void
+convert_reductions (loop_vec_info loop_vinfo, tree mask)
+{
+  unsigned i;
+  for (i = 0; i < LOOP_VINFO_REDUCTIONS (loop_vinfo).length (); i++)
+    {
+      gimple *stmt = LOOP_VINFO_REDUCTIONS (loop_vinfo)[i];
+      gimple_stmt_iterator gsi;
+      tree vectype;
+      tree lhs, rhs;
+      tree var, new_lhs, vec_cond_expr;
+      gimple *new_stmt, *def;
+      stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+      stmt = STMT_VINFO_VEC_STMT (stmt_info);
+      lhs = gimple_assign_lhs (stmt);
+      vectype = TREE_TYPE (lhs);
+      gsi = gsi_for_stmt (stmt);
+      rhs = gimple_assign_rhs1 (stmt);
+      gcc_assert (TREE_CODE (rhs) == SSA_NAME);
+      def = SSA_NAME_DEF_STMT (rhs);
+      if (gimple_code (def) != GIMPLE_PHI)
+	{
+	  rhs = gimple_assign_rhs2 (stmt);
+	  gcc_assert (TREE_CODE (rhs) == SSA_NAME);
+	  def = SSA_NAME_DEF_STMT (rhs);
+	  gcc_assert (gimple_code (def) == GIMPLE_PHI);
+	}
+      /* Change lhs of STMT.  */
+      var = vect_get_new_vect_var (vectype, vect_simple_var, NULL);
+      new_lhs = make_ssa_name (var, stmt);
+      gimple_assign_set_lhs (stmt, new_lhs);
+      /* Generate new VEC_COND expr.  */
+      vec_cond_expr = build3 (VEC_COND_EXPR, vectype, mask, new_lhs, rhs);
+      new_stmt = gimple_build_assign (lhs, vec_cond_expr);
+      gsi_insert_after (&gsi, new_stmt, GSI_SAME_STMT);
+    }
+}
+
+/* Return true if MEM_REF is incremented by vector size and false otherwise.  */
+
+static bool
+mem_ref_is_vec_size_incremented (loop_vec_info loop_vinfo, tree lhs)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  tree vectype = TREE_TYPE (lhs);
+  unsigned n = GET_MODE_SIZE (TYPE_MODE (vectype));
+  gphi *phi;
+  edge e = loop_latch_edge (loop);
+  tree arg;
+  gimple *def;
+  tree name;
+  if (TREE_CODE (lhs) != MEM_REF)
+    return false;
+  name = TREE_OPERAND (lhs, 0);
+  if (TREE_CODE (name) != SSA_NAME)
+    return false;
+  def = SSA_NAME_DEF_STMT (name);
+  if (!def || gimple_code (def) != GIMPLE_PHI)
+    return false;
+  phi = as_a <gphi *> (def);
+  arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
+  gcc_assert (TREE_CODE (arg) == SSA_NAME);
+  def = SSA_NAME_DEF_STMT (arg);
+  if (gimple_code (def) != GIMPLE_ASSIGN
+      || gimple_assign_rhs_code (def) != POINTER_PLUS_EXPR)
+    return false;
+  arg = gimple_assign_rhs2 (def);
+  if (TREE_CODE (arg) != INTEGER_CST)
+    arg = gimple_assign_rhs1 (def);
+  if (TREE_CODE (arg) != INTEGER_CST)
+    return false;
+  if (compare_tree_int (arg, n) != 0)
+    return false;
+  return true;
+}
+
+/* Combine vectorized loop with scalar remainder through masking statemnts
+   such as memoryt read/write and reduction to produce legal result.
+   New vector inductive variable is created to generate mask which simply is
+   result of compare new variable with vector containing a number of iteration.
+   Loop tripe count is adjusted and scalar loop correspondent to remainder
+   is made unreachable through vectorized loop.  */
+
+void
+combine_vect_loop_remainder (loop_vec_info loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  auto_vec<gimple *, 10> loads;
+  auto_vec<gimple *, 5> stores;
+  auto_vec<gimple *, 5> masked_ld_st;
+  int elem_size = 0;
+  int n;
+  int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  basic_block bb;
+  gimple_stmt_iterator gsi;
+  gimple *stmt;
+  stmt_vec_info stmt_info;
+  tree lhs, rhs, vectype, mask_vectype;
+  tree vec_index, vec_mask;
+  bool has_reductions = false;
+  unsigned size = 0;
+  unsigned additional_cost;
+  unsigned val;
+
+  if (!loop)
+    return;
+  if (loop->inner)
+    return;  /* Do not support outer-loop vectorization.  */
+  gcc_assert (LOOP_VINFO_VECTORIZABLE_P (loop_vinfo));
+  vect_location = find_loop_location (loop);
+  if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
+      || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
+    return;
+  if (!LOOP_VINFO_REDUCTION_CHAINS (loop_vinfo).is_empty ()
+      || !LOOP_VINFO_GROUPED_STORES (loop_vinfo).is_empty ())
+    return;
+  bb = loop->header;
+  /* Initialize ADDITIONAL_COST as cost of two vector statements.  */
+  additional_cost = builtin_vectorization_cost (vector_stmt, NULL_TREE, 0) * 2;
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location,
+		     "=== try to eliminate scalar epilogue ===\n");
+
+  /* Collect all statements that need to be fixed, compute cost of masking.  */
+  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      stmt = gsi_stmt (gsi);
+      stmt_info = vinfo_for_stmt (stmt);
+      if (stmt_info && STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+	/* Not supported yet!  */
+	return;
+      /* Check that we support given define type.  */
+      if (stmt_info)
+	switch (STMT_VINFO_DEF_TYPE (stmt_info))
+	  {
+	    case vect_induction_def:
+	      if (STMT_VINFO_LIVE_P (stmt_info))
+		return;
+	      break;
+	    case vect_nested_cycle:
+	    case vect_double_reduction_def:
+	    case vect_external_def:
+	      return;
+	    default:
+	      break;
+	  }
+
+      if (gimple_assign_load_p (stmt))
+	{
+	  lhs = gimple_assign_lhs (stmt);
+	  rhs = gimple_assign_rhs1 (stmt);
+	  vectype = TREE_TYPE (lhs);
+	  if (may_be_nonaddressable_p (rhs))
+	    return;
+	  if (!VECTOR_TYPE_P (vectype))
+	    {
+	      struct data_reference *dr;
+	      if (!stmt_info)
+		continue;
+	      dr = STMT_VINFO_DATA_REF (stmt_info);
+	      if (!dr)
+		continue;
+	      if (TREE_CODE (DR_STEP (dr)) != INTEGER_CST)
+		return;
+	      if (tree_int_cst_compare (DR_STEP (dr), size_zero_node) <= 0)
+		{
+		  if (dump_enabled_p ())
+		    dump_printf_loc (MSG_NOTE, vect_location,
+				 "Load with decrement is not masked.\n");
+		  return;
+		}
+	      continue;
+	    }
+	  if (vf / TYPE_VECTOR_SUBPARTS (vectype) > 1)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "multiple-types are not supported yet.\n");
+	      return;
+	    }
+	  n = GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (vectype)));
+	  if (elem_size == 0)
+	    elem_size = n;
+	  else if (n != elem_size)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "multiple-types are not supported yet.\n");
+	      return;
+	    }
+	  if (size == 0)
+	    size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
+	  mask_vectype = build_same_sized_truth_vector_type (vectype);
+	  if (!can_vec_mask_load_store_p (TYPE_MODE (vectype),
+					  TYPE_MODE (mask_vectype),
+					  true))
+	    {
+	      if (dump_enabled_p ())
+		{
+		  dump_printf_loc (MSG_NOTE, vect_location,
+				   "type is not supported for masking!\n");
+		  dump_generic_expr (MSG_NOTE, TDF_SLIM, vectype);
+		}
+	      return;
+	    }
+	  additional_cost += builtin_masking_cost (masking_load, mask_vectype);
+	  loads.safe_push (stmt);
+	}
+      else if (gimple_store_p (stmt))
+	{
+	  gcc_assert (gimple_assign_single_p (stmt));
+	  lhs = gimple_assign_lhs (stmt);
+	  if (may_be_nonaddressable_p (lhs))
+	    return;
+	  vectype = TREE_TYPE (lhs);
+	  if (!VECTOR_TYPE_P (vectype))
+	    continue;
+	  if (vf / TYPE_VECTOR_SUBPARTS (vectype) > 1)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "multiple-types are not supported yet.\n");
+	      return;
+	    }
+	  n = GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (vectype)));
+	  if (elem_size == 0)
+	      elem_size = n;
+	  else if (n != elem_size)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "multiple-types are not supported yet.\n");
+	      return;
+	    }
+	  if (!mem_ref_is_vec_size_incremented (loop_vinfo, lhs))
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "Store with decrement is not masked.\n");
+	      return;
+	    }
+	  if (size == 0)
+	    size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
+	  mask_vectype = build_same_sized_truth_vector_type (vectype);
+	  if (!can_vec_mask_load_store_p (TYPE_MODE (vectype),
+					  TYPE_MODE (mask_vectype),
+					  false))
+	    {
+	      if (dump_enabled_p ())
+		{
+		  dump_printf_loc (MSG_NOTE, vect_location,
+				   "type is not supported for masking!\n");
+		  dump_generic_expr (MSG_NOTE, TDF_SLIM, vectype);
+		}
+	      return;
+	    }
+	  additional_cost += builtin_masking_cost (masking_store, mask_vectype);
+	  stores.safe_push (stmt);
+	}
+      else if (is_gimple_call (stmt)
+	       && gimple_call_internal_p (stmt)
+	       && (gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
+		   || gimple_call_internal_fn (stmt) == IFN_MASK_STORE))
+	{
+	  tree mask = gimple_call_arg (stmt, 2);
+	  additional_cost += builtin_masking_cost (masking_masked_stmt,
+						   TREE_TYPE (mask));
+	  masked_ld_st.safe_push (stmt);
+	}
+      else if (is_gimple_call (stmt))
+	return;
+    }
+
+  /* Check that all vectorizable reductions can be converted to VCOND.  */
+  if (!LOOP_VINFO_REDUCTIONS (loop_vinfo).is_empty ())
+    {
+      unsigned i;
+      has_reductions = true;
+      for (i = 0; i < LOOP_VINFO_REDUCTIONS (loop_vinfo).length (); i++)
+	{
+	  tree mask_vectype;
+
+	  stmt = LOOP_VINFO_REDUCTIONS (loop_vinfo)[i];
+	  stmt_info = vinfo_for_stmt (stmt);
+	  gcc_assert (stmt_info);
+	  if (PURE_SLP_STMT (stmt_info))
+	    return;
+	  gcc_assert (STMT_VINFO_VEC_STMT (stmt_info));
+	  stmt = STMT_VINFO_VEC_STMT (stmt_info);
+	  if (gimple_code (stmt) != GIMPLE_ASSIGN)
+	    return;
+	  /* Only reduction with binary operation is supported.  */
+	  if (get_gimple_rhs_class (gimple_assign_rhs_code (stmt))
+	      != GIMPLE_BINARY_RHS)
+	    return;
+	  lhs = gimple_assign_lhs (stmt);
+	  vectype = TREE_TYPE (lhs);
+	  if (vf / TYPE_VECTOR_SUBPARTS (vectype) > 1)
+	    /* Not yet supported!  */
+	    return;
+	  n = GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (vectype)));
+	  if (elem_size == 0)
+	    elem_size = n;
+	  else if (n != elem_size)
+	    /* Not yet supported!  */
+	    return;
+	  if (size == 0)
+	    size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
+	  mask_vectype = build_same_sized_truth_vector_type (vectype);
+	  if (!expand_vec_cond_expr_p (vectype, mask_vectype))
+	    return;
+	  additional_cost += builtin_masking_cost (masking_reduction,
+						   mask_vectype);
+	}
+    }
+  /* Check masked load/stores is any.  */
+  if (!masked_ld_st.is_empty ())
+    {
+      unsigned i;
+      for (i = 0; i < masked_ld_st.length (); i++)
+	{
+	  tree mask;
+	  tree vectype;
+	  optab tab;
+	  stmt = masked_ld_st[i];
+	  mask = gimple_call_arg (stmt, 2);
+	  vectype = TREE_TYPE (mask);
+	  n = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (vectype)));
+	  if (elem_size == 0)
+	    elem_size = n;
+	  else if (n != elem_size)
+	    /* Mask conversion is not supported yet!  */
+	    return;
+	  if (size == 0)
+	    size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
+	  /* Check that BIT_AND is supported on target.  */
+	  tab = optab_for_tree_code (BIT_AND_EXPR, vectype, optab_default);
+	  if (!tab)
+	    return;
+	  if (optab_handler (tab, TYPE_MODE (vectype)) == CODE_FOR_nothing)
+	    return;
+	}
+    }
+
+  /* Vectorization of scalar epilogue is reasonable if cost increase in
+     percents does not exceed value of "vect-cost-increase-threshold"
+     parameter.  */
+  val = additional_cost * 100;
+  val /= LOOP_VINFO_SINGLE_VECTOR_ITERATION_COST (loop_vinfo);
+  if (val > (unsigned) PARAM_VALUE (PARAM_VECT_COST_INCREASE_THRESHOLD))
+    {
+      if (dump_enabled_p ())
+	{
+	  dump_printf_loc (MSG_NOTE, vect_location, "Vectorization of scalar"
+			   " epilogue is not profitable.\n");
+	  dump_printf (MSG_NOTE, "Vector iteration cost: %d\n",
+		       LOOP_VINFO_SINGLE_VECTOR_ITERATION_COST (loop_vinfo));
+	  dump_printf (MSG_NOTE, " Additional iteration cost: %d\n",
+		       additional_cost);
+	}
+      return;
+    }
+  /* Generate induction vector which will be used to evaluate mask.  */
+  vec_index = gen_vec_induction (loop_vinfo, elem_size, size);
+  if (!vec_index)
+    return;
+
+  /* Generate mask vector which will be used to nask saved statements.  */
+  vec_mask = gen_mask_for_remainder (loop_vinfo, vec_index);
+  gcc_assert (vec_mask);
+
+  /* Convert vectororized loads to masked ones.  */
+  if (!loads.is_empty ())
+    convert_loads_to_masked (&loads, vec_mask);
+
+  /* Convert vectoirizzed stores to masked ones.  */
+  if (!stores.is_empty ())
+    convert_stores_to_masked (&stores, vec_mask);
+
+  if (has_reductions)
+    convert_reductions (loop_vinfo, vec_mask);
+
+  if (!masked_ld_st.is_empty ())
+    fix_mask_for_masked_ld_st (&masked_ld_st, vec_mask);
+
+  /* Fix loop trip count.  */
+  fix_vec_loop_trip_count (loop_vinfo);
+
+  /* Fix up cfg to make scalar loop remainder unreachable.  */
+  isolate_remainder (loop_vinfo);
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location,
+		     "=== scalar epilogue has been deleted ===\n");
+}
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 41e87a8..46535a2 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -580,6 +580,18 @@ vectorize_loops (void)
 	  }
       }
 
+  /* Try to combine vectorized loop and scalar remainder.  */
+  for (i = 1; i < vect_loops_num; i++)
+    {
+      loop_vec_info loop_vinfo;
+      loop = get_loop (cfun, i);
+      if (!loop || loop->inner)
+	continue;
+      loop_vinfo = (loop_vec_info) loop->aux;
+      if (loop_vinfo && LOOP_VINFO_VECTORIZABLE_P (loop_vinfo))
+	combine_vect_loop_remainder (loop_vinfo);
+    }
+
   for (i = 1; i < vect_loops_num; i++)
     {
       loop_vec_info loop_vinfo;
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 6ad0cc4..6cedc01 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -231,6 +231,8 @@ typedef struct _loop_vec_info : public vec_info {
   tree num_iters;
   /* Number of iterations of the original loop.  */
   tree num_iters_unchanged;
+  /* Number of iteration of vectorized loop.  */
+  tree num_iters_vect_loop;
 
   /* Threshold of number of iterations below which vectorzation will not be
      performed. It is calculated from MIN_PROFITABLE_ITERS and
@@ -291,6 +293,9 @@ typedef struct _loop_vec_info : public vec_info {
   /* Cost of a single scalar iteration.  */
   int single_scalar_iteration_cost;
 
+  /* Cost of a single vector iteration.  */
+  unsigned single_vector_iteration_cost;
+
   /* When we have grouped data accesses with gaps, we may introduce invalid
      memory accesses.  We peel the last iteration of the loop to prevent
      this.  */
@@ -336,6 +341,7 @@ typedef struct _loop_vec_info : public vec_info {
 #define LOOP_VINFO_BBS(L)                  (L)->bbs
 #define LOOP_VINFO_NITERSM1(L)             (L)->num_itersm1
 #define LOOP_VINFO_NITERS(L)               (L)->num_iters
+#define LOOP_VINFO_NITERS_VECT_LOOP(L)    (L)->num_iters_vect_loop
 /* Since LOOP_VINFO_NITERS and LOOP_VINFO_NITERSM1 can change after
    prologue peeling retain total unchanged scalar loop iterations for
    cost model.  */
@@ -366,6 +372,8 @@ typedef struct _loop_vec_info : public vec_info {
 #define LOOP_VINFO_SCALAR_LOOP(L)	   (L)->scalar_loop
 #define LOOP_VINFO_SCALAR_ITERATION_COST(L) (L)->scalar_cost_vec
 #define LOOP_VINFO_SINGLE_SCALAR_ITERATION_COST(L) (L)->single_scalar_iteration_cost
+#define LOOP_VINFO_SINGLE_VECTOR_ITERATION_COST(L) \
+  (L)->single_vector_iteration_cost
 
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L) \
   ((L)->may_misalign_stmts.length () > 0)
@@ -822,6 +830,14 @@ builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
 						       vectype, misalign);
 }
 
+/* Alias targetm.vectorize.builtin_masking_cost.  */
+
+static inline int
+builtin_masking_cost (enum vect_cost_for_masking kind, tree vectype)
+{
+  return targetm.vectorize.builtin_masking_cost (kind, vectype);
+}
+
 /* Get cost by calling cost target builtin.  */
 
 static inline
@@ -1001,6 +1017,7 @@ extern void vect_get_vec_defs (tree, tree, gimple *, vec<tree> *,
 			       vec<tree> *, slp_tree, int);
 extern tree vect_gen_perm_mask_any (tree, const unsigned char *);
 extern tree vect_gen_perm_mask_checked (tree, const unsigned char *);
+extern void combine_vect_loop_remainder (loop_vec_info);
 
 /* In tree-vect-data-refs.c.  */
 extern bool vect_can_force_dr_alignment_p (const_tree, unsigned int);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Combine vectorized loops with its scalar remainder.
  2015-11-13 10:36           ` Yuri Rumyantsev
@ 2015-11-23 15:54             ` Yuri Rumyantsev
  2015-11-24  9:21               ` Richard Biener
  2015-11-27 13:49             ` Richard Biener
  1 sibling, 1 reply; 17+ messages in thread
From: Yuri Rumyantsev @ 2015-11-23 15:54 UTC (permalink / raw)
  To: Richard Biener; +Cc: Ilya Enkovich, gcc-patches, Jeff Law, Igor Zamyatin

Hi Richard,

Did you have a chance to look at this?

Thanks.
Yuri.

2015-11-13 13:35 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
> Hi Richard,
>
> Here is updated version of the patch which 91) is in sync with trunk
> compiler and (2) contains simple cost model to estimate profitability
> of scalar epilogue elimination. The part related to vectorization of
> loops with small trip count is in process of developing. Note that
> implemented cost model was not tuned  well for HASWELL and KNL but we
> got  ~6% speed-up on 436.cactusADM from spec2006 suite for HASWELL.
>
> 2015-11-10 17:52 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>> On Tue, Nov 10, 2015 at 2:02 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>>> 2015-11-10 15:30 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>> On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>> Richard,
>>>>>
>>>>> It looks like misunderstanding - we assume that for GCCv6 the simple
>>>>> scheme of remainder will be used through introducing new IV :
>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>>>>>
>>>>> Is it true or we missed something?
>>>>
>>>> <quote>
>>>>> > Do you have an idea how "masking" is better be organized to be usable
>>>>> > for both 4b and 4c?
>>>>>
>>>>> Do 2a ...
>>>> Okay.
>>>> </quote>
>>>
>>> 2a was 'transform already vectorized loop as a separate
>>> post-processing'. Isn't it what this prototype patch implements?
>>> Current version only masks loop body which is in practice applicable
>>> for AVX-512 only in the most cases.  With AVX-512 it's easier to see
>>> how profitable masking might be and it is a main target for the first
>>> masking version.  Extending it to prologues/epilogues and thus making
>>> it more profitable for other targets is the next step and is out of
>>> the scope of this patch.
>>
>> Ok, technically the prototype transforms the already vectorized loop.
>> Of course I meant the vectorized loop be copied, masked and that
>> result used as epilogue...
>>
>> I'll queue a more detailed look into the patch for this week.
>>
>> Did you perform any measurements with this patch like # of
>> masked epilogues in SPEC 2006 FP (and any speedup?)
>>
>> Thanks,
>> Richard.
>>
>>> Thanks,
>>> Ilya
>>>
>>>>
>>>> Richard.
>>>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Combine vectorized loops with its scalar remainder.
  2015-11-23 15:54             ` Yuri Rumyantsev
@ 2015-11-24  9:21               ` Richard Biener
  0 siblings, 0 replies; 17+ messages in thread
From: Richard Biener @ 2015-11-24  9:21 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Ilya Enkovich, gcc-patches, Jeff Law, Igor Zamyatin

On Mon, Nov 23, 2015 at 4:52 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Hi Richard,
>
> Did you have a chance to look at this?

It's on my list - I'm still swamped with patches to review.

Richard.

> Thanks.
> Yuri.
>
> 2015-11-13 13:35 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
>> Hi Richard,
>>
>> Here is updated version of the patch which 91) is in sync with trunk
>> compiler and (2) contains simple cost model to estimate profitability
>> of scalar epilogue elimination. The part related to vectorization of
>> loops with small trip count is in process of developing. Note that
>> implemented cost model was not tuned  well for HASWELL and KNL but we
>> got  ~6% speed-up on 436.cactusADM from spec2006 suite for HASWELL.
>>
>> 2015-11-10 17:52 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>> On Tue, Nov 10, 2015 at 2:02 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>>>> 2015-11-10 15:30 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>>> On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>>> Richard,
>>>>>>
>>>>>> It looks like misunderstanding - we assume that for GCCv6 the simple
>>>>>> scheme of remainder will be used through introducing new IV :
>>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>>>>>>
>>>>>> Is it true or we missed something?
>>>>>
>>>>> <quote>
>>>>>> > Do you have an idea how "masking" is better be organized to be usable
>>>>>> > for both 4b and 4c?
>>>>>>
>>>>>> Do 2a ...
>>>>> Okay.
>>>>> </quote>
>>>>
>>>> 2a was 'transform already vectorized loop as a separate
>>>> post-processing'. Isn't it what this prototype patch implements?
>>>> Current version only masks loop body which is in practice applicable
>>>> for AVX-512 only in the most cases.  With AVX-512 it's easier to see
>>>> how profitable masking might be and it is a main target for the first
>>>> masking version.  Extending it to prologues/epilogues and thus making
>>>> it more profitable for other targets is the next step and is out of
>>>> the scope of this patch.
>>>
>>> Ok, technically the prototype transforms the already vectorized loop.
>>> Of course I meant the vectorized loop be copied, masked and that
>>> result used as epilogue...
>>>
>>> I'll queue a more detailed look into the patch for this week.
>>>
>>> Did you perform any measurements with this patch like # of
>>> masked epilogues in SPEC 2006 FP (and any speedup?)
>>>
>>> Thanks,
>>> Richard.
>>>
>>>> Thanks,
>>>> Ilya
>>>>
>>>>>
>>>>> Richard.
>>>>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Combine vectorized loops with its scalar remainder.
  2015-11-13 10:36           ` Yuri Rumyantsev
  2015-11-23 15:54             ` Yuri Rumyantsev
@ 2015-11-27 13:49             ` Richard Biener
  2015-11-30 15:04               ` Yuri Rumyantsev
  1 sibling, 1 reply; 17+ messages in thread
From: Richard Biener @ 2015-11-27 13:49 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Ilya Enkovich, gcc-patches, Jeff Law, Igor Zamyatin

On Fri, Nov 13, 2015 at 11:35 AM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Hi Richard,
>
> Here is updated version of the patch which 91) is in sync with trunk
> compiler and (2) contains simple cost model to estimate profitability
> of scalar epilogue elimination. The part related to vectorization of
> loops with small trip count is in process of developing. Note that
> implemented cost model was not tuned  well for HASWELL and KNL but we
> got  ~6% speed-up on 436.cactusADM from spec2006 suite for HASWELL.

Ok, so I don't know where to start with this.

First of all while I wanted to have the actual stmt processing to be
as post-processing
on the vectorized loop body I didn't want to have this competely separated from
vectorizing.

So, do combine_vect_loop_remainder () from vect_transform_loop, not by iterating
over all (vectorized) loops at the end.

Second, all the adjustments of the number of iterations for the vector
loop should
be integrated into the main vectorization scheme as should determining the
cost of the predication.  So you'll end up adding a
LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE flag, determined during
cost analysis and during code generation adjust vector iteration computation
accordingly and _not_ generate the epilogue loop (or wire it up correctly in
the first place).

The actual stmt processing should then still happen in a similar way as you do.

So I'm going to comment on that part only as I expect the rest will look a lot
different.

+/* Generate induction_vector which will be used to mask evaluation.  */
+
+static tree
+gen_vec_induction (loop_vec_info loop_vinfo, unsigned elem_size, unsigned size)
+{

please make use of create_iv.  Add more comments.  I reverse-engineered
that you add a { { 0, ..., vf }, +, {vf, ... vf } } IV which you use
in gen_mask_for_remainder
by comparing it against { niter, ..., niter }.

+  gsi = gsi_after_labels (loop->header);
+  niters = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
+          ? LOOP_VINFO_NITERS (loop_vinfo)
+          : LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo);

that's either wrong or unnecessary.  if ! peeling for alignment
loop-vinfo-niters
is equal to loop-vinfo-niters-unchanged.

+      ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
+      if (!SSA_NAME_PTR_INFO (addr))
+       copy_ref_info (build2 (MEM_REF, TREE_TYPE (ref), addr, ptr), ref);

vect_duplicate_ssa_name_ptr_info.

+
+static void
+fix_mask_for_masked_ld_st (vec<gimple *> *masked_stmt, tree mask)
+{
+  gimple *stmt, *new_stmt;
+  tree old, lhs, vectype, var, n_lhs;

no comment?  what's this for.

+/* Convert vectorized reductions to VEC_COND statements to preserve
+   reduction semantic:
+       s1 = x + s2 --> t = x + s2; s1 = (mask)? t : s2.  */
+
+static void
+convert_reductions (loop_vec_info loop_vinfo, tree mask)
+{

for reductions it looks like preserving the last iteration x plus the mask
could avoid predicating it this way and compensate in the reduction
epilogue by "subtracting" x & mask?  With true predication support
that'll likely be more expensive of course.

+      /* Generate new VEC_COND expr.  */
+      vec_cond_expr = build3 (VEC_COND_EXPR, vectype, mask, new_lhs, rhs);
+      new_stmt = gimple_build_assign (lhs, vec_cond_expr);

gimple_build_assign (lhs, VEC_COND_EXPR, vectype, mask, new_lhs, rhs);

+/* Return true if MEM_REF is incremented by vector size and false
otherwise.  */
+
+static bool
+mem_ref_is_vec_size_incremented (loop_vec_info loop_vinfo, tree lhs)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);

what?!  Just look at DR_STEP of the store?


+void
+combine_vect_loop_remainder (loop_vec_info loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  auto_vec<gimple *, 10> loads;
+  auto_vec<gimple *, 5> stores;

so you need to re-structure this in a way that it computes

  a) wheter it can perform the operation - and you need to do that
      reliably before the operation has taken place
  b) its cost

instead of looking at def types or gimple_assign_load/store_p predicates
please look at STMT_VINFO_TYPE instead.

I don't like the new target hook for the costing.  We do need some major
re-structuring in the vectorizer cost model implementation, this doesn't go
into the right direction.

A simplistic hook following the current scheme would have used
the vect_cost_for_stmt as argument and mirror builtin_vectorization_cost.

There is not a single testcase in the patch.  I would have expected one that
makes sure we keep the 6% speedup for cactusADM at least.


So this was a 45minute "overall" review not going into all the
implementation details.

Thanks,
Richard.


> 2015-11-10 17:52 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>> On Tue, Nov 10, 2015 at 2:02 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>>> 2015-11-10 15:30 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>> On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>> Richard,
>>>>>
>>>>> It looks like misunderstanding - we assume that for GCCv6 the simple
>>>>> scheme of remainder will be used through introducing new IV :
>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>>>>>
>>>>> Is it true or we missed something?
>>>>
>>>> <quote>
>>>>> > Do you have an idea how "masking" is better be organized to be usable
>>>>> > for both 4b and 4c?
>>>>>
>>>>> Do 2a ...
>>>> Okay.
>>>> </quote>
>>>
>>> 2a was 'transform already vectorized loop as a separate
>>> post-processing'. Isn't it what this prototype patch implements?
>>> Current version only masks loop body which is in practice applicable
>>> for AVX-512 only in the most cases.  With AVX-512 it's easier to see
>>> how profitable masking might be and it is a main target for the first
>>> masking version.  Extending it to prologues/epilogues and thus making
>>> it more profitable for other targets is the next step and is out of
>>> the scope of this patch.
>>
>> Ok, technically the prototype transforms the already vectorized loop.
>> Of course I meant the vectorized loop be copied, masked and that
>> result used as epilogue...
>>
>> I'll queue a more detailed look into the patch for this week.
>>
>> Did you perform any measurements with this patch like # of
>> masked epilogues in SPEC 2006 FP (and any speedup?)
>>
>> Thanks,
>> Richard.
>>
>>> Thanks,
>>> Ilya
>>>
>>>>
>>>> Richard.
>>>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Combine vectorized loops with its scalar remainder.
  2015-11-27 13:49             ` Richard Biener
@ 2015-11-30 15:04               ` Yuri Rumyantsev
  2015-12-15 16:41                 ` Yuri Rumyantsev
  0 siblings, 1 reply; 17+ messages in thread
From: Yuri Rumyantsev @ 2015-11-30 15:04 UTC (permalink / raw)
  To: Richard Biener; +Cc: Ilya Enkovich, gcc-patches, Jeff Law, Igor Zamyatin

Richard,

Thanks a lot for your detailed comments!

Few words about 436.cactusADM gain. The loop which was transformed for
avx2 is very huge and this is the last inner-most loop in routine
Bench_StaggeredLeapfrog2 (StaggeredLeapfrog2.F #366). If you don't
have sources, let me know.

Yuri.

2015-11-27 16:45 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
> On Fri, Nov 13, 2015 at 11:35 AM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> Hi Richard,
>>
>> Here is updated version of the patch which 91) is in sync with trunk
>> compiler and (2) contains simple cost model to estimate profitability
>> of scalar epilogue elimination. The part related to vectorization of
>> loops with small trip count is in process of developing. Note that
>> implemented cost model was not tuned  well for HASWELL and KNL but we
>> got  ~6% speed-up on 436.cactusADM from spec2006 suite for HASWELL.
>
> Ok, so I don't know where to start with this.
>
> First of all while I wanted to have the actual stmt processing to be
> as post-processing
> on the vectorized loop body I didn't want to have this competely separated from
> vectorizing.
>
> So, do combine_vect_loop_remainder () from vect_transform_loop, not by iterating
> over all (vectorized) loops at the end.
>
> Second, all the adjustments of the number of iterations for the vector
> loop should
> be integrated into the main vectorization scheme as should determining the
> cost of the predication.  So you'll end up adding a
> LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE flag, determined during
> cost analysis and during code generation adjust vector iteration computation
> accordingly and _not_ generate the epilogue loop (or wire it up correctly in
> the first place).
>
> The actual stmt processing should then still happen in a similar way as you do.
>
> So I'm going to comment on that part only as I expect the rest will look a lot
> different.
>
> +/* Generate induction_vector which will be used to mask evaluation.  */
> +
> +static tree
> +gen_vec_induction (loop_vec_info loop_vinfo, unsigned elem_size, unsigned size)
> +{
>
> please make use of create_iv.  Add more comments.  I reverse-engineered
> that you add a { { 0, ..., vf }, +, {vf, ... vf } } IV which you use
> in gen_mask_for_remainder
> by comparing it against { niter, ..., niter }.
>
> +  gsi = gsi_after_labels (loop->header);
> +  niters = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> +          ? LOOP_VINFO_NITERS (loop_vinfo)
> +          : LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo);
>
> that's either wrong or unnecessary.  if ! peeling for alignment
> loop-vinfo-niters
> is equal to loop-vinfo-niters-unchanged.
>
> +      ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
> +      if (!SSA_NAME_PTR_INFO (addr))
> +       copy_ref_info (build2 (MEM_REF, TREE_TYPE (ref), addr, ptr), ref);
>
> vect_duplicate_ssa_name_ptr_info.
>
> +
> +static void
> +fix_mask_for_masked_ld_st (vec<gimple *> *masked_stmt, tree mask)
> +{
> +  gimple *stmt, *new_stmt;
> +  tree old, lhs, vectype, var, n_lhs;
>
> no comment?  what's this for.
>
> +/* Convert vectorized reductions to VEC_COND statements to preserve
> +   reduction semantic:
> +       s1 = x + s2 --> t = x + s2; s1 = (mask)? t : s2.  */
> +
> +static void
> +convert_reductions (loop_vec_info loop_vinfo, tree mask)
> +{
>
> for reductions it looks like preserving the last iteration x plus the mask
> could avoid predicating it this way and compensate in the reduction
> epilogue by "subtracting" x & mask?  With true predication support
> that'll likely be more expensive of course.
>
> +      /* Generate new VEC_COND expr.  */
> +      vec_cond_expr = build3 (VEC_COND_EXPR, vectype, mask, new_lhs, rhs);
> +      new_stmt = gimple_build_assign (lhs, vec_cond_expr);
>
> gimple_build_assign (lhs, VEC_COND_EXPR, vectype, mask, new_lhs, rhs);
>
> +/* Return true if MEM_REF is incremented by vector size and false
> otherwise.  */
> +
> +static bool
> +mem_ref_is_vec_size_incremented (loop_vec_info loop_vinfo, tree lhs)
> +{
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>
> what?!  Just look at DR_STEP of the store?
>
>
> +void
> +combine_vect_loop_remainder (loop_vec_info loop_vinfo)
> +{
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  auto_vec<gimple *, 10> loads;
> +  auto_vec<gimple *, 5> stores;
>
> so you need to re-structure this in a way that it computes
>
>   a) wheter it can perform the operation - and you need to do that
>       reliably before the operation has taken place
>   b) its cost
>
> instead of looking at def types or gimple_assign_load/store_p predicates
> please look at STMT_VINFO_TYPE instead.
>
> I don't like the new target hook for the costing.  We do need some major
> re-structuring in the vectorizer cost model implementation, this doesn't go
> into the right direction.
>
> A simplistic hook following the current scheme would have used
> the vect_cost_for_stmt as argument and mirror builtin_vectorization_cost.
>
> There is not a single testcase in the patch.  I would have expected one that
> makes sure we keep the 6% speedup for cactusADM at least.
>
>
> So this was a 45minute "overall" review not going into all the
> implementation details.
>
> Thanks,
> Richard.
>
>
>> 2015-11-10 17:52 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>> On Tue, Nov 10, 2015 at 2:02 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>>>> 2015-11-10 15:30 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>>> On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>>> Richard,
>>>>>>
>>>>>> It looks like misunderstanding - we assume that for GCCv6 the simple
>>>>>> scheme of remainder will be used through introducing new IV :
>>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>>>>>>
>>>>>> Is it true or we missed something?
>>>>>
>>>>> <quote>
>>>>>> > Do you have an idea how "masking" is better be organized to be usable
>>>>>> > for both 4b and 4c?
>>>>>>
>>>>>> Do 2a ...
>>>>> Okay.
>>>>> </quote>
>>>>
>>>> 2a was 'transform already vectorized loop as a separate
>>>> post-processing'. Isn't it what this prototype patch implements?
>>>> Current version only masks loop body which is in practice applicable
>>>> for AVX-512 only in the most cases.  With AVX-512 it's easier to see
>>>> how profitable masking might be and it is a main target for the first
>>>> masking version.  Extending it to prologues/epilogues and thus making
>>>> it more profitable for other targets is the next step and is out of
>>>> the scope of this patch.
>>>
>>> Ok, technically the prototype transforms the already vectorized loop.
>>> Of course I meant the vectorized loop be copied, masked and that
>>> result used as epilogue...
>>>
>>> I'll queue a more detailed look into the patch for this week.
>>>
>>> Did you perform any measurements with this patch like # of
>>> masked epilogues in SPEC 2006 FP (and any speedup?)
>>>
>>> Thanks,
>>> Richard.
>>>
>>>> Thanks,
>>>> Ilya
>>>>
>>>>>
>>>>> Richard.
>>>>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Combine vectorized loops with its scalar remainder.
  2015-11-30 15:04               ` Yuri Rumyantsev
@ 2015-12-15 16:41                 ` Yuri Rumyantsev
  2016-01-11 10:07                   ` Yuri Rumyantsev
  2016-02-09 16:10                   ` Ilya Enkovich
  0 siblings, 2 replies; 17+ messages in thread
From: Yuri Rumyantsev @ 2015-12-15 16:41 UTC (permalink / raw)
  To: Richard Biener; +Cc: Ilya Enkovich, gcc-patches, Jeff Law, Igor Zamyatin

[-- Attachment #1: Type: text/plain, Size: 9486 bytes --]

Hi Richard,

I re-designed the patch to determine ability of loop masking on fly of
vectorization analysis and invoke it after loop transformation.
Test-case is also provided.

what is your opinion?

Thanks.
Yuri.

ChangeLog::
2015-12-15  Yuri Rumyantsev  <ysrumyan@gmail.com>

* config/i386/i386.c (ix86_builtin_vectorization_cost): Add handling
of new cases.
* config/i386/i386.h (TARGET_INCREASE_MASK_STORE_COST): Add new target
macros.
* config/i386/x86-tune.def (X86_TUNE_INCREASE_MASK_STORE_COST): New
tuner.
* params.def (PARAM_VECT_COST_INCREASE_THRESHOLD): New parameter.
* target.h (enum vect_cost_for_stmt): Add new elements.
* targhooks.c (default_builtin_vectorization_cost): Extend switch for
new enum elements.
* tree-vect-loop.c : Include 3 header files.
(vect_analyze_loop_operations): Add new filelds initialization and
resetting, add computation of profitability for masking loop for
epilog.
(vectorizable_reduction): Determine ability of reduction masking
and compute its cost.
(vect_can_build_vector_iv): New function.
(vect_generate_tmps_on_preheader): Adjust compution of ration depending
on epilogue generation.
(gen_vec_iv_for_masking): New function.
(gen_vec_mask_for_loop): Likewise.
(mask_vect_load_store): Likewise.
(convert_reductions_for_masking): Likewise.
(fix_mask_for_masked_ld_st): Likewise.
(mask_loop_for_epilogue): Likewise.
(vect_transform_loop): Do not perform loop masking if it requires
peeling for gaps, add check on ability masking of loop, turn off
loop peeling if loop masking is performed, save recomputed NITERS to
correspondent filed of loop_vec_info, invoke of mask_loop_for_epilogue
after vectorization if masking is possible.
* tree-vect-stmts.c : Include tree-ssa-loop-ivopts.h.
(can_mask_load_store): New function.
(vectorizable_mask_load_store): Determine ability of load/store
masking and compute its cost.
(vectorizable_load):  Likewise.
* tree-vectorizer.h (additional_loop_body_cost): New field of
loop_vec_info.
(mask_main_loop_for_epilogue): Likewise.
(LOOP_VINFO_ADDITIONAL_LOOP_BODY_COST): New macros.
(LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE): Likewise.

gcc/testsuite/ChangeLog:
* gcc.target/i386/vect-mask-loop_for_epilogue1.c: New test.

2015-11-30 18:03 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
> Richard,
>
> Thanks a lot for your detailed comments!
>
> Few words about 436.cactusADM gain. The loop which was transformed for
> avx2 is very huge and this is the last inner-most loop in routine
> Bench_StaggeredLeapfrog2 (StaggeredLeapfrog2.F #366). If you don't
> have sources, let me know.
>
> Yuri.
>
> 2015-11-27 16:45 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>> On Fri, Nov 13, 2015 at 11:35 AM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>> Hi Richard,
>>>
>>> Here is updated version of the patch which 91) is in sync with trunk
>>> compiler and (2) contains simple cost model to estimate profitability
>>> of scalar epilogue elimination. The part related to vectorization of
>>> loops with small trip count is in process of developing. Note that
>>> implemented cost model was not tuned  well for HASWELL and KNL but we
>>> got  ~6% speed-up on 436.cactusADM from spec2006 suite for HASWELL.
>>
>> Ok, so I don't know where to start with this.
>>
>> First of all while I wanted to have the actual stmt processing to be
>> as post-processing
>> on the vectorized loop body I didn't want to have this competely separated from
>> vectorizing.
>>
>> So, do combine_vect_loop_remainder () from vect_transform_loop, not by iterating
>> over all (vectorized) loops at the end.
>>
>> Second, all the adjustments of the number of iterations for the vector
>> loop should
>> be integrated into the main vectorization scheme as should determining the
>> cost of the predication.  So you'll end up adding a
>> LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE flag, determined during
>> cost analysis and during code generation adjust vector iteration computation
>> accordingly and _not_ generate the epilogue loop (or wire it up correctly in
>> the first place).
>>
>> The actual stmt processing should then still happen in a similar way as you do.
>>
>> So I'm going to comment on that part only as I expect the rest will look a lot
>> different.
>>
>> +/* Generate induction_vector which will be used to mask evaluation.  */
>> +
>> +static tree
>> +gen_vec_induction (loop_vec_info loop_vinfo, unsigned elem_size, unsigned size)
>> +{
>>
>> please make use of create_iv.  Add more comments.  I reverse-engineered
>> that you add a { { 0, ..., vf }, +, {vf, ... vf } } IV which you use
>> in gen_mask_for_remainder
>> by comparing it against { niter, ..., niter }.
>>
>> +  gsi = gsi_after_labels (loop->header);
>> +  niters = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
>> +          ? LOOP_VINFO_NITERS (loop_vinfo)
>> +          : LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo);
>>
>> that's either wrong or unnecessary.  if ! peeling for alignment
>> loop-vinfo-niters
>> is equal to loop-vinfo-niters-unchanged.
>>
>> +      ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
>> +      if (!SSA_NAME_PTR_INFO (addr))
>> +       copy_ref_info (build2 (MEM_REF, TREE_TYPE (ref), addr, ptr), ref);
>>
>> vect_duplicate_ssa_name_ptr_info.
>>
>> +
>> +static void
>> +fix_mask_for_masked_ld_st (vec<gimple *> *masked_stmt, tree mask)
>> +{
>> +  gimple *stmt, *new_stmt;
>> +  tree old, lhs, vectype, var, n_lhs;
>>
>> no comment?  what's this for.
>>
>> +/* Convert vectorized reductions to VEC_COND statements to preserve
>> +   reduction semantic:
>> +       s1 = x + s2 --> t = x + s2; s1 = (mask)? t : s2.  */
>> +
>> +static void
>> +convert_reductions (loop_vec_info loop_vinfo, tree mask)
>> +{
>>
>> for reductions it looks like preserving the last iteration x plus the mask
>> could avoid predicating it this way and compensate in the reduction
>> epilogue by "subtracting" x & mask?  With true predication support
>> that'll likely be more expensive of course.
>>
>> +      /* Generate new VEC_COND expr.  */
>> +      vec_cond_expr = build3 (VEC_COND_EXPR, vectype, mask, new_lhs, rhs);
>> +      new_stmt = gimple_build_assign (lhs, vec_cond_expr);
>>
>> gimple_build_assign (lhs, VEC_COND_EXPR, vectype, mask, new_lhs, rhs);
>>
>> +/* Return true if MEM_REF is incremented by vector size and false
>> otherwise.  */
>> +
>> +static bool
>> +mem_ref_is_vec_size_incremented (loop_vec_info loop_vinfo, tree lhs)
>> +{
>> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>>
>> what?!  Just look at DR_STEP of the store?
>>
>>
>> +void
>> +combine_vect_loop_remainder (loop_vec_info loop_vinfo)
>> +{
>> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>> +  auto_vec<gimple *, 10> loads;
>> +  auto_vec<gimple *, 5> stores;
>>
>> so you need to re-structure this in a way that it computes
>>
>>   a) wheter it can perform the operation - and you need to do that
>>       reliably before the operation has taken place
>>   b) its cost
>>
>> instead of looking at def types or gimple_assign_load/store_p predicates
>> please look at STMT_VINFO_TYPE instead.
>>
>> I don't like the new target hook for the costing.  We do need some major
>> re-structuring in the vectorizer cost model implementation, this doesn't go
>> into the right direction.
>>
>> A simplistic hook following the current scheme would have used
>> the vect_cost_for_stmt as argument and mirror builtin_vectorization_cost.
>>
>> There is not a single testcase in the patch.  I would have expected one that
>> makes sure we keep the 6% speedup for cactusADM at least.
>>
>>
>> So this was a 45minute "overall" review not going into all the
>> implementation details.
>>
>> Thanks,
>> Richard.
>>
>>
>>> 2015-11-10 17:52 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>> On Tue, Nov 10, 2015 at 2:02 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>>>>> 2015-11-10 15:30 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>>>> On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>>>> Richard,
>>>>>>>
>>>>>>> It looks like misunderstanding - we assume that for GCCv6 the simple
>>>>>>> scheme of remainder will be used through introducing new IV :
>>>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>>>>>>>
>>>>>>> Is it true or we missed something?
>>>>>>
>>>>>> <quote>
>>>>>>> > Do you have an idea how "masking" is better be organized to be usable
>>>>>>> > for both 4b and 4c?
>>>>>>>
>>>>>>> Do 2a ...
>>>>>> Okay.
>>>>>> </quote>
>>>>>
>>>>> 2a was 'transform already vectorized loop as a separate
>>>>> post-processing'. Isn't it what this prototype patch implements?
>>>>> Current version only masks loop body which is in practice applicable
>>>>> for AVX-512 only in the most cases.  With AVX-512 it's easier to see
>>>>> how profitable masking might be and it is a main target for the first
>>>>> masking version.  Extending it to prologues/epilogues and thus making
>>>>> it more profitable for other targets is the next step and is out of
>>>>> the scope of this patch.
>>>>
>>>> Ok, technically the prototype transforms the already vectorized loop.
>>>> Of course I meant the vectorized loop be copied, masked and that
>>>> result used as epilogue...
>>>>
>>>> I'll queue a more detailed look into the patch for this week.
>>>>
>>>> Did you perform any measurements with this patch like # of
>>>> masked epilogues in SPEC 2006 FP (and any speedup?)
>>>>
>>>> Thanks,
>>>> Richard.
>>>>
>>>>> Thanks,
>>>>> Ilya
>>>>>
>>>>>>
>>>>>> Richard.
>>>>>>

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 34911 bytes --]

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index cecea24..e1e4420 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -48753,6 +48753,17 @@ ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
 	elements = TYPE_VECTOR_SUBPARTS (vectype);
 	return ix86_cost->vec_stmt_cost * (elements / 2 + 1);
 
+      case masking_vec_load:
+	return 0;
+
+      case masking_vec_store:
+	return (TARGET_INCREASE_MASK_STORE_COST
+		&& !TARGET_AVX512F) ? 10 : 0;
+
+      case masking_vec_stmt:
+	return (GET_MODE_CLASS (TYPE_MODE (vectype)) != MODE_VECTOR_INT) ? 0
+	       : ix86_cost->vec_stmt_cost;
+
       default:
         gcc_unreachable ();
     }
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index e69c9cc..177369f 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -496,6 +496,8 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
     ix86_tune_features[X86_TUNE_ADJUST_UNROLL]
 #define TARGET_AVOID_FALSE_DEP_FOR_BMI \
 	ix86_tune_features[X86_TUNE_AVOID_FALSE_DEP_FOR_BMI]
+#define TARGET_INCREASE_MASK_STORE_COST \
+	ix86_tune_features[X86_TUNE_INCREASE_MASK_STORE_COST]
 
 /* Feature tests against the various architecture variations.  */
 enum ix86_arch_indices {
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index b2d3921..21fdc9f 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -527,6 +527,11 @@ DEF_TUNE (X86_TUNE_AVOID_VECTOR_DECODE, "avoid_vector_decode",
 DEF_TUNE (X86_TUNE_AVOID_FALSE_DEP_FOR_BMI, "avoid_false_dep_for_bmi",
 	  m_SANDYBRIDGE | m_HASWELL | m_GENERIC)
 
+/* X86_TUNE_INCREASE_MASK_STORE_COST: Increase coast of masked store for
+   some platforms.  */
+DEF_TUNE (X86_TUNE_INCREASE_MASK_STORE_COST, "increase_mask_store_cost",
+	  m_HASWELL | m_BDVER4 | m_ZNVER1)
+
 /*****************************************************************************/
 /* This never worked well before.                                            */
 /*****************************************************************************/
diff --git a/gcc/params.def b/gcc/params.def
index 41fd8a8..f655d5b 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1177,6 +1177,12 @@ DEFPARAM (PARAM_MAX_SSA_NAME_QUERY_DEPTH,
 	  "Maximum recursion depth allowed when querying a property of an"
 	  " SSA name.",
 	  2, 1, 0)
+
+DEFPARAM (PARAM_VECT_COST_INCREASE_THRESHOLD,
+	  "vect-cost-increase-threshold",
+	  "Cost increase threshold to mask main loop for epilogue.",
+	  10, 0, 100)
+
 /*
 
 Local variables:
diff --git a/gcc/target.h b/gcc/target.h
index ffc4d6a..5102571 100644
--- a/gcc/target.h
+++ b/gcc/target.h
@@ -170,7 +170,10 @@ enum vect_cost_for_stmt
   cond_branch_taken,
   vec_perm,
   vec_promote_demote,
-  vec_construct
+  vec_construct,
+  masking_vec_load,
+  masking_vec_store,
+  masking_vec_stmt
 };
 
 /* Separate locations for which the vectorizer cost model should
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index dcf0863..f458717 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -592,6 +592,11 @@ default_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
 	elements = TYPE_VECTOR_SUBPARTS (vectype);
 	return elements / 2 + 1;
 
+      case masking_vec_load:
+      case masking_vec_store:
+      case masking_vec_stmt:
+	return 1;
+
       default:
         gcc_unreachable ();
     }
diff --git a/gcc/testsuite/gcc.target/i386/vect-mask-loop_for_epilogue1.c b/gcc/testsuite/gcc.target/i386/vect-mask-loop_for_epilogue1.c
new file mode 100755
index 0000000..65dae32
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-mask-loop_for_epilogue1.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-march=knl -O3 -ffast-math -fdump-tree-vect-details" } */
+
+#define N 128
+float a[N], b[N];
+float d1[N], d2[N];
+float c1[N], c2[N];
+
+void foo (int n, float x)
+{
+  int i;
+  for (i=0; i<n; i++)
+    {
+      a[i] += b[i];
+      c1[i] -= c2[i] * x;
+      d1[i] += d2[i] * x;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Loop was masked for epilogue" 1 "vect" } } */
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 77ad760..1e59b6e 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -47,6 +47,9 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-vectorizer.h"
 #include "gimple-fold.h"
 #include "cgraph.h"
+#include "alias.h"
+#include "tree-ssa-address.h"
+#include "tree-cfg.h"
 
 /* Loop Vectorization Pass.
 
@@ -1591,6 +1594,9 @@ vect_analyze_loop_operations (loop_vec_info loop_vinfo)
     dump_printf_loc (MSG_NOTE, vect_location,
 		     "=== vect_analyze_loop_operations ===\n");
 
+  /* Determine possibility of combining vectorized loop with epilogue.  */
+  LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = true;
+
   for (i = 0; i < nbbs; i++)
     {
       basic_block bb = bbs[i];
@@ -2202,6 +2208,8 @@ again:
   LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
   LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = false;
   LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = 0;
+  LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = true;
+  LOOP_VINFO_ADDITIONAL_LOOP_BODY_COST (loop_vinfo) = 0;
 
   goto start_over;
 }
@@ -3419,6 +3427,22 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
                       min_profitable_iters);
 
   *ret_min_profitable_estimate = min_profitable_estimate;
+  if (LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+    {
+      /* Calculate profitability combining epilogue with main loop.  */
+      unsigned val = LOOP_VINFO_ADDITIONAL_LOOP_BODY_COST (loop_vinfo);
+      unsigned param = PARAM_VALUE (PARAM_VECT_COST_INCREASE_THRESHOLD);
+      /* Add a cost of vector iv comparison and incrementation.  */
+      val += builtin_vectorization_cost (vector_stmt, NULL_TREE, 0) * 2;
+      if ( val * 100 > vec_inside_cost * param)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Masking loop for epilogue is not pofitable: "
+			     "addditional cost=%d.\n", val);
+	  LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
+	}
+    }
 }
 
 /* Writes into SEL a mask for a vec_perm, equivalent to a vec_shr by OFFSET
@@ -5379,6 +5403,7 @@ vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
       outer_loop = loop;
       loop = loop->inner;
       nested_cycle = true;
+      LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
     }
 
   /* 1. Is vectorizable reduction?  */
@@ -5578,6 +5603,12 @@ vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
 
   gcc_assert (ncopies >= 1);
 
+  if (slp_node || PURE_SLP_STMT (stmt_info) || ncopies > 1 || code == COND_EXPR
+      || STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info) == COND_REDUCTION
+      || STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info)
+	 == INTEGER_INDUC_COND_REDUCTION)
+    LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
+
   vec_mode = TYPE_MODE (vectype_in);
 
   if (code == COND_EXPR)
@@ -5859,6 +5890,13 @@ vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
 	  return false;
 	}
     }
+  if (loop_vinfo && LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+    {
+      /* Check that masking of reduction is supported.  */
+      tree mask_vtype = build_same_sized_truth_vector_type (vectype_out);
+      if (!expand_vec_cond_expr_p (vectype_out, mask_vtype))
+	LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
+    }
 
   if (!vec_stmt) /* transformation not required.  */
     {
@@ -5867,6 +5905,15 @@ vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
 					 reduc_index))
         return false;
       STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
+      if (loop_vinfo && LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+	{
+	  tree mask_vtype = build_same_sized_truth_vector_type (vectype_out);
+	  unsigned cost = builtin_vectorization_cost (masking_vec_stmt,
+						      mask_vtype,
+						      0);
+	  LOOP_VINFO_ADDITIONAL_LOOP_BODY_COST (loop_vinfo) += cost;
+	}
+
       return true;
     }
 
@@ -6457,6 +6504,17 @@ vect_build_loop_niters (loop_vec_info loop_vinfo)
     }
 }
 
+/* Return true if vector iv can be build from number of iterations.  */
+static bool
+vect_can_build_vector_iv (loop_vec_info loop_vinfo)
+{
+  tree ni = LOOP_VINFO_NITERS (loop_vinfo);
+  tree type = TREE_TYPE (ni);
+  tree vectype = get_vectype_for_scalar_type (type);
+  int nunits = TYPE_VECTOR_SUBPARTS (vectype);
+  return (LOOP_VINFO_VECT_FACTOR (loop_vinfo) <= nunits);
+}
+
 
 /* This function generates the following statements:
 
@@ -6502,20 +6560,36 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
   else
     ni_minus_gap_name = ni_name;
 
-  /* Create: ratio = ni >> log2(vf) */
-  /* ???  As we have ni == number of latch executions + 1, ni could
-     have overflown to zero.  So avoid computing ratio based on ni
-     but compute it using the fact that we know ratio will be at least
-     one, thus via (ni - vf) >> log2(vf) + 1.  */
-  ratio_name
-    = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
-		   fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
-				fold_build2 (MINUS_EXPR, TREE_TYPE (ni_name),
-					     ni_minus_gap_name,
-					     build_int_cst
-					       (TREE_TYPE (ni_name), vf)),
-				log_vf),
-		   build_int_cst (TREE_TYPE (ni_name), 1));
+  if (!LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+    {
+      /* Create: ratio = ni >> log2(vf) */
+      /* ???  As we have ni == number of latch executions + 1, ni could
+	 have overflown to zero.  So avoid computing ratio based on ni
+	 but compute it using the fact that we know ratio will be at least
+	 one, thus via (ni - vf) >> log2(vf) + 1.  */
+      ratio_name
+        = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
+		       fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
+				    fold_build2 (MINUS_EXPR,
+						 TREE_TYPE (ni_name),
+						 ni_minus_gap_name,
+						 build_int_cst
+						   (TREE_TYPE (ni_name), vf)),
+				    log_vf),
+		       build_int_cst (TREE_TYPE (ni_name), 1));
+    }
+  else
+    {
+      /* Create ni + (vf-1) >> log2(vf) if epilogue is combined with loop.  */
+      gcc_assert (!LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
+      ratio_name
+	= fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
+		       fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
+				    ni_name,
+				    build_int_cst (TREE_TYPE (ni_name),
+						   vf - 1)),
+		       log_vf);
+    }
   if (!is_gimple_val (ratio_name))
     {
       var = create_tmp_var (TREE_TYPE (ni_name), "bnd");
@@ -6545,6 +6619,279 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
   return;
 }
 
+/* Below are functions designed for masking vectorized loop to not generate
+   epilogue loop. Only load/store's and statements result of which is live
+   (conditional reduction is not yet supported) are masked to preserve
+   computational semantic.
+
+   Create inductive variable with initial value {0,1,...,vf-1}
+   and step {vf,vf,...,vf} which will be used for mask computation.  */
+
+static tree
+gen_vec_iv_for_masking (loop_vec_info loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  unsigned int elem_size = current_vector_size / vf * BITS_PER_UNIT;
+  bool insert_after;
+  gimple_stmt_iterator gsi;
+  tree type = build_nonstandard_integer_type (elem_size, 0);
+  tree vectype;
+  tree *vtemp;
+  tree vec_init, step, vec_step;
+  int k;
+  tree indx_before_incr;
+
+  gcc_assert (type);
+  vectype = get_vectype_for_scalar_type (type);
+  gcc_assert (vectype);
+  /* Create initialization vector VEC_INIT.  */
+  vtemp = XALLOCAVEC (tree, vf);
+  for (k = 0; k < vf; k++)
+    vtemp[k] = build_int_cst (type, k);
+  vec_init = build_vector (vectype, vtemp);
+
+  /* Create vector STEP with elements equal to VF.  */
+  step = build_int_cst (type, vf);
+  vec_step = build_vector_from_val (vectype, step);
+
+  /* Create an inductive variable including phi node.  */
+  standard_iv_increment_position (loop, &gsi, &insert_after);
+  create_iv (vec_init, vec_step, NULL, loop, &gsi,
+	     insert_after, &indx_before_incr, NULL);
+  return indx_before_incr;
+}
+
+/* Create vector mask through comparison of VEC_IV and
+   vector consisting of number of iterations which will
+   be used for masking statements in loop.  */
+
+static tree
+gen_vec_mask_for_loop (loop_vec_info loop_vinfo, tree vec_iv)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  tree vectype = TREE_TYPE (vec_iv);
+  tree type = TREE_TYPE (vectype);
+  tree niters = LOOP_VINFO_NITERS (loop_vinfo);
+  tree ni_type = TREE_TYPE (niters);
+  tree ni;
+  tree vec_niters;
+  gimple *stmt;
+  edge pe = loop_preheader_edge (loop);
+  basic_block bb;
+
+  if (TREE_CODE (niters) != SSA_NAME)
+    niters = vect_build_loop_niters (loop_vinfo);
+
+  if (!types_compatible_p (type, ni_type))
+    {
+      /* Need to convert NITERS to TYPE.  */
+      unsigned sz = tree_to_uhwi (TYPE_SIZE (type));
+      unsigned ni_sz = tree_to_uhwi (TYPE_SIZE (ni_type));
+      edge pe = loop_preheader_edge (loop);
+      enum tree_code cop = (sz == ni_sz) ? NOP_EXPR : CONVERT_EXPR;
+      ni = make_ssa_name (type);
+      stmt = gimple_build_assign (ni, cop, niters);
+      bb = gsi_insert_on_edge_immediate (pe, stmt);
+      gcc_assert (!bb);
+    }
+  else
+    ni = niters;
+
+  /* Create vector consisting from NITERS.  */
+  tree new_vec = build_vector_from_val (vectype, ni);
+  tree new_var = vect_get_new_vect_var (vectype, vect_simple_var, "cst_");
+  stmt = gimple_build_assign (new_var, new_vec);
+  vec_niters = make_ssa_name (new_var, stmt);
+  gimple_assign_set_lhs (stmt, vec_niters);
+  bb = gsi_insert_on_edge_immediate (pe, stmt);
+  gcc_assert (!bb);
+
+  /* Now create resulting mask through comparison  VEC_IV < VEC_NITERS and
+     put prodused statement at the begining of loop header.  */
+  tree s_vectype = build_same_sized_truth_vector_type (vectype);
+  tree vec_mask = vect_get_new_vect_var (s_vectype,
+					 vect_simple_var,
+					 "vec_mask_");
+  stmt = gimple_build_assign (vec_mask, LT_EXPR, vec_iv, vec_niters);
+  tree vec_res = make_ssa_name (vec_mask, stmt);
+  gimple_assign_set_lhs (stmt, vec_res);
+  gimple_stmt_iterator gsi = gsi_after_labels (loop->header);
+  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+  return vec_res;
+}
+
+/* Convert vector load/store to masked form.  */
+
+static void
+mask_vect_load_store (gimple *stmt, tree mask)
+{
+  gimple *new_stmt;
+  tree addr, ref;
+  gimple_stmt_iterator gsi;
+
+  tree lhs, ptr;
+  gsi = gsi_for_stmt (stmt);
+  if (gimple_store_p (stmt))
+    {
+      lhs = gimple_assign_rhs1 (stmt);
+      ref = gimple_assign_lhs (stmt);
+    }
+  else
+    {
+      lhs = gimple_assign_lhs (stmt);
+      ref = gimple_assign_rhs1 (stmt);
+    }
+  addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (ref),
+				   true, NULL_TREE, true,
+				   GSI_SAME_STMT);
+  ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
+  if (!SSA_NAME_PTR_INFO (addr))
+    copy_ref_info (build2 (MEM_REF, TREE_TYPE (ref), addr, ptr), ref);
+  if (gimple_store_p (stmt))
+    new_stmt = gimple_build_call_internal (IFN_MASK_STORE, 4, addr, ptr,
+					    mask, lhs);
+  else
+    {
+      new_stmt = gimple_build_call_internal (IFN_MASK_LOAD, 3,
+					     addr, ptr, mask);
+      gimple_call_set_lhs (new_stmt, lhs);
+    }
+  gsi_replace (&gsi, new_stmt, false);
+}
+
+/* Convert vectorized reductions to VEC_COND statements to preserve
+   reduction semantic:
+	s1 = x + s2 --> t = x + s2; s1 = (mask)? t : s2.  */
+
+static void
+convert_reductions_for_masking (loop_vec_info loop_vinfo, tree mask)
+{
+  unsigned i;
+  for (i = 0; i < LOOP_VINFO_REDUCTIONS (loop_vinfo).length (); i++)
+    {
+      gimple *stmt = LOOP_VINFO_REDUCTIONS (loop_vinfo)[i];
+      gimple_stmt_iterator gsi;
+      tree vectype;
+      tree lhs, rhs;
+      tree var, new_lhs, vec_cond_expr;
+      gimple *new_stmt, *def;
+      stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+
+      stmt = STMT_VINFO_VEC_STMT (stmt_info);
+      lhs = gimple_assign_lhs (stmt);
+      vectype = TREE_TYPE (lhs);
+      /* Find operand  RHS defined by PHI node.  */
+      rhs = gimple_assign_rhs1 (stmt);
+      gcc_assert (TREE_CODE (rhs) == SSA_NAME);
+      def = SSA_NAME_DEF_STMT (rhs);
+      if (gimple_code (def) != GIMPLE_PHI)
+	{
+	  rhs = gimple_assign_rhs2 (stmt);
+	  gcc_assert (TREE_CODE (rhs) == SSA_NAME);
+	  def = SSA_NAME_DEF_STMT (rhs);
+	  gcc_assert (gimple_code (def) == GIMPLE_PHI);
+	}
+      /* Convert reduction stmt to ordinary assignment to NEW_LHS.  */
+      var = vect_get_new_vect_var (vectype, vect_simple_var, NULL);
+      new_lhs = make_ssa_name (var, stmt);
+      gimple_assign_set_lhs (stmt, new_lhs);
+
+      /* Create new statement with VEC_COND expr and insert it after STMT.  */
+      vec_cond_expr = build3 (VEC_COND_EXPR, vectype, mask, new_lhs, rhs);
+      new_stmt = gimple_build_assign (lhs, vec_cond_expr);
+      gsi = gsi_for_stmt (stmt);
+      gsi_insert_after (&gsi, new_stmt, GSI_SAME_STMT);
+    }
+}
+
+/* Conjunct MASK and mask argument of masked loed/store STMT.  */
+
+static void
+fix_mask_for_masked_ld_st (gimple *stmt, tree mask)
+{
+  gimple *new_stmt;
+  tree old, lhs, vectype, var, n_lhs;
+  gimple_stmt_iterator gsi;
+
+  gsi = gsi_for_stmt (stmt);
+  old = gimple_call_arg (stmt, 2);
+  vectype = TREE_TYPE (old);
+  if (TREE_TYPE (mask) != vectype)
+    {
+      /* Need to convert MASK to type of OLD.  */
+      tree n_var;
+      tree conv_expr;
+      n_var = vect_get_new_vect_var (vectype, vect_simple_var, NULL);
+      conv_expr = build1 (VIEW_CONVERT_EXPR, vectype, mask);
+      new_stmt = gimple_build_assign (n_var, conv_expr);
+      n_lhs = make_ssa_name (n_var);
+      gimple_assign_set_lhs (new_stmt, n_lhs);
+      gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
+    }
+  else
+    n_lhs = mask;
+  var = vect_get_new_vect_var (vectype, vect_simple_var, NULL);
+  new_stmt = gimple_build_assign (var, BIT_AND_EXPR, n_lhs, old);
+  lhs = make_ssa_name (var, new_stmt);
+  gimple_assign_set_lhs (new_stmt, lhs);
+  gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
+  gimple_call_set_arg (stmt, 2, lhs);
+  update_stmt (stmt);
+}
+
+
+/* Perform masking to provide correct execution of statements for all
+   iterations the bnumber of which was adjusted to not generate epilogue
+   loop.  */
+
+static void
+mask_loop_for_epilogue (loop_vec_info loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
+  tree vec_iv, vec_mask;
+
+  gcc_assert (LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo));
+  vec_iv = gen_vec_iv_for_masking (loop_vinfo);
+  vec_mask = gen_vec_mask_for_loop (loop_vinfo, vec_iv);
+  gcc_assert (vec_mask);
+
+  /* Convert reduction statements if any.  */
+  if (!LOOP_VINFO_REDUCTIONS (loop_vinfo).is_empty ())
+     convert_reductions_for_masking (loop_vinfo, vec_mask);
+
+  /* Scan all loop statements to convert vector load/store including masked
+     form.  */
+  for (unsigned i = 0; i < loop->num_nodes; i++)
+    {
+      basic_block bb = bbs[i];
+      for (gimple_stmt_iterator si = gsi_start_bb (bb);
+	   !gsi_end_p (si); gsi_next (&si))
+	{
+	  gimple *stmt = gsi_stmt (si);
+	  if (is_gimple_call (stmt)
+	      && gimple_call_internal_p (stmt)
+	      && (gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
+		  || gimple_call_internal_fn (stmt) == IFN_MASK_STORE)
+	      && VECTOR_TYPE_P (gimple_call_arg (stmt, 2)))
+	    {
+	      fix_mask_for_masked_ld_st (stmt, vec_mask);
+	      continue;
+	    }
+	  if (gimple_code (stmt) != GIMPLE_ASSIGN)
+	    continue;
+	  if (!VECTOR_TYPE_P (TREE_TYPE (gimple_assign_lhs (stmt))))
+	    continue;
+	  if (gimple_assign_load_p (stmt)
+	      || gimple_store_p (stmt))
+	    mask_vect_load_store (stmt, vec_mask);
+	}
+    }
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location,
+		     "=== Loop was masked for epilogue ===\n");
+}
 
 /* Function vect_transform_loop.
 
@@ -6575,6 +6922,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location, "=== vec_transform_loop ===\n");
 
+  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
+    LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
+
   /* If profile is inprecise, we have chance to fix it up.  */
   if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     expected_iterations = LOOP_VINFO_INT_NITERS (loop_vinfo);
@@ -6595,6 +6945,11 @@ vect_transform_loop (loop_vec_info loop_vinfo)
       check_profitability = true;
     }
 
+  if (LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+    /* Need to check ability to create vector iv to mask main loop.  */
+    if (!vect_can_build_vector_iv (loop_vinfo))
+      LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
+
   /* Version the loop first, if required, so the profitability check
      comes first.  */
 
@@ -6634,11 +6989,16 @@ vect_transform_loop (loop_vec_info loop_vinfo)
     {
       tree ratio_mult_vf;
       if (!ni_name)
-	ni_name = vect_build_loop_niters (loop_vinfo);
+	{
+	  ni_name = vect_build_loop_niters (loop_vinfo);
+	  LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
+	}
       vect_generate_tmps_on_preheader (loop_vinfo, ni_name, &ratio_mult_vf,
 				       &ratio);
-      vect_do_peeling_for_loop_bound (loop_vinfo, ni_name, ratio_mult_vf,
-				      th, check_profitability);
+      /* If epilogue is combined with main loop peeling is not needed.  */
+      if (!LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+	vect_do_peeling_for_loop_bound (loop_vinfo, ni_name, ratio_mult_vf,
+					th, check_profitability);
     }
   else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
@@ -6646,7 +7006,10 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   else
     {
       if (!ni_name)
-	ni_name = vect_build_loop_niters (loop_vinfo);
+	{
+	  ni_name = vect_build_loop_niters (loop_vinfo);
+	  LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
+	}
       vect_generate_tmps_on_preheader (loop_vinfo, ni_name, NULL, &ratio);
     }
 
@@ -6900,6 +7263,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 
   slpeel_make_loop_iterate_ntimes (loop, ratio);
 
+  if (LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+    mask_loop_for_epilogue (loop_vinfo);
+
   /* Reduce loop iterations by the vectorization factor.  */
   scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vectorization_factor),
 		      expected_iterations / vectorization_factor);
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index abcd9a4..2306bc4 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -48,6 +48,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-vectorizer.h"
 #include "builtins.h"
 #include "internal-fn.h"
+#include "tree-ssa-loop-ivopts.h"
 
 /* For lang_hooks.types.type_for_mode.  */
 #include "langhooks.h"
@@ -574,6 +575,38 @@ process_use (gimple *stmt, tree use, loop_vec_info loop_vinfo, bool live_p,
   return true;
 }
 
+/* Return ture if STMT can be converted to masked form.  */
+
+static bool
+can_mask_load_store (gimple *stmt)
+{
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  tree vectype, mask_vectype;
+  tree lhs, ref;
+
+  if (!stmt_info)
+    return false;
+  lhs = gimple_assign_lhs (stmt);
+  ref = (TREE_CODE (lhs) == SSA_NAME) ? gimple_assign_rhs1 (stmt) : lhs;
+  if (may_be_nonaddressable_p (ref))
+    return false;
+  vectype = STMT_VINFO_VECTYPE (stmt_info);
+  mask_vectype = build_same_sized_truth_vector_type (vectype);
+  if (!can_vec_mask_load_store_p (TYPE_MODE (vectype),
+				  TYPE_MODE (mask_vectype),
+				  gimple_assign_load_p (stmt)))
+    {
+      if (dump_enabled_p ())
+	{
+	  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			   "Statement can't be masked.\n");
+	  dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0);
+	}
+
+       return false;
+    }
+  return true;
+}
 
 /* Function vect_mark_stmts_to_be_vectorized.
 
@@ -1753,6 +1786,8 @@ vectorizable_mask_load_store (gimple *stmt, gimple_stmt_iterator *gsi,
 			 "multiple types in nested loop.");
       return false;
     }
+  else if (LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) && ncopies > 1)
+    LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
 
   if (!STMT_VINFO_RELEVANT_P (stmt_info))
     return false;
@@ -1828,6 +1863,15 @@ vectorizable_mask_load_store (gimple *stmt, gimple_stmt_iterator *gsi,
 	       && !useless_type_conversion_p (vectype, rhs_vectype)))
     return false;
 
+  if (LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+    {
+      /* Check that mask conjuction is supported.  */
+      optab tab;
+      tab = optab_for_tree_code (BIT_AND_EXPR, vectype, optab_default);
+      if (!tab || optab_handler (tab, TYPE_MODE (vectype)) == CODE_FOR_nothing)
+	LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
+    }
+
   if (!vec_stmt) /* transformation not required.  */
     {
       STMT_VINFO_TYPE (stmt_info) = call_vec_info_type;
@@ -1836,6 +1880,13 @@ vectorizable_mask_load_store (gimple *stmt, gimple_stmt_iterator *gsi,
 			       NULL, NULL, NULL);
       else
 	vect_model_load_cost (stmt_info, ncopies, false, NULL, NULL, NULL);
+      if (loop_vinfo && LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+	{
+	  unsigned cost = builtin_vectorization_cost (masking_vec_stmt,
+						      mask_vectype,
+						      0);
+	  LOOP_VINFO_ADDITIONAL_LOOP_BODY_COST (loop_vinfo) += cost;
+	}
       return true;
     }
 
@@ -2829,6 +2880,9 @@ vectorizable_simd_clone_call (gimple *stmt, gimple_stmt_iterator *gsi,
   if (slp_node || PURE_SLP_STMT (stmt_info))
     return false;
 
+  /* Masked clones are not yet supported.  */
+  LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
+
   /* Process function arguments.  */
   nargs = gimple_call_num_args (stmt);
 
@@ -5295,6 +5349,15 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
 			 "multiple types in nested loop.\n");
       return false;
     }
+  else if (ncopies > 1 && loop_vinfo
+           && LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+    {
+      if (dump_enabled_p ())
+        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                         "Mask loop for epilogue: multiple types are not yet"
+			 " supported.\n");
+      LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
+    }
 
   op = gimple_assign_rhs1 (stmt);
   if (!vect_is_simple_use (op, vinfo, &def_stmt, &dt))
@@ -5350,6 +5413,15 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
 				 "negative step and reversing not supported.\n");
 	      return false;
 	    }
+	  if (loop_vinfo
+	      && LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+	    {
+	      LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				 "Mask loop for epilogue: negative step"
+				 " is not supported.");
+	    }
 	}
     }
 
@@ -5358,6 +5430,16 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
       grouped_store = true;
       first_stmt = GROUP_FIRST_ELEMENT (stmt_info);
       group_size = GROUP_SIZE (vinfo_for_stmt (first_stmt));
+      if (loop_vinfo
+	  && LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "Mask loop for epilogue: grouped access"
+			     " is not supported." );
+	  LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
+      }
+
       if (!slp
 	  && !PURE_SLP_STMT (stmt_info)
 	  && !STMT_VINFO_STRIDED_P (stmt_info))
@@ -5413,8 +5495,29 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
                              "scatter index use not simple.");
 	  return false;
 	}
+      if (loop_vinfo
+	  && LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "Mask loop for epilogue: gather/scatter is"
+			     " not supported.");
+	  LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
+	}
     }
 
+  if (loop_vinfo && LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo)
+      && (slp || PURE_SLP_STMT (stmt_info) || STMT_VINFO_STRIDED_P (stmt_info)
+	  || !can_mask_load_store (stmt)))
+    {
+      LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "Mask loop for epilogue: strided store is not"
+			 " supported.");
+    }
+
+
   if (!vec_stmt) /* transformation not required.  */
     {
       STMT_VINFO_TYPE (stmt_info) = store_vec_info_type;
@@ -5422,6 +5525,14 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
       if (!PURE_SLP_STMT (stmt_info))
 	vect_model_store_cost (stmt_info, ncopies, store_lanes_p, dt,
 			       NULL, NULL, NULL);
+      if (loop_vinfo && LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+	{
+	  unsigned cost = builtin_vectorization_cost (masking_vec_store,
+						      vectype,
+						      0);
+	  LOOP_VINFO_ADDITIONAL_LOOP_BODY_COST (loop_vinfo) += cost;
+	}
+
       return true;
     }
 
@@ -6292,6 +6403,15 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
                          "multiple types in nested loop.\n");
       return false;
     }
+  else if (ncopies > 1 && loop_vinfo
+	   && LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "Mask loop for epilogue: multiple types are not"
+			 "supported.");
+      LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
+    }
 
   /* Invalidate assumptions made by dependence analysis when vectorization
      on the unrolled body effectively re-orders stmts.  */
@@ -6326,6 +6446,16 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
       grouped_load = true;
       /* FORNOW */
       gcc_assert (!nested_in_vect_loop && !STMT_VINFO_GATHER_SCATTER_P (stmt_info));
+      /* Not yet supported.  */
+      if (loop_vinfo
+	  && LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "Mask loop for epilogue: grouped acces is not"
+			     " supported.");
+	  LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
+      }
 
       first_stmt = GROUP_FIRST_ELEMENT (stmt_info);
 
@@ -6424,6 +6554,17 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
       gather_decl = vect_check_gather_scatter (stmt, loop_vinfo, &gather_base,
 					       &gather_off, &gather_scale);
       gcc_assert (gather_decl);
+      if (loop_vinfo
+	  && LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			    "Mask loop for epilogue: gather/scatter is not"
+			    " supported.");
+	  LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
+	}
+
+
       if (!vect_is_simple_use (gather_off, vinfo, &def_stmt, &gather_dt,
 			       &gather_off_vectype))
 	{
@@ -6435,6 +6576,16 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
     }
   else if (STMT_VINFO_STRIDED_P (stmt_info))
     {
+      if (loop_vinfo
+	  && LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+	{
+	  LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "Mask loop for epilogue: strided load is not"
+			     " supported.");
+	}
+
       if ((grouped_load
 	   && (slp || PURE_SLP_STMT (stmt_info)))
 	  && (group_size > nunits
@@ -6486,8 +6637,19 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
                                  "\n");
 	      return false;
 	    }
+	  if (loop_vinfo
+	      && LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			        "Negative step for masking.\n");
+	      LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
+	    }
 	}
     }
+    if (loop_vinfo && LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo)
+        && (slp || PURE_SLP_STMT (stmt_info) || !can_mask_load_store (stmt)))
+      LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo) = false;
 
   if (!vec_stmt) /* transformation not required.  */
     {
@@ -6496,6 +6658,14 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
       if (!PURE_SLP_STMT (stmt_info))
 	vect_model_load_cost (stmt_info, ncopies, load_lanes_p,
 			      NULL, NULL, NULL);
+      if (loop_vinfo && LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE (loop_vinfo))
+	{
+	  unsigned cost = builtin_vectorization_cost (masking_vec_load,
+						      vectype,
+						      0);
+	  LOOP_VINFO_ADDITIONAL_LOOP_BODY_COST (loop_vinfo) += cost;
+	}
+
       return true;
     }
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index b07f270..8be8792 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -292,6 +292,9 @@ typedef struct _loop_vec_info : public vec_info {
   /* Cost of a single scalar iteration.  */
   int single_scalar_iteration_cost;
 
+  /* Additional cost of masking main loop for epilogue.  */
+  int additional_loop_body_cost;
+
   /* When we have grouped data accesses with gaps, we may introduce invalid
      memory accesses.  We peel the last iteration of the loop to prevent
      this.  */
@@ -326,6 +329,9 @@ typedef struct _loop_vec_info : public vec_info {
      vectorize this, so this field would be false.  */
   bool no_data_dependencies;
 
+  /* Flag to combine main loop with epilogue.  */
+  bool mask_main_loop_for_epilogue;
+
   /* If if-conversion versioned this loop before conversion, this is the
      loop version without if-conversion.  */
   struct loop *scalar_loop;
@@ -367,6 +373,10 @@ typedef struct _loop_vec_info : public vec_info {
 #define LOOP_VINFO_SCALAR_LOOP(L)	   (L)->scalar_loop
 #define LOOP_VINFO_SCALAR_ITERATION_COST(L) (L)->scalar_cost_vec
 #define LOOP_VINFO_SINGLE_SCALAR_ITERATION_COST(L) (L)->single_scalar_iteration_cost
+#define LOOP_VINFO_ADDITIONAL_LOOP_BODY_COST(L) \
+  (L)->additional_loop_body_cost
+#define LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE(L) \
+  (L)->mask_main_loop_for_epilogue
 
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L) \
   ((L)->may_misalign_stmts.length () > 0)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Combine vectorized loops with its scalar remainder.
  2015-12-15 16:41                 ` Yuri Rumyantsev
@ 2016-01-11 10:07                   ` Yuri Rumyantsev
  2016-02-09 16:10                   ` Ilya Enkovich
  1 sibling, 0 replies; 17+ messages in thread
From: Yuri Rumyantsev @ 2016-01-11 10:07 UTC (permalink / raw)
  To: Richard Biener; +Cc: Ilya Enkovich, gcc-patches, Jeff Law, Igor Zamyatin

Hi Richard,

Did you have a chance to look at this updated patch?

Thanks.
Yuri.

2015-12-15 19:41 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
> Hi Richard,
>
> I re-designed the patch to determine ability of loop masking on fly of
> vectorization analysis and invoke it after loop transformation.
> Test-case is also provided.
>
> what is your opinion?
>
> Thanks.
> Yuri.
>
> ChangeLog::
> 2015-12-15  Yuri Rumyantsev  <ysrumyan@gmail.com>
>
> * config/i386/i386.c (ix86_builtin_vectorization_cost): Add handling
> of new cases.
> * config/i386/i386.h (TARGET_INCREASE_MASK_STORE_COST): Add new target
> macros.
> * config/i386/x86-tune.def (X86_TUNE_INCREASE_MASK_STORE_COST): New
> tuner.
> * params.def (PARAM_VECT_COST_INCREASE_THRESHOLD): New parameter.
> * target.h (enum vect_cost_for_stmt): Add new elements.
> * targhooks.c (default_builtin_vectorization_cost): Extend switch for
> new enum elements.
> * tree-vect-loop.c : Include 3 header files.
> (vect_analyze_loop_operations): Add new filelds initialization and
> resetting, add computation of profitability for masking loop for
> epilog.
> (vectorizable_reduction): Determine ability of reduction masking
> and compute its cost.
> (vect_can_build_vector_iv): New function.
> (vect_generate_tmps_on_preheader): Adjust compution of ration depending
> on epilogue generation.
> (gen_vec_iv_for_masking): New function.
> (gen_vec_mask_for_loop): Likewise.
> (mask_vect_load_store): Likewise.
> (convert_reductions_for_masking): Likewise.
> (fix_mask_for_masked_ld_st): Likewise.
> (mask_loop_for_epilogue): Likewise.
> (vect_transform_loop): Do not perform loop masking if it requires
> peeling for gaps, add check on ability masking of loop, turn off
> loop peeling if loop masking is performed, save recomputed NITERS to
> correspondent filed of loop_vec_info, invoke of mask_loop_for_epilogue
> after vectorization if masking is possible.
> * tree-vect-stmts.c : Include tree-ssa-loop-ivopts.h.
> (can_mask_load_store): New function.
> (vectorizable_mask_load_store): Determine ability of load/store
> masking and compute its cost.
> (vectorizable_load):  Likewise.
> * tree-vectorizer.h (additional_loop_body_cost): New field of
> loop_vec_info.
> (mask_main_loop_for_epilogue): Likewise.
> (LOOP_VINFO_ADDITIONAL_LOOP_BODY_COST): New macros.
> (LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE): Likewise.
>
> gcc/testsuite/ChangeLog:
> * gcc.target/i386/vect-mask-loop_for_epilogue1.c: New test.
>
> 2015-11-30 18:03 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
>> Richard,
>>
>> Thanks a lot for your detailed comments!
>>
>> Few words about 436.cactusADM gain. The loop which was transformed for
>> avx2 is very huge and this is the last inner-most loop in routine
>> Bench_StaggeredLeapfrog2 (StaggeredLeapfrog2.F #366). If you don't
>> have sources, let me know.
>>
>> Yuri.
>>
>> 2015-11-27 16:45 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>> On Fri, Nov 13, 2015 at 11:35 AM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>> Hi Richard,
>>>>
>>>> Here is updated version of the patch which 91) is in sync with trunk
>>>> compiler and (2) contains simple cost model to estimate profitability
>>>> of scalar epilogue elimination. The part related to vectorization of
>>>> loops with small trip count is in process of developing. Note that
>>>> implemented cost model was not tuned  well for HASWELL and KNL but we
>>>> got  ~6% speed-up on 436.cactusADM from spec2006 suite for HASWELL.
>>>
>>> Ok, so I don't know where to start with this.
>>>
>>> First of all while I wanted to have the actual stmt processing to be
>>> as post-processing
>>> on the vectorized loop body I didn't want to have this competely separated from
>>> vectorizing.
>>>
>>> So, do combine_vect_loop_remainder () from vect_transform_loop, not by iterating
>>> over all (vectorized) loops at the end.
>>>
>>> Second, all the adjustments of the number of iterations for the vector
>>> loop should
>>> be integrated into the main vectorization scheme as should determining the
>>> cost of the predication.  So you'll end up adding a
>>> LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE flag, determined during
>>> cost analysis and during code generation adjust vector iteration computation
>>> accordingly and _not_ generate the epilogue loop (or wire it up correctly in
>>> the first place).
>>>
>>> The actual stmt processing should then still happen in a similar way as you do.
>>>
>>> So I'm going to comment on that part only as I expect the rest will look a lot
>>> different.
>>>
>>> +/* Generate induction_vector which will be used to mask evaluation.  */
>>> +
>>> +static tree
>>> +gen_vec_induction (loop_vec_info loop_vinfo, unsigned elem_size, unsigned size)
>>> +{
>>>
>>> please make use of create_iv.  Add more comments.  I reverse-engineered
>>> that you add a { { 0, ..., vf }, +, {vf, ... vf } } IV which you use
>>> in gen_mask_for_remainder
>>> by comparing it against { niter, ..., niter }.
>>>
>>> +  gsi = gsi_after_labels (loop->header);
>>> +  niters = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
>>> +          ? LOOP_VINFO_NITERS (loop_vinfo)
>>> +          : LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo);
>>>
>>> that's either wrong or unnecessary.  if ! peeling for alignment
>>> loop-vinfo-niters
>>> is equal to loop-vinfo-niters-unchanged.
>>>
>>> +      ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
>>> +      if (!SSA_NAME_PTR_INFO (addr))
>>> +       copy_ref_info (build2 (MEM_REF, TREE_TYPE (ref), addr, ptr), ref);
>>>
>>> vect_duplicate_ssa_name_ptr_info.
>>>
>>> +
>>> +static void
>>> +fix_mask_for_masked_ld_st (vec<gimple *> *masked_stmt, tree mask)
>>> +{
>>> +  gimple *stmt, *new_stmt;
>>> +  tree old, lhs, vectype, var, n_lhs;
>>>
>>> no comment?  what's this for.
>>>
>>> +/* Convert vectorized reductions to VEC_COND statements to preserve
>>> +   reduction semantic:
>>> +       s1 = x + s2 --> t = x + s2; s1 = (mask)? t : s2.  */
>>> +
>>> +static void
>>> +convert_reductions (loop_vec_info loop_vinfo, tree mask)
>>> +{
>>>
>>> for reductions it looks like preserving the last iteration x plus the mask
>>> could avoid predicating it this way and compensate in the reduction
>>> epilogue by "subtracting" x & mask?  With true predication support
>>> that'll likely be more expensive of course.
>>>
>>> +      /* Generate new VEC_COND expr.  */
>>> +      vec_cond_expr = build3 (VEC_COND_EXPR, vectype, mask, new_lhs, rhs);
>>> +      new_stmt = gimple_build_assign (lhs, vec_cond_expr);
>>>
>>> gimple_build_assign (lhs, VEC_COND_EXPR, vectype, mask, new_lhs, rhs);
>>>
>>> +/* Return true if MEM_REF is incremented by vector size and false
>>> otherwise.  */
>>> +
>>> +static bool
>>> +mem_ref_is_vec_size_incremented (loop_vec_info loop_vinfo, tree lhs)
>>> +{
>>> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>>>
>>> what?!  Just look at DR_STEP of the store?
>>>
>>>
>>> +void
>>> +combine_vect_loop_remainder (loop_vec_info loop_vinfo)
>>> +{
>>> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>>> +  auto_vec<gimple *, 10> loads;
>>> +  auto_vec<gimple *, 5> stores;
>>>
>>> so you need to re-structure this in a way that it computes
>>>
>>>   a) wheter it can perform the operation - and you need to do that
>>>       reliably before the operation has taken place
>>>   b) its cost
>>>
>>> instead of looking at def types or gimple_assign_load/store_p predicates
>>> please look at STMT_VINFO_TYPE instead.
>>>
>>> I don't like the new target hook for the costing.  We do need some major
>>> re-structuring in the vectorizer cost model implementation, this doesn't go
>>> into the right direction.
>>>
>>> A simplistic hook following the current scheme would have used
>>> the vect_cost_for_stmt as argument and mirror builtin_vectorization_cost.
>>>
>>> There is not a single testcase in the patch.  I would have expected one that
>>> makes sure we keep the 6% speedup for cactusADM at least.
>>>
>>>
>>> So this was a 45minute "overall" review not going into all the
>>> implementation details.
>>>
>>> Thanks,
>>> Richard.
>>>
>>>
>>>> 2015-11-10 17:52 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>>> On Tue, Nov 10, 2015 at 2:02 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>>>>>> 2015-11-10 15:30 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>>>>> On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>>>>> Richard,
>>>>>>>>
>>>>>>>> It looks like misunderstanding - we assume that for GCCv6 the simple
>>>>>>>> scheme of remainder will be used through introducing new IV :
>>>>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>>>>>>>>
>>>>>>>> Is it true or we missed something?
>>>>>>>
>>>>>>> <quote>
>>>>>>>> > Do you have an idea how "masking" is better be organized to be usable
>>>>>>>> > for both 4b and 4c?
>>>>>>>>
>>>>>>>> Do 2a ...
>>>>>>> Okay.
>>>>>>> </quote>
>>>>>>
>>>>>> 2a was 'transform already vectorized loop as a separate
>>>>>> post-processing'. Isn't it what this prototype patch implements?
>>>>>> Current version only masks loop body which is in practice applicable
>>>>>> for AVX-512 only in the most cases.  With AVX-512 it's easier to see
>>>>>> how profitable masking might be and it is a main target for the first
>>>>>> masking version.  Extending it to prologues/epilogues and thus making
>>>>>> it more profitable for other targets is the next step and is out of
>>>>>> the scope of this patch.
>>>>>
>>>>> Ok, technically the prototype transforms the already vectorized loop.
>>>>> Of course I meant the vectorized loop be copied, masked and that
>>>>> result used as epilogue...
>>>>>
>>>>> I'll queue a more detailed look into the patch for this week.
>>>>>
>>>>> Did you perform any measurements with this patch like # of
>>>>> masked epilogues in SPEC 2006 FP (and any speedup?)
>>>>>
>>>>> Thanks,
>>>>> Richard.
>>>>>
>>>>>> Thanks,
>>>>>> Ilya
>>>>>>
>>>>>>>
>>>>>>> Richard.
>>>>>>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Combine vectorized loops with its scalar remainder.
  2015-12-15 16:41                 ` Yuri Rumyantsev
  2016-01-11 10:07                   ` Yuri Rumyantsev
@ 2016-02-09 16:10                   ` Ilya Enkovich
  2016-02-09 16:18                     ` Jeff Law
  1 sibling, 1 reply; 17+ messages in thread
From: Ilya Enkovich @ 2016-02-09 16:10 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Richard Biener, gcc-patches, Jeff Law, Igor Zamyatin

2015-12-15 19:41 GMT+03:00 Yuri Rumyantsev <ysrumyan@gmail.com>:
> Hi Richard,
>
> I re-designed the patch to determine ability of loop masking on fly of
> vectorization analysis and invoke it after loop transformation.
> Test-case is also provided.
>
> what is your opinion?
>
> Thanks.
> Yuri.
>

Hi,

I'm going to start work on extending this patch to handle mixed mask sizes,
support vectorization of peeled loop tail and fix profitability
estimation to choose
proper loop tail processing. Here is shortly a planned changes list:

1. Don't put any restriction on mask type when check if statement can be masked.
Instead just store all required masks in LOOP_VINFO_REQUIRED_MASKS. After
all statements are checked we additionally check all required masks
can be produced
(we have proper comparison, widening and narrowing support).

2. In vect_estimate_min_profitable_iters compute overhead for masks creation,
decide what we should do with a loop tail (nothing, vectorize, combine
with loop body),
additionally return a number of tail iterations required for chosen
tail processing
profitability.

3. In vect_transform_loop depending on chosen strategy either mask whole loop or
produce vectorized tail. For now it's not fully clear to me what is
the best way to get
vectorized tail.

The first option is to just peel one iteration after loop is
vectorized. But in our masking
functions we use LOOP_VINFO and STMT_VINFO structures we loose during peeling.

Another option is to peel scalar loop and then just run vectorizer one more time
to vectorize and mask it.

Also we may peel vectorized loop and use original version (with all
STMT_VINFO still
available) as a tail and peeled version as a main loop.

Currently I think the best option is to peel scalar loop and run
vectorizer one more time
for it. This option is simpler and can also be used to vectorize loop
tail with a smaller vector
size when target doesn't support masking or masking is not profitable.

Any comments?

Thanks,
Ilya

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Combine vectorized loops with its scalar remainder.
  2016-02-09 16:10                   ` Ilya Enkovich
@ 2016-02-09 16:18                     ` Jeff Law
  0 siblings, 0 replies; 17+ messages in thread
From: Jeff Law @ 2016-02-09 16:18 UTC (permalink / raw)
  To: Ilya Enkovich, Yuri Rumyantsev; +Cc: Richard Biener, gcc-patches, Igor Zamyatin

On 02/09/2016 09:09 AM, Ilya Enkovich wrote:
>
> Another option is to peel scalar loop and then just run vectorizer
> one more time to vectorize and mask it.
>
> Also we may peel vectorized loop and use original version (with all
> STMT_VINFO still available) as a tail and peeled version as a main
> loop.
>
> Currently I think the best option is to peel scalar loop and run
> vectorizer one more time for it. This option is simpler and can also
> be used to vectorize loop tail with a smaller vector size when target
> doesn't support masking or masking is not profitable.
In general, a path where we have peeling & masking as an option seems 
wise.  The sense I've gotten from rth was that there's going to be 
classes of loops where that's going to be the best option.

jeff

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-02-09 16:18 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-28 10:57 [RFC] Combine vectorized loops with its scalar remainder Yuri Rumyantsev
2015-11-03 10:08 ` Richard Henderson
2015-11-03 10:35   ` Yuri Rumyantsev
2015-11-03 11:47 ` Richard Biener
2015-11-03 12:08   ` Yuri Rumyantsev
2015-11-10 12:30     ` Richard Biener
2015-11-10 13:02       ` Ilya Enkovich
2015-11-10 14:52         ` Richard Biener
2015-11-13 10:36           ` Yuri Rumyantsev
2015-11-23 15:54             ` Yuri Rumyantsev
2015-11-24  9:21               ` Richard Biener
2015-11-27 13:49             ` Richard Biener
2015-11-30 15:04               ` Yuri Rumyantsev
2015-12-15 16:41                 ` Yuri Rumyantsev
2016-01-11 10:07                   ` Yuri Rumyantsev
2016-02-09 16:10                   ` Ilya Enkovich
2016-02-09 16:18                     ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).