From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-430604-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 102209 invoked by alias); 28 Jun 2016 12:27:29 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 102194 invoked by uid 89); 28 Jun 2016 12:27:27 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=BAYES_20,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 spammy=ilya.enkovich@intel.com, ilyaenkovichintelcom, n2, n1
X-HELO: mail-qt0-f177.google.com
Received: from mail-qt0-f177.google.com (HELO mail-qt0-f177.google.com) (209.85.216.177) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Tue, 28 Jun 2016 12:27:12 +0000
Received: by mail-qt0-f177.google.com with SMTP id c34so6334865qte.0        for <gcc-patches@gcc.gnu.org>; Tue, 28 Jun 2016 05:27:12 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=1e100.net; s=20130820;        h=x-gm-message-state:date:from:to:cc:subject:message-id:references         :mime-version:content-disposition:in-reply-to:user-agent;        bh=wI2pQ/P25MylSy7fcstV85k0pUH+1K72X6UE+ka3Yao=;        b=GSo3CTMoOslikk0+GKSiaIAS0l6Ale5e54nupPb/bjXXt4Bs99KeL+Hi4rZmSBRU6g         8rpV9k9wW843iyR+OFhuoU99lw+OLwo2iC0wB/P0Eylq37b6vpC6rs7NC+tGO/KE3rqh         e4DhZyKXwAAgh3qkrzUjWaPQOyp9R8Ipbhdaf+DdrYHL52S9dolm6pb+fWUTkkzFRhHK         NXejRtJ6nTA9dXrEIHCj+IcI4NbMMBgT9D6THIeTlsx+9dK4xFj5J2yTFC/LkNWjYHW6         Ilf7IPNOv8eoCT45wh+pYn4gugd8HGFU53X3+yVR7IeE8cHslJNc0om+dA6MGRnd+sLn         1wsg==
X-Gm-Message-State: ALyK8tLaYN93rt7ENsJyiK1Vqw96t16/LyEa4rzfqcZM0Fn1h4/YeBKsNzxn97q8yRPzAg==
X-Received: by 10.237.53.233 with SMTP id d38mr1262246qte.104.1467116829311;        Tue, 28 Jun 2016 05:27:09 -0700 (PDT)
Received: from msticlxl57.ims.intel.com (irdmzpr02-ext.ir.intel.com. [192.198.151.37])        by smtp.gmail.com with ESMTPSA id 132sm1476245qkh.41.2016.06.28.05.27.06        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);        Tue, 28 Jun 2016 05:27:08 -0700 (PDT)
Date: Tue, 28 Jun 2016 13:37:00 -0000
From: Ilya Enkovich <enkovich.gnu@gmail.com>
To: Jeff Law <law@redhat.com>
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
Message-ID: <20160628122439.GB4143@msticlxl57.ims.intel.com>
References: <20160519194450.GH40563@msticlxl57.ims.intel.com> <18ccae1a-30c3-c23c-e28f-287f9d41eaa0@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <18ccae1a-30c3-c23c-e28f-287f9d41eaa0@redhat.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-IsSubscribed: yes
X-SW-Source: 2016-06/txt/msg01877.txt.bz2

On 16 Jun 10:54, Jeff Law wrote:
> On 05/19/2016 01:44 PM, Ilya Enkovich wrote:
> >Hi,
> >
> >This patch introduces support for loop epilogue combining.  This includes
> >support in cost estimation and all required changes required to mask
> >vectorized loop.
> >
> >Thanks,
> >Ilya
> >--
> >gcc/
> >
> >2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
> >
> >	* dbgcnt.def (vect_tail_combine): New.
> >	* params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
> >	* tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
> >	* tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
> >	epilogue combined with loop body.
> >	(vect_do_peeling_for_loop_bound): Likewise.
> >	* tree-vect-loop.c Include alias.h and dbgcnt.h.
> >	(vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters
> >	arg, compute number of iterations for which loop epilogue combining is
> >	profitable.
> >	(vect_generate_tmps_on_preheader): Support combined apilogue.
> >	(vect_gen_ivs_for_masking): New.
> >	(vect_get_mask_index_for_elems): New.
> >	(vect_get_mask_index_for_type): New.
> >	(vect_gen_loop_masks): New.
> >	(vect_mask_reduction_stmt): New.
> >	(vect_mask_mask_load_store_stmt): New.
> >	(vect_mask_load_store_stmt): New.
> >	(vect_combine_loop_epilogue): New.
> >	(vect_transform_loop): Support combined apilogue.
> >
> >
> >diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> >index fab5879..b3c0668 100644
> >--- a/gcc/tree-vect-loop-manip.c
> >+++ b/gcc/tree-vect-loop-manip.c
> >@@ -1464,11 +1469,20 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
> >   bb_between_loops = new_exit_bb;
> >   bb_after_second_loop = split_edge (single_exit (second_loop));
> >
> >-  pre_condition =
> >-	fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
> >-  skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
> >-                                  bb_after_second_loop, bb_before_first_loop,
> >-				  inverse_probability (second_guard_probability));
> >+  if (skip_second_after_first)
> >+    /* We can just redirect edge from bb_between_loops to
> >+       bb_after_second_loop but we have many code assuming
> >+       we have a guard after the first loop.  So just make
> >+       always taken condtion.  */
> >+    pre_condition = fold_build2 (EQ_EXPR, boolean_type_node, integer_zero_node,
> >+				 integer_zero_node);
> This isn't ideal, but I don't think it's that big of an issue.
> 
> >@@ -1758,8 +1772,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
> >   basic_block preheader;
> >   int loop_num;
> >   int max_iter;
> >+  int bound2;
> >   tree cond_expr = NULL_TREE;
> >   gimple_seq cond_expr_stmt_list = NULL;
> >+  bool combine = LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo);
> >
> >   if (dump_enabled_p ())
> >     dump_printf_loc (MSG_NOTE, vect_location,
> >@@ -1769,12 +1785,13 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
> >
> >   loop_num  = loop->num;
> >
> >+  bound2 = combine ? th : LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> Can you document what the TH parameter is to the various routines that use
> it in tree-vect-loop-manip.c?  I realize you didn't add it, but it would
> help anyone looking at this code in the future to know it's the threshold of
> iterations for vectorization without having to find it in other function
> comment headers ;-)
> 
> That's pre-approved to go in immediately :-)
> 
> >@@ -1803,7 +1820,11 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
> >   max_iter = (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > 	      ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) * 2
> > 	      : LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 2;
> >-  if (check_profitability)
> >+  /* When epilogue is combined only profitability
> >+     treshold matters.  */
> s/treshold/threshold/
> 
> 
> 
> > static void
> > vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
> > 				    int *ret_min_profitable_niters,
> >-				    int *ret_min_profitable_estimate)
> >+				    int *ret_min_profitable_estimate,
> >+				    int *ret_min_profitable_combine_niters)
> I'm torn a bit here.  There's all kinds of things missing/incomplete in the
> function comments throughout the vectorizer.  And in some cases, like this
> one, the parameters are largely self-documenting.  But we've also got coding
> standards that we'd like to adhere to.
> 
> I don't think it's fair to require you to fix all these issues in the
> vectorizer (though if you wanted to, I'd fully support those an independent
> cleanups).
> 
> Perhaps just document LOOP_VINFO with a generic comment about the ret_*
> parameters for this function rather than a comment for each ret_* parameter.
> Pre-approved for the trunk independent of the vec-tails work.
> 
> 
> >@@ -3728,6 +3784,77 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
> > 		     min_profitable_estimate);
> >
> >+
> >+      unsigned combine_treshold
> >+	= PARAM_VALUE (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD);
> >+      /* Calculate profitability combining epilogue with the main loop.
> >+	 We have a threshold for inside cost overhead (not applied
> >+	 for low trip count loop case):
> >+	 MIC * 100 < VIC * CT
> >+	 Masked iteration should be better than a scalar prologue:
> >+	 MIC + VIC < SIC * epilogue_niters  */
> Can you double-check the whitespace formatting here.  Where does the "100"
> come from and should it be a param?

I checked the formatting.  We have 100 here because combine_treshold
is measured in percent.  E.g. value 2 means iterations masking overhead
shouldn't exceed 2% of vector iteration cost.

> 
> 
> >@@ -6886,6 +7030,485 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
> >   return;
> > }
> >
> 
> >+
> >+/* Function vect_gen_loop_masks.
> >+
> >+   Create masks to mask a loop desvribed by LOOP_VINFO.  Masks
> s/desvribed/described/
> 
> >+   are created according to LOOP_VINFO_REQUIRED_MASKS and are stored
> >+   into MASKS vector.
> >+
> >+   Index of a mask in a vector is computed according to a number
> >+   of masks's elements.  Masks are sorted by number of its elements
> >+   in descending order.  Index 0 is used to access a mask with
> >+   current_vector_size elements.  Among masks with the same number
> >+   of elements the one with lower index is used to mask iterations
> >+   with smaller iteration counter.  Note that you may get NULL elements
> >+   for masks which are not required.  Use vect_get_mask_index_for_elems
> >+   or vect_get_mask_index_for_type to access resulting vector.  */
> >+
> >+static void
> >+vect_gen_loop_masks (loop_vec_info loop_vinfo, vec<tree> *masks)
> >+{
> >+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> >+  edge pe = loop_preheader_edge (loop);
> >+  tree niters = LOOP_VINFO_NITERS (loop_vinfo);
> >+  unsigned min_mask_elems, max_mask_elems, nmasks;
> >+  unsigned iv_elems, cur_mask, prev_mask, cur_mask_elems;
> >+  auto_vec<tree> ivs;
> >+  tree vectype, mask_type;
> >+  tree vec_niters, vec_niters_val, mask;
> >+  gimple *stmt;
> >+  basic_block bb;
> >+  gimple_stmt_iterator gsi = gsi_after_labels (loop->header);
> >+  unsigned vec_size;
> >+
> >+  /* Create required IVs.  */
> >+  vect_gen_ivs_for_masking (loop_vinfo, &ivs);
> >+  vectype = TREE_TYPE (ivs[0]);
> >+
> >+  vec_size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
> >+  iv_elems = TYPE_VECTOR_SUBPARTS (vectype);
> >+
> >+  /* Get a proper niter to build a vector.  */
> >+  if (!is_gimple_val (niters))
> >+    {
> >+      gimple_seq seq = NULL;
> >+      niters = force_gimple_operand (niters, &seq, true, NULL);
> >+      gsi_insert_seq_on_edge_immediate (pe, seq);
> >+    }
> >+  /* We may need a type cast in case niter has a too small type
> >+     for generated IVs.  */
> Nit.  There should be vertical whitespace after the close brace and the
> comment for the next logical block of code.  Can you do a scan over the
> patchkit looking for other instances where the vertical whitespace is
> needed.
> 
> Generally, if you find that a blob of code needs a comment, then the comment
> and blob of code should have that vertical whitespace to visually separate
> it from everything else.
> 
> 
> 
> >+/* Function vect_combine_loop_epilogue.
> >+
> >+   Combine loop epilogue with the main vectorized body.  It requires
> >+   masking of memory accesses and reductions.  */
> So you mask reductions, loads & stores.  Is there anything else that we
> might potentially need to mask to combine the loop & epilogue via masking?
> 
> 
> I don't see anything particularly worrisome here either -- I have a slight
> concern about correctness issues with only masking loads/stores and
> reductions.  But I will defer to your judgment on whether or not there's
> other stuff that we need to mask to combine the epilogue with the loop via
> masking.

We have to mask operations which may cause errors if executed speculatively.
For others we just ignore produced result.  So we don't truly mask reductions
but fix-up their results.  I assume memory accesses are only ones we have to
truly mask (plus non-const calls which are rejected now).  For signalling
arithmetic I assumed we just don't vectorize it.

Basically we should act similar to if-conversion.  I'll check if it has
restrictions I miss.

> 
> Jeff

Here is an updated patch version.

Thanks,
Ilya
--
gcc/

2016-05-28  Ilya Enkovich  <ilya.enkovich@intel.com>

	* dbgcnt.def (vect_tail_combine): New.
	* params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
	* tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
	* tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
	epilogue combined with loop body.
	(vect_do_peeling_for_loop_bound): LIkewise.
	(vect_do_peeling_for_alignment): ???
	* tree-vect-loop.c Include alias.h and dbgcnt.h.
	(vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters
	arg, compute number of iterations for which loop epilogue combining is
	profitable.
	(vect_generate_tmps_on_preheader): Support combined apilogue.
	(vect_gen_ivs_for_masking): New.
	(vect_get_mask_index_for_elems): New.
	(vect_get_mask_index_for_type): New.
	(vect_gen_loop_masks): New.
	(vect_mask_reduction_stmt): New.
	(vect_mask_mask_load_store_stmt): New.
	(vect_mask_load_store_stmt): New.
	(vect_combine_loop_epilogue): New.
	(vect_transform_loop): Support combined apilogue.


diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def
index 78ddcc2..73c2966 100644
--- a/gcc/dbgcnt.def
+++ b/gcc/dbgcnt.def
@@ -192,4 +192,5 @@ DEBUG_COUNTER (treepre_insert)
 DEBUG_COUNTER (tree_sra)
 DEBUG_COUNTER (vect_loop)
 DEBUG_COUNTER (vect_slp)
+DEBUG_COUNTER (vect_tail_combine)
 DEBUG_COUNTER (dom_unreachable_edges)
diff --git a/gcc/params.def b/gcc/params.def
index 62a1e40..98d6c5a 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1220,6 +1220,11 @@ DEFPARAM (PARAM_MAX_SPECULATIVE_DEVIRT_MAYDEFS,
 	  "Maximum number of may-defs visited when devirtualizing "
 	  "speculatively", 50, 0, 0)
 
+DEFPARAM (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD,
+	  "vect-cost-increase-combine-threshold",
+	  "Cost increase threshold to mask main loop for epilogue.",
+	  10, 0, 300)
+
 /*
 
 Local variables:
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index a902a50..26e0cc1 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4007,6 +4007,9 @@ vect_get_new_ssa_name (tree type, enum vect_var_kind var_kind, const char *name)
   case vect_scalar_var:
     prefix = "stmp";
     break;
+  case vect_mask_var:
+    prefix = "mask";
+    break;
   case vect_pointer_var:
     prefix = "vectp";
     break;
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index c26aa1d..7403686 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1195,6 +1195,7 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
   int first_guard_probability = 2 * REG_BR_PROB_BASE / 3;
   int second_guard_probability = 2 * REG_BR_PROB_BASE / 3;
   int probability_of_second_loop;
+  bool skip_second_after_first = false;
 
   if (!slpeel_can_duplicate_loop_p (loop, e))
     return NULL;
@@ -1393,7 +1394,11 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
     {
       loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
       tree scalar_loop_iters = LOOP_VINFO_NITERSM1 (loop_vinfo);
-      unsigned limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
+      unsigned limit = 0;
+      if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+	skip_second_after_first = true;
+      else
+	limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
 	limit = limit + 1;
       if (check_profitability
@@ -1464,11 +1469,20 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
   bb_between_loops = new_exit_bb;
   bb_after_second_loop = split_edge (single_exit (second_loop));
 
-  pre_condition =
-	fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
-  skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
-                                  bb_after_second_loop, bb_before_first_loop,
-				  inverse_probability (second_guard_probability));
+  if (skip_second_after_first)
+    /* We can just redirect edge from bb_between_loops to
+       bb_after_second_loop but we have many code assuming
+       we have a guard after the first loop.  So just make
+       always taken condtion.  */
+    pre_condition = fold_build2 (EQ_EXPR, boolean_type_node, integer_zero_node,
+				 integer_zero_node);
+  else
+    pre_condition =
+      fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
+  skip_e
+    = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
+			     bb_after_second_loop, bb_before_first_loop,
+			     inverse_probability (second_guard_probability));
   scale_loop_profile (second_loop, probability_of_second_loop, bound2);
   slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop,
                                      second_loop == new_loop, &new_exit_bb);
@@ -1762,8 +1776,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
   basic_block preheader;
   int loop_num;
   int max_iter;
+  int bound2;
   tree cond_expr = NULL_TREE;
   gimple_seq cond_expr_stmt_list = NULL;
+  bool combine = LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo);
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -1773,12 +1789,13 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
 
   loop_num  = loop->num;
 
+  bound2 = combine ? th : LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   new_loop
     = slpeel_tree_peel_loop_to_edge (loop, scalar_loop, single_exit (loop),
 				     &ratio_mult_vf_name, ni_name, false,
 				     th, check_profitability,
 				     cond_expr, cond_expr_stmt_list,
-				     0, LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+				     0, bound2);
   gcc_assert (new_loop);
   gcc_assert (loop_num == loop->num);
   slpeel_checking_verify_cfg_after_peeling (loop, new_loop);
@@ -1807,7 +1824,12 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
   max_iter = (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
 	      ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) * 2
 	      : LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 2;
-  if (check_profitability)
+
+  /* When epilogue is combined only profitability
+     threshold matters.  */
+  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    max_iter = (int) th - 1;
+  else if (check_profitability)
     max_iter = MAX (max_iter, (int) th - 1);
   record_niter_bound (new_loop, max_iter, false, true);
   dump_printf (MSG_NOTE,
@@ -2044,7 +2066,8 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo, tree ni_name,
 				   bound, 0);
 
   gcc_assert (new_loop);
-  slpeel_checking_verify_cfg_after_peeling (new_loop, loop);
+  if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    slpeel_checking_verify_cfg_after_peeling (new_loop, loop);
   /* For vectorization factor N, we need to copy at most N-1 values 
      for alignment and this means N-2 loopback edge executions.  */
   max_iter = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 2;
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 41b9380..08fad82 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -50,6 +50,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-fold.h"
 #include "cgraph.h"
 #include "tree-if-conv.h"
+#include "alias.h"
+#include "dbgcnt.h"
 
 /* Loop Vectorization Pass.
 
@@ -149,7 +151,8 @@ along with GCC; see the file COPYING3.  If not see
    http://gcc.gnu.org/projects/tree-ssa/vectorization.html
 */
 
-static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *);
+static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *,
+						int *);
 
 /* Function vect_determine_vectorization_factor
 
@@ -2310,8 +2313,10 @@ start_over:
 
   /* Analyze cost.  Decide if worth while to vectorize.  */
   int min_profitable_estimate, min_profitable_iters;
+  int min_profitable_combine_iters;
   vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
-				      &min_profitable_estimate);
+				      &min_profitable_estimate,
+				      &min_profitable_combine_iters);
 
   if (min_profitable_iters < 0)
     {
@@ -2420,6 +2425,52 @@ start_over:
   gcc_assert (vectorization_factor
 	      == (unsigned)LOOP_VINFO_VECT_FACTOR (loop_vinfo));
 
+  if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
+    {
+      LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
+      LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = false;
+    }
+  else if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+	   && min_profitable_combine_iters >= 0)
+    {
+      if (((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+	    && (LOOP_VINFO_INT_NITERS (loop_vinfo)
+		>= (unsigned) min_profitable_combine_iters))
+	   || estimated_niter == -1
+	   || estimated_niter >= min_profitable_combine_iters)
+	  && dbg_cnt (vect_tail_combine))
+	{
+	  LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
+	  LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = true;
+
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "Decided to combine loop with its epilogue.\n");
+
+	  /* We need to adjust profitability check if combine
+	     epilogue considering additional vector iteration
+	     and profitable combine iterations.  */
+	  if ((int)(min_profitable_combine_iters + vectorization_factor)
+	      > min_scalar_loop_bound)
+	    {
+	      LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo)
+		= (unsigned) min_profitable_combine_iters;
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "Updated runtime profitability treshold: %d\n",
+				 min_profitable_combine_iters);
+
+	    }
+	}
+      else
+	{
+	  if (!LOOP_VINFO_NEED_MASKING (loop_vinfo) && dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Not combined loop with epilogue: iterations "
+			     "count is too low (threshold is %d).\n",
+			     min_profitable_combine_iters);
+	}
+    }
+
   /* Ok to vectorize!  */
   return true;
 
@@ -3392,12 +3443,18 @@ vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
    profitability check.
 
    *RET_MIN_PROFITABLE_ESTIMATE is a profitability threshold to be used
-   for static check against estimated number of iterations.  */
+   for static check against estimated number of iterations.
+
+   *RET_MIN_PROFITABLE_COMBINE_NITERS is a cost model profitability threshold
+   of iterations for vectorization with combined loop epilogue.  -1 means
+   combining is not profitable.  Value may be used fo dynamic profitability
+   check.  */
 
 static void
 vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 				    int *ret_min_profitable_niters,
-				    int *ret_min_profitable_estimate)
+				    int *ret_min_profitable_estimate,
+				    int *ret_min_profitable_combine_niters)
 {
   int min_profitable_iters;
   int min_profitable_estimate;
@@ -3641,6 +3698,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
                    vec_prologue_cost);
       dump_printf (MSG_NOTE, "  Vector epilogue cost: %d\n",
                    vec_epilogue_cost);
+      dump_printf (MSG_NOTE, "  Masking prologue cost: %d\n",
+                   masking_prologue_cost);
+      dump_printf (MSG_NOTE, "  Masking inside cost: %d\n",
+                   masking_inside_cost);
       dump_printf (MSG_NOTE, "  Scalar iteration cost: %d\n",
                    scalar_single_iter_cost);
       dump_printf (MSG_NOTE, "  Scalar outside cost: %d\n",
@@ -3744,6 +3805,77 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 		     min_profitable_estimate);
 
   *ret_min_profitable_estimate = min_profitable_estimate;
+
+  *ret_min_profitable_combine_niters = -1;
+
+  /* Don't try to vectorize epilogue of epilogue.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    return;
+
+  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+    {
+      if (flag_vect_epilogue_cost_model == VECT_COST_MODEL_UNLIMITED)
+	{
+	  if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE)
+	    *ret_min_profitable_combine_niters = 0;
+	  return;
+	}
+
+      unsigned combine_treshold
+	= PARAM_VALUE (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD);
+      /* Calculate profitability combining epilogue with the main loop.
+	 We have a threshold for inside cost overhead (not applied
+	 for low trip count loop case):
+	 MIC * 100 < VIC * CT
+	 Masked iteration should be better than a scalar prologue:
+	 MIC + VIC < SIC * epilogue_niters  */
+      if (masking_inside_cost * 100 >= vec_inside_cost * combine_treshold)
+	{
+	  if (dump_enabled_p ())
+	    {
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "Combining loop with epilogue is not "
+			       "profitable.\n");
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "  Combining overhead %d%% exceeds "
+			       "treshold %d%%.\n",
+			       masking_inside_cost * 100 / vec_inside_cost,
+			       combine_treshold);
+	    }
+	  *ret_min_profitable_combine_niters = -1;
+	}
+      else if ((int)(masking_inside_cost + vec_inside_cost)
+	       >= scalar_single_iter_cost * peel_iters_epilogue)
+	{
+	  if (dump_enabled_p ())
+	    {
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "Combining loop with epilogue is not "
+			       "profitable.\n");
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "  Scalar epilogue is faster than a "
+			       "single masked iteration.\n");
+	    }
+	  *ret_min_profitable_combine_niters = -1;
+	}
+      else if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE)
+	{
+	  int inside_cost = vec_inside_cost + masking_inside_cost;
+	  int outside_cost = vec_outside_cost + masking_prologue_cost;
+	  int profitable_iters = ((outside_cost - scalar_outside_cost) * vf
+				  - inside_cost * peel_iters_prologue
+				  - inside_cost * peel_iters_epilogue)
+				 / ((scalar_single_iter_cost * vf)
+				    - inside_cost);
+
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Combinig loop with epilogue "
+			     "pofitability treshold = %d\n",
+			     profitable_iters);
+	  *ret_min_profitable_combine_niters = profitable_iters;
+	}
+    }
 }
 
 /* Writes into SEL a mask for a vec_perm, equivalent to a vec_shr by OFFSET
@@ -6852,20 +6984,37 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
   else
     ni_minus_gap_name = ni_name;
 
-  /* Create: ratio = ni >> log2(vf) */
-  /* ???  As we have ni == number of latch executions + 1, ni could
-     have overflown to zero.  So avoid computing ratio based on ni
-     but compute it using the fact that we know ratio will be at least
-     one, thus via (ni - vf) >> log2(vf) + 1.  */
-  ratio_name
-    = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
-		   fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
-				fold_build2 (MINUS_EXPR, TREE_TYPE (ni_name),
-					     ni_minus_gap_name,
-					     build_int_cst
-					       (TREE_TYPE (ni_name), vf)),
-				log_vf),
-		   build_int_cst (TREE_TYPE (ni_name), 1));
+  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    {
+      /* Create ni + (vf-1) >> log2(vf) if epilogue is combined with loop.  */
+      gcc_assert (!LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
+      ratio_name
+	= fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
+		       fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
+				    ni_name,
+				    build_int_cst (TREE_TYPE (ni_name),
+						   vf - 1)),
+		       log_vf);
+    }
+  else
+    {
+      /* Create: ratio = ni >> log2(vf) */
+      /* ???  As we have ni == number of latch executions + 1, ni could
+	 have overflown to zero.  So avoid computing ratio based on ni
+	 but compute it using the fact that we know ratio will be at least
+	 one, thus via (ni - vf) >> log2(vf) + 1.  */
+      ratio_name
+        = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
+		       fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
+				    fold_build2 (MINUS_EXPR,
+						 TREE_TYPE (ni_name),
+						 ni_minus_gap_name,
+						 build_int_cst
+						   (TREE_TYPE (ni_name), vf)),
+				    log_vf),
+		       build_int_cst (TREE_TYPE (ni_name), 1));
+    }
+
   if (!is_gimple_val (ratio_name))
     {
       var = create_tmp_var (TREE_TYPE (ni_name), "bnd");
@@ -6895,6 +7044,489 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
   return;
 }
 
+/* Function vect_gen_ivs_for_masking.
+
+   Create IVs to be used for masks computation to mask loop described
+   by LOOP_VINFO.  Created IVs are stored in IVS vector.  .
+
+   Initial IV values is {0, 1, ..., VF - 1} (probably split into several
+   vectors, in this case IVS's elements with lower index hold IV with
+   smaller numbers).  IV step is {VF, VF, ..., VF}.  VF is a used
+   vectorization factor.  */
+
+static void
+vect_gen_ivs_for_masking (loop_vec_info loop_vinfo, vec<tree> *ivs)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  tree vectype = vect_get_masking_iv_type (loop_vinfo);
+  tree type = TREE_TYPE (vectype);
+  int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
+  int ncopies  = vf / elems;
+  int i, k;
+  tree iv, init_val, step_val;
+  bool insert_after;
+  gimple_stmt_iterator gsi;
+  tree *vtemp;
+
+  /* Create {VF, ..., VF} vector constant.  */
+  step_val = build_vector_from_val (vectype, build_int_cst (type, vf));
+
+  vtemp = XALLOCAVEC (tree, vf);
+  for (i = 0; i < ncopies; i++)
+    {
+      /* Create initial IV value.  */
+      for (k = 0; k < vf; k++)
+	vtemp[k] = build_int_cst (type, k + i * elems);
+      init_val = build_vector (vectype, vtemp);
+
+      /* Create an inductive variable including phi node.  */
+      standard_iv_increment_position (loop, &gsi, &insert_after);
+      create_iv (init_val, step_val, NULL, loop, &gsi, insert_after,
+		 &iv, NULL);
+      ivs->safe_push (iv);
+    }
+}
+
+/* Function vect_get_mask_index_for_elems.
+
+   A helper function to access masks vector.  See vect_gen_loop_masks
+   for masks vector sorting description.  Return index of the first
+   mask having MASK_ELEMS elements.  */
+
+static inline unsigned
+vect_get_mask_index_for_elems (unsigned mask_elems)
+{
+  return current_vector_size / mask_elems - 1;
+}
+
+/* Function vect_get_mask_index_for_type.
+
+   A helper function to access masks vector.  See vect_gen_loop_masks
+   for masks vector sorting description.  Return index of the first
+   mask appropriate for VECTYPE.  */
+
+static inline unsigned
+vect_get_mask_index_for_type (tree vectype)
+{
+  unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
+  return vect_get_mask_index_for_elems (elems);
+}
+
+/* Function vect_gen_loop_masks.
+
+   Create masks to mask a loop described by LOOP_VINFO.  Masks
+   are created according to LOOP_VINFO_REQUIRED_MASKS and are stored
+   into MASKS vector.
+
+   Index of a mask in a vector is computed according to a number
+   of masks's elements.  Masks are sorted by number of its elements
+   in descending order.  Index 0 is used to access a mask with
+   current_vector_size elements.  Among masks with the same number
+   of elements the one with lower index is used to mask iterations
+   with smaller iteration counter.  Note that you may get NULL elements
+   for masks which are not required.  Use vect_get_mask_index_for_elems
+   or vect_get_mask_index_for_type to access resulting vector.  */
+
+static void
+vect_gen_loop_masks (loop_vec_info loop_vinfo, vec<tree> *masks)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  edge pe = loop_preheader_edge (loop);
+  tree niters = LOOP_VINFO_NITERS (loop_vinfo);
+  unsigned min_mask_elems, max_mask_elems, nmasks;
+  unsigned iv_elems, cur_mask, prev_mask, cur_mask_elems;
+  auto_vec<tree> ivs;
+  tree vectype, mask_type;
+  tree vec_niters, vec_niters_val, mask;
+  gimple *stmt;
+  basic_block bb;
+  gimple_stmt_iterator gsi = gsi_after_labels (loop->header);
+  unsigned vec_size;
+
+  /* Create required IVs.  */
+  vect_gen_ivs_for_masking (loop_vinfo, &ivs);
+  vectype = TREE_TYPE (ivs[0]);
+
+  vec_size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
+  iv_elems = TYPE_VECTOR_SUBPARTS (vectype);
+
+  /* Get a proper niter to build a vector.  */
+  if (!is_gimple_val (niters))
+    {
+      gimple_seq seq = NULL;
+      niters = force_gimple_operand (niters, &seq, true, NULL);
+      gsi_insert_seq_on_edge_immediate (pe, seq);
+    }
+
+  /* We may need a type cast in case niter has a too small type
+     for generated IVs.  */
+  if (!types_compatible_p (TREE_TYPE (vectype), TREE_TYPE (niters)))
+    {
+      tree new_niters = make_temp_ssa_name (TREE_TYPE (vectype),
+					    NULL, "niters");
+      stmt = gimple_build_assign (new_niters, CONVERT_EXPR, niters);
+      bb = gsi_insert_on_edge_immediate (pe, stmt);
+      gcc_assert (!bb);
+      niters = new_niters;
+    }
+
+  /* Create {NITERS, ..., NITERS} vector and put to SSA_NAME.  */
+  vec_niters_val = build_vector_from_val (vectype, niters);
+  vec_niters = vect_get_new_ssa_name (vectype, vect_simple_var, "niters");
+  stmt = gimple_build_assign (vec_niters, vec_niters_val);
+  bb = gsi_insert_on_edge_immediate (pe, stmt);
+  gcc_assert (!bb);
+
+  /* Determine which masks we need to compute and how many.  */
+  vect_get_extreme_masks (loop_vinfo, &min_mask_elems, &max_mask_elems);
+  nmasks = vect_get_mask_index_for_elems (MIN (min_mask_elems, iv_elems) / 2);
+  masks->safe_grow_cleared (nmasks);
+
+  /* Now create base masks through comparison IV < VEC_NITERS.  */
+  mask_type = build_same_sized_truth_vector_type (vectype);
+  cur_mask = vect_get_mask_index_for_elems (iv_elems);
+  for (unsigned i = 0; i < ivs.length (); i++)
+    {
+      tree iv = ivs[i];
+      mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+      stmt = gimple_build_assign (mask, LT_EXPR, iv, vec_niters);
+      gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+      (*masks)[cur_mask++] = mask;
+    }
+
+  /* Create narrowed masks.  */
+  cur_mask_elems = iv_elems;
+  nmasks = ivs.length ();
+  while (cur_mask_elems < max_mask_elems)
+    {
+      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      cur_mask_elems <<= 1;
+      nmasks >>= 1;
+
+      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
+
+      for (unsigned i = 0; i < nmasks; i++)
+	{
+	  tree mask_low = (*masks)[prev_mask++];
+	  tree mask_hi = (*masks)[prev_mask++];
+	  mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+	  stmt = gimple_build_assign (mask, VEC_PACK_TRUNC_EXPR,
+				      mask_low, mask_hi);
+	  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+	  (*masks)[cur_mask++] = mask;
+	}
+    }
+
+  /* Created widened masks.  */
+  cur_mask_elems = iv_elems;
+  nmasks = ivs.length ();
+  while (cur_mask_elems > min_mask_elems)
+    {
+      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      cur_mask_elems >>= 1;
+      nmasks <<= 1;
+
+      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
+
+      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
+
+      for (unsigned i = 0; i < nmasks; i += 2)
+	{
+	  tree orig_mask = (*masks)[prev_mask++];
+
+	  mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+	  stmt = gimple_build_assign (mask, VEC_UNPACK_LO_EXPR, orig_mask);
+	  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+	  (*masks)[cur_mask++] = mask;
+
+	  mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
+	  stmt = gimple_build_assign (mask, VEC_UNPACK_HI_EXPR, orig_mask);
+	  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+	  (*masks)[cur_mask++] = mask;
+	}
+    }
+}
+
+/* Function vect_mask_reduction_stmt.
+
+   Mask given vectorized reduction statement STMT using
+   MASK.  In case scalar reduction statement is vectorized
+   into several vector statements then PREV holds a
+   preceding vetor statement copy for STMT.
+
+   Masking is performed using VEC_COND_EXPR. E.g.
+
+   S1: r_1 = r_2 + d_3
+
+   is transformed into:
+
+   S1': r_4 = r_2 + d_3
+   S2': r_1 = VEC_COND_EXPR<MASK, r_4, r_2>
+
+   Return generated condition statement.  */
+
+static gimple *
+vect_mask_reduction_stmt (gimple *stmt, tree mask, gimple *prev)
+{
+  gimple_stmt_iterator gsi;
+  tree vectype;
+  tree lhs, rhs, tmp;
+  gimple *new_stmt, *phi;
+
+  lhs = gimple_assign_lhs (stmt);
+  vectype = TREE_TYPE (lhs);
+
+  gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
+	      == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
+
+  /* Find operand RHS defined by PHI node.  */
+  rhs = gimple_assign_rhs1 (stmt);
+  gcc_assert (TREE_CODE (rhs) == SSA_NAME);
+  phi = SSA_NAME_DEF_STMT (rhs);
+
+  if (phi != prev && gimple_code (phi) != GIMPLE_PHI)
+    {
+      rhs = gimple_assign_rhs2 (stmt);
+      gcc_assert (TREE_CODE (rhs) == SSA_NAME);
+      phi = SSA_NAME_DEF_STMT (rhs);
+      gcc_assert (phi == prev || gimple_code (phi) == GIMPLE_PHI);
+    }
+
+  /* Convert reduction stmt to ordinary assignment to TMP.  */
+  tmp = vect_get_new_ssa_name (vectype, vect_simple_var, NULL);
+  gimple_assign_set_lhs (stmt, tmp);
+
+  /* Create VEC_COND_EXPR and insert it after STMT.  */
+  new_stmt = gimple_build_assign (lhs, VEC_COND_EXPR, mask, tmp, rhs);
+  gsi = gsi_for_stmt (stmt);
+  gsi_insert_after (&gsi, new_stmt, GSI_SAME_STMT);
+
+  return new_stmt;
+}
+
+/* Function vect_mask_mask_load_store_stmt.
+
+   Mask given vectorized MASK_LOAD or MASK_STORE statement
+   STMT using MASK.  Function replaces a mask used by STMT
+   with its conjunction with MASK.  */
+
+static void
+vect_mask_mask_load_store_stmt (gimple *stmt, tree mask)
+{
+  gimple *new_stmt;
+  tree old_mask, new_mask;
+  gimple_stmt_iterator gsi;
+
+  gsi = gsi_for_stmt (stmt);
+  old_mask = gimple_call_arg (stmt, 2);
+
+  gcc_assert (types_compatible_p (TREE_TYPE (old_mask), TREE_TYPE (mask)));
+
+  new_mask = vect_get_new_ssa_name (TREE_TYPE (mask), vect_simple_var, NULL);
+  new_stmt = gimple_build_assign (new_mask, BIT_AND_EXPR, old_mask, mask);
+  gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
+
+  gimple_call_set_arg (stmt, 2, new_mask);
+  update_stmt (stmt);
+}
+
+
+/* Function vect_mask_load_store_stmt.
+
+   Mask given vectorized load or store statement STMT using
+   MASK.  DR is a data reference for a scalar memory access.
+   Assignment is transformed into MASK_LOAD or MASK_STORE
+   statement.  SI is either an iterator pointing to STMT and
+   is to be updated or NULL.  */
+
+static void
+vect_mask_load_store_stmt (gimple *stmt, tree vectype, tree mask,
+			   data_reference *dr, gimple_stmt_iterator *si)
+{
+  tree mem, val, addr, ptr;
+  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+  unsigned align, misalign;
+  tree elem_type = TREE_TYPE (vectype);
+  gimple *new_stmt;
+
+  gcc_assert (!si || gsi_stmt (*si) == stmt);
+
+  gsi = gsi_for_stmt (stmt);
+  if (gimple_store_p (stmt))
+    {
+      val = gimple_assign_rhs1 (stmt);
+      mem = gimple_assign_lhs (stmt);
+    }
+  else
+    {
+      val = gimple_assign_lhs (stmt);
+      mem = gimple_assign_rhs1 (stmt);
+    }
+
+  gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
+	      == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
+
+  addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (mem),
+				   true, NULL_TREE, true,
+				   GSI_SAME_STMT);
+
+  align = TYPE_ALIGN_UNIT (vectype);
+  if (aligned_access_p (dr))
+    misalign = 0;
+  else if (DR_MISALIGNMENT (dr) == -1)
+    {
+      align = TYPE_ALIGN_UNIT (elem_type);
+      misalign = 0;
+    }
+  else
+    misalign = DR_MISALIGNMENT (dr);
+  set_ptr_info_alignment (get_ptr_info (addr), align, misalign);
+  ptr = build_int_cst (reference_alias_ptr_type (mem),
+		       misalign ? misalign & -misalign : align);
+
+  if (gimple_store_p (stmt))
+    new_stmt = gimple_build_call_internal (IFN_MASK_STORE, 4, addr, ptr,
+					   mask, val);
+  else
+    {
+      new_stmt = gimple_build_call_internal (IFN_MASK_LOAD, 3, addr, ptr,
+					     mask);
+      gimple_call_set_lhs (new_stmt, val);
+    }
+  gsi_replace (si ? si : &gsi, new_stmt, false);
+}
+
+/* Function vect_combine_loop_epilogue.
+
+   Combine loop epilogue with the main vectorized body.  It requires
+   masking of memory accesses and reductions.  */
+
+static void
+vect_combine_loop_epilogue (loop_vec_info loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
+  unsigned mask_no;
+  auto_vec<tree> masks;
+
+  vect_gen_loop_masks (loop_vinfo, &masks);
+
+  /* Convert reduction statements if any.  */
+  for (unsigned i = 0; i < LOOP_VINFO_REDUCTIONS (loop_vinfo).length (); i++)
+    {
+      gimple *stmt = LOOP_VINFO_REDUCTIONS (loop_vinfo)[i];
+      gimple *prev_stmt = NULL;
+      stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+
+      mask_no = vect_get_mask_index_for_type (STMT_VINFO_VECTYPE (stmt_info));
+
+      stmt = STMT_VINFO_VEC_STMT (stmt_info);
+      while (stmt)
+	{
+	  prev_stmt = vect_mask_reduction_stmt (stmt, masks[mask_no++],
+						prev_stmt);
+	  stmt_info = vinfo_for_stmt (stmt);
+	  stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
+	}
+    }
+
+  /* Scan all loop statements to convert vector load/store including masked
+     form.  */
+  for (unsigned i = 0; i < loop->num_nodes; i++)
+    {
+      basic_block bb = bbs[i];
+      for (gimple_stmt_iterator si = gsi_start_bb (bb);
+	   !gsi_end_p (si); gsi_next (&si))
+	{
+	  gimple *stmt = gsi_stmt (si);
+	  stmt_vec_info stmt_info = NULL;
+	  tree vectype = NULL;
+	  data_reference *dr;
+
+	  /* Mask load case.  */
+	  if (is_gimple_call (stmt)
+	      && gimple_call_internal_p (stmt)
+	      && gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
+	      && !VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (stmt, 2))))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+	      if (!STMT_VINFO_VEC_STMT (stmt_info))
+		continue;
+	      stmt = STMT_VINFO_VEC_STMT (stmt_info);
+	      vectype = STMT_VINFO_VECTYPE (stmt_info);
+	    }
+	  /* Mask store case.  */
+	  else if (is_gimple_call (stmt)
+		   && gimple_call_internal_p (stmt)
+		   && gimple_call_internal_fn (stmt) == IFN_MASK_STORE
+		   && vinfo_for_stmt (stmt)
+		   && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+	      vectype = TREE_TYPE (gimple_call_arg (stmt, 2));
+	    }
+	  /* Load case.  */
+	  else if (gimple_assign_load_p (stmt)
+		   && !VECTOR_TYPE_P (TREE_TYPE (gimple_assign_lhs (stmt))))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+
+	      /* Skip vector loads.  */
+	      if (!STMT_VINFO_VEC_STMT (stmt_info))
+		continue;
+
+	      /* Skip invariant loads.  */
+	      if (integer_zerop (nested_in_vect_loop_p (loop, stmt)
+				 ? STMT_VINFO_DR_STEP (stmt_info)
+				 : DR_STEP (STMT_VINFO_DATA_REF (stmt_info))))
+		continue;
+	      stmt = STMT_VINFO_VEC_STMT (stmt_info);
+	      vectype = STMT_VINFO_VECTYPE (stmt_info);
+	    }
+	  /* Store case.  */
+	  else if (gimple_code (stmt) == GIMPLE_ASSIGN
+		   && gimple_store_p (stmt)
+		   && vinfo_for_stmt (stmt)
+		   && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
+	    {
+	      stmt_info = vinfo_for_stmt (stmt);
+	      vectype = STMT_VINFO_VECTYPE (stmt_info);
+	    }
+	  else
+	    continue;
+
+	  /* Skip hoisted out statements.  */
+	  if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
+	    continue;
+
+	  mask_no = vect_get_mask_index_for_type (vectype);
+
+	  dr = STMT_VINFO_DATA_REF (stmt_info);
+	  while (stmt)
+	    {
+	      if (is_gimple_call (stmt))
+		vect_mask_mask_load_store_stmt (stmt, masks[mask_no++]);
+	      else
+		vect_mask_load_store_stmt (stmt, vectype, masks[mask_no++], dr,
+					   /* Have to update iterator only if
+					      it points to stmt we mask.  */
+					   stmt == gsi_stmt (si) ? &si : NULL);
+
+	      stmt_info = vinfo_for_stmt (stmt);
+	      stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
+	    }
+	}
+    }
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location,
+		     "=== Loop epilogue was combined ===\n");
+}
 
 /* Function vect_transform_loop.
 
@@ -6936,7 +7568,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
      run at least the vectorization factor number of times checking
      is pointless, too.  */
   th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
-  if (th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1
+  if ((th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1
+       || (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)
+	   && th > 1))
       && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
       if (dump_enabled_p ())
@@ -6985,12 +7619,18 @@ vect_transform_loop (loop_vec_info loop_vinfo)
     {
       tree ratio_mult_vf;
       if (!ni_name)
-	ni_name = vect_build_loop_niters (loop_vinfo);
+	{
+	  ni_name = vect_build_loop_niters (loop_vinfo);
+	  LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
+	}
       vect_generate_tmps_on_preheader (loop_vinfo, ni_name, &ratio_mult_vf,
 				       &ratio);
-      epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name,
-						 ratio_mult_vf, th,
-						 check_profitability);
+      /* If epilogue is combined with main loop peeling is not needed.  */
+      if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)
+	  || check_profitability)
+	epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name,
+						   ratio_mult_vf, th,
+						   check_profitability);
     }
   else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
@@ -6998,7 +7638,10 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   else
     {
       if (!ni_name)
-	ni_name = vect_build_loop_niters (loop_vinfo);
+	{
+	  ni_name = vect_build_loop_niters (loop_vinfo);
+	  LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
+	}
       vect_generate_tmps_on_preheader (loop_vinfo, ni_name, NULL, &ratio);
     }
 
@@ -7252,6 +7895,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 
   slpeel_make_loop_iterate_ntimes (loop, ratio);
 
+  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    vect_combine_loop_epilogue (loop_vinfo);
+
   /* Reduce loop iterations by the vectorization factor.  */
   scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vectorization_factor),
 		      expected_iterations / vectorization_factor);
@@ -7263,20 +7909,28 @@ vect_transform_loop (loop_vec_info loop_vinfo)
         loop->nb_iterations_likely_upper_bound
 	   = loop->nb_iterations_likely_upper_bound - 1;
     }
-  loop->nb_iterations_upper_bound
-    = wi::udiv_floor (loop->nb_iterations_upper_bound + 1,
-		      vectorization_factor) - 1;
-  loop->nb_iterations_likely_upper_bound
-    = wi::udiv_floor (loop->nb_iterations_likely_upper_bound + 1,
-		      vectorization_factor) - 1;
+
+  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+    loop->nb_iterations_upper_bound
+      = wi::div_ceil (loop->nb_iterations_upper_bound + 1,
+		      vectorization_factor, UNSIGNED) - 1;
+  else
+    loop->nb_iterations_upper_bound
+      = wi::udiv_floor (loop->nb_iterations_upper_bound + 1,
+			vectorization_factor) - 1;
 
   if (loop->any_estimate)
     {
-      loop->nb_iterations_estimate
-        = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor);
-       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
-	   && loop->nb_iterations_estimate != 0)
-	 loop->nb_iterations_estimate = loop->nb_iterations_estimate - 1;
+      if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
+	loop->nb_iterations_estimate
+	  = wi::div_ceil (loop->nb_iterations_estimate, vectorization_factor,
+			  UNSIGNED);
+      else
+	loop->nb_iterations_estimate
+	  = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor);
+      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+	  && loop->nb_iterations_estimate != 0)
+	loop->nb_iterations_estimate -= 1;
     }
 
   if (dump_enabled_p ())