From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-429856-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 119865 invoked by alias); 15 Jun 2016 11:44:54 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 119847 invoked by uid 89); 15 Jun 2016 11:44:53 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.3 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 spammy=sk:aligned, enkovich.gnu@gmail.com, ilya.enkovich@intel.com, ilyaenkovichintelcom
X-HELO: mail-wm0-f68.google.com
Received: from mail-wm0-f68.google.com (HELO mail-wm0-f68.google.com) (74.125.82.68) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Wed, 15 Jun 2016 11:44:42 +0000
Received: by mail-wm0-f68.google.com with SMTP id k184so5303149wme.2        for <gcc-patches@gcc.gnu.org>; Wed, 15 Jun 2016 04:44:42 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=1e100.net; s=20130820;        h=x-gm-message-state:mime-version:in-reply-to:references:from:date         :message-id:subject:to:cc;        bh=YFF1ynYk9ygj0SFa2ZI/EKr0SJJpJR0ITz+AR4K2m+4=;        b=BhhAssgSk7gIyLvdxvvF1HRdJ6cGik3JHJIA4o8radugfX8HXunI3PgzVDDtTk7r6T         LTaZuPewFFuxhLpolx+Y6cR/pMb83IQHT88u2TsvSxnFx4MS0/Oh9IJEcEewsppufC72         GCb6DNqQJ0RjJYqL4E9haIvfCXFBajsEDpX25f+JrFv9Bmkd8inW1WFnjoDRDwJMjRgW         MRAqu5D10t9dSUoASiqZ4v8xA4R3Vjm1+JxcvpEgUZrBphanSCFq5UJz5vE7ooHGeeZ4         Uw+FJ7uhjCBrx4zZYopBGkpWZbkuY8RGgncyHBCyynZQ4o5JsBVl8LxaLoqJ/g/FBIGZ         t+aA==
X-Gm-Message-State: ALyK8tJ1rAd5nyaGQrUAfzqa49Q3xWO8QTWurfnRlyqWwXzookYPVPV2qWFu07rEqCynmTKS0aPgmNjaNaLa/Q==
X-Received: by 10.194.114.72 with SMTP id je8mr10557426wjb.88.1465991078555; Wed, 15 Jun 2016 04:44:38 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.194.87.34 with HTTP; Wed, 15 Jun 2016 04:44:38 -0700 (PDT)
In-Reply-To: <20160519194450.GH40563@msticlxl57.ims.intel.com>
References: <20160519194450.GH40563@msticlxl57.ims.intel.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Wed, 15 Jun 2016 11:44:00 -0000
Message-ID: <CAFiYyc375iHC7yKtKoC7SreDp81bds0+1XgP=e=fXcjtyj-wRg@mail.gmail.com>
Subject: Re: [PATCH, vec-tails 07/10] Support loop epilogue combining
To: Ilya Enkovich <enkovich.gnu@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset=UTF-8
X-IsSubscribed: yes
X-SW-Source: 2016-06/txt/msg01129.txt.bz2

On Thu, May 19, 2016 at 9:44 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> Hi,
>
> This patch introduces support for loop epilogue combining.  This includes
> support in cost estimation and all required changes required to mask
> vectorized loop.

I wonder why you compute a minimum number of iterations to make masking
of the vectorized body profitable rather than a maximum number of iterations.

I'd say masking the vectorized loop is profitable if niter/vf *
masking-overhead < epilogue-cost.
Masking the epilogue is profitable if vectorizing the epilogue with
masking is profitable.

Am I missing something?

Thanks,
Richard.

> Thanks,
> Ilya
> --
> gcc/
>
> 2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
>
>         * dbgcnt.def (vect_tail_combine): New.
>         * params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
>         * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
>         * tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
>         epilogue combined with loop body.
>         (vect_do_peeling_for_loop_bound): Likewise.
>         * tree-vect-loop.c Include alias.h and dbgcnt.h.
>         (vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters
>         arg, compute number of iterations for which loop epilogue combining is
>         profitable.
>         (vect_generate_tmps_on_preheader): Support combined apilogue.
>         (vect_gen_ivs_for_masking): New.
>         (vect_get_mask_index_for_elems): New.
>         (vect_get_mask_index_for_type): New.
>         (vect_gen_loop_masks): New.
>         (vect_mask_reduction_stmt): New.
>         (vect_mask_mask_load_store_stmt): New.
>         (vect_mask_load_store_stmt): New.
>         (vect_combine_loop_epilogue): New.
>         (vect_transform_loop): Support combined apilogue.
>
>
> diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def
> index 78ddcc2..73c2966 100644
> --- a/gcc/dbgcnt.def
> +++ b/gcc/dbgcnt.def
> @@ -192,4 +192,5 @@ DEBUG_COUNTER (treepre_insert)
>  DEBUG_COUNTER (tree_sra)
>  DEBUG_COUNTER (vect_loop)
>  DEBUG_COUNTER (vect_slp)
> +DEBUG_COUNTER (vect_tail_combine)
>  DEBUG_COUNTER (dom_unreachable_edges)
> diff --git a/gcc/params.def b/gcc/params.def
> index 62a1e40..98d6c5a 100644
> --- a/gcc/params.def
> +++ b/gcc/params.def
> @@ -1220,6 +1220,11 @@ DEFPARAM (PARAM_MAX_SPECULATIVE_DEVIRT_MAYDEFS,
>           "Maximum number of may-defs visited when devirtualizing "
>           "speculatively", 50, 0, 0)
>
> +DEFPARAM (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD,
> +         "vect-cost-increase-combine-threshold",
> +         "Cost increase threshold to mask main loop for epilogue.",
> +         10, 0, 300)
> +
>  /*
>
>  Local variables:
> diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
> index f275933..c5bdeb9 100644
> --- a/gcc/tree-vect-data-refs.c
> +++ b/gcc/tree-vect-data-refs.c
> @@ -4000,6 +4000,9 @@ vect_get_new_ssa_name (tree type, enum vect_var_kind var_kind, const char *name)
>    case vect_scalar_var:
>      prefix = "stmp";
>      break;
> +  case vect_mask_var:
> +    prefix = "mask";
> +    break;
>    case vect_pointer_var:
>      prefix = "vectp";
>      break;
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index fab5879..b3c0668 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -1195,6 +1195,7 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
>    int first_guard_probability = 2 * REG_BR_PROB_BASE / 3;
>    int second_guard_probability = 2 * REG_BR_PROB_BASE / 3;
>    int probability_of_second_loop;
> +  bool skip_second_after_first = false;
>
>    if (!slpeel_can_duplicate_loop_p (loop, e))
>      return NULL;
> @@ -1393,7 +1394,11 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
>      {
>        loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
>        tree scalar_loop_iters = LOOP_VINFO_NITERSM1 (loop_vinfo);
> -      unsigned limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
> +      unsigned limit = 0;
> +      if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +       skip_second_after_first = true;
> +      else
> +       limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
>        if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
>         limit = limit + 1;
>        if (check_profitability
> @@ -1464,11 +1469,20 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
>    bb_between_loops = new_exit_bb;
>    bb_after_second_loop = split_edge (single_exit (second_loop));
>
> -  pre_condition =
> -       fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
> -  skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
> -                                  bb_after_second_loop, bb_before_first_loop,
> -                                 inverse_probability (second_guard_probability));
> +  if (skip_second_after_first)
> +    /* We can just redirect edge from bb_between_loops to
> +       bb_after_second_loop but we have many code assuming
> +       we have a guard after the first loop.  So just make
> +       always taken condtion.  */
> +    pre_condition = fold_build2 (EQ_EXPR, boolean_type_node, integer_zero_node,
> +                                integer_zero_node);
> +  else
> +    pre_condition =
> +      fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
> +  skip_e
> +    = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
> +                            bb_after_second_loop, bb_before_first_loop,
> +                            inverse_probability (second_guard_probability));
>    scale_loop_profile (second_loop, probability_of_second_loop, bound2);
>    slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop,
>                                       second_loop == new_loop, &new_exit_bb);
> @@ -1758,8 +1772,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>    basic_block preheader;
>    int loop_num;
>    int max_iter;
> +  int bound2;
>    tree cond_expr = NULL_TREE;
>    gimple_seq cond_expr_stmt_list = NULL;
> +  bool combine = LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo);
>
>    if (dump_enabled_p ())
>      dump_printf_loc (MSG_NOTE, vect_location,
> @@ -1769,12 +1785,13 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>
>    loop_num  = loop->num;
>
> +  bound2 = combine ? th : LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>    new_loop
>      = slpeel_tree_peel_loop_to_edge (loop, scalar_loop, single_exit (loop),
>                                      &ratio_mult_vf_name, ni_name, false,
>                                      th, check_profitability,
>                                      cond_expr, cond_expr_stmt_list,
> -                                    0, LOOP_VINFO_VECT_FACTOR (loop_vinfo));
> +                                    0, bound2);
>    gcc_assert (new_loop);
>    gcc_assert (loop_num == loop->num);
>    slpeel_checking_verify_cfg_after_peeling (loop, new_loop);
> @@ -1803,7 +1820,11 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>    max_iter = (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>               ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) * 2
>               : LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 2;
> -  if (check_profitability)
> +  /* When epilogue is combined only profitability
> +     treshold matters.  */
> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +    max_iter = (int) th - 1;
> +  else if (check_profitability)
>      max_iter = MAX (max_iter, (int) th - 1);
>    record_niter_bound (new_loop, max_iter, false, true);
>    dump_printf (MSG_NOTE,
> @@ -2036,7 +2057,8 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo, tree ni_name,
>                                    bound, 0);
>
>    gcc_assert (new_loop);
> -  slpeel_checking_verify_cfg_after_peeling (new_loop, loop);
> +  if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +    slpeel_checking_verify_cfg_after_peeling (new_loop, loop);
>    /* For vectorization factor N, we need to copy at most N-1 values
>       for alignment and this means N-2 loopback edge executions.  */
>    max_iter = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 2;
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 31360d3..1a80c42 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -50,6 +50,8 @@ along with GCC; see the file COPYING3.  If not see
>  #include "gimple-fold.h"
>  #include "cgraph.h"
>  #include "tree-if-conv.h"
> +#include "alias.h"
> +#include "dbgcnt.h"
>
>  /* Loop Vectorization Pass.
>
> @@ -149,7 +151,8 @@ along with GCC; see the file COPYING3.  If not see
>     http://gcc.gnu.org/projects/tree-ssa/vectorization.html
>  */
>
> -static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *);
> +static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *,
> +                                               int *);
>
>  /* Function vect_determine_vectorization_factor
>
> @@ -2304,8 +2307,10 @@ start_over:
>
>    /* Analyze cost.  Decide if worth while to vectorize.  */
>    int min_profitable_estimate, min_profitable_iters;
> +  int min_profitable_combine_iters;
>    vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
> -                                     &min_profitable_estimate);
> +                                     &min_profitable_estimate,
> +                                     &min_profitable_combine_iters);
>
>    if (min_profitable_iters < 0)
>      {
> @@ -2412,6 +2417,52 @@ start_over:
>    gcc_assert (vectorization_factor
>               == (unsigned)LOOP_VINFO_VECT_FACTOR (loop_vinfo));
>
> +  if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> +    {
> +      LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
> +      LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = false;
> +    }
> +  else if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> +          && min_profitable_combine_iters >= 0)
> +    {
> +      if (((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> +           && (LOOP_VINFO_INT_NITERS (loop_vinfo)
> +               >= (unsigned) min_profitable_combine_iters))
> +          || estimated_niter == -1
> +          || estimated_niter >= min_profitable_combine_iters)
> +         && dbg_cnt (vect_tail_combine))
> +       {
> +         LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
> +         LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = true;
> +
> +         dump_printf_loc (MSG_NOTE, vect_location,
> +                          "Decided to combine loop with its epilogue.\n");
> +
> +         /* We need to adjust profitability check if combine
> +            epilogue considering additional vector iteration
> +            and profitable combine iterations.  */
> +         if ((int)(min_profitable_combine_iters + vectorization_factor)
> +             > min_scalar_loop_bound)
> +           {
> +             LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo)
> +               = (unsigned) min_profitable_combine_iters;
> +             if (dump_enabled_p ())
> +               dump_printf_loc (MSG_NOTE, vect_location,
> +                                "Updated runtime profitability treshold: %d\n",
> +                                min_profitable_combine_iters);
> +
> +           }
> +       }
> +      else
> +       {
> +         if (!LOOP_VINFO_NEED_MASKING (loop_vinfo) && dump_enabled_p ())
> +           dump_printf_loc (MSG_NOTE, vect_location,
> +                            "Not combined loop with epilogue: iterations "
> +                            "count is too low (threshold is %d).\n",
> +                            min_profitable_combine_iters);
> +       }
> +    }
> +
>    /* Ok to vectorize!  */
>    return true;
>
> @@ -3381,7 +3432,8 @@ vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
>  static void
>  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>                                     int *ret_min_profitable_niters,
> -                                   int *ret_min_profitable_estimate)
> +                                   int *ret_min_profitable_estimate,
> +                                   int *ret_min_profitable_combine_niters)
>  {
>    int min_profitable_iters;
>    int min_profitable_estimate;
> @@ -3625,6 +3677,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>                     vec_prologue_cost);
>        dump_printf (MSG_NOTE, "  Vector epilogue cost: %d\n",
>                     vec_epilogue_cost);
> +      dump_printf (MSG_NOTE, "  Masking prologue cost: %d\n",
> +                   masking_prologue_cost);
> +      dump_printf (MSG_NOTE, "  Masking inside cost: %d\n",
> +                   masking_inside_cost);
>        dump_printf (MSG_NOTE, "  Scalar iteration cost: %d\n",
>                     scalar_single_iter_cost);
>        dump_printf (MSG_NOTE, "  Scalar outside cost: %d\n",
> @@ -3728,6 +3784,77 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>                      min_profitable_estimate);
>
>    *ret_min_profitable_estimate = min_profitable_estimate;
> +
> +  *ret_min_profitable_combine_niters = -1;
> +
> +  /* Don't try to vectorize epilogue of epilogue.  */
> +  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> +    return;
> +
> +  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +    {
> +      if (flag_vect_epilogue_cost_model == VECT_COST_MODEL_UNLIMITED)
> +       {
> +         if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE)
> +           *ret_min_profitable_combine_niters = 0;
> +         return;
> +       }
> +
> +      unsigned combine_treshold
> +       = PARAM_VALUE (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD);
> +      /* Calculate profitability combining epilogue with the main loop.
> +        We have a threshold for inside cost overhead (not applied
> +        for low trip count loop case):
> +        MIC * 100 < VIC * CT
> +        Masked iteration should be better than a scalar prologue:
> +        MIC + VIC < SIC * epilogue_niters  */
> +      if (masking_inside_cost * 100 >= vec_inside_cost * combine_treshold)
> +       {
> +         if (dump_enabled_p ())
> +           {
> +             dump_printf_loc (MSG_NOTE, vect_location,
> +                              "Combining loop with epilogue is not "
> +                              "profitable.\n");
> +             dump_printf_loc (MSG_NOTE, vect_location,
> +                              "  Combining overhead %d%% exceeds "
> +                              "treshold %d%%.\n",
> +                              masking_inside_cost * 100 / vec_inside_cost,
> +                              combine_treshold);
> +           }
> +         *ret_min_profitable_combine_niters = -1;
> +       }
> +      else if ((int)(masking_inside_cost + vec_inside_cost)
> +              >= scalar_single_iter_cost * peel_iters_epilogue)
> +       {
> +         if (dump_enabled_p ())
> +           {
> +             dump_printf_loc (MSG_NOTE, vect_location,
> +                              "Combining loop with epilogue is not "
> +                              "profitable.\n");
> +             dump_printf_loc (MSG_NOTE, vect_location,
> +                              "  Scalar epilogue is faster than a "
> +                              "single masked iteration.\n");
> +           }
> +         *ret_min_profitable_combine_niters = -1;
> +       }
> +      else if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE)
> +       {
> +         int inside_cost = vec_inside_cost + masking_inside_cost;
> +         int outside_cost = vec_outside_cost + masking_prologue_cost;
> +          int profitable_iters = ((outside_cost - scalar_outside_cost) * vf
> +                                 - inside_cost * peel_iters_prologue
> +                                  - inside_cost * peel_iters_epilogue)
> +                                 / ((scalar_single_iter_cost * vf)
> +                                    - inside_cost);
> +
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_NOTE, vect_location,
> +                            "Combinig loop with epilogue "
> +                            "pofitability treshold = %d\n",
> +                            profitable_iters);
> +         *ret_min_profitable_combine_niters = profitable_iters;
> +       }
> +    }
>  }
>
>  /* Writes into SEL a mask for a vec_perm, equivalent to a vec_shr by OFFSET
> @@ -6843,20 +6970,37 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
>    else
>      ni_minus_gap_name = ni_name;
>
> -  /* Create: ratio = ni >> log2(vf) */
> -  /* ???  As we have ni == number of latch executions + 1, ni could
> -     have overflown to zero.  So avoid computing ratio based on ni
> -     but compute it using the fact that we know ratio will be at least
> -     one, thus via (ni - vf) >> log2(vf) + 1.  */
> -  ratio_name
> -    = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
> -                  fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
> -                               fold_build2 (MINUS_EXPR, TREE_TYPE (ni_name),
> -                                            ni_minus_gap_name,
> -                                            build_int_cst
> -                                              (TREE_TYPE (ni_name), vf)),
> -                               log_vf),
> -                  build_int_cst (TREE_TYPE (ni_name), 1));
> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +    {
> +      /* Create ni + (vf-1) >> log2(vf) if epilogue is combined with loop.  */
> +      gcc_assert (!LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
> +      ratio_name
> +       = fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
> +                      fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
> +                                   ni_name,
> +                                   build_int_cst (TREE_TYPE (ni_name),
> +                                                  vf - 1)),
> +                      log_vf);
> +    }
> +  else
> +    {
> +      /* Create: ratio = ni >> log2(vf) */
> +      /* ???  As we have ni == number of latch executions + 1, ni could
> +        have overflown to zero.  So avoid computing ratio based on ni
> +        but compute it using the fact that we know ratio will be at least
> +        one, thus via (ni - vf) >> log2(vf) + 1.  */
> +      ratio_name
> +        = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
> +                      fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
> +                                   fold_build2 (MINUS_EXPR,
> +                                                TREE_TYPE (ni_name),
> +                                                ni_minus_gap_name,
> +                                                build_int_cst
> +                                                  (TREE_TYPE (ni_name), vf)),
> +                                   log_vf),
> +                      build_int_cst (TREE_TYPE (ni_name), 1));
> +    }
> +
>    if (!is_gimple_val (ratio_name))
>      {
>        var = create_tmp_var (TREE_TYPE (ni_name), "bnd");
> @@ -6886,6 +7030,485 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
>    return;
>  }
>
> +/* Function vect_gen_ivs_for_masking.
> +
> +   Create IVs to be used for masks computation to mask loop described
> +   by LOOP_VINFO.  Created IVs are stored in IVS vector.  .
> +
> +   Initial IV values is {0, 1, ..., VF - 1} (probably split into several
> +   vectors, in this case IVS's elements with lower index hold IV with
> +   smaller numbers).  IV step is {VF, VF, ..., VF}.  VF is a used
> +   vectorization factor.  */
> +
> +static void
> +vect_gen_ivs_for_masking (loop_vec_info loop_vinfo, vec<tree> *ivs)
> +{
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  tree vectype = vect_get_masking_iv_type (loop_vinfo);
> +  tree type = TREE_TYPE (vectype);
> +  int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> +  unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
> +  int ncopies  = vf / elems;
> +  int i, k;
> +  tree iv, init_val, step_val;
> +  bool insert_after;
> +  gimple_stmt_iterator gsi;
> +  tree *vtemp;
> +
> +  /* Create {VF, ..., VF} vector constant.  */
> +  step_val = build_vector_from_val (vectype, build_int_cst (type, vf));
> +
> +  vtemp = XALLOCAVEC (tree, vf);
> +  for (i = 0; i < ncopies; i++)
> +    {
> +      /* Create initial IV value.  */
> +      for (k = 0; k < vf; k++)
> +       vtemp[k] = build_int_cst (type, k + i * elems);
> +      init_val = build_vector (vectype, vtemp);
> +
> +      /* Create an inductive variable including phi node.  */
> +      standard_iv_increment_position (loop, &gsi, &insert_after);
> +      create_iv (init_val, step_val, NULL, loop, &gsi, insert_after,
> +                &iv, NULL);
> +      ivs->safe_push (iv);
> +    }
> +}
> +
> +/* Function vect_get_mask_index_for_elems.
> +
> +   A helper function to access masks vector.  See vect_gen_loop_masks
> +   for masks vector sorting description.  Return index of the first
> +   mask having MASK_ELEMS elements.  */
> +
> +static inline unsigned
> +vect_get_mask_index_for_elems (unsigned mask_elems)
> +{
> +  return current_vector_size / mask_elems - 1;
> +}
> +
> +/* Function vect_get_mask_index_for_type.
> +
> +   A helper function to access masks vector.  See vect_gen_loop_masks
> +   for masks vector sorting description.  Return index of the first
> +   mask appropriate for VECTYPE.  */
> +
> +static inline unsigned
> +vect_get_mask_index_for_type (tree vectype)
> +{
> +  unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
> +  return vect_get_mask_index_for_elems (elems);
> +}
> +
> +/* Function vect_gen_loop_masks.
> +
> +   Create masks to mask a loop desvribed by LOOP_VINFO.  Masks
> +   are created according to LOOP_VINFO_REQUIRED_MASKS and are stored
> +   into MASKS vector.
> +
> +   Index of a mask in a vector is computed according to a number
> +   of masks's elements.  Masks are sorted by number of its elements
> +   in descending order.  Index 0 is used to access a mask with
> +   current_vector_size elements.  Among masks with the same number
> +   of elements the one with lower index is used to mask iterations
> +   with smaller iteration counter.  Note that you may get NULL elements
> +   for masks which are not required.  Use vect_get_mask_index_for_elems
> +   or vect_get_mask_index_for_type to access resulting vector.  */
> +
> +static void
> +vect_gen_loop_masks (loop_vec_info loop_vinfo, vec<tree> *masks)
> +{
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  edge pe = loop_preheader_edge (loop);
> +  tree niters = LOOP_VINFO_NITERS (loop_vinfo);
> +  unsigned min_mask_elems, max_mask_elems, nmasks;
> +  unsigned iv_elems, cur_mask, prev_mask, cur_mask_elems;
> +  auto_vec<tree> ivs;
> +  tree vectype, mask_type;
> +  tree vec_niters, vec_niters_val, mask;
> +  gimple *stmt;
> +  basic_block bb;
> +  gimple_stmt_iterator gsi = gsi_after_labels (loop->header);
> +  unsigned vec_size;
> +
> +  /* Create required IVs.  */
> +  vect_gen_ivs_for_masking (loop_vinfo, &ivs);
> +  vectype = TREE_TYPE (ivs[0]);
> +
> +  vec_size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
> +  iv_elems = TYPE_VECTOR_SUBPARTS (vectype);
> +
> +  /* Get a proper niter to build a vector.  */
> +  if (!is_gimple_val (niters))
> +    {
> +      gimple_seq seq = NULL;
> +      niters = force_gimple_operand (niters, &seq, true, NULL);
> +      gsi_insert_seq_on_edge_immediate (pe, seq);
> +    }
> +  /* We may need a type cast in case niter has a too small type
> +     for generated IVs.  */
> +  if (!types_compatible_p (TREE_TYPE (vectype), TREE_TYPE (niters)))
> +    {
> +      tree new_niters = make_temp_ssa_name (TREE_TYPE (vectype),
> +                                           NULL, "niters");
> +      stmt = gimple_build_assign (new_niters, CONVERT_EXPR, niters);
> +      bb = gsi_insert_on_edge_immediate (pe, stmt);
> +      gcc_assert (!bb);
> +      niters = new_niters;
> +    }
> +  /* Create {NITERS, ..., NITERS} vector and put to SSA_NAME.  */
> +  vec_niters_val = build_vector_from_val (vectype, niters);
> +  vec_niters = vect_get_new_ssa_name (vectype, vect_simple_var, "niters");
> +  stmt = gimple_build_assign (vec_niters, vec_niters_val);
> +  bb = gsi_insert_on_edge_immediate (pe, stmt);
> +  gcc_assert (!bb);
> +
> +  /* Determine which masks we need to compute and how many.  */
> +  vect_get_extreme_masks (loop_vinfo, &min_mask_elems, &max_mask_elems);
> +  nmasks = vect_get_mask_index_for_elems (MIN (min_mask_elems, iv_elems) / 2);
> +  masks->safe_grow_cleared (nmasks);
> +
> +  /* Now create base masks through comparison IV < VEC_NITERS.  */
> +  mask_type = build_same_sized_truth_vector_type (vectype);
> +  cur_mask = vect_get_mask_index_for_elems (iv_elems);
> +  for (unsigned i = 0; i < ivs.length (); i++)
> +    {
> +      tree iv = ivs[i];
> +      mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
> +      stmt = gimple_build_assign (mask, LT_EXPR, iv, vec_niters);
> +      gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
> +      (*masks)[cur_mask++] = mask;
> +    }
> +
> +  /* Create narrowed masks.  */
> +  cur_mask_elems = iv_elems;
> +  nmasks = ivs.length ();
> +  while (cur_mask_elems < max_mask_elems)
> +    {
> +      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> +
> +      cur_mask_elems <<= 1;
> +      nmasks >>= 1;
> +
> +      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> +
> +      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
> +
> +      for (unsigned i = 0; i < nmasks; i++)
> +       {
> +         tree mask_low = (*masks)[prev_mask++];
> +         tree mask_hi = (*masks)[prev_mask++];
> +         mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
> +         stmt = gimple_build_assign (mask, VEC_PACK_TRUNC_EXPR,
> +                                     mask_low, mask_hi);
> +         gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
> +         (*masks)[cur_mask++] = mask;
> +       }
> +    }
> +
> +  /* Created widened masks.  */
> +  cur_mask_elems = iv_elems;
> +  nmasks = ivs.length ();
> +  while (cur_mask_elems > min_mask_elems)
> +    {
> +      prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> +
> +      cur_mask_elems >>= 1;
> +      nmasks <<= 1;
> +
> +      cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> +
> +      mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
> +
> +      for (unsigned i = 0; i < nmasks; i += 2)
> +       {
> +         tree orig_mask = (*masks)[prev_mask++];
> +
> +         mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
> +         stmt = gimple_build_assign (mask, VEC_UNPACK_LO_EXPR, orig_mask);
> +         gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
> +         (*masks)[cur_mask++] = mask;
> +
> +         mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
> +         stmt = gimple_build_assign (mask, VEC_UNPACK_HI_EXPR, orig_mask);
> +         gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
> +         (*masks)[cur_mask++] = mask;
> +       }
> +    }
> +}
> +
> +/* Function vect_mask_reduction_stmt.
> +
> +   Mask given vectorized reduction statement STMT using
> +   MASK.  In case scalar reduction statement is vectorized
> +   into several vector statements then PREV holds a
> +   preceding vetor statement copy for STMT.
> +
> +   Masking is performed using VEC_COND_EXPR. E.g.
> +
> +   S1: r_1 = r_2 + d_3
> +
> +   is transformed into:
> +
> +   S1': r_4 = r_2 + d_3
> +   S2': r_1 = VEC_COND_EXPR<MASK, r_4, r_2>
> +
> +   Return generated condition statement.  */
> +
> +static gimple *
> +vect_mask_reduction_stmt (gimple *stmt, tree mask, gimple *prev)
> +{
> +  gimple_stmt_iterator gsi;
> +  tree vectype;
> +  tree lhs, rhs, tmp;
> +  gimple *new_stmt, *phi;
> +
> +  lhs = gimple_assign_lhs (stmt);
> +  vectype = TREE_TYPE (lhs);
> +
> +  gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
> +             == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
> +
> +  /* Find operand RHS defined by PHI node.  */
> +  rhs = gimple_assign_rhs1 (stmt);
> +  gcc_assert (TREE_CODE (rhs) == SSA_NAME);
> +  phi = SSA_NAME_DEF_STMT (rhs);
> +
> +  if (phi != prev && gimple_code (phi) != GIMPLE_PHI)
> +    {
> +      rhs = gimple_assign_rhs2 (stmt);
> +      gcc_assert (TREE_CODE (rhs) == SSA_NAME);
> +      phi = SSA_NAME_DEF_STMT (rhs);
> +      gcc_assert (phi == prev || gimple_code (phi) == GIMPLE_PHI);
> +    }
> +
> +  /* Convert reduction stmt to ordinary assignment to TMP.  */
> +  tmp = vect_get_new_ssa_name (vectype, vect_simple_var, NULL);
> +  gimple_assign_set_lhs (stmt, tmp);
> +
> +  /* Create VEC_COND_EXPR and insert it after STMT.  */
> +  new_stmt = gimple_build_assign (lhs, VEC_COND_EXPR, mask, tmp, rhs);
> +  gsi = gsi_for_stmt (stmt);
> +  gsi_insert_after (&gsi, new_stmt, GSI_SAME_STMT);
> +
> +  return new_stmt;
> +}
> +
> +/* Function vect_mask_mask_load_store_stmt.
> +
> +   Mask given vectorized MASK_LOAD or MASK_STORE statement
> +   STMT using MASK.  Function replaces a mask used by STMT
> +   with its conjunction with MASK.  */
> +
> +static void
> +vect_mask_mask_load_store_stmt (gimple *stmt, tree mask)
> +{
> +  gimple *new_stmt;
> +  tree old_mask, new_mask;
> +  gimple_stmt_iterator gsi;
> +
> +  gsi = gsi_for_stmt (stmt);
> +  old_mask = gimple_call_arg (stmt, 2);
> +
> +  gcc_assert (types_compatible_p (TREE_TYPE (old_mask), TREE_TYPE (mask)));
> +
> +  new_mask = vect_get_new_ssa_name (TREE_TYPE (mask), vect_simple_var, NULL);
> +  new_stmt = gimple_build_assign (new_mask, BIT_AND_EXPR, old_mask, mask);
> +  gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
> +
> +  gimple_call_set_arg (stmt, 2, new_mask);
> +  update_stmt (stmt);
> +}
> +
> +
> +/* Function vect_mask_load_store_stmt.
> +
> +   Mask given vectorized load or store statement STMT using
> +   MASK.  DR is a data reference for a scalar memory access.
> +   Assignment is transformed into MASK_LOAD or MASK_STORE
> +   statement.  SI is either an iterator pointing to STMT and
> +   is to be updated or NULL.  */
> +
> +static void
> +vect_mask_load_store_stmt (gimple *stmt, tree vectype, tree mask,
> +                          data_reference *dr, gimple_stmt_iterator *si)
> +{
> +  tree mem, val, addr, ptr;
> +  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
> +  unsigned align, misalign;
> +  tree elem_type = TREE_TYPE (vectype);
> +  gimple *new_stmt;
> +
> +  gcc_assert (!si || gsi_stmt (*si) == stmt);
> +
> +  gsi = gsi_for_stmt (stmt);
> +  if (gimple_store_p (stmt))
> +    {
> +      val = gimple_assign_rhs1 (stmt);
> +      mem = gimple_assign_lhs (stmt);
> +    }
> +  else
> +    {
> +      val = gimple_assign_lhs (stmt);
> +      mem = gimple_assign_rhs1 (stmt);
> +    }
> +
> +  gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
> +             == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
> +
> +  addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (mem),
> +                                  true, NULL_TREE, true,
> +                                  GSI_SAME_STMT);
> +
> +  align = TYPE_ALIGN_UNIT (vectype);
> +  if (aligned_access_p (dr))
> +    misalign = 0;
> +  else if (DR_MISALIGNMENT (dr) == -1)
> +    {
> +      align = TYPE_ALIGN_UNIT (elem_type);
> +      misalign = 0;
> +    }
> +  else
> +    misalign = DR_MISALIGNMENT (dr);
> +  set_ptr_info_alignment (get_ptr_info (addr), align, misalign);
> +  ptr = build_int_cst (reference_alias_ptr_type (mem),
> +                      misalign ? misalign & -misalign : align);
> +
> +  if (gimple_store_p (stmt))
> +    new_stmt = gimple_build_call_internal (IFN_MASK_STORE, 4, addr, ptr,
> +                                          mask, val);
> +  else
> +    {
> +      new_stmt = gimple_build_call_internal (IFN_MASK_LOAD, 3, addr, ptr,
> +                                            mask);
> +      gimple_call_set_lhs (new_stmt, val);
> +    }
> +  gsi_replace (si ? si : &gsi, new_stmt, false);
> +}
> +
> +/* Function vect_combine_loop_epilogue.
> +
> +   Combine loop epilogue with the main vectorized body.  It requires
> +   masking of memory accesses and reductions.  */
> +
> +static void
> +vect_combine_loop_epilogue (loop_vec_info loop_vinfo)
> +{
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
> +  unsigned mask_no;
> +  auto_vec<tree> masks;
> +
> +  vect_gen_loop_masks (loop_vinfo, &masks);
> +
> +  /* Convert reduction statements if any.  */
> +  for (unsigned i = 0; i < LOOP_VINFO_REDUCTIONS (loop_vinfo).length (); i++)
> +    {
> +      gimple *stmt = LOOP_VINFO_REDUCTIONS (loop_vinfo)[i];
> +      gimple *prev_stmt = NULL;
> +      stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> +
> +      mask_no = vect_get_mask_index_for_type (STMT_VINFO_VECTYPE (stmt_info));
> +
> +      stmt = STMT_VINFO_VEC_STMT (stmt_info);
> +      while (stmt)
> +       {
> +         prev_stmt = vect_mask_reduction_stmt (stmt, masks[mask_no++],
> +                                               prev_stmt);
> +         stmt_info = vinfo_for_stmt (stmt);
> +         stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
> +       }
> +    }
> +
> +  /* Scan all loop statements to convert vector load/store including masked
> +     form.  */
> +  for (unsigned i = 0; i < loop->num_nodes; i++)
> +    {
> +      basic_block bb = bbs[i];
> +      for (gimple_stmt_iterator si = gsi_start_bb (bb);
> +          !gsi_end_p (si); gsi_next (&si))
> +       {
> +         gimple *stmt = gsi_stmt (si);
> +         stmt_vec_info stmt_info = NULL;
> +         tree vectype = NULL;
> +         data_reference *dr;
> +
> +         /* Mask load case.  */
> +         if (is_gimple_call (stmt)
> +             && gimple_call_internal_p (stmt)
> +             && gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
> +             && !VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (stmt, 2))))
> +           {
> +             stmt_info = vinfo_for_stmt (stmt);
> +             if (!STMT_VINFO_VEC_STMT (stmt_info))
> +               continue;
> +             stmt = STMT_VINFO_VEC_STMT (stmt_info);
> +             vectype = STMT_VINFO_VECTYPE (stmt_info);
> +           }
> +         /* Mask store case.  */
> +         else if (is_gimple_call (stmt)
> +                  && gimple_call_internal_p (stmt)
> +                  && gimple_call_internal_fn (stmt) == IFN_MASK_STORE
> +                  && vinfo_for_stmt (stmt)
> +                  && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
> +           {
> +             stmt_info = vinfo_for_stmt (stmt);
> +             vectype = TREE_TYPE (gimple_call_arg (stmt, 2));
> +           }
> +         /* Load case.  */
> +         else if (gimple_assign_load_p (stmt)
> +                  && !VECTOR_TYPE_P (TREE_TYPE (gimple_assign_lhs (stmt))))
> +           {
> +             stmt_info = vinfo_for_stmt (stmt);
> +             /* Skip vector loads.  */
> +             if (!STMT_VINFO_VEC_STMT (stmt_info))
> +               continue;
> +             /* Skip invariant loads.  */
> +             if (integer_zerop (nested_in_vect_loop_p (loop, stmt)
> +                                ? STMT_VINFO_DR_STEP (stmt_info)
> +                                : DR_STEP (STMT_VINFO_DATA_REF (stmt_info))))
> +               continue;
> +             stmt = STMT_VINFO_VEC_STMT (stmt_info);
> +             vectype = STMT_VINFO_VECTYPE (stmt_info);
> +           }
> +         /* Store case.  */
> +         else if (gimple_code (stmt) == GIMPLE_ASSIGN
> +                  && gimple_store_p (stmt)
> +                  && vinfo_for_stmt (stmt)
> +                  && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
> +           {
> +             stmt_info = vinfo_for_stmt (stmt);
> +             vectype = STMT_VINFO_VECTYPE (stmt_info);
> +           }
> +         else
> +           continue;
> +
> +         /* Skip hoisted out statements.  */
> +         if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
> +           continue;
> +
> +         mask_no = vect_get_mask_index_for_type (vectype);
> +
> +         dr = STMT_VINFO_DATA_REF (stmt_info);
> +         while (stmt)
> +           {
> +             if (is_gimple_call (stmt))
> +               vect_mask_mask_load_store_stmt (stmt, masks[mask_no++]);
> +             else
> +               vect_mask_load_store_stmt (stmt, vectype, masks[mask_no++], dr,
> +                                          /* Have to update iterator only if
> +                                             it points to stmt we mask.  */
> +                                          stmt == gsi_stmt (si) ? &si : NULL);
> +
> +             stmt_info = vinfo_for_stmt (stmt);
> +             stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
> +           }
> +       }
> +    }
> +
> +  if (dump_enabled_p ())
> +    dump_printf_loc (MSG_NOTE, vect_location,
> +                    "=== Loop epilogue was combined ===\n");
> +}
>
>  /* Function vect_transform_loop.
>
> @@ -6927,7 +7550,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>       run at least the vectorization factor number of times checking
>       is pointless, too.  */
>    th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
> -  if (th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1
> +  if ((th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1
> +       || (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)
> +          && th > 1))
>        && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
>      {
>        if (dump_enabled_p ())
> @@ -6976,12 +7601,18 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>      {
>        tree ratio_mult_vf;
>        if (!ni_name)
> -       ni_name = vect_build_loop_niters (loop_vinfo);
> +       {
> +         ni_name = vect_build_loop_niters (loop_vinfo);
> +         LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
> +       }
>        vect_generate_tmps_on_preheader (loop_vinfo, ni_name, &ratio_mult_vf,
>                                        &ratio);
> -      epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name,
> -                                                ratio_mult_vf, th,
> -                                                check_profitability);
> +      /* If epilogue is combined with main loop peeling is not needed.  */
> +      if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)
> +         || check_profitability)
> +       epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name,
> +                                                  ratio_mult_vf, th,
> +                                                  check_profitability);
>      }
>    else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
>      ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
> @@ -6989,7 +7620,10 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>    else
>      {
>        if (!ni_name)
> -       ni_name = vect_build_loop_niters (loop_vinfo);
> +       {
> +         ni_name = vect_build_loop_niters (loop_vinfo);
> +         LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
> +       }
>        vect_generate_tmps_on_preheader (loop_vinfo, ni_name, NULL, &ratio);
>      }
>
> @@ -7243,23 +7877,35 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>
>    slpeel_make_loop_iterate_ntimes (loop, ratio);
>
> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +    vect_combine_loop_epilogue (loop_vinfo);
> +
>    /* Reduce loop iterations by the vectorization factor.  */
>    scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vectorization_factor),
>                       expected_iterations / vectorization_factor);
>    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>        && loop->nb_iterations_upper_bound != 0)
> -    loop->nb_iterations_upper_bound = loop->nb_iterations_upper_bound - 1;
> -  loop->nb_iterations_upper_bound
> -    = wi::udiv_floor (loop->nb_iterations_upper_bound + 1,
> -                     vectorization_factor) - 1;
> -
> +    loop->nb_iterations_upper_bound -= 1;
> +  if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +    loop->nb_iterations_upper_bound
> +      = wi::div_ceil (loop->nb_iterations_upper_bound + 1,
> +                     vectorization_factor, UNSIGNED) - 1;
> +  else
> +    loop->nb_iterations_upper_bound
> +      = wi::udiv_floor (loop->nb_iterations_upper_bound + 1,
> +                       vectorization_factor) - 1;
>    if (loop->any_estimate)
>      {
> -      loop->nb_iterations_estimate
> -        = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor);
> -       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> -          && loop->nb_iterations_estimate != 0)
> -        loop->nb_iterations_estimate = loop->nb_iterations_estimate - 1;
> +      if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> +       loop->nb_iterations_estimate
> +         = wi::div_ceil (loop->nb_iterations_estimate, vectorization_factor,
> +                         UNSIGNED);
> +      else
> +       loop->nb_iterations_estimate
> +         = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor);
> +      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> +         && loop->nb_iterations_estimate != 0)
> +       loop->nb_iterations_estimate -= 1;
>      }
>
>    if (dump_enabled_p ())