[AVX-512] Vectorization when tripcount is less than VF.

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* [AVX-512] Vectorization when tripcount is less than VF.
@ 2014-02-18 12:27 Kirill Yukhin
  0 siblings, 0 replies; only message in thread
From: Kirill Yukhin @ 2014-02-18 12:27 UTC (permalink / raw)
  To: rguenther, Jakub Jelinek, Richard Henderson, gcc

[-- Attachment #1: Type: text/plain, Size: 5917 bytes --]

Hello,

I'd like to start discussion about the autovectorization using the EVEX's masking feature.

In brief: AVX-512 introduces new EVEX encoding which (amongst other things)
is capable to perform per-element masking for vectors. There're 2 masking modes
avaialbe: merge-masking and zero-masking.

Quick example.

suppose: %zmm1 = (V8DI) {1 , 2, 3, 4, 5, 6, 7, 8}
	 %zmm2 = (V8DI) {9 ,10,11,12,13,14,15,16}
	 %zmm3 = (V8DI) {88,88,88,88,88,88,88,88}
	 %k1   = (QI) 0xCD (11001101)

A merge-masked operation:
       vaddq	%zmm1, %zmm2, %zmm3 {%k1}
will set %zmm3 = (V8DI) {10,12,88,88,18,20,88,24}

A zero-masked operation:
       vaddq	%zmm1, %zmm2, %zmm3 {%k1} {z}
will set %zmm3 = (V8DI) {10,12,0 ,0 ,18,20,0 ,24}

It should be noted that the operation is not even performed on
the masked-out elements, so no exceptions like DIV0 raised there.

For more info refer to [1].

Currently we see main applications for the feature:
  1. Loop remainder peeling for bounds. When vectorizing loop we need to peel out some iterations
     to make the loop tripcount multiple of vector length. The rest of (IV%VF) iterations form
     a scalar remainder. Given that the remainder's tripcount for this remainder is < VF we can
     vectorize it using masking by copying vectorized version of the `main' loop and applying
     a corresponding masking to it.
     Example.
       FOR I=1, N DO
	 stmt_1[i,1]
	 ...
	 stmt_N[i,1]
       END DO
     Currently converted to:
       ! `Main' loop
       FOR I=1, N-(N%vf), vf DO
	 v_stmt_1[i,vf]
	 ...
	 v_stmt_N[i,vf]
       END DO

       ! Loop remainder
       FOR I=N-(N%vf), N  DO
	 stmt_1[i,1]
	 ...
	 stmt_N[i,1]
       END DO

     The idea is to have instead of scalar loop remainder smth like that:
       ! Broadcast the Number of peeled iterations
       v_tmp_rest = {N%vf, N%vf,..., N%vf}

       ! Make a sequence 0..vf
       v_tmp_vf = {0, 1,...,vf}

       ! Set (N%vf) LSB bits in integral type, rest are zeroes
       mask = SPECIAL_CMP_GT (v_tmp_rest, v_tmp_vf)

       ! The loop remainder is just a copy of the `mainâ€™ body
       v_stmt_1[i,N%vf] {mask}
       ...
       v_stmt_N[i,N%vf] {mask}

  2. Loop head peeling for alignement. While working with vector code it is important
     to have memory access aligned to the type natural alignement. When memory access is unaligned
     it may lead to a significant perf penalty. That is why it sometimes useful to peel off a few of (< vf)
     loop iteration so, that the `main' vectorized loop will have mem accesses aligned. We can
     vectorize this head-remainders by masking as well in a similar way as in p.1

  3. if-conv. Vectorizer doesn't like loop bodies with the BB count > 1. If a loop contains conditionals,
     we have a BB count > 1. We can try to convert such non-trivial CFGs to a predicated single-BB insn
     sequence. The idea is, for e.g.:
       IF (P) THEN
	 stmt_T
       ELSE
	 stmt_F
       END IF
     Convert it to
       stmt_T {P}
       stmt-F {!P}

     The current implementation analyzes the stmts involved and if they're safe, creates a special
     version with BB making all loads/stores predicated. If the loop is vectorized - this version 
     will be chosen and all loads/stores will be implemented with the SSE's movmsk insn (under well-formed mask),
     else it will be dumped.

Currently we're investigating options to implement such optimizations.
We already have chatted about that on the IRC (w/ RichardB).

The biggest problem here is that we don't have representation in GIMPLE of such a things.

Currently I see 2 approaches to solve that:
  1. Introduce a corresponding mask version for every tree code corresponding mask version 
     (or built-in, which I believe ultimately is the same):
       - We already have maskload/stores (which came from Jakub's if-conv)
       - We need to introduce an internal call for eah tree code which may cause a trap. I suspect that
	 this going to be a relatively big invasion to GIMPLE.

  2. Split every GIMPLE stmt into two to have smth like that instead of a = div (b, c):
       t = div (b, c)
       a = mask (t, mask)
     With hope that combiner will join that back to a RTL pattern representing the masked div,
     leading to this: vdivps %zmm1, %zmm2, %zmm3 {%k1}
     But if combiner (or whoever) will fail to join these 2 STMTs we'll have a RTL which reperesent
     incorrect instructions:
       vdivps %zmm4, %zmm2, %zmm3 # May trap
       vmovaps %zmm1, %zmm4 {k1}
     So, we either need to expand them in pairs (which looks a bit complicated), or we 
     need to make sure they are combined later in the RTL (assumingly, in the combine-phase).
     As richi stated: relying on any "magic" is fragile. And I agree, so this not the
     best solution

  3. We might try to accompany every BB with a new field, representing a mask which should be
     applied to every vector stmt inside the BB. We'll need to be able to set DR from mask
     def STMT to BB containing use STMT, but I am not sure that it is good.
     I am also not sure, if vector a STMT can be moved from one BB to another, and if so, we must
     prohibit that for those BB pairs, where masks are not equal.
     While expanding STMT we'll check its BB's mask and apply it appropriately.

As a proof-of-concept we've implemented option #2 for the  loop remainder vectorization (patch is here [2]). 
It seems to be working on Spec2k6 giving
15% decrease of insn executed count on 436.leslie3d, it also gives about 4% icount decrese
on 470.lbm (you can get simulator here [3]).

I'd like to hear from community which option is acceptable? Maybe there's something
out of my sight...

[1] - http://software.intel.com/en-us/file/319433-018pdf
[2] - Attached.
[3] - http://software.intel.com/en-us/articles/intel-software-development-emulator

IV stands for Induction Variable.
VF stands for Vectorization Factor.

--
Thanks, K

[-- Attachment #2: tmp --]
[-- Type: text/plain, Size: 52362 bytes --]

 gcc/config/i386/i386.c     |  24 +++-
 gcc/config/i386/i386.opt   |   4 +
 gcc/config/i386/sse.md     |  90 ++++++++++----
 gcc/doc/tm.texi            |   4 +
 gcc/doc/tm.texi.in         |   4 +
 gcc/expr.c                 |  46 +++++++
 gcc/genopinit.c            |  33 ++++-
 gcc/gimple-pretty-print.c  |  11 ++
 gcc/gimple.c               |   2 +
 gcc/optabs.c               |  13 ++
 gcc/optabs.def             |   1 +
 gcc/optabs.h               |  43 +++++++
 gcc/target.def             |   7 ++
 gcc/targhooks.c            |   7 ++
 gcc/targhooks.h            |   1 +
 gcc/tree-cfg.c             |   6 +
 gcc/tree-inline.c          |   2 +
 gcc/tree-vect-generic.c    |   3 +
 gcc/tree-vect-loop-manip.c | 301 +++++++++++++++++++++++++++++++++++----------
 gcc/tree-vect-loop.c       |  68 ++++++++--
 gcc/tree-vect-slp.c        |   2 +-
 gcc/tree-vect-stmts.c      |  90 +++++++++++++-
 gcc/tree-vectorizer.h      |   3 +
 gcc/tree.def               |   3 +
 24 files changed, 665 insertions(+), 103 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index b16353a..a3c513a 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2511,6 +2511,7 @@ ix86_target_string (HOST_WIDE_INT isa, int flags, const char *arch,
     { "-mavx256-split-unaligned-load",	MASK_AVX256_SPLIT_UNALIGNED_LOAD},
     { "-mavx256-split-unaligned-store",	MASK_AVX256_SPLIT_UNALIGNED_STORE},
     { "-mprefer-avx128",		MASK_PREFER_AVX128},
+    { "-mavx512-vecmask",		MASK_AVX512_VECMASK},
   };
 
   const char *opts[ARRAY_SIZE (isa_opts) + ARRAY_SIZE (flag_opts) + 6][2];
@@ -16700,8 +16701,11 @@ ix86_expand_vector_move_misalign (enum machine_mode mode, rtx operands[])
 	{
 	case MODE_VECTOR_INT:
 	case MODE_INT:
-	  op0 = gen_lowpart (V16SImode, op0);
-	  op1 = gen_lowpart (V16SImode, op1);
+	  if (GET_MODE (op0) != V8DImode && GET_MODE (op0) != V16SImode)
+	    {
+	      op0 = gen_lowpart (V16SImode, op0);
+	      op1 = gen_lowpart (V16SImode, op1);
+	    }
 	  /* FALLTHRU */
 
 	case MODE_VECTOR_FLOAT:
@@ -16709,6 +16713,10 @@ ix86_expand_vector_move_misalign (enum machine_mode mode, rtx operands[])
 	    {
 	    default:
 	      gcc_unreachable ();
+	    case V8DImode:
+	      load_unaligned = gen_avx512f_loaddquv8di;
+	      store_unaligned = gen_avx512f_storedquv8di;
+	      break;
 	    case V16SImode:
 	      load_unaligned = gen_avx512f_loaddquv16si;
 	      store_unaligned = gen_avx512f_storedquv16si;
@@ -45092,6 +45100,16 @@ ix86_reassociation_width (unsigned int opc ATTRIBUTE_UNUSED,
   return res;
 }
 
+/* Implement the TARGET_VECTORIZE_VECMASK_SUPPORT hook.  */
+
+bool ix86_vecmask_support (void)
+{
+  if (TARGET_AVX512F && (target_flags & TARGET_AVX512_VECMASK))
+    return true;
+
+  return false;
+}
+
 /* ??? No autovectorization into MMX or 3DNOW until we can reliably
    place emms and femms instructions.  */
 
@@ -45563,6 +45581,8 @@ ix86_memmodel_check (unsigned HOST_WIDE_INT val)
 #define TARGET_VECTORIZE_FINISH_COST ix86_finish_cost
 #undef TARGET_VECTORIZE_DESTROY_COST_DATA
 #define TARGET_VECTORIZE_DESTROY_COST_DATA ix86_destroy_cost_data
+#undef TARGET_VECTORIZE_VECMASK_SUPPORT
+#define TARGET_VECTORIZE_VECMASK_SUPPORT ix86_vecmask_support
 
 #undef TARGET_SET_CURRENT_FUNCTION
 #define TARGET_SET_CURRENT_FUNCTION ix86_set_current_function
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index ebda28f..8fc891f 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -447,6 +447,10 @@ mprefer-avx128
 Target Report Mask(PREFER_AVX128) SAVE
 Use 128-bit AVX instructions instead of 256-bit AVX instructions in the auto-vectorizer.
 
+mavx512-vecmask
+Target Report Mask(AVX512_VECMASK) SAVE
+Enable autogeneration of AVX-512's embedded masking.
+
 ;; ISA support
 
 m32
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index d4f01cb..9fb3ae8 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -632,6 +632,21 @@
 (define_mode_attr bcstscalarsuff
   [(V16SI "d") (V16SF "ss") (V8DI "q") (V8DF "sd")])
 
+;; Used in `vecmask' attribute to match vector modes that
+;; supports masking
+(define_mode_attr mask_en
+  [(V64QI "no") (V32HI "no") (V16SI "yes") (V8DI "yes")
+   (V32QI "no") (V16HI "no") (V8SI "no") (V4DI "no") (V2TI "no")
+   (V16QI "no") (V8HI "no") (V4SI "no") (V2DI "no") (V1TI "no")
+   (V16SF "yes") (V8DF "yes")
+   (V8SF "no") (V4DF "no")
+   (V4SF "no") (V2DF "no")
+   (XI "yes") (OI "no") (TI "no")])
+
+;; Indicates if a vector instruction has masked version.  */
+(define_attr "vecmask" "no,yes"
+  (const_string "no"))
+
 ;; Include define_subst patterns for instructions with mask
 (include "subst.md")
 
@@ -653,7 +668,8 @@
 {
   ix86_expand_vector_move (<MODE>mode, operands);
   DONE;
-})
+}
+[(set_attr "vecmask" "<mask_en>")])
 
 (define_insn "*mov<mode>_internal"
   [(set (match_operand:VMOVE 0 "nonimmediate_operand"               "=v,v ,m")
@@ -1301,7 +1317,8 @@
 	  (match_operand:VF 1 "nonimmediate_operand")
 	  (match_operand:VF 2 "nonimmediate_operand")))]
   "TARGET_SSE && <mask_mode512bit_condition> && <round_mode512bit_condition>"
-  "ix86_fixup_binary_operands_no_copy (<CODE>, <MODE>mode, operands);")
+  "ix86_fixup_binary_operands_no_copy (<CODE>, <MODE>mode, operands);"
+[(set_attr "vecmask" "<mask_en>")])
 
 (define_insn "*<plusminus_insn><mode>3<mask_name><round_name>"
   [(set (match_operand:VF 0 "register_operand" "=x,v")
@@ -1340,7 +1357,8 @@
 	  (match_operand:VF 1 "nonimmediate_operand")
 	  (match_operand:VF 2 "nonimmediate_operand")))]
   "TARGET_SSE && <mask_mode512bit_condition> && <round_mode512bit_condition>"
-  "ix86_fixup_binary_operands_no_copy (MULT, <MODE>mode, operands);")
+  "ix86_fixup_binary_operands_no_copy (MULT, <MODE>mode, operands);"
+[(set_attr "vecmask" "<mask_en>")])
 
 (define_insn "*mul<mode>3<mask_name><round_name>"
   [(set (match_operand:VF 0 "register_operand" "=x,v")
@@ -1380,7 +1398,8 @@
 	(div:VF2 (match_operand:VF2 1 "register_operand")
 		 (match_operand:VF2 2 "nonimmediate_operand")))]
   "TARGET_SSE2"
-  "ix86_fixup_binary_operands_no_copy (DIV, <MODE>mode, operands);")
+  "ix86_fixup_binary_operands_no_copy (DIV, <MODE>mode, operands);"
+[(set_attr "vecmask" "<mask_en>")])
 
 (define_expand "div<mode>3"
   [(set (match_operand:VF1 0 "register_operand")
@@ -1399,7 +1418,8 @@
       ix86_emit_swdivsf (operands[0], operands[1], operands[2], <MODE>mode);
       DONE;
     }
-})
+}
+[(set_attr "vecmask" "<mask_en>")])
 
 (define_insn "<sse>_div<mode>3<mask_name><round_name>"
   [(set (match_operand:VF 0 "register_operand" "=x,v")
@@ -1474,7 +1494,9 @@
 (define_expand "sqrt<mode>2"
   [(set (match_operand:VF2 0 "register_operand")
 	(sqrt:VF2 (match_operand:VF2 1 "nonimmediate_operand")))]
-  "TARGET_SSE2")
+  "TARGET_SSE2"
+  ""
+[(set_attr "vecmask" "<mask_en>")])
 
 (define_expand "sqrt<mode>2"
   [(set (match_operand:VF1 0 "register_operand")
@@ -1490,7 +1512,8 @@
       ix86_emit_swsqrtsf (operands[0], operands[1], <MODE>mode, false);
       DONE;
     }
-})
+}
+[(set_attr "vecmask" "<mask_en>")])
 
 (define_insn "<sse>_sqrt<mode>2<mask_name><round_name>"
   [(set (match_operand:VF 0 "register_operand" "=v")
@@ -1597,7 +1620,8 @@
   if (!flag_finite_math_only)
     operands[1] = force_reg (<MODE>mode, operands[1]);
   ix86_fixup_binary_operands_no_copy (<CODE>, <MODE>mode, operands);
-})
+}
+[(set_attr "vecmask" "<mask_en>")])
 
 (define_insn "*<code><mode>3_finite<mask_name><round_saeonly_name>"
   [(set (match_operand:VF 0 "register_operand" "=x,v")
@@ -2405,7 +2429,8 @@
          (match_operand:VF_512 1 "nonimmediate_operand")
          (match_operand:VF_512 2 "nonimmediate_operand")))]
   "TARGET_AVX512F"
-  "ix86_fixup_binary_operands_no_copy (<CODE>, <MODE>mode, operands);")
+  "ix86_fixup_binary_operands_no_copy (<CODE>, <MODE>mode, operands);"
+[(set_attr "vecmask" "<mask_en>")])
 
 (define_insn "*<code><mode>3"
   [(set (match_operand:VF 0 "register_operand" "=x,v")
@@ -2714,7 +2739,9 @@
 	  (match_operand:FMAMODEM 1 "nonimmediate_operand")
 	  (match_operand:FMAMODEM 2 "nonimmediate_operand")
 	  (match_operand:FMAMODEM 3 "nonimmediate_operand")))]
-  "")
+  ""
+  ""
+[(set_attr "vecmask" "<mask_en>")])
 
 (define_expand "fms<mode>4"
   [(set (match_operand:FMAMODEM 0 "register_operand")
@@ -2722,7 +2749,9 @@
 	  (match_operand:FMAMODEM 1 "nonimmediate_operand")
 	  (match_operand:FMAMODEM 2 "nonimmediate_operand")
 	  (neg:FMAMODEM (match_operand:FMAMODEM 3 "nonimmediate_operand"))))]
-  "")
+  ""
+  ""
+[(set_attr "vecmask" "<mask_en>")])
 
 (define_expand "fnma<mode>4"
   [(set (match_operand:FMAMODEM 0 "register_operand")
@@ -2730,7 +2759,9 @@
 	  (neg:FMAMODEM (match_operand:FMAMODEM 1 "nonimmediate_operand"))
 	  (match_operand:FMAMODEM 2 "nonimmediate_operand")
 	  (match_operand:FMAMODEM 3 "nonimmediate_operand")))]
-  "")
+  ""
+  ""
+[(set_attr "vecmask" "<mask_en>")])
 
 (define_expand "fnms<mode>4"
   [(set (match_operand:FMAMODEM 0 "register_operand")
@@ -2738,7 +2769,9 @@
 	  (neg:FMAMODEM (match_operand:FMAMODEM 1 "nonimmediate_operand"))
 	  (match_operand:FMAMODEM 2 "nonimmediate_operand")
 	  (neg:FMAMODEM (match_operand:FMAMODEM 3 "nonimmediate_operand"))))]
-  "")
+  ""
+  ""
+[(set_attr "vecmask" "<mask_en>")])
 
 ;; The builtins for intrinsics are not constrained by SSE math enabled.
 (define_mode_iterator FMAMODE [(SF "TARGET_FMA || TARGET_FMA4 || TARGET_AVX512F")
@@ -7786,7 +7819,8 @@
 	  (match_operand:VI_AVX2 1 "nonimmediate_operand")
 	  (match_operand:VI_AVX2 2 "nonimmediate_operand")))]
   "TARGET_SSE2 && <mask_mode512bit_condition>"
-  "ix86_fixup_binary_operands_no_copy (<CODE>, <MODE>mode, operands);")
+  "ix86_fixup_binary_operands_no_copy (<CODE>, <MODE>mode, operands);"
+[(set_attr "vecmask" "<mask_en>")])
 
 (define_insn "*<plusminus_insn><mode>3<mask_name>"
   [(set (match_operand:VI_AVX2 0 "register_operand" "=x,v")
@@ -8274,7 +8308,8 @@
       ix86_expand_sse2_mulv4si3 (operands[0], operands[1], operands[2]);
       DONE;
     }
-})
+}
+[(set_attr "vecmask" "<mask_en>")])
 
 (define_insn "*<sse4_1_avx2>_mul<mode>3<mask_name>"
   [(set (match_operand:VI4_AVX512F 0 "register_operand" "=x,v")
@@ -8413,7 +8448,8 @@
      (if_then_else (match_operand 2 "const_int_operand")
        (const_string "1")
        (const_string "0")))
-   (set_attr "mode" "<sseinsnmode>")])
+   (set_attr "mode" "<sseinsnmode>")
+(set_attr "vecmask" "<mask_en>")])
 
 (define_insn "<shift_insn><mode>3"
   [(set (match_operand:VI248_AVX2 0 "register_operand" "=x,x")
@@ -8448,8 +8484,8 @@
        (const_string "1")
        (const_string "0")))
    (set_attr "prefix" "evex")
-   (set_attr "mode" "<sseinsnmode>")])
-
+   (set_attr "mode" "<sseinsnmode>")
+(set_attr "vecmask" "<mask_en>")])
 
 (define_expand "vec_shl_<mode>"
   [(set (match_operand:VI_128 0 "register_operand")
@@ -8887,6 +8923,14 @@
    (set_attr "prefix" "evex")
    (set_attr "mode" "<sseinsnmode>")])
 
+;; Comparison instruction used for generating vector mask.
+(define_expand "vec_mask_gen_<mode>"
+  [(set (match_operand:<avx512fmaskmode> 0 "register_operand")
+	(gt:<avx512fmaskmode>
+	  (match_operand:VI48_512 1 "register_operand")
+	  (match_operand:VI48_512 2 "nonimmediate_operand")))]
+  "TARGET_AVX512F")
+
 (define_insn "sse2_gt<mode>3"
   [(set (match_operand:VI124_128 0 "register_operand" "=x,x")
 	(gt:VI124_128
@@ -9051,7 +9095,8 @@
 {
   ix86_expand_vec_perm (operands);
   DONE;
-})
+}
+[(set_attr "vecmask" "<mask_en>")])
 
 (define_mode_iterator VEC_PERM_CONST
   [(V4SF "TARGET_SSE") (V4SI "TARGET_SSE")
@@ -9192,7 +9237,8 @@
 {
   ix86_expand_vector_logical_operator (<CODE>, <MODE>mode, operands);
   DONE;
-})
+}
+[(set_attr "vecmask" "<mask_en>")])
 
 (define_insn "<mask_codefor><code><mode>3<mask_name>"
   [(set (match_operand:VI 0 "register_operand" "=x,v")
@@ -13489,7 +13535,9 @@
 	(ashift:VI48_512
 	  (match_operand:VI48_512 1 "register_operand")
 	  (match_operand:VI48_512 2 "nonimmediate_operand")))]
-  "TARGET_AVX512F")
+  "TARGET_AVX512F"
+  ""
+[(set_attr "vecmask" "<mask_en>")])
 
 (define_expand "vashl<mode>3"
   [(set (match_operand:VI48_256 0 "register_operand")
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 8d220f3..b946117 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5787,6 +5787,10 @@ The default is @code{NULL_TREE} which means to not vectorize gather
 loads.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_VECTORIZE_VECMASK_SUPPORT (void) (void)
+Returns true if architecture supports vector masks.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 863e843a..e59fa72 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4414,6 +4414,10 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_BUILTIN_GATHER
 
+@hook TARGET_VECTORIZE_VECMASK_SUPPORT (void)
+Returns true if architecture supports vector masks.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/expr.c b/gcc/expr.c
index 5949b13..2a66dbe 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9100,6 +9100,52 @@ expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode,
       target = expand_vec_cond_expr (type, treeop0, treeop1, treeop2, target);
       return target;
 
+    case MASK_GEN:
+
+      rtx cmp_rtx;
+      enum machine_mode cmp_mode;
+
+      op0 = expand_normal (treeop0);
+      op1 = expand_normal (treeop1);
+
+      cmp_mode = mode_for_size (TYPE_VECTOR_SUBPARTS (TREE_TYPE (treeop0)),
+				MODE_INT, 0);
+
+      if (!target)
+	target = gen_reg_rtx (cmp_mode);
+
+      op0 = force_reg (GET_MODE (op0), op0);
+      op1 = force_reg (GET_MODE (op1), op1);
+
+      cmp_rtx = gen_rtx_UNSPEC (cmp_mode,
+				gen_rtvec (2, op1, op0), UNSPEC_MASKED_GT);
+      emit_insn (gen_rtx_SET (cmp_mode, target, cmp_rtx));
+
+      return target;
+
+    case VEC_MASK:
+
+      type = TREE_TYPE (treeop0);
+
+      op0 = expand_normal (treeop0);
+      op1 = expand_normal (treeop1);
+
+      if (TREE_CODE(treeop2) == VECTOR_CST)
+	op2 = const_vector_from_tree (treeop2);
+      else
+	op2 = expand_normal (treeop2);
+
+      op0 = force_reg (GET_MODE (op0), op0);
+      op1 = force_reg (GET_MODE (op1), op1);
+
+      if (!target)
+	  target = gen_reg_rtx (TYPE_MODE (type));
+
+      if (MEM_P (target))
+	  op2 = target;
+
+      return gen_rtx_VEC_MERGE (TYPE_MODE (type), op0, op2, op1);
+
     default:
       gcc_unreachable ();
     }
diff --git a/gcc/genopinit.c b/gcc/genopinit.c
index fb80717..03d10cf 100644
--- a/gcc/genopinit.c
+++ b/gcc/genopinit.c
@@ -142,6 +142,7 @@ typedef struct pattern_d
   unsigned int op;
   unsigned int m1, m2;
   unsigned int sort_num;
+  bool mask;
 } pattern;
 
 
@@ -247,7 +248,9 @@ gen_insn (rtx insn)
 {
   const char *name = XSTR (insn, 0);
   pattern p;
-  unsigned pindex;
+  p.mask = false;
+  unsigned pindex, i, n_attrs;
+  rtx attr;
 
   /* Don't mention "unnamed" instructions.  */
   if (*name == 0 || *name == '*')
@@ -263,6 +266,22 @@ gen_insn (rtx insn)
 	{
 	  p.op = optabs[pindex].op;
 	  p.sort_num = (p.op << 16) | (p.m2 << 8) | p.m1;
+
+	  if (XVEC (insn, 4))
+	    {
+	      n_attrs = XVECLEN (insn, 4);
+	      for (i = 0; i < n_attrs; i++)
+		{
+		  attr = RTVEC_ELT (XVEC (insn, 4), i);
+		  if (!strcmp (XSTR (attr, 0), "vecmask") &&
+		      !strcmp (XSTR (attr, 1), "yes"))
+		    {
+		      p.mask = true;
+		      break;
+		    }
+		}
+	    }
+
 	  patterns.safe_push (p);
 	  return;
 	}
@@ -413,13 +432,14 @@ main (int argc, char **argv)
 	   "\n"
 	   "struct optab_pat {\n"
 	   "  unsigned scode;\n"
+	   "  bool mask;\n"
 	   "  enum insn_code icode;\n"
 	   "};\n\n");
 
   fprintf (s_file,
 	   "static const struct optab_pat pats[NUM_OPTAB_PATTERNS] = {\n");
   for (i = 0; patterns.iterate (i, &p); ++i)
-    fprintf (s_file, "  { %#08x, CODE_FOR_%s },\n", p->sort_num, p->name);
+    fprintf (s_file, "  { %#08x, %s, CODE_FOR_%s },\n", p->sort_num, ((p->mask) ? "true" : "false"), p->name);
   fprintf (s_file, "};\n\n");
 
   fprintf (s_file, "void\ninit_all_optabs (struct target_optabs *optabs)\n{\n");
@@ -462,6 +482,15 @@ main (int argc, char **argv)
 
   fprintf (s_file,
 	   "bool\n"
+	   "raw_optab_handler_has_mask (unsigned scode)\n"
+	   "{\n"
+	   "  int i = lookup_handler (scode);\n"
+	   "  return (i >= 0 && this_fn_optabs->pat_enable[i]\n"
+	   "          ? pats[i].mask : false);\n"
+	   "}\n\n");
+
+  fprintf (s_file,
+	   "bool\n"
 	   "swap_optab_enable (optab op, enum machine_mode m, bool set)\n"
 	   "{\n"
 	   "  unsigned scode = (op << 16) | m;\n"
diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
index 1599c80..6085f81 100644
--- a/gcc/gimple-pretty-print.c
+++ b/gcc/gimple-pretty-print.c
@@ -340,6 +340,7 @@ dump_binary_rhs (pretty_printer *buffer, gimple gs, int spc, int flags)
   switch (code)
     {
     case COMPLEX_EXPR:
+    case MASK_GEN:
     case MIN_EXPR:
     case MAX_EXPR:
     case VEC_WIDEN_MULT_HI_EXPR:
@@ -464,6 +465,16 @@ dump_ternary_rhs (pretty_printer *buffer, gimple gs, int spc, int flags)
       pp_greater (buffer);
       break;
 
+    case VEC_MASK:
+      pp_string (buffer, "VEC_MASK <");
+      dump_generic_node (buffer, gimple_assign_rhs1 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs2 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
+      pp_string (buffer, ">");
+      break;
+
     default:
       gcc_unreachable ();
     }
diff --git a/gcc/gimple.c b/gcc/gimple.c
index 59fcf43..4035f07 100644
--- a/gcc/gimple.c
+++ b/gcc/gimple.c
@@ -2517,6 +2517,7 @@ get_gimple_rhs_num_ops (enum tree_code code)
       || (TYPE) == tcc_reference) ? GIMPLE_SINGLE_RHS			    \
    : ((SYM) == TRUTH_AND_EXPR						    \
       || (SYM) == TRUTH_OR_EXPR						    \
+      || (SYM) == MASK_GEN                                                  \
       || (SYM) == TRUTH_XOR_EXPR) ? GIMPLE_BINARY_RHS			    \
    : (SYM) == TRUTH_NOT_EXPR ? GIMPLE_UNARY_RHS				    \
    : ((SYM) == COND_EXPR						    \
@@ -2526,6 +2527,7 @@ get_gimple_rhs_num_ops (enum tree_code code)
       || (SYM) == REALIGN_LOAD_EXPR					    \
       || (SYM) == VEC_COND_EXPR						    \
       || (SYM) == VEC_PERM_EXPR                                             \
+      || (SYM) == VEC_MASK                                                  \
       || (SYM) == FMA_EXPR) ? GIMPLE_TERNARY_RHS			    \
    : ((SYM) == CONSTRUCTOR						    \
       || (SYM) == OBJ_TYPE_REF						    \
diff --git a/gcc/optabs.c b/gcc/optabs.c
index 3238885..6f5b964 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -6711,6 +6711,19 @@ get_vcond_icode (enum machine_mode vmode, enum machine_mode cmode, bool uns)
   return icode;
 }
 
+/* The same as get_vcond_icode for instructions with vector mask.  */
+
+bool
+get_vcond_icode_has_mask (enum machine_mode vmode, enum machine_mode cmode, bool uns)
+{
+  bool supportable = false;
+  if (uns)
+    supportable = convert_optab_handler_has_mask (vcondu_optab, vmode, cmode);
+  else
+    supportable = convert_optab_handler_has_mask (vcond_optab, vmode, cmode);
+  return supportable;
+}
+
 /* Return TRUE iff, appropriate vector insns are available
    for vector cond expr with vector type VALUE_TYPE and a comparison
    with operand vector types in CMP_OP_TYPE.  */
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 6b924ac..0d4f7a3 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -281,6 +281,7 @@ OPTAB_D (vec_widen_umult_lo_optab, "vec_widen_umult_lo_$a")
 OPTAB_D (vec_widen_umult_odd_optab, "vec_widen_umult_odd_$a")
 OPTAB_D (vec_widen_ushiftl_hi_optab, "vec_widen_ushiftl_hi_$a")
 OPTAB_D (vec_widen_ushiftl_lo_optab, "vec_widen_ushiftl_lo_$a")
+OPTAB_D (vec_mask_gen_optab, "vec_mask_gen_$a")
 
 OPTAB_D (sync_add_optab, "sync_add$I$a")
 OPTAB_D (sync_and_optab, "sync_and$I$a")
diff --git a/gcc/optabs.h b/gcc/optabs.h
index 4de4409..ea2d3fb 100644
--- a/gcc/optabs.h
+++ b/gcc/optabs.h
@@ -67,6 +67,8 @@ extern const struct optab_libcall_d normlib_def[NUM_NORMLIB_OPTABS];
 
 /* Returns the active icode for the given (encoded) optab.  */
 extern enum insn_code raw_optab_handler (unsigned);
+extern bool raw_optab_handler_has_mask (unsigned);
+bool get_vcond_icode_has_mask (enum machine_mode, enum machine_mode, bool);
 extern bool swap_optab_enable (optab, enum machine_mode, bool);
 
 /* Target-dependent globals.  */
@@ -259,6 +261,14 @@ optab_handler (optab op, enum machine_mode mode)
   return raw_optab_handler (scode);
 }
 
+static inline bool
+optab_handler_has_mask (optab op, enum machine_mode mode)
+{
+  unsigned scode = (op << 16) | mode;
+  gcc_assert (op > LAST_CONV_OPTAB);
+  return raw_optab_handler_has_mask (scode);
+}
+
 /* Return the insn used to perform conversion OP from mode FROM_MODE
    to mode TO_MODE; return CODE_FOR_nothing if the target does not have
    such an insn.  */
@@ -272,6 +282,16 @@ convert_optab_handler (convert_optab op, enum machine_mode to_mode,
   return raw_optab_handler (scode);
 }
 
+static inline bool
+convert_optab_handler_has_mask (convert_optab op, enum machine_mode to_mode,
+		       enum machine_mode from_mode)
+{
+  unsigned scode = (op << 16) | (from_mode << 8) | to_mode;
+  gcc_assert (op > unknown_optab && op <= LAST_CONV_OPTAB);
+  return raw_optab_handler_has_mask (scode);
+}
+
 /* Like optab_handler, but for widening_operations that have a
    TO_MODE and a FROM_MODE.  */
 
@@ -291,6 +311,23 @@ widening_optab_handler (optab op, enum machine_mode to_mode,
   return raw_optab_handler (scode);
 }
 
+static inline bool
+widening_optab_handler_has_mask (optab op, enum machine_mode to_mode,
+			enum machine_mode from_mode)
+{
+  unsigned scode = (op << 16) | to_mode;
+  if (to_mode != from_mode && from_mode != VOIDmode)
+    {
+      /* ??? Why does find_widening_optab_handler_and_mode attempt to
+	 widen things that can't be widened?  E.g. add_optab... */
+      if (op > LAST_CONV_OPTAB)
+	return CODE_FOR_nothing;
+      scode |= from_mode << 8;
+    }
+  return raw_optab_handler_has_mask (scode);
+}
+
 /* Return the insn used to implement mode MODE of OP, or CODE_FOR_nothing
    if the target does not have such an insn.  */
 
@@ -300,6 +337,12 @@ direct_optab_handler (direct_optab op, enum machine_mode mode)
   return optab_handler (op, mode);
 }
 
+static inline bool
+direct_optab_handler_has_mask (direct_optab op, enum machine_mode mode)
+{
+  return optab_handler_has_mask (op, mode);
+}
+
 /* Return true if UNOPTAB is for a trapping-on-overflow operation.  */
 
 static inline bool
diff --git a/gcc/target.def b/gcc/target.def
index 6de513f..50c7b16 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1724,6 +1724,13 @@ DEFHOOK
  (void *data),
  default_destroy_cost_data)
 
+/* Returns true if vector masks supported */
+DEFHOOK
+(vecmask_support,
+ "",
+ bool, (void),
+ default_vecmask_support)
+
 HOOK_VECTOR_END (vectorize)
 
 #undef HOOK_PREFIX
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 03db7b4..d76eed4 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1567,4 +1567,11 @@ default_canonicalize_comparison (int *, rtx *, rtx *, bool)
 {
 }
 
+/* Default version of vecmask_support.  */
+
+bool default_vecmask_support (void)
+{
+  return false;
+}
+
 #include "gt-targhooks.h"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index aaddae9..10f4384 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -186,6 +186,7 @@ extern unsigned char default_class_max_nregs (reg_class_t, enum machine_mode);
 extern enum unwind_info_type default_debug_unwind_info (void);
 
 extern void default_canonicalize_comparison (int *, rtx *, rtx *, bool);
+extern bool default_vecmask_support (void);
 
 extern int default_label_align_after_barrier_max_skip (rtx);
 extern int default_loop_align_max_skip (rtx);
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index 70930a35..d0c2bcf 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -3568,6 +3568,11 @@ verify_gimple_assign_binary (gimple stmt)
         return false;
       }
 
+    case MASK_GEN:
+      /* FIXME.  */
+      return false;
+
     case PLUS_EXPR:
     case MINUS_EXPR:
       {
@@ -3828,6 +3833,7 @@ verify_gimple_assign_ternary (gimple stmt)
 
     case DOT_PROD_EXPR:
     case REALIGN_LOAD_EXPR:
+    case VEC_MASK:
       /* FIXME.  */
       return false;
 
diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
index ebb4b91..af41c90 100644
--- a/gcc/tree-inline.c
+++ b/gcc/tree-inline.c
@@ -3597,6 +3597,8 @@ estimate_operator_cost (enum tree_code code, eni_weights *weights,
     case VEC_PACK_FIX_TRUNC_EXPR:
     case VEC_WIDEN_LSHIFT_HI_EXPR:
     case VEC_WIDEN_LSHIFT_LO_EXPR:
+    case MASK_GEN:
+    case VEC_MASK:
 
       return 1;
 
diff --git a/gcc/tree-vect-generic.c b/gcc/tree-vect-generic.c
index 4f67ca0..f4ad56a 100644
--- a/gcc/tree-vect-generic.c
+++ b/gcc/tree-vect-generic.c
@@ -1267,6 +1267,9 @@ expand_vector_operations_1 (gimple_stmt_iterator *gsi)
       return;
     }
 
+  if (code == MASK_GEN || code == VEC_MASK)
+    return;
+
   if (code == VEC_COND_EXPR)
     {
       expand_vector_condition (gsi);
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index bd77473..86c0d40 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1034,7 +1034,7 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop,
 			       tree niters, bool update_first_loop_count,
 			       unsigned int th, bool check_profitability,
 			       tree cond_expr, gimple_seq cond_expr_stmt_list,
-			       int bound1, int bound2)
+			       int bound1, int bound2, bool use_vecmask)
 {
   struct loop *new_loop = NULL, *first_loop, *second_loop;
   edge skip_e;
@@ -1141,6 +1141,10 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop,
       second_loop = loop;
     }
 
+  if (use_vecmask)
+    slpeel_make_loop_iterate_ntimes (second_loop,
+				     build_int_cst (TREE_TYPE (*first_niters), 1));
+
   /* 2.  Add the guard code in one of the following ways:
 
      2.a Add the guard that controls whether the first loop is executed.
@@ -1419,6 +1423,39 @@ vect_build_loop_niters (loop_vec_info loop_vinfo, gimple_seq seq)
   return ni_name;
 }
 
+static tree
+add_var_on_edge (const char *name, tree rhs, tree type, edge pe,
+		 bool vector, gimple_seq cond_expr_stmt_list)
+{
+  basic_block new_bb;
+  tree var;
+  gimple stmts;
+
+  if (!vector)
+    {
+      var = create_tmp_var (type, name);
+      var = force_gimple_operand (rhs, &stmts, true, var);
+    }
+  else
+    {
+      var = vect_get_new_vect_var (type, vect_simple_var, name);
+
+      stmts = gimple_build_assign  (var, rhs);
+      var = make_ssa_name (var, stmts);
+      gimple_assign_set_lhs (stmts, var);
+    }
+
+  if (cond_expr_stmt_list)
+    gimple_seq_add_seq (&cond_expr_stmt_list, stmts);
+  else
+    {
+      new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
+      gcc_assert (!new_bb);
+    }
+
+  return var;
+}
 
 /* This function generates the following statements:
 
@@ -1436,12 +1473,8 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
 				 tree *ratio_name_ptr,
 				 gimple_seq cond_expr_stmt_list)
 {
-
   edge pe;
-  basic_block new_bb;
-  gimple_seq stmts;
   tree ni_name, ni_minus_gap_name;
-  tree var;
   tree ratio_name;
   tree ratio_mult_vf_name;
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
@@ -1462,69 +1495,28 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
      correct calculation of RATIO.  */
   if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
     {
-      ni_minus_gap_name = fold_build2 (MINUS_EXPR, TREE_TYPE (ni_name),
-				       ni_name,
-			               build_one_cst (TREE_TYPE (ni_name)));
-      if (!is_gimple_val (ni_minus_gap_name))
-	{
-	  var = create_tmp_var (TREE_TYPE (ni), "ni_gap");
-
-          stmts = NULL;
-          ni_minus_gap_name = force_gimple_operand (ni_minus_gap_name, &stmts,
-						    true, var);
-          if (cond_expr_stmt_list)
-            gimple_seq_add_seq (&cond_expr_stmt_list, stmts);
-          else
-            {
-              pe = loop_preheader_edge (loop);
-              new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
-              gcc_assert (!new_bb);
-            }
-        }
+      ni_minus_gap_name = fold_build2 (MINUS_EXPR, TREE_TYPE (ni_name), ni_name,
+				       build_one_cst (TREE_TYPE (ni_name)));
+      ni_minus_gap_name = add_var_on_edge ("ni_gap", ni_minus_gap_name,
+					   TREE_TYPE (ni_name), pe, false,
+					   cond_expr_stmt_list);
     }
   else
     ni_minus_gap_name = ni_name;
 
   /* Create: ratio = ni >> log2(vf) */
-
   ratio_name = fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_minus_gap_name),
 			    ni_minus_gap_name, log_vf);
-  if (!is_gimple_val (ratio_name))
-    {
-      var = create_tmp_var (TREE_TYPE (ni), "bnd");
-
-      stmts = NULL;
-      ratio_name = force_gimple_operand (ratio_name, &stmts, true, var);
-      if (cond_expr_stmt_list)
-	gimple_seq_add_seq (&cond_expr_stmt_list, stmts);
-      else
-	{
-	  pe = loop_preheader_edge (loop);
-	  new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
-	  gcc_assert (!new_bb);
-	}
-    }
+  ratio_name = add_var_on_edge ("bnd", ratio_name,
+				TREE_TYPE (ni_minus_gap_name), pe, false,
+				cond_expr_stmt_list);
 
   /* Create: ratio_mult_vf = ratio << log2 (vf).  */
-
   ratio_mult_vf_name = fold_build2 (LSHIFT_EXPR, TREE_TYPE (ratio_name),
 				    ratio_name, log_vf);
-  if (!is_gimple_val (ratio_mult_vf_name))
-    {
-      var = create_tmp_var (TREE_TYPE (ni), "ratio_mult_vf");
-
-      stmts = NULL;
-      ratio_mult_vf_name = force_gimple_operand (ratio_mult_vf_name, &stmts,
-						 true, var);
-      if (cond_expr_stmt_list)
-	gimple_seq_add_seq (&cond_expr_stmt_list, stmts);
-      else
-	{
-	  pe = loop_preheader_edge (loop);
-	  new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
-	  gcc_assert (!new_bb);
-	}
-    }
+  ratio_mult_vf_name = add_var_on_edge ("ratio_mult_vf", ratio_mult_vf_name,
+					TREE_TYPE (ratio_name), pe, false,
+					cond_expr_stmt_list);
 
   *ni_name_ptr = ni_name;
   *ratio_mult_vf_name_ptr = ratio_mult_vf_name;
@@ -1533,6 +1525,75 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
   return;
 }
 
+/* This function generates the following statements:
+
+ niter_epilog = ni_name - ratio_mult_vf_name
+ niter_epilog_vect = {niter_epilog, niter_epilog, ... } // VECT_FACTOR times
+ integer_seq = {0, 1, .... VECT_FACTOR}
+
+ mask = MASK_GEN <integer_seq, niter_epilog_vect>
+
+ and places them at the peeled iteration edge.  */
+
+static void
+vect_mask_on_preheader (loop_vec_info loop_vinfo, edge pe, tree ni_name,
+			tree ratio_mult_vf_name, tree *vecmask)
+{
+  basic_block new_bb;
+
+  tree niter_type, niter_vectype;
+  tree niter_epilog, niter_epilog_vect;
+  tree integer_seq;
+  tree mask_name;
+  gimple mask_stmt;
+
+  vec<constructor_elt, va_gc> * integer_seq_vect;
+
+  int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+
+
+  if (!LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo))
+    return;
+
+  niter_type =
+    build_nonstandard_integer_type (current_vector_size * 8 / vf, true);
+  niter_vectype = build_vector_type (niter_type, vf);
+
+  niter_epilog = fold_build2 (MINUS_EXPR, niter_type,
+			      fold_convert (niter_type, ni_name),
+			      fold_convert (niter_type, ratio_mult_vf_name));
+
+  niter_epilog = add_var_on_edge ("niter_epilog", niter_epilog, niter_type, pe,
+				  false, NULL);
+
+  niter_epilog_vect = build_vector_from_val (niter_vectype, niter_epilog);
+  niter_epilog_vect = add_var_on_edge ("niter_epilog",
+				       niter_epilog_vect,
+				       niter_vectype, pe, true, NULL);
+
+  vec_alloc (integer_seq_vect, vf);
+
+  for (int i = 0; i < vf; i++)
+    CONSTRUCTOR_APPEND_ELT (integer_seq_vect, NULL_TREE,
+			    build_int_cst (niter_type, i));
+
+  integer_seq = build_constructor (niter_vectype, integer_seq_vect);
+  integer_seq = add_var_on_edge ("integer_seq", integer_seq,
+				 niter_vectype, pe, true, NULL);
+
+  mask_name = create_tmp_var (build_nonstandard_integer_type (vf, true),
+			      "mask");
+  mask_name = make_ssa_name (mask_name, NULL);
+
+  mask_stmt = gimple_build_assign_with_ops (MASK_GEN, mask_name,
+					    integer_seq, niter_epilog_vect);
+
+  new_bb = gsi_insert_seq_on_edge_immediate (pe, mask_stmt);
+  gcc_assert (!new_bb);
+
+  *vecmask = mask_name;
+}
+
 /* Function vect_can_advance_ivs_p
 
    In case the number of iterations that LOOP iterates is unknown at compile
@@ -1753,7 +1814,7 @@ void
 vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
 				unsigned int th, bool check_profitability)
 {
-  tree ni_name, ratio_mult_vf_name;
+  tree ni_name, ratio_mult_vf_name, vecmask;
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   struct loop *new_loop;
   edge update_e;
@@ -1763,6 +1824,8 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
   tree cond_expr = NULL_TREE;
   gimple_seq cond_expr_stmt_list = NULL;
 
+  int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
                      "=== vect_do_peeling_for_loop_bound ===\n");
@@ -1780,11 +1843,12 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
 
   loop_num  = loop->num;
 
-  new_loop = slpeel_tree_peel_loop_to_edge (loop, single_exit (loop),
-                                            &ratio_mult_vf_name, ni_name, false,
-                                            th, check_profitability,
-					    cond_expr, cond_expr_stmt_list,
-					    0, LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+  new_loop =
+    slpeel_tree_peel_loop_to_edge (loop, single_exit (loop),
+				   &ratio_mult_vf_name, ni_name, false, th,
+				   check_profitability, cond_expr,
+				   cond_expr_stmt_list, 0, vf,
+				   LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo));
   gcc_assert (new_loop);
   gcc_assert (loop_num == loop->num);
 #ifdef ENABLE_CHECKING
@@ -1803,9 +1867,112 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
   else
     update_e = EDGE_PRED (preheader, 1);
 
-  /* Update IVs of original loop as if they were advanced
-     by ratio_mult_vf_name steps.  */
-  vect_update_ivs_after_vectorizer (loop_vinfo, ratio_mult_vf_name, update_e);
+  /* Mask generation.  */
+  vect_mask_on_preheader (loop_vinfo, loop_preheader_edge (new_loop), ni_name,
+			  ratio_mult_vf_name, &vecmask);
+
+  /* Wrap vector statements in loop's body with VEC_MASK.  */
+  if (LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo))
+    {
+      gimple_stmt_iterator si;
+      gimple stmt, new_stmt, tmp_stmt, mem_stmt;
+
+      tree zero_vector;
+      tree tmp_name, tmp_mem_name, tmp_ptr_name;
+      tree mem_ref;
+      tree masked_operand;
+      tree vector_type, inner_type, v_ptr_type;
+      tree op1, op2;
+
+      si = gsi_start_bb (new_loop->header);
+
+      for (; !gsi_end_p (si); gsi_next (&si))
+	{
+	  stmt = gsi_stmt (si);
+
+	  if (gimple_code (stmt) == GIMPLE_ASSIGN
+	      && TREE_CODE (TREE_TYPE (gimple_assign_rhs_to_tree (stmt))) == VECTOR_TYPE)
+	    {
+	      /* In usual operation we mask the result (i.e. LHS), in stores we
+		 mask the value to be stored i.e. RHS.  */
+	      masked_operand = (gimple_store_p (stmt))
+			       ? gimple_assign_rhs1 (stmt) 
+			       : gimple_assign_lhs (stmt);
+
+	      vector_type = gimple_store_p (stmt)
+		? TREE_TYPE (gimple_assign_lhs (stmt))
+		: TREE_TYPE (gimple_assign_rhs1 (stmt));
+
+	      inner_type = TREE_TYPE (vector_type);
+
+	      /* Skip statements with vector length isn't equal to VF.  */
+	      if (TYPE_VECTOR_SUBPARTS (vector_type) != (unsigned)vf)
+		continue; 
+
+	      /* LHS for VEC_MASK operation.  */
+	      tmp_name = create_tmp_var (vector_type, "m_tmp");
+	      tmp_name = make_ssa_name (tmp_name, stmt);
+
+	      zero_vector = build_vector_from_val (vector_type,
+						   build_zero_cst (inner_type));
+
+	      op1 = (gimple_store_p (stmt)) ? tmp_name : masked_operand;
+	      op2 = (gimple_store_p (stmt)) ? masked_operand : tmp_name;
+
+
+	      if (!gimple_store_p (stmt))
+		{
+		  new_stmt =
+		    gimple_build_assign_with_ops (VEC_MASK, op1, op2, vecmask,
+						  zero_vector);
+
+		  gsi_insert_after (&si, new_stmt, GSI_NEW_STMT);
+		  gimple_assign_set_lhs (stmt, tmp_name);
+		}
+
+	      else
+		{
+		  v_ptr_type = build_pointer_type (vector_type);
+
+		  tmp_ptr_name = create_tmp_var (v_ptr_type, "p_tmp");
+		  tmp_ptr_name = make_ssa_name (tmp_ptr_name, stmt);
+
+		  tmp_stmt =
+		    gimple_build_assign (tmp_ptr_name,
+					 TREE_OPERAND (gimple_assign_lhs (stmt), 0));
+		  gsi_insert_before (&si, tmp_stmt, GSI_NEW_STMT);
+		  gsi_next (&si);
+
+		  tmp_mem_name = create_tmp_var (vector_type, "mem_tmp");
+		  tmp_mem_name = make_ssa_name (tmp_mem_name, stmt);
+
+		  mem_ref = build2 (MEM_REF, vector_type, tmp_ptr_name,
+				    build_int_cst (v_ptr_type, 0));
+		  mem_stmt = gimple_build_assign (tmp_mem_name, mem_ref);
+
+		  gsi_insert_before (&si, mem_stmt, GSI_NEW_STMT);
+		  gsi_next (&si);
+
+		  new_stmt = gimple_build_assign_with_ops (VEC_MASK, op1, op2,
+							   vecmask, tmp_mem_name);
+
+		  gsi_insert_before (&si, new_stmt, GSI_NEW_STMT);
+		  gsi_next (&si);
+
+		  gimple_assign_set_rhs1 (stmt, tmp_name);
+		}
+
+	      update_stmt (stmt);
+	    }
+	}
+    }
+  else
+    {
+      /* Update IVs of original loop as if they were advanced
+	 by ratio_mult_vf_name steps.  */
+      vect_update_ivs_after_vectorizer (loop_vinfo, ratio_mult_vf_name, update_e);
+    }
+
 
   /* For vectorization factor N, we need to copy last N-1 values in epilogue
      and this means N-2 loopback edge executions.
@@ -2042,7 +2209,7 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo,
 				   &niters_of_prolog_loop, ni_name, true,
 				   th, check_profitability, NULL_TREE, NULL,
 				   bound,
-				   0);
+				   0, false);
 
   gcc_assert (new_loop);
 #ifdef ENABLE_CHECKING
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 2871ba1..1bce617 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -916,6 +916,7 @@ new_loop_vec_info (struct loop *loop)
   LOOP_VINFO_TARGET_COST_DATA (res) = init_cost (loop);
   LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
   LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
+  LOOP_VINFO_VECTORIZABLE_EPILOG (res) = false;
 
   return res;
 }
@@ -1321,6 +1322,9 @@ vect_analyze_loop_operations (loop_vec_info loop_vinfo, bool slp)
 
   gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo));
   vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+
+  LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) = true;
+
   if (slp)
     {
       /* If all the stmts in the loop can be SLPed, we perform only SLP, and
@@ -5052,6 +5056,9 @@ vectorizable_reduction (gimple stmt, gimple_stmt_iterator *gsi,
 
   if (!vec_stmt) /* transformation not required.  */
     {
+      /* FORNOW mkuznets: */
+      LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) = false;
+
       if (!vect_model_reduction_cost (stmt_info, epilog_reduc_code, ncopies))
         return false;
       STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
@@ -5627,6 +5634,32 @@ vect_transform_loop (loop_vec_info loop_vinfo)
       check_profitability = false;
     }
 
+  /* Vectorizable epilog.  */
+
+  if (!targetm.vectorize.vecmask_support()
+      || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
+    LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) = false; 
+
+  if (LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo))
+    {
+      tree index_type =
+	build_nonstandard_integer_type (current_vector_size * 8
+					/ vectorization_factor, true); /* replace magic 8 with something.  */
+      tree vectype = build_vector_type (index_type, vectorization_factor);
+
+      if (optab_handler (vec_mask_gen_optab,
+			 TYPE_MODE (vectype)) == CODE_FOR_nothing)
+	{
+	  if (dump_enabled_p ())
+	    {
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "Cannot find mask_gen for tree type:");
+	      dump_generic_expr (MSG_NOTE, TDF_SLIM, vectype);
+	    }
+	  LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) = false;
+	}
+    }
+
   /* If the loop has a symbolic number of iterations 'n' (i.e. it's not a
      compile time constant), or it is a constant that doesn't divide by the
      vectorization factor, then an epilog loop needs to be created.
@@ -5635,15 +5668,29 @@ vect_transform_loop (loop_vec_info loop_vinfo)
      will remain scalar and will compute the remaining (n%VF) iterations.
      (VF is the vectorization factor).  */
 
-  if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+  bool epilog_peeling_needed = false;
+
+  if ((!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
        || (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
 	   && LOOP_VINFO_INT_NITERS (loop_vinfo) % vectorization_factor != 0)
-       || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
-    vect_do_peeling_for_loop_bound (loop_vinfo, &ratio,
-				    th, check_profitability);
-  else
-    ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
-		LOOP_VINFO_INT_NITERS (loop_vinfo) / vectorization_factor);
+       || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)))
+    epilog_peeling_needed = true;
+
+  if (!epilog_peeling_needed)
+    LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) = false;
+
+  if (!LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo))
+    {
+      if (epilog_peeling_needed)
+	vect_do_peeling_for_loop_bound (loop_vinfo, &ratio, th,
+					check_profitability);
+      else
+	{
+	  ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
+				 LOOP_VINFO_INT_NITERS (loop_vinfo) / vectorization_factor);
+	  LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) = false;
+	}
+    }
 
   /* 1) Make sure the loop header has exactly two entries
      2) Make sure we have a preheader basic block.  */
@@ -5888,6 +5935,13 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 	}		        /* stmts in BB */
     }				/* BBs in loop */
 
+  if (LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo))
+    {
+      update_ssa (TODO_update_ssa);
+      vect_do_peeling_for_loop_bound (loop_vinfo, &ratio, th,
+				      check_profitability);
+    }
+
   slpeel_make_loop_iterate_ntimes (loop, ratio);
 
   /* Reduce loop iterations by the vectorization factor.  */
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 8ed0fc5..bf76dcd 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1093,7 +1093,7 @@ vect_supported_load_permutation_p (slp_instance slp_instn)
 	  FOR_EACH_VEC_ELT (node->load_permutation, j, next)
 	    dump_printf (MSG_NOTE, "%d ", next);
 	else
-	  for (i = 0; i < group_size; ++i)
+	  for (j = 0; j < group_size; ++j)
 	    dump_printf (MSG_NOTE, "%d ", i);
       dump_printf (MSG_NOTE, "\n");
     }
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 16ec6d2..4115421 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -1833,11 +1833,19 @@ vectorizable_call (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
   nunits_in = TYPE_VECTOR_SUBPARTS (vectype_in);
   nunits_out = TYPE_VECTOR_SUBPARTS (vectype_out);
   if (nunits_in == nunits_out / 2)
-    modifier = NARROW;
+    {
+      modifier = NARROW;
+      if (loop_vinfo)
+	LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) = false;
+    }
   else if (nunits_out == nunits_in)
     modifier = NONE;
   else if (nunits_out == nunits_in / 2)
-    modifier = WIDEN;
+    {
+      modifier = WIDEN;
+      if (loop_vinfo)
+	LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) = false;
+    }
   else
     return false;
 
@@ -1885,6 +1893,10 @@ vectorizable_call (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
 
   if (!vec_stmt) /* transformation not required.  */
     {
+      /* FORNOW.  */
+      if (loop_vinfo)
+	LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) = false;
+
       STMT_VINFO_TYPE (stmt_info) = call_vec_info_type;
       if (dump_enabled_p ())
         dump_printf_loc (MSG_NOTE, vect_location, "=== vectorizable_call ==="
@@ -2500,6 +2512,11 @@ vectorizable_conversion (gimple stmt, gimple_stmt_iterator *gsi,
   gcc_assert (ncopies >= 1);
 
   /* Supportable by target?  */
+
+  /* FORNOW.  */
+  if (loop_vinfo)
+    LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) = false;
+
   switch (modifier)
     {
     case NONE:
@@ -3016,6 +3033,11 @@ vectorizable_assignment (gimple stmt, gimple_stmt_iterator *gsi,
 
   if (!vec_stmt) /* transformation not required.  */
     {
+      if (code == VIEW_CONVERT_EXPR || code == CONVERT_EXPR)
+	/* FORNOW.  */
+	if (loop_vinfo)
+	  LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) = false;
+
       STMT_VINFO_TYPE (stmt_info) = assignment_vec_info_type;
       if (dump_enabled_p ())
         dump_printf_loc (MSG_NOTE, vect_location,
@@ -3124,6 +3146,7 @@ vectorizable_shift (gimple stmt, gimple_stmt_iterator *gsi,
   stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
   tree vectype;
   loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
+  struct loop *loop = NULL;
   enum tree_code code;
   enum machine_mode vec_mode;
   tree new_temp;
@@ -3149,6 +3172,9 @@ vectorizable_shift (gimple stmt, gimple_stmt_iterator *gsi,
   bb_vec_info bb_vinfo = STMT_VINFO_BB_VINFO (stmt_info);
   int vf;
 
+  if (loop_vinfo)
+    loop = LOOP_VINFO_LOOP (loop_vinfo);
+
   if (!STMT_VINFO_RELEVANT_P (stmt_info) && !bb_vinfo)
     return false;
 
@@ -3375,6 +3401,13 @@ vectorizable_shift (gimple stmt, gimple_stmt_iterator *gsi,
 
   if (!vec_stmt) /* transformation not required.  */
     {
+      if (loop && LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo))
+	{
+	  LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) &=
+	    optab_handler_has_mask (optab, vec_mode);
+
+	}
+
       STMT_VINFO_TYPE (stmt_info) = shift_vec_info_type;
       if (dump_enabled_p ())
         dump_printf_loc (MSG_NOTE, vect_location,
@@ -3490,11 +3523,12 @@ vectorizable_operation (gimple stmt, gimple_stmt_iterator *gsi,
   stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
   tree vectype;
   loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
+  struct loop *loop = NULL;
   enum tree_code code;
   enum machine_mode vec_mode;
   tree new_temp;
   int op_type;
-  optab optab;
+  optab optab = unknown_optab;
   int icode;
   tree def;
   gimple def_stmt;
@@ -3514,6 +3548,9 @@ vectorizable_operation (gimple stmt, gimple_stmt_iterator *gsi,
   bb_vec_info bb_vinfo = STMT_VINFO_BB_VINFO (stmt_info);
   int vf;
 
+  if (loop_vinfo)
+    loop = LOOP_VINFO_LOOP (loop_vinfo);
+
   if (!STMT_VINFO_RELEVANT_P (stmt_info) && !bb_vinfo)
     return false;
 
@@ -3647,6 +3684,10 @@ vectorizable_operation (gimple stmt, gimple_stmt_iterator *gsi,
   vec_mode = TYPE_MODE (vectype);
   if (code == MULT_HIGHPART_EXPR)
     {
+      if (loop_vinfo)
+        /* FORNOW.  */
+	LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) = false;
+
       if (can_mult_highpart_p (vec_mode, TYPE_UNSIGNED (vectype)))
 	icode = LAST_INSN_CODE;
       else
@@ -3692,6 +3733,12 @@ vectorizable_operation (gimple stmt, gimple_stmt_iterator *gsi,
 
   if (!vec_stmt) /* transformation not required.  */
     {
+      if (loop && LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo))
+	{
+	  LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) &=
+	    optab_handler_has_mask (optab, vec_mode);
+	}
+
       STMT_VINFO_TYPE (stmt_info) = op_vec_info_type;
       if (dump_enabled_p ())
         dump_printf_loc (MSG_NOTE, vect_location,
@@ -3908,6 +3955,8 @@ vectorizable_store (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
     ncopies = LOOP_VINFO_VECT_FACTOR (loop_vinfo) / nunits;
 
   gcc_assert (ncopies >= 1);
+  if (ncopies > 1 && loop_vinfo)
+    LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) = false;
 
   /* FORNOW. This restriction should be relaxed.  */
   if (loop && nested_in_vect_loop_p (loop, stmt) && ncopies > 1)
@@ -3976,6 +4025,10 @@ vectorizable_store (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
 
   if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
     {
+      /* FORNOW.  */
+      if (loop_vinfo)
+	LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) = false;
+
       grouped_store = true;
       first_stmt = GROUP_FIRST_ELEMENT (stmt_info);
       if (!slp && !PURE_SLP_STMT (stmt_info))
@@ -4011,6 +4064,10 @@ vectorizable_store (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
 
   if (!vec_stmt) /* transformation not required.  */
     {
+      if (loop && LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo))
+	  LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) &=
+	    optab_handler_has_mask (mov_optab, vec_mode);
+
       STMT_VINFO_TYPE (stmt_info) = store_vec_info_type;
       vect_model_store_cost (stmt_info, ncopies, store_lanes_p, dt,
 			     NULL, NULL, NULL);
@@ -4606,6 +4663,17 @@ vectorizable_load (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
 
   if (!vec_stmt) /* transformation not required.  */
     {
+      if (loop && LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo))
+	{
+	  LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) &=
+	    optab_handler_has_mask (mov_optab, mode);
+
+	  if (STMT_VINFO_GATHER_P (stmt_info)
+	      || STMT_VINFO_STRIDE_LOAD_P (stmt_info))
+	    /* FORNOW.  */
+	    LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) = false;
+	}
+
       STMT_VINFO_TYPE (stmt_info) = load_vec_info_type;
       vect_model_load_cost (stmt_info, ncopies, load_lanes_p, NULL, NULL, NULL);
       return true;
@@ -4637,6 +4705,8 @@ vectorizable_load (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
 	{
 	  unsigned char *sel = XALLOCAVEC (unsigned char, gather_off_nunits);
 	  modifier = WIDEN;
+	  if (loop_vinfo)
+	    LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) = false;
 
 	  for (i = 0; i < gather_off_nunits; ++i)
 	    sel[i] = i | nunits;
@@ -4648,6 +4718,8 @@ vectorizable_load (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
 	{
 	  unsigned char *sel = XALLOCAVEC (unsigned char, nunits);
 	  modifier = NARROW;
+	  if (loop_vinfo)
+	    LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) = false;
 
 	  for (i = 0; i < nunits; ++i)
 	    sel[i] = i < gather_off_nunits
@@ -5446,6 +5518,7 @@ vectorizable_condition (gimple stmt, gimple_stmt_iterator *gsi,
   tree vec_compare, vec_cond_expr;
   tree new_temp;
   loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
+  struct loop *loop = NULL;
   tree def;
   enum vect_def_type dt, dts[4];
   int nunits = TYPE_VECTOR_SUBPARTS (vectype);
@@ -5460,6 +5533,9 @@ vectorizable_condition (gimple stmt, gimple_stmt_iterator *gsi,
   vec<tree> vec_oprnds3 = vNULL;
   tree vec_cmp_type;
 
+  if (loop_vinfo)
+    loop = LOOP_VINFO_LOOP (loop_vinfo);
+
   if (slp_node || PURE_SLP_STMT (stmt_info))
     ncopies = 1;
   else
@@ -5540,6 +5616,14 @@ vectorizable_condition (gimple stmt, gimple_stmt_iterator *gsi,
 
   if (!vec_stmt)
     {
+      if (loop && LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo))
+	{
+	  LOOP_VINFO_VECTORIZABLE_EPILOG (loop_vinfo) &=
+	    get_vcond_icode_has_mask (TYPE_MODE (vectype),
+				      TYPE_MODE (comp_vectype),
+				      TYPE_UNSIGNED (comp_vectype));
+	}
+
       STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
       return expand_vec_cond_expr_p (vectype, comp_vectype);
     }
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 695c059..cbce8bb 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -314,6 +314,8 @@ typedef struct _loop_vec_info {
      fix it up.  */
   bool operands_swapped;
 
+  bool vectorizable_epilog;
+
 } *loop_vec_info;
 
 /* Access Functions.  */
@@ -345,6 +347,7 @@ typedef struct _loop_vec_info {
 #define LOOP_VINFO_TARGET_COST_DATA(L)     (L)->target_cost_data
 #define LOOP_VINFO_PEELING_FOR_GAPS(L)     (L)->peeling_for_gaps
 #define LOOP_VINFO_OPERANDS_SWAPPED(L)     (L)->operands_swapped
+#define LOOP_VINFO_VECTORIZABLE_EPILOG(L)  (L)->vectorizable_epilog
 
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L) \
 (L)->may_misalign_stmts.length () > 0
diff --git a/gcc/tree.def b/gcc/tree.def
index f825aad..90578d1 100644
--- a/gcc/tree.def
+++ b/gcc/tree.def
@@ -508,6 +508,9 @@ DEFTREECODE (COND_EXPR, "cond_expr", tcc_expression, 3)
 */
 DEFTREECODE (VEC_COND_EXPR, "vec_cond_expr", tcc_expression, 3)
 
+DEFTREECODE (MASK_GEN, "mask_gen", tcc_expression, 2)
+DEFTREECODE (VEC_MASK, "vec_mask", tcc_expression, 3)
+
 /* Vector permutation expression.  A = VEC_PERM_EXPR<v0, v1, mask> means
 
    N = length(mask)

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2014-02-18 12:27 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-18 12:27 [AVX-512] Vectorization when tripcount is less than VF Kirill Yukhin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).