public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
@ 2013-11-12 13:56 Kirill Yukhin
  2013-11-15 18:21 ` Kirill Yukhin
  0 siblings, 1 reply; 43+ messages in thread
From: Kirill Yukhin @ 2013-11-12 13:56 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Uros Bizjak, Jakub Jelinek, GCC Patches

Hello,
Patch in the bottom extends some hooks toward AVX-512 support.
This patch decrease icount for Spec2006 FP suite (ref set):

Optset was: -static -m64 -fstrict-aliasing -fno-prefetch-loop-arrays
-Ofast -funroll-loops -flto -march=core-avx2 -mtune=core-avx2

Lower is better.

Test\Arch        Icount	   Icount w/      Icount w/
                 incr, %   -mavx2         -mavx512f
SPECfp 	  	    
410.bwaves	-1.91	   1052892413074  1032764159879
416.gamess	-0.35	   4981923623398  4964404349547
433.milc	-2.21	   613881448195	  600337040966
434.zeusmp	-14.05	   839652873778	  721646740255
435.gromacs	-2.63	   1584456238056  1542724794028
436.cactusADM	-33.35	   621112775417	  413998009835
437.leslie3d	-24.68	   685183512084	  516052497339
444.namd	+0.41	   1639494201826  1646198964416
447.dealII	-0.17	   1090815823369  1088911757625
450.soplex	-8.29	   667084302238	  611769330624
453.povray	+0.36	   854174676846	  857232459761
454.calculix	-7.23	   1716433835093  1592358631108
459.GemsFDTD	-22.82	   682427745937	  526665452119
465.tonto	+2.09	   1527568790770  1559478471810
470.lbm		-4.01	   885166688641	  849702055497
481.wrf		-6.12	   1315087631021  1234615894173
482.sphinx3	+16.33	   2115592654950  2461052825829
Geomean		-7.14			

ChangeLog:
2013-11-12  Alexander Ivchenko  <alexander.ivchenko@intel.com>
	    Maxim Kuznetsov  <maxim.kuznetsov@intel.com>
	    Sergey Lega  <sergey.s.lega@intel.com>
	    Anna Tikhonova  <anna.tikhonova@intel.com>
	    Ilya Tocar  <ilya.tocar@intel.com>
	    Andrey Turetskiy  <andrey.turetskiy@intel.com>
	    Ilya Verbin  <ilya.verbin@intel.com>
	    Kirill Yukhin  <kirill.yukhin@intel.com>
	    Michael Zolotukhin  <michael.v.zolotukhin@intel.com>

	* config/i386/i386.c (MAX_CLASSES): Increase number of classes.
	(classify_argument): Extend for 512 bit vectors.
	(construct_container): Ditto.
	(function_arg_advance_32): Ditto.
	(function_arg_advance_64): Ditto.
	(function_arg_32): Ditto.
	(function_arg_64): Ditto.
	(function_value_32): Ditto.
	(return_in_memory_32): Ditto.
	(ix86_gimplify_va_arg): Ditto.
	(standard_sse_constant_p): Ditto.
	(standard_sse_constant_opcode): Ditto.
	(ix86_expand_vector_convert_uns_vsivsf): Ditto.
	(ix86_build_const_vector): Ditto.
	(ix86_build_signbit_mask): Ditto.
	(ix86_expand_sse_cmp): Extend for AVX512.
	(ix86_expand_sse_movcc): Ditto.
	(ix86_expand_int_vcond): Ditto.
	(ix86_expand_vec_perm): Ditto.
	(ix86_expand_sse_unpack): Ditto.
	(ix86_constant_alignment): Ditto.
	(avx_vpermilp_parallel): Ditto.
	(ix86_rtx_costs): Ditto.
	(ix86_expand_vector_init_duplicate): Ditto.
	(ix86_expand_vector_init_concat): Ditto.
	(ix86_expand_vector_init_general): Ditto.
	(ix86_expand_vector_extract): Ditto.
	(emit_reduc_half): Ditto.
	(ix86_vector_mode_supported_p): Ditto.
	(ix86_emit_swdivsf): Ditto.
	(ix86_emit_swsqrtsf): Ditto.
	(expand_vec_perm_1): Ditto.
	(ix86_vectorize_vec_perm_const_ok): Ditto.
	(ix86_expand_mul_widen_evenodd): Ditto.
	(ix86_expand_sse2_mulvxdi3): Ditto.
	(ix86_preferred_simd_mode): Ditto.
	(ix86_autovectorize_vector_sizes): Ditto.
	(ix86_expand_vec_perm_vpermi2): New.
	(ix86_vector_duplicate_value): Ditto.
	* config/i386/sse.md (fixuns_trunc<mode><sseintvecmodelower>2): Extend
	for AVX512.
	(vec_pack_ufix_trunc_<mode>): Ditto.
	* tree-vect-stmts.c (vectorizable_load): Support AVX512's gathers.
	* tree-vectorizer.h (MAX_VECTORIZATION_FACTOR): Extend for 512 bit
	vectors.

Testing:
  1. Bootstrap pass.
  2. make check shows no regressions.
  3. Spec 2000 & 2006 build show no regressions both with and without -mavx512f option.
  4. Spec 2000 & 2006 run shows no stability regressions without -mavx512f option.

Is it ok to commit to main trunk?

--
Thanks, K

---
 gcc/config/i386/i386.c | 607 +++++++++++++++++++++++++++++++++++++++++++------
 gcc/config/i386/sse.md |  94 ++++++--
 gcc/tree-vect-stmts.c  |  34 ++-
 gcc/tree-vectorizer.h  |   4 +-
 4 files changed, 636 insertions(+), 103 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index e172adf..bc52afa 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2209,7 +2209,7 @@ enum x86_64_reg_class
     X86_64_MEMORY_CLASS
   };
 
-#define MAX_CLASSES 4
+#define MAX_CLASSES 8
 
 /* Table of constants used by fldpi, fldln2, etc....  */
 static REAL_VALUE_TYPE ext_80387_constants_table [5];
@@ -6108,7 +6108,7 @@ merge_classes (enum x86_64_reg_class class1, enum x86_64_reg_class class2)
    sized containers, classes[0] will be NO_CLASS and 1 is returned.
 
    BIT_OFFSET is used internally for handling records and specifies offset
-   of the offset in bits modulo 256 to avoid overflow cases.
+   of the offset in bits modulo 512 to avoid overflow cases.
 
    See the x86-64 PS ABI for details.
 */
@@ -6208,7 +6208,7 @@ classify_argument (enum machine_mode mode, const_tree type,
 		      num = classify_argument (TYPE_MODE (type), type,
 					       subclasses,
 					       (int_bit_position (field)
-						+ bit_offset) % 256);
+						+ bit_offset) % 512);
 		      if (!num)
 			return 0;
 		      pos = (int_bit_position (field)
@@ -6458,6 +6458,21 @@ classify_argument (enum machine_mode mode, const_tree type,
       classes[2] = X86_64_SSEUP_CLASS;
       classes[3] = X86_64_SSEUP_CLASS;
       return 4;
+    case V8DFmode:
+    case V16SFmode:
+    case V8DImode:
+    case V16SImode:
+    case V32HImode:
+    case V64QImode:
+      classes[0] = X86_64_SSE_CLASS;
+      classes[1] = X86_64_SSEUP_CLASS;
+      classes[2] = X86_64_SSEUP_CLASS;
+      classes[3] = X86_64_SSEUP_CLASS;
+      classes[4] = X86_64_SSEUP_CLASS;
+      classes[5] = X86_64_SSEUP_CLASS;
+      classes[6] = X86_64_SSEUP_CLASS;
+      classes[7] = X86_64_SSEUP_CLASS;
+      return 8;
     case V4SFmode:
     case V4SImode:
     case V16QImode:
@@ -6643,6 +6658,18 @@ construct_container (enum machine_mode mode, enum machine_mode orig_mode,
       && mode != BLKmode)
     return gen_reg_or_parallel (mode, orig_mode,
 				SSE_REGNO (sse_regno));
+  if (n == 8
+      && regclass[0] == X86_64_SSE_CLASS
+      && regclass[1] == X86_64_SSEUP_CLASS
+      && regclass[2] == X86_64_SSEUP_CLASS
+      && regclass[3] == X86_64_SSEUP_CLASS
+      && regclass[4] == X86_64_SSEUP_CLASS
+      && regclass[5] == X86_64_SSEUP_CLASS
+      && regclass[6] == X86_64_SSEUP_CLASS
+      && regclass[7] == X86_64_SSEUP_CLASS
+      && mode != BLKmode)
+    return gen_reg_or_parallel (mode, orig_mode,
+				SSE_REGNO (sse_regno));
   if (n == 2
       && regclass[0] == X86_64_X87_CLASS
       && regclass[1] == X86_64_X87UP_CLASS)
@@ -6724,6 +6751,18 @@ construct_container (enum machine_mode mode, enum machine_mode orig_mode,
 		tmpmode = OImode;
 		i += 3;
 		break;
+	      case 8:
+		gcc_assert (i == 0
+			    && regclass[1] == X86_64_SSEUP_CLASS
+			    && regclass[2] == X86_64_SSEUP_CLASS
+			    && regclass[3] == X86_64_SSEUP_CLASS
+			    && regclass[4] == X86_64_SSEUP_CLASS
+			    && regclass[5] == X86_64_SSEUP_CLASS
+			    && regclass[6] == X86_64_SSEUP_CLASS
+			    && regclass[7] == X86_64_SSEUP_CLASS);
+		tmpmode = XImode;
+		i += 7;
+		break;
 	      default:
 		gcc_unreachable ();
 	      }
@@ -6797,6 +6836,12 @@ function_arg_advance_32 (CUMULATIVE_ARGS *cum, enum machine_mode mode,
 
     case V8SFmode:
     case V8SImode:
+    case V64QImode:
+    case V32HImode:
+    case V16SImode:
+    case V8DImode:
+    case V16SFmode:
+    case V8DFmode:
     case V32QImode:
     case V16HImode:
     case V4DFmode:
@@ -6848,8 +6893,9 @@ function_arg_advance_64 (CUMULATIVE_ARGS *cum, enum machine_mode mode,
 {
   int int_nregs, sse_nregs;
 
-  /* Unnamed 256bit vector mode parameters are passed on stack.  */
-  if (!named && VALID_AVX256_REG_MODE (mode))
+  /* Unnamed 512 and 256bit vector mode parameters are passed on stack.  */
+  if (!named && (VALID_AVX512F_REG_MODE (mode)
+		 || VALID_AVX256_REG_MODE (mode)))
     return;
 
   if (examine_argument (mode, type, 0, &int_nregs, &sse_nregs)
@@ -7000,9 +7046,16 @@ function_arg_32 (const CUMULATIVE_ARGS *cum, enum machine_mode mode,
       break;
 
     case OImode:
-      /* OImode shouldn't be used directly.  */
+    case XImode:
+      /* OImode and XImode shouldn't be used directly.  */
       gcc_unreachable ();
 
+    case V64QImode:
+    case V32HImode:
+    case V16SImode:
+    case V8DImode:
+    case V16SFmode:
+    case V8DFmode:
     case V8SFmode:
     case V8SImode:
     case V32QImode:
@@ -7065,7 +7118,13 @@ function_arg_64 (const CUMULATIVE_ARGS *cum, enum machine_mode mode,
     case V16HImode:
     case V4DFmode:
     case V4DImode:
-      /* Unnamed 256bit vector mode parameters are passed on stack.  */
+    case V16SFmode:
+    case V16SImode:
+    case V64QImode:
+    case V32HImode:
+    case V8DFmode:
+    case V8DImode:
+      /* Unnamed 256 and 512bit vector mode parameters are passed on stack.  */
       if (!named)
 	return NULL;
       break;
@@ -7468,6 +7527,10 @@ function_value_32 (enum machine_mode orig_mode, enum machine_mode mode,
   else if (VECTOR_MODE_P (mode) && GET_MODE_SIZE (mode) == 32)
     regno = FIRST_SSE_REG;
 
+  /* 64-byte vector modes in %zmm0.   */
+  else if (VECTOR_MODE_P (mode) && GET_MODE_SIZE (mode) == 64)
+    regno = FIRST_SSE_REG;
+
   /* Floating point return values in %st(0) (unless -mno-fp-ret-in-387).  */
   else if (X87_FLOAT_MODE_P (mode) && TARGET_FLOAT_RETURNS_IN_80387)
     regno = FIRST_FLOAT_REG;
@@ -7675,6 +7738,10 @@ return_in_memory_32 (const_tree type, enum machine_mode mode)
       /* AVX values are returned in YMM0, except when it doesn't exist.  */
       if (size == 32)
 	return !TARGET_AVX;
+
+      /* AVX512F values are returned in ZMM0, except when it doesn't exist.  */
+      if (size == 64)
+	return !TARGET_AVX512F;
     }
 
   if (mode == XFmode)
@@ -8211,7 +8278,13 @@ ix86_gimplify_va_arg (tree valist, tree type, gimple_seq *pre_p,
     case V16HImode:
     case V4DFmode:
     case V4DImode:
-      /* Unnamed 256bit vector mode parameters are passed on stack.  */
+    case V16SFmode:
+    case V16SImode:
+    case V64QImode:
+    case V32HImode:
+    case V8DFmode:
+    case V8DImode:
+      /* Unnamed 256 and 512bit vector mode parameters are passed on stack.  */
       if (!TARGET_64BIT_MS_ABI)
 	{
 	  container = NULL;
@@ -8626,6 +8699,12 @@ standard_sse_constant_p (rtx x)
       case V4DImode:
 	if (TARGET_AVX2)
 	  return 2;
+      case V64QImode:
+      case V32HImode:
+      case V16SImode:
+      case V8DImode:
+	if (TARGET_AVX512F)
+	  return 2;
       default:
 	break;
       }
@@ -8644,6 +8723,11 @@ standard_sse_constant_opcode (rtx insn, rtx x)
     case 1:
       switch (get_attr_mode (insn))
 	{
+	case MODE_XI:
+	case MODE_V16SF:
+	  return "vpxord\t%g0, %g0, %g0";
+	case MODE_V8DF:
+	  return "vpxorq\t%g0, %g0, %g0";
 	case MODE_TI:
 	  return "%vpxor\t%0, %d0";
 	case MODE_V2DF:
@@ -18443,6 +18527,11 @@ ix86_expand_vector_convert_uns_vsivsf (rtx target, rtx val)
   enum machine_mode fltmode = GET_MODE (target);
   rtx (*cvt) (rtx, rtx);
 
+  if (intmode == V16SImode)
+    {
+      emit_insn (gen_ufloatv16siv16sf2 (target, val));
+      return;
+    }
   if (intmode == V4SImode)
     cvt = gen_floatv4siv4sf2;
   else
@@ -18533,17 +18622,23 @@ ix86_build_const_vector (enum machine_mode mode, bool vect, rtx value)
 
   switch (mode)
     {
+    case V64QImode:
     case V32QImode:
     case V16QImode:
+    case V32HImode:
     case V16HImode:
     case V8HImode:
+    case V16SImode:
     case V8SImode:
     case V4SImode:
+    case V8DImode:
     case V4DImode:
     case V2DImode:
       gcc_assert (vect);
+    case V16SFmode:
     case V8SFmode:
     case V4SFmode:
+    case V8DFmode:
     case V4DFmode:
     case V2DFmode:
       n_elt = GET_MODE_NUNITS (mode);
@@ -18580,6 +18675,8 @@ ix86_build_signbit_mask (enum machine_mode mode, bool vect, bool invert)
   /* Find the sign bit, sign extended to 2*HWI.  */
   switch (mode)
     {
+    case V16SImode:
+    case V16SFmode:
     case V8SImode:
     case V4SImode:
     case V8SFmode:
@@ -18590,8 +18687,10 @@ ix86_build_signbit_mask (enum machine_mode mode, bool vect, bool invert)
       lo = 0x80000000, hi = lo < 0;
       break;
 
+    case V8DImode:
     case V4DImode:
     case V2DImode:
+    case V8DFmode:
     case V4DFmode:
     case V2DFmode:
       vec_mode = mode;
@@ -20448,22 +20547,63 @@ ix86_expand_sse_cmp (rtx dest, enum rtx_code code, rtx cmp_op0, rtx cmp_op1,
 		     rtx op_true, rtx op_false)
 {
   enum machine_mode mode = GET_MODE (dest);
-  enum machine_mode cmp_mode = GET_MODE (cmp_op0);
+  enum machine_mode cmp_ops_mode = GET_MODE (cmp_op0);
+
+  /* In general case result of comparison can differ from operands' type.  */
+  enum machine_mode cmp_mode;
+
+  /* In AVX512F the result of comparison is an integer mask.  */
+  bool maskcmp = false;
   rtx x;
 
-  cmp_op0 = force_reg (cmp_mode, cmp_op0);
-  if (!nonimmediate_operand (cmp_op1, cmp_mode))
-    cmp_op1 = force_reg (cmp_mode, cmp_op1);
+  if (GET_MODE_SIZE (cmp_ops_mode) == 64)
+    {
+      cmp_mode = mode_for_size (GET_MODE_NUNITS (cmp_ops_mode), MODE_INT, 0);
+      gcc_assert (cmp_mode != BLKmode);
+
+      maskcmp = true;
+    }
+  else
+    cmp_mode = cmp_ops_mode;
+
+
+  cmp_op0 = force_reg (cmp_ops_mode, cmp_op0);
+  if (!nonimmediate_operand (cmp_op1, cmp_ops_mode))
+    cmp_op1 = force_reg (cmp_ops_mode, cmp_op1);
 
   if (optimize
       || reg_overlap_mentioned_p (dest, op_true)
       || reg_overlap_mentioned_p (dest, op_false))
-    dest = gen_reg_rtx (mode);
+    dest = gen_reg_rtx (maskcmp ? cmp_mode : mode);
+
+  /* Compare patterns for int modes are unspec in AVX512F only.  */
+  if (maskcmp && (code == GT || code == EQ))
+    {
+      rtx (*gen)(rtx, rtx, rtx);
+
+      switch (cmp_ops_mode)
+	{
+	case V16SImode:
+	  gen = code == GT ? gen_avx512f_gtv16si3 : gen_avx512f_eqv16si3_1;
+	  break;
+	case V8DImode:
+	  gen = code == GT ? gen_avx512f_gtv8di3 : gen_avx512f_eqv8di3_1;
+	  break;
+	default:
+	  gen = NULL;
+	}
 
+      if (gen)
+	{
+	  emit_insn (gen (dest, cmp_op0, cmp_op1));
+	  return dest;
+	}
+    }
   x = gen_rtx_fmt_ee (code, cmp_mode, cmp_op0, cmp_op1);
-  if (cmp_mode != mode)
+
+  if (cmp_mode != mode && !maskcmp)
     {
-      x = force_reg (cmp_mode, x);
+      x = force_reg (cmp_ops_mode, x);
       convert_move (dest, x, false);
     }
   else
@@ -20479,33 +20619,43 @@ static void
 ix86_expand_sse_movcc (rtx dest, rtx cmp, rtx op_true, rtx op_false)
 {
   enum machine_mode mode = GET_MODE (dest);
+  enum machine_mode cmpmode = GET_MODE (cmp);
+
+  /* In AVX512F the result of comparison is an integer mask.  */
+  bool maskcmp = (mode != cmpmode && TARGET_AVX512F);
+
   rtx t2, t3, x;
 
   if (vector_all_ones_operand (op_true, mode)
-      && rtx_equal_p (op_false, CONST0_RTX (mode)))
+      && rtx_equal_p (op_false, CONST0_RTX (mode))
+      && !maskcmp)
     {
       emit_insn (gen_rtx_SET (VOIDmode, dest, cmp));
     }
-  else if (op_false == CONST0_RTX (mode))
+  else if (op_false == CONST0_RTX (mode)
+      && !maskcmp)
     {
       op_true = force_reg (mode, op_true);
       x = gen_rtx_AND (mode, cmp, op_true);
       emit_insn (gen_rtx_SET (VOIDmode, dest, x));
     }
-  else if (op_true == CONST0_RTX (mode))
+  else if (op_true == CONST0_RTX (mode)
+      && !maskcmp)
     {
       op_false = force_reg (mode, op_false);
       x = gen_rtx_NOT (mode, cmp);
       x = gen_rtx_AND (mode, x, op_false);
       emit_insn (gen_rtx_SET (VOIDmode, dest, x));
     }
-  else if (INTEGRAL_MODE_P (mode) && op_true == CONSTM1_RTX (mode))
+  else if (INTEGRAL_MODE_P (mode) && op_true == CONSTM1_RTX (mode)
+      && !maskcmp)
     {
       op_false = force_reg (mode, op_false);
       x = gen_rtx_IOR (mode, cmp, op_false);
       emit_insn (gen_rtx_SET (VOIDmode, dest, x));
     }
-  else if (TARGET_XOP)
+  else if (TARGET_XOP
+      && !maskcmp)
     {
       op_true = force_reg (mode, op_true);
 
@@ -20573,6 +20723,20 @@ ix86_expand_sse_movcc (rtx dest, rtx cmp, rtx op_true, rtx op_false)
 	      cmp = gen_lowpart (V32QImode, cmp);
 	    }
 	  break;
+
+	case V16SImode:
+	  gen = gen_avx512f_blendmv16si;
+	  break;
+	case V8DImode:
+	  gen = gen_avx512f_blendmv8di;
+	  break;
+	case V8DFmode:
+	  gen = gen_avx512f_blendmv8df;
+	  break;
+	case V16SFmode:
+	  gen = gen_avx512f_blendmv16sf;
+	  break;
+
 	default:
 	  break;
 	}
@@ -20840,6 +21004,8 @@ ix86_expand_int_vcond (rtx operands[])
 
 	  switch (mode)
 	    {
+	    case V16SImode:
+	    case V8DImode:
 	    case V8SImode:
 	    case V4DImode:
 	    case V4SImode:
@@ -20850,6 +21016,8 @@ ix86_expand_int_vcond (rtx operands[])
 
 		  switch (mode)
 		    {
+		    case V16SImode: gen_sub3 = gen_subv16si3; break;
+		    case V8DImode: gen_sub3 = gen_subv8di3; break;
 		    case V8SImode: gen_sub3 = gen_subv8si3; break;
 		    case V4DImode: gen_sub3 = gen_subv4di3; break;
 		    case V4SImode: gen_sub3 = gen_subv4si3; break;
@@ -20905,7 +21073,8 @@ ix86_expand_int_vcond (rtx operands[])
       gcc_assert (GET_MODE_SIZE (data_mode) == GET_MODE_SIZE (mode));
       x = ix86_expand_sse_cmp (gen_reg_rtx (mode), code, cop0, cop1,
 			       operands[1+negate], operands[2-negate]);
-      x = gen_lowpart (data_mode, x);
+      if (GET_MODE (x) == mode)
+	x = gen_lowpart (data_mode, x);
     }
 
   ix86_expand_sse_movcc (operands[0], x, operands[1+negate],
@@ -20913,6 +21082,35 @@ ix86_expand_int_vcond (rtx operands[])
   return true;
 }
 
+static bool
+ix86_expand_vec_perm_vpermi2 (rtx target, rtx op0, rtx mask, rtx op1)
+{
+  enum machine_mode mode = GET_MODE (op0);
+  switch (mode)
+    {
+    case V16SImode:
+      emit_insn (gen_avx512f_vpermi2varv16si3 (target, op0,
+					      force_reg (V16SImode, mask),
+					      op1));
+      return true;
+    case V16SFmode:
+      emit_insn (gen_avx512f_vpermi2varv16sf3 (target, op0,
+					      force_reg (V16SImode, mask),
+					      op1));
+      return true;
+    case V8DImode:
+      emit_insn (gen_avx512f_vpermi2varv8di3 (target, op0,
+					     force_reg (V8DImode, mask), op1));
+      return true;
+    case V8DFmode:
+      emit_insn (gen_avx512f_vpermi2varv8df3 (target, op0,
+					     force_reg (V8DImode, mask), op1));
+      return true;
+    default:
+      return false;
+    }
+}
+
 /* Expand a variable vector permutation.  */
 
 void
@@ -20931,7 +21129,10 @@ ix86_expand_vec_perm (rtx operands[])
   /* Number of elements in the vector.  */
   w = GET_MODE_NUNITS (mode);
   e = GET_MODE_UNIT_SIZE (mode);
-  gcc_assert (w <= 32);
+  gcc_assert (w <= 64);
+
+  if (ix86_expand_vec_perm_vpermi2 (target, op0, mask, op1))
+    return;
 
   if (TARGET_AVX2)
     {
@@ -21311,6 +21512,15 @@ ix86_expand_sse_unpack (rtx dest, rtx src, bool unsigned_p, bool high_p)
 	  extract
 	    = high_p ? gen_vec_extract_hi_v32qi : gen_vec_extract_lo_v32qi;
 	  break;
+	case V32HImode:
+	  if (unsigned_p)
+	    unpack = gen_avx512f_zero_extendv16hiv16si2;
+	  else
+	    unpack = gen_avx512f_sign_extendv16hiv16si2;
+	  halfmode = V16HImode;
+	  extract
+	    = high_p ? gen_vec_extract_hi_v32hi : gen_vec_extract_lo_v32hi;
+	  break;
 	case V16HImode:
 	  if (unsigned_p)
 	    unpack = gen_avx2_zero_extendv8hiv8si2;
@@ -21320,6 +21530,15 @@ ix86_expand_sse_unpack (rtx dest, rtx src, bool unsigned_p, bool high_p)
 	  extract
 	    = high_p ? gen_vec_extract_hi_v16hi : gen_vec_extract_lo_v16hi;
 	  break;
+	case V16SImode:
+	  if (unsigned_p)
+	    unpack = gen_avx512f_zero_extendv8siv8di2;
+	  else
+	    unpack = gen_avx512f_sign_extendv8siv8di2;
+	  halfmode = V8SImode;
+	  extract
+	    = high_p ? gen_vec_extract_hi_v16si : gen_vec_extract_lo_v16si;
+	  break;
 	case V8SImode:
 	  if (unsigned_p)
 	    unpack = gen_avx2_zero_extendv4siv4di2;
@@ -21351,7 +21570,7 @@ ix86_expand_sse_unpack (rtx dest, rtx src, bool unsigned_p, bool high_p)
 	  gcc_unreachable ();
 	}
 
-      if (GET_MODE_SIZE (imode) == 32)
+      if (GET_MODE_SIZE (imode) >= 32)
 	{
 	  tmp = gen_reg_rtx (halfmode);
 	  emit_insn (extract (tmp, src));
@@ -26070,7 +26289,8 @@ ix86_constant_alignment (tree exp, int align)
 int
 ix86_data_alignment (tree type, int align, bool opt)
 {
-  int max_align = optimize_size ? BITS_PER_WORD : MIN (256, MAX_OFILE_ALIGNMENT);
+  int max_align = optimize_size ? BITS_PER_WORD
+				: MIN (512, MAX_OFILE_ALIGNMENT);
 
   if (opt
       && AGGREGATE_TYPE_P (type)
@@ -34175,7 +34395,7 @@ avx_vpermilp_parallel (rtx par, enum machine_mode mode)
 {
   unsigned i, nelt = GET_MODE_NUNITS (mode);
   unsigned mask = 0;
-  unsigned char ipar[8] = {};  /* Silence -Wuninitialized warning.  */
+  unsigned char ipar[16] = {};  /* Silence -Wuninitialized warning.  */
 
   if (XVECLEN (par, 0) != (int) nelt)
     return 0;
@@ -34198,6 +34418,24 @@ avx_vpermilp_parallel (rtx par, enum machine_mode mode)
 
   switch (mode)
     {
+    case V8DFmode:
+      /* In the 512-bit DFmode case, we can only move elements within
+         a 128-bit lane.  First fill the second part of the mask,
+	 then fallthru.  */
+      for (i = 4; i < 6; ++i)
+	{
+	  if (ipar[i] < 4 || ipar[i] >= 6)
+	    return 0;
+	  mask |= (ipar[i] - 4) << i;
+	}
+      for (i = 6; i < 8; ++i)
+	{
+	  if (ipar[i] < 6)
+	    return 0;
+	  mask |= (ipar[i] - 6) << i;
+	}
+      /* FALLTHRU */
+
     case V4DFmode:
       /* In the 256-bit DFmode case, we can only move elements within
          a 128-bit lane.  */
@@ -34215,10 +34453,18 @@ avx_vpermilp_parallel (rtx par, enum machine_mode mode)
 	}
       break;
 
+    case V16SFmode:
+      /* In 512 bit SFmode case, permutation in the upper 256 bits
+	 must mirror the permutation in the lower 256-bits.  */
+      for (i = 0; i < 8; ++i)
+	if (ipar[i] + 8 != ipar[i + 8])
+	  return 0;
+      /* FALLTHRU */
+
     case V8SFmode:
-      /* In the 256-bit SFmode case, we have full freedom of movement
-	 within the low 128-bit lane, but the high 128-bit lane must
-	 mirror the exact same pattern.  */
+      /* In 256 bit SFmode case, we have full freedom of
+         movement within the low 128-bit lane, but the high 128-bit
+         lane must mirror the exact same pattern.  */
       for (i = 0; i < 4; ++i)
 	if (ipar[i] + 4 != ipar[i + 4])
 	  return 0;
@@ -35172,6 +35418,7 @@ static bool
 ix86_rtx_costs (rtx x, int code_i, int outer_code_i, int opno, int *total,
 		bool speed)
 {
+  rtx mask;
   enum rtx_code code = (enum rtx_code) code_i;
   enum rtx_code outer_code = (enum rtx_code) outer_code_i;
   enum machine_mode mode = GET_MODE (x);
@@ -35648,13 +35895,21 @@ ix86_rtx_costs (rtx x, int code_i, int outer_code_i, int opno, int *total,
 
     case VEC_SELECT:
     case VEC_CONCAT:
-    case VEC_MERGE:
     case VEC_DUPLICATE:
       /* ??? Assume all of these vector manipulation patterns are
 	 recognizable.  In which case they all pretty much have the
 	 same cost.  */
      *total = cost->fabs;
      return true;
+    case VEC_MERGE:
+      mask = XEXP (x, 2);
+      /* This is masked instruction, assume the same cost,
+	 as nonmasked variant.  */
+      if (TARGET_AVX512F && register_operand (mask, GET_MODE (mask)))
+	*total = rtx_cost (XEXP (x, 0), outer_code, opno, speed);
+      else
+	*total = cost->fabs;
+      return true;
 
     default:
       return false;
@@ -36824,6 +37079,36 @@ get_mode_wider_vector (enum machine_mode o)
   return n;
 }
 
+/* A subroutine of ix86_expand_vector_init_duplicate.  Tries to
+   fill target with val via vec_duplicate.  */
+
+static bool
+ix86_vector_duplicate_value (enum machine_mode mode, rtx target, rtx val)
+{
+  bool ok;
+  rtx insn, dup;
+
+  /* First attempt to recognize VAL as-is.  */
+  dup = gen_rtx_VEC_DUPLICATE (mode, val);
+  insn = emit_insn (gen_rtx_SET (VOIDmode, target, dup));
+  if (recog_memoized (insn) < 0)
+    {
+      rtx seq;
+      /* If that fails, force VAL into a register.  */
+
+      start_sequence ();
+      XEXP (dup, 0) = force_reg (GET_MODE_INNER (mode), val);
+      seq = get_insns ();
+      end_sequence ();
+      if (seq)
+	emit_insn_before (seq, insn);
+
+      ok = recog_memoized (insn) >= 0;
+      gcc_assert (ok);
+    }
+  return true;
+}
+
 /* A subroutine of ix86_expand_vector_init.  Store into TARGET a vector
    with all elements equal to VAR.  Return true if successful.  */
 
@@ -36849,29 +37134,11 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, enum machine_mode mode,
     case V2DImode:
     case V4SFmode:
     case V4SImode:
-      {
-	rtx insn, dup;
-
-	/* First attempt to recognize VAL as-is.  */
-	dup = gen_rtx_VEC_DUPLICATE (mode, val);
-	insn = emit_insn (gen_rtx_SET (VOIDmode, target, dup));
-	if (recog_memoized (insn) < 0)
-	  {
-	    rtx seq;
-	    /* If that fails, force VAL into a register.  */
-
-	    start_sequence ();
-	    XEXP (dup, 0) = force_reg (GET_MODE_INNER (mode), val);
-	    seq = get_insns ();
-	    end_sequence ();
-	    if (seq)
-	      emit_insn_before (seq, insn);
-
-	    ok = recog_memoized (insn) >= 0;
-	    gcc_assert (ok);
-	  }
-      }
-      return true;
+    case V16SImode:
+    case V8DImode:
+    case V16SFmode:
+    case V8DFmode:
+      return ix86_vector_duplicate_value (mode, target, val);
 
     case V4HImode:
       if (!mmx_ok)
@@ -37221,8 +37488,8 @@ static void
 ix86_expand_vector_init_concat (enum machine_mode mode,
 				rtx target, rtx *ops, int n)
 {
-  enum machine_mode cmode, hmode = VOIDmode;
-  rtx first[8], second[4];
+  enum machine_mode cmode, hmode = VOIDmode, gmode = VOIDmode;
+  rtx first[16], second[8], third[4];
   rtvec v;
   int i, j;
 
@@ -37231,6 +37498,18 @@ ix86_expand_vector_init_concat (enum machine_mode mode,
     case 2:
       switch (mode)
 	{
+	case V16SImode:
+	  cmode = V8SImode;
+	  break;
+	case V16SFmode:
+	  cmode = V8SFmode;
+	  break;
+	case V8DImode:
+	  cmode = V4DImode;
+	  break;
+	case V8DFmode:
+	  cmode = V4DFmode;
+	  break;
 	case V8SImode:
 	  cmode = V4SImode;
 	  break;
@@ -37297,6 +37576,14 @@ ix86_expand_vector_init_concat (enum machine_mode mode,
     case 8:
       switch (mode)
 	{
+	case V8DImode:
+	  cmode = V2DImode;
+	  hmode = V4DImode;
+	  break;
+	case V8DFmode:
+	  cmode = V2DFmode;
+	  hmode = V4DFmode;
+	  break;
 	case V8SImode:
 	  cmode = V2SImode;
 	  hmode = V4SImode;
@@ -37310,6 +37597,24 @@ ix86_expand_vector_init_concat (enum machine_mode mode,
 	}
       goto half;
 
+    case 16:
+      switch (mode)
+	{
+	case V16SImode:
+	  cmode = V2SImode;
+	  hmode = V4SImode;
+	  gmode = V8SImode;
+	  break;
+	case V16SFmode:
+	  cmode = V2SFmode;
+	  hmode = V4SFmode;
+	  gmode = V8SFmode;
+	  break;
+	default:
+	  gcc_unreachable ();
+	}
+      goto half;
+
 half:
       /* FIXME: We process inputs backward to help RA.  PR 36222.  */
       i = n - 1;
@@ -37323,7 +37628,27 @@ half:
 	}
 
       n >>= 1;
-      if (n > 2)
+      if (n > 4)
+	{
+	  gcc_assert (hmode != VOIDmode);
+	  gcc_assert (gmode != VOIDmode);
+	  for (i = j = 0; i < n; i += 2, j++)
+	    {
+	      second[j] = gen_reg_rtx (hmode);
+	      ix86_expand_vector_init_concat (hmode, second [j],
+					      &first [i], 2);
+	    }
+	  n >>= 1;
+	  for (i = j = 0; i < n; i += 2, j++)
+	    {
+	      third[j] = gen_reg_rtx (gmode);
+	      ix86_expand_vector_init_concat (gmode, third[j],
+					      &second[i], 2);
+	    }
+	  n >>= 1;
+	  ix86_expand_vector_init_concat (mode, target, third, n);
+	}
+      else if (n > 2)
 	{
 	  gcc_assert (hmode != VOIDmode);
 	  for (i = j = 0; i < n; i += 2, j++)
@@ -37466,7 +37791,7 @@ static void
 ix86_expand_vector_init_general (bool mmx_ok, enum machine_mode mode,
 				 rtx target, rtx vals)
 {
-  rtx ops[32], op0, op1;
+  rtx ops[64], op0, op1;
   enum machine_mode half_mode = VOIDmode;
   int n, i;
 
@@ -37478,6 +37803,10 @@ ix86_expand_vector_init_general (bool mmx_ok, enum machine_mode mode,
 	break;
       /* FALLTHRU */
 
+    case V16SImode:
+    case V16SFmode:
+    case V8DFmode:
+    case V8DImode:
     case V8SFmode:
     case V8SImode:
     case V4DFmode:
@@ -38103,6 +38432,42 @@ ix86_expand_vector_extract (bool mmx_ok, rtx target, rtx vec, int elt)
 	}
       break;
 
+    case V16SFmode:
+      tmp = gen_reg_rtx (V8SFmode);
+      if (elt < 8)
+	emit_insn (gen_vec_extract_lo_v16sf (tmp, vec));
+      else
+	emit_insn (gen_vec_extract_hi_v16sf (tmp, vec));
+      ix86_expand_vector_extract (false, target, tmp, elt & 7);
+      return;
+
+    case V8DFmode:
+      tmp = gen_reg_rtx (V4DFmode);
+      if (elt < 4)
+	emit_insn (gen_vec_extract_lo_v8df (tmp, vec));
+      else
+	emit_insn (gen_vec_extract_hi_v8df (tmp, vec));
+      ix86_expand_vector_extract (false, target, tmp, elt & 3);
+      return;
+
+    case V16SImode:
+      tmp = gen_reg_rtx (V8SImode);
+      if (elt < 8)
+	emit_insn (gen_vec_extract_lo_v16si (tmp, vec));
+      else
+	emit_insn (gen_vec_extract_hi_v16si (tmp, vec));
+      ix86_expand_vector_extract (false, target, tmp, elt & 7);
+      return;
+
+    case V8DImode:
+      tmp = gen_reg_rtx (V4DImode);
+      if (elt < 4)
+	emit_insn (gen_vec_extract_lo_v8di (tmp, vec));
+      else
+	emit_insn (gen_vec_extract_hi_v8di (tmp, vec));
+      ix86_expand_vector_extract (false, target, tmp, elt & 3);
+      return;
+
     case V8QImode:
       /* ??? Could extract the appropriate HImode element and shift.  */
     default:
@@ -38195,6 +38560,44 @@ emit_reduc_half (rtx dest, rtx src, int i)
 				    GEN_INT (i / 2));
 	}
       break;
+    case V16SImode:
+    case V16SFmode:
+    case V8DImode:
+    case V8DFmode:
+      if (i > 128)
+	tem = gen_avx512f_shuf_i32x4_1 (gen_lowpart (V16SImode, dest),
+				      gen_lowpart (V16SImode, src),
+				      gen_lowpart (V16SImode, src),
+				      GEN_INT (0x4 + (i == 512 ? 4 : 0)),
+				      GEN_INT (0x5 + (i == 512 ? 4 : 0)),
+				      GEN_INT (0x6 + (i == 512 ? 4 : 0)),
+				      GEN_INT (0x7 + (i == 512 ? 4 : 0)),
+				      GEN_INT (0xC), GEN_INT (0xD),
+				      GEN_INT (0xE), GEN_INT (0xF),
+				      GEN_INT (0x10), GEN_INT (0x11),
+				      GEN_INT (0x12), GEN_INT (0x13),
+				      GEN_INT (0x14), GEN_INT (0x15),
+				      GEN_INT (0x16), GEN_INT (0x17));
+      else
+	tem = gen_avx512f_pshufd_1 (gen_lowpart (V16SImode, dest),
+				   gen_lowpart (V16SImode, src),
+				   GEN_INT (i == 128 ? 0x2 : 0x1),
+				   GEN_INT (0x3),
+				   GEN_INT (0x3),
+				   GEN_INT (0x3),
+				   GEN_INT (i == 128 ? 0x6 : 0x5),
+				   GEN_INT (0x7),
+				   GEN_INT (0x7),
+				   GEN_INT (0x7),
+				   GEN_INT (i == 128 ? 0xA : 0x9),
+				   GEN_INT (0xB),
+				   GEN_INT (0xB),
+				   GEN_INT (0xB),
+				   GEN_INT (i == 128 ? 0xE : 0xD),
+				   GEN_INT (0xF),
+				   GEN_INT (0xF),
+				   GEN_INT (0xF));
+      break;
     default:
       gcc_unreachable ();
     }
@@ -38259,6 +38662,8 @@ ix86_vector_mode_supported_p (enum machine_mode mode)
     return true;
   if (TARGET_AVX && VALID_AVX256_REG_MODE (mode))
     return true;
+  if (TARGET_AVX512F && VALID_AVX512F_REG_MODE (mode))
+    return true;
   if (TARGET_MMX && VALID_MMX_REG_MODE (mode))
     return true;
   if (TARGET_3DNOW && VALID_MMX_REG_MODE_3DNOW (mode))
@@ -38572,9 +38977,15 @@ void ix86_emit_swdivsf (rtx res, rtx a, rtx b, enum machine_mode mode)
   b = force_reg (mode, b);
 
   /* x0 = rcp(b) estimate */
-  emit_insn (gen_rtx_SET (VOIDmode, x0,
-			  gen_rtx_UNSPEC (mode, gen_rtvec (1, b),
-					  UNSPEC_RCP)));
+  if (mode == V16SFmode || mode == V8DFmode)
+    emit_insn (gen_rtx_SET (VOIDmode, x0,
+			    gen_rtx_UNSPEC (mode, gen_rtvec (1, b),
+					    UNSPEC_RCP14)));
+  else
+    emit_insn (gen_rtx_SET (VOIDmode, x0,
+			    gen_rtx_UNSPEC (mode, gen_rtvec (1, b),
+					    UNSPEC_RCP)));
+
   /* e0 = x0 * b */
   emit_insn (gen_rtx_SET (VOIDmode, e0,
 			  gen_rtx_MULT (mode, x0, b)));
@@ -38604,6 +39015,7 @@ void ix86_emit_swsqrtsf (rtx res, rtx a, enum machine_mode mode,
 {
   rtx x0, e0, e1, e2, e3, mthree, mhalf;
   REAL_VALUE_TYPE r;
+  int unspec;
 
   x0 = gen_reg_rtx (mode);
   e0 = gen_reg_rtx (mode);
@@ -38616,11 +39028,15 @@ void ix86_emit_swsqrtsf (rtx res, rtx a, enum machine_mode mode,
 
   real_arithmetic (&r, NEGATE_EXPR, &dconsthalf, NULL);
   mhalf = CONST_DOUBLE_FROM_REAL_VALUE (r, SFmode);
+  unspec = UNSPEC_RSQRT;
 
   if (VECTOR_MODE_P (mode))
     {
       mthree = ix86_build_const_vector (mode, true, mthree);
       mhalf = ix86_build_const_vector (mode, true, mhalf);
+      /* There is no 512-bit rsqrt.  There is however rsqrt14.  */
+      if (GET_MODE_SIZE (mode) == 64)
+	unspec = UNSPEC_RSQRT14;
     }
 
   /* sqrt(a)  = -0.5 * a * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0)
@@ -38631,7 +39047,7 @@ void ix86_emit_swsqrtsf (rtx res, rtx a, enum machine_mode mode,
   /* x0 = rsqrt(a) estimate */
   emit_insn (gen_rtx_SET (VOIDmode, x0,
 			  gen_rtx_UNSPEC (mode, gen_rtvec (1, a),
-					  UNSPEC_RSQRT)));
+					  unspec)));
 
   /* If (a == 0.0) Filter out infinity to prevent NaN for sqrt(0.0).  */
   if (!recip)
@@ -38642,11 +39058,23 @@ void ix86_emit_swsqrtsf (rtx res, rtx a, enum machine_mode mode,
       mask = gen_reg_rtx (mode);
 
       zero = force_reg (mode, CONST0_RTX(mode));
-      emit_insn (gen_rtx_SET (VOIDmode, mask,
-			      gen_rtx_NE (mode, zero, a)));
 
-      emit_insn (gen_rtx_SET (VOIDmode, x0,
-			      gen_rtx_AND (mode, x0, mask)));
+      /* Handle masked compare.  */
+      if (VECTOR_MODE_P (mode) && GET_MODE_SIZE (mode) == 64)
+	{
+	  mask = gen_reg_rtx (HImode);
+	  /* Imm value 0x4 corresponds to not-equal comparison.  */
+	  emit_insn (gen_avx512f_cmpv16sf3 (mask, zero, a, GEN_INT (0x4)));
+	  emit_insn (gen_avx512f_blendmv16sf (x0, zero, x0, mask));
+	}
+      else
+	{
+	  emit_insn (gen_rtx_SET (VOIDmode, mask,
+				  gen_rtx_NE (mode, zero, a)));
+
+	  emit_insn (gen_rtx_SET (VOIDmode, x0,
+				  gen_rtx_AND (mode, x0, mask)));
+	}
     }
 
   /* e0 = x0 * a */
@@ -40168,6 +40596,19 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
   if (expand_vec_perm_pshufb (d))
     return true;
 
+  /* Try the AVX512F vpermi2 instructions.  */
+  rtx vec[64];
+  enum machine_mode mode = d->vmode;
+  if (mode == V8DFmode)
+    mode = V8DImode;
+  else if (mode == V16SFmode)
+    mode = V16SImode;
+  for (i = 0; i < nelt; ++i)
+    vec[i] = GEN_INT (d->perm[i]);
+  rtx mask = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt, vec));
+  if (ix86_expand_vec_perm_vpermi2 (d->target, d->op0, mask, d->op1))
+    return true;
+
   return false;
 }
 
@@ -41775,6 +42216,10 @@ ix86_vectorize_vec_perm_const_ok (enum machine_mode vmode,
 
   /* Given sufficient ISA support we can just return true here
      for selected vector modes.  */
+  if (d.vmode == V16SImode || d.vmode == V16SFmode
+      || d.vmode == V8DFmode || d.vmode == V8DImode)
+    /* All implementable with a single vpermi2 insn.  */
+    return true;
   if (GET_MODE_SIZE (d.vmode) == 16)
     {
       /* All implementable with a single vpperm insn.  */
@@ -42017,7 +42462,7 @@ ix86_expand_mul_widen_evenodd (rtx dest, rtx op1, rtx op2,
     op2 = force_reg (mode, op2);
 
   /* We only play even/odd games with vectors of SImode.  */
-  gcc_assert (mode == V4SImode || mode == V8SImode);
+  gcc_assert (mode == V4SImode || mode == V8SImode || mode == V16SImode);
 
   /* If we're looking for the odd results, shift those members down to
      the even slots.  For some cpus this is faster than a PSHUFD.  */
@@ -42043,7 +42488,14 @@ ix86_expand_mul_widen_evenodd (rtx dest, rtx op1, rtx op2,
       op2 = gen_lowpart (mode, op2);
     }
 
-  if (mode == V8SImode)
+  if (mode == V16SImode)
+    {
+      if (uns_p)
+	x = gen_vec_widen_umult_even_v16si (dest, op1, op2);
+      else
+	x = gen_vec_widen_smult_even_v16si (dest, op1, op2);
+    }
+  else if (mode == V8SImode)
     {
       if (uns_p)
 	x = gen_vec_widen_umult_even_v8si (dest, op1, op2);
@@ -42263,6 +42715,11 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2)
 	  umul = gen_vec_widen_umult_even_v8si;
 	  nmode = V8SImode;
 	}
+      else if (mode == V8DImode)
+	{
+	  umul = gen_vec_widen_umult_even_v16si;
+	  nmode = V16SImode;
+	}
       else
 	gcc_unreachable ();
 
@@ -43421,12 +43878,16 @@ ix86_preferred_simd_mode (enum machine_mode mode)
     case HImode:
       return (TARGET_AVX && !TARGET_PREFER_AVX128) ? V16HImode : V8HImode;
     case SImode:
-      return (TARGET_AVX && !TARGET_PREFER_AVX128) ? V8SImode : V4SImode;
+      return TARGET_AVX512F ? V16SImode :
+	(TARGET_AVX && !TARGET_PREFER_AVX128) ? V8SImode : V4SImode;
     case DImode:
-      return (TARGET_AVX && !TARGET_PREFER_AVX128) ? V4DImode : V2DImode;
+      return TARGET_AVX512F ? V8DImode :
+	(TARGET_AVX && !TARGET_PREFER_AVX128) ? V4DImode : V2DImode;
 
     case SFmode:
-      if (TARGET_AVX && !TARGET_PREFER_AVX128)
+      if (TARGET_AVX512F)
+	return V16SFmode;
+      else if (TARGET_AVX && !TARGET_PREFER_AVX128)
 	return V8SFmode;
       else
 	return V4SFmode;
@@ -43434,6 +43895,8 @@ ix86_preferred_simd_mode (enum machine_mode mode)
     case DFmode:
       if (!TARGET_VECTORIZE_DOUBLE)
 	return word_mode;
+      else if (TARGET_AVX512F)
+	return V8DFmode;
       else if (TARGET_AVX && !TARGET_PREFER_AVX128)
 	return V4DFmode;
       else if (TARGET_SSE2)
@@ -43446,12 +43909,14 @@ ix86_preferred_simd_mode (enum machine_mode mode)
 }
 
 /* If AVX is enabled then try vectorizing with both 256bit and 128bit
-   vectors.  */
+   vectors.  If AVX512F is enabled then try vectorizing with 512bit,
+   256bit and 128bit vectors.  */
 
 static unsigned int
 ix86_autovectorize_vector_sizes (void)
 {
-  return (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
+  return TARGET_AVX512F ? 64 | 32 | 16 :
+    (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
 }
 
 \f
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 321d969..072c27b 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -1045,6 +1045,7 @@
 {
   switch (get_attr_mode (insn))
     {
+    case MODE_V16SF:
     case MODE_V8SF:
     case MODE_V4SF:
       return "%vmovups\t{%1, %0|%0, %1}";
@@ -3841,11 +3842,17 @@
    (match_operand:VF1 1 "register_operand")]
   "TARGET_SSE2"
 {
-  rtx tmp[3];
-  tmp[0] = ix86_expand_adjust_ufix_to_sfix_si (operands[1], &tmp[2]);
-  tmp[1] = gen_reg_rtx (<sseintvecmode>mode);
-  emit_insn (gen_fix_trunc<mode><sseintvecmodelower>2 (tmp[1], tmp[0]));
-  emit_insn (gen_xor<sseintvecmodelower>3 (operands[0], tmp[1], tmp[2]));
+  if (GET_MODE (operands[1]) == V16SFmode)
+    emit_insn (gen_ufix_truncv16sfv16si2 (operands[0],
+					  operands[1]));
+  else
+    {
+      rtx tmp[3];
+      tmp[0] = ix86_expand_adjust_ufix_to_sfix_si (operands[1], &tmp[2]);
+      tmp[1] = gen_reg_rtx (<sseintvecmode>mode);
+      emit_insn (gen_fix_trunc<mode><sseintvecmodelower>2 (tmp[1], tmp[0]));
+      emit_insn (gen_xor<sseintvecmodelower>3 (operands[0], tmp[1], tmp[2]));
+    }
   DONE;
 })
 
@@ -4771,6 +4778,32 @@
   DONE;
 })
 
+(define_expand "vec_unpacku_float_hi_v16si"
+  [(match_operand:V8DF 0 "register_operand")
+   (match_operand:V16SI 1 "register_operand")]
+  "TARGET_AVX512F"
+{
+  REAL_VALUE_TYPE TWO32r;
+  rtx k, x, tmp[4];
+
+  real_ldexp (&TWO32r, &dconst1, 32);
+  x = const_double_from_real_value (TWO32r, DFmode);
+
+  tmp[0] = force_reg (V8DFmode, CONST0_RTX (V8DFmode));
+  tmp[1] = force_reg (V8DFmode, ix86_build_const_vector (V8DFmode, 1, x));
+  tmp[2] = gen_reg_rtx (V8DFmode);
+  tmp[3] = gen_reg_rtx (V8SImode);
+  k = gen_reg_rtx (QImode);
+
+  emit_insn (gen_vec_extract_hi_v16si (tmp[3], operands[1]));
+  emit_insn (gen_floatv8siv8df2 (tmp[2], tmp[3]));
+  emit_insn (gen_rtx_SET (VOIDmode, k,
+			  gen_rtx_LT (QImode, tmp[2], tmp[0])));
+  emit_insn (gen_addv8df3_mask (tmp[2], tmp[2], tmp[1], tmp[2], k));
+  emit_move_insn (operands[0], tmp[2]);
+  DONE;
+})
+
 (define_expand "vec_unpacku_float_lo_v8si"
   [(match_operand:V4DF 0 "register_operand")
    (match_operand:V8SI 1 "nonimmediate_operand")]
@@ -4936,31 +4969,46 @@
 
 (define_expand "vec_pack_ufix_trunc_<mode>"
   [(match_operand:<ssepackfltmode> 0 "register_operand")
-   (match_operand:VF2_128_256 1 "register_operand")
-   (match_operand:VF2_128_256 2 "register_operand")]
+   (match_operand:VF2 1 "register_operand")
+   (match_operand:VF2 2 "register_operand")]
   "TARGET_SSE2"
 {
-  rtx tmp[7];
-  tmp[0] = ix86_expand_adjust_ufix_to_sfix_si (operands[1], &tmp[2]);
-  tmp[1] = ix86_expand_adjust_ufix_to_sfix_si (operands[2], &tmp[3]);
-  tmp[4] = gen_reg_rtx (<ssepackfltmode>mode);
-  emit_insn (gen_vec_pack_sfix_trunc_<mode> (tmp[4], tmp[0], tmp[1]));
-  if (<ssepackfltmode>mode == V4SImode || TARGET_AVX2)
+  if (GET_MODE (operands[1]) == V8DFmode)
     {
-      tmp[5] = gen_reg_rtx (<ssepackfltmode>mode);
-      ix86_expand_vec_extract_even_odd (tmp[5], tmp[2], tmp[3], 0);
+      rtx r1, r2;
+
+      r1 = gen_reg_rtx (V8SImode);
+      r2 = gen_reg_rtx (V8SImode);
+
+      emit_insn (gen_ufix_truncv8dfv8si2 (r1, operands[1]));
+      emit_insn (gen_ufix_truncv8dfv8si2 (r2, operands[2]));
+      emit_insn (gen_avx_vec_concatv16si (operands[0], r1, r2));
     }
   else
     {
-      tmp[5] = gen_reg_rtx (V8SFmode);
-      ix86_expand_vec_extract_even_odd (tmp[5], gen_lowpart (V8SFmode, tmp[2]),
-					gen_lowpart (V8SFmode, tmp[3]), 0);
-      tmp[5] = gen_lowpart (V8SImode, tmp[5]);
+      rtx tmp[7];
+      tmp[0] = ix86_expand_adjust_ufix_to_sfix_si (operands[1], &tmp[2]);
+      tmp[1] = ix86_expand_adjust_ufix_to_sfix_si (operands[2], &tmp[3]);
+      tmp[4] = gen_reg_rtx (<ssepackfltmode>mode);
+      emit_insn (gen_vec_pack_sfix_trunc_<mode> (tmp[4], tmp[0], tmp[1]));
+      if (<ssepackfltmode>mode == V4SImode || TARGET_AVX2)
+	{
+	  tmp[5] = gen_reg_rtx (<ssepackfltmode>mode);
+	  ix86_expand_vec_extract_even_odd (tmp[5], tmp[2], tmp[3], 0);
+	}
+      else
+	{
+	  tmp[5] = gen_reg_rtx (V8SFmode);
+	  ix86_expand_vec_extract_even_odd (tmp[5], gen_lowpart (V8SFmode, tmp[2]),
+					    gen_lowpart (V8SFmode, tmp[3]), 0);
+	  tmp[5] = gen_lowpart (V8SImode, tmp[5]);
+	}
+      tmp[6] = expand_simple_binop (<ssepackfltmode>mode, XOR, tmp[4], tmp[5],
+				    operands[0], 0, OPTAB_DIRECT);
+      if (tmp[6] != operands[0])
+	emit_move_insn (operands[0], tmp[6]);
     }
-  tmp[6] = expand_simple_binop (<ssepackfltmode>mode, XOR, tmp[4], tmp[5],
-				operands[0], 0, OPTAB_DIRECT);
-  if (tmp[6] != operands[0])
-    emit_move_insn (operands[0], tmp[6]);
+
   DONE;
 })
 
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index c1ba3c7..4022eb1 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -4610,7 +4610,7 @@ vectorizable_load (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
       tree vec_oprnd0 = NULL_TREE, op;
       tree arglist = TYPE_ARG_TYPES (TREE_TYPE (gather_decl));
       tree rettype, srctype, ptrtype, idxtype, masktype, scaletype;
-      tree ptr, mask, var, scale, perm_mask = NULL_TREE, prev_res = NULL_TREE;
+      tree ptr, mask, var, scale, merge, perm_mask = NULL_TREE, prev_res = NULL_TREE;
       edge pe = loop_preheader_edge (loop);
       gimple_seq seq;
       basic_block new_bb;
@@ -4652,8 +4652,7 @@ vectorizable_load (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
       idxtype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
       masktype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
       scaletype = TREE_VALUE (arglist);
-      gcc_checking_assert (types_compatible_p (srctype, rettype)
-			   && types_compatible_p (srctype, masktype));
+      gcc_checking_assert (types_compatible_p (srctype, rettype));
 
       vec_dest = vect_create_destination_var (scalar_dest, vectype);
 
@@ -4667,8 +4666,13 @@ vectorizable_load (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
 
       /* Currently we support only unconditional gather loads,
 	 so mask should be all ones.  */
-      if (TREE_CODE (TREE_TYPE (masktype)) == INTEGER_TYPE)
-	mask = build_int_cst (TREE_TYPE (masktype), -1);
+      if (TREE_CODE (masktype) == INTEGER_TYPE)
+	mask = build_int_cst (masktype, -1);
+      else if (TREE_CODE (TREE_TYPE (masktype)) == INTEGER_TYPE)
+	{
+	  mask = build_int_cst (TREE_TYPE (masktype), -1);
+	  mask = build_vector_from_val (masktype, mask);
+	}
       else if (SCALAR_FLOAT_TYPE_P (TREE_TYPE (masktype)))
 	{
 	  REAL_VALUE_TYPE r;
@@ -4677,14 +4681,30 @@ vectorizable_load (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
 	    tmp[j] = -1;
 	  real_from_target (&r, tmp, TYPE_MODE (TREE_TYPE (masktype)));
 	  mask = build_real (TREE_TYPE (masktype), r);
+	  mask = build_vector_from_val (masktype, mask);
 	}
       else
 	gcc_unreachable ();
-      mask = build_vector_from_val (masktype, mask);
       mask = vect_init_vector (stmt, mask, masktype, NULL);
 
       scale = build_int_cst (scaletype, gather_scale);
 
+      if (TREE_CODE (TREE_TYPE (rettype)) == INTEGER_TYPE)
+	merge = build_int_cst (TREE_TYPE (rettype), 0);
+      else if (SCALAR_FLOAT_TYPE_P (TREE_TYPE (rettype)))
+	{
+	  REAL_VALUE_TYPE r;
+	  long tmp[6];
+	  for (j = 0; j < 6; ++j)
+	    tmp[j] = 0;
+	  real_from_target (&r, tmp, TYPE_MODE (TREE_TYPE (rettype)));
+	  merge = build_real (TREE_TYPE (rettype), r);
+	}
+      else
+	gcc_unreachable ();
+      merge = build_vector_from_val (rettype, merge);
+      merge = vect_init_vector (stmt, merge, rettype, NULL);
+
       prev_stmt_info = NULL;
       for (j = 0; j < ncopies; ++j)
 	{
@@ -4713,7 +4733,7 @@ vectorizable_load (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
 	    }
 
 	  new_stmt
-	    = gimple_build_call (gather_decl, 5, mask, ptr, op, mask, scale);
+	    = gimple_build_call (gather_decl, 5, merge, ptr, op, mask, scale);
 
 	  if (!useless_type_conversion_p (vectype, rettype))
 	    {
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index a2f482d..b384c0d 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -641,8 +641,8 @@ struct dataref_aux {
    conversion.  */
 #define MAX_INTERM_CVT_STEPS         3
 
-/* The maximum vectorization factor supported by any target (V32QI).  */
-#define MAX_VECTORIZATION_FACTOR 32
+/* The maximum vectorization factor supported by any target (V64QI).  */
+#define MAX_VECTORIZATION_FACTOR 64
 
 /* Avoid GTY(()) on stmt_vec_info.  */
 typedef void *vec_void_p;

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2013-11-12 13:56 [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks Kirill Yukhin
@ 2013-11-15 18:21 ` Kirill Yukhin
  2013-11-19 10:28   ` Kirill Yukhin
  0 siblings, 1 reply; 43+ messages in thread
From: Kirill Yukhin @ 2013-11-15 18:21 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Uros Bizjak, Jakub Jelinek, GCC Patches

Hello,
On 12 Nov 15:36, Kirill Yukhin wrote:
> Hello,
> Patch in the bottom extends some hooks toward AVX-512 support.
> This patch decrease icount for Spec2006 FP suite (ref set):
> 
> Optset was: -static -m64 -fstrict-aliasing -fno-prefetch-loop-arrays
> -Ofast -funroll-loops -flto -march=core-avx2 -mtune=core-avx2
> 
> Lower is better.
> 
> Test\Arch        Icount	   Icount w/      Icount w/
>                  incr, %   -mavx2         -mavx512f
> SPECfp 	  	    
> 410.bwaves	-1.91	   1052892413074  1032764159879
> 416.gamess	-0.35	   4981923623398  4964404349547
> 433.milc	-2.21	   613881448195	  600337040966
> 434.zeusmp	-14.05	   839652873778	  721646740255
> 435.gromacs	-2.63	   1584456238056  1542724794028
> 436.cactusADM	-33.35	   621112775417	  413998009835
> 437.leslie3d	-24.68	   685183512084	  516052497339
> 444.namd	+0.41	   1639494201826  1646198964416
> 447.dealII	-0.17	   1090815823369  1088911757625
> 450.soplex	-8.29	   667084302238	  611769330624
> 453.povray	+0.36	   854174676846	  857232459761
> 454.calculix	-7.23	   1716433835093  1592358631108
> 459.GemsFDTD	-22.82	   682427745937	  526665452119
> 465.tonto	+2.09	   1527568790770  1559478471810
> 470.lbm		-4.01	   885166688641	  849702055497
> 481.wrf		-6.12	   1315087631021  1234615894173
> 482.sphinx3	+16.33	   2115592654950  2461052825829
> Geomean		-7.14			

> Testing:
>   1. Bootstrap pass.
>   2. make check shows no regressions.
>   3. Spec 2000 & 2006 build show no regressions both with and without -mavx512f option.
>   4. Spec 2000 & 2006 run shows no stability regressions without -mavx512f option.
> 
> Is it ok to commit to main trunk?
Ping.

--
Thanks, K

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2013-11-15 18:21 ` Kirill Yukhin
@ 2013-11-19 10:28   ` Kirill Yukhin
  2013-12-02 13:16     ` Kirill Yukhin
  0 siblings, 1 reply; 43+ messages in thread
From: Kirill Yukhin @ 2013-11-19 10:28 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Uros Bizjak, Jakub Jelinek, GCC Patches

Hello,
On 15 Nov 20:10, Kirill Yukhin wrote:
> > Is it ok to commit to main trunk?
> Ping.
Ping.

--
Thanks, K

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2013-11-19 10:28   ` Kirill Yukhin
@ 2013-12-02 13:16     ` Kirill Yukhin
  2013-12-18 13:08       ` Kirill Yukhin
  0 siblings, 1 reply; 43+ messages in thread
From: Kirill Yukhin @ 2013-12-02 13:16 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Uros Bizjak, Jakub Jelinek, GCC Patches

Hello,
On 19 Nov 12:14, Kirill Yukhin wrote:
> Hello,
> On 15 Nov 20:10, Kirill Yukhin wrote:
> > > Is it ok to commit to main trunk?
> > Ping.
> Ping.
Ping.

--
Thanks, K

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2013-12-02 13:16     ` Kirill Yukhin
@ 2013-12-18 13:08       ` Kirill Yukhin
  2013-12-22 10:47         ` Uros Bizjak
  0 siblings, 1 reply; 43+ messages in thread
From: Kirill Yukhin @ 2013-12-18 13:08 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Uros Bizjak, Jakub Jelinek, GCC Patches

Hello,

On 02 Dec 16:13, Kirill Yukhin wrote:
> Hello,
> On 19 Nov 12:14, Kirill Yukhin wrote:
> > Hello,
> > On 15 Nov 20:10, Kirill Yukhin wrote:
> > > > Is it ok to commit to main trunk?
> > > Ping.
> > Ping.
> Ping.
Ping.

Updated patch in the bottom.

--
Thanks, K

---
 gcc/config/i386/i386.c | 607 +++++++++++++++++++++++++++++++++++++++++++------
 gcc/config/i386/sse.md |  94 ++++++--
 gcc/tree-vect-stmts.c  |  34 ++-
 gcc/tree-vectorizer.h  |   4 +-
 4 files changed, 636 insertions(+), 103 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a3dd307..c86aa0a 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2307,7 +2307,7 @@ enum x86_64_reg_class
     X86_64_MEMORY_CLASS
   };
 
-#define MAX_CLASSES 4
+#define MAX_CLASSES 8
 
 /* Table of constants used by fldpi, fldln2, etc....  */
 static REAL_VALUE_TYPE ext_80387_constants_table [5];
@@ -6293,7 +6293,7 @@ merge_classes (enum x86_64_reg_class class1, enum x86_64_reg_class class2)
    sized containers, classes[0] will be NO_CLASS and 1 is returned.
 
    BIT_OFFSET is used internally for handling records and specifies offset
-   of the offset in bits modulo 256 to avoid overflow cases.
+   of the offset in bits modulo 512 to avoid overflow cases.
 
    See the x86-64 PS ABI for details.
 */
@@ -6393,7 +6393,7 @@ classify_argument (enum machine_mode mode, const_tree type,
 		      num = classify_argument (TYPE_MODE (type), type,
 					       subclasses,
 					       (int_bit_position (field)
-						+ bit_offset) % 256);
+						+ bit_offset) % 512);
 		      if (!num)
 			return 0;
 		      pos = (int_bit_position (field)
@@ -6643,6 +6643,21 @@ classify_argument (enum machine_mode mode, const_tree type,
       classes[2] = X86_64_SSEUP_CLASS;
       classes[3] = X86_64_SSEUP_CLASS;
       return 4;
+    case V8DFmode:
+    case V16SFmode:
+    case V8DImode:
+    case V16SImode:
+    case V32HImode:
+    case V64QImode:
+      classes[0] = X86_64_SSE_CLASS;
+      classes[1] = X86_64_SSEUP_CLASS;
+      classes[2] = X86_64_SSEUP_CLASS;
+      classes[3] = X86_64_SSEUP_CLASS;
+      classes[4] = X86_64_SSEUP_CLASS;
+      classes[5] = X86_64_SSEUP_CLASS;
+      classes[6] = X86_64_SSEUP_CLASS;
+      classes[7] = X86_64_SSEUP_CLASS;
+      return 8;
     case V4SFmode:
     case V4SImode:
     case V16QImode:
@@ -6828,6 +6843,18 @@ construct_container (enum machine_mode mode, enum machine_mode orig_mode,
       && mode != BLKmode)
     return gen_reg_or_parallel (mode, orig_mode,
 				SSE_REGNO (sse_regno));
+  if (n == 8
+      && regclass[0] == X86_64_SSE_CLASS
+      && regclass[1] == X86_64_SSEUP_CLASS
+      && regclass[2] == X86_64_SSEUP_CLASS
+      && regclass[3] == X86_64_SSEUP_CLASS
+      && regclass[4] == X86_64_SSEUP_CLASS
+      && regclass[5] == X86_64_SSEUP_CLASS
+      && regclass[6] == X86_64_SSEUP_CLASS
+      && regclass[7] == X86_64_SSEUP_CLASS
+      && mode != BLKmode)
+    return gen_reg_or_parallel (mode, orig_mode,
+				SSE_REGNO (sse_regno));
   if (n == 2
       && regclass[0] == X86_64_X87_CLASS
       && regclass[1] == X86_64_X87UP_CLASS)
@@ -6909,6 +6936,18 @@ construct_container (enum machine_mode mode, enum machine_mode orig_mode,
 		tmpmode = OImode;
 		i += 3;
 		break;
+	      case 8:
+		gcc_assert (i == 0
+			    && regclass[1] == X86_64_SSEUP_CLASS
+			    && regclass[2] == X86_64_SSEUP_CLASS
+			    && regclass[3] == X86_64_SSEUP_CLASS
+			    && regclass[4] == X86_64_SSEUP_CLASS
+			    && regclass[5] == X86_64_SSEUP_CLASS
+			    && regclass[6] == X86_64_SSEUP_CLASS
+			    && regclass[7] == X86_64_SSEUP_CLASS);
+		tmpmode = XImode;
+		i += 7;
+		break;
 	      default:
 		gcc_unreachable ();
 	      }
@@ -6982,6 +7021,12 @@ function_arg_advance_32 (CUMULATIVE_ARGS *cum, enum machine_mode mode,
 
     case V8SFmode:
     case V8SImode:
+    case V64QImode:
+    case V32HImode:
+    case V16SImode:
+    case V8DImode:
+    case V16SFmode:
+    case V8DFmode:
     case V32QImode:
     case V16HImode:
     case V4DFmode:
@@ -7033,8 +7078,9 @@ function_arg_advance_64 (CUMULATIVE_ARGS *cum, enum machine_mode mode,
 {
   int int_nregs, sse_nregs;
 
-  /* Unnamed 256bit vector mode parameters are passed on stack.  */
-  if (!named && VALID_AVX256_REG_MODE (mode))
+  /* Unnamed 512 and 256bit vector mode parameters are passed on stack.  */
+  if (!named && (VALID_AVX512F_REG_MODE (mode)
+		 || VALID_AVX256_REG_MODE (mode)))
     return;
 
   if (examine_argument (mode, type, 0, &int_nregs, &sse_nregs)
@@ -7185,9 +7231,16 @@ function_arg_32 (const CUMULATIVE_ARGS *cum, enum machine_mode mode,
       break;
 
     case OImode:
-      /* OImode shouldn't be used directly.  */
+    case XImode:
+      /* OImode and XImode shouldn't be used directly.  */
       gcc_unreachable ();
 
+    case V64QImode:
+    case V32HImode:
+    case V16SImode:
+    case V8DImode:
+    case V16SFmode:
+    case V8DFmode:
     case V8SFmode:
     case V8SImode:
     case V32QImode:
@@ -7250,7 +7303,13 @@ function_arg_64 (const CUMULATIVE_ARGS *cum, enum machine_mode mode,
     case V16HImode:
     case V4DFmode:
     case V4DImode:
-      /* Unnamed 256bit vector mode parameters are passed on stack.  */
+    case V16SFmode:
+    case V16SImode:
+    case V64QImode:
+    case V32HImode:
+    case V8DFmode:
+    case V8DImode:
+      /* Unnamed 256 and 512bit vector mode parameters are passed on stack.  */
       if (!named)
 	return NULL;
       break;
@@ -7653,6 +7712,10 @@ function_value_32 (enum machine_mode orig_mode, enum machine_mode mode,
   else if (VECTOR_MODE_P (mode) && GET_MODE_SIZE (mode) == 32)
     regno = FIRST_SSE_REG;
 
+  /* 64-byte vector modes in %zmm0.   */
+  else if (VECTOR_MODE_P (mode) && GET_MODE_SIZE (mode) == 64)
+    regno = FIRST_SSE_REG;
+
   /* Floating point return values in %st(0) (unless -mno-fp-ret-in-387).  */
   else if (X87_FLOAT_MODE_P (mode) && TARGET_FLOAT_RETURNS_IN_80387)
     regno = FIRST_FLOAT_REG;
@@ -7860,6 +7923,10 @@ return_in_memory_32 (const_tree type, enum machine_mode mode)
       /* AVX values are returned in YMM0, except when it doesn't exist.  */
       if (size == 32)
 	return !TARGET_AVX;
+
+      /* AVX512F values are returned in ZMM0, except when it doesn't exist.  */
+      if (size == 64)
+	return !TARGET_AVX512F;
     }
 
   if (mode == XFmode)
@@ -8396,7 +8463,13 @@ ix86_gimplify_va_arg (tree valist, tree type, gimple_seq *pre_p,
     case V16HImode:
     case V4DFmode:
     case V4DImode:
-      /* Unnamed 256bit vector mode parameters are passed on stack.  */
+    case V16SFmode:
+    case V16SImode:
+    case V64QImode:
+    case V32HImode:
+    case V8DFmode:
+    case V8DImode:
+      /* Unnamed 256 and 512bit vector mode parameters are passed on stack.  */
       if (!TARGET_64BIT_MS_ABI)
 	{
 	  container = NULL;
@@ -8811,6 +8884,12 @@ standard_sse_constant_p (rtx x)
       case V4DImode:
 	if (TARGET_AVX2)
 	  return 2;
+      case V64QImode:
+      case V32HImode:
+      case V16SImode:
+      case V8DImode:
+	if (TARGET_AVX512F)
+	  return 2;
       default:
 	break;
       }
@@ -8829,6 +8908,11 @@ standard_sse_constant_opcode (rtx insn, rtx x)
     case 1:
       switch (get_attr_mode (insn))
 	{
+	case MODE_XI:
+	case MODE_V16SF:
+	  return "vpxord\t%g0, %g0, %g0";
+	case MODE_V8DF:
+	  return "vpxorq\t%g0, %g0, %g0";
 	case MODE_TI:
 	  return "%vpxor\t%0, %d0";
 	case MODE_V2DF:
@@ -18629,6 +18713,11 @@ ix86_expand_vector_convert_uns_vsivsf (rtx target, rtx val)
   enum machine_mode fltmode = GET_MODE (target);
   rtx (*cvt) (rtx, rtx);
 
+  if (intmode == V16SImode)
+    {
+      emit_insn (gen_ufloatv16siv16sf2 (target, val));
+      return;
+    }
   if (intmode == V4SImode)
     cvt = gen_floatv4siv4sf2;
   else
@@ -18719,17 +18808,23 @@ ix86_build_const_vector (enum machine_mode mode, bool vect, rtx value)
 
   switch (mode)
     {
+    case V64QImode:
     case V32QImode:
     case V16QImode:
+    case V32HImode:
     case V16HImode:
     case V8HImode:
+    case V16SImode:
     case V8SImode:
     case V4SImode:
+    case V8DImode:
     case V4DImode:
     case V2DImode:
       gcc_assert (vect);
+    case V16SFmode:
     case V8SFmode:
     case V4SFmode:
+    case V8DFmode:
     case V4DFmode:
     case V2DFmode:
       n_elt = GET_MODE_NUNITS (mode);
@@ -18766,6 +18861,8 @@ ix86_build_signbit_mask (enum machine_mode mode, bool vect, bool invert)
   /* Find the sign bit, sign extended to 2*HWI.  */
   switch (mode)
     {
+    case V16SImode:
+    case V16SFmode:
     case V8SImode:
     case V4SImode:
     case V8SFmode:
@@ -18776,8 +18873,10 @@ ix86_build_signbit_mask (enum machine_mode mode, bool vect, bool invert)
       lo = 0x80000000, hi = lo < 0;
       break;
 
+    case V8DImode:
     case V4DImode:
     case V2DImode:
+    case V8DFmode:
     case V4DFmode:
     case V2DFmode:
       vec_mode = mode;
@@ -20634,22 +20733,63 @@ ix86_expand_sse_cmp (rtx dest, enum rtx_code code, rtx cmp_op0, rtx cmp_op1,
 		     rtx op_true, rtx op_false)
 {
   enum machine_mode mode = GET_MODE (dest);
-  enum machine_mode cmp_mode = GET_MODE (cmp_op0);
+  enum machine_mode cmp_ops_mode = GET_MODE (cmp_op0);
+
+  /* In general case result of comparison can differ from operands' type.  */
+  enum machine_mode cmp_mode;
+
+  /* In AVX512F the result of comparison is an integer mask.  */
+  bool maskcmp = false;
   rtx x;
 
-  cmp_op0 = force_reg (cmp_mode, cmp_op0);
-  if (!nonimmediate_operand (cmp_op1, cmp_mode))
-    cmp_op1 = force_reg (cmp_mode, cmp_op1);
+  if (GET_MODE_SIZE (cmp_ops_mode) == 64)
+    {
+      cmp_mode = mode_for_size (GET_MODE_NUNITS (cmp_ops_mode), MODE_INT, 0);
+      gcc_assert (cmp_mode != BLKmode);
+
+      maskcmp = true;
+    }
+  else
+    cmp_mode = cmp_ops_mode;
+
+
+  cmp_op0 = force_reg (cmp_ops_mode, cmp_op0);
+  if (!nonimmediate_operand (cmp_op1, cmp_ops_mode))
+    cmp_op1 = force_reg (cmp_ops_mode, cmp_op1);
 
   if (optimize
       || reg_overlap_mentioned_p (dest, op_true)
       || reg_overlap_mentioned_p (dest, op_false))
-    dest = gen_reg_rtx (mode);
+    dest = gen_reg_rtx (maskcmp ? cmp_mode : mode);
+
+  /* Compare patterns for int modes are unspec in AVX512F only.  */
+  if (maskcmp && (code == GT || code == EQ))
+    {
+      rtx (*gen)(rtx, rtx, rtx);
+
+      switch (cmp_ops_mode)
+	{
+	case V16SImode:
+	  gen = code == GT ? gen_avx512f_gtv16si3 : gen_avx512f_eqv16si3_1;
+	  break;
+	case V8DImode:
+	  gen = code == GT ? gen_avx512f_gtv8di3 : gen_avx512f_eqv8di3_1;
+	  break;
+	default:
+	  gen = NULL;
+	}
 
+      if (gen)
+	{
+	  emit_insn (gen (dest, cmp_op0, cmp_op1));
+	  return dest;
+	}
+    }
   x = gen_rtx_fmt_ee (code, cmp_mode, cmp_op0, cmp_op1);
-  if (cmp_mode != mode)
+
+  if (cmp_mode != mode && !maskcmp)
     {
-      x = force_reg (cmp_mode, x);
+      x = force_reg (cmp_ops_mode, x);
       convert_move (dest, x, false);
     }
   else
@@ -20665,33 +20805,43 @@ static void
 ix86_expand_sse_movcc (rtx dest, rtx cmp, rtx op_true, rtx op_false)
 {
   enum machine_mode mode = GET_MODE (dest);
+  enum machine_mode cmpmode = GET_MODE (cmp);
+
+  /* In AVX512F the result of comparison is an integer mask.  */
+  bool maskcmp = (mode != cmpmode && TARGET_AVX512F);
+
   rtx t2, t3, x;
 
   if (vector_all_ones_operand (op_true, mode)
-      && rtx_equal_p (op_false, CONST0_RTX (mode)))
+      && rtx_equal_p (op_false, CONST0_RTX (mode))
+      && !maskcmp)
     {
       emit_insn (gen_rtx_SET (VOIDmode, dest, cmp));
     }
-  else if (op_false == CONST0_RTX (mode))
+  else if (op_false == CONST0_RTX (mode)
+      && !maskcmp)
     {
       op_true = force_reg (mode, op_true);
       x = gen_rtx_AND (mode, cmp, op_true);
       emit_insn (gen_rtx_SET (VOIDmode, dest, x));
     }
-  else if (op_true == CONST0_RTX (mode))
+  else if (op_true == CONST0_RTX (mode)
+      && !maskcmp)
     {
       op_false = force_reg (mode, op_false);
       x = gen_rtx_NOT (mode, cmp);
       x = gen_rtx_AND (mode, x, op_false);
       emit_insn (gen_rtx_SET (VOIDmode, dest, x));
     }
-  else if (INTEGRAL_MODE_P (mode) && op_true == CONSTM1_RTX (mode))
+  else if (INTEGRAL_MODE_P (mode) && op_true == CONSTM1_RTX (mode)
+      && !maskcmp)
     {
       op_false = force_reg (mode, op_false);
       x = gen_rtx_IOR (mode, cmp, op_false);
       emit_insn (gen_rtx_SET (VOIDmode, dest, x));
     }
-  else if (TARGET_XOP)
+  else if (TARGET_XOP
+      && !maskcmp)
     {
       op_true = force_reg (mode, op_true);
 
@@ -20759,6 +20909,20 @@ ix86_expand_sse_movcc (rtx dest, rtx cmp, rtx op_true, rtx op_false)
 	      cmp = gen_lowpart (V32QImode, cmp);
 	    }
 	  break;
+
+	case V16SImode:
+	  gen = gen_avx512f_blendmv16si;
+	  break;
+	case V8DImode:
+	  gen = gen_avx512f_blendmv8di;
+	  break;
+	case V8DFmode:
+	  gen = gen_avx512f_blendmv8df;
+	  break;
+	case V16SFmode:
+	  gen = gen_avx512f_blendmv16sf;
+	  break;
+
 	default:
 	  break;
 	}
@@ -21026,6 +21190,8 @@ ix86_expand_int_vcond (rtx operands[])
 
 	  switch (mode)
 	    {
+	    case V16SImode:
+	    case V8DImode:
 	    case V8SImode:
 	    case V4DImode:
 	    case V4SImode:
@@ -21036,6 +21202,8 @@ ix86_expand_int_vcond (rtx operands[])
 
 		  switch (mode)
 		    {
+		    case V16SImode: gen_sub3 = gen_subv16si3; break;
+		    case V8DImode: gen_sub3 = gen_subv8di3; break;
 		    case V8SImode: gen_sub3 = gen_subv8si3; break;
 		    case V4DImode: gen_sub3 = gen_subv4di3; break;
 		    case V4SImode: gen_sub3 = gen_subv4si3; break;
@@ -21091,7 +21259,8 @@ ix86_expand_int_vcond (rtx operands[])
       gcc_assert (GET_MODE_SIZE (data_mode) == GET_MODE_SIZE (mode));
       x = ix86_expand_sse_cmp (gen_reg_rtx (mode), code, cop0, cop1,
 			       operands[1+negate], operands[2-negate]);
-      x = gen_lowpart (data_mode, x);
+      if (GET_MODE (x) == mode)
+	x = gen_lowpart (data_mode, x);
     }
 
   ix86_expand_sse_movcc (operands[0], x, operands[1+negate],
@@ -21099,6 +21268,35 @@ ix86_expand_int_vcond (rtx operands[])
   return true;
 }
 
+static bool
+ix86_expand_vec_perm_vpermi2 (rtx target, rtx op0, rtx mask, rtx op1)
+{
+  enum machine_mode mode = GET_MODE (op0);
+  switch (mode)
+    {
+    case V16SImode:
+      emit_insn (gen_avx512f_vpermi2varv16si3 (target, op0,
+					      force_reg (V16SImode, mask),
+					      op1));
+      return true;
+    case V16SFmode:
+      emit_insn (gen_avx512f_vpermi2varv16sf3 (target, op0,
+					      force_reg (V16SImode, mask),
+					      op1));
+      return true;
+    case V8DImode:
+      emit_insn (gen_avx512f_vpermi2varv8di3 (target, op0,
+					     force_reg (V8DImode, mask), op1));
+      return true;
+    case V8DFmode:
+      emit_insn (gen_avx512f_vpermi2varv8df3 (target, op0,
+					     force_reg (V8DImode, mask), op1));
+      return true;
+    default:
+      return false;
+    }
+}
+
 /* Expand a variable vector permutation.  */
 
 void
@@ -21117,7 +21315,10 @@ ix86_expand_vec_perm (rtx operands[])
   /* Number of elements in the vector.  */
   w = GET_MODE_NUNITS (mode);
   e = GET_MODE_UNIT_SIZE (mode);
-  gcc_assert (w <= 32);
+  gcc_assert (w <= 64);
+
+  if (ix86_expand_vec_perm_vpermi2 (target, op0, mask, op1))
+    return;
 
   if (TARGET_AVX2)
     {
@@ -21497,6 +21698,15 @@ ix86_expand_sse_unpack (rtx dest, rtx src, bool unsigned_p, bool high_p)
 	  extract
 	    = high_p ? gen_vec_extract_hi_v32qi : gen_vec_extract_lo_v32qi;
 	  break;
+	case V32HImode:
+	  if (unsigned_p)
+	    unpack = gen_avx512f_zero_extendv16hiv16si2;
+	  else
+	    unpack = gen_avx512f_sign_extendv16hiv16si2;
+	  halfmode = V16HImode;
+	  extract
+	    = high_p ? gen_vec_extract_hi_v32hi : gen_vec_extract_lo_v32hi;
+	  break;
 	case V16HImode:
 	  if (unsigned_p)
 	    unpack = gen_avx2_zero_extendv8hiv8si2;
@@ -21506,6 +21716,15 @@ ix86_expand_sse_unpack (rtx dest, rtx src, bool unsigned_p, bool high_p)
 	  extract
 	    = high_p ? gen_vec_extract_hi_v16hi : gen_vec_extract_lo_v16hi;
 	  break;
+	case V16SImode:
+	  if (unsigned_p)
+	    unpack = gen_avx512f_zero_extendv8siv8di2;
+	  else
+	    unpack = gen_avx512f_sign_extendv8siv8di2;
+	  halfmode = V8SImode;
+	  extract
+	    = high_p ? gen_vec_extract_hi_v16si : gen_vec_extract_lo_v16si;
+	  break;
 	case V8SImode:
 	  if (unsigned_p)
 	    unpack = gen_avx2_zero_extendv4siv4di2;
@@ -21537,7 +21756,7 @@ ix86_expand_sse_unpack (rtx dest, rtx src, bool unsigned_p, bool high_p)
 	  gcc_unreachable ();
 	}
 
-      if (GET_MODE_SIZE (imode) == 32)
+      if (GET_MODE_SIZE (imode) >= 32)
 	{
 	  tmp = gen_reg_rtx (halfmode);
 	  emit_insn (extract (tmp, src));
@@ -26269,7 +26488,8 @@ ix86_constant_alignment (tree exp, int align)
 int
 ix86_data_alignment (tree type, int align, bool opt)
 {
-  int max_align = optimize_size ? BITS_PER_WORD : MIN (256, MAX_OFILE_ALIGNMENT);
+  int max_align = optimize_size ? BITS_PER_WORD
+				: MIN (512, MAX_OFILE_ALIGNMENT);
 
   if (opt
       && AGGREGATE_TYPE_P (type)
@@ -34512,7 +34732,7 @@ avx_vpermilp_parallel (rtx par, enum machine_mode mode)
 {
   unsigned i, nelt = GET_MODE_NUNITS (mode);
   unsigned mask = 0;
-  unsigned char ipar[8] = {};  /* Silence -Wuninitialized warning.  */
+  unsigned char ipar[16] = {};  /* Silence -Wuninitialized warning.  */
 
   if (XVECLEN (par, 0) != (int) nelt)
     return 0;
@@ -34535,6 +34755,24 @@ avx_vpermilp_parallel (rtx par, enum machine_mode mode)
 
   switch (mode)
     {
+    case V8DFmode:
+      /* In the 512-bit DFmode case, we can only move elements within
+         a 128-bit lane.  First fill the second part of the mask,
+	 then fallthru.  */
+      for (i = 4; i < 6; ++i)
+	{
+	  if (ipar[i] < 4 || ipar[i] >= 6)
+	    return 0;
+	  mask |= (ipar[i] - 4) << i;
+	}
+      for (i = 6; i < 8; ++i)
+	{
+	  if (ipar[i] < 6)
+	    return 0;
+	  mask |= (ipar[i] - 6) << i;
+	}
+      /* FALLTHRU */
+
     case V4DFmode:
       /* In the 256-bit DFmode case, we can only move elements within
          a 128-bit lane.  */
@@ -34552,10 +34790,18 @@ avx_vpermilp_parallel (rtx par, enum machine_mode mode)
 	}
       break;
 
+    case V16SFmode:
+      /* In 512 bit SFmode case, permutation in the upper 256 bits
+	 must mirror the permutation in the lower 256-bits.  */
+      for (i = 0; i < 8; ++i)
+	if (ipar[i] + 8 != ipar[i + 8])
+	  return 0;
+      /* FALLTHRU */
+
     case V8SFmode:
-      /* In the 256-bit SFmode case, we have full freedom of movement
-	 within the low 128-bit lane, but the high 128-bit lane must
-	 mirror the exact same pattern.  */
+      /* In 256 bit SFmode case, we have full freedom of
+         movement within the low 128-bit lane, but the high 128-bit
+         lane must mirror the exact same pattern.  */
       for (i = 0; i < 4; ++i)
 	if (ipar[i] + 4 != ipar[i + 4])
 	  return 0;
@@ -35506,6 +35752,7 @@ static bool
 ix86_rtx_costs (rtx x, int code_i, int outer_code_i, int opno, int *total,
 		bool speed)
 {
+  rtx mask;
   enum rtx_code code = (enum rtx_code) code_i;
   enum rtx_code outer_code = (enum rtx_code) outer_code_i;
   enum machine_mode mode = GET_MODE (x);
@@ -35982,13 +36229,21 @@ ix86_rtx_costs (rtx x, int code_i, int outer_code_i, int opno, int *total,
 
     case VEC_SELECT:
     case VEC_CONCAT:
-    case VEC_MERGE:
     case VEC_DUPLICATE:
       /* ??? Assume all of these vector manipulation patterns are
 	 recognizable.  In which case they all pretty much have the
 	 same cost.  */
      *total = cost->fabs;
      return true;
+    case VEC_MERGE:
+      mask = XEXP (x, 2);
+      /* This is masked instruction, assume the same cost,
+	 as nonmasked variant.  */
+      if (TARGET_AVX512F && register_operand (mask, GET_MODE (mask)))
+	*total = rtx_cost (XEXP (x, 0), outer_code, opno, speed);
+      else
+	*total = cost->fabs;
+      return true;
 
     default:
       return false;
@@ -37154,6 +37409,36 @@ get_mode_wider_vector (enum machine_mode o)
   return n;
 }
 
+/* A subroutine of ix86_expand_vector_init_duplicate.  Tries to
+   fill target with val via vec_duplicate.  */
+
+static bool
+ix86_vector_duplicate_value (enum machine_mode mode, rtx target, rtx val)
+{
+  bool ok;
+  rtx insn, dup;
+
+  /* First attempt to recognize VAL as-is.  */
+  dup = gen_rtx_VEC_DUPLICATE (mode, val);
+  insn = emit_insn (gen_rtx_SET (VOIDmode, target, dup));
+  if (recog_memoized (insn) < 0)
+    {
+      rtx seq;
+      /* If that fails, force VAL into a register.  */
+
+      start_sequence ();
+      XEXP (dup, 0) = force_reg (GET_MODE_INNER (mode), val);
+      seq = get_insns ();
+      end_sequence ();
+      if (seq)
+	emit_insn_before (seq, insn);
+
+      ok = recog_memoized (insn) >= 0;
+      gcc_assert (ok);
+    }
+  return true;
+}
+
 /* A subroutine of ix86_expand_vector_init.  Store into TARGET a vector
    with all elements equal to VAR.  Return true if successful.  */
 
@@ -37179,29 +37464,11 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, enum machine_mode mode,
     case V2DImode:
     case V4SFmode:
     case V4SImode:
-      {
-	rtx insn, dup;
-
-	/* First attempt to recognize VAL as-is.  */
-	dup = gen_rtx_VEC_DUPLICATE (mode, val);
-	insn = emit_insn (gen_rtx_SET (VOIDmode, target, dup));
-	if (recog_memoized (insn) < 0)
-	  {
-	    rtx seq;
-	    /* If that fails, force VAL into a register.  */
-
-	    start_sequence ();
-	    XEXP (dup, 0) = force_reg (GET_MODE_INNER (mode), val);
-	    seq = get_insns ();
-	    end_sequence ();
-	    if (seq)
-	      emit_insn_before (seq, insn);
-
-	    ok = recog_memoized (insn) >= 0;
-	    gcc_assert (ok);
-	  }
-      }
-      return true;
+    case V16SImode:
+    case V8DImode:
+    case V16SFmode:
+    case V8DFmode:
+      return ix86_vector_duplicate_value (mode, target, val);
 
     case V4HImode:
       if (!mmx_ok)
@@ -37551,8 +37818,8 @@ static void
 ix86_expand_vector_init_concat (enum machine_mode mode,
 				rtx target, rtx *ops, int n)
 {
-  enum machine_mode cmode, hmode = VOIDmode;
-  rtx first[8], second[4];
+  enum machine_mode cmode, hmode = VOIDmode, gmode = VOIDmode;
+  rtx first[16], second[8], third[4];
   rtvec v;
   int i, j;
 
@@ -37561,6 +37828,18 @@ ix86_expand_vector_init_concat (enum machine_mode mode,
     case 2:
       switch (mode)
 	{
+	case V16SImode:
+	  cmode = V8SImode;
+	  break;
+	case V16SFmode:
+	  cmode = V8SFmode;
+	  break;
+	case V8DImode:
+	  cmode = V4DImode;
+	  break;
+	case V8DFmode:
+	  cmode = V4DFmode;
+	  break;
 	case V8SImode:
 	  cmode = V4SImode;
 	  break;
@@ -37627,6 +37906,14 @@ ix86_expand_vector_init_concat (enum machine_mode mode,
     case 8:
       switch (mode)
 	{
+	case V8DImode:
+	  cmode = V2DImode;
+	  hmode = V4DImode;
+	  break;
+	case V8DFmode:
+	  cmode = V2DFmode;
+	  hmode = V4DFmode;
+	  break;
 	case V8SImode:
 	  cmode = V2SImode;
 	  hmode = V4SImode;
@@ -37640,6 +37927,24 @@ ix86_expand_vector_init_concat (enum machine_mode mode,
 	}
       goto half;
 
+    case 16:
+      switch (mode)
+	{
+	case V16SImode:
+	  cmode = V2SImode;
+	  hmode = V4SImode;
+	  gmode = V8SImode;
+	  break;
+	case V16SFmode:
+	  cmode = V2SFmode;
+	  hmode = V4SFmode;
+	  gmode = V8SFmode;
+	  break;
+	default:
+	  gcc_unreachable ();
+	}
+      goto half;
+
 half:
       /* FIXME: We process inputs backward to help RA.  PR 36222.  */
       i = n - 1;
@@ -37653,7 +37958,27 @@ half:
 	}
 
       n >>= 1;
-      if (n > 2)
+      if (n > 4)
+	{
+	  gcc_assert (hmode != VOIDmode);
+	  gcc_assert (gmode != VOIDmode);
+	  for (i = j = 0; i < n; i += 2, j++)
+	    {
+	      second[j] = gen_reg_rtx (hmode);
+	      ix86_expand_vector_init_concat (hmode, second [j],
+					      &first [i], 2);
+	    }
+	  n >>= 1;
+	  for (i = j = 0; i < n; i += 2, j++)
+	    {
+	      third[j] = gen_reg_rtx (gmode);
+	      ix86_expand_vector_init_concat (gmode, third[j],
+					      &second[i], 2);
+	    }
+	  n >>= 1;
+	  ix86_expand_vector_init_concat (mode, target, third, n);
+	}
+      else if (n > 2)
 	{
 	  gcc_assert (hmode != VOIDmode);
 	  for (i = j = 0; i < n; i += 2, j++)
@@ -37796,7 +38121,7 @@ static void
 ix86_expand_vector_init_general (bool mmx_ok, enum machine_mode mode,
 				 rtx target, rtx vals)
 {
-  rtx ops[32], op0, op1;
+  rtx ops[64], op0, op1;
   enum machine_mode half_mode = VOIDmode;
   int n, i;
 
@@ -37808,6 +38133,10 @@ ix86_expand_vector_init_general (bool mmx_ok, enum machine_mode mode,
 	break;
       /* FALLTHRU */
 
+    case V16SImode:
+    case V16SFmode:
+    case V8DFmode:
+    case V8DImode:
     case V8SFmode:
     case V8SImode:
     case V4DFmode:
@@ -38433,6 +38762,42 @@ ix86_expand_vector_extract (bool mmx_ok, rtx target, rtx vec, int elt)
 	}
       break;
 
+    case V16SFmode:
+      tmp = gen_reg_rtx (V8SFmode);
+      if (elt < 8)
+	emit_insn (gen_vec_extract_lo_v16sf (tmp, vec));
+      else
+	emit_insn (gen_vec_extract_hi_v16sf (tmp, vec));
+      ix86_expand_vector_extract (false, target, tmp, elt & 7);
+      return;
+
+    case V8DFmode:
+      tmp = gen_reg_rtx (V4DFmode);
+      if (elt < 4)
+	emit_insn (gen_vec_extract_lo_v8df (tmp, vec));
+      else
+	emit_insn (gen_vec_extract_hi_v8df (tmp, vec));
+      ix86_expand_vector_extract (false, target, tmp, elt & 3);
+      return;
+
+    case V16SImode:
+      tmp = gen_reg_rtx (V8SImode);
+      if (elt < 8)
+	emit_insn (gen_vec_extract_lo_v16si (tmp, vec));
+      else
+	emit_insn (gen_vec_extract_hi_v16si (tmp, vec));
+      ix86_expand_vector_extract (false, target, tmp, elt & 7);
+      return;
+
+    case V8DImode:
+      tmp = gen_reg_rtx (V4DImode);
+      if (elt < 4)
+	emit_insn (gen_vec_extract_lo_v8di (tmp, vec));
+      else
+	emit_insn (gen_vec_extract_hi_v8di (tmp, vec));
+      ix86_expand_vector_extract (false, target, tmp, elt & 3);
+      return;
+
     case V8QImode:
       /* ??? Could extract the appropriate HImode element and shift.  */
     default:
@@ -38525,6 +38890,44 @@ emit_reduc_half (rtx dest, rtx src, int i)
 				    GEN_INT (i / 2));
 	}
       break;
+    case V16SImode:
+    case V16SFmode:
+    case V8DImode:
+    case V8DFmode:
+      if (i > 128)
+	tem = gen_avx512f_shuf_i32x4_1 (gen_lowpart (V16SImode, dest),
+				      gen_lowpart (V16SImode, src),
+				      gen_lowpart (V16SImode, src),
+				      GEN_INT (0x4 + (i == 512 ? 4 : 0)),
+				      GEN_INT (0x5 + (i == 512 ? 4 : 0)),
+				      GEN_INT (0x6 + (i == 512 ? 4 : 0)),
+				      GEN_INT (0x7 + (i == 512 ? 4 : 0)),
+				      GEN_INT (0xC), GEN_INT (0xD),
+				      GEN_INT (0xE), GEN_INT (0xF),
+				      GEN_INT (0x10), GEN_INT (0x11),
+				      GEN_INT (0x12), GEN_INT (0x13),
+				      GEN_INT (0x14), GEN_INT (0x15),
+				      GEN_INT (0x16), GEN_INT (0x17));
+      else
+	tem = gen_avx512f_pshufd_1 (gen_lowpart (V16SImode, dest),
+				   gen_lowpart (V16SImode, src),
+				   GEN_INT (i == 128 ? 0x2 : 0x1),
+				   GEN_INT (0x3),
+				   GEN_INT (0x3),
+				   GEN_INT (0x3),
+				   GEN_INT (i == 128 ? 0x6 : 0x5),
+				   GEN_INT (0x7),
+				   GEN_INT (0x7),
+				   GEN_INT (0x7),
+				   GEN_INT (i == 128 ? 0xA : 0x9),
+				   GEN_INT (0xB),
+				   GEN_INT (0xB),
+				   GEN_INT (0xB),
+				   GEN_INT (i == 128 ? 0xE : 0xD),
+				   GEN_INT (0xF),
+				   GEN_INT (0xF),
+				   GEN_INT (0xF));
+      break;
     default:
       gcc_unreachable ();
     }
@@ -38589,6 +38992,8 @@ ix86_vector_mode_supported_p (enum machine_mode mode)
     return true;
   if (TARGET_AVX && VALID_AVX256_REG_MODE (mode))
     return true;
+  if (TARGET_AVX512F && VALID_AVX512F_REG_MODE (mode))
+    return true;
   if (TARGET_MMX && VALID_MMX_REG_MODE (mode))
     return true;
   if (TARGET_3DNOW && VALID_MMX_REG_MODE_3DNOW (mode))
@@ -38902,9 +39307,15 @@ void ix86_emit_swdivsf (rtx res, rtx a, rtx b, enum machine_mode mode)
   b = force_reg (mode, b);
 
   /* x0 = rcp(b) estimate */
-  emit_insn (gen_rtx_SET (VOIDmode, x0,
-			  gen_rtx_UNSPEC (mode, gen_rtvec (1, b),
-					  UNSPEC_RCP)));
+  if (mode == V16SFmode || mode == V8DFmode)
+    emit_insn (gen_rtx_SET (VOIDmode, x0,
+			    gen_rtx_UNSPEC (mode, gen_rtvec (1, b),
+					    UNSPEC_RCP14)));
+  else
+    emit_insn (gen_rtx_SET (VOIDmode, x0,
+			    gen_rtx_UNSPEC (mode, gen_rtvec (1, b),
+					    UNSPEC_RCP)));
+
   /* e0 = x0 * b */
   emit_insn (gen_rtx_SET (VOIDmode, e0,
 			  gen_rtx_MULT (mode, x0, b)));
@@ -38934,6 +39345,7 @@ void ix86_emit_swsqrtsf (rtx res, rtx a, enum machine_mode mode,
 {
   rtx x0, e0, e1, e2, e3, mthree, mhalf;
   REAL_VALUE_TYPE r;
+  int unspec;
 
   x0 = gen_reg_rtx (mode);
   e0 = gen_reg_rtx (mode);
@@ -38946,11 +39358,15 @@ void ix86_emit_swsqrtsf (rtx res, rtx a, enum machine_mode mode,
 
   real_arithmetic (&r, NEGATE_EXPR, &dconsthalf, NULL);
   mhalf = CONST_DOUBLE_FROM_REAL_VALUE (r, SFmode);
+  unspec = UNSPEC_RSQRT;
 
   if (VECTOR_MODE_P (mode))
     {
       mthree = ix86_build_const_vector (mode, true, mthree);
       mhalf = ix86_build_const_vector (mode, true, mhalf);
+      /* There is no 512-bit rsqrt.  There is however rsqrt14.  */
+      if (GET_MODE_SIZE (mode) == 64)
+	unspec = UNSPEC_RSQRT14;
     }
 
   /* sqrt(a)  = -0.5 * a * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0)
@@ -38961,7 +39377,7 @@ void ix86_emit_swsqrtsf (rtx res, rtx a, enum machine_mode mode,
   /* x0 = rsqrt(a) estimate */
   emit_insn (gen_rtx_SET (VOIDmode, x0,
 			  gen_rtx_UNSPEC (mode, gen_rtvec (1, a),
-					  UNSPEC_RSQRT)));
+					  unspec)));
 
   /* If (a == 0.0) Filter out infinity to prevent NaN for sqrt(0.0).  */
   if (!recip)
@@ -38972,11 +39388,23 @@ void ix86_emit_swsqrtsf (rtx res, rtx a, enum machine_mode mode,
       mask = gen_reg_rtx (mode);
 
       zero = force_reg (mode, CONST0_RTX(mode));
-      emit_insn (gen_rtx_SET (VOIDmode, mask,
-			      gen_rtx_NE (mode, zero, a)));
 
-      emit_insn (gen_rtx_SET (VOIDmode, x0,
-			      gen_rtx_AND (mode, x0, mask)));
+      /* Handle masked compare.  */
+      if (VECTOR_MODE_P (mode) && GET_MODE_SIZE (mode) == 64)
+	{
+	  mask = gen_reg_rtx (HImode);
+	  /* Imm value 0x4 corresponds to not-equal comparison.  */
+	  emit_insn (gen_avx512f_cmpv16sf3 (mask, zero, a, GEN_INT (0x4)));
+	  emit_insn (gen_avx512f_blendmv16sf (x0, zero, x0, mask));
+	}
+      else
+	{
+	  emit_insn (gen_rtx_SET (VOIDmode, mask,
+				  gen_rtx_NE (mode, zero, a)));
+
+	  emit_insn (gen_rtx_SET (VOIDmode, x0,
+				  gen_rtx_AND (mode, x0, mask)));
+	}
     }
 
   /* e0 = x0 * a */
@@ -40498,6 +40926,19 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
   if (expand_vec_perm_pshufb (d))
     return true;
 
+  /* Try the AVX512F vpermi2 instructions.  */
+  rtx vec[64];
+  enum machine_mode mode = d->vmode;
+  if (mode == V8DFmode)
+    mode = V8DImode;
+  else if (mode == V16SFmode)
+    mode = V16SImode;
+  for (i = 0; i < nelt; ++i)
+    vec[i] = GEN_INT (d->perm[i]);
+  rtx mask = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt, vec));
+  if (ix86_expand_vec_perm_vpermi2 (d->target, d->op0, mask, d->op1))
+    return true;
+
   return false;
 }
 
@@ -42105,6 +42546,10 @@ ix86_vectorize_vec_perm_const_ok (enum machine_mode vmode,
 
   /* Given sufficient ISA support we can just return true here
      for selected vector modes.  */
+  if (d.vmode == V16SImode || d.vmode == V16SFmode
+      || d.vmode == V8DFmode || d.vmode == V8DImode)
+    /* All implementable with a single vpermi2 insn.  */
+    return true;
   if (GET_MODE_SIZE (d.vmode) == 16)
     {
       /* All implementable with a single vpperm insn.  */
@@ -42347,7 +42792,7 @@ ix86_expand_mul_widen_evenodd (rtx dest, rtx op1, rtx op2,
     op2 = force_reg (mode, op2);
 
   /* We only play even/odd games with vectors of SImode.  */
-  gcc_assert (mode == V4SImode || mode == V8SImode);
+  gcc_assert (mode == V4SImode || mode == V8SImode || mode == V16SImode);
 
   /* If we're looking for the odd results, shift those members down to
      the even slots.  For some cpus this is faster than a PSHUFD.  */
@@ -42373,7 +42818,14 @@ ix86_expand_mul_widen_evenodd (rtx dest, rtx op1, rtx op2,
       op2 = gen_lowpart (mode, op2);
     }
 
-  if (mode == V8SImode)
+  if (mode == V16SImode)
+    {
+      if (uns_p)
+	x = gen_vec_widen_umult_even_v16si (dest, op1, op2);
+      else
+	x = gen_vec_widen_smult_even_v16si (dest, op1, op2);
+    }
+  else if (mode == V8SImode)
     {
       if (uns_p)
 	x = gen_vec_widen_umult_even_v8si (dest, op1, op2);
@@ -42593,6 +43045,11 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2)
 	  umul = gen_vec_widen_umult_even_v8si;
 	  nmode = V8SImode;
 	}
+      else if (mode == V8DImode)
+	{
+	  umul = gen_vec_widen_umult_even_v16si;
+	  nmode = V16SImode;
+	}
       else
 	gcc_unreachable ();
 
@@ -43739,12 +44196,16 @@ ix86_preferred_simd_mode (enum machine_mode mode)
     case HImode:
       return (TARGET_AVX && !TARGET_PREFER_AVX128) ? V16HImode : V8HImode;
     case SImode:
-      return (TARGET_AVX && !TARGET_PREFER_AVX128) ? V8SImode : V4SImode;
+      return TARGET_AVX512F ? V16SImode :
+	(TARGET_AVX && !TARGET_PREFER_AVX128) ? V8SImode : V4SImode;
     case DImode:
-      return (TARGET_AVX && !TARGET_PREFER_AVX128) ? V4DImode : V2DImode;
+      return TARGET_AVX512F ? V8DImode :
+	(TARGET_AVX && !TARGET_PREFER_AVX128) ? V4DImode : V2DImode;
 
     case SFmode:
-      if (TARGET_AVX && !TARGET_PREFER_AVX128)
+      if (TARGET_AVX512F)
+	return V16SFmode;
+      else if (TARGET_AVX && !TARGET_PREFER_AVX128)
 	return V8SFmode;
       else
 	return V4SFmode;
@@ -43752,6 +44213,8 @@ ix86_preferred_simd_mode (enum machine_mode mode)
     case DFmode:
       if (!TARGET_VECTORIZE_DOUBLE)
 	return word_mode;
+      else if (TARGET_AVX512F)
+	return V8DFmode;
       else if (TARGET_AVX && !TARGET_PREFER_AVX128)
 	return V4DFmode;
       else if (TARGET_SSE2)
@@ -43764,12 +44227,14 @@ ix86_preferred_simd_mode (enum machine_mode mode)
 }
 
 /* If AVX is enabled then try vectorizing with both 256bit and 128bit
-   vectors.  */
+   vectors.  If AVX512F is enabled then try vectorizing with 512bit,
+   256bit and 128bit vectors.  */
 
 static unsigned int
 ix86_autovectorize_vector_sizes (void)
 {
-  return (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
+  return TARGET_AVX512F ? 64 | 32 | 16 :
+    (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
 }
 
 \f
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 8620541..a59b5c5 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -1048,6 +1048,7 @@
 {
   switch (get_attr_mode (insn))
     {
+    case MODE_V16SF:
     case MODE_V8SF:
     case MODE_V4SF:
       return "%vmovups\t{%1, %0|%0, %1}";
@@ -3555,11 +3556,17 @@
    (match_operand:VF1 1 "register_operand")]
   "TARGET_SSE2"
 {
-  rtx tmp[3];
-  tmp[0] = ix86_expand_adjust_ufix_to_sfix_si (operands[1], &tmp[2]);
-  tmp[1] = gen_reg_rtx (<sseintvecmode>mode);
-  emit_insn (gen_fix_trunc<mode><sseintvecmodelower>2 (tmp[1], tmp[0]));
-  emit_insn (gen_xor<sseintvecmodelower>3 (operands[0], tmp[1], tmp[2]));
+  if (GET_MODE (operands[1]) == V16SFmode)
+    emit_insn (gen_ufix_truncv16sfv16si2 (operands[0],
+					  operands[1]));
+  else
+    {
+      rtx tmp[3];
+      tmp[0] = ix86_expand_adjust_ufix_to_sfix_si (operands[1], &tmp[2]);
+      tmp[1] = gen_reg_rtx (<sseintvecmode>mode);
+      emit_insn (gen_fix_trunc<mode><sseintvecmodelower>2 (tmp[1], tmp[0]));
+      emit_insn (gen_xor<sseintvecmodelower>3 (operands[0], tmp[1], tmp[2]));
+    }
   DONE;
 })
 
@@ -4486,6 +4493,32 @@
   DONE;
 })
 
+(define_expand "vec_unpacku_float_hi_v16si"
+  [(match_operand:V8DF 0 "register_operand")
+   (match_operand:V16SI 1 "register_operand")]
+  "TARGET_AVX512F"
+{
+  REAL_VALUE_TYPE TWO32r;
+  rtx k, x, tmp[4];
+
+  real_ldexp (&TWO32r, &dconst1, 32);
+  x = const_double_from_real_value (TWO32r, DFmode);
+
+  tmp[0] = force_reg (V8DFmode, CONST0_RTX (V8DFmode));
+  tmp[1] = force_reg (V8DFmode, ix86_build_const_vector (V8DFmode, 1, x));
+  tmp[2] = gen_reg_rtx (V8DFmode);
+  tmp[3] = gen_reg_rtx (V8SImode);
+  k = gen_reg_rtx (QImode);
+
+  emit_insn (gen_vec_extract_hi_v16si (tmp[3], operands[1]));
+  emit_insn (gen_floatv8siv8df2 (tmp[2], tmp[3]));
+  emit_insn (gen_rtx_SET (VOIDmode, k,
+			  gen_rtx_LT (QImode, tmp[2], tmp[0])));
+  emit_insn (gen_addv8df3_mask (tmp[2], tmp[2], tmp[1], tmp[2], k));
+  emit_move_insn (operands[0], tmp[2]);
+  DONE;
+})
+
 (define_expand "vec_unpacku_float_lo_v8si"
   [(match_operand:V4DF 0 "register_operand")
    (match_operand:V8SI 1 "nonimmediate_operand")]
@@ -4651,31 +4684,46 @@
 
 (define_expand "vec_pack_ufix_trunc_<mode>"
   [(match_operand:<ssepackfltmode> 0 "register_operand")
-   (match_operand:VF2_128_256 1 "register_operand")
-   (match_operand:VF2_128_256 2 "register_operand")]
+   (match_operand:VF2 1 "register_operand")
+   (match_operand:VF2 2 "register_operand")]
   "TARGET_SSE2"
 {
-  rtx tmp[7];
-  tmp[0] = ix86_expand_adjust_ufix_to_sfix_si (operands[1], &tmp[2]);
-  tmp[1] = ix86_expand_adjust_ufix_to_sfix_si (operands[2], &tmp[3]);
-  tmp[4] = gen_reg_rtx (<ssepackfltmode>mode);
-  emit_insn (gen_vec_pack_sfix_trunc_<mode> (tmp[4], tmp[0], tmp[1]));
-  if (<ssepackfltmode>mode == V4SImode || TARGET_AVX2)
+  if (GET_MODE (operands[1]) == V8DFmode)
     {
-      tmp[5] = gen_reg_rtx (<ssepackfltmode>mode);
-      ix86_expand_vec_extract_even_odd (tmp[5], tmp[2], tmp[3], 0);
+      rtx r1, r2;
+
+      r1 = gen_reg_rtx (V8SImode);
+      r2 = gen_reg_rtx (V8SImode);
+
+      emit_insn (gen_ufix_truncv8dfv8si2 (r1, operands[1]));
+      emit_insn (gen_ufix_truncv8dfv8si2 (r2, operands[2]));
+      emit_insn (gen_avx_vec_concatv16si (operands[0], r1, r2));
     }
   else
     {
-      tmp[5] = gen_reg_rtx (V8SFmode);
-      ix86_expand_vec_extract_even_odd (tmp[5], gen_lowpart (V8SFmode, tmp[2]),
-					gen_lowpart (V8SFmode, tmp[3]), 0);
-      tmp[5] = gen_lowpart (V8SImode, tmp[5]);
+      rtx tmp[7];
+      tmp[0] = ix86_expand_adjust_ufix_to_sfix_si (operands[1], &tmp[2]);
+      tmp[1] = ix86_expand_adjust_ufix_to_sfix_si (operands[2], &tmp[3]);
+      tmp[4] = gen_reg_rtx (<ssepackfltmode>mode);
+      emit_insn (gen_vec_pack_sfix_trunc_<mode> (tmp[4], tmp[0], tmp[1]));
+      if (<ssepackfltmode>mode == V4SImode || TARGET_AVX2)
+	{
+	  tmp[5] = gen_reg_rtx (<ssepackfltmode>mode);
+	  ix86_expand_vec_extract_even_odd (tmp[5], tmp[2], tmp[3], 0);
+	}
+      else
+	{
+	  tmp[5] = gen_reg_rtx (V8SFmode);
+	  ix86_expand_vec_extract_even_odd (tmp[5], gen_lowpart (V8SFmode, tmp[2]),
+					    gen_lowpart (V8SFmode, tmp[3]), 0);
+	  tmp[5] = gen_lowpart (V8SImode, tmp[5]);
+	}
+      tmp[6] = expand_simple_binop (<ssepackfltmode>mode, XOR, tmp[4], tmp[5],
+				    operands[0], 0, OPTAB_DIRECT);
+      if (tmp[6] != operands[0])
+	emit_move_insn (operands[0], tmp[6]);
     }
-  tmp[6] = expand_simple_binop (<ssepackfltmode>mode, XOR, tmp[4], tmp[5],
-				operands[0], 0, OPTAB_DIRECT);
-  if (tmp[6] != operands[0])
-    emit_move_insn (operands[0], tmp[6]);
+
   DONE;
 })
 
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 99f6b1f..da08020 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -5636,7 +5636,7 @@ vectorizable_load (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
       tree vec_oprnd0 = NULL_TREE, op;
       tree arglist = TYPE_ARG_TYPES (TREE_TYPE (gather_decl));
       tree rettype, srctype, ptrtype, idxtype, masktype, scaletype;
-      tree ptr, mask, var, scale, perm_mask = NULL_TREE, prev_res = NULL_TREE;
+      tree ptr, mask, var, scale, merge, perm_mask = NULL_TREE, prev_res = NULL_TREE;
       edge pe = loop_preheader_edge (loop);
       gimple_seq seq;
       basic_block new_bb;
@@ -5678,8 +5678,7 @@ vectorizable_load (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
       idxtype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
       masktype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
       scaletype = TREE_VALUE (arglist);
-      gcc_checking_assert (types_compatible_p (srctype, rettype)
-			   && types_compatible_p (srctype, masktype));
+      gcc_checking_assert (types_compatible_p (srctype, rettype));
 
       vec_dest = vect_create_destination_var (scalar_dest, vectype);
 
@@ -5693,8 +5692,13 @@ vectorizable_load (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
 
       /* Currently we support only unconditional gather loads,
 	 so mask should be all ones.  */
-      if (TREE_CODE (TREE_TYPE (masktype)) == INTEGER_TYPE)
-	mask = build_int_cst (TREE_TYPE (masktype), -1);
+      if (TREE_CODE (masktype) == INTEGER_TYPE)
+	mask = build_int_cst (masktype, -1);
+      else if (TREE_CODE (TREE_TYPE (masktype)) == INTEGER_TYPE)
+	{
+	  mask = build_int_cst (TREE_TYPE (masktype), -1);
+	  mask = build_vector_from_val (masktype, mask);
+	}
       else if (SCALAR_FLOAT_TYPE_P (TREE_TYPE (masktype)))
 	{
 	  REAL_VALUE_TYPE r;
@@ -5703,14 +5707,30 @@ vectorizable_load (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
 	    tmp[j] = -1;
 	  real_from_target (&r, tmp, TYPE_MODE (TREE_TYPE (masktype)));
 	  mask = build_real (TREE_TYPE (masktype), r);
+	  mask = build_vector_from_val (masktype, mask);
 	}
       else
 	gcc_unreachable ();
-      mask = build_vector_from_val (masktype, mask);
       mask = vect_init_vector (stmt, mask, masktype, NULL);
 
       scale = build_int_cst (scaletype, gather_scale);
 
+      if (TREE_CODE (TREE_TYPE (rettype)) == INTEGER_TYPE)
+	merge = build_int_cst (TREE_TYPE (rettype), 0);
+      else if (SCALAR_FLOAT_TYPE_P (TREE_TYPE (rettype)))
+	{
+	  REAL_VALUE_TYPE r;
+	  long tmp[6];
+	  for (j = 0; j < 6; ++j)
+	    tmp[j] = 0;
+	  real_from_target (&r, tmp, TYPE_MODE (TREE_TYPE (rettype)));
+	  merge = build_real (TREE_TYPE (rettype), r);
+	}
+      else
+	gcc_unreachable ();
+      merge = build_vector_from_val (rettype, merge);
+      merge = vect_init_vector (stmt, merge, rettype, NULL);
+
       prev_stmt_info = NULL;
       for (j = 0; j < ncopies; ++j)
 	{
@@ -5739,7 +5759,7 @@ vectorizable_load (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
 	    }
 
 	  new_stmt
-	    = gimple_build_call (gather_decl, 5, mask, ptr, op, mask, scale);
+	    = gimple_build_call (gather_decl, 5, merge, ptr, op, mask, scale);
 
 	  if (!useless_type_conversion_p (vectype, rettype))
 	    {
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 54e73c8..00e56dc 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -683,8 +683,8 @@ struct dataref_aux {
    conversion.  */
 #define MAX_INTERM_CVT_STEPS         3
 
-/* The maximum vectorization factor supported by any target (V32QI).  */
-#define MAX_VECTORIZATION_FACTOR 32
+/* The maximum vectorization factor supported by any target (V64QI).  */
+#define MAX_VECTORIZATION_FACTOR 64
 
 /* Avoid GTY(()) on stmt_vec_info.  */
 typedef void *vec_void_p;

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2013-12-18 13:08       ` Kirill Yukhin
@ 2013-12-22 10:47         ` Uros Bizjak
  2013-12-22 12:52           ` Jakub Jelinek
  2013-12-30 11:00           ` Kirill Yukhin
  0 siblings, 2 replies; 43+ messages in thread
From: Uros Bizjak @ 2013-12-22 10:47 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: Richard Henderson, Jakub Jelinek, GCC Patches

On Wed, Dec 18, 2013 at 2:07 PM, Kirill Yukhin <kirill.yukhin@gmail.com> wrote:
> Hello,
>
> On 02 Dec 16:13, Kirill Yukhin wrote:
>> Hello,
>> On 19 Nov 12:14, Kirill Yukhin wrote:
>> > Hello,
>> > On 15 Nov 20:10, Kirill Yukhin wrote:
>> > > > Is it ok to commit to main trunk?
>> > > Ping.
>> > Ping.
>> Ping.
> Ping.
>
> Updated patch in the bottom.

This patch actually implements AVX512F arguments passing ABI and insn
sequences generation for basic vector functionality (I guess these
expanders were referred as "hooks"). Looking at the title, I expected
most changes in ix86_builtin_vectorized_function, but IIRC, I have
seen these in another patch.

@@ -18629,6 +18713,11 @@ ix86_expand_vector_convert_uns_vsivsf (rtx
target, rtx val)
   enum machine_mode fltmode = GET_MODE (target);
   rtx (*cvt) (rtx, rtx);

+  if (intmode == V16SImode)
+    {
+      emit_insn (gen_ufloatv16siv16sf2 (target, val));
+      return;
+    }
   if (intmode == V4SImode)

Please put the above directly in the sse.md expander. Use static mode
checks in .md files (see below).

@@ -3555,11 +3556,17 @@
    (match_operand:VF1 1 "register_operand")]
   "TARGET_SSE2"
 {
-  rtx tmp[3];
-  tmp[0] = ix86_expand_adjust_ufix_to_sfix_si (operands[1], &tmp[2]);
-  tmp[1] = gen_reg_rtx (<sseintvecmode>mode);
-  emit_insn (gen_fix_trunc<mode><sseintvecmodelower>2 (tmp[1], tmp[0]));
-  emit_insn (gen_xor<sseintvecmodelower>3 (operands[0], tmp[1], tmp[2]));
+  if (GET_MODE (operands[1]) == V16SFmode)
+    emit_insn (gen_ufix_truncv16sfv16si2 (operands[0],
+  operands[1]));
+  else

Please use static mode checks in the form of "if (<MODE>mode ==
V16SFmode)" in the .md files. The <MODE> will be substituted from VF1
mode iterator in the C source before compilation.

diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 99f6b1f..da08020 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -5636,7 +5636,7 @@ vectorizable_load (gimple stmt,
gimple_stmt_iterator *gsi, gimple *vec_stmt,

This (minor) change should be reviewed by a middle-end reviewer.

BTW: I didn't find (updated?) ChangeLog anywhere in the message, so i
took the one from the original submission:

2013-11-12  Alexander Ivchenko  <alexander.ivchenko@intel.com>
            Maxim Kuznetsov  <maxim.kuznetsov@intel.com>
            Sergey Lega  <sergey.s.lega@intel.com>
            Anna Tikhonova  <anna.tikhonova@intel.com>
            Ilya Tocar  <ilya.tocar@intel.com>
            Andrey Turetskiy  <andrey.turetskiy@intel.com>
            Ilya Verbin  <ilya.verbin@intel.com>
            Kirill Yukhin  <kirill.yukhin@intel.com>
            Michael Zolotukhin  <michael.v.zolotukhin@intel.com>

        * config/i386/i386.c (MAX_CLASSES): Increase number of classes.
        (classify_argument): Extend for 512 bit vectors.
        (construct_container): Ditto.
        (function_arg_advance_32): Ditto.
        (function_arg_advance_64): Ditto.
        (function_arg_32): Ditto.
        (function_arg_64): Ditto.
        (function_value_32): Ditto.
        (return_in_memory_32): Ditto.
        (ix86_gimplify_va_arg): Ditto.
        (standard_sse_constant_p): Ditto.
        (standard_sse_constant_opcode): Ditto.
        (ix86_expand_vector_convert_uns_vsivsf): Ditto.
        (ix86_build_const_vector): Ditto.
        (ix86_build_signbit_mask): Ditto.
        (ix86_expand_sse_cmp): Extend for AVX512.
        (ix86_expand_sse_movcc): Ditto.
        (ix86_expand_int_vcond): Ditto.
        (ix86_expand_vec_perm): Ditto.
        (ix86_expand_sse_unpack): Ditto.
        (ix86_constant_alignment): Ditto.
        (avx_vpermilp_parallel): Ditto.
        (ix86_rtx_costs): Ditto.
        (ix86_expand_vector_init_duplicate): Ditto.
        (ix86_expand_vector_init_concat): Ditto.
        (ix86_expand_vector_init_general): Ditto.
        (ix86_expand_vector_extract): Ditto.
        (emit_reduc_half): Ditto.
        (ix86_vector_mode_supported_p): Ditto.
        (ix86_emit_swdivsf): Ditto.
        (ix86_emit_swsqrtsf): Ditto.
        (expand_vec_perm_1): Ditto.
        (ix86_vectorize_vec_perm_const_ok): Ditto.
        (ix86_expand_mul_widen_evenodd): Ditto.
        (ix86_expand_sse2_mulvxdi3): Ditto.
        (ix86_preferred_simd_mode): Ditto.
        (ix86_autovectorize_vector_sizes): Ditto.
        (ix86_expand_vec_perm_vpermi2): New.
        (ix86_vector_duplicate_value): Ditto.
        * config/i386/sse.md
(fixuns_trunc<mode><sseintvecmodelower>2): Extend for AVX512.
        (vec_pack_ufix_trunc_<mode>): Ditto.
        * tree-vect-stmts.c (vectorizable_load): Support AVX512's gathers.
        * tree-vectorizer.h (MAX_VECTORIZATION_FACTOR): Extend for 512
bit vectors.

I assumed the same testing procedure as described in the original submission:

Testing:
  1. Bootstrap pass.
  2. make check shows no regressions.
  3. Spec 2000 & 2006 build show no regressions both with and without
-mavx512f option.
  4. Spec 2000 & 2006 run shows no stability regressions without
-mavx512f option.

The x86 part is OK for mainline. You will also need approval from the
middle-end reviewer for tree-* parts.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2013-12-22 10:47         ` Uros Bizjak
@ 2013-12-22 12:52           ` Jakub Jelinek
  2013-12-30 11:00           ` Kirill Yukhin
  1 sibling, 0 replies; 43+ messages in thread
From: Jakub Jelinek @ 2013-12-22 12:52 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Kirill Yukhin, Richard Henderson, GCC Patches

On Sun, Dec 22, 2013 at 11:47:52AM +0100, Uros Bizjak wrote:
>         * tree-vect-stmts.c (vectorizable_load): Support AVX512's gathers.
>         * tree-vectorizer.h (MAX_VECTORIZATION_FACTOR): Extend for 512
> bit vectors.
> 
> I assumed the same testing procedure as described in the original submission:
> 
> Testing:
>   1. Bootstrap pass.
>   2. make check shows no regressions.
>   3. Spec 2000 & 2006 build show no regressions both with and without
> -mavx512f option.
>   4. Spec 2000 & 2006 run shows no stability regressions without
> -mavx512f option.
> 
> The x86 part is OK for mainline. You will also need approval from the
> middle-end reviewer for tree-* parts.

The tree parts are ok for trunk, but likely insufficient by now,
you need similar changes for vectorizable_mask_load_store (which also
handles gathers), plus verify even the non-gather mask load/store if they
don't need any tweaking for AVX512F (integer masks rather than vector
ones?).

	Jakub

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2013-12-22 10:47         ` Uros Bizjak
  2013-12-22 12:52           ` Jakub Jelinek
@ 2013-12-30 11:00           ` Kirill Yukhin
  2014-01-01 23:08             ` Eric Botcazou
  1 sibling, 1 reply; 43+ messages in thread
From: Kirill Yukhin @ 2013-12-30 11:00 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Richard Henderson, Jakub Jelinek, GCC Patches

Hello Uroš, Jakub,
On 22 Dec 11:47, Uros Bizjak wrote:
> The x86 part is OK for mainline. You will also need approval from the
> middle-end reviewer for tree-* parts.

Thanks, I'am testing (in agreed volume, bootstrap passed so far) patch
in the bottom.

If no more inputs - I'll check it in to main trunk tomorrow (Moscow time)
after testing is over.

Jakub, I've filed: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59617
But not sure that fix (when it'll be invented) can go to main trunk since
it is performance issue.

gcc/
2013-12-30  Alexander Ivchenko  <alexander.ivchenko@intel.com>
	    Maxim Kuznetsov  <maxim.kuznetsov@intel.com>
	    Sergey Lega  <sergey.s.lega@intel.com>
	    Anna Tikhonova  <anna.tikhonova@intel.com>
	    Ilya Tocar  <ilya.tocar@intel.com>
	    Andrey Turetskiy  <andrey.turetskiy@intel.com>
	    Ilya Verbin  <ilya.verbin@intel.com>
	    Kirill Yukhin  <kirill.yukhin@intel.com>
	    Michael Zolotukhin  <michael.v.zolotukhin@intel.com>

	* config/i386/i386.c (MAX_CLASSES): Increase number of classes.
	(classify_argument): Extend for 512 bit vectors.
	(construct_container): Ditto.
	(function_arg_advance_32): Ditto.
	(function_arg_advance_64): Ditto.
	(function_arg_32): Ditto.
	(function_arg_64): Ditto.
	(function_value_32): Ditto.
	(return_in_memory_32): Ditto.
	(ix86_gimplify_va_arg): Ditto.
	(standard_sse_constant_p): Ditto.
	(standard_sse_constant_opcode): Ditto.
	(ix86_expand_vector_convert_uns_vsivsf): Ditto.
	(ix86_build_const_vector): Ditto.
	(ix86_build_signbit_mask): Ditto.
	(ix86_expand_sse_cmp): Extend for AVX512.
	(ix86_expand_sse_movcc): Ditto.
	(ix86_expand_int_vcond): Ditto.
	(ix86_expand_vec_perm): Ditto.
	(ix86_expand_sse_unpack): Ditto.
	(ix86_constant_alignment): Ditto.
	(ix86_builtin_vectorized_function): Ditto.
	(ix86_vectorize_builtin_gather): Ditto.
	(avx_vpermilp_parallel): Ditto.
	(ix86_rtx_costs): Ditto.
	(ix86_expand_vector_init_duplicate): Ditto.
	(ix86_expand_vector_init_concat): Ditto.
	(ix86_expand_vector_init_general): Ditto.
	(ix86_expand_vector_extract): Ditto.
	(emit_reduc_half): Ditto.
	(ix86_vector_mode_supported_p): Ditto.
	(ix86_emit_swdivsf): Ditto.
	(ix86_emit_swsqrtsf): Ditto.
	(expand_vec_perm_1): Ditto.
	(ix86_vectorize_vec_perm_const_ok): Ditto.
	(ix86_expand_mul_widen_evenodd): Ditto.
	(ix86_expand_sse2_mulvxdi3): Ditto.
	(ix86_preferred_simd_mode): Ditto.
	(ix86_autovectorize_vector_sizes): Ditto.
	(ix86_expand_vec_perm_vpermi2): New.
	(ix86_vector_duplicate_value): Ditto.
	(IX86_BUILTIN_SQRTPD512, IX86_BUILTIN_EXP2PS, IX86_BUILTIN_SQRTPS_NR512,
	IX86_BUILTIN_GATHER3ALTDIV16SF, IX86_BUILTIN_GATHER3ALTDIV16SI,
	IX86_BUILTIN_GATHER3ALTSIV8DF, IX86_BUILTIN_GATHER3ALTSIV8DI,
	IX86_BUILTIN_GATHER3DIV16SF, IX86_BUILTIN_GATHER3DIV16SI,
	IX86_BUILTIN_GATHER3DIV8DF, IX86_BUILTIN_GATHER3DIV8DI,
	IX86_BUILTIN_GATHER3SIV16SF, IX86_BUILTIN_GATHER3SIV16SI,
	IX86_BUILTIN_GATHER3SIV8DF, IX86_BUILTIN_CEILPD_VEC_PACK_SFIX512,
	IX86_BUILTIN_CPYSGNPS512, IX86_BUILTIN_CPYSGNPD512,
	IX86_BUILTIN_FLOORPD_VEC_PACK_SFIX512,
	IX86_BUILTIN_ROUNDPD_AZ_VEC_PACK_SFIX512): Ditto.
	* config/i386/sse.md (*mov<mode>_internal): Disable SSE typeless
	stores vectors > 128bit (AVX*).
	(<sse>_storeu<ssemodesuffix><avxsizesuffix>): Ditto.
	(<sse2_avx_avx512f>_storedqu<mode>): Extend for AVX-512, disable
	SSE typeless stores vectors > 128bit (AVX*).
	(fixuns_trunc<mode><sseintvecmodelower>2): Extend for AVX-512.
	(vec_pack_ufix_trunc_<mode>): Ditto.
	(vec_unpacku_float_hi_v16si): New.
	* tree-vect-stmts.c (vectorizable_load): Support AVX512's gathers.
	* tree-vectorizer.h (MAX_VECTORIZATION_FACTOR): Extend for 512 bit
	vectors.

testsuite/
2013-12-30  Alexander Ivchenko  <alexander.ivchenko@intel.com>
	Maxim Kuznetsov  <maxim.kuznetsov@intel.com>
	Sergey Lega  <sergey.s.lega@intel.com>
	Anna Tikhonova  <anna.tikhonova@intel.com>
	Ilya Tocar  <ilya.tocar@intel.com>
	Andrey Turetskiy  <andrey.turetskiy@intel.com>
	Ilya Verbin  <ilya.verbin@intel.com>
	Kirill Yukhin  <kirill.yukhin@intel.com>
	Michael Zolotukhin  <michael.v.zolotukhin@intel.com>

	* gcc.target/i386/pr49002-2.c: allow vmovapd generation.

--
Thanks, K

---
 gcc/config/i386/i386.c                    | 673 ++++++++++++++++++++++++++----
 gcc/config/i386/sse.md                    | 115 +++--
 gcc/testsuite/gcc.target/i386/pr49002-2.c |   2 +-
 gcc/tree-vect-stmts.c                     |  34 +-
 gcc/tree-vectorizer.h                     |   4 +-
 5 files changed, 717 insertions(+), 111 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2fc9b80..b0002ff 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2308,7 +2308,7 @@ enum x86_64_reg_class
     X86_64_MEMORY_CLASS
   };
 
-#define MAX_CLASSES 4
+#define MAX_CLASSES 8
 
 /* Table of constants used by fldpi, fldln2, etc....  */
 static REAL_VALUE_TYPE ext_80387_constants_table [5];
@@ -6242,7 +6242,7 @@ merge_classes (enum x86_64_reg_class class1, enum x86_64_reg_class class2)
    sized containers, classes[0] will be NO_CLASS and 1 is returned.
 
    BIT_OFFSET is used internally for handling records and specifies offset
-   of the offset in bits modulo 256 to avoid overflow cases.
+   of the offset in bits modulo 512 to avoid overflow cases.
 
    See the x86-64 PS ABI for details.
 */
@@ -6342,7 +6342,7 @@ classify_argument (enum machine_mode mode, const_tree type,
 		      num = classify_argument (TYPE_MODE (type), type,
 					       subclasses,
 					       (int_bit_position (field)
-						+ bit_offset) % 256);
+						+ bit_offset) % 512);
 		      if (!num)
 			return 0;
 		      pos = (int_bit_position (field)
@@ -6592,6 +6592,21 @@ classify_argument (enum machine_mode mode, const_tree type,
       classes[2] = X86_64_SSEUP_CLASS;
       classes[3] = X86_64_SSEUP_CLASS;
       return 4;
+    case V8DFmode:
+    case V16SFmode:
+    case V8DImode:
+    case V16SImode:
+    case V32HImode:
+    case V64QImode:
+      classes[0] = X86_64_SSE_CLASS;
+      classes[1] = X86_64_SSEUP_CLASS;
+      classes[2] = X86_64_SSEUP_CLASS;
+      classes[3] = X86_64_SSEUP_CLASS;
+      classes[4] = X86_64_SSEUP_CLASS;
+      classes[5] = X86_64_SSEUP_CLASS;
+      classes[6] = X86_64_SSEUP_CLASS;
+      classes[7] = X86_64_SSEUP_CLASS;
+      return 8;
     case V4SFmode:
     case V4SImode:
     case V16QImode:
@@ -6777,6 +6792,18 @@ construct_container (enum machine_mode mode, enum machine_mode orig_mode,
       && mode != BLKmode)
     return gen_reg_or_parallel (mode, orig_mode,
 				SSE_REGNO (sse_regno));
+  if (n == 8
+      && regclass[0] == X86_64_SSE_CLASS
+      && regclass[1] == X86_64_SSEUP_CLASS
+      && regclass[2] == X86_64_SSEUP_CLASS
+      && regclass[3] == X86_64_SSEUP_CLASS
+      && regclass[4] == X86_64_SSEUP_CLASS
+      && regclass[5] == X86_64_SSEUP_CLASS
+      && regclass[6] == X86_64_SSEUP_CLASS
+      && regclass[7] == X86_64_SSEUP_CLASS
+      && mode != BLKmode)
+    return gen_reg_or_parallel (mode, orig_mode,
+				SSE_REGNO (sse_regno));
   if (n == 2
       && regclass[0] == X86_64_X87_CLASS
       && regclass[1] == X86_64_X87UP_CLASS)
@@ -6858,6 +6885,18 @@ construct_container (enum machine_mode mode, enum machine_mode orig_mode,
 		tmpmode = OImode;
 		i += 3;
 		break;
+	      case 8:
+		gcc_assert (i == 0
+			    && regclass[1] == X86_64_SSEUP_CLASS
+			    && regclass[2] == X86_64_SSEUP_CLASS
+			    && regclass[3] == X86_64_SSEUP_CLASS
+			    && regclass[4] == X86_64_SSEUP_CLASS
+			    && regclass[5] == X86_64_SSEUP_CLASS
+			    && regclass[6] == X86_64_SSEUP_CLASS
+			    && regclass[7] == X86_64_SSEUP_CLASS);
+		tmpmode = XImode;
+		i += 7;
+		break;
 	      default:
 		gcc_unreachable ();
 	      }
@@ -6931,6 +6970,12 @@ function_arg_advance_32 (CUMULATIVE_ARGS *cum, enum machine_mode mode,
 
     case V8SFmode:
     case V8SImode:
+    case V64QImode:
+    case V32HImode:
+    case V16SImode:
+    case V8DImode:
+    case V16SFmode:
+    case V8DFmode:
     case V32QImode:
     case V16HImode:
     case V4DFmode:
@@ -6982,8 +7027,9 @@ function_arg_advance_64 (CUMULATIVE_ARGS *cum, enum machine_mode mode,
 {
   int int_nregs, sse_nregs;
 
-  /* Unnamed 256bit vector mode parameters are passed on stack.  */
-  if (!named && VALID_AVX256_REG_MODE (mode))
+  /* Unnamed 512 and 256bit vector mode parameters are passed on stack.  */
+  if (!named && (VALID_AVX512F_REG_MODE (mode)
+		 || VALID_AVX256_REG_MODE (mode)))
     return;
 
   if (examine_argument (mode, type, 0, &int_nregs, &sse_nregs)
@@ -7134,9 +7180,16 @@ function_arg_32 (const CUMULATIVE_ARGS *cum, enum machine_mode mode,
       break;
 
     case OImode:
-      /* OImode shouldn't be used directly.  */
+    case XImode:
+      /* OImode and XImode shouldn't be used directly.  */
       gcc_unreachable ();
 
+    case V64QImode:
+    case V32HImode:
+    case V16SImode:
+    case V8DImode:
+    case V16SFmode:
+    case V8DFmode:
     case V8SFmode:
     case V8SImode:
     case V32QImode:
@@ -7199,7 +7252,13 @@ function_arg_64 (const CUMULATIVE_ARGS *cum, enum machine_mode mode,
     case V16HImode:
     case V4DFmode:
     case V4DImode:
-      /* Unnamed 256bit vector mode parameters are passed on stack.  */
+    case V16SFmode:
+    case V16SImode:
+    case V64QImode:
+    case V32HImode:
+    case V8DFmode:
+    case V8DImode:
+      /* Unnamed 256 and 512bit vector mode parameters are passed on stack.  */
       if (!named)
 	return NULL;
       break;
@@ -7602,6 +7661,10 @@ function_value_32 (enum machine_mode orig_mode, enum machine_mode mode,
   else if (VECTOR_MODE_P (mode) && GET_MODE_SIZE (mode) == 32)
     regno = FIRST_SSE_REG;
 
+  /* 64-byte vector modes in %zmm0.   */
+  else if (VECTOR_MODE_P (mode) && GET_MODE_SIZE (mode) == 64)
+    regno = FIRST_SSE_REG;
+
   /* Floating point return values in %st(0) (unless -mno-fp-ret-in-387).  */
   else if (X87_FLOAT_MODE_P (mode) && TARGET_FLOAT_RETURNS_IN_80387)
     regno = FIRST_FLOAT_REG;
@@ -7809,6 +7872,10 @@ return_in_memory_32 (const_tree type, enum machine_mode mode)
       /* AVX values are returned in YMM0, except when it doesn't exist.  */
       if (size == 32)
 	return !TARGET_AVX;
+
+      /* AVX512F values are returned in ZMM0, except when it doesn't exist.  */
+      if (size == 64)
+	return !TARGET_AVX512F;
     }
 
   if (mode == XFmode)
@@ -8345,7 +8412,13 @@ ix86_gimplify_va_arg (tree valist, tree type, gimple_seq *pre_p,
     case V16HImode:
     case V4DFmode:
     case V4DImode:
-      /* Unnamed 256bit vector mode parameters are passed on stack.  */
+    case V16SFmode:
+    case V16SImode:
+    case V64QImode:
+    case V32HImode:
+    case V8DFmode:
+    case V8DImode:
+      /* Unnamed 256 and 512bit vector mode parameters are passed on stack.  */
       if (!TARGET_64BIT_MS_ABI)
 	{
 	  container = NULL;
@@ -8760,6 +8833,12 @@ standard_sse_constant_p (rtx x)
       case V4DImode:
 	if (TARGET_AVX2)
 	  return 2;
+      case V64QImode:
+      case V32HImode:
+      case V16SImode:
+      case V8DImode:
+	if (TARGET_AVX512F)
+	  return 2;
       default:
 	break;
       }
@@ -8778,6 +8857,11 @@ standard_sse_constant_opcode (rtx insn, rtx x)
     case 1:
       switch (get_attr_mode (insn))
 	{
+	case MODE_XI:
+	case MODE_V16SF:
+	  return "vpxord\t%g0, %g0, %g0";
+	case MODE_V8DF:
+	  return "vpxorq\t%g0, %g0, %g0";
 	case MODE_TI:
 	  return "%vpxor\t%0, %d0";
 	case MODE_V2DF:
@@ -18668,17 +18752,23 @@ ix86_build_const_vector (enum machine_mode mode, bool vect, rtx value)
 
   switch (mode)
     {
+    case V64QImode:
     case V32QImode:
     case V16QImode:
+    case V32HImode:
     case V16HImode:
     case V8HImode:
+    case V16SImode:
     case V8SImode:
     case V4SImode:
+    case V8DImode:
     case V4DImode:
     case V2DImode:
       gcc_assert (vect);
+    case V16SFmode:
     case V8SFmode:
     case V4SFmode:
+    case V8DFmode:
     case V4DFmode:
     case V2DFmode:
       n_elt = GET_MODE_NUNITS (mode);
@@ -18715,6 +18805,8 @@ ix86_build_signbit_mask (enum machine_mode mode, bool vect, bool invert)
   /* Find the sign bit, sign extended to 2*HWI.  */
   switch (mode)
     {
+    case V16SImode:
+    case V16SFmode:
     case V8SImode:
     case V4SImode:
     case V8SFmode:
@@ -18725,8 +18817,10 @@ ix86_build_signbit_mask (enum machine_mode mode, bool vect, bool invert)
       lo = 0x80000000, hi = lo < 0;
       break;
 
+    case V8DImode:
     case V4DImode:
     case V2DImode:
+    case V8DFmode:
     case V4DFmode:
     case V2DFmode:
       vec_mode = mode;
@@ -20583,22 +20677,63 @@ ix86_expand_sse_cmp (rtx dest, enum rtx_code code, rtx cmp_op0, rtx cmp_op1,
 		     rtx op_true, rtx op_false)
 {
   enum machine_mode mode = GET_MODE (dest);
-  enum machine_mode cmp_mode = GET_MODE (cmp_op0);
+  enum machine_mode cmp_ops_mode = GET_MODE (cmp_op0);
+
+  /* In general case result of comparison can differ from operands' type.  */
+  enum machine_mode cmp_mode;
+
+  /* In AVX512F the result of comparison is an integer mask.  */
+  bool maskcmp = false;
   rtx x;
 
-  cmp_op0 = force_reg (cmp_mode, cmp_op0);
-  if (!nonimmediate_operand (cmp_op1, cmp_mode))
-    cmp_op1 = force_reg (cmp_mode, cmp_op1);
+  if (GET_MODE_SIZE (cmp_ops_mode) == 64)
+    {
+      cmp_mode = mode_for_size (GET_MODE_NUNITS (cmp_ops_mode), MODE_INT, 0);
+      gcc_assert (cmp_mode != BLKmode);
+
+      maskcmp = true;
+    }
+  else
+    cmp_mode = cmp_ops_mode;
+
+
+  cmp_op0 = force_reg (cmp_ops_mode, cmp_op0);
+  if (!nonimmediate_operand (cmp_op1, cmp_ops_mode))
+    cmp_op1 = force_reg (cmp_ops_mode, cmp_op1);
 
   if (optimize
       || reg_overlap_mentioned_p (dest, op_true)
       || reg_overlap_mentioned_p (dest, op_false))
-    dest = gen_reg_rtx (mode);
+    dest = gen_reg_rtx (maskcmp ? cmp_mode : mode);
+
+  /* Compare patterns for int modes are unspec in AVX512F only.  */
+  if (maskcmp && (code == GT || code == EQ))
+    {
+      rtx (*gen)(rtx, rtx, rtx);
 
+      switch (cmp_ops_mode)
+	{
+	case V16SImode:
+	  gen = code == GT ? gen_avx512f_gtv16si3 : gen_avx512f_eqv16si3_1;
+	  break;
+	case V8DImode:
+	  gen = code == GT ? gen_avx512f_gtv8di3 : gen_avx512f_eqv8di3_1;
+	  break;
+	default:
+	  gen = NULL;
+	}
+
+      if (gen)
+	{
+	  emit_insn (gen (dest, cmp_op0, cmp_op1));
+	  return dest;
+	}
+    }
   x = gen_rtx_fmt_ee (code, cmp_mode, cmp_op0, cmp_op1);
-  if (cmp_mode != mode)
+
+  if (cmp_mode != mode && !maskcmp)
     {
-      x = force_reg (cmp_mode, x);
+      x = force_reg (cmp_ops_mode, x);
       convert_move (dest, x, false);
     }
   else
@@ -20614,33 +20749,43 @@ static void
 ix86_expand_sse_movcc (rtx dest, rtx cmp, rtx op_true, rtx op_false)
 {
   enum machine_mode mode = GET_MODE (dest);
+  enum machine_mode cmpmode = GET_MODE (cmp);
+
+  /* In AVX512F the result of comparison is an integer mask.  */
+  bool maskcmp = (mode != cmpmode && TARGET_AVX512F);
+
   rtx t2, t3, x;
 
   if (vector_all_ones_operand (op_true, mode)
-      && rtx_equal_p (op_false, CONST0_RTX (mode)))
+      && rtx_equal_p (op_false, CONST0_RTX (mode))
+      && !maskcmp)
     {
       emit_insn (gen_rtx_SET (VOIDmode, dest, cmp));
     }
-  else if (op_false == CONST0_RTX (mode))
+  else if (op_false == CONST0_RTX (mode)
+      && !maskcmp)
     {
       op_true = force_reg (mode, op_true);
       x = gen_rtx_AND (mode, cmp, op_true);
       emit_insn (gen_rtx_SET (VOIDmode, dest, x));
     }
-  else if (op_true == CONST0_RTX (mode))
+  else if (op_true == CONST0_RTX (mode)
+      && !maskcmp)
     {
       op_false = force_reg (mode, op_false);
       x = gen_rtx_NOT (mode, cmp);
       x = gen_rtx_AND (mode, x, op_false);
       emit_insn (gen_rtx_SET (VOIDmode, dest, x));
     }
-  else if (INTEGRAL_MODE_P (mode) && op_true == CONSTM1_RTX (mode))
+  else if (INTEGRAL_MODE_P (mode) && op_true == CONSTM1_RTX (mode)
+      && !maskcmp)
     {
       op_false = force_reg (mode, op_false);
       x = gen_rtx_IOR (mode, cmp, op_false);
       emit_insn (gen_rtx_SET (VOIDmode, dest, x));
     }
-  else if (TARGET_XOP)
+  else if (TARGET_XOP
+      && !maskcmp)
     {
       op_true = force_reg (mode, op_true);
 
@@ -20708,6 +20853,20 @@ ix86_expand_sse_movcc (rtx dest, rtx cmp, rtx op_true, rtx op_false)
 	      cmp = gen_lowpart (V32QImode, cmp);
 	    }
 	  break;
+
+	case V16SImode:
+	  gen = gen_avx512f_blendmv16si;
+	  break;
+	case V8DImode:
+	  gen = gen_avx512f_blendmv8di;
+	  break;
+	case V8DFmode:
+	  gen = gen_avx512f_blendmv8df;
+	  break;
+	case V16SFmode:
+	  gen = gen_avx512f_blendmv16sf;
+	  break;
+
 	default:
 	  break;
 	}
@@ -20975,6 +21134,8 @@ ix86_expand_int_vcond (rtx operands[])
 
 	  switch (mode)
 	    {
+	    case V16SImode:
+	    case V8DImode:
 	    case V8SImode:
 	    case V4DImode:
 	    case V4SImode:
@@ -20985,6 +21146,8 @@ ix86_expand_int_vcond (rtx operands[])
 
 		  switch (mode)
 		    {
+		    case V16SImode: gen_sub3 = gen_subv16si3; break;
+		    case V8DImode: gen_sub3 = gen_subv8di3; break;
 		    case V8SImode: gen_sub3 = gen_subv8si3; break;
 		    case V4DImode: gen_sub3 = gen_subv4di3; break;
 		    case V4SImode: gen_sub3 = gen_subv4si3; break;
@@ -21040,7 +21203,8 @@ ix86_expand_int_vcond (rtx operands[])
       gcc_assert (GET_MODE_SIZE (data_mode) == GET_MODE_SIZE (mode));
       x = ix86_expand_sse_cmp (gen_reg_rtx (mode), code, cop0, cop1,
 			       operands[1+negate], operands[2-negate]);
-      x = gen_lowpart (data_mode, x);
+      if (GET_MODE (x) == mode)
+	x = gen_lowpart (data_mode, x);
     }
 
   ix86_expand_sse_movcc (operands[0], x, operands[1+negate],
@@ -21048,6 +21212,35 @@ ix86_expand_int_vcond (rtx operands[])
   return true;
 }
 
+static bool
+ix86_expand_vec_perm_vpermi2 (rtx target, rtx op0, rtx mask, rtx op1)
+{
+  enum machine_mode mode = GET_MODE (op0);
+  switch (mode)
+    {
+    case V16SImode:
+      emit_insn (gen_avx512f_vpermi2varv16si3 (target, op0,
+					      force_reg (V16SImode, mask),
+					      op1));
+      return true;
+    case V16SFmode:
+      emit_insn (gen_avx512f_vpermi2varv16sf3 (target, op0,
+					      force_reg (V16SImode, mask),
+					      op1));
+      return true;
+    case V8DImode:
+      emit_insn (gen_avx512f_vpermi2varv8di3 (target, op0,
+					     force_reg (V8DImode, mask), op1));
+      return true;
+    case V8DFmode:
+      emit_insn (gen_avx512f_vpermi2varv8df3 (target, op0,
+					     force_reg (V8DImode, mask), op1));
+      return true;
+    default:
+      return false;
+    }
+}
+
 /* Expand a variable vector permutation.  */
 
 void
@@ -21066,7 +21259,10 @@ ix86_expand_vec_perm (rtx operands[])
   /* Number of elements in the vector.  */
   w = GET_MODE_NUNITS (mode);
   e = GET_MODE_UNIT_SIZE (mode);
-  gcc_assert (w <= 32);
+  gcc_assert (w <= 64);
+
+  if (ix86_expand_vec_perm_vpermi2 (target, op0, mask, op1))
+    return;
 
   if (TARGET_AVX2)
     {
@@ -21446,6 +21642,15 @@ ix86_expand_sse_unpack (rtx dest, rtx src, bool unsigned_p, bool high_p)
 	  extract
 	    = high_p ? gen_vec_extract_hi_v32qi : gen_vec_extract_lo_v32qi;
 	  break;
+	case V32HImode:
+	  if (unsigned_p)
+	    unpack = gen_avx512f_zero_extendv16hiv16si2;
+	  else
+	    unpack = gen_avx512f_sign_extendv16hiv16si2;
+	  halfmode = V16HImode;
+	  extract
+	    = high_p ? gen_vec_extract_hi_v32hi : gen_vec_extract_lo_v32hi;
+	  break;
 	case V16HImode:
 	  if (unsigned_p)
 	    unpack = gen_avx2_zero_extendv8hiv8si2;
@@ -21455,6 +21660,15 @@ ix86_expand_sse_unpack (rtx dest, rtx src, bool unsigned_p, bool high_p)
 	  extract
 	    = high_p ? gen_vec_extract_hi_v16hi : gen_vec_extract_lo_v16hi;
 	  break;
+	case V16SImode:
+	  if (unsigned_p)
+	    unpack = gen_avx512f_zero_extendv8siv8di2;
+	  else
+	    unpack = gen_avx512f_sign_extendv8siv8di2;
+	  halfmode = V8SImode;
+	  extract
+	    = high_p ? gen_vec_extract_hi_v16si : gen_vec_extract_lo_v16si;
+	  break;
 	case V8SImode:
 	  if (unsigned_p)
 	    unpack = gen_avx2_zero_extendv4siv4di2;
@@ -21486,7 +21700,7 @@ ix86_expand_sse_unpack (rtx dest, rtx src, bool unsigned_p, bool high_p)
 	  gcc_unreachable ();
 	}
 
-      if (GET_MODE_SIZE (imode) == 32)
+      if (GET_MODE_SIZE (imode) >= 32)
 	{
 	  tmp = gen_reg_rtx (halfmode);
 	  emit_insn (extract (tmp, src));
@@ -26219,7 +26433,8 @@ ix86_constant_alignment (tree exp, int align)
 int
 ix86_data_alignment (tree type, int align, bool opt)
 {
-  int max_align = optimize_size ? BITS_PER_WORD : MIN (256, MAX_OFILE_ALIGNMENT);
+  int max_align = optimize_size ? BITS_PER_WORD
+				: MIN (512, MAX_OFILE_ALIGNMENT);
 
   if (opt
       && AGGREGATE_TYPE_P (type)
@@ -27681,12 +27896,27 @@ enum ix86_builtins
   IX86_BUILTIN_GATHERDIV4SI,
   IX86_BUILTIN_GATHERDIV8SI,
 
+  IX86_BUILTIN_SQRTPD512,
+  IX86_BUILTIN_EXP2PS,
+  IX86_BUILTIN_SQRTPS_NR512,
+
   /* Alternate 4 element gather for the vectorizer where
      all operands are 32-byte wide.  */
   IX86_BUILTIN_GATHERALTSIV4DF,
   IX86_BUILTIN_GATHERALTDIV8SF,
   IX86_BUILTIN_GATHERALTSIV4DI,
   IX86_BUILTIN_GATHERALTDIV8SI,
+  IX86_BUILTIN_GATHER3ALTDIV16SF,
+  IX86_BUILTIN_GATHER3ALTDIV16SI,
+  IX86_BUILTIN_GATHER3ALTSIV8DF,
+  IX86_BUILTIN_GATHER3ALTSIV8DI,
+  IX86_BUILTIN_GATHER3DIV16SF,
+  IX86_BUILTIN_GATHER3DIV16SI,
+  IX86_BUILTIN_GATHER3DIV8DF,
+  IX86_BUILTIN_GATHER3DIV8DI,
+  IX86_BUILTIN_GATHER3SIV16SF,
+  IX86_BUILTIN_GATHER3SIV16SI,
+  IX86_BUILTIN_GATHER3SIV8DF,
 
   /* TFmode support builtins.  */
   IX86_BUILTIN_INFQ,
@@ -27695,10 +27925,16 @@ enum ix86_builtins
   IX86_BUILTIN_COPYSIGNQ,
 
   /* Vectorizer support builtins.  */
+  IX86_BUILTIN_CEILPD_VEC_PACK_SFIX512,
   IX86_BUILTIN_CPYSGNPS,
   IX86_BUILTIN_CPYSGNPD,
   IX86_BUILTIN_CPYSGNPS256,
+  IX86_BUILTIN_CPYSGNPS512,
   IX86_BUILTIN_CPYSGNPD256,
+  IX86_BUILTIN_CPYSGNPD512,
+  IX86_BUILTIN_FLOORPD_VEC_PACK_SFIX512,
+  IX86_BUILTIN_ROUNDPD_AZ_VEC_PACK_SFIX512,
+
 
   /* FMA4 instructions.  */
   IX86_BUILTIN_VFMADDSS,
@@ -33876,6 +34112,16 @@ ix86_builtin_vectorized_function (tree fndecl, tree type_out,
 	    return ix86_get_builtin (IX86_BUILTIN_SQRTPD);
 	  else if (out_n == 4 && in_n == 4)
 	    return ix86_get_builtin (IX86_BUILTIN_SQRTPD256);
+	  else if (out_n == 8 && in_n == 8)
+	    return ix86_get_builtin (IX86_BUILTIN_SQRTPD512);
+	}
+      break;
+
+    case BUILT_IN_EXP2F:
+      if (out_mode == SFmode && in_mode == SFmode)
+	{
+	  if (out_n == 16 && in_n == 16)
+	    return ix86_get_builtin (IX86_BUILTIN_EXP2PS);
 	}
       break;
 
@@ -33886,6 +34132,8 @@ ix86_builtin_vectorized_function (tree fndecl, tree type_out,
 	    return ix86_get_builtin (IX86_BUILTIN_SQRTPS_NR);
 	  else if (out_n == 8 && in_n == 8)
 	    return ix86_get_builtin (IX86_BUILTIN_SQRTPS_NR256);
+	  else if (out_n == 16 && in_n == 16)
+	    return ix86_get_builtin (IX86_BUILTIN_SQRTPS_NR512);
 	}
       break;
 
@@ -33902,6 +34150,8 @@ ix86_builtin_vectorized_function (tree fndecl, tree type_out,
 	    return ix86_get_builtin (IX86_BUILTIN_FLOORPD_VEC_PACK_SFIX);
 	  else if (out_n == 8 && in_n == 4)
 	    return ix86_get_builtin (IX86_BUILTIN_FLOORPD_VEC_PACK_SFIX256);
+	  else if (out_n == 16 && in_n == 8)
+	    return ix86_get_builtin (IX86_BUILTIN_FLOORPD_VEC_PACK_SFIX512);
 	}
       break;
 
@@ -33934,6 +34184,8 @@ ix86_builtin_vectorized_function (tree fndecl, tree type_out,
 	    return ix86_get_builtin (IX86_BUILTIN_CEILPD_VEC_PACK_SFIX);
 	  else if (out_n == 8 && in_n == 4)
 	    return ix86_get_builtin (IX86_BUILTIN_CEILPD_VEC_PACK_SFIX256);
+	  else if (out_n == 16 && in_n == 8)
+	    return ix86_get_builtin (IX86_BUILTIN_CEILPD_VEC_PACK_SFIX512);
 	}
       break;
 
@@ -33990,6 +34242,8 @@ ix86_builtin_vectorized_function (tree fndecl, tree type_out,
 	    return ix86_get_builtin (IX86_BUILTIN_ROUNDPD_AZ_VEC_PACK_SFIX);
 	  else if (out_n == 8 && in_n == 4)
 	    return ix86_get_builtin (IX86_BUILTIN_ROUNDPD_AZ_VEC_PACK_SFIX256);
+	  else if (out_n == 16 && in_n == 8)
+	    return ix86_get_builtin (IX86_BUILTIN_ROUNDPD_AZ_VEC_PACK_SFIX512);
 	}
       break;
 
@@ -34016,6 +34270,8 @@ ix86_builtin_vectorized_function (tree fndecl, tree type_out,
 	    return ix86_get_builtin (IX86_BUILTIN_CPYSGNPD);
 	  else if (out_n == 4 && in_n == 4)
 	    return ix86_get_builtin (IX86_BUILTIN_CPYSGNPD256);
+	  else if (out_n == 8 && in_n == 8)
+	    return ix86_get_builtin (IX86_BUILTIN_CPYSGNPD512);
 	}
       break;
 
@@ -34026,6 +34282,8 @@ ix86_builtin_vectorized_function (tree fndecl, tree type_out,
 	    return ix86_get_builtin (IX86_BUILTIN_CPYSGNPS);
 	  else if (out_n == 8 && in_n == 8)
 	    return ix86_get_builtin (IX86_BUILTIN_CPYSGNPS256);
+	  else if (out_n == 16 && in_n == 16)
+	    return ix86_get_builtin (IX86_BUILTIN_CPYSGNPS512);
 	}
       break;
 
@@ -34461,6 +34719,34 @@ ix86_vectorize_builtin_gather (const_tree mem_vectype,
     case V8SImode:
       code = si ? IX86_BUILTIN_GATHERSIV8SI : IX86_BUILTIN_GATHERALTDIV8SI;
       break;
+#if 0
+    /*  FIXME: Commented until vectorizer can work with (mask_type != src_type)
+	PR59617.   */
+    case V8DFmode:
+      if (TARGET_AVX512F)
+	code = si ? IX86_BUILTIN_GATHER3ALTSIV8DF : IX86_BUILTIN_GATHER3DIV8DF;
+      else
+	return NULL_TREE;
+      break;
+    case V8DImode:
+      if (TARGET_AVX512F)
+	code = si ? IX86_BUILTIN_GATHER3ALTSIV8DI : IX86_BUILTIN_GATHER3DIV8DI;
+      else
+	return NULL_TREE;
+      break;
+    case V16SFmode:
+      if (TARGET_AVX512F)
+	code = si ? IX86_BUILTIN_GATHER3SIV16SF : IX86_BUILTIN_GATHER3ALTDIV16SF;
+      else
+	return NULL_TREE;
+      break;
+    case V16SImode:
+      if (TARGET_AVX512F)
+	code = si ? IX86_BUILTIN_GATHER3SIV16SI : IX86_BUILTIN_GATHER3ALTDIV16SI;
+      else
+	return NULL_TREE;
+      break;
+#endif
     default:
       return NULL_TREE;
     }
@@ -34516,7 +34802,7 @@ avx_vpermilp_parallel (rtx par, enum machine_mode mode)
 {
   unsigned i, nelt = GET_MODE_NUNITS (mode);
   unsigned mask = 0;
-  unsigned char ipar[8] = {};  /* Silence -Wuninitialized warning.  */
+  unsigned char ipar[16] = {};  /* Silence -Wuninitialized warning.  */
 
   if (XVECLEN (par, 0) != (int) nelt)
     return 0;
@@ -34539,6 +34825,24 @@ avx_vpermilp_parallel (rtx par, enum machine_mode mode)
 
   switch (mode)
     {
+    case V8DFmode:
+      /* In the 512-bit DFmode case, we can only move elements within
+         a 128-bit lane.  First fill the second part of the mask,
+	 then fallthru.  */
+      for (i = 4; i < 6; ++i)
+	{
+	  if (ipar[i] < 4 || ipar[i] >= 6)
+	    return 0;
+	  mask |= (ipar[i] - 4) << i;
+	}
+      for (i = 6; i < 8; ++i)
+	{
+	  if (ipar[i] < 6)
+	    return 0;
+	  mask |= (ipar[i] - 6) << i;
+	}
+      /* FALLTHRU */
+
     case V4DFmode:
       /* In the 256-bit DFmode case, we can only move elements within
          a 128-bit lane.  */
@@ -34556,10 +34860,18 @@ avx_vpermilp_parallel (rtx par, enum machine_mode mode)
 	}
       break;
 
+    case V16SFmode:
+      /* In 512 bit SFmode case, permutation in the upper 256 bits
+	 must mirror the permutation in the lower 256-bits.  */
+      for (i = 0; i < 8; ++i)
+	if (ipar[i] + 8 != ipar[i + 8])
+	  return 0;
+      /* FALLTHRU */
+
     case V8SFmode:
-      /* In the 256-bit SFmode case, we have full freedom of movement
-	 within the low 128-bit lane, but the high 128-bit lane must
-	 mirror the exact same pattern.  */
+      /* In 256 bit SFmode case, we have full freedom of
+         movement within the low 128-bit lane, but the high 128-bit
+         lane must mirror the exact same pattern.  */
       for (i = 0; i < 4; ++i)
 	if (ipar[i] + 4 != ipar[i + 4])
 	  return 0;
@@ -35510,6 +35822,7 @@ static bool
 ix86_rtx_costs (rtx x, int code_i, int outer_code_i, int opno, int *total,
 		bool speed)
 {
+  rtx mask;
   enum rtx_code code = (enum rtx_code) code_i;
   enum rtx_code outer_code = (enum rtx_code) outer_code_i;
   enum machine_mode mode = GET_MODE (x);
@@ -35986,13 +36299,21 @@ ix86_rtx_costs (rtx x, int code_i, int outer_code_i, int opno, int *total,
 
     case VEC_SELECT:
     case VEC_CONCAT:
-    case VEC_MERGE:
     case VEC_DUPLICATE:
       /* ??? Assume all of these vector manipulation patterns are
 	 recognizable.  In which case they all pretty much have the
 	 same cost.  */
      *total = cost->fabs;
      return true;
+    case VEC_MERGE:
+      mask = XEXP (x, 2);
+      /* This is masked instruction, assume the same cost,
+	 as nonmasked variant.  */
+      if (TARGET_AVX512F && register_operand (mask, GET_MODE (mask)))
+	*total = rtx_cost (XEXP (x, 0), outer_code, opno, speed);
+      else
+	*total = cost->fabs;
+      return true;
 
     default:
       return false;
@@ -37158,6 +37479,36 @@ get_mode_wider_vector (enum machine_mode o)
   return n;
 }
 
+/* A subroutine of ix86_expand_vector_init_duplicate.  Tries to
+   fill target with val via vec_duplicate.  */
+
+static bool
+ix86_vector_duplicate_value (enum machine_mode mode, rtx target, rtx val)
+{
+  bool ok;
+  rtx insn, dup;
+
+  /* First attempt to recognize VAL as-is.  */
+  dup = gen_rtx_VEC_DUPLICATE (mode, val);
+  insn = emit_insn (gen_rtx_SET (VOIDmode, target, dup));
+  if (recog_memoized (insn) < 0)
+    {
+      rtx seq;
+      /* If that fails, force VAL into a register.  */
+
+      start_sequence ();
+      XEXP (dup, 0) = force_reg (GET_MODE_INNER (mode), val);
+      seq = get_insns ();
+      end_sequence ();
+      if (seq)
+	emit_insn_before (seq, insn);
+
+      ok = recog_memoized (insn) >= 0;
+      gcc_assert (ok);
+    }
+  return true;
+}
+
 /* A subroutine of ix86_expand_vector_init.  Store into TARGET a vector
    with all elements equal to VAR.  Return true if successful.  */
 
@@ -37183,29 +37534,11 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, enum machine_mode mode,
     case V2DImode:
     case V4SFmode:
     case V4SImode:
-      {
-	rtx insn, dup;
-
-	/* First attempt to recognize VAL as-is.  */
-	dup = gen_rtx_VEC_DUPLICATE (mode, val);
-	insn = emit_insn (gen_rtx_SET (VOIDmode, target, dup));
-	if (recog_memoized (insn) < 0)
-	  {
-	    rtx seq;
-	    /* If that fails, force VAL into a register.  */
-
-	    start_sequence ();
-	    XEXP (dup, 0) = force_reg (GET_MODE_INNER (mode), val);
-	    seq = get_insns ();
-	    end_sequence ();
-	    if (seq)
-	      emit_insn_before (seq, insn);
-
-	    ok = recog_memoized (insn) >= 0;
-	    gcc_assert (ok);
-	  }
-      }
-      return true;
+    case V16SImode:
+    case V8DImode:
+    case V16SFmode:
+    case V8DFmode:
+      return ix86_vector_duplicate_value (mode, target, val);
 
     case V4HImode:
       if (!mmx_ok)
@@ -37555,8 +37888,8 @@ static void
 ix86_expand_vector_init_concat (enum machine_mode mode,
 				rtx target, rtx *ops, int n)
 {
-  enum machine_mode cmode, hmode = VOIDmode;
-  rtx first[8], second[4];
+  enum machine_mode cmode, hmode = VOIDmode, gmode = VOIDmode;
+  rtx first[16], second[8], third[4];
   rtvec v;
   int i, j;
 
@@ -37565,6 +37898,18 @@ ix86_expand_vector_init_concat (enum machine_mode mode,
     case 2:
       switch (mode)
 	{
+	case V16SImode:
+	  cmode = V8SImode;
+	  break;
+	case V16SFmode:
+	  cmode = V8SFmode;
+	  break;
+	case V8DImode:
+	  cmode = V4DImode;
+	  break;
+	case V8DFmode:
+	  cmode = V4DFmode;
+	  break;
 	case V8SImode:
 	  cmode = V4SImode;
 	  break;
@@ -37631,6 +37976,14 @@ ix86_expand_vector_init_concat (enum machine_mode mode,
     case 8:
       switch (mode)
 	{
+	case V8DImode:
+	  cmode = V2DImode;
+	  hmode = V4DImode;
+	  break;
+	case V8DFmode:
+	  cmode = V2DFmode;
+	  hmode = V4DFmode;
+	  break;
 	case V8SImode:
 	  cmode = V2SImode;
 	  hmode = V4SImode;
@@ -37644,6 +37997,24 @@ ix86_expand_vector_init_concat (enum machine_mode mode,
 	}
       goto half;
 
+    case 16:
+      switch (mode)
+	{
+	case V16SImode:
+	  cmode = V2SImode;
+	  hmode = V4SImode;
+	  gmode = V8SImode;
+	  break;
+	case V16SFmode:
+	  cmode = V2SFmode;
+	  hmode = V4SFmode;
+	  gmode = V8SFmode;
+	  break;
+	default:
+	  gcc_unreachable ();
+	}
+      goto half;
+
 half:
       /* FIXME: We process inputs backward to help RA.  PR 36222.  */
       i = n - 1;
@@ -37657,7 +38028,27 @@ half:
 	}
 
       n >>= 1;
-      if (n > 2)
+      if (n > 4)
+	{
+	  gcc_assert (hmode != VOIDmode);
+	  gcc_assert (gmode != VOIDmode);
+	  for (i = j = 0; i < n; i += 2, j++)
+	    {
+	      second[j] = gen_reg_rtx (hmode);
+	      ix86_expand_vector_init_concat (hmode, second [j],
+					      &first [i], 2);
+	    }
+	  n >>= 1;
+	  for (i = j = 0; i < n; i += 2, j++)
+	    {
+	      third[j] = gen_reg_rtx (gmode);
+	      ix86_expand_vector_init_concat (gmode, third[j],
+					      &second[i], 2);
+	    }
+	  n >>= 1;
+	  ix86_expand_vector_init_concat (mode, target, third, n);
+	}
+      else if (n > 2)
 	{
 	  gcc_assert (hmode != VOIDmode);
 	  for (i = j = 0; i < n; i += 2, j++)
@@ -37800,7 +38191,7 @@ static void
 ix86_expand_vector_init_general (bool mmx_ok, enum machine_mode mode,
 				 rtx target, rtx vals)
 {
-  rtx ops[32], op0, op1;
+  rtx ops[64], op0, op1;
   enum machine_mode half_mode = VOIDmode;
   int n, i;
 
@@ -37812,6 +38203,10 @@ ix86_expand_vector_init_general (bool mmx_ok, enum machine_mode mode,
 	break;
       /* FALLTHRU */
 
+    case V16SImode:
+    case V16SFmode:
+    case V8DFmode:
+    case V8DImode:
     case V8SFmode:
     case V8SImode:
     case V4DFmode:
@@ -38437,6 +38832,42 @@ ix86_expand_vector_extract (bool mmx_ok, rtx target, rtx vec, int elt)
 	}
       break;
 
+    case V16SFmode:
+      tmp = gen_reg_rtx (V8SFmode);
+      if (elt < 8)
+	emit_insn (gen_vec_extract_lo_v16sf (tmp, vec));
+      else
+	emit_insn (gen_vec_extract_hi_v16sf (tmp, vec));
+      ix86_expand_vector_extract (false, target, tmp, elt & 7);
+      return;
+
+    case V8DFmode:
+      tmp = gen_reg_rtx (V4DFmode);
+      if (elt < 4)
+	emit_insn (gen_vec_extract_lo_v8df (tmp, vec));
+      else
+	emit_insn (gen_vec_extract_hi_v8df (tmp, vec));
+      ix86_expand_vector_extract (false, target, tmp, elt & 3);
+      return;
+
+    case V16SImode:
+      tmp = gen_reg_rtx (V8SImode);
+      if (elt < 8)
+	emit_insn (gen_vec_extract_lo_v16si (tmp, vec));
+      else
+	emit_insn (gen_vec_extract_hi_v16si (tmp, vec));
+      ix86_expand_vector_extract (false, target, tmp, elt & 7);
+      return;
+
+    case V8DImode:
+      tmp = gen_reg_rtx (V4DImode);
+      if (elt < 4)
+	emit_insn (gen_vec_extract_lo_v8di (tmp, vec));
+      else
+	emit_insn (gen_vec_extract_hi_v8di (tmp, vec));
+      ix86_expand_vector_extract (false, target, tmp, elt & 3);
+      return;
+
     case V8QImode:
       /* ??? Could extract the appropriate HImode element and shift.  */
     default:
@@ -38529,6 +38960,44 @@ emit_reduc_half (rtx dest, rtx src, int i)
 				    GEN_INT (i / 2));
 	}
       break;
+    case V16SImode:
+    case V16SFmode:
+    case V8DImode:
+    case V8DFmode:
+      if (i > 128)
+	tem = gen_avx512f_shuf_i32x4_1 (gen_lowpart (V16SImode, dest),
+				      gen_lowpart (V16SImode, src),
+				      gen_lowpart (V16SImode, src),
+				      GEN_INT (0x4 + (i == 512 ? 4 : 0)),
+				      GEN_INT (0x5 + (i == 512 ? 4 : 0)),
+				      GEN_INT (0x6 + (i == 512 ? 4 : 0)),
+				      GEN_INT (0x7 + (i == 512 ? 4 : 0)),
+				      GEN_INT (0xC), GEN_INT (0xD),
+				      GEN_INT (0xE), GEN_INT (0xF),
+				      GEN_INT (0x10), GEN_INT (0x11),
+				      GEN_INT (0x12), GEN_INT (0x13),
+				      GEN_INT (0x14), GEN_INT (0x15),
+				      GEN_INT (0x16), GEN_INT (0x17));
+      else
+	tem = gen_avx512f_pshufd_1 (gen_lowpart (V16SImode, dest),
+				   gen_lowpart (V16SImode, src),
+				   GEN_INT (i == 128 ? 0x2 : 0x1),
+				   GEN_INT (0x3),
+				   GEN_INT (0x3),
+				   GEN_INT (0x3),
+				   GEN_INT (i == 128 ? 0x6 : 0x5),
+				   GEN_INT (0x7),
+				   GEN_INT (0x7),
+				   GEN_INT (0x7),
+				   GEN_INT (i == 128 ? 0xA : 0x9),
+				   GEN_INT (0xB),
+				   GEN_INT (0xB),
+				   GEN_INT (0xB),
+				   GEN_INT (i == 128 ? 0xE : 0xD),
+				   GEN_INT (0xF),
+				   GEN_INT (0xF),
+				   GEN_INT (0xF));
+      break;
     default:
       gcc_unreachable ();
     }
@@ -38593,6 +39062,8 @@ ix86_vector_mode_supported_p (enum machine_mode mode)
     return true;
   if (TARGET_AVX && VALID_AVX256_REG_MODE (mode))
     return true;
+  if (TARGET_AVX512F && VALID_AVX512F_REG_MODE (mode))
+    return true;
   if (TARGET_MMX && VALID_MMX_REG_MODE (mode))
     return true;
   if (TARGET_3DNOW && VALID_MMX_REG_MODE_3DNOW (mode))
@@ -38906,9 +39377,15 @@ void ix86_emit_swdivsf (rtx res, rtx a, rtx b, enum machine_mode mode)
   b = force_reg (mode, b);
 
   /* x0 = rcp(b) estimate */
-  emit_insn (gen_rtx_SET (VOIDmode, x0,
-			  gen_rtx_UNSPEC (mode, gen_rtvec (1, b),
-					  UNSPEC_RCP)));
+  if (mode == V16SFmode || mode == V8DFmode)
+    emit_insn (gen_rtx_SET (VOIDmode, x0,
+			    gen_rtx_UNSPEC (mode, gen_rtvec (1, b),
+					    UNSPEC_RCP14)));
+  else
+    emit_insn (gen_rtx_SET (VOIDmode, x0,
+			    gen_rtx_UNSPEC (mode, gen_rtvec (1, b),
+					    UNSPEC_RCP)));
+
   /* e0 = x0 * b */
   emit_insn (gen_rtx_SET (VOIDmode, e0,
 			  gen_rtx_MULT (mode, x0, b)));
@@ -38938,6 +39415,7 @@ void ix86_emit_swsqrtsf (rtx res, rtx a, enum machine_mode mode,
 {
   rtx x0, e0, e1, e2, e3, mthree, mhalf;
   REAL_VALUE_TYPE r;
+  int unspec;
 
   x0 = gen_reg_rtx (mode);
   e0 = gen_reg_rtx (mode);
@@ -38950,11 +39428,15 @@ void ix86_emit_swsqrtsf (rtx res, rtx a, enum machine_mode mode,
 
   real_arithmetic (&r, NEGATE_EXPR, &dconsthalf, NULL);
   mhalf = CONST_DOUBLE_FROM_REAL_VALUE (r, SFmode);
+  unspec = UNSPEC_RSQRT;
 
   if (VECTOR_MODE_P (mode))
     {
       mthree = ix86_build_const_vector (mode, true, mthree);
       mhalf = ix86_build_const_vector (mode, true, mhalf);
+      /* There is no 512-bit rsqrt.  There is however rsqrt14.  */
+      if (GET_MODE_SIZE (mode) == 64)
+	unspec = UNSPEC_RSQRT14;
     }
 
   /* sqrt(a)  = -0.5 * a * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0)
@@ -38965,7 +39447,7 @@ void ix86_emit_swsqrtsf (rtx res, rtx a, enum machine_mode mode,
   /* x0 = rsqrt(a) estimate */
   emit_insn (gen_rtx_SET (VOIDmode, x0,
 			  gen_rtx_UNSPEC (mode, gen_rtvec (1, a),
-					  UNSPEC_RSQRT)));
+					  unspec)));
 
   /* If (a == 0.0) Filter out infinity to prevent NaN for sqrt(0.0).  */
   if (!recip)
@@ -38976,11 +39458,23 @@ void ix86_emit_swsqrtsf (rtx res, rtx a, enum machine_mode mode,
       mask = gen_reg_rtx (mode);
 
       zero = force_reg (mode, CONST0_RTX(mode));
-      emit_insn (gen_rtx_SET (VOIDmode, mask,
-			      gen_rtx_NE (mode, zero, a)));
 
-      emit_insn (gen_rtx_SET (VOIDmode, x0,
-			      gen_rtx_AND (mode, x0, mask)));
+      /* Handle masked compare.  */
+      if (VECTOR_MODE_P (mode) && GET_MODE_SIZE (mode) == 64)
+	{
+	  mask = gen_reg_rtx (HImode);
+	  /* Imm value 0x4 corresponds to not-equal comparison.  */
+	  emit_insn (gen_avx512f_cmpv16sf3 (mask, zero, a, GEN_INT (0x4)));
+	  emit_insn (gen_avx512f_blendmv16sf (x0, zero, x0, mask));
+	}
+      else
+	{
+	  emit_insn (gen_rtx_SET (VOIDmode, mask,
+				  gen_rtx_NE (mode, zero, a)));
+
+	  emit_insn (gen_rtx_SET (VOIDmode, x0,
+				  gen_rtx_AND (mode, x0, mask)));
+	}
     }
 
   /* e0 = x0 * a */
@@ -40502,6 +40996,19 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
   if (expand_vec_perm_pshufb (d))
     return true;
 
+  /* Try the AVX512F vpermi2 instructions.  */
+  rtx vec[64];
+  enum machine_mode mode = d->vmode;
+  if (mode == V8DFmode)
+    mode = V8DImode;
+  else if (mode == V16SFmode)
+    mode = V16SImode;
+  for (i = 0; i < nelt; ++i)
+    vec[i] = GEN_INT (d->perm[i]);
+  rtx mask = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt, vec));
+  if (ix86_expand_vec_perm_vpermi2 (d->target, d->op0, mask, d->op1))
+    return true;
+
   return false;
 }
 
@@ -42109,6 +42616,10 @@ ix86_vectorize_vec_perm_const_ok (enum machine_mode vmode,
 
   /* Given sufficient ISA support we can just return true here
      for selected vector modes.  */
+  if (d.vmode == V16SImode || d.vmode == V16SFmode
+      || d.vmode == V8DFmode || d.vmode == V8DImode)
+    /* All implementable with a single vpermi2 insn.  */
+    return true;
   if (GET_MODE_SIZE (d.vmode) == 16)
     {
       /* All implementable with a single vpperm insn.  */
@@ -42351,7 +42862,7 @@ ix86_expand_mul_widen_evenodd (rtx dest, rtx op1, rtx op2,
     op2 = force_reg (mode, op2);
 
   /* We only play even/odd games with vectors of SImode.  */
-  gcc_assert (mode == V4SImode || mode == V8SImode);
+  gcc_assert (mode == V4SImode || mode == V8SImode || mode == V16SImode);
 
   /* If we're looking for the odd results, shift those members down to
      the even slots.  For some cpus this is faster than a PSHUFD.  */
@@ -42377,7 +42888,14 @@ ix86_expand_mul_widen_evenodd (rtx dest, rtx op1, rtx op2,
       op2 = gen_lowpart (mode, op2);
     }
 
-  if (mode == V8SImode)
+  if (mode == V16SImode)
+    {
+      if (uns_p)
+	x = gen_vec_widen_umult_even_v16si (dest, op1, op2);
+      else
+	x = gen_vec_widen_smult_even_v16si (dest, op1, op2);
+    }
+  else if (mode == V8SImode)
     {
       if (uns_p)
 	x = gen_vec_widen_umult_even_v8si (dest, op1, op2);
@@ -42597,6 +43115,11 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2)
 	  umul = gen_vec_widen_umult_even_v8si;
 	  nmode = V8SImode;
 	}
+      else if (mode == V8DImode)
+	{
+	  umul = gen_vec_widen_umult_even_v16si;
+	  nmode = V16SImode;
+	}
       else
 	gcc_unreachable ();
 
@@ -43743,12 +44266,16 @@ ix86_preferred_simd_mode (enum machine_mode mode)
     case HImode:
       return (TARGET_AVX && !TARGET_PREFER_AVX128) ? V16HImode : V8HImode;
     case SImode:
-      return (TARGET_AVX && !TARGET_PREFER_AVX128) ? V8SImode : V4SImode;
+      return TARGET_AVX512F ? V16SImode :
+	(TARGET_AVX && !TARGET_PREFER_AVX128) ? V8SImode : V4SImode;
     case DImode:
-      return (TARGET_AVX && !TARGET_PREFER_AVX128) ? V4DImode : V2DImode;
+      return TARGET_AVX512F ? V8DImode :
+	(TARGET_AVX && !TARGET_PREFER_AVX128) ? V4DImode : V2DImode;
 
     case SFmode:
-      if (TARGET_AVX && !TARGET_PREFER_AVX128)
+      if (TARGET_AVX512F)
+	return V16SFmode;
+      else if (TARGET_AVX && !TARGET_PREFER_AVX128)
 	return V8SFmode;
       else
 	return V4SFmode;
@@ -43756,6 +44283,8 @@ ix86_preferred_simd_mode (enum machine_mode mode)
     case DFmode:
       if (!TARGET_VECTORIZE_DOUBLE)
 	return word_mode;
+      else if (TARGET_AVX512F)
+	return V8DFmode;
       else if (TARGET_AVX && !TARGET_PREFER_AVX128)
 	return V4DFmode;
       else if (TARGET_SSE2)
@@ -43768,12 +44297,14 @@ ix86_preferred_simd_mode (enum machine_mode mode)
 }
 
 /* If AVX is enabled then try vectorizing with both 256bit and 128bit
-   vectors.  */
+   vectors.  If AVX512F is enabled then try vectorizing with 512bit,
+   256bit and 128bit vectors.  */
 
 static unsigned int
 ix86_autovectorize_vector_sizes (void)
 {
-  return (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
+  return TARGET_AVX512F ? 64 | 32 | 16 :
+    (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
 }
 
 \f
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 7beb245..a3c0e0c 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -748,8 +748,9 @@
    (set (attr "mode")
 	(cond [(match_test "TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL")
 		 (const_string "<ssePSmode>")
-	       (and (eq_attr "alternative" "2")
-		    (match_test "TARGET_SSE_TYPELESS_STORES"))
+	       (and (match_test "GET_MODE_SIZE (<MODE>mode) == 16")
+		    (and (eq_attr "alternative" "2")
+			 (match_test "TARGET_SSE_TYPELESS_STORES")))
 		 (const_string "<ssePSmode>")
 	       (match_test "TARGET_AVX")
 		 (const_string "<sseinsnmode>")
@@ -986,8 +987,9 @@
    (set_attr "ssememalign" "8")
    (set_attr "prefix" "maybe_vex")
    (set (attr "mode")
-	(cond [(ior (match_test "TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL")
-		    (match_test "TARGET_SSE_TYPELESS_STORES"))
+        (cond [(and (match_test "GET_MODE_SIZE (<MODE>mode) == 16")
+                    (ior (match_test "TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL")
+                         (match_test "TARGET_SSE_TYPELESS_STORES")))
 		 (const_string "<ssePSmode>")
 	       (match_test "TARGET_AVX")
 		 (const_string "<MODE>")
@@ -1091,6 +1093,7 @@
 {
   switch (get_attr_mode (insn))
     {
+    case MODE_V16SF:
     case MODE_V8SF:
     case MODE_V4SF:
       return "%vmovups\t{%1, %0|%0, %1}";
@@ -1113,8 +1116,9 @@
      (const_string "1")))
    (set_attr "prefix" "maybe_vex")
    (set (attr "mode")
-	(cond [(ior (match_test "TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL")
-		    (match_test "TARGET_SSE_TYPELESS_STORES"))
+	(cond [(and (match_test "GET_MODE_SIZE (<MODE>mode) == 16")
+		    (ior (match_test "TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL")
+			 (match_test "TARGET_SSE_TYPELESS_STORES")))
 		 (const_string "<ssePSmode>")
 	       (match_test "TARGET_AVX")
 		 (const_string "<sseinsnmode>")
@@ -3492,7 +3496,11 @@
    (match_operand:<sseintvecmode> 1 "register_operand")]
   "TARGET_SSE2 && (<MODE>mode == V4SFmode || TARGET_AVX2)"
 {
-  ix86_expand_vector_convert_uns_vsivsf (operands[0], operands[1]);
+  if (<MODE>mode == V16SFmode)
+    emit_insn (gen_ufloatv16siv16sf2 (operands[0], operands[1]));
+  else
+    ix86_expand_vector_convert_uns_vsivsf (operands[0], operands[1]);
+
   DONE;
 })
 
@@ -3583,11 +3591,17 @@
    (match_operand:VF1 1 "register_operand")]
   "TARGET_SSE2"
 {
-  rtx tmp[3];
-  tmp[0] = ix86_expand_adjust_ufix_to_sfix_si (operands[1], &tmp[2]);
-  tmp[1] = gen_reg_rtx (<sseintvecmode>mode);
-  emit_insn (gen_fix_trunc<mode><sseintvecmodelower>2 (tmp[1], tmp[0]));
-  emit_insn (gen_xor<sseintvecmodelower>3 (operands[0], tmp[1], tmp[2]));
+  if (<MODE>mode == V16SFmode)
+    emit_insn (gen_ufix_truncv16sfv16si2 (operands[0],
+					  operands[1]));
+  else
+    {
+      rtx tmp[3];
+      tmp[0] = ix86_expand_adjust_ufix_to_sfix_si (operands[1], &tmp[2]);
+      tmp[1] = gen_reg_rtx (<sseintvecmode>mode);
+      emit_insn (gen_fix_trunc<mode><sseintvecmodelower>2 (tmp[1], tmp[0]));
+      emit_insn (gen_xor<sseintvecmodelower>3 (operands[0], tmp[1], tmp[2]));
+    }
   DONE;
 })
 
@@ -4514,6 +4528,32 @@
   DONE;
 })
 
+(define_expand "vec_unpacku_float_hi_v16si"
+  [(match_operand:V8DF 0 "register_operand")
+   (match_operand:V16SI 1 "register_operand")]
+  "TARGET_AVX512F"
+{
+  REAL_VALUE_TYPE TWO32r;
+  rtx k, x, tmp[4];
+
+  real_ldexp (&TWO32r, &dconst1, 32);
+  x = const_double_from_real_value (TWO32r, DFmode);
+
+  tmp[0] = force_reg (V8DFmode, CONST0_RTX (V8DFmode));
+  tmp[1] = force_reg (V8DFmode, ix86_build_const_vector (V8DFmode, 1, x));
+  tmp[2] = gen_reg_rtx (V8DFmode);
+  tmp[3] = gen_reg_rtx (V8SImode);
+  k = gen_reg_rtx (QImode);
+
+  emit_insn (gen_vec_extract_hi_v16si (tmp[3], operands[1]));
+  emit_insn (gen_floatv8siv8df2 (tmp[2], tmp[3]));
+  emit_insn (gen_rtx_SET (VOIDmode, k,
+			  gen_rtx_LT (QImode, tmp[2], tmp[0])));
+  emit_insn (gen_addv8df3_mask (tmp[2], tmp[2], tmp[1], tmp[2], k));
+  emit_move_insn (operands[0], tmp[2]);
+  DONE;
+})
+
 (define_expand "vec_unpacku_float_lo_v8si"
   [(match_operand:V4DF 0 "register_operand")
    (match_operand:V8SI 1 "nonimmediate_operand")]
@@ -4679,31 +4719,46 @@
 
 (define_expand "vec_pack_ufix_trunc_<mode>"
   [(match_operand:<ssepackfltmode> 0 "register_operand")
-   (match_operand:VF2_128_256 1 "register_operand")
-   (match_operand:VF2_128_256 2 "register_operand")]
+   (match_operand:VF2 1 "register_operand")
+   (match_operand:VF2 2 "register_operand")]
   "TARGET_SSE2"
 {
-  rtx tmp[7];
-  tmp[0] = ix86_expand_adjust_ufix_to_sfix_si (operands[1], &tmp[2]);
-  tmp[1] = ix86_expand_adjust_ufix_to_sfix_si (operands[2], &tmp[3]);
-  tmp[4] = gen_reg_rtx (<ssepackfltmode>mode);
-  emit_insn (gen_vec_pack_sfix_trunc_<mode> (tmp[4], tmp[0], tmp[1]));
-  if (<ssepackfltmode>mode == V4SImode || TARGET_AVX2)
+  if (<MODE>mode == V8DFmode)
     {
-      tmp[5] = gen_reg_rtx (<ssepackfltmode>mode);
-      ix86_expand_vec_extract_even_odd (tmp[5], tmp[2], tmp[3], 0);
+      rtx r1, r2;
+
+      r1 = gen_reg_rtx (V8SImode);
+      r2 = gen_reg_rtx (V8SImode);
+
+      emit_insn (gen_ufix_truncv8dfv8si2 (r1, operands[1]));
+      emit_insn (gen_ufix_truncv8dfv8si2 (r2, operands[2]));
+      emit_insn (gen_avx_vec_concatv16si (operands[0], r1, r2));
     }
   else
     {
-      tmp[5] = gen_reg_rtx (V8SFmode);
-      ix86_expand_vec_extract_even_odd (tmp[5], gen_lowpart (V8SFmode, tmp[2]),
-					gen_lowpart (V8SFmode, tmp[3]), 0);
-      tmp[5] = gen_lowpart (V8SImode, tmp[5]);
+      rtx tmp[7];
+      tmp[0] = ix86_expand_adjust_ufix_to_sfix_si (operands[1], &tmp[2]);
+      tmp[1] = ix86_expand_adjust_ufix_to_sfix_si (operands[2], &tmp[3]);
+      tmp[4] = gen_reg_rtx (<ssepackfltmode>mode);
+      emit_insn (gen_vec_pack_sfix_trunc_<mode> (tmp[4], tmp[0], tmp[1]));
+      if (<ssepackfltmode>mode == V4SImode || TARGET_AVX2)
+	{
+	  tmp[5] = gen_reg_rtx (<ssepackfltmode>mode);
+	  ix86_expand_vec_extract_even_odd (tmp[5], tmp[2], tmp[3], 0);
+	}
+      else
+	{
+	  tmp[5] = gen_reg_rtx (V8SFmode);
+	  ix86_expand_vec_extract_even_odd (tmp[5], gen_lowpart (V8SFmode, tmp[2]),
+					    gen_lowpart (V8SFmode, tmp[3]), 0);
+	  tmp[5] = gen_lowpart (V8SImode, tmp[5]);
+	}
+      tmp[6] = expand_simple_binop (<ssepackfltmode>mode, XOR, tmp[4], tmp[5],
+				    operands[0], 0, OPTAB_DIRECT);
+      if (tmp[6] != operands[0])
+	emit_move_insn (operands[0], tmp[6]);
     }
-  tmp[6] = expand_simple_binop (<ssepackfltmode>mode, XOR, tmp[4], tmp[5],
-				operands[0], 0, OPTAB_DIRECT);
-  if (tmp[6] != operands[0])
-    emit_move_insn (operands[0], tmp[6]);
+
   DONE;
 })
 
diff --git a/gcc/testsuite/gcc.target/i386/pr49002-2.c b/gcc/testsuite/gcc.target/i386/pr49002-2.c
index 9f21a2d..dfb83b4 100644
--- a/gcc/testsuite/gcc.target/i386/pr49002-2.c
+++ b/gcc/testsuite/gcc.target/i386/pr49002-2.c
@@ -12,4 +12,4 @@ void foo(const __m128d from, __m256d *to)
 /* Ensure we store ymm, not xmm.  */
 /* { dg-final { scan-assembler-not "vmovapd\[\t \]*%xmm\[0-9\]\+,\[^,\]*" } } */
 /* { dg-final { scan-assembler-not "vmovaps\[\t \]*%xmm\[0-9\]\+,\[^,\]*" } } */
-/* { dg-final { scan-assembler "vmovaps\[\t \]*%ymm\[0-9\]\+,\[^,\]*" } } */
+/* { dg-final { scan-assembler "vmovap\[sd\]\[\t \]*%ymm\[0-9\]\+,\[^,\]*" } } */
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index e3009d9..a3aaa6e 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -5687,7 +5687,7 @@ vectorizable_load (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
       tree vec_oprnd0 = NULL_TREE, op;
       tree arglist = TYPE_ARG_TYPES (TREE_TYPE (gather_decl));
       tree rettype, srctype, ptrtype, idxtype, masktype, scaletype;
-      tree ptr, mask, var, scale, perm_mask = NULL_TREE, prev_res = NULL_TREE;
+      tree ptr, mask, var, scale, merge, perm_mask = NULL_TREE, prev_res = NULL_TREE;
       edge pe = loop_preheader_edge (loop);
       gimple_seq seq;
       basic_block new_bb;
@@ -5729,8 +5729,7 @@ vectorizable_load (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
       idxtype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
       masktype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
       scaletype = TREE_VALUE (arglist);
-      gcc_checking_assert (types_compatible_p (srctype, rettype)
-			   && types_compatible_p (srctype, masktype));
+      gcc_checking_assert (types_compatible_p (srctype, rettype));
 
       vec_dest = vect_create_destination_var (scalar_dest, vectype);
 
@@ -5744,8 +5743,13 @@ vectorizable_load (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
 
       /* Currently we support only unconditional gather loads,
 	 so mask should be all ones.  */
-      if (TREE_CODE (TREE_TYPE (masktype)) == INTEGER_TYPE)
-	mask = build_int_cst (TREE_TYPE (masktype), -1);
+      if (TREE_CODE (masktype) == INTEGER_TYPE)
+	mask = build_int_cst (masktype, -1);
+      else if (TREE_CODE (TREE_TYPE (masktype)) == INTEGER_TYPE)
+	{
+	  mask = build_int_cst (TREE_TYPE (masktype), -1);
+	  mask = build_vector_from_val (masktype, mask);
+	}
       else if (SCALAR_FLOAT_TYPE_P (TREE_TYPE (masktype)))
 	{
 	  REAL_VALUE_TYPE r;
@@ -5754,14 +5758,30 @@ vectorizable_load (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
 	    tmp[j] = -1;
 	  real_from_target (&r, tmp, TYPE_MODE (TREE_TYPE (masktype)));
 	  mask = build_real (TREE_TYPE (masktype), r);
+	  mask = build_vector_from_val (masktype, mask);
 	}
       else
 	gcc_unreachable ();
-      mask = build_vector_from_val (masktype, mask);
       mask = vect_init_vector (stmt, mask, masktype, NULL);
 
       scale = build_int_cst (scaletype, gather_scale);
 
+      if (TREE_CODE (TREE_TYPE (rettype)) == INTEGER_TYPE)
+	merge = build_int_cst (TREE_TYPE (rettype), 0);
+      else if (SCALAR_FLOAT_TYPE_P (TREE_TYPE (rettype)))
+	{
+	  REAL_VALUE_TYPE r;
+	  long tmp[6];
+	  for (j = 0; j < 6; ++j)
+	    tmp[j] = 0;
+	  real_from_target (&r, tmp, TYPE_MODE (TREE_TYPE (rettype)));
+	  merge = build_real (TREE_TYPE (rettype), r);
+	}
+      else
+	gcc_unreachable ();
+      merge = build_vector_from_val (rettype, merge);
+      merge = vect_init_vector (stmt, merge, rettype, NULL);
+
       prev_stmt_info = NULL;
       for (j = 0; j < ncopies; ++j)
 	{
@@ -5790,7 +5810,7 @@ vectorizable_load (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
 	    }
 
 	  new_stmt
-	    = gimple_build_call (gather_decl, 5, mask, ptr, op, mask, scale);
+	    = gimple_build_call (gather_decl, 5, merge, ptr, op, mask, scale);
 
 	  if (!useless_type_conversion_p (vectype, rettype))
 	    {
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 54e73c8..00e56dc 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -683,8 +683,8 @@ struct dataref_aux {
    conversion.  */
 #define MAX_INTERM_CVT_STEPS         3
 
-/* The maximum vectorization factor supported by any target (V32QI).  */
-#define MAX_VECTORIZATION_FACTOR 32
+/* The maximum vectorization factor supported by any target (V64QI).  */
+#define MAX_VECTORIZATION_FACTOR 64
 
 /* Avoid GTY(()) on stmt_vec_info.  */
 typedef void *vec_void_p;

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2013-12-30 11:00           ` Kirill Yukhin
@ 2014-01-01 23:08             ` Eric Botcazou
  2014-01-02 10:53               ` Kirill Yukhin
  2014-01-02 22:18               ` Eric Botcazou
  0 siblings, 2 replies; 43+ messages in thread
From: Eric Botcazou @ 2014-01-01 23:08 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: gcc-patches, Uros Bizjak, Richard Henderson, Jakub Jelinek

> gcc/
> 2013-12-30  Alexander Ivchenko  <alexander.ivchenko@intel.com>
> 	    Maxim Kuznetsov  <maxim.kuznetsov@intel.com>
> 	    Sergey Lega  <sergey.s.lega@intel.com>
> 	    Anna Tikhonova  <anna.tikhonova@intel.com>
> 	    Ilya Tocar  <ilya.tocar@intel.com>
> 	    Andrey Turetskiy  <andrey.turetskiy@intel.com>
> 	    Ilya Verbin  <ilya.verbin@intel.com>
> 	    Kirill Yukhin  <kirill.yukhin@intel.com>
> 	    Michael Zolotukhin  <michael.v.zolotukhin@intel.com>
> 
> 	* config/i386/i386.c (MAX_CLASSES): Increase number of classes.
> 	(classify_argument): Extend for 512 bit vectors.
> 	(construct_container): Ditto.
> 	(function_arg_advance_32): Ditto.
> 	(function_arg_advance_64): Ditto.
> 	(function_arg_32): Ditto.
> 	(function_arg_64): Ditto.
> 	(function_value_32): Ditto.
> 	(return_in_memory_32): Ditto.
> 	(ix86_gimplify_va_arg): Ditto.
> 	(standard_sse_constant_p): Ditto.
> 	(standard_sse_constant_opcode): Ditto.
> 	(ix86_expand_vector_convert_uns_vsivsf): Ditto.
> 	(ix86_build_const_vector): Ditto.
> 	(ix86_build_signbit_mask): Ditto.
> 	(ix86_expand_sse_cmp): Extend for AVX512.
> 	(ix86_expand_sse_movcc): Ditto.
> 	(ix86_expand_int_vcond): Ditto.
> 	(ix86_expand_vec_perm): Ditto.
> 	(ix86_expand_sse_unpack): Ditto.
> 	(ix86_constant_alignment): Ditto.

The change is actually to ix86_data_alignment, not to ix86_constant_alignment:

@@ -26219,7 +26433,8 @@ ix86_constant_alignment (tree exp, int align)
 int
 ix86_data_alignment (tree type, int align, bool opt)
 {
-  int max_align = optimize_size ? BITS_PER_WORD : MIN (256, 
MAX_OFILE_ALIGNMENT);
+  int max_align = optimize_size ? BITS_PER_WORD
+                               : MIN (512, MAX_OFILE_ALIGNMENT);
 
   if (opt
       && AGGREGATE_TYPE_P (type)


Note that it has unexpected side-effects: previously, in 32-bit mode, 256-bit 
aggregate objects would have been given 256-bit alignment; now, they will fall
back to default alignment, for example 32-bit only.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-01 23:08             ` Eric Botcazou
@ 2014-01-02 10:53               ` Kirill Yukhin
  2014-01-02 15:12                 ` Eric Botcazou
  2014-01-02 22:18               ` Eric Botcazou
  1 sibling, 1 reply; 43+ messages in thread
From: Kirill Yukhin @ 2014-01-02 10:53 UTC (permalink / raw)
  To: Eric Botcazou; +Cc: gcc-patches, Uros Bizjak, Richard Henderson, Jakub Jelinek

Hello Eric,
On 02 Jan 00:07, Eric Botcazou wrote:
> The change is actually to ix86_data_alignment, not to ix86_constant_alignment:
> 
> @@ -26219,7 +26433,8 @@ ix86_constant_alignment (tree exp, int align)
>  int
>  ix86_data_alignment (tree type, int align, bool opt)
>  {
> -  int max_align = optimize_size ? BITS_PER_WORD : MIN (256, 
> MAX_OFILE_ALIGNMENT);
> +  int max_align = optimize_size ? BITS_PER_WORD
> +                               : MIN (512, MAX_OFILE_ALIGNMENT);
>  
>    if (opt
>        && AGGREGATE_TYPE_P (type)
> Note that it has unexpected side-effects: previously, in 32-bit mode, 256-bit 
> aggregate objects would have been given 256-bit alignment; now, they will fall
> back to default alignment, for example 32-bit only.
Frankly speaking, I do not understand, what's wrong here.
IMHO, this change is pretty mechanical: we just extend maximal aligment available.
Because of 512-bit data types we now extend maximal aligment to 512 bits.

I suspect that an issue is here:
  if (opt
      && AGGREGATE_TYPE_P (type)
      && TYPE_SIZE (type)
      && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST
      && (TREE_INT_CST_LOW (TYPE_SIZE (type)) >= (unsigned) max_align
          || TREE_INT_CST_HIGH (TYPE_SIZE (type)))
      && align < max_align)
    align = max_align;

Maybe we can split it and handle 256-bit aggregates separately?

--
Thanks, K

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-02 10:53               ` Kirill Yukhin
@ 2014-01-02 15:12                 ` Eric Botcazou
  2014-01-02 19:50                   ` Jan Hubicka
  0 siblings, 1 reply; 43+ messages in thread
From: Eric Botcazou @ 2014-01-02 15:12 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: gcc-patches, Uros Bizjak, Richard Henderson, Jakub Jelinek

> Frankly speaking, I do not understand, what's wrong here.
> IMHO, this change is pretty mechanical: we just extend maximal aligment
> available. Because of 512-bit data types we now extend maximal aligment to
> 512 bits.

Nothing wrong per se, but...

> I suspect that an issue is here:
>   if (opt
>       && AGGREGATE_TYPE_P (type)
>       && TYPE_SIZE (type)
>       && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST
>       && (TREE_INT_CST_LOW (TYPE_SIZE (type)) >= (unsigned) max_align
> 
>           || TREE_INT_CST_HIGH (TYPE_SIZE (type)))
> 
>       && align < max_align)
>     align = max_align;

...yes, bumping max_align has the unexpected side effect of changing the 
behavior for sizes between the old value and the new value because of this 
code.  I'm no x86 specialist, but I think that this should be fixed.

> Maybe we can split it and handle 256-bit aggregates separately?

Probably, and we should also add a warning just before the declaration of 
max_align, as well as investigate whether this didn't already happen when 
max_align was bumped from 128 to 256.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-02 15:12                 ` Eric Botcazou
@ 2014-01-02 19:50                   ` Jan Hubicka
  2014-01-02 21:56                     ` Eric Botcazou
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Hubicka @ 2014-01-02 19:50 UTC (permalink / raw)
  To: Eric Botcazou
  Cc: Kirill Yukhin, gcc-patches, Uros Bizjak, Richard Henderson,
	Jakub Jelinek

> > Frankly speaking, I do not understand, what's wrong here.
> > IMHO, this change is pretty mechanical: we just extend maximal aligment
> > available. Because of 512-bit data types we now extend maximal aligment to
> > 512 bits.
> 
> Nothing wrong per se, but...
> 
> > I suspect that an issue is here:
> >   if (opt
> >       && AGGREGATE_TYPE_P (type)
> >       && TYPE_SIZE (type)
> >       && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST
> >       && (TREE_INT_CST_LOW (TYPE_SIZE (type)) >= (unsigned) max_align
> > 
> >           || TREE_INT_CST_HIGH (TYPE_SIZE (type)))
> > 
> >       && align < max_align)
> >     align = max_align;
> 
> ...yes, bumping max_align has the unexpected side effect of changing the 
> behavior for sizes between the old value and the new value because of this 
> code.  I'm no x86 specialist, but I think that this should be fixed.
> 
> > Maybe we can split it and handle 256-bit aggregates separately?
> 
> Probably, and we should also add a warning just before the declaration of 
> max_align, as well as investigate whether this didn't already happen when 
> max_align was bumped from 128 to 256.

x86-64 ABI has clause about aligning static vars to 128bit boundary at a given
size.  This was introduced to aid compiler to generate aligned vector store/load
even if the object may bind to other object file.
This is set to stone and can not be changed for AVX/SSE.

For other objects that are fully under local control we can bump up alignment
more.  I remember this code was originally supposed to bump up to 128bits since
it was written long before AVX.  I suppose it would make sense to do so when
AVX is enabled and we anticipate to use it.

I am not quite sure however how important it is given that we have pass to increase
alignment for vectorizable arrays. Other case where we can autogenerate SSE is memcpy/memset,
but sadly only for variably sized case and we don't do that by default yet (I hope to
teach move_by_pieces/store_by_pieces about SSE soonish, but not for 4.9)

This logic all come from time when vectorization was in infancy.
Honza

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-02 19:50                   ` Jan Hubicka
@ 2014-01-02 21:56                     ` Eric Botcazou
  2014-01-02 22:16                       ` Jan Hubicka
  0 siblings, 1 reply; 43+ messages in thread
From: Eric Botcazou @ 2014-01-02 21:56 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Kirill Yukhin, gcc-patches, Uros Bizjak, Richard Henderson,
	Jakub Jelinek

> x86-64 ABI has clause about aligning static vars to 128bit boundary at a
> given size.  This was introduced to aid compiler to generate aligned
> vector store/load even if the object may bind to other object file.
> This is set to stone and can not be changed for AVX/SSE.

Yes, but that's irrelevant in 32-bit mode.

> For other objects that are fully under local control we can bump up
> alignment more.  I remember this code was originally supposed to bump up
> to 128bits since it was written long before AVX.  I suppose it would make
> sense to do so when AVX is enabled and we anticipate to use it.

So the same unexpected side-effect (decreasing the alignment) probably happened 
when the maximum alignment was bumped from 128 to 256.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-02 21:56                     ` Eric Botcazou
@ 2014-01-02 22:16                       ` Jan Hubicka
  0 siblings, 0 replies; 43+ messages in thread
From: Jan Hubicka @ 2014-01-02 22:16 UTC (permalink / raw)
  To: Eric Botcazou
  Cc: Jan Hubicka, Kirill Yukhin, gcc-patches, Uros Bizjak,
	Richard Henderson, Jakub Jelinek

> > x86-64 ABI has clause about aligning static vars to 128bit boundary at a
> > given size.  This was introduced to aid compiler to generate aligned
> > vector store/load even if the object may bind to other object file.
> > This is set to stone and can not be changed for AVX/SSE.
> 
> Yes, but that's irrelevant in 32-bit mode.
> 
> > For other objects that are fully under local control we can bump up
> > alignment more.  I remember this code was originally supposed to bump up
> > to 128bits since it was written long before AVX.  I suppose it would make
> > sense to do so when AVX is enabled and we anticipate to use it.
> 
> So the same unexpected side-effect (decreasing the alignment) probably happened 
> when the maximum alignment was bumped from 128 to 256.

Yes, that code was written with only one vector mode in mind.
It would be nice to have some data if the code helps at all though, but I guess
it would be sanner to bump alignment up to largest enabled vector mode that
is smaller than object size instead of what we do now.

Honza

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-01 23:08             ` Eric Botcazou
  2014-01-02 10:53               ` Kirill Yukhin
@ 2014-01-02 22:18               ` Eric Botcazou
  2014-01-03 11:03                 ` Uros Bizjak
  1 sibling, 1 reply; 43+ messages in thread
From: Eric Botcazou @ 2014-01-02 22:18 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: gcc-patches, Uros Bizjak, Richard Henderson, Jakub Jelinek

> Note that it has unexpected side-effects: previously, in 32-bit mode,
> 256-bit aggregate objects would have been given 256-bit alignment; now,
> they will fall back to default alignment, for example 32-bit only.

In case this wasn't clear enough, just compile in 32-bit mode:

int a[8] = { 1, 2, 3, 4, 5, 6, 7, 8};

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-02 22:18               ` Eric Botcazou
@ 2014-01-03 11:03                 ` Uros Bizjak
  2014-01-03 11:20                   ` Eric Botcazou
  0 siblings, 1 reply; 43+ messages in thread
From: Uros Bizjak @ 2014-01-03 11:03 UTC (permalink / raw)
  To: Eric Botcazou
  Cc: Kirill Yukhin, gcc-patches, Richard Henderson, Jakub Jelinek

On Thu, Jan 2, 2014 at 11:18 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>> Note that it has unexpected side-effects: previously, in 32-bit mode,
>> 256-bit aggregate objects would have been given 256-bit alignment; now,
>> they will fall back to default alignment, for example 32-bit only.
>
> In case this wasn't clear enough, just compile in 32-bit mode:
>
> int a[8] = { 1, 2, 3, 4, 5, 6, 7, 8};

It looks to me that we don't need to adjust anything with max_align.
Using following patch:

--cut here--
Index: i386.c
===================================================================
--- i386.c      (revision 206311)
+++ i386.c      (working copy)
@@ -26465,6 +26465,7 @@
 int
 ix86_data_alignment (tree type, int align, bool opt)
 {
+#if 0
   int max_align = optimize_size ? BITS_PER_WORD
                                : MIN (512, MAX_OFILE_ALIGNMENT);

@@ -26476,6 +26477,7 @@
          || TREE_INT_CST_HIGH (TYPE_SIZE (type)))
       && align < max_align)
     align = max_align;
+#endif

   /* x86-64 ABI requires arrays greater than 16 bytes to be aligned
      to 16byte boundary.  */
--cut here--

and following testcase:

-- cut here--
float a[8] = { 1, 2, 3, 4, 5, 6, 7, 8};

extern float z[8];

void t (void)
{
  int i;

  for (i = 0; i < 8; i++)
    z[i] = z[i] + a[i];
}
--cut here--

When compiled with -m32 -mavx, we get:

        .align 32
        .type   a, @object
        .size   a, 32
a:

so, the alignment was already raised elsewhere. We get .align 16 for
-msse -m32 when vectorizing.

without -msse (and consequently without vectorizing), we get for -m32:

        .align 4
        .type   a, @object
        .size   a, 32
a:

which corresponds to 32bit ABI rules (we still get .align16 for 64bit ABI).

What bothers me in this testcase is (unrelated) alignment of z[8]
array. Even for 64bit targets, we get:

#(insn:TI 6 5 8 2 (set (reg:V4SF 21 xmm0 [orig:90 vect__4.5 ] [90])
#        (unspec:V4SF [
#                (mem/c:V4SF (reg/f:DI 0 ax [89]) [2 MEM[(float
*)&z]+0 S16 A32])
#            ] UNSPEC_LOADU)) al.c:10 1195 {*sse_loadups}
#     (nil))
        movups  (%rax), %xmm0   # 6     *sse_loadups    [length = 3]

ABI guarantees 16 byte alignment of z[8], but we fail to communicate
this to the compiler.

Uros.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-03 11:03                 ` Uros Bizjak
@ 2014-01-03 11:20                   ` Eric Botcazou
  2014-01-03 11:25                     ` Uros Bizjak
  0 siblings, 1 reply; 43+ messages in thread
From: Eric Botcazou @ 2014-01-03 11:20 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Kirill Yukhin, gcc-patches, Richard Henderson, Jakub Jelinek

> When compiled with -m32 -mavx, we get:
> 
>         .align 32
>         .type   a, @object
>         .size   a, 32
> a:
> 
> so, the alignment was already raised elsewhere. We get .align 16 for
> -msse -m32 when vectorizing.
> 
> without -msse (and consequently without vectorizing), we get for -m32:
> 
>         .align 4
>         .type   a, @object
>         .size   a, 32
> a:
> 
> which corresponds to 32bit ABI rules (we still get .align16 for 64bit ABI).

Yes, but the issue is that it was 32 before so the alignment decrease is weird.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-03 11:20                   ` Eric Botcazou
@ 2014-01-03 11:25                     ` Uros Bizjak
  2014-01-03 12:00                       ` Jakub Jelinek
  0 siblings, 1 reply; 43+ messages in thread
From: Uros Bizjak @ 2014-01-03 11:25 UTC (permalink / raw)
  To: Eric Botcazou
  Cc: Kirill Yukhin, gcc-patches, Richard Henderson, Jakub Jelinek

On Fri, Jan 3, 2014 at 12:20 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>> When compiled with -m32 -mavx, we get:
>>
>>         .align 32
>>         .type   a, @object
>>         .size   a, 32
>> a:
>>
>> so, the alignment was already raised elsewhere. We get .align 16 for
>> -msse -m32 when vectorizing.
>>
>> without -msse (and consequently without vectorizing), we get for -m32:
>>
>>         .align 4
>>         .type   a, @object
>>         .size   a, 32
>> a:
>>
>> which corresponds to 32bit ABI rules (we still get .align16 for 64bit ABI).
>
> Yes, but the issue is that it was 32 before so the alignment decrease is weird.

Yes, but this change is benign, and as shown, I don't think we need
this functionality at all. The data from other TUs is accessed with a
4 byte alignment (on 32bit targets, and unfortunately also on 64bit
targets), and the alignment of local data is increased elsewhere.

I am testing a patch that removes "max_align" part from ix86_data_alignment.

Uros.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-03 11:25                     ` Uros Bizjak
@ 2014-01-03 12:00                       ` Jakub Jelinek
  2014-01-03 12:27                         ` Uros Bizjak
  0 siblings, 1 reply; 43+ messages in thread
From: Jakub Jelinek @ 2014-01-03 12:00 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Eric Botcazou, Kirill Yukhin, gcc-patches, Richard Henderson

On Fri, Jan 03, 2014 at 12:25:00PM +0100, Uros Bizjak wrote:
> I am testing a patch that removes "max_align" part from ix86_data_alignment.

That looks like unnecessary pessimization.  Note the hunk in question is
guarded with opt, which means it is an optimization rather than ABI issue,
it can increase alignment, but the compiler can only assume the increased
alignment if the symbol is not public or if it is public, but can't be
preempted by another TU's definition.  Even in that case it can be
worthwhile to increase the alignment, say if doing versioning for alignment,
or say just doing some AVX/AVX2/AVX512F version of memcpy/memset that can
handle it faster if it is sufficiently aligned by testing it at runtime.

So, IMHO we shouldn't drop it, just improve it.
Perhaps:

  int max_align = optimize_size ? BITS_PER_WORD
                                : MIN (512, MAX_OFILE_ALIGNMENT);

  if (opt
      && AGGREGATE_TYPE_P (type)
      && TYPE_SIZE (type)
      && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST
      && align < max_align)
    {
      int this_align = 256;
      for (this_align = 256; this_align <= max_align; this_align *= 2)
	if (TREE_INT_CST_LOW (TYPE_SIZE (type)) < (unsigned) this_align
	    && !TREE_INT_CST_HIGH (TYPE_SIZE (type)))
	  break;
	else if (align < this_align)
	  align = this_align;
    }

which will handle both the 256 bit alignment for >= 256 bit objects,
512 bit alignment for >= 512 bit objects and will be prepared for the
future.

128 bit I think doesn't need to be handled, DATA_ALIGNMENT has been
using 256-bit test already since it's introduction in 1998:
http://gcc.gnu.org/ml/gcc-bugs/1998-03/msg00011.html

	Jakub

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-03 12:00                       ` Jakub Jelinek
@ 2014-01-03 12:27                         ` Uros Bizjak
  2014-01-03 13:35                           ` Uros Bizjak
  0 siblings, 1 reply; 43+ messages in thread
From: Uros Bizjak @ 2014-01-03 12:27 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Eric Botcazou, Kirill Yukhin, gcc-patches, Richard Henderson

On Fri, Jan 3, 2014 at 12:59 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Fri, Jan 03, 2014 at 12:25:00PM +0100, Uros Bizjak wrote:
>> I am testing a patch that removes "max_align" part from ix86_data_alignment.
>
> That looks like unnecessary pessimization.  Note the hunk in question is
> guarded with opt, which means it is an optimization rather than ABI issue,
> it can increase alignment, but the compiler can only assume the increased
> alignment if the symbol is not public or if it is public, but can't be
> preempted by another TU's definition.  Even in that case it can be
> worthwhile to increase the alignment, say if doing versioning for alignment,
> or say just doing some AVX/AVX2/AVX512F version of memcpy/memset that can
> handle it faster if it is sufficiently aligned by testing it at runtime.
>
> So, IMHO we shouldn't drop it, just improve it.
> Perhaps:
>
>   int max_align = optimize_size ? BITS_PER_WORD
>                                 : MIN (512, MAX_OFILE_ALIGNMENT);
>
>   if (opt
>       && AGGREGATE_TYPE_P (type)
>       && TYPE_SIZE (type)
>       && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST
>       && align < max_align)
>     {
>       int this_align = 256;
>       for (this_align = 256; this_align <= max_align; this_align *= 2)
>         if (TREE_INT_CST_LOW (TYPE_SIZE (type)) < (unsigned) this_align
>             && !TREE_INT_CST_HIGH (TYPE_SIZE (type)))
>           break;
>         else if (align < this_align)
>           align = this_align;
>     }
>
> which will handle both the 256 bit alignment for >= 256 bit objects,
> 512 bit alignment for >= 512 bit objects and will be prepared for the
> future.
>
> 128 bit I think doesn't need to be handled, DATA_ALIGNMENT has been
> using 256-bit test already since it's introduction in 1998:
> http://gcc.gnu.org/ml/gcc-bugs/1998-03/msg00011.html

Thanks for the pointer, there is indeed the recommendation in
optimization manual [1], section 3.6.4, where it is said:

--quote--
Misaligned data access can incur significant performance penalties.
This is particularly true for cache line
splits. The size of a cache line is 64 bytes in the Pentium 4 and
other recent Intel processors, including
processors based on Intel Core microarchitecture.
An access to data unaligned on 64-byte boundary leads to two memory
accesses and requires several
μops to be executed (instead of one). Accesses that span 64-byte
boundaries are likely to incur a large
performance penalty, the cost of each stall generally are greater on
machines with longer pipelines.

...

A 64-byte or greater data structure or array should be aligned so that
its base address is a multiple of 64.
Sorting data in decreasing size order is one heuristic for assisting
with natural alignment. As long as 16-
byte boundaries (and cache lines) are never crossed, natural alignment
is not strictly necessary (though
it is an easy way to enforce this).
--/quote--

So, this part has nothing to do with AVX512, but with cache line
width. And we do have a --param "l1-cache-line-size=64", detected with
-march=native that could come handy here.

This part should be rewritten (and commented) with the information
above in mind.

[1] http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

Uros.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-03 12:27                         ` Uros Bizjak
@ 2014-01-03 13:35                           ` Uros Bizjak
  2014-01-03 13:43                             ` Jakub Jelinek
  2014-05-19  4:48                             ` Jan Hubicka
  0 siblings, 2 replies; 43+ messages in thread
From: Uros Bizjak @ 2014-01-03 13:35 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Eric Botcazou, Kirill Yukhin, gcc-patches, Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 4021 bytes --]

On Fri, Jan 3, 2014 at 1:27 PM, Uros Bizjak <ubizjak@gmail.com> wrote:

>>> I am testing a patch that removes "max_align" part from ix86_data_alignment.
>>
>> That looks like unnecessary pessimization.  Note the hunk in question is
>> guarded with opt, which means it is an optimization rather than ABI issue,
>> it can increase alignment, but the compiler can only assume the increased
>> alignment if the symbol is not public or if it is public, but can't be
>> preempted by another TU's definition.  Even in that case it can be
>> worthwhile to increase the alignment, say if doing versioning for alignment,
>> or say just doing some AVX/AVX2/AVX512F version of memcpy/memset that can
>> handle it faster if it is sufficiently aligned by testing it at runtime.
>>
>> So, IMHO we shouldn't drop it, just improve it.
>> Perhaps:
>>
>>   int max_align = optimize_size ? BITS_PER_WORD
>>                                 : MIN (512, MAX_OFILE_ALIGNMENT);
>>
>>   if (opt
>>       && AGGREGATE_TYPE_P (type)
>>       && TYPE_SIZE (type)
>>       && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST
>>       && align < max_align)
>>     {
>>       int this_align = 256;
>>       for (this_align = 256; this_align <= max_align; this_align *= 2)
>>         if (TREE_INT_CST_LOW (TYPE_SIZE (type)) < (unsigned) this_align
>>             && !TREE_INT_CST_HIGH (TYPE_SIZE (type)))
>>           break;
>>         else if (align < this_align)
>>           align = this_align;
>>     }
>>
>> which will handle both the 256 bit alignment for >= 256 bit objects,
>> 512 bit alignment for >= 512 bit objects and will be prepared for the
>> future.
>>
>> 128 bit I think doesn't need to be handled, DATA_ALIGNMENT has been
>> using 256-bit test already since it's introduction in 1998:
>> http://gcc.gnu.org/ml/gcc-bugs/1998-03/msg00011.html
>
> Thanks for the pointer, there is indeed the recommendation in
> optimization manual [1], section 3.6.4, where it is said:
>
> --quote--
> Misaligned data access can incur significant performance penalties.
> This is particularly true for cache line
> splits. The size of a cache line is 64 bytes in the Pentium 4 and
> other recent Intel processors, including
> processors based on Intel Core microarchitecture.
> An access to data unaligned on 64-byte boundary leads to two memory
> accesses and requires several
> μops to be executed (instead of one). Accesses that span 64-byte
> boundaries are likely to incur a large
> performance penalty, the cost of each stall generally are greater on
> machines with longer pipelines.
>
> ...
>
> A 64-byte or greater data structure or array should be aligned so that
> its base address is a multiple of 64.
> Sorting data in decreasing size order is one heuristic for assisting
> with natural alignment. As long as 16-
> byte boundaries (and cache lines) are never crossed, natural alignment
> is not strictly necessary (though
> it is an easy way to enforce this).
> --/quote--
>
> So, this part has nothing to do with AVX512, but with cache line
> width. And we do have a --param "l1-cache-line-size=64", detected with
> -march=native that could come handy here.
>
> This part should be rewritten (and commented) with the information
> above in mind.

Like in the patch below. Please note, that the block_tune setting for
the nocona is wrong, -march=native on my trusted old P4 returns:

--param "l1-cache-size=16" --param "l1-cache-line-size=64" --param
"l2-cache-size=2048" "-mtune=nocona"

which is consistent with the above quote from manual.

2014-01-02  Uros Bizjak  <ubizjak@gmail.com>

    * config/i386/i386.c (ix86_data_alignment): Calculate max_align
    from prefetch_block tune setting.
    (nocona_cost): Correct size of prefetch block to 64.

The patch was bootstrapped on x86_64-pc-linux-gnu and is currently in
regression testing. If there are no comments, I will commit it to
mainline and release branches after a couple of days.

Uros.

[-- Attachment #2: p.diff.txt --]
[-- Type: text/plain, Size: 1560 bytes --]

Index: i386.c
===================================================================
--- i386.c	(revision 206311)
+++ i386.c	(working copy)
@@ -1568,7 +1568,7 @@ struct processor_costs nocona_cost = {
   8,					/* MMX or SSE register to integer */
   8,					/* size of l1 cache.  */
   1024,					/* size of l2 cache.  */
-  128,					/* size of prefetch block */
+  64,					/* size of prefetch block */
   8,					/* number of parallel prefetches */
   1,					/* Branch cost */
   COSTS_N_INSNS (6),			/* cost of FADD and FSUB insns.  */
@@ -26465,9 +26465,21 @@ ix86_constant_alignment (tree exp, int align)
 int
 ix86_data_alignment (tree type, int align, bool opt)
 {
-  int max_align = optimize_size ? BITS_PER_WORD
-				: MIN (512, MAX_OFILE_ALIGNMENT);
+  /* Misaligned data access can incur significant performance penalties.
+     This is particularly true for cache line splits. The size of a cache
+     line is 64 bytes in the Pentium 4 and other recent Intel processors,
+     including processors based on Intel Core microarchitecture.
+     An access to data unaligned on 64-byte boundary leads to two memory
+     accesses and requires several μops to be executed (instead of one).
+     A 64-byte or greater data structure or array should be aligned so that
+     its base address is a multiple of 64.  */
 
+  int max_align
+    = MIN ((unsigned) ix86_tune_cost->prefetch_block * 8, MAX_OFILE_ALIGNMENT);
+
+  if (max_align < BITS_PER_WORD)
+    max_align = BITS_PER_WORD;
+
   if (opt
       && AGGREGATE_TYPE_P (type)
       && TYPE_SIZE (type)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-03 13:35                           ` Uros Bizjak
@ 2014-01-03 13:43                             ` Jakub Jelinek
  2014-01-03 14:02                               ` Uros Bizjak
  2014-05-19  4:48                             ` Jan Hubicka
  1 sibling, 1 reply; 43+ messages in thread
From: Jakub Jelinek @ 2014-01-03 13:43 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Eric Botcazou, Kirill Yukhin, gcc-patches, Richard Henderson

On Fri, Jan 03, 2014 at 02:35:36PM +0100, Uros Bizjak wrote:
> Like in the patch below. Please note, that the block_tune setting for
> the nocona is wrong, -march=native on my trusted old P4 returns:
> 
> --param "l1-cache-size=16" --param "l1-cache-line-size=64" --param
> "l2-cache-size=2048" "-mtune=nocona"
> 
> which is consistent with the above quote from manual.
> 
> 2014-01-02  Uros Bizjak  <ubizjak@gmail.com>
> 
>     * config/i386/i386.c (ix86_data_alignment): Calculate max_align
>     from prefetch_block tune setting.
>     (nocona_cost): Correct size of prefetch block to 64.
> 
> The patch was bootstrapped on x86_64-pc-linux-gnu and is currently in
> regression testing. If there are no comments, I will commit it to
> mainline and release branches after a couple of days.

That still has the effect of not aligning (for most tunings) 32 to 63 bytes
long aggregates to 32 bytes, while previously they were aligned.  Forcing
aligning 32 byte long aggregates to 64 bytes would be overkill, 32 byte
alignment is just fine for those (and ensures it never crosses 64 byte
boundary), for 33 to 63 bytes perhaps using 64 bytes alignment wouldn't
be that bad, just wouldn't match what we have done before.

	Jakub

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-03 13:43                             ` Jakub Jelinek
@ 2014-01-03 14:02                               ` Uros Bizjak
  2014-01-03 14:13                                 ` Jakub Jelinek
  2014-01-03 16:04                                 ` Uros Bizjak
  0 siblings, 2 replies; 43+ messages in thread
From: Uros Bizjak @ 2014-01-03 14:02 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Eric Botcazou, Kirill Yukhin, gcc-patches, Richard Henderson

On Fri, Jan 3, 2014 at 2:43 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Fri, Jan 03, 2014 at 02:35:36PM +0100, Uros Bizjak wrote:
>> Like in the patch below. Please note, that the block_tune setting for
>> the nocona is wrong, -march=native on my trusted old P4 returns:
>>
>> --param "l1-cache-size=16" --param "l1-cache-line-size=64" --param
>> "l2-cache-size=2048" "-mtune=nocona"
>>
>> which is consistent with the above quote from manual.
>>
>> 2014-01-02  Uros Bizjak  <ubizjak@gmail.com>
>>
>>     * config/i386/i386.c (ix86_data_alignment): Calculate max_align
>>     from prefetch_block tune setting.
>>     (nocona_cost): Correct size of prefetch block to 64.
>>
>> The patch was bootstrapped on x86_64-pc-linux-gnu and is currently in
>> regression testing. If there are no comments, I will commit it to
>> mainline and release branches after a couple of days.
>
> That still has the effect of not aligning (for most tunings) 32 to 63 bytes
> long aggregates to 32 bytes, while previously they were aligned.  Forcing
> aligning 32 byte long aggregates to 64 bytes would be overkill, 32 byte
> alignment is just fine for those (and ensures it never crosses 64 byte
> boundary), for 33 to 63 bytes perhaps using 64 bytes alignment wouldn't
> be that bad, just wouldn't match what we have done before.

Please note that previous value was based on earlier (pre P4)
recommendation and it was appropriate for older chips with 32byte
cache line. The value should be updated long ago, when 64bit cache
lines were introduced, but was probably missed due to usage of magic
value without comment.

Ah, I see. My patch deals only with structures, larger than cache
line. As recommended in "As long as 16-byte boundaries (and cache
lines) are never crossed, natural alignment is not strictly necessary
(though it is an easy way to enforce this)." part of the manual, we
should align smaller structures to 16 or 32 bytes.

Yes, I agree. Can you please merge your patch together with the proposed patch?

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-03 14:02                               ` Uros Bizjak
@ 2014-01-03 14:13                                 ` Jakub Jelinek
  2014-01-03 14:35                                   ` Uros Bizjak
  2014-01-03 16:04                                 ` Uros Bizjak
  1 sibling, 1 reply; 43+ messages in thread
From: Jakub Jelinek @ 2014-01-03 14:13 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Eric Botcazou, Kirill Yukhin, gcc-patches, Richard Henderson

On Fri, Jan 03, 2014 at 03:02:51PM +0100, Uros Bizjak wrote:
> Please note that previous value was based on earlier (pre P4)
> recommendation and it was appropriate for older chips with 32byte
> cache line. The value should be updated long ago, when 64bit cache
> lines were introduced, but was probably missed due to usage of magic
> value without comment.
> 
> Ah, I see. My patch deals only with structures, larger than cache
> line. As recommended in "As long as 16-byte boundaries (and cache
> lines) are never crossed, natural alignment is not strictly necessary
> (though it is an easy way to enforce this)." part of the manual, we
> should align smaller structures to 16 or 32 bytes.
> 
> Yes, I agree. Can you please merge your patch together with the proposed patch?

How do we want to treat the 33-63 resp. 17-31 bytes long aggregates though?
32 byte long and 16 byte long aggregates can surely be aligned just to 32
resp. 16 bytes and never crosses 64 byte boundary then and doesn't waste
space in paddings unnecessarily (still opt thing, ABI can override),
but do we want to waste some extra bytes to ensure that 17-31 resp. 33-63
bytes long objects don't cross 64 byte boundaries by aligning those to 32
resp. 64 bytes, or do align them to 16 resp. 32 bytes instead?

	Jakub

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-03 14:13                                 ` Jakub Jelinek
@ 2014-01-03 14:35                                   ` Uros Bizjak
  2014-01-03 14:42                                     ` Jakub Jelinek
  0 siblings, 1 reply; 43+ messages in thread
From: Uros Bizjak @ 2014-01-03 14:35 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Eric Botcazou, Kirill Yukhin, gcc-patches, Richard Henderson

On Fri, Jan 3, 2014 at 3:13 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Fri, Jan 03, 2014 at 03:02:51PM +0100, Uros Bizjak wrote:
>> Please note that previous value was based on earlier (pre P4)
>> recommendation and it was appropriate for older chips with 32byte
>> cache line. The value should be updated long ago, when 64bit cache
>> lines were introduced, but was probably missed due to usage of magic
>> value without comment.
>>
>> Ah, I see. My patch deals only with structures, larger than cache
>> line. As recommended in "As long as 16-byte boundaries (and cache
>> lines) are never crossed, natural alignment is not strictly necessary
>> (though it is an easy way to enforce this)." part of the manual, we
>> should align smaller structures to 16 or 32 bytes.
>>
>> Yes, I agree. Can you please merge your patch together with the proposed patch?
>
> How do we want to treat the 33-63 resp. 17-31 bytes long aggregates though?
> 32 byte long and 16 byte long aggregates can surely be aligned just to 32
> resp. 16 bytes and never crosses 64 byte boundary then and doesn't waste
> space in paddings unnecessarily (still opt thing, ABI can override),
> but do we want to waste some extra bytes to ensure that 17-31 resp. 33-63
> bytes long objects don't cross 64 byte boundaries by aligning those to 32
> resp. 64 bytes, or do align them to 16 resp. 32 bytes instead?

Looking at "significant performance penalties" part of the above
recommendation, I'd say to align it to 32/64 byte boundaries.
Hopefully, the linker is able to put other data in the hole?

Uros.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-03 14:35                                   ` Uros Bizjak
@ 2014-01-03 14:42                                     ` Jakub Jelinek
  0 siblings, 0 replies; 43+ messages in thread
From: Jakub Jelinek @ 2014-01-03 14:42 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Eric Botcazou, Kirill Yukhin, gcc-patches, Richard Henderson

On Fri, Jan 03, 2014 at 03:35:11PM +0100, Uros Bizjak wrote:
> On Fri, Jan 3, 2014 at 3:13 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> > On Fri, Jan 03, 2014 at 03:02:51PM +0100, Uros Bizjak wrote:
> >> Please note that previous value was based on earlier (pre P4)
> >> recommendation and it was appropriate for older chips with 32byte
> >> cache line. The value should be updated long ago, when 64bit cache
> >> lines were introduced, but was probably missed due to usage of magic
> >> value without comment.
> >>
> >> Ah, I see. My patch deals only with structures, larger than cache
> >> line. As recommended in "As long as 16-byte boundaries (and cache
> >> lines) are never crossed, natural alignment is not strictly necessary
> >> (though it is an easy way to enforce this)." part of the manual, we
> >> should align smaller structures to 16 or 32 bytes.
> >>
> >> Yes, I agree. Can you please merge your patch together with the proposed patch?
> >
> > How do we want to treat the 33-63 resp. 17-31 bytes long aggregates though?
> > 32 byte long and 16 byte long aggregates can surely be aligned just to 32
> > resp. 16 bytes and never crosses 64 byte boundary then and doesn't waste
> > space in paddings unnecessarily (still opt thing, ABI can override),
> > but do we want to waste some extra bytes to ensure that 17-31 resp. 33-63
> > bytes long objects don't cross 64 byte boundaries by aligning those to 32
> > resp. 64 bytes, or do align them to 16 resp. 32 bytes instead?
> 
> Looking at "significant performance penalties" part of the above
> recommendation, I'd say to align it to 32/64 byte boundaries.
> Hopefully, the linker is able to put other data in the hole?

Unless -fdata-sections linker doesn't affect this, unless it is about the
very first or very last object in the TU in the particular section.
GCC itself would need to (supposedly unless -fno-toplevel-reorder) attempt
to sort the varpool constants that are going to be emitted prior to emitting
them (compute what section each decl would be emitted to, and within each
section start with putting variable with biggest alignment first and then
try to pack them nicely).  Kind of similar to what is done for
-fsection-anchors, just don't emit everything as a single block, just sort
them in the varpool queue.

	Jakub

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-03 14:02                               ` Uros Bizjak
  2014-01-03 14:13                                 ` Jakub Jelinek
@ 2014-01-03 16:04                                 ` Uros Bizjak
  2014-01-14 17:09                                   ` Jakub Jelinek
  1 sibling, 1 reply; 43+ messages in thread
From: Uros Bizjak @ 2014-01-03 16:04 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Eric Botcazou, Kirill Yukhin, gcc-patches, Richard Henderson

On Fri, Jan 3, 2014 at 3:02 PM, Uros Bizjak <ubizjak@gmail.com> wrote:

>>> Like in the patch below. Please note, that the block_tune setting for
>>> the nocona is wrong, -march=native on my trusted old P4 returns:
>>>
>>> --param "l1-cache-size=16" --param "l1-cache-line-size=64" --param
>>> "l2-cache-size=2048" "-mtune=nocona"
>>>
>>> which is consistent with the above quote from manual.
>>>
>>> 2014-01-02  Uros Bizjak  <ubizjak@gmail.com>
>>>
>>>     * config/i386/i386.c (ix86_data_alignment): Calculate max_align
>>>     from prefetch_block tune setting.
>>>     (nocona_cost): Correct size of prefetch block to 64.
>>>
>>> The patch was bootstrapped on x86_64-pc-linux-gnu and is currently in
>>> regression testing. If there are no comments, I will commit it to
>>> mainline and release branches after a couple of days.
>>
>> That still has the effect of not aligning (for most tunings) 32 to 63 bytes
>> long aggregates to 32 bytes, while previously they were aligned.  Forcing
>> aligning 32 byte long aggregates to 64 bytes would be overkill, 32 byte
>> alignment is just fine for those (and ensures it never crosses 64 byte
>> boundary), for 33 to 63 bytes perhaps using 64 bytes alignment wouldn't
>> be that bad, just wouldn't match what we have done before.
>
> Please note that previous value was based on earlier (pre P4)
> recommendation and it was appropriate for older chips with 32byte
> cache line. The value should be updated long ago, when 64bit cache
> lines were introduced, but was probably missed due to usage of magic
> value without comment.
>
> Ah, I see. My patch deals only with structures, larger than cache
> line. As recommended in "As long as 16-byte boundaries (and cache
> lines) are never crossed, natural alignment is not strictly necessary
> (though it is an easy way to enforce this)." part of the manual, we
> should align smaller structures to 16 or 32 bytes.
>
> Yes, I agree. Can you please merge your patch together with the proposed patch?

On a second thought, the crossing of 16-byte boundaries is mentioned
for the data *access* (the instruction itself) if it is not naturally
aligned (please see example 3-40 and fig 3-2), which is *NOT* in our
case.

So, we don't have to align 32 byte structures in any way for newer
processors, since this optimization applies to 64+ byte (larger or
equal to cache line size) structures only. Older processors are
handled correctly, modulo nocona, where its cache line size value has
to be corrected.

Following that, my original patch implements this optimization in the
correct way.

Uros.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-03 16:04                                 ` Uros Bizjak
@ 2014-01-14 17:09                                   ` Jakub Jelinek
  2014-01-14 18:37                                     ` Uros Bizjak
  0 siblings, 1 reply; 43+ messages in thread
From: Jakub Jelinek @ 2014-01-14 17:09 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Eric Botcazou, Kirill Yukhin, gcc-patches, Richard Henderson

On Fri, Jan 03, 2014 at 05:04:39PM +0100, Uros Bizjak wrote:
> On a second thought, the crossing of 16-byte boundaries is mentioned
> for the data *access* (the instruction itself) if it is not naturally
> aligned (please see example 3-40 and fig 3-2), which is *NOT* in our
> case.
> 
> So, we don't have to align 32 byte structures in any way for newer
> processors, since this optimization applies to 64+ byte (larger or
> equal to cache line size) structures only. Older processors are
> handled correctly, modulo nocona, where its cache line size value has
> to be corrected.
> 
> Following that, my original patch implements this optimization in the
> correct way.

Sorry for catching this late, but on the 4.8 and earlier branches
there is no opt argument and thus any ix86_data_alignment change is
unfortunately an ABI change.  So I'd think we should revert
r206433 and r206436.  And for the trunk we need to ensure even for
opt we never return a smaller number from ix86_data_alignment than
we did in 4.8 and earlier, because otherwise if you have 4.8 compiled
code that assumes the alignment 4.8 would use for something that is defined
in a compilation unit built by gcc 4.9+, if we don't align it at least
as much as we did in the past, the linked mix of 4.8 user and 4.9 definition
could misbehave.

	Jakub

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-14 17:09                                   ` Jakub Jelinek
@ 2014-01-14 18:37                                     ` Uros Bizjak
  2014-01-14 19:00                                       ` H.J. Lu
  2014-01-14 19:12                                       ` Jakub Jelinek
  0 siblings, 2 replies; 43+ messages in thread
From: Uros Bizjak @ 2014-01-14 18:37 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Eric Botcazou, Kirill Yukhin, gcc-patches, Richard Henderson

On Tue, Jan 14, 2014 at 6:09 PM, Jakub Jelinek <jakub@redhat.com> wrote:

>> On a second thought, the crossing of 16-byte boundaries is mentioned
>> for the data *access* (the instruction itself) if it is not naturally
>> aligned (please see example 3-40 and fig 3-2), which is *NOT* in our
>> case.
>>
>> So, we don't have to align 32 byte structures in any way for newer
>> processors, since this optimization applies to 64+ byte (larger or
>> equal to cache line size) structures only. Older processors are
>> handled correctly, modulo nocona, where its cache line size value has
>> to be corrected.
>>
>> Following that, my original patch implements this optimization in the
>> correct way.
>
> Sorry for catching this late, but on the 4.8 and earlier branches
> there is no opt argument and thus any ix86_data_alignment change is
> unfortunately an ABI change.  So I'd think we should revert
> r206433 and r206436.  And for the trunk we need to ensure even for

OK, let's play safe. I'll revert these two changes (modulo size of
nocona prefetch block).

> opt we never return a smaller number from ix86_data_alignment than
> we did in 4.8 and earlier, because otherwise if you have 4.8 compiled
> code that assumes the alignment 4.8 would use for something that is defined
> in a compilation unit built by gcc 4.9+, if we don't align it at least
> as much as we did in the past, the linked mix of 4.8 user and 4.9 definition
> could misbehave.

From 4.9 onwards, we would like to align >= 64byte structures on
64byte boundary. Should we add a compatibility rule to align >= 32byte
structures to 32 bytes?

Please also note that in 4.7 and 4.8, we have

int max_align = optimize_size ? BITS_PER_WORD : MIN (256, MAX_OFILE_ALIGNMENT);

so, in effect -Os code will be incompatible with other optimization levels.

I guess that for 4.7 and 4.8, we should revert to this anyway, but
what to do with 4.9?

Uros.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-14 18:37                                     ` Uros Bizjak
@ 2014-01-14 19:00                                       ` H.J. Lu
  2014-01-14 19:12                                       ` Jakub Jelinek
  1 sibling, 0 replies; 43+ messages in thread
From: H.J. Lu @ 2014-01-14 19:00 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Jakub Jelinek, Eric Botcazou, Kirill Yukhin, gcc-patches,
	Richard Henderson

On Tue, Jan 14, 2014 at 10:37 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Tue, Jan 14, 2014 at 6:09 PM, Jakub Jelinek <jakub@redhat.com> wrote:
>
>>> On a second thought, the crossing of 16-byte boundaries is mentioned
>>> for the data *access* (the instruction itself) if it is not naturally
>>> aligned (please see example 3-40 and fig 3-2), which is *NOT* in our
>>> case.
>>>
>>> So, we don't have to align 32 byte structures in any way for newer
>>> processors, since this optimization applies to 64+ byte (larger or
>>> equal to cache line size) structures only. Older processors are
>>> handled correctly, modulo nocona, where its cache line size value has
>>> to be corrected.
>>>
>>> Following that, my original patch implements this optimization in the
>>> correct way.
>>
>> Sorry for catching this late, but on the 4.8 and earlier branches
>> there is no opt argument and thus any ix86_data_alignment change is
>> unfortunately an ABI change.  So I'd think we should revert
>> r206433 and r206436.  And for the trunk we need to ensure even for
>
> OK, let's play safe. I'll revert these two changes (modulo size of
> nocona prefetch block).
>
>> opt we never return a smaller number from ix86_data_alignment than
>> we did in 4.8 and earlier, because otherwise if you have 4.8 compiled
>> code that assumes the alignment 4.8 would use for something that is defined
>> in a compilation unit built by gcc 4.9+, if we don't align it at least
>> as much as we did in the past, the linked mix of 4.8 user and 4.9 definition
>> could misbehave.
>
> From 4.9 onwards, we would like to align >= 64byte structures on
> 64byte boundary. Should we add a compatibility rule to align >= 32byte
> structures to 32 bytes?

That is why we issue a warning when alignment was changed
with AVX support:

[hjl@gnu-6 tmp]$ cat a1.i
typedef long long __m256i __attribute__ ((__vector_size__ (32), __may_alias__));
extern __m256i y;
void
f1(__m256i x)
{
  y = x;
}
[hjl@gnu-6 tmp]$ gcc -S a1.i
a1.i: In function ‘f1’:
a1.i:4:1: note: The ABI for passing parameters with 32-byte alignment
has changed in GCC 4.6
 f1(__m256i x)
 ^
a1.i:4:1: warning: AVX vector argument without AVX enabled changes the
ABI [enabled by default]
[hjl@gnu-6 tmp]$

> Please also note that in 4.7 and 4.8, we have
>
> int max_align = optimize_size ? BITS_PER_WORD : MIN (256, MAX_OFILE_ALIGNMENT);
>
> so, in effect -Os code will be incompatible with other optimization levels.
>
> I guess that for 4.7 and 4.8, we should revert to this anyway, but
> what to do with 4.9?
>
> Uros.



-- 
H.J.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-14 18:37                                     ` Uros Bizjak
  2014-01-14 19:00                                       ` H.J. Lu
@ 2014-01-14 19:12                                       ` Jakub Jelinek
  2014-01-17 14:15                                         ` Jakub Jelinek
  1 sibling, 1 reply; 43+ messages in thread
From: Jakub Jelinek @ 2014-01-14 19:12 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Eric Botcazou, Kirill Yukhin, gcc-patches, Richard Henderson

On Tue, Jan 14, 2014 at 07:37:33PM +0100, Uros Bizjak wrote:
> OK, let's play safe. I'll revert these two changes (modulo size of
> nocona prefetch block).

Thanks.

> > opt we never return a smaller number from ix86_data_alignment than
> > we did in 4.8 and earlier, because otherwise if you have 4.8 compiled
> > code that assumes the alignment 4.8 would use for something that is defined
> > in a compilation unit built by gcc 4.9+, if we don't align it at least
> > as much as we did in the past, the linked mix of 4.8 user and 4.9 definition
> > could misbehave.
> 
> >From 4.9 onwards, we would like to align >= 64byte structures on
> 64byte boundary. Should we add a compatibility rule to align >= 32byte
> structures to 32 bytes?
> 
> Please also note that in 4.7 and 4.8, we have
> 
> int max_align = optimize_size ? BITS_PER_WORD : MIN (256, MAX_OFILE_ALIGNMENT);
> 
> so, in effect -Os code will be incompatible with other optimization levels.

Well, the max_align is only one of the several possibilities of aligment
increases, but yes, there is an ABI issue, see e.g. PR56564 for details.

> I guess that for 4.7 and 4.8, we should revert to this anyway, but
> what to do with 4.9?

For 4.9, if what you've added is what you want to do for performance
reasons, then I'd do something like:

  /* GCC 4.8 and earlier used to incorrectly assume this alignment even
     for symbols from other compilation units or symbols that don't need
     to bind locally.  In order to preserve some ABI compatibility with
     those compilers, ensure we don't decrease alignment from what we
     used to assume.  */

  int max_align_compat
    = optimize_size ? BITS_PER_WORD : MIN (256, MAX_OFILE_ALIGNMENT);

  /* A data structure, equal or greater than the size of a cache line
     (64 bytes in the Pentium 4 and other recent Intel processors, including
     processors based on Intel Core microarchitecture) should be aligned
     so that its base address is a multiple of a cache line size.  */
  
  int max_align
    = MIN ((unsigned) ix86_tune_cost->prefetch_block * 8, MAX_OFILE_ALIGNMENT);

  if (max_align < BITS_PER_WORD)
    max_align = BITS_PER_WORD;

  if (opt
      && AGGREGATE_TYPE_P (type)
      && TYPE_SIZE (type)
      && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST)
    {
      if ((TREE_INT_CST_LOW (TYPE_SIZE (type)) >= (unsigned) max_align_compat
	   || TREE_INT_CST_HIGH (TYPE_SIZE (type)))
	  && align < max_align_compat)
	align = max_align_compat;
      if ((TREE_INT_CST_LOW (TYPE_SIZE (type)) >= (unsigned) max_align
	   || TREE_INT_CST_HIGH (TYPE_SIZE (type)))
	  && align < max_align)
	align = max_align;
    }

That way, max_align will be purely optimization and can be changed as
anyone wishes in the future, max_align_compat compatibility with
pre-4.9 (beyond ABI) assumptions and !opt stuff the ABI mandated alignment.

	Jakub

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-14 19:12                                       ` Jakub Jelinek
@ 2014-01-17 14:15                                         ` Jakub Jelinek
  2014-01-17 14:25                                           ` Uros Bizjak
  0 siblings, 1 reply; 43+ messages in thread
From: Jakub Jelinek @ 2014-01-17 14:15 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Eric Botcazou, Kirill Yukhin, gcc-patches, Richard Henderson

On Tue, Jan 14, 2014 at 08:12:41PM +0100, Jakub Jelinek wrote:
> For 4.9, if what you've added is what you want to do for performance
> reasons, then I'd do something like:

Ok, here it is in a form of patch, bootstrapped/regtested on x86_64-linux
and i686-linux, ok for trunk?

2014-01-17  Jakub Jelinek  <jakub@redhat.com>

	* config/i386/i386.c (ix86_data_alignment): For compatibility with
	(incorrect) GCC 4.8 and earlier alignment assumptions ensure we align
	decls to at least the GCC 4.8 used alignments.

--- gcc/config/i386/i386.c.jj	2014-01-16 20:22:50.000000000 +0100
+++ gcc/config/i386/i386.c	2014-01-17 11:56:51.183501322 +0100
@@ -26433,6 +26433,15 @@ ix86_constant_alignment (tree exp, int a
 int
 ix86_data_alignment (tree type, int align, bool opt)
 {
+  /* GCC 4.8 and earlier used to incorrectly assume this alignment even
+     for symbols from other compilation units or symbols that don't need
+     to bind locally.  In order to preserve some ABI compatibility with
+     those compilers, ensure we don't decrease alignment from what we
+     used to assume.  */
+
+  int max_align_compat
+    = optimize_size ? BITS_PER_WORD : MIN (256, MAX_OFILE_ALIGNMENT);
+
   /* A data structure, equal or greater than the size of a cache line
      (64 bytes in the Pentium 4 and other recent Intel processors, including
      processors based on Intel Core microarchitecture) should be aligned
@@ -26447,11 +26456,17 @@ ix86_data_alignment (tree type, int alig
   if (opt
       && AGGREGATE_TYPE_P (type)
       && TYPE_SIZE (type)
-      && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST
-      && (TREE_INT_CST_LOW (TYPE_SIZE (type)) >= (unsigned) max_align
-	  || TREE_INT_CST_HIGH (TYPE_SIZE (type)))
-      && align < max_align)
-    align = max_align;
+      && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST)
+    {
+      if ((TREE_INT_CST_LOW (TYPE_SIZE (type)) >= (unsigned) max_align_compat
+	   || TREE_INT_CST_HIGH (TYPE_SIZE (type)))
+	  && align < max_align_compat)
+	align = max_align_compat;
+      if ((TREE_INT_CST_LOW (TYPE_SIZE (type)) >= (unsigned) max_align
+	   || TREE_INT_CST_HIGH (TYPE_SIZE (type)))
+	  && align < max_align)
+	align = max_align;
+    }
 
   /* x86-64 ABI requires arrays greater than 16 bytes to be aligned
      to 16byte boundary.  */


	Jakub

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-17 14:15                                         ` Jakub Jelinek
@ 2014-01-17 14:25                                           ` Uros Bizjak
  0 siblings, 0 replies; 43+ messages in thread
From: Uros Bizjak @ 2014-01-17 14:25 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Eric Botcazou, Kirill Yukhin, gcc-patches, Richard Henderson

On Fri, Jan 17, 2014 at 3:15 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Tue, Jan 14, 2014 at 08:12:41PM +0100, Jakub Jelinek wrote:
>> For 4.9, if what you've added is what you want to do for performance
>> reasons, then I'd do something like:
>
> Ok, here it is in a form of patch, bootstrapped/regtested on x86_64-linux
> and i686-linux, ok for trunk?
>
> 2014-01-17  Jakub Jelinek  <jakub@redhat.com>
>
>         * config/i386/i386.c (ix86_data_alignment): For compatibility with
>         (incorrect) GCC 4.8 and earlier alignment assumptions ensure we align
>         decls to at least the GCC 4.8 used alignments.

This is OK for mainline.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-01-03 13:35                           ` Uros Bizjak
  2014-01-03 13:43                             ` Jakub Jelinek
@ 2014-05-19  4:48                             ` Jan Hubicka
  2014-05-19 16:14                               ` Uros Bizjak
  1 sibling, 1 reply; 43+ messages in thread
From: Jan Hubicka @ 2014-05-19  4:48 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Jakub Jelinek, Eric Botcazou, Kirill Yukhin, gcc-patches,
	Richard Henderson

> > Thanks for the pointer, there is indeed the recommendation in
> > optimization manual [1], section 3.6.4, where it is said:
> >
> > --quote--
> > Misaligned data access can incur significant performance penalties.
> > This is particularly true for cache line
> > splits. The size of a cache line is 64 bytes in the Pentium 4 and
> > other recent Intel processors, including
> > processors based on Intel Core microarchitecture.
> > An access to data unaligned on 64-byte boundary leads to two memory
> > accesses and requires several
> > ??ops to be executed (instead of one). Accesses that span 64-byte
> > boundaries are likely to incur a large
> > performance penalty, the cost of each stall generally are greater on
> > machines with longer pipelines.
> >
> > ...
> >
> > A 64-byte or greater data structure or array should be aligned so that
> > its base address is a multiple of 64.
> > Sorting data in decreasing size order is one heuristic for assisting
> > with natural alignment. As long as 16-
> > byte boundaries (and cache lines) are never crossed, natural alignment
> > is not strictly necessary (though
> > it is an easy way to enforce this).
> > --/quote--
> >
> > So, this part has nothing to do with AVX512, but with cache line
> > width. And we do have a --param "l1-cache-line-size=64", detected with
> > -march=native that could come handy here.
> >
> > This part should be rewritten (and commented) with the information
> > above in mind.
> 
> Like in the patch below. Please note, that the block_tune setting for
> the nocona is wrong, -march=native on my trusted old P4 returns:
> 
> --param "l1-cache-size=16" --param "l1-cache-line-size=64" --param
> "l2-cache-size=2048" "-mtune=nocona"
> 
> which is consistent with the above quote from manual.
> 
> 2014-01-02  Uros Bizjak  <ubizjak@gmail.com>
> 
>     * config/i386/i386.c (ix86_data_alignment): Calculate max_align
>     from prefetch_block tune setting.
>     (nocona_cost): Correct size of prefetch block to 64.
> 
Uros,
I am looking into libreoffice size and the data alignment seems to make huge
difference. Data section has grown from 5.8MB to 6.3MB in between GCC 4.8 and 4.9,
while clang produces 5.2MB.

The two patches I posted to not align vtables and RTTI reduces it to 5.7MB, but
But perhaps we want to revisit the alignment rules.  The optimization manuals
usually care only about performance critical loops.  Perhaps we can make the
rules to align only bigger datastructures, or so at least for -O2.

Honza

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-05-19  4:48                             ` Jan Hubicka
@ 2014-05-19 16:14                               ` Uros Bizjak
  2014-05-19 16:42                                 ` H.J. Lu
  0 siblings, 1 reply; 43+ messages in thread
From: Uros Bizjak @ 2014-05-19 16:14 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Jakub Jelinek, Eric Botcazou, Kirill Yukhin, gcc-patches,
	Richard Henderson

On Mon, May 19, 2014 at 6:48 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > Thanks for the pointer, there is indeed the recommendation in
>> > optimization manual [1], section 3.6.4, where it is said:
>> >
>> > --quote--
>> > Misaligned data access can incur significant performance penalties.
>> > This is particularly true for cache line
>> > splits. The size of a cache line is 64 bytes in the Pentium 4 and
>> > other recent Intel processors, including
>> > processors based on Intel Core microarchitecture.
>> > An access to data unaligned on 64-byte boundary leads to two memory
>> > accesses and requires several
>> > ??ops to be executed (instead of one). Accesses that span 64-byte
>> > boundaries are likely to incur a large
>> > performance penalty, the cost of each stall generally are greater on
>> > machines with longer pipelines.
>> >
>> > ...
>> >
>> > A 64-byte or greater data structure or array should be aligned so that
>> > its base address is a multiple of 64.
>> > Sorting data in decreasing size order is one heuristic for assisting
>> > with natural alignment. As long as 16-
>> > byte boundaries (and cache lines) are never crossed, natural alignment
>> > is not strictly necessary (though
>> > it is an easy way to enforce this).
>> > --/quote--
>> >
>> > So, this part has nothing to do with AVX512, but with cache line
>> > width. And we do have a --param "l1-cache-line-size=64", detected with
>> > -march=native that could come handy here.
>> >
>> > This part should be rewritten (and commented) with the information
>> > above in mind.
>>
>> Like in the patch below. Please note, that the block_tune setting for
>> the nocona is wrong, -march=native on my trusted old P4 returns:
>>
>> --param "l1-cache-size=16" --param "l1-cache-line-size=64" --param
>> "l2-cache-size=2048" "-mtune=nocona"
>>
>> which is consistent with the above quote from manual.
>>
>> 2014-01-02  Uros Bizjak  <ubizjak@gmail.com>
>>
>>     * config/i386/i386.c (ix86_data_alignment): Calculate max_align
>>     from prefetch_block tune setting.
>>     (nocona_cost): Correct size of prefetch block to 64.
>>
> Uros,
> I am looking into libreoffice size and the data alignment seems to make huge
> difference. Data section has grown from 5.8MB to 6.3MB in between GCC 4.8 and 4.9,
> while clang produces 5.2MB.
>
> The two patches I posted to not align vtables and RTTI reduces it to 5.7MB, but
> But perhaps we want to revisit the alignment rules.  The optimization manuals
> usually care only about performance critical loops.  Perhaps we can make the
> rules to align only bigger datastructures, or so at least for -O2.

Based on the above quote, "Misaligned data access can incur
significant performance penalties." and the fact that this particular
alignment rule has some compatibility issues with previous versions of
gcc (these were later fixed by Jakub), I'd rather leave this rule as
is. However, if the access is from the cold section, we can perhaps
avoid extra alignment, while avoiding those compatibility issues.

Uros.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-05-19 16:14                               ` Uros Bizjak
@ 2014-05-19 16:42                                 ` H.J. Lu
  2014-05-19 16:45                                   ` Uros Bizjak
  0 siblings, 1 reply; 43+ messages in thread
From: H.J. Lu @ 2014-05-19 16:42 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Jan Hubicka, Jakub Jelinek, Eric Botcazou, Kirill Yukhin,
	gcc-patches, Richard Henderson

On Mon, May 19, 2014 at 9:14 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Mon, May 19, 2014 at 6:48 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> > Thanks for the pointer, there is indeed the recommendation in
>>> > optimization manual [1], section 3.6.4, where it is said:
>>> >
>>> > --quote--
>>> > Misaligned data access can incur significant performance penalties.
>>> > This is particularly true for cache line
>>> > splits. The size of a cache line is 64 bytes in the Pentium 4 and
>>> > other recent Intel processors, including
>>> > processors based on Intel Core microarchitecture.
>>> > An access to data unaligned on 64-byte boundary leads to two memory
>>> > accesses and requires several
>>> > ??ops to be executed (instead of one). Accesses that span 64-byte
>>> > boundaries are likely to incur a large
>>> > performance penalty, the cost of each stall generally are greater on
>>> > machines with longer pipelines.
>>> >
>>> > ...
>>> >
>>> > A 64-byte or greater data structure or array should be aligned so that
>>> > its base address is a multiple of 64.
>>> > Sorting data in decreasing size order is one heuristic for assisting
>>> > with natural alignment. As long as 16-
>>> > byte boundaries (and cache lines) are never crossed, natural alignment
>>> > is not strictly necessary (though
>>> > it is an easy way to enforce this).
>>> > --/quote--
>>> >
>>> > So, this part has nothing to do with AVX512, but with cache line
>>> > width. And we do have a --param "l1-cache-line-size=64", detected with
>>> > -march=native that could come handy here.
>>> >
>>> > This part should be rewritten (and commented) with the information
>>> > above in mind.
>>>
>>> Like in the patch below. Please note, that the block_tune setting for
>>> the nocona is wrong, -march=native on my trusted old P4 returns:
>>>
>>> --param "l1-cache-size=16" --param "l1-cache-line-size=64" --param
>>> "l2-cache-size=2048" "-mtune=nocona"
>>>
>>> which is consistent with the above quote from manual.
>>>
>>> 2014-01-02  Uros Bizjak  <ubizjak@gmail.com>
>>>
>>>     * config/i386/i386.c (ix86_data_alignment): Calculate max_align
>>>     from prefetch_block tune setting.
>>>     (nocona_cost): Correct size of prefetch block to 64.
>>>
>> Uros,
>> I am looking into libreoffice size and the data alignment seems to make huge
>> difference. Data section has grown from 5.8MB to 6.3MB in between GCC 4.8 and 4.9,
>> while clang produces 5.2MB.
>>
>> The two patches I posted to not align vtables and RTTI reduces it to 5.7MB, but
>> But perhaps we want to revisit the alignment rules.  The optimization manuals
>> usually care only about performance critical loops.  Perhaps we can make the
>> rules to align only bigger datastructures, or so at least for -O2.
>
> Based on the above quote, "Misaligned data access can incur
> significant performance penalties." and the fact that this particular
> alignment rule has some compatibility issues with previous versions of
> gcc (these were later fixed by Jakub), I'd rather leave this rule as
> is. However, if the access is from the cold section, we can perhaps
> avoid extra alignment, while avoiding those compatibility issues.
>

It is excessive to align

struct foo
{
  int x1;
  int x2;
  char x3;
  int x4;
  int x5;
  char x6;
  int x7;
  int x8;
};

to 32 bytes and align

struct foo
{
  int x1;
  int x2;
  char x3;
  int x4;
  int x5;
  char x6;
  int x7[9];
  int x8;
};

to 64 bytes.  What performance gain does it provide?

-- 
H.J.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-05-19 16:42                                 ` H.J. Lu
@ 2014-05-19 16:45                                   ` Uros Bizjak
  2014-05-19 16:58                                     ` H.J. Lu
  0 siblings, 1 reply; 43+ messages in thread
From: Uros Bizjak @ 2014-05-19 16:45 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jan Hubicka, Jakub Jelinek, Eric Botcazou, Kirill Yukhin,
	gcc-patches, Richard Henderson

On Mon, May 19, 2014 at 6:42 PM, H.J. Lu <hjl.tools@gmail.com> wrote:

>>> Uros,
>>> I am looking into libreoffice size and the data alignment seems to make huge
>>> difference. Data section has grown from 5.8MB to 6.3MB in between GCC 4.8 and 4.9,
>>> while clang produces 5.2MB.
>>>
>>> The two patches I posted to not align vtables and RTTI reduces it to 5.7MB, but
>>> But perhaps we want to revisit the alignment rules.  The optimization manuals
>>> usually care only about performance critical loops.  Perhaps we can make the
>>> rules to align only bigger datastructures, or so at least for -O2.
>>
>> Based on the above quote, "Misaligned data access can incur
>> significant performance penalties." and the fact that this particular
>> alignment rule has some compatibility issues with previous versions of
>> gcc (these were later fixed by Jakub), I'd rather leave this rule as
>> is. However, if the access is from the cold section, we can perhaps
>> avoid extra alignment, while avoiding those compatibility issues.
>>
>
> It is excessive to align
>
> struct foo
> {
>   int x1;
>   int x2;
>   char x3;
>   int x4;
>   int x5;
>   char x6;
>   int x7;
>   int x8;
> };
>
> to 32 bytes and align
>
> struct foo
> {
>   int x1;
>   int x2;
>   char x3;
>   int x4;
>   int x5;
>   char x6;
>   int x7[9];
>   int x8;
> };
>
> to 64 bytes.  What performance gain does it provide?

Avoids "significant performance penalties," perhaps?

Uros.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-05-19 16:45                                   ` Uros Bizjak
@ 2014-05-19 16:58                                     ` H.J. Lu
  2014-05-20 12:00                                       ` Kirill Yukhin
  0 siblings, 1 reply; 43+ messages in thread
From: H.J. Lu @ 2014-05-19 16:58 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Jan Hubicka, Jakub Jelinek, Eric Botcazou, Kirill Yukhin,
	gcc-patches, Richard Henderson

On Mon, May 19, 2014 at 9:45 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Mon, May 19, 2014 at 6:42 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>
>>>> Uros,
>>>> I am looking into libreoffice size and the data alignment seems to make huge
>>>> difference. Data section has grown from 5.8MB to 6.3MB in between GCC 4.8 and 4.9,
>>>> while clang produces 5.2MB.
>>>>
>>>> The two patches I posted to not align vtables and RTTI reduces it to 5.7MB, but
>>>> But perhaps we want to revisit the alignment rules.  The optimization manuals
>>>> usually care only about performance critical loops.  Perhaps we can make the
>>>> rules to align only bigger datastructures, or so at least for -O2.
>>>
>>> Based on the above quote, "Misaligned data access can incur
>>> significant performance penalties." and the fact that this particular
>>> alignment rule has some compatibility issues with previous versions of
>>> gcc (these were later fixed by Jakub), I'd rather leave this rule as
>>> is. However, if the access is from the cold section, we can perhaps
>>> avoid extra alignment, while avoiding those compatibility issues.
>>>
>>
>> It is excessive to align
>>
>> struct foo
>> {
>>   int x1;
>>   int x2;
>>   char x3;
>>   int x4;
>>   int x5;
>>   char x6;
>>   int x7;
>>   int x8;
>> };
>>
>> to 32 bytes and align
>>
>> struct foo
>> {
>>   int x1;
>>   int x2;
>>   char x3;
>>   int x4;
>>   int x5;
>>   char x6;
>>   int x7[9];
>>   int x8;
>> };
>>
>> to 64 bytes.  What performance gain does it provide?
>
> Avoids "significant performance penalties," perhaps?
>

Kirill, do we have performance data for excessive alignment
vs ABI alignment?


-- 
H.J.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-05-19 16:58                                     ` H.J. Lu
@ 2014-05-20 12:00                                       ` Kirill Yukhin
  2014-05-20 15:24                                         ` H.J. Lu
  0 siblings, 1 reply; 43+ messages in thread
From: Kirill Yukhin @ 2014-05-20 12:00 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Uros Bizjak, Jan Hubicka, Jakub Jelinek, Eric Botcazou,
	gcc-patches, Richard Henderson

Hello,
On 19 May 09:58, H.J. Lu wrote:
> On Mon, May 19, 2014 at 9:45 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> > On Mon, May 19, 2014 at 6:42 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> >>>> Uros,
> >>>> I am looking into libreoffice size and the data alignment seems to make huge
> >>>> difference. Data section has grown from 5.8MB to 6.3MB in between GCC 4.8 and 4.9,
> >>>> while clang produces 5.2MB.
> >>>>
> >>>> The two patches I posted to not align vtables and RTTI reduces it to 5.7MB, but
> >>>> But perhaps we want to revisit the alignment rules.  The optimization manuals
> >>>> usually care only about performance critical loops.  Perhaps we can make the
> >>>> rules to align only bigger datastructures, or so at least for -O2.
> >>>
> >>> Based on the above quote, "Misaligned data access can incur
> >>> significant performance penalties." and the fact that this particular
> >>> alignment rule has some compatibility issues with previous versions of
> >>> gcc (these were later fixed by Jakub), I'd rather leave this rule as
> >>> is. However, if the access is from the cold section, we can perhaps
> >>> avoid extra alignment, while avoiding those compatibility issues.
> >>>
> >>
> >> It is excessive to align
> >>
> >> struct foo
> >> {
> >>   int x1;
> >>   int x2;
> >>   char x3;
> >>   int x4;
> >>   int x5;
> >>   char x6;
> >>   int x7;
> >>   int x8;
> >> };
> >>
> >> to 32 bytes and align
> >>
> >> struct foo
> >> {
> >>   int x1;
> >>   int x2;
> >>   char x3;
> >>   int x4;
> >>   int x5;
> >>   char x6;
> >>   int x7[9];
> >>   int x8;
> >> };
> >>
> >> to 64 bytes.  What performance gain does it provide?
> >
> > Avoids "significant performance penalties," perhaps?
> >
> 
> Kirill, do we have performance data for excessive alignment
> vs ABI alignment?
Nope, we have no actual data showing performance impact on such changes,
sorry.

We may try such a change on HSW machine (on Spec 2006), will it be useful?

--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -26576,7 +26576,7 @@ ix86_data_alignment (tree type, int align, bool opt)
      used to assume.  */

   int max_align_compat
-    = optimize_size ? BITS_PER_WORD : MIN (256, MAX_OFILE_ALIGNMENT);
+    = optimize_size ? BITS_PER_WORD : MIN (128, MAX_OFILE_ALIGNMENT);

   /* A data structure, equal or greater than the size of a cache line
      (64 bytes in the Pentium 4 and other recent Intel processors, including


> -- 
> H.J.

--
Thanks, K

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-05-20 12:00                                       ` Kirill Yukhin
@ 2014-05-20 15:24                                         ` H.J. Lu
  2014-05-22  9:02                                           ` Kirill Yukhin
  0 siblings, 1 reply; 43+ messages in thread
From: H.J. Lu @ 2014-05-20 15:24 UTC (permalink / raw)
  To: Kirill Yukhin
  Cc: Uros Bizjak, Jan Hubicka, Jakub Jelinek, Eric Botcazou,
	gcc-patches, Richard Henderson

On Tue, May 20, 2014 at 5:00 AM, Kirill Yukhin <kirill.yukhin@gmail.com> wrote:
> Hello,
> On 19 May 09:58, H.J. Lu wrote:
>> On Mon, May 19, 2014 at 9:45 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>> > On Mon, May 19, 2014 at 6:42 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> >
>> >>>> Uros,
>> >>>> I am looking into libreoffice size and the data alignment seems to make huge
>> >>>> difference. Data section has grown from 5.8MB to 6.3MB in between GCC 4.8 and 4.9,
>> >>>> while clang produces 5.2MB.
>> >>>>
>> >>>> The two patches I posted to not align vtables and RTTI reduces it to 5.7MB, but
>> >>>> But perhaps we want to revisit the alignment rules.  The optimization manuals
>> >>>> usually care only about performance critical loops.  Perhaps we can make the
>> >>>> rules to align only bigger datastructures, or so at least for -O2.
>> >>>
>> >>> Based on the above quote, "Misaligned data access can incur
>> >>> significant performance penalties." and the fact that this particular
>> >>> alignment rule has some compatibility issues with previous versions of
>> >>> gcc (these were later fixed by Jakub), I'd rather leave this rule as
>> >>> is. However, if the access is from the cold section, we can perhaps
>> >>> avoid extra alignment, while avoiding those compatibility issues.
>> >>>
>> >>
>> >> It is excessive to align
>> >>
>> >> struct foo
>> >> {
>> >>   int x1;
>> >>   int x2;
>> >>   char x3;
>> >>   int x4;
>> >>   int x5;
>> >>   char x6;
>> >>   int x7;
>> >>   int x8;
>> >> };
>> >>
>> >> to 32 bytes and align
>> >>
>> >> struct foo
>> >> {
>> >>   int x1;
>> >>   int x2;
>> >>   char x3;
>> >>   int x4;
>> >>   int x5;
>> >>   char x6;
>> >>   int x7[9];
>> >>   int x8;
>> >> };
>> >>
>> >> to 64 bytes.  What performance gain does it provide?
>> >
>> > Avoids "significant performance penalties," perhaps?
>> >
>>
>> Kirill, do we have performance data for excessive alignment
>> vs ABI alignment?
> Nope, we have no actual data showing performance impact on such changes,
> sorry.
>
> We may try such a change on HSW machine (on Spec 2006), will it be useful?
>
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -26576,7 +26576,7 @@ ix86_data_alignment (tree type, int align, bool opt)
>       used to assume.  */
>
>    int max_align_compat
> -    = optimize_size ? BITS_PER_WORD : MIN (256, MAX_OFILE_ALIGNMENT);
> +    = optimize_size ? BITS_PER_WORD : MIN (128, MAX_OFILE_ALIGNMENT);
>
>    /* A data structure, equal or greater than the size of a cache line
>       (64 bytes in the Pentium 4 and other recent Intel processors, including
>
>

ABI alignment should be sufficient for correctness. Bigger alignments
are supposed to give better performance.  Can you try this patch on
HSW and SLM to see if it has any impact on performance?

-- 
H.J.
----
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index c0a46ed..4879110 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -26568,39 +26568,6 @@ ix86_constant_alignment (tree exp, int align)
 int
 ix86_data_alignment (tree type, int align, bool opt)
 {
-  /* GCC 4.8 and earlier used to incorrectly assume this alignment even
-     for symbols from other compilation units or symbols that don't need
-     to bind locally.  In order to preserve some ABI compatibility with
-     those compilers, ensure we don't decrease alignment from what we
-     used to assume.  */
-
-  int max_align_compat
-    = optimize_size ? BITS_PER_WORD : MIN (256, MAX_OFILE_ALIGNMENT);
-
-  /* A data structure, equal or greater than the size of a cache line
-     (64 bytes in the Pentium 4 and other recent Intel processors, including
-     processors based on Intel Core microarchitecture) should be aligned
-     so that its base address is a multiple of a cache line size.  */
-
-  int max_align
-    = MIN ((unsigned) ix86_tune_cost->prefetch_block * 8, MAX_OFILE_ALIGNMENT);
-
-  if (max_align < BITS_PER_WORD)
-    max_align = BITS_PER_WORD;
-
-  if (opt
-      && AGGREGATE_TYPE_P (type)
-      && TYPE_SIZE (type)
-      && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST)
-    {
-      if (wi::geu_p (TYPE_SIZE (type), max_align_compat)
-  && align < max_align_compat)
- align = max_align_compat;
-       if (wi::geu_p (TYPE_SIZE (type), max_align)
-   && align < max_align)
- align = max_align;
-    }
-
   /* x86-64 ABI requires arrays greater than 16 bytes to be aligned
      to 16byte boundary.  */
   if (TARGET_64BIT)
@@ -26616,6 +26583,9 @@ ix86_data_alignment (tree type, int align, bool opt)
   if (!opt)
     return align;

+  if (align < BITS_PER_WORD)
+    align = BITS_PER_WORD;
+
   if (TREE_CODE (type) == ARRAY_TYPE)
     {
       if (TYPE_MODE (TREE_TYPE (type)) == DFmode && align < 64)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-05-20 15:24                                         ` H.J. Lu
@ 2014-05-22  9:02                                           ` Kirill Yukhin
  2014-05-22 17:38                                             ` H.J. Lu
  0 siblings, 1 reply; 43+ messages in thread
From: Kirill Yukhin @ 2014-05-22  9:02 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Uros Bizjak, Jan Hubicka, Jakub Jelinek, Eric Botcazou,
	gcc-patches, Richard Henderson

Hello,
On 20 May 08:24, H.J. Lu wrote:
> ABI alignment should be sufficient for correctness. Bigger alignments
> are supposed to give better performance.  Can you try this patch on
> HSW and SLM to see if it has any impact on performance?

Here is perf. data of your patch.

Only HSW so far

HSW, 64 bits, base

Test Previous Current Ratio(%)
400.perlbench     37.7000    37.7000 +0%
401.bzip2         24.8000    24.7000 -0.40%
403.gcc           35.1000    35.2000 +0.28%
429.mcf           41.7000    42.0000 +0.71%
445.gobmk         26.9000    27.0000 +0.37%
456.hmmer         27.2000    27.2000 +0%
458.sjeng         30.2000    30.1000 -0.33%
462.libquantum    77.4000    76.7000 -0.90%
464.h264ref       52.5000    52.8000 +0.57%
471.omnetpp       23.8000    23.7000 -0.42%
473.astar         23.2000    23.1000 -0.43%
483.xalancbmk     39.8000    40.1000 +0.75%
410.bwaves        78.4000    78.5000 +0.12%
416.gamess        33.9000    33.9000 +0%
433.milc          34.7000    34.8000 +0.28%
434.zeusmp        38.6000    38.4000 -0.51%
435.gromacs       26.9000    27.0000 +0.37%
436.cactusADM     54.7000    62.0000 +13.34%
437.leslie3d      45.3000    45.3000 +0%
444.namd          27.2000    27.1000 -0.36%
447.dealII        56.7000    56.7000 +0%
450.soplex        39.3000    39.3000 +0%
453.povray        49.0000    49.1000 +0.20%
454.calculix      28.8000    29.3000 +1.73%
459.GemsFDTD      38.9000    39.0000 +0.25%
465.tonto         23.1000    23.3000 +0.86%
470.lbm           55.3000    55.6000 +0.54%
481.wrf           40.8000    40.8000 +0%
482.sphinx3       47.8000    47.9000 +0.20%

HSW, 64 bits, o2

Test Previous Current Ratio(%)
400.perlbench     39.7000    39.7000 +0%
401.bzip2         25.1000    25.1000 +0%
403.gcc           33.7000    33.7000 +0%
429.mcf           40.1000    39.9000 -0.49%
445.gobmk         26.5000    26.4000 -0.37%
456.hmmer         24.8000    24.8000 +0%
458.sjeng         28.4000    28.5000 +0.35%
462.libquantum    74.4000    74.4000 +0%
464.h264ref       50.1000    50.3000 +0.39%
471.omnetpp       22.6000    22.5000 -0.44%
473.astar         20.7000    21.0000 +1.44%
483.xalancbmk     37.0000    37.4000 +1.08%
410.bwaves        60.1000    60.1000 +0%
416.gamess        35.5000    35.4000 -0.28%
433.milc          29.9000    29.8000 -0.33%
434.zeusmp        34.8000    34.6000 -0.57%
435.gromacs       27.4000    27.5000 +0.36%
436.cactusADM     32.3000    33.8000 +4.64%
437.leslie3d      32.6000    32.6000 +0%
444.namd          26.9000    26.9000 +0%
447.dealII        45.2000    45.1000 -0.22%
450.soplex        42.5000    42.6000 +0.23%
453.povray        45.9000    46.1000 +0.43%
454.calculix      12.9000    12.9000 +0%
459.GemsFDTD      38.6000    38.7000 +0.25%
465.tonto         23.7000    23.8000 +0.42%
470.lbm           56.7000    56.7000 +0%
481.wrf           28.9000    28.9000 +0%
482.sphinx3       43.9000    43.8000 -0.22%

--
Thanks, K

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-05-22  9:02                                           ` Kirill Yukhin
@ 2014-05-22 17:38                                             ` H.J. Lu
  2014-05-30 17:23                                               ` H.J. Lu
  0 siblings, 1 reply; 43+ messages in thread
From: H.J. Lu @ 2014-05-22 17:38 UTC (permalink / raw)
  To: Kirill Yukhin
  Cc: Uros Bizjak, Jan Hubicka, Jakub Jelinek, Eric Botcazou,
	gcc-patches, Richard Henderson

On Thu, May 22, 2014 at 2:01 AM, Kirill Yukhin <kirill.yukhin@gmail.com> wrote:
> Hello,
> On 20 May 08:24, H.J. Lu wrote:
>> ABI alignment should be sufficient for correctness. Bigger alignments
>> are supposed to give better performance.  Can you try this patch on
>> HSW and SLM to see if it has any impact on performance?
>
> Here is perf. data of your patch.
>
> Only HSW so far
>
> HSW, 64 bits, base
>
> Test Previous Current Ratio(%)
> 400.perlbench     37.7000    37.7000 +0%
> 401.bzip2         24.8000    24.7000 -0.40%
> 403.gcc           35.1000    35.2000 +0.28%
> 429.mcf           41.7000    42.0000 +0.71%
> 445.gobmk         26.9000    27.0000 +0.37%
> 456.hmmer         27.2000    27.2000 +0%
> 458.sjeng         30.2000    30.1000 -0.33%
> 462.libquantum    77.4000    76.7000 -0.90%
> 464.h264ref       52.5000    52.8000 +0.57%
> 471.omnetpp       23.8000    23.7000 -0.42%
> 473.astar         23.2000    23.1000 -0.43%
> 483.xalancbmk     39.8000    40.1000 +0.75%
> 410.bwaves        78.4000    78.5000 +0.12%
> 416.gamess        33.9000    33.9000 +0%
> 433.milc          34.7000    34.8000 +0.28%
> 434.zeusmp        38.6000    38.4000 -0.51%
> 435.gromacs       26.9000    27.0000 +0.37%
> 436.cactusADM     54.7000    62.0000 +13.34%
> 437.leslie3d      45.3000    45.3000 +0%
> 444.namd          27.2000    27.1000 -0.36%
> 447.dealII        56.7000    56.7000 +0%
> 450.soplex        39.3000    39.3000 +0%
> 453.povray        49.0000    49.1000 +0.20%
> 454.calculix      28.8000    29.3000 +1.73%
> 459.GemsFDTD      38.9000    39.0000 +0.25%
> 465.tonto         23.1000    23.3000 +0.86%
> 470.lbm           55.3000    55.6000 +0.54%
> 481.wrf           40.8000    40.8000 +0%
> 482.sphinx3       47.8000    47.9000 +0.20%
>
> HSW, 64 bits, o2
>
> Test Previous Current Ratio(%)
> 400.perlbench     39.7000    39.7000 +0%
> 401.bzip2         25.1000    25.1000 +0%
> 403.gcc           33.7000    33.7000 +0%
> 429.mcf           40.1000    39.9000 -0.49%
> 445.gobmk         26.5000    26.4000 -0.37%
> 456.hmmer         24.8000    24.8000 +0%
> 458.sjeng         28.4000    28.5000 +0.35%
> 462.libquantum    74.4000    74.4000 +0%
> 464.h264ref       50.1000    50.3000 +0.39%
> 471.omnetpp       22.6000    22.5000 -0.44%
> 473.astar         20.7000    21.0000 +1.44%
> 483.xalancbmk     37.0000    37.4000 +1.08%
> 410.bwaves        60.1000    60.1000 +0%
> 416.gamess        35.5000    35.4000 -0.28%
> 433.milc          29.9000    29.8000 -0.33%
> 434.zeusmp        34.8000    34.6000 -0.57%
> 435.gromacs       27.4000    27.5000 +0.36%
> 436.cactusADM     32.3000    33.8000 +4.64%
> 437.leslie3d      32.6000    32.6000 +0%
> 444.namd          26.9000    26.9000 +0%
> 447.dealII        45.2000    45.1000 -0.22%
> 450.soplex        42.5000    42.6000 +0.23%
> 453.povray        45.9000    46.1000 +0.43%
> 454.calculix      12.9000    12.9000 +0%
> 459.GemsFDTD      38.6000    38.7000 +0.25%
> 465.tonto         23.7000    23.8000 +0.42%
> 470.lbm           56.7000    56.7000 +0%
> 481.wrf           28.9000    28.9000 +0%
> 482.sphinx3       43.9000    43.8000 -0.22%
>

So extra alignment doesn't have any performance impact
on HSW with SPEC CPU 2006.  So the question is if using
alignment specified by ABI will cause any correctness issue.
I am concerned about the comment:

  /* GCC 4.8 and earlier used to incorrectly assume this alignment even
     for symbols from other compilation units or symbols that don't need
     to bind locally.  In order to preserve some ABI compatibility with
     those compilers, ensure we don't decrease alignment from what we
     used to assume.  */

Jakub,  will we into any correctness issue if ix86_data_alignment
always returns ABI alignment?


-- 
H.J.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.
  2014-05-22 17:38                                             ` H.J. Lu
@ 2014-05-30 17:23                                               ` H.J. Lu
  0 siblings, 0 replies; 43+ messages in thread
From: H.J. Lu @ 2014-05-30 17:23 UTC (permalink / raw)
  To: Kirill Yukhin
  Cc: Uros Bizjak, Jan Hubicka, Jakub Jelinek, Eric Botcazou,
	gcc-patches, Richard Henderson

On Thu, May 22, 2014 at 10:38 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Thu, May 22, 2014 at 2:01 AM, Kirill Yukhin <kirill.yukhin@gmail.com> wrote:
>> Hello,
>> On 20 May 08:24, H.J. Lu wrote:
>>> ABI alignment should be sufficient for correctness. Bigger alignments
>>> are supposed to give better performance.  Can you try this patch on
>>> HSW and SLM to see if it has any impact on performance?
>>
>> Here is perf. data of your patch.
>>
>> Only HSW so far
>>
>> HSW, 64 bits, base
>>
>> Test Previous Current Ratio(%)
>> 400.perlbench     37.7000    37.7000 +0%
>> 401.bzip2         24.8000    24.7000 -0.40%
>> 403.gcc           35.1000    35.2000 +0.28%
>> 429.mcf           41.7000    42.0000 +0.71%
>> 445.gobmk         26.9000    27.0000 +0.37%
>> 456.hmmer         27.2000    27.2000 +0%
>> 458.sjeng         30.2000    30.1000 -0.33%
>> 462.libquantum    77.4000    76.7000 -0.90%
>> 464.h264ref       52.5000    52.8000 +0.57%
>> 471.omnetpp       23.8000    23.7000 -0.42%
>> 473.astar         23.2000    23.1000 -0.43%
>> 483.xalancbmk     39.8000    40.1000 +0.75%
>> 410.bwaves        78.4000    78.5000 +0.12%
>> 416.gamess        33.9000    33.9000 +0%
>> 433.milc          34.7000    34.8000 +0.28%
>> 434.zeusmp        38.6000    38.4000 -0.51%
>> 435.gromacs       26.9000    27.0000 +0.37%
>> 436.cactusADM     54.7000    62.0000 +13.34%
>> 437.leslie3d      45.3000    45.3000 +0%
>> 444.namd          27.2000    27.1000 -0.36%
>> 447.dealII        56.7000    56.7000 +0%
>> 450.soplex        39.3000    39.3000 +0%
>> 453.povray        49.0000    49.1000 +0.20%
>> 454.calculix      28.8000    29.3000 +1.73%
>> 459.GemsFDTD      38.9000    39.0000 +0.25%
>> 465.tonto         23.1000    23.3000 +0.86%
>> 470.lbm           55.3000    55.6000 +0.54%
>> 481.wrf           40.8000    40.8000 +0%
>> 482.sphinx3       47.8000    47.9000 +0.20%
>>
>> HSW, 64 bits, o2
>>
>> Test Previous Current Ratio(%)
>> 400.perlbench     39.7000    39.7000 +0%
>> 401.bzip2         25.1000    25.1000 +0%
>> 403.gcc           33.7000    33.7000 +0%
>> 429.mcf           40.1000    39.9000 -0.49%
>> 445.gobmk         26.5000    26.4000 -0.37%
>> 456.hmmer         24.8000    24.8000 +0%
>> 458.sjeng         28.4000    28.5000 +0.35%
>> 462.libquantum    74.4000    74.4000 +0%
>> 464.h264ref       50.1000    50.3000 +0.39%
>> 471.omnetpp       22.6000    22.5000 -0.44%
>> 473.astar         20.7000    21.0000 +1.44%
>> 483.xalancbmk     37.0000    37.4000 +1.08%
>> 410.bwaves        60.1000    60.1000 +0%
>> 416.gamess        35.5000    35.4000 -0.28%
>> 433.milc          29.9000    29.8000 -0.33%
>> 434.zeusmp        34.8000    34.6000 -0.57%
>> 435.gromacs       27.4000    27.5000 +0.36%
>> 436.cactusADM     32.3000    33.8000 +4.64%
>> 437.leslie3d      32.6000    32.6000 +0%
>> 444.namd          26.9000    26.9000 +0%
>> 447.dealII        45.2000    45.1000 -0.22%
>> 450.soplex        42.5000    42.6000 +0.23%
>> 453.povray        45.9000    46.1000 +0.43%
>> 454.calculix      12.9000    12.9000 +0%
>> 459.GemsFDTD      38.6000    38.7000 +0.25%
>> 465.tonto         23.7000    23.8000 +0.42%
>> 470.lbm           56.7000    56.7000 +0%
>> 481.wrf           28.9000    28.9000 +0%
>> 482.sphinx3       43.9000    43.8000 -0.22%
>>
>
> So extra alignment doesn't have any performance impact
> on HSW with SPEC CPU 2006.  So the question is if using
> alignment specified by ABI will cause any correctness issue.
> I am concerned about the comment:
>
>   /* GCC 4.8 and earlier used to incorrectly assume this alignment even
>      for symbols from other compilation units or symbols that don't need
>      to bind locally.  In order to preserve some ABI compatibility with
>      those compilers, ensure we don't decrease alignment from what we
>      used to assume.  */
>
> Jakub,  will we into any correctness issue if ix86_data_alignment
> always returns ABI alignment?
>

The ABI issue is

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56564

That is we shouldn't use alignment beyond ABI if the definition
may come from a different compilation unit.   But DATA_ALIGNMENT
is only for optimization purpose.  I opened:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61296

for excessive x86 data alignment.  It looks like that started as
an accident from:

https://gcc.gnu.org/ml/gcc-patches/2000-06/msg00871.html

which aligned struct >= 32 bytes to 32 bytes.  But alignment beyond
natural alignment doesn't provide performance benefit.  Should
we limit DATA_ALIGNMENT to MAX (ABI alignment, natural alignment)?


-- 
H.J.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2014-05-30 17:23 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-12 13:56 [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks Kirill Yukhin
2013-11-15 18:21 ` Kirill Yukhin
2013-11-19 10:28   ` Kirill Yukhin
2013-12-02 13:16     ` Kirill Yukhin
2013-12-18 13:08       ` Kirill Yukhin
2013-12-22 10:47         ` Uros Bizjak
2013-12-22 12:52           ` Jakub Jelinek
2013-12-30 11:00           ` Kirill Yukhin
2014-01-01 23:08             ` Eric Botcazou
2014-01-02 10:53               ` Kirill Yukhin
2014-01-02 15:12                 ` Eric Botcazou
2014-01-02 19:50                   ` Jan Hubicka
2014-01-02 21:56                     ` Eric Botcazou
2014-01-02 22:16                       ` Jan Hubicka
2014-01-02 22:18               ` Eric Botcazou
2014-01-03 11:03                 ` Uros Bizjak
2014-01-03 11:20                   ` Eric Botcazou
2014-01-03 11:25                     ` Uros Bizjak
2014-01-03 12:00                       ` Jakub Jelinek
2014-01-03 12:27                         ` Uros Bizjak
2014-01-03 13:35                           ` Uros Bizjak
2014-01-03 13:43                             ` Jakub Jelinek
2014-01-03 14:02                               ` Uros Bizjak
2014-01-03 14:13                                 ` Jakub Jelinek
2014-01-03 14:35                                   ` Uros Bizjak
2014-01-03 14:42                                     ` Jakub Jelinek
2014-01-03 16:04                                 ` Uros Bizjak
2014-01-14 17:09                                   ` Jakub Jelinek
2014-01-14 18:37                                     ` Uros Bizjak
2014-01-14 19:00                                       ` H.J. Lu
2014-01-14 19:12                                       ` Jakub Jelinek
2014-01-17 14:15                                         ` Jakub Jelinek
2014-01-17 14:25                                           ` Uros Bizjak
2014-05-19  4:48                             ` Jan Hubicka
2014-05-19 16:14                               ` Uros Bizjak
2014-05-19 16:42                                 ` H.J. Lu
2014-05-19 16:45                                   ` Uros Bizjak
2014-05-19 16:58                                     ` H.J. Lu
2014-05-20 12:00                                       ` Kirill Yukhin
2014-05-20 15:24                                         ` H.J. Lu
2014-05-22  9:02                                           ` Kirill Yukhin
2014-05-22 17:38                                             ` H.J. Lu
2014-05-30 17:23                                               ` H.J. Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).