public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
@ 2020-08-31  9:06 Xiong Hu Luo
  2020-08-31 12:43 ` Richard Biener
                   ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Xiong Hu Luo @ 2020-08-31  9:06 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher, dje.gcc, wschmidt, guojiufu, linkw, Xiong Hu Luo

vec_insert accepts 3 arguments, arg0 is input vector, arg1 is the value
to be insert, arg2 is the place to insert arg1 to arg0.  This patch adds
__builtin_vec_insert_v4si[v4sf,v2di,v2df,v8hi,v16qi] for vec_insert to
not expand too early in gimple stage if arg2 is variable, to avoid generate
store hit load instructions.

For Power9 V4SI:
	addi 9,1,-16
	rldic 6,6,2,60
	stxv 34,-16(1)
	stwx 5,9,6
	lxv 34,-16(1)
=>
	addis 9,2,.LC0@toc@ha
	addi 9,9,.LC0@toc@l
	mtvsrwz 33,5
	lxv 32,0(9)
	sradi 9,6,2
	addze 9,9
	sldi 9,9,2
	subf 9,9,6
	subfic 9,9,3
	sldi 9,9,2
	subfic 9,9,20
	lvsl 13,0,9
	xxperm 33,33,45
	xxperm 32,32,45
	xxsel 34,34,33,32

Though instructions increase from 5 to 15, the performance is improved
60% in typical cases.

gcc/ChangeLog:

	* config/rs6000/altivec.md (altivec_lvsl_reg_<mode>2): Extend to
	SDI mode.
	* config/rs6000/rs6000-builtin.def (BU_VSX_X): Add support
	macros for vec_insert built-in functions.
	* config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin):
	Generate built-in calls for vec_insert.
	* config/rs6000/rs6000-call.c (altivec_expand_vec_insert_builtin):
	New function.
	(altivec_expand_builtin): Add case entry for
	VSX_BUILTIN_VEC_INSERT_V16QI, VSX_BUILTIN_VEC_INSERT_V8HI,
	VSX_BUILTIN_VEC_INSERT_V4SF,  VSX_BUILTIN_VEC_INSERT_V4SI,
	VSX_BUILTIN_VEC_INSERT_V2DF,  VSX_BUILTIN_VEC_INSERT_V2DI.
	(altivec_init_builtins):
	* config/rs6000/rs6000-protos.h (rs6000_expand_vector_insert):
	New declear.
	* config/rs6000/rs6000.c (rs6000_expand_vector_insert):
	New function.
	* config/rs6000/rs6000.md (FQHS): New mode iterator.
	(FD): New mode iterator.
	p8_mtvsrwz_v16qi<mode>2: New define_insn.
	p8_mtvsrd_v16qi<mode>2: New define_insn.
	* config/rs6000/vsx.md: Call gen_altivec_lvsl_reg_di2.

gcc/testsuite/ChangeLog:

	* gcc.target/powerpc/pr79251.c: New test.
---
 gcc/config/rs6000/altivec.md               |   4 +-
 gcc/config/rs6000/rs6000-builtin.def       |   6 +
 gcc/config/rs6000/rs6000-c.c               |  61 +++++++++
 gcc/config/rs6000/rs6000-call.c            |  74 +++++++++++
 gcc/config/rs6000/rs6000-protos.h          |   1 +
 gcc/config/rs6000/rs6000.c                 | 146 +++++++++++++++++++++
 gcc/config/rs6000/rs6000.md                |  19 +++
 gcc/config/rs6000/vsx.md                   |   2 +-
 gcc/testsuite/gcc.target/powerpc/pr79251.c |  23 ++++
 9 files changed, 333 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr79251.c

diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
index 0a2e634d6b0..66b636059a6 100644
--- a/gcc/config/rs6000/altivec.md
+++ b/gcc/config/rs6000/altivec.md
@@ -2772,10 +2772,10 @@
   DONE;
 })
 
-(define_insn "altivec_lvsl_reg"
+(define_insn "altivec_lvsl_reg_<mode>2"
   [(set (match_operand:V16QI 0 "altivec_register_operand" "=v")
 	(unspec:V16QI
-	[(match_operand:DI 1 "gpc_reg_operand" "b")]
+	[(match_operand:SDI 1 "gpc_reg_operand" "b")]
 	UNSPEC_LVSL_REG))]
   "TARGET_ALTIVEC"
   "lvsl %0,0,%1"
diff --git a/gcc/config/rs6000/rs6000-builtin.def b/gcc/config/rs6000/rs6000-builtin.def
index f9f0fece549..d095b365c14 100644
--- a/gcc/config/rs6000/rs6000-builtin.def
+++ b/gcc/config/rs6000/rs6000-builtin.def
@@ -2047,6 +2047,12 @@ BU_VSX_X (VEC_INIT_V2DI,      "vec_init_v2di",	CONST)
 BU_VSX_X (VEC_SET_V1TI,	      "vec_set_v1ti",	CONST)
 BU_VSX_X (VEC_SET_V2DF,	      "vec_set_v2df",	CONST)
 BU_VSX_X (VEC_SET_V2DI,	      "vec_set_v2di",	CONST)
+BU_VSX_X (VEC_INSERT_V16QI,	      "vec_insert_v16qi",	CONST)
+BU_VSX_X (VEC_INSERT_V8HI,	      "vec_insert_v8hi",	CONST)
+BU_VSX_X (VEC_INSERT_V4SI,	      "vec_insert_v4si",	CONST)
+BU_VSX_X (VEC_INSERT_V4SF,	      "vec_insert_v4sf",	CONST)
+BU_VSX_X (VEC_INSERT_V2DI,	      "vec_insert_v2di",	CONST)
+BU_VSX_X (VEC_INSERT_V2DF,	      "vec_insert_v2df",	CONST)
 BU_VSX_X (VEC_EXT_V1TI,	      "vec_ext_v1ti",	CONST)
 BU_VSX_X (VEC_EXT_V2DF,	      "vec_ext_v2df",	CONST)
 BU_VSX_X (VEC_EXT_V2DI,	      "vec_ext_v2di",	CONST)
diff --git a/gcc/config/rs6000/rs6000-c.c b/gcc/config/rs6000/rs6000-c.c
index 2fad3d94706..03b00738a5e 100644
--- a/gcc/config/rs6000/rs6000-c.c
+++ b/gcc/config/rs6000/rs6000-c.c
@@ -1563,6 +1563,67 @@ altivec_resolve_overloaded_builtin (location_t loc, tree fndecl,
 	  return build_call_expr (call, 3, arg1, arg0, arg2);
 	}
 
+      else if (VECTOR_MEM_VSX_P (mode))
+	{
+	  tree call = NULL_TREE;
+
+	  arg2 = fold_for_warn (arg2);
+
+	  /* If the second argument is variable, we can optimize it if we are
+	     generating 64-bit code on a machine with direct move.  */
+	  if (TREE_CODE (arg2) != INTEGER_CST && TARGET_DIRECT_MOVE_64BIT)
+	    {
+	      switch (mode)
+		{
+		default:
+		  break;
+
+		case E_V2DImode:
+		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_INSERT_V2DI];
+		  break;
+
+		case E_V2DFmode:
+		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_INSERT_V2DF];
+		  break;
+
+		case E_V4SFmode:
+		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_INSERT_V4SF];
+		  break;
+
+		case E_V4SImode:
+		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_INSERT_V4SI];
+		  break;
+
+		case E_V8HImode:
+		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_INSERT_V8HI];
+		  break;
+
+		case E_V16QImode:
+		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_INSERT_V16QI];
+		  break;
+		}
+	    }
+
+	  if (call)
+	    {
+	      if (TYPE_VECTOR_SUBPARTS (arg1_type) == 1)
+		arg2 = build_int_cst (TREE_TYPE (arg2), 0);
+	      else
+		arg2 = build_binary_op (
+		  loc, BIT_AND_EXPR, arg2,
+		  build_int_cst (TREE_TYPE (arg2),
+				 TYPE_VECTOR_SUBPARTS (arg1_type) - 1),
+		  0);
+	      tree result
+		= build_call_expr (call, 3, arg1,
+				   convert (TREE_TYPE (arg1_type), arg0),
+				   convert (integer_type_node, arg2));
+	      /* Coerce the result to vector element type.  May be no-op.  */
+	      result = fold_convert (TREE_TYPE (arg1), result);
+	      return result;
+	    }
+	}
+
       /* Build *(((arg1_inner_type*)&(vector type){arg1})+arg2) = arg0. */
       arg1_inner_type = TREE_TYPE (arg1_type);
       if (TYPE_VECTOR_SUBPARTS (arg1_type) == 1)
diff --git a/gcc/config/rs6000/rs6000-call.c b/gcc/config/rs6000/rs6000-call.c
index e39cfcf672b..339e9ae87e3 100644
--- a/gcc/config/rs6000/rs6000-call.c
+++ b/gcc/config/rs6000/rs6000-call.c
@@ -10660,6 +10660,40 @@ altivec_expand_vec_set_builtin (tree exp)
   return op0;
 }
 
+/* Expand vec_insert builtin.  */
+static rtx
+altivec_expand_vec_insert_builtin (tree exp, rtx target)
+{
+  machine_mode tmode, mode1, mode2;
+  tree arg0, arg1, arg2;
+  rtx op0 = NULL_RTX, op1, op2;
+
+  arg0 = CALL_EXPR_ARG (exp, 0);
+  arg1 = CALL_EXPR_ARG (exp, 1);
+  arg2 = CALL_EXPR_ARG (exp, 2);
+
+  tmode = TYPE_MODE (TREE_TYPE (arg0));
+  mode1 = TYPE_MODE (TREE_TYPE (TREE_TYPE (arg0)));
+  mode2 = TYPE_MODE ((TREE_TYPE (arg2)));
+  gcc_assert (VECTOR_MODE_P (tmode));
+
+  op0 = expand_expr (arg0, NULL_RTX, tmode, EXPAND_NORMAL);
+  op1 = expand_expr (arg1, NULL_RTX, mode1, EXPAND_NORMAL);
+  op2 = expand_expr (arg2, NULL_RTX, mode2, EXPAND_NORMAL);
+
+  if (GET_MODE (op1) != mode1 && GET_MODE (op1) != VOIDmode)
+    op1 = convert_modes (mode1, GET_MODE (op1), op1, true);
+
+  op0 = force_reg (tmode, op0);
+  op1 = force_reg (mode1, op1);
+  op2 = force_reg (mode2, op2);
+
+  target = gen_reg_rtx (V16QImode);
+  rs6000_expand_vector_insert (target, op0, op1, op2);
+
+  return target;
+}
+
 /* Expand vec_ext builtin.  */
 static rtx
 altivec_expand_vec_ext_builtin (tree exp, rtx target)
@@ -10922,6 +10956,14 @@ altivec_expand_builtin (tree exp, rtx target, bool *expandedp)
     case VSX_BUILTIN_VEC_SET_V1TI:
       return altivec_expand_vec_set_builtin (exp);
 
+    case VSX_BUILTIN_VEC_INSERT_V16QI:
+    case VSX_BUILTIN_VEC_INSERT_V8HI:
+    case VSX_BUILTIN_VEC_INSERT_V4SF:
+    case VSX_BUILTIN_VEC_INSERT_V4SI:
+    case VSX_BUILTIN_VEC_INSERT_V2DF:
+    case VSX_BUILTIN_VEC_INSERT_V2DI:
+      return altivec_expand_vec_insert_builtin (exp, target);
+
     case ALTIVEC_BUILTIN_VEC_EXT_V4SI:
     case ALTIVEC_BUILTIN_VEC_EXT_V8HI:
     case ALTIVEC_BUILTIN_VEC_EXT_V16QI:
@@ -13681,6 +13723,38 @@ altivec_init_builtins (void)
 				    integer_type_node, NULL_TREE);
   def_builtin ("__builtin_vec_set_v2di", ftype, VSX_BUILTIN_VEC_SET_V2DI);
 
+  /* Access to the vec_insert patterns.  */
+  ftype = build_function_type_list (V16QI_type_node, V16QI_type_node,
+				    intQI_type_node,
+				    integer_type_node, NULL_TREE);
+  def_builtin ("__builtin_vec_insert_v16qi", ftype,
+	       VSX_BUILTIN_VEC_INSERT_V16QI);
+
+  ftype = build_function_type_list (V8HI_type_node, V8HI_type_node,
+				    intHI_type_node,
+				    integer_type_node, NULL_TREE);
+  def_builtin ("__builtin_vec_insert_v8hi", ftype, VSX_BUILTIN_VEC_INSERT_V8HI);
+
+  ftype = build_function_type_list (V4SI_type_node, V4SI_type_node,
+				    integer_type_node,
+				    integer_type_node, NULL_TREE);
+  def_builtin ("__builtin_vec_insert_v4si", ftype, VSX_BUILTIN_VEC_INSERT_V4SI);
+
+  ftype = build_function_type_list (V4SF_type_node, V4SF_type_node,
+				    float_type_node,
+				    integer_type_node, NULL_TREE);
+  def_builtin ("__builtin_vec_insert_v4sf", ftype, VSX_BUILTIN_VEC_INSERT_V4SF);
+
+  ftype = build_function_type_list (V2DI_type_node, V2DI_type_node,
+				    intDI_type_node,
+				    integer_type_node, NULL_TREE);
+  def_builtin ("__builtin_vec_insert_v2di", ftype, VSX_BUILTIN_VEC_INSERT_V2DI);
+
+  ftype = build_function_type_list (V2DF_type_node, V2DF_type_node,
+				    double_type_node,
+				    integer_type_node, NULL_TREE);
+  def_builtin ("__builtin_vec_insert_v2df", ftype, VSX_BUILTIN_VEC_INSERT_V2DF);
+
   /* Access to the vec_extract patterns.  */
   ftype = build_function_type_list (intSI_type_node, V4SI_type_node,
 				    integer_type_node, NULL_TREE);
diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
index 28e859f4381..78b5b31d79f 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -58,6 +58,7 @@ extern bool rs6000_split_128bit_ok_p (rtx []);
 extern void rs6000_expand_float128_convert (rtx, rtx, bool);
 extern void rs6000_expand_vector_init (rtx, rtx);
 extern void rs6000_expand_vector_set (rtx, rtx, int);
+extern void rs6000_expand_vector_insert (rtx, rtx, rtx, rtx);
 extern void rs6000_expand_vector_extract (rtx, rtx, rtx);
 extern void rs6000_split_vec_extract_var (rtx, rtx, rtx, rtx, rtx);
 extern rtx rs6000_adjust_vec_address (rtx, rtx, rtx, rtx, machine_mode);
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index fe93cf6ff2b..afa845f3dff 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -6788,6 +6788,152 @@ rs6000_expand_vector_set (rtx target, rtx val, int elt)
   emit_insn (gen_rtx_SET (target, x));
 }
 
+/* Insert value from VEC into idx of TARGET.  */
+
+void
+rs6000_expand_vector_insert (rtx target, rtx vec, rtx val, rtx idx)
+{
+  machine_mode mode = GET_MODE (vec);
+
+  if (VECTOR_MEM_VSX_P (mode) && CONST_INT_P (idx))
+      gcc_unreachable ();
+  else if (VECTOR_MEM_VSX_P (mode) && !CONST_INT_P (idx)
+	   && TARGET_DIRECT_MOVE_64BIT)
+    {
+      gcc_assert (GET_MODE (idx) == E_SImode);
+      machine_mode inner_mode = GET_MODE (val);
+      HOST_WIDE_INT mode_mask = GET_MODE_MASK (inner_mode);
+
+      rtx tmp = gen_reg_rtx (GET_MODE (idx));
+      if (GET_MODE_SIZE (inner_mode) == 8)
+	{
+	  if (!BYTES_BIG_ENDIAN)
+	    {
+	      /*  idx = 1 - idx.  */
+	      emit_insn (gen_subsi3 (tmp, GEN_INT (1), idx));
+	      /*  idx = idx * 8.  */
+	      emit_insn (gen_ashlsi3 (tmp, tmp, GEN_INT (3)));
+	      /*  idx = 16 - idx.  */
+	      emit_insn (gen_subsi3 (tmp, GEN_INT (16), tmp));
+	    }
+	  else
+	    {
+	      emit_insn (gen_ashlsi3 (tmp, idx, GEN_INT (3)));
+	      emit_insn (gen_subsi3 (tmp, GEN_INT (16), tmp));
+	    }
+	}
+      else if (GET_MODE_SIZE (inner_mode) == 4)
+	{
+	  if (!BYTES_BIG_ENDIAN)
+	    {
+	      /*  idx = 3 - idx.  */
+	      emit_insn (gen_subsi3 (tmp, GEN_INT (3), idx));
+	      /*  idx = idx * 4.  */
+	      emit_insn (gen_ashlsi3 (tmp, tmp, GEN_INT (2)));
+	      /*  idx = 20 - idx.  */
+	      emit_insn (gen_subsi3 (tmp, GEN_INT (20), tmp));
+	    }
+	  else
+	  {
+	      emit_insn (gen_ashlsi3 (tmp, idx, GEN_INT (2)));
+	      emit_insn (gen_subsi3 (tmp, GEN_INT (20), tmp));
+	  }
+	}
+      else if (GET_MODE_SIZE (inner_mode) == 2)
+	{
+	  if (!BYTES_BIG_ENDIAN)
+	    {
+	      /*  idx = 7 - idx.  */
+	      emit_insn (gen_subsi3 (tmp, GEN_INT (7), idx));
+	      /*  idx = idx * 2.  */
+	      emit_insn (gen_ashlsi3 (tmp, tmp, GEN_INT (1)));
+	      /*  idx = 22 - idx.  */
+	      emit_insn (gen_subsi3 (tmp, GEN_INT (22), tmp));
+	    }
+	  else
+	    {
+	      emit_insn (gen_ashlsi3 (tmp, tmp, GEN_INT (1)));
+	      emit_insn (gen_subsi3 (tmp, GEN_INT (22), idx));
+	    }
+	}
+      else if (GET_MODE_SIZE (inner_mode) == 1)
+	if (!BYTES_BIG_ENDIAN)
+	  emit_insn (gen_addsi3 (tmp, idx, GEN_INT (8)));
+	else
+	  emit_insn (gen_subsi3 (tmp, GEN_INT (23), idx));
+      else
+	gcc_unreachable ();
+
+      /*  lxv vs32, mask.
+	  DImode: 0xffffffffffffffff0000000000000000
+	  SImode: 0x00000000ffffffff0000000000000000
+	  HImode: 0x000000000000ffff0000000000000000.
+	  QImode: 0x00000000000000ff0000000000000000.  */
+      rtx mask = gen_reg_rtx (V16QImode);
+      rtx mask_v2di = gen_reg_rtx (V2DImode);
+      rtvec v = rtvec_alloc (2);
+      if (!BYTES_BIG_ENDIAN)
+	{
+	  RTVEC_ELT (v, 0) = gen_rtx_CONST_INT (DImode, 0);
+	  RTVEC_ELT (v, 1) = gen_rtx_CONST_INT (DImode, mode_mask);
+	}
+      else
+      {
+	  RTVEC_ELT (v, 0) = gen_rtx_CONST_INT (DImode, mode_mask);
+	  RTVEC_ELT (v, 1) = gen_rtx_CONST_INT (DImode, 0);
+	}
+      emit_insn (
+	gen_vec_initv2didi (mask_v2di, gen_rtx_PARALLEL (V2DImode, v)));
+      rtx sub_mask = simplify_gen_subreg (V16QImode, mask_v2di, V2DImode, 0);
+      emit_insn (gen_rtx_SET (mask, sub_mask));
+
+      /*  mtvsrd[wz] f0,val.  */
+      rtx val_v16qi = gen_reg_rtx (V16QImode);
+      switch (inner_mode)
+	{
+	default:
+	  gcc_unreachable ();
+	  break;
+	case E_QImode:
+	  emit_insn (gen_p8_mtvsrwz_v16qiqi2 (val_v16qi, val));
+	  break;
+	case E_HImode:
+	  emit_insn (gen_p8_mtvsrwz_v16qihi2 (val_v16qi, val));
+	  break;
+	case E_SImode:
+	  emit_insn (gen_p8_mtvsrwz_v16qisi2 (val_v16qi, val));
+	  break;
+	case E_SFmode:
+	  emit_insn (gen_p8_mtvsrwz_v16qisf2 (val_v16qi, val));
+	  break;
+	case E_DImode:
+	  emit_insn (gen_p8_mtvsrd_v16qidi2 (val_v16qi, val));
+	  break;
+	case E_DFmode:
+	  emit_insn (gen_p8_mtvsrd_v16qidf2 (val_v16qi, val));
+	  break;
+	}
+
+      /*  lvsl    v1,0,idx.  */
+      rtx pcv = gen_reg_rtx (V16QImode);
+      emit_insn (gen_altivec_lvsl_reg_si2 (pcv, tmp));
+
+      /*  xxperm  vs0,vs0,vs33.  */
+      /*  xxperm  vs32,vs32,vs33.  */
+      rtx val_perm = gen_reg_rtx (V16QImode);
+      rtx mask_perm = gen_reg_rtx (V16QImode);
+      emit_insn (
+	gen_altivec_vperm_v8hiv16qi (val_perm, val_v16qi, val_v16qi, pcv));
+      emit_insn (gen_altivec_vperm_v8hiv16qi (mask_perm, mask, mask, pcv));
+
+      rtx sub_target = simplify_gen_subreg (V16QImode, vec, mode, 0);
+      emit_insn (gen_rtx_SET (target, sub_target));
+
+      /*  xxsel   vs34,vs34,vs0,vs32.  */
+      emit_insn (gen_vector_select_v16qi (target, target, val_perm, mask_perm));
+    }
+}
+
 /* Extract field ELT from VEC into TARGET.  */
 
 void
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index 43b620ae1c0..b02fda836d4 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -8713,6 +8713,25 @@
   "mtvsrwz %x0,%1"
   [(set_attr "type" "mftgpr")])
 
+(define_mode_iterator FQHS [SF QI HI SI])
+(define_mode_iterator FD [DF DI])
+
+(define_insn "p8_mtvsrwz_v16qi<mode>2"
+  [(set (match_operand:V16QI 0 "register_operand" "=wa")
+	(unspec:V16QI [(match_operand:FQHS 1 "register_operand" "r")]
+		   UNSPEC_P8V_MTVSRWZ))]
+  "TARGET_POWERPC64 && TARGET_DIRECT_MOVE"
+  "mtvsrwz %x0,%1"
+  [(set_attr "type" "mftgpr")])
+
+(define_insn "p8_mtvsrd_v16qi<mode>2"
+  [(set (match_operand:V16QI 0 "register_operand" "=wa")
+	(unspec:V16QI [(match_operand:FD 1 "register_operand" "r")]
+		   UNSPEC_P8V_MTVSRD))]
+  "TARGET_POWERPC64 && TARGET_DIRECT_MOVE"
+  "mtvsrd %x0,%1"
+  [(set_attr "type" "mftgpr")])
+
 (define_insn_and_split "reload_fpr_from_gpr<mode>"
   [(set (match_operand:FMOVE64X 0 "register_operand" "=d")
 	(unspec:FMOVE64X [(match_operand:FMOVE64X 1 "register_operand" "r")]
diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index dd750210758..7e82690d12d 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -5349,7 +5349,7 @@
   rtx rtx_vtmp = gen_reg_rtx (V16QImode);
   rtx tmp = gen_reg_rtx (DImode);
 
-  emit_insn (gen_altivec_lvsl_reg (shift_mask, operands[2]));
+  emit_insn (gen_altivec_lvsl_reg_di2 (shift_mask, operands[2]));
   emit_insn (gen_ashldi3 (tmp, operands[2], GEN_INT (56)));
   emit_insn (gen_lxvll (rtx_vtmp, operands[1], tmp));
   emit_insn (gen_altivec_vperm_v8hiv16qi (operands[0], rtx_vtmp, rtx_vtmp,
diff --git a/gcc/testsuite/gcc.target/powerpc/pr79251.c b/gcc/testsuite/gcc.target/powerpc/pr79251.c
new file mode 100644
index 00000000000..877659a0146
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr79251.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-options "-O2 -mdejagnu-cpu=power9 -maltivec" } */
+
+#include <stddef.h>
+#include <altivec.h>
+
+#define TYPE int
+  
+__attribute__ ((noinline))
+vector TYPE test (vector TYPE v, TYPE i, size_t n)
+{
+  vector TYPE v1 = v;
+  v1 = vec_insert (i, v, n);
+
+  return v1;
+}
+
+/* { dg-final { scan-assembler-not {\mstxw\M} } } */
+/* { dg-final { scan-assembler-times {\mlvsl\M} 1 } } */
+/* { dg-final { scan-assembler-times {\mxxperm\M} 2 } } */
+/* { dg-final { scan-assembler-times {\mxxsel\M} 1 } } */
-- 
2.27.0.90.geebb51ba8c


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-08-31  9:06 [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251] Xiong Hu Luo
@ 2020-08-31 12:43 ` Richard Biener
  2020-08-31 16:47 ` will schmidt
  2020-08-31 17:04 ` Segher Boessenkool
  2 siblings, 0 replies; 43+ messages in thread
From: Richard Biener @ 2020-08-31 12:43 UTC (permalink / raw)
  To: Xiong Hu Luo, Richard Sandiford
  Cc: GCC Patches, Segher Boessenkool, Bill Schmidt, linkw, David Edelsohn

On Mon, Aug 31, 2020 at 11:09 AM Xiong Hu Luo via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> vec_insert accepts 3 arguments, arg0 is input vector, arg1 is the value
> to be insert, arg2 is the place to insert arg1 to arg0.  This patch adds
> __builtin_vec_insert_v4si[v4sf,v2di,v2df,v8hi,v16qi] for vec_insert to
> not expand too early in gimple stage if arg2 is variable, to avoid generate
> store hit load instructions.
>
> For Power9 V4SI:
>         addi 9,1,-16
>         rldic 6,6,2,60
>         stxv 34,-16(1)
>         stwx 5,9,6
>         lxv 34,-16(1)
> =>
>         addis 9,2,.LC0@toc@ha
>         addi 9,9,.LC0@toc@l
>         mtvsrwz 33,5
>         lxv 32,0(9)
>         sradi 9,6,2
>         addze 9,9
>         sldi 9,9,2
>         subf 9,9,6
>         subfic 9,9,3
>         sldi 9,9,2
>         subfic 9,9,20
>         lvsl 13,0,9
>         xxperm 33,33,45
>         xxperm 32,32,45
>         xxsel 34,34,33,32
>
> Though instructions increase from 5 to 15, the performance is improved
> 60% in typical cases.

Not sure if it is already possible but maybe use internal functions for
those purely internal builtins instead?  That makes it possible
to overload with a single IFN.

Richard.

> gcc/ChangeLog:
>
>         * config/rs6000/altivec.md (altivec_lvsl_reg_<mode>2): Extend to
>         SDI mode.
>         * config/rs6000/rs6000-builtin.def (BU_VSX_X): Add support
>         macros for vec_insert built-in functions.
>         * config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin):
>         Generate built-in calls for vec_insert.
>         * config/rs6000/rs6000-call.c (altivec_expand_vec_insert_builtin):
>         New function.
>         (altivec_expand_builtin): Add case entry for
>         VSX_BUILTIN_VEC_INSERT_V16QI, VSX_BUILTIN_VEC_INSERT_V8HI,
>         VSX_BUILTIN_VEC_INSERT_V4SF,  VSX_BUILTIN_VEC_INSERT_V4SI,
>         VSX_BUILTIN_VEC_INSERT_V2DF,  VSX_BUILTIN_VEC_INSERT_V2DI.
>         (altivec_init_builtins):
>         * config/rs6000/rs6000-protos.h (rs6000_expand_vector_insert):
>         New declear.
>         * config/rs6000/rs6000.c (rs6000_expand_vector_insert):
>         New function.
>         * config/rs6000/rs6000.md (FQHS): New mode iterator.
>         (FD): New mode iterator.
>         p8_mtvsrwz_v16qi<mode>2: New define_insn.
>         p8_mtvsrd_v16qi<mode>2: New define_insn.
>         * config/rs6000/vsx.md: Call gen_altivec_lvsl_reg_di2.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/powerpc/pr79251.c: New test.
> ---
>  gcc/config/rs6000/altivec.md               |   4 +-
>  gcc/config/rs6000/rs6000-builtin.def       |   6 +
>  gcc/config/rs6000/rs6000-c.c               |  61 +++++++++
>  gcc/config/rs6000/rs6000-call.c            |  74 +++++++++++
>  gcc/config/rs6000/rs6000-protos.h          |   1 +
>  gcc/config/rs6000/rs6000.c                 | 146 +++++++++++++++++++++
>  gcc/config/rs6000/rs6000.md                |  19 +++
>  gcc/config/rs6000/vsx.md                   |   2 +-
>  gcc/testsuite/gcc.target/powerpc/pr79251.c |  23 ++++
>  9 files changed, 333 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr79251.c
>
> diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
> index 0a2e634d6b0..66b636059a6 100644
> --- a/gcc/config/rs6000/altivec.md
> +++ b/gcc/config/rs6000/altivec.md
> @@ -2772,10 +2772,10 @@
>    DONE;
>  })
>
> -(define_insn "altivec_lvsl_reg"
> +(define_insn "altivec_lvsl_reg_<mode>2"
>    [(set (match_operand:V16QI 0 "altivec_register_operand" "=v")
>         (unspec:V16QI
> -       [(match_operand:DI 1 "gpc_reg_operand" "b")]
> +       [(match_operand:SDI 1 "gpc_reg_operand" "b")]
>         UNSPEC_LVSL_REG))]
>    "TARGET_ALTIVEC"
>    "lvsl %0,0,%1"
> diff --git a/gcc/config/rs6000/rs6000-builtin.def b/gcc/config/rs6000/rs6000-builtin.def
> index f9f0fece549..d095b365c14 100644
> --- a/gcc/config/rs6000/rs6000-builtin.def
> +++ b/gcc/config/rs6000/rs6000-builtin.def
> @@ -2047,6 +2047,12 @@ BU_VSX_X (VEC_INIT_V2DI,      "vec_init_v2di",   CONST)
>  BU_VSX_X (VEC_SET_V1TI,              "vec_set_v1ti",   CONST)
>  BU_VSX_X (VEC_SET_V2DF,              "vec_set_v2df",   CONST)
>  BU_VSX_X (VEC_SET_V2DI,              "vec_set_v2di",   CONST)
> +BU_VSX_X (VEC_INSERT_V16QI,          "vec_insert_v16qi",       CONST)
> +BU_VSX_X (VEC_INSERT_V8HI,           "vec_insert_v8hi",        CONST)
> +BU_VSX_X (VEC_INSERT_V4SI,           "vec_insert_v4si",        CONST)
> +BU_VSX_X (VEC_INSERT_V4SF,           "vec_insert_v4sf",        CONST)
> +BU_VSX_X (VEC_INSERT_V2DI,           "vec_insert_v2di",        CONST)
> +BU_VSX_X (VEC_INSERT_V2DF,           "vec_insert_v2df",        CONST)
>  BU_VSX_X (VEC_EXT_V1TI,              "vec_ext_v1ti",   CONST)
>  BU_VSX_X (VEC_EXT_V2DF,              "vec_ext_v2df",   CONST)
>  BU_VSX_X (VEC_EXT_V2DI,              "vec_ext_v2di",   CONST)
> diff --git a/gcc/config/rs6000/rs6000-c.c b/gcc/config/rs6000/rs6000-c.c
> index 2fad3d94706..03b00738a5e 100644
> --- a/gcc/config/rs6000/rs6000-c.c
> +++ b/gcc/config/rs6000/rs6000-c.c
> @@ -1563,6 +1563,67 @@ altivec_resolve_overloaded_builtin (location_t loc, tree fndecl,
>           return build_call_expr (call, 3, arg1, arg0, arg2);
>         }
>
> +      else if (VECTOR_MEM_VSX_P (mode))
> +       {
> +         tree call = NULL_TREE;
> +
> +         arg2 = fold_for_warn (arg2);
> +
> +         /* If the second argument is variable, we can optimize it if we are
> +            generating 64-bit code on a machine with direct move.  */
> +         if (TREE_CODE (arg2) != INTEGER_CST && TARGET_DIRECT_MOVE_64BIT)
> +           {
> +             switch (mode)
> +               {
> +               default:
> +                 break;
> +
> +               case E_V2DImode:
> +                 call = rs6000_builtin_decls[VSX_BUILTIN_VEC_INSERT_V2DI];
> +                 break;
> +
> +               case E_V2DFmode:
> +                 call = rs6000_builtin_decls[VSX_BUILTIN_VEC_INSERT_V2DF];
> +                 break;
> +
> +               case E_V4SFmode:
> +                 call = rs6000_builtin_decls[VSX_BUILTIN_VEC_INSERT_V4SF];
> +                 break;
> +
> +               case E_V4SImode:
> +                 call = rs6000_builtin_decls[VSX_BUILTIN_VEC_INSERT_V4SI];
> +                 break;
> +
> +               case E_V8HImode:
> +                 call = rs6000_builtin_decls[VSX_BUILTIN_VEC_INSERT_V8HI];
> +                 break;
> +
> +               case E_V16QImode:
> +                 call = rs6000_builtin_decls[VSX_BUILTIN_VEC_INSERT_V16QI];
> +                 break;
> +               }
> +           }
> +
> +         if (call)
> +           {
> +             if (TYPE_VECTOR_SUBPARTS (arg1_type) == 1)
> +               arg2 = build_int_cst (TREE_TYPE (arg2), 0);
> +             else
> +               arg2 = build_binary_op (
> +                 loc, BIT_AND_EXPR, arg2,
> +                 build_int_cst (TREE_TYPE (arg2),
> +                                TYPE_VECTOR_SUBPARTS (arg1_type) - 1),
> +                 0);
> +             tree result
> +               = build_call_expr (call, 3, arg1,
> +                                  convert (TREE_TYPE (arg1_type), arg0),
> +                                  convert (integer_type_node, arg2));
> +             /* Coerce the result to vector element type.  May be no-op.  */
> +             result = fold_convert (TREE_TYPE (arg1), result);
> +             return result;
> +           }
> +       }
> +
>        /* Build *(((arg1_inner_type*)&(vector type){arg1})+arg2) = arg0. */
>        arg1_inner_type = TREE_TYPE (arg1_type);
>        if (TYPE_VECTOR_SUBPARTS (arg1_type) == 1)
> diff --git a/gcc/config/rs6000/rs6000-call.c b/gcc/config/rs6000/rs6000-call.c
> index e39cfcf672b..339e9ae87e3 100644
> --- a/gcc/config/rs6000/rs6000-call.c
> +++ b/gcc/config/rs6000/rs6000-call.c
> @@ -10660,6 +10660,40 @@ altivec_expand_vec_set_builtin (tree exp)
>    return op0;
>  }
>
> +/* Expand vec_insert builtin.  */
> +static rtx
> +altivec_expand_vec_insert_builtin (tree exp, rtx target)
> +{
> +  machine_mode tmode, mode1, mode2;
> +  tree arg0, arg1, arg2;
> +  rtx op0 = NULL_RTX, op1, op2;
> +
> +  arg0 = CALL_EXPR_ARG (exp, 0);
> +  arg1 = CALL_EXPR_ARG (exp, 1);
> +  arg2 = CALL_EXPR_ARG (exp, 2);
> +
> +  tmode = TYPE_MODE (TREE_TYPE (arg0));
> +  mode1 = TYPE_MODE (TREE_TYPE (TREE_TYPE (arg0)));
> +  mode2 = TYPE_MODE ((TREE_TYPE (arg2)));
> +  gcc_assert (VECTOR_MODE_P (tmode));
> +
> +  op0 = expand_expr (arg0, NULL_RTX, tmode, EXPAND_NORMAL);
> +  op1 = expand_expr (arg1, NULL_RTX, mode1, EXPAND_NORMAL);
> +  op2 = expand_expr (arg2, NULL_RTX, mode2, EXPAND_NORMAL);
> +
> +  if (GET_MODE (op1) != mode1 && GET_MODE (op1) != VOIDmode)
> +    op1 = convert_modes (mode1, GET_MODE (op1), op1, true);
> +
> +  op0 = force_reg (tmode, op0);
> +  op1 = force_reg (mode1, op1);
> +  op2 = force_reg (mode2, op2);
> +
> +  target = gen_reg_rtx (V16QImode);
> +  rs6000_expand_vector_insert (target, op0, op1, op2);
> +
> +  return target;
> +}
> +
>  /* Expand vec_ext builtin.  */
>  static rtx
>  altivec_expand_vec_ext_builtin (tree exp, rtx target)
> @@ -10922,6 +10956,14 @@ altivec_expand_builtin (tree exp, rtx target, bool *expandedp)
>      case VSX_BUILTIN_VEC_SET_V1TI:
>        return altivec_expand_vec_set_builtin (exp);
>
> +    case VSX_BUILTIN_VEC_INSERT_V16QI:
> +    case VSX_BUILTIN_VEC_INSERT_V8HI:
> +    case VSX_BUILTIN_VEC_INSERT_V4SF:
> +    case VSX_BUILTIN_VEC_INSERT_V4SI:
> +    case VSX_BUILTIN_VEC_INSERT_V2DF:
> +    case VSX_BUILTIN_VEC_INSERT_V2DI:
> +      return altivec_expand_vec_insert_builtin (exp, target);
> +
>      case ALTIVEC_BUILTIN_VEC_EXT_V4SI:
>      case ALTIVEC_BUILTIN_VEC_EXT_V8HI:
>      case ALTIVEC_BUILTIN_VEC_EXT_V16QI:
> @@ -13681,6 +13723,38 @@ altivec_init_builtins (void)
>                                     integer_type_node, NULL_TREE);
>    def_builtin ("__builtin_vec_set_v2di", ftype, VSX_BUILTIN_VEC_SET_V2DI);
>
> +  /* Access to the vec_insert patterns.  */
> +  ftype = build_function_type_list (V16QI_type_node, V16QI_type_node,
> +                                   intQI_type_node,
> +                                   integer_type_node, NULL_TREE);
> +  def_builtin ("__builtin_vec_insert_v16qi", ftype,
> +              VSX_BUILTIN_VEC_INSERT_V16QI);
> +
> +  ftype = build_function_type_list (V8HI_type_node, V8HI_type_node,
> +                                   intHI_type_node,
> +                                   integer_type_node, NULL_TREE);
> +  def_builtin ("__builtin_vec_insert_v8hi", ftype, VSX_BUILTIN_VEC_INSERT_V8HI);
> +
> +  ftype = build_function_type_list (V4SI_type_node, V4SI_type_node,
> +                                   integer_type_node,
> +                                   integer_type_node, NULL_TREE);
> +  def_builtin ("__builtin_vec_insert_v4si", ftype, VSX_BUILTIN_VEC_INSERT_V4SI);
> +
> +  ftype = build_function_type_list (V4SF_type_node, V4SF_type_node,
> +                                   float_type_node,
> +                                   integer_type_node, NULL_TREE);
> +  def_builtin ("__builtin_vec_insert_v4sf", ftype, VSX_BUILTIN_VEC_INSERT_V4SF);
> +
> +  ftype = build_function_type_list (V2DI_type_node, V2DI_type_node,
> +                                   intDI_type_node,
> +                                   integer_type_node, NULL_TREE);
> +  def_builtin ("__builtin_vec_insert_v2di", ftype, VSX_BUILTIN_VEC_INSERT_V2DI);
> +
> +  ftype = build_function_type_list (V2DF_type_node, V2DF_type_node,
> +                                   double_type_node,
> +                                   integer_type_node, NULL_TREE);
> +  def_builtin ("__builtin_vec_insert_v2df", ftype, VSX_BUILTIN_VEC_INSERT_V2DF);
> +
>    /* Access to the vec_extract patterns.  */
>    ftype = build_function_type_list (intSI_type_node, V4SI_type_node,
>                                     integer_type_node, NULL_TREE);
> diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
> index 28e859f4381..78b5b31d79f 100644
> --- a/gcc/config/rs6000/rs6000-protos.h
> +++ b/gcc/config/rs6000/rs6000-protos.h
> @@ -58,6 +58,7 @@ extern bool rs6000_split_128bit_ok_p (rtx []);
>  extern void rs6000_expand_float128_convert (rtx, rtx, bool);
>  extern void rs6000_expand_vector_init (rtx, rtx);
>  extern void rs6000_expand_vector_set (rtx, rtx, int);
> +extern void rs6000_expand_vector_insert (rtx, rtx, rtx, rtx);
>  extern void rs6000_expand_vector_extract (rtx, rtx, rtx);
>  extern void rs6000_split_vec_extract_var (rtx, rtx, rtx, rtx, rtx);
>  extern rtx rs6000_adjust_vec_address (rtx, rtx, rtx, rtx, machine_mode);
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index fe93cf6ff2b..afa845f3dff 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -6788,6 +6788,152 @@ rs6000_expand_vector_set (rtx target, rtx val, int elt)
>    emit_insn (gen_rtx_SET (target, x));
>  }
>
> +/* Insert value from VEC into idx of TARGET.  */
> +
> +void
> +rs6000_expand_vector_insert (rtx target, rtx vec, rtx val, rtx idx)
> +{
> +  machine_mode mode = GET_MODE (vec);
> +
> +  if (VECTOR_MEM_VSX_P (mode) && CONST_INT_P (idx))
> +      gcc_unreachable ();
> +  else if (VECTOR_MEM_VSX_P (mode) && !CONST_INT_P (idx)
> +          && TARGET_DIRECT_MOVE_64BIT)
> +    {
> +      gcc_assert (GET_MODE (idx) == E_SImode);
> +      machine_mode inner_mode = GET_MODE (val);
> +      HOST_WIDE_INT mode_mask = GET_MODE_MASK (inner_mode);
> +
> +      rtx tmp = gen_reg_rtx (GET_MODE (idx));
> +      if (GET_MODE_SIZE (inner_mode) == 8)
> +       {
> +         if (!BYTES_BIG_ENDIAN)
> +           {
> +             /*  idx = 1 - idx.  */
> +             emit_insn (gen_subsi3 (tmp, GEN_INT (1), idx));
> +             /*  idx = idx * 8.  */
> +             emit_insn (gen_ashlsi3 (tmp, tmp, GEN_INT (3)));
> +             /*  idx = 16 - idx.  */
> +             emit_insn (gen_subsi3 (tmp, GEN_INT (16), tmp));
> +           }
> +         else
> +           {
> +             emit_insn (gen_ashlsi3 (tmp, idx, GEN_INT (3)));
> +             emit_insn (gen_subsi3 (tmp, GEN_INT (16), tmp));
> +           }
> +       }
> +      else if (GET_MODE_SIZE (inner_mode) == 4)
> +       {
> +         if (!BYTES_BIG_ENDIAN)
> +           {
> +             /*  idx = 3 - idx.  */
> +             emit_insn (gen_subsi3 (tmp, GEN_INT (3), idx));
> +             /*  idx = idx * 4.  */
> +             emit_insn (gen_ashlsi3 (tmp, tmp, GEN_INT (2)));
> +             /*  idx = 20 - idx.  */
> +             emit_insn (gen_subsi3 (tmp, GEN_INT (20), tmp));
> +           }
> +         else
> +         {
> +             emit_insn (gen_ashlsi3 (tmp, idx, GEN_INT (2)));
> +             emit_insn (gen_subsi3 (tmp, GEN_INT (20), tmp));
> +         }
> +       }
> +      else if (GET_MODE_SIZE (inner_mode) == 2)
> +       {
> +         if (!BYTES_BIG_ENDIAN)
> +           {
> +             /*  idx = 7 - idx.  */
> +             emit_insn (gen_subsi3 (tmp, GEN_INT (7), idx));
> +             /*  idx = idx * 2.  */
> +             emit_insn (gen_ashlsi3 (tmp, tmp, GEN_INT (1)));
> +             /*  idx = 22 - idx.  */
> +             emit_insn (gen_subsi3 (tmp, GEN_INT (22), tmp));
> +           }
> +         else
> +           {
> +             emit_insn (gen_ashlsi3 (tmp, tmp, GEN_INT (1)));
> +             emit_insn (gen_subsi3 (tmp, GEN_INT (22), idx));
> +           }
> +       }
> +      else if (GET_MODE_SIZE (inner_mode) == 1)
> +       if (!BYTES_BIG_ENDIAN)
> +         emit_insn (gen_addsi3 (tmp, idx, GEN_INT (8)));
> +       else
> +         emit_insn (gen_subsi3 (tmp, GEN_INT (23), idx));
> +      else
> +       gcc_unreachable ();
> +
> +      /*  lxv vs32, mask.
> +         DImode: 0xffffffffffffffff0000000000000000
> +         SImode: 0x00000000ffffffff0000000000000000
> +         HImode: 0x000000000000ffff0000000000000000.
> +         QImode: 0x00000000000000ff0000000000000000.  */
> +      rtx mask = gen_reg_rtx (V16QImode);
> +      rtx mask_v2di = gen_reg_rtx (V2DImode);
> +      rtvec v = rtvec_alloc (2);
> +      if (!BYTES_BIG_ENDIAN)
> +       {
> +         RTVEC_ELT (v, 0) = gen_rtx_CONST_INT (DImode, 0);
> +         RTVEC_ELT (v, 1) = gen_rtx_CONST_INT (DImode, mode_mask);
> +       }
> +      else
> +      {
> +         RTVEC_ELT (v, 0) = gen_rtx_CONST_INT (DImode, mode_mask);
> +         RTVEC_ELT (v, 1) = gen_rtx_CONST_INT (DImode, 0);
> +       }
> +      emit_insn (
> +       gen_vec_initv2didi (mask_v2di, gen_rtx_PARALLEL (V2DImode, v)));
> +      rtx sub_mask = simplify_gen_subreg (V16QImode, mask_v2di, V2DImode, 0);
> +      emit_insn (gen_rtx_SET (mask, sub_mask));
> +
> +      /*  mtvsrd[wz] f0,val.  */
> +      rtx val_v16qi = gen_reg_rtx (V16QImode);
> +      switch (inner_mode)
> +       {
> +       default:
> +         gcc_unreachable ();
> +         break;
> +       case E_QImode:
> +         emit_insn (gen_p8_mtvsrwz_v16qiqi2 (val_v16qi, val));
> +         break;
> +       case E_HImode:
> +         emit_insn (gen_p8_mtvsrwz_v16qihi2 (val_v16qi, val));
> +         break;
> +       case E_SImode:
> +         emit_insn (gen_p8_mtvsrwz_v16qisi2 (val_v16qi, val));
> +         break;
> +       case E_SFmode:
> +         emit_insn (gen_p8_mtvsrwz_v16qisf2 (val_v16qi, val));
> +         break;
> +       case E_DImode:
> +         emit_insn (gen_p8_mtvsrd_v16qidi2 (val_v16qi, val));
> +         break;
> +       case E_DFmode:
> +         emit_insn (gen_p8_mtvsrd_v16qidf2 (val_v16qi, val));
> +         break;
> +       }
> +
> +      /*  lvsl    v1,0,idx.  */
> +      rtx pcv = gen_reg_rtx (V16QImode);
> +      emit_insn (gen_altivec_lvsl_reg_si2 (pcv, tmp));
> +
> +      /*  xxperm  vs0,vs0,vs33.  */
> +      /*  xxperm  vs32,vs32,vs33.  */
> +      rtx val_perm = gen_reg_rtx (V16QImode);
> +      rtx mask_perm = gen_reg_rtx (V16QImode);
> +      emit_insn (
> +       gen_altivec_vperm_v8hiv16qi (val_perm, val_v16qi, val_v16qi, pcv));
> +      emit_insn (gen_altivec_vperm_v8hiv16qi (mask_perm, mask, mask, pcv));
> +
> +      rtx sub_target = simplify_gen_subreg (V16QImode, vec, mode, 0);
> +      emit_insn (gen_rtx_SET (target, sub_target));
> +
> +      /*  xxsel   vs34,vs34,vs0,vs32.  */
> +      emit_insn (gen_vector_select_v16qi (target, target, val_perm, mask_perm));
> +    }
> +}
> +
>  /* Extract field ELT from VEC into TARGET.  */
>
>  void
> diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
> index 43b620ae1c0..b02fda836d4 100644
> --- a/gcc/config/rs6000/rs6000.md
> +++ b/gcc/config/rs6000/rs6000.md
> @@ -8713,6 +8713,25 @@
>    "mtvsrwz %x0,%1"
>    [(set_attr "type" "mftgpr")])
>
> +(define_mode_iterator FQHS [SF QI HI SI])
> +(define_mode_iterator FD [DF DI])
> +
> +(define_insn "p8_mtvsrwz_v16qi<mode>2"
> +  [(set (match_operand:V16QI 0 "register_operand" "=wa")
> +       (unspec:V16QI [(match_operand:FQHS 1 "register_operand" "r")]
> +                  UNSPEC_P8V_MTVSRWZ))]
> +  "TARGET_POWERPC64 && TARGET_DIRECT_MOVE"
> +  "mtvsrwz %x0,%1"
> +  [(set_attr "type" "mftgpr")])
> +
> +(define_insn "p8_mtvsrd_v16qi<mode>2"
> +  [(set (match_operand:V16QI 0 "register_operand" "=wa")
> +       (unspec:V16QI [(match_operand:FD 1 "register_operand" "r")]
> +                  UNSPEC_P8V_MTVSRD))]
> +  "TARGET_POWERPC64 && TARGET_DIRECT_MOVE"
> +  "mtvsrd %x0,%1"
> +  [(set_attr "type" "mftgpr")])
> +
>  (define_insn_and_split "reload_fpr_from_gpr<mode>"
>    [(set (match_operand:FMOVE64X 0 "register_operand" "=d")
>         (unspec:FMOVE64X [(match_operand:FMOVE64X 1 "register_operand" "r")]
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index dd750210758..7e82690d12d 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -5349,7 +5349,7 @@
>    rtx rtx_vtmp = gen_reg_rtx (V16QImode);
>    rtx tmp = gen_reg_rtx (DImode);
>
> -  emit_insn (gen_altivec_lvsl_reg (shift_mask, operands[2]));
> +  emit_insn (gen_altivec_lvsl_reg_di2 (shift_mask, operands[2]));
>    emit_insn (gen_ashldi3 (tmp, operands[2], GEN_INT (56)));
>    emit_insn (gen_lxvll (rtx_vtmp, operands[1], tmp));
>    emit_insn (gen_altivec_vperm_v8hiv16qi (operands[0], rtx_vtmp, rtx_vtmp,
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr79251.c b/gcc/testsuite/gcc.target/powerpc/pr79251.c
> new file mode 100644
> index 00000000000..877659a0146
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr79251.c
> @@ -0,0 +1,23 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-require-effective-target lp64 } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power9 -maltivec" } */
> +
> +#include <stddef.h>
> +#include <altivec.h>
> +
> +#define TYPE int
> +
> +__attribute__ ((noinline))
> +vector TYPE test (vector TYPE v, TYPE i, size_t n)
> +{
> +  vector TYPE v1 = v;
> +  v1 = vec_insert (i, v, n);
> +
> +  return v1;
> +}
> +
> +/* { dg-final { scan-assembler-not {\mstxw\M} } } */
> +/* { dg-final { scan-assembler-times {\mlvsl\M} 1 } } */
> +/* { dg-final { scan-assembler-times {\mxxperm\M} 2 } } */
> +/* { dg-final { scan-assembler-times {\mxxsel\M} 1 } } */
> --
> 2.27.0.90.geebb51ba8c
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-08-31  9:06 [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251] Xiong Hu Luo
  2020-08-31 12:43 ` Richard Biener
@ 2020-08-31 16:47 ` will schmidt
  2020-09-01 11:43   ` luoxhu
  2020-08-31 17:04 ` Segher Boessenkool
  2 siblings, 1 reply; 43+ messages in thread
From: will schmidt @ 2020-08-31 16:47 UTC (permalink / raw)
  To: Xiong Hu Luo, gcc-patches; +Cc: segher, wschmidt, linkw, dje.gcc

On Mon, 2020-08-31 at 04:06 -0500, Xiong Hu Luo via Gcc-patches wrote:
> vec_insert accepts 3 arguments, arg0 is input vector, arg1 is the value
> to be insert, arg2 is the place to insert arg1 to arg0.  This patch adds
> __builtin_vec_insert_v4si[v4sf,v2di,v2df,v8hi,v16qi] for vec_insert to
> not expand too early in gimple stage if arg2 is variable, to avoid generate
> store hit load instructions.
> 
> For Power9 V4SI:
> 	addi 9,1,-16
> 	rldic 6,6,2,60
> 	stxv 34,-16(1)
> 	stwx 5,9,6
> 	lxv 34,-16(1)
> =>
> 	addis 9,2,.LC0@toc@ha
> 	addi 9,9,.LC0@toc@l
> 	mtvsrwz 33,5
> 	lxv 32,0(9)
> 	sradi 9,6,2
> 	addze 9,9
> 	sldi 9,9,2
> 	subf 9,9,6
> 	subfic 9,9,3
> 	sldi 9,9,2
> 	subfic 9,9,20
> 	lvsl 13,0,9
> 	xxperm 33,33,45
> 	xxperm 32,32,45
> 	xxsel 34,34,33,32
> 
> Though instructions increase from 5 to 15, the performance is improved
> 60% in typical cases.

Ok.  :-)


(bunch of nits below, no issues with the gist of the patch).


> gcc/ChangeLog:
> 
> 	* config/rs6000/altivec.md (altivec_lvsl_reg_<mode>2): Extend to
> 	SDI mode.

(altivec_lvsl_reg): Rename to (altivec_lvsl_reg_<mode>2) and extend to SDI mode.


> 	* config/rs6000/rs6000-builtin.def (BU_VSX_X): Add support
> 	macros for vec_insert built-in functions.

Should that list the VEC_INSERT_V16QI, VEC_INSERT_V8HI, ... values instead of the BU_VSX_X ?  (need second opinion.. )


> 	* config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin):
> 	Generate built-in calls for vec_insert.

> 	* config/rs6000/rs6000-call.c (altivec_expand_vec_insert_builtin):
> 	New function.

> 	(altivec_expand_builtin): Add case entry for
> 	VSX_BUILTIN_VEC_INSERT_V16QI, VSX_BUILTIN_VEC_INSERT_V8HI,
> 	VSX_BUILTIN_VEC_INSERT_V4SF,  VSX_BUILTIN_VEC_INSERT_V4SI,
> 	VSX_BUILTIN_VEC_INSERT_V2DF,  VSX_BUILTIN_VEC_INSERT_V2DI.

plural entries :-) 


> 	(altivec_init_builtins):

Add defines for __builtin_vec_insert_v16qi, __builtin_vec_insert_v8hi, ...


> 	* config/rs6000/rs6000-protos.h (rs6000_expand_vector_insert):
> 	New declear.

declare

> 	* config/rs6000/rs6000.c (rs6000_expand_vector_insert):
> 	New function.



> 	* config/rs6000/rs6000.md (FQHS): New mode iterator.
> 	(FD): New mode iterator.
> 	p8_mtvsrwz_v16qi<mode>2: New define_insn.
> 	p8_mtvsrd_v16qi<mode>2: New define_insn.
> 	* config/rs6000/vsx.md: Call gen_altivec_lvsl_reg_di2.
ok

> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/powerpc/pr79251.c: New test.
> ---
>  gcc/config/rs6000/altivec.md               |   4 +-
>  gcc/config/rs6000/rs6000-builtin.def       |   6 +
>  gcc/config/rs6000/rs6000-c.c               |  61 +++++++++
>  gcc/config/rs6000/rs6000-call.c            |  74 +++++++++++
>  gcc/config/rs6000/rs6000-protos.h          |   1 +
>  gcc/config/rs6000/rs6000.c                 | 146 +++++++++++++++++++++
>  gcc/config/rs6000/rs6000.md                |  19 +++
>  gcc/config/rs6000/vsx.md                   |   2 +-
>  gcc/testsuite/gcc.target/powerpc/pr79251.c |  23 ++++
>  9 files changed, 333 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr79251.c
> 
> diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
> index 0a2e634d6b0..66b636059a6 100644
> --- a/gcc/config/rs6000/altivec.md
> +++ b/gcc/config/rs6000/altivec.md
> @@ -2772,10 +2772,10 @@
>    DONE;
>  })
> 
> -(define_insn "altivec_lvsl_reg"
> +(define_insn "altivec_lvsl_reg_<mode>2"
>    [(set (match_operand:V16QI 0 "altivec_register_operand" "=v")
>  	(unspec:V16QI
> -	[(match_operand:DI 1 "gpc_reg_operand" "b")]
> +	[(match_operand:SDI 1 "gpc_reg_operand" "b")]
>  	UNSPEC_LVSL_REG))]
>    "TARGET_ALTIVEC"
>    "lvsl %0,0,%1"
> diff --git a/gcc/config/rs6000/rs6000-builtin.def b/gcc/config/rs6000/rs6000-builtin.def
> index f9f0fece549..d095b365c14 100644
> --- a/gcc/config/rs6000/rs6000-builtin.def
> +++ b/gcc/config/rs6000/rs6000-builtin.def
> @@ -2047,6 +2047,12 @@ BU_VSX_X (VEC_INIT_V2DI,      "vec_init_v2di",	CONST)
>  BU_VSX_X (VEC_SET_V1TI,	      "vec_set_v1ti",	CONST)
>  BU_VSX_X (VEC_SET_V2DF,	      "vec_set_v2df",	CONST)
>  BU_VSX_X (VEC_SET_V2DI,	      "vec_set_v2di",	CONST)
> +BU_VSX_X (VEC_INSERT_V16QI,	      "vec_insert_v16qi",	CONST)
> +BU_VSX_X (VEC_INSERT_V8HI,	      "vec_insert_v8hi",	CONST)
> +BU_VSX_X (VEC_INSERT_V4SI,	      "vec_insert_v4si",	CONST)
> +BU_VSX_X (VEC_INSERT_V4SF,	      "vec_insert_v4sf",	CONST)
> +BU_VSX_X (VEC_INSERT_V2DI,	      "vec_insert_v2di",	CONST)
> +BU_VSX_X (VEC_INSERT_V2DF,	      "vec_insert_v2df",	CONST)
>  BU_VSX_X (VEC_EXT_V1TI,	      "vec_ext_v1ti",	CONST)
>  BU_VSX_X (VEC_EXT_V2DF,	      "vec_ext_v2df",	CONST)
>  BU_VSX_X (VEC_EXT_V2DI,	      "vec_ext_v2di",	CONST)
> diff --git a/gcc/config/rs6000/rs6000-c.c b/gcc/config/rs6000/rs6000-c.c
> index 2fad3d94706..03b00738a5e 100644
> --- a/gcc/config/rs6000/rs6000-c.c
> +++ b/gcc/config/rs6000/rs6000-c.c
> @@ -1563,6 +1563,67 @@ altivec_resolve_overloaded_builtin (location_t loc, tree fndecl,
>  	  return build_call_expr (call, 3, arg1, arg0, arg2);
>  	}
> 
> +      else if (VECTOR_MEM_VSX_P (mode))
> +	{
> +	  tree call = NULL_TREE;
> +
> +	  arg2 = fold_for_warn (arg2);
> +
> +	  /* If the second argument is variable, we can optimize it if we are
> +	     generating 64-bit code on a machine with direct move.  */
> +	  if (TREE_CODE (arg2) != INTEGER_CST && TARGET_DIRECT_MOVE_64BIT)
> +	    {
> +	      switch (mode)
> +		{
> +		default:
> +		  break;
> +
> +		case E_V2DImode:
> +		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_INSERT_V2DI];
> +		  break;
> +
> +		case E_V2DFmode:
> +		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_INSERT_V2DF];
> +		  break;
> +
> +		case E_V4SFmode:
> +		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_INSERT_V4SF];
> +		  break;
> +
> +		case E_V4SImode:
> +		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_INSERT_V4SI];
> +		  break;
> +
> +		case E_V8HImode:
> +		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_INSERT_V8HI];
> +		  break;
> +
> +		case E_V16QImode:
> +		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_INSERT_V16QI];
> +		  break;
> +		}
> +	    }
> +
> +	  if (call)
> +	    {
> +	      if (TYPE_VECTOR_SUBPARTS (arg1_type) == 1)
> +		arg2 = build_int_cst (TREE_TYPE (arg2), 0);
> +	      else
> +		arg2 = build_binary_op (
> +		  loc, BIT_AND_EXPR, arg2,
> +		  build_int_cst (TREE_TYPE (arg2),
> +				 TYPE_VECTOR_SUBPARTS (arg1_type) - 1),
> +		  0);

						
Indentation nit there, the "loc, BIT_AND_EXPR, ..." line should go on
the previous line.   If that greatly messes up the indentation of the
rest of the statement, use your judgement.  


> +	      tree result
> +		= build_call_expr (call, 3, arg1,
> +				   convert (TREE_TYPE (arg1_type), arg0),
> +				   convert (integer_type_node, arg2));
> +	      /* Coerce the result to vector element type.  May be no-op.  */
> +	      result = fold_convert (TREE_TYPE (arg1), result);
> +	      return result;
> +	    }
> +	}
> +
>        /* Build *(((arg1_inner_type*)&(vector type){arg1})+arg2) = arg0. */
>        arg1_inner_type = TREE_TYPE (arg1_type);
>        if (TYPE_VECTOR_SUBPARTS (arg1_type) == 1)
> diff --git a/gcc/config/rs6000/rs6000-call.c b/gcc/config/rs6000/rs6000-call.c
> index e39cfcf672b..339e9ae87e3 100644
> --- a/gcc/config/rs6000/rs6000-call.c
> +++ b/gcc/config/rs6000/rs6000-call.c
> @@ -10660,6 +10660,40 @@ altivec_expand_vec_set_builtin (tree exp)
>    return op0;
>  }
> 
> +/* Expand vec_insert builtin.  */
> +static rtx
> +altivec_expand_vec_insert_builtin (tree exp, rtx target)
> +{
> +  machine_mode tmode, mode1, mode2;
> +  tree arg0, arg1, arg2;
> +  rtx op0 = NULL_RTX, op1, op2;
> +
> +  arg0 = CALL_EXPR_ARG (exp, 0);
> +  arg1 = CALL_EXPR_ARG (exp, 1);
> +  arg2 = CALL_EXPR_ARG (exp, 2);
> +
> +  tmode = TYPE_MODE (TREE_TYPE (arg0));
> +  mode1 = TYPE_MODE (TREE_TYPE (TREE_TYPE (arg0)));
> +  mode2 = TYPE_MODE ((TREE_TYPE (arg2)));
> +  gcc_assert (VECTOR_MODE_P (tmode));
> +
> +  op0 = expand_expr (arg0, NULL_RTX, tmode, EXPAND_NORMAL);
> +  op1 = expand_expr (arg1, NULL_RTX, mode1, EXPAND_NORMAL);
> +  op2 = expand_expr (arg2, NULL_RTX, mode2, EXPAND_NORMAL);
> +
> +  if (GET_MODE (op1) != mode1 && GET_MODE (op1) != VOIDmode)
> +    op1 = convert_modes (mode1, GET_MODE (op1), op1, true);
> +
> +  op0 = force_reg (tmode, op0);
> +  op1 = force_reg (mode1, op1);
> +  op2 = force_reg (mode2, op2);
> +
> +  target = gen_reg_rtx (V16QImode);

Should that be tmode, or is V16QImode always correct here?

> +  rs6000_expand_vector_insert (target, op0, op1, op2);
> +
> +  return target;
> +}
> +
>  /* Expand vec_ext builtin.  */
>  static rtx
>  altivec_expand_vec_ext_builtin (tree exp, rtx target)
> @@ -10922,6 +10956,14 @@ altivec_expand_builtin (tree exp, rtx target, bool *expandedp)
>      case VSX_BUILTIN_VEC_SET_V1TI:
>        return altivec_expand_vec_set_builtin (exp);
> 
> +    case VSX_BUILTIN_VEC_INSERT_V16QI:
> +    case VSX_BUILTIN_VEC_INSERT_V8HI:
> +    case VSX_BUILTIN_VEC_INSERT_V4SF:
> +    case VSX_BUILTIN_VEC_INSERT_V4SI:
> +    case VSX_BUILTIN_VEC_INSERT_V2DF:
> +    case VSX_BUILTIN_VEC_INSERT_V2DI:
> +      return altivec_expand_vec_insert_builtin (exp, target);
> +
>      case ALTIVEC_BUILTIN_VEC_EXT_V4SI:
>      case ALTIVEC_BUILTIN_VEC_EXT_V8HI:
>      case ALTIVEC_BUILTIN_VEC_EXT_V16QI:
> @@ -13681,6 +13723,38 @@ altivec_init_builtins (void)
>  				    integer_type_node, NULL_TREE);
>    def_builtin ("__builtin_vec_set_v2di", ftype, VSX_BUILTIN_VEC_SET_V2DI);
> 
> +  /* Access to the vec_insert patterns.  */
> +  ftype = build_function_type_list (V16QI_type_node, V16QI_type_node,
> +				    intQI_type_node,
> +				    integer_type_node, NULL_TREE);
> +  def_builtin ("__builtin_vec_insert_v16qi", ftype,
> +	       VSX_BUILTIN_VEC_INSERT_V16QI);
> +
> +  ftype = build_function_type_list (V8HI_type_node, V8HI_type_node,
> +				    intHI_type_node,
> +				    integer_type_node, NULL_TREE);
> +  def_builtin ("__builtin_vec_insert_v8hi", ftype, VSX_BUILTIN_VEC_INSERT_V8HI);
> +
> +  ftype = build_function_type_list (V4SI_type_node, V4SI_type_node,
> +				    integer_type_node,
> +				    integer_type_node, NULL_TREE);
> +  def_builtin ("__builtin_vec_insert_v4si", ftype, VSX_BUILTIN_VEC_INSERT_V4SI);
> +
> +  ftype = build_function_type_list (V4SF_type_node, V4SF_type_node,
> +				    float_type_node,
> +				    integer_type_node, NULL_TREE);
> +  def_builtin ("__builtin_vec_insert_v4sf", ftype, VSX_BUILTIN_VEC_INSERT_V4SF);
> +
> +  ftype = build_function_type_list (V2DI_type_node, V2DI_type_node,
> +				    intDI_type_node,
> +				    integer_type_node, NULL_TREE);
> +  def_builtin ("__builtin_vec_insert_v2di", ftype, VSX_BUILTIN_VEC_INSERT_V2DI);
> +
> +  ftype = build_function_type_list (V2DF_type_node, V2DF_type_node,
> +				    double_type_node,
> +				    integer_type_node, NULL_TREE);
> +  def_builtin ("__builtin_vec_insert_v2df", ftype, VSX_BUILTIN_VEC_INSERT_V2DF);
> +
>    /* Access to the vec_extract patterns.  */
>    ftype = build_function_type_list (intSI_type_node, V4SI_type_node,
>  				    integer_type_node, NULL_TREE);
> diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
> index 28e859f4381..78b5b31d79f 100644
> --- a/gcc/config/rs6000/rs6000-protos.h
> +++ b/gcc/config/rs6000/rs6000-protos.h
> @@ -58,6 +58,7 @@ extern bool rs6000_split_128bit_ok_p (rtx []);
>  extern void rs6000_expand_float128_convert (rtx, rtx, bool);
>  extern void rs6000_expand_vector_init (rtx, rtx);
>  extern void rs6000_expand_vector_set (rtx, rtx, int);
> +extern void rs6000_expand_vector_insert (rtx, rtx, rtx, rtx);
>  extern void rs6000_expand_vector_extract (rtx, rtx, rtx);
>  extern void rs6000_split_vec_extract_var (rtx, rtx, rtx, rtx, rtx);
>  extern rtx rs6000_adjust_vec_address (rtx, rtx, rtx, rtx, machine_mode);
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index fe93cf6ff2b..afa845f3dff 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -6788,6 +6788,152 @@ rs6000_expand_vector_set (rtx target, rtx val, int elt)
>    emit_insn (gen_rtx_SET (target, x));
>  }
> 
> +/* Insert value from VEC into idx of TARGET.  */
> +
> +void
> +rs6000_expand_vector_insert (rtx target, rtx vec, rtx val, rtx idx)
> +{
> +  machine_mode mode = GET_MODE (vec);
> +
> +  if (VECTOR_MEM_VSX_P (mode) && CONST_INT_P (idx))
> +      gcc_unreachable ();

only 2 spaces indent.

(My mailer has suddenly gotten confused with tabs and spaces,..  here
and below may need spaces replaced with tabs, or may just be a problem
on my end.. )

> +  else if (VECTOR_MEM_VSX_P (mode) && !CONST_INT_P (idx)
> +	   && TARGET_DIRECT_MOVE_64BIT)
> +    {
> +      gcc_assert (GET_MODE (idx) == E_SImode);
> +      machine_mode inner_mode = GET_MODE (val);
> +      HOST_WIDE_INT mode_mask = GET_MODE_MASK (inner_mode);
> +
> +      rtx tmp = gen_reg_rtx (GET_MODE (idx));
> +      if (GET_MODE_SIZE (inner_mode) == 8)
> +	{
> +	  if (!BYTES_BIG_ENDIAN)
> +	    {
> +	      /*  idx = 1 - idx.  */
> +	      emit_insn (gen_subsi3 (tmp, GEN_INT (1), idx));
> +	      /*  idx = idx * 8.  */
> +	      emit_insn (gen_ashlsi3 (tmp, tmp, GEN_INT (3)));
> +	      /*  idx = 16 - idx.  */
> +	      emit_insn (gen_subsi3 (tmp, GEN_INT (16), tmp));
> +	    }
> +	  else
> +	    {
> +	      emit_insn (gen_ashlsi3 (tmp, idx, GEN_INT (3)));
> +	      emit_insn (gen_subsi3 (tmp, GEN_INT (16), tmp));
> +	    }
> +	}
> +      else if (GET_MODE_SIZE (inner_mode) == 4)
> +	{
> +	  if (!BYTES_BIG_ENDIAN)
> +	    {
> +	      /*  idx = 3 - idx.  */
> +	      emit_insn (gen_subsi3 (tmp, GEN_INT (3), idx));
> +	      /*  idx = idx * 4.  */
> +	      emit_insn (gen_ashlsi3 (tmp, tmp, GEN_INT (2)));
> +	      /*  idx = 20 - idx.  */
> +	      emit_insn (gen_subsi3 (tmp, GEN_INT (20), tmp));
> +	    }
> +	  else
> +	  {
> +	      emit_insn (gen_ashlsi3 (tmp, idx, GEN_INT (2)));
> +	      emit_insn (gen_subsi3 (tmp, GEN_INT (20), tmp));
> +	  }
> +	}
> +      else if (GET_MODE_SIZE (inner_mode) == 2)
> +	{
> +	  if (!BYTES_BIG_ENDIAN)
> +	    {
> +	      /*  idx = 7 - idx.  */
> +	      emit_insn (gen_subsi3 (tmp, GEN_INT (7), idx));
> +	      /*  idx = idx * 2.  */
> +	      emit_insn (gen_ashlsi3 (tmp, tmp, GEN_INT (1)));
> +	      /*  idx = 22 - idx.  */
> +	      emit_insn (gen_subsi3 (tmp, GEN_INT (22), tmp));
> +	    }
> +	  else
> +	    {
> +	      emit_insn (gen_ashlsi3 (tmp, tmp, GEN_INT (1)));
> +	      emit_insn (gen_subsi3 (tmp, GEN_INT (22), idx));
> +	    }
> +	}
> +      else if (GET_MODE_SIZE (inner_mode) == 1)
> +	if (!BYTES_BIG_ENDIAN)
> +	  emit_insn (gen_addsi3 (tmp, idx, GEN_INT (8)));
> +	else
> +	  emit_insn (gen_subsi3 (tmp, GEN_INT (23), idx));
> +      else
> +	gcc_unreachable ();
> +
> +      /*  lxv vs32, mask.
> +	  DImode: 0xffffffffffffffff0000000000000000
> +	  SImode: 0x00000000ffffffff0000000000000000
> +	  HImode: 0x000000000000ffff0000000000000000.
> +	  QImode: 0x00000000000000ff0000000000000000.  */
good. :-)

> +      rtx mask = gen_reg_rtx (V16QImode);
> +      rtx mask_v2di = gen_reg_rtx (V2DImode);
> +      rtvec v = rtvec_alloc (2);
> +      if (!BYTES_BIG_ENDIAN)
> +	{
> +	  RTVEC_ELT (v, 0) = gen_rtx_CONST_INT (DImode, 0);
> +	  RTVEC_ELT (v, 1) = gen_rtx_CONST_INT (DImode, mode_mask);
> +	}
> +      else
> +      {
> +	  RTVEC_ELT (v, 0) = gen_rtx_CONST_INT (DImode, mode_mask);
> +	  RTVEC_ELT (v, 1) = gen_rtx_CONST_INT (DImode, 0);
> +	}
> +      emit_insn (
> +	gen_vec_initv2didi (mask_v2di, gen_rtx_PARALLEL (V2DImode, v)));
> +      rtx sub_mask = simplify_gen_subreg (V16QImode, mask_v2di, V2DImode, 0);
> +      emit_insn (gen_rtx_SET (mask, sub_mask));
> +
> +      /*  mtvsrd[wz] f0,val.  */
> +      rtx val_v16qi = gen_reg_rtx (V16QImode);
> +      switch (inner_mode)
> +	{
> +	default:
> +	  gcc_unreachable ();
> +	  break;
> +	case E_QImode:
> +	  emit_insn (gen_p8_mtvsrwz_v16qiqi2 (val_v16qi, val));
> +	  break;
> +	case E_HImode:
> +	  emit_insn (gen_p8_mtvsrwz_v16qihi2 (val_v16qi, val));
> +	  break;
> +	case E_SImode:
> +	  emit_insn (gen_p8_mtvsrwz_v16qisi2 (val_v16qi, val));
> +	  break;
> +	case E_SFmode:
> +	  emit_insn (gen_p8_mtvsrwz_v16qisf2 (val_v16qi, val));
> +	  break;
> +	case E_DImode:
> +	  emit_insn (gen_p8_mtvsrd_v16qidi2 (val_v16qi, val));
> +	  break;
> +	case E_DFmode:
> +	  emit_insn (gen_p8_mtvsrd_v16qidf2 (val_v16qi, val));
> +	  break;
> +	}
> +
> +      /*  lvsl    v1,0,idx.  */
> +      rtx pcv = gen_reg_rtx (V16QImode);
> +      emit_insn (gen_altivec_lvsl_reg_si2 (pcv, tmp));
> +
> +      /*  xxperm  vs0,vs0,vs33.  */
> +      /*  xxperm  vs32,vs32,vs33.  */
> +      rtx val_perm = gen_reg_rtx (V16QImode);
> +      rtx mask_perm = gen_reg_rtx (V16QImode);
> +      emit_insn (
> +	gen_altivec_vperm_v8hiv16qi (val_perm, val_v16qi, val_v16qi, pcv));
> +      emit_insn (gen_altivec_vperm_v8hiv16qi (mask_perm, mask, mask, pcv));
> +
> +      rtx sub_target = simplify_gen_subreg (V16QImode, vec, mode, 0);
> +      emit_insn (gen_rtx_SET (target, sub_target));
> +
> +      /*  xxsel   vs34,vs34,vs0,vs32.  */
> +      emit_insn (gen_vector_select_v16qi (target, target, val_perm, mask_perm));
> +    }
> +}
> +
>  /* Extract field ELT from VEC into TARGET.  */
> 
>  void
> diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
> index 43b620ae1c0..b02fda836d4 100644
> --- a/gcc/config/rs6000/rs6000.md
> +++ b/gcc/config/rs6000/rs6000.md
> @@ -8713,6 +8713,25 @@
>    "mtvsrwz %x0,%1"
>    [(set_attr "type" "mftgpr")])
> 
> +(define_mode_iterator FQHS [SF QI HI SI])
> +(define_mode_iterator FD [DF DI])
> +
> +(define_insn "p8_mtvsrwz_v16qi<mode>2"
> +  [(set (match_operand:V16QI 0 "register_operand" "=wa")
> +	(unspec:V16QI [(match_operand:FQHS 1 "register_operand" "r")]
> +		   UNSPEC_P8V_MTVSRWZ))]
> +  "TARGET_POWERPC64 && TARGET_DIRECT_MOVE"
> +  "mtvsrwz %x0,%1"
> +  [(set_attr "type" "mftgpr")])
> +
> +(define_insn "p8_mtvsrd_v16qi<mode>2"
> +  [(set (match_operand:V16QI 0 "register_operand" "=wa")
> +	(unspec:V16QI [(match_operand:FD 1 "register_operand" "r")]
> +		   UNSPEC_P8V_MTVSRD))]
> +  "TARGET_POWERPC64 && TARGET_DIRECT_MOVE"
> +  "mtvsrd %x0,%1"
> +  [(set_attr "type" "mftgpr")])
> +
>  (define_insn_and_split "reload_fpr_from_gpr<mode>"
>    [(set (match_operand:FMOVE64X 0 "register_operand" "=d")
>  	(unspec:FMOVE64X [(match_operand:FMOVE64X 1 "register_operand" "r")]
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index dd750210758..7e82690d12d 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -5349,7 +5349,7 @@
>    rtx rtx_vtmp = gen_reg_rtx (V16QImode);
>    rtx tmp = gen_reg_rtx (DImode);
> 
> -  emit_insn (gen_altivec_lvsl_reg (shift_mask, operands[2]));
> +  emit_insn (gen_altivec_lvsl_reg_di2 (shift_mask, operands[2]));
>    emit_insn (gen_ashldi3 (tmp, operands[2], GEN_INT (56)));
>    emit_insn (gen_lxvll (rtx_vtmp, operands[1], tmp));
>    emit_insn (gen_altivec_vperm_v8hiv16qi (operands[0], rtx_vtmp, rtx_vtmp,
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr79251.c b/gcc/testsuite/gcc.target/powerpc/pr79251.c
> new file mode 100644
> index 00000000000..877659a0146
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr79251.c
> @@ -0,0 +1,23 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-require-effective-target lp64 } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power9 -maltivec" } */
> +
> +#include <stddef.h>
> +#include <altivec.h>
> +
> +#define TYPE int

Is testing against int types sufficient coverage? (are there other
existing tests?)

thanks,
-Will

> +  
> +__attribute__ ((noinline))
> +vector TYPE test (vector TYPE v, TYPE i, size_t n)
> +{
> +  vector TYPE v1 = v;
> +  v1 = vec_insert (i, v, n);
> +
> +  return v1;
> +}
> +
> +/* { dg-final { scan-assembler-not {\mstxw\M} } } */
> +/* { dg-final { scan-assembler-times {\mlvsl\M} 1 } } */
> +/* { dg-final { scan-assembler-times {\mxxperm\M} 2 } } */
> +/* { dg-final { scan-assembler-times {\mxxsel\M} 1 } } */



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-08-31  9:06 [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251] Xiong Hu Luo
  2020-08-31 12:43 ` Richard Biener
  2020-08-31 16:47 ` will schmidt
@ 2020-08-31 17:04 ` Segher Boessenkool
  2020-09-01  8:09   ` luoxhu
  2 siblings, 1 reply; 43+ messages in thread
From: Segher Boessenkool @ 2020-08-31 17:04 UTC (permalink / raw)
  To: Xiong Hu Luo; +Cc: gcc-patches, dje.gcc, wschmidt, guojiufu, linkw

Hi!

On Mon, Aug 31, 2020 at 04:06:47AM -0500, Xiong Hu Luo wrote:
> vec_insert accepts 3 arguments, arg0 is input vector, arg1 is the value
> to be insert, arg2 is the place to insert arg1 to arg0.  This patch adds
> __builtin_vec_insert_v4si[v4sf,v2di,v2df,v8hi,v16qi] for vec_insert to
> not expand too early in gimple stage if arg2 is variable, to avoid generate
> store hit load instructions.
> 
> For Power9 V4SI:
> 	addi 9,1,-16
> 	rldic 6,6,2,60
> 	stxv 34,-16(1)
> 	stwx 5,9,6
> 	lxv 34,-16(1)
> =>
> 	addis 9,2,.LC0@toc@ha
> 	addi 9,9,.LC0@toc@l
> 	mtvsrwz 33,5
> 	lxv 32,0(9)
> 	sradi 9,6,2
> 	addze 9,9
> 	sldi 9,9,2
> 	subf 9,9,6
> 	subfic 9,9,3
> 	sldi 9,9,2
> 	subfic 9,9,20
> 	lvsl 13,0,9
> 	xxperm 33,33,45
> 	xxperm 32,32,45
> 	xxsel 34,34,33,32

For v a V4SI, x a SI, j some int,  what do we generate for
  v[j&3] = x;
?
This should be exactly the same as we generate for
  vec_insert(x, v, j);
(the builtin does a modulo 4 automatically).


Segher

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-08-31 17:04 ` Segher Boessenkool
@ 2020-09-01  8:09   ` luoxhu
  2020-09-01 13:07     ` Richard Biener
  2020-09-01 14:02     ` [PATCH] " Segher Boessenkool
  0 siblings, 2 replies; 43+ messages in thread
From: luoxhu @ 2020-09-01  8:09 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, dje.gcc, wschmidt, guojiufu, linkw

Hi,

On 2020/9/1 01:04, Segher Boessenkool wrote:
> Hi!
> 
> On Mon, Aug 31, 2020 at 04:06:47AM -0500, Xiong Hu Luo wrote:
>> vec_insert accepts 3 arguments, arg0 is input vector, arg1 is the value
>> to be insert, arg2 is the place to insert arg1 to arg0.  This patch adds
>> __builtin_vec_insert_v4si[v4sf,v2di,v2df,v8hi,v16qi] for vec_insert to
>> not expand too early in gimple stage if arg2 is variable, to avoid generate
>> store hit load instructions.
>>
>> For Power9 V4SI:
>> 	addi 9,1,-16
>> 	rldic 6,6,2,60
>> 	stxv 34,-16(1)
>> 	stwx 5,9,6
>> 	lxv 34,-16(1)
>> =>
>> 	addis 9,2,.LC0@toc@ha
>> 	addi 9,9,.LC0@toc@l
>> 	mtvsrwz 33,5
>> 	lxv 32,0(9)
>> 	sradi 9,6,2
>> 	addze 9,9
>> 	sldi 9,9,2
>> 	subf 9,9,6
>> 	subfic 9,9,3
>> 	sldi 9,9,2
>> 	subfic 9,9,20
>> 	lvsl 13,0,9
>> 	xxperm 33,33,45
>> 	xxperm 32,32,45
>> 	xxsel 34,34,33,32
> 
> For v a V4SI, x a SI, j some int,  what do we generate for
>    v[j&3] = x;
> ?
> This should be exactly the same as we generate for
>    vec_insert(x, v, j);
> (the builtin does a modulo 4 automatically).

No, even with my patch "stxv 34,-16(1);stwx 5,9,6;lxv 34,-16(1)" generated currently.
Is it feasible and acceptable to expand some kind of pattern in expander directly without
builtin transition?

I borrowed some of implementation from vec_extract.  For vec_extract, the issue also exists:

           source:                             gimple:                             expand:                                        asm:
1) i = vec_extract (v, n);  =>  i = __builtin_vec_ext_v4si (v, n);   => {r120:SI=unspec[r118:V4SI,r119:DI] 134;...}     => slwi 9,6,2   vextuwrx 3,9,2   
2) i = vec_extract (v, 3);  =>  i = __builtin_vec_ext_v4si (v, 3);   => {r120:SI=vec_select(r118:V4SI,parallel)...}     =>  li 9,12      vextuwrx 3,9,2
3) i = v[n%4];   =>  _1 = n & 3;  i = VIEW_CONVERT_EXPR<int[4]>(v)[_1];  =>        ...        =>     stxv 34,-16(1);addi 9,1,-16; rldic 5,5,2,60; lwax 3,9,5
4) i = v[3];     =>          i = BIT_FIELD_REF <v, 32, 96>;              =>  {r120:SI=vec_select(r118:V4SI,parallel)...} => li 9,12;   vextuwrx 3,9,2 

Case 3) also couldn't handle the similar usage, and case 4) doesn't generate builtin as expected,
it just expand to vec_select by coincidence.  So does this mean both vec_insert and vec_extract
and all other similar vector builtins should use IFN as suggested by Richard Biener, to match the
pattern in gimple and expand both constant and variable index in expander?  Will this also be
beneficial for other targets except power?  Or we should do that gradually after this patch
approved as it seems another independent issue?  Thanks:)


Xionghu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-08-31 16:47 ` will schmidt
@ 2020-09-01 11:43   ` luoxhu
  0 siblings, 0 replies; 43+ messages in thread
From: luoxhu @ 2020-09-01 11:43 UTC (permalink / raw)
  To: will schmidt, gcc-patches; +Cc: segher, wschmidt, linkw, dje.gcc

Hi,

On 2020/9/1 00:47, will schmidt wrote:
>> +  tmode = TYPE_MODE (TREE_TYPE (arg0));
>> +  mode1 = TYPE_MODE (TREE_TYPE (TREE_TYPE (arg0)));
>> +  mode2 = TYPE_MODE ((TREE_TYPE (arg2)));
>> +  gcc_assert (VECTOR_MODE_P (tmode));
>> +
>> +  op0 = expand_expr (arg0, NULL_RTX, tmode, EXPAND_NORMAL);
>> +  op1 = expand_expr (arg1, NULL_RTX, mode1, EXPAND_NORMAL);
>> +  op2 = expand_expr (arg2, NULL_RTX, mode2, EXPAND_NORMAL);
>> +
>> +  if (GET_MODE (op1) != mode1 && GET_MODE (op1) != VOIDmode)
>> +    op1 = convert_modes (mode1, GET_MODE (op1), op1, true);
>> +
>> +  op0 = force_reg (tmode, op0);
>> +  op1 = force_reg (mode1, op1);
>> +  op2 = force_reg (mode2, op2);
>> +
>> +  target = gen_reg_rtx (V16QImode);
> Should that be tmode, or is V16QImode always correct here?


Thanks for the review.
Yes, the target should be TImode here, but the followed call
rs6000_expand_vector_insert needs a lot of emit_insns in it, 
using V16QI could reuse most of patterns in existed md files,
after returning from this function, there will be a convert from
V16QImode to TImode to make the type same:

expr.c: convert_move (target, temp, TYPE_UNSIGNED (TREE_TYPE (exp))); 

and I've tested this with V2DI, V2DF V4SI, V4SF, V8HI, V16QI on
Power9-LE and Power8-BE, the result correctness is ensured.
Other comments are modified.  Will update it later if no disagreements
about the implementation.


Thanks,
Xionghu

> 
>> +  rs6000_expand_vector_insert (target, op0, op1, op2);
>> +
>> +  return target;
>> +}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-01  8:09   ` luoxhu
@ 2020-09-01 13:07     ` Richard Biener
  2020-09-02  9:26       ` luoxhu
  2020-09-01 14:02     ` [PATCH] " Segher Boessenkool
  1 sibling, 1 reply; 43+ messages in thread
From: Richard Biener @ 2020-09-01 13:07 UTC (permalink / raw)
  To: luoxhu
  Cc: Segher Boessenkool, Bill Schmidt, GCC Patches, linkw, David Edelsohn

On Tue, Sep 1, 2020 at 10:11 AM luoxhu via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi,
>
> On 2020/9/1 01:04, Segher Boessenkool wrote:
> > Hi!
> >
> > On Mon, Aug 31, 2020 at 04:06:47AM -0500, Xiong Hu Luo wrote:
> >> vec_insert accepts 3 arguments, arg0 is input vector, arg1 is the value
> >> to be insert, arg2 is the place to insert arg1 to arg0.  This patch adds
> >> __builtin_vec_insert_v4si[v4sf,v2di,v2df,v8hi,v16qi] for vec_insert to
> >> not expand too early in gimple stage if arg2 is variable, to avoid generate
> >> store hit load instructions.
> >>
> >> For Power9 V4SI:
> >>      addi 9,1,-16
> >>      rldic 6,6,2,60
> >>      stxv 34,-16(1)
> >>      stwx 5,9,6
> >>      lxv 34,-16(1)
> >> =>
> >>      addis 9,2,.LC0@toc@ha
> >>      addi 9,9,.LC0@toc@l
> >>      mtvsrwz 33,5
> >>      lxv 32,0(9)
> >>      sradi 9,6,2
> >>      addze 9,9
> >>      sldi 9,9,2
> >>      subf 9,9,6
> >>      subfic 9,9,3
> >>      sldi 9,9,2
> >>      subfic 9,9,20
> >>      lvsl 13,0,9
> >>      xxperm 33,33,45
> >>      xxperm 32,32,45
> >>      xxsel 34,34,33,32
> >
> > For v a V4SI, x a SI, j some int,  what do we generate for
> >    v[j&3] = x;
> > ?
> > This should be exactly the same as we generate for
> >    vec_insert(x, v, j);
> > (the builtin does a modulo 4 automatically).
>
> No, even with my patch "stxv 34,-16(1);stwx 5,9,6;lxv 34,-16(1)" generated currently.
> Is it feasible and acceptable to expand some kind of pattern in expander directly without
> builtin transition?
>
> I borrowed some of implementation from vec_extract.  For vec_extract, the issue also exists:
>
>            source:                             gimple:                             expand:                                        asm:
> 1) i = vec_extract (v, n);  =>  i = __builtin_vec_ext_v4si (v, n);   => {r120:SI=unspec[r118:V4SI,r119:DI] 134;...}     => slwi 9,6,2   vextuwrx 3,9,2
> 2) i = vec_extract (v, 3);  =>  i = __builtin_vec_ext_v4si (v, 3);   => {r120:SI=vec_select(r118:V4SI,parallel)...}     =>  li 9,12      vextuwrx 3,9,2
> 3) i = v[n%4];   =>  _1 = n & 3;  i = VIEW_CONVERT_EXPR<int[4]>(v)[_1];  =>        ...        =>     stxv 34,-16(1);addi 9,1,-16; rldic 5,5,2,60; lwax 3,9,5
> 4) i = v[3];     =>          i = BIT_FIELD_REF <v, 32, 96>;              =>  {r120:SI=vec_select(r118:V4SI,parallel)...} => li 9,12;   vextuwrx 3,9,2

Why are 1) and 2) handled differently than 3)/4)?

> Case 3) also couldn't handle the similar usage, and case 4) doesn't generate builtin as expected,
> it just expand to vec_select by coincidence.  So does this mean both vec_insert and vec_extract
> and all other similar vector builtins should use IFN as suggested by Richard Biener, to match the
> pattern in gimple and expand both constant and variable index in expander?  Will this also be
> beneficial for other targets except power?  Or we should do that gradually after this patch
> approved as it seems another independent issue?  Thanks:)

If the code generated for 3)/4) isn't optimal you have to figure why
by tracing the RTL
expansion code and looking for missing optabs.

Consider the amount of backend code you need to write if ever using
those in constexpr
context ...

Richard.

>
>
> Xionghu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-01  8:09   ` luoxhu
  2020-09-01 13:07     ` Richard Biener
@ 2020-09-01 14:02     ` Segher Boessenkool
  1 sibling, 0 replies; 43+ messages in thread
From: Segher Boessenkool @ 2020-09-01 14:02 UTC (permalink / raw)
  To: luoxhu; +Cc: gcc-patches, dje.gcc, wschmidt, guojiufu, linkw

On Tue, Sep 01, 2020 at 04:09:53PM +0800, luoxhu wrote:
> On 2020/9/1 01:04, Segher Boessenkool wrote:
> > For v a V4SI, x a SI, j some int,  what do we generate for
> >    v[j&3] = x;
> > ?
> > This should be exactly the same as we generate for
> >    vec_insert(x, v, j);
> > (the builtin does a modulo 4 automatically).
> 
> No, even with my patch "stxv 34,-16(1);stwx 5,9,6;lxv 34,-16(1)" generated currently.

I think you should solve the problem in the generic case, then, since it
is (presumably) much more frequent.

> Is it feasible and acceptable to expand some kind of pattern in expander directly without
> builtin transition?

I don't understand what you mean?

> I borrowed some of implementation from vec_extract.  For vec_extract, the issue also exists:
> 
>            source:                             gimple:                             expand:                                        asm:
> 1) i = vec_extract (v, n);  =>  i = __builtin_vec_ext_v4si (v, n);   => {r120:SI=unspec[r118:V4SI,r119:DI] 134;...}     => slwi 9,6,2   vextuwrx 3,9,2   
> 2) i = vec_extract (v, 3);  =>  i = __builtin_vec_ext_v4si (v, 3);   => {r120:SI=vec_select(r118:V4SI,parallel)...}     =>  li 9,12      vextuwrx 3,9,2
> 3) i = v[n%4];   =>  _1 = n & 3;  i = VIEW_CONVERT_EXPR<int[4]>(v)[_1];  =>        ...        =>     stxv 34,-16(1);addi 9,1,-16; rldic 5,5,2,60; lwax 3,9,5
> 4) i = v[3];     =>          i = BIT_FIELD_REF <v, 32, 96>;              =>  {r120:SI=vec_select(r118:V4SI,parallel)...} => li 9,12;   vextuwrx 3,9,2 
> 
> Case 3) also couldn't handle the similar usage, and case 4) doesn't generate builtin as expected,
> it just expand to vec_select by coincidence.  So does this mean both vec_insert and vec_extract
> and all other similar vector builtins should use IFN as suggested by Richard Biener, to match the
> pattern in gimple and expand both constant and variable index in expander?  Will this also be
> beneficial for other targets except power?  Or we should do that gradually after this patch
> approved as it seems another independent issue?  Thanks:)

I don't think we should do that at all?  IFNs just complicate
everything, there is no added value here: this is just data movement!
We need to avoid the data aliasing some generic way.


Segher

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-01 13:07     ` Richard Biener
@ 2020-09-02  9:26       ` luoxhu
  2020-09-02  9:30         ` Richard Biener
  0 siblings, 1 reply; 43+ messages in thread
From: luoxhu @ 2020-09-02  9:26 UTC (permalink / raw)
  To: Richard Biener
  Cc: Segher Boessenkool, Bill Schmidt, GCC Patches, linkw, David Edelsohn

Hi,

On 2020/9/1 21:07, Richard Biener wrote:
> On Tue, Sep 1, 2020 at 10:11 AM luoxhu via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> Hi,
>>
>> On 2020/9/1 01:04, Segher Boessenkool wrote:
>>> Hi!
>>>
>>> On Mon, Aug 31, 2020 at 04:06:47AM -0500, Xiong Hu Luo wrote:
>>>> vec_insert accepts 3 arguments, arg0 is input vector, arg1 is the value
>>>> to be insert, arg2 is the place to insert arg1 to arg0.  This patch adds
>>>> __builtin_vec_insert_v4si[v4sf,v2di,v2df,v8hi,v16qi] for vec_insert to
>>>> not expand too early in gimple stage if arg2 is variable, to avoid generate
>>>> store hit load instructions.
>>>>
>>>> For Power9 V4SI:
>>>>       addi 9,1,-16
>>>>       rldic 6,6,2,60
>>>>       stxv 34,-16(1)
>>>>       stwx 5,9,6
>>>>       lxv 34,-16(1)
>>>> =>
>>>>       addis 9,2,.LC0@toc@ha
>>>>       addi 9,9,.LC0@toc@l
>>>>       mtvsrwz 33,5
>>>>       lxv 32,0(9)
>>>>       sradi 9,6,2
>>>>       addze 9,9
>>>>       sldi 9,9,2
>>>>       subf 9,9,6
>>>>       subfic 9,9,3
>>>>       sldi 9,9,2
>>>>       subfic 9,9,20
>>>>       lvsl 13,0,9
>>>>       xxperm 33,33,45
>>>>       xxperm 32,32,45
>>>>       xxsel 34,34,33,32
>>>
>>> For v a V4SI, x a SI, j some int,  what do we generate for
>>>     v[j&3] = x;
>>> ?
>>> This should be exactly the same as we generate for
>>>     vec_insert(x, v, j);
>>> (the builtin does a modulo 4 automatically).
>>
>> No, even with my patch "stxv 34,-16(1);stwx 5,9,6;lxv 34,-16(1)" generated currently.
>> Is it feasible and acceptable to expand some kind of pattern in expander directly without
>> builtin transition?
>>
>> I borrowed some of implementation from vec_extract.  For vec_extract, the issue also exists:
>>
>>             source:                             gimple:                             expand:                                        asm:
>> 1) i = vec_extract (v, n);  =>  i = __builtin_vec_ext_v4si (v, n);   => {r120:SI=unspec[r118:V4SI,r119:DI] 134;...}     => slwi 9,6,2   vextuwrx 3,9,2
>> 2) i = vec_extract (v, 3);  =>  i = __builtin_vec_ext_v4si (v, 3);   => {r120:SI=vec_select(r118:V4SI,parallel)...}     =>  li 9,12      vextuwrx 3,9,2
>> 3) i = v[n%4];   =>  _1 = n & 3;  i = VIEW_CONVERT_EXPR<int[4]>(v)[_1];  =>        ...        =>     stxv 34,-16(1);addi 9,1,-16; rldic 5,5,2,60; lwax 3,9,5
>> 4) i = v[3];     =>          i = BIT_FIELD_REF <v, 32, 96>;              =>  {r120:SI=vec_select(r118:V4SI,parallel)...} => li 9,12;   vextuwrx 3,9,2
> 
> Why are 1) and 2) handled differently than 3)/4)?

1) and 2) are calling builtin function vec_extract, it is defined to
 __builtin_vec_extract and will be resolved to ALTIVEC_BUILTIN_VEC_EXTRACT
 by resolve_overloaded_builtin, to generate a call __builtin_vec_ext_v4si
 to be expanded only in RTL. 
3) is access variable v as array type with opcode VIEW_CONVERT_EXPR, I
 guess we should also generate builtin call instead of calling
 convert_vector_to_array_for_subscript to generate VIEW_CONVERT_EXPR
 expression for such kind of usage.
4) is translated to BIT_FIELD_REF with constant bitstart and bitsize,
variable v could also be accessed by register instead of stack, so optabs
could match the rs6000_expand_vector_insert to generate expected instruction
through extract_bit_field.

> 
>> Case 3) also couldn't handle the similar usage, and case 4) doesn't generate builtin as expected,
>> it just expand to vec_select by coincidence.  So does this mean both vec_insert and vec_extract
>> and all other similar vector builtins should use IFN as suggested by Richard Biener, to match the
>> pattern in gimple and expand both constant and variable index in expander?  Will this also be
>> beneficial for other targets except power?  Or we should do that gradually after this patch
>> approved as it seems another independent issue?  Thanks:)
> 
> If the code generated for 3)/4) isn't optimal you have to figure why
> by tracing the RTL
> expansion code and looking for missing optabs.
> 
> Consider the amount of backend code you need to write if ever using
> those in constexpr
> context ...

It seems too complicated to expand the "i = VIEW_CONVERT_EXPR<int[4]>(v)[_1];"
or "VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i_6(D);" to
rs6000_expand_vector_insert/rs6000_expand_vector_extract in RTL, as:
1) Vector v is stored to stack with array type; need extra load and store operation.
2) Requires amount of code to decompose VIEW_CONVERT_EXPR to extract the vector and
index then call rs6000_expand_vector_insert/rs6000_expand_vector_extract.

which means replace following instructions #9~#12 to new instruction sequences:
    1: NOTE_INSN_DELETED
    6: NOTE_INSN_BASIC_BLOCK 2
    2: r119:V4SI=%2:V4SI
    3: r120:DI=%5:DI
    4: r121:DI=%6:DI
    5: NOTE_INSN_FUNCTION_BEG
    8: [r112:DI]=r119:V4SI

    9: r122:DI=r121:DI&0x3
   10: r123:DI=r122:DI<<0x2
   11: r124:DI=r112:DI+r123:DI
   12: [r124:DI]=r120:DI#0

   13: r118:V4SI=[r112:DI]
   17: %2:V4SI=r118:V4SI
   18: use %2:V4SI

=>

    1: NOTE_INSN_DELETED
    6: NOTE_INSN_BASIC_BLOCK 2
    2: r119:V4SI=%2:V4SI
    3: r120:DI=%5:DI
    4: r121:DI=%6:DI
    5: NOTE_INSN_FUNCTION_BEG
    8: [r112:DI]=r119:V4SI

r130:V4SI=[r112:DI]
rs6000_expand_vector_insert (r130,  r121:DI&0x3, r120:DI#0)
[r112:DI]=r130:V4SI

   13: r118:V4SI=[r112:DI]
   17: %2:V4SI=r118:V4SI
   18: use %2:V4SI

so maybe bypass convert_vector_to_array_for_subscript for special circumstance
like "i = v[n%4]" or "v[n&3]=i" to generate vec_extract or vec_insert builtin 
call a relative simpler method?

Xionghu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-02  9:26       ` luoxhu
@ 2020-09-02  9:30         ` Richard Biener
  2020-09-03  9:20           ` luoxhu
  0 siblings, 1 reply; 43+ messages in thread
From: Richard Biener @ 2020-09-02  9:30 UTC (permalink / raw)
  To: luoxhu
  Cc: Segher Boessenkool, Bill Schmidt, GCC Patches, linkw, David Edelsohn

On Wed, Sep 2, 2020 at 11:26 AM luoxhu <luoxhu@linux.ibm.com> wrote:
>
> Hi,
>
> On 2020/9/1 21:07, Richard Biener wrote:
> > On Tue, Sep 1, 2020 at 10:11 AM luoxhu via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >> Hi,
> >>
> >> On 2020/9/1 01:04, Segher Boessenkool wrote:
> >>> Hi!
> >>>
> >>> On Mon, Aug 31, 2020 at 04:06:47AM -0500, Xiong Hu Luo wrote:
> >>>> vec_insert accepts 3 arguments, arg0 is input vector, arg1 is the value
> >>>> to be insert, arg2 is the place to insert arg1 to arg0.  This patch adds
> >>>> __builtin_vec_insert_v4si[v4sf,v2di,v2df,v8hi,v16qi] for vec_insert to
> >>>> not expand too early in gimple stage if arg2 is variable, to avoid generate
> >>>> store hit load instructions.
> >>>>
> >>>> For Power9 V4SI:
> >>>>       addi 9,1,-16
> >>>>       rldic 6,6,2,60
> >>>>       stxv 34,-16(1)
> >>>>       stwx 5,9,6
> >>>>       lxv 34,-16(1)
> >>>> =>
> >>>>       addis 9,2,.LC0@toc@ha
> >>>>       addi 9,9,.LC0@toc@l
> >>>>       mtvsrwz 33,5
> >>>>       lxv 32,0(9)
> >>>>       sradi 9,6,2
> >>>>       addze 9,9
> >>>>       sldi 9,9,2
> >>>>       subf 9,9,6
> >>>>       subfic 9,9,3
> >>>>       sldi 9,9,2
> >>>>       subfic 9,9,20
> >>>>       lvsl 13,0,9
> >>>>       xxperm 33,33,45
> >>>>       xxperm 32,32,45
> >>>>       xxsel 34,34,33,32
> >>>
> >>> For v a V4SI, x a SI, j some int,  what do we generate for
> >>>     v[j&3] = x;
> >>> ?
> >>> This should be exactly the same as we generate for
> >>>     vec_insert(x, v, j);
> >>> (the builtin does a modulo 4 automatically).
> >>
> >> No, even with my patch "stxv 34,-16(1);stwx 5,9,6;lxv 34,-16(1)" generated currently.
> >> Is it feasible and acceptable to expand some kind of pattern in expander directly without
> >> builtin transition?
> >>
> >> I borrowed some of implementation from vec_extract.  For vec_extract, the issue also exists:
> >>
> >>             source:                             gimple:                             expand:                                        asm:
> >> 1) i = vec_extract (v, n);  =>  i = __builtin_vec_ext_v4si (v, n);   => {r120:SI=unspec[r118:V4SI,r119:DI] 134;...}     => slwi 9,6,2   vextuwrx 3,9,2
> >> 2) i = vec_extract (v, 3);  =>  i = __builtin_vec_ext_v4si (v, 3);   => {r120:SI=vec_select(r118:V4SI,parallel)...}     =>  li 9,12      vextuwrx 3,9,2
> >> 3) i = v[n%4];   =>  _1 = n & 3;  i = VIEW_CONVERT_EXPR<int[4]>(v)[_1];  =>        ...        =>     stxv 34,-16(1);addi 9,1,-16; rldic 5,5,2,60; lwax 3,9,5
> >> 4) i = v[3];     =>          i = BIT_FIELD_REF <v, 32, 96>;              =>  {r120:SI=vec_select(r118:V4SI,parallel)...} => li 9,12;   vextuwrx 3,9,2
> >
> > Why are 1) and 2) handled differently than 3)/4)?
>
> 1) and 2) are calling builtin function vec_extract, it is defined to
>  __builtin_vec_extract and will be resolved to ALTIVEC_BUILTIN_VEC_EXTRACT
>  by resolve_overloaded_builtin, to generate a call __builtin_vec_ext_v4si
>  to be expanded only in RTL.
> 3) is access variable v as array type with opcode VIEW_CONVERT_EXPR, I
>  guess we should also generate builtin call instead of calling
>  convert_vector_to_array_for_subscript to generate VIEW_CONVERT_EXPR
>  expression for such kind of usage.
> 4) is translated to BIT_FIELD_REF with constant bitstart and bitsize,
> variable v could also be accessed by register instead of stack, so optabs
> could match the rs6000_expand_vector_insert to generate expected instruction
> through extract_bit_field.
>
> >
> >> Case 3) also couldn't handle the similar usage, and case 4) doesn't generate builtin as expected,
> >> it just expand to vec_select by coincidence.  So does this mean both vec_insert and vec_extract
> >> and all other similar vector builtins should use IFN as suggested by Richard Biener, to match the
> >> pattern in gimple and expand both constant and variable index in expander?  Will this also be
> >> beneficial for other targets except power?  Or we should do that gradually after this patch
> >> approved as it seems another independent issue?  Thanks:)
> >
> > If the code generated for 3)/4) isn't optimal you have to figure why
> > by tracing the RTL
> > expansion code and looking for missing optabs.
> >
> > Consider the amount of backend code you need to write if ever using
> > those in constexpr
> > context ...
>
> It seems too complicated to expand the "i = VIEW_CONVERT_EXPR<int[4]>(v)[_1];"
> or "VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i_6(D);" to
> rs6000_expand_vector_insert/rs6000_expand_vector_extract in RTL, as:
> 1) Vector v is stored to stack with array type; need extra load and store operation.
> 2) Requires amount of code to decompose VIEW_CONVERT_EXPR to extract the vector and
> index then call rs6000_expand_vector_insert/rs6000_expand_vector_extract.
>
> which means replace following instructions #9~#12 to new instruction sequences:
>     1: NOTE_INSN_DELETED
>     6: NOTE_INSN_BASIC_BLOCK 2
>     2: r119:V4SI=%2:V4SI
>     3: r120:DI=%5:DI
>     4: r121:DI=%6:DI
>     5: NOTE_INSN_FUNCTION_BEG
>     8: [r112:DI]=r119:V4SI
>
>     9: r122:DI=r121:DI&0x3
>    10: r123:DI=r122:DI<<0x2
>    11: r124:DI=r112:DI+r123:DI
>    12: [r124:DI]=r120:DI#0
>
>    13: r118:V4SI=[r112:DI]
>    17: %2:V4SI=r118:V4SI
>    18: use %2:V4SI
>
> =>
>
>     1: NOTE_INSN_DELETED
>     6: NOTE_INSN_BASIC_BLOCK 2
>     2: r119:V4SI=%2:V4SI
>     3: r120:DI=%5:DI
>     4: r121:DI=%6:DI
>     5: NOTE_INSN_FUNCTION_BEG
>     8: [r112:DI]=r119:V4SI
>
> r130:V4SI=[r112:DI]
> rs6000_expand_vector_insert (r130,  r121:DI&0x3, r120:DI#0)
> [r112:DI]=r130:V4SI
>
>    13: r118:V4SI=[r112:DI]
>    17: %2:V4SI=r118:V4SI
>    18: use %2:V4SI
>
> so maybe bypass convert_vector_to_array_for_subscript for special circumstance
> like "i = v[n%4]" or "v[n&3]=i" to generate vec_extract or vec_insert builtin
> call a relative simpler method?

I think you have it backward.  You need to work with what
convert_vector_to_array_for_subscript
gives and deal with it during RTL expansion / optimization to generate
more optimal
code for power.  The goal is to have as little target specific
builtins during the GIMPLE
optimization phase (because we cannot work out its semantics in optimizers).

Richard.

>
> Xionghu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-02  9:30         ` Richard Biener
@ 2020-09-03  9:20           ` luoxhu
  2020-09-03 10:29             ` Richard Biener
  0 siblings, 1 reply; 43+ messages in thread
From: luoxhu @ 2020-09-03  9:20 UTC (permalink / raw)
  To: Richard Biener
  Cc: Segher Boessenkool, Bill Schmidt, GCC Patches, linkw, David Edelsohn



On 2020/9/2 17:30, Richard Biener wrote:
>> so maybe bypass convert_vector_to_array_for_subscript for special circumstance
>> like "i = v[n%4]" or "v[n&3]=i" to generate vec_extract or vec_insert builtin
>> call a relative simpler method?
> I think you have it backward.  You need to work with what
> convert_vector_to_array_for_subscript
> gives and deal with it during RTL expansion / optimization to generate
> more optimal
> code for power.  The goal is to have as little target specific
> builtins during the GIMPLE
> optimization phase (because we cannot work out its semantics in optimizers).

OK, got it, will add optabs vec_insert and expand "VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i_6(D);"
expressions to rs6000_expand_vector_insert instead of builtin call.  
vec_extract already has optabs and "i = v[n%4]" should be in another patch
after this.


Thanks,
Xionghu


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-03  9:20           ` luoxhu
@ 2020-09-03 10:29             ` Richard Biener
  2020-09-04  6:16               ` luoxhu
  0 siblings, 1 reply; 43+ messages in thread
From: Richard Biener @ 2020-09-03 10:29 UTC (permalink / raw)
  To: luoxhu
  Cc: Segher Boessenkool, Bill Schmidt, GCC Patches, linkw, David Edelsohn

On Thu, Sep 3, 2020 at 11:20 AM luoxhu <luoxhu@linux.ibm.com> wrote:
>
>
>
> On 2020/9/2 17:30, Richard Biener wrote:
> >> so maybe bypass convert_vector_to_array_for_subscript for special circumstance
> >> like "i = v[n%4]" or "v[n&3]=i" to generate vec_extract or vec_insert builtin
> >> call a relative simpler method?
> > I think you have it backward.  You need to work with what
> > convert_vector_to_array_for_subscript
> > gives and deal with it during RTL expansion / optimization to generate
> > more optimal
> > code for power.  The goal is to have as little target specific
> > builtins during the GIMPLE
> > optimization phase (because we cannot work out its semantics in optimizers).
>
> OK, got it, will add optabs vec_insert and expand "VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i_6(D);"
> expressions to rs6000_expand_vector_insert instead of builtin call.
> vec_extract already has optabs and "i = v[n%4]" should be in another patch
> after this.

There is already vec_set and vec_extract - the question is whether the expander
tries those for variable index.

Richard.

>
> Thanks,
> Xionghu
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-03 10:29             ` Richard Biener
@ 2020-09-04  6:16               ` luoxhu
  2020-09-04  6:38                 ` luoxhu
  0 siblings, 1 reply; 43+ messages in thread
From: luoxhu @ 2020-09-04  6:16 UTC (permalink / raw)
  To: Richard Biener
  Cc: Segher Boessenkool, Bill Schmidt, GCC Patches, linkw, David Edelsohn

Hi,

On 2020/9/3 18:29, Richard Biener wrote:
> On Thu, Sep 3, 2020 at 11:20 AM luoxhu <luoxhu@linux.ibm.com> wrote:
>>
>>
>>
>> On 2020/9/2 17:30, Richard Biener wrote:
>>>> so maybe bypass convert_vector_to_array_for_subscript for special circumstance
>>>> like "i = v[n%4]" or "v[n&3]=i" to generate vec_extract or vec_insert builtin
>>>> call a relative simpler method?
>>> I think you have it backward.  You need to work with what
>>> convert_vector_to_array_for_subscript
>>> gives and deal with it during RTL expansion / optimization to generate
>>> more optimal
>>> code for power.  The goal is to have as little target specific
>>> builtins during the GIMPLE
>>> optimization phase (because we cannot work out its semantics in optimizers).
>>
>> OK, got it, will add optabs vec_insert and expand "VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i_6(D);"
>> expressions to rs6000_expand_vector_insert instead of builtin call.
>> vec_extract already has optabs and "i = v[n%4]" should be in another patch
>> after this.
> 
> There is already vec_set and vec_extract - the question is whether the expander
> tries those for variable index.
> 

Yes, I checked and found that both vec_set and vec_extract doesn't support
variable index for most targets, store_bit_field_1 and extract_bit_field_1
would only consider use optabs when index is integer value.  Anyway, it
shouldn't be hard to extend depending on target requirements. 

Another problem is v[n&3]=i and vec_insert(v, i, n) are generating with
different gimple code:

{
_1 = n & 3;
VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i;
}

vs:

{
  __vector signed int v1;
  __vector signed int D.3192;
  long unsigned int _1;
  long unsigned int _2;
  int * _3;

  <bb 2> [local count: 1073741824]:
  D.3192 = v_4(D);
  _1 = n_7(D) & 3;
  _2 = _1 * 4;
  _3 = &D.3192 + _2; 
  *_3 = i_8(D);
  v1_10 = D.3192;
  return v1_10;
}

If not use builtin for "vec_insert(v, i, n)", the pointer is "int*" instead
of vector type, will this be difficult for expander to capture so many
statements then call the optabs?  So shall we still keep the builtin style
for "vec_insert(v, i, n)" and expand "v[n&3]=i" with optabs or expand both 
with optabs???

Drafted a fast patch to expand "v[n&3]=i" with optabs as below, sorry that not
using existed vec_set yet as not quite sure, together with the first patch, both
cases could be handled as expected:


[PATCH] Expander: expand VIEW_CONVERT_EXPR to vec_insert with variable index

v[n%4] = i has same semantic with vec_insert (i, v, n), but it will be
optimized to "VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i;" in gimple, this
patch tries to recognize the pattern in expander and use optabs to
expand it to fast instructions like vec_insert: lvsl+xxperm+xxsel.

gcc/ChangeLog:

	* config/rs6000/vector.md:
	* expr.c (expand_assignment):
	* optabs.def (OPTAB_CD):
---
 gcc/config/rs6000/vector.md | 13 +++++++++++
 gcc/expr.c                  | 46 +++++++++++++++++++++++++++++++++++++
 gcc/optabs.def              |  1 +
 3 files changed, 60 insertions(+)

diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md
index 796345c80d3..46d21271e17 100644
--- a/gcc/config/rs6000/vector.md
+++ b/gcc/config/rs6000/vector.md
@@ -1244,6 +1244,19 @@ (define_expand "vec_extract<mode><VEC_base_l>"
   DONE;
 })
 
+(define_expand "vec_insert<VEC_base_l><mode>"
+  [(match_operand:VEC_E 0 "vlogical_operand")
+   (match_operand:<VEC_base> 1 "register_operand")
+   (match_operand 2 "register_operand")]
+  "VECTOR_MEM_ALTIVEC_OR_VSX_P (<MODE>mode)"
+{
+  rtx target = gen_reg_rtx (V16QImode);
+  rs6000_expand_vector_insert (target, operands[0], operands[1], operands[2]);
+  rtx sub_target = simplify_gen_subreg (GET_MODE(operands[0]), target, V16QImode, 0);
+  emit_insn (gen_rtx_SET (operands[0], sub_target));
+  DONE;
+})
+
 ;; Convert double word types to single word types
 (define_expand "vec_pack_trunc_v2df"
   [(match_operand:V4SF 0 "vfloat_operand")
diff --git a/gcc/expr.c b/gcc/expr.c
index dd2200ddea8..ce2890c1a2d 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -5237,6 +5237,52 @@ expand_assignment (tree to, tree from, bool nontemporal)
 
       to_rtx = expand_expr (tem, NULL_RTX, VOIDmode, EXPAND_WRITE);
 
+      tree type = TREE_TYPE (to);
+      if (TREE_CODE (to) == ARRAY_REF && tree_fits_uhwi_p (TYPE_SIZE (type))
+	  && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
+	  && tree_to_uhwi (TYPE_SIZE (type))
+		 * tree_to_uhwi (TYPE_SIZE_UNIT (type))
+	       == 128)
+	{
+	  tree op0 = TREE_OPERAND (to, 0);
+	  tree op1 = TREE_OPERAND (to, 1);
+	  if (TREE_CODE (op0) == VIEW_CONVERT_EXPR)
+	    {
+	      tree view_op0 = TREE_OPERAND (op0, 0);
+	      mode = TYPE_MODE (TREE_TYPE (view_op0));
+	      if (TREE_CODE (TREE_TYPE (view_op0)) == VECTOR_TYPE)
+		{
+		  rtx value
+		    = expand_expr (from, NULL_RTX, VOIDmode, EXPAND_NORMAL);
+		  rtx pos
+		    = expand_expr (op1, NULL_RTX, VOIDmode, EXPAND_NORMAL);
+		  rtx temp_target = gen_reg_rtx (mode);
+		  emit_move_insn (temp_target, to_rtx);
+
+		  machine_mode outermode = mode;
+		  scalar_mode innermode = GET_MODE_INNER (outermode);
+		  class expand_operand ops[3];
+		  enum insn_code icode
+		    = convert_optab_handler (vec_insert_optab, innermode,
+					     outermode);
+
+		  if (icode != CODE_FOR_nothing)
+		    {
+		      pos = convert_to_mode (E_SImode, pos, 0);
+
+		      create_fixed_operand (&ops[0], temp_target);
+		      create_input_operand (&ops[1], value, innermode);
+		      create_input_operand (&ops[2], pos, GET_MODE (pos));
+		      if (maybe_expand_insn (icode, 3, ops))
+			{
+			  emit_move_insn (to_rtx, temp_target);
+			  pop_temp_slots ();
+			  return;
+			}
+		    }
+		}
+	    }
+	}
       /* If the field has a mode, we want to access it in the
 	 field's mode, not the computed mode.
 	 If a MEM has VOIDmode (external with incomplete type),
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 78409aa1453..21b163a969e 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -96,6 +96,7 @@ OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
 OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
 OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
 OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
+OPTAB_CD(vec_insert_optab, "vec_insert$a$b")
 OPTAB_CD(vec_init_optab, "vec_init$a$b")
 
 OPTAB_CD (while_ult_optab, "while_ult$a$b")
-- 
2.27.0.90.geebb51ba8c





^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-04  6:16               ` luoxhu
@ 2020-09-04  6:38                 ` luoxhu
  2020-09-04  7:19                   ` Richard Biener
  0 siblings, 1 reply; 43+ messages in thread
From: luoxhu @ 2020-09-04  6:38 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, David Edelsohn, Bill Schmidt, linkw, Segher Boessenkool



On 2020/9/4 14:16, luoxhu via Gcc-patches wrote:
> Hi,
> 
> 
> Yes, I checked and found that both vec_set and vec_extract doesn't support
> variable index for most targets, store_bit_field_1 and extract_bit_field_1
> would only consider use optabs when index is integer value.  Anyway, it
> shouldn't be hard to extend depending on target requirements.
> 
> Another problem is v[n&3]=i and vec_insert(v, i, n) are generating with
> different gimple code:
> 
> {
> _1 = n & 3;
> VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i;
> }
> 
> vs:
> 
> {
>    __vector signed int v1;
>    __vector signed int D.3192;
>    long unsigned int _1;
>    long unsigned int _2;
>    int * _3;
> 
>    <bb 2> [local count: 1073741824]:
>    D.3192 = v_4(D);
>    _1 = n_7(D) & 3;
>    _2 = _1 * 4;
>    _3 = &D.3192 + _2;
>    *_3 = i_8(D);
>    v1_10 = D.3192;
>    return v1_10;
> }

Just realized use convert_vector_to_array_for_subscript would generate
"VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i;" before produce those instructions, 
  your confirmation and comments will be highly appreciated...  Thanks in 
advance. :)

Xionghu

> 
> If not use builtin for "vec_insert(v, i, n)", the pointer is "int*" instead
> of vector type, will this be difficult for expander to capture so many
> statements then call the optabs?  So shall we still keep the builtin style
> for "vec_insert(v, i, n)" and expand "v[n&3]=i" with optabs or expand both
> with optabs???
> 
> Drafted a fast patch to expand "v[n&3]=i" with optabs as below, sorry that not
> using existed vec_set yet as not quite sure, together with the first patch, both
> cases could be handled as expected:
> 
> 
> [PATCH] Expander: expand VIEW_CONVERT_EXPR to vec_insert with variable index
> 
> v[n%4] = i has same semantic with vec_insert (i, v, n), but it will be
> optimized to "VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i;" in gimple, this
> patch tries to recognize the pattern in expander and use optabs to
> expand it to fast instructions like vec_insert: lvsl+xxperm+xxsel.
> 
> gcc/ChangeLog:
> 
> 	* config/rs6000/vector.md:
> 	* expr.c (expand_assignment):
> 	* optabs.def (OPTAB_CD):
> ---
>   gcc/config/rs6000/vector.md | 13 +++++++++++
>   gcc/expr.c                  | 46 +++++++++++++++++++++++++++++++++++++
>   gcc/optabs.def              |  1 +
>   3 files changed, 60 insertions(+)
> 
> diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md
> index 796345c80d3..46d21271e17 100644
> --- a/gcc/config/rs6000/vector.md
> +++ b/gcc/config/rs6000/vector.md
> @@ -1244,6 +1244,19 @@ (define_expand "vec_extract<mode><VEC_base_l>"
>     DONE;
>   })
> 
> +(define_expand "vec_insert<VEC_base_l><mode>"
> +  [(match_operand:VEC_E 0 "vlogical_operand")
> +   (match_operand:<VEC_base> 1 "register_operand")
> +   (match_operand 2 "register_operand")]
> +  "VECTOR_MEM_ALTIVEC_OR_VSX_P (<MODE>mode)"
> +{
> +  rtx target = gen_reg_rtx (V16QImode);
> +  rs6000_expand_vector_insert (target, operands[0], operands[1], operands[2]);
> +  rtx sub_target = simplify_gen_subreg (GET_MODE(operands[0]), target, V16QImode, 0);
> +  emit_insn (gen_rtx_SET (operands[0], sub_target));
> +  DONE;
> +})
> +
>   ;; Convert double word types to single word types
>   (define_expand "vec_pack_trunc_v2df"
>     [(match_operand:V4SF 0 "vfloat_operand")
> diff --git a/gcc/expr.c b/gcc/expr.c
> index dd2200ddea8..ce2890c1a2d 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -5237,6 +5237,52 @@ expand_assignment (tree to, tree from, bool nontemporal)
> 
>         to_rtx = expand_expr (tem, NULL_RTX, VOIDmode, EXPAND_WRITE);
> 
> +      tree type = TREE_TYPE (to);
> +      if (TREE_CODE (to) == ARRAY_REF && tree_fits_uhwi_p (TYPE_SIZE (type))
> +	  && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
> +	  && tree_to_uhwi (TYPE_SIZE (type))
> +		 * tree_to_uhwi (TYPE_SIZE_UNIT (type))
> +	       == 128)
> +	{
> +	  tree op0 = TREE_OPERAND (to, 0);
> +	  tree op1 = TREE_OPERAND (to, 1);
> +	  if (TREE_CODE (op0) == VIEW_CONVERT_EXPR)
> +	    {
> +	      tree view_op0 = TREE_OPERAND (op0, 0);
> +	      mode = TYPE_MODE (TREE_TYPE (view_op0));
> +	      if (TREE_CODE (TREE_TYPE (view_op0)) == VECTOR_TYPE)
> +		{
> +		  rtx value
> +		    = expand_expr (from, NULL_RTX, VOIDmode, EXPAND_NORMAL);
> +		  rtx pos
> +		    = expand_expr (op1, NULL_RTX, VOIDmode, EXPAND_NORMAL);
> +		  rtx temp_target = gen_reg_rtx (mode);
> +		  emit_move_insn (temp_target, to_rtx);
> +
> +		  machine_mode outermode = mode;
> +		  scalar_mode innermode = GET_MODE_INNER (outermode);
> +		  class expand_operand ops[3];
> +		  enum insn_code icode
> +		    = convert_optab_handler (vec_insert_optab, innermode,
> +					     outermode);
> +
> +		  if (icode != CODE_FOR_nothing)
> +		    {
> +		      pos = convert_to_mode (E_SImode, pos, 0);
> +
> +		      create_fixed_operand (&ops[0], temp_target);
> +		      create_input_operand (&ops[1], value, innermode);
> +		      create_input_operand (&ops[2], pos, GET_MODE (pos));
> +		      if (maybe_expand_insn (icode, 3, ops))
> +			{
> +			  emit_move_insn (to_rtx, temp_target);
> +			  pop_temp_slots ();
> +			  return;
> +			}
> +		    }
> +		}
> +	    }
> +	}
>         /* If the field has a mode, we want to access it in the
>   	 field's mode, not the computed mode.
>   	 If a MEM has VOIDmode (external with incomplete type),
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 78409aa1453..21b163a969e 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -96,6 +96,7 @@ OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
>   OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
>   OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
>   OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
> +OPTAB_CD(vec_insert_optab, "vec_insert$a$b")
>   OPTAB_CD(vec_init_optab, "vec_init$a$b")
> 
>   OPTAB_CD (while_ult_optab, "while_ult$a$b")
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-04  6:38                 ` luoxhu
@ 2020-09-04  7:19                   ` Richard Biener
  2020-09-04  7:23                     ` Richard Biener
  0 siblings, 1 reply; 43+ messages in thread
From: Richard Biener @ 2020-09-04  7:19 UTC (permalink / raw)
  To: luoxhu
  Cc: GCC Patches, David Edelsohn, Bill Schmidt, linkw, Segher Boessenkool

On Fri, Sep 4, 2020 at 8:38 AM luoxhu <luoxhu@linux.ibm.com> wrote:
>
>
>
> On 2020/9/4 14:16, luoxhu via Gcc-patches wrote:
> > Hi,
> >
> >
> > Yes, I checked and found that both vec_set and vec_extract doesn't support
> > variable index for most targets, store_bit_field_1 and extract_bit_field_1
> > would only consider use optabs when index is integer value.  Anyway, it
> > shouldn't be hard to extend depending on target requirements.
> >
> > Another problem is v[n&3]=i and vec_insert(v, i, n) are generating with
> > different gimple code:
> >
> > {
> > _1 = n & 3;
> > VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i;
> > }
> >
> > vs:
> >
> > {
> >    __vector signed int v1;
> >    __vector signed int D.3192;
> >    long unsigned int _1;
> >    long unsigned int _2;
> >    int * _3;
> >
> >    <bb 2> [local count: 1073741824]:
> >    D.3192 = v_4(D);
> >    _1 = n_7(D) & 3;
> >    _2 = _1 * 4;
> >    _3 = &D.3192 + _2;
> >    *_3 = i_8(D);
> >    v1_10 = D.3192;
> >    return v1_10;
> > }
>
> Just realized use convert_vector_to_array_for_subscript would generate
> "VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i;" before produce those instructions,
>   your confirmation and comments will be highly appreciated...  Thanks in
> advance. :)

I think what the GCC vector extensions produce is generally better
so wherever "code generation" for vec_insert resides it should be
adjusted to produce the same code.  Same for vec_extract.

> Xionghu
>
> >
> > If not use builtin for "vec_insert(v, i, n)", the pointer is "int*" instead
> > of vector type, will this be difficult for expander to capture so many
> > statements then call the optabs?  So shall we still keep the builtin style
> > for "vec_insert(v, i, n)" and expand "v[n&3]=i" with optabs or expand both
> > with optabs???
> >
> > Drafted a fast patch to expand "v[n&3]=i" with optabs as below, sorry that not
> > using existed vec_set yet as not quite sure, together with the first patch, both
> > cases could be handled as expected:
> >
> >
> > [PATCH] Expander: expand VIEW_CONVERT_EXPR to vec_insert with variable index
> >
> > v[n%4] = i has same semantic with vec_insert (i, v, n), but it will be
> > optimized to "VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i;" in gimple, this
> > patch tries to recognize the pattern in expander and use optabs to
> > expand it to fast instructions like vec_insert: lvsl+xxperm+xxsel.
> >
> > gcc/ChangeLog:
> >
> >       * config/rs6000/vector.md:
> >       * expr.c (expand_assignment):
> >       * optabs.def (OPTAB_CD):
> > ---
> >   gcc/config/rs6000/vector.md | 13 +++++++++++
> >   gcc/expr.c                  | 46 +++++++++++++++++++++++++++++++++++++
> >   gcc/optabs.def              |  1 +
> >   3 files changed, 60 insertions(+)
> >
> > diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md
> > index 796345c80d3..46d21271e17 100644
> > --- a/gcc/config/rs6000/vector.md
> > +++ b/gcc/config/rs6000/vector.md
> > @@ -1244,6 +1244,19 @@ (define_expand "vec_extract<mode><VEC_base_l>"
> >     DONE;
> >   })
> >
> > +(define_expand "vec_insert<VEC_base_l><mode>"
> > +  [(match_operand:VEC_E 0 "vlogical_operand")
> > +   (match_operand:<VEC_base> 1 "register_operand")
> > +   (match_operand 2 "register_operand")]
> > +  "VECTOR_MEM_ALTIVEC_OR_VSX_P (<MODE>mode)"
> > +{
> > +  rtx target = gen_reg_rtx (V16QImode);
> > +  rs6000_expand_vector_insert (target, operands[0], operands[1], operands[2]);
> > +  rtx sub_target = simplify_gen_subreg (GET_MODE(operands[0]), target, V16QImode, 0);
> > +  emit_insn (gen_rtx_SET (operands[0], sub_target));
> > +  DONE;
> > +})
> > +
> >   ;; Convert double word types to single word types
> >   (define_expand "vec_pack_trunc_v2df"
> >     [(match_operand:V4SF 0 "vfloat_operand")
> > diff --git a/gcc/expr.c b/gcc/expr.c
> > index dd2200ddea8..ce2890c1a2d 100644
> > --- a/gcc/expr.c
> > +++ b/gcc/expr.c
> > @@ -5237,6 +5237,52 @@ expand_assignment (tree to, tree from, bool nontemporal)
> >
> >         to_rtx = expand_expr (tem, NULL_RTX, VOIDmode, EXPAND_WRITE);
> >
> > +      tree type = TREE_TYPE (to);
> > +      if (TREE_CODE (to) == ARRAY_REF && tree_fits_uhwi_p (TYPE_SIZE (type))
> > +       && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
> > +       && tree_to_uhwi (TYPE_SIZE (type))
> > +              * tree_to_uhwi (TYPE_SIZE_UNIT (type))
> > +            == 128)
> > +     {
> > +       tree op0 = TREE_OPERAND (to, 0);
> > +       tree op1 = TREE_OPERAND (to, 1);
> > +       if (TREE_CODE (op0) == VIEW_CONVERT_EXPR)
> > +         {
> > +           tree view_op0 = TREE_OPERAND (op0, 0);
> > +           mode = TYPE_MODE (TREE_TYPE (view_op0));
> > +           if (TREE_CODE (TREE_TYPE (view_op0)) == VECTOR_TYPE)
> > +             {
> > +               rtx value
> > +                 = expand_expr (from, NULL_RTX, VOIDmode, EXPAND_NORMAL);
> > +               rtx pos
> > +                 = expand_expr (op1, NULL_RTX, VOIDmode, EXPAND_NORMAL);
> > +               rtx temp_target = gen_reg_rtx (mode);
> > +               emit_move_insn (temp_target, to_rtx);
> > +
> > +               machine_mode outermode = mode;
> > +               scalar_mode innermode = GET_MODE_INNER (outermode);
> > +               class expand_operand ops[3];
> > +               enum insn_code icode
> > +                 = convert_optab_handler (vec_insert_optab, innermode,
> > +                                          outermode);
> > +
> > +               if (icode != CODE_FOR_nothing)
> > +                 {
> > +                   pos = convert_to_mode (E_SImode, pos, 0);
> > +
> > +                   create_fixed_operand (&ops[0], temp_target);
> > +                   create_input_operand (&ops[1], value, innermode);
> > +                   create_input_operand (&ops[2], pos, GET_MODE (pos));
> > +                   if (maybe_expand_insn (icode, 3, ops))
> > +                     {
> > +                       emit_move_insn (to_rtx, temp_target);
> > +                       pop_temp_slots ();
> > +                       return;
> > +                     }
> > +                 }
> > +             }
> > +         }
> > +     }
> >         /* If the field has a mode, we want to access it in the
> >        field's mode, not the computed mode.
> >        If a MEM has VOIDmode (external with incomplete type),
> > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > index 78409aa1453..21b163a969e 100644
> > --- a/gcc/optabs.def
> > +++ b/gcc/optabs.def
> > @@ -96,6 +96,7 @@ OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
> >   OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
> >   OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
> >   OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
> > +OPTAB_CD(vec_insert_optab, "vec_insert$a$b")
> >   OPTAB_CD(vec_init_optab, "vec_init$a$b")
> >
> >   OPTAB_CD (while_ult_optab, "while_ult$a$b")
> >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-04  7:19                   ` Richard Biener
@ 2020-09-04  7:23                     ` Richard Biener
  2020-09-04  9:18                       ` luoxhu
  0 siblings, 1 reply; 43+ messages in thread
From: Richard Biener @ 2020-09-04  7:23 UTC (permalink / raw)
  To: luoxhu
  Cc: GCC Patches, David Edelsohn, Bill Schmidt, linkw, Segher Boessenkool

On Fri, Sep 4, 2020 at 9:19 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Fri, Sep 4, 2020 at 8:38 AM luoxhu <luoxhu@linux.ibm.com> wrote:
> >
> >
> >
> > On 2020/9/4 14:16, luoxhu via Gcc-patches wrote:
> > > Hi,
> > >
> > >
> > > Yes, I checked and found that both vec_set and vec_extract doesn't support
> > > variable index for most targets, store_bit_field_1 and extract_bit_field_1
> > > would only consider use optabs when index is integer value.  Anyway, it
> > > shouldn't be hard to extend depending on target requirements.
> > >
> > > Another problem is v[n&3]=i and vec_insert(v, i, n) are generating with
> > > different gimple code:
> > >
> > > {
> > > _1 = n & 3;
> > > VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i;
> > > }
> > >
> > > vs:
> > >
> > > {
> > >    __vector signed int v1;
> > >    __vector signed int D.3192;
> > >    long unsigned int _1;
> > >    long unsigned int _2;
> > >    int * _3;
> > >
> > >    <bb 2> [local count: 1073741824]:
> > >    D.3192 = v_4(D);
> > >    _1 = n_7(D) & 3;
> > >    _2 = _1 * 4;
> > >    _3 = &D.3192 + _2;
> > >    *_3 = i_8(D);
> > >    v1_10 = D.3192;
> > >    return v1_10;
> > > }
> >
> > Just realized use convert_vector_to_array_for_subscript would generate
> > "VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i;" before produce those instructions,
> >   your confirmation and comments will be highly appreciated...  Thanks in
> > advance. :)
>
> I think what the GCC vector extensions produce is generally better
> so wherever "code generation" for vec_insert resides it should be
> adjusted to produce the same code.  Same for vec_extract.

Guess altivec.h, dispatching to __builtin_vec_insert.  Wonder why it wasn't

#define vec_insert(a,b,c) (a)[c]=(b)

anyway, you obviously have some lowering of the builtin somewhere in rs6000.c
and thus can adjust that.

> > Xionghu
> >
> > >
> > > If not use builtin for "vec_insert(v, i, n)", the pointer is "int*" instead
> > > of vector type, will this be difficult for expander to capture so many
> > > statements then call the optabs?  So shall we still keep the builtin style
> > > for "vec_insert(v, i, n)" and expand "v[n&3]=i" with optabs or expand both
> > > with optabs???
> > >
> > > Drafted a fast patch to expand "v[n&3]=i" with optabs as below, sorry that not
> > > using existed vec_set yet as not quite sure, together with the first patch, both
> > > cases could be handled as expected:
> > >
> > >
> > > [PATCH] Expander: expand VIEW_CONVERT_EXPR to vec_insert with variable index
> > >
> > > v[n%4] = i has same semantic with vec_insert (i, v, n), but it will be
> > > optimized to "VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i;" in gimple, this
> > > patch tries to recognize the pattern in expander and use optabs to
> > > expand it to fast instructions like vec_insert: lvsl+xxperm+xxsel.
> > >
> > > gcc/ChangeLog:
> > >
> > >       * config/rs6000/vector.md:
> > >       * expr.c (expand_assignment):
> > >       * optabs.def (OPTAB_CD):
> > > ---
> > >   gcc/config/rs6000/vector.md | 13 +++++++++++
> > >   gcc/expr.c                  | 46 +++++++++++++++++++++++++++++++++++++
> > >   gcc/optabs.def              |  1 +
> > >   3 files changed, 60 insertions(+)
> > >
> > > diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md
> > > index 796345c80d3..46d21271e17 100644
> > > --- a/gcc/config/rs6000/vector.md
> > > +++ b/gcc/config/rs6000/vector.md
> > > @@ -1244,6 +1244,19 @@ (define_expand "vec_extract<mode><VEC_base_l>"
> > >     DONE;
> > >   })
> > >
> > > +(define_expand "vec_insert<VEC_base_l><mode>"
> > > +  [(match_operand:VEC_E 0 "vlogical_operand")
> > > +   (match_operand:<VEC_base> 1 "register_operand")
> > > +   (match_operand 2 "register_operand")]
> > > +  "VECTOR_MEM_ALTIVEC_OR_VSX_P (<MODE>mode)"
> > > +{
> > > +  rtx target = gen_reg_rtx (V16QImode);
> > > +  rs6000_expand_vector_insert (target, operands[0], operands[1], operands[2]);
> > > +  rtx sub_target = simplify_gen_subreg (GET_MODE(operands[0]), target, V16QImode, 0);
> > > +  emit_insn (gen_rtx_SET (operands[0], sub_target));
> > > +  DONE;
> > > +})
> > > +
> > >   ;; Convert double word types to single word types
> > >   (define_expand "vec_pack_trunc_v2df"
> > >     [(match_operand:V4SF 0 "vfloat_operand")
> > > diff --git a/gcc/expr.c b/gcc/expr.c
> > > index dd2200ddea8..ce2890c1a2d 100644
> > > --- a/gcc/expr.c
> > > +++ b/gcc/expr.c
> > > @@ -5237,6 +5237,52 @@ expand_assignment (tree to, tree from, bool nontemporal)
> > >
> > >         to_rtx = expand_expr (tem, NULL_RTX, VOIDmode, EXPAND_WRITE);
> > >
> > > +      tree type = TREE_TYPE (to);
> > > +      if (TREE_CODE (to) == ARRAY_REF && tree_fits_uhwi_p (TYPE_SIZE (type))
> > > +       && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
> > > +       && tree_to_uhwi (TYPE_SIZE (type))
> > > +              * tree_to_uhwi (TYPE_SIZE_UNIT (type))
> > > +            == 128)
> > > +     {
> > > +       tree op0 = TREE_OPERAND (to, 0);
> > > +       tree op1 = TREE_OPERAND (to, 1);
> > > +       if (TREE_CODE (op0) == VIEW_CONVERT_EXPR)
> > > +         {
> > > +           tree view_op0 = TREE_OPERAND (op0, 0);
> > > +           mode = TYPE_MODE (TREE_TYPE (view_op0));
> > > +           if (TREE_CODE (TREE_TYPE (view_op0)) == VECTOR_TYPE)
> > > +             {
> > > +               rtx value
> > > +                 = expand_expr (from, NULL_RTX, VOIDmode, EXPAND_NORMAL);
> > > +               rtx pos
> > > +                 = expand_expr (op1, NULL_RTX, VOIDmode, EXPAND_NORMAL);
> > > +               rtx temp_target = gen_reg_rtx (mode);
> > > +               emit_move_insn (temp_target, to_rtx);
> > > +
> > > +               machine_mode outermode = mode;
> > > +               scalar_mode innermode = GET_MODE_INNER (outermode);
> > > +               class expand_operand ops[3];
> > > +               enum insn_code icode
> > > +                 = convert_optab_handler (vec_insert_optab, innermode,
> > > +                                          outermode);
> > > +
> > > +               if (icode != CODE_FOR_nothing)
> > > +                 {
> > > +                   pos = convert_to_mode (E_SImode, pos, 0);
> > > +
> > > +                   create_fixed_operand (&ops[0], temp_target);
> > > +                   create_input_operand (&ops[1], value, innermode);
> > > +                   create_input_operand (&ops[2], pos, GET_MODE (pos));
> > > +                   if (maybe_expand_insn (icode, 3, ops))
> > > +                     {
> > > +                       emit_move_insn (to_rtx, temp_target);
> > > +                       pop_temp_slots ();
> > > +                       return;
> > > +                     }
> > > +                 }
> > > +             }
> > > +         }
> > > +     }
> > >         /* If the field has a mode, we want to access it in the
> > >        field's mode, not the computed mode.
> > >        If a MEM has VOIDmode (external with incomplete type),
> > > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > > index 78409aa1453..21b163a969e 100644
> > > --- a/gcc/optabs.def
> > > +++ b/gcc/optabs.def
> > > @@ -96,6 +96,7 @@ OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
> > >   OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
> > >   OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
> > >   OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
> > > +OPTAB_CD(vec_insert_optab, "vec_insert$a$b")
> > >   OPTAB_CD(vec_init_optab, "vec_init$a$b")
> > >
> > >   OPTAB_CD (while_ult_optab, "while_ult$a$b")
> > >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-04  7:23                     ` Richard Biener
@ 2020-09-04  9:18                       ` luoxhu
  2020-09-04 10:23                         ` Segher Boessenkool
  0 siblings, 1 reply; 43+ messages in thread
From: luoxhu @ 2020-09-04  9:18 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, David Edelsohn, Bill Schmidt, linkw, Segher Boessenkool



On 2020/9/4 15:23, Richard Biener wrote:
> On Fri, Sep 4, 2020 at 9:19 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
>>
>> On Fri, Sep 4, 2020 at 8:38 AM luoxhu <luoxhu@linux.ibm.com> wrote:
>>>
>>>
>>>
>>> On 2020/9/4 14:16, luoxhu via Gcc-patches wrote:
>>>> Hi,
>>>>
>>>>
>>>> Yes, I checked and found that both vec_set and vec_extract doesn't support
>>>> variable index for most targets, store_bit_field_1 and extract_bit_field_1
>>>> would only consider use optabs when index is integer value.  Anyway, it
>>>> shouldn't be hard to extend depending on target requirements.
>>>>
>>>> Another problem is v[n&3]=i and vec_insert(v, i, n) are generating with
>>>> different gimple code:
>>>>
>>>> {
>>>> _1 = n & 3;
>>>> VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i;
>>>> }
>>>>
>>>> vs:
>>>>
>>>> {
>>>>     __vector signed int v1;
>>>>     __vector signed int D.3192;
>>>>     long unsigned int _1;
>>>>     long unsigned int _2;
>>>>     int * _3;
>>>>
>>>>     <bb 2> [local count: 1073741824]:
>>>>     D.3192 = v_4(D);
>>>>     _1 = n_7(D) & 3;
>>>>     _2 = _1 * 4;
>>>>     _3 = &D.3192 + _2;
>>>>     *_3 = i_8(D);
>>>>     v1_10 = D.3192;
>>>>     return v1_10;
>>>> }
>>>
>>> Just realized use convert_vector_to_array_for_subscript would generate
>>> "VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i;" before produce those instructions,
>>>    your confirmation and comments will be highly appreciated...  Thanks in
>>> advance. :)
>>
>> I think what the GCC vector extensions produce is generally better
>> so wherever "code generation" for vec_insert resides it should be
>> adjusted to produce the same code.  Same for vec_extract.
> 
> Guess altivec.h, dispatching to __builtin_vec_insert.  Wonder why it wasn't
> 
> #define vec_insert(a,b,c) (a)[c]=(b)
> 
> anyway, you obviously have some lowering of the builtin somewhere in rs6000.c
> and thus can adjust that.
> 

Yes, altivec.h use that style for all vector functions, not sure why.
But this could be adjusted by below patch during front end parsing,
which could also generate  "VIEW_CONVERT_EXPR<int[4]>(D.3192)[_1] = i;"
in gimple, then both v[n&3]=i and vec_insert(v, i, n) could use optabs
in expander:


diff --git a/gcc/config/rs6000/rs6000-c.c b/gcc/config/rs6000/rs6000-c.c
index 03b00738a5e..00c65311f76 100644
--- a/gcc/config/rs6000/rs6000-c.c
+++ b/gcc/config/rs6000/rs6000-c.c
       /* Build *(((arg1_inner_type*)&(vector type){arg1})+arg2) = arg0. */
@@ -1654,15 +1656,8 @@ altivec_resolve_overloaded_builtin (location_t loc, tree fndecl,
          SET_EXPR_LOCATION (stmt, loc);
          stmt = build1 (COMPOUND_LITERAL_EXPR, arg1_type, stmt);
        }
-
-      innerptrtype = build_pointer_type (arg1_inner_type);
-
-      stmt = build_unary_op (loc, ADDR_EXPR, stmt, 0);
-      stmt = convert (innerptrtype, stmt);
-      stmt = build_binary_op (loc, PLUS_EXPR, stmt, arg2, 1);
-      stmt = build_indirect_ref (loc, stmt, RO_NULL);
-      stmt = build2 (MODIFY_EXPR, TREE_TYPE (stmt), stmt,
-                    convert (TREE_TYPE (stmt), arg0));
+      stmt = build_array_ref (loc, stmt, arg2);
+      stmt = fold_build2 (MODIFY_EXPR, TREE_TYPE (arg0), stmt, arg0);
       stmt = build2 (COMPOUND_EXPR, arg1_type, stmt, decl);
       return stmt;
     }

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-04  9:18                       ` luoxhu
@ 2020-09-04 10:23                         ` Segher Boessenkool
  2020-09-07  5:43                           ` [PATCH v2] " luoxhu
  0 siblings, 1 reply; 43+ messages in thread
From: Segher Boessenkool @ 2020-09-04 10:23 UTC (permalink / raw)
  To: luoxhu; +Cc: Richard Biener, GCC Patches, David Edelsohn, Bill Schmidt, linkw

Hi!

On Fri, Sep 04, 2020 at 05:18:49PM +0800, luoxhu wrote:
> On 2020/9/4 15:23, Richard Biener wrote:
> > On Fri, Sep 4, 2020 at 9:19 AM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> >> On Fri, Sep 4, 2020 at 8:38 AM luoxhu <luoxhu@linux.ibm.com> wrote:
> >>> On 2020/9/4 14:16, luoxhu via Gcc-patches wrote:
> >>>> Another problem is v[n&3]=i and vec_insert(v, i, n) are generating with
> >>>> different gimple code:
> >>>>
> >>>> {
> >>>> _1 = n & 3;
> >>>> VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i;
> >>>> }
> >>>>
> >>>> vs:
> >>>>
> >>>> {
> >>>>     __vector signed int v1;
> >>>>     __vector signed int D.3192;
> >>>>     long unsigned int _1;
> >>>>     long unsigned int _2;
> >>>>     int * _3;
> >>>>
> >>>>     <bb 2> [local count: 1073741824]:
> >>>>     D.3192 = v_4(D);
> >>>>     _1 = n_7(D) & 3;
> >>>>     _2 = _1 * 4;
> >>>>     _3 = &D.3192 + _2;
> >>>>     *_3 = i_8(D);
> >>>>     v1_10 = D.3192;
> >>>>     return v1_10;
> >>>> }

Not the semantics of vec_insert aren't exactly that..  It doesn't modify
the vector in place, it returns a copy with the modification.  But yes,
it could/should just use this same VIEW_CONVERT_EXPR(...)[...] thing for
that.

> >> I think what the GCC vector extensions produce is generally better
> >> so wherever "code generation" for vec_insert resides it should be
> >> adjusted to produce the same code.  Same for vec_extract.

Yup.

> > Guess altivec.h, dispatching to __builtin_vec_insert.  Wonder why it wasn't
> > 
> > #define vec_insert(a,b,c) (a)[c]=(b)
> > 
> > anyway, you obviously have some lowering of the builtin somewhere in rs6000.c
> > and thus can adjust that.
> > 
> 
> Yes, altivec.h use that style for all vector functions, not sure why.

Probably simply because pretty much everything in there is just calling
builtins, everything new follows suit.  It is contagious ;-)

> But this could be adjusted by below patch during front end parsing,
> which could also generate  "VIEW_CONVERT_EXPR<int[4]>(D.3192)[_1] = i;"
> in gimple, then both v[n&3]=i and vec_insert(v, i, n) could use optabs
> in expander:
> 
> 
> diff --git a/gcc/config/rs6000/rs6000-c.c b/gcc/config/rs6000/rs6000-c.c
> index 03b00738a5e..00c65311f76 100644
> --- a/gcc/config/rs6000/rs6000-c.c
> +++ b/gcc/config/rs6000/rs6000-c.c
>        /* Build *(((arg1_inner_type*)&(vector type){arg1})+arg2) = arg0. */
> @@ -1654,15 +1656,8 @@ altivec_resolve_overloaded_builtin (location_t loc, tree fndecl,
>           SET_EXPR_LOCATION (stmt, loc);
>           stmt = build1 (COMPOUND_LITERAL_EXPR, arg1_type, stmt);
>         }
> -
> -      innerptrtype = build_pointer_type (arg1_inner_type);
> -
> -      stmt = build_unary_op (loc, ADDR_EXPR, stmt, 0);
> -      stmt = convert (innerptrtype, stmt);
> -      stmt = build_binary_op (loc, PLUS_EXPR, stmt, arg2, 1);
> -      stmt = build_indirect_ref (loc, stmt, RO_NULL);
> -      stmt = build2 (MODIFY_EXPR, TREE_TYPE (stmt), stmt,
> -                    convert (TREE_TYPE (stmt), arg0));
> +      stmt = build_array_ref (loc, stmt, arg2);
> +      stmt = fold_build2 (MODIFY_EXPR, TREE_TYPE (arg0), stmt, arg0);
>        stmt = build2 (COMPOUND_EXPR, arg1_type, stmt, decl);
>        return stmt;
>      }

You should make a copy of the vector, not modify the original one in
place?  (If I read that correctly, heh.)  Looks good otherwise.


Segher

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-04 10:23                         ` Segher Boessenkool
@ 2020-09-07  5:43                           ` luoxhu
  2020-09-07 11:57                             ` Richard Biener
  0 siblings, 1 reply; 43+ messages in thread
From: luoxhu @ 2020-09-07  5:43 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Richard Biener, GCC Patches, David Edelsohn, Bill Schmidt, linkw

[-- Attachment #1: Type: text/plain, Size: 2120 bytes --]

Hi,

On 2020/9/4 18:23, Segher Boessenkool wrote:
>> diff --git a/gcc/config/rs6000/rs6000-c.c b/gcc/config/rs6000/rs6000-c.c
>> index 03b00738a5e..00c65311f76 100644
>> --- a/gcc/config/rs6000/rs6000-c.c
>> +++ b/gcc/config/rs6000/rs6000-c.c
>>         /* Build *(((arg1_inner_type*)&(vector type){arg1})+arg2) = arg0. */
>> @@ -1654,15 +1656,8 @@ altivec_resolve_overloaded_builtin (location_t loc, tree fndecl,
>>            SET_EXPR_LOCATION (stmt, loc);
>>            stmt = build1 (COMPOUND_LITERAL_EXPR, arg1_type, stmt);
>>          }
>> -
>> -      innerptrtype = build_pointer_type (arg1_inner_type);
>> -
>> -      stmt = build_unary_op (loc, ADDR_EXPR, stmt, 0);
>> -      stmt = convert (innerptrtype, stmt);
>> -      stmt = build_binary_op (loc, PLUS_EXPR, stmt, arg2, 1);
>> -      stmt = build_indirect_ref (loc, stmt, RO_NULL);
>> -      stmt = build2 (MODIFY_EXPR, TREE_TYPE (stmt), stmt,
>> -                    convert (TREE_TYPE (stmt), arg0));
>> +      stmt = build_array_ref (loc, stmt, arg2);
>> +      stmt = fold_build2 (MODIFY_EXPR, TREE_TYPE (arg0), stmt, arg0);
>>         stmt = build2 (COMPOUND_EXPR, arg1_type, stmt, decl);
>>         return stmt;
>>       }
> You should make a copy of the vector, not modify the original one in
> place?  (If I read that correctly, heh.)  Looks good otherwise.
> 

Segher, there  is already existed code to make a copy of vector as we
discussed offline.  Thanks for the reminder.

cat pr79251.c.006t.gimple
__attribute__((noinline))
test (__vector signed int v, int i, size_t n)
{
  __vector signed int D.3192;
  __vector signed int D.3194;
  __vector signed int v1;
  v1 = v;
  D.3192 = v1;
  _1 = n & 3;
  VIEW_CONVERT_EXPR<int[4]>(D.3192)[_1] = i;
  v1 = D.3192;
  D.3194 = v1;
  return D.3194;
}

Attached the v2 patch which does:
1) Build VIEW_CONVERT_EXPR for vec_insert (i, v, n) like v[n%4] = i to
unify the gimple code, then expander could use vec_set_optab to expand.
2) Recognize the pattern in expander and use optabs to expand
VIEW_CONVERT_EXPR to vec_insert with variable index to fast instructions:
lvsl+xxperm+xxsel. 

Thanks,
Xionghu

[-- Attachment #2: v2-0001-rs6000-Expand-vec_insert-in-expander-instead-of-g.patch --]
[-- Type: text/plain, Size: 20784 bytes --]

From 9a7a47b086579e26ae13f378732cea067b64a4e6 Mon Sep 17 00:00:00 2001
From: Xiong Hu Luo <luoxhu@linux.ibm.com>
Date: Wed, 19 Aug 2020 03:54:17 -0500
Subject: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple
 [PR79251]

vec_insert accepts 3 arguments, arg0 is input vector, arg1 is the value
to be insert, arg2 is the place to insert arg1 to arg0.  Current expander
generates stxv+stwx+lxv if arg2 is variable instead of constant, which
causes serious store hit load performance issue on Power.  This patch tries
 1) Build VIEW_CONVERT_EXPR for vec_insert (i, v, n) like v[n%4] = i to
unify the gimple code, then expander could use vec_set_optab to expand.
 2) Recognize the pattern in expander and use optabs to expand
VIEW_CONVERT_EXPR to vec_insert with variable index to fast instructions:
lvsl+xxperm+xxsel.
In this way, "vec_insert (i, v, n)" and "v[n%4] = i" won't be expanded too
early in gimple stage if arg2 is variable, avoid generating store hit load
instructions.

For Power9 V4SI:
	addi 9,1,-16
	rldic 6,6,2,60
	stxv 34,-16(1)
	stwx 5,9,6
	lxv 34,-16(1)
=>
	addis 9,2,.LC0@toc@ha
	addi 9,9,.LC0@toc@l
	mtvsrwz 33,5
	lxv 32,0(9)
	sradi 9,6,2
	addze 9,9
	sldi 9,9,2
	subf 9,9,6
	subfic 9,9,3
	sldi 9,9,2
	subfic 9,9,20
	lvsl 13,0,9
	xxperm 33,33,45
	xxperm 32,32,45
	xxsel 34,34,33,32

Though instructions increase from 5 to 15, the performance is improved
60% in typical cases.
Tested with V2DI, V2DF V4SI, V4SF, V8HI, V16QI on Power9-LE and
Power8-BE, bootstrap tested pass.

gcc/ChangeLog:

2020-09-07  Xionghu Luo  <luoxhu@linux.ibm.com>

	* config/rs6000/altivec.md (altivec_lvsl_reg_<mode>2): Rename to
	 (altivec_lvsl_reg_<mode>2) and extend to SDI mode.
	* config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin):
	Ajdust variable index vec_insert to VIEW_CONVERT_EXPR.
	* config/rs6000/rs6000-protos.h (rs6000_expand_vector_set_var):
	New declare.
	* config/rs6000/rs6000.c (rs6000_expand_vector_set_var):
	New function.
	* config/rs6000/rs6000.md (FQHS): New mode iterator.
	(FD): New mode iterator.
	p8_mtvsrwz_v16qi<mode>2: New define_insn.
	p8_mtvsrd_v16qi<mode>2: New define_insn.
	* config/rs6000/vector.md: Add register operand2 match for
	vec_set index.
	* config/rs6000/vsx.md: Call gen_altivec_lvsl_reg_di2.
	* expr.c (expand_view_convert_to_vec_set): New function.
	(expand_assignment): Call expand_view_convert_to_vec_set.

gcc/testsuite/ChangeLog:

2020-09-07  Xionghu Luo  <luoxhu@linux.ibm.com>

	* gcc.target/powerpc/pr79251.c: New test.
	* gcc.target/powerpc/pr79251-run.c: New test.
	* gcc.target/powerpc/pr79251.h: New header.
---
 gcc/config/rs6000/altivec.md                  |   4 +-
 gcc/config/rs6000/rs6000-c.c                  |  22 ++-
 gcc/config/rs6000/rs6000-protos.h             |   1 +
 gcc/config/rs6000/rs6000.c                    | 146 ++++++++++++++++++
 gcc/config/rs6000/rs6000.md                   |  19 +++
 gcc/config/rs6000/vector.md                   |  19 ++-
 gcc/config/rs6000/vsx.md                      |   2 +-
 gcc/expr.c                                    |  61 ++++++++
 .../gcc.target/powerpc/pr79251-run.c          |  29 ++++
 gcc/testsuite/gcc.target/powerpc/pr79251.c    |  15 ++
 gcc/testsuite/gcc.target/powerpc/pr79251.h    |  19 +++
 11 files changed, 318 insertions(+), 19 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr79251-run.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr79251.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr79251.h

diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
index 0a2e634d6b0..66b636059a6 100644
--- a/gcc/config/rs6000/altivec.md
+++ b/gcc/config/rs6000/altivec.md
@@ -2772,10 +2772,10 @@ (define_expand "altivec_lvsl"
   DONE;
 })
 
-(define_insn "altivec_lvsl_reg"
+(define_insn "altivec_lvsl_reg_<mode>2"
   [(set (match_operand:V16QI 0 "altivec_register_operand" "=v")
 	(unspec:V16QI
-	[(match_operand:DI 1 "gpc_reg_operand" "b")]
+	[(match_operand:SDI 1 "gpc_reg_operand" "b")]
 	UNSPEC_LVSL_REG))]
   "TARGET_ALTIVEC"
   "lvsl %0,0,%1"
diff --git a/gcc/config/rs6000/rs6000-c.c b/gcc/config/rs6000/rs6000-c.c
index 2fad3d94706..78abe49c833 100644
--- a/gcc/config/rs6000/rs6000-c.c
+++ b/gcc/config/rs6000/rs6000-c.c
@@ -1509,9 +1509,7 @@ altivec_resolve_overloaded_builtin (location_t loc, tree fndecl,
       tree arg1;
       tree arg2;
       tree arg1_type;
-      tree arg1_inner_type;
       tree decl, stmt;
-      tree innerptrtype;
       machine_mode mode;
 
       /* No second or third arguments. */
@@ -1563,8 +1561,13 @@ altivec_resolve_overloaded_builtin (location_t loc, tree fndecl,
 	  return build_call_expr (call, 3, arg1, arg0, arg2);
 	}
 
-      /* Build *(((arg1_inner_type*)&(vector type){arg1})+arg2) = arg0. */
-      arg1_inner_type = TREE_TYPE (arg1_type);
+      /* Build *(((arg1_inner_type*)&(vector type){arg1})+arg2) = arg0 with
+	 VIEW_CONVERT_EXPR.  i.e.:
+	 D.3192 = v1;
+	 _1 = n & 3;
+	 VIEW_CONVERT_EXPR<int[4]>(D.3192)[_1] = i;
+	 v1 = D.3192;
+	 D.3194 = v1;  */
       if (TYPE_VECTOR_SUBPARTS (arg1_type) == 1)
 	arg2 = build_int_cst (TREE_TYPE (arg2), 0);
       else
@@ -1593,15 +1596,8 @@ altivec_resolve_overloaded_builtin (location_t loc, tree fndecl,
 	  SET_EXPR_LOCATION (stmt, loc);
 	  stmt = build1 (COMPOUND_LITERAL_EXPR, arg1_type, stmt);
 	}
-
-      innerptrtype = build_pointer_type (arg1_inner_type);
-
-      stmt = build_unary_op (loc, ADDR_EXPR, stmt, 0);
-      stmt = convert (innerptrtype, stmt);
-      stmt = build_binary_op (loc, PLUS_EXPR, stmt, arg2, 1);
-      stmt = build_indirect_ref (loc, stmt, RO_NULL);
-      stmt = build2 (MODIFY_EXPR, TREE_TYPE (stmt), stmt,
-		     convert (TREE_TYPE (stmt), arg0));
+      stmt = build_array_ref (loc, stmt, arg2);
+      stmt = fold_build2 (MODIFY_EXPR, TREE_TYPE (arg0), stmt, arg0);
       stmt = build2 (COMPOUND_EXPR, arg1_type, stmt, decl);
       return stmt;
     }
diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
index 28e859f4381..f6f8bd65c2f 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -58,6 +58,7 @@ extern bool rs6000_split_128bit_ok_p (rtx []);
 extern void rs6000_expand_float128_convert (rtx, rtx, bool);
 extern void rs6000_expand_vector_init (rtx, rtx);
 extern void rs6000_expand_vector_set (rtx, rtx, int);
+extern void rs6000_expand_vector_set_var (rtx, rtx, rtx, rtx);
 extern void rs6000_expand_vector_extract (rtx, rtx, rtx);
 extern void rs6000_split_vec_extract_var (rtx, rtx, rtx, rtx, rtx);
 extern rtx rs6000_adjust_vec_address (rtx, rtx, rtx, rtx, machine_mode);
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index fe93cf6ff2b..d22d7999a61 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -6788,6 +6788,152 @@ rs6000_expand_vector_set (rtx target, rtx val, int elt)
   emit_insn (gen_rtx_SET (target, x));
 }
 
+/* Insert value from VEC into idx of TARGET.  */
+
+void
+rs6000_expand_vector_set_var (rtx target, rtx vec, rtx val, rtx idx)
+{
+  machine_mode mode = GET_MODE (vec);
+
+  if (VECTOR_MEM_VSX_P (mode) && CONST_INT_P (idx))
+    gcc_unreachable ();
+  else if (VECTOR_MEM_VSX_P (mode) && !CONST_INT_P (idx)
+	   && TARGET_DIRECT_MOVE_64BIT)
+    {
+      gcc_assert (GET_MODE (idx) == E_SImode);
+      machine_mode inner_mode = GET_MODE (val);
+      HOST_WIDE_INT mode_mask = GET_MODE_MASK (inner_mode);
+
+      rtx tmp = gen_reg_rtx (GET_MODE (idx));
+      if (GET_MODE_SIZE (inner_mode) == 8)
+	{
+	  if (!BYTES_BIG_ENDIAN)
+	    {
+	      /*  idx = 1 - idx.  */
+	      emit_insn (gen_subsi3 (tmp, GEN_INT (1), idx));
+	      /*  idx = idx * 8.  */
+	      emit_insn (gen_ashlsi3 (tmp, tmp, GEN_INT (3)));
+	      /*  idx = 16 - idx.  */
+	      emit_insn (gen_subsi3 (tmp, GEN_INT (16), tmp));
+	    }
+	  else
+	    {
+	      emit_insn (gen_ashlsi3 (tmp, idx, GEN_INT (3)));
+	      emit_insn (gen_subsi3 (tmp, GEN_INT (16), tmp));
+	    }
+	}
+      else if (GET_MODE_SIZE (inner_mode) == 4)
+	{
+	  if (!BYTES_BIG_ENDIAN)
+	    {
+	      /*  idx = 3 - idx.  */
+	      emit_insn (gen_subsi3 (tmp, GEN_INT (3), idx));
+	      /*  idx = idx * 4.  */
+	      emit_insn (gen_ashlsi3 (tmp, tmp, GEN_INT (2)));
+	      /*  idx = 20 - idx.  */
+	      emit_insn (gen_subsi3 (tmp, GEN_INT (20), tmp));
+	    }
+	  else
+	  {
+	      emit_insn (gen_ashlsi3 (tmp, idx, GEN_INT (2)));
+	      emit_insn (gen_subsi3 (tmp, GEN_INT (20), tmp));
+	  }
+	}
+      else if (GET_MODE_SIZE (inner_mode) == 2)
+	{
+	  if (!BYTES_BIG_ENDIAN)
+	    {
+	      /*  idx = 7 - idx.  */
+	      emit_insn (gen_subsi3 (tmp, GEN_INT (7), idx));
+	      /*  idx = idx * 2.  */
+	      emit_insn (gen_ashlsi3 (tmp, tmp, GEN_INT (1)));
+	      /*  idx = 22 - idx.  */
+	      emit_insn (gen_subsi3 (tmp, GEN_INT (22), tmp));
+	    }
+	  else
+	    {
+	      emit_insn (gen_ashlsi3 (tmp, tmp, GEN_INT (1)));
+	      emit_insn (gen_subsi3 (tmp, GEN_INT (22), idx));
+	    }
+	}
+      else if (GET_MODE_SIZE (inner_mode) == 1)
+	if (!BYTES_BIG_ENDIAN)
+	  emit_insn (gen_addsi3 (tmp, idx, GEN_INT (8)));
+	else
+	  emit_insn (gen_subsi3 (tmp, GEN_INT (23), idx));
+      else
+	gcc_unreachable ();
+
+      /*  lxv vs32, mask.
+	  DImode: 0xffffffffffffffff0000000000000000
+	  SImode: 0x00000000ffffffff0000000000000000
+	  HImode: 0x000000000000ffff0000000000000000.
+	  QImode: 0x00000000000000ff0000000000000000.  */
+      rtx mask = gen_reg_rtx (V16QImode);
+      rtx mask_v2di = gen_reg_rtx (V2DImode);
+      rtvec v = rtvec_alloc (2);
+      if (!BYTES_BIG_ENDIAN)
+	{
+	  RTVEC_ELT (v, 0) = gen_rtx_CONST_INT (DImode, 0);
+	  RTVEC_ELT (v, 1) = gen_rtx_CONST_INT (DImode, mode_mask);
+	}
+      else
+      {
+	  RTVEC_ELT (v, 0) = gen_rtx_CONST_INT (DImode, mode_mask);
+	  RTVEC_ELT (v, 1) = gen_rtx_CONST_INT (DImode, 0);
+	}
+      emit_insn (
+	gen_vec_initv2didi (mask_v2di, gen_rtx_PARALLEL (V2DImode, v)));
+      rtx sub_mask = simplify_gen_subreg (V16QImode, mask_v2di, V2DImode, 0);
+      emit_insn (gen_rtx_SET (mask, sub_mask));
+
+      /*  mtvsrd[wz] f0,val.  */
+      rtx val_v16qi = gen_reg_rtx (V16QImode);
+      switch (inner_mode)
+	{
+	default:
+	  gcc_unreachable ();
+	  break;
+	case E_QImode:
+	  emit_insn (gen_p8_mtvsrwz_v16qiqi2 (val_v16qi, val));
+	  break;
+	case E_HImode:
+	  emit_insn (gen_p8_mtvsrwz_v16qihi2 (val_v16qi, val));
+	  break;
+	case E_SImode:
+	  emit_insn (gen_p8_mtvsrwz_v16qisi2 (val_v16qi, val));
+	  break;
+	case E_SFmode:
+	  emit_insn (gen_p8_mtvsrwz_v16qisf2 (val_v16qi, val));
+	  break;
+	case E_DImode:
+	  emit_insn (gen_p8_mtvsrd_v16qidi2 (val_v16qi, val));
+	  break;
+	case E_DFmode:
+	  emit_insn (gen_p8_mtvsrd_v16qidf2 (val_v16qi, val));
+	  break;
+	}
+
+      /*  lvsl    v1,0,idx.  */
+      rtx pcv = gen_reg_rtx (V16QImode);
+      emit_insn (gen_altivec_lvsl_reg_si2 (pcv, tmp));
+
+      /*  xxperm  vs0,vs0,vs33.  */
+      /*  xxperm  vs32,vs32,vs33.  */
+      rtx val_perm = gen_reg_rtx (V16QImode);
+      rtx mask_perm = gen_reg_rtx (V16QImode);
+      emit_insn (
+	gen_altivec_vperm_v8hiv16qi (val_perm, val_v16qi, val_v16qi, pcv));
+      emit_insn (gen_altivec_vperm_v8hiv16qi (mask_perm, mask, mask, pcv));
+
+      rtx sub_target = simplify_gen_subreg (V16QImode, vec, mode, 0);
+      emit_insn (gen_rtx_SET (target, sub_target));
+
+      /*  xxsel   vs34,vs34,vs0,vs32.  */
+      emit_insn (gen_vector_select_v16qi (target, target, val_perm, mask_perm));
+    }
+}
+
 /* Extract field ELT from VEC into TARGET.  */
 
 void
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index 43b620ae1c0..b02fda836d4 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -8713,6 +8713,25 @@ (define_insn "p8_mtvsrwz"
   "mtvsrwz %x0,%1"
   [(set_attr "type" "mftgpr")])
 
+(define_mode_iterator FQHS [SF QI HI SI])
+(define_mode_iterator FD [DF DI])
+
+(define_insn "p8_mtvsrwz_v16qi<mode>2"
+  [(set (match_operand:V16QI 0 "register_operand" "=wa")
+	(unspec:V16QI [(match_operand:FQHS 1 "register_operand" "r")]
+		   UNSPEC_P8V_MTVSRWZ))]
+  "TARGET_POWERPC64 && TARGET_DIRECT_MOVE"
+  "mtvsrwz %x0,%1"
+  [(set_attr "type" "mftgpr")])
+
+(define_insn "p8_mtvsrd_v16qi<mode>2"
+  [(set (match_operand:V16QI 0 "register_operand" "=wa")
+	(unspec:V16QI [(match_operand:FD 1 "register_operand" "r")]
+		   UNSPEC_P8V_MTVSRD))]
+  "TARGET_POWERPC64 && TARGET_DIRECT_MOVE"
+  "mtvsrd %x0,%1"
+  [(set_attr "type" "mftgpr")])
+
 (define_insn_and_split "reload_fpr_from_gpr<mode>"
   [(set (match_operand:FMOVE64X 0 "register_operand" "=d")
 	(unspec:FMOVE64X [(match_operand:FMOVE64X 1 "register_operand" "r")]
diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md
index 796345c80d3..28e59c1c995 100644
--- a/gcc/config/rs6000/vector.md
+++ b/gcc/config/rs6000/vector.md
@@ -1227,11 +1227,24 @@ (define_expand "vec_init<mode><VEC_base_l>"
 (define_expand "vec_set<mode>"
   [(match_operand:VEC_E 0 "vlogical_operand")
    (match_operand:<VEC_base> 1 "register_operand")
-   (match_operand 2 "const_int_operand")]
+   (match_operand 2 "reg_or_cint_operand")]
   "VECTOR_MEM_ALTIVEC_OR_VSX_P (<MODE>mode)"
 {
-  rs6000_expand_vector_set (operands[0], operands[1], INTVAL (operands[2]));
-  DONE;
+  if (CONST_INT_P (operands[2]))
+    {
+      rs6000_expand_vector_set (operands[0], operands[1], INTVAL (operands[2]));
+      DONE;
+    }
+  else
+    {
+      rtx target = gen_reg_rtx (V16QImode);
+      rs6000_expand_vector_set_var (target, operands[0], operands[1],
+				    operands[2]);
+      rtx sub_target
+	= simplify_gen_subreg (GET_MODE (operands[0]), target, V16QImode, 0);
+      emit_insn (gen_rtx_SET (operands[0], sub_target));
+      DONE;
+    }
 })
 
 (define_expand "vec_extract<mode><VEC_base_l>"
diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index dd750210758..7e82690d12d 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -5349,7 +5349,7 @@ (define_expand "xl_len_r"
   rtx rtx_vtmp = gen_reg_rtx (V16QImode);
   rtx tmp = gen_reg_rtx (DImode);
 
-  emit_insn (gen_altivec_lvsl_reg (shift_mask, operands[2]));
+  emit_insn (gen_altivec_lvsl_reg_di2 (shift_mask, operands[2]));
   emit_insn (gen_ashldi3 (tmp, operands[2], GEN_INT (56)));
   emit_insn (gen_lxvll (rtx_vtmp, operands[1], tmp));
   emit_insn (gen_altivec_vperm_v8hiv16qi (operands[0], rtx_vtmp, rtx_vtmp,
diff --git a/gcc/expr.c b/gcc/expr.c
index dd2200ddea8..31545891262 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -5129,6 +5129,57 @@ mem_ref_refers_to_non_mem_p (tree ref)
   return non_mem_decl_p (base);
 }
 
+/* Expand a VIEW_CONVERT_EXPR to vec_set with variable index position.
+   For V4SI:
+   _1 = n_5 & 3;
+   VIEW_CONVERT_EXPR<int[4]>(D.3186)[_1] = i_6;  */
+
+static inline bool
+expand_view_convert_to_vec_set (tree to, tree from, rtx to_rtx)
+{
+  tree type = TREE_TYPE (to);
+  tree op0 = TREE_OPERAND (to, 0);
+  if (TREE_CODE (op0) == VIEW_CONVERT_EXPR
+      && tree_fits_uhwi_p (TYPE_SIZE (type)))
+    {
+      tree op1 = TREE_OPERAND (to, 1);
+      tree view_op0 = TREE_OPERAND (op0, 0);
+      machine_mode outermode = TYPE_MODE (TREE_TYPE (view_op0));
+      scalar_mode innermode = GET_MODE_INNER (outermode);
+      wide_int minv, maxv;
+      if (TREE_CODE (TREE_TYPE (view_op0)) == VECTOR_TYPE
+	  && tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (view_op0)))
+	  && tree_to_uhwi (TYPE_SIZE (TREE_TYPE (view_op0))) == 128
+	  && determine_value_range (op1, &minv, &maxv) == VR_RANGE
+	  && wi::geu_p (minv, 0)
+	  && wi::leu_p (maxv, (128 / GET_MODE_BITSIZE (innermode))))
+	{
+	  rtx value = expand_expr (from, NULL_RTX, VOIDmode, EXPAND_NORMAL);
+	  rtx pos = expand_expr (op1, NULL_RTX, VOIDmode, EXPAND_NORMAL);
+	  rtx temp_target = gen_reg_rtx (outermode);
+	  emit_move_insn (temp_target, to_rtx);
+
+	  class expand_operand ops[3];
+	  enum insn_code icode = optab_handler (vec_set_optab, outermode);
+
+	  if (icode != CODE_FOR_nothing)
+	    {
+	      pos = convert_to_mode (E_SImode, pos, 0);
+
+	      create_fixed_operand (&ops[0], temp_target);
+	      create_input_operand (&ops[1], value, innermode);
+	      create_input_operand (&ops[2], pos, GET_MODE (pos));
+	      if (maybe_expand_insn (icode, 3, ops))
+		{
+		  emit_move_insn (to_rtx, temp_target);
+		  return true;
+		}
+	    }
+	}
+    }
+  return false;
+}
+
 /* Expand an assignment that stores the value of FROM into TO.  If NONTEMPORAL
    is true, try generating a nontemporal store.  */
 
@@ -5237,6 +5288,16 @@ expand_assignment (tree to, tree from, bool nontemporal)
 
       to_rtx = expand_expr (tem, NULL_RTX, VOIDmode, EXPAND_WRITE);
 
+      if (TREE_CODE (to) == ARRAY_REF)
+	{
+	  tree op0 = TREE_OPERAND (to, 0);
+	  if (TREE_CODE (op0) == VIEW_CONVERT_EXPR
+	      && expand_view_convert_to_vec_set (to, from, to_rtx))
+	    {
+	      pop_temp_slots ();
+	      return;
+	    }
+	}
       /* If the field has a mode, we want to access it in the
 	 field's mode, not the computed mode.
 	 If a MEM has VOIDmode (external with incomplete type),
diff --git a/gcc/testsuite/gcc.target/powerpc/pr79251-run.c b/gcc/testsuite/gcc.target/powerpc/pr79251-run.c
new file mode 100644
index 00000000000..840f6712ad2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr79251-run.c
@@ -0,0 +1,29 @@
+/* { dg-do run { target { lp64 && p9vector_hw } } } */
+/* { dg-options "-O2 -mdejagnu-cpu=power9 -maltivec" } */
+
+#include <stddef.h>
+#include <altivec.h>
+#include "pr79251.h"
+
+TEST_VEC_INSERT_ALL (test)
+
+#define run_test(TYPE, num)                                                    \
+  {                                                                            \
+    vector TYPE v;                                                             \
+    vector TYPE u = {0x0};                                                     \
+    for (long k = 0; k < 16 / sizeof (TYPE); k++)                              \
+      v[k] = 0xaa;                                                             \
+    for (long k = 0; k < 16 / sizeof (TYPE); k++)                              \
+      {                                                                        \
+	u = test##num (v, 254, k);                                             \
+	if (u[k] != (TYPE) 254)                                                \
+	  __builtin_abort ();                                                  \
+      }                                                                        \
+  }
+
+int
+main (void)
+{
+  TEST_VEC_INSERT_ALL (run_test)
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/pr79251.c b/gcc/testsuite/gcc.target/powerpc/pr79251.c
new file mode 100644
index 00000000000..8124f503df9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr79251.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-options "-O2 -mdejagnu-cpu=power9 -maltivec" } */
+
+#include <stddef.h>
+#include <altivec.h>
+#include "pr79251.h"
+
+TEST_VEC_INSERT_ALL (test)
+
+/* { dg-final { scan-assembler-not {\mstxw\M} } } */
+/* { dg-final { scan-assembler-times {\mlvsl\M} 10 } } */
+/* { dg-final { scan-assembler-times {\mxxperm\M} 20 } } */
+/* { dg-final { scan-assembler-times {\mxxsel\M} 10 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/pr79251.h b/gcc/testsuite/gcc.target/powerpc/pr79251.h
new file mode 100644
index 00000000000..609371c96cd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr79251.h
@@ -0,0 +1,19 @@
+
+#define test(TYPE, num)                                                        \
+  __attribute__ ((noinline, noclone))                                          \
+    vector TYPE test##num (vector TYPE v, TYPE i, unsigned int n)              \
+  {                                                                            \
+    return vec_insert (i, v, n);                                               \
+  }
+
+#define TEST_VEC_INSERT_ALL(T)                                                 \
+  T (char, 0)                                                                  \
+  T (unsigned char, 1)                                                         \
+  T (short, 2)                                                                 \
+  T (unsigned short, 3)                                                        \
+  T (int, 4)                                                                   \
+  T (unsigned int, 5)                                                          \
+  T (long long, 6)                                                             \
+  T (unsigned long long, 7)                                                    \
+  T (float, 8)                                                                 \
+  T (double, 9)
-- 
2.27.0.90.geebb51ba8c


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-07  5:43                           ` [PATCH v2] " luoxhu
@ 2020-09-07 11:57                             ` Richard Biener
  2020-09-08  8:11                               ` luoxhu
  0 siblings, 1 reply; 43+ messages in thread
From: Richard Biener @ 2020-09-07 11:57 UTC (permalink / raw)
  To: luoxhu
  Cc: Segher Boessenkool, GCC Patches, David Edelsohn, Bill Schmidt, linkw

On Mon, Sep 7, 2020 at 7:44 AM luoxhu <luoxhu@linux.ibm.com> wrote:
>
> Hi,
>
> On 2020/9/4 18:23, Segher Boessenkool wrote:
> >> diff --git a/gcc/config/rs6000/rs6000-c.c b/gcc/config/rs6000/rs6000-c.c
> >> index 03b00738a5e..00c65311f76 100644
> >> --- a/gcc/config/rs6000/rs6000-c.c
> >> +++ b/gcc/config/rs6000/rs6000-c.c
> >>         /* Build *(((arg1_inner_type*)&(vector type){arg1})+arg2) = arg0. */
> >> @@ -1654,15 +1656,8 @@ altivec_resolve_overloaded_builtin (location_t loc, tree fndecl,
> >>            SET_EXPR_LOCATION (stmt, loc);
> >>            stmt = build1 (COMPOUND_LITERAL_EXPR, arg1_type, stmt);
> >>          }
> >> -
> >> -      innerptrtype = build_pointer_type (arg1_inner_type);
> >> -
> >> -      stmt = build_unary_op (loc, ADDR_EXPR, stmt, 0);
> >> -      stmt = convert (innerptrtype, stmt);
> >> -      stmt = build_binary_op (loc, PLUS_EXPR, stmt, arg2, 1);
> >> -      stmt = build_indirect_ref (loc, stmt, RO_NULL);
> >> -      stmt = build2 (MODIFY_EXPR, TREE_TYPE (stmt), stmt,
> >> -                    convert (TREE_TYPE (stmt), arg0));
> >> +      stmt = build_array_ref (loc, stmt, arg2);
> >> +      stmt = fold_build2 (MODIFY_EXPR, TREE_TYPE (arg0), stmt, arg0);
> >>         stmt = build2 (COMPOUND_EXPR, arg1_type, stmt, decl);
> >>         return stmt;
> >>       }
> > You should make a copy of the vector, not modify the original one in
> > place?  (If I read that correctly, heh.)  Looks good otherwise.
> >
>
> Segher, there  is already existed code to make a copy of vector as we
> discussed offline.  Thanks for the reminder.
>
> cat pr79251.c.006t.gimple
> __attribute__((noinline))
> test (__vector signed int v, int i, size_t n)
> {
>   __vector signed int D.3192;
>   __vector signed int D.3194;
>   __vector signed int v1;
>   v1 = v;
>   D.3192 = v1;
>   _1 = n & 3;
>   VIEW_CONVERT_EXPR<int[4]>(D.3192)[_1] = i;
>   v1 = D.3192;
>   D.3194 = v1;
>   return D.3194;
> }
>
> Attached the v2 patch which does:
> 1) Build VIEW_CONVERT_EXPR for vec_insert (i, v, n) like v[n%4] = i to
> unify the gimple code, then expander could use vec_set_optab to expand.
> 2) Recognize the pattern in expander and use optabs to expand
> VIEW_CONVERT_EXPR to vec_insert with variable index to fast instructions:
> lvsl+xxperm+xxsel.

Looking at the RTL expander side I see several issues.


@@ -5237,6 +5288,16 @@ expand_assignment (tree to, tree from, bool nontemporal)

       to_rtx = expand_expr (tem, NULL_RTX, VOIDmode, EXPAND_WRITE);

+      if (TREE_CODE (to) == ARRAY_REF)
+       {
+         tree op0 = TREE_OPERAND (to, 0);
+         if (TREE_CODE (op0) == VIEW_CONVERT_EXPR
+             && expand_view_convert_to_vec_set (to, from, to_rtx))
+           {
+             pop_temp_slots ();
+             return;
+           }
+       }

you're placing this at an awkward spot IMHO, after to_rtx expansion
but disregading parts of  it and compensating just with 'to' matching.
Is the pieces (offset, bitpos) really too awkward to work with for
matching?

Because as written you'll miscompile

struct X { _vector signed int v; _vector singed int u; } x;

test(int i, int a)
{
  x.u[i] = a;
}

as I think you'll end up assigning to x.v.

Are we just interested in the case were we store to a
pseudo or also when the destination is memory?  I guess
only when it's a pseudo - correct?  In that case
handling this all in optimize_bitfield_assignment_op
is probably the best thing to try.

Note we possibly refrain from assigning a pseudo to
such vector because we see a variable array-ref to it.

Richard.

> Thanks,
> Xionghu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-07 11:57                             ` Richard Biener
@ 2020-09-08  8:11                               ` luoxhu
  2020-09-08  8:26                                 ` Richard Biener
  0 siblings, 1 reply; 43+ messages in thread
From: luoxhu @ 2020-09-08  8:11 UTC (permalink / raw)
  To: Richard Biener
  Cc: Segher Boessenkool, GCC Patches, David Edelsohn, Bill Schmidt, linkw

Hi Richi,

On 2020/9/7 19:57, Richard Biener wrote:
> +      if (TREE_CODE (to) == ARRAY_REF)
> +       {
> +         tree op0 = TREE_OPERAND (to, 0);
> +         if (TREE_CODE (op0) == VIEW_CONVERT_EXPR
> +             && expand_view_convert_to_vec_set (to, from, to_rtx))
> +           {
> +             pop_temp_slots ();
> +             return;
> +           }
> +       }
> 
> you're placing this at an awkward spot IMHO, after to_rtx expansion
> but disregading parts of  it and compensating just with 'to' matching.
> Is the pieces (offset, bitpos) really too awkward to work with for
> matching?
> 
> Because as written you'll miscompile
> 
> struct X { _vector signed int v; _vector singed int u; } x;
> 
> test(int i, int a)
> {
>    x.u[i] = a;
> }
> 
> as I think you'll end up assigning to x.v.

Thanks for pointing out, this case will be a problem for the patch.
I checked with optimize_bitfield_assignment_op, it will return very early
as the mode1 is not VOIDmode, and this is actually not "FIELD op= VAL"
operation?

To be honest, I am not quite familiar with this part of code, I put the new
function expand_view_convert_to_vec_set just after to_rtx expansion because
adjust_address will change the V4SImode memory to SImode memory, but I need
keep target to_rtx V4SImode to save the vector after calling
rs6000_vector_set_var, so seems paradoxical here?

 p to_rtx
$264 = (rtx_def *) (mem/c:V4SI (reg/f:DI 112 virtual-stack-vars) [1 D.3186+0 S16 A128])

=> to_rtx = adjust_address (to_rtx, mode1, 0);

p to_rtx
$265 = (rtx_def *) (mem/c:SI (reg/f:DI 112 virtual-stack-vars) [1 D.3186+0 S4 A128])


> 
> Are we just interested in the case were we store to a
> pseudo or also when the destination is memory?  I guess
> only when it's a pseudo - correct?  In that case
> handling this all in optimize_bitfield_assignment_op
> is probably the best thing to try.
> 
> Note we possibly refrain from assigning a pseudo to
> such vector because we see a variable array-ref to it.

Seems not only pseudo, for example "v = vec_insert (i, v, n);"
the vector variable will be store to stack first, then [r112:DI] is a
memory here to be processed.  So the patch loads it from stack(insn #10) to
temp vector register first, and store to stack again(insn #24) after
rs6000_vector_set_var.

optimized:

D.3185 = v_3(D);
_1 = n_5(D) & 3;
VIEW_CONVERT_EXPR<int[4]>(D.3185)[_1] = i_6(D);
v_8 = D.3185;
return v_8;

=> expand without the patch:

    2: r119:V4SI=%2:V4SI
    3: r120:DI=%5:DI
    4: r121:DI=%6:DI
    5: NOTE_INSN_FUNCTION_BEG
    8: [r112:DI]=r119:V4SI

    9: r122:DI=r121:DI&0x3
   10: r123:DI=r122:DI<<0x2
   11: r124:DI=r112:DI+r123:DI
   12: [r124:DI]=r120:DI#0

   13: r126:V4SI=[r112:DI]
   14: r118:V4SI=r126:V4SI
   18: %2:V4SI=r118:V4SI
   19: use %2:V4SI

=> expand with the patch (replace #9~#12 to #10~#24):

    2: r119:V4SI=%2:V4SI
    3: r120:DI=%5:DI
    4: r121:DI=%6:DI
    5: NOTE_INSN_FUNCTION_BEG
    8: [r112:DI]=r119:V4SI
    9: r122:DI=r121:DI&0x3

   10: r123:V4SI=[r112:DI]         // load from stack
   11: {r125:SI=0x3-r122:DI#0;clobber ca:SI;}
   12: r125:SI=r125:SI<<0x2
   13: {r125:SI=0x14-r125:SI;clobber ca:SI;}
   14: r128:DI=unspec[`*.LC0',%2:DI] 47
      REG_EQUAL `*.LC0'
   15: r127:V2DI=[r128:DI]
      REG_EQUAL const_vector
   16: r126:V16QI=r127:V2DI#0
   17: r129:V16QI=unspec[r120:DI#0] 61
   18: r130:V16QI=unspec[r125:SI] 151
   19: r131:V16QI=unspec[r129:V16QI,r129:V16QI,r130:V16QI] 232
   20: r132:V16QI=unspec[r126:V16QI,r126:V16QI,r130:V16QI] 232
   21: r124:V16QI=r123:V4SI#0
   22: r124:V16QI={(r132:V16QI!=const_vector)?r131:V16QI:r124:V16QI}
   23: r123:V4SI=r124:V16QI#0
   24: [r112:DI]=r123:V4SI       // store to stack.

   25: r134:V4SI=[r112:DI]
   26: r118:V4SI=r134:V4SI
   30: %2:V4SI=r118:V4SI
   31: use %2:V4SI


Thanks,
Xionghu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-08  8:11                               ` luoxhu
@ 2020-09-08  8:26                                 ` Richard Biener
  2020-09-09  1:47                                   ` luoxhu
  2020-09-09 13:47                                   ` Segher Boessenkool
  0 siblings, 2 replies; 43+ messages in thread
From: Richard Biener @ 2020-09-08  8:26 UTC (permalink / raw)
  To: luoxhu
  Cc: Segher Boessenkool, GCC Patches, David Edelsohn, Bill Schmidt, linkw

On Tue, Sep 8, 2020 at 10:11 AM luoxhu <luoxhu@linux.ibm.com> wrote:
>
> Hi Richi,
>
> On 2020/9/7 19:57, Richard Biener wrote:
> > +      if (TREE_CODE (to) == ARRAY_REF)
> > +       {
> > +         tree op0 = TREE_OPERAND (to, 0);
> > +         if (TREE_CODE (op0) == VIEW_CONVERT_EXPR
> > +             && expand_view_convert_to_vec_set (to, from, to_rtx))
> > +           {
> > +             pop_temp_slots ();
> > +             return;
> > +           }
> > +       }
> >
> > you're placing this at an awkward spot IMHO, after to_rtx expansion
> > but disregading parts of  it and compensating just with 'to' matching.
> > Is the pieces (offset, bitpos) really too awkward to work with for
> > matching?
> >
> > Because as written you'll miscompile
> >
> > struct X { _vector signed int v; _vector singed int u; } x;
> >
> > test(int i, int a)
> > {
> >    x.u[i] = a;
> > }
> >
> > as I think you'll end up assigning to x.v.
>
> Thanks for pointing out, this case will be a problem for the patch.
> I checked with optimize_bitfield_assignment_op, it will return very early
> as the mode1 is not VOIDmode, and this is actually not "FIELD op= VAL"
> operation?
>
> To be honest, I am not quite familiar with this part of code, I put the new
> function expand_view_convert_to_vec_set just after to_rtx expansion because
> adjust_address will change the V4SImode memory to SImode memory, but I need
> keep target to_rtx V4SImode to save the vector after calling
> rs6000_vector_set_var, so seems paradoxical here?
>
>  p to_rtx
> $264 = (rtx_def *) (mem/c:V4SI (reg/f:DI 112 virtual-stack-vars) [1 D.3186+0 S16 A128])
>
> => to_rtx = adjust_address (to_rtx, mode1, 0);
>
> p to_rtx
> $265 = (rtx_def *) (mem/c:SI (reg/f:DI 112 virtual-stack-vars) [1 D.3186+0 S4 A128])
>
>
> >
> > Are we just interested in the case were we store to a
> > pseudo or also when the destination is memory?  I guess
> > only when it's a pseudo - correct?  In that case
> > handling this all in optimize_bitfield_assignment_op
> > is probably the best thing to try.
> >
> > Note we possibly refrain from assigning a pseudo to
> > such vector because we see a variable array-ref to it.
>
> Seems not only pseudo, for example "v = vec_insert (i, v, n);"
> the vector variable will be store to stack first, then [r112:DI] is a
> memory here to be processed.  So the patch loads it from stack(insn #10) to
> temp vector register first, and store to stack again(insn #24) after
> rs6000_vector_set_var.

Hmm, yeah - I guess that's what should be addressed first then.
I'm quite sure that in case 'v' is not on the stack but in memory like
in my case a SImode store is better than what we get from
vec_insert - in fact vec_insert will likely introduce a RMW cycle
which is prone to inserting store-data-races?

So - what we need to "fix" is cfgexpand.c marking variably-indexed
decls as not to be expanded as registers (see
discover_nonconstant_array_refs).

I guess one way forward would be to perform instruction
selection on GIMPLE here and transform

VIEW_CONVERT_EXPR<int[4]>(D.3185)[_1] = i_6(D)

to a (direct) internal function based on the vec_set optab.  But then
in GIMPLE D.3185 is also still memory (we don't have a variable
index partial register set operation - BIT_INSERT_EXPR is
currently specified to receive a constant bit position only).

At which point after your patch is the stack storage elided?

>
> optimized:
>
> D.3185 = v_3(D);
> _1 = n_5(D) & 3;
> VIEW_CONVERT_EXPR<int[4]>(D.3185)[_1] = i_6(D);
> v_8 = D.3185;
> return v_8;
>
> => expand without the patch:
>
>     2: r119:V4SI=%2:V4SI
>     3: r120:DI=%5:DI
>     4: r121:DI=%6:DI
>     5: NOTE_INSN_FUNCTION_BEG
>     8: [r112:DI]=r119:V4SI
>
>     9: r122:DI=r121:DI&0x3
>    10: r123:DI=r122:DI<<0x2
>    11: r124:DI=r112:DI+r123:DI
>    12: [r124:DI]=r120:DI#0
>
>    13: r126:V4SI=[r112:DI]
>    14: r118:V4SI=r126:V4SI
>    18: %2:V4SI=r118:V4SI
>    19: use %2:V4SI
>
> => expand with the patch (replace #9~#12 to #10~#24):
>
>     2: r119:V4SI=%2:V4SI
>     3: r120:DI=%5:DI
>     4: r121:DI=%6:DI
>     5: NOTE_INSN_FUNCTION_BEG
>     8: [r112:DI]=r119:V4SI
>     9: r122:DI=r121:DI&0x3
>
>    10: r123:V4SI=[r112:DI]         // load from stack
>    11: {r125:SI=0x3-r122:DI#0;clobber ca:SI;}
>    12: r125:SI=r125:SI<<0x2
>    13: {r125:SI=0x14-r125:SI;clobber ca:SI;}
>    14: r128:DI=unspec[`*.LC0',%2:DI] 47
>       REG_EQUAL `*.LC0'
>    15: r127:V2DI=[r128:DI]
>       REG_EQUAL const_vector
>    16: r126:V16QI=r127:V2DI#0
>    17: r129:V16QI=unspec[r120:DI#0] 61
>    18: r130:V16QI=unspec[r125:SI] 151
>    19: r131:V16QI=unspec[r129:V16QI,r129:V16QI,r130:V16QI] 232
>    20: r132:V16QI=unspec[r126:V16QI,r126:V16QI,r130:V16QI] 232
>    21: r124:V16QI=r123:V4SI#0
>    22: r124:V16QI={(r132:V16QI!=const_vector)?r131:V16QI:r124:V16QI}
>    23: r123:V4SI=r124:V16QI#0
>    24: [r112:DI]=r123:V4SI       // store to stack.
>
>    25: r134:V4SI=[r112:DI]
>    26: r118:V4SI=r134:V4SI
>    30: %2:V4SI=r118:V4SI
>    31: use %2:V4SI
>
>
> Thanks,
> Xionghu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-08  8:26                                 ` Richard Biener
@ 2020-09-09  1:47                                   ` luoxhu
  2020-09-09  7:30                                     ` Richard Biener
  2020-09-09 13:47                                   ` Segher Boessenkool
  1 sibling, 1 reply; 43+ messages in thread
From: luoxhu @ 2020-09-09  1:47 UTC (permalink / raw)
  To: Richard Biener
  Cc: Segher Boessenkool, GCC Patches, David Edelsohn, Bill Schmidt, linkw



On 2020/9/8 16:26, Richard Biener wrote:
>> Seems not only pseudo, for example "v = vec_insert (i, v, n);"
>> the vector variable will be store to stack first, then [r112:DI] is a
>> memory here to be processed.  So the patch loads it from stack(insn #10) to
>> temp vector register first, and store to stack again(insn #24) after
>> rs6000_vector_set_var.
> Hmm, yeah - I guess that's what should be addressed first then.
> I'm quite sure that in case 'v' is not on the stack but in memory like
> in my case a SImode store is better than what we get from
> vec_insert - in fact vec_insert will likely introduce a RMW cycle
> which is prone to inserting store-data-races?

Yes, for your case, there is no stack operation and to_rtx is expanded
with BLKmode instead of V4SImode.  Add the to_rtx mode check could workaround
it.  ASM doesn't show store hit load issue.  

optimized:

_1 = i_2(D) % 4;
VIEW_CONVERT_EXPR<int[4]>(x.u)[_1] = a_4(D);

expand:
    2: r118:DI=%3:DI
    3: r119:DI=%4:DI
    4: NOTE_INSN_FUNCTION_BEG
    7: r120:DI=unspec[`*.LANCHOR0',%2:DI] 47
      REG_EQUAL `*.LANCHOR0'
    8: r122:SI=r118:DI#0
    9: {r124:SI=r122:SI/0x4;clobber ca:SI;}
   10: r125:SI=r124:SI<<0x2
   11: r123:SI=r122:SI-r125:SI
      REG_EQUAL r122:SI%0x4
   12: r126:DI=sign_extend(r123:SI)
   13: r127:DI=r126:DI+0x4
   14: r128:DI=r127:DI<<0x2
   15: r129:DI=r120:DI+r128:DI
   16: [r129:DI]=r119:DI#0

 p to_rtx
$319 = (rtx_def *) (mem/c:BLK (reg/f:DI 120) [2 x+0 S32 A128])

asm:
        addis 2,12,.TOC.-.LCF0@ha
        addi 2,2,.TOC.-.LCF0@l
        .localentry     test,.-test
        srawi 9,3,2
        addze 9,9
        addis 10,2,.LANCHOR0@toc@ha
        addi 10,10,.LANCHOR0@toc@l
        slwi 9,9,2
        subf 9,9,3
        extsw 9,9
        addi 9,9,4
        sldi 9,9,2
        stwx 4,10,9
        blr


> 
> So - what we need to "fix" is cfgexpand.c marking variably-indexed
> decls as not to be expanded as registers (see
> discover_nonconstant_array_refs).
> 
> I guess one way forward would be to perform instruction
> selection on GIMPLE here and transform
> 
> VIEW_CONVERT_EXPR<int[4]>(D.3185)[_1] = i_6(D)
> 
> to a (direct) internal function based on the vec_set optab.  

I don't quite understand what you mean here.  Do you mean:
ALTIVEC_BUILTIN_VEC_INSERT -> VIEW_CONVERT_EXPR -> internal function -> vec_set
or ALTIVEC_BUILTIN_VEC_INSERT -> internal function -> vec_set?
And which pass to put the selection and transform is acceptable?
Why call it *based on* vec_set optab?  The VIEW_CONVERT_EXPR or internal function
is expanded to vec_set optab.

I guess you suggest adding internal function for VIEW_CONVERT_EXPR in gimple,
and do the transform from internal function to vec_set optab in expander?
I doubt my understanding as this looks really over complicated since we
transform from VIEW_CONVERT_EXPR to vec_set optab directly so far...
IIUC, Internal function seems doesn't help much here as Segher said before.


> But then in GIMPLE D.3185 is also still memory (we don't have a variable
> index partial register set operation - BIT_INSERT_EXPR is
> currently specified to receive a constant bit position only).
> 
> At which point after your patch is the stack storage elided?
> 

Stack storage is elided by register reload pass in RTL.


Thanks,
Xionghu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-09  1:47                                   ` luoxhu
@ 2020-09-09  7:30                                     ` Richard Biener
  0 siblings, 0 replies; 43+ messages in thread
From: Richard Biener @ 2020-09-09  7:30 UTC (permalink / raw)
  To: luoxhu
  Cc: Segher Boessenkool, GCC Patches, David Edelsohn, Bill Schmidt, linkw

On Wed, Sep 9, 2020 at 3:47 AM luoxhu <luoxhu@linux.ibm.com> wrote:
>
>
>
> On 2020/9/8 16:26, Richard Biener wrote:
> >> Seems not only pseudo, for example "v = vec_insert (i, v, n);"
> >> the vector variable will be store to stack first, then [r112:DI] is a
> >> memory here to be processed.  So the patch loads it from stack(insn #10) to
> >> temp vector register first, and store to stack again(insn #24) after
> >> rs6000_vector_set_var.
> > Hmm, yeah - I guess that's what should be addressed first then.
> > I'm quite sure that in case 'v' is not on the stack but in memory like
> > in my case a SImode store is better than what we get from
> > vec_insert - in fact vec_insert will likely introduce a RMW cycle
> > which is prone to inserting store-data-races?
>
> Yes, for your case, there is no stack operation and to_rtx is expanded
> with BLKmode instead of V4SImode.  Add the to_rtx mode check could workaround
> it.  ASM doesn't show store hit load issue.
>
> optimized:
>
> _1 = i_2(D) % 4;
> VIEW_CONVERT_EXPR<int[4]>(x.u)[_1] = a_4(D);
>
> expand:
>     2: r118:DI=%3:DI
>     3: r119:DI=%4:DI
>     4: NOTE_INSN_FUNCTION_BEG
>     7: r120:DI=unspec[`*.LANCHOR0',%2:DI] 47
>       REG_EQUAL `*.LANCHOR0'
>     8: r122:SI=r118:DI#0
>     9: {r124:SI=r122:SI/0x4;clobber ca:SI;}
>    10: r125:SI=r124:SI<<0x2
>    11: r123:SI=r122:SI-r125:SI
>       REG_EQUAL r122:SI%0x4
>    12: r126:DI=sign_extend(r123:SI)
>    13: r127:DI=r126:DI+0x4
>    14: r128:DI=r127:DI<<0x2
>    15: r129:DI=r120:DI+r128:DI
>    16: [r129:DI]=r119:DI#0
>
>  p to_rtx
> $319 = (rtx_def *) (mem/c:BLK (reg/f:DI 120) [2 x+0 S32 A128])
>
> asm:
>         addis 2,12,.TOC.-.LCF0@ha
>         addi 2,2,.TOC.-.LCF0@l
>         .localentry     test,.-test
>         srawi 9,3,2
>         addze 9,9
>         addis 10,2,.LANCHOR0@toc@ha
>         addi 10,10,.LANCHOR0@toc@l
>         slwi 9,9,2
>         subf 9,9,3
>         extsw 9,9
>         addi 9,9,4
>         sldi 9,9,2
>         stwx 4,10,9
>         blr
>
>
> >
> > So - what we need to "fix" is cfgexpand.c marking variably-indexed
> > decls as not to be expanded as registers (see
> > discover_nonconstant_array_refs).
> >
> > I guess one way forward would be to perform instruction
> > selection on GIMPLE here and transform
> >
> > VIEW_CONVERT_EXPR<int[4]>(D.3185)[_1] = i_6(D)
> >
> > to a (direct) internal function based on the vec_set optab.
>
> I don't quite understand what you mean here.  Do you mean:
> ALTIVEC_BUILTIN_VEC_INSERT -> VIEW_CONVERT_EXPR -> internal function -> vec_set

You're writing VIEW_CONVERT_EXPR here but the outermost component
is an ARRAY_REF.  But yes, this is what I meant.

> or ALTIVEC_BUILTIN_VEC_INSERT -> internal function -> vec_set?
> And which pass to put the selection and transform is acceptable?

Close to RTL expansion.  There's gimple-isel.cc which does instruction selection
for VEC_COND_EXPRs.

> Why call it *based on* vec_set optab?  The VIEW_CONVERT_EXPR or internal function
> is expanded to vec_set optab.

Based on because we have the convenient capability to represent optabs to be
used for RTL expansion as internal function calls on GIMPLE, called
"direct internal function".

> I guess you suggest adding internal function for VIEW_CONVERT_EXPR in gimple,
> and do the transform from internal function to vec_set optab in expander?

No, I suggest to "add" an internal function for the vec_set optab, see
DEF_INTERNAL_OPTAB_FN in internal-fn.def

> I doubt my understanding as this looks really over complicated since we
> transform from VIEW_CONVERT_EXPR to vec_set optab directly so far...
> IIUC, Internal function seems doesn't help much here as Segher said before.

The advantage would be to circumvent GIMPLEs forcing of memory here.
But as I said here:

> > But then in GIMPLE D.3185 is also still memory (we don't have a variable
> > index partial register set operation - BIT_INSERT_EXPR is
> > currently specified to receive a constant bit position only).

it might not work out so easy.  Going down the rathole to avoid forcing
memory during RTL expansion for select cases (vector type bases
with a supported vector mode) might be something to try.

That at least would make the approach of dealing with this
in expand_assignment or siblings sensible.

> > At which point after your patch is the stack storage elided?
> >
>
> Stack storage is elided by register reload pass in RTL.
>
>
> Thanks,
> Xionghu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-08  8:26                                 ` Richard Biener
  2020-09-09  1:47                                   ` luoxhu
@ 2020-09-09 13:47                                   ` Segher Boessenkool
  2020-09-09 14:28                                     ` Richard Biener
  1 sibling, 1 reply; 43+ messages in thread
From: Segher Boessenkool @ 2020-09-09 13:47 UTC (permalink / raw)
  To: Richard Biener; +Cc: luoxhu, GCC Patches, David Edelsohn, Bill Schmidt, linkw

Hi!

On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote:
> Hmm, yeah - I guess that's what should be addressed first then.
> I'm quite sure that in case 'v' is not on the stack but in memory like
> in my case a SImode store is better than what we get from
> vec_insert - in fact vec_insert will likely introduce a RMW cycle
> which is prone to inserting store-data-races?

The other way around -- if it is in memory, and was stored as vector
recently, then reading back something shorter from it is prone to
SHL/LHS problems.  There is nothing special about the stack here, except
of course it is more likely to have been stored recently if on the
stack.  So it depends how often it has been stored recently which option
is best.  On newer CPUs, although they can avoid SHL/LHS flushes more
often, the penalty is relatively bigger, so memory does not often win.

I.e.: it needs to be measured.  Intuition is often wrong here.


Segher

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-09 13:47                                   ` Segher Boessenkool
@ 2020-09-09 14:28                                     ` Richard Biener
  2020-09-09 16:00                                       ` Segher Boessenkool
  0 siblings, 1 reply; 43+ messages in thread
From: Richard Biener @ 2020-09-09 14:28 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: luoxhu, GCC Patches, David Edelsohn, Bill Schmidt, linkw

On Wed, Sep 9, 2020 at 3:49 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> Hi!
>
> On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote:
> > Hmm, yeah - I guess that's what should be addressed first then.
> > I'm quite sure that in case 'v' is not on the stack but in memory like
> > in my case a SImode store is better than what we get from
> > vec_insert - in fact vec_insert will likely introduce a RMW cycle
> > which is prone to inserting store-data-races?
>
> The other way around -- if it is in memory, and was stored as vector
> recently, then reading back something shorter from it is prone to
> SHL/LHS problems.  There is nothing special about the stack here, except
> of course it is more likely to have been stored recently if on the
> stack.  So it depends how often it has been stored recently which option
> is best.  On newer CPUs, although they can avoid SHL/LHS flushes more
> often, the penalty is relatively bigger, so memory does not often win.
>
> I.e.: it needs to be measured.  Intuition is often wrong here.

But the current method would simply do a direct store to memory
without a preceeding read of the whole vector.

Richard.

>
>
> Segher

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-09 14:28                                     ` Richard Biener
@ 2020-09-09 16:00                                       ` Segher Boessenkool
  2020-09-10 10:08                                         ` Richard Biener
  0 siblings, 1 reply; 43+ messages in thread
From: Segher Boessenkool @ 2020-09-09 16:00 UTC (permalink / raw)
  To: Richard Biener; +Cc: luoxhu, GCC Patches, David Edelsohn, Bill Schmidt, linkw

On Wed, Sep 09, 2020 at 04:28:19PM +0200, Richard Biener wrote:
> On Wed, Sep 9, 2020 at 3:49 PM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> >
> > Hi!
> >
> > On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote:
> > > Hmm, yeah - I guess that's what should be addressed first then.
> > > I'm quite sure that in case 'v' is not on the stack but in memory like
> > > in my case a SImode store is better than what we get from
> > > vec_insert - in fact vec_insert will likely introduce a RMW cycle
> > > which is prone to inserting store-data-races?
> >
> > The other way around -- if it is in memory, and was stored as vector
> > recently, then reading back something shorter from it is prone to
> > SHL/LHS problems.  There is nothing special about the stack here, except
> > of course it is more likely to have been stored recently if on the
> > stack.  So it depends how often it has been stored recently which option
> > is best.  On newer CPUs, although they can avoid SHL/LHS flushes more
> > often, the penalty is relatively bigger, so memory does not often win.
> >
> > I.e.: it needs to be measured.  Intuition is often wrong here.
> 
> But the current method would simply do a direct store to memory
> without a preceeding read of the whole vector.

The problem is even worse the other way: you do a short store here, but
so a full vector read later.  If the store and read are far apart, that
is fine, but if they are close (that is on the order of fifty or more
insns), there can be problems.

There often are problems over function calls (where the compiler cannot
usually *see* how something is used).


Segher

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-09 16:00                                       ` Segher Boessenkool
@ 2020-09-10 10:08                                         ` Richard Biener
  2020-09-14  8:05                                           ` luoxhu
  2020-09-14 20:21                                           ` Segher Boessenkool
  0 siblings, 2 replies; 43+ messages in thread
From: Richard Biener @ 2020-09-10 10:08 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: luoxhu, GCC Patches, David Edelsohn, Bill Schmidt, linkw

On Wed, Sep 9, 2020 at 6:03 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> On Wed, Sep 09, 2020 at 04:28:19PM +0200, Richard Biener wrote:
> > On Wed, Sep 9, 2020 at 3:49 PM Segher Boessenkool
> > <segher@kernel.crashing.org> wrote:
> > >
> > > Hi!
> > >
> > > On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote:
> > > > Hmm, yeah - I guess that's what should be addressed first then.
> > > > I'm quite sure that in case 'v' is not on the stack but in memory like
> > > > in my case a SImode store is better than what we get from
> > > > vec_insert - in fact vec_insert will likely introduce a RMW cycle
> > > > which is prone to inserting store-data-races?
> > >
> > > The other way around -- if it is in memory, and was stored as vector
> > > recently, then reading back something shorter from it is prone to
> > > SHL/LHS problems.  There is nothing special about the stack here, except
> > > of course it is more likely to have been stored recently if on the
> > > stack.  So it depends how often it has been stored recently which option
> > > is best.  On newer CPUs, although they can avoid SHL/LHS flushes more
> > > often, the penalty is relatively bigger, so memory does not often win.
> > >
> > > I.e.: it needs to be measured.  Intuition is often wrong here.
> >
> > But the current method would simply do a direct store to memory
> > without a preceeding read of the whole vector.
>
> The problem is even worse the other way: you do a short store here, but
> so a full vector read later.  If the store and read are far apart, that
> is fine, but if they are close (that is on the order of fifty or more
> insns), there can be problems.

Sure, but you can't simply load/store a whole vector when the code didn't
unless you know it will not introduce data races and it will not trap (thus
the whole vector needs to be at least naturally aligned).

Also if there's a preceeding short store you will now load the whole vector
to avoid the short store ... catch-22

> There often are problems over function calls (where the compiler cannot
> usually *see* how something is used).

Yep.  The best way would be to use small loads and larger stores
which is what CPUs usually tend to handle fine (with alignment
constraints, etc.).  Of course that's not what either of the "solutions"
can do.

That said, since you seem to be "first" in having an instruction
to insert into a vector at a variable position the idea that we'd
have to spill anyway for this to be expanded and thus we expand
the vector to a stack location in the first place falls down.  And
that's where I'd first try to improve things.

So what can the CPU actually do?

Richard.

>
>
> Segher

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-10 10:08                                         ` Richard Biener
@ 2020-09-14  8:05                                           ` luoxhu
  2020-09-14  9:47                                             ` Richard Biener
  2020-09-14 20:21                                           ` Segher Boessenkool
  1 sibling, 1 reply; 43+ messages in thread
From: luoxhu @ 2020-09-14  8:05 UTC (permalink / raw)
  To: Richard Biener, Segher Boessenkool
  Cc: GCC Patches, David Edelsohn, Bill Schmidt, linkw



On 2020/9/10 18:08, Richard Biener wrote:
> On Wed, Sep 9, 2020 at 6:03 PM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
>>
>> On Wed, Sep 09, 2020 at 04:28:19PM +0200, Richard Biener wrote:
>>> On Wed, Sep 9, 2020 at 3:49 PM Segher Boessenkool
>>> <segher@kernel.crashing.org> wrote:
>>>>
>>>> Hi!
>>>>
>>>> On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote:
>>>>> Hmm, yeah - I guess that's what should be addressed first then.
>>>>> I'm quite sure that in case 'v' is not on the stack but in memory like
>>>>> in my case a SImode store is better than what we get from
>>>>> vec_insert - in fact vec_insert will likely introduce a RMW cycle
>>>>> which is prone to inserting store-data-races?
>>>>
>>>> The other way around -- if it is in memory, and was stored as vector
>>>> recently, then reading back something shorter from it is prone to
>>>> SHL/LHS problems.  There is nothing special about the stack here, except
>>>> of course it is more likely to have been stored recently if on the
>>>> stack.  So it depends how often it has been stored recently which option
>>>> is best.  On newer CPUs, although they can avoid SHL/LHS flushes more
>>>> often, the penalty is relatively bigger, so memory does not often win.
>>>>
>>>> I.e.: it needs to be measured.  Intuition is often wrong here.
>>>
>>> But the current method would simply do a direct store to memory
>>> without a preceeding read of the whole vector.
>>
>> The problem is even worse the other way: you do a short store here, but
>> so a full vector read later.  If the store and read are far apart, that
>> is fine, but if they are close (that is on the order of fifty or more
>> insns), there can be problems.
> 
> Sure, but you can't simply load/store a whole vector when the code didn't
> unless you know it will not introduce data races and it will not trap (thus
> the whole vector needs to be at least naturally aligned).
> 
> Also if there's a preceeding short store you will now load the whole vector
> to avoid the short store ... catch-22
> 
>> There often are problems over function calls (where the compiler cannot
>> usually *see* how something is used).
> 
> Yep.  The best way would be to use small loads and larger stores
> which is what CPUs usually tend to handle fine (with alignment
> constraints, etc.).  Of course that's not what either of the "solutions"
> can do.
> 
> That said, since you seem to be "first" in having an instruction
> to insert into a vector at a variable position the idea that we'd
> have to spill anyway for this to be expanded and thus we expand
> the vector to a stack location in the first place falls down.  And
> that's where I'd first try to improve things.
> 
> So what can the CPU actually do?
> 

Not sure whether this reflects the issues you discussed above.
I constructed below test cases and tested with and without this patch,
only if "a+c"(which means store only), the performance is getting bad with
this patch due to extra load/store(about 50% slower).
For "v = vec_insert (i, v, n);" usage, v is always loaded after store, so this
patch will always get big benefit.
But for "v[n % 4] = i;" usage, it depends on whether "v" is used immediately
inside the function or out of the function soon.  Does this mean unify the two
usage to same gimple code not a good idea sometimes?  Or is it possible to
detect the generated IFN ".VEC_INSERT (&v, i_4(D), _1);" destination "v" is
used not far away inside or out side of the function?


#define TYPE int

vector TYPE v = {1, 2, 3, 4};   // a. global vector.
vector TYPE s = {4, 3, 2, 1};

__attribute__ ((noinline))
vector TYPE
test (vector TYPE u, TYPE i, size_t n)
{
  v[n % 4] = i;
  return u;
}

int main ()
{
  //vector TYPE v = {1, 2, 3, 4};    // b. local vector.
  //vector TYPE s = {4, 3, 2, 1};
  vector TYPE r = {0};
  for (long k = 0; k < 3989478945; k++)
    {
        r += test (s, 254.0f, k);  // c. access different vector. store only.
       //r += test (v, 254.0f, k);  // d. access same vector. load after store in callee function.
      //test (s, 254.0f, k); r += v;   //e. load after store out of function.
    }
  return r[0];
}


Thanks,
Xionghu


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-14  8:05                                           ` luoxhu
@ 2020-09-14  9:47                                             ` Richard Biener
  2020-09-14 10:47                                               ` Richard Sandiford
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Richard Biener @ 2020-09-14  9:47 UTC (permalink / raw)
  To: luoxhu
  Cc: Segher Boessenkool, GCC Patches, David Edelsohn, Bill Schmidt, linkw

On Mon, Sep 14, 2020 at 10:05 AM luoxhu <luoxhu@linux.ibm.com> wrote:
>
>
>
> On 2020/9/10 18:08, Richard Biener wrote:
> > On Wed, Sep 9, 2020 at 6:03 PM Segher Boessenkool
> > <segher@kernel.crashing.org> wrote:
> >>
> >> On Wed, Sep 09, 2020 at 04:28:19PM +0200, Richard Biener wrote:
> >>> On Wed, Sep 9, 2020 at 3:49 PM Segher Boessenkool
> >>> <segher@kernel.crashing.org> wrote:
> >>>>
> >>>> Hi!
> >>>>
> >>>> On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote:
> >>>>> Hmm, yeah - I guess that's what should be addressed first then.
> >>>>> I'm quite sure that in case 'v' is not on the stack but in memory like
> >>>>> in my case a SImode store is better than what we get from
> >>>>> vec_insert - in fact vec_insert will likely introduce a RMW cycle
> >>>>> which is prone to inserting store-data-races?
> >>>>
> >>>> The other way around -- if it is in memory, and was stored as vector
> >>>> recently, then reading back something shorter from it is prone to
> >>>> SHL/LHS problems.  There is nothing special about the stack here, except
> >>>> of course it is more likely to have been stored recently if on the
> >>>> stack.  So it depends how often it has been stored recently which option
> >>>> is best.  On newer CPUs, although they can avoid SHL/LHS flushes more
> >>>> often, the penalty is relatively bigger, so memory does not often win.
> >>>>
> >>>> I.e.: it needs to be measured.  Intuition is often wrong here.
> >>>
> >>> But the current method would simply do a direct store to memory
> >>> without a preceeding read of the whole vector.
> >>
> >> The problem is even worse the other way: you do a short store here, but
> >> so a full vector read later.  If the store and read are far apart, that
> >> is fine, but if they are close (that is on the order of fifty or more
> >> insns), there can be problems.
> >
> > Sure, but you can't simply load/store a whole vector when the code didn't
> > unless you know it will not introduce data races and it will not trap (thus
> > the whole vector needs to be at least naturally aligned).
> >
> > Also if there's a preceeding short store you will now load the whole vector
> > to avoid the short store ... catch-22
> >
> >> There often are problems over function calls (where the compiler cannot
> >> usually *see* how something is used).
> >
> > Yep.  The best way would be to use small loads and larger stores
> > which is what CPUs usually tend to handle fine (with alignment
> > constraints, etc.).  Of course that's not what either of the "solutions"
> > can do.
> >
> > That said, since you seem to be "first" in having an instruction
> > to insert into a vector at a variable position the idea that we'd
> > have to spill anyway for this to be expanded and thus we expand
> > the vector to a stack location in the first place falls down.  And
> > that's where I'd first try to improve things.
> >
> > So what can the CPU actually do?
> >
>
> Not sure whether this reflects the issues you discussed above.
> I constructed below test cases and tested with and without this patch,
> only if "a+c"(which means store only), the performance is getting bad with
> this patch due to extra load/store(about 50% slower).
> For "v = vec_insert (i, v, n);" usage, v is always loaded after store, so this
> patch will always get big benefit.
> But for "v[n % 4] = i;" usage, it depends on whether "v" is used immediately
> inside the function or out of the function soon.  Does this mean unify the two
> usage to same gimple code not a good idea sometimes?  Or is it possible to
> detect the generated IFN ".VEC_INSERT (&v, i_4(D), _1);" destination "v" is
> used not far away inside or out side of the function?
>
>
> #define TYPE int
>
> vector TYPE v = {1, 2, 3, 4};   // a. global vector.
> vector TYPE s = {4, 3, 2, 1};
>
> __attribute__ ((noinline))
> vector TYPE
> test (vector TYPE u, TYPE i, size_t n)
> {
>   v[n % 4] = i;

^^^

this should be

   u[n % 4] = i;

I guess.  Is the % 4 mandated by the vec_insert semantics btw?

If you tested with the above error you probably need to re-measure.

On gimple the above function (after fixing it) looks like

  VIEW_CONVERT_EXPR<int[4]>(u)[_1] = i_4(D);

and the IFN idea I had would - for non-global memory 'u' only - transform
this to

  vector_register_2 = u;
  vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D));
  u = vector_register_3;

if vec_set can handle variable indexes.  This then becomes a
vec_set on a register and if that was the only variable indexing of 'u'
will also cause 'u' to be expanded as register rather than stack memory.

Note we can't use the direct-optab method here since the vec_set optab
modifies operand 0 which isn't possible in SSA form.  That might hint
at that we eventually want to extend BIT_INSERT_EXPR to handle
a non-constant bit position but for experiments using an alternate
internal function is certainly easier.

Richard.

>   return u;
> }
>
> int main ()
> {
>   //vector TYPE v = {1, 2, 3, 4};    // b. local vector.
>   //vector TYPE s = {4, 3, 2, 1};
>   vector TYPE r = {0};
>   for (long k = 0; k < 3989478945; k++)
>     {
>         r += test (s, 254.0f, k);  // c. access different vector. store only.
>        //r += test (v, 254.0f, k);  // d. access same vector. load after store in callee function.
>       //test (s, 254.0f, k); r += v;   //e. load after store out of function.
>     }
>   return r[0];
> }
>
>
> Thanks,
> Xionghu
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-14  9:47                                             ` Richard Biener
@ 2020-09-14 10:47                                               ` Richard Sandiford
  2020-09-14 11:22                                                 ` Richard Biener
  2020-09-14 21:06                                                 ` Segher Boessenkool
  2020-09-14 20:59                                               ` Segher Boessenkool
  2020-09-15  3:56                                               ` luoxhu
  2 siblings, 2 replies; 43+ messages in thread
From: Richard Sandiford @ 2020-09-14 10:47 UTC (permalink / raw)
  To: Richard Biener via Gcc-patches
  Cc: luoxhu, Richard Biener, Bill Schmidt, David Edelsohn,
	Segher Boessenkool, linkw

Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On gimple the above function (after fixing it) looks like
>
>   VIEW_CONVERT_EXPR<int[4]>(u)[_1] = i_4(D);
>
> and the IFN idea I had would - for non-global memory 'u' only - transform
> this to
>
>   vector_register_2 = u;
>   vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D));
>   u = vector_register_3;
>
> if vec_set can handle variable indexes.  This then becomes a
> vec_set on a register and if that was the only variable indexing of 'u'
> will also cause 'u' to be expanded as register rather than stack memory.
>
> Note we can't use the direct-optab method here since the vec_set optab
> modifies operand 0 which isn't possible in SSA form.

Would it be worth changing the optab so that the input and output are
separate?  Having a single operand would be justified if the operation
was only supposed to touch the selected bytes, but most targets wouldn't
guarantee that for memory operands, even as things stand.

Or maybe the idea was to force the RA's hand by making the input and
output tied even before RA, with separate moves where necessary.
But I'm not sure why vec_set is so special that it requires this
treatment and other optabs don't.

Thanks,
Richard


> That might hint at that we eventually want to extend BIT_INSERT_EXPR
> to handle a non-constant bit position but for experiments using an
> alternate internal function is certainly easier.
>
> Richard.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-14 10:47                                               ` Richard Sandiford
@ 2020-09-14 11:22                                                 ` Richard Biener
  2020-09-14 11:49                                                   ` Richard Sandiford
  2020-09-14 21:06                                                 ` Segher Boessenkool
  1 sibling, 1 reply; 43+ messages in thread
From: Richard Biener @ 2020-09-14 11:22 UTC (permalink / raw)
  To: Richard Biener via Gcc-patches, luoxhu, Richard Biener,
	Bill Schmidt, David Edelsohn, Segher Boessenkool, linkw,
	Richard Sandiford

On Mon, Sep 14, 2020 at 12:47 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On gimple the above function (after fixing it) looks like
> >
> >   VIEW_CONVERT_EXPR<int[4]>(u)[_1] = i_4(D);
> >
> > and the IFN idea I had would - for non-global memory 'u' only - transform
> > this to
> >
> >   vector_register_2 = u;
> >   vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D));
> >   u = vector_register_3;
> >
> > if vec_set can handle variable indexes.  This then becomes a
> > vec_set on a register and if that was the only variable indexing of 'u'
> > will also cause 'u' to be expanded as register rather than stack memory.
> >
> > Note we can't use the direct-optab method here since the vec_set optab
> > modifies operand 0 which isn't possible in SSA form.
>
> Would it be worth changing the optab so that the input and output are
> separate?  Having a single operand would be justified if the operation
> was only supposed to touch the selected bytes, but most targets wouldn't
> guarantee that for memory operands, even as things stand.

I thought about this as well, just didn't want to bother Xiong Hu Luo with
it for the experiments.

> Or maybe the idea was to force the RA's hand by making the input and
> output tied even before RA, with separate moves where necessary.
> But I'm not sure why vec_set is so special that it requires this
> treatment and other optabs don't.

Certainly the define_expand do not have to be define_insn so the target
can force the RAs hand just fine if it likes to.

The more interesting question of course is how to query vec_set whether
it accepts variable indices w/o building too much garbage RTL.

Richard.

>
> Thanks,
> Richard
>
>
> > That might hint at that we eventually want to extend BIT_INSERT_EXPR
> > to handle a non-constant bit position but for experiments using an
> > alternate internal function is certainly easier.
> >
> > Richard.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-14 11:22                                                 ` Richard Biener
@ 2020-09-14 11:49                                                   ` Richard Sandiford
  0 siblings, 0 replies; 43+ messages in thread
From: Richard Sandiford @ 2020-09-14 11:49 UTC (permalink / raw)
  To: Richard Biener
  Cc: Richard Biener via Gcc-patches, luoxhu, Bill Schmidt,
	David Edelsohn, Segher Boessenkool, linkw

Richard Biener <richard.guenther@gmail.com> writes:
> On Mon, Sep 14, 2020 at 12:47 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> > On gimple the above function (after fixing it) looks like
>> >
>> >   VIEW_CONVERT_EXPR<int[4]>(u)[_1] = i_4(D);
>> >
>> > and the IFN idea I had would - for non-global memory 'u' only - transform
>> > this to
>> >
>> >   vector_register_2 = u;
>> >   vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D));
>> >   u = vector_register_3;
>> >
>> > if vec_set can handle variable indexes.  This then becomes a
>> > vec_set on a register and if that was the only variable indexing of 'u'
>> > will also cause 'u' to be expanded as register rather than stack memory.
>> >
>> > Note we can't use the direct-optab method here since the vec_set optab
>> > modifies operand 0 which isn't possible in SSA form.
>>
>> Would it be worth changing the optab so that the input and output are
>> separate?  Having a single operand would be justified if the operation
>> was only supposed to touch the selected bytes, but most targets wouldn't
>> guarantee that for memory operands, even as things stand.
>
> I thought about this as well, just didn't want to bother Xiong Hu Luo with
> it for the experiments.
>
>> Or maybe the idea was to force the RA's hand by making the input and
>> output tied even before RA, with separate moves where necessary.
>> But I'm not sure why vec_set is so special that it requires this
>> treatment and other optabs don't.
>
> Certainly the define_expand do not have to be define_insn so the target
> can force the RAs hand just fine if it likes to.
>
> The more interesting question of course is how to query vec_set whether
> it accepts variable indices w/o building too much garbage RTL.

Probably easiest to do something like can_vcond_compare_p, i.e.:

  machine_mode mode = insn_data[icode].operand[N];
  rtx reg = alloca_raw_REG (mode, LAST_VIRTUAL_REGISTER + 1);
  … insn_operand_matches (icode, N, reg) …

Thanks,
Richard

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-10 10:08                                         ` Richard Biener
  2020-09-14  8:05                                           ` luoxhu
@ 2020-09-14 20:21                                           ` Segher Boessenkool
  1 sibling, 0 replies; 43+ messages in thread
From: Segher Boessenkool @ 2020-09-14 20:21 UTC (permalink / raw)
  To: Richard Biener; +Cc: luoxhu, GCC Patches, David Edelsohn, Bill Schmidt, linkw

On Thu, Sep 10, 2020 at 12:08:44PM +0200, Richard Biener wrote:
> On Wed, Sep 9, 2020 at 6:03 PM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> > There often are problems over function calls (where the compiler cannot
> > usually *see* how something is used).
> 
> Yep.  The best way would be to use small loads and larger stores
> which is what CPUs usually tend to handle fine (with alignment
> constraints, etc.).  Of course that's not what either of the "solutions"
> can do.

Yes, and yes.

> That said, since you seem to be "first" in having an instruction
> to insert into a vector at a variable position the idea that we'd
> have to spill anyway for this to be expanded and thus we expand
> the vector to a stack location in the first place falls down.  And
> that's where I'd first try to improve things.
> 
> So what can the CPU actually do?

Both immediate and variable inserts, of 1, 2, 4, or 8 bytes.  The
inserted part is not allowed to cross the 16B boundary (all aligned
stuff never has that problem).  Variable inserts look at only the low
bits of the GPR that says where to insert (4 bits for bytes, 3 bits
for halfs, etc.)


Segher

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-14  9:47                                             ` Richard Biener
  2020-09-14 10:47                                               ` Richard Sandiford
@ 2020-09-14 20:59                                               ` Segher Boessenkool
  2020-09-15  3:56                                               ` luoxhu
  2 siblings, 0 replies; 43+ messages in thread
From: Segher Boessenkool @ 2020-09-14 20:59 UTC (permalink / raw)
  To: Richard Biener; +Cc: luoxhu, GCC Patches, David Edelsohn, Bill Schmidt, linkw

On Mon, Sep 14, 2020 at 11:47:56AM +0200, Richard Biener wrote:
> this should be
> 
>    u[n % 4] = i;
> 
> I guess.  Is the % 4 mandated by the vec_insert semantics btw?

Yes:

  VEC_INSERT (ARG1, ARG2, ARG3)
  Purpose: Returns a copy of vector ARG2 with element ARG3 replaced by
  the value of ARG1.  Result value: A copy of vector ARG2 with element
  ARG3 replaced by the value of ARG1. This function uses modular
  arithmetic on ARG3 to determine the element number. For example, if
  ARG3 is out of range, the compiler uses ARG3 modulo the number of
  elements in the vector to determine the element position.

The builtin requires it.  The machine insns work like that, too, e.g.:

  vinswlx VRT,RA,RB

  if MSR.VEC=0 then Vector_Unavailable()
  index ← GPR[RA].bit[60:63]
  VSR[VRT+32].byte[index:index+3] ← GPR[RB].bit[32:63]

  Let index be the contents of bits 60:63 of GPR[RA].

  The contents of bits 32:63 of GPR[RB] are placed into
  byte elements index:index+3 of VSR[VRT+32].

  All other byte elements of VSR[VRT+32] are not
  modified.

  If index is greater than 12, the result is undefined.


Segher

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-14 10:47                                               ` Richard Sandiford
  2020-09-14 11:22                                                 ` Richard Biener
@ 2020-09-14 21:06                                                 ` Segher Boessenkool
  1 sibling, 0 replies; 43+ messages in thread
From: Segher Boessenkool @ 2020-09-14 21:06 UTC (permalink / raw)
  To: Richard Biener via Gcc-patches, luoxhu, Richard Biener,
	Bill Schmidt, David Edelsohn, linkw, richard.sandiford

On Mon, Sep 14, 2020 at 11:47:52AM +0100, Richard Sandiford wrote:
> Would it be worth changing the optab so that the input and output are
> separate?  Having a single operand would be justified if the operation
> was only supposed to touch the selected bytes, but most targets wouldn't
> guarantee that for memory operands, even as things stand.

You have my vote.

> Or maybe the idea was to force the RA's hand by making the input and
> output tied even before RA, with separate moves where necessary.
> But I'm not sure why vec_set is so special that it requires this
> treatment and other optabs don't.

Yeah.  The register allocator is normally very good in using the same
reg in both places, if that is useful.  And it also handles the case
where your machine insns require the two to be the same pretty well.
Not restricting this stuff before RA should be a win.


Segher

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-14  9:47                                             ` Richard Biener
  2020-09-14 10:47                                               ` Richard Sandiford
  2020-09-14 20:59                                               ` Segher Boessenkool
@ 2020-09-15  3:56                                               ` luoxhu
  2020-09-15  6:51                                                 ` Richard Biener
  2 siblings, 1 reply; 43+ messages in thread
From: luoxhu @ 2020-09-15  3:56 UTC (permalink / raw)
  To: Richard Biener
  Cc: Segher Boessenkool, GCC Patches, David Edelsohn, Bill Schmidt, linkw

[-- Attachment #1: Type: text/plain, Size: 5427 bytes --]



On 2020/9/14 17:47, Richard Biener wrote:
> On Mon, Sep 14, 2020 at 10:05 AM luoxhu <luoxhu@linux.ibm.com> wrote:

>> Not sure whether this reflects the issues you discussed above.
>> I constructed below test cases and tested with and without this patch,
>> only if "a+c"(which means store only), the performance is getting bad with
>> this patch due to extra load/store(about 50% slower).
>> For "v = vec_insert (i, v, n);" usage, v is always loaded after store, so this
>> patch will always get big benefit.
>> But for "v[n % 4] = i;" usage, it depends on whether "v" is used immediately
>> inside the function or out of the function soon.  Does this mean unify the two
>> usage to same gimple code not a good idea sometimes?  Or is it possible to
>> detect the generated IFN ".VEC_INSERT (&v, i_4(D), _1);" destination "v" is
>> used not far away inside or out side of the function?
>>
>>
>> #define TYPE int
>>
>> vector TYPE v = {1, 2, 3, 4};   // a. global vector.
>> vector TYPE s = {4, 3, 2, 1};
>>
>> __attribute__ ((noinline))
>> vector TYPE
>> test (vector TYPE u, TYPE i, size_t n)
>> {
>>    v[n % 4] = i;
> 
> ^^^
> 
> this should be
> 
>     u[n % 4] = i;
> 
> I guess.  Is the % 4 mandated by the vec_insert semantics btw?

Yes.  Segher pasted the builtin description in his reply.  "v = vec_insert (i, u, n);"
is a bit different with "u[n % 4] = i;" since it returns a copy of u instead of modify
u. The adjust in lower __builtin_vec_insert_xxx gimple will make a copy of u first to
meet the requirements.

> 
> If you tested with the above error you probably need to re-measure.

No I did test for u as local instead of global before.  If u is not used very
soon, the performance is almost the same for generating single store or IFN_SET
of inserting with variable index.

source:
__attribute__ ((noinline)) vector TYPE
test (vector TYPE u, TYPE i, size_t n)
{
  u[n % 4] = i;
  vector TYPE r = {0};
  for (long k = 0; k < 100; k++)
    {
      r += v;
    }
  return u+r;
}

=> store hit load is relieved due to long distance.

ASM:
0:      addis 2,12,.TOC.-.LCF0@ha
        addi 2,2,.TOC.-.LCF0@l
        .localentry     test,.-test
        addis 10,2,.LANCHOR0@toc@ha
        li 8,50
        xxspltib 32,0
        addi 9,1,-16
        rldic 6,6,2,60
        stxv 34,-16(1)
        addi 10,10,.LANCHOR0@toc@l
        mtctr 8
        xxlor 33,32,32
        stwx 5,9,6      // short store
        lxv 45,0(10)
        .p2align 4,,15
.L2:
        vadduwm 0,0,13
        vadduwm 1,1,13
        bdnz .L2
        vadduwm 0,0,1
        lxv 33,-16(1)   // wide load
        vadduwm 2,0,1
        blr


Then I intend to use "v" there for global memory insert test to separate
the store and load into different function ("v" is stored in function,
but loaded out of function will get different performance for single store
or lxv+xxperm+xxsel+stxv, the inner function doesn't know the distance
between load and store across functions).

Do you mean that we don't need generate IFN_SET if "v" is global memory?
I only see VAR_DECL and PARM_DECL, is there any function to check the tree
variable is global? I added DECL_REGISTER, but the RTL still expands to stack:

gcc/internal-fn.c: rtx to_rtx = expand_expr (view_op0, NULL_RTX, VOIDmode, EXPAND_WRITE);

(gdb) p view_op0
$584 = <var_decl 0x7ffff7f705a0>
(gdb) p DECL_REGISTER(view_op0)
$585 = 1
(gdb) p to_rtx
$586 = (rtx_def *) (mem/c:V4SI (reg/f:DI 112 virtual-stack-vars) [1 D.3190+0 S16 A128])


> 
> On gimple the above function (after fixing it) looks like
> 
>    VIEW_CONVERT_EXPR<int[4]>(u)[_1] = i_4(D);
> 
> and the IFN idea I had would - for non-global memory 'u' only - transform
> this to
> 
>    vector_register_2 = u;
>    vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D));
>    u = vector_register_3;
> 
> if vec_set can handle variable indexes.  This then becomes a
> vec_set on a register and if that was the only variable indexing of 'u'
> will also cause 'u' to be expanded as register rather than stack memory.
> 
> Note we can't use the direct-optab method here since the vec_set optab
> modifies operand 0 which isn't possible in SSA form.  That might hint
> at that we eventually want to extend BIT_INSERT_EXPR to handle
> a non-constant bit position but for experiments using an alternate
> internal function is certainly easier.
> 

My current implementation does:

1)  v = vec_insert (i, u, n);

=>gimple:
{
  register __vector signed int D.3190;
  D.3190 = u;            // *new decl and copy u first.*
  _1 = n & 3;
  VIEW_CONVERT_EXPR<int[4]>(D.3190)[_1] = i;   // *update op0 of VIEW_CONVERT_EXPR*
  _2 = D.3190;
  ...
}

=>isel:
{
  register __vector signed int D.3190;
  D.3190 = u_4(D);
  _1 = n_6(D) & 3;
  .VEC_SET (&D.3190, i_7(D), _1);
  _2 = D.3190;
  ...
}


2) u[n % 4] = i;

=>gimple:
{
  __vector signed int D.3191;
  _1 = n & 3;
  VIEW_CONVERT_EXPR<int[4]>(u)[_1] = i;   // *update op0 of VIEW_CONVERT_EXPR*
  D.3191 = u;   
 ...
}

=>isel:
{
  D.3190 = u_4(D);
  _1 = n_6(D) & 3;
  .VEC_SET (&D.3191, i_7(D), _1);
  _2 = D.3190;
  v = _2;
}

The IFN ".VEC_SET" behavior quite like the other IFN STORE functions and doesn't
require a dest operand to be set?  Both 1) and 2) are modifying operand 0 of
VIEW_CONVERT_EXPR just like vec_set optab.  

Attached the IFN vec_set patch part, the expand part is moved from expr.c:expand_assignment
to internal-fn.c:expand_vec_set_optab_fn now.


Thanks,
Xionghu

[-- Attachment #2: 0001-IFN-Implement-IFN_VEC_SET-for-vec_insert.patch --]
[-- Type: text/plain, Size: 7312 bytes --]

From ae64e4903ebf945995501cd57569cfe4939bc574 Mon Sep 17 00:00:00 2001
From: Xiong Hu Luo <luoxhu@linux.ibm.com>
Date: Mon, 14 Sep 2020 21:08:11 -0500
Subject: [PATCH] IFN: Implement IFN_VEC_SET for vec_insert

gcc/ChangeLog:

	* gimple-isel.cc (gimple_expand_vec_set_expr):
	(gimple_expand_vec_cond_exprs):
	* internal-fn.c (vec_set_direct):
	(expand_vec_set_optab_fn):
	(direct_vec_set_optab_supported_p):
	* internal-fn.def (VEC_SET):
---
 gcc/gimple-isel.cc  | 117 +++++++++++++++++++++++++++++++++++++++++++-
 gcc/internal-fn.c   |  38 ++++++++++++++
 gcc/internal-fn.def |   2 +
 3 files changed, 155 insertions(+), 2 deletions(-)

diff --git a/gcc/gimple-isel.cc b/gcc/gimple-isel.cc
index b330cf4c20e..92afe0306fa 100644
--- a/gcc/gimple-isel.cc
+++ b/gcc/gimple-isel.cc
@@ -35,6 +35,76 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-cfg.h"
 #include "bitmap.h"
 #include "tree-ssa-dce.h"
+#include "fold-const.h"
+
+static gimple *
+gimple_expand_vec_extract_expr (
+  gimple_stmt_iterator *gsi,
+  hash_map<tree, unsigned int> *vec_cond_ssa_name_uses)
+{
+  enum tree_code code;
+
+  /* Only consider code == GIMPLE_ASSIGN.  */
+  gassign *stmt = dyn_cast<gassign *> (gsi_stmt (*gsi));
+  if (!stmt)
+    return NULL;
+
+  code = gimple_assign_rhs_code (stmt);
+  if (code != ARRAY_REF)
+    return NULL;
+
+  return NULL;
+}
+
+static gimple *
+gimple_expand_vec_set_expr (gimple_stmt_iterator *gsi)
+{
+  enum tree_code code;
+  enum insn_code icode;
+  gcall *new_stmt = NULL;
+
+  /* Only consider code == GIMPLE_ASSIGN.  */
+  gassign *stmt = dyn_cast<gassign *> (gsi_stmt (*gsi));
+  if (!stmt)
+    return NULL;
+
+  code = TREE_CODE (gimple_assign_lhs (stmt));
+  if (code != ARRAY_REF)
+    return NULL;
+
+  tree lhs = gimple_assign_lhs (stmt);
+  tree val = gimple_assign_rhs1 (stmt);
+
+  tree type = TREE_TYPE (lhs);
+  tree op0 = TREE_OPERAND (lhs, 0);
+  if (TREE_CODE (op0) == VIEW_CONVERT_EXPR
+      && tree_fits_uhwi_p (TYPE_SIZE (type)))
+    {
+      tree pos = TREE_OPERAND (lhs, 1);
+      tree view_op0 = TREE_OPERAND (op0, 0);
+      mark_addressable (view_op0);
+      machine_mode outermode = TYPE_MODE (TREE_TYPE (view_op0));
+      scalar_mode innermode = GET_MODE_INNER (outermode);
+      wide_int minv, maxv;
+      if (TREE_CODE (TREE_TYPE (view_op0)) == VECTOR_TYPE
+	  && tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (view_op0)))
+	  && tree_to_uhwi (TYPE_SIZE (TREE_TYPE (view_op0))) == 128
+	  && determine_value_range (pos, &minv, &maxv) == VR_RANGE
+	  && wi::geu_p (minv, 0)
+	  && wi::leu_p (maxv, (128 / GET_MODE_BITSIZE (innermode))))
+	{
+	  tree addr
+	    = force_gimple_operand_gsi (gsi, build_fold_addr_expr (view_op0),
+					true, NULL_TREE, true, GSI_SAME_STMT);
+
+	  new_stmt
+	    = gimple_build_call_internal (IFN_VEC_SET, 3, addr, val, pos);
+	  gimple_move_vops (new_stmt, stmt);
+	}
+    }
+
+  return new_stmt;
+}
 
 /* Expand all VEC_COND_EXPR gimple assignments into calls to internal
    function based on type of selected expansion.  */
@@ -187,8 +257,26 @@ gimple_expand_vec_cond_exprs (void)
     {
       for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
 	{
-	  gimple *g = gimple_expand_vec_cond_expr (&gsi,
-						   &vec_cond_ssa_name_uses);
+	  gassign *stmt = dyn_cast<gassign *> (gsi_stmt (gsi));
+	  if (!stmt)
+	    continue;
+
+	  enum tree_code code;
+	  gimple *g = NULL;
+	  code = gimple_assign_rhs_code (stmt);
+	  switch (code)
+	    {
+	    case VEC_COND_EXPR:
+	      g = gimple_expand_vec_cond_expr (&gsi, &vec_cond_ssa_name_uses);
+	      break;
+	    case ARRAY_REF:
+	      g = gimple_expand_vec_extract_expr (&gsi,
+						  &vec_cond_ssa_name_uses);
+	      break;
+	    default:
+	      break;
+	    }
+
 	  if (g != NULL)
 	    {
 	      tree lhs = gimple_assign_lhs (gsi_stmt (gsi));
@@ -204,6 +292,31 @@ gimple_expand_vec_cond_exprs (void)
 
   simple_dce_from_worklist (dce_ssa_names);
 
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  gassign *stmt = dyn_cast<gassign *> (gsi_stmt (gsi));
+	  if (!stmt)
+	    continue;
+
+	  enum tree_code code;
+	  gimple *g = NULL;
+	  code = TREE_CODE (gimple_assign_lhs (stmt));
+	  switch (code)
+	    {
+	    case ARRAY_REF:
+	      g = gimple_expand_vec_set_expr (&gsi);
+	      break;
+	    default:
+	      break;
+	    }
+
+	  if (g != NULL)
+	    gsi_replace (&gsi, g, false);
+	}
+    }
+
   return 0;
 }
 
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 8efc77d986b..490d2caa582 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -115,6 +115,7 @@ init_internal_fns ()
 #define vec_condeq_direct { 0, 0, false }
 #define scatter_store_direct { 3, 1, false }
 #define len_store_direct { 3, 3, false }
+#define vec_set_direct { 3, 3, false }
 #define unary_direct { 0, 0, true }
 #define binary_direct { 0, 0, true }
 #define ternary_direct { 0, 0, true }
@@ -2658,6 +2659,42 @@ expand_vect_cond_mask_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 
 #define expand_vec_cond_mask_optab_fn expand_vect_cond_mask_optab_fn
 
+static void
+expand_vec_set_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
+{
+  tree op0 = gimple_call_arg (stmt, 0);
+  tree op1 = gimple_call_arg (stmt, 1);
+  tree op2 = gimple_call_arg (stmt, 2);
+  tree view_op0 = TREE_OPERAND (op0, 0);
+  rtx to_rtx = expand_expr (view_op0, NULL_RTX, VOIDmode, EXPAND_WRITE);
+
+  machine_mode outermode = TYPE_MODE (TREE_TYPE (view_op0));
+  scalar_mode innermode = GET_MODE_INNER (outermode);
+
+  rtx value = expand_expr (op1, NULL_RTX, VOIDmode, EXPAND_NORMAL);
+  rtx pos = expand_expr (op2, NULL_RTX, VOIDmode, EXPAND_NORMAL);
+
+  rtx temp_target = gen_reg_rtx (outermode);
+  emit_move_insn (temp_target, to_rtx);
+
+  class expand_operand ops[3];
+  enum insn_code icode = optab_handler (optab, outermode);
+
+  if (icode != CODE_FOR_nothing)
+    {
+      pos = convert_to_mode (E_SImode, pos, 0);
+
+      create_fixed_operand (&ops[0], temp_target);
+      create_input_operand (&ops[1], value, innermode);
+      create_input_operand (&ops[2], pos, GET_MODE (pos));
+      if (maybe_expand_insn (icode, 3, ops))
+	{
+	  emit_move_insn (to_rtx, temp_target);
+	  return;
+	}
+    }
+}
+
 static void
 expand_ABNORMAL_DISPATCHER (internal_fn, gcall *)
 {
@@ -3253,6 +3290,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
 #define direct_fold_left_optab_supported_p direct_optab_supported_p
 #define direct_mask_fold_left_optab_supported_p direct_optab_supported_p
 #define direct_check_ptrs_optab_supported_p direct_optab_supported_p
+#define direct_vec_set_optab_supported_p direct_optab_supported_p
 
 /* Return the optab used by internal function FN.  */
 
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 13e60828fcf..e6cfe1b6159 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -145,6 +145,8 @@ DEF_INTERNAL_OPTAB_FN (VCONDU, 0, vcondu, vec_condu)
 DEF_INTERNAL_OPTAB_FN (VCONDEQ, 0, vcondeq, vec_condeq)
 DEF_INTERNAL_OPTAB_FN (VCOND_MASK, 0, vcond_mask, vec_cond_mask)
 
+DEF_INTERNAL_OPTAB_FN (VEC_SET, 0, vec_set, vec_set)
+
 DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
 
 DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
-- 
2.27.0.90.geebb51ba8c


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-15  3:56                                               ` luoxhu
@ 2020-09-15  6:51                                                 ` Richard Biener
  2020-09-15 16:16                                                   ` Segher Boessenkool
  2020-09-16  6:15                                                   ` luoxhu
  0 siblings, 2 replies; 43+ messages in thread
From: Richard Biener @ 2020-09-15  6:51 UTC (permalink / raw)
  To: luoxhu
  Cc: Segher Boessenkool, GCC Patches, David Edelsohn, Bill Schmidt, linkw

On Tue, Sep 15, 2020 at 5:56 AM luoxhu <luoxhu@linux.ibm.com> wrote:
>
>
>
> On 2020/9/14 17:47, Richard Biener wrote:
> > On Mon, Sep 14, 2020 at 10:05 AM luoxhu <luoxhu@linux.ibm.com> wrote:
>
> >> Not sure whether this reflects the issues you discussed above.
> >> I constructed below test cases and tested with and without this patch,
> >> only if "a+c"(which means store only), the performance is getting bad with
> >> this patch due to extra load/store(about 50% slower).
> >> For "v = vec_insert (i, v, n);" usage, v is always loaded after store, so this
> >> patch will always get big benefit.
> >> But for "v[n % 4] = i;" usage, it depends on whether "v" is used immediately
> >> inside the function or out of the function soon.  Does this mean unify the two
> >> usage to same gimple code not a good idea sometimes?  Or is it possible to
> >> detect the generated IFN ".VEC_INSERT (&v, i_4(D), _1);" destination "v" is
> >> used not far away inside or out side of the function?
> >>
> >>
> >> #define TYPE int
> >>
> >> vector TYPE v = {1, 2, 3, 4};   // a. global vector.
> >> vector TYPE s = {4, 3, 2, 1};
> >>
> >> __attribute__ ((noinline))
> >> vector TYPE
> >> test (vector TYPE u, TYPE i, size_t n)
> >> {
> >>    v[n % 4] = i;
> >
> > ^^^
> >
> > this should be
> >
> >     u[n % 4] = i;
> >
> > I guess.  Is the % 4 mandated by the vec_insert semantics btw?
>
> Yes.  Segher pasted the builtin description in his reply.  "v = vec_insert (i, u, n);"
> is a bit different with "u[n % 4] = i;" since it returns a copy of u instead of modify
> u. The adjust in lower __builtin_vec_insert_xxx gimple will make a copy of u first to
> meet the requirements.
>
> >
> > If you tested with the above error you probably need to re-measure.
>
> No I did test for u as local instead of global before.  If u is not used very
> soon, the performance is almost the same for generating single store or IFN_SET
> of inserting with variable index.
>
> source:
> __attribute__ ((noinline)) vector TYPE
> test (vector TYPE u, TYPE i, size_t n)
> {
>   u[n % 4] = i;
>   vector TYPE r = {0};
>   for (long k = 0; k < 100; k++)
>     {
>       r += v;
>     }
>   return u+r;
> }
>
> => store hit load is relieved due to long distance.
>
> ASM:
> 0:      addis 2,12,.TOC.-.LCF0@ha
>         addi 2,2,.TOC.-.LCF0@l
>         .localentry     test,.-test
>         addis 10,2,.LANCHOR0@toc@ha
>         li 8,50
>         xxspltib 32,0
>         addi 9,1,-16
>         rldic 6,6,2,60
>         stxv 34,-16(1)
>         addi 10,10,.LANCHOR0@toc@l
>         mtctr 8
>         xxlor 33,32,32
>         stwx 5,9,6      // short store
>         lxv 45,0(10)
>         .p2align 4,,15
> .L2:
>         vadduwm 0,0,13
>         vadduwm 1,1,13
>         bdnz .L2
>         vadduwm 0,0,1
>         lxv 33,-16(1)   // wide load
>         vadduwm 2,0,1
>         blr
>
>
> Then I intend to use "v" there for global memory insert test to separate
> the store and load into different function ("v" is stored in function,
> but loaded out of function will get different performance for single store
> or lxv+xxperm+xxsel+stxv, the inner function doesn't know the distance
> between load and store across functions).
>
> Do you mean that we don't need generate IFN_SET if "v" is global memory?

Yes.

> I only see VAR_DECL and PARM_DECL, is there any function to check the tree
> variable is global? I added DECL_REGISTER, but the RTL still expands to stack:

is_global_var () or alternatively !auto_var_in_fn_p (), I think doing
IFN_SET only
makes sense if there's the chance we can promote the variable to a
register.  But it
would be an incorrect transform (it stores the whole vector) if the
vector storage
could "escape" to another thread - which means you probably have to check
!TREE_ADDRESSABLE as well.

>
> gcc/internal-fn.c: rtx to_rtx = expand_expr (view_op0, NULL_RTX, VOIDmode, EXPAND_WRITE);
>
> (gdb) p view_op0
> $584 = <var_decl 0x7ffff7f705a0>
> (gdb) p DECL_REGISTER(view_op0)
> $585 = 1
> (gdb) p to_rtx
> $586 = (rtx_def *) (mem/c:V4SI (reg/f:DI 112 virtual-stack-vars) [1 D.3190+0 S16 A128])
>
>
> >
> > On gimple the above function (after fixing it) looks like
> >
> >    VIEW_CONVERT_EXPR<int[4]>(u)[_1] = i_4(D);
> >
> > and the IFN idea I had would - for non-global memory 'u' only - transform
> > this to
> >
> >    vector_register_2 = u;
> >    vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D));
> >    u = vector_register_3;
> >
> > if vec_set can handle variable indexes.  This then becomes a
> > vec_set on a register and if that was the only variable indexing of 'u'
> > will also cause 'u' to be expanded as register rather than stack memory.
> >
> > Note we can't use the direct-optab method here since the vec_set optab
> > modifies operand 0 which isn't possible in SSA form.  That might hint
> > at that we eventually want to extend BIT_INSERT_EXPR to handle
> > a non-constant bit position but for experiments using an alternate
> > internal function is certainly easier.
> >
>
> My current implementation does:
>
> 1)  v = vec_insert (i, u, n);
>
> =>gimple:
> {
>   register __vector signed int D.3190;
>   D.3190 = u;            // *new decl and copy u first.*
>   _1 = n & 3;
>   VIEW_CONVERT_EXPR<int[4]>(D.3190)[_1] = i;   // *update op0 of VIEW_CONVERT_EXPR*
>   _2 = D.3190;
>   ...
> }
>
> =>isel:
> {
>   register __vector signed int D.3190;
>   D.3190 = u_4(D);
>   _1 = n_6(D) & 3;
>   .VEC_SET (&D.3190, i_7(D), _1);

why are you passing the address of D.3190 to .VEC_SET?  That will not
make D.3190
be expanded to a pseudo.   You really need to have GIMPLE registers
here (SSA name)
and thus a return value, leaving the argument unmodified

  D.3190_3 = .VEC_SET (D.3190_4, i_7(D), _1);

note this is why I asked about the actual CPU instruction - as I read
Seghers mail
the instruction modifies a vector register, not memory.

>   _2 = D.3190;
>   ...
> }
>
>
> 2) u[n % 4] = i;
>
> =>gimple:
> {
>   __vector signed int D.3191;
>   _1 = n & 3;
>   VIEW_CONVERT_EXPR<int[4]>(u)[_1] = i;   // *update op0 of VIEW_CONVERT_EXPR*
>   D.3191 = u;
>  ...
> }
>
> =>isel:
> {
>   D.3190 = u_4(D);
>   _1 = n_6(D) & 3;
>   .VEC_SET (&D.3191, i_7(D), _1);
>   _2 = D.3190;
>   v = _2;
> }
>
> The IFN ".VEC_SET" behavior quite like the other IFN STORE functions and doesn't
> require a dest operand to be set?  Both 1) and 2) are modifying operand 0 of
> VIEW_CONVERT_EXPR just like vec_set optab.
>
> Attached the IFN vec_set patch part, the expand part is moved from expr.c:expand_assignment
> to internal-fn.c:expand_vec_set_optab_fn now.
>
>
> Thanks,
> Xionghu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-15  6:51                                                 ` Richard Biener
@ 2020-09-15 16:16                                                   ` Segher Boessenkool
  2020-09-16  8:31                                                     ` Richard Biener
  2020-09-16  6:15                                                   ` luoxhu
  1 sibling, 1 reply; 43+ messages in thread
From: Segher Boessenkool @ 2020-09-15 16:16 UTC (permalink / raw)
  To: Richard Biener; +Cc: luoxhu, GCC Patches, David Edelsohn, Bill Schmidt, linkw

On Tue, Sep 15, 2020 at 08:51:09AM +0200, Richard Biener wrote:
> On Tue, Sep 15, 2020 at 5:56 AM luoxhu <luoxhu@linux.ibm.com> wrote:
> > >     u[n % 4] = i;
> > >
> > > I guess.  Is the % 4 mandated by the vec_insert semantics btw?

(As an aside -- please use "& 3" instead: that works fine if n is signed
as well, but modulo doesn't.  Maybe that is in the patch already, I
didn't check, sorry.)

> note this is why I asked about the actual CPU instruction - as I read
> Seghers mail
> the instruction modifies a vector register, not memory.

But note that the builtin is not the same as the machine instruction --
here there shouldn't be a difference if compiling for a new enough ISA,
but the builtin is available on anything with at least AltiVec.


Segher

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-15  6:51                                                 ` Richard Biener
  2020-09-15 16:16                                                   ` Segher Boessenkool
@ 2020-09-16  6:15                                                   ` luoxhu
  2020-09-16  8:41                                                     ` Richard Biener
  1 sibling, 1 reply; 43+ messages in thread
From: luoxhu @ 2020-09-16  6:15 UTC (permalink / raw)
  To: Richard Biener
  Cc: Segher Boessenkool, GCC Patches, David Edelsohn, Bill Schmidt, linkw



On 2020/9/15 14:51, Richard Biener wrote:


>> I only see VAR_DECL and PARM_DECL, is there any function to check the tree
>> variable is global? I added DECL_REGISTER, but the RTL still expands to stack:
> 
> is_global_var () or alternatively !auto_var_in_fn_p (), I think doing
> IFN_SET only
> makes sense if there's the chance we can promote the variable to a
> register.  But it
> would be an incorrect transform (it stores the whole vector) if the
> vector storage
> could "escape" to another thread - which means you probably have to check
> !TREE_ADDRESSABLE as well.
> 

The tree of param "u" will be marked ADDRESSABLE when generating 
"VIEW_CONVERT_EXPR<int[4]>(D.3190)[_1] = i;", if check !TREE_ADDRESSABLE, no IFN_SET
will be produced in gimple-isel.


#1  0x000000001066c700 in convert_vector_to_array_for_subscript (loc=5307072, vecp=0x7fffffffc5d0,
index=<trunc_mod_expr 0x7ffff59c73a0>) at ../../gcc/gcc/c-family/c-common.c:8169
#2  0x0000000010553b54 in build_array_ref (loc=5307072, array=<parm_decl 0x7ffff5ad0100 u>, index=<
trunc_mod_expr 0x7ffff59c73a0>) at ../../gcc/gcc/c/c-typeck.c:2668
#3  0x00000000105c8824 in c_parser_postfix_expression_after_primary (parser=0x7ffff7f703f0, expr_lo
c=5307040, expr=...) at ../../gcc/gcc/c/c-parser.c:10494
#4  0x00000000105c7570 in c_parser_postfix_expression (parser=0x7ffff7f703f0) at ../../gcc/gcc/c/c-
parser.c:10216

>>
>> My current implementation does:
>>
>> 1)  v = vec_insert (i, u, n);
>>
>> =>gimple:
>> {
>>    register __vector signed int D.3190;
>>    D.3190 = u;            // *new decl and copy u first.*
>>    _1 = n & 3;
>>    VIEW_CONVERT_EXPR<int[4]>(D.3190)[_1] = i;   // *update op0 of VIEW_CONVERT_EXPR*
>>    _2 = D.3190;
>>    ...
>> }
>>
>> =>isel:
>> {
>>    register __vector signed int D.3190;
>>    D.3190 = u_4(D);
>>    _1 = n_6(D) & 3;
>>    .VEC_SET (&D.3190, i_7(D), _1);
> 
> why are you passing the address of D.3190 to .VEC_SET?  That will not
> make D.3190
> be expanded to a pseudo.   You really need to have GIMPLE registers
> here (SSA name)
> and thus a return value, leaving the argument unmodified
> 
>    D.3190_3 = .VEC_SET (D.3190_4, i_7(D), _1);
> 
> note this is why I asked about the actual CPU instruction - as I read
> Seghers mail
> the instruction modifies a vector register, not memory.
> 

Updated the code and got expected gimple-isel output and ASM for both 2 cases:

pr79251.c.236t.isel:

__attribute__((noinline))
test (__vector signed int u, int i, size_t n)
{
  long unsigned int _1;
  __vector signed int _6;
  vector(4) int _7;
  vector(4) int vec_set_dst_8;

  <bb 2> [local count: 1073741824]:
  _1 = n_2(D) & 3;
  _7 = u;
  vec_set_dst_8 = .VEC_SET (_7, i_4(D), _1);
  u = vec_set_dst_8;
  _6 = u;
  return _6;

}

But tree variable "u" need to be set to "TREE_ADDRESSABLE (view_op0) = 0;"
(Maybe check IFN VEC_SET and set to 0 in discover_nonconstant_array_refs
is better later.) after generating the IFN VEC_SET, otherwise, "u" will still
be expanded to stack since expander:
 "Replacing Expressions:  _7 replace with --> _7 = u;".

Setting "u" to non-addressable also seems really unreasonable as for below
case, as u[n+1] need be ADDRESSABLE:

__attribute__ ((noinline)) vector TYPE
test (vector TYPE u, TYPE i, size_t n)
{
 u[n % 4] = i;
 u[n+1] = i+1;
 return u;
}

=> gimple-isel with both VEC_SET and VIEW_CONVERT_EXPR:

test (__vector signed int u, int i, size_t n)
{
  long unsigned int _1;
  long unsigned int _2;
  int _3;
  __vector signed int _9;
  vector(4) int _10;
  vector(4) int vec_set_dst_11;

  <bb 2> [local count: 1073741824]:
  _1 = n_4(D) & 3;
  _10 = u;
  vec_set_dst_11 = .VEC_SET (_10, i_6(D), _1);
  u = vec_set_dst_11;
  _2 = n_4(D) + 1;
  _3 = i_6(D) + 1;
  VIEW_CONVERT_EXPR<int[4]>(u)[_2] = _3;
  _9 = u;
  return _9;

}


Below code are generating the IFN call, create_tmp_reg or create_tmp_reg_ssa_name
seems not create a variable that will be allocated on register?


diff --git a/gcc/gimple-isel.cc b/gcc/gimple-isel.cc
index b330cf4c20e..a699022cd09 100644
--- a/gcc/gimple-isel.cc
+++ b/gcc/gimple-isel.cc
...
+      if (!is_global_var (view_op0)
+         && TREE_CODE (TREE_TYPE (view_op0)) == VECTOR_TYPE
+         && tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (view_op0)))
+         && tree_to_uhwi (TYPE_SIZE (TREE_TYPE (view_op0))) == 128
+         && determine_value_range (pos, &minv, &maxv) == VR_RANGE
+         && wi::geu_p (minv, 0)
+         && wi::leu_p (maxv, (128 / GET_MODE_BITSIZE (innermode))))
+       {
+         location_t loc = gimple_location (stmt);
+         tree var_src = create_tmp_reg (TREE_TYPE (view_op0));
+         tree var_dst
+           = make_temp_ssa_name (TREE_TYPE (view_op0), NULL, "vec_set_dst");
+         TREE_ADDRESSABLE (view_op0) = 0;
+
+         ass_stmt = gimple_build_assign (var_src, view_op0);
+         gimple_set_location (ass_stmt, loc);
+         gsi_insert_before (gsi, ass_stmt, GSI_SAME_STMT);
+
+         new_stmt
+           = gimple_build_call_internal (IFN_VEC_SET, 3, var_src, val, pos);
+
+         gimple_call_set_lhs (new_stmt, var_dst);
+
+         gimple_set_location (new_stmt, loc);
+         gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);
+
+         ass_stmt = gimple_build_assign (view_op0, var_dst);
+
+         gimple_set_location (ass_stmt, loc);
+         gsi_insert_before (gsi, ass_stmt, GSI_SAME_STMT);
+
+         gimple_move_vops (ass_stmt, stmt);
+       }
+    }
+
+  return ass_stmt;


Thanks,
Xionghu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-15 16:16                                                   ` Segher Boessenkool
@ 2020-09-16  8:31                                                     ` Richard Biener
  2020-09-16 11:11                                                       ` Segher Boessenkool
  0 siblings, 1 reply; 43+ messages in thread
From: Richard Biener @ 2020-09-16  8:31 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: luoxhu, GCC Patches, David Edelsohn, Bill Schmidt, linkw

On Tue, Sep 15, 2020 at 6:18 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> On Tue, Sep 15, 2020 at 08:51:09AM +0200, Richard Biener wrote:
> > On Tue, Sep 15, 2020 at 5:56 AM luoxhu <luoxhu@linux.ibm.com> wrote:
> > > >     u[n % 4] = i;
> > > >
> > > > I guess.  Is the % 4 mandated by the vec_insert semantics btw?
>
> (As an aside -- please use "& 3" instead: that works fine if n is signed
> as well, but modulo doesn't.  Maybe that is in the patch already, I
> didn't check, sorry.)
>
> > note this is why I asked about the actual CPU instruction - as I read
> > Seghers mail
> > the instruction modifies a vector register, not memory.
>
> But note that the builtin is not the same as the machine instruction --
> here there shouldn't be a difference if compiling for a new enough ISA,
> but the builtin is available on anything with at least AltiVec.

Well, given we're trying to improve instruction selection that very much
should depend on the ISA.  Thus if the target says it cannot vec_set
to a register in a variable position then we don't want to pretend it can.

Richard.

>
> Segher

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-16  6:15                                                   ` luoxhu
@ 2020-09-16  8:41                                                     ` Richard Biener
  0 siblings, 0 replies; 43+ messages in thread
From: Richard Biener @ 2020-09-16  8:41 UTC (permalink / raw)
  To: luoxhu
  Cc: Segher Boessenkool, GCC Patches, David Edelsohn, Bill Schmidt, linkw

On Wed, Sep 16, 2020 at 8:15 AM luoxhu <luoxhu@linux.ibm.com> wrote:
>
>
>
> On 2020/9/15 14:51, Richard Biener wrote:
>
>
> >> I only see VAR_DECL and PARM_DECL, is there any function to check the tree
> >> variable is global? I added DECL_REGISTER, but the RTL still expands to stack:
> >
> > is_global_var () or alternatively !auto_var_in_fn_p (), I think doing
> > IFN_SET only
> > makes sense if there's the chance we can promote the variable to a
> > register.  But it
> > would be an incorrect transform (it stores the whole vector) if the
> > vector storage
> > could "escape" to another thread - which means you probably have to check
> > !TREE_ADDRESSABLE as well.
> >
>
> The tree of param "u" will be marked ADDRESSABLE when generating
> "VIEW_CONVERT_EXPR<int[4]>(D.3190)[_1] = i;", if check !TREE_ADDRESSABLE, no IFN_SET
> will be produced in gimple-isel.

The TREE_ADDRESSABLE bit should be cleared later on GIMPLE during
execute_update_address_taken.  If it is not we should fix that.

> #1  0x000000001066c700 in convert_vector_to_array_for_subscript (loc=5307072, vecp=0x7fffffffc5d0,
> index=<trunc_mod_expr 0x7ffff59c73a0>) at ../../gcc/gcc/c-family/c-common.c:8169
> #2  0x0000000010553b54 in build_array_ref (loc=5307072, array=<parm_decl 0x7ffff5ad0100 u>, index=<
> trunc_mod_expr 0x7ffff59c73a0>) at ../../gcc/gcc/c/c-typeck.c:2668
> #3  0x00000000105c8824 in c_parser_postfix_expression_after_primary (parser=0x7ffff7f703f0, expr_lo
> c=5307040, expr=...) at ../../gcc/gcc/c/c-parser.c:10494
> #4  0x00000000105c7570 in c_parser_postfix_expression (parser=0x7ffff7f703f0) at ../../gcc/gcc/c/c-
> parser.c:10216
>
> >>
> >> My current implementation does:
> >>
> >> 1)  v = vec_insert (i, u, n);
> >>
> >> =>gimple:
> >> {
> >>    register __vector signed int D.3190;
> >>    D.3190 = u;            // *new decl and copy u first.*
> >>    _1 = n & 3;
> >>    VIEW_CONVERT_EXPR<int[4]>(D.3190)[_1] = i;   // *update op0 of VIEW_CONVERT_EXPR*
> >>    _2 = D.3190;
> >>    ...
> >> }
> >>
> >> =>isel:
> >> {
> >>    register __vector signed int D.3190;
> >>    D.3190 = u_4(D);
> >>    _1 = n_6(D) & 3;
> >>    .VEC_SET (&D.3190, i_7(D), _1);
> >
> > why are you passing the address of D.3190 to .VEC_SET?  That will not
> > make D.3190
> > be expanded to a pseudo.   You really need to have GIMPLE registers
> > here (SSA name)
> > and thus a return value, leaving the argument unmodified
> >
> >    D.3190_3 = .VEC_SET (D.3190_4, i_7(D), _1);
> >
> > note this is why I asked about the actual CPU instruction - as I read
> > Seghers mail
> > the instruction modifies a vector register, not memory.
> >
>
> Updated the code and got expected gimple-isel output and ASM for both 2 cases:
>
> pr79251.c.236t.isel:
>
> __attribute__((noinline))
> test (__vector signed int u, int i, size_t n)
> {
>   long unsigned int _1;
>   __vector signed int _6;
>   vector(4) int _7;
>   vector(4) int vec_set_dst_8;
>
>   <bb 2> [local count: 1073741824]:
>   _1 = n_2(D) & 3;
>   _7 = u;
>   vec_set_dst_8 = .VEC_SET (_7, i_4(D), _1);
>   u = vec_set_dst_8;
>   _6 = u;
>   return _6;
>
> }
>
> But tree variable "u" need to be set to "TREE_ADDRESSABLE (view_op0) = 0;"
> (Maybe check IFN VEC_SET and set to 0 in discover_nonconstant_array_refs
> is better later.) after generating the IFN VEC_SET, otherwise, "u" will still
> be expanded to stack since expander:
>  "Replacing Expressions:  _7 replace with --> _7 = u;".
>
> Setting "u" to non-addressable also seems really unreasonable as for below
> case, as u[n+1] need be ADDRESSABLE:

Why should u be addressable for u[n+1]?

> __attribute__ ((noinline)) vector TYPE
> test (vector TYPE u, TYPE i, size_t n)
> {
>  u[n % 4] = i;
>  u[n+1] = i+1;
>  return u;
> }
>
> => gimple-isel with both VEC_SET and VIEW_CONVERT_EXPR:
>
> test (__vector signed int u, int i, size_t n)
> {
>   long unsigned int _1;
>   long unsigned int _2;
>   int _3;
>   __vector signed int _9;
>   vector(4) int _10;
>   vector(4) int vec_set_dst_11;
>
>   <bb 2> [local count: 1073741824]:
>   _1 = n_4(D) & 3;
>   _10 = u;
>   vec_set_dst_11 = .VEC_SET (_10, i_6(D), _1);
>   u = vec_set_dst_11;
>   _2 = n_4(D) + 1;
>   _3 = i_6(D) + 1;
>   VIEW_CONVERT_EXPR<int[4]>(u)[_2] = _3;

why is .VEC_SET not used here?

>   _9 = u;
>   return _9;
>
> }
>
>
> Below code are generating the IFN call, create_tmp_reg or create_tmp_reg_ssa_name
> seems not create a variable that will be allocated on register?

Yes, they create decls, use make_ssa_name (TREE_TYPE (view_op0)) to create
a SSA name of the desired type.

>
> diff --git a/gcc/gimple-isel.cc b/gcc/gimple-isel.cc
> index b330cf4c20e..a699022cd09 100644
> --- a/gcc/gimple-isel.cc
> +++ b/gcc/gimple-isel.cc
> ...
> +      if (!is_global_var (view_op0)
> +         && TREE_CODE (TREE_TYPE (view_op0)) == VECTOR_TYPE
> +         && tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (view_op0)))
> +         && tree_to_uhwi (TYPE_SIZE (TREE_TYPE (view_op0))) == 128
> +         && determine_value_range (pos, &minv, &maxv) == VR_RANGE
> +         && wi::geu_p (minv, 0)
> +         && wi::leu_p (maxv, (128 / GET_MODE_BITSIZE (innermode))))
> +       {
> +         location_t loc = gimple_location (stmt);
> +         tree var_src = create_tmp_reg (TREE_TYPE (view_op0));
> +         tree var_dst
> +           = make_temp_ssa_name (TREE_TYPE (view_op0), NULL, "vec_set_dst");
> +         TREE_ADDRESSABLE (view_op0) = 0;
> +
> +         ass_stmt = gimple_build_assign (var_src, view_op0);

you should set the gimple_vuse to the vuse of 'stmt'

> +         gimple_set_location (ass_stmt, loc);
> +         gsi_insert_before (gsi, ass_stmt, GSI_SAME_STMT);
> +
> +         new_stmt
> +           = gimple_build_call_internal (IFN_VEC_SET, 3, var_src, val, pos);
> +
> +         gimple_call_set_lhs (new_stmt, var_dst);
> +
> +         gimple_set_location (new_stmt, loc);
> +         gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);
> +
> +         ass_stmt = gimple_build_assign (view_op0, var_dst);
> +
> +         gimple_set_location (ass_stmt, loc);
> +         gsi_insert_before (gsi, ass_stmt, GSI_SAME_STMT);
> +
> +         gimple_move_vops (ass_stmt, stmt);
> +       }
> +    }
> +
> +  return ass_stmt;
>
>
> Thanks,
> Xionghu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
  2020-09-16  8:31                                                     ` Richard Biener
@ 2020-09-16 11:11                                                       ` Segher Boessenkool
  0 siblings, 0 replies; 43+ messages in thread
From: Segher Boessenkool @ 2020-09-16 11:11 UTC (permalink / raw)
  To: Richard Biener; +Cc: luoxhu, GCC Patches, David Edelsohn, Bill Schmidt, linkw

On Wed, Sep 16, 2020 at 10:31:54AM +0200, Richard Biener wrote:
> On Tue, Sep 15, 2020 at 6:18 PM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> >
> > On Tue, Sep 15, 2020 at 08:51:09AM +0200, Richard Biener wrote:
> > > On Tue, Sep 15, 2020 at 5:56 AM luoxhu <luoxhu@linux.ibm.com> wrote:
> > > > >     u[n % 4] = i;
> > > > >
> > > > > I guess.  Is the % 4 mandated by the vec_insert semantics btw?
> >
> > (As an aside -- please use "& 3" instead: that works fine if n is signed
> > as well, but modulo doesn't.  Maybe that is in the patch already, I
> > didn't check, sorry.)
> >
> > > note this is why I asked about the actual CPU instruction - as I read
> > > Seghers mail
> > > the instruction modifies a vector register, not memory.
> >
> > But note that the builtin is not the same as the machine instruction --
> > here there shouldn't be a difference if compiling for a new enough ISA,
> > but the builtin is available on anything with at least AltiVec.
> 
> Well, given we're trying to improve instruction selection that very much
> should depend on the ISA.  Thus if the target says it cannot vec_set
> to a register in a variable position then we don't want to pretend it can.

Of course :-)  We shouldn't look at what the machine insn can do, but
also not at what the source level constructs can; we should just ask the
target code itself.


Segher

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2020-09-16 11:12 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-31  9:06 [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251] Xiong Hu Luo
2020-08-31 12:43 ` Richard Biener
2020-08-31 16:47 ` will schmidt
2020-09-01 11:43   ` luoxhu
2020-08-31 17:04 ` Segher Boessenkool
2020-09-01  8:09   ` luoxhu
2020-09-01 13:07     ` Richard Biener
2020-09-02  9:26       ` luoxhu
2020-09-02  9:30         ` Richard Biener
2020-09-03  9:20           ` luoxhu
2020-09-03 10:29             ` Richard Biener
2020-09-04  6:16               ` luoxhu
2020-09-04  6:38                 ` luoxhu
2020-09-04  7:19                   ` Richard Biener
2020-09-04  7:23                     ` Richard Biener
2020-09-04  9:18                       ` luoxhu
2020-09-04 10:23                         ` Segher Boessenkool
2020-09-07  5:43                           ` [PATCH v2] " luoxhu
2020-09-07 11:57                             ` Richard Biener
2020-09-08  8:11                               ` luoxhu
2020-09-08  8:26                                 ` Richard Biener
2020-09-09  1:47                                   ` luoxhu
2020-09-09  7:30                                     ` Richard Biener
2020-09-09 13:47                                   ` Segher Boessenkool
2020-09-09 14:28                                     ` Richard Biener
2020-09-09 16:00                                       ` Segher Boessenkool
2020-09-10 10:08                                         ` Richard Biener
2020-09-14  8:05                                           ` luoxhu
2020-09-14  9:47                                             ` Richard Biener
2020-09-14 10:47                                               ` Richard Sandiford
2020-09-14 11:22                                                 ` Richard Biener
2020-09-14 11:49                                                   ` Richard Sandiford
2020-09-14 21:06                                                 ` Segher Boessenkool
2020-09-14 20:59                                               ` Segher Boessenkool
2020-09-15  3:56                                               ` luoxhu
2020-09-15  6:51                                                 ` Richard Biener
2020-09-15 16:16                                                   ` Segher Boessenkool
2020-09-16  8:31                                                     ` Richard Biener
2020-09-16 11:11                                                       ` Segher Boessenkool
2020-09-16  6:15                                                   ` luoxhu
2020-09-16  8:41                                                     ` Richard Biener
2020-09-14 20:21                                           ` Segher Boessenkool
2020-09-01 14:02     ` [PATCH] " Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).