[PATCH, 1 of 4 or 5], Enhance PowerPC vec_extract support for power8/power9 machines

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH, 1 of 4 or 5], Enhance PowerPC vec_extract support for power8/power9 machines
@ 2016-07-27 14:33 Michael Meissner
  2016-07-27 19:35 ` Segher Boessenkool
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Michael Meissner @ 2016-07-27 14:33 UTC (permalink / raw)
  To: gcc-patches, Segher Boessenkool, David Edelsohn, Bill Schmidt

[-- Attachment #1: Type: text/plain, Size: 3467 bytes --]

These patches enhance the vec_extract built-in on modern PowerPC server
systems.  Currently, vec_extract is optimized for constant element numbers for
vector double/vector long on any VSX system, and constant element numbers for
vector char/vector short/vector int on ISA 3.0 (power9) systems.

If the vec_extract is not handled, the compiler currently stores the vector
into memory, and then indexes the element via normal memory addressing.  This
creates a store-load hit.

This patch and the successive patches will enable better code generation of
vec_extract on 64-bit systems with direct move (power8) and above.

This particular patch changes the infrastructure so that in the next patch, I
can add support for extracting a variable element of vector double or vector
long.  This particular patch is just infrastructure, and does not change the
code generation.

In addition, I discovered a previous change for ISA 3.0 extraction spelled an
instruction wrong, and it is fixed here.  It turns out that I messed up the
constraints, so the register allocator would never generate this instruction.
This patch just uses the correct name, but it won't be until the next patch
that the constraints will be fixed so it can be generated.

I have tested this patch and there are no regressions.  Can I apply this to the
trunk?  These sets of patches depend on the DImode in Altivec registers patches
that have not been back ported to GCC 6.2, so it is for trunk only.

The next patch will enhance vec_extract to to allow the element number to be
variable for vec_extract of vector double/vector long on 64-bit systems with
direct move using the VSLO instruction.

The third patch will enhance vec_extract to better optimize extracting elements
if the vector is in memory.  Right now, the code only optimizes extracting
element 0.  The new patch will allow both constant and variable element
numbers.

The fourth patch will enhance vec_extract for the other vector types.  I might
split it up into two patches, one for vector float, and the other for vector
char, vector short, and vector long.

I built spec 2006 with all of the patches, and there were some benchmarks that
generated a few changes, and a few benchmarks that generated a lot (gamess had
over 500 places that were optimized).  I ran a comparison between the old
compiler and one with the patches installed on several of the benchmarks that
showed the most changes.  I did not see any performance changes on the
benchmarks that I ran.  I believe this is because vec_extract is typically
generated at the end of vector reductions, and it does not account for much
time in the whole benchmark.  User written code that uses vec_extract would
hopefully see speed improvements.

2016-07-27  Michael Meissner  <meissner@linux.vnet.ibm.com>

	* config/rs6000/vector.md (vec_extract<mode>): Change the calling
	signature of rs6000_expand_vector_extract so that the element
	number is a RTX instead of a constant integer.
	* config/rs6000/rs6000-protos.h (rs6000_expand_vector_extract):
	Likewise.
	* config/rs6000/rs6000.c (rs6000_expand_vector_extract): Likewise.
	(altivec_expand_vec_ext_builtin): Likewise.
	* config/rs6000/altivec.md (reduc_plus_scal_<mode>): Likewise.
	* config/rs6000/vsx.md (vsx_extract_<mode>): Fix spelling of the
	MFVSRLD instruction.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797

[-- Attachment #2: gcc-stage7.extract004b --]
[-- Type: text/plain, Size: 5934 bytes --]

Index: gcc/config/rs6000/vector.md
===================================================================
--- gcc/config/rs6000/vector.md	(revision 238772)
+++ gcc/config/rs6000/vector.md	(working copy)
@@ -858,8 +858,7 @@ (define_expand "vec_extract<mode>"
    (match_operand 2 "const_int_operand" "")]
   "VECTOR_MEM_ALTIVEC_OR_VSX_P (<MODE>mode)"
 {
-  rs6000_expand_vector_extract (operands[0], operands[1],
-				INTVAL (operands[2]));
+  rs6000_expand_vector_extract (operands[0], operands[1], operands[2]);
   DONE;
 })
 \f
Index: gcc/config/rs6000/rs6000-protos.h
===================================================================
--- gcc/config/rs6000/rs6000-protos.h	(revision 238772)
+++ gcc/config/rs6000/rs6000-protos.h	(working copy)
@@ -61,7 +61,7 @@ extern void convert_int_to_float128 (rtx
 extern void rs6000_expand_vector_init (rtx, rtx);
 extern void paired_expand_vector_init (rtx, rtx);
 extern void rs6000_expand_vector_set (rtx, rtx, int);
-extern void rs6000_expand_vector_extract (rtx, rtx, int);
+extern void rs6000_expand_vector_extract (rtx, rtx, rtx);
 extern bool altivec_expand_vec_perm_const (rtx op[4]);
 extern void altivec_expand_vec_perm_le (rtx op[4]);
 extern bool rs6000_expand_vec_perm_const (rtx op[4]);
Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(revision 238772)
+++ gcc/config/rs6000/rs6000.c	(working copy)
@@ -6911,35 +6911,35 @@ rs6000_expand_vector_set (rtx target, rt
 /* Extract field ELT from VEC into TARGET.  */
 
 void
-rs6000_expand_vector_extract (rtx target, rtx vec, int elt)
+rs6000_expand_vector_extract (rtx target, rtx vec, rtx elt)
 {
   machine_mode mode = GET_MODE (vec);
   machine_mode inner_mode = GET_MODE_INNER (mode);
   rtx mem;
 
-  if (VECTOR_MEM_VSX_P (mode))
+  if (VECTOR_MEM_VSX_P (mode) && CONST_INT_P (elt))
     {
       switch (mode)
 	{
 	default:
 	  break;
 	case V1TImode:
-	  gcc_assert (elt == 0 && inner_mode == TImode);
+	  gcc_assert (INTVAL (elt) == 0 && inner_mode == TImode);
 	  emit_move_insn (target, gen_lowpart (TImode, vec));
 	  break;
 	case V2DFmode:
-	  emit_insn (gen_vsx_extract_v2df (target, vec, GEN_INT (elt)));
+	  emit_insn (gen_vsx_extract_v2df (target, vec, elt));
 	  return;
 	case V2DImode:
-	  emit_insn (gen_vsx_extract_v2di (target, vec, GEN_INT (elt)));
+	  emit_insn (gen_vsx_extract_v2di (target, vec, elt));
 	  return;
 	case V4SFmode:
-	  emit_insn (gen_vsx_extract_v4sf (target, vec, GEN_INT (elt)));
+	  emit_insn (gen_vsx_extract_v4sf (target, vec, elt));
 	  return;
 	case V16QImode:
 	  if (TARGET_VEXTRACTUB)
 	    {
-	      emit_insn (gen_vsx_extract_v16qi (target, vec, GEN_INT (elt)));
+	      emit_insn (gen_vsx_extract_v16qi (target, vec, elt));
 	      return;
 	    }
 	  else
@@ -6947,7 +6947,7 @@ rs6000_expand_vector_extract (rtx target
 	case V8HImode:
 	  if (TARGET_VEXTRACTUB)
 	    {
-	      emit_insn (gen_vsx_extract_v8hi (target, vec, GEN_INT (elt)));
+	      emit_insn (gen_vsx_extract_v8hi (target, vec, elt));
 	      return;
 	    }
 	  else
@@ -6955,7 +6955,7 @@ rs6000_expand_vector_extract (rtx target
 	case V4SImode:
 	  if (TARGET_VEXTRACTUB)
 	    {
-	      emit_insn (gen_vsx_extract_v4si (target, vec, GEN_INT (elt)));
+	      emit_insn (gen_vsx_extract_v4si (target, vec, elt));
 	      return;
 	    }
 	  else
@@ -6963,13 +6963,16 @@ rs6000_expand_vector_extract (rtx target
 	}
     }
 
+  gcc_assert (CONST_INT_P (elt));
+
   /* Allocate mode-sized buffer.  */
   mem = assign_stack_temp (mode, GET_MODE_SIZE (mode));
 
   emit_move_insn (mem, vec);
 
   /* Add offset to field within buffer matching vector element.  */
-  mem = adjust_address_nv (mem, inner_mode, elt * GET_MODE_SIZE (inner_mode));
+  mem = adjust_address_nv (mem, inner_mode,
+			   INTVAL (elt) * GET_MODE_SIZE (inner_mode));
 
   emit_move_insn (target, adjust_address_nv (mem, inner_mode, 0));
 }
@@ -14658,14 +14661,18 @@ altivec_expand_vec_ext_builtin (tree exp
 {
   machine_mode tmode, mode0;
   tree arg0, arg1;
-  int elt;
   rtx op0;
+  rtx op1;
 
   arg0 = CALL_EXPR_ARG (exp, 0);
   arg1 = CALL_EXPR_ARG (exp, 1);
 
   op0 = expand_normal (arg0);
-  elt = get_element_number (TREE_TYPE (arg0), arg1);
+  op1 = expand_normal (arg1);
+
+  /* Call get_element_number to validate arg1 if it is a constant.  */
+  if (TREE_CODE (arg1) == INTEGER_CST)
+    (void) get_element_number (TREE_TYPE (arg0), arg1);
 
   tmode = TYPE_MODE (TREE_TYPE (TREE_TYPE (arg0)));
   mode0 = TYPE_MODE (TREE_TYPE (arg0));
@@ -14676,7 +14683,7 @@ altivec_expand_vec_ext_builtin (tree exp
   if (optimize || !target || !register_operand (target, tmode))
     target = gen_reg_rtx (tmode);
 
-  rs6000_expand_vector_extract (target, op0, elt);
+  rs6000_expand_vector_extract (target, op0, op1);
 
   return target;
 }
Index: gcc/config/rs6000/vsx.md
===================================================================
--- gcc/config/rs6000/vsx.md	(revision 238772)
+++ gcc/config/rs6000/vsx.md	(working copy)
@@ -2159,7 +2159,7 @@ (define_insn "vsx_extract_<mode>"
 
   else if (element == VECTOR_ELEMENT_MFVSRLD_64BIT && INT_REGNO_P (op0_regno)
 	   && TARGET_P9_VECTOR && TARGET_POWERPC64 && TARGET_DIRECT_MOVE)
-    return "mfvsrdl %0,%x1";
+    return "mfvsrld %0,%x1";
 
   else if (VSX_REGNO_P (op0_regno))
     {
Index: gcc/config/rs6000/altivec.md
===================================================================
--- gcc/config/rs6000/altivec.md	(revision 238772)
+++ gcc/config/rs6000/altivec.md	(working copy)
@@ -2781,7 +2781,7 @@ (define_expand "reduc_plus_scal_<mode>"
   emit_insn (gen_altivec_vspltisw (vzero, const0_rtx));
   emit_insn (gen_altivec_vsum4s<VI_char>s (vtmp1, operands[1], vzero));
   emit_insn (gen_altivec_vsumsws_direct (dest, vtmp1, vzero));
-  rs6000_expand_vector_extract (operands[0], vtmp2, elt);
+  rs6000_expand_vector_extract (operands[0], vtmp2, GEN_INT (elt));
   DONE;
 })
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, 1 of 4 or 5], Enhance PowerPC vec_extract support for power8/power9 machines
  2016-07-27 14:33 [PATCH, 1 of 4 or 5], Enhance PowerPC vec_extract support for power8/power9 machines Michael Meissner
@ 2016-07-27 19:35 ` Segher Boessenkool
  2016-07-27 20:06   ` Michael Meissner
  2016-07-27 21:16 ` [PATCH, 2 of 4], " Michael Meissner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Segher Boessenkool @ 2016-07-27 19:35 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, David Edelsohn, Bill Schmidt

On Wed, Jul 27, 2016 at 10:32:21AM -0400, Michael Meissner wrote:
> 2016-07-27  Michael Meissner  <meissner@linux.vnet.ibm.com>
> 
> 	* config/rs6000/vector.md (vec_extract<mode>): Change the calling
> 	signature of rs6000_expand_vector_extract so that the element
> 	number is a RTX instead of a constant integer.
> 	* config/rs6000/rs6000-protos.h (rs6000_expand_vector_extract):
> 	Likewise.
> 	* config/rs6000/rs6000.c (rs6000_expand_vector_extract): Likewise.
> 	(altivec_expand_vec_ext_builtin): Likewise.
> 	* config/rs6000/altivec.md (reduc_plus_scal_<mode>): Likewise.
> 	* config/rs6000/vsx.md (vsx_extract_<mode>): Fix spelling of the
> 	MFVSRLD instruction.

> @@ -14658,14 +14661,18 @@ altivec_expand_vec_ext_builtin (tree exp
>  {
>    machine_mode tmode, mode0;
>    tree arg0, arg1;
> -  int elt;
>    rtx op0;
> +  rtx op1;

You could put op0, op1 on one line, or better yet, declare them where
they are first initialised.

> --- gcc/config/rs6000/vsx.md	(revision 238772)
> +++ gcc/config/rs6000/vsx.md	(working copy)
> @@ -2159,7 +2159,7 @@ (define_insn "vsx_extract_<mode>"
>  
>    else if (element == VECTOR_ELEMENT_MFVSRLD_64BIT && INT_REGNO_P (op0_regno)
>  	   && TARGET_P9_VECTOR && TARGET_POWERPC64 && TARGET_DIRECT_MOVE)
> -    return "mfvsrdl %0,%x1";
> +    return "mfvsrld %0,%x1";

Later patches have some testcases?

This is okay for trunk, with or without the cosmetic change.  Thanks,


Segher

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, 1 of 4 or 5], Enhance PowerPC vec_extract support for power8/power9 machines
  2016-07-27 19:35 ` Segher Boessenkool
@ 2016-07-27 20:06   ` Michael Meissner
  0 siblings, 0 replies; 13+ messages in thread
From: Michael Meissner @ 2016-07-27 20:06 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Michael Meissner, gcc-patches, David Edelsohn, Bill Schmidt

On Wed, Jul 27, 2016 at 02:35:18PM -0500, Segher Boessenkool wrote:
> Later patches have some testcases?

Yes, as I said, since the first patch just changed the internal infrastructure
to allow for variable vec_extracts (in patch #2), it didn't change any code.

There will be new test cases for the successive patches.

> This is okay for trunk, with or without the cosmetic change.  Thanks,

Thanks.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, 2 of 4], Enhance PowerPC vec_extract support for power8/power9 machines
  2016-07-27 14:33 [PATCH, 1 of 4 or 5], Enhance PowerPC vec_extract support for power8/power9 machines Michael Meissner
  2016-07-27 19:35 ` Segher Boessenkool
@ 2016-07-27 21:16 ` Michael Meissner
  2016-07-28  9:58   ` Segher Boessenkool
  2016-07-30 15:29 ` [PATCH, 3 " Michael Meissner
  2016-08-01 22:38 ` [PATCH, 4 " Michael Meissner
  3 siblings, 1 reply; 13+ messages in thread
From: Michael Meissner @ 2016-07-27 21:16 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool,
	David Edelsohn, Bill Schmidt

[-- Attachment #1: Type: text/plain, Size: 2315 bytes --]

Next patch in the vec_extract series.

This patch adds support for vec_extract with a variable argument element number
for vector double or vector long on 64-bit ISA 2.07 (power8) or ISA 3.0
(power9) systems.  It needs 64-bit ISA 2.07 for the direct move support.

I have tested this on a little endian 64-bit power8 system and a big endian
power7 system with both 32-bit and 64-bit targets.  Note, the power7 system
uses the existing method of saving the vector and addressing the vector
elements in memory because it doesn't have the direct move instructions.  There
were no regressions in the test, can I install these patches on the trunk?

[gcc, patch #2]
2016-07-27  Michael Meissner  <meissner@linux.vnet.ibm.com>

	* config/rs6000/rs6000-protos.h (rs6000_split_vec_extract_var):
	New declaration.
	* config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin):
	Add support for vec_extract of vector double or vector long having
	a variable element number on 64-bit ISA 2.07 systems or newer.
	* config/rs6000/rs6000.c (rs6000_expand_vector_extract):
	Likewise.
	(rs6000_split_vec_extract_var): New function to split a
	vec_extract built-in function with variable element number.
	(rtx_is_swappable_p): Variable vec_extracts and shifts are not
	swappable.
	* config/rs6000/vsx.md (UNSPEC_VSX_VSLO): New unspecs.
	(UNSPEC_VSX_EXTRACT): Likewise.
	(vsx_extract_<mode>, VSX_D iterator): Fix constraints to allow
	direct move instructions to be generated on 64-bit ISA 2.07
	systems and newer, and to take advantage of the ISA 3.0 MFVSRLD
	instruction.
	(vsx_vslo_<mode>): New insn to do VSLO on V2DFmode and V2DImode
	arguments for vec_extract variable element.
	(vsx_extract_<mode>_var, VSX_D iterator): New insn to support
	vec_extract with variable element on V2DFmode and V2DImode
	vectors.
	* config/rs6000/rs6000.h (TARGET_VEXTRACTUB): Remove
	-mupper-regs-df requirement, since it isn't needed.
	(VEC_EXTRACT_OPTIMIZE_P): New macro to say whether we can optmize
	vec_extract on 64-bit ISA 2.07 systems and newer.

[gcc/testsuite, patch #2]
2016-07-27  Michael Meissner  <meissner@linux.vnet.ibm.com>

	* gcc.target/powerpc/vec-extract-1.c: New test.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797

[-- Attachment #2: gcc-stage7.extract004c --]
[-- Type: text/plain, Size: 11769 bytes --]

Index: gcc/config/rs6000/rs6000-protos.h
===================================================================
--- gcc/config/rs6000/rs6000-protos.h	(revision 238775)
+++ gcc/config/rs6000/rs6000-protos.h	(working copy)
@@ -62,6 +62,7 @@ extern void rs6000_expand_vector_init (r
 extern void paired_expand_vector_init (rtx, rtx);
 extern void rs6000_expand_vector_set (rtx, rtx, int);
 extern void rs6000_expand_vector_extract (rtx, rtx, rtx);
+extern void rs6000_split_vec_extract_var (rtx, rtx, rtx, rtx, rtx);
 extern bool altivec_expand_vec_perm_const (rtx op[4]);
 extern void altivec_expand_vec_perm_le (rtx op[4]);
 extern bool rs6000_expand_vec_perm_const (rtx op[4]);
Index: gcc/config/rs6000/rs6000-c.c
===================================================================
--- gcc/config/rs6000/rs6000-c.c	(revision 238772)
+++ gcc/config/rs6000/rs6000-c.c	(working copy)
@@ -5105,29 +5105,61 @@ altivec_resolve_overloaded_builtin (loca
 				  arg2);
 	}
 
-      /* If we can use the VSX xxpermdi instruction, use that for extract.  */
+      /* See if we can optimize vec_extracts with the current VSX instruction
+	 set.  */
       mode = TYPE_MODE (arg1_type);
-      if ((mode == V2DFmode || mode == V2DImode) && VECTOR_MEM_VSX_P (mode)
-	  && TREE_CODE (arg2) == INTEGER_CST
-	  && wi::ltu_p (arg2, 2))
+      if (VECTOR_MEM_VSX_P (mode))
+
 	{
 	  tree call = NULL_TREE;
+	  int nunits = GET_MODE_NUNITS (mode);
 
-	  if (mode == V2DFmode)
-	    call = rs6000_builtin_decls[VSX_BUILTIN_VEC_EXT_V2DF];
-	  else if (mode == V2DImode)
-	    call = rs6000_builtin_decls[VSX_BUILTIN_VEC_EXT_V2DI];
+	  /* If the second argument is an integer constant, if the value is in
+	     the expected range, generate the built-in code if we can.  We need
+	     64-bit and direct move to extract the small integer vectors.  */
+	  if (TREE_CODE (arg2) == INTEGER_CST && wi::ltu_p (arg2, nunits))
+	    {
+	      switch (mode)
+		{
+		default:
+		  break;
+
+		case V1TImode:
+		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_EXT_V1TI];
+		  break;
+
+		case V2DFmode:
+		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_EXT_V2DF];
+		  break;
+
+		case V2DImode:
+		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_EXT_V2DI];
+		  break;
+		}
+	    }
+
+	  /* If the second argument is variable, we can optimize it if we are
+	     generating 64-bit code on a machine with direct move.  */
+	  else if (TREE_CODE (arg2) != INTEGER_CST && VEC_EXTRACT_OPTIMIZE_P)
+	    {
+	      switch (mode)
+		{
+		default:
+		  break;
+
+		case V2DFmode:
+		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_EXT_V2DF];
+		  break;
+
+		case V2DImode:
+		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_EXT_V2DI];
+		  break;
+		}
+	    }
 
 	  if (call)
 	    return build_call_expr (call, 2, arg1, arg2);
 	}
-      else if (mode == V1TImode && VECTOR_MEM_VSX_P (mode)
-	       && TREE_CODE (arg2) == INTEGER_CST
-	       && wi::eq_p (arg2, 0))
-	{
-	  tree call = rs6000_builtin_decls[VSX_BUILTIN_VEC_EXT_V1TI];
-	  return build_call_expr (call, 2, arg1, arg2);
-	}
 
       /* Build *(((arg1_inner_type*)&(vector type){arg1})+arg2). */
       arg1_inner_type = TREE_TYPE (arg1_type);
Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(revision 238775)
+++ gcc/config/rs6000/rs6000.c	(working copy)
@@ -6958,8 +6958,31 @@ rs6000_expand_vector_extract (rtx target
 	      emit_insn (gen_vsx_extract_v4si (target, vec, elt));
 	      return;
 	    }
-	  else
-	    break;
+	  break;
+	}
+    }
+  else if (VECTOR_MEM_VSX_P (mode) && !CONST_INT_P (elt)
+	   && VEC_EXTRACT_OPTIMIZE_P)
+    {
+      if (GET_MODE (elt) != DImode)
+	{
+	  rtx tmp = gen_reg_rtx (DImode);
+	  convert_move (tmp, elt, 0);
+	  elt = tmp;
+	}
+
+      switch (mode)
+	{
+	case V2DFmode:
+	  emit_insn (gen_vsx_extract_v2df_var (target, vec, elt));
+	  return;
+
+	case V2DImode:
+	  emit_insn (gen_vsx_extract_v2di_var (target, vec, elt));
+	  return;
+
+	default:
+	  gcc_unreachable ();
 	}
     }
 
@@ -6977,6 +7000,99 @@ rs6000_expand_vector_extract (rtx target
   emit_move_insn (target, adjust_address_nv (mem, inner_mode, 0));
 }
 
+/* Split a variable vec_extract operation into the component instructions.  */
+
+void
+rs6000_split_vec_extract_var (rtx dest, rtx src, rtx element, rtx tmp_gpr,
+			      rtx tmp_altivec)
+{
+  machine_mode mode = GET_MODE (src);
+  machine_mode scalar_mode = GET_MODE (dest);
+  unsigned scalar_size = GET_MODE_SIZE (scalar_mode);
+  int byte_shift = exact_log2 (scalar_size);
+
+  gcc_assert (byte_shift >= 0);
+
+  if (REG_P (src) || SUBREG_P (src))
+    {
+      int bit_shift = byte_shift + 3;
+      rtx element2;
+
+      gcc_assert (REG_P (tmp_gpr) && REG_P (tmp_altivec));
+
+      /* For little endian, adjust element ordering.  For V2DI/V2DF, we can use
+	 an XOR, otherwise we need to subtract.  The shift amount is so VSLO
+	 will shift the element into the upper position (adding 3 to convert a
+	 byte shift into a bit shift). */
+      if (scalar_size == 8)
+	{
+	  if (!VECTOR_ELT_ORDER_BIG)
+	    {
+	      emit_insn (gen_xordi3 (tmp_gpr, element, const1_rtx));
+	      element2 = tmp_gpr;
+	    }
+	  else
+	    element2 = element;
+
+	  /* Generate RLDIC directly to shift left 6 bits and retrieve 1
+	     bit.  */
+	  emit_insn (gen_rtx_SET (tmp_gpr,
+				  gen_rtx_AND (DImode,
+					       gen_rtx_ASHIFT (DImode,
+							       element2,
+							       GEN_INT (6)),
+					       GEN_INT (64))));
+	}
+      else
+	{
+	  if (!VECTOR_ELT_ORDER_BIG)
+	    {
+	      rtx num_ele_m1 = GEN_INT (GET_MODE_NUNITS (mode) - 1);
+
+	      emit_insn (gen_anddi3 (tmp_gpr, element, num_ele_m1));
+	      emit_insn (gen_subdi3 (tmp_gpr, num_ele_m1, tmp_gpr));
+	      element2 = tmp_gpr;
+	    }
+	  else
+	    element2 = element;
+
+	  emit_insn (gen_ashldi3 (tmp_gpr, element2, GEN_INT (bit_shift)));
+	}
+
+      /* Get the value into the lower byte of the Altivec register where VSLO
+	 expects it.  */
+      if (TARGET_P9_VECTOR)
+	emit_insn (gen_vsx_splat_v2di (tmp_altivec, tmp_gpr));
+      else if (can_create_pseudo_p ())
+	emit_insn (gen_vsx_concat_v2di (tmp_altivec, tmp_gpr, tmp_gpr));
+      else
+	{
+	  rtx tmp_di = gen_rtx_REG (DImode, REGNO (tmp_altivec));
+	  emit_move_insn (tmp_di, tmp_gpr);
+	  emit_insn (gen_vsx_concat_v2di (tmp_altivec, tmp_di, tmp_di));
+	}
+
+      /* Do the VSLO to get the value into the final location.  */
+      switch (mode)
+	{
+	case V2DFmode:
+	  emit_insn (gen_vsx_vslo_v2df (dest, src, tmp_altivec));
+	  return;
+
+	case V2DImode:
+	  emit_insn (gen_vsx_vslo_v2di (dest, src, tmp_altivec));
+	  return;
+
+	default:
+	  gcc_unreachable ();
+	}
+
+      return;
+    }
+  else
+    gcc_unreachable ();
+ }
+
 /* Return TRUE if OP is an invalid SUBREG operation on the e500.  */
 
 bool
@@ -38601,6 +38717,7 @@ rtx_is_swappable_p (rtx op, unsigned int
 	  case UNSPEC_VSX_CVDPSPN:
 	  case UNSPEC_VSX_CVSPDP:
 	  case UNSPEC_VSX_CVSPDPN:
+	  case UNSPEC_VSX_EXTRACT:
 	    return 0;
 	  case UNSPEC_VSPLT_DIRECT:
 	    *special = SH_SPLAT;
Index: gcc/config/rs6000/vsx.md
===================================================================
--- gcc/config/rs6000/vsx.md	(revision 238775)
+++ gcc/config/rs6000/vsx.md	(working copy)
@@ -309,6 +309,8 @@ (define_c_enum "unspec"
    UNSPEC_VSX_XVCVDPUXDS
    UNSPEC_VSX_SIGN_EXTEND
    UNSPEC_P9_MEMORY
+   UNSPEC_VSX_VSLO
+   UNSPEC_VSX_EXTRACT
   ])
 
 ;; VSX moves
@@ -2118,16 +2120,13 @@ (define_insn "vsx_set_<mode>"
 ;; register was picked.  Limit the scalar value to FPRs for now.
 
 (define_insn "vsx_extract_<mode>"
-  [(set (match_operand:<VS_scalar> 0 "gpc_reg_operand"
-            "=d,     wm,      wo,    d")
+  [(set (match_operand:<VS_scalar> 0 "gpc_reg_operand" "=d,    d,     wr, wr")
 
 	(vec_select:<VS_scalar>
-	 (match_operand:VSX_D 1 "gpc_reg_operand"
-            "<VSa>, <VSa>,  <VSa>,  <VSa>")
+	 (match_operand:VSX_D 1 "gpc_reg_operand"      "<VSa>, <VSa>, wm, wo")
 
 	 (parallel
-	  [(match_operand:QI 2 "const_0_to_1_operand"
-            "wD,    wD,     wL,     n")])))]
+	  [(match_operand:QI 2 "const_0_to_1_operand"  "wD,    n,     wD, n")])))]
   "VECTOR_MEM_VSX_P (<MODE>mode)"
 {
   int element = INTVAL (operands[2]);
@@ -2205,6 +2204,34 @@ (define_insn "*vsx_extract_<mode>_store"
   [(set_attr "type" "fpstore")
    (set_attr "length" "4")])
 
+;; Variable V2DI/V2DF extract shift
+(define_insn "vsx_vslo_<mode>"
+  [(set (match_operand:<VS_scalar> 0 "gpc_reg_operand" "=v")
+	(unspec:<VS_scalar> [(match_operand:VSX_D 1 "gpc_reg_operand" "v")
+			     (match_operand:V2DI 2 "gpc_reg_operand" "v")]
+			    UNSPEC_VSX_VSLO))]
+  "VECTOR_MEM_VSX_P (<MODE>mode) && VEC_EXTRACT_OPTIMIZE_P"
+  "vslo %0,%1,%2"
+  [(set_attr "type" "vecperm")])
+
+;; Variable V2DI/V2DF extract
+(define_insn_and_split "vsx_extract_<mode>_var"
+  [(set (match_operand:<VS_scalar> 0 "gpc_reg_operand" "=v")
+	(unspec:<VS_scalar> [(match_operand:VSX_D 1 "input_operand" "v")
+			     (match_operand:DI 2 "gpc_reg_operand" "r")]
+			    UNSPEC_VSX_EXTRACT))
+   (clobber (match_scratch:DI 3 "=r"))
+   (clobber (match_scratch:V2DI 4 "=&v"))]
+  "VECTOR_MEM_VSX_P (<MODE>mode) && VEC_EXTRACT_OPTIMIZE_P"
+  "#"
+  "&& reload_completed"
+  [(const_int 0)]
+{
+  rs6000_split_vec_extract_var (operands[0], operands[1], operands[2],
+				operands[3], operands[4]);
+  DONE;
+})
+
 ;; Extract a SF element from V4SF
 (define_insn_and_split "vsx_extract_v4sf"
   [(set (match_operand:SF 0 "vsx_register_operand" "=f,f")
Index: gcc/config/rs6000/rs6000.h
===================================================================
--- gcc/config/rs6000/rs6000.h	(revision 238772)
+++ gcc/config/rs6000/rs6000.h	(working copy)
@@ -602,7 +602,6 @@ extern int rs6000_vector_align[];
 #define TARGET_DIRECT_MOVE_128	(TARGET_P9_VECTOR && TARGET_DIRECT_MOVE \
 				 && TARGET_POWERPC64)
 #define TARGET_VEXTRACTUB	(TARGET_P9_VECTOR && TARGET_DIRECT_MOVE \
-				 && TARGET_UPPER_REGS_DF \
 				 && TARGET_UPPER_REGS_DI && TARGET_POWERPC64)
 
 /* Byte/char syncs were added as phased in for ISA 2.06B, but are not present
@@ -761,6 +760,11 @@ extern int rs6000_vector_align[];
 				 && TARGET_SINGLE_FLOAT			\
 				 && TARGET_DOUBLE_FLOAT)
 
+/* Macro to say whether we can optimize vector extracts.  */
+#define VEC_EXTRACT_OPTIMIZE_P	(TARGET_DIRECT_MOVE			\
+				 && TARGET_POWERPC64			\
+				 && TARGET_UPPER_REGS_DI)
+
 /* Whether the various reciprocal divide/square root estimate instructions
    exist, and whether we should automatically generate code for the instruction
    by default.  */
Index: gcc/testsuite/gcc.target/powerpc/vec-extract-1.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vec-extract-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vec-extract-1.c	(revision 0)
@@ -0,0 +1,27 @@
+/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power8" } } */
+/* { dg-options "-mcpu=power8 -O2 -mupper-regs-df -mupper-regs-di" } */
+
+#include <altivec.h>
+
+double
+add_double (vector double a, int n)
+{
+  return vec_extract (a, n) + 1.0;
+}
+
+long
+add_long (vector long a, int n)
+{
+  return vec_extract (a, n) + 1;
+}
+
+/* { dg-final { scan-assembler     "vslo"    } } */
+/* { dg-final { scan-assembler     "mtvsrd"  } } */
+/* { dg-final { scan-assembler     "mfvsrd"  } } */
+/* { dg-final { scan-assembler-not "stxvd2x" } } */
+/* { dg-final { scan-assembler-not "stxvx"   } } */
+/* { dg-final { scan-assembler-not "stxv"    } } */
+/* { dg-final { scan-assembler-not "ldx"     } } */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, 2 of 4], Enhance PowerPC vec_extract support for power8/power9 machines
  2016-07-27 21:16 ` [PATCH, 2 of 4], " Michael Meissner
@ 2016-07-28  9:58   ` Segher Boessenkool
  2016-07-28 19:44     ` Michael Meissner
  0 siblings, 1 reply; 13+ messages in thread
From: Segher Boessenkool @ 2016-07-28  9:58 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, David Edelsohn, Bill Schmidt

On Wed, Jul 27, 2016 at 05:16:28PM -0400, Michael Meissner wrote:
> 	* config/rs6000/vsx.md (UNSPEC_VSX_VSLO): New unspecs.
> 	(UNSPEC_VSX_EXTRACT): Likewise.

"New unspec".

> 	(VEC_EXTRACT_OPTIMIZE_P): New macro to say whether we can optmize
> 	vec_extract on 64-bit ISA 2.07 systems and newer.

"optimize".

> --- gcc/config/rs6000/rs6000-protos.h	(revision 238775)
> +++ gcc/config/rs6000/rs6000-protos.h	(working copy)
> @@ -62,6 +62,7 @@ extern void rs6000_expand_vector_init (r
>  extern void paired_expand_vector_init (rtx, rtx);
>  extern void rs6000_expand_vector_set (rtx, rtx, int);
>  extern void rs6000_expand_vector_extract (rtx, rtx, rtx);
> +extern void rs6000_split_vec_extract_var (rtx, rtx, rtx, rtx, rtx);
>  extern bool altivec_expand_vec_perm_const (rtx op[4]);
>  extern void altivec_expand_vec_perm_le (rtx op[4]);
>  extern bool rs6000_expand_vec_perm_const (rtx op[4]);

This isn't in the changelog.

> +      /* For little endian, adjust element ordering.  For V2DI/V2DF, we can use
> +	 an XOR, otherwise we need to subtract.  The shift amount is so VSLO
> +	 will shift the element into the upper position (adding 3 to convert a
> +	 byte shift into a bit shift). */

Two spaces after dot.

> +;; Variable V2DI/V2DF extract
> +(define_insn_and_split "vsx_extract_<mode>_var"
> +  [(set (match_operand:<VS_scalar> 0 "gpc_reg_operand" "=v")
> +	(unspec:<VS_scalar> [(match_operand:VSX_D 1 "input_operand" "v")
> +			     (match_operand:DI 2 "gpc_reg_operand" "r")]
> +			    UNSPEC_VSX_EXTRACT))
> +   (clobber (match_scratch:DI 3 "=r"))
> +   (clobber (match_scratch:V2DI 4 "=&v"))]
> +  "VECTOR_MEM_VSX_P (<MODE>mode) && VEC_EXTRACT_OPTIMIZE_P"
> +  "#"
> +  "&& reload_completed"
> +  [(const_int 0)]
> +{
> +  rs6000_split_vec_extract_var (operands[0], operands[1], operands[2],
> +				operands[3], operands[4]);
> +  DONE;
> +})

Why reload_completed?  Can it not run earlier?

> +/* Macro to say whether we can optimize vector extracts.  */
> +#define VEC_EXTRACT_OPTIMIZE_P	(TARGET_DIRECT_MOVE			\
> +				 && TARGET_POWERPC64			\
> +				 && TARGET_UPPER_REGS_DI)

I'm not a big fan of this name.  "Optimize" will quickly become dated,
everyone will take the current new hot thing for granted, and then when
you want to optimise even more (say, for ISA 6.0 or whatever) the name
is really out of place.

But I don't know a much better name either.

> --- gcc/testsuite/gcc.target/powerpc/vec-extract-1.c	(revision 0)
> +++ gcc/testsuite/gcc.target/powerpc/vec-extract-1.c	(revision 0)
> @@ -0,0 +1,27 @@
> +/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */

Maybe you can add a "run" test as well?

Looks good otherwise, okay for trunk with those nits fixed.

Thanks,


Segher

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, 2 of 4], Enhance PowerPC vec_extract support for power8/power9 machines
  2016-07-28  9:58   ` Segher Boessenkool
@ 2016-07-28 19:44     ` Michael Meissner
  2016-07-29  6:50       ` Segher Boessenkool
  0 siblings, 1 reply; 13+ messages in thread
From: Michael Meissner @ 2016-07-28 19:44 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Michael Meissner, gcc-patches, David Edelsohn, Bill Schmidt

[-- Attachment #1: Type: text/plain, Size: 5770 bytes --]

On Thu, Jul 28, 2016 at 04:57:53AM -0500, Segher Boessenkool wrote:
> On Wed, Jul 27, 2016 at 05:16:28PM -0400, Michael Meissner wrote:
> > 	* config/rs6000/vsx.md (UNSPEC_VSX_VSLO): New unspecs.
> > 	(UNSPEC_VSX_EXTRACT): Likewise.
> 
> "New unspec".

Thanks.

> > 	(VEC_EXTRACT_OPTIMIZE_P): New macro to say whether we can optmize
> > 	vec_extract on 64-bit ISA 2.07 systems and newer.
> 
> "optimize".

Thanks.

> > --- gcc/config/rs6000/rs6000-protos.h	(revision 238775)
> > +++ gcc/config/rs6000/rs6000-protos.h	(working copy)
> > @@ -62,6 +62,7 @@ extern void rs6000_expand_vector_init (r
> >  extern void paired_expand_vector_init (rtx, rtx);
> >  extern void rs6000_expand_vector_set (rtx, rtx, int);
> >  extern void rs6000_expand_vector_extract (rtx, rtx, rtx);
> > +extern void rs6000_split_vec_extract_var (rtx, rtx, rtx, rtx, rtx);
> >  extern bool altivec_expand_vec_perm_const (rtx op[4]);
> >  extern void altivec_expand_vec_perm_le (rtx op[4]);
> >  extern bool rs6000_expand_vec_perm_const (rtx op[4]);
> 
> This isn't in the changelog.

Yes it is.

2016-07-27  Michael Meissner  <meissner@linux.vnet.ibm.com>

	* config/rs6000/rs6000-protos.h (rs6000_split_vec_extract_var):
	New declaration.

> > +      /* For little endian, adjust element ordering.  For V2DI/V2DF, we can use
> > +	 an XOR, otherwise we need to subtract.  The shift amount is so VSLO
> > +	 will shift the element into the upper position (adding 3 to convert a
> > +	 byte shift into a bit shift). */
> 
> Two spaces after dot.

Thanks.

> > +;; Variable V2DI/V2DF extract
> > +(define_insn_and_split "vsx_extract_<mode>_var"
> > +  [(set (match_operand:<VS_scalar> 0 "gpc_reg_operand" "=v")
> > +	(unspec:<VS_scalar> [(match_operand:VSX_D 1 "input_operand" "v")
> > +			     (match_operand:DI 2 "gpc_reg_operand" "r")]
> > +			    UNSPEC_VSX_EXTRACT))
> > +   (clobber (match_scratch:DI 3 "=r"))
> > +   (clobber (match_scratch:V2DI 4 "=&v"))]
> > +  "VECTOR_MEM_VSX_P (<MODE>mode) && VEC_EXTRACT_OPTIMIZE_P"
> > +  "#"
> > +  "&& reload_completed"
> > +  [(const_int 0)]
> > +{
> > +  rs6000_split_vec_extract_var (operands[0], operands[1], operands[2],
> > +				operands[3], operands[4]);
> > +  DONE;
> > +})
> 
> Why reload_completed?  Can it not run earlier?

This patch can perhaps run earlier, but the next patch that adds optimizing
memory references cannot be.

In order to change memory addresses, I need to know exactly which register set
(GPR, traditional floating point, and traditional Altivec register), and what
address modes they support.  Keeping it until after reload will also allow for
some flexibility if the vector was not in register, it can just access the
memory value, instead of forcing it to be in a register.

> > +/* Macro to say whether we can optimize vector extracts.  */
> > +#define VEC_EXTRACT_OPTIMIZE_P	(TARGET_DIRECT_MOVE			\
> > +				 && TARGET_POWERPC64			\
> > +				 && TARGET_UPPER_REGS_DI)
> 
> I'm not a big fan of this name.  "Optimize" will quickly become dated,
> everyone will take the current new hot thing for granted, and then when
> you want to optimise even more (say, for ISA 6.0 or whatever) the name
> is really out of place.
> 
> But I don't know a much better name either.

I changed it to TARGET_DIRECT_MOVE_64BIT, which hopefully is clearer of what
exactly we need.  In particular, the calculation of the bit shift is done in
the GPR and direct move creates teh vector used by VSLO to do a byte shift.

> > --- gcc/testsuite/gcc.target/powerpc/vec-extract-1.c	(revision 0)
> > +++ gcc/testsuite/gcc.target/powerpc/vec-extract-1.c	(revision 0)
> > @@ -0,0 +1,27 @@
> > +/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
> 
> Maybe you can add a "run" test as well?

I added the run tests on July 21st that explicitly checks for just about every
combination that is being optimized by these paches to make sure it generates
the correct code.
 
> Looks good otherwise, okay for trunk with those nits fixed.

Here is the revised patch that I will check in after the tests are rerun:

[gcc, patch #2]
2016-07-28  Michael Meissner  <meissner@linux.vnet.ibm.com>

	* config/rs6000/rs6000-protos.h (rs6000_split_vec_extract_var):
	New declaration.
	* config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin):
	Add support for vec_extract of vector double or vector long having
	a variable element number on 64-bit ISA 2.07 systems or newer.
	* config/rs6000/rs6000.c (rs6000_expand_vector_extract):
	Likewise.
	(rs6000_split_vec_extract_var): New function to split a
	vec_extract built-in function with variable element number.
	(rtx_is_swappable_p): Variable vec_extracts and shifts are not
	swappable.
	* config/rs6000/vsx.md (UNSPEC_VSX_VSLO): New unspec.
	(UNSPEC_VSX_EXTRACT): Likewise.
	(vsx_extract_<mode>, VSX_D iterator): Fix constraints to allow
	direct move instructions to be generated on 64-bit ISA 2.07
	systems and newer, and to take advantage of the ISA 3.0 MFVSRLD
	instruction.
	(vsx_vslo_<mode>): New insn to do VSLO on V2DFmode and V2DImode
	arguments for vec_extract variable element.
	(vsx_extract_<mode>_var, VSX_D iterator): New insn to support
	vec_extract with variable element on V2DFmode and V2DImode
	vectors.
	* config/rs6000/rs6000.h (TARGET_VEXTRACTUB): Remove
	-mupper-regs-df requirement, since it isn't needed.
	(TARGET_DIRECT_MOVE_64BIT): New macro to say whether we can
	do direct moves on 64-bit systems, which allows optimization of
	vec_extract on 64-bit ISA 2.07 systems and newer.

[gcc/testsuite, patch #2]
2016-07-28  Michael Meissner  <meissner@linux.vnet.ibm.com>

	* gcc.target/powerpc/vec-extract-1.c: New test.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797

[-- Attachment #2: gcc-stage7.extract005b --]
[-- Type: text/plain, Size: 11916 bytes --]

Index: gcc/config/rs6000/rs6000-protos.h
===================================================================
--- gcc/config/rs6000/rs6000-protos.h	(revision 238826)
+++ gcc/config/rs6000/rs6000-protos.h	(working copy)
@@ -62,6 +62,7 @@ extern void rs6000_expand_vector_init (r
 extern void paired_expand_vector_init (rtx, rtx);
 extern void rs6000_expand_vector_set (rtx, rtx, int);
 extern void rs6000_expand_vector_extract (rtx, rtx, rtx);
+extern void rs6000_split_vec_extract_var (rtx, rtx, rtx, rtx, rtx);
 extern bool altivec_expand_vec_perm_const (rtx op[4]);
 extern void altivec_expand_vec_perm_le (rtx op[4]);
 extern bool rs6000_expand_vec_perm_const (rtx op[4]);
Index: gcc/config/rs6000/rs6000-c.c
===================================================================
--- gcc/config/rs6000/rs6000-c.c	(revision 238826)
+++ gcc/config/rs6000/rs6000-c.c	(working copy)
@@ -5105,29 +5105,61 @@ altivec_resolve_overloaded_builtin (loca
 				  arg2);
 	}
 
-      /* If we can use the VSX xxpermdi instruction, use that for extract.  */
+      /* See if we can optimize vec_extracts with the current VSX instruction
+	 set.  */
       mode = TYPE_MODE (arg1_type);
-      if ((mode == V2DFmode || mode == V2DImode) && VECTOR_MEM_VSX_P (mode)
-	  && TREE_CODE (arg2) == INTEGER_CST
-	  && wi::ltu_p (arg2, 2))
+      if (VECTOR_MEM_VSX_P (mode))
+
 	{
 	  tree call = NULL_TREE;
+	  int nunits = GET_MODE_NUNITS (mode);
 
-	  if (mode == V2DFmode)
-	    call = rs6000_builtin_decls[VSX_BUILTIN_VEC_EXT_V2DF];
-	  else if (mode == V2DImode)
-	    call = rs6000_builtin_decls[VSX_BUILTIN_VEC_EXT_V2DI];
+	  /* If the second argument is an integer constant, if the value is in
+	     the expected range, generate the built-in code if we can.  We need
+	     64-bit and direct move to extract the small integer vectors.  */
+	  if (TREE_CODE (arg2) == INTEGER_CST && wi::ltu_p (arg2, nunits))
+	    {
+	      switch (mode)
+		{
+		default:
+		  break;
+
+		case V1TImode:
+		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_EXT_V1TI];
+		  break;
+
+		case V2DFmode:
+		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_EXT_V2DF];
+		  break;
+
+		case V2DImode:
+		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_EXT_V2DI];
+		  break;
+		}
+	    }
+
+	  /* If the second argument is variable, we can optimize it if we are
+	     generating 64-bit code on a machine with direct move.  */
+	  else if (TREE_CODE (arg2) != INTEGER_CST && TARGET_DIRECT_MOVE_64BIT)
+	    {
+	      switch (mode)
+		{
+		default:
+		  break;
+
+		case V2DFmode:
+		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_EXT_V2DF];
+		  break;
+
+		case V2DImode:
+		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_EXT_V2DI];
+		  break;
+		}
+	    }
 
 	  if (call)
 	    return build_call_expr (call, 2, arg1, arg2);
 	}
-      else if (mode == V1TImode && VECTOR_MEM_VSX_P (mode)
-	       && TREE_CODE (arg2) == INTEGER_CST
-	       && wi::eq_p (arg2, 0))
-	{
-	  tree call = rs6000_builtin_decls[VSX_BUILTIN_VEC_EXT_V1TI];
-	  return build_call_expr (call, 2, arg1, arg2);
-	}
 
       /* Build *(((arg1_inner_type*)&(vector type){arg1})+arg2). */
       arg1_inner_type = TREE_TYPE (arg1_type);
Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(revision 238826)
+++ gcc/config/rs6000/rs6000.c	(working copy)
@@ -6958,8 +6958,31 @@ rs6000_expand_vector_extract (rtx target
 	      emit_insn (gen_vsx_extract_v4si (target, vec, elt));
 	      return;
 	    }
-	  else
-	    break;
+	  break;
+	}
+    }
+  else if (VECTOR_MEM_VSX_P (mode) && !CONST_INT_P (elt)
+	   && TARGET_DIRECT_MOVE_64BIT)
+    {
+      if (GET_MODE (elt) != DImode)
+	{
+	  rtx tmp = gen_reg_rtx (DImode);
+	  convert_move (tmp, elt, 0);
+	  elt = tmp;
+	}
+
+      switch (mode)
+	{
+	case V2DFmode:
+	  emit_insn (gen_vsx_extract_v2df_var (target, vec, elt));
+	  return;
+
+	case V2DImode:
+	  emit_insn (gen_vsx_extract_v2di_var (target, vec, elt));
+	  return;
+
+	default:
+	  gcc_unreachable ();
 	}
     }
 
@@ -6977,6 +7000,99 @@ rs6000_expand_vector_extract (rtx target
   emit_move_insn (target, adjust_address_nv (mem, inner_mode, 0));
 }
 
+/* Split a variable vec_extract operation into the component instructions.  */
+
+void
+rs6000_split_vec_extract_var (rtx dest, rtx src, rtx element, rtx tmp_gpr,
+			      rtx tmp_altivec)
+{
+  machine_mode mode = GET_MODE (src);
+  machine_mode scalar_mode = GET_MODE (dest);
+  unsigned scalar_size = GET_MODE_SIZE (scalar_mode);
+  int byte_shift = exact_log2 (scalar_size);
+
+  gcc_assert (byte_shift >= 0);
+
+  if (REG_P (src) || SUBREG_P (src))
+    {
+      int bit_shift = byte_shift + 3;
+      rtx element2;
+
+      gcc_assert (REG_P (tmp_gpr) && REG_P (tmp_altivec));
+
+      /* For little endian, adjust element ordering.  For V2DI/V2DF, we can use
+	 an XOR, otherwise we need to subtract.  The shift amount is so VSLO
+	 will shift the element into the upper position (adding 3 to convert a
+	 byte shift into a bit shift).  */
+      if (scalar_size == 8)
+	{
+	  if (!VECTOR_ELT_ORDER_BIG)
+	    {
+	      emit_insn (gen_xordi3 (tmp_gpr, element, const1_rtx));
+	      element2 = tmp_gpr;
+	    }
+	  else
+	    element2 = element;
+
+	  /* Generate RLDIC directly to shift left 6 bits and retrieve 1
+	     bit.  */
+	  emit_insn (gen_rtx_SET (tmp_gpr,
+				  gen_rtx_AND (DImode,
+					       gen_rtx_ASHIFT (DImode,
+							       element2,
+							       GEN_INT (6)),
+					       GEN_INT (64))));
+	}
+      else
+	{
+	  if (!VECTOR_ELT_ORDER_BIG)
+	    {
+	      rtx num_ele_m1 = GEN_INT (GET_MODE_NUNITS (mode) - 1);
+
+	      emit_insn (gen_anddi3 (tmp_gpr, element, num_ele_m1));
+	      emit_insn (gen_subdi3 (tmp_gpr, num_ele_m1, tmp_gpr));
+	      element2 = tmp_gpr;
+	    }
+	  else
+	    element2 = element;
+
+	  emit_insn (gen_ashldi3 (tmp_gpr, element2, GEN_INT (bit_shift)));
+	}
+
+      /* Get the value into the lower byte of the Altivec register where VSLO
+	 expects it.  */
+      if (TARGET_P9_VECTOR)
+	emit_insn (gen_vsx_splat_v2di (tmp_altivec, tmp_gpr));
+      else if (can_create_pseudo_p ())
+	emit_insn (gen_vsx_concat_v2di (tmp_altivec, tmp_gpr, tmp_gpr));
+      else
+	{
+	  rtx tmp_di = gen_rtx_REG (DImode, REGNO (tmp_altivec));
+	  emit_move_insn (tmp_di, tmp_gpr);
+	  emit_insn (gen_vsx_concat_v2di (tmp_altivec, tmp_di, tmp_di));
+	}
+
+      /* Do the VSLO to get the value into the final location.  */
+      switch (mode)
+	{
+	case V2DFmode:
+	  emit_insn (gen_vsx_vslo_v2df (dest, src, tmp_altivec));
+	  return;
+
+	case V2DImode:
+	  emit_insn (gen_vsx_vslo_v2di (dest, src, tmp_altivec));
+	  return;
+
+	default:
+	  gcc_unreachable ();
+	}
+
+      return;
+    }
+  else
+    gcc_unreachable ();
+ }
+
 /* Return TRUE if OP is an invalid SUBREG operation on the e500.  */
 
 bool
@@ -38601,6 +38717,7 @@ rtx_is_swappable_p (rtx op, unsigned int
 	  case UNSPEC_VSX_CVDPSPN:
 	  case UNSPEC_VSX_CVSPDP:
 	  case UNSPEC_VSX_CVSPDPN:
+	  case UNSPEC_VSX_EXTRACT:
 	    return 0;
 	  case UNSPEC_VSPLT_DIRECT:
 	    *special = SH_SPLAT;
Index: gcc/config/rs6000/vsx.md
===================================================================
--- gcc/config/rs6000/vsx.md	(revision 238826)
+++ gcc/config/rs6000/vsx.md	(working copy)
@@ -309,6 +309,8 @@ (define_c_enum "unspec"
    UNSPEC_VSX_XVCVDPUXDS
    UNSPEC_VSX_SIGN_EXTEND
    UNSPEC_P9_MEMORY
+   UNSPEC_VSX_VSLO
+   UNSPEC_VSX_EXTRACT
   ])
 
 ;; VSX moves
@@ -2118,16 +2120,13 @@ (define_insn "vsx_set_<mode>"
 ;; register was picked.  Limit the scalar value to FPRs for now.
 
 (define_insn "vsx_extract_<mode>"
-  [(set (match_operand:<VS_scalar> 0 "gpc_reg_operand"
-            "=d,     wm,      wo,    d")
+  [(set (match_operand:<VS_scalar> 0 "gpc_reg_operand" "=d,    d,     wr, wr")
 
 	(vec_select:<VS_scalar>
-	 (match_operand:VSX_D 1 "gpc_reg_operand"
-            "<VSa>, <VSa>,  <VSa>,  <VSa>")
+	 (match_operand:VSX_D 1 "gpc_reg_operand"      "<VSa>, <VSa>, wm, wo")
 
 	 (parallel
-	  [(match_operand:QI 2 "const_0_to_1_operand"
-            "wD,    wD,     wL,     n")])))]
+	  [(match_operand:QI 2 "const_0_to_1_operand"  "wD,    n,     wD, n")])))]
   "VECTOR_MEM_VSX_P (<MODE>mode)"
 {
   int element = INTVAL (operands[2]);
@@ -2205,6 +2204,34 @@ (define_insn "*vsx_extract_<mode>_store"
   [(set_attr "type" "fpstore")
    (set_attr "length" "4")])
 
+;; Variable V2DI/V2DF extract shift
+(define_insn "vsx_vslo_<mode>"
+  [(set (match_operand:<VS_scalar> 0 "gpc_reg_operand" "=v")
+	(unspec:<VS_scalar> [(match_operand:VSX_D 1 "gpc_reg_operand" "v")
+			     (match_operand:V2DI 2 "gpc_reg_operand" "v")]
+			    UNSPEC_VSX_VSLO))]
+  "VECTOR_MEM_VSX_P (<MODE>mode) && TARGET_DIRECT_MOVE_64BIT"
+  "vslo %0,%1,%2"
+  [(set_attr "type" "vecperm")])
+
+;; Variable V2DI/V2DF extract
+(define_insn_and_split "vsx_extract_<mode>_var"
+  [(set (match_operand:<VS_scalar> 0 "gpc_reg_operand" "=v")
+	(unspec:<VS_scalar> [(match_operand:VSX_D 1 "input_operand" "v")
+			     (match_operand:DI 2 "gpc_reg_operand" "r")]
+			    UNSPEC_VSX_EXTRACT))
+   (clobber (match_scratch:DI 3 "=r"))
+   (clobber (match_scratch:V2DI 4 "=&v"))]
+  "VECTOR_MEM_VSX_P (<MODE>mode) && TARGET_DIRECT_MOVE_64BIT"
+  "#"
+  "&& reload_completed"
+  [(const_int 0)]
+{
+  rs6000_split_vec_extract_var (operands[0], operands[1], operands[2],
+				operands[3], operands[4]);
+  DONE;
+})
+
 ;; Extract a SF element from V4SF
 (define_insn_and_split "vsx_extract_v4sf"
   [(set (match_operand:SF 0 "vsx_register_operand" "=f,f")
Index: gcc/config/rs6000/rs6000.h
===================================================================
--- gcc/config/rs6000/rs6000.h	(revision 238826)
+++ gcc/config/rs6000/rs6000.h	(working copy)
@@ -602,7 +602,6 @@ extern int rs6000_vector_align[];
 #define TARGET_DIRECT_MOVE_128	(TARGET_P9_VECTOR && TARGET_DIRECT_MOVE \
 				 && TARGET_POWERPC64)
 #define TARGET_VEXTRACTUB	(TARGET_P9_VECTOR && TARGET_DIRECT_MOVE \
-				 && TARGET_UPPER_REGS_DF \
 				 && TARGET_UPPER_REGS_DI && TARGET_POWERPC64)
 
 /* Byte/char syncs were added as phased in for ISA 2.06B, but are not present
@@ -761,6 +760,14 @@ extern int rs6000_vector_align[];
 				 && TARGET_SINGLE_FLOAT			\
 				 && TARGET_DOUBLE_FLOAT)
 
+/* Macro to say whether we can do optimization where we need to do parts of the
+   calculation in 64-bit GPRs and then is transfered to the vector
+   registers.  */
+#define TARGET_DIRECT_MOVE_64BIT	(TARGET_DIRECT_MOVE		\
+					 && TARGET_P8_VECTOR		\
+					 && TARGET_POWERPC64		\
+					 && TARGET_UPPER_REGS_DI)
+
 /* Whether the various reciprocal divide/square root estimate instructions
    exist, and whether we should automatically generate code for the instruction
    by default.  */
Index: gcc/testsuite/gcc.target/powerpc/vec-extract-1.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vec-extract-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vec-extract-1.c	(revision 0)
@@ -0,0 +1,27 @@
+/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power8" } } */
+/* { dg-options "-mcpu=power8 -O2 -mupper-regs-df -mupper-regs-di" } */
+
+#include <altivec.h>
+
+double
+add_double (vector double a, int n)
+{
+  return vec_extract (a, n) + 1.0;
+}
+
+long
+add_long (vector long a, int n)
+{
+  return vec_extract (a, n) + 1;
+}
+
+/* { dg-final { scan-assembler     "vslo"    } } */
+/* { dg-final { scan-assembler     "mtvsrd"  } } */
+/* { dg-final { scan-assembler     "mfvsrd"  } } */
+/* { dg-final { scan-assembler-not "stxvd2x" } } */
+/* { dg-final { scan-assembler-not "stxvx"   } } */
+/* { dg-final { scan-assembler-not "stxv"    } } */
+/* { dg-final { scan-assembler-not "ldx"     } } */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, 2 of 4], Enhance PowerPC vec_extract support for power8/power9 machines
  2016-07-28 19:44     ` Michael Meissner
@ 2016-07-29  6:50       ` Segher Boessenkool
  0 siblings, 0 replies; 13+ messages in thread
From: Segher Boessenkool @ 2016-07-29  6:50 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, David Edelsohn, Bill Schmidt

On Thu, Jul 28, 2016 at 03:44:25PM -0400, Michael Meissner wrote:
> > This isn't in the changelog.
> 
> Yes it is.

I need new glasses.

> > > +/* Macro to say whether we can optimize vector extracts.  */
> > > +#define VEC_EXTRACT_OPTIMIZE_P	(TARGET_DIRECT_MOVE			\
> > > +				 && TARGET_POWERPC64			\
> > > +				 && TARGET_UPPER_REGS_DI)
> > 
> > I'm not a big fan of this name.  "Optimize" will quickly become dated,
> > everyone will take the current new hot thing for granted, and then when
> > you want to optimise even more (say, for ISA 6.0 or whatever) the name
> > is really out of place.
> > 
> > But I don't know a much better name either.
> 
> I changed it to TARGET_DIRECT_MOVE_64BIT, which hopefully is clearer of what
> exactly we need.  In particular, the calculation of the bit shift is done in
> the GPR and direct move creates teh vector used by VSLO to do a byte shift.

That is a much better name, thanks!


Segher

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, 3 of 4], Enhance PowerPC vec_extract support for power8/power9 machines
  2016-07-27 14:33 [PATCH, 1 of 4 or 5], Enhance PowerPC vec_extract support for power8/power9 machines Michael Meissner
  2016-07-27 19:35 ` Segher Boessenkool
  2016-07-27 21:16 ` [PATCH, 2 of 4], " Michael Meissner
@ 2016-07-30 15:29 ` Michael Meissner
  2016-07-30 15:38   ` Michael Meissner
  2016-07-30 16:01   ` Segher Boessenkool
  2016-08-01 22:38 ` [PATCH, 4 " Michael Meissner
  3 siblings, 2 replies; 13+ messages in thread
From: Michael Meissner @ 2016-07-30 15:29 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool,
	David Edelsohn, Bill Schmidt

[-- Attachment #1: Type: text/plain, Size: 2905 bytes --]

This patch adds better support for optimizing vec_extract of vector double or
vector long on 64-bit machines with direct move when the vector is in memory.
It converts the memory address from a vector address to the address of the
element, paying attention to the address mode allowed by the register being
loaded.  This patch adds support for vec_extract of the 2nd element and for
variable element number (previously the code only optimized accessing element 0
from memory).

I also added ISA 3.0 support d-form memory addressing for the vec_extract store
optimization.  This optimization is done if you do a vec_extract of element 0
(big endian) or 1 (little endian), you can store the element directly without
doing an extract.

I also noticed that the ISA 2.07 alternatives for the same optimization could
use some tuning to use a constraint that just targeted Altivec registers when
we used stxsd instead of stfd, and eliminate a redundant alternative.

I have tested this on a big endian power7 system (both 32-bit and 64-bit
tests), a big endian power8 system (only 64-bit tests), and a little endian
64-bit system with bootstrap builds and no regressions.  Can I apply these
patches to the trunk?

I have one more patch to go in the vec_extract series, that will add similar
optimizations to vector float, vector int, vector short, and vector char
vectors.

[gcc]
2016-07-30  Michael Meissner  <meissner@linux.vnet.ibm.com>

	* config/rs6000/rs6000-protos.h (rs6000_adjust_vec_address): New
	function that takes a vector memory address, a hard register, an
	element number and a temporary base register, and recreates an
	address that points to the appropriate element within the vector.
	* config/rs6000/rs6000.c (rs6000_adjust_vec_address): Likewise.
	(rs6000_split_vec_extract_var): Add support for the target of a
	vec_extract with variable element number being a scalar memory
	location.
	* config/rs6000/vsx.md (vsx_extract_<mode>_load): Replace
	vsx_extract_<mode>_load insn with a new insn that optimizes
	storing either element to a memory location, using scratch
	registers to pick apart the vector and reconstruct the address.
	(vsx_extract_<P:mode>_<VSX_D:mode>_load): Likewise.
	(vsx_extract_<mode>_store): Rework alternatives to more correctly
	support Altivec registers.  Add support for ISA 3.0 Altivec d-form
	store instruction.
	(vsx_extract_<mode>_var): Add support for extracting a variable
	element number from memory.

[gcc/testsuite]
2016-07-30  Michael Meissner  <meissner@linux.vnet.ibm.com>

	* gcc.target/powerpc/vec-extract-2.c: New tests for vec_extract of
	vector double or vector long where the vector is in memory.
	* gcc.target/powerpc/vec-extract-3.c: Likewise.
	* gcc.target/powerpc/vec-extract-4.c: Likewise.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797

[-- Attachment #2: gcc-stage7.extract006b --]
[-- Type: text/plain, Size: 14008 bytes --]

Index: gcc/config/rs6000/rs6000-protos.h
===================================================================
--- gcc/config/rs6000/rs6000-protos.h	(.../svn+ssh://meissner@gcc.gnu.org/svn/gcc/trunk/gcc/config/rs6000)	(revision 238892)
+++ gcc/config/rs6000/rs6000-protos.h	(.../gcc/config/rs6000)	(working copy)
@@ -63,6 +63,7 @@ extern void paired_expand_vector_init (r
 extern void rs6000_expand_vector_set (rtx, rtx, int);
 extern void rs6000_expand_vector_extract (rtx, rtx, rtx);
 extern void rs6000_split_vec_extract_var (rtx, rtx, rtx, rtx, rtx);
+extern rtx rs6000_adjust_vec_address (rtx, rtx, rtx, rtx, machine_mode);
 extern bool altivec_expand_vec_perm_const (rtx op[4]);
 extern void altivec_expand_vec_perm_le (rtx op[4]);
 extern bool rs6000_expand_vec_perm_const (rtx op[4]);
Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(.../svn+ssh://meissner@gcc.gnu.org/svn/gcc/trunk/gcc/config/rs6000)	(revision 238892)
+++ gcc/config/rs6000/rs6000.c	(.../gcc/config/rs6000)	(working copy)
@@ -7001,6 +7001,164 @@ rs6000_expand_vector_extract (rtx target
   emit_move_insn (target, adjust_address_nv (mem, inner_mode, 0));
 }
 
+/* Adjust a memory address (MEM) of a vector type to point to a scalar field
+   within the vector (ELEMENT) with a type (SCALAR_TYPE).  Use a base register
+   temporary (BASE_TMP) to fixup the address.  Return the new memory address
+   that is valid for reads or writes to a given register (SCALAR_REG).  */
+
+rtx
+rs6000_adjust_vec_address (rtx scalar_reg,
+			   rtx mem,
+			   rtx element,
+			   rtx base_tmp,
+			   machine_mode scalar_mode)
+{
+  unsigned scalar_size = GET_MODE_SIZE (scalar_mode);
+  rtx addr = XEXP (mem, 0);
+  rtx element_offset;
+  rtx new_addr;
+  bool valid_addr_p;
+
+  /* Vector addresses should not have PRE_INC, PRE_DEC, or PRE_MODIFY.  */
+  gcc_assert (GET_RTX_CLASS (GET_CODE (addr)) != RTX_AUTOINC);
+ 
+  /* Calculate what we need to add to the address to get the element
+     address.  */
+  if (CONST_INT_P (element))
+    element_offset = GEN_INT (INTVAL (element) * scalar_size);
+  else
+    {
+      int byte_shift = exact_log2 (scalar_size);
+      gcc_assert (byte_shift >= 0);
+
+      if (byte_shift == 0)
+	element_offset = element;
+
+      else
+	{
+	  if (TARGET_POWERPC64)
+	    emit_insn (gen_ashldi3 (base_tmp, element, GEN_INT (byte_shift)));
+	  else
+	    emit_insn (gen_ashlsi3 (base_tmp, element, GEN_INT (byte_shift)));
+
+	  element_offset = base_tmp;
+	}
+    }
+
+  /* Create the new address pointing to the element within the vector.  If we
+     are adding 0, we don't have to change the address.  */
+  if (element_offset == const0_rtx)
+    new_addr = addr;
+
+  /* A simple indirect address can be converted into a reg + offset
+     address.  */
+  else if (REG_P (addr) || SUBREG_P (addr))
+    new_addr = gen_rtx_PLUS (Pmode, addr, element_offset);
+
+  /* Optimize D-FORM addresses with constant offset with a constant element, to
+     include the element offset in the address directly.  */
+  else if (GET_CODE (addr) == PLUS)
+    {
+      rtx op0 = XEXP (addr, 0);
+      rtx op1 = XEXP (addr, 1);
+      rtx insn;
+
+      gcc_assert (REG_P (op0) || SUBREG_P (op0));
+      if (CONST_INT_P (op1) && CONST_INT_P (element_offset))
+	{
+	  HOST_WIDE_INT offset = INTVAL (op1) + INTVAL (element_offset);
+	  rtx offset_rtx = GEN_INT (offset);
+
+	  if (IN_RANGE (offset, -32768, 32767)
+	      && (scalar_size < 8 || (offset & 0x3) == 0))
+	    new_addr = gen_rtx_PLUS (Pmode, op0, offset_rtx);
+	  else
+	    {
+	      emit_move_insn (base_tmp, offset_rtx);
+	      new_addr = gen_rtx_PLUS (Pmode, op0, base_tmp);
+	    }
+	}
+      else
+	{
+	  if (REG_P (op1) || SUBREG_P (op1))
+	    {
+	      insn = gen_add3_insn (base_tmp, op1, element_offset);
+	      gcc_assert (insn != NULL_RTX);
+	      emit_insn (insn);
+	    }
+
+	  else if (REG_P (element_offset) || SUBREG_P (element_offset))
+	    {
+	      insn = gen_add3_insn (base_tmp, element_offset, op1);
+	      gcc_assert (insn != NULL_RTX);
+	      emit_insn (insn);
+	    }
+
+	  else
+	    {
+	      emit_move_insn (base_tmp, op1);
+	      emit_insn (gen_add2_insn (base_tmp, element_offset));
+	    }
+
+	  new_addr = gen_rtx_PLUS (Pmode, op0, base_tmp);
+	}
+    }
+
+  else
+    {
+      emit_move_insn (base_tmp, addr);
+      new_addr = gen_rtx_PLUS (Pmode, base_tmp, element_offset);
+    }
+
+  /* If we have a PLUS, we need to see whether the particular register class
+     allows for D-FORM or X-FORM addressing.  */
+  if (GET_CODE (new_addr) == PLUS)
+    {
+      rtx op1 = XEXP (new_addr, 1);
+      addr_mask_type addr_mask;
+      int scalar_regno;
+
+      if (REG_P (scalar_reg))
+	scalar_regno = REGNO (scalar_reg);
+      else if (SUBREG_P (scalar_reg))
+	scalar_regno = subreg_regno (scalar_reg);
+      else
+	gcc_unreachable ();
+
+      gcc_assert (scalar_regno < FIRST_PSEUDO_REGISTER);
+      if (INT_REGNO_P (scalar_regno))
+	addr_mask = reg_addr[scalar_mode].addr_mask[RELOAD_REG_GPR];
+
+      else if (FP_REGNO_P (scalar_regno))
+	addr_mask = reg_addr[scalar_mode].addr_mask[RELOAD_REG_FPR];
+
+      else if (ALTIVEC_REGNO_P (scalar_regno))
+	addr_mask = reg_addr[scalar_mode].addr_mask[RELOAD_REG_VMX];
+
+      else
+	gcc_unreachable ();
+
+      if (REG_P (op1) || SUBREG_P (op1))
+	valid_addr_p = (addr_mask & RELOAD_REG_INDEXED) != 0;
+      else
+	valid_addr_p = (addr_mask & RELOAD_REG_OFFSET) != 0;
+    }
+
+  else if (REG_P (new_addr) || SUBREG_P (new_addr))
+    valid_addr_p = true;
+
+  else
+    valid_addr_p = false;
+
+  if (!valid_addr_p)
+    {
+      emit_move_insn (base_tmp, new_addr);
+      new_addr = base_tmp;
+    }
+
+  return change_address (mem, scalar_mode, new_addr);
+}
+
 /* Split a variable vec_extract operation into the component instructions.  */
 
 void
@@ -7014,7 +7172,18 @@ rs6000_split_vec_extract_var (rtx dest, 
 
   gcc_assert (byte_shift >= 0);
 
-  if (REG_P (src) || SUBREG_P (src))
+  /* If we are given a memory address, optimize to load just the element.  We
+     don't have to adjust the vector element number on little endian
+     systems.  */
+  if (MEM_P (src))
+    {
+      gcc_assert (REG_P (tmp_gpr));
+      emit_move_insn (dest, rs6000_adjust_vec_address (dest, src, element,
+						       tmp_gpr, scalar_mode));
+      return;
+    }
+
+  else if (REG_P (src) || SUBREG_P (src))
     {
       int bit_shift = byte_shift + 3;
       rtx element2;
@@ -7024,7 +7193,7 @@ rs6000_split_vec_extract_var (rtx dest, 
       /* For little endian, adjust element ordering.  For V2DI/V2DF, we can use
 	 an XOR, otherwise we need to subtract.  The shift amount is so VSLO
 	 will shift the element into the upper position (adding 3 to convert a
-	 byte shift into a bit shift).  */
+	 byte shift into a bit shift). */
       if (scalar_size == 8)
 	{
 	  if (!VECTOR_ELT_ORDER_BIG)
@@ -38759,6 +38928,7 @@ rtx_is_swappable_p (rtx op, unsigned int
 	  case UNSPEC_VSX_CVSPDP:
 	  case UNSPEC_VSX_CVSPDPN:
 	  case UNSPEC_VSX_EXTRACT:
+	  case UNSPEC_VSX_VSLO:
 	    return 0;
 	  case UNSPEC_VSPLT_DIRECT:
 	    *special = SH_SPLAT;
Index: gcc/config/rs6000/vsx.md
===================================================================
--- gcc/config/rs6000/vsx.md	(.../svn+ssh://meissner@gcc.gnu.org/svn/gcc/trunk/gcc/config/rs6000)	(revision 238892)
+++ gcc/config/rs6000/vsx.md	(.../gcc/config/rs6000)	(working copy)
@@ -2174,33 +2174,36 @@ (define_insn "vsx_extract_<mode>"
 }
   [(set_attr "type" "veclogical,mftgpr,mftgpr,vecperm")])
 
-;; Optimize extracting a single scalar element from memory if the scalar is in
-;; the correct location to use a single load.
-(define_insn "*vsx_extract_<mode>_load"
-  [(set (match_operand:<VS_scalar> 0 "register_operand" "=d,wv,wr")
-	(vec_select:<VS_scalar>
-	 (match_operand:VSX_D 1 "memory_operand" "m,Z,m")
-	 (parallel [(const_int 0)])))]
-  "VECTOR_MEM_VSX_P (<MODE>mode)"
-  "@
-   lfd%U1%X1 %0,%1
-   lxsd%U1x %x0,%y1
-   ld%U1%X1 %0,%1"
-  [(set_attr "type" "fpload,fpload,load")
-   (set_attr "length" "4")])
+;; Optimize extracting a single scalar element from memory.
+(define_insn_and_split "*vsx_extract_<P:mode>_<VSX_D:mode>_load"
+  [(set (match_operand:<VS_scalar> 0 "register_operand" "=<VSX_D:VS_64reg>,wr")
+	(vec_select:<VSX_D:VS_scalar>
+	 (match_operand:VSX_D 1 "memory_operand" "m,m")
+	 (parallel [(match_operand:QI 2 "const_0_to_1_operand" "n,n")])))
+   (clobber (match_scratch:P 3 "=&b,&b"))]
+  "VECTOR_MEM_VSX_P (<VSX_D:MODE>mode)"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 0) (match_dup 4))]
+{
+  operands[4] = rs6000_adjust_vec_address (operands[0], operands[1], operands[2],
+					   operands[3], <VSX_D:VS_scalar>mode);
+}
+  [(set_attr "type" "fpload,load")
+   (set_attr "length" "8")])
 
 ;; Optimize storing a single scalar element that is the right location to
 ;; memory
 (define_insn "*vsx_extract_<mode>_store"
-  [(set (match_operand:<VS_scalar> 0 "memory_operand" "=m,Z,?Z")
+  [(set (match_operand:<VS_scalar> 0 "memory_operand" "=m,Z,o")
 	(vec_select:<VS_scalar>
-	 (match_operand:VSX_D 1 "register_operand" "d,wd,<VSa>")
+	 (match_operand:VSX_D 1 "register_operand" "d,wv,wb")
 	 (parallel [(match_operand:QI 2 "vsx_scalar_64bit" "wD,wD,wD")])))]
   "VECTOR_MEM_VSX_P (<MODE>mode)"
   "@
    stfd%U0%X0 %1,%0
    stxsd%U0x %x1,%y0
-   stxsd%U0x %x1,%y0"
+   stxsd %1,%0"
   [(set_attr "type" "fpstore")
    (set_attr "length" "4")])
 
@@ -2216,12 +2219,12 @@ (define_insn "vsx_vslo_<mode>"
 
 ;; Variable V2DI/V2DF extract
 (define_insn_and_split "vsx_extract_<mode>_var"
-  [(set (match_operand:<VS_scalar> 0 "gpc_reg_operand" "=v")
-	(unspec:<VS_scalar> [(match_operand:VSX_D 1 "input_operand" "v")
-			     (match_operand:DI 2 "gpc_reg_operand" "r")]
+  [(set (match_operand:<VS_scalar> 0 "gpc_reg_operand" "=v,<VSa>,r")
+	(unspec:<VS_scalar> [(match_operand:VSX_D 1 "input_operand" "v,m,m")
+			     (match_operand:DI 2 "gpc_reg_operand" "r,r,r")]
 			    UNSPEC_VSX_EXTRACT))
-   (clobber (match_scratch:DI 3 "=r"))
-   (clobber (match_scratch:V2DI 4 "=&v"))]
+   (clobber (match_scratch:DI 3 "=r,&b,&b"))
+   (clobber (match_scratch:V2DI 4 "=&v,X,X"))]
   "VECTOR_MEM_VSX_P (<MODE>mode) && TARGET_DIRECT_MOVE_64BIT"
   "#"
   "&& reload_completed"
Index: gcc/testsuite/gcc.target/powerpc/vec-extract-2.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vec-extract-2.c	(.../svn+ssh://meissner@gcc.gnu.org/svn/gcc/trunk/gcc/testsuite/gcc.target/powerpc)	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vec-extract-2.c	(.../gcc/testsuite/gcc.target/powerpc)	(revision 0)
@@ -0,0 +1,37 @@
+/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#include <altivec.h>
+
+double
+add_double_0 (vector double *p, double x)
+{
+  return vec_extract (*p, 0) + x;
+}
+
+double
+add_double_1 (vector double *p, double x)
+{
+  return vec_extract (*p, 1) + x;
+}
+
+long
+add_long_0 (vector long *p, long x)
+{
+  return vec_extract (*p, 0) + x;
+}
+
+long
+add_long_1 (vector long *p, long x)
+{
+  return vec_extract (*p, 1) + x;
+}
+
+/* { dg-final { scan-assembler-not "lxvd2x"   } } */
+/* { dg-final { scan-assembler-not "lxvw4x"   } } */
+/* { dg-final { scan-assembler-not "lxvx"     } } */
+/* { dg-final { scan-assembler-not "lxv"      } } */
+/* { dg-final { scan-assembler-not "lvx"      } } */
+/* { dg-final { scan-assembler-not "xxpermdi" } } */
Index: gcc/testsuite/gcc.target/powerpc/vec-extract-3.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vec-extract-3.c	(.../svn+ssh://meissner@gcc.gnu.org/svn/gcc/trunk/gcc/testsuite/gcc.target/powerpc)	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vec-extract-3.c	(.../gcc/testsuite/gcc.target/powerpc)	(revision 0)
@@ -0,0 +1,26 @@
+/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power8" } } */
+/* { dg-options "-O2 -mcpu=power8" } */
+
+#include <altivec.h>
+
+double
+add_double_n (vector double *p, double x, long n)
+{
+  return vec_extract (*p, n) + x;
+}
+
+long
+add_long_n (vector long *p, long x, long n)
+{
+  return vec_extract (*p, n) + x;
+}
+
+/* { dg-final { scan-assembler-not "lxvd2x"   } } */
+/* { dg-final { scan-assembler-not "lxvw4x"   } } */
+/* { dg-final { scan-assembler-not "lxvx"     } } */
+/* { dg-final { scan-assembler-not "lxv"      } } */
+/* { dg-final { scan-assembler-not "lvx"      } } */
+/* { dg-final { scan-assembler-not "xxpermdi" } } */
Index: gcc/testsuite/gcc.target/powerpc/vec-extract-4.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vec-extract-4.c	(.../svn+ssh://meissner@gcc.gnu.org/svn/gcc/trunk/gcc/testsuite/gcc.target/powerpc)	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vec-extract-4.c	(.../gcc/testsuite/gcc.target/powerpc)	(revision 0)
@@ -0,0 +1,23 @@
+/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power9" } } */
+/* { dg-options "-O2 -mcpu=power9" } */
+
+#include <altivec.h>
+
+#ifdef __LITTLE_ENDIAN__
+#define ELEMENT 1
+#else
+#define ELEMENT 0
+#endif
+
+void foo (double *p, vector double v)
+{
+  p[10] = vec_extract (v, ELEMENT);
+}
+
+/* { dg-final { scan-assembler     "stxsd "   } } */
+/* { dg-final { scan-assembler-not "stxsdx"   } } */
+/* { dg-final { scan-assembler-not "stfd"     } } */
+/* { dg-final { scan-assembler-not "xxpermdi" } } */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, 3 of 4], Enhance PowerPC vec_extract support for power8/power9 machines
  2016-07-30 15:29 ` [PATCH, 3 " Michael Meissner
@ 2016-07-30 15:38   ` Michael Meissner
  2016-07-30 16:04     ` Segher Boessenkool
  2016-07-30 16:01   ` Segher Boessenkool
  1 sibling, 1 reply; 13+ messages in thread
From: Michael Meissner @ 2016-07-30 15:38 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool,
	David Edelsohn, Bill Schmidt

I noticed I forgot to include one of the changes in the ChangeLog file
(rtx_is_swappable_p).  This change was originally meant to be in the previous
patch, and it got left out.  The corrected ChangeLog is:

[gcc]
2016-07-30  Michael Meissner  <meissner@linux.vnet.ibm.com>

	* config/rs6000/rs6000-protos.h (rs6000_adjust_vec_address): New
	function that takes a vector memory address, a hard register, an
	element number and a temporary base register, and recreates an
	address that points to the appropriate element within the vector.
	* config/rs6000/rs6000.c (rs6000_adjust_vec_address): Likewise.
	(rs6000_split_vec_extract_var): Add support for the target of a
	vec_extract with variable element number being a scalar memory
	location.
	* config/rs6000/vsx.md (vsx_extract_<mode>_load): Replace
	vsx_extract_<mode>_load insn with a new insn that optimizes
	storing either element to a memory location, using scratch
	registers to pick apart the vector and reconstruct the address.
	(vsx_extract_<P:mode>_<VSX_D:mode>_load): Likewise.
	(vsx_extract_<mode>_store): Rework alternatives to more correctly
	support Altivec registers.  Add support for ISA 3.0 Altivec d-form
	store instruction.
	(vsx_extract_<mode>_var): Add support for extracting a variable
	element number from memory.
	(rtx_is_swappable_p): VLSO insns (UNSPEC_VSX_VSLOW) are not
	swappable.

[gcc/testsuite]
2016-07-30  Michael Meissner  <meissner@linux.vnet.ibm.com>

	* gcc.target/powerpc/vec-extract-2.c: New tests for vec_extract of
	vector double or vector long where the vector is in memory.
	* gcc.target/powerpc/vec-extract-3.c: Likewise.
	* gcc.target/powerpc/vec-extract-4.c: Likewise.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, 3 of 4], Enhance PowerPC vec_extract support for power8/power9 machines
  2016-07-30 15:29 ` [PATCH, 3 " Michael Meissner
  2016-07-30 15:38   ` Michael Meissner
@ 2016-07-30 16:01   ` Segher Boessenkool
  1 sibling, 0 replies; 13+ messages in thread
From: Segher Boessenkool @ 2016-07-30 16:01 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, David Edelsohn, Bill Schmidt

On Sat, Jul 30, 2016 at 11:29:25AM -0400, Michael Meissner wrote:
> This patch adds better support for optimizing vec_extract of vector double or
> vector long on 64-bit machines with direct move when the vector is in memory.

> I have tested this on a big endian power7 system (both 32-bit and 64-bit
> tests), a big endian power8 system (only 64-bit tests), and a little endian
> 64-bit system with bootstrap builds and no regressions.  Can I apply these
> patches to the trunk?

Yes please.  Some nits below.

> I have one more patch to go in the vec_extract series, that will add similar
> optimizations to vector float, vector int, vector short, and vector char
> vectors.

Looking forward to it!

> --- gcc/config/rs6000/rs6000.c	(.../svn+ssh://meissner@gcc.gnu.org/svn/gcc/trunk/gcc/config/rs6000)	(revision 238892)
> +++ gcc/config/rs6000/rs6000.c	(.../gcc/config/rs6000)	(working copy)
> @@ -7001,6 +7001,164 @@ rs6000_expand_vector_extract (rtx target
>    emit_move_insn (target, adjust_address_nv (mem, inner_mode, 0));
>  }
>  
> +/* Adjust a memory address (MEM) of a vector type to point to a scalar field
> +   within the vector (ELEMENT) with a type (SCALAR_TYPE).  Use a base register

Mode, not type (and SCALAR_MODE)..

> +  /* Vector addresses should not have PRE_INC, PRE_DEC, or PRE_MODIFY.  */
> +  gcc_assert (GET_RTX_CLASS (GET_CODE (addr)) != RTX_AUTOINC);
> + 

Stray space on this last line?

> @@ -7024,7 +7193,7 @@ rs6000_split_vec_extract_var (rtx dest, 
>        /* For little endian, adjust element ordering.  For V2DI/V2DF, we can use
>  	 an XOR, otherwise we need to subtract.  The shift amount is so VSLO
>  	 will shift the element into the upper position (adding 3 to convert a
> -	 byte shift into a bit shift).  */
> +	 byte shift into a bit shift). */

Let's not :-)

Thanks,


Segher

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, 3 of 4], Enhance PowerPC vec_extract support for power8/power9 machines
  2016-07-30 15:38   ` Michael Meissner
@ 2016-07-30 16:04     ` Segher Boessenkool
  0 siblings, 0 replies; 13+ messages in thread
From: Segher Boessenkool @ 2016-07-30 16:04 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, David Edelsohn, Bill Schmidt

On Sat, Jul 30, 2016 at 11:37:46AM -0400, Michael Meissner wrote:
> I noticed I forgot to include one of the changes in the ChangeLog file
> (rtx_is_swappable_p).  This change was originally meant to be in the previous
> patch, and it got left out.  The corrected ChangeLog is:

But put it on the correct file please (rs6000.c) ;-)


Segher


> 	* config/rs6000/rs6000-protos.h (rs6000_adjust_vec_address): New
> 	function that takes a vector memory address, a hard register, an
> 	element number and a temporary base register, and recreates an
> 	address that points to the appropriate element within the vector.
> 	* config/rs6000/rs6000.c (rs6000_adjust_vec_address): Likewise.
> 	(rs6000_split_vec_extract_var): Add support for the target of a
> 	vec_extract with variable element number being a scalar memory
> 	location.
> 	* config/rs6000/vsx.md (vsx_extract_<mode>_load): Replace
> 	vsx_extract_<mode>_load insn with a new insn that optimizes
> 	storing either element to a memory location, using scratch
> 	registers to pick apart the vector and reconstruct the address.
> 	(vsx_extract_<P:mode>_<VSX_D:mode>_load): Likewise.
> 	(vsx_extract_<mode>_store): Rework alternatives to more correctly
> 	support Altivec registers.  Add support for ISA 3.0 Altivec d-form
> 	store instruction.
> 	(vsx_extract_<mode>_var): Add support for extracting a variable
> 	element number from memory.
> 	(rtx_is_swappable_p): VLSO insns (UNSPEC_VSX_VSLOW) are not
> 	swappable.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, 4 of 4], Enhance PowerPC vec_extract support for power8/power9 machines
  2016-07-27 14:33 [PATCH, 1 of 4 or 5], Enhance PowerPC vec_extract support for power8/power9 machines Michael Meissner
                   ` (2 preceding siblings ...)
  2016-07-30 15:29 ` [PATCH, 3 " Michael Meissner
@ 2016-08-01 22:38 ` Michael Meissner
  2016-08-01 22:55   ` Segher Boessenkool
  3 siblings, 1 reply; 13+ messages in thread
From: Michael Meissner @ 2016-08-01 22:38 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool,
	David Edelsohn, Bill Schmidt

[-- Attachment #1: Type: text/plain, Size: 3231 bytes --]

This is the 4th path to enhance vec_extract on 64-bit power8/power9 machines.

This patch uses the load from memory support and the variable elment patch that
were part of the previous patches for vector long/vector double, and adds the
same support for vector float, vector int, vector short, and vector char.

I have tested these patches with bootstrap builds and running make check on:

    1) Big endian power7 (both -m32 and -m64 tests done)
    2) Big endian power8 (only -m64 tests were done)
    3) Little endian power8

There were no regressions.  Can I check these patches into the trunk?

One further optimization would be to add support for constant element extracts
if the vector is currently in GPRs rather than vector registers on 64-bit
systems.  I'm not sure if it would be a win in general, or if it would cause
the register allocators to generate more moves between the GPR and vector
register banks.

[gcc]
2016-08-01  Michael Meissner  <meissner@linux.vnet.ibm.com>

	* config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin):
	Add support for vec_extract on vector float, vector int, vector
	short, and vector char vector types.
	* config/rs6000/rs6000.c (rs6000_expand_vector_extract): Add
	vector float, vector int, vector short, and vector char
	optimizations on 64-bit ISA 2.07 systems for both constant and
	variable element numbers.
	(rs6000_split_vec_extract_var): Likewise.
	* config/rs6000/vsx.md (vsx_xscvspdp_scalar2): Allow SFmode to be
	Altivec registers on ISA 2.07 and above.
	(vsx_extract_v4sf): Delete alternative that hard coded element 0,
	which never was matched due to the split occuring before register
	allocation (and the code would not have worked on little endian
	systems if it did match).  Allow extracts to go to the Altivec
	registers if ISA 2.07 (power8).  Change from using "" around the
	C++ code to using {}'s.
	(vsx_extract_v4sf_<mode>_load): New insn to optimize vector float
	vec_extracts when the vector is in memory.
	(vsx_extract_v4sf_var): New insn to optimize vector float
	vec_extracts when the element number is variable on 64-bit ISA
	2.07 systems.
	(vsx_extract_<mode>, VSX_EXTRACT_I iterator): Add optimizations
	for 64-bit ISA 2.07 as well as ISA 3.0.
	(vsx_extract_<mode>_p9, VSX_EXTRACT_I iterator): Likewise.
	(vsx_extract_<mode>_p8, VSX_EXTRACT_I iterator): Likewise.
	(vsx_extract_<mode>_load, VSX_EXTRACT_I iterator): New insn to
	optimize vector int, vector short, and vector char vec_extracts
	when the vector is in memory.
	(vsx_extract_<mode>_var, VSX_EXTRACT_I iterator): New insn to
	optimize vector int, vector short, and vector char vec_extracts
	when the element number is variable.

[gcc/testsuite]
2016-08-01  Michael Meissner  <meissner@linux.vnet.ibm.com>

	* gcc.target/powerpc/vec-extract-5.c: New tests to test
	vec_extract for vector float, vector int, vector short, and vector
	char.
	* gcc.target/powerpc/vec-extract-6.c: Likewise.
	* gcc.target/powerpc/vec-extract-7.c: Likewise.
	* gcc.target/powerpc/vec-extract-8.c: Likewise.
	* gcc.target/powerpc/vec-extract-9.c: Likewise.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797

[-- Attachment #2: gcc-stage7.extract007b --]
[-- Type: text/plain, Size: 19361 bytes --]

Index: gcc/config/rs6000/rs6000-c.c
===================================================================
--- gcc/config/rs6000/rs6000-c.c	(revision 238892)
+++ gcc/config/rs6000/rs6000-c.c	(working copy)
@@ -5135,6 +5135,25 @@ altivec_resolve_overloaded_builtin (loca
 		case V2DImode:
 		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_EXT_V2DI];
 		  break;
+
+		case V4SFmode:
+		  call = rs6000_builtin_decls[ALTIVEC_BUILTIN_VEC_EXT_V4SF];
+		  break;
+
+		case V4SImode:
+		  if (TARGET_DIRECT_MOVE_64BIT)
+		    call = rs6000_builtin_decls[ALTIVEC_BUILTIN_VEC_EXT_V4SI];
+		  break;
+
+		case V8HImode:
+		  if (TARGET_DIRECT_MOVE_64BIT)
+		    call = rs6000_builtin_decls[ALTIVEC_BUILTIN_VEC_EXT_V8HI];
+		  break;
+
+		case V16QImode:
+		  if (TARGET_DIRECT_MOVE_64BIT)
+		    call = rs6000_builtin_decls[ALTIVEC_BUILTIN_VEC_EXT_V16QI];
+		  break;
 		}
 	    }
 
@@ -5154,6 +5173,22 @@ altivec_resolve_overloaded_builtin (loca
 		case V2DImode:
 		  call = rs6000_builtin_decls[VSX_BUILTIN_VEC_EXT_V2DI];
 		  break;
+
+		case V4SFmode:
+		  call = rs6000_builtin_decls[ALTIVEC_BUILTIN_VEC_EXT_V4SF];
+		  break;
+
+		case V4SImode:
+		  call = rs6000_builtin_decls[ALTIVEC_BUILTIN_VEC_EXT_V4SI];
+		  break;
+
+		case V8HImode:
+		  call = rs6000_builtin_decls[ALTIVEC_BUILTIN_VEC_EXT_V8HI];
+		  break;
+
+		case V16QImode:
+		  call = rs6000_builtin_decls[ALTIVEC_BUILTIN_VEC_EXT_V16QI];
+		  break;
 		}
 	    }
 
Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(revision 238899)
+++ gcc/config/rs6000/rs6000.c	(working copy)
@@ -6938,7 +6938,7 @@ rs6000_expand_vector_extract (rtx target
 	  emit_insn (gen_vsx_extract_v4sf (target, vec, elt));
 	  return;
 	case V16QImode:
-	  if (TARGET_VEXTRACTUB)
+	  if (TARGET_DIRECT_MOVE_64BIT)
 	    {
 	      emit_insn (gen_vsx_extract_v16qi (target, vec, elt));
 	      return;
@@ -6946,7 +6946,7 @@ rs6000_expand_vector_extract (rtx target
 	  else
 	    break;
 	case V8HImode:
-	  if (TARGET_VEXTRACTUB)
+	  if (TARGET_DIRECT_MOVE_64BIT)
 	    {
 	      emit_insn (gen_vsx_extract_v8hi (target, vec, elt));
 	      return;
@@ -6954,7 +6954,7 @@ rs6000_expand_vector_extract (rtx target
 	  else
 	    break;
 	case V4SImode:
-	  if (TARGET_VEXTRACTUB)
+	  if (TARGET_DIRECT_MOVE_64BIT)
 	    {
 	      emit_insn (gen_vsx_extract_v4si (target, vec, elt));
 	      return;
@@ -6982,6 +6982,26 @@ rs6000_expand_vector_extract (rtx target
 	  emit_insn (gen_vsx_extract_v2di_var (target, vec, elt));
 	  return;
 
+	case V4SFmode:
+	  if (TARGET_UPPER_REGS_SF)
+	    {
+	      emit_insn (gen_vsx_extract_v4sf_var (target, vec, elt));
+	      return;
+	    }
+	  break;
+
+	case V4SImode:
+	  emit_insn (gen_vsx_extract_v4si_var (target, vec, elt));
+	  return;
+
+	case V8HImode:
+	  emit_insn (gen_vsx_extract_v8hi_var (target, vec, elt));
+	  return;
+
+	case V16QImode:
+	  emit_insn (gen_vsx_extract_v16qi_var (target, vec, elt));
+	  return;
+
 	default:
 	  gcc_unreachable ();
 	}
@@ -7253,6 +7273,33 @@ rs6000_split_vec_extract_var (rtx dest, 
 	  emit_insn (gen_vsx_vslo_v2di (dest, src, tmp_altivec));
 	  return;
 
+	case V4SFmode:
+	  {
+	    rtx tmp_altivec_di = gen_rtx_REG (DImode, REGNO (tmp_altivec));
+	    rtx tmp_altivec_v4sf = gen_rtx_REG (V4SFmode, REGNO (tmp_altivec));
+	    rtx src_v2di = gen_rtx_REG (V2DImode, REGNO (src));
+	    emit_insn (gen_vsx_vslo_v2di (tmp_altivec_di, src_v2di,
+					  tmp_altivec));
+
+	    emit_insn (gen_vsx_xscvspdp_scalar2 (dest, tmp_altivec_v4sf));
+	    return;
+	  }
+
+	case V4SImode:
+	case V8HImode:
+	case V16QImode:
+	  {
+	    rtx tmp_altivec_di = gen_rtx_REG (DImode, REGNO (tmp_altivec));
+	    rtx src_v2di = gen_rtx_REG (V2DImode, REGNO (src));
+	    rtx tmp_gpr_di = gen_rtx_REG (DImode, REGNO (dest));
+	    emit_insn (gen_vsx_vslo_v2di (tmp_altivec_di, src_v2di,
+					  tmp_altivec));
+	    emit_move_insn (tmp_gpr_di, tmp_altivec_di);
+	    emit_insn (gen_ashrdi3 (tmp_gpr_di, tmp_gpr_di,
+				    GEN_INT (64 - (8 * scalar_size))));
+	    return;
+	  }
+
 	default:
 	  gcc_unreachable ();
 	}
Index: gcc/config/rs6000/vsx.md
===================================================================
--- gcc/config/rs6000/vsx.md	(revision 238899)
+++ gcc/config/rs6000/vsx.md	(working copy)
@@ -1663,7 +1663,7 @@ (define_insn "vsx_xscvdpsp_scalar"
 
 ;; Same as vsx_xscvspdp, but use SF as the type
 (define_insn "vsx_xscvspdp_scalar2"
-  [(set (match_operand:SF 0 "vsx_register_operand" "=f")
+  [(set (match_operand:SF 0 "vsx_register_operand" "=ww")
 	(unspec:SF [(match_operand:V4SF 1 "vsx_register_operand" "wa")]
 		   UNSPEC_VSX_CVSPDP))]
   "VECTOR_UNIT_VSX_P (V4SFmode)"
@@ -2237,18 +2237,15 @@ (define_insn_and_split "vsx_extract_<mod
 
 ;; Extract a SF element from V4SF
 (define_insn_and_split "vsx_extract_v4sf"
-  [(set (match_operand:SF 0 "vsx_register_operand" "=f,f")
+  [(set (match_operand:SF 0 "vsx_register_operand" "=ww")
 	(vec_select:SF
-	 (match_operand:V4SF 1 "vsx_register_operand" "wa,wa")
-	 (parallel [(match_operand:QI 2 "u5bit_cint_operand" "O,i")])))
-   (clobber (match_scratch:V4SF 3 "=X,0"))]
+	 (match_operand:V4SF 1 "vsx_register_operand" "wa")
+	 (parallel [(match_operand:QI 2 "u5bit_cint_operand" "n")])))
+   (clobber (match_scratch:V4SF 3 "=0"))]
   "VECTOR_UNIT_VSX_P (V4SFmode)"
-  "@
-   xscvspdp %x0,%x1
-   #"
-  ""
+  "#"
+  "&& 1"
   [(const_int 0)]
-  "
 {
   rtx op0 = operands[0];
   rtx op1 = operands[1];
@@ -2268,10 +2265,46 @@ (define_insn_and_split "vsx_extract_v4sf
     }
   emit_insn (gen_vsx_xscvspdp_scalar2 (op0, tmp));
   DONE;
-}"
-  [(set_attr "length" "4,8")
+}
+  [(set_attr "length" "8")
    (set_attr "type" "fp")])
 
+(define_insn_and_split "*vsx_extract_v4sf_<mode>_load"
+  [(set (match_operand:SF 0 "register_operand" "=f,wv,wb,?r")
+	(vec_select:SF
+	 (match_operand:V4SF 1 "memory_operand" "m,Z,m,m")
+	 (parallel [(match_operand:QI 2 "const_0_to_3_operand" "n,n,n,n")])))
+   (clobber (match_scratch:P 3 "=&b,&b,&b,&b"))]
+  "VECTOR_MEM_VSX_P (V4SFmode)"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 0) (match_dup 4))]
+{
+  operands[4] = rs6000_adjust_vec_address (operands[0], operands[1], operands[2],
+					   operands[3], SFmode);
+}
+  [(set_attr "type" "fpload,fpload,fpload,load")
+   (set_attr "length" "8")])
+
+;; Variable V4SF extract
+(define_insn_and_split "vsx_extract_v4sf_var"
+  [(set (match_operand:SF 0 "gpc_reg_operand" "=ww,ww,?r")
+	(unspec:SF [(match_operand:V4SF 1 "input_operand" "v,m,m")
+		    (match_operand:DI 2 "gpc_reg_operand" "r,r,r")]
+		   UNSPEC_VSX_EXTRACT))
+   (clobber (match_scratch:DI 3 "=r,&b,&b"))
+   (clobber (match_scratch:V2DI 4 "=&v,X,X"))]
+  "VECTOR_MEM_VSX_P (V4SFmode) && TARGET_DIRECT_MOVE_64BIT
+   && TARGET_UPPER_REGS_SF"
+  "#"
+  "&& reload_completed"
+  [(const_int 0)]
+{
+  rs6000_split_vec_extract_var (operands[0], operands[1], operands[2],
+				operands[3], operands[4]);
+  DONE;
+})
+
 ;; Expand the builtin form of xxpermdi to canonical rtl.
 (define_expand "vsx_xxpermdi_<mode>"
   [(match_operand:VSX_L 0 "vsx_register_operand" "")
@@ -2370,7 +2403,21 @@ (define_expand "vec_perm_const<mode>"
 ;; Extraction of a single element in a small integer vector.  None of the small
 ;; types are currently allowed in a vector register, so we extract to a DImode
 ;; and either do a direct move or store.
-(define_insn_and_split  "vsx_extract_<mode>"
+(define_expand  "vsx_extract_<mode>"
+  [(parallel [(set (match_operand:<VS_scalar> 0 "nonimmediate_operand" "")
+		   (vec_select:<VS_scalar>
+		    (match_operand:VSX_EXTRACT_I 1 "gpc_reg_operand" "")
+		    (parallel [(match_operand:QI 2 "const_int_operand" "")])))
+	      (clobber (match_dup 3))])]
+  "VECTOR_MEM_VSX_P (<MODE>mode) && TARGET_DIRECT_MOVE_64BIT"
+{
+  operands[3] = gen_rtx_SCRATCH ((TARGET_VEXTRACTUB) ? DImode : <MODE>mode);
+})
+
+;; Under ISA 3.0, we can use the byte/half-word/word integer stores if we are
+;; extracting a vector element and storing it to memory, rather than using
+;; direct move to a GPR and a GPR store.
+(define_insn_and_split  "*vsx_extract_<mode>_p9"
   [(set (match_operand:<VS_scalar> 0 "nonimmediate_operand" "=r,Z")
 	(vec_select:<VS_scalar>
 	 (match_operand:VSX_EXTRACT_I 1 "gpc_reg_operand" "<VSX_EX>,<VSX_EX>")
@@ -2438,6 +2485,95 @@ (define_insn  "vsx_extract_<mode>_di"
 }
   [(set_attr "type" "vecsimple")])
 
+(define_insn_and_split  "*vsx_extract_<mode>_p8"
+  [(set (match_operand:<VS_scalar> 0 "nonimmediate_operand" "=r")
+	(vec_select:<VS_scalar>
+	 (match_operand:VSX_EXTRACT_I 1 "gpc_reg_operand" "v")
+	 (parallel [(match_operand:QI 2 "<VSX_EXTRACT_PREDICATE>" "n")])))
+   (clobber (match_scratch:VSX_EXTRACT_I 3 "=v"))]
+  "VECTOR_MEM_VSX_P (<MODE>mode) && TARGET_DIRECT_MOVE_64BIT"
+  "#"
+  "&& reload_completed"
+  [(const_int 0)]
+{
+  rtx dest = operands[0];
+  rtx src = operands[1];
+  rtx element = operands[2];
+  rtx vec_tmp = operands[3];
+  int value;
+
+  if (!VECTOR_ELT_ORDER_BIG)
+    element = GEN_INT (GET_MODE_NUNITS (<MODE>mode) - 1 - INTVAL (element));
+
+  /* If the value is in the correct position, we can avoid doing the VSPLT<x>
+     instruction.  */
+  value = INTVAL (element);
+  if (<MODE>mode == V16QImode)
+    {
+      if (value != 7)
+	emit_insn (gen_altivec_vspltb_direct (vec_tmp, src, element));
+      else
+	vec_tmp = src;
+    }
+  else if (<MODE>mode == V8HImode)
+    {
+      if (value != 3)
+	emit_insn (gen_altivec_vsplth_direct (vec_tmp, src, element));
+      else
+	vec_tmp = src;
+    }
+  else if (<MODE>mode == V4SImode)
+    {
+      if (value != 1)
+	emit_insn (gen_altivec_vspltw_direct (vec_tmp, src, element));
+      else
+	vec_tmp = src;
+    }
+  else
+    gcc_unreachable ();
+
+  emit_move_insn (gen_rtx_REG (DImode, REGNO (dest)),
+		  gen_rtx_REG (DImode, REGNO (vec_tmp)));
+  DONE;
+}
+  [(set_attr "type" "mftgpr")])
+
+;; Optimize extracting a single scalar element from memory.
+(define_insn_and_split "*vsx_extract_<mode>_load"
+  [(set (match_operand:<VS_scalar> 0 "register_operand" "=r")
+	(vec_select:<VS_scalar>
+	 (match_operand:VSX_EXTRACT_I 1 "memory_operand" "m")
+	 (parallel [(match_operand:QI 2 "<VSX_EXTRACT_PREDICATE>" "n")])))
+   (clobber (match_scratch:DI 3 "=&b"))]
+  "VECTOR_MEM_VSX_P (<MODE>mode) && TARGET_DIRECT_MOVE_64BIT"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 0) (match_dup 4))]
+{
+  operands[4] = rs6000_adjust_vec_address (operands[0], operands[1], operands[2],
+					   operands[3], <VS_scalar>mode);
+}
+  [(set_attr "type" "load")
+   (set_attr "length" "8")])
+
+;; Variable V16QI/V8HI/V4SI extract
+(define_insn_and_split "vsx_extract_<mode>_var"
+  [(set (match_operand:<VS_scalar> 0 "gpc_reg_operand" "=r,r")
+	(unspec:<VS_scalar>
+	 [(match_operand:VSX_EXTRACT_I 1 "input_operand" "v,m")
+	  (match_operand:DI 2 "gpc_reg_operand" "r,r")]
+	 UNSPEC_VSX_EXTRACT))
+   (clobber (match_scratch:DI 3 "=r,&b"))
+   (clobber (match_scratch:V2DI 4 "=&v,X"))]
+  "VECTOR_MEM_VSX_P (<MODE>mode) && TARGET_DIRECT_MOVE_64BIT"
+  "#"
+  "&& reload_completed"
+  [(const_int 0)]
+{
+  rs6000_split_vec_extract_var (operands[0], operands[1], operands[2],
+				operands[3], operands[4]);
+  DONE;
+})
 
 ;; Expanders for builtins
 (define_expand "vsx_mergel_<mode>"
Index: gcc/testsuite/gcc.target/powerpc/vec-extract-6.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vec-extract-6.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vec-extract-6.c	(revision 0)
@@ -0,0 +1,68 @@
+/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power8" } } */
+/* { dg-options "-O2 -mcpu=power8" } */
+
+#include <altivec.h>
+
+unsigned char
+add_unsigned_char_0 (vector unsigned char *p)
+{
+  return vec_extract (*p, 0) + 1;
+}
+
+unsigned char
+add_unsigned_char_1 (vector unsigned char *p)
+{
+  return vec_extract (*p, 1) + 1;
+}
+
+unsigned char
+add_unsigned_char_2 (vector unsigned char *p)
+{
+  return vec_extract (*p, 2) + 1;
+}
+
+unsigned char
+add_unsigned_char_3 (vector unsigned char *p)
+{
+  return vec_extract (*p, 3) + 1;
+}
+
+unsigned char
+add_unsigned_char_4 (vector unsigned char *p)
+{
+  return vec_extract (*p, 4) + 1;
+}
+
+unsigned char
+add_unsigned_char_5 (vector unsigned char *p)
+{
+  return vec_extract (*p, 5) + 1;
+}
+
+unsigned char
+add_unsigned_char_6 (vector unsigned char *p)
+{
+  return vec_extract (*p, 6) + 1;
+}
+
+unsigned char
+add_unsigned_char_7 (vector unsigned char *p)
+{
+  return vec_extract (*p, 7) + 1;
+}
+
+unsigned char
+add_unsigned_char_n (vector unsigned char *p, int n)
+{
+  return vec_extract (*p, n) + 1;
+}
+
+/* { dg-final { scan-assembler-not "lxvd2x"   } } */
+/* { dg-final { scan-assembler-not "lxvw4x"   } } */
+/* { dg-final { scan-assembler-not "lxvx"     } } */
+/* { dg-final { scan-assembler-not "lxv"      } } */
+/* { dg-final { scan-assembler-not "lvx"      } } */
+/* { dg-final { scan-assembler-not "xxpermdi" } } */
Index: gcc/testsuite/gcc.target/powerpc/vec-extract-7.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vec-extract-7.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vec-extract-7.c	(revision 0)
@@ -0,0 +1,44 @@
+/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power8" } } */
+/* { dg-options "-O2 -mcpu=power8" } */
+
+#include <altivec.h>
+
+float
+add_float_0 (vector float *p)
+{
+  return vec_extract (*p, 0) + 1.0f;
+}
+
+float
+add_float_1 (vector float *p)
+{
+  return vec_extract (*p, 1) + 1.0f;
+}
+
+float
+add_float_2 (vector float *p)
+{
+  return vec_extract (*p, 2) + 1.0f;
+}
+
+float
+add_float_3 (vector float *p)
+{
+  return vec_extract (*p, 3) + 1.0f;
+}
+
+float
+add_float_n (vector float *p, long n)
+{
+  return vec_extract (*p, n) + 1.0f;
+}
+
+/* { dg-final { scan-assembler-not "lxvd2x"   } } */
+/* { dg-final { scan-assembler-not "lxvw4x"   } } */
+/* { dg-final { scan-assembler-not "lxvx"     } } */
+/* { dg-final { scan-assembler-not "lxv"      } } */
+/* { dg-final { scan-assembler-not "lvx"      } } */
+/* { dg-final { scan-assembler-not "xxpermdi" } } */
Index: gcc/testsuite/gcc.target/powerpc/vec-extract-8.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vec-extract-8.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vec-extract-8.c	(revision 0)
@@ -0,0 +1,44 @@
+/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power8" } } */
+/* { dg-options "-O2 -mcpu=power8" } */
+
+#include <altivec.h>
+
+int
+add_int_0 (vector int *p)
+{
+  return vec_extract (*p, 0) + 1;
+}
+
+int
+add_int_1 (vector int *p)
+{
+  return vec_extract (*p, 1) + 1;
+}
+
+int
+add_int_2 (vector int *p)
+{
+  return vec_extract (*p, 2) + 1;
+}
+
+int
+add_int_3 (vector int *p)
+{
+  return vec_extract (*p, 3) + 1;
+}
+
+int
+add_int_n (vector int *p, int n)
+{
+  return vec_extract (*p, n) + 1;
+}
+
+/* { dg-final { scan-assembler-not "lxvd2x"   } } */
+/* { dg-final { scan-assembler-not "lxvw4x"   } } */
+/* { dg-final { scan-assembler-not "lxvx"     } } */
+/* { dg-final { scan-assembler-not "lxv"      } } */
+/* { dg-final { scan-assembler-not "lvx"      } } */
+/* { dg-final { scan-assembler-not "xxpermdi" } } */
Index: gcc/testsuite/gcc.target/powerpc/vec-extract-9.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vec-extract-9.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vec-extract-9.c	(revision 0)
@@ -0,0 +1,68 @@
+/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power8" } } */
+/* { dg-options "-O2 -mcpu=power8" } */
+
+#include <altivec.h>
+
+short
+add_short_0 (vector short *p)
+{
+  return vec_extract (*p, 0) + 1;
+}
+
+short
+add_short_1 (vector short *p)
+{
+  return vec_extract (*p, 1) + 1;
+}
+
+short
+add_short_2 (vector short *p)
+{
+  return vec_extract (*p, 2) + 1;
+}
+
+short
+add_short_3 (vector short *p)
+{
+  return vec_extract (*p, 3) + 1;
+}
+
+short
+add_short_4 (vector short *p)
+{
+  return vec_extract (*p, 4) + 1;
+}
+
+short
+add_short_5 (vector short *p)
+{
+  return vec_extract (*p, 5) + 1;
+}
+
+short
+add_short_6 (vector short *p)
+{
+  return vec_extract (*p, 6) + 1;
+}
+
+short
+add_short_7 (vector short *p)
+{
+  return vec_extract (*p, 7) + 1;
+}
+
+short
+add_short_n (vector short *p, int n)
+{
+  return vec_extract (*p, n) + 1;
+}
+
+/* { dg-final { scan-assembler-not "lxvd2x"   } } */
+/* { dg-final { scan-assembler-not "lxvw4x"   } } */
+/* { dg-final { scan-assembler-not "lxvx"     } } */
+/* { dg-final { scan-assembler-not "lxv"      } } */
+/* { dg-final { scan-assembler-not "lvx"      } } */
+/* { dg-final { scan-assembler-not "xxpermdi" } } */
Index: gcc/testsuite/gcc.target/powerpc/vec-extract-5.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vec-extract-5.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vec-extract-5.c	(revision 0)
@@ -0,0 +1,68 @@
+/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power8" } } */
+/* { dg-options "-O2 -mcpu=power8" } */
+
+#include <altivec.h>
+
+signed char
+add_signed_char_0 (vector signed char *p)
+{
+  return vec_extract (*p, 0) + 1;
+}
+
+signed char
+add_signed_char_1 (vector signed char *p)
+{
+  return vec_extract (*p, 1) + 1;
+}
+
+signed char
+add_signed_char_2 (vector signed char *p)
+{
+  return vec_extract (*p, 2) + 1;
+}
+
+signed char
+add_signed_char_3 (vector signed char *p)
+{
+  return vec_extract (*p, 3) + 1;
+}
+
+signed char
+add_signed_char_4 (vector signed char *p)
+{
+  return vec_extract (*p, 4) + 1;
+}
+
+signed char
+add_signed_char_5 (vector signed char *p)
+{
+  return vec_extract (*p, 5) + 1;
+}
+
+signed char
+add_signed_char_6 (vector signed char *p)
+{
+  return vec_extract (*p, 6) + 1;
+}
+
+signed char
+add_signed_char_7 (vector signed char *p)
+{
+  return vec_extract (*p, 7) + 1;
+}
+
+signed char
+add_signed_char_n (vector signed char *p, int n)
+{
+  return vec_extract (*p, n) + 1;
+}
+
+/* { dg-final { scan-assembler-not "lxvd2x"   } } */
+/* { dg-final { scan-assembler-not "lxvw4x"   } } */
+/* { dg-final { scan-assembler-not "lxvx"     } } */
+/* { dg-final { scan-assembler-not "lxv"      } } */
+/* { dg-final { scan-assembler-not "lvx"      } } */
+/* { dg-final { scan-assembler-not "xxpermdi" } } */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, 4 of 4], Enhance PowerPC vec_extract support for power8/power9 machines
  2016-08-01 22:38 ` [PATCH, 4 " Michael Meissner
@ 2016-08-01 22:55   ` Segher Boessenkool
  0 siblings, 0 replies; 13+ messages in thread
From: Segher Boessenkool @ 2016-08-01 22:55 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, David Edelsohn, Bill Schmidt

On Mon, Aug 01, 2016 at 06:37:42PM -0400, Michael Meissner wrote:
> One further optimization would be to add support for constant element extracts
> if the vector is currently in GPRs rather than vector registers on 64-bit
> systems.  I'm not sure if it would be a win in general, or if it would cause
> the register allocators to generate more moves between the GPR and vector
> register banks.

I don't know if it'll help either, you'll have to try it to make sure.
I don't think it will be terribly important, either way.

One nit:

>  ;; Extraction of a single element in a small integer vector.  None of the small
>  ;; types are currently allowed in a vector register, so we extract to a DImode
>  ;; and either do a direct move or store.
> -(define_insn_and_split  "vsx_extract_<mode>"
> +(define_expand  "vsx_extract_<mode>"
> +  [(parallel [(set (match_operand:<VS_scalar> 0 "nonimmediate_operand" "")
> +		   (vec_select:<VS_scalar>
> +		    (match_operand:VSX_EXTRACT_I 1 "gpc_reg_operand" "")
> +		    (parallel [(match_operand:QI 2 "const_int_operand" "")])))
> +	      (clobber (match_dup 3))])]

Drop the superfluous ""s?  And the predicates are never used either I think?

This is okay for trunk.  Thanks,


Segher

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-08-01 22:55 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-27 14:33 [PATCH, 1 of 4 or 5], Enhance PowerPC vec_extract support for power8/power9 machines Michael Meissner
2016-07-27 19:35 ` Segher Boessenkool
2016-07-27 20:06   ` Michael Meissner
2016-07-27 21:16 ` [PATCH, 2 of 4], " Michael Meissner
2016-07-28  9:58   ` Segher Boessenkool
2016-07-28 19:44     ` Michael Meissner
2016-07-29  6:50       ` Segher Boessenkool
2016-07-30 15:29 ` [PATCH, 3 " Michael Meissner
2016-07-30 15:38   ` Michael Meissner
2016-07-30 16:04     ` Segher Boessenkool
2016-07-30 16:01   ` Segher Boessenkool
2016-08-01 22:38 ` [PATCH, 4 " Michael Meissner
2016-08-01 22:55   ` Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).