RFC/A: Add a targetm.vectorize.related

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* RFC/A: Add a targetm.vectorize.related_mode hook
@ 2019-10-23 11:01 Richard Sandiford
  2019-10-23 11:16 ` Richard Biener
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Sandiford @ 2019-10-23 11:01 UTC (permalink / raw)
  To: gcc-patches

This patch is the first of a series that tries to remove two
assumptions:

(1) that all vectors involved in vectorisation must be the same size

(2) that there is only one vector mode for a given element mode and
    number of elements

Relaxing (1) helps with targets that support multiple vector sizes or
that require the number of elements to stay the same.  E.g. if we're
vectorising code that operates on narrow and wide elements, and the
narrow elements use 64-bit vectors, then on AArch64 it would normally
be better to use 128-bit vectors rather than pairs of 64-bit vectors
for the wide elements.

Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce
fixed-length code for SVE.  It also allows unpacked/half-size SVE
vectors to work with -msve-vector-bits=256.

The patch adds a new hook that targets can use to control how we
move from one vector mode to another.  The hook takes a starting vector
mode, a new element mode, and (optionally) a new number of elements.
The flexibility needed for (1) comes in when the number of elements
isn't specified.

All callers in this patch specify the number of elements, but a later
vectoriser patch doesn't.  I won't be posting the vectoriser patch
for a few days, hence the RFC/A tag.

Tested individually on aarch64-linux-gnu and as a series on
x86_64-linux-gnu.  OK to install?  Or if not yet, does the idea
look OK?

I'll post some follow-up patches too.

Richard


2019-10-23  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* target.def (related_mode): New hook.
	* doc/tm.texi.in (TARGET_VECTORIZE_RELATED_MODE): New hook.
	* doc/tm.texi: Regenerate.
	* targhooks.h (default_vectorize_related_mode): Declare.
	* targhooks.c (default_vectorize_related_mode): New function.
	* machmode.h (related_vector_mode): Declare.
	* stor-layout.c (related_vector_mode): New function.
	* expmed.c (extract_bit_field_1): Use it instead of mode_for_vector.
	* optabs-query.c (qimode_for_vec_perm): Likewise.
	* tree-vect-stmts.c (get_group_load_store_type): Likewise.
	(vectorizable_store, vectorizable_load): Likewise

Index: gcc/target.def
===================================================================
--- gcc/target.def	2019-09-30 17:20:57.370607986 +0100
+++ gcc/target.def	2019-10-23 11:33:01.568510253 +0100
@@ -1909,6 +1909,33 @@ for autovectorization.  The default impl
  (vector_sizes *sizes, bool all),
  default_autovectorize_vector_sizes)
 
+DEFHOOK
+(related_mode,
+ "If a piece of code is using vector mode @var{vector_mode} and also wants\n\
+to operate on elements of mode @var{element_mode}, return the vector mode\n\
+it should use for those elements.  If @var{nunits} is nonzero, ensure that\n\
+the mode has exactly @var{nunits} elements, otherwise pick whichever vector\n\
+size pairs the most naturally with @var{vector_mode}.  Return an empty\n\
+@code{opt_machine_mode} if there is no supported vector mode with the\n\
+required properties.\n\
+\n\
+There is no prescribed way of handling the case in which @var{nunits}\n\
+is zero.  One common choice is to pick a vector mode with the same size\n\
+as @var{vector_mode}; this is the natural choice if the target has a\n\
+fixed vector size.  Another option is to choose a vector mode with the\n\
+same number of elements as @var{vector_mode}; this is the natural choice\n\
+if the target has a fixed number of elements.  Alternatively, the hook\n\
+might choose a middle ground, such as trying to keep the number of\n\
+elements as similar as possible while applying maximum and minimum\n\
+vector sizes.\n\
+\n\
+The default implementation uses @code{mode_for_vector} to find the\n\
+requested mode, returning a mode with the same size as @var{vector_mode}\n\
+when @var{nunits} is zero.  This is the correct behavior for most targets.",
+ opt_machine_mode,
+ (machine_mode vector_mode, scalar_mode element_mode, poly_uint64 nunits),
+ default_vectorize_related_mode)
+
 /* Function to get a target mode for a vector mask.  */
 DEFHOOK
 (get_mask_mode,
Index: gcc/doc/tm.texi.in
===================================================================
--- gcc/doc/tm.texi.in	2019-09-30 17:20:57.350608130 +0100
+++ gcc/doc/tm.texi.in	2019-10-23 11:33:01.564510281 +0100
@@ -4181,6 +4181,8 @@ address;  but often a machine-dependent
 
 @hook TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES
 
+@hook TARGET_VECTORIZE_RELATED_MODE
+
 @hook TARGET_VECTORIZE_GET_MASK_MODE
 
 @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
Index: gcc/doc/tm.texi
===================================================================
--- gcc/doc/tm.texi	2019-09-30 17:20:57.350608130 +0100
+++ gcc/doc/tm.texi	2019-10-23 11:33:01.560510309 +0100
@@ -6029,6 +6029,30 @@ The hook does not need to do anything if
 for autovectorization.  The default implementation does nothing.
 @end deftypefn
 
+@deftypefn {Target Hook} opt_machine_mode TARGET_VECTORIZE_RELATED_MODE (machine_mode @var{vector_mode}, scalar_mode @var{element_mode}, poly_uint64 @var{nunits})
+If a piece of code is using vector mode @var{vector_mode} and also wants
+to operate on elements of mode @var{element_mode}, return the vector mode
+it should use for those elements.  If @var{nunits} is nonzero, ensure that
+the mode has exactly @var{nunits} elements, otherwise pick whichever vector
+size pairs the most naturally with @var{vector_mode}.  Return an empty
+@code{opt_machine_mode} if there is no supported vector mode with the
+required properties.
+
+There is no prescribed way of handling the case in which @var{nunits}
+is zero.  One common choice is to pick a vector mode with the same size
+as @var{vector_mode}; this is the natural choice if the target has a
+fixed vector size.  Another option is to choose a vector mode with the
+same number of elements as @var{vector_mode}; this is the natural choice
+if the target has a fixed number of elements.  Alternatively, the hook
+might choose a middle ground, such as trying to keep the number of
+elements as similar as possible while applying maximum and minimum
+vector sizes.
+
+The default implementation uses @code{mode_for_vector} to find the
+requested mode, returning a mode with the same size as @var{vector_mode}
+when @var{nunits} is zero.  This is the correct behavior for most targets.
+@end deftypefn
+
 @deftypefn {Target Hook} opt_machine_mode TARGET_VECTORIZE_GET_MASK_MODE (poly_uint64 @var{nunits}, poly_uint64 @var{length})
 A vector mask is a value that holds one boolean result for every element
 in a vector.  This hook returns the machine mode that should be used to
Index: gcc/targhooks.h
===================================================================
--- gcc/targhooks.h	2019-09-30 17:19:45.051128625 +0100
+++ gcc/targhooks.h	2019-10-23 11:33:01.568510253 +0100
@@ -114,6 +114,9 @@ default_builtin_support_vector_misalignm
 extern machine_mode default_preferred_simd_mode (scalar_mode mode);
 extern machine_mode default_split_reduction (machine_mode);
 extern void default_autovectorize_vector_sizes (vector_sizes *, bool);
+extern opt_machine_mode default_vectorize_related_mode (machine_mode,
+							scalar_mode,
+							poly_uint64);
 extern opt_machine_mode default_get_mask_mode (poly_uint64, poly_uint64);
 extern bool default_empty_mask_is_expensive (unsigned);
 extern void *default_init_cost (class loop *);
Index: gcc/targhooks.c
===================================================================
--- gcc/targhooks.c	2019-10-20 13:58:01.283640189 +0100
+++ gcc/targhooks.c	2019-10-23 11:33:01.568510253 +0100
@@ -1307,6 +1307,25 @@ default_autovectorize_vector_sizes (vect
 {
 }
 
+/* The default implementation of TARGET_VECTORIZE_RELATED_MODE.  */
+
+opt_machine_mode
+default_vectorize_related_mode (machine_mode vector_mode,
+				scalar_mode element_mode,
+				poly_uint64 nunits)
+{
+  machine_mode result_mode;
+  if ((maybe_ne (nunits, 0U)
+       || multiple_p (GET_MODE_SIZE (vector_mode),
+		      GET_MODE_SIZE (element_mode), &nunits))
+      && mode_for_vector (element_mode, nunits).exists (&result_mode)
+      && VECTOR_MODE_P (result_mode)
+      && targetm.vector_mode_supported_p (result_mode))
+    return result_mode;
+
+  return opt_machine_mode ();
+}
+
 /* By default a vector of integers is used as a mask.  */
 
 opt_machine_mode
Index: gcc/machmode.h
===================================================================
--- gcc/machmode.h	2019-10-07 09:38:45.627684403 +0100
+++ gcc/machmode.h	2019-10-23 11:33:01.564510281 +0100
@@ -880,6 +880,8 @@ extern opt_scalar_int_mode int_mode_for_
 extern opt_machine_mode bitwise_mode_for_mode (machine_mode);
 extern opt_machine_mode mode_for_vector (scalar_mode, poly_uint64);
 extern opt_machine_mode mode_for_int_vector (unsigned int, poly_uint64);
+extern opt_machine_mode related_vector_mode (machine_mode, scalar_mode,
+					     poly_uint64 = 0);
 
 /* Return the integer vector equivalent of MODE, if one exists.  In other
    words, return the mode for an integer vector that has the same number
Index: gcc/stor-layout.c
===================================================================
--- gcc/stor-layout.c	2019-09-18 10:43:11.562335822 +0100
+++ gcc/stor-layout.c	2019-10-23 11:33:01.564510281 +0100
@@ -530,6 +530,26 @@ mode_for_int_vector (unsigned int int_bi
   return opt_machine_mode ();
 }
 
+/* If a piece of code is using vector mode VECTOR_MODE and also wants
+   to operate on elements of mode ELEMENT_MODE, return the vector mode
+   it should use for those elements.  If NUNITS is nonzero, ensure that
+   the mode has exactly NUNITS elements, otherwise pick whichever vector
+   size pairs the most naturally with VECTOR_MODE; this may mean choosing
+   a mode with a different size and/or number of elements, depending on
+   what the target prefers.  Return an empty opt_machine_mode if there
+   is no supported vector mode with the required properties.
+
+   Unlike mode_for_vector. any returned mode is guaranteed to satisfy
+   both VECTOR_MODE_P and targetm.vector_mode_supported_p.  */
+
+opt_machine_mode
+related_vector_mode (machine_mode vector_mode, scalar_mode element_mode,
+		     poly_uint64 nunits)
+{
+  gcc_assert (VECTOR_MODE_P (vector_mode));
+  return targetm.vectorize.related_mode (vector_mode, element_mode, nunits);
+}
+
 /* Return the alignment of MODE. This will be bounded by 1 and
    BIGGEST_ALIGNMENT.  */
 
Index: gcc/expmed.c
===================================================================
--- gcc/expmed.c	2019-09-10 17:18:39.992121613 +0100
+++ gcc/expmed.c	2019-10-23 11:33:01.564510281 +0100
@@ -1641,12 +1641,10 @@ extract_bit_field_1 (rtx str_rtx, poly_u
 	  poly_uint64 nunits;
 	  if (!multiple_p (GET_MODE_BITSIZE (GET_MODE (op0)),
 			   GET_MODE_UNIT_BITSIZE (tmode), &nunits)
-	      || !mode_for_vector (inner_mode, nunits).exists (&new_mode)
-	      || !VECTOR_MODE_P (new_mode)
+	      || !related_vector_mode (tmode, inner_mode,
+				       nunits).exists (&new_mode)
 	      || maybe_ne (GET_MODE_SIZE (new_mode),
-			   GET_MODE_SIZE (GET_MODE (op0)))
-	      || GET_MODE_INNER (new_mode) != GET_MODE_INNER (tmode)
-	      || !targetm.vector_mode_supported_p (new_mode))
+			   GET_MODE_SIZE (GET_MODE (op0))))
 	    new_mode = VOIDmode;
 	}
       poly_uint64 pos;
Index: gcc/optabs-query.c
===================================================================
--- gcc/optabs-query.c	2019-07-10 19:41:26.387898094 +0100
+++ gcc/optabs-query.c	2019-10-23 11:33:01.564510281 +0100
@@ -354,11 +354,8 @@ can_conditionally_move_p (machine_mode m
 opt_machine_mode
 qimode_for_vec_perm (machine_mode mode)
 {
-  machine_mode qimode;
-  if (GET_MODE_INNER (mode) != QImode
-      && mode_for_vector (QImode, GET_MODE_SIZE (mode)).exists (&qimode)
-      && VECTOR_MODE_P (qimode))
-    return qimode;
+  if (GET_MODE_INNER (mode) != QImode)
+    return related_vector_mode (mode, QImode, GET_MODE_SIZE (mode));
   return opt_machine_mode ();
 }
 
Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	2019-10-23 11:30:54.953408377 +0100
+++ gcc/tree-vect-stmts.c	2019-10-23 11:33:01.572510226 +0100
@@ -2276,9 +2276,8 @@ get_group_load_store_type (stmt_vec_info
 		  || alignment_support_scheme == dr_unaligned_supported)
 	      && known_eq (nunits, (group_size - gap) * 2)
 	      && known_eq (nunits, group_size)
-	      && mode_for_vector (elmode, (group_size - gap)).exists (&vmode)
-	      && VECTOR_MODE_P (vmode)
-	      && targetm.vector_mode_supported_p (vmode)
+	      && related_vector_mode (TYPE_MODE (vectype), elmode,
+				      group_size - gap).exists (&vmode)
 	      && (convert_optab_handler (vec_init_optab,
 					 TYPE_MODE (vectype), vmode)
 		  != CODE_FOR_nothing))
@@ -7736,9 +7735,8 @@ vectorizable_store (stmt_vec_info stmt_i
 		 of vector elts directly.  */
 	      scalar_mode elmode = SCALAR_TYPE_MODE (elem_type);
 	      machine_mode vmode;
-	      if (!mode_for_vector (elmode, group_size).exists (&vmode)
-		  || !VECTOR_MODE_P (vmode)
-		  || !targetm.vector_mode_supported_p (vmode)
+	      if (!related_vector_mode (TYPE_MODE (vectype), elmode,
+					group_size).exists (&vmode)
 		  || (convert_optab_handler (vec_extract_optab,
 					     TYPE_MODE (vectype), vmode)
 		      == CODE_FOR_nothing))
@@ -7755,9 +7753,8 @@ vectorizable_store (stmt_vec_info stmt_i
 		     element extracts from the original vector type and
 		     element size stores.  */
 		  if (int_mode_for_size (lsize, 0).exists (&elmode)
-		      && mode_for_vector (elmode, lnunits).exists (&vmode)
-		      && VECTOR_MODE_P (vmode)
-		      && targetm.vector_mode_supported_p (vmode)
+		      && related_vector_mode (TYPE_MODE (vectype), elmode,
+					      lnunits).exists (&vmode)
 		      && (convert_optab_handler (vec_extract_optab,
 						 vmode, elmode)
 			  != CODE_FOR_nothing))
@@ -8838,9 +8835,8 @@ vectorizable_load (stmt_vec_info stmt_in
 		 vector elts directly.  */
 	      scalar_mode elmode = SCALAR_TYPE_MODE (TREE_TYPE (vectype));
 	      machine_mode vmode;
-	      if (mode_for_vector (elmode, group_size).exists (&vmode)
-		  && VECTOR_MODE_P (vmode)
-		  && targetm.vector_mode_supported_p (vmode)
+	      if (related_vector_mode (TYPE_MODE (vectype), elmode,
+				       group_size).exists (&vmode)
 		  && (convert_optab_handler (vec_init_optab,
 					     TYPE_MODE (vectype), vmode)
 		      != CODE_FOR_nothing))
@@ -8864,9 +8860,8 @@ vectorizable_load (stmt_vec_info stmt_in
 		  /* If we can't construct such a vector fall back to
 		     element loads of the original vector type.  */
 		  if (int_mode_for_size (lsize, 0).exists (&elmode)
-		      && mode_for_vector (elmode, lnunits).exists (&vmode)
-		      && VECTOR_MODE_P (vmode)
-		      && targetm.vector_mode_supported_p (vmode)
+		      && related_vector_mode (TYPE_MODE (vectype), elmode,
+					      lnunits).exists (&vmode)
 		      && (convert_optab_handler (vec_init_optab, vmode, elmode)
 			  != CODE_FOR_nothing))
 		    {

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC/A: Add a targetm.vectorize.related_mode hook
  2019-10-23 11:01 RFC/A: Add a targetm.vectorize.related_mode hook Richard Sandiford
@ 2019-10-23 11:16 ` Richard Biener
  2019-10-23 12:00   ` Richard Sandiford
  2019-10-23 21:42   ` Jim Wilson
  0 siblings, 2 replies; 13+ messages in thread
From: Richard Biener @ 2019-10-23 11:16 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: GCC Patches

On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> This patch is the first of a series that tries to remove two
> assumptions:
>
> (1) that all vectors involved in vectorisation must be the same size
>
> (2) that there is only one vector mode for a given element mode and
>     number of elements
>
> Relaxing (1) helps with targets that support multiple vector sizes or
> that require the number of elements to stay the same.  E.g. if we're
> vectorising code that operates on narrow and wide elements, and the
> narrow elements use 64-bit vectors, then on AArch64 it would normally
> be better to use 128-bit vectors rather than pairs of 64-bit vectors
> for the wide elements.
>
> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce
> fixed-length code for SVE.  It also allows unpacked/half-size SVE
> vectors to work with -msve-vector-bits=256.
>
> The patch adds a new hook that targets can use to control how we
> move from one vector mode to another.  The hook takes a starting vector
> mode, a new element mode, and (optionally) a new number of elements.
> The flexibility needed for (1) comes in when the number of elements
> isn't specified.
>
> All callers in this patch specify the number of elements, but a later
> vectoriser patch doesn't.  I won't be posting the vectoriser patch
> for a few days, hence the RFC/A tag.
>
> Tested individually on aarch64-linux-gnu and as a series on
> x86_64-linux-gnu.  OK to install?  Or if not yet, does the idea
> look OK?

In isolation the idea looks good but maybe a bit limited?  I see
how it works for the same-size case but if you consider x86
where we have SSE, AVX256 and AVX512 what would it return
for related_vector_mode (V4SImode, SImode, 0)?  Or is this
kind of query not intended (where the component modes match
but nunits is zero)?  How do you get from SVE fixed 128bit
to NEON fixed 128bit then?  Or is it just used to stay in the
same register set for different component modes?

As said, it looks good but I'd like to see the followups.

Note I delayed thinking about relaxing the single-vector-size
constraint in the vectorizer until after we're SLP only because
that looked more easily done there.  I also remember patches
relaxing this a bit from RISCV folks.

Thanks,
Richard.

> I'll post some follow-up patches too.
>
> Richard
>
>
> 2019-10-23  Richard Sandiford  <richard.sandiford@arm.com>
>
> gcc/
>         * target.def (related_mode): New hook.
>         * doc/tm.texi.in (TARGET_VECTORIZE_RELATED_MODE): New hook.
>         * doc/tm.texi: Regenerate.
>         * targhooks.h (default_vectorize_related_mode): Declare.
>         * targhooks.c (default_vectorize_related_mode): New function.
>         * machmode.h (related_vector_mode): Declare.
>         * stor-layout.c (related_vector_mode): New function.
>         * expmed.c (extract_bit_field_1): Use it instead of mode_for_vector.
>         * optabs-query.c (qimode_for_vec_perm): Likewise.
>         * tree-vect-stmts.c (get_group_load_store_type): Likewise.
>         (vectorizable_store, vectorizable_load): Likewise
>
> Index: gcc/target.def
> ===================================================================
> --- gcc/target.def      2019-09-30 17:20:57.370607986 +0100
> +++ gcc/target.def      2019-10-23 11:33:01.568510253 +0100
> @@ -1909,6 +1909,33 @@ for autovectorization.  The default impl
>   (vector_sizes *sizes, bool all),
>   default_autovectorize_vector_sizes)
>
> +DEFHOOK
> +(related_mode,
> + "If a piece of code is using vector mode @var{vector_mode} and also wants\n\
> +to operate on elements of mode @var{element_mode}, return the vector mode\n\
> +it should use for those elements.  If @var{nunits} is nonzero, ensure that\n\
> +the mode has exactly @var{nunits} elements, otherwise pick whichever vector\n\
> +size pairs the most naturally with @var{vector_mode}.  Return an empty\n\
> +@code{opt_machine_mode} if there is no supported vector mode with the\n\
> +required properties.\n\
> +\n\
> +There is no prescribed way of handling the case in which @var{nunits}\n\
> +is zero.  One common choice is to pick a vector mode with the same size\n\
> +as @var{vector_mode}; this is the natural choice if the target has a\n\
> +fixed vector size.  Another option is to choose a vector mode with the\n\
> +same number of elements as @var{vector_mode}; this is the natural choice\n\
> +if the target has a fixed number of elements.  Alternatively, the hook\n\
> +might choose a middle ground, such as trying to keep the number of\n\
> +elements as similar as possible while applying maximum and minimum\n\
> +vector sizes.\n\
> +\n\
> +The default implementation uses @code{mode_for_vector} to find the\n\
> +requested mode, returning a mode with the same size as @var{vector_mode}\n\
> +when @var{nunits} is zero.  This is the correct behavior for most targets.",
> + opt_machine_mode,
> + (machine_mode vector_mode, scalar_mode element_mode, poly_uint64 nunits),
> + default_vectorize_related_mode)
> +
>  /* Function to get a target mode for a vector mask.  */
>  DEFHOOK
>  (get_mask_mode,
> Index: gcc/doc/tm.texi.in
> ===================================================================
> --- gcc/doc/tm.texi.in  2019-09-30 17:20:57.350608130 +0100
> +++ gcc/doc/tm.texi.in  2019-10-23 11:33:01.564510281 +0100
> @@ -4181,6 +4181,8 @@ address;  but often a machine-dependent
>
>  @hook TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES
>
> +@hook TARGET_VECTORIZE_RELATED_MODE
> +
>  @hook TARGET_VECTORIZE_GET_MASK_MODE
>
>  @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
> Index: gcc/doc/tm.texi
> ===================================================================
> --- gcc/doc/tm.texi     2019-09-30 17:20:57.350608130 +0100
> +++ gcc/doc/tm.texi     2019-10-23 11:33:01.560510309 +0100
> @@ -6029,6 +6029,30 @@ The hook does not need to do anything if
>  for autovectorization.  The default implementation does nothing.
>  @end deftypefn
>
> +@deftypefn {Target Hook} opt_machine_mode TARGET_VECTORIZE_RELATED_MODE (machine_mode @var{vector_mode}, scalar_mode @var{element_mode}, poly_uint64 @var{nunits})
> +If a piece of code is using vector mode @var{vector_mode} and also wants
> +to operate on elements of mode @var{element_mode}, return the vector mode
> +it should use for those elements.  If @var{nunits} is nonzero, ensure that
> +the mode has exactly @var{nunits} elements, otherwise pick whichever vector
> +size pairs the most naturally with @var{vector_mode}.  Return an empty
> +@code{opt_machine_mode} if there is no supported vector mode with the
> +required properties.
> +
> +There is no prescribed way of handling the case in which @var{nunits}
> +is zero.  One common choice is to pick a vector mode with the same size
> +as @var{vector_mode}; this is the natural choice if the target has a
> +fixed vector size.  Another option is to choose a vector mode with the
> +same number of elements as @var{vector_mode}; this is the natural choice
> +if the target has a fixed number of elements.  Alternatively, the hook
> +might choose a middle ground, such as trying to keep the number of
> +elements as similar as possible while applying maximum and minimum
> +vector sizes.
> +
> +The default implementation uses @code{mode_for_vector} to find the
> +requested mode, returning a mode with the same size as @var{vector_mode}
> +when @var{nunits} is zero.  This is the correct behavior for most targets.
> +@end deftypefn
> +
>  @deftypefn {Target Hook} opt_machine_mode TARGET_VECTORIZE_GET_MASK_MODE (poly_uint64 @var{nunits}, poly_uint64 @var{length})
>  A vector mask is a value that holds one boolean result for every element
>  in a vector.  This hook returns the machine mode that should be used to
> Index: gcc/targhooks.h
> ===================================================================
> --- gcc/targhooks.h     2019-09-30 17:19:45.051128625 +0100
> +++ gcc/targhooks.h     2019-10-23 11:33:01.568510253 +0100
> @@ -114,6 +114,9 @@ default_builtin_support_vector_misalignm
>  extern machine_mode default_preferred_simd_mode (scalar_mode mode);
>  extern machine_mode default_split_reduction (machine_mode);
>  extern void default_autovectorize_vector_sizes (vector_sizes *, bool);
> +extern opt_machine_mode default_vectorize_related_mode (machine_mode,
> +                                                       scalar_mode,
> +                                                       poly_uint64);
>  extern opt_machine_mode default_get_mask_mode (poly_uint64, poly_uint64);
>  extern bool default_empty_mask_is_expensive (unsigned);
>  extern void *default_init_cost (class loop *);
> Index: gcc/targhooks.c
> ===================================================================
> --- gcc/targhooks.c     2019-10-20 13:58:01.283640189 +0100
> +++ gcc/targhooks.c     2019-10-23 11:33:01.568510253 +0100
> @@ -1307,6 +1307,25 @@ default_autovectorize_vector_sizes (vect
>  {
>  }
>
> +/* The default implementation of TARGET_VECTORIZE_RELATED_MODE.  */
> +
> +opt_machine_mode
> +default_vectorize_related_mode (machine_mode vector_mode,
> +                               scalar_mode element_mode,
> +                               poly_uint64 nunits)
> +{
> +  machine_mode result_mode;
> +  if ((maybe_ne (nunits, 0U)
> +       || multiple_p (GET_MODE_SIZE (vector_mode),
> +                     GET_MODE_SIZE (element_mode), &nunits))
> +      && mode_for_vector (element_mode, nunits).exists (&result_mode)
> +      && VECTOR_MODE_P (result_mode)
> +      && targetm.vector_mode_supported_p (result_mode))
> +    return result_mode;
> +
> +  return opt_machine_mode ();
> +}
> +
>  /* By default a vector of integers is used as a mask.  */
>
>  opt_machine_mode
> Index: gcc/machmode.h
> ===================================================================
> --- gcc/machmode.h      2019-10-07 09:38:45.627684403 +0100
> +++ gcc/machmode.h      2019-10-23 11:33:01.564510281 +0100
> @@ -880,6 +880,8 @@ extern opt_scalar_int_mode int_mode_for_
>  extern opt_machine_mode bitwise_mode_for_mode (machine_mode);
>  extern opt_machine_mode mode_for_vector (scalar_mode, poly_uint64);
>  extern opt_machine_mode mode_for_int_vector (unsigned int, poly_uint64);
> +extern opt_machine_mode related_vector_mode (machine_mode, scalar_mode,
> +                                            poly_uint64 = 0);
>
>  /* Return the integer vector equivalent of MODE, if one exists.  In other
>     words, return the mode for an integer vector that has the same number
> Index: gcc/stor-layout.c
> ===================================================================
> --- gcc/stor-layout.c   2019-09-18 10:43:11.562335822 +0100
> +++ gcc/stor-layout.c   2019-10-23 11:33:01.564510281 +0100
> @@ -530,6 +530,26 @@ mode_for_int_vector (unsigned int int_bi
>    return opt_machine_mode ();
>  }
>
> +/* If a piece of code is using vector mode VECTOR_MODE and also wants
> +   to operate on elements of mode ELEMENT_MODE, return the vector mode
> +   it should use for those elements.  If NUNITS is nonzero, ensure that
> +   the mode has exactly NUNITS elements, otherwise pick whichever vector
> +   size pairs the most naturally with VECTOR_MODE; this may mean choosing
> +   a mode with a different size and/or number of elements, depending on
> +   what the target prefers.  Return an empty opt_machine_mode if there
> +   is no supported vector mode with the required properties.
> +
> +   Unlike mode_for_vector. any returned mode is guaranteed to satisfy
> +   both VECTOR_MODE_P and targetm.vector_mode_supported_p.  */
> +
> +opt_machine_mode
> +related_vector_mode (machine_mode vector_mode, scalar_mode element_mode,
> +                    poly_uint64 nunits)
> +{
> +  gcc_assert (VECTOR_MODE_P (vector_mode));
> +  return targetm.vectorize.related_mode (vector_mode, element_mode, nunits);
> +}
> +
>  /* Return the alignment of MODE. This will be bounded by 1 and
>     BIGGEST_ALIGNMENT.  */
>
> Index: gcc/expmed.c
> ===================================================================
> --- gcc/expmed.c        2019-09-10 17:18:39.992121613 +0100
> +++ gcc/expmed.c        2019-10-23 11:33:01.564510281 +0100
> @@ -1641,12 +1641,10 @@ extract_bit_field_1 (rtx str_rtx, poly_u
>           poly_uint64 nunits;
>           if (!multiple_p (GET_MODE_BITSIZE (GET_MODE (op0)),
>                            GET_MODE_UNIT_BITSIZE (tmode), &nunits)
> -             || !mode_for_vector (inner_mode, nunits).exists (&new_mode)
> -             || !VECTOR_MODE_P (new_mode)
> +             || !related_vector_mode (tmode, inner_mode,
> +                                      nunits).exists (&new_mode)
>               || maybe_ne (GET_MODE_SIZE (new_mode),
> -                          GET_MODE_SIZE (GET_MODE (op0)))
> -             || GET_MODE_INNER (new_mode) != GET_MODE_INNER (tmode)
> -             || !targetm.vector_mode_supported_p (new_mode))
> +                          GET_MODE_SIZE (GET_MODE (op0))))
>             new_mode = VOIDmode;
>         }
>        poly_uint64 pos;
> Index: gcc/optabs-query.c
> ===================================================================
> --- gcc/optabs-query.c  2019-07-10 19:41:26.387898094 +0100
> +++ gcc/optabs-query.c  2019-10-23 11:33:01.564510281 +0100
> @@ -354,11 +354,8 @@ can_conditionally_move_p (machine_mode m
>  opt_machine_mode
>  qimode_for_vec_perm (machine_mode mode)
>  {
> -  machine_mode qimode;
> -  if (GET_MODE_INNER (mode) != QImode
> -      && mode_for_vector (QImode, GET_MODE_SIZE (mode)).exists (&qimode)
> -      && VECTOR_MODE_P (qimode))
> -    return qimode;
> +  if (GET_MODE_INNER (mode) != QImode)
> +    return related_vector_mode (mode, QImode, GET_MODE_SIZE (mode));
>    return opt_machine_mode ();
>  }
>
> Index: gcc/tree-vect-stmts.c
> ===================================================================
> --- gcc/tree-vect-stmts.c       2019-10-23 11:30:54.953408377 +0100
> +++ gcc/tree-vect-stmts.c       2019-10-23 11:33:01.572510226 +0100
> @@ -2276,9 +2276,8 @@ get_group_load_store_type (stmt_vec_info
>                   || alignment_support_scheme == dr_unaligned_supported)
>               && known_eq (nunits, (group_size - gap) * 2)
>               && known_eq (nunits, group_size)
> -             && mode_for_vector (elmode, (group_size - gap)).exists (&vmode)
> -             && VECTOR_MODE_P (vmode)
> -             && targetm.vector_mode_supported_p (vmode)
> +             && related_vector_mode (TYPE_MODE (vectype), elmode,
> +                                     group_size - gap).exists (&vmode)
>               && (convert_optab_handler (vec_init_optab,
>                                          TYPE_MODE (vectype), vmode)
>                   != CODE_FOR_nothing))
> @@ -7736,9 +7735,8 @@ vectorizable_store (stmt_vec_info stmt_i
>                  of vector elts directly.  */
>               scalar_mode elmode = SCALAR_TYPE_MODE (elem_type);
>               machine_mode vmode;
> -             if (!mode_for_vector (elmode, group_size).exists (&vmode)
> -                 || !VECTOR_MODE_P (vmode)
> -                 || !targetm.vector_mode_supported_p (vmode)
> +             if (!related_vector_mode (TYPE_MODE (vectype), elmode,
> +                                       group_size).exists (&vmode)
>                   || (convert_optab_handler (vec_extract_optab,
>                                              TYPE_MODE (vectype), vmode)
>                       == CODE_FOR_nothing))
> @@ -7755,9 +7753,8 @@ vectorizable_store (stmt_vec_info stmt_i
>                      element extracts from the original vector type and
>                      element size stores.  */
>                   if (int_mode_for_size (lsize, 0).exists (&elmode)
> -                     && mode_for_vector (elmode, lnunits).exists (&vmode)
> -                     && VECTOR_MODE_P (vmode)
> -                     && targetm.vector_mode_supported_p (vmode)
> +                     && related_vector_mode (TYPE_MODE (vectype), elmode,
> +                                             lnunits).exists (&vmode)
>                       && (convert_optab_handler (vec_extract_optab,
>                                                  vmode, elmode)
>                           != CODE_FOR_nothing))
> @@ -8838,9 +8835,8 @@ vectorizable_load (stmt_vec_info stmt_in
>                  vector elts directly.  */
>               scalar_mode elmode = SCALAR_TYPE_MODE (TREE_TYPE (vectype));
>               machine_mode vmode;
> -             if (mode_for_vector (elmode, group_size).exists (&vmode)
> -                 && VECTOR_MODE_P (vmode)
> -                 && targetm.vector_mode_supported_p (vmode)
> +             if (related_vector_mode (TYPE_MODE (vectype), elmode,
> +                                      group_size).exists (&vmode)
>                   && (convert_optab_handler (vec_init_optab,
>                                              TYPE_MODE (vectype), vmode)
>                       != CODE_FOR_nothing))
> @@ -8864,9 +8860,8 @@ vectorizable_load (stmt_vec_info stmt_in
>                   /* If we can't construct such a vector fall back to
>                      element loads of the original vector type.  */
>                   if (int_mode_for_size (lsize, 0).exists (&elmode)
> -                     && mode_for_vector (elmode, lnunits).exists (&vmode)
> -                     && VECTOR_MODE_P (vmode)
> -                     && targetm.vector_mode_supported_p (vmode)
> +                     && related_vector_mode (TYPE_MODE (vectype), elmode,
> +                                             lnunits).exists (&vmode)
>                       && (convert_optab_handler (vec_init_optab, vmode, elmode)
>                           != CODE_FOR_nothing))
>                     {

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC/A: Add a targetm.vectorize.related_mode hook
  2019-10-23 11:16 ` Richard Biener
@ 2019-10-23 12:00   ` Richard Sandiford
  2019-10-23 12:07     ` Richard Biener
  2019-10-23 22:35     ` H.J. Lu
  2019-10-23 21:42   ` Jim Wilson
  1 sibling, 2 replies; 13+ messages in thread
From: Richard Sandiford @ 2019-10-23 12:00 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> This patch is the first of a series that tries to remove two
>> assumptions:
>>
>> (1) that all vectors involved in vectorisation must be the same size
>>
>> (2) that there is only one vector mode for a given element mode and
>>     number of elements
>>
>> Relaxing (1) helps with targets that support multiple vector sizes or
>> that require the number of elements to stay the same.  E.g. if we're
>> vectorising code that operates on narrow and wide elements, and the
>> narrow elements use 64-bit vectors, then on AArch64 it would normally
>> be better to use 128-bit vectors rather than pairs of 64-bit vectors
>> for the wide elements.
>>
>> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce
>> fixed-length code for SVE.  It also allows unpacked/half-size SVE
>> vectors to work with -msve-vector-bits=256.
>>
>> The patch adds a new hook that targets can use to control how we
>> move from one vector mode to another.  The hook takes a starting vector
>> mode, a new element mode, and (optionally) a new number of elements.
>> The flexibility needed for (1) comes in when the number of elements
>> isn't specified.
>>
>> All callers in this patch specify the number of elements, but a later
>> vectoriser patch doesn't.  I won't be posting the vectoriser patch
>> for a few days, hence the RFC/A tag.
>>
>> Tested individually on aarch64-linux-gnu and as a series on
>> x86_64-linux-gnu.  OK to install?  Or if not yet, does the idea
>> look OK?
>
> In isolation the idea looks good but maybe a bit limited?  I see
> how it works for the same-size case but if you consider x86
> where we have SSE, AVX256 and AVX512 what would it return
> for related_vector_mode (V4SImode, SImode, 0)?  Or is this
> kind of query not intended (where the component modes match
> but nunits is zero)?

In that case we'd normally get V4SImode back.  It's an allowed
combination, but not very useful.

> How do you get from SVE fixed 128bit to NEON fixed 128bit then?  Or is
> it just used to stay in the same register set for different component
> modes?

Yeah, the idea is to use the original vector mode as essentially
a base architecture.

The follow-on patches replace vec_info::vector_size with
vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes
with targetm.vectorize.autovectorize_vector_modes.  These are the
starting modes that would be passed to the hook in the nunits==0 case.

E.g. for Advanced SIMD on AArch64, it would make more sense for
related_mode (V4HImode, SImode, 0) to be V4SImode rather than V2SImode.
I think things would work in a similar way for the x86_64 vector archs.

For SVE we'd add both VNx16QImode (the SVE mode) and V16QImode (the
Advanced SIMD mode) to autovectorize_vector_modes, even though they
happen to be the same size for 128-bit SVE.  We can then compare
128-bit SVE with 128-bit Advanced SIMD, with related_mode ensuring
that we consistently use all-SVE modes or all-Advanced SIMD modes
for each attempt.

The plan for SVE is to add 4(!) modes to autovectorize_vector_modes:

- VNx16QImode (full vector)
- VNx8QImode (half vector)
- VNx4QImode (quarter vector)
- VNx2QImode (eighth vector)

and then pick the one with the lowest cost.  related_mode would
keep the number of units the same for nunits==0, within the limit
of the vector size.  E.g.:

- related_mode (VNx16QImode, HImode, 0) == VNx8HImode (full vector)
- related_mode (VNx8QImode, HImode, 0) == VNx8HImode (full vector)
- related_mode (VNx4QImode, HImode, 0) == VNx4HImode (half vector)
- related_mode (VNx2QImode, HImode, 0) == VNx2HImode (quarter vector)

and:

- related_mode (VNx16QImode, SImode, 0) == VNx4SImode (full vector)
- related_mode (VNx8QImode, SImode, 0) == VNx4SImode (full vector)
- related_mode (VNx4QImode, SImode, 0) == VNx4SImode (full vector)
- related_mode (VNx2QImode, SImode, 0) == VNx2SImode (half vector)

So when operating on multiple element sizes, the tradeoff is between
trying to make full use of the vector size (higher base nunits) vs.
trying to remove packs and unpacks between multiple vector copies
(lower base nunits).  The latter is useful because extending within
a vector is an in-lane rather than cross-lane operation and truncating
within a vector is a no-op.

With a couple of tweaks, we seem to do a good job of guessing which
version has the lowest cost, at least for the simple cases I've tried
so far.

Obviously there's going to be a bit of a compile-time cost
for SVE targets, but I think it's worth paying for.

> As said, it looks good but I'd like to see the followups.
>
> Note I delayed thinking about relaxing the single-vector-size
> constraint in the vectorizer until after we're SLP only because
> that looked more easily done there.  I also remember patches
> relaxing this a bit from RISCV folks.

That side seemed easier than I'd expected TBH, at least after the
mostly mechanical changes above.  The main missing thing was support
for extension and truncation between integer vector modes with the same
number of elements but different element sizes.  But that's really just
a natural extension of the scalar extend and truncate optabs.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC/A: Add a targetm.vectorize.related_mode hook
  2019-10-23 12:00   ` Richard Sandiford
@ 2019-10-23 12:07     ` Richard Biener
  2019-10-23 12:29       ` Richard Sandiford
  2019-10-23 22:35     ` H.J. Lu
  1 sibling, 1 reply; 13+ messages in thread
From: Richard Biener @ 2019-10-23 12:07 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: GCC Patches

On Wed, Oct 23, 2019 at 1:51 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> This patch is the first of a series that tries to remove two
> >> assumptions:
> >>
> >> (1) that all vectors involved in vectorisation must be the same size
> >>
> >> (2) that there is only one vector mode for a given element mode and
> >>     number of elements
> >>
> >> Relaxing (1) helps with targets that support multiple vector sizes or
> >> that require the number of elements to stay the same.  E.g. if we're
> >> vectorising code that operates on narrow and wide elements, and the
> >> narrow elements use 64-bit vectors, then on AArch64 it would normally
> >> be better to use 128-bit vectors rather than pairs of 64-bit vectors
> >> for the wide elements.
> >>
> >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce
> >> fixed-length code for SVE.  It also allows unpacked/half-size SVE
> >> vectors to work with -msve-vector-bits=256.
> >>
> >> The patch adds a new hook that targets can use to control how we
> >> move from one vector mode to another.  The hook takes a starting vector
> >> mode, a new element mode, and (optionally) a new number of elements.
> >> The flexibility needed for (1) comes in when the number of elements
> >> isn't specified.
> >>
> >> All callers in this patch specify the number of elements, but a later
> >> vectoriser patch doesn't.  I won't be posting the vectoriser patch
> >> for a few days, hence the RFC/A tag.
> >>
> >> Tested individually on aarch64-linux-gnu and as a series on
> >> x86_64-linux-gnu.  OK to install?  Or if not yet, does the idea
> >> look OK?
> >
> > In isolation the idea looks good but maybe a bit limited?  I see
> > how it works for the same-size case but if you consider x86
> > where we have SSE, AVX256 and AVX512 what would it return
> > for related_vector_mode (V4SImode, SImode, 0)?  Or is this
> > kind of query not intended (where the component modes match
> > but nunits is zero)?
>
> In that case we'd normally get V4SImode back.  It's an allowed
> combination, but not very useful.
>
> > How do you get from SVE fixed 128bit to NEON fixed 128bit then?  Or is
> > it just used to stay in the same register set for different component
> > modes?
>
> Yeah, the idea is to use the original vector mode as essentially
> a base architecture.
>
> The follow-on patches replace vec_info::vector_size with
> vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes
> with targetm.vectorize.autovectorize_vector_modes.  These are the
> starting modes that would be passed to the hook in the nunits==0 case.
>
> E.g. for Advanced SIMD on AArch64, it would make more sense for
> related_mode (V4HImode, SImode, 0) to be V4SImode rather than V2SImode.
> I think things would work in a similar way for the x86_64 vector archs.
>
> For SVE we'd add both VNx16QImode (the SVE mode) and V16QImode (the
> Advanced SIMD mode) to autovectorize_vector_modes, even though they
> happen to be the same size for 128-bit SVE.  We can then compare
> 128-bit SVE with 128-bit Advanced SIMD, with related_mode ensuring
> that we consistently use all-SVE modes or all-Advanced SIMD modes
> for each attempt.
>
> The plan for SVE is to add 4(!) modes to autovectorize_vector_modes:
>
> - VNx16QImode (full vector)
> - VNx8QImode (half vector)
> - VNx4QImode (quarter vector)
> - VNx2QImode (eighth vector)
>
> and then pick the one with the lowest cost.  related_mode would
> keep the number of units the same for nunits==0, within the limit
> of the vector size.  E.g.:
>
> - related_mode (VNx16QImode, HImode, 0) == VNx8HImode (full vector)
> - related_mode (VNx8QImode, HImode, 0) == VNx8HImode (full vector)
> - related_mode (VNx4QImode, HImode, 0) == VNx4HImode (half vector)
> - related_mode (VNx2QImode, HImode, 0) == VNx2HImode (quarter vector)
>
> and:
>
> - related_mode (VNx16QImode, SImode, 0) == VNx4SImode (full vector)
> - related_mode (VNx8QImode, SImode, 0) == VNx4SImode (full vector)
> - related_mode (VNx4QImode, SImode, 0) == VNx4SImode (full vector)
> - related_mode (VNx2QImode, SImode, 0) == VNx2SImode (half vector)
>
> So when operating on multiple element sizes, the tradeoff is between
> trying to make full use of the vector size (higher base nunits) vs.
> trying to remove packs and unpacks between multiple vector copies
> (lower base nunits).  The latter is useful because extending within
> a vector is an in-lane rather than cross-lane operation and truncating
> within a vector is a no-op.
>
> With a couple of tweaks, we seem to do a good job of guessing which
> version has the lowest cost, at least for the simple cases I've tried
> so far.
>
> Obviously there's going to be a bit of a compile-time cost
> for SVE targets, but I think it's worth paying for.

I would guess that immediate benefit could be seen with
basic-block vectorization which simply fails when conversions
are involved.  x86_64 should now always support V4SImode
and V2SImode so eventually a testcase can be crafted for that
as well.

> > As said, it looks good but I'd like to see the followups.
> >
> > Note I delayed thinking about relaxing the single-vector-size
> > constraint in the vectorizer until after we're SLP only because
> > that looked more easily done there.  I also remember patches
> > relaxing this a bit from RISCV folks.
>
> That side seemed easier than I'd expected TBH, at least after the
> mostly mechanical changes above.  The main missing thing was support
> for extension and truncation between integer vector modes with the same
> number of elements but different element sizes.  But that's really just
> a natural extension of the scalar extend and truncate optabs.

Ah, I see - that's good to hear.

Richard.

> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC/A: Add a targetm.vectorize.related_mode hook
  2019-10-23 12:07     ` Richard Biener
@ 2019-10-23 12:29       ` Richard Sandiford
  2019-10-25  7:21         ` Richard Biener
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Sandiford @ 2019-10-23 12:29 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Wed, Oct 23, 2019 at 1:51 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> writes:
>> > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> This patch is the first of a series that tries to remove two
>> >> assumptions:
>> >>
>> >> (1) that all vectors involved in vectorisation must be the same size
>> >>
>> >> (2) that there is only one vector mode for a given element mode and
>> >>     number of elements
>> >>
>> >> Relaxing (1) helps with targets that support multiple vector sizes or
>> >> that require the number of elements to stay the same.  E.g. if we're
>> >> vectorising code that operates on narrow and wide elements, and the
>> >> narrow elements use 64-bit vectors, then on AArch64 it would normally
>> >> be better to use 128-bit vectors rather than pairs of 64-bit vectors
>> >> for the wide elements.
>> >>
>> >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce
>> >> fixed-length code for SVE.  It also allows unpacked/half-size SVE
>> >> vectors to work with -msve-vector-bits=256.
>> >>
>> >> The patch adds a new hook that targets can use to control how we
>> >> move from one vector mode to another.  The hook takes a starting vector
>> >> mode, a new element mode, and (optionally) a new number of elements.
>> >> The flexibility needed for (1) comes in when the number of elements
>> >> isn't specified.
>> >>
>> >> All callers in this patch specify the number of elements, but a later
>> >> vectoriser patch doesn't.  I won't be posting the vectoriser patch
>> >> for a few days, hence the RFC/A tag.
>> >>
>> >> Tested individually on aarch64-linux-gnu and as a series on
>> >> x86_64-linux-gnu.  OK to install?  Or if not yet, does the idea
>> >> look OK?
>> >
>> > In isolation the idea looks good but maybe a bit limited?  I see
>> > how it works for the same-size case but if you consider x86
>> > where we have SSE, AVX256 and AVX512 what would it return
>> > for related_vector_mode (V4SImode, SImode, 0)?  Or is this
>> > kind of query not intended (where the component modes match
>> > but nunits is zero)?
>>
>> In that case we'd normally get V4SImode back.  It's an allowed
>> combination, but not very useful.
>>
>> > How do you get from SVE fixed 128bit to NEON fixed 128bit then?  Or is
>> > it just used to stay in the same register set for different component
>> > modes?
>>
>> Yeah, the idea is to use the original vector mode as essentially
>> a base architecture.
>>
>> The follow-on patches replace vec_info::vector_size with
>> vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes
>> with targetm.vectorize.autovectorize_vector_modes.  These are the
>> starting modes that would be passed to the hook in the nunits==0 case.
>>
>> E.g. for Advanced SIMD on AArch64, it would make more sense for
>> related_mode (V4HImode, SImode, 0) to be V4SImode rather than V2SImode.
>> I think things would work in a similar way for the x86_64 vector archs.
>>
>> For SVE we'd add both VNx16QImode (the SVE mode) and V16QImode (the
>> Advanced SIMD mode) to autovectorize_vector_modes, even though they
>> happen to be the same size for 128-bit SVE.  We can then compare
>> 128-bit SVE with 128-bit Advanced SIMD, with related_mode ensuring
>> that we consistently use all-SVE modes or all-Advanced SIMD modes
>> for each attempt.
>>
>> The plan for SVE is to add 4(!) modes to autovectorize_vector_modes:
>>
>> - VNx16QImode (full vector)
>> - VNx8QImode (half vector)
>> - VNx4QImode (quarter vector)
>> - VNx2QImode (eighth vector)
>>
>> and then pick the one with the lowest cost.  related_mode would
>> keep the number of units the same for nunits==0, within the limit
>> of the vector size.  E.g.:
>>
>> - related_mode (VNx16QImode, HImode, 0) == VNx8HImode (full vector)
>> - related_mode (VNx8QImode, HImode, 0) == VNx8HImode (full vector)
>> - related_mode (VNx4QImode, HImode, 0) == VNx4HImode (half vector)
>> - related_mode (VNx2QImode, HImode, 0) == VNx2HImode (quarter vector)
>>
>> and:
>>
>> - related_mode (VNx16QImode, SImode, 0) == VNx4SImode (full vector)
>> - related_mode (VNx8QImode, SImode, 0) == VNx4SImode (full vector)
>> - related_mode (VNx4QImode, SImode, 0) == VNx4SImode (full vector)
>> - related_mode (VNx2QImode, SImode, 0) == VNx2SImode (half vector)
>>
>> So when operating on multiple element sizes, the tradeoff is between
>> trying to make full use of the vector size (higher base nunits) vs.
>> trying to remove packs and unpacks between multiple vector copies
>> (lower base nunits).  The latter is useful because extending within
>> a vector is an in-lane rather than cross-lane operation and truncating
>> within a vector is a no-op.
>>
>> With a couple of tweaks, we seem to do a good job of guessing which
>> version has the lowest cost, at least for the simple cases I've tried
>> so far.
>>
>> Obviously there's going to be a bit of a compile-time cost
>> for SVE targets, but I think it's worth paying for.
>
> I would guess that immediate benefit could be seen with
> basic-block vectorization which simply fails when conversions
> are involved.  x86_64 should now always support V4SImode
> and V2SImode so eventually a testcase can be crafted for that
> as well.

I'd hoped so too, but the problem is that if the cost saving is good
enough, BB vectorisation simply stops at the conversion frontiers and
vectorises the rest, rather than considering other vector mode
combinations that might be able to do more.

One way of fixing that would be to try all vector modes and pick the
one that works best (as the WIP patches do for loop vectorisation).
But a better way might be to allow the vectorisation as now, but
then try to revectorise the new region with other vector modes.
I was hoping Joel's (unrelated) constructor patch might help,
since we could recognise constructors built from conversions
and try to revectorise from there.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC/A: Add a targetm.vectorize.related_mode hook
  2019-10-23 11:16 ` Richard Biener
  2019-10-23 12:00   ` Richard Sandiford
@ 2019-10-23 21:42   ` Jim Wilson
  1 sibling, 0 replies; 13+ messages in thread
From: Jim Wilson @ 2019-10-23 21:42 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Sandiford, GCC Patches

On Wed, Oct 23, 2019 at 4:16 AM Richard Biener
<richard.guenther@gmail.com> wrote:
> Note I delayed thinking about relaxing the single-vector-size
> constraint in the vectorizer until after we're SLP only because
> that looked more easily done there.  I also remember patches
> relaxing this a bit from RISCV folks.

Probably not from us RISC-V folks.  I don't know of anyone looking at
RISC-V vector support in gcc other than me, and I've only had time for
some initial exploration.

Jim

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC/A: Add a targetm.vectorize.related_mode hook
  2019-10-23 12:00   ` Richard Sandiford
  2019-10-23 12:07     ` Richard Biener
@ 2019-10-23 22:35     ` H.J. Lu
  2019-10-24  8:14       ` Richard Sandiford
  1 sibling, 1 reply; 13+ messages in thread
From: H.J. Lu @ 2019-10-23 22:35 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Richard Biener, GCC Patches

On Wed, Oct 23, 2019 at 4:51 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> This patch is the first of a series that tries to remove two
> >> assumptions:
> >>
> >> (1) that all vectors involved in vectorisation must be the same size
> >>
> >> (2) that there is only one vector mode for a given element mode and
> >>     number of elements
> >>
> >> Relaxing (1) helps with targets that support multiple vector sizes or
> >> that require the number of elements to stay the same.  E.g. if we're
> >> vectorising code that operates on narrow and wide elements, and the
> >> narrow elements use 64-bit vectors, then on AArch64 it would normally
> >> be better to use 128-bit vectors rather than pairs of 64-bit vectors
> >> for the wide elements.
> >>
> >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce
> >> fixed-length code for SVE.  It also allows unpacked/half-size SVE
> >> vectors to work with -msve-vector-bits=256.
> >>
> >> The patch adds a new hook that targets can use to control how we
> >> move from one vector mode to another.  The hook takes a starting vector
> >> mode, a new element mode, and (optionally) a new number of elements.
> >> The flexibility needed for (1) comes in when the number of elements
> >> isn't specified.
> >>
> >> All callers in this patch specify the number of elements, but a later
> >> vectoriser patch doesn't.  I won't be posting the vectoriser patch
> >> for a few days, hence the RFC/A tag.
> >>
> >> Tested individually on aarch64-linux-gnu and as a series on
> >> x86_64-linux-gnu.  OK to install?  Or if not yet, does the idea
> >> look OK?
> >
> > In isolation the idea looks good but maybe a bit limited?  I see
> > how it works for the same-size case but if you consider x86
> > where we have SSE, AVX256 and AVX512 what would it return
> > for related_vector_mode (V4SImode, SImode, 0)?  Or is this
> > kind of query not intended (where the component modes match
> > but nunits is zero)?
>
> In that case we'd normally get V4SImode back.  It's an allowed
> combination, but not very useful.
>
> > How do you get from SVE fixed 128bit to NEON fixed 128bit then?  Or is
> > it just used to stay in the same register set for different component
> > modes?
>
> Yeah, the idea is to use the original vector mode as essentially
> a base architecture.
>
> The follow-on patches replace vec_info::vector_size with
> vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes
> with targetm.vectorize.autovectorize_vector_modes.  These are the
> starting modes that would be passed to the hook in the nunits==0 case.
>

For a target with different vector sizes,
targetm.vectorize.autovectorize_vector_sizes
doesn't return the optimal vector sizes for known trip count and
unknown trip count.
For a target with 128-bit and 256-bit vectors, 256-bit followed by
128-bit works well for
known trip count since vectorizer knows the maximum usable vector size.  But for
unknown trip count, we may want to use 128-bit vector when 256-bit
code path won't
be used at run-time, but 128-bit vector will.  At the moment, we can
only use one
set of vector sizes for both known trip count and unknown trip count.
  Can vectorizer
support 2 sets of vector sizes, one for known trip count and the other
for unknown
trip count?

H.J.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC/A: Add a targetm.vectorize.related_mode hook
  2019-10-23 22:35     ` H.J. Lu
@ 2019-10-24  8:14       ` Richard Sandiford
  2019-10-24 15:18         ` H.J. Lu
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Sandiford @ 2019-10-24  8:14 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Richard Biener, GCC Patches

"H.J. Lu" <hjl.tools@gmail.com> writes:
> On Wed, Oct 23, 2019 at 4:51 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> writes:
>> > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> This patch is the first of a series that tries to remove two
>> >> assumptions:
>> >>
>> >> (1) that all vectors involved in vectorisation must be the same size
>> >>
>> >> (2) that there is only one vector mode for a given element mode and
>> >>     number of elements
>> >>
>> >> Relaxing (1) helps with targets that support multiple vector sizes or
>> >> that require the number of elements to stay the same.  E.g. if we're
>> >> vectorising code that operates on narrow and wide elements, and the
>> >> narrow elements use 64-bit vectors, then on AArch64 it would normally
>> >> be better to use 128-bit vectors rather than pairs of 64-bit vectors
>> >> for the wide elements.
>> >>
>> >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce
>> >> fixed-length code for SVE.  It also allows unpacked/half-size SVE
>> >> vectors to work with -msve-vector-bits=256.
>> >>
>> >> The patch adds a new hook that targets can use to control how we
>> >> move from one vector mode to another.  The hook takes a starting vector
>> >> mode, a new element mode, and (optionally) a new number of elements.
>> >> The flexibility needed for (1) comes in when the number of elements
>> >> isn't specified.
>> >>
>> >> All callers in this patch specify the number of elements, but a later
>> >> vectoriser patch doesn't.  I won't be posting the vectoriser patch
>> >> for a few days, hence the RFC/A tag.
>> >>
>> >> Tested individually on aarch64-linux-gnu and as a series on
>> >> x86_64-linux-gnu.  OK to install?  Or if not yet, does the idea
>> >> look OK?
>> >
>> > In isolation the idea looks good but maybe a bit limited?  I see
>> > how it works for the same-size case but if you consider x86
>> > where we have SSE, AVX256 and AVX512 what would it return
>> > for related_vector_mode (V4SImode, SImode, 0)?  Or is this
>> > kind of query not intended (where the component modes match
>> > but nunits is zero)?
>>
>> In that case we'd normally get V4SImode back.  It's an allowed
>> combination, but not very useful.
>>
>> > How do you get from SVE fixed 128bit to NEON fixed 128bit then?  Or is
>> > it just used to stay in the same register set for different component
>> > modes?
>>
>> Yeah, the idea is to use the original vector mode as essentially
>> a base architecture.
>>
>> The follow-on patches replace vec_info::vector_size with
>> vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes
>> with targetm.vectorize.autovectorize_vector_modes.  These are the
>> starting modes that would be passed to the hook in the nunits==0 case.
>>
>
> For a target with different vector sizes,
> targetm.vectorize.autovectorize_vector_sizes
> doesn't return the optimal vector sizes for known trip count and
> unknown trip count.
> For a target with 128-bit and 256-bit vectors, 256-bit followed by
> 128-bit works well for
> known trip count since vectorizer knows the maximum usable vector size.  But for
> unknown trip count, we may want to use 128-bit vector when 256-bit
> code path won't
> be used at run-time, but 128-bit vector will.  At the moment, we can
> only use one
> set of vector sizes for both known trip count and unknown trip count.

Yeah, we're hit by this for AArch64 too.  Andre's recent patches:

https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01564.html
https://gcc.gnu.org/ml/gcc-patches/2019-09/msg00205.html

should help.

>   Can vectorizer
> support 2 sets of vector sizes, one for known trip count and the other
> for unknown
> trip count?

The approach Andre's taking is to continue to use the wider vector size
for unknown trip counts, and instead ensure that the epilogue loop is
vectorised at the narrower vector size if possible.  The patches then
use this vectorised epilogue as a fallback "main" loop if the runtime
trip count is too low for the wide vectors.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC/A: Add a targetm.vectorize.related_mode hook
  2019-10-24  8:14       ` Richard Sandiford
@ 2019-10-24 15:18         ` H.J. Lu
  0 siblings, 0 replies; 13+ messages in thread
From: H.J. Lu @ 2019-10-24 15:18 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Richard Biener, GCC Patches

On Thu, Oct 24, 2019 at 12:56 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> "H.J. Lu" <hjl.tools@gmail.com> writes:
> > On Wed, Oct 23, 2019 at 4:51 AM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Richard Biener <richard.guenther@gmail.com> writes:
> >> > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford
> >> > <richard.sandiford@arm.com> wrote:
> >> >>
> >> >> This patch is the first of a series that tries to remove two
> >> >> assumptions:
> >> >>
> >> >> (1) that all vectors involved in vectorisation must be the same size
> >> >>
> >> >> (2) that there is only one vector mode for a given element mode and
> >> >>     number of elements
> >> >>
> >> >> Relaxing (1) helps with targets that support multiple vector sizes or
> >> >> that require the number of elements to stay the same.  E.g. if we're
> >> >> vectorising code that operates on narrow and wide elements, and the
> >> >> narrow elements use 64-bit vectors, then on AArch64 it would normally
> >> >> be better to use 128-bit vectors rather than pairs of 64-bit vectors
> >> >> for the wide elements.
> >> >>
> >> >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce
> >> >> fixed-length code for SVE.  It also allows unpacked/half-size SVE
> >> >> vectors to work with -msve-vector-bits=256.
> >> >>
> >> >> The patch adds a new hook that targets can use to control how we
> >> >> move from one vector mode to another.  The hook takes a starting vector
> >> >> mode, a new element mode, and (optionally) a new number of elements.
> >> >> The flexibility needed for (1) comes in when the number of elements
> >> >> isn't specified.
> >> >>
> >> >> All callers in this patch specify the number of elements, but a later
> >> >> vectoriser patch doesn't.  I won't be posting the vectoriser patch
> >> >> for a few days, hence the RFC/A tag.
> >> >>
> >> >> Tested individually on aarch64-linux-gnu and as a series on
> >> >> x86_64-linux-gnu.  OK to install?  Or if not yet, does the idea
> >> >> look OK?
> >> >
> >> > In isolation the idea looks good but maybe a bit limited?  I see
> >> > how it works for the same-size case but if you consider x86
> >> > where we have SSE, AVX256 and AVX512 what would it return
> >> > for related_vector_mode (V4SImode, SImode, 0)?  Or is this
> >> > kind of query not intended (where the component modes match
> >> > but nunits is zero)?
> >>
> >> In that case we'd normally get V4SImode back.  It's an allowed
> >> combination, but not very useful.
> >>
> >> > How do you get from SVE fixed 128bit to NEON fixed 128bit then?  Or is
> >> > it just used to stay in the same register set for different component
> >> > modes?
> >>
> >> Yeah, the idea is to use the original vector mode as essentially
> >> a base architecture.
> >>
> >> The follow-on patches replace vec_info::vector_size with
> >> vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes
> >> with targetm.vectorize.autovectorize_vector_modes.  These are the
> >> starting modes that would be passed to the hook in the nunits==0 case.
> >>
> >
> > For a target with different vector sizes,
> > targetm.vectorize.autovectorize_vector_sizes
> > doesn't return the optimal vector sizes for known trip count and
> > unknown trip count.
> > For a target with 128-bit and 256-bit vectors, 256-bit followed by
> > 128-bit works well for
> > known trip count since vectorizer knows the maximum usable vector size.  But for
> > unknown trip count, we may want to use 128-bit vector when 256-bit
> > code path won't
> > be used at run-time, but 128-bit vector will.  At the moment, we can
> > only use one
> > set of vector sizes for both known trip count and unknown trip count.
>
> Yeah, we're hit by this for AArch64 too.  Andre's recent patches:
>
> https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01564.html
> https://gcc.gnu.org/ml/gcc-patches/2019-09/msg00205.html
>
> should help.
>
> >   Can vectorizer
> > support 2 sets of vector sizes, one for known trip count and the other
> > for unknown
> > trip count?
>
> The approach Andre's taking is to continue to use the wider vector size
> for unknown trip counts, and instead ensure that the epilogue loop is
> vectorised at the narrower vector size if possible.  The patches then
> use this vectorised epilogue as a fallback "main" loop if the runtime
> trip count is too low for the wide vectors.

I tried it on 548.exchange2_r in SPEC CPU 2017.  There is short cut
to vectorized epilogue for low trip count.

-- 
H.J.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC/A: Add a targetm.vectorize.related_mode hook
  2019-10-23 12:29       ` Richard Sandiford
@ 2019-10-25  7:21         ` Richard Biener
  2019-10-25  8:01           ` Richard Sandiford
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Biener @ 2019-10-25  7:21 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: GCC Patches

On Wed, Oct 23, 2019 at 2:12 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Wed, Oct 23, 2019 at 1:51 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Richard Biener <richard.guenther@gmail.com> writes:
> >> > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford
> >> > <richard.sandiford@arm.com> wrote:
> >> >>
> >> >> This patch is the first of a series that tries to remove two
> >> >> assumptions:
> >> >>
> >> >> (1) that all vectors involved in vectorisation must be the same size
> >> >>
> >> >> (2) that there is only one vector mode for a given element mode and
> >> >>     number of elements
> >> >>
> >> >> Relaxing (1) helps with targets that support multiple vector sizes or
> >> >> that require the number of elements to stay the same.  E.g. if we're
> >> >> vectorising code that operates on narrow and wide elements, and the
> >> >> narrow elements use 64-bit vectors, then on AArch64 it would normally
> >> >> be better to use 128-bit vectors rather than pairs of 64-bit vectors
> >> >> for the wide elements.
> >> >>
> >> >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce
> >> >> fixed-length code for SVE.  It also allows unpacked/half-size SVE
> >> >> vectors to work with -msve-vector-bits=256.
> >> >>
> >> >> The patch adds a new hook that targets can use to control how we
> >> >> move from one vector mode to another.  The hook takes a starting vector
> >> >> mode, a new element mode, and (optionally) a new number of elements.
> >> >> The flexibility needed for (1) comes in when the number of elements
> >> >> isn't specified.
> >> >>
> >> >> All callers in this patch specify the number of elements, but a later
> >> >> vectoriser patch doesn't.  I won't be posting the vectoriser patch
> >> >> for a few days, hence the RFC/A tag.
> >> >>
> >> >> Tested individually on aarch64-linux-gnu and as a series on
> >> >> x86_64-linux-gnu.  OK to install?  Or if not yet, does the idea
> >> >> look OK?
> >> >
> >> > In isolation the idea looks good but maybe a bit limited?  I see
> >> > how it works for the same-size case but if you consider x86
> >> > where we have SSE, AVX256 and AVX512 what would it return
> >> > for related_vector_mode (V4SImode, SImode, 0)?  Or is this
> >> > kind of query not intended (where the component modes match
> >> > but nunits is zero)?
> >>
> >> In that case we'd normally get V4SImode back.  It's an allowed
> >> combination, but not very useful.
> >>
> >> > How do you get from SVE fixed 128bit to NEON fixed 128bit then?  Or is
> >> > it just used to stay in the same register set for different component
> >> > modes?
> >>
> >> Yeah, the idea is to use the original vector mode as essentially
> >> a base architecture.
> >>
> >> The follow-on patches replace vec_info::vector_size with
> >> vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes
> >> with targetm.vectorize.autovectorize_vector_modes.  These are the
> >> starting modes that would be passed to the hook in the nunits==0 case.
> >>
> >> E.g. for Advanced SIMD on AArch64, it would make more sense for
> >> related_mode (V4HImode, SImode, 0) to be V4SImode rather than V2SImode.
> >> I think things would work in a similar way for the x86_64 vector archs.
> >>
> >> For SVE we'd add both VNx16QImode (the SVE mode) and V16QImode (the
> >> Advanced SIMD mode) to autovectorize_vector_modes, even though they
> >> happen to be the same size for 128-bit SVE.  We can then compare
> >> 128-bit SVE with 128-bit Advanced SIMD, with related_mode ensuring
> >> that we consistently use all-SVE modes or all-Advanced SIMD modes
> >> for each attempt.
> >>
> >> The plan for SVE is to add 4(!) modes to autovectorize_vector_modes:
> >>
> >> - VNx16QImode (full vector)
> >> - VNx8QImode (half vector)
> >> - VNx4QImode (quarter vector)
> >> - VNx2QImode (eighth vector)
> >>
> >> and then pick the one with the lowest cost.  related_mode would
> >> keep the number of units the same for nunits==0, within the limit
> >> of the vector size.  E.g.:
> >>
> >> - related_mode (VNx16QImode, HImode, 0) == VNx8HImode (full vector)
> >> - related_mode (VNx8QImode, HImode, 0) == VNx8HImode (full vector)
> >> - related_mode (VNx4QImode, HImode, 0) == VNx4HImode (half vector)
> >> - related_mode (VNx2QImode, HImode, 0) == VNx2HImode (quarter vector)
> >>
> >> and:
> >>
> >> - related_mode (VNx16QImode, SImode, 0) == VNx4SImode (full vector)
> >> - related_mode (VNx8QImode, SImode, 0) == VNx4SImode (full vector)
> >> - related_mode (VNx4QImode, SImode, 0) == VNx4SImode (full vector)
> >> - related_mode (VNx2QImode, SImode, 0) == VNx2SImode (half vector)
> >>
> >> So when operating on multiple element sizes, the tradeoff is between
> >> trying to make full use of the vector size (higher base nunits) vs.
> >> trying to remove packs and unpacks between multiple vector copies
> >> (lower base nunits).  The latter is useful because extending within
> >> a vector is an in-lane rather than cross-lane operation and truncating
> >> within a vector is a no-op.
> >>
> >> With a couple of tweaks, we seem to do a good job of guessing which
> >> version has the lowest cost, at least for the simple cases I've tried
> >> so far.
> >>
> >> Obviously there's going to be a bit of a compile-time cost
> >> for SVE targets, but I think it's worth paying for.
> >
> > I would guess that immediate benefit could be seen with
> > basic-block vectorization which simply fails when conversions
> > are involved.  x86_64 should now always support V4SImode
> > and V2SImode so eventually a testcase can be crafted for that
> > as well.
>
> I'd hoped so too, but the problem is that if the cost saving is good
> enough, BB vectorisation simply stops at the conversion frontiers and
> vectorises the rest, rather than considering other vector mode
> combinations that might be able to do more.

Sure, but when SLP build fails because it thinks it needs to unroll
it could ask for a vector type with the same number of lanes
(that's probably what we should do from the start - get same number
of lane vector types in BB vectorization).

It's when you introduce multiple sizes then the outer loop over all
sizes comparing costs becomes somewhat obsolete...  this outer
loop should then instead compare different VFs (which also means
possible extra unrolling beyond the vector size need?).

Richard.

>
> One way of fixing that would be to try all vector modes and pick the
> one that works best (as the WIP patches do for loop vectorisation).
> But a better way might be to allow the vectorisation as now, but
> then try to revectorise the new region with other vector modes.
> I was hoping Joel's (unrelated) constructor patch might help,
> since we could recognise constructors built from conversions
> and try to revectorise from there.
>
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC/A: Add a targetm.vectorize.related_mode hook
  2019-10-25  7:21         ` Richard Biener
@ 2019-10-25  8:01           ` Richard Sandiford
  2019-10-30  9:58             ` Richard Sandiford
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Sandiford @ 2019-10-25  8:01 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Wed, Oct 23, 2019 at 2:12 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> writes:
>> > On Wed, Oct 23, 2019 at 1:51 PM Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> Richard Biener <richard.guenther@gmail.com> writes:
>> >> > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford
>> >> > <richard.sandiford@arm.com> wrote:
>> >> >>
>> >> >> This patch is the first of a series that tries to remove two
>> >> >> assumptions:
>> >> >>
>> >> >> (1) that all vectors involved in vectorisation must be the same size
>> >> >>
>> >> >> (2) that there is only one vector mode for a given element mode and
>> >> >>     number of elements
>> >> >>
>> >> >> Relaxing (1) helps with targets that support multiple vector sizes or
>> >> >> that require the number of elements to stay the same.  E.g. if we're
>> >> >> vectorising code that operates on narrow and wide elements, and the
>> >> >> narrow elements use 64-bit vectors, then on AArch64 it would normally
>> >> >> be better to use 128-bit vectors rather than pairs of 64-bit vectors
>> >> >> for the wide elements.
>> >> >>
>> >> >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce
>> >> >> fixed-length code for SVE.  It also allows unpacked/half-size SVE
>> >> >> vectors to work with -msve-vector-bits=256.
>> >> >>
>> >> >> The patch adds a new hook that targets can use to control how we
>> >> >> move from one vector mode to another.  The hook takes a starting vector
>> >> >> mode, a new element mode, and (optionally) a new number of elements.
>> >> >> The flexibility needed for (1) comes in when the number of elements
>> >> >> isn't specified.
>> >> >>
>> >> >> All callers in this patch specify the number of elements, but a later
>> >> >> vectoriser patch doesn't.  I won't be posting the vectoriser patch
>> >> >> for a few days, hence the RFC/A tag.
>> >> >>
>> >> >> Tested individually on aarch64-linux-gnu and as a series on
>> >> >> x86_64-linux-gnu.  OK to install?  Or if not yet, does the idea
>> >> >> look OK?
>> >> >
>> >> > In isolation the idea looks good but maybe a bit limited?  I see
>> >> > how it works for the same-size case but if you consider x86
>> >> > where we have SSE, AVX256 and AVX512 what would it return
>> >> > for related_vector_mode (V4SImode, SImode, 0)?  Or is this
>> >> > kind of query not intended (where the component modes match
>> >> > but nunits is zero)?
>> >>
>> >> In that case we'd normally get V4SImode back.  It's an allowed
>> >> combination, but not very useful.
>> >>
>> >> > How do you get from SVE fixed 128bit to NEON fixed 128bit then?  Or is
>> >> > it just used to stay in the same register set for different component
>> >> > modes?
>> >>
>> >> Yeah, the idea is to use the original vector mode as essentially
>> >> a base architecture.
>> >>
>> >> The follow-on patches replace vec_info::vector_size with
>> >> vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes
>> >> with targetm.vectorize.autovectorize_vector_modes.  These are the
>> >> starting modes that would be passed to the hook in the nunits==0 case.
>> >>
>> >> E.g. for Advanced SIMD on AArch64, it would make more sense for
>> >> related_mode (V4HImode, SImode, 0) to be V4SImode rather than V2SImode.
>> >> I think things would work in a similar way for the x86_64 vector archs.
>> >>
>> >> For SVE we'd add both VNx16QImode (the SVE mode) and V16QImode (the
>> >> Advanced SIMD mode) to autovectorize_vector_modes, even though they
>> >> happen to be the same size for 128-bit SVE.  We can then compare
>> >> 128-bit SVE with 128-bit Advanced SIMD, with related_mode ensuring
>> >> that we consistently use all-SVE modes or all-Advanced SIMD modes
>> >> for each attempt.
>> >>
>> >> The plan for SVE is to add 4(!) modes to autovectorize_vector_modes:
>> >>
>> >> - VNx16QImode (full vector)
>> >> - VNx8QImode (half vector)
>> >> - VNx4QImode (quarter vector)
>> >> - VNx2QImode (eighth vector)
>> >>
>> >> and then pick the one with the lowest cost.  related_mode would
>> >> keep the number of units the same for nunits==0, within the limit
>> >> of the vector size.  E.g.:
>> >>
>> >> - related_mode (VNx16QImode, HImode, 0) == VNx8HImode (full vector)
>> >> - related_mode (VNx8QImode, HImode, 0) == VNx8HImode (full vector)
>> >> - related_mode (VNx4QImode, HImode, 0) == VNx4HImode (half vector)
>> >> - related_mode (VNx2QImode, HImode, 0) == VNx2HImode (quarter vector)
>> >>
>> >> and:
>> >>
>> >> - related_mode (VNx16QImode, SImode, 0) == VNx4SImode (full vector)
>> >> - related_mode (VNx8QImode, SImode, 0) == VNx4SImode (full vector)
>> >> - related_mode (VNx4QImode, SImode, 0) == VNx4SImode (full vector)
>> >> - related_mode (VNx2QImode, SImode, 0) == VNx2SImode (half vector)
>> >>
>> >> So when operating on multiple element sizes, the tradeoff is between
>> >> trying to make full use of the vector size (higher base nunits) vs.
>> >> trying to remove packs and unpacks between multiple vector copies
>> >> (lower base nunits).  The latter is useful because extending within
>> >> a vector is an in-lane rather than cross-lane operation and truncating
>> >> within a vector is a no-op.
>> >>
>> >> With a couple of tweaks, we seem to do a good job of guessing which
>> >> version has the lowest cost, at least for the simple cases I've tried
>> >> so far.
>> >>
>> >> Obviously there's going to be a bit of a compile-time cost
>> >> for SVE targets, but I think it's worth paying for.
>> >
>> > I would guess that immediate benefit could be seen with
>> > basic-block vectorization which simply fails when conversions
>> > are involved.  x86_64 should now always support V4SImode
>> > and V2SImode so eventually a testcase can be crafted for that
>> > as well.
>>
>> I'd hoped so too, but the problem is that if the cost saving is good
>> enough, BB vectorisation simply stops at the conversion frontiers and
>> vectorises the rest, rather than considering other vector mode
>> combinations that might be able to do more.
>
> Sure, but when SLP build fails because it thinks it needs to unroll
> it could ask for a vector type with the same number of lanes

Do you mean for loop or BB vectorisation?  For loop vectorisation
the outer loop iterating over vector sizes/modes works fine: if we
need to unroll beyond the maximum VF, we'll just retry vectorisation
with a different vector size/mode combination, with the narrowest
element having smaller vectors.

But yeah, I guess if get_vectype_for_scalar_type returns a vector
with too many units for BB vectorisation, we could try asking for
a different type with the "right" number of units, rather than
failing and falling back to the next iteration of the outer loop.
I'll give that a go.

> (that's probably what we should do from the start - get same number
> of lane vector types in BB vectorization).

It's still useful to have different numbers of lanes for both loop
and BB vectorisation.  E.g. if we're applying SLP to an operation on
8 ints and 8 shorts, it's still better to mix V4SI and V8HI for 128-bit
vector archs.

> It's when you introduce multiple sizes then the outer loop over all
> sizes comparing costs becomes somewhat obsolete...  this outer
> loop should then instead compare different VFs (which also means
> possible extra unrolling beyond the vector size need?).

This is effectively what we're doing for the SVE arrangement above.
But replacing targetm.vectorize.autovectorize_vector_sizes with
targetm.vectorize.autovectorize_vector_modes means that we can
also try multiple vector archs with the same VF (e.g. 128-bit SVE
vs. 128-bit Advanced SIMD).

Thanks,
Richard

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC/A: Add a targetm.vectorize.related_mode hook
  2019-10-25  8:01           ` Richard Sandiford
@ 2019-10-30  9:58             ` Richard Sandiford
  2019-10-30 14:26               ` Richard Biener
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Sandiford @ 2019-10-30  9:58 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

The series posted so far now shows how the hook would be used in practice.
Just wanted to follow up on some points here:

Richard Sandiford <richard.sandiford@arm.com> writes:
> Richard Biener <richard.guenther@gmail.com> writes:
>> On Wed, Oct 23, 2019 at 2:12 PM Richard Sandiford
>> <richard.sandiford@arm.com> wrote:
>>>
>>> Richard Biener <richard.guenther@gmail.com> writes:
>>> > On Wed, Oct 23, 2019 at 1:51 PM Richard Sandiford
>>> > <richard.sandiford@arm.com> wrote:
>>> >>
>>> >> Richard Biener <richard.guenther@gmail.com> writes:
>>> >> > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford
>>> >> > <richard.sandiford@arm.com> wrote:
>>> >> >>
>>> >> >> This patch is the first of a series that tries to remove two
>>> >> >> assumptions:
>>> >> >>
>>> >> >> (1) that all vectors involved in vectorisation must be the same size
>>> >> >>
>>> >> >> (2) that there is only one vector mode for a given element mode and
>>> >> >>     number of elements
>>> >> >>
>>> >> >> Relaxing (1) helps with targets that support multiple vector sizes or
>>> >> >> that require the number of elements to stay the same.  E.g. if we're
>>> >> >> vectorising code that operates on narrow and wide elements, and the
>>> >> >> narrow elements use 64-bit vectors, then on AArch64 it would normally
>>> >> >> be better to use 128-bit vectors rather than pairs of 64-bit vectors
>>> >> >> for the wide elements.
>>> >> >>
>>> >> >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce
>>> >> >> fixed-length code for SVE.  It also allows unpacked/half-size SVE
>>> >> >> vectors to work with -msve-vector-bits=256.
>>> >> >>
>>> >> >> The patch adds a new hook that targets can use to control how we
>>> >> >> move from one vector mode to another.  The hook takes a starting vector
>>> >> >> mode, a new element mode, and (optionally) a new number of elements.
>>> >> >> The flexibility needed for (1) comes in when the number of elements
>>> >> >> isn't specified.
>>> >> >>
>>> >> >> All callers in this patch specify the number of elements, but a later
>>> >> >> vectoriser patch doesn't.  I won't be posting the vectoriser patch
>>> >> >> for a few days, hence the RFC/A tag.
>>> >> >>
>>> >> >> Tested individually on aarch64-linux-gnu and as a series on
>>> >> >> x86_64-linux-gnu.  OK to install?  Or if not yet, does the idea
>>> >> >> look OK?
>>> >> >
>>> >> > In isolation the idea looks good but maybe a bit limited?  I see
>>> >> > how it works for the same-size case but if you consider x86
>>> >> > where we have SSE, AVX256 and AVX512 what would it return
>>> >> > for related_vector_mode (V4SImode, SImode, 0)?  Or is this
>>> >> > kind of query not intended (where the component modes match
>>> >> > but nunits is zero)?
>>> >>
>>> >> In that case we'd normally get V4SImode back.  It's an allowed
>>> >> combination, but not very useful.
>>> >>
>>> >> > How do you get from SVE fixed 128bit to NEON fixed 128bit then?  Or is
>>> >> > it just used to stay in the same register set for different component
>>> >> > modes?
>>> >>
>>> >> Yeah, the idea is to use the original vector mode as essentially
>>> >> a base architecture.
>>> >>
>>> >> The follow-on patches replace vec_info::vector_size with
>>> >> vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes
>>> >> with targetm.vectorize.autovectorize_vector_modes.  These are the
>>> >> starting modes that would be passed to the hook in the nunits==0 case.
>>> >>
>>> >> E.g. for Advanced SIMD on AArch64, it would make more sense for
>>> >> related_mode (V4HImode, SImode, 0) to be V4SImode rather than V2SImode.
>>> >> I think things would work in a similar way for the x86_64 vector archs.
>>> >>
>>> >> For SVE we'd add both VNx16QImode (the SVE mode) and V16QImode (the
>>> >> Advanced SIMD mode) to autovectorize_vector_modes, even though they
>>> >> happen to be the same size for 128-bit SVE.  We can then compare
>>> >> 128-bit SVE with 128-bit Advanced SIMD, with related_mode ensuring
>>> >> that we consistently use all-SVE modes or all-Advanced SIMD modes
>>> >> for each attempt.
>>> >>
>>> >> The plan for SVE is to add 4(!) modes to autovectorize_vector_modes:
>>> >>
>>> >> - VNx16QImode (full vector)
>>> >> - VNx8QImode (half vector)
>>> >> - VNx4QImode (quarter vector)
>>> >> - VNx2QImode (eighth vector)
>>> >>
>>> >> and then pick the one with the lowest cost.  related_mode would
>>> >> keep the number of units the same for nunits==0, within the limit
>>> >> of the vector size.  E.g.:
>>> >>
>>> >> - related_mode (VNx16QImode, HImode, 0) == VNx8HImode (full vector)
>>> >> - related_mode (VNx8QImode, HImode, 0) == VNx8HImode (full vector)
>>> >> - related_mode (VNx4QImode, HImode, 0) == VNx4HImode (half vector)
>>> >> - related_mode (VNx2QImode, HImode, 0) == VNx2HImode (quarter vector)
>>> >>
>>> >> and:
>>> >>
>>> >> - related_mode (VNx16QImode, SImode, 0) == VNx4SImode (full vector)
>>> >> - related_mode (VNx8QImode, SImode, 0) == VNx4SImode (full vector)
>>> >> - related_mode (VNx4QImode, SImode, 0) == VNx4SImode (full vector)
>>> >> - related_mode (VNx2QImode, SImode, 0) == VNx2SImode (half vector)
>>> >>
>>> >> So when operating on multiple element sizes, the tradeoff is between
>>> >> trying to make full use of the vector size (higher base nunits) vs.
>>> >> trying to remove packs and unpacks between multiple vector copies
>>> >> (lower base nunits).  The latter is useful because extending within
>>> >> a vector is an in-lane rather than cross-lane operation and truncating
>>> >> within a vector is a no-op.
>>> >>
>>> >> With a couple of tweaks, we seem to do a good job of guessing which
>>> >> version has the lowest cost, at least for the simple cases I've tried
>>> >> so far.
>>> >>
>>> >> Obviously there's going to be a bit of a compile-time cost
>>> >> for SVE targets, but I think it's worth paying for.
>>> >
>>> > I would guess that immediate benefit could be seen with
>>> > basic-block vectorization which simply fails when conversions
>>> > are involved.  x86_64 should now always support V4SImode
>>> > and V2SImode so eventually a testcase can be crafted for that
>>> > as well.
>>>
>>> I'd hoped so too, but the problem is that if the cost saving is good
>>> enough, BB vectorisation simply stops at the conversion frontiers and
>>> vectorises the rest, rather than considering other vector mode
>>> combinations that might be able to do more.
>>
>> Sure, but when SLP build fails because it thinks it needs to unroll
>> it could ask for a vector type with the same number of lanes
>
> Do you mean for loop or BB vectorisation?  For loop vectorisation
> the outer loop iterating over vector sizes/modes works fine: if we
> need to unroll beyond the maximum VF, we'll just retry vectorisation
> with a different vector size/mode combination, with the narrowest
> element having smaller vectors.
>
> But yeah, I guess if get_vectype_for_scalar_type returns a vector
> with too many units for BB vectorisation, we could try asking for
> a different type with the "right" number of units, rather than
> failing and falling back to the next iteration of the outer loop.
> I'll give that a go.

This is 16/n: https://gcc.gnu.org/ml/gcc-patches/2019-10/msg02063.html
It was easier than I'd expected and cleans up some nasty codegen for
mixed-size SLP on AArch64, so thanks :-)

On iterating over VFs:

>> (that's probably what we should do from the start - get same number
>> of lane vector types in BB vectorization).
>
> It's still useful to have different numbers of lanes for both loop
> and BB vectorisation.  E.g. if we're applying SLP to an operation on
> 8 ints and 8 shorts, it's still better to mix V4SI and V8HI for 128-bit
> vector archs.
>
>> It's when you introduce multiple sizes then the outer loop over all
>> sizes comparing costs becomes somewhat obsolete...  this outer
>> loop should then instead compare different VFs (which also means
>> possible extra unrolling beyond the vector size need?).
>
> This is effectively what we're doing for the SVE arrangement above.
> But replacing targetm.vectorize.autovectorize_vector_sizes with
> targetm.vectorize.autovectorize_vector_modes means that we can
> also try multiple vector archs with the same VF (e.g. 128-bit SVE
> vs. 128-bit Advanced SIMD).

See 12/n: https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01831.html
for how this iteration works in practice.  The AArch64 port specifies
four ways of picking vector types, depending on which elements (if any)
use 64 bits instead of 128 bits.  Each of these approaches then determines
a VF, based on the element types that are actually used.  Sometimes we
can end up trying the same VF and vector types multiple times, but that's
easy to fix -- will post a patch soon as preparation for picking the
"best" size.

I think iterating on these approaches is better than iterating directly
on candidate VFs for the reason mentioned above: full-vector 128-bit SVE
and full-vector 128-bit Advanced SIMD both provide the same VF, but we
want to try them both.

And like you say, if base vector mode M1 gives a VF of X and base vector
mode M2 gives a VF of X/2, and if 2*M2 appears cheaper than M1, we could
in future have the option of doubling the VF for M2 artificially as a
way of getting interleave-based unrolling.  So in that case we'd have
two different approaches for vectorising at VF X even if we stick to the
same vector architecture.

So I don't think starting with the VF and calculating other things from
that is intrinsically better.  We'd then have to have a subloop that
iterates over Advanced SIMD vs. SVE, and potentially a further subloop
that iterates over unroll factors.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC/A: Add a targetm.vectorize.related_mode hook
  2019-10-30  9:58             ` Richard Sandiford
@ 2019-10-30 14:26               ` Richard Biener
  0 siblings, 0 replies; 13+ messages in thread
From: Richard Biener @ 2019-10-30 14:26 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: GCC Patches

On Wed, Oct 30, 2019 at 10:43 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> The series posted so far now shows how the hook would be used in practice.
> Just wanted to follow up on some points here:
>
> Richard Sandiford <richard.sandiford@arm.com> writes:
> > Richard Biener <richard.guenther@gmail.com> writes:
> >> On Wed, Oct 23, 2019 at 2:12 PM Richard Sandiford
> >> <richard.sandiford@arm.com> wrote:
> >>>
> >>> Richard Biener <richard.guenther@gmail.com> writes:
> >>> > On Wed, Oct 23, 2019 at 1:51 PM Richard Sandiford
> >>> > <richard.sandiford@arm.com> wrote:
> >>> >>
> >>> >> Richard Biener <richard.guenther@gmail.com> writes:
> >>> >> > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford
> >>> >> > <richard.sandiford@arm.com> wrote:
> >>> >> >>
> >>> >> >> This patch is the first of a series that tries to remove two
> >>> >> >> assumptions:
> >>> >> >>
> >>> >> >> (1) that all vectors involved in vectorisation must be the same size
> >>> >> >>
> >>> >> >> (2) that there is only one vector mode for a given element mode and
> >>> >> >>     number of elements
> >>> >> >>
> >>> >> >> Relaxing (1) helps with targets that support multiple vector sizes or
> >>> >> >> that require the number of elements to stay the same.  E.g. if we're
> >>> >> >> vectorising code that operates on narrow and wide elements, and the
> >>> >> >> narrow elements use 64-bit vectors, then on AArch64 it would normally
> >>> >> >> be better to use 128-bit vectors rather than pairs of 64-bit vectors
> >>> >> >> for the wide elements.
> >>> >> >>
> >>> >> >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce
> >>> >> >> fixed-length code for SVE.  It also allows unpacked/half-size SVE
> >>> >> >> vectors to work with -msve-vector-bits=256.
> >>> >> >>
> >>> >> >> The patch adds a new hook that targets can use to control how we
> >>> >> >> move from one vector mode to another.  The hook takes a starting vector
> >>> >> >> mode, a new element mode, and (optionally) a new number of elements.
> >>> >> >> The flexibility needed for (1) comes in when the number of elements
> >>> >> >> isn't specified.
> >>> >> >>
> >>> >> >> All callers in this patch specify the number of elements, but a later
> >>> >> >> vectoriser patch doesn't.  I won't be posting the vectoriser patch
> >>> >> >> for a few days, hence the RFC/A tag.
> >>> >> >>
> >>> >> >> Tested individually on aarch64-linux-gnu and as a series on
> >>> >> >> x86_64-linux-gnu.  OK to install?  Or if not yet, does the idea
> >>> >> >> look OK?
> >>> >> >
> >>> >> > In isolation the idea looks good but maybe a bit limited?  I see
> >>> >> > how it works for the same-size case but if you consider x86
> >>> >> > where we have SSE, AVX256 and AVX512 what would it return
> >>> >> > for related_vector_mode (V4SImode, SImode, 0)?  Or is this
> >>> >> > kind of query not intended (where the component modes match
> >>> >> > but nunits is zero)?
> >>> >>
> >>> >> In that case we'd normally get V4SImode back.  It's an allowed
> >>> >> combination, but not very useful.
> >>> >>
> >>> >> > How do you get from SVE fixed 128bit to NEON fixed 128bit then?  Or is
> >>> >> > it just used to stay in the same register set for different component
> >>> >> > modes?
> >>> >>
> >>> >> Yeah, the idea is to use the original vector mode as essentially
> >>> >> a base architecture.
> >>> >>
> >>> >> The follow-on patches replace vec_info::vector_size with
> >>> >> vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes
> >>> >> with targetm.vectorize.autovectorize_vector_modes.  These are the
> >>> >> starting modes that would be passed to the hook in the nunits==0 case.
> >>> >>
> >>> >> E.g. for Advanced SIMD on AArch64, it would make more sense for
> >>> >> related_mode (V4HImode, SImode, 0) to be V4SImode rather than V2SImode.
> >>> >> I think things would work in a similar way for the x86_64 vector archs.
> >>> >>
> >>> >> For SVE we'd add both VNx16QImode (the SVE mode) and V16QImode (the
> >>> >> Advanced SIMD mode) to autovectorize_vector_modes, even though they
> >>> >> happen to be the same size for 128-bit SVE.  We can then compare
> >>> >> 128-bit SVE with 128-bit Advanced SIMD, with related_mode ensuring
> >>> >> that we consistently use all-SVE modes or all-Advanced SIMD modes
> >>> >> for each attempt.
> >>> >>
> >>> >> The plan for SVE is to add 4(!) modes to autovectorize_vector_modes:
> >>> >>
> >>> >> - VNx16QImode (full vector)
> >>> >> - VNx8QImode (half vector)
> >>> >> - VNx4QImode (quarter vector)
> >>> >> - VNx2QImode (eighth vector)
> >>> >>
> >>> >> and then pick the one with the lowest cost.  related_mode would
> >>> >> keep the number of units the same for nunits==0, within the limit
> >>> >> of the vector size.  E.g.:
> >>> >>
> >>> >> - related_mode (VNx16QImode, HImode, 0) == VNx8HImode (full vector)
> >>> >> - related_mode (VNx8QImode, HImode, 0) == VNx8HImode (full vector)
> >>> >> - related_mode (VNx4QImode, HImode, 0) == VNx4HImode (half vector)
> >>> >> - related_mode (VNx2QImode, HImode, 0) == VNx2HImode (quarter vector)
> >>> >>
> >>> >> and:
> >>> >>
> >>> >> - related_mode (VNx16QImode, SImode, 0) == VNx4SImode (full vector)
> >>> >> - related_mode (VNx8QImode, SImode, 0) == VNx4SImode (full vector)
> >>> >> - related_mode (VNx4QImode, SImode, 0) == VNx4SImode (full vector)
> >>> >> - related_mode (VNx2QImode, SImode, 0) == VNx2SImode (half vector)
> >>> >>
> >>> >> So when operating on multiple element sizes, the tradeoff is between
> >>> >> trying to make full use of the vector size (higher base nunits) vs.
> >>> >> trying to remove packs and unpacks between multiple vector copies
> >>> >> (lower base nunits).  The latter is useful because extending within
> >>> >> a vector is an in-lane rather than cross-lane operation and truncating
> >>> >> within a vector is a no-op.
> >>> >>
> >>> >> With a couple of tweaks, we seem to do a good job of guessing which
> >>> >> version has the lowest cost, at least for the simple cases I've tried
> >>> >> so far.
> >>> >>
> >>> >> Obviously there's going to be a bit of a compile-time cost
> >>> >> for SVE targets, but I think it's worth paying for.
> >>> >
> >>> > I would guess that immediate benefit could be seen with
> >>> > basic-block vectorization which simply fails when conversions
> >>> > are involved.  x86_64 should now always support V4SImode
> >>> > and V2SImode so eventually a testcase can be crafted for that
> >>> > as well.
> >>>
> >>> I'd hoped so too, but the problem is that if the cost saving is good
> >>> enough, BB vectorisation simply stops at the conversion frontiers and
> >>> vectorises the rest, rather than considering other vector mode
> >>> combinations that might be able to do more.
> >>
> >> Sure, but when SLP build fails because it thinks it needs to unroll
> >> it could ask for a vector type with the same number of lanes
> >
> > Do you mean for loop or BB vectorisation?  For loop vectorisation
> > the outer loop iterating over vector sizes/modes works fine: if we
> > need to unroll beyond the maximum VF, we'll just retry vectorisation
> > with a different vector size/mode combination, with the narrowest
> > element having smaller vectors.
> >
> > But yeah, I guess if get_vectype_for_scalar_type returns a vector
> > with too many units for BB vectorisation, we could try asking for
> > a different type with the "right" number of units, rather than
> > failing and falling back to the next iteration of the outer loop.
> > I'll give that a go.
>
> This is 16/n: https://gcc.gnu.org/ml/gcc-patches/2019-10/msg02063.html
> It was easier than I'd expected and cleans up some nasty codegen for
> mixed-size SLP on AArch64, so thanks :-)
>
> On iterating over VFs:
>
> >> (that's probably what we should do from the start - get same number
> >> of lane vector types in BB vectorization).
> >
> > It's still useful to have different numbers of lanes for both loop
> > and BB vectorisation.  E.g. if we're applying SLP to an operation on
> > 8 ints and 8 shorts, it's still better to mix V4SI and V8HI for 128-bit
> > vector archs.
> >
> >> It's when you introduce multiple sizes then the outer loop over all
> >> sizes comparing costs becomes somewhat obsolete...  this outer
> >> loop should then instead compare different VFs (which also means
> >> possible extra unrolling beyond the vector size need?).
> >
> > This is effectively what we're doing for the SVE arrangement above.
> > But replacing targetm.vectorize.autovectorize_vector_sizes with
> > targetm.vectorize.autovectorize_vector_modes means that we can
> > also try multiple vector archs with the same VF (e.g. 128-bit SVE
> > vs. 128-bit Advanced SIMD).
>
> See 12/n: https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01831.html
> for how this iteration works in practice.  The AArch64 port specifies
> four ways of picking vector types, depending on which elements (if any)
> use 64 bits instead of 128 bits.  Each of these approaches then determines
> a VF, based on the element types that are actually used.  Sometimes we
> can end up trying the same VF and vector types multiple times, but that's
> easy to fix -- will post a patch soon as preparation for picking the
> "best" size.
>
> I think iterating on these approaches is better than iterating directly
> on candidate VFs for the reason mentioned above: full-vector 128-bit SVE
> and full-vector 128-bit Advanced SIMD both provide the same VF, but we
> want to try them both.
>
> And like you say, if base vector mode M1 gives a VF of X and base vector
> mode M2 gives a VF of X/2, and if 2*M2 appears cheaper than M1, we could
> in future have the option of doubling the VF for M2 artificially as a
> way of getting interleave-based unrolling.  So in that case we'd have
> two different approaches for vectorising at VF X even if we stick to the
> same vector architecture.
>
> So I don't think starting with the VF and calculating other things from
> that is intrinsically better.  We'd then have to have a subloop that
> iterates over Advanced SIMD vs. SVE, and potentially a further subloop
> that iterates over unroll factors.

OK, fair enough.

The patch is OK.

Thanks,
Richard.

> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-10-30 14:25 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-23 11:01 RFC/A: Add a targetm.vectorize.related_mode hook Richard Sandiford
2019-10-23 11:16 ` Richard Biener
2019-10-23 12:00   ` Richard Sandiford
2019-10-23 12:07     ` Richard Biener
2019-10-23 12:29       ` Richard Sandiford
2019-10-25  7:21         ` Richard Biener
2019-10-25  8:01           ` Richard Sandiford
2019-10-30  9:58             ` Richard Sandiford
2019-10-30 14:26               ` Richard Biener
2019-10-23 22:35     ` H.J. Lu
2019-10-24  8:14       ` Richard Sandiford
2019-10-24 15:18         ` H.J. Lu
2019-10-23 21:42   ` Jim Wilson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).