[PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
@ 2023-10-31  9:59 Juzhe-Zhong
  2023-10-31 14:08 ` Li, Pan2
  2023-11-02 13:52 ` Richard Biener
  0 siblings, 2 replies; 13+ messages in thread
From: Juzhe-Zhong @ 2023-10-31  9:59 UTC (permalink / raw)
  To: gcc-patches
  Cc: rguenther, jeffreyalaw, richard.sandiford, rdapp.gcc, Juzhe-Zhong

As previous Richard's suggested, we should support strided load/store in
loop vectorizer instead hacking RISC-V backend.

This patch adds MASK_LEN_STRIDED LOAD/STORE OPTABS/IFN.

The GIMPLE IR is same as mask_len_gather_load/mask_len_scatter_store but with
changing vector offset into scalar stride.

We don't have strided_load/strided_store and mask_strided_load/mask_strided_store since
it't unlikely RVV will have such optabs and we can't add the patterns that we can't test them.


gcc/ChangeLog:

	* doc/md.texi: Add mask_len_strided_load/mask_len_strided_store.
	* internal-fn.cc (internal_load_fn_p): Ditto.
	(internal_strided_fn_p): Ditto.
	(internal_fn_len_index): Ditto.
	(internal_fn_mask_index): Ditto.
	(internal_fn_stored_value_index): Ditto.
	(internal_strided_fn_supported_p): Ditto.
	* internal-fn.def (MASK_LEN_STRIDED_LOAD): Ditto.
	(MASK_LEN_STRIDED_STORE): Ditto.
	* internal-fn.h (internal_strided_fn_p): Ditto.
	(internal_strided_fn_supported_p): Ditto.
	* optabs.def (OPTAB_CD): Ditto.

---
 gcc/doc/md.texi     | 51 +++++++++++++++++++++++++++++++++++++++++++++
 gcc/internal-fn.cc  | 44 ++++++++++++++++++++++++++++++++++++++
 gcc/internal-fn.def |  4 ++++
 gcc/internal-fn.h   |  2 ++
 gcc/optabs.def      |  2 ++
 5 files changed, 103 insertions(+)

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index fab2513105a..5bac713a0dd 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5094,6 +5094,32 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
 be loaded from memory and clear if element @var{i} of the result should be undefined.
 Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
 
+@cindex @code{mask_len_strided_load@var{m}@var{n}} instruction pattern
+@item @samp{mask_len_strided_load@var{m}@var{n}}
+Load several separate memory locations into a destination vector of mode @var{m}.
+Operand 0 is a destination vector of mode @var{m}.
+Operand 1 is a scalar base address and operand 2 is a scalar stride of mode @var{n}.
+The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
+with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
+For each element index i:
+
+@itemize @bullet
+@item
+extend the stride to address width, using zero
+extension if operand 3 is 1 and sign extension if operand 3 is zero;
+@item
+multiply the extended stride by operand 4;
+@item
+add the result to the base; and
+@item
+load the value at that address (operand 1 + @var{i} * multiplied and extended stride) into element @var{i} of operand 0.
+@end itemize
+
+Similar to mask_len_load, the instruction loads at most (operand 6 + operand 7) elements from memory.
+Bit @var{i} of the mask is set if element @var{i} of the result should
+be loaded from memory and clear if element @var{i} of the result should be undefined.
+Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
+
 @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
 @item @samp{scatter_store@var{m}@var{n}}
 Store a vector of mode @var{m} into several distinct memory locations.
@@ -5131,6 +5157,31 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
 Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
 Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
 
+@cindex @code{mask_len_strided_store@var{m}@var{n}} instruction pattern
+@item @samp{mask_len_strided_store@var{m}@var{n}}
+Store a vector of mode m into several distinct memory locations.
+Operand 0 is a scalar base address and operand 1 is scalar stride of mode @var{n}.
+Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
+The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
+with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
+For each element index i:
+
+@itemize @bullet
+@item
+extend the stride to address width, using zero
+extension if operand 2 is 1 and sign extension if operand 2 is zero;
+@item
+multiply the extended stride by operand 3;
+@item
+add the result to the base; and
+@item
+store element @var{i} of operand 4 to that address (operand 1 + @var{i} * multiplied and extended stride).
+@end itemize
+
+Similar to mask_len_store, the instruction stores at most (operand 6 + operand 7) elements of (operand 4) to memory.
+Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
+Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
+
 @cindex @code{vec_set@var{m}} instruction pattern
 @item @samp{vec_set@var{m}}
 Set given field in the vector value.  Operand 0 is the vector to modify,
diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index e7451b96353..f7f85aa7dde 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -4596,6 +4596,7 @@ internal_load_fn_p (internal_fn fn)
     case IFN_GATHER_LOAD:
     case IFN_MASK_GATHER_LOAD:
     case IFN_MASK_LEN_GATHER_LOAD:
+    case IFN_MASK_LEN_STRIDED_LOAD:
     case IFN_LEN_LOAD:
     case IFN_MASK_LEN_LOAD:
       return true;
@@ -4648,6 +4649,22 @@ internal_gather_scatter_fn_p (internal_fn fn)
     }
 }
 
+/* Return true if IFN is some form of strided load or strided store.  */
+
+bool
+internal_strided_fn_p (internal_fn fn)
+{
+  switch (fn)
+    {
+    case IFN_MASK_LEN_STRIDED_LOAD:
+    case IFN_MASK_LEN_STRIDED_STORE:
+      return true;
+
+    default:
+      return false;
+    }
+}
+
 /* If FN takes a vector len argument, return the index of that argument,
    otherwise return -1.  */
 
@@ -4662,6 +4679,8 @@ internal_fn_len_index (internal_fn fn)
 
     case IFN_MASK_LEN_GATHER_LOAD:
     case IFN_MASK_LEN_SCATTER_STORE:
+    case IFN_MASK_LEN_STRIDED_LOAD:
+    case IFN_MASK_LEN_STRIDED_STORE:
     case IFN_COND_LEN_FMA:
     case IFN_COND_LEN_FMS:
     case IFN_COND_LEN_FNMA:
@@ -4719,6 +4738,8 @@ internal_fn_mask_index (internal_fn fn)
     case IFN_MASK_SCATTER_STORE:
     case IFN_MASK_LEN_GATHER_LOAD:
     case IFN_MASK_LEN_SCATTER_STORE:
+    case IFN_MASK_LEN_STRIDED_LOAD:
+    case IFN_MASK_LEN_STRIDED_STORE:
       return 4;
 
     default:
@@ -4740,6 +4761,7 @@ internal_fn_stored_value_index (internal_fn fn)
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
     case IFN_MASK_LEN_SCATTER_STORE:
+    case IFN_MASK_LEN_STRIDED_STORE:
       return 3;
 
     case IFN_LEN_STORE:
@@ -4784,6 +4806,28 @@ internal_gather_scatter_fn_supported_p (internal_fn ifn, tree vector_type,
 	  && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
 }
 
+/* Return true if the target supports strided load or strided store function
+   IFN.  For loads, VECTOR_TYPE is the vector type of the load result,
+   while for stores it is the vector type of the stored data argument.
+   STRIDE_TYPE is the type that holds the stride from the previous element
+   memory address of each loaded or stored element.
+   SCALE is the amount by which these stride should be multiplied
+   *after* they have been extended to address width.  */
+
+bool
+internal_strided_fn_supported_p (internal_fn ifn, tree vector_type,
+				 tree offset_type, int scale)
+{
+  optab optab = direct_internal_fn_optab (ifn);
+  insn_code icode = convert_optab_handler (optab, TYPE_MODE (vector_type),
+					   TYPE_MODE (offset_type));
+  int output_ops = internal_load_fn_p (ifn) ? 1 : 0;
+  bool unsigned_p = TYPE_UNSIGNED (offset_type);
+  return (icode != CODE_FOR_nothing
+	  && insn_operand_matches (icode, 2 + output_ops, GEN_INT (unsigned_p))
+	  && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
+}
+
 /* Return true if the target supports IFN_CHECK_{RAW,WAR}_PTRS function IFN
    for pointers of type TYPE when the accesses have LENGTH bytes and their
    common byte alignment is ALIGN.  */
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index a2023ab9c3d..0fa532e8f6b 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -199,6 +199,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
 		       mask_gather_load, gather_load)
 DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
 		       mask_len_gather_load, gather_load)
+DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
+		       mask_len_strided_load, gather_load)
 
 DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
 DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
@@ -208,6 +210,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
 		       mask_scatter_store, scatter_store)
 DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
 		       mask_len_scatter_store, scatter_store)
+DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
+		       mask_len_strided_store, scatter_store)
 
 DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
 DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
index 99de13a0199..8379c61dff7 100644
--- a/gcc/internal-fn.h
+++ b/gcc/internal-fn.h
@@ -235,11 +235,13 @@ extern bool can_interpret_as_conditional_op_p (gimple *, tree *,
 extern bool internal_load_fn_p (internal_fn);
 extern bool internal_store_fn_p (internal_fn);
 extern bool internal_gather_scatter_fn_p (internal_fn);
+extern bool internal_strided_fn_p (internal_fn);
 extern int internal_fn_mask_index (internal_fn);
 extern int internal_fn_len_index (internal_fn);
 extern int internal_fn_stored_value_index (internal_fn);
 extern bool internal_gather_scatter_fn_supported_p (internal_fn, tree,
 						    tree, tree, int);
+extern bool internal_strided_fn_supported_p (internal_fn, tree, tree, int);
 extern bool internal_check_ptrs_fn_supported_p (internal_fn, tree,
 						poly_uint64, unsigned int);
 #define VECT_PARTIAL_BIAS_UNSUPPORTED 127
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 2ccbe4197b7..3d85ac5f678 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -98,9 +98,11 @@ OPTAB_CD(mask_len_store_optab, "mask_len_store$a$b")
 OPTAB_CD(gather_load_optab, "gather_load$a$b")
 OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
 OPTAB_CD(mask_len_gather_load_optab, "mask_len_gather_load$a$b")
+OPTAB_CD(mask_len_strided_load_optab, "mask_len_strided_load$a$b")
 OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
 OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
 OPTAB_CD(mask_len_scatter_store_optab, "mask_len_scatter_store$a$b")
+OPTAB_CD(mask_len_strided_store_optab, "mask_len_strided_store$a$b")
 OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
 OPTAB_CD(vec_init_optab, "vec_init$a$b")
 
-- 
2.36.3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
  2023-10-31  9:59 [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN Juzhe-Zhong
@ 2023-10-31 14:08 ` Li, Pan2
  2023-11-02 13:52 ` Richard Biener
  1 sibling, 0 replies; 13+ messages in thread
From: Li, Pan2 @ 2023-10-31 14:08 UTC (permalink / raw)
  To: Juzhe-Zhong, gcc-patches
  Cc: rguenther, jeffreyalaw, richard.sandiford, rdapp.gcc

Passed the x86 bootstrap and regression tests.

Pan

-----Original Message-----
From: Juzhe-Zhong <juzhe.zhong@rivai.ai> 
Sent: Tuesday, October 31, 2023 5:59 PM
To: gcc-patches@gcc.gnu.org
Cc: rguenther@suse.de; jeffreyalaw@gmail.com; richard.sandiford@arm.com; rdapp.gcc@gmail.com; Juzhe-Zhong <juzhe.zhong@rivai.ai>
Subject: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN

As previous Richard's suggested, we should support strided load/store in
loop vectorizer instead hacking RISC-V backend.

This patch adds MASK_LEN_STRIDED LOAD/STORE OPTABS/IFN.

The GIMPLE IR is same as mask_len_gather_load/mask_len_scatter_store but with
changing vector offset into scalar stride.

We don't have strided_load/strided_store and mask_strided_load/mask_strided_store since
it't unlikely RVV will have such optabs and we can't add the patterns that we can't test them.


gcc/ChangeLog:

	* doc/md.texi: Add mask_len_strided_load/mask_len_strided_store.
	* internal-fn.cc (internal_load_fn_p): Ditto.
	(internal_strided_fn_p): Ditto.
	(internal_fn_len_index): Ditto.
	(internal_fn_mask_index): Ditto.
	(internal_fn_stored_value_index): Ditto.
	(internal_strided_fn_supported_p): Ditto.
	* internal-fn.def (MASK_LEN_STRIDED_LOAD): Ditto.
	(MASK_LEN_STRIDED_STORE): Ditto.
	* internal-fn.h (internal_strided_fn_p): Ditto.
	(internal_strided_fn_supported_p): Ditto.
	* optabs.def (OPTAB_CD): Ditto.

---
 gcc/doc/md.texi     | 51 +++++++++++++++++++++++++++++++++++++++++++++
 gcc/internal-fn.cc  | 44 ++++++++++++++++++++++++++++++++++++++
 gcc/internal-fn.def |  4 ++++
 gcc/internal-fn.h   |  2 ++
 gcc/optabs.def      |  2 ++
 5 files changed, 103 insertions(+)

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index fab2513105a..5bac713a0dd 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5094,6 +5094,32 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
 be loaded from memory and clear if element @var{i} of the result should be undefined.
 Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
 
+@cindex @code{mask_len_strided_load@var{m}@var{n}} instruction pattern
+@item @samp{mask_len_strided_load@var{m}@var{n}}
+Load several separate memory locations into a destination vector of mode @var{m}.
+Operand 0 is a destination vector of mode @var{m}.
+Operand 1 is a scalar base address and operand 2 is a scalar stride of mode @var{n}.
+The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
+with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
+For each element index i:
+
+@itemize @bullet
+@item
+extend the stride to address width, using zero
+extension if operand 3 is 1 and sign extension if operand 3 is zero;
+@item
+multiply the extended stride by operand 4;
+@item
+add the result to the base; and
+@item
+load the value at that address (operand 1 + @var{i} * multiplied and extended stride) into element @var{i} of operand 0.
+@end itemize
+
+Similar to mask_len_load, the instruction loads at most (operand 6 + operand 7) elements from memory.
+Bit @var{i} of the mask is set if element @var{i} of the result should
+be loaded from memory and clear if element @var{i} of the result should be undefined.
+Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
+
 @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
 @item @samp{scatter_store@var{m}@var{n}}
 Store a vector of mode @var{m} into several distinct memory locations.
@@ -5131,6 +5157,31 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
 Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
 Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
 
+@cindex @code{mask_len_strided_store@var{m}@var{n}} instruction pattern
+@item @samp{mask_len_strided_store@var{m}@var{n}}
+Store a vector of mode m into several distinct memory locations.
+Operand 0 is a scalar base address and operand 1 is scalar stride of mode @var{n}.
+Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
+The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
+with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
+For each element index i:
+
+@itemize @bullet
+@item
+extend the stride to address width, using zero
+extension if operand 2 is 1 and sign extension if operand 2 is zero;
+@item
+multiply the extended stride by operand 3;
+@item
+add the result to the base; and
+@item
+store element @var{i} of operand 4 to that address (operand 1 + @var{i} * multiplied and extended stride).
+@end itemize
+
+Similar to mask_len_store, the instruction stores at most (operand 6 + operand 7) elements of (operand 4) to memory.
+Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
+Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
+
 @cindex @code{vec_set@var{m}} instruction pattern
 @item @samp{vec_set@var{m}}
 Set given field in the vector value.  Operand 0 is the vector to modify,
diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index e7451b96353..f7f85aa7dde 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -4596,6 +4596,7 @@ internal_load_fn_p (internal_fn fn)
     case IFN_GATHER_LOAD:
     case IFN_MASK_GATHER_LOAD:
     case IFN_MASK_LEN_GATHER_LOAD:
+    case IFN_MASK_LEN_STRIDED_LOAD:
     case IFN_LEN_LOAD:
     case IFN_MASK_LEN_LOAD:
       return true;
@@ -4648,6 +4649,22 @@ internal_gather_scatter_fn_p (internal_fn fn)
     }
 }
 
+/* Return true if IFN is some form of strided load or strided store.  */
+
+bool
+internal_strided_fn_p (internal_fn fn)
+{
+  switch (fn)
+    {
+    case IFN_MASK_LEN_STRIDED_LOAD:
+    case IFN_MASK_LEN_STRIDED_STORE:
+      return true;
+
+    default:
+      return false;
+    }
+}
+
 /* If FN takes a vector len argument, return the index of that argument,
    otherwise return -1.  */
 
@@ -4662,6 +4679,8 @@ internal_fn_len_index (internal_fn fn)
 
     case IFN_MASK_LEN_GATHER_LOAD:
     case IFN_MASK_LEN_SCATTER_STORE:
+    case IFN_MASK_LEN_STRIDED_LOAD:
+    case IFN_MASK_LEN_STRIDED_STORE:
     case IFN_COND_LEN_FMA:
     case IFN_COND_LEN_FMS:
     case IFN_COND_LEN_FNMA:
@@ -4719,6 +4738,8 @@ internal_fn_mask_index (internal_fn fn)
     case IFN_MASK_SCATTER_STORE:
     case IFN_MASK_LEN_GATHER_LOAD:
     case IFN_MASK_LEN_SCATTER_STORE:
+    case IFN_MASK_LEN_STRIDED_LOAD:
+    case IFN_MASK_LEN_STRIDED_STORE:
       return 4;
 
     default:
@@ -4740,6 +4761,7 @@ internal_fn_stored_value_index (internal_fn fn)
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
     case IFN_MASK_LEN_SCATTER_STORE:
+    case IFN_MASK_LEN_STRIDED_STORE:
       return 3;
 
     case IFN_LEN_STORE:
@@ -4784,6 +4806,28 @@ internal_gather_scatter_fn_supported_p (internal_fn ifn, tree vector_type,
 	  && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
 }
 
+/* Return true if the target supports strided load or strided store function
+   IFN.  For loads, VECTOR_TYPE is the vector type of the load result,
+   while for stores it is the vector type of the stored data argument.
+   STRIDE_TYPE is the type that holds the stride from the previous element
+   memory address of each loaded or stored element.
+   SCALE is the amount by which these stride should be multiplied
+   *after* they have been extended to address width.  */
+
+bool
+internal_strided_fn_supported_p (internal_fn ifn, tree vector_type,
+				 tree offset_type, int scale)
+{
+  optab optab = direct_internal_fn_optab (ifn);
+  insn_code icode = convert_optab_handler (optab, TYPE_MODE (vector_type),
+					   TYPE_MODE (offset_type));
+  int output_ops = internal_load_fn_p (ifn) ? 1 : 0;
+  bool unsigned_p = TYPE_UNSIGNED (offset_type);
+  return (icode != CODE_FOR_nothing
+	  && insn_operand_matches (icode, 2 + output_ops, GEN_INT (unsigned_p))
+	  && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
+}
+
 /* Return true if the target supports IFN_CHECK_{RAW,WAR}_PTRS function IFN
    for pointers of type TYPE when the accesses have LENGTH bytes and their
    common byte alignment is ALIGN.  */
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index a2023ab9c3d..0fa532e8f6b 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -199,6 +199,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
 		       mask_gather_load, gather_load)
 DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
 		       mask_len_gather_load, gather_load)
+DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
+		       mask_len_strided_load, gather_load)
 
 DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
 DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
@@ -208,6 +210,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
 		       mask_scatter_store, scatter_store)
 DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
 		       mask_len_scatter_store, scatter_store)
+DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
+		       mask_len_strided_store, scatter_store)
 
 DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
 DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
index 99de13a0199..8379c61dff7 100644
--- a/gcc/internal-fn.h
+++ b/gcc/internal-fn.h
@@ -235,11 +235,13 @@ extern bool can_interpret_as_conditional_op_p (gimple *, tree *,
 extern bool internal_load_fn_p (internal_fn);
 extern bool internal_store_fn_p (internal_fn);
 extern bool internal_gather_scatter_fn_p (internal_fn);
+extern bool internal_strided_fn_p (internal_fn);
 extern int internal_fn_mask_index (internal_fn);
 extern int internal_fn_len_index (internal_fn);
 extern int internal_fn_stored_value_index (internal_fn);
 extern bool internal_gather_scatter_fn_supported_p (internal_fn, tree,
 						    tree, tree, int);
+extern bool internal_strided_fn_supported_p (internal_fn, tree, tree, int);
 extern bool internal_check_ptrs_fn_supported_p (internal_fn, tree,
 						poly_uint64, unsigned int);
 #define VECT_PARTIAL_BIAS_UNSUPPORTED 127
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 2ccbe4197b7..3d85ac5f678 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -98,9 +98,11 @@ OPTAB_CD(mask_len_store_optab, "mask_len_store$a$b")
 OPTAB_CD(gather_load_optab, "gather_load$a$b")
 OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
 OPTAB_CD(mask_len_gather_load_optab, "mask_len_gather_load$a$b")
+OPTAB_CD(mask_len_strided_load_optab, "mask_len_strided_load$a$b")
 OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
 OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
 OPTAB_CD(mask_len_scatter_store_optab, "mask_len_scatter_store$a$b")
+OPTAB_CD(mask_len_strided_store_optab, "mask_len_strided_store$a$b")
 OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
 OPTAB_CD(vec_init_optab, "vec_init$a$b")
 
-- 
2.36.3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
  2023-10-31  9:59 [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN Juzhe-Zhong
  2023-10-31 14:08 ` Li, Pan2
@ 2023-11-02 13:52 ` Richard Biener
  2023-11-02 14:23   ` 钟居哲
  1 sibling, 1 reply; 13+ messages in thread
From: Richard Biener @ 2023-11-02 13:52 UTC (permalink / raw)
  To: Juzhe-Zhong; +Cc: gcc-patches, jeffreyalaw, richard.sandiford, rdapp.gcc

On Tue, 31 Oct 2023, Juzhe-Zhong wrote:

> As previous Richard's suggested, we should support strided load/store in
> loop vectorizer instead hacking RISC-V backend.
> 
> This patch adds MASK_LEN_STRIDED LOAD/STORE OPTABS/IFN.
> 
> The GIMPLE IR is same as mask_len_gather_load/mask_len_scatter_store but with
> changing vector offset into scalar stride.

I see that it follows gather/scatter.  I'll note that when introducing
those we failed to add a specifier for TBAA and alignment info for the
data access.  That means we have to use alias-set zero for the accesses
(I see existing targets use UNSPECs with some not elaborate MEM anyway,
but TBAA info might have been the "easy" and obvious property to 
preserve).  For alignment we either have to assume unaligned or reject
vectorization of accesses that do not have their original scalar accesses
naturally aligned (aligned according to their mode).  We don't seem
to check that though.

It might be fine to go forward with this since gather/scatter are broken
in a similar way.

Do we really need to have two modes for the optab though or could we
simply require the target to support arbitrary offset modes (give it
is implicitly constrained to ptr_mode for the base already)?  Or
properly extend/truncate the offset at expansion time, say to ptr_mode
or to the mode of sizetype.

Thanks,
Richard.
 
> We don't have strided_load/strided_store and mask_strided_load/mask_strided_store since
> it't unlikely RVV will have such optabs and we can't add the patterns that we can't test them.
>
> 
> gcc/ChangeLog:
> 
> 	* doc/md.texi: Add mask_len_strided_load/mask_len_strided_store.
> 	* internal-fn.cc (internal_load_fn_p): Ditto.
> 	(internal_strided_fn_p): Ditto.
> 	(internal_fn_len_index): Ditto.
> 	(internal_fn_mask_index): Ditto.
> 	(internal_fn_stored_value_index): Ditto.
> 	(internal_strided_fn_supported_p): Ditto.
> 	* internal-fn.def (MASK_LEN_STRIDED_LOAD): Ditto.
> 	(MASK_LEN_STRIDED_STORE): Ditto.
> 	* internal-fn.h (internal_strided_fn_p): Ditto.
> 	(internal_strided_fn_supported_p): Ditto.
> 	* optabs.def (OPTAB_CD): Ditto.
> 
> ---
>  gcc/doc/md.texi     | 51 +++++++++++++++++++++++++++++++++++++++++++++
>  gcc/internal-fn.cc  | 44 ++++++++++++++++++++++++++++++++++++++
>  gcc/internal-fn.def |  4 ++++
>  gcc/internal-fn.h   |  2 ++
>  gcc/optabs.def      |  2 ++
>  5 files changed, 103 insertions(+)
> 
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index fab2513105a..5bac713a0dd 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5094,6 +5094,32 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
>  be loaded from memory and clear if element @var{i} of the result should be undefined.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
>  
> +@cindex @code{mask_len_strided_load@var{m}@var{n}} instruction pattern
> +@item @samp{mask_len_strided_load@var{m}@var{n}}
> +Load several separate memory locations into a destination vector of mode @var{m}.
> +Operand 0 is a destination vector of mode @var{m}.
> +Operand 1 is a scalar base address and operand 2 is a scalar stride of mode @var{n}.
> +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> +For each element index i:
> +
> +@itemize @bullet
> +@item
> +extend the stride to address width, using zero
> +extension if operand 3 is 1 and sign extension if operand 3 is zero;
> +@item
> +multiply the extended stride by operand 4;
> +@item
> +add the result to the base; and
> +@item
> +load the value at that address (operand 1 + @var{i} * multiplied and extended stride) into element @var{i} of operand 0.
> +@end itemize
> +
> +Similar to mask_len_load, the instruction loads at most (operand 6 + operand 7) elements from memory.
> +Bit @var{i} of the mask is set if element @var{i} of the result should
> +be loaded from memory and clear if element @var{i} of the result should be undefined.
> +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> +
>  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
>  @item @samp{scatter_store@var{m}@var{n}}
>  Store a vector of mode @var{m} into several distinct memory locations.
> @@ -5131,6 +5157,31 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
>  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
>  
> +@cindex @code{mask_len_strided_store@var{m}@var{n}} instruction pattern
> +@item @samp{mask_len_strided_store@var{m}@var{n}}
> +Store a vector of mode m into several distinct memory locations.
> +Operand 0 is a scalar base address and operand 1 is scalar stride of mode @var{n}.
> +Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
> +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> +For each element index i:
> +
> +@itemize @bullet
> +@item
> +extend the stride to address width, using zero
> +extension if operand 2 is 1 and sign extension if operand 2 is zero;
> +@item
> +multiply the extended stride by operand 3;
> +@item
> +add the result to the base; and
> +@item
> +store element @var{i} of operand 4 to that address (operand 1 + @var{i} * multiplied and extended stride).
> +@end itemize
> +
> +Similar to mask_len_store, the instruction stores at most (operand 6 + operand 7) elements of (operand 4) to memory.
> +Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> +
>  @cindex @code{vec_set@var{m}} instruction pattern
>  @item @samp{vec_set@var{m}}
>  Set given field in the vector value.  Operand 0 is the vector to modify,
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index e7451b96353..f7f85aa7dde 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -4596,6 +4596,7 @@ internal_load_fn_p (internal_fn fn)
>      case IFN_GATHER_LOAD:
>      case IFN_MASK_GATHER_LOAD:
>      case IFN_MASK_LEN_GATHER_LOAD:
> +    case IFN_MASK_LEN_STRIDED_LOAD:
>      case IFN_LEN_LOAD:
>      case IFN_MASK_LEN_LOAD:
>        return true;
> @@ -4648,6 +4649,22 @@ internal_gather_scatter_fn_p (internal_fn fn)
>      }
>  }
>  
> +/* Return true if IFN is some form of strided load or strided store.  */
> +
> +bool
> +internal_strided_fn_p (internal_fn fn)
> +{
> +  switch (fn)
> +    {
> +    case IFN_MASK_LEN_STRIDED_LOAD:
> +    case IFN_MASK_LEN_STRIDED_STORE:
> +      return true;
> +
> +    default:
> +      return false;
> +    }
> +}
> +
>  /* If FN takes a vector len argument, return the index of that argument,
>     otherwise return -1.  */
>  
> @@ -4662,6 +4679,8 @@ internal_fn_len_index (internal_fn fn)
>  
>      case IFN_MASK_LEN_GATHER_LOAD:
>      case IFN_MASK_LEN_SCATTER_STORE:
> +    case IFN_MASK_LEN_STRIDED_LOAD:
> +    case IFN_MASK_LEN_STRIDED_STORE:
>      case IFN_COND_LEN_FMA:
>      case IFN_COND_LEN_FMS:
>      case IFN_COND_LEN_FNMA:
> @@ -4719,6 +4738,8 @@ internal_fn_mask_index (internal_fn fn)
>      case IFN_MASK_SCATTER_STORE:
>      case IFN_MASK_LEN_GATHER_LOAD:
>      case IFN_MASK_LEN_SCATTER_STORE:
> +    case IFN_MASK_LEN_STRIDED_LOAD:
> +    case IFN_MASK_LEN_STRIDED_STORE:
>        return 4;
>  
>      default:
> @@ -4740,6 +4761,7 @@ internal_fn_stored_value_index (internal_fn fn)
>      case IFN_SCATTER_STORE:
>      case IFN_MASK_SCATTER_STORE:
>      case IFN_MASK_LEN_SCATTER_STORE:
> +    case IFN_MASK_LEN_STRIDED_STORE:
>        return 3;
>  
>      case IFN_LEN_STORE:
> @@ -4784,6 +4806,28 @@ internal_gather_scatter_fn_supported_p (internal_fn ifn, tree vector_type,
>  	  && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
>  }
>  
> +/* Return true if the target supports strided load or strided store function
> +   IFN.  For loads, VECTOR_TYPE is the vector type of the load result,
> +   while for stores it is the vector type of the stored data argument.
> +   STRIDE_TYPE is the type that holds the stride from the previous element
> +   memory address of each loaded or stored element.
> +   SCALE is the amount by which these stride should be multiplied
> +   *after* they have been extended to address width.  */
> +
> +bool
> +internal_strided_fn_supported_p (internal_fn ifn, tree vector_type,
> +				 tree offset_type, int scale)
> +{
> +  optab optab = direct_internal_fn_optab (ifn);
> +  insn_code icode = convert_optab_handler (optab, TYPE_MODE (vector_type),
> +					   TYPE_MODE (offset_type));
> +  int output_ops = internal_load_fn_p (ifn) ? 1 : 0;
> +  bool unsigned_p = TYPE_UNSIGNED (offset_type);
> +  return (icode != CODE_FOR_nothing
> +	  && insn_operand_matches (icode, 2 + output_ops, GEN_INT (unsigned_p))
> +	  && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> +}
> +
>  /* Return true if the target supports IFN_CHECK_{RAW,WAR}_PTRS function IFN
>     for pointers of type TYPE when the accesses have LENGTH bytes and their
>     common byte alignment is ALIGN.  */
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index a2023ab9c3d..0fa532e8f6b 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -199,6 +199,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
>  		       mask_gather_load, gather_load)
>  DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
>  		       mask_len_gather_load, gather_load)
> +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
> +		       mask_len_strided_load, gather_load)
>  
>  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
>  DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
> @@ -208,6 +210,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
>  		       mask_scatter_store, scatter_store)
>  DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
>  		       mask_len_scatter_store, scatter_store)
> +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
> +		       mask_len_strided_store, scatter_store)
>  
>  DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
>  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
> diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
> index 99de13a0199..8379c61dff7 100644
> --- a/gcc/internal-fn.h
> +++ b/gcc/internal-fn.h
> @@ -235,11 +235,13 @@ extern bool can_interpret_as_conditional_op_p (gimple *, tree *,
>  extern bool internal_load_fn_p (internal_fn);
>  extern bool internal_store_fn_p (internal_fn);
>  extern bool internal_gather_scatter_fn_p (internal_fn);
> +extern bool internal_strided_fn_p (internal_fn);
>  extern int internal_fn_mask_index (internal_fn);
>  extern int internal_fn_len_index (internal_fn);
>  extern int internal_fn_stored_value_index (internal_fn);
>  extern bool internal_gather_scatter_fn_supported_p (internal_fn, tree,
>  						    tree, tree, int);
> +extern bool internal_strided_fn_supported_p (internal_fn, tree, tree, int);
>  extern bool internal_check_ptrs_fn_supported_p (internal_fn, tree,
>  						poly_uint64, unsigned int);
>  #define VECT_PARTIAL_BIAS_UNSUPPORTED 127
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 2ccbe4197b7..3d85ac5f678 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -98,9 +98,11 @@ OPTAB_CD(mask_len_store_optab, "mask_len_store$a$b")
>  OPTAB_CD(gather_load_optab, "gather_load$a$b")
>  OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
>  OPTAB_CD(mask_len_gather_load_optab, "mask_len_gather_load$a$b")
> +OPTAB_CD(mask_len_strided_load_optab, "mask_len_strided_load$a$b")
>  OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
>  OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
>  OPTAB_CD(mask_len_scatter_store_optab, "mask_len_scatter_store$a$b")
> +OPTAB_CD(mask_len_strided_store_optab, "mask_len_strided_store$a$b")
>  OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
>  OPTAB_CD(vec_init_optab, "vec_init$a$b")
>  
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
  2023-11-02 13:52 ` Richard Biener
@ 2023-11-02 14:23   ` 钟居哲
  2023-11-02 14:27     ` Richard Biener
  0 siblings, 1 reply; 13+ messages in thread
From: 钟居哲 @ 2023-11-02 14:23 UTC (permalink / raw)
  To: rguenther; +Cc: gcc-patches, Jeff Law, richard.sandiford, rdapp.gcc

[-- Attachment #1: Type: text/plain, Size: 13669 bytes --]

Hi, Richi.

>> Do we really need to have two modes for the optab though or could we
>> simply require the target to support arbitrary offset modes (give it
>> is implicitly constrained to ptr_mode for the base already)?  Or
>> properly extend/truncate the offset at expansion time, say to ptr_mode
>> or to the mode of sizetype.

For RVV, it's ok by default set stride type as ptr_mode/size_type by default.
Is it ok that I define strided load/store as single mode optab and default Pmode as stride operand?
How about scale and signed/unsigned operand ?
It seems scale operand can be removed ? Since we can pass DR_STEP directly to the stride arguments.
But I think we can't remove signed/unsigned operand since for strided mode = SI mode, the unsigned
maximum stride = 2^31 wheras signed is 2 ^ 30.




juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-11-02 21:52
To: Juzhe-Zhong
CC: gcc-patches; jeffreyalaw; richard.sandiford; rdapp.gcc
Subject: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
On Tue, 31 Oct 2023, Juzhe-Zhong wrote:
 
> As previous Richard's suggested, we should support strided load/store in
> loop vectorizer instead hacking RISC-V backend.
> 
> This patch adds MASK_LEN_STRIDED LOAD/STORE OPTABS/IFN.
> 
> The GIMPLE IR is same as mask_len_gather_load/mask_len_scatter_store but with
> changing vector offset into scalar stride.
 
I see that it follows gather/scatter.  I'll note that when introducing
those we failed to add a specifier for TBAA and alignment info for the
data access.  That means we have to use alias-set zero for the accesses
(I see existing targets use UNSPECs with some not elaborate MEM anyway,
but TBAA info might have been the "easy" and obvious property to 
preserve).  For alignment we either have to assume unaligned or reject
vectorization of accesses that do not have their original scalar accesses
naturally aligned (aligned according to their mode).  We don't seem
to check that though.
 
It might be fine to go forward with this since gather/scatter are broken
in a similar way.
 
Do we really need to have two modes for the optab though or could we
simply require the target to support arbitrary offset modes (give it
is implicitly constrained to ptr_mode for the base already)?  Or
properly extend/truncate the offset at expansion time, say to ptr_mode
or to the mode of sizetype.
 
Thanks,
Richard.
> We don't have strided_load/strided_store and mask_strided_load/mask_strided_store since
> it't unlikely RVV will have such optabs and we can't add the patterns that we can't test them.
>
> 
> gcc/ChangeLog:
> 
> * doc/md.texi: Add mask_len_strided_load/mask_len_strided_store.
> * internal-fn.cc (internal_load_fn_p): Ditto.
> (internal_strided_fn_p): Ditto.
> (internal_fn_len_index): Ditto.
> (internal_fn_mask_index): Ditto.
> (internal_fn_stored_value_index): Ditto.
> (internal_strided_fn_supported_p): Ditto.
> * internal-fn.def (MASK_LEN_STRIDED_LOAD): Ditto.
> (MASK_LEN_STRIDED_STORE): Ditto.
> * internal-fn.h (internal_strided_fn_p): Ditto.
> (internal_strided_fn_supported_p): Ditto.
> * optabs.def (OPTAB_CD): Ditto.
> 
> ---
>  gcc/doc/md.texi     | 51 +++++++++++++++++++++++++++++++++++++++++++++
>  gcc/internal-fn.cc  | 44 ++++++++++++++++++++++++++++++++++++++
>  gcc/internal-fn.def |  4 ++++
>  gcc/internal-fn.h   |  2 ++
>  gcc/optabs.def      |  2 ++
>  5 files changed, 103 insertions(+)
> 
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index fab2513105a..5bac713a0dd 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5094,6 +5094,32 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
>  be loaded from memory and clear if element @var{i} of the result should be undefined.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
>  
> +@cindex @code{mask_len_strided_load@var{m}@var{n}} instruction pattern
> +@item @samp{mask_len_strided_load@var{m}@var{n}}
> +Load several separate memory locations into a destination vector of mode @var{m}.
> +Operand 0 is a destination vector of mode @var{m}.
> +Operand 1 is a scalar base address and operand 2 is a scalar stride of mode @var{n}.
> +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> +For each element index i:
> +
> +@itemize @bullet
> +@item
> +extend the stride to address width, using zero
> +extension if operand 3 is 1 and sign extension if operand 3 is zero;
> +@item
> +multiply the extended stride by operand 4;
> +@item
> +add the result to the base; and
> +@item
> +load the value at that address (operand 1 + @var{i} * multiplied and extended stride) into element @var{i} of operand 0.
> +@end itemize
> +
> +Similar to mask_len_load, the instruction loads at most (operand 6 + operand 7) elements from memory.
> +Bit @var{i} of the mask is set if element @var{i} of the result should
> +be loaded from memory and clear if element @var{i} of the result should be undefined.
> +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> +
>  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
>  @item @samp{scatter_store@var{m}@var{n}}
>  Store a vector of mode @var{m} into several distinct memory locations.
> @@ -5131,6 +5157,31 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
>  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
>  
> +@cindex @code{mask_len_strided_store@var{m}@var{n}} instruction pattern
> +@item @samp{mask_len_strided_store@var{m}@var{n}}
> +Store a vector of mode m into several distinct memory locations.
> +Operand 0 is a scalar base address and operand 1 is scalar stride of mode @var{n}.
> +Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
> +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> +For each element index i:
> +
> +@itemize @bullet
> +@item
> +extend the stride to address width, using zero
> +extension if operand 2 is 1 and sign extension if operand 2 is zero;
> +@item
> +multiply the extended stride by operand 3;
> +@item
> +add the result to the base; and
> +@item
> +store element @var{i} of operand 4 to that address (operand 1 + @var{i} * multiplied and extended stride).
> +@end itemize
> +
> +Similar to mask_len_store, the instruction stores at most (operand 6 + operand 7) elements of (operand 4) to memory.
> +Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> +
>  @cindex @code{vec_set@var{m}} instruction pattern
>  @item @samp{vec_set@var{m}}
>  Set given field in the vector value.  Operand 0 is the vector to modify,
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index e7451b96353..f7f85aa7dde 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -4596,6 +4596,7 @@ internal_load_fn_p (internal_fn fn)
>      case IFN_GATHER_LOAD:
>      case IFN_MASK_GATHER_LOAD:
>      case IFN_MASK_LEN_GATHER_LOAD:
> +    case IFN_MASK_LEN_STRIDED_LOAD:
>      case IFN_LEN_LOAD:
>      case IFN_MASK_LEN_LOAD:
>        return true;
> @@ -4648,6 +4649,22 @@ internal_gather_scatter_fn_p (internal_fn fn)
>      }
>  }
>  
> +/* Return true if IFN is some form of strided load or strided store.  */
> +
> +bool
> +internal_strided_fn_p (internal_fn fn)
> +{
> +  switch (fn)
> +    {
> +    case IFN_MASK_LEN_STRIDED_LOAD:
> +    case IFN_MASK_LEN_STRIDED_STORE:
> +      return true;
> +
> +    default:
> +      return false;
> +    }
> +}
> +
>  /* If FN takes a vector len argument, return the index of that argument,
>     otherwise return -1.  */
>  
> @@ -4662,6 +4679,8 @@ internal_fn_len_index (internal_fn fn)
>  
>      case IFN_MASK_LEN_GATHER_LOAD:
>      case IFN_MASK_LEN_SCATTER_STORE:
> +    case IFN_MASK_LEN_STRIDED_LOAD:
> +    case IFN_MASK_LEN_STRIDED_STORE:
>      case IFN_COND_LEN_FMA:
>      case IFN_COND_LEN_FMS:
>      case IFN_COND_LEN_FNMA:
> @@ -4719,6 +4738,8 @@ internal_fn_mask_index (internal_fn fn)
>      case IFN_MASK_SCATTER_STORE:
>      case IFN_MASK_LEN_GATHER_LOAD:
>      case IFN_MASK_LEN_SCATTER_STORE:
> +    case IFN_MASK_LEN_STRIDED_LOAD:
> +    case IFN_MASK_LEN_STRIDED_STORE:
>        return 4;
>  
>      default:
> @@ -4740,6 +4761,7 @@ internal_fn_stored_value_index (internal_fn fn)
>      case IFN_SCATTER_STORE:
>      case IFN_MASK_SCATTER_STORE:
>      case IFN_MASK_LEN_SCATTER_STORE:
> +    case IFN_MASK_LEN_STRIDED_STORE:
>        return 3;
>  
>      case IFN_LEN_STORE:
> @@ -4784,6 +4806,28 @@ internal_gather_scatter_fn_supported_p (internal_fn ifn, tree vector_type,
>    && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
>  }
>  
> +/* Return true if the target supports strided load or strided store function
> +   IFN.  For loads, VECTOR_TYPE is the vector type of the load result,
> +   while for stores it is the vector type of the stored data argument.
> +   STRIDE_TYPE is the type that holds the stride from the previous element
> +   memory address of each loaded or stored element.
> +   SCALE is the amount by which these stride should be multiplied
> +   *after* they have been extended to address width.  */
> +
> +bool
> +internal_strided_fn_supported_p (internal_fn ifn, tree vector_type,
> + tree offset_type, int scale)
> +{
> +  optab optab = direct_internal_fn_optab (ifn);
> +  insn_code icode = convert_optab_handler (optab, TYPE_MODE (vector_type),
> +    TYPE_MODE (offset_type));
> +  int output_ops = internal_load_fn_p (ifn) ? 1 : 0;
> +  bool unsigned_p = TYPE_UNSIGNED (offset_type);
> +  return (icode != CODE_FOR_nothing
> +   && insn_operand_matches (icode, 2 + output_ops, GEN_INT (unsigned_p))
> +   && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> +}
> +
>  /* Return true if the target supports IFN_CHECK_{RAW,WAR}_PTRS function IFN
>     for pointers of type TYPE when the accesses have LENGTH bytes and their
>     common byte alignment is ALIGN.  */
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index a2023ab9c3d..0fa532e8f6b 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -199,6 +199,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
>         mask_gather_load, gather_load)
>  DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
>         mask_len_gather_load, gather_load)
> +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
> +        mask_len_strided_load, gather_load)
>  
>  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
>  DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
> @@ -208,6 +210,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
>         mask_scatter_store, scatter_store)
>  DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
>         mask_len_scatter_store, scatter_store)
> +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
> +        mask_len_strided_store, scatter_store)
>  
>  DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
>  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
> diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
> index 99de13a0199..8379c61dff7 100644
> --- a/gcc/internal-fn.h
> +++ b/gcc/internal-fn.h
> @@ -235,11 +235,13 @@ extern bool can_interpret_as_conditional_op_p (gimple *, tree *,
>  extern bool internal_load_fn_p (internal_fn);
>  extern bool internal_store_fn_p (internal_fn);
>  extern bool internal_gather_scatter_fn_p (internal_fn);
> +extern bool internal_strided_fn_p (internal_fn);
>  extern int internal_fn_mask_index (internal_fn);
>  extern int internal_fn_len_index (internal_fn);
>  extern int internal_fn_stored_value_index (internal_fn);
>  extern bool internal_gather_scatter_fn_supported_p (internal_fn, tree,
>      tree, tree, int);
> +extern bool internal_strided_fn_supported_p (internal_fn, tree, tree, int);
>  extern bool internal_check_ptrs_fn_supported_p (internal_fn, tree,
>  poly_uint64, unsigned int);
>  #define VECT_PARTIAL_BIAS_UNSUPPORTED 127
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 2ccbe4197b7..3d85ac5f678 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -98,9 +98,11 @@ OPTAB_CD(mask_len_store_optab, "mask_len_store$a$b")
>  OPTAB_CD(gather_load_optab, "gather_load$a$b")
>  OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
>  OPTAB_CD(mask_len_gather_load_optab, "mask_len_gather_load$a$b")
> +OPTAB_CD(mask_len_strided_load_optab, "mask_len_strided_load$a$b")
>  OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
>  OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
>  OPTAB_CD(mask_len_scatter_store_optab, "mask_len_scatter_store$a$b")
> +OPTAB_CD(mask_len_strided_store_optab, "mask_len_strided_store$a$b")
>  OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
>  OPTAB_CD(vec_init_optab, "vec_init$a$b")
>  
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
  2023-11-02 14:23   ` 钟居哲
@ 2023-11-02 14:27     ` Richard Biener
  2023-11-02 14:29       ` 钟居哲
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Biener @ 2023-11-02 14:27 UTC (permalink / raw)
  To: 钟居哲
  Cc: gcc-patches, Jeff Law, richard.sandiford, rdapp.gcc

On Thu, 2 Nov 2023, ??? wrote:

> Hi, Richi.
> 
> >> Do we really need to have two modes for the optab though or could we
> >> simply require the target to support arbitrary offset modes (give it
> >> is implicitly constrained to ptr_mode for the base already)?  Or
> >> properly extend/truncate the offset at expansion time, say to ptr_mode
> >> or to the mode of sizetype.
> 
> For RVV, it's ok by default set stride type as ptr_mode/size_type by default.
> Is it ok that I define strided load/store as single mode optab and default Pmode as stride operand?
> How about scale and signed/unsigned operand ?
> It seems scale operand can be removed ? Since we can pass DR_STEP directly to the stride arguments.
> But I think we can't remove signed/unsigned operand since for strided mode = SI mode, the unsigned
> maximum stride = 2^31 wheras signed is 2 ^ 30.

On the GIMPLE side I think we want to have a sizetype operand and
indeed drop 'scale', the sizetype operand should be readily available.

> 
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-11-02 21:52
> To: Juzhe-Zhong
> CC: gcc-patches; jeffreyalaw; richard.sandiford; rdapp.gcc
> Subject: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> On Tue, 31 Oct 2023, Juzhe-Zhong wrote:
>  
> > As previous Richard's suggested, we should support strided load/store in
> > loop vectorizer instead hacking RISC-V backend.
> > 
> > This patch adds MASK_LEN_STRIDED LOAD/STORE OPTABS/IFN.
> > 
> > The GIMPLE IR is same as mask_len_gather_load/mask_len_scatter_store but with
> > changing vector offset into scalar stride.
>  
> I see that it follows gather/scatter.  I'll note that when introducing
> those we failed to add a specifier for TBAA and alignment info for the
> data access.  That means we have to use alias-set zero for the accesses
> (I see existing targets use UNSPECs with some not elaborate MEM anyway,
> but TBAA info might have been the "easy" and obvious property to 
> preserve).  For alignment we either have to assume unaligned or reject
> vectorization of accesses that do not have their original scalar accesses
> naturally aligned (aligned according to their mode).  We don't seem
> to check that though.
>  
> It might be fine to go forward with this since gather/scatter are broken
> in a similar way.
>  
> Do we really need to have two modes for the optab though or could we
> simply require the target to support arbitrary offset modes (give it
> is implicitly constrained to ptr_mode for the base already)?  Or
> properly extend/truncate the offset at expansion time, say to ptr_mode
> or to the mode of sizetype.
>  
> Thanks,
> Richard.
> > We don't have strided_load/strided_store and mask_strided_load/mask_strided_store since
> > it't unlikely RVV will have such optabs and we can't add the patterns that we can't test them.
> >
> > 
> > gcc/ChangeLog:
> > 
> > * doc/md.texi: Add mask_len_strided_load/mask_len_strided_store.
> > * internal-fn.cc (internal_load_fn_p): Ditto.
> > (internal_strided_fn_p): Ditto.
> > (internal_fn_len_index): Ditto.
> > (internal_fn_mask_index): Ditto.
> > (internal_fn_stored_value_index): Ditto.
> > (internal_strided_fn_supported_p): Ditto.
> > * internal-fn.def (MASK_LEN_STRIDED_LOAD): Ditto.
> > (MASK_LEN_STRIDED_STORE): Ditto.
> > * internal-fn.h (internal_strided_fn_p): Ditto.
> > (internal_strided_fn_supported_p): Ditto.
> > * optabs.def (OPTAB_CD): Ditto.
> > 
> > ---
> >  gcc/doc/md.texi     | 51 +++++++++++++++++++++++++++++++++++++++++++++
> >  gcc/internal-fn.cc  | 44 ++++++++++++++++++++++++++++++++++++++
> >  gcc/internal-fn.def |  4 ++++
> >  gcc/internal-fn.h   |  2 ++
> >  gcc/optabs.def      |  2 ++
> >  5 files changed, 103 insertions(+)
> > 
> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > index fab2513105a..5bac713a0dd 100644
> > --- a/gcc/doc/md.texi
> > +++ b/gcc/doc/md.texi
> > @@ -5094,6 +5094,32 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
> >  be loaded from memory and clear if element @var{i} of the result should be undefined.
> >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> >  
> > +@cindex @code{mask_len_strided_load@var{m}@var{n}} instruction pattern
> > +@item @samp{mask_len_strided_load@var{m}@var{n}}
> > +Load several separate memory locations into a destination vector of mode @var{m}.
> > +Operand 0 is a destination vector of mode @var{m}.
> > +Operand 1 is a scalar base address and operand 2 is a scalar stride of mode @var{n}.
> > +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > +For each element index i:
> > +
> > +@itemize @bullet
> > +@item
> > +extend the stride to address width, using zero
> > +extension if operand 3 is 1 and sign extension if operand 3 is zero;
> > +@item
> > +multiply the extended stride by operand 4;
> > +@item
> > +add the result to the base; and
> > +@item
> > +load the value at that address (operand 1 + @var{i} * multiplied and extended stride) into element @var{i} of operand 0.
> > +@end itemize
> > +
> > +Similar to mask_len_load, the instruction loads at most (operand 6 + operand 7) elements from memory.
> > +Bit @var{i} of the mask is set if element @var{i} of the result should
> > +be loaded from memory and clear if element @var{i} of the result should be undefined.
> > +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > +
> >  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
> >  @item @samp{scatter_store@var{m}@var{n}}
> >  Store a vector of mode @var{m} into several distinct memory locations.
> > @@ -5131,6 +5157,31 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
> >  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> >  
> > +@cindex @code{mask_len_strided_store@var{m}@var{n}} instruction pattern
> > +@item @samp{mask_len_strided_store@var{m}@var{n}}
> > +Store a vector of mode m into several distinct memory locations.
> > +Operand 0 is a scalar base address and operand 1 is scalar stride of mode @var{n}.
> > +Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
> > +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > +For each element index i:
> > +
> > +@itemize @bullet
> > +@item
> > +extend the stride to address width, using zero
> > +extension if operand 2 is 1 and sign extension if operand 2 is zero;
> > +@item
> > +multiply the extended stride by operand 3;
> > +@item
> > +add the result to the base; and
> > +@item
> > +store element @var{i} of operand 4 to that address (operand 1 + @var{i} * multiplied and extended stride).
> > +@end itemize
> > +
> > +Similar to mask_len_store, the instruction stores at most (operand 6 + operand 7) elements of (operand 4) to memory.
> > +Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> > +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > +
> >  @cindex @code{vec_set@var{m}} instruction pattern
> >  @item @samp{vec_set@var{m}}
> >  Set given field in the vector value.  Operand 0 is the vector to modify,
> > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > index e7451b96353..f7f85aa7dde 100644
> > --- a/gcc/internal-fn.cc
> > +++ b/gcc/internal-fn.cc
> > @@ -4596,6 +4596,7 @@ internal_load_fn_p (internal_fn fn)
> >      case IFN_GATHER_LOAD:
> >      case IFN_MASK_GATHER_LOAD:
> >      case IFN_MASK_LEN_GATHER_LOAD:
> > +    case IFN_MASK_LEN_STRIDED_LOAD:
> >      case IFN_LEN_LOAD:
> >      case IFN_MASK_LEN_LOAD:
> >        return true;
> > @@ -4648,6 +4649,22 @@ internal_gather_scatter_fn_p (internal_fn fn)
> >      }
> >  }
> >  
> > +/* Return true if IFN is some form of strided load or strided store.  */
> > +
> > +bool
> > +internal_strided_fn_p (internal_fn fn)
> > +{
> > +  switch (fn)
> > +    {
> > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > +    case IFN_MASK_LEN_STRIDED_STORE:
> > +      return true;
> > +
> > +    default:
> > +      return false;
> > +    }
> > +}
> > +
> >  /* If FN takes a vector len argument, return the index of that argument,
> >     otherwise return -1.  */
> >  
> > @@ -4662,6 +4679,8 @@ internal_fn_len_index (internal_fn fn)
> >  
> >      case IFN_MASK_LEN_GATHER_LOAD:
> >      case IFN_MASK_LEN_SCATTER_STORE:
> > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > +    case IFN_MASK_LEN_STRIDED_STORE:
> >      case IFN_COND_LEN_FMA:
> >      case IFN_COND_LEN_FMS:
> >      case IFN_COND_LEN_FNMA:
> > @@ -4719,6 +4738,8 @@ internal_fn_mask_index (internal_fn fn)
> >      case IFN_MASK_SCATTER_STORE:
> >      case IFN_MASK_LEN_GATHER_LOAD:
> >      case IFN_MASK_LEN_SCATTER_STORE:
> > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > +    case IFN_MASK_LEN_STRIDED_STORE:
> >        return 4;
> >  
> >      default:
> > @@ -4740,6 +4761,7 @@ internal_fn_stored_value_index (internal_fn fn)
> >      case IFN_SCATTER_STORE:
> >      case IFN_MASK_SCATTER_STORE:
> >      case IFN_MASK_LEN_SCATTER_STORE:
> > +    case IFN_MASK_LEN_STRIDED_STORE:
> >        return 3;
> >  
> >      case IFN_LEN_STORE:
> > @@ -4784,6 +4806,28 @@ internal_gather_scatter_fn_supported_p (internal_fn ifn, tree vector_type,
> >    && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> >  }
> >  
> > +/* Return true if the target supports strided load or strided store function
> > +   IFN.  For loads, VECTOR_TYPE is the vector type of the load result,
> > +   while for stores it is the vector type of the stored data argument.
> > +   STRIDE_TYPE is the type that holds the stride from the previous element
> > +   memory address of each loaded or stored element.
> > +   SCALE is the amount by which these stride should be multiplied
> > +   *after* they have been extended to address width.  */
> > +
> > +bool
> > +internal_strided_fn_supported_p (internal_fn ifn, tree vector_type,
> > + tree offset_type, int scale)
> > +{
> > +  optab optab = direct_internal_fn_optab (ifn);
> > +  insn_code icode = convert_optab_handler (optab, TYPE_MODE (vector_type),
> > +    TYPE_MODE (offset_type));
> > +  int output_ops = internal_load_fn_p (ifn) ? 1 : 0;
> > +  bool unsigned_p = TYPE_UNSIGNED (offset_type);
> > +  return (icode != CODE_FOR_nothing
> > +   && insn_operand_matches (icode, 2 + output_ops, GEN_INT (unsigned_p))
> > +   && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> > +}
> > +
> >  /* Return true if the target supports IFN_CHECK_{RAW,WAR}_PTRS function IFN
> >     for pointers of type TYPE when the accesses have LENGTH bytes and their
> >     common byte alignment is ALIGN.  */
> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > index a2023ab9c3d..0fa532e8f6b 100644
> > --- a/gcc/internal-fn.def
> > +++ b/gcc/internal-fn.def
> > @@ -199,6 +199,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
> >         mask_gather_load, gather_load)
> >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
> >         mask_len_gather_load, gather_load)
> > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
> > +        mask_len_strided_load, gather_load)
> >  
> >  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
> >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
> > @@ -208,6 +210,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
> >         mask_scatter_store, scatter_store)
> >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
> >         mask_len_scatter_store, scatter_store)
> > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
> > +        mask_len_strided_store, scatter_store)
> >  
> >  DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
> >  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
> > diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
> > index 99de13a0199..8379c61dff7 100644
> > --- a/gcc/internal-fn.h
> > +++ b/gcc/internal-fn.h
> > @@ -235,11 +235,13 @@ extern bool can_interpret_as_conditional_op_p (gimple *, tree *,
> >  extern bool internal_load_fn_p (internal_fn);
> >  extern bool internal_store_fn_p (internal_fn);
> >  extern bool internal_gather_scatter_fn_p (internal_fn);
> > +extern bool internal_strided_fn_p (internal_fn);
> >  extern int internal_fn_mask_index (internal_fn);
> >  extern int internal_fn_len_index (internal_fn);
> >  extern int internal_fn_stored_value_index (internal_fn);
> >  extern bool internal_gather_scatter_fn_supported_p (internal_fn, tree,
> >      tree, tree, int);
> > +extern bool internal_strided_fn_supported_p (internal_fn, tree, tree, int);
> >  extern bool internal_check_ptrs_fn_supported_p (internal_fn, tree,
> >  poly_uint64, unsigned int);
> >  #define VECT_PARTIAL_BIAS_UNSUPPORTED 127
> > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > index 2ccbe4197b7..3d85ac5f678 100644
> > --- a/gcc/optabs.def
> > +++ b/gcc/optabs.def
> > @@ -98,9 +98,11 @@ OPTAB_CD(mask_len_store_optab, "mask_len_store$a$b")
> >  OPTAB_CD(gather_load_optab, "gather_load$a$b")
> >  OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
> >  OPTAB_CD(mask_len_gather_load_optab, "mask_len_gather_load$a$b")
> > +OPTAB_CD(mask_len_strided_load_optab, "mask_len_strided_load$a$b")
> >  OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
> >  OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
> >  OPTAB_CD(mask_len_scatter_store_optab, "mask_len_scatter_store$a$b")
> > +OPTAB_CD(mask_len_strided_store_optab, "mask_len_strided_store$a$b")
> >  OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
> >  OPTAB_CD(vec_init_optab, "vec_init$a$b")
> >  
> > 
>  
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
  2023-11-02 14:27     ` Richard Biener
@ 2023-11-02 14:29       ` 钟居哲
  2023-11-02 14:34         ` Richard Biener
  0 siblings, 1 reply; 13+ messages in thread
From: 钟居哲 @ 2023-11-02 14:29 UTC (permalink / raw)
  To: rguenther; +Cc: gcc-patches, Jeff Law, richard.sandiford, rdapp.gcc

[-- Attachment #1: Type: text/plain, Size: 14898 bytes --]

Ok. So drop 'scale' and keep signed/unsigned argument, is that right?
And I wonder I should create the stride_type using size_type_node or ptrdiff_type_node ?
Which is preferrable ?

Thanks.



juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-11-02 22:27
To: 钟居哲
CC: gcc-patches; Jeff Law; richard.sandiford; rdapp.gcc
Subject: Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
On Thu, 2 Nov 2023, ??? wrote:
 
> Hi, Richi.
> 
> >> Do we really need to have two modes for the optab though or could we
> >> simply require the target to support arbitrary offset modes (give it
> >> is implicitly constrained to ptr_mode for the base already)?  Or
> >> properly extend/truncate the offset at expansion time, say to ptr_mode
> >> or to the mode of sizetype.
> 
> For RVV, it's ok by default set stride type as ptr_mode/size_type by default.
> Is it ok that I define strided load/store as single mode optab and default Pmode as stride operand?
> How about scale and signed/unsigned operand ?
> It seems scale operand can be removed ? Since we can pass DR_STEP directly to the stride arguments.
> But I think we can't remove signed/unsigned operand since for strided mode = SI mode, the unsigned
> maximum stride = 2^31 wheras signed is 2 ^ 30.
 
On the GIMPLE side I think we want to have a sizetype operand and
indeed drop 'scale', the sizetype operand should be readily available.
 
> 
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-11-02 21:52
> To: Juzhe-Zhong
> CC: gcc-patches; jeffreyalaw; richard.sandiford; rdapp.gcc
> Subject: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> On Tue, 31 Oct 2023, Juzhe-Zhong wrote:
>  
> > As previous Richard's suggested, we should support strided load/store in
> > loop vectorizer instead hacking RISC-V backend.
> > 
> > This patch adds MASK_LEN_STRIDED LOAD/STORE OPTABS/IFN.
> > 
> > The GIMPLE IR is same as mask_len_gather_load/mask_len_scatter_store but with
> > changing vector offset into scalar stride.
>  
> I see that it follows gather/scatter.  I'll note that when introducing
> those we failed to add a specifier for TBAA and alignment info for the
> data access.  That means we have to use alias-set zero for the accesses
> (I see existing targets use UNSPECs with some not elaborate MEM anyway,
> but TBAA info might have been the "easy" and obvious property to 
> preserve).  For alignment we either have to assume unaligned or reject
> vectorization of accesses that do not have their original scalar accesses
> naturally aligned (aligned according to their mode).  We don't seem
> to check that though.
>  
> It might be fine to go forward with this since gather/scatter are broken
> in a similar way.
>  
> Do we really need to have two modes for the optab though or could we
> simply require the target to support arbitrary offset modes (give it
> is implicitly constrained to ptr_mode for the base already)?  Or
> properly extend/truncate the offset at expansion time, say to ptr_mode
> or to the mode of sizetype.
>  
> Thanks,
> Richard.
> > We don't have strided_load/strided_store and mask_strided_load/mask_strided_store since
> > it't unlikely RVV will have such optabs and we can't add the patterns that we can't test them.
> >
> > 
> > gcc/ChangeLog:
> > 
> > * doc/md.texi: Add mask_len_strided_load/mask_len_strided_store.
> > * internal-fn.cc (internal_load_fn_p): Ditto.
> > (internal_strided_fn_p): Ditto.
> > (internal_fn_len_index): Ditto.
> > (internal_fn_mask_index): Ditto.
> > (internal_fn_stored_value_index): Ditto.
> > (internal_strided_fn_supported_p): Ditto.
> > * internal-fn.def (MASK_LEN_STRIDED_LOAD): Ditto.
> > (MASK_LEN_STRIDED_STORE): Ditto.
> > * internal-fn.h (internal_strided_fn_p): Ditto.
> > (internal_strided_fn_supported_p): Ditto.
> > * optabs.def (OPTAB_CD): Ditto.
> > 
> > ---
> >  gcc/doc/md.texi     | 51 +++++++++++++++++++++++++++++++++++++++++++++
> >  gcc/internal-fn.cc  | 44 ++++++++++++++++++++++++++++++++++++++
> >  gcc/internal-fn.def |  4 ++++
> >  gcc/internal-fn.h   |  2 ++
> >  gcc/optabs.def      |  2 ++
> >  5 files changed, 103 insertions(+)
> > 
> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > index fab2513105a..5bac713a0dd 100644
> > --- a/gcc/doc/md.texi
> > +++ b/gcc/doc/md.texi
> > @@ -5094,6 +5094,32 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
> >  be loaded from memory and clear if element @var{i} of the result should be undefined.
> >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> >  
> > +@cindex @code{mask_len_strided_load@var{m}@var{n}} instruction pattern
> > +@item @samp{mask_len_strided_load@var{m}@var{n}}
> > +Load several separate memory locations into a destination vector of mode @var{m}.
> > +Operand 0 is a destination vector of mode @var{m}.
> > +Operand 1 is a scalar base address and operand 2 is a scalar stride of mode @var{n}.
> > +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > +For each element index i:
> > +
> > +@itemize @bullet
> > +@item
> > +extend the stride to address width, using zero
> > +extension if operand 3 is 1 and sign extension if operand 3 is zero;
> > +@item
> > +multiply the extended stride by operand 4;
> > +@item
> > +add the result to the base; and
> > +@item
> > +load the value at that address (operand 1 + @var{i} * multiplied and extended stride) into element @var{i} of operand 0.
> > +@end itemize
> > +
> > +Similar to mask_len_load, the instruction loads at most (operand 6 + operand 7) elements from memory.
> > +Bit @var{i} of the mask is set if element @var{i} of the result should
> > +be loaded from memory and clear if element @var{i} of the result should be undefined.
> > +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > +
> >  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
> >  @item @samp{scatter_store@var{m}@var{n}}
> >  Store a vector of mode @var{m} into several distinct memory locations.
> > @@ -5131,6 +5157,31 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
> >  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> >  
> > +@cindex @code{mask_len_strided_store@var{m}@var{n}} instruction pattern
> > +@item @samp{mask_len_strided_store@var{m}@var{n}}
> > +Store a vector of mode m into several distinct memory locations.
> > +Operand 0 is a scalar base address and operand 1 is scalar stride of mode @var{n}.
> > +Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
> > +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > +For each element index i:
> > +
> > +@itemize @bullet
> > +@item
> > +extend the stride to address width, using zero
> > +extension if operand 2 is 1 and sign extension if operand 2 is zero;
> > +@item
> > +multiply the extended stride by operand 3;
> > +@item
> > +add the result to the base; and
> > +@item
> > +store element @var{i} of operand 4 to that address (operand 1 + @var{i} * multiplied and extended stride).
> > +@end itemize
> > +
> > +Similar to mask_len_store, the instruction stores at most (operand 6 + operand 7) elements of (operand 4) to memory.
> > +Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> > +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > +
> >  @cindex @code{vec_set@var{m}} instruction pattern
> >  @item @samp{vec_set@var{m}}
> >  Set given field in the vector value.  Operand 0 is the vector to modify,
> > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > index e7451b96353..f7f85aa7dde 100644
> > --- a/gcc/internal-fn.cc
> > +++ b/gcc/internal-fn.cc
> > @@ -4596,6 +4596,7 @@ internal_load_fn_p (internal_fn fn)
> >      case IFN_GATHER_LOAD:
> >      case IFN_MASK_GATHER_LOAD:
> >      case IFN_MASK_LEN_GATHER_LOAD:
> > +    case IFN_MASK_LEN_STRIDED_LOAD:
> >      case IFN_LEN_LOAD:
> >      case IFN_MASK_LEN_LOAD:
> >        return true;
> > @@ -4648,6 +4649,22 @@ internal_gather_scatter_fn_p (internal_fn fn)
> >      }
> >  }
> >  
> > +/* Return true if IFN is some form of strided load or strided store.  */
> > +
> > +bool
> > +internal_strided_fn_p (internal_fn fn)
> > +{
> > +  switch (fn)
> > +    {
> > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > +    case IFN_MASK_LEN_STRIDED_STORE:
> > +      return true;
> > +
> > +    default:
> > +      return false;
> > +    }
> > +}
> > +
> >  /* If FN takes a vector len argument, return the index of that argument,
> >     otherwise return -1.  */
> >  
> > @@ -4662,6 +4679,8 @@ internal_fn_len_index (internal_fn fn)
> >  
> >      case IFN_MASK_LEN_GATHER_LOAD:
> >      case IFN_MASK_LEN_SCATTER_STORE:
> > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > +    case IFN_MASK_LEN_STRIDED_STORE:
> >      case IFN_COND_LEN_FMA:
> >      case IFN_COND_LEN_FMS:
> >      case IFN_COND_LEN_FNMA:
> > @@ -4719,6 +4738,8 @@ internal_fn_mask_index (internal_fn fn)
> >      case IFN_MASK_SCATTER_STORE:
> >      case IFN_MASK_LEN_GATHER_LOAD:
> >      case IFN_MASK_LEN_SCATTER_STORE:
> > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > +    case IFN_MASK_LEN_STRIDED_STORE:
> >        return 4;
> >  
> >      default:
> > @@ -4740,6 +4761,7 @@ internal_fn_stored_value_index (internal_fn fn)
> >      case IFN_SCATTER_STORE:
> >      case IFN_MASK_SCATTER_STORE:
> >      case IFN_MASK_LEN_SCATTER_STORE:
> > +    case IFN_MASK_LEN_STRIDED_STORE:
> >        return 3;
> >  
> >      case IFN_LEN_STORE:
> > @@ -4784,6 +4806,28 @@ internal_gather_scatter_fn_supported_p (internal_fn ifn, tree vector_type,
> >    && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> >  }
> >  
> > +/* Return true if the target supports strided load or strided store function
> > +   IFN.  For loads, VECTOR_TYPE is the vector type of the load result,
> > +   while for stores it is the vector type of the stored data argument.
> > +   STRIDE_TYPE is the type that holds the stride from the previous element
> > +   memory address of each loaded or stored element.
> > +   SCALE is the amount by which these stride should be multiplied
> > +   *after* they have been extended to address width.  */
> > +
> > +bool
> > +internal_strided_fn_supported_p (internal_fn ifn, tree vector_type,
> > + tree offset_type, int scale)
> > +{
> > +  optab optab = direct_internal_fn_optab (ifn);
> > +  insn_code icode = convert_optab_handler (optab, TYPE_MODE (vector_type),
> > +    TYPE_MODE (offset_type));
> > +  int output_ops = internal_load_fn_p (ifn) ? 1 : 0;
> > +  bool unsigned_p = TYPE_UNSIGNED (offset_type);
> > +  return (icode != CODE_FOR_nothing
> > +   && insn_operand_matches (icode, 2 + output_ops, GEN_INT (unsigned_p))
> > +   && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> > +}
> > +
> >  /* Return true if the target supports IFN_CHECK_{RAW,WAR}_PTRS function IFN
> >     for pointers of type TYPE when the accesses have LENGTH bytes and their
> >     common byte alignment is ALIGN.  */
> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > index a2023ab9c3d..0fa532e8f6b 100644
> > --- a/gcc/internal-fn.def
> > +++ b/gcc/internal-fn.def
> > @@ -199,6 +199,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
> >         mask_gather_load, gather_load)
> >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
> >         mask_len_gather_load, gather_load)
> > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
> > +        mask_len_strided_load, gather_load)
> >  
> >  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
> >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
> > @@ -208,6 +210,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
> >         mask_scatter_store, scatter_store)
> >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
> >         mask_len_scatter_store, scatter_store)
> > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
> > +        mask_len_strided_store, scatter_store)
> >  
> >  DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
> >  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
> > diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
> > index 99de13a0199..8379c61dff7 100644
> > --- a/gcc/internal-fn.h
> > +++ b/gcc/internal-fn.h
> > @@ -235,11 +235,13 @@ extern bool can_interpret_as_conditional_op_p (gimple *, tree *,
> >  extern bool internal_load_fn_p (internal_fn);
> >  extern bool internal_store_fn_p (internal_fn);
> >  extern bool internal_gather_scatter_fn_p (internal_fn);
> > +extern bool internal_strided_fn_p (internal_fn);
> >  extern int internal_fn_mask_index (internal_fn);
> >  extern int internal_fn_len_index (internal_fn);
> >  extern int internal_fn_stored_value_index (internal_fn);
> >  extern bool internal_gather_scatter_fn_supported_p (internal_fn, tree,
> >      tree, tree, int);
> > +extern bool internal_strided_fn_supported_p (internal_fn, tree, tree, int);
> >  extern bool internal_check_ptrs_fn_supported_p (internal_fn, tree,
> >  poly_uint64, unsigned int);
> >  #define VECT_PARTIAL_BIAS_UNSUPPORTED 127
> > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > index 2ccbe4197b7..3d85ac5f678 100644
> > --- a/gcc/optabs.def
> > +++ b/gcc/optabs.def
> > @@ -98,9 +98,11 @@ OPTAB_CD(mask_len_store_optab, "mask_len_store$a$b")
> >  OPTAB_CD(gather_load_optab, "gather_load$a$b")
> >  OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
> >  OPTAB_CD(mask_len_gather_load_optab, "mask_len_gather_load$a$b")
> > +OPTAB_CD(mask_len_strided_load_optab, "mask_len_strided_load$a$b")
> >  OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
> >  OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
> >  OPTAB_CD(mask_len_scatter_store_optab, "mask_len_scatter_store$a$b")
> > +OPTAB_CD(mask_len_strided_store_optab, "mask_len_strided_store$a$b")
> >  OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
> >  OPTAB_CD(vec_init_optab, "vec_init$a$b")
> >  
> > 
>  
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
  2023-11-02 14:29       ` 钟居哲
@ 2023-11-02 14:34         ` Richard Biener
  2023-11-03  7:10           ` juzhe.zhong
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Biener @ 2023-11-02 14:34 UTC (permalink / raw)
  To: 钟居哲
  Cc: gcc-patches, Jeff Law, richard.sandiford, rdapp.gcc

On Thu, 2 Nov 2023, ??? wrote:

> Ok. So drop 'scale' and keep signed/unsigned argument, is that right?

I don't think we need signed/unsigned.  RTL expansion has the signedness
of the offset argument there and can just extend to the appropriate mode
to offset a pointer.

> And I wonder I should create the stride_type using size_type_node or ptrdiff_type_node ?
> Which is preferrable ?

'sizetype' - that's the type we require to be used for 
the POINTER_PLUS_EXPR offset operand.


> Thanks.
> 
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-11-02 22:27
> To: ???
> CC: gcc-patches; Jeff Law; richard.sandiford; rdapp.gcc
> Subject: Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> On Thu, 2 Nov 2023, ??? wrote:
>  
> > Hi, Richi.
> > 
> > >> Do we really need to have two modes for the optab though or could we
> > >> simply require the target to support arbitrary offset modes (give it
> > >> is implicitly constrained to ptr_mode for the base already)?  Or
> > >> properly extend/truncate the offset at expansion time, say to ptr_mode
> > >> or to the mode of sizetype.
> > 
> > For RVV, it's ok by default set stride type as ptr_mode/size_type by default.
> > Is it ok that I define strided load/store as single mode optab and default Pmode as stride operand?
> > How about scale and signed/unsigned operand ?
> > It seems scale operand can be removed ? Since we can pass DR_STEP directly to the stride arguments.
> > But I think we can't remove signed/unsigned operand since for strided mode = SI mode, the unsigned
> > maximum stride = 2^31 wheras signed is 2 ^ 30.
>  
> On the GIMPLE side I think we want to have a sizetype operand and
> indeed drop 'scale', the sizetype operand should be readily available.
>  
> > 
> > 
> > 
> > juzhe.zhong@rivai.ai
> >  
> > From: Richard Biener
> > Date: 2023-11-02 21:52
> > To: Juzhe-Zhong
> > CC: gcc-patches; jeffreyalaw; richard.sandiford; rdapp.gcc
> > Subject: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> > On Tue, 31 Oct 2023, Juzhe-Zhong wrote:
> >  
> > > As previous Richard's suggested, we should support strided load/store in
> > > loop vectorizer instead hacking RISC-V backend.
> > > 
> > > This patch adds MASK_LEN_STRIDED LOAD/STORE OPTABS/IFN.
> > > 
> > > The GIMPLE IR is same as mask_len_gather_load/mask_len_scatter_store but with
> > > changing vector offset into scalar stride.
> >  
> > I see that it follows gather/scatter.  I'll note that when introducing
> > those we failed to add a specifier for TBAA and alignment info for the
> > data access.  That means we have to use alias-set zero for the accesses
> > (I see existing targets use UNSPECs with some not elaborate MEM anyway,
> > but TBAA info might have been the "easy" and obvious property to 
> > preserve).  For alignment we either have to assume unaligned or reject
> > vectorization of accesses that do not have their original scalar accesses
> > naturally aligned (aligned according to their mode).  We don't seem
> > to check that though.
> >  
> > It might be fine to go forward with this since gather/scatter are broken
> > in a similar way.
> >  
> > Do we really need to have two modes for the optab though or could we
> > simply require the target to support arbitrary offset modes (give it
> > is implicitly constrained to ptr_mode for the base already)?  Or
> > properly extend/truncate the offset at expansion time, say to ptr_mode
> > or to the mode of sizetype.
> >  
> > Thanks,
> > Richard.
> > > We don't have strided_load/strided_store and mask_strided_load/mask_strided_store since
> > > it't unlikely RVV will have such optabs and we can't add the patterns that we can't test them.
> > >
> > > 
> > > gcc/ChangeLog:
> > > 
> > > * doc/md.texi: Add mask_len_strided_load/mask_len_strided_store.
> > > * internal-fn.cc (internal_load_fn_p): Ditto.
> > > (internal_strided_fn_p): Ditto.
> > > (internal_fn_len_index): Ditto.
> > > (internal_fn_mask_index): Ditto.
> > > (internal_fn_stored_value_index): Ditto.
> > > (internal_strided_fn_supported_p): Ditto.
> > > * internal-fn.def (MASK_LEN_STRIDED_LOAD): Ditto.
> > > (MASK_LEN_STRIDED_STORE): Ditto.
> > > * internal-fn.h (internal_strided_fn_p): Ditto.
> > > (internal_strided_fn_supported_p): Ditto.
> > > * optabs.def (OPTAB_CD): Ditto.
> > > 
> > > ---
> > >  gcc/doc/md.texi     | 51 +++++++++++++++++++++++++++++++++++++++++++++
> > >  gcc/internal-fn.cc  | 44 ++++++++++++++++++++++++++++++++++++++
> > >  gcc/internal-fn.def |  4 ++++
> > >  gcc/internal-fn.h   |  2 ++
> > >  gcc/optabs.def      |  2 ++
> > >  5 files changed, 103 insertions(+)
> > > 
> > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > > index fab2513105a..5bac713a0dd 100644
> > > --- a/gcc/doc/md.texi
> > > +++ b/gcc/doc/md.texi
> > > @@ -5094,6 +5094,32 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
> > >  be loaded from memory and clear if element @var{i} of the result should be undefined.
> > >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > >  
> > > +@cindex @code{mask_len_strided_load@var{m}@var{n}} instruction pattern
> > > +@item @samp{mask_len_strided_load@var{m}@var{n}}
> > > +Load several separate memory locations into a destination vector of mode @var{m}.
> > > +Operand 0 is a destination vector of mode @var{m}.
> > > +Operand 1 is a scalar base address and operand 2 is a scalar stride of mode @var{n}.
> > > +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> > > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > > +For each element index i:
> > > +
> > > +@itemize @bullet
> > > +@item
> > > +extend the stride to address width, using zero
> > > +extension if operand 3 is 1 and sign extension if operand 3 is zero;
> > > +@item
> > > +multiply the extended stride by operand 4;
> > > +@item
> > > +add the result to the base; and
> > > +@item
> > > +load the value at that address (operand 1 + @var{i} * multiplied and extended stride) into element @var{i} of operand 0.
> > > +@end itemize
> > > +
> > > +Similar to mask_len_load, the instruction loads at most (operand 6 + operand 7) elements from memory.
> > > +Bit @var{i} of the mask is set if element @var{i} of the result should
> > > +be loaded from memory and clear if element @var{i} of the result should be undefined.
> > > +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > +
> > >  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
> > >  @item @samp{scatter_store@var{m}@var{n}}
> > >  Store a vector of mode @var{m} into several distinct memory locations.
> > > @@ -5131,6 +5157,31 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
> > >  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> > >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > >  
> > > +@cindex @code{mask_len_strided_store@var{m}@var{n}} instruction pattern
> > > +@item @samp{mask_len_strided_store@var{m}@var{n}}
> > > +Store a vector of mode m into several distinct memory locations.
> > > +Operand 0 is a scalar base address and operand 1 is scalar stride of mode @var{n}.
> > > +Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
> > > +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> > > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > > +For each element index i:
> > > +
> > > +@itemize @bullet
> > > +@item
> > > +extend the stride to address width, using zero
> > > +extension if operand 2 is 1 and sign extension if operand 2 is zero;
> > > +@item
> > > +multiply the extended stride by operand 3;
> > > +@item
> > > +add the result to the base; and
> > > +@item
> > > +store element @var{i} of operand 4 to that address (operand 1 + @var{i} * multiplied and extended stride).
> > > +@end itemize
> > > +
> > > +Similar to mask_len_store, the instruction stores at most (operand 6 + operand 7) elements of (operand 4) to memory.
> > > +Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> > > +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > +
> > >  @cindex @code{vec_set@var{m}} instruction pattern
> > >  @item @samp{vec_set@var{m}}
> > >  Set given field in the vector value.  Operand 0 is the vector to modify,
> > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > > index e7451b96353..f7f85aa7dde 100644
> > > --- a/gcc/internal-fn.cc
> > > +++ b/gcc/internal-fn.cc
> > > @@ -4596,6 +4596,7 @@ internal_load_fn_p (internal_fn fn)
> > >      case IFN_GATHER_LOAD:
> > >      case IFN_MASK_GATHER_LOAD:
> > >      case IFN_MASK_LEN_GATHER_LOAD:
> > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > >      case IFN_LEN_LOAD:
> > >      case IFN_MASK_LEN_LOAD:
> > >        return true;
> > > @@ -4648,6 +4649,22 @@ internal_gather_scatter_fn_p (internal_fn fn)
> > >      }
> > >  }
> > >  
> > > +/* Return true if IFN is some form of strided load or strided store.  */
> > > +
> > > +bool
> > > +internal_strided_fn_p (internal_fn fn)
> > > +{
> > > +  switch (fn)
> > > +    {
> > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > +      return true;
> > > +
> > > +    default:
> > > +      return false;
> > > +    }
> > > +}
> > > +
> > >  /* If FN takes a vector len argument, return the index of that argument,
> > >     otherwise return -1.  */
> > >  
> > > @@ -4662,6 +4679,8 @@ internal_fn_len_index (internal_fn fn)
> > >  
> > >      case IFN_MASK_LEN_GATHER_LOAD:
> > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > >      case IFN_COND_LEN_FMA:
> > >      case IFN_COND_LEN_FMS:
> > >      case IFN_COND_LEN_FNMA:
> > > @@ -4719,6 +4738,8 @@ internal_fn_mask_index (internal_fn fn)
> > >      case IFN_MASK_SCATTER_STORE:
> > >      case IFN_MASK_LEN_GATHER_LOAD:
> > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > >        return 4;
> > >  
> > >      default:
> > > @@ -4740,6 +4761,7 @@ internal_fn_stored_value_index (internal_fn fn)
> > >      case IFN_SCATTER_STORE:
> > >      case IFN_MASK_SCATTER_STORE:
> > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > >        return 3;
> > >  
> > >      case IFN_LEN_STORE:
> > > @@ -4784,6 +4806,28 @@ internal_gather_scatter_fn_supported_p (internal_fn ifn, tree vector_type,
> > >    && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> > >  }
> > >  
> > > +/* Return true if the target supports strided load or strided store function
> > > +   IFN.  For loads, VECTOR_TYPE is the vector type of the load result,
> > > +   while for stores it is the vector type of the stored data argument.
> > > +   STRIDE_TYPE is the type that holds the stride from the previous element
> > > +   memory address of each loaded or stored element.
> > > +   SCALE is the amount by which these stride should be multiplied
> > > +   *after* they have been extended to address width.  */
> > > +
> > > +bool
> > > +internal_strided_fn_supported_p (internal_fn ifn, tree vector_type,
> > > + tree offset_type, int scale)
> > > +{
> > > +  optab optab = direct_internal_fn_optab (ifn);
> > > +  insn_code icode = convert_optab_handler (optab, TYPE_MODE (vector_type),
> > > +    TYPE_MODE (offset_type));
> > > +  int output_ops = internal_load_fn_p (ifn) ? 1 : 0;
> > > +  bool unsigned_p = TYPE_UNSIGNED (offset_type);
> > > +  return (icode != CODE_FOR_nothing
> > > +   && insn_operand_matches (icode, 2 + output_ops, GEN_INT (unsigned_p))
> > > +   && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> > > +}
> > > +
> > >  /* Return true if the target supports IFN_CHECK_{RAW,WAR}_PTRS function IFN
> > >     for pointers of type TYPE when the accesses have LENGTH bytes and their
> > >     common byte alignment is ALIGN.  */
> > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > > index a2023ab9c3d..0fa532e8f6b 100644
> > > --- a/gcc/internal-fn.def
> > > +++ b/gcc/internal-fn.def
> > > @@ -199,6 +199,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
> > >         mask_gather_load, gather_load)
> > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
> > >         mask_len_gather_load, gather_load)
> > > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
> > > +        mask_len_strided_load, gather_load)
> > >  
> > >  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
> > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
> > > @@ -208,6 +210,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
> > >         mask_scatter_store, scatter_store)
> > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
> > >         mask_len_scatter_store, scatter_store)
> > > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
> > > +        mask_len_strided_store, scatter_store)
> > >  
> > >  DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
> > >  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
> > > diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
> > > index 99de13a0199..8379c61dff7 100644
> > > --- a/gcc/internal-fn.h
> > > +++ b/gcc/internal-fn.h
> > > @@ -235,11 +235,13 @@ extern bool can_interpret_as_conditional_op_p (gimple *, tree *,
> > >  extern bool internal_load_fn_p (internal_fn);
> > >  extern bool internal_store_fn_p (internal_fn);
> > >  extern bool internal_gather_scatter_fn_p (internal_fn);
> > > +extern bool internal_strided_fn_p (internal_fn);
> > >  extern int internal_fn_mask_index (internal_fn);
> > >  extern int internal_fn_len_index (internal_fn);
> > >  extern int internal_fn_stored_value_index (internal_fn);
> > >  extern bool internal_gather_scatter_fn_supported_p (internal_fn, tree,
> > >      tree, tree, int);
> > > +extern bool internal_strided_fn_supported_p (internal_fn, tree, tree, int);
> > >  extern bool internal_check_ptrs_fn_supported_p (internal_fn, tree,
> > >  poly_uint64, unsigned int);
> > >  #define VECT_PARTIAL_BIAS_UNSUPPORTED 127
> > > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > > index 2ccbe4197b7..3d85ac5f678 100644
> > > --- a/gcc/optabs.def
> > > +++ b/gcc/optabs.def
> > > @@ -98,9 +98,11 @@ OPTAB_CD(mask_len_store_optab, "mask_len_store$a$b")
> > >  OPTAB_CD(gather_load_optab, "gather_load$a$b")
> > >  OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
> > >  OPTAB_CD(mask_len_gather_load_optab, "mask_len_gather_load$a$b")
> > > +OPTAB_CD(mask_len_strided_load_optab, "mask_len_strided_load$a$b")
> > >  OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
> > >  OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
> > >  OPTAB_CD(mask_len_scatter_store_optab, "mask_len_scatter_store$a$b")
> > > +OPTAB_CD(mask_len_strided_store_optab, "mask_len_strided_store$a$b")
> > >  OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
> > >  OPTAB_CD(vec_init_optab, "vec_init$a$b")
> > >  
> > > 
> >  
> > 
>  
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
  2023-11-02 14:34         ` Richard Biener
@ 2023-11-03  7:10           ` juzhe.zhong
  2023-11-03  7:40             ` Richard Biener
  0 siblings, 1 reply; 13+ messages in thread
From: juzhe.zhong @ 2023-11-03  7:10 UTC (permalink / raw)
  To: rguenther; +Cc: gcc-patches, jeffreyalaw, richard.sandiford, Robin Dapp

[-- Attachment #1: Type: text/plain, Size: 19457 bytes --]

Hi, Richi.

The following is strided load/store doc:

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index fab2513105a..4f0821a291d 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5094,6 +5094,22 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
 be loaded from memory and clear if element @var{i} of the result should be undefined.
 Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.

+@cindex @code{mask_len_strided_load@var{m}} instruction pattern
+@item @samp{mask_len_strided_load@var{m}}
+Load several separate memory locations into a destination vector of mode @var{m}.
+Operand 0 is a destination vector of mode @var{m}.
+Operand 1 is a scalar base address and operand 2 is a scalar stride of Pmode.
+operand 3 indicates stride operand is signed or unsigned number and operand 4 is mask operand.
+operand 5 is length operand and operand 6 is bias operand.
+The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
+with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
+For each element index i load address is operand 1 + @var{i} * operand 2.
+operand 2 can be negative stride if operand 3 is 0.
+Similar to mask_len_load, the instruction loads at most (operand 5 + operand 6) elements from memory.
+Bit @var{i} of the mask (operand 4) is set if element @var{i} of the result should
+be loaded from memory and clear if element @var{i} of the result should be undefined.
+Mask elements @var{i} with @var{i} > (operand 5 + operand 6) are ignored.
+
 @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
 @item @samp{scatter_store@var{m}@var{n}}
 Store a vector of mode @var{m} into several distinct memory locations.
@@ -5131,6 +5147,21 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
 Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
 Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.

+@cindex @code{mask_len_strided_store@var{m}} instruction pattern
+@item @samp{mask_len_strided_store@var{m}}
+Store a vector of mode m into several distinct memory locations.
+Operand 0 is a scalar base address and operand 1 is scalar stride of Pmode.
+operand 2 indicates stride operand is signed or unsigned number.
+Operand 3 is the vector of values that should be stored, which is of mode @var{m}.
+operand 4 is mask operand, operand 5 is length operand and operand 6 is bias operand.
+The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
+with an offset vector that is a @code{vec_series} with operand 1 as base and operand 1 as step.
+For each element index i store address is operand 0 + @var{i} * operand 1.
+operand 1 can be negative stride if operand 2 is 0.
+Similar to mask_len_store, the instruction stores at most (operand 5 + operand 6) elements of mask (operand 4) to memory.
+Bit @var{i} of the mask is set if element @var{i} of (operand 3) should be stored.
+Mask elements @var{i} with @var{i} > (operand 5 + operand 6) are ignored.
+
 @cindex @code{vec_set@var{m}} instruction pattern
 @item @samp{vec_set@var{m}}
 Set given field in the vector value.  Operand 0 is the vector to modify,

Does it look reasonable ? Thanks.



juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-11-02 22:34
To: 钟居哲
CC: gcc-patches; Jeff Law; richard.sandiford; rdapp.gcc
Subject: Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
On Thu, 2 Nov 2023, ??? wrote:
 
> Ok. So drop 'scale' and keep signed/unsigned argument, is that right?
 
I don't think we need signed/unsigned.  RTL expansion has the signedness
of the offset argument there and can just extend to the appropriate mode
to offset a pointer.
 
> And I wonder I should create the stride_type using size_type_node or ptrdiff_type_node ?
> Which is preferrable ?
 
'sizetype' - that's the type we require to be used for 
the POINTER_PLUS_EXPR offset operand.
 
 
> Thanks.
> 
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-11-02 22:27
> To: ???
> CC: gcc-patches; Jeff Law; richard.sandiford; rdapp.gcc
> Subject: Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> On Thu, 2 Nov 2023, ??? wrote:
>  
> > Hi, Richi.
> > 
> > >> Do we really need to have two modes for the optab though or could we
> > >> simply require the target to support arbitrary offset modes (give it
> > >> is implicitly constrained to ptr_mode for the base already)?  Or
> > >> properly extend/truncate the offset at expansion time, say to ptr_mode
> > >> or to the mode of sizetype.
> > 
> > For RVV, it's ok by default set stride type as ptr_mode/size_type by default.
> > Is it ok that I define strided load/store as single mode optab and default Pmode as stride operand?
> > How about scale and signed/unsigned operand ?
> > It seems scale operand can be removed ? Since we can pass DR_STEP directly to the stride arguments.
> > But I think we can't remove signed/unsigned operand since for strided mode = SI mode, the unsigned
> > maximum stride = 2^31 wheras signed is 2 ^ 30.
>  
> On the GIMPLE side I think we want to have a sizetype operand and
> indeed drop 'scale', the sizetype operand should be readily available.
>  
> > 
> > 
> > 
> > juzhe.zhong@rivai.ai
> >  
> > From: Richard Biener
> > Date: 2023-11-02 21:52
> > To: Juzhe-Zhong
> > CC: gcc-patches; jeffreyalaw; richard.sandiford; rdapp.gcc
> > Subject: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> > On Tue, 31 Oct 2023, Juzhe-Zhong wrote:
> >  
> > > As previous Richard's suggested, we should support strided load/store in
> > > loop vectorizer instead hacking RISC-V backend.
> > > 
> > > This patch adds MASK_LEN_STRIDED LOAD/STORE OPTABS/IFN.
> > > 
> > > The GIMPLE IR is same as mask_len_gather_load/mask_len_scatter_store but with
> > > changing vector offset into scalar stride.
> >  
> > I see that it follows gather/scatter.  I'll note that when introducing
> > those we failed to add a specifier for TBAA and alignment info for the
> > data access.  That means we have to use alias-set zero for the accesses
> > (I see existing targets use UNSPECs with some not elaborate MEM anyway,
> > but TBAA info might have been the "easy" and obvious property to 
> > preserve).  For alignment we either have to assume unaligned or reject
> > vectorization of accesses that do not have their original scalar accesses
> > naturally aligned (aligned according to their mode).  We don't seem
> > to check that though.
> >  
> > It might be fine to go forward with this since gather/scatter are broken
> > in a similar way.
> >  
> > Do we really need to have two modes for the optab though or could we
> > simply require the target to support arbitrary offset modes (give it
> > is implicitly constrained to ptr_mode for the base already)?  Or
> > properly extend/truncate the offset at expansion time, say to ptr_mode
> > or to the mode of sizetype.
> >  
> > Thanks,
> > Richard.
> > > We don't have strided_load/strided_store and mask_strided_load/mask_strided_store since
> > > it't unlikely RVV will have such optabs and we can't add the patterns that we can't test them.
> > >
> > > 
> > > gcc/ChangeLog:
> > > 
> > > * doc/md.texi: Add mask_len_strided_load/mask_len_strided_store.
> > > * internal-fn.cc (internal_load_fn_p): Ditto.
> > > (internal_strided_fn_p): Ditto.
> > > (internal_fn_len_index): Ditto.
> > > (internal_fn_mask_index): Ditto.
> > > (internal_fn_stored_value_index): Ditto.
> > > (internal_strided_fn_supported_p): Ditto.
> > > * internal-fn.def (MASK_LEN_STRIDED_LOAD): Ditto.
> > > (MASK_LEN_STRIDED_STORE): Ditto.
> > > * internal-fn.h (internal_strided_fn_p): Ditto.
> > > (internal_strided_fn_supported_p): Ditto.
> > > * optabs.def (OPTAB_CD): Ditto.
> > > 
> > > ---
> > >  gcc/doc/md.texi     | 51 +++++++++++++++++++++++++++++++++++++++++++++
> > >  gcc/internal-fn.cc  | 44 ++++++++++++++++++++++++++++++++++++++
> > >  gcc/internal-fn.def |  4 ++++
> > >  gcc/internal-fn.h   |  2 ++
> > >  gcc/optabs.def      |  2 ++
> > >  5 files changed, 103 insertions(+)
> > > 
> > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > > index fab2513105a..5bac713a0dd 100644
> > > --- a/gcc/doc/md.texi
> > > +++ b/gcc/doc/md.texi
> > > @@ -5094,6 +5094,32 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
> > >  be loaded from memory and clear if element @var{i} of the result should be undefined.
> > >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > >  
> > > +@cindex @code{mask_len_strided_load@var{m}@var{n}} instruction pattern
> > > +@item @samp{mask_len_strided_load@var{m}@var{n}}
> > > +Load several separate memory locations into a destination vector of mode @var{m}.
> > > +Operand 0 is a destination vector of mode @var{m}.
> > > +Operand 1 is a scalar base address and operand 2 is a scalar stride of mode @var{n}.
> > > +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> > > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > > +For each element index i:
> > > +
> > > +@itemize @bullet
> > > +@item
> > > +extend the stride to address width, using zero
> > > +extension if operand 3 is 1 and sign extension if operand 3 is zero;
> > > +@item
> > > +multiply the extended stride by operand 4;
> > > +@item
> > > +add the result to the base; and
> > > +@item
> > > +load the value at that address (operand 1 + @var{i} * multiplied and extended stride) into element @var{i} of operand 0.
> > > +@end itemize
> > > +
> > > +Similar to mask_len_load, the instruction loads at most (operand 6 + operand 7) elements from memory.
> > > +Bit @var{i} of the mask is set if element @var{i} of the result should
> > > +be loaded from memory and clear if element @var{i} of the result should be undefined.
> > > +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > +
> > >  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
> > >  @item @samp{scatter_store@var{m}@var{n}}
> > >  Store a vector of mode @var{m} into several distinct memory locations.
> > > @@ -5131,6 +5157,31 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
> > >  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> > >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > >  
> > > +@cindex @code{mask_len_strided_store@var{m}@var{n}} instruction pattern
> > > +@item @samp{mask_len_strided_store@var{m}@var{n}}
> > > +Store a vector of mode m into several distinct memory locations.
> > > +Operand 0 is a scalar base address and operand 1 is scalar stride of mode @var{n}.
> > > +Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
> > > +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> > > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > > +For each element index i:
> > > +
> > > +@itemize @bullet
> > > +@item
> > > +extend the stride to address width, using zero
> > > +extension if operand 2 is 1 and sign extension if operand 2 is zero;
> > > +@item
> > > +multiply the extended stride by operand 3;
> > > +@item
> > > +add the result to the base; and
> > > +@item
> > > +store element @var{i} of operand 4 to that address (operand 1 + @var{i} * multiplied and extended stride).
> > > +@end itemize
> > > +
> > > +Similar to mask_len_store, the instruction stores at most (operand 6 + operand 7) elements of (operand 4) to memory.
> > > +Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> > > +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > +
> > >  @cindex @code{vec_set@var{m}} instruction pattern
> > >  @item @samp{vec_set@var{m}}
> > >  Set given field in the vector value.  Operand 0 is the vector to modify,
> > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > > index e7451b96353..f7f85aa7dde 100644
> > > --- a/gcc/internal-fn.cc
> > > +++ b/gcc/internal-fn.cc
> > > @@ -4596,6 +4596,7 @@ internal_load_fn_p (internal_fn fn)
> > >      case IFN_GATHER_LOAD:
> > >      case IFN_MASK_GATHER_LOAD:
> > >      case IFN_MASK_LEN_GATHER_LOAD:
> > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > >      case IFN_LEN_LOAD:
> > >      case IFN_MASK_LEN_LOAD:
> > >        return true;
> > > @@ -4648,6 +4649,22 @@ internal_gather_scatter_fn_p (internal_fn fn)
> > >      }
> > >  }
> > >  
> > > +/* Return true if IFN is some form of strided load or strided store.  */
> > > +
> > > +bool
> > > +internal_strided_fn_p (internal_fn fn)
> > > +{
> > > +  switch (fn)
> > > +    {
> > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > +      return true;
> > > +
> > > +    default:
> > > +      return false;
> > > +    }
> > > +}
> > > +
> > >  /* If FN takes a vector len argument, return the index of that argument,
> > >     otherwise return -1.  */
> > >  
> > > @@ -4662,6 +4679,8 @@ internal_fn_len_index (internal_fn fn)
> > >  
> > >      case IFN_MASK_LEN_GATHER_LOAD:
> > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > >      case IFN_COND_LEN_FMA:
> > >      case IFN_COND_LEN_FMS:
> > >      case IFN_COND_LEN_FNMA:
> > > @@ -4719,6 +4738,8 @@ internal_fn_mask_index (internal_fn fn)
> > >      case IFN_MASK_SCATTER_STORE:
> > >      case IFN_MASK_LEN_GATHER_LOAD:
> > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > >        return 4;
> > >  
> > >      default:
> > > @@ -4740,6 +4761,7 @@ internal_fn_stored_value_index (internal_fn fn)
> > >      case IFN_SCATTER_STORE:
> > >      case IFN_MASK_SCATTER_STORE:
> > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > >        return 3;
> > >  
> > >      case IFN_LEN_STORE:
> > > @@ -4784,6 +4806,28 @@ internal_gather_scatter_fn_supported_p (internal_fn ifn, tree vector_type,
> > >    && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> > >  }
> > >  
> > > +/* Return true if the target supports strided load or strided store function
> > > +   IFN.  For loads, VECTOR_TYPE is the vector type of the load result,
> > > +   while for stores it is the vector type of the stored data argument.
> > > +   STRIDE_TYPE is the type that holds the stride from the previous element
> > > +   memory address of each loaded or stored element.
> > > +   SCALE is the amount by which these stride should be multiplied
> > > +   *after* they have been extended to address width.  */
> > > +
> > > +bool
> > > +internal_strided_fn_supported_p (internal_fn ifn, tree vector_type,
> > > + tree offset_type, int scale)
> > > +{
> > > +  optab optab = direct_internal_fn_optab (ifn);
> > > +  insn_code icode = convert_optab_handler (optab, TYPE_MODE (vector_type),
> > > +    TYPE_MODE (offset_type));
> > > +  int output_ops = internal_load_fn_p (ifn) ? 1 : 0;
> > > +  bool unsigned_p = TYPE_UNSIGNED (offset_type);
> > > +  return (icode != CODE_FOR_nothing
> > > +   && insn_operand_matches (icode, 2 + output_ops, GEN_INT (unsigned_p))
> > > +   && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> > > +}
> > > +
> > >  /* Return true if the target supports IFN_CHECK_{RAW,WAR}_PTRS function IFN
> > >     for pointers of type TYPE when the accesses have LENGTH bytes and their
> > >     common byte alignment is ALIGN.  */
> > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > > index a2023ab9c3d..0fa532e8f6b 100644
> > > --- a/gcc/internal-fn.def
> > > +++ b/gcc/internal-fn.def
> > > @@ -199,6 +199,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
> > >         mask_gather_load, gather_load)
> > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
> > >         mask_len_gather_load, gather_load)
> > > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
> > > +        mask_len_strided_load, gather_load)
> > >  
> > >  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
> > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
> > > @@ -208,6 +210,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
> > >         mask_scatter_store, scatter_store)
> > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
> > >         mask_len_scatter_store, scatter_store)
> > > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
> > > +        mask_len_strided_store, scatter_store)
> > >  
> > >  DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
> > >  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
> > > diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
> > > index 99de13a0199..8379c61dff7 100644
> > > --- a/gcc/internal-fn.h
> > > +++ b/gcc/internal-fn.h
> > > @@ -235,11 +235,13 @@ extern bool can_interpret_as_conditional_op_p (gimple *, tree *,
> > >  extern bool internal_load_fn_p (internal_fn);
> > >  extern bool internal_store_fn_p (internal_fn);
> > >  extern bool internal_gather_scatter_fn_p (internal_fn);
> > > +extern bool internal_strided_fn_p (internal_fn);
> > >  extern int internal_fn_mask_index (internal_fn);
> > >  extern int internal_fn_len_index (internal_fn);
> > >  extern int internal_fn_stored_value_index (internal_fn);
> > >  extern bool internal_gather_scatter_fn_supported_p (internal_fn, tree,
> > >      tree, tree, int);
> > > +extern bool internal_strided_fn_supported_p (internal_fn, tree, tree, int);
> > >  extern bool internal_check_ptrs_fn_supported_p (internal_fn, tree,
> > >  poly_uint64, unsigned int);
> > >  #define VECT_PARTIAL_BIAS_UNSUPPORTED 127
> > > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > > index 2ccbe4197b7..3d85ac5f678 100644
> > > --- a/gcc/optabs.def
> > > +++ b/gcc/optabs.def
> > > @@ -98,9 +98,11 @@ OPTAB_CD(mask_len_store_optab, "mask_len_store$a$b")
> > >  OPTAB_CD(gather_load_optab, "gather_load$a$b")
> > >  OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
> > >  OPTAB_CD(mask_len_gather_load_optab, "mask_len_gather_load$a$b")
> > > +OPTAB_CD(mask_len_strided_load_optab, "mask_len_strided_load$a$b")
> > >  OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
> > >  OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
> > >  OPTAB_CD(mask_len_scatter_store_optab, "mask_len_scatter_store$a$b")
> > > +OPTAB_CD(mask_len_strided_store_optab, "mask_len_strided_store$a$b")
> > >  OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
> > >  OPTAB_CD(vec_init_optab, "vec_init$a$b")
> > >  
> > > 
> >  
> > 
>  
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
  2023-11-03  7:10           ` juzhe.zhong
@ 2023-11-03  7:40             ` Richard Biener
  2023-11-03  7:49               ` juzhe.zhong
  2023-11-03  8:18               ` juzhe.zhong
  0 siblings, 2 replies; 13+ messages in thread
From: Richard Biener @ 2023-11-03  7:40 UTC (permalink / raw)
  To: juzhe.zhong; +Cc: gcc-patches, jeffreyalaw, richard.sandiford, Robin Dapp

On Fri, 3 Nov 2023, juzhe.zhong@rivai.ai wrote:

> Hi, Richi.
> 
> The following is strided load/store doc:
> 
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index fab2513105a..4f0821a291d 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5094,6 +5094,22 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
>  be loaded from memory and clear if element @var{i} of the result should be undefined.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> 
> +@cindex @code{mask_len_strided_load@var{m}} instruction pattern
> +@item @samp{mask_len_strided_load@var{m}}
> +Load several separate memory locations into a destination vector of mode @var{m}.
> +Operand 0 is a destination vector of mode @var{m}.
> +Operand 1 is a scalar base address and operand 2 is a scalar stride of Pmode.
> +operand 3 indicates stride operand is signed or unsigned number and operand 4 is mask operand.

we don't need operand 3

> +operand 5 is length operand and operand 6 is bias operand.
> +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> +For each element index i load address is operand 1 + @var{i} * operand 2.
> +operand 2 can be negative stride if operand 3 is 0.
> +Similar to mask_len_load, the instruction loads at most (operand 5 + operand 6) elements from memory.
> +Bit @var{i} of the mask (operand 4) is set if element @var{i} of the result should

Element @var{i}.  I think we don't want to document 'undefined' but
rather match mask_gather_load and say the result should be zero.

Similar adjustments below.

> +be loaded from memory and clear if element @var{i} of the result should be undefined.
> +Mask elements @var{i} with @var{i} > (operand 5 + operand 6) are ignored.
> +
>  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
>  @item @samp{scatter_store@var{m}@var{n}}
>  Store a vector of mode @var{m} into several distinct memory locations.
> @@ -5131,6 +5147,21 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
>  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> 
> +@cindex @code{mask_len_strided_store@var{m}} instruction pattern
> +@item @samp{mask_len_strided_store@var{m}}
> +Store a vector of mode m into several distinct memory locations.
> +Operand 0 is a scalar base address and operand 1 is scalar stride of Pmode.
> +operand 2 indicates stride operand is signed or unsigned number.
> +Operand 3 is the vector of values that should be stored, which is of mode @var{m}.
> +operand 4 is mask operand, operand 5 is length operand and operand 6 is bias operand.
> +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 1 as step.
> +For each element index i store address is operand 0 + @var{i} * operand 1.
> +operand 1 can be negative stride if operand 2 is 0.
> +Similar to mask_len_store, the instruction stores at most (operand 5 + operand 6) elements of mask (operand 4) to memory.
> +Bit @var{i} of the mask is set if element @var{i} of (operand 3) should be stored.
> +Mask elements @var{i} with @var{i} > (operand 5 + operand 6) are ignored.
> +
>  @cindex @code{vec_set@var{m}} instruction pattern
>  @item @samp{vec_set@var{m}}
>  Set given field in the vector value.  Operand 0 is the vector to modify,
> 
> Does it look reasonable ? Thanks.
> 
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-11-02 22:34
> To: ???
> CC: gcc-patches; Jeff Law; richard.sandiford; rdapp.gcc
> Subject: Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> On Thu, 2 Nov 2023, ??? wrote:
>  
> > Ok. So drop 'scale' and keep signed/unsigned argument, is that right?
>  
> I don't think we need signed/unsigned.  RTL expansion has the signedness
> of the offset argument there and can just extend to the appropriate mode
> to offset a pointer.
>  
> > And I wonder I should create the stride_type using size_type_node or ptrdiff_type_node ?
> > Which is preferrable ?
>  
> 'sizetype' - that's the type we require to be used for 
> the POINTER_PLUS_EXPR offset operand.
>  
>  
> > Thanks.
> > 
> > 
> > 
> > juzhe.zhong@rivai.ai
> >  
> > From: Richard Biener
> > Date: 2023-11-02 22:27
> > To: ???
> > CC: gcc-patches; Jeff Law; richard.sandiford; rdapp.gcc
> > Subject: Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> > On Thu, 2 Nov 2023, ??? wrote:
> >  
> > > Hi, Richi.
> > > 
> > > >> Do we really need to have two modes for the optab though or could we
> > > >> simply require the target to support arbitrary offset modes (give it
> > > >> is implicitly constrained to ptr_mode for the base already)?  Or
> > > >> properly extend/truncate the offset at expansion time, say to ptr_mode
> > > >> or to the mode of sizetype.
> > > 
> > > For RVV, it's ok by default set stride type as ptr_mode/size_type by default.
> > > Is it ok that I define strided load/store as single mode optab and default Pmode as stride operand?
> > > How about scale and signed/unsigned operand ?
> > > It seems scale operand can be removed ? Since we can pass DR_STEP directly to the stride arguments.
> > > But I think we can't remove signed/unsigned operand since for strided mode = SI mode, the unsigned
> > > maximum stride = 2^31 wheras signed is 2 ^ 30.
> >  
> > On the GIMPLE side I think we want to have a sizetype operand and
> > indeed drop 'scale', the sizetype operand should be readily available.
> >  
> > > 
> > > 
> > > 
> > > juzhe.zhong@rivai.ai
> > >  
> > > From: Richard Biener
> > > Date: 2023-11-02 21:52
> > > To: Juzhe-Zhong
> > > CC: gcc-patches; jeffreyalaw; richard.sandiford; rdapp.gcc
> > > Subject: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> > > On Tue, 31 Oct 2023, Juzhe-Zhong wrote:
> > >  
> > > > As previous Richard's suggested, we should support strided load/store in
> > > > loop vectorizer instead hacking RISC-V backend.
> > > > 
> > > > This patch adds MASK_LEN_STRIDED LOAD/STORE OPTABS/IFN.
> > > > 
> > > > The GIMPLE IR is same as mask_len_gather_load/mask_len_scatter_store but with
> > > > changing vector offset into scalar stride.
> > >  
> > > I see that it follows gather/scatter.  I'll note that when introducing
> > > those we failed to add a specifier for TBAA and alignment info for the
> > > data access.  That means we have to use alias-set zero for the accesses
> > > (I see existing targets use UNSPECs with some not elaborate MEM anyway,
> > > but TBAA info might have been the "easy" and obvious property to 
> > > preserve).  For alignment we either have to assume unaligned or reject
> > > vectorization of accesses that do not have their original scalar accesses
> > > naturally aligned (aligned according to their mode).  We don't seem
> > > to check that though.
> > >  
> > > It might be fine to go forward with this since gather/scatter are broken
> > > in a similar way.
> > >  
> > > Do we really need to have two modes for the optab though or could we
> > > simply require the target to support arbitrary offset modes (give it
> > > is implicitly constrained to ptr_mode for the base already)?  Or
> > > properly extend/truncate the offset at expansion time, say to ptr_mode
> > > or to the mode of sizetype.
> > >  
> > > Thanks,
> > > Richard.
> > > > We don't have strided_load/strided_store and mask_strided_load/mask_strided_store since
> > > > it't unlikely RVV will have such optabs and we can't add the patterns that we can't test them.
> > > >
> > > > 
> > > > gcc/ChangeLog:
> > > > 
> > > > * doc/md.texi: Add mask_len_strided_load/mask_len_strided_store.
> > > > * internal-fn.cc (internal_load_fn_p): Ditto.
> > > > (internal_strided_fn_p): Ditto.
> > > > (internal_fn_len_index): Ditto.
> > > > (internal_fn_mask_index): Ditto.
> > > > (internal_fn_stored_value_index): Ditto.
> > > > (internal_strided_fn_supported_p): Ditto.
> > > > * internal-fn.def (MASK_LEN_STRIDED_LOAD): Ditto.
> > > > (MASK_LEN_STRIDED_STORE): Ditto.
> > > > * internal-fn.h (internal_strided_fn_p): Ditto.
> > > > (internal_strided_fn_supported_p): Ditto.
> > > > * optabs.def (OPTAB_CD): Ditto.
> > > > 
> > > > ---
> > > >  gcc/doc/md.texi     | 51 +++++++++++++++++++++++++++++++++++++++++++++
> > > >  gcc/internal-fn.cc  | 44 ++++++++++++++++++++++++++++++++++++++
> > > >  gcc/internal-fn.def |  4 ++++
> > > >  gcc/internal-fn.h   |  2 ++
> > > >  gcc/optabs.def      |  2 ++
> > > >  5 files changed, 103 insertions(+)
> > > > 
> > > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > > > index fab2513105a..5bac713a0dd 100644
> > > > --- a/gcc/doc/md.texi
> > > > +++ b/gcc/doc/md.texi
> > > > @@ -5094,6 +5094,32 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
> > > >  be loaded from memory and clear if element @var{i} of the result should be undefined.
> > > >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > >  
> > > > +@cindex @code{mask_len_strided_load@var{m}@var{n}} instruction pattern
> > > > +@item @samp{mask_len_strided_load@var{m}@var{n}}
> > > > +Load several separate memory locations into a destination vector of mode @var{m}.
> > > > +Operand 0 is a destination vector of mode @var{m}.
> > > > +Operand 1 is a scalar base address and operand 2 is a scalar stride of mode @var{n}.
> > > > +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> > > > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > > > +For each element index i:
> > > > +
> > > > +@itemize @bullet
> > > > +@item
> > > > +extend the stride to address width, using zero
> > > > +extension if operand 3 is 1 and sign extension if operand 3 is zero;
> > > > +@item
> > > > +multiply the extended stride by operand 4;
> > > > +@item
> > > > +add the result to the base; and
> > > > +@item
> > > > +load the value at that address (operand 1 + @var{i} * multiplied and extended stride) into element @var{i} of operand 0.
> > > > +@end itemize
> > > > +
> > > > +Similar to mask_len_load, the instruction loads at most (operand 6 + operand 7) elements from memory.
> > > > +Bit @var{i} of the mask is set if element @var{i} of the result should
> > > > +be loaded from memory and clear if element @var{i} of the result should be undefined.
> > > > +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > > +
> > > >  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
> > > >  @item @samp{scatter_store@var{m}@var{n}}
> > > >  Store a vector of mode @var{m} into several distinct memory locations.
> > > > @@ -5131,6 +5157,31 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
> > > >  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> > > >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > >  
> > > > +@cindex @code{mask_len_strided_store@var{m}@var{n}} instruction pattern
> > > > +@item @samp{mask_len_strided_store@var{m}@var{n}}
> > > > +Store a vector of mode m into several distinct memory locations.
> > > > +Operand 0 is a scalar base address and operand 1 is scalar stride of mode @var{n}.
> > > > +Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
> > > > +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> > > > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > > > +For each element index i:
> > > > +
> > > > +@itemize @bullet
> > > > +@item
> > > > +extend the stride to address width, using zero
> > > > +extension if operand 2 is 1 and sign extension if operand 2 is zero;
> > > > +@item
> > > > +multiply the extended stride by operand 3;
> > > > +@item
> > > > +add the result to the base; and
> > > > +@item
> > > > +store element @var{i} of operand 4 to that address (operand 1 + @var{i} * multiplied and extended stride).
> > > > +@end itemize
> > > > +
> > > > +Similar to mask_len_store, the instruction stores at most (operand 6 + operand 7) elements of (operand 4) to memory.
> > > > +Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> > > > +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > > +
> > > >  @cindex @code{vec_set@var{m}} instruction pattern
> > > >  @item @samp{vec_set@var{m}}
> > > >  Set given field in the vector value.  Operand 0 is the vector to modify,
> > > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > > > index e7451b96353..f7f85aa7dde 100644
> > > > --- a/gcc/internal-fn.cc
> > > > +++ b/gcc/internal-fn.cc
> > > > @@ -4596,6 +4596,7 @@ internal_load_fn_p (internal_fn fn)
> > > >      case IFN_GATHER_LOAD:
> > > >      case IFN_MASK_GATHER_LOAD:
> > > >      case IFN_MASK_LEN_GATHER_LOAD:
> > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > >      case IFN_LEN_LOAD:
> > > >      case IFN_MASK_LEN_LOAD:
> > > >        return true;
> > > > @@ -4648,6 +4649,22 @@ internal_gather_scatter_fn_p (internal_fn fn)
> > > >      }
> > > >  }
> > > >  
> > > > +/* Return true if IFN is some form of strided load or strided store.  */
> > > > +
> > > > +bool
> > > > +internal_strided_fn_p (internal_fn fn)
> > > > +{
> > > > +  switch (fn)
> > > > +    {
> > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > > +      return true;
> > > > +
> > > > +    default:
> > > > +      return false;
> > > > +    }
> > > > +}
> > > > +
> > > >  /* If FN takes a vector len argument, return the index of that argument,
> > > >     otherwise return -1.  */
> > > >  
> > > > @@ -4662,6 +4679,8 @@ internal_fn_len_index (internal_fn fn)
> > > >  
> > > >      case IFN_MASK_LEN_GATHER_LOAD:
> > > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > >      case IFN_COND_LEN_FMA:
> > > >      case IFN_COND_LEN_FMS:
> > > >      case IFN_COND_LEN_FNMA:
> > > > @@ -4719,6 +4738,8 @@ internal_fn_mask_index (internal_fn fn)
> > > >      case IFN_MASK_SCATTER_STORE:
> > > >      case IFN_MASK_LEN_GATHER_LOAD:
> > > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > >        return 4;
> > > >  
> > > >      default:
> > > > @@ -4740,6 +4761,7 @@ internal_fn_stored_value_index (internal_fn fn)
> > > >      case IFN_SCATTER_STORE:
> > > >      case IFN_MASK_SCATTER_STORE:
> > > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > >        return 3;
> > > >  
> > > >      case IFN_LEN_STORE:
> > > > @@ -4784,6 +4806,28 @@ internal_gather_scatter_fn_supported_p (internal_fn ifn, tree vector_type,
> > > >    && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> > > >  }
> > > >  
> > > > +/* Return true if the target supports strided load or strided store function
> > > > +   IFN.  For loads, VECTOR_TYPE is the vector type of the load result,
> > > > +   while for stores it is the vector type of the stored data argument.
> > > > +   STRIDE_TYPE is the type that holds the stride from the previous element
> > > > +   memory address of each loaded or stored element.
> > > > +   SCALE is the amount by which these stride should be multiplied
> > > > +   *after* they have been extended to address width.  */
> > > > +
> > > > +bool
> > > > +internal_strided_fn_supported_p (internal_fn ifn, tree vector_type,
> > > > + tree offset_type, int scale)
> > > > +{
> > > > +  optab optab = direct_internal_fn_optab (ifn);
> > > > +  insn_code icode = convert_optab_handler (optab, TYPE_MODE (vector_type),
> > > > +    TYPE_MODE (offset_type));
> > > > +  int output_ops = internal_load_fn_p (ifn) ? 1 : 0;
> > > > +  bool unsigned_p = TYPE_UNSIGNED (offset_type);
> > > > +  return (icode != CODE_FOR_nothing
> > > > +   && insn_operand_matches (icode, 2 + output_ops, GEN_INT (unsigned_p))
> > > > +   && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> > > > +}
> > > > +
> > > >  /* Return true if the target supports IFN_CHECK_{RAW,WAR}_PTRS function IFN
> > > >     for pointers of type TYPE when the accesses have LENGTH bytes and their
> > > >     common byte alignment is ALIGN.  */
> > > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > > > index a2023ab9c3d..0fa532e8f6b 100644
> > > > --- a/gcc/internal-fn.def
> > > > +++ b/gcc/internal-fn.def
> > > > @@ -199,6 +199,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
> > > >         mask_gather_load, gather_load)
> > > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
> > > >         mask_len_gather_load, gather_load)
> > > > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
> > > > +        mask_len_strided_load, gather_load)
> > > >  
> > > >  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
> > > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
> > > > @@ -208,6 +210,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
> > > >         mask_scatter_store, scatter_store)
> > > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
> > > >         mask_len_scatter_store, scatter_store)
> > > > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
> > > > +        mask_len_strided_store, scatter_store)
> > > >  
> > > >  DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
> > > >  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
> > > > diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
> > > > index 99de13a0199..8379c61dff7 100644
> > > > --- a/gcc/internal-fn.h
> > > > +++ b/gcc/internal-fn.h
> > > > @@ -235,11 +235,13 @@ extern bool can_interpret_as_conditional_op_p (gimple *, tree *,
> > > >  extern bool internal_load_fn_p (internal_fn);
> > > >  extern bool internal_store_fn_p (internal_fn);
> > > >  extern bool internal_gather_scatter_fn_p (internal_fn);
> > > > +extern bool internal_strided_fn_p (internal_fn);
> > > >  extern int internal_fn_mask_index (internal_fn);
> > > >  extern int internal_fn_len_index (internal_fn);
> > > >  extern int internal_fn_stored_value_index (internal_fn);
> > > >  extern bool internal_gather_scatter_fn_supported_p (internal_fn, tree,
> > > >      tree, tree, int);
> > > > +extern bool internal_strided_fn_supported_p (internal_fn, tree, tree, int);
> > > >  extern bool internal_check_ptrs_fn_supported_p (internal_fn, tree,
> > > >  poly_uint64, unsigned int);
> > > >  #define VECT_PARTIAL_BIAS_UNSUPPORTED 127
> > > > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > > > index 2ccbe4197b7..3d85ac5f678 100644
> > > > --- a/gcc/optabs.def
> > > > +++ b/gcc/optabs.def
> > > > @@ -98,9 +98,11 @@ OPTAB_CD(mask_len_store_optab, "mask_len_store$a$b")
> > > >  OPTAB_CD(gather_load_optab, "gather_load$a$b")
> > > >  OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
> > > >  OPTAB_CD(mask_len_gather_load_optab, "mask_len_gather_load$a$b")
> > > > +OPTAB_CD(mask_len_strided_load_optab, "mask_len_strided_load$a$b")
> > > >  OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
> > > >  OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
> > > >  OPTAB_CD(mask_len_scatter_store_optab, "mask_len_scatter_store$a$b")
> > > > +OPTAB_CD(mask_len_strided_store_optab, "mask_len_strided_store$a$b")
> > > >  OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
> > > >  OPTAB_CD(vec_init_optab, "vec_init$a$b")
> > > >  
> > > > 
> > >  
> > > 
> >  
> > 
>  
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
  2023-11-03  7:40             ` Richard Biener
@ 2023-11-03  7:49               ` juzhe.zhong
  2023-11-03  8:27                 ` Richard Biener
  2023-11-03  8:18               ` juzhe.zhong
  1 sibling, 1 reply; 13+ messages in thread
From: juzhe.zhong @ 2023-11-03  7:49 UTC (permalink / raw)
  To: rguenther; +Cc: gcc-patches, jeffreyalaw, richard.sandiford, Robin Dapp

[-- Attachment #1: Type: text/plain, Size: 21126 bytes --]

Hi，Richi.

>> Element @var{i}.  I think we don't want to document 'undefined' but
>> rather match mask_gather_load and say the result should be zero.

I am confused here,  will vectorizer depend on zero inactive value ?
Since RVV load/store doesn't touch the inactive(tail or unmasked) elements, I am wondering whether it will cause issues.


juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-11-03 15:40
To: juzhe.zhong@rivai.ai
CC: gcc-patches; jeffreyalaw; richard.sandiford; Robin Dapp
Subject: Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
On Fri, 3 Nov 2023, juzhe.zhong@rivai.ai wrote:
 
> Hi, Richi.
> 
> The following is strided load/store doc:
> 
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index fab2513105a..4f0821a291d 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5094,6 +5094,22 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
>  be loaded from memory and clear if element @var{i} of the result should be undefined.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> 
> +@cindex @code{mask_len_strided_load@var{m}} instruction pattern
> +@item @samp{mask_len_strided_load@var{m}}
> +Load several separate memory locations into a destination vector of mode @var{m}.
> +Operand 0 is a destination vector of mode @var{m}.
> +Operand 1 is a scalar base address and operand 2 is a scalar stride of Pmode.
> +operand 3 indicates stride operand is signed or unsigned number and operand 4 is mask operand.
 
we don't need operand 3
 
> +operand 5 is length operand and operand 6 is bias operand.
> +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> +For each element index i load address is operand 1 + @var{i} * operand 2.
> +operand 2 can be negative stride if operand 3 is 0.
> +Similar to mask_len_load, the instruction loads at most (operand 5 + operand 6) elements from memory.
> +Bit @var{i} of the mask (operand 4) is set if element @var{i} of the result should
 
Element @var{i}.  I think we don't want to document 'undefined' but
rather match mask_gather_load and say the result should be zero.
 
Similar adjustments below.
 
> +be loaded from memory and clear if element @var{i} of the result should be undefined.
> +Mask elements @var{i} with @var{i} > (operand 5 + operand 6) are ignored.
> +
>  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
>  @item @samp{scatter_store@var{m}@var{n}}
>  Store a vector of mode @var{m} into several distinct memory locations.
> @@ -5131,6 +5147,21 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
>  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> 
> +@cindex @code{mask_len_strided_store@var{m}} instruction pattern
> +@item @samp{mask_len_strided_store@var{m}}
> +Store a vector of mode m into several distinct memory locations.
> +Operand 0 is a scalar base address and operand 1 is scalar stride of Pmode.
> +operand 2 indicates stride operand is signed or unsigned number.
> +Operand 3 is the vector of values that should be stored, which is of mode @var{m}.
> +operand 4 is mask operand, operand 5 is length operand and operand 6 is bias operand.
> +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 1 as step.
> +For each element index i store address is operand 0 + @var{i} * operand 1.
> +operand 1 can be negative stride if operand 2 is 0.
> +Similar to mask_len_store, the instruction stores at most (operand 5 + operand 6) elements of mask (operand 4) to memory.
> +Bit @var{i} of the mask is set if element @var{i} of (operand 3) should be stored.
> +Mask elements @var{i} with @var{i} > (operand 5 + operand 6) are ignored.
> +
>  @cindex @code{vec_set@var{m}} instruction pattern
>  @item @samp{vec_set@var{m}}
>  Set given field in the vector value.  Operand 0 is the vector to modify,
> 
> Does it look reasonable ? Thanks.
> 
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-11-02 22:34
> To: ???
> CC: gcc-patches; Jeff Law; richard.sandiford; rdapp.gcc
> Subject: Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> On Thu, 2 Nov 2023, ??? wrote:
>  
> > Ok. So drop 'scale' and keep signed/unsigned argument, is that right?
>  
> I don't think we need signed/unsigned.  RTL expansion has the signedness
> of the offset argument there and can just extend to the appropriate mode
> to offset a pointer.
>  
> > And I wonder I should create the stride_type using size_type_node or ptrdiff_type_node ?
> > Which is preferrable ?
>  
> 'sizetype' - that's the type we require to be used for 
> the POINTER_PLUS_EXPR offset operand.
>  
>  
> > Thanks.
> > 
> > 
> > 
> > juzhe.zhong@rivai.ai
> >  
> > From: Richard Biener
> > Date: 2023-11-02 22:27
> > To: ???
> > CC: gcc-patches; Jeff Law; richard.sandiford; rdapp.gcc
> > Subject: Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> > On Thu, 2 Nov 2023, ??? wrote:
> >  
> > > Hi, Richi.
> > > 
> > > >> Do we really need to have two modes for the optab though or could we
> > > >> simply require the target to support arbitrary offset modes (give it
> > > >> is implicitly constrained to ptr_mode for the base already)?  Or
> > > >> properly extend/truncate the offset at expansion time, say to ptr_mode
> > > >> or to the mode of sizetype.
> > > 
> > > For RVV, it's ok by default set stride type as ptr_mode/size_type by default.
> > > Is it ok that I define strided load/store as single mode optab and default Pmode as stride operand?
> > > How about scale and signed/unsigned operand ?
> > > It seems scale operand can be removed ? Since we can pass DR_STEP directly to the stride arguments.
> > > But I think we can't remove signed/unsigned operand since for strided mode = SI mode, the unsigned
> > > maximum stride = 2^31 wheras signed is 2 ^ 30.
> >  
> > On the GIMPLE side I think we want to have a sizetype operand and
> > indeed drop 'scale', the sizetype operand should be readily available.
> >  
> > > 
> > > 
> > > 
> > > juzhe.zhong@rivai.ai
> > >  
> > > From: Richard Biener
> > > Date: 2023-11-02 21:52
> > > To: Juzhe-Zhong
> > > CC: gcc-patches; jeffreyalaw; richard.sandiford; rdapp.gcc
> > > Subject: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> > > On Tue, 31 Oct 2023, Juzhe-Zhong wrote:
> > >  
> > > > As previous Richard's suggested, we should support strided load/store in
> > > > loop vectorizer instead hacking RISC-V backend.
> > > > 
> > > > This patch adds MASK_LEN_STRIDED LOAD/STORE OPTABS/IFN.
> > > > 
> > > > The GIMPLE IR is same as mask_len_gather_load/mask_len_scatter_store but with
> > > > changing vector offset into scalar stride.
> > >  
> > > I see that it follows gather/scatter.  I'll note that when introducing
> > > those we failed to add a specifier for TBAA and alignment info for the
> > > data access.  That means we have to use alias-set zero for the accesses
> > > (I see existing targets use UNSPECs with some not elaborate MEM anyway,
> > > but TBAA info might have been the "easy" and obvious property to 
> > > preserve).  For alignment we either have to assume unaligned or reject
> > > vectorization of accesses that do not have their original scalar accesses
> > > naturally aligned (aligned according to their mode).  We don't seem
> > > to check that though.
> > >  
> > > It might be fine to go forward with this since gather/scatter are broken
> > > in a similar way.
> > >  
> > > Do we really need to have two modes for the optab though or could we
> > > simply require the target to support arbitrary offset modes (give it
> > > is implicitly constrained to ptr_mode for the base already)?  Or
> > > properly extend/truncate the offset at expansion time, say to ptr_mode
> > > or to the mode of sizetype.
> > >  
> > > Thanks,
> > > Richard.
> > > > We don't have strided_load/strided_store and mask_strided_load/mask_strided_store since
> > > > it't unlikely RVV will have such optabs and we can't add the patterns that we can't test them.
> > > >
> > > > 
> > > > gcc/ChangeLog:
> > > > 
> > > > * doc/md.texi: Add mask_len_strided_load/mask_len_strided_store.
> > > > * internal-fn.cc (internal_load_fn_p): Ditto.
> > > > (internal_strided_fn_p): Ditto.
> > > > (internal_fn_len_index): Ditto.
> > > > (internal_fn_mask_index): Ditto.
> > > > (internal_fn_stored_value_index): Ditto.
> > > > (internal_strided_fn_supported_p): Ditto.
> > > > * internal-fn.def (MASK_LEN_STRIDED_LOAD): Ditto.
> > > > (MASK_LEN_STRIDED_STORE): Ditto.
> > > > * internal-fn.h (internal_strided_fn_p): Ditto.
> > > > (internal_strided_fn_supported_p): Ditto.
> > > > * optabs.def (OPTAB_CD): Ditto.
> > > > 
> > > > ---
> > > >  gcc/doc/md.texi     | 51 +++++++++++++++++++++++++++++++++++++++++++++
> > > >  gcc/internal-fn.cc  | 44 ++++++++++++++++++++++++++++++++++++++
> > > >  gcc/internal-fn.def |  4 ++++
> > > >  gcc/internal-fn.h   |  2 ++
> > > >  gcc/optabs.def      |  2 ++
> > > >  5 files changed, 103 insertions(+)
> > > > 
> > > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > > > index fab2513105a..5bac713a0dd 100644
> > > > --- a/gcc/doc/md.texi
> > > > +++ b/gcc/doc/md.texi
> > > > @@ -5094,6 +5094,32 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
> > > >  be loaded from memory and clear if element @var{i} of the result should be undefined.
> > > >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > >  
> > > > +@cindex @code{mask_len_strided_load@var{m}@var{n}} instruction pattern
> > > > +@item @samp{mask_len_strided_load@var{m}@var{n}}
> > > > +Load several separate memory locations into a destination vector of mode @var{m}.
> > > > +Operand 0 is a destination vector of mode @var{m}.
> > > > +Operand 1 is a scalar base address and operand 2 is a scalar stride of mode @var{n}.
> > > > +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> > > > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > > > +For each element index i:
> > > > +
> > > > +@itemize @bullet
> > > > +@item
> > > > +extend the stride to address width, using zero
> > > > +extension if operand 3 is 1 and sign extension if operand 3 is zero;
> > > > +@item
> > > > +multiply the extended stride by operand 4;
> > > > +@item
> > > > +add the result to the base; and
> > > > +@item
> > > > +load the value at that address (operand 1 + @var{i} * multiplied and extended stride) into element @var{i} of operand 0.
> > > > +@end itemize
> > > > +
> > > > +Similar to mask_len_load, the instruction loads at most (operand 6 + operand 7) elements from memory.
> > > > +Bit @var{i} of the mask is set if element @var{i} of the result should
> > > > +be loaded from memory and clear if element @var{i} of the result should be undefined.
> > > > +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > > +
> > > >  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
> > > >  @item @samp{scatter_store@var{m}@var{n}}
> > > >  Store a vector of mode @var{m} into several distinct memory locations.
> > > > @@ -5131,6 +5157,31 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
> > > >  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> > > >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > >  
> > > > +@cindex @code{mask_len_strided_store@var{m}@var{n}} instruction pattern
> > > > +@item @samp{mask_len_strided_store@var{m}@var{n}}
> > > > +Store a vector of mode m into several distinct memory locations.
> > > > +Operand 0 is a scalar base address and operand 1 is scalar stride of mode @var{n}.
> > > > +Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
> > > > +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> > > > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > > > +For each element index i:
> > > > +
> > > > +@itemize @bullet
> > > > +@item
> > > > +extend the stride to address width, using zero
> > > > +extension if operand 2 is 1 and sign extension if operand 2 is zero;
> > > > +@item
> > > > +multiply the extended stride by operand 3;
> > > > +@item
> > > > +add the result to the base; and
> > > > +@item
> > > > +store element @var{i} of operand 4 to that address (operand 1 + @var{i} * multiplied and extended stride).
> > > > +@end itemize
> > > > +
> > > > +Similar to mask_len_store, the instruction stores at most (operand 6 + operand 7) elements of (operand 4) to memory.
> > > > +Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> > > > +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > > +
> > > >  @cindex @code{vec_set@var{m}} instruction pattern
> > > >  @item @samp{vec_set@var{m}}
> > > >  Set given field in the vector value.  Operand 0 is the vector to modify,
> > > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > > > index e7451b96353..f7f85aa7dde 100644
> > > > --- a/gcc/internal-fn.cc
> > > > +++ b/gcc/internal-fn.cc
> > > > @@ -4596,6 +4596,7 @@ internal_load_fn_p (internal_fn fn)
> > > >      case IFN_GATHER_LOAD:
> > > >      case IFN_MASK_GATHER_LOAD:
> > > >      case IFN_MASK_LEN_GATHER_LOAD:
> > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > >      case IFN_LEN_LOAD:
> > > >      case IFN_MASK_LEN_LOAD:
> > > >        return true;
> > > > @@ -4648,6 +4649,22 @@ internal_gather_scatter_fn_p (internal_fn fn)
> > > >      }
> > > >  }
> > > >  
> > > > +/* Return true if IFN is some form of strided load or strided store.  */
> > > > +
> > > > +bool
> > > > +internal_strided_fn_p (internal_fn fn)
> > > > +{
> > > > +  switch (fn)
> > > > +    {
> > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > > +      return true;
> > > > +
> > > > +    default:
> > > > +      return false;
> > > > +    }
> > > > +}
> > > > +
> > > >  /* If FN takes a vector len argument, return the index of that argument,
> > > >     otherwise return -1.  */
> > > >  
> > > > @@ -4662,6 +4679,8 @@ internal_fn_len_index (internal_fn fn)
> > > >  
> > > >      case IFN_MASK_LEN_GATHER_LOAD:
> > > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > >      case IFN_COND_LEN_FMA:
> > > >      case IFN_COND_LEN_FMS:
> > > >      case IFN_COND_LEN_FNMA:
> > > > @@ -4719,6 +4738,8 @@ internal_fn_mask_index (internal_fn fn)
> > > >      case IFN_MASK_SCATTER_STORE:
> > > >      case IFN_MASK_LEN_GATHER_LOAD:
> > > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > >        return 4;
> > > >  
> > > >      default:
> > > > @@ -4740,6 +4761,7 @@ internal_fn_stored_value_index (internal_fn fn)
> > > >      case IFN_SCATTER_STORE:
> > > >      case IFN_MASK_SCATTER_STORE:
> > > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > >        return 3;
> > > >  
> > > >      case IFN_LEN_STORE:
> > > > @@ -4784,6 +4806,28 @@ internal_gather_scatter_fn_supported_p (internal_fn ifn, tree vector_type,
> > > >    && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> > > >  }
> > > >  
> > > > +/* Return true if the target supports strided load or strided store function
> > > > +   IFN.  For loads, VECTOR_TYPE is the vector type of the load result,
> > > > +   while for stores it is the vector type of the stored data argument.
> > > > +   STRIDE_TYPE is the type that holds the stride from the previous element
> > > > +   memory address of each loaded or stored element.
> > > > +   SCALE is the amount by which these stride should be multiplied
> > > > +   *after* they have been extended to address width.  */
> > > > +
> > > > +bool
> > > > +internal_strided_fn_supported_p (internal_fn ifn, tree vector_type,
> > > > + tree offset_type, int scale)
> > > > +{
> > > > +  optab optab = direct_internal_fn_optab (ifn);
> > > > +  insn_code icode = convert_optab_handler (optab, TYPE_MODE (vector_type),
> > > > +    TYPE_MODE (offset_type));
> > > > +  int output_ops = internal_load_fn_p (ifn) ? 1 : 0;
> > > > +  bool unsigned_p = TYPE_UNSIGNED (offset_type);
> > > > +  return (icode != CODE_FOR_nothing
> > > > +   && insn_operand_matches (icode, 2 + output_ops, GEN_INT (unsigned_p))
> > > > +   && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> > > > +}
> > > > +
> > > >  /* Return true if the target supports IFN_CHECK_{RAW,WAR}_PTRS function IFN
> > > >     for pointers of type TYPE when the accesses have LENGTH bytes and their
> > > >     common byte alignment is ALIGN.  */
> > > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > > > index a2023ab9c3d..0fa532e8f6b 100644
> > > > --- a/gcc/internal-fn.def
> > > > +++ b/gcc/internal-fn.def
> > > > @@ -199,6 +199,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
> > > >         mask_gather_load, gather_load)
> > > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
> > > >         mask_len_gather_load, gather_load)
> > > > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
> > > > +        mask_len_strided_load, gather_load)
> > > >  
> > > >  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
> > > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
> > > > @@ -208,6 +210,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
> > > >         mask_scatter_store, scatter_store)
> > > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
> > > >         mask_len_scatter_store, scatter_store)
> > > > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
> > > > +        mask_len_strided_store, scatter_store)
> > > >  
> > > >  DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
> > > >  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
> > > > diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
> > > > index 99de13a0199..8379c61dff7 100644
> > > > --- a/gcc/internal-fn.h
> > > > +++ b/gcc/internal-fn.h
> > > > @@ -235,11 +235,13 @@ extern bool can_interpret_as_conditional_op_p (gimple *, tree *,
> > > >  extern bool internal_load_fn_p (internal_fn);
> > > >  extern bool internal_store_fn_p (internal_fn);
> > > >  extern bool internal_gather_scatter_fn_p (internal_fn);
> > > > +extern bool internal_strided_fn_p (internal_fn);
> > > >  extern int internal_fn_mask_index (internal_fn);
> > > >  extern int internal_fn_len_index (internal_fn);
> > > >  extern int internal_fn_stored_value_index (internal_fn);
> > > >  extern bool internal_gather_scatter_fn_supported_p (internal_fn, tree,
> > > >      tree, tree, int);
> > > > +extern bool internal_strided_fn_supported_p (internal_fn, tree, tree, int);
> > > >  extern bool internal_check_ptrs_fn_supported_p (internal_fn, tree,
> > > >  poly_uint64, unsigned int);
> > > >  #define VECT_PARTIAL_BIAS_UNSUPPORTED 127
> > > > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > > > index 2ccbe4197b7..3d85ac5f678 100644
> > > > --- a/gcc/optabs.def
> > > > +++ b/gcc/optabs.def
> > > > @@ -98,9 +98,11 @@ OPTAB_CD(mask_len_store_optab, "mask_len_store$a$b")
> > > >  OPTAB_CD(gather_load_optab, "gather_load$a$b")
> > > >  OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
> > > >  OPTAB_CD(mask_len_gather_load_optab, "mask_len_gather_load$a$b")
> > > > +OPTAB_CD(mask_len_strided_load_optab, "mask_len_strided_load$a$b")
> > > >  OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
> > > >  OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
> > > >  OPTAB_CD(mask_len_scatter_store_optab, "mask_len_scatter_store$a$b")
> > > > +OPTAB_CD(mask_len_strided_store_optab, "mask_len_strided_store$a$b")
> > > >  OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
> > > >  OPTAB_CD(vec_init_optab, "vec_init$a$b")
> > > >  
> > > > 
> > >  
> > > 
> >  
> > 
>  
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
  2023-11-03  7:40             ` Richard Biener
  2023-11-03  7:49               ` juzhe.zhong
@ 2023-11-03  8:18               ` juzhe.zhong
  2023-11-03  8:29                 ` Richard Biener
  1 sibling, 1 reply; 13+ messages in thread
From: juzhe.zhong @ 2023-11-03  8:18 UTC (permalink / raw)
  To: rguenther; +Cc: gcc-patches, jeffreyalaw, richard.sandiford, Robin Dapp

[-- Attachment #1: Type: text/plain, Size: 23844 bytes --]

Hi, Richi.

Is this following reasonable ?

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index fab2513105a..8d1cdad1607 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5094,6 +5094,20 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
 be loaded from memory and clear if element @var{i} of the result should be undefined.
 Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.

+@cindex @code{mask_len_strided_load@var{m}} instruction pattern
+@item @samp{mask_len_strided_load@var{m}}
+Load several separate memory locations into a destination vector of mode @var{m}.
+Operand 0 is a destination vector of mode @var{m}.
+Operand 1 is a scalar base address and operand 2 is a scalar stride of Pmode.
+operand 3 is mask operand, operand 4 is length operand and operand 5 is bias operand.
+The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
+with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
+For each element index i load address is operand 1 + @var{i} * operand 2.
+Similar to mask_len_load, the instruction loads at most (operand 4 + operand 5) elements from memory.
+Bit @var{i} of the mask (operand 3) is set if element @var{i} of the result should
+be loaded from memory and clear if element @var{i} of the result should be zero.
+Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
+
 @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
 @item @samp{scatter_store@var{m}@var{n}}
 Store a vector of mode @var{m} into several distinct memory locations.
@@ -5131,6 +5145,19 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
 Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
 Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.

+@cindex @code{mask_len_strided_store@var{m}} instruction pattern
+@item @samp{mask_len_strided_store@var{m}}
+Store a vector of mode m into several distinct memory locations.
+Operand 0 is a scalar base address and operand 1 is scalar stride of Pmode.
+Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
+operand 3 is mask operand, operand 4 is length operand and operand 5 is bias operand.
+The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
+with an offset vector that is a @code{vec_series} with operand 1 as base and operand 1 as step.
+For each element index i store address is operand 0 + @var{i} * operand 1.
+Similar to mask_len_store, the instruction stores at most (operand 4 + operand 5) elements of mask (operand 3) to memory.
+Bit @var{i} of the mask is set if element @var{i} of (operand 3) should be stored.
+Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
+
 @cindex @code{vec_set@var{m}} instruction pattern
 @item @samp{vec_set@var{m}}
 Set given field in the vector value.  Operand 0 is the vector to modify,

Thanks.



juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-11-03 15:40
To: juzhe.zhong@rivai.ai
CC: gcc-patches; jeffreyalaw; richard.sandiford; Robin Dapp
Subject: Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
On Fri, 3 Nov 2023, juzhe.zhong@rivai.ai wrote:
 
> Hi, Richi.
> 
> The following is strided load/store doc:
> 
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index fab2513105a..4f0821a291d 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5094,6 +5094,22 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
>  be loaded from memory and clear if element @var{i} of the result should be undefined.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> 
> +@cindex @code{mask_len_strided_load@var{m}} instruction pattern
> +@item @samp{mask_len_strided_load@var{m}}
> +Load several separate memory locations into a destination vector of mode @var{m}.
> +Operand 0 is a destination vector of mode @var{m}.
> +Operand 1 is a scalar base address and operand 2 is a scalar stride of Pmode.
> +operand 3 indicates stride operand is signed or unsigned number and operand 4 is mask operand.
 
we don't need operand 3
 
> +operand 5 is length operand and operand 6 is bias operand.
> +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> +For each element index i load address is operand 1 + @var{i} * operand 2.
> +operand 2 can be negative stride if operand 3 is 0.
> +Similar to mask_len_load, the instruction loads at most (operand 5 + operand 6) elements from memory.
> +Bit @var{i} of the mask (operand 4) is set if element @var{i} of the result should
 
Element @var{i}.  I think we don't want to document 'undefined' but
rather match mask_gather_load and say the result should be zero.
 
Similar adjustments below.
 
> +be loaded from memory and clear if element @var{i} of the result should be undefined.
> +Mask elements @var{i} with @var{i} > (operand 5 + operand 6) are ignored.
> +
>  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
>  @item @samp{scatter_store@var{m}@var{n}}
>  Store a vector of mode @var{m} into several distinct memory locations.
> @@ -5131,6 +5147,21 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
>  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> 
> +@cindex @code{mask_len_strided_store@var{m}} instruction pattern
> +@item @samp{mask_len_strided_store@var{m}}
> +Store a vector of mode m into several distinct memory locations.
> +Operand 0 is a scalar base address and operand 1 is scalar stride of Pmode.
> +operand 2 indicates stride operand is signed or unsigned number.
> +Operand 3 is the vector of values that should be stored, which is of mode @var{m}.
> +operand 4 is mask operand, operand 5 is length operand and operand 6 is bias operand.
> +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 1 as step.
> +For each element index i store address is operand 0 + @var{i} * operand 1.
> +operand 1 can be negative stride if operand 2 is 0.
> +Similar to mask_len_store, the instruction stores at most (operand 5 + operand 6) elements of mask (operand 4) to memory.
> +Bit @var{i} of the mask is set if element @var{i} of (operand 3) should be stored.
> +Mask elements @var{i} with @var{i} > (operand 5 + operand 6) are ignored.
> +
>  @cindex @code{vec_set@var{m}} instruction pattern
>  @item @samp{vec_set@var{m}}
>  Set given field in the vector value.  Operand 0 is the vector to modify,
> 
> Does it look reasonable ? Thanks.
> 
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-11-02 22:34
> To: ???
> CC: gcc-patches; Jeff Law; richard.sandiford; rdapp.gcc
> Subject: Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> On Thu, 2 Nov 2023, ??? wrote:
>  
> > Ok. So drop 'scale' and keep signed/unsigned argument, is that right?
>  
> I don't think we need signed/unsigned.  RTL expansion has the signedness
> of the offset argument there and can just extend to the appropriate mode
> to offset a pointer.
>  
> > And I wonder I should create the stride_type using size_type_node or ptrdiff_type_node ?
> > Which is preferrable ?
>  
> 'sizetype' - that's the type we require to be used for 
> the POINTER_PLUS_EXPR offset operand.
>  
>  
> > Thanks.
> > 
> > 
> > 
> > juzhe.zhong@rivai.ai
> >  
> > From: Richard Biener
> > Date: 2023-11-02 22:27
> > To: ???
> > CC: gcc-patches; Jeff Law; richard.sandiford; rdapp.gcc
> > Subject: Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> > On Thu, 2 Nov 2023, ??? wrote:
> >  
> > > Hi, Richi.
> > > 
> > > >> Do we really need to have two modes for the optab though or could we
> > > >> simply require the target to support arbitrary offset modes (give it
> > > >> is implicitly constrained to ptr_mode for the base already)?  Or
> > > >> properly extend/truncate the offset at expansion time, say to ptr_mode
> > > >> or to the mode of sizetype.
> > > 
> > > For RVV, it's ok by default set stride type as ptr_mode/size_type by default.
> > > Is it ok that I define strided load/store as single mode optab and default Pmode as stride operand?
> > > How about scale and signed/unsigned operand ?
> > > It seems scale operand can be removed ? Since we can pass DR_STEP directly to the stride arguments.
> > > But I think we can't remove signed/unsigned operand since for strided mode = SI mode, the unsigned
> > > maximum stride = 2^31 wheras signed is 2 ^ 30.
> >  
> > On the GIMPLE side I think we want to have a sizetype operand and
> > indeed drop 'scale', the sizetype operand should be readily available.
> >  
> > > 
> > > 
> > > 
> > > juzhe.zhong@rivai.ai
> > >  
> > > From: Richard Biener
> > > Date: 2023-11-02 21:52
> > > To: Juzhe-Zhong
> > > CC: gcc-patches; jeffreyalaw; richard.sandiford; rdapp.gcc
> > > Subject: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> > > On Tue, 31 Oct 2023, Juzhe-Zhong wrote:
> > >  
> > > > As previous Richard's suggested, we should support strided load/store in
> > > > loop vectorizer instead hacking RISC-V backend.
> > > > 
> > > > This patch adds MASK_LEN_STRIDED LOAD/STORE OPTABS/IFN.
> > > > 
> > > > The GIMPLE IR is same as mask_len_gather_load/mask_len_scatter_store but with
> > > > changing vector offset into scalar stride.
> > >  
> > > I see that it follows gather/scatter.  I'll note that when introducing
> > > those we failed to add a specifier for TBAA and alignment info for the
> > > data access.  That means we have to use alias-set zero for the accesses
> > > (I see existing targets use UNSPECs with some not elaborate MEM anyway,
> > > but TBAA info might have been the "easy" and obvious property to 
> > > preserve).  For alignment we either have to assume unaligned or reject
> > > vectorization of accesses that do not have their original scalar accesses
> > > naturally aligned (aligned according to their mode).  We don't seem
> > > to check that though.
> > >  
> > > It might be fine to go forward with this since gather/scatter are broken
> > > in a similar way.
> > >  
> > > Do we really need to have two modes for the optab though or could we
> > > simply require the target to support arbitrary offset modes (give it
> > > is implicitly constrained to ptr_mode for the base already)?  Or
> > > properly extend/truncate the offset at expansion time, say to ptr_mode
> > > or to the mode of sizetype.
> > >  
> > > Thanks,
> > > Richard.
> > > > We don't have strided_load/strided_store and mask_strided_load/mask_strided_store since
> > > > it't unlikely RVV will have such optabs and we can't add the patterns that we can't test them.
> > > >
> > > > 
> > > > gcc/ChangeLog:
> > > > 
> > > > * doc/md.texi: Add mask_len_strided_load/mask_len_strided_store.
> > > > * internal-fn.cc (internal_load_fn_p): Ditto.
> > > > (internal_strided_fn_p): Ditto.
> > > > (internal_fn_len_index): Ditto.
> > > > (internal_fn_mask_index): Ditto.
> > > > (internal_fn_stored_value_index): Ditto.
> > > > (internal_strided_fn_supported_p): Ditto.
> > > > * internal-fn.def (MASK_LEN_STRIDED_LOAD): Ditto.
> > > > (MASK_LEN_STRIDED_STORE): Ditto.
> > > > * internal-fn.h (internal_strided_fn_p): Ditto.
> > > > (internal_strided_fn_supported_p): Ditto.
> > > > * optabs.def (OPTAB_CD): Ditto.
> > > > 
> > > > ---
> > > >  gcc/doc/md.texi     | 51 +++++++++++++++++++++++++++++++++++++++++++++
> > > >  gcc/internal-fn.cc  | 44 ++++++++++++++++++++++++++++++++++++++
> > > >  gcc/internal-fn.def |  4 ++++
> > > >  gcc/internal-fn.h   |  2 ++
> > > >  gcc/optabs.def      |  2 ++
> > > >  5 files changed, 103 insertions(+)
> > > > 
> > > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > > > index fab2513105a..5bac713a0dd 100644
> > > > --- a/gcc/doc/md.texi
> > > > +++ b/gcc/doc/md.texi
> > > > @@ -5094,6 +5094,32 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
> > > >  be loaded from memory and clear if element @var{i} of the result should be undefined.
> > > >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > >  
> > > > +@cindex @code{mask_len_strided_load@var{m}@var{n}} instruction pattern
> > > > +@item @samp{mask_len_strided_load@var{m}@var{n}}
> > > > +Load several separate memory locations into a destination vector of mode @var{m}.
> > > > +Operand 0 is a destination vector of mode @var{m}.
> > > > +Operand 1 is a scalar base address and operand 2 is a scalar stride of mode @var{n}.
> > > > +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> > > > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > > > +For each element index i:
> > > > +
> > > > +@itemize @bullet
> > > > +@item
> > > > +extend the stride to address width, using zero
> > > > +extension if operand 3 is 1 and sign extension if operand 3 is zero;
> > > > +@item
> > > > +multiply the extended stride by operand 4;
> > > > +@item
> > > > +add the result to the base; and
> > > > +@item
> > > > +load the value at that address (operand 1 + @var{i} * multiplied and extended stride) into element @var{i} of operand 0.
> > > > +@end itemize
> > > > +
> > > > +Similar to mask_len_load, the instruction loads at most (operand 6 + operand 7) elements from memory.
> > > > +Bit @var{i} of the mask is set if element @var{i} of the result should
> > > > +be loaded from memory and clear if element @var{i} of the result should be undefined.
> > > > +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > > +
> > > >  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
> > > >  @item @samp{scatter_store@var{m}@var{n}}
> > > >  Store a vector of mode @var{m} into several distinct memory locations.
> > > > @@ -5131,6 +5157,31 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
> > > >  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> > > >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > >  
> > > > +@cindex @code{mask_len_strided_store@var{m}@var{n}} instruction pattern
> > > > +@item @samp{mask_len_strided_store@var{m}@var{n}}
> > > > +Store a vector of mode m into several distinct memory locations.
> > > > +Operand 0 is a scalar base address and operand 1 is scalar stride of mode @var{n}.
> > > > +Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
> > > > +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> > > > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > > > +For each element index i:
> > > > +
> > > > +@itemize @bullet
> > > > +@item
> > > > +extend the stride to address width, using zero
> > > > +extension if operand 2 is 1 and sign extension if operand 2 is zero;
> > > > +@item
> > > > +multiply the extended stride by operand 3;
> > > > +@item
> > > > +add the result to the base; and
> > > > +@item
> > > > +store element @var{i} of operand 4 to that address (operand 1 + @var{i} * multiplied and extended stride).
> > > > +@end itemize
> > > > +
> > > > +Similar to mask_len_store, the instruction stores at most (operand 6 + operand 7) elements of (operand 4) to memory.
> > > > +Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> > > > +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > > +
> > > >  @cindex @code{vec_set@var{m}} instruction pattern
> > > >  @item @samp{vec_set@var{m}}
> > > >  Set given field in the vector value.  Operand 0 is the vector to modify,
> > > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > > > index e7451b96353..f7f85aa7dde 100644
> > > > --- a/gcc/internal-fn.cc
> > > > +++ b/gcc/internal-fn.cc
> > > > @@ -4596,6 +4596,7 @@ internal_load_fn_p (internal_fn fn)
> > > >      case IFN_GATHER_LOAD:
> > > >      case IFN_MASK_GATHER_LOAD:
> > > >      case IFN_MASK_LEN_GATHER_LOAD:
> > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > >      case IFN_LEN_LOAD:
> > > >      case IFN_MASK_LEN_LOAD:
> > > >        return true;
> > > > @@ -4648,6 +4649,22 @@ internal_gather_scatter_fn_p (internal_fn fn)
> > > >      }
> > > >  }
> > > >  
> > > > +/* Return true if IFN is some form of strided load or strided store.  */
> > > > +
> > > > +bool
> > > > +internal_strided_fn_p (internal_fn fn)
> > > > +{
> > > > +  switch (fn)
> > > > +    {
> > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > > +      return true;
> > > > +
> > > > +    default:
> > > > +      return false;
> > > > +    }
> > > > +}
> > > > +
> > > >  /* If FN takes a vector len argument, return the index of that argument,
> > > >     otherwise return -1.  */
> > > >  
> > > > @@ -4662,6 +4679,8 @@ internal_fn_len_index (internal_fn fn)
> > > >  
> > > >      case IFN_MASK_LEN_GATHER_LOAD:
> > > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > >      case IFN_COND_LEN_FMA:
> > > >      case IFN_COND_LEN_FMS:
> > > >      case IFN_COND_LEN_FNMA:
> > > > @@ -4719,6 +4738,8 @@ internal_fn_mask_index (internal_fn fn)
> > > >      case IFN_MASK_SCATTER_STORE:
> > > >      case IFN_MASK_LEN_GATHER_LOAD:
> > > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > >        return 4;
> > > >  
> > > >      default:
> > > > @@ -4740,6 +4761,7 @@ internal_fn_stored_value_index (internal_fn fn)
> > > >      case IFN_SCATTER_STORE:
> > > >      case IFN_MASK_SCATTER_STORE:
> > > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > >        return 3;
> > > >  
> > > >      case IFN_LEN_STORE:
> > > > @@ -4784,6 +4806,28 @@ internal_gather_scatter_fn_supported_p (internal_fn ifn, tree vector_type,
> > > >    && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> > > >  }
> > > >  
> > > > +/* Return true if the target supports strided load or strided store function
> > > > +   IFN.  For loads, VECTOR_TYPE is the vector type of the load result,
> > > > +   while for stores it is the vector type of the stored data argument.
> > > > +   STRIDE_TYPE is the type that holds the stride from the previous element
> > > > +   memory address of each loaded or stored element.
> > > > +   SCALE is the amount by which these stride should be multiplied
> > > > +   *after* they have been extended to address width.  */
> > > > +
> > > > +bool
> > > > +internal_strided_fn_supported_p (internal_fn ifn, tree vector_type,
> > > > + tree offset_type, int scale)
> > > > +{
> > > > +  optab optab = direct_internal_fn_optab (ifn);
> > > > +  insn_code icode = convert_optab_handler (optab, TYPE_MODE (vector_type),
> > > > +    TYPE_MODE (offset_type));
> > > > +  int output_ops = internal_load_fn_p (ifn) ? 1 : 0;
> > > > +  bool unsigned_p = TYPE_UNSIGNED (offset_type);
> > > > +  return (icode != CODE_FOR_nothing
> > > > +   && insn_operand_matches (icode, 2 + output_ops, GEN_INT (unsigned_p))
> > > > +   && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> > > > +}
> > > > +
> > > >  /* Return true if the target supports IFN_CHECK_{RAW,WAR}_PTRS function IFN
> > > >     for pointers of type TYPE when the accesses have LENGTH bytes and their
> > > >     common byte alignment is ALIGN.  */
> > > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > > > index a2023ab9c3d..0fa532e8f6b 100644
> > > > --- a/gcc/internal-fn.def
> > > > +++ b/gcc/internal-fn.def
> > > > @@ -199,6 +199,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
> > > >         mask_gather_load, gather_load)
> > > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
> > > >         mask_len_gather_load, gather_load)
> > > > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
> > > > +        mask_len_strided_load, gather_load)
> > > >  
> > > >  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
> > > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
> > > > @@ -208,6 +210,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
> > > >         mask_scatter_store, scatter_store)
> > > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
> > > >         mask_len_scatter_store, scatter_store)
> > > > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
> > > > +        mask_len_strided_store, scatter_store)
> > > >  
> > > >  DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
> > > >  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
> > > > diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
> > > > index 99de13a0199..8379c61dff7 100644
> > > > --- a/gcc/internal-fn.h
> > > > +++ b/gcc/internal-fn.h
> > > > @@ -235,11 +235,13 @@ extern bool can_interpret_as_conditional_op_p (gimple *, tree *,
> > > >  extern bool internal_load_fn_p (internal_fn);
> > > >  extern bool internal_store_fn_p (internal_fn);
> > > >  extern bool internal_gather_scatter_fn_p (internal_fn);
> > > > +extern bool internal_strided_fn_p (internal_fn);
> > > >  extern int internal_fn_mask_index (internal_fn);
> > > >  extern int internal_fn_len_index (internal_fn);
> > > >  extern int internal_fn_stored_value_index (internal_fn);
> > > >  extern bool internal_gather_scatter_fn_supported_p (internal_fn, tree,
> > > >      tree, tree, int);
> > > > +extern bool internal_strided_fn_supported_p (internal_fn, tree, tree, int);
> > > >  extern bool internal_check_ptrs_fn_supported_p (internal_fn, tree,
> > > >  poly_uint64, unsigned int);
> > > >  #define VECT_PARTIAL_BIAS_UNSUPPORTED 127
> > > > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > > > index 2ccbe4197b7..3d85ac5f678 100644
> > > > --- a/gcc/optabs.def
> > > > +++ b/gcc/optabs.def
> > > > @@ -98,9 +98,11 @@ OPTAB_CD(mask_len_store_optab, "mask_len_store$a$b")
> > > >  OPTAB_CD(gather_load_optab, "gather_load$a$b")
> > > >  OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
> > > >  OPTAB_CD(mask_len_gather_load_optab, "mask_len_gather_load$a$b")
> > > > +OPTAB_CD(mask_len_strided_load_optab, "mask_len_strided_load$a$b")
> > > >  OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
> > > >  OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
> > > >  OPTAB_CD(mask_len_scatter_store_optab, "mask_len_scatter_store$a$b")
> > > > +OPTAB_CD(mask_len_strided_store_optab, "mask_len_strided_store$a$b")
> > > >  OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
> > > >  OPTAB_CD(vec_init_optab, "vec_init$a$b")
> > > >  
> > > > 
> > >  
> > > 
> >  
> > 
>  
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
  2023-11-03  7:49               ` juzhe.zhong
@ 2023-11-03  8:27                 ` Richard Biener
  0 siblings, 0 replies; 13+ messages in thread
From: Richard Biener @ 2023-11-03  8:27 UTC (permalink / raw)
  To: juzhe.zhong; +Cc: gcc-patches, jeffreyalaw, richard.sandiford, Robin Dapp

On Fri, 3 Nov 2023, juzhe.zhong@rivai.ai wrote:

> Hi?Richi.
> 
> >> Element @var{i}.  I think we don't want to document 'undefined' but
> >> rather match mask_gather_load and say the result should be zero.
> 
> I am confused here,  will vectorizer depend on zero inactive value ?

I don't think it does, but why be inconsistent with gather and at
the same time document it behaves like gather?

> Since RVV load/store doesn't touch the inactive(tail or unmasked) elements, I am wondering whether it will cause issues.

For load not zeroing will cause a false dependence on the previous value
of the register, so I doubt this is optimal behavior.  For stores the
inactive elements are of course not stored.

> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-11-03 15:40
> To: juzhe.zhong@rivai.ai
> CC: gcc-patches; jeffreyalaw; richard.sandiford; Robin Dapp
> Subject: Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> On Fri, 3 Nov 2023, juzhe.zhong@rivai.ai wrote:
>  
> > Hi, Richi.
> > 
> > The following is strided load/store doc:
> > 
> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > index fab2513105a..4f0821a291d 100644
> > --- a/gcc/doc/md.texi
> > +++ b/gcc/doc/md.texi
> > @@ -5094,6 +5094,22 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
> >  be loaded from memory and clear if element @var{i} of the result should be undefined.
> >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > 
> > +@cindex @code{mask_len_strided_load@var{m}} instruction pattern
> > +@item @samp{mask_len_strided_load@var{m}}
> > +Load several separate memory locations into a destination vector of mode @var{m}.
> > +Operand 0 is a destination vector of mode @var{m}.
> > +Operand 1 is a scalar base address and operand 2 is a scalar stride of Pmode.
> > +operand 3 indicates stride operand is signed or unsigned number and operand 4 is mask operand.
>  
> we don't need operand 3
>  
> > +operand 5 is length operand and operand 6 is bias operand.
> > +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > +For each element index i load address is operand 1 + @var{i} * operand 2.
> > +operand 2 can be negative stride if operand 3 is 0.
> > +Similar to mask_len_load, the instruction loads at most (operand 5 + operand 6) elements from memory.
> > +Bit @var{i} of the mask (operand 4) is set if element @var{i} of the result should
>  
> Element @var{i}.  I think we don't want to document 'undefined' but
> rather match mask_gather_load and say the result should be zero.
>  
> Similar adjustments below.
>  
> > +be loaded from memory and clear if element @var{i} of the result should be undefined.
> > +Mask elements @var{i} with @var{i} > (operand 5 + operand 6) are ignored.
> > +
> >  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
> >  @item @samp{scatter_store@var{m}@var{n}}
> >  Store a vector of mode @var{m} into several distinct memory locations.
> > @@ -5131,6 +5147,21 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
> >  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > 
> > +@cindex @code{mask_len_strided_store@var{m}} instruction pattern
> > +@item @samp{mask_len_strided_store@var{m}}
> > +Store a vector of mode m into several distinct memory locations.
> > +Operand 0 is a scalar base address and operand 1 is scalar stride of Pmode.
> > +operand 2 indicates stride operand is signed or unsigned number.
> > +Operand 3 is the vector of values that should be stored, which is of mode @var{m}.
> > +operand 4 is mask operand, operand 5 is length operand and operand 6 is bias operand.
> > +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 1 as step.
> > +For each element index i store address is operand 0 + @var{i} * operand 1.
> > +operand 1 can be negative stride if operand 2 is 0.
> > +Similar to mask_len_store, the instruction stores at most (operand 5 + operand 6) elements of mask (operand 4) to memory.
> > +Bit @var{i} of the mask is set if element @var{i} of (operand 3) should be stored.
> > +Mask elements @var{i} with @var{i} > (operand 5 + operand 6) are ignored.
> > +
> >  @cindex @code{vec_set@var{m}} instruction pattern
> >  @item @samp{vec_set@var{m}}
> >  Set given field in the vector value.  Operand 0 is the vector to modify,
> > 
> > Does it look reasonable ? Thanks.
> > 
> > 
> > 
> > juzhe.zhong@rivai.ai
> >  
> > From: Richard Biener
> > Date: 2023-11-02 22:34
> > To: ???
> > CC: gcc-patches; Jeff Law; richard.sandiford; rdapp.gcc
> > Subject: Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> > On Thu, 2 Nov 2023, ??? wrote:
> >  
> > > Ok. So drop 'scale' and keep signed/unsigned argument, is that right?
> >  
> > I don't think we need signed/unsigned.  RTL expansion has the signedness
> > of the offset argument there and can just extend to the appropriate mode
> > to offset a pointer.
> >  
> > > And I wonder I should create the stride_type using size_type_node or ptrdiff_type_node ?
> > > Which is preferrable ?
> >  
> > 'sizetype' - that's the type we require to be used for 
> > the POINTER_PLUS_EXPR offset operand.
> >  
> >  
> > > Thanks.
> > > 
> > > 
> > > 
> > > juzhe.zhong@rivai.ai
> > >  
> > > From: Richard Biener
> > > Date: 2023-11-02 22:27
> > > To: ???
> > > CC: gcc-patches; Jeff Law; richard.sandiford; rdapp.gcc
> > > Subject: Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> > > On Thu, 2 Nov 2023, ??? wrote:
> > >  
> > > > Hi, Richi.
> > > > 
> > > > >> Do we really need to have two modes for the optab though or could we
> > > > >> simply require the target to support arbitrary offset modes (give it
> > > > >> is implicitly constrained to ptr_mode for the base already)?  Or
> > > > >> properly extend/truncate the offset at expansion time, say to ptr_mode
> > > > >> or to the mode of sizetype.
> > > > 
> > > > For RVV, it's ok by default set stride type as ptr_mode/size_type by default.
> > > > Is it ok that I define strided load/store as single mode optab and default Pmode as stride operand?
> > > > How about scale and signed/unsigned operand ?
> > > > It seems scale operand can be removed ? Since we can pass DR_STEP directly to the stride arguments.
> > > > But I think we can't remove signed/unsigned operand since for strided mode = SI mode, the unsigned
> > > > maximum stride = 2^31 wheras signed is 2 ^ 30.
> > >  
> > > On the GIMPLE side I think we want to have a sizetype operand and
> > > indeed drop 'scale', the sizetype operand should be readily available.
> > >  
> > > > 
> > > > 
> > > > 
> > > > juzhe.zhong@rivai.ai
> > > >  
> > > > From: Richard Biener
> > > > Date: 2023-11-02 21:52
> > > > To: Juzhe-Zhong
> > > > CC: gcc-patches; jeffreyalaw; richard.sandiford; rdapp.gcc
> > > > Subject: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> > > > On Tue, 31 Oct 2023, Juzhe-Zhong wrote:
> > > >  
> > > > > As previous Richard's suggested, we should support strided load/store in
> > > > > loop vectorizer instead hacking RISC-V backend.
> > > > > 
> > > > > This patch adds MASK_LEN_STRIDED LOAD/STORE OPTABS/IFN.
> > > > > 
> > > > > The GIMPLE IR is same as mask_len_gather_load/mask_len_scatter_store but with
> > > > > changing vector offset into scalar stride.
> > > >  
> > > > I see that it follows gather/scatter.  I'll note that when introducing
> > > > those we failed to add a specifier for TBAA and alignment info for the
> > > > data access.  That means we have to use alias-set zero for the accesses
> > > > (I see existing targets use UNSPECs with some not elaborate MEM anyway,
> > > > but TBAA info might have been the "easy" and obvious property to 
> > > > preserve).  For alignment we either have to assume unaligned or reject
> > > > vectorization of accesses that do not have their original scalar accesses
> > > > naturally aligned (aligned according to their mode).  We don't seem
> > > > to check that though.
> > > >  
> > > > It might be fine to go forward with this since gather/scatter are broken
> > > > in a similar way.
> > > >  
> > > > Do we really need to have two modes for the optab though or could we
> > > > simply require the target to support arbitrary offset modes (give it
> > > > is implicitly constrained to ptr_mode for the base already)?  Or
> > > > properly extend/truncate the offset at expansion time, say to ptr_mode
> > > > or to the mode of sizetype.
> > > >  
> > > > Thanks,
> > > > Richard.
> > > > > We don't have strided_load/strided_store and mask_strided_load/mask_strided_store since
> > > > > it't unlikely RVV will have such optabs and we can't add the patterns that we can't test them.
> > > > >
> > > > > 
> > > > > gcc/ChangeLog:
> > > > > 
> > > > > * doc/md.texi: Add mask_len_strided_load/mask_len_strided_store.
> > > > > * internal-fn.cc (internal_load_fn_p): Ditto.
> > > > > (internal_strided_fn_p): Ditto.
> > > > > (internal_fn_len_index): Ditto.
> > > > > (internal_fn_mask_index): Ditto.
> > > > > (internal_fn_stored_value_index): Ditto.
> > > > > (internal_strided_fn_supported_p): Ditto.
> > > > > * internal-fn.def (MASK_LEN_STRIDED_LOAD): Ditto.
> > > > > (MASK_LEN_STRIDED_STORE): Ditto.
> > > > > * internal-fn.h (internal_strided_fn_p): Ditto.
> > > > > (internal_strided_fn_supported_p): Ditto.
> > > > > * optabs.def (OPTAB_CD): Ditto.
> > > > > 
> > > > > ---
> > > > >  gcc/doc/md.texi     | 51 +++++++++++++++++++++++++++++++++++++++++++++
> > > > >  gcc/internal-fn.cc  | 44 ++++++++++++++++++++++++++++++++++++++
> > > > >  gcc/internal-fn.def |  4 ++++
> > > > >  gcc/internal-fn.h   |  2 ++
> > > > >  gcc/optabs.def      |  2 ++
> > > > >  5 files changed, 103 insertions(+)
> > > > > 
> > > > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > > > > index fab2513105a..5bac713a0dd 100644
> > > > > --- a/gcc/doc/md.texi
> > > > > +++ b/gcc/doc/md.texi
> > > > > @@ -5094,6 +5094,32 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
> > > > >  be loaded from memory and clear if element @var{i} of the result should be undefined.
> > > > >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > > >  
> > > > > +@cindex @code{mask_len_strided_load@var{m}@var{n}} instruction pattern
> > > > > +@item @samp{mask_len_strided_load@var{m}@var{n}}
> > > > > +Load several separate memory locations into a destination vector of mode @var{m}.
> > > > > +Operand 0 is a destination vector of mode @var{m}.
> > > > > +Operand 1 is a scalar base address and operand 2 is a scalar stride of mode @var{n}.
> > > > > +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> > > > > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > > > > +For each element index i:
> > > > > +
> > > > > +@itemize @bullet
> > > > > +@item
> > > > > +extend the stride to address width, using zero
> > > > > +extension if operand 3 is 1 and sign extension if operand 3 is zero;
> > > > > +@item
> > > > > +multiply the extended stride by operand 4;
> > > > > +@item
> > > > > +add the result to the base; and
> > > > > +@item
> > > > > +load the value at that address (operand 1 + @var{i} * multiplied and extended stride) into element @var{i} of operand 0.
> > > > > +@end itemize
> > > > > +
> > > > > +Similar to mask_len_load, the instruction loads at most (operand 6 + operand 7) elements from memory.
> > > > > +Bit @var{i} of the mask is set if element @var{i} of the result should
> > > > > +be loaded from memory and clear if element @var{i} of the result should be undefined.
> > > > > +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > > > +
> > > > >  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
> > > > >  @item @samp{scatter_store@var{m}@var{n}}
> > > > >  Store a vector of mode @var{m} into several distinct memory locations.
> > > > > @@ -5131,6 +5157,31 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
> > > > >  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> > > > >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > > >  
> > > > > +@cindex @code{mask_len_strided_store@var{m}@var{n}} instruction pattern
> > > > > +@item @samp{mask_len_strided_store@var{m}@var{n}}
> > > > > +Store a vector of mode m into several distinct memory locations.
> > > > > +Operand 0 is a scalar base address and operand 1 is scalar stride of mode @var{n}.
> > > > > +Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
> > > > > +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> > > > > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > > > > +For each element index i:
> > > > > +
> > > > > +@itemize @bullet
> > > > > +@item
> > > > > +extend the stride to address width, using zero
> > > > > +extension if operand 2 is 1 and sign extension if operand 2 is zero;
> > > > > +@item
> > > > > +multiply the extended stride by operand 3;
> > > > > +@item
> > > > > +add the result to the base; and
> > > > > +@item
> > > > > +store element @var{i} of operand 4 to that address (operand 1 + @var{i} * multiplied and extended stride).
> > > > > +@end itemize
> > > > > +
> > > > > +Similar to mask_len_store, the instruction stores at most (operand 6 + operand 7) elements of (operand 4) to memory.
> > > > > +Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> > > > > +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > > > +
> > > > >  @cindex @code{vec_set@var{m}} instruction pattern
> > > > >  @item @samp{vec_set@var{m}}
> > > > >  Set given field in the vector value.  Operand 0 is the vector to modify,
> > > > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > > > > index e7451b96353..f7f85aa7dde 100644
> > > > > --- a/gcc/internal-fn.cc
> > > > > +++ b/gcc/internal-fn.cc
> > > > > @@ -4596,6 +4596,7 @@ internal_load_fn_p (internal_fn fn)
> > > > >      case IFN_GATHER_LOAD:
> > > > >      case IFN_MASK_GATHER_LOAD:
> > > > >      case IFN_MASK_LEN_GATHER_LOAD:
> > > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > > >      case IFN_LEN_LOAD:
> > > > >      case IFN_MASK_LEN_LOAD:
> > > > >        return true;
> > > > > @@ -4648,6 +4649,22 @@ internal_gather_scatter_fn_p (internal_fn fn)
> > > > >      }
> > > > >  }
> > > > >  
> > > > > +/* Return true if IFN is some form of strided load or strided store.  */
> > > > > +
> > > > > +bool
> > > > > +internal_strided_fn_p (internal_fn fn)
> > > > > +{
> > > > > +  switch (fn)
> > > > > +    {
> > > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > > > +      return true;
> > > > > +
> > > > > +    default:
> > > > > +      return false;
> > > > > +    }
> > > > > +}
> > > > > +
> > > > >  /* If FN takes a vector len argument, return the index of that argument,
> > > > >     otherwise return -1.  */
> > > > >  
> > > > > @@ -4662,6 +4679,8 @@ internal_fn_len_index (internal_fn fn)
> > > > >  
> > > > >      case IFN_MASK_LEN_GATHER_LOAD:
> > > > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > > >      case IFN_COND_LEN_FMA:
> > > > >      case IFN_COND_LEN_FMS:
> > > > >      case IFN_COND_LEN_FNMA:
> > > > > @@ -4719,6 +4738,8 @@ internal_fn_mask_index (internal_fn fn)
> > > > >      case IFN_MASK_SCATTER_STORE:
> > > > >      case IFN_MASK_LEN_GATHER_LOAD:
> > > > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > > >        return 4;
> > > > >  
> > > > >      default:
> > > > > @@ -4740,6 +4761,7 @@ internal_fn_stored_value_index (internal_fn fn)
> > > > >      case IFN_SCATTER_STORE:
> > > > >      case IFN_MASK_SCATTER_STORE:
> > > > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > > >        return 3;
> > > > >  
> > > > >      case IFN_LEN_STORE:
> > > > > @@ -4784,6 +4806,28 @@ internal_gather_scatter_fn_supported_p (internal_fn ifn, tree vector_type,
> > > > >    && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> > > > >  }
> > > > >  
> > > > > +/* Return true if the target supports strided load or strided store function
> > > > > +   IFN.  For loads, VECTOR_TYPE is the vector type of the load result,
> > > > > +   while for stores it is the vector type of the stored data argument.
> > > > > +   STRIDE_TYPE is the type that holds the stride from the previous element
> > > > > +   memory address of each loaded or stored element.
> > > > > +   SCALE is the amount by which these stride should be multiplied
> > > > > +   *after* they have been extended to address width.  */
> > > > > +
> > > > > +bool
> > > > > +internal_strided_fn_supported_p (internal_fn ifn, tree vector_type,
> > > > > + tree offset_type, int scale)
> > > > > +{
> > > > > +  optab optab = direct_internal_fn_optab (ifn);
> > > > > +  insn_code icode = convert_optab_handler (optab, TYPE_MODE (vector_type),
> > > > > +    TYPE_MODE (offset_type));
> > > > > +  int output_ops = internal_load_fn_p (ifn) ? 1 : 0;
> > > > > +  bool unsigned_p = TYPE_UNSIGNED (offset_type);
> > > > > +  return (icode != CODE_FOR_nothing
> > > > > +   && insn_operand_matches (icode, 2 + output_ops, GEN_INT (unsigned_p))
> > > > > +   && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> > > > > +}
> > > > > +
> > > > >  /* Return true if the target supports IFN_CHECK_{RAW,WAR}_PTRS function IFN
> > > > >     for pointers of type TYPE when the accesses have LENGTH bytes and their
> > > > >     common byte alignment is ALIGN.  */
> > > > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > > > > index a2023ab9c3d..0fa532e8f6b 100644
> > > > > --- a/gcc/internal-fn.def
> > > > > +++ b/gcc/internal-fn.def
> > > > > @@ -199,6 +199,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
> > > > >         mask_gather_load, gather_load)
> > > > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
> > > > >         mask_len_gather_load, gather_load)
> > > > > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
> > > > > +        mask_len_strided_load, gather_load)
> > > > >  
> > > > >  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
> > > > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
> > > > > @@ -208,6 +210,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
> > > > >         mask_scatter_store, scatter_store)
> > > > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
> > > > >         mask_len_scatter_store, scatter_store)
> > > > > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
> > > > > +        mask_len_strided_store, scatter_store)
> > > > >  
> > > > >  DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
> > > > >  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
> > > > > diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
> > > > > index 99de13a0199..8379c61dff7 100644
> > > > > --- a/gcc/internal-fn.h
> > > > > +++ b/gcc/internal-fn.h
> > > > > @@ -235,11 +235,13 @@ extern bool can_interpret_as_conditional_op_p (gimple *, tree *,
> > > > >  extern bool internal_load_fn_p (internal_fn);
> > > > >  extern bool internal_store_fn_p (internal_fn);
> > > > >  extern bool internal_gather_scatter_fn_p (internal_fn);
> > > > > +extern bool internal_strided_fn_p (internal_fn);
> > > > >  extern int internal_fn_mask_index (internal_fn);
> > > > >  extern int internal_fn_len_index (internal_fn);
> > > > >  extern int internal_fn_stored_value_index (internal_fn);
> > > > >  extern bool internal_gather_scatter_fn_supported_p (internal_fn, tree,
> > > > >      tree, tree, int);
> > > > > +extern bool internal_strided_fn_supported_p (internal_fn, tree, tree, int);
> > > > >  extern bool internal_check_ptrs_fn_supported_p (internal_fn, tree,
> > > > >  poly_uint64, unsigned int);
> > > > >  #define VECT_PARTIAL_BIAS_UNSUPPORTED 127
> > > > > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > > > > index 2ccbe4197b7..3d85ac5f678 100644
> > > > > --- a/gcc/optabs.def
> > > > > +++ b/gcc/optabs.def
> > > > > @@ -98,9 +98,11 @@ OPTAB_CD(mask_len_store_optab, "mask_len_store$a$b")
> > > > >  OPTAB_CD(gather_load_optab, "gather_load$a$b")
> > > > >  OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
> > > > >  OPTAB_CD(mask_len_gather_load_optab, "mask_len_gather_load$a$b")
> > > > > +OPTAB_CD(mask_len_strided_load_optab, "mask_len_strided_load$a$b")
> > > > >  OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
> > > > >  OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
> > > > >  OPTAB_CD(mask_len_scatter_store_optab, "mask_len_scatter_store$a$b")
> > > > > +OPTAB_CD(mask_len_strided_store_optab, "mask_len_strided_store$a$b")
> > > > >  OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
> > > > >  OPTAB_CD(vec_init_optab, "vec_init$a$b")
> > > > >  
> > > > > 
> > > >  
> > > > 
> > >  
> > > 
> >  
> > 
>  
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
  2023-11-03  8:18               ` juzhe.zhong
@ 2023-11-03  8:29                 ` Richard Biener
  0 siblings, 0 replies; 13+ messages in thread
From: Richard Biener @ 2023-11-03  8:29 UTC (permalink / raw)
  To: juzhe.zhong; +Cc: gcc-patches, jeffreyalaw, richard.sandiford, Robin Dapp

On Fri, 3 Nov 2023, juzhe.zhong@rivai.ai wrote:

> Hi, Richi.
> 
> Is this following reasonable ?
> 
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index fab2513105a..8d1cdad1607 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5094,6 +5094,20 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
>  be loaded from memory and clear if element @var{i} of the result should be undefined.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> 
> +@cindex @code{mask_len_strided_load@var{m}} instruction pattern
> +@item @samp{mask_len_strided_load@var{m}}
> +Load several separate memory locations into a destination vector of mode @var{m}.
> +Operand 0 is a destination vector of mode @var{m}.
> +Operand 1 is a scalar base address and operand 2 is a scalar stride of Pmode.
> +operand 3 is mask operand, operand 4 is length operand and operand 5 is bias operand.
> +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> +For each element index i load address is operand 1 + @var{i} * operand 2.
> +Similar to mask_len_load, the instruction loads at most (operand 4 + operand 5) elements from memory.
> +Bit @var{i} of the mask (operand 3) is set if element @var{i} of the result should

Element @var{i} of the mask

> +be loaded from memory and clear if element @var{i} of the result should be zero.
> +Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
> +
>  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
>  @item @samp{scatter_store@var{m}@var{n}}
>  Store a vector of mode @var{m} into several distinct memory locations.
> @@ -5131,6 +5145,19 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
>  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> 
> +@cindex @code{mask_len_strided_store@var{m}} instruction pattern
> +@item @samp{mask_len_strided_store@var{m}}
> +Store a vector of mode m into several distinct memory locations.
> +Operand 0 is a scalar base address and operand 1 is scalar stride of Pmode.
> +Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
> +operand 3 is mask operand, operand 4 is length operand and operand 5 is bias operand.
> +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 1 as step.
> +For each element index i store address is operand 0 + @var{i} * operand 1.
> +Similar to mask_len_store, the instruction stores at most (operand 4 + operand 5) elements of mask (operand 3) to memory.
> +Bit @var{i} of the mask is set if element @var{i} of (operand 3) should be stored.

Element @var{i}

Otherwise looks OK

> +Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
> +
>  @cindex @code{vec_set@var{m}} instruction pattern
>  @item @samp{vec_set@var{m}}
>  Set given field in the vector value.  Operand 0 is the vector to modify,
> 
> Thanks.
> 
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-11-03 15:40
> To: juzhe.zhong@rivai.ai
> CC: gcc-patches; jeffreyalaw; richard.sandiford; Robin Dapp
> Subject: Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> On Fri, 3 Nov 2023, juzhe.zhong@rivai.ai wrote:
>  
> > Hi, Richi.
> > 
> > The following is strided load/store doc:
> > 
> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > index fab2513105a..4f0821a291d 100644
> > --- a/gcc/doc/md.texi
> > +++ b/gcc/doc/md.texi
> > @@ -5094,6 +5094,22 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
> >  be loaded from memory and clear if element @var{i} of the result should be undefined.
> >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > 
> > +@cindex @code{mask_len_strided_load@var{m}} instruction pattern
> > +@item @samp{mask_len_strided_load@var{m}}
> > +Load several separate memory locations into a destination vector of mode @var{m}.
> > +Operand 0 is a destination vector of mode @var{m}.
> > +Operand 1 is a scalar base address and operand 2 is a scalar stride of Pmode.
> > +operand 3 indicates stride operand is signed or unsigned number and operand 4 is mask operand.
>  
> we don't need operand 3
>  
> > +operand 5 is length operand and operand 6 is bias operand.
> > +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > +For each element index i load address is operand 1 + @var{i} * operand 2.
> > +operand 2 can be negative stride if operand 3 is 0.
> > +Similar to mask_len_load, the instruction loads at most (operand 5 + operand 6) elements from memory.
> > +Bit @var{i} of the mask (operand 4) is set if element @var{i} of the result should
>  
> Element @var{i}.  I think we don't want to document 'undefined' but
> rather match mask_gather_load and say the result should be zero.
>  
> Similar adjustments below.
>  
> > +be loaded from memory and clear if element @var{i} of the result should be undefined.
> > +Mask elements @var{i} with @var{i} > (operand 5 + operand 6) are ignored.
> > +
> >  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
> >  @item @samp{scatter_store@var{m}@var{n}}
> >  Store a vector of mode @var{m} into several distinct memory locations.
> > @@ -5131,6 +5147,21 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
> >  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > 
> > +@cindex @code{mask_len_strided_store@var{m}} instruction pattern
> > +@item @samp{mask_len_strided_store@var{m}}
> > +Store a vector of mode m into several distinct memory locations.
> > +Operand 0 is a scalar base address and operand 1 is scalar stride of Pmode.
> > +operand 2 indicates stride operand is signed or unsigned number.
> > +Operand 3 is the vector of values that should be stored, which is of mode @var{m}.
> > +operand 4 is mask operand, operand 5 is length operand and operand 6 is bias operand.
> > +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 1 as step.
> > +For each element index i store address is operand 0 + @var{i} * operand 1.
> > +operand 1 can be negative stride if operand 2 is 0.
> > +Similar to mask_len_store, the instruction stores at most (operand 5 + operand 6) elements of mask (operand 4) to memory.
> > +Bit @var{i} of the mask is set if element @var{i} of (operand 3) should be stored.
> > +Mask elements @var{i} with @var{i} > (operand 5 + operand 6) are ignored.
> > +
> >  @cindex @code{vec_set@var{m}} instruction pattern
> >  @item @samp{vec_set@var{m}}
> >  Set given field in the vector value.  Operand 0 is the vector to modify,
> > 
> > Does it look reasonable ? Thanks.
> > 
> > 
> > 
> > juzhe.zhong@rivai.ai
> >  
> > From: Richard Biener
> > Date: 2023-11-02 22:34
> > To: ???
> > CC: gcc-patches; Jeff Law; richard.sandiford; rdapp.gcc
> > Subject: Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> > On Thu, 2 Nov 2023, ??? wrote:
> >  
> > > Ok. So drop 'scale' and keep signed/unsigned argument, is that right?
> >  
> > I don't think we need signed/unsigned.  RTL expansion has the signedness
> > of the offset argument there and can just extend to the appropriate mode
> > to offset a pointer.
> >  
> > > And I wonder I should create the stride_type using size_type_node or ptrdiff_type_node ?
> > > Which is preferrable ?
> >  
> > 'sizetype' - that's the type we require to be used for 
> > the POINTER_PLUS_EXPR offset operand.
> >  
> >  
> > > Thanks.
> > > 
> > > 
> > > 
> > > juzhe.zhong@rivai.ai
> > >  
> > > From: Richard Biener
> > > Date: 2023-11-02 22:27
> > > To: ???
> > > CC: gcc-patches; Jeff Law; richard.sandiford; rdapp.gcc
> > > Subject: Re: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> > > On Thu, 2 Nov 2023, ??? wrote:
> > >  
> > > > Hi, Richi.
> > > > 
> > > > >> Do we really need to have two modes for the optab though or could we
> > > > >> simply require the target to support arbitrary offset modes (give it
> > > > >> is implicitly constrained to ptr_mode for the base already)?  Or
> > > > >> properly extend/truncate the offset at expansion time, say to ptr_mode
> > > > >> or to the mode of sizetype.
> > > > 
> > > > For RVV, it's ok by default set stride type as ptr_mode/size_type by default.
> > > > Is it ok that I define strided load/store as single mode optab and default Pmode as stride operand?
> > > > How about scale and signed/unsigned operand ?
> > > > It seems scale operand can be removed ? Since we can pass DR_STEP directly to the stride arguments.
> > > > But I think we can't remove signed/unsigned operand since for strided mode = SI mode, the unsigned
> > > > maximum stride = 2^31 wheras signed is 2 ^ 30.
> > >  
> > > On the GIMPLE side I think we want to have a sizetype operand and
> > > indeed drop 'scale', the sizetype operand should be readily available.
> > >  
> > > > 
> > > > 
> > > > 
> > > > juzhe.zhong@rivai.ai
> > > >  
> > > > From: Richard Biener
> > > > Date: 2023-11-02 21:52
> > > > To: Juzhe-Zhong
> > > > CC: gcc-patches; jeffreyalaw; richard.sandiford; rdapp.gcc
> > > > Subject: Re: [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN
> > > > On Tue, 31 Oct 2023, Juzhe-Zhong wrote:
> > > >  
> > > > > As previous Richard's suggested, we should support strided load/store in
> > > > > loop vectorizer instead hacking RISC-V backend.
> > > > > 
> > > > > This patch adds MASK_LEN_STRIDED LOAD/STORE OPTABS/IFN.
> > > > > 
> > > > > The GIMPLE IR is same as mask_len_gather_load/mask_len_scatter_store but with
> > > > > changing vector offset into scalar stride.
> > > >  
> > > > I see that it follows gather/scatter.  I'll note that when introducing
> > > > those we failed to add a specifier for TBAA and alignment info for the
> > > > data access.  That means we have to use alias-set zero for the accesses
> > > > (I see existing targets use UNSPECs with some not elaborate MEM anyway,
> > > > but TBAA info might have been the "easy" and obvious property to 
> > > > preserve).  For alignment we either have to assume unaligned or reject
> > > > vectorization of accesses that do not have their original scalar accesses
> > > > naturally aligned (aligned according to their mode).  We don't seem
> > > > to check that though.
> > > >  
> > > > It might be fine to go forward with this since gather/scatter are broken
> > > > in a similar way.
> > > >  
> > > > Do we really need to have two modes for the optab though or could we
> > > > simply require the target to support arbitrary offset modes (give it
> > > > is implicitly constrained to ptr_mode for the base already)?  Or
> > > > properly extend/truncate the offset at expansion time, say to ptr_mode
> > > > or to the mode of sizetype.
> > > >  
> > > > Thanks,
> > > > Richard.
> > > > > We don't have strided_load/strided_store and mask_strided_load/mask_strided_store since
> > > > > it't unlikely RVV will have such optabs and we can't add the patterns that we can't test them.
> > > > >
> > > > > 
> > > > > gcc/ChangeLog:
> > > > > 
> > > > > * doc/md.texi: Add mask_len_strided_load/mask_len_strided_store.
> > > > > * internal-fn.cc (internal_load_fn_p): Ditto.
> > > > > (internal_strided_fn_p): Ditto.
> > > > > (internal_fn_len_index): Ditto.
> > > > > (internal_fn_mask_index): Ditto.
> > > > > (internal_fn_stored_value_index): Ditto.
> > > > > (internal_strided_fn_supported_p): Ditto.
> > > > > * internal-fn.def (MASK_LEN_STRIDED_LOAD): Ditto.
> > > > > (MASK_LEN_STRIDED_STORE): Ditto.
> > > > > * internal-fn.h (internal_strided_fn_p): Ditto.
> > > > > (internal_strided_fn_supported_p): Ditto.
> > > > > * optabs.def (OPTAB_CD): Ditto.
> > > > > 
> > > > > ---
> > > > >  gcc/doc/md.texi     | 51 +++++++++++++++++++++++++++++++++++++++++++++
> > > > >  gcc/internal-fn.cc  | 44 ++++++++++++++++++++++++++++++++++++++
> > > > >  gcc/internal-fn.def |  4 ++++
> > > > >  gcc/internal-fn.h   |  2 ++
> > > > >  gcc/optabs.def      |  2 ++
> > > > >  5 files changed, 103 insertions(+)
> > > > > 
> > > > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > > > > index fab2513105a..5bac713a0dd 100644
> > > > > --- a/gcc/doc/md.texi
> > > > > +++ b/gcc/doc/md.texi
> > > > > @@ -5094,6 +5094,32 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
> > > > >  be loaded from memory and clear if element @var{i} of the result should be undefined.
> > > > >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > > >  
> > > > > +@cindex @code{mask_len_strided_load@var{m}@var{n}} instruction pattern
> > > > > +@item @samp{mask_len_strided_load@var{m}@var{n}}
> > > > > +Load several separate memory locations into a destination vector of mode @var{m}.
> > > > > +Operand 0 is a destination vector of mode @var{m}.
> > > > > +Operand 1 is a scalar base address and operand 2 is a scalar stride of mode @var{n}.
> > > > > +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> > > > > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > > > > +For each element index i:
> > > > > +
> > > > > +@itemize @bullet
> > > > > +@item
> > > > > +extend the stride to address width, using zero
> > > > > +extension if operand 3 is 1 and sign extension if operand 3 is zero;
> > > > > +@item
> > > > > +multiply the extended stride by operand 4;
> > > > > +@item
> > > > > +add the result to the base; and
> > > > > +@item
> > > > > +load the value at that address (operand 1 + @var{i} * multiplied and extended stride) into element @var{i} of operand 0.
> > > > > +@end itemize
> > > > > +
> > > > > +Similar to mask_len_load, the instruction loads at most (operand 6 + operand 7) elements from memory.
> > > > > +Bit @var{i} of the mask is set if element @var{i} of the result should
> > > > > +be loaded from memory and clear if element @var{i} of the result should be undefined.
> > > > > +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > > > +
> > > > >  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
> > > > >  @item @samp{scatter_store@var{m}@var{n}}
> > > > >  Store a vector of mode @var{m} into several distinct memory locations.
> > > > > @@ -5131,6 +5157,31 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
> > > > >  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> > > > >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > > >  
> > > > > +@cindex @code{mask_len_strided_store@var{m}@var{n}} instruction pattern
> > > > > +@item @samp{mask_len_strided_store@var{m}@var{n}}
> > > > > +Store a vector of mode m into several distinct memory locations.
> > > > > +Operand 0 is a scalar base address and operand 1 is scalar stride of mode @var{n}.
> > > > > +Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
> > > > > +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> > > > > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > > > > +For each element index i:
> > > > > +
> > > > > +@itemize @bullet
> > > > > +@item
> > > > > +extend the stride to address width, using zero
> > > > > +extension if operand 2 is 1 and sign extension if operand 2 is zero;
> > > > > +@item
> > > > > +multiply the extended stride by operand 3;
> > > > > +@item
> > > > > +add the result to the base; and
> > > > > +@item
> > > > > +store element @var{i} of operand 4 to that address (operand 1 + @var{i} * multiplied and extended stride).
> > > > > +@end itemize
> > > > > +
> > > > > +Similar to mask_len_store, the instruction stores at most (operand 6 + operand 7) elements of (operand 4) to memory.
> > > > > +Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> > > > > +Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> > > > > +
> > > > >  @cindex @code{vec_set@var{m}} instruction pattern
> > > > >  @item @samp{vec_set@var{m}}
> > > > >  Set given field in the vector value.  Operand 0 is the vector to modify,
> > > > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > > > > index e7451b96353..f7f85aa7dde 100644
> > > > > --- a/gcc/internal-fn.cc
> > > > > +++ b/gcc/internal-fn.cc
> > > > > @@ -4596,6 +4596,7 @@ internal_load_fn_p (internal_fn fn)
> > > > >      case IFN_GATHER_LOAD:
> > > > >      case IFN_MASK_GATHER_LOAD:
> > > > >      case IFN_MASK_LEN_GATHER_LOAD:
> > > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > > >      case IFN_LEN_LOAD:
> > > > >      case IFN_MASK_LEN_LOAD:
> > > > >        return true;
> > > > > @@ -4648,6 +4649,22 @@ internal_gather_scatter_fn_p (internal_fn fn)
> > > > >      }
> > > > >  }
> > > > >  
> > > > > +/* Return true if IFN is some form of strided load or strided store.  */
> > > > > +
> > > > > +bool
> > > > > +internal_strided_fn_p (internal_fn fn)
> > > > > +{
> > > > > +  switch (fn)
> > > > > +    {
> > > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > > > +      return true;
> > > > > +
> > > > > +    default:
> > > > > +      return false;
> > > > > +    }
> > > > > +}
> > > > > +
> > > > >  /* If FN takes a vector len argument, return the index of that argument,
> > > > >     otherwise return -1.  */
> > > > >  
> > > > > @@ -4662,6 +4679,8 @@ internal_fn_len_index (internal_fn fn)
> > > > >  
> > > > >      case IFN_MASK_LEN_GATHER_LOAD:
> > > > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > > >      case IFN_COND_LEN_FMA:
> > > > >      case IFN_COND_LEN_FMS:
> > > > >      case IFN_COND_LEN_FNMA:
> > > > > @@ -4719,6 +4738,8 @@ internal_fn_mask_index (internal_fn fn)
> > > > >      case IFN_MASK_SCATTER_STORE:
> > > > >      case IFN_MASK_LEN_GATHER_LOAD:
> > > > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > > > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > > >        return 4;
> > > > >  
> > > > >      default:
> > > > > @@ -4740,6 +4761,7 @@ internal_fn_stored_value_index (internal_fn fn)
> > > > >      case IFN_SCATTER_STORE:
> > > > >      case IFN_MASK_SCATTER_STORE:
> > > > >      case IFN_MASK_LEN_SCATTER_STORE:
> > > > > +    case IFN_MASK_LEN_STRIDED_STORE:
> > > > >        return 3;
> > > > >  
> > > > >      case IFN_LEN_STORE:
> > > > > @@ -4784,6 +4806,28 @@ internal_gather_scatter_fn_supported_p (internal_fn ifn, tree vector_type,
> > > > >    && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> > > > >  }
> > > > >  
> > > > > +/* Return true if the target supports strided load or strided store function
> > > > > +   IFN.  For loads, VECTOR_TYPE is the vector type of the load result,
> > > > > +   while for stores it is the vector type of the stored data argument.
> > > > > +   STRIDE_TYPE is the type that holds the stride from the previous element
> > > > > +   memory address of each loaded or stored element.
> > > > > +   SCALE is the amount by which these stride should be multiplied
> > > > > +   *after* they have been extended to address width.  */
> > > > > +
> > > > > +bool
> > > > > +internal_strided_fn_supported_p (internal_fn ifn, tree vector_type,
> > > > > + tree offset_type, int scale)
> > > > > +{
> > > > > +  optab optab = direct_internal_fn_optab (ifn);
> > > > > +  insn_code icode = convert_optab_handler (optab, TYPE_MODE (vector_type),
> > > > > +    TYPE_MODE (offset_type));
> > > > > +  int output_ops = internal_load_fn_p (ifn) ? 1 : 0;
> > > > > +  bool unsigned_p = TYPE_UNSIGNED (offset_type);
> > > > > +  return (icode != CODE_FOR_nothing
> > > > > +   && insn_operand_matches (icode, 2 + output_ops, GEN_INT (unsigned_p))
> > > > > +   && insn_operand_matches (icode, 3 + output_ops, GEN_INT (scale)));
> > > > > +}
> > > > > +
> > > > >  /* Return true if the target supports IFN_CHECK_{RAW,WAR}_PTRS function IFN
> > > > >     for pointers of type TYPE when the accesses have LENGTH bytes and their
> > > > >     common byte alignment is ALIGN.  */
> > > > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > > > > index a2023ab9c3d..0fa532e8f6b 100644
> > > > > --- a/gcc/internal-fn.def
> > > > > +++ b/gcc/internal-fn.def
> > > > > @@ -199,6 +199,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
> > > > >         mask_gather_load, gather_load)
> > > > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
> > > > >         mask_len_gather_load, gather_load)
> > > > > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
> > > > > +        mask_len_strided_load, gather_load)
> > > > >  
> > > > >  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
> > > > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
> > > > > @@ -208,6 +210,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
> > > > >         mask_scatter_store, scatter_store)
> > > > >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
> > > > >         mask_len_scatter_store, scatter_store)
> > > > > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
> > > > > +        mask_len_strided_store, scatter_store)
> > > > >  
> > > > >  DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
> > > > >  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
> > > > > diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
> > > > > index 99de13a0199..8379c61dff7 100644
> > > > > --- a/gcc/internal-fn.h
> > > > > +++ b/gcc/internal-fn.h
> > > > > @@ -235,11 +235,13 @@ extern bool can_interpret_as_conditional_op_p (gimple *, tree *,
> > > > >  extern bool internal_load_fn_p (internal_fn);
> > > > >  extern bool internal_store_fn_p (internal_fn);
> > > > >  extern bool internal_gather_scatter_fn_p (internal_fn);
> > > > > +extern bool internal_strided_fn_p (internal_fn);
> > > > >  extern int internal_fn_mask_index (internal_fn);
> > > > >  extern int internal_fn_len_index (internal_fn);
> > > > >  extern int internal_fn_stored_value_index (internal_fn);
> > > > >  extern bool internal_gather_scatter_fn_supported_p (internal_fn, tree,
> > > > >      tree, tree, int);
> > > > > +extern bool internal_strided_fn_supported_p (internal_fn, tree, tree, int);
> > > > >  extern bool internal_check_ptrs_fn_supported_p (internal_fn, tree,
> > > > >  poly_uint64, unsigned int);
> > > > >  #define VECT_PARTIAL_BIAS_UNSUPPORTED 127
> > > > > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > > > > index 2ccbe4197b7..3d85ac5f678 100644
> > > > > --- a/gcc/optabs.def
> > > > > +++ b/gcc/optabs.def
> > > > > @@ -98,9 +98,11 @@ OPTAB_CD(mask_len_store_optab, "mask_len_store$a$b")
> > > > >  OPTAB_CD(gather_load_optab, "gather_load$a$b")
> > > > >  OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
> > > > >  OPTAB_CD(mask_len_gather_load_optab, "mask_len_gather_load$a$b")
> > > > > +OPTAB_CD(mask_len_strided_load_optab, "mask_len_strided_load$a$b")
> > > > >  OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
> > > > >  OPTAB_CD(mask_scatter_store_optab, "mask_scatter_store$a$b")
> > > > >  OPTAB_CD(mask_len_scatter_store_optab, "mask_len_scatter_store$a$b")
> > > > > +OPTAB_CD(mask_len_strided_store_optab, "mask_len_strided_store$a$b")
> > > > >  OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
> > > > >  OPTAB_CD(vec_init_optab, "vec_init$a$b")
> > > > >  
> > > > > 
> > > >  
> > > > 
> > >  
> > > 
> >  
> > 
>  
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-11-03  8:29 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-31  9:59 [PATCH V2] OPTABS/IFN: Add mask_len_strided_load/mask_len_strided_store OPTABS/IFN Juzhe-Zhong
2023-10-31 14:08 ` Li, Pan2
2023-11-02 13:52 ` Richard Biener
2023-11-02 14:23   ` 钟居哲
2023-11-02 14:27     ` Richard Biener
2023-11-02 14:29       ` 钟居哲
2023-11-02 14:34         ` Richard Biener
2023-11-03  7:10           ` juzhe.zhong
2023-11-03  7:40             ` Richard Biener
2023-11-03  7:49               ` juzhe.zhong
2023-11-03  8:27                 ` Richard Biener
2023-11-03  8:18               ` juzhe.zhong
2023-11-03  8:29                 ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).