public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] move x86 to use gather/scatter internal functions
@ 2021-08-17 13:26 Richard Biener
  2021-08-17 14:42 ` Richard Biener
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Biener @ 2021-08-17 13:26 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.sandiford, hongtao.liu, ubizjak

This is an attempt to start moving the x86 backend to use
standard pattern names for [mask_]gather_load and [mask_]scatter_store
rather than using the builtin_{gather,scatter} target hooks.

I've started with AVX2 gathers and given x86 only supports masked
gather I only implemented mask_gather_load.  Note while for
the builtin_gather case the vectorizer will provide an all-true
mask operand for non-masked gathers this capability does not
exist for the IFN path yet, so only testcases with actual masked
gathers will work.

If this looks reasonable on the backend side I'll see to first
complete the vectorizer part, ripping out the target hook and
arranging for the missing pieces.  Another one is the support
for SImode indices with DFmode data which requires unpacking
the index vector and actually recognizing the IFN.

2021-08-17  Richard Biener  <rguenther@suse.de>

	* tree-vect-data-refs.c (vect_check_gather_scatter):
	Always use internal functions.
	* config/i386/sse.md
	(mask_gather_load<mode><vec_gather_idxsi>): New expander.
	(mask_gather_load<mode><vec_gather_idxdi>): Likewise.
---
 gcc/config/i386/sse.md    | 56 +++++++++++++++++++++++++++++++++++++++
 gcc/tree-vect-data-refs.c |  4 +--
 2 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 3957c86c3df..40bec98d9f7 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -23232,12 +23232,22 @@
 		       (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
 		       (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
 		       (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
+(define_mode_attr vec_gather_idxsi
+		      [(V2DI "v4si") (V4DI "v4si") (V8DI "v8si")
+		       (V2DF "v4si") (V4DF "v4si") (V8DF "v8si")
+		       (V4SI "v4si") (V8SI "v8si") (V16SI "v16si")
+		       (V4SF "v4si") (V8SF "v8si") (V16SF "v16si")])
 
 (define_mode_attr VEC_GATHER_IDXDI
 		      [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
 		       (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
 		       (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
 		       (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
+(define_mode_attr vec_gather_idxdi
+		      [(V2DI "v2di") (V4DI "v4di") (V8DI "v8di")
+		       (V2DF "v2di") (V4DF "v4di") (V8DF "v8di")
+		       (V4SI "v2di") (V8SI "v4di") (V16SI "v8di")
+		       (V4SF "v2di") (V8SF "v4di") (V16SF "v8di")])
 
 (define_mode_attr VEC_GATHER_SRCDI
 		      [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
@@ -23245,6 +23255,29 @@
 		       (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
 		       (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
 
+(define_expand "mask_gather_load<mode><vec_gather_idxsi>"
+  [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
+		   (unspec:VEC_GATHER_MODE
+		     [(pc)
+		      (mem:<ssescalarmode>
+			(match_par_dup 6
+			  [(match_operand 1 "vsib_address_operand")
+			   (match_operand:<VEC_GATHER_IDXSI>
+			      2 "register_operand")
+			   (match_operand:SI 4 "const1248_operand")
+			   (match_operand:SI 3 "const0_operand")]))
+		      (mem:BLK (scratch))
+		      (match_operand:<sseintvecmode> 5 "register_operand")]
+		     UNSPEC_GATHER))
+	      (clobber (match_scratch:VEC_GATHER_MODE 7))])]
+  "TARGET_AVX2 && TARGET_USE_GATHER"
+{
+  operands[5] = gen_lowpart_SUBREG (<MODE>mode, operands[5]);
+  operands[6]
+    = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, operands[2], operands[3],
+					operands[5]), UNSPEC_VSIBADDR);
+})
+
 (define_expand "avx2_gathersi<mode>"
   [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
 		   (unspec:VEC_GATHER_MODE
@@ -23306,6 +23339,29 @@
    (set_attr "prefix" "vex")
    (set_attr "mode" "<sseinsnmode>")])
 
+(define_expand "mask_gather_load<mode><vec_gather_idxdi>"
+  [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
+		   (unspec:VEC_GATHER_MODE
+		     [(pc)
+		      (mem:<ssescalarmode>
+			(match_par_dup 6
+			  [(match_operand 1 "vsib_address_operand")
+			   (match_operand:<VEC_GATHER_IDXDI>
+			      2 "register_operand")
+			   (match_operand:SI 4 "const1248_operand ")
+			   (match_operand:SI 3 "const0_operand")]))
+		      (mem:BLK (scratch))
+		      (match_operand:<sseintvecmode> 5 "register_operand")]
+		     UNSPEC_GATHER))
+	      (clobber (match_scratch:VEC_GATHER_MODE 7))])]
+  "TARGET_AVX2 && TARGET_USE_GATHER"
+{
+  operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, operands[5]);
+  operands[6]
+    = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, operands[1], operands[2],
+					operands[4]), UNSPEC_VSIBADDR);
+})
+
 (define_expand "avx2_gatherdi<mode>"
   [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
 		   (unspec:VEC_GATHER_MODE
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index d594c0a1b1e..67564b45c32 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -3824,9 +3824,7 @@ vect_check_gather_scatter (stmt_vec_info stmt_info, loop_vec_info loop_vinfo,
 
   /* True if we should aim to use internal functions rather than
      built-in functions.  */
-  bool use_ifn_p = (DR_IS_READ (dr)
-		    ? supports_vec_gather_load_p ()
-		    : supports_vec_scatter_store_p ());
+  bool use_ifn_p = true;
 
   base = DR_REF (dr);
   /* For masked loads/stores, DR_REF (dr) is an artificial MEM_REF,
-- 
2.31.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-17 13:26 [PATCH] move x86 to use gather/scatter internal functions Richard Biener
@ 2021-08-17 14:42 ` Richard Biener
  2021-08-18  3:12   ` Hongtao Liu
  2021-08-18  3:24   ` Hongtao Liu
  0 siblings, 2 replies; 21+ messages in thread
From: Richard Biener @ 2021-08-17 14:42 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Richard Sandiford, Hongtao Liu

On Tue, Aug 17, 2021 at 3:29 PM Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> This is an attempt to start moving the x86 backend to use
> standard pattern names for [mask_]gather_load and [mask_]scatter_store
> rather than using the builtin_{gather,scatter} target hooks.
>
> I've started with AVX2 gathers and given x86 only supports masked
> gather I only implemented mask_gather_load.  Note while for
> the builtin_gather case the vectorizer will provide an all-true
> mask operand for non-masked gathers this capability does not
> exist for the IFN path yet, so only testcases with actual masked
> gathers will work.
>
> If this looks reasonable on the backend side I'll see to first
> complete the vectorizer part, ripping out the target hook and
> arranging for the missing pieces.  Another one is the support
> for SImode indices with DFmode data which requires unpacking
> the index vector and actually recognizing the IFN.
>
> 2021-08-17  Richard Biener  <rguenther@suse.de>
>
>         * tree-vect-data-refs.c (vect_check_gather_scatter):
>         Always use internal functions.
>         * config/i386/sse.md
>         (mask_gather_load<mode><vec_gather_idxsi>): New expander.
>         (mask_gather_load<mode><vec_gather_idxdi>): Likewise.
> ---
>  gcc/config/i386/sse.md    | 56 +++++++++++++++++++++++++++++++++++++++
>  gcc/tree-vect-data-refs.c |  4 +--
>  2 files changed, 57 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 3957c86c3df..40bec98d9f7 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -23232,12 +23232,22 @@
>                        (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
>                        (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
>                        (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
> +(define_mode_attr vec_gather_idxsi
> +                     [(V2DI "v4si") (V4DI "v4si") (V8DI "v8si")
> +                      (V2DF "v4si") (V4DF "v4si") (V8DF "v8si")
> +                      (V4SI "v4si") (V8SI "v8si") (V16SI "v16si")
> +                      (V4SF "v4si") (V8SF "v8si") (V16SF "v16si")])
>
>  (define_mode_attr VEC_GATHER_IDXDI
>                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
>                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
>                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
>                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> +(define_mode_attr vec_gather_idxdi
> +                     [(V2DI "v2di") (V4DI "v4di") (V8DI "v8di")
> +                      (V2DF "v2di") (V4DF "v4di") (V8DF "v8di")
> +                      (V4SI "v2di") (V8SI "v4di") (V16SI "v8di")
> +                      (V4SF "v2di") (V8SF "v4di") (V16SF "v8di")])
>
>  (define_mode_attr VEC_GATHER_SRCDI
>                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> @@ -23245,6 +23255,29 @@
>                        (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
>                        (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
>
> +(define_expand "mask_gather_load<mode><vec_gather_idxsi>"
> +  [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> +                  (unspec:VEC_GATHER_MODE
> +                    [(pc)
> +                     (mem:<ssescalarmode>
> +                       (match_par_dup 6
> +                         [(match_operand 1 "vsib_address_operand")
> +                          (match_operand:<VEC_GATHER_IDXSI>
> +                             2 "register_operand")
> +                          (match_operand:SI 4 "const1248_operand")
> +                          (match_operand:SI 3 "const0_operand")]))
> +                     (mem:BLK (scratch))
> +                     (match_operand:<sseintvecmode> 5 "register_operand")]

One problem of these is that when AVX512[VL] is enabled we get a AVX512 mask
mode here and while the internal function expansion check succeeds (it
never checks
the mask operand!), RTL expansion fails unexpectedly because of this mismatch.

I suppose a more complicated define_mode_attr might do the trick or do I
need to add && !TARGET_AVX512F to these expanders?

I've meanwhile posted a patch to make the vectorizer fall back to
masked_ when non-masked_ variants are not available and that seems to work
fine at least.

Richard.

> +                    UNSPEC_GATHER))
> +             (clobber (match_scratch:VEC_GATHER_MODE 7))])]
> +  "TARGET_AVX2 && TARGET_USE_GATHER"
> +{
> +  operands[5] = gen_lowpart_SUBREG (<MODE>mode, operands[5]);
> +  operands[6]
> +    = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, operands[2], operands[3],
> +                                       operands[5]), UNSPEC_VSIBADDR);
> +})
> +
>  (define_expand "avx2_gathersi<mode>"
>    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
>                    (unspec:VEC_GATHER_MODE
> @@ -23306,6 +23339,29 @@
>     (set_attr "prefix" "vex")
>     (set_attr "mode" "<sseinsnmode>")])
>
> +(define_expand "mask_gather_load<mode><vec_gather_idxdi>"
> +  [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> +                  (unspec:VEC_GATHER_MODE
> +                    [(pc)
> +                     (mem:<ssescalarmode>
> +                       (match_par_dup 6
> +                         [(match_operand 1 "vsib_address_operand")
> +                          (match_operand:<VEC_GATHER_IDXDI>
> +                             2 "register_operand")
> +                          (match_operand:SI 4 "const1248_operand ")
> +                          (match_operand:SI 3 "const0_operand")]))
> +                     (mem:BLK (scratch))
> +                     (match_operand:<sseintvecmode> 5 "register_operand")]
> +                    UNSPEC_GATHER))
> +             (clobber (match_scratch:VEC_GATHER_MODE 7))])]
> +  "TARGET_AVX2 && TARGET_USE_GATHER"
> +{
> +  operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, operands[5]);
> +  operands[6]
> +    = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, operands[1], operands[2],
> +                                       operands[4]), UNSPEC_VSIBADDR);
> +})
> +
>  (define_expand "avx2_gatherdi<mode>"
>    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
>                    (unspec:VEC_GATHER_MODE
> diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
> index d594c0a1b1e..67564b45c32 100644
> --- a/gcc/tree-vect-data-refs.c
> +++ b/gcc/tree-vect-data-refs.c
> @@ -3824,9 +3824,7 @@ vect_check_gather_scatter (stmt_vec_info stmt_info, loop_vec_info loop_vinfo,
>
>    /* True if we should aim to use internal functions rather than
>       built-in functions.  */
> -  bool use_ifn_p = (DR_IS_READ (dr)
> -                   ? supports_vec_gather_load_p ()
> -                   : supports_vec_scatter_store_p ());
> +  bool use_ifn_p = true;
>
>    base = DR_REF (dr);
>    /* For masked loads/stores, DR_REF (dr) is an artificial MEM_REF,
> --
> 2.31.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-17 14:42 ` Richard Biener
@ 2021-08-18  3:12   ` Hongtao Liu
  2021-08-18  3:24   ` Hongtao Liu
  1 sibling, 0 replies; 21+ messages in thread
From: Hongtao Liu @ 2021-08-18  3:12 UTC (permalink / raw)
  To: Richard Biener
  Cc: Richard Biener, Richard Sandiford, Hongtao Liu, GCC Patches

On Tue, Aug 17, 2021 at 10:43 PM Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Tue, Aug 17, 2021 at 3:29 PM Richard Biener via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > This is an attempt to start moving the x86 backend to use
> > standard pattern names for [mask_]gather_load and [mask_]scatter_store
> > rather than using the builtin_{gather,scatter} target hooks.
> >
> > I've started with AVX2 gathers and given x86 only supports masked
> > gather I only implemented mask_gather_load.  Note while for
> > the builtin_gather case the vectorizer will provide an all-true
> > mask operand for non-masked gathers this capability does not
> > exist for the IFN path yet, so only testcases with actual masked
> > gathers will work.
> >
> > If this looks reasonable on the backend side I'll see to first
> > complete the vectorizer part, ripping out the target hook and
> > arranging for the missing pieces.  Another one is the support
> > for SImode indices with DFmode data which requires unpacking
> > the index vector and actually recognizing the IFN.
> >
> > 2021-08-17  Richard Biener  <rguenther@suse.de>
> >
> >         * tree-vect-data-refs.c (vect_check_gather_scatter):
> >         Always use internal functions.
> >         * config/i386/sse.md
> >         (mask_gather_load<mode><vec_gather_idxsi>): New expander.
> >         (mask_gather_load<mode><vec_gather_idxdi>): Likewise.
> > ---
> >  gcc/config/i386/sse.md    | 56 +++++++++++++++++++++++++++++++++++++++
> >  gcc/tree-vect-data-refs.c |  4 +--
> >  2 files changed, 57 insertions(+), 3 deletions(-)
> >
> > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > index 3957c86c3df..40bec98d9f7 100644
> > --- a/gcc/config/i386/sse.md
> > +++ b/gcc/config/i386/sse.md
> > @@ -23232,12 +23232,22 @@
> >                        (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
> >                        (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
> >                        (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
> > +(define_mode_attr vec_gather_idxsi
> > +                     [(V2DI "v4si") (V4DI "v4si") (V8DI "v8si")
> > +                      (V2DF "v4si") (V4DF "v4si") (V8DF "v8si")
> > +                      (V4SI "v4si") (V8SI "v8si") (V16SI "v16si")
> > +                      (V4SF "v4si") (V8SF "v8si") (V16SF "v16si")])
> >
> >  (define_mode_attr VEC_GATHER_IDXDI
> >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> >                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> >                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
> >                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> > +(define_mode_attr vec_gather_idxdi
> > +                     [(V2DI "v2di") (V4DI "v4di") (V8DI "v8di")
> > +                      (V2DF "v2di") (V4DF "v4di") (V8DF "v8di")
> > +                      (V4SI "v2di") (V8SI "v4di") (V16SI "v8di")
> > +                      (V4SF "v2di") (V8SF "v4di") (V16SF "v8di")])
> >
> >  (define_mode_attr VEC_GATHER_SRCDI
> >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > @@ -23245,6 +23255,29 @@
> >                        (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
> >                        (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
> >
> > +(define_expand "mask_gather_load<mode><vec_gather_idxsi>"
> > +  [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> > +                  (unspec:VEC_GATHER_MODE
> > +                    [(pc)
> > +                     (mem:<ssescalarmode>
> > +                       (match_par_dup 6
> > +                         [(match_operand 1 "vsib_address_operand")
> > +                          (match_operand:<VEC_GATHER_IDXSI>
> > +                             2 "register_operand")
> > +                          (match_operand:SI 4 "const1248_operand")
> > +                          (match_operand:SI 3 "const0_operand")]))
> > +                     (mem:BLK (scratch))
> > +                     (match_operand:<sseintvecmode> 5 "register_operand")]
>
> One problem of these is that when AVX512[VL] is enabled we get a AVX512 mask
> mode here and while the internal function expansion check succeeds (it
> never checks
> the mask operand!), RTL expansion fails unexpectedly because of this mismatch.
>
‘mask_gather_loadmn’
Like ‘gather_loadmn’, but takes an extra mask operand as operand 5. Bit i of
the mask is set if element i of the result should be loaded from memory and
clear if element i of the result should be set to zero.

According to the document, the mask here needs to be an avx512 mask,
would it be ok to use a vector mask?

> I suppose a more complicated define_mode_attr might do the trick or do I
> need to add && !TARGET_AVX512F to these expanders?
>
> I've meanwhile posted a patch to make the vectorizer fall back to
> masked_ when non-masked_ variants are not available and that seems to work
> fine at least

‘gather_loadmn’
Load several separate memory locations into a vector of mode m. Operand 1
is a scalar base address and operand 2 is a vector of mode n containing offsets
from that base. Operand 0 is a destination vector with the same number of
elements as n. For each element index i:

Also it seems operand 0 should have the same number of elements as
operand 2, but  vgatherdq has v2di as operand 0 and v4si as operand2.
Considering we already have
"gather_load<mode><VEC_GATHER_IDXSI_lower>", either it's not checked
in the middle end for the same element number or the existing expander
is not used.

BTW, you can define mask_gather_load<mode><vec_gather_idxsi> like
below and emit_insn (avx2_gathersi<mode> directly.

(define_expand "mask_gather_load<mode><VEC_GATHER_IDXSI_lower>"
  [(match_operand:VEC_GATHER_MODE 0 "register_operand")
   (match_operand 1 "vsib_address_operand")
   (match_operand:<VEC_GATHER_IDXSI> 2 "register_operand")
   (match_operand:SI 3 "const0_operand")
   (match_operand:SI 4 "const1248_operand")
   (match_operand:<sseintvecmode> 5 "register_operand")
]
  "TARGET_AVX2 && TARGET_USE_GATHER"
{
  operands[5] = lowpart_subreg (<VEC_GATHER_MODE>mode, operands[5]);
/* According to the document, src here needs to be const0_rtx.  */
  emit_insn (avx2_gathersi<mode> (operands[0],
CONST0_RTX<VEC_GATHER_MODE>mode,operands[1], operands[2], operands[5],
operands[4]);
  DONE;
})

> Richard.
>
> > +                    UNSPEC_GATHER))
> > +             (clobber (match_scratch:VEC_GATHER_MODE 7))])]
> > +  "TARGET_AVX2 && TARGET_USE_GATHER"
> > +{
> > +  operands[5] = gen_lowpart_SUBREG (<MODE>mode, operands[5]);
> > +  operands[6]
> > +    = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, operands[2], operands[3],
> > +                                       operands[5]), UNSPEC_VSIBADDR);
> > +})
> > +
> >  (define_expand "avx2_gathersi<mode>"
> >    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> >                    (unspec:VEC_GATHER_MODE
> > @@ -23306,6 +23339,29 @@
> >     (set_attr "prefix" "vex")
> >     (set_attr "mode" "<sseinsnmode>")])
> >
> > +(define_expand "mask_gather_load<mode><vec_gather_idxdi>"
> > +  [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> > +                  (unspec:VEC_GATHER_MODE
> > +                    [(pc)
> > +                     (mem:<ssescalarmode>
> > +                       (match_par_dup 6
> > +                         [(match_operand 1 "vsib_address_operand")
> > +                          (match_operand:<VEC_GATHER_IDXDI>
> > +                             2 "register_operand")
> > +                          (match_operand:SI 4 "const1248_operand ")
> > +                          (match_operand:SI 3 "const0_operand")]))
> > +                     (mem:BLK (scratch))
> > +                     (match_operand:<sseintvecmode> 5 "register_operand")]
> > +                    UNSPEC_GATHER))
> > +             (clobber (match_scratch:VEC_GATHER_MODE 7))])]
> > +  "TARGET_AVX2 && TARGET_USE_GATHER"
> > +{
> > +  operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, operands[5]);
> > +  operands[6]
> > +    = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, operands[1], operands[2],
> > +                                       operands[4]), UNSPEC_VSIBADDR);
> > +})
> > +
> >  (define_expand "avx2_gatherdi<mode>"
> >    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> >                    (unspec:VEC_GATHER_MODE
> > diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
> > index d594c0a1b1e..67564b45c32 100644
> > --- a/gcc/tree-vect-data-refs.c
> > +++ b/gcc/tree-vect-data-refs.c
> > @@ -3824,9 +3824,7 @@ vect_check_gather_scatter (stmt_vec_info stmt_info, loop_vec_info loop_vinfo,
> >
> >    /* True if we should aim to use internal functions rather than
> >       built-in functions.  */
> > -  bool use_ifn_p = (DR_IS_READ (dr)
> > -                   ? supports_vec_gather_load_p ()
> > -                   : supports_vec_scatter_store_p ());
> > +  bool use_ifn_p = true;
> >
> >    base = DR_REF (dr);
> >    /* For masked loads/stores, DR_REF (dr) is an artificial MEM_REF,
> > --
> > 2.31.1



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-17 14:42 ` Richard Biener
  2021-08-18  3:12   ` Hongtao Liu
@ 2021-08-18  3:24   ` Hongtao Liu
  2021-08-18  3:29     ` Hongtao Liu
  1 sibling, 1 reply; 21+ messages in thread
From: Hongtao Liu @ 2021-08-18  3:24 UTC (permalink / raw)
  To: Richard Biener
  Cc: Richard Biener, Richard Sandiford, Hongtao Liu, GCC Patches

On Tue, Aug 17, 2021 at 10:43 PM Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Tue, Aug 17, 2021 at 3:29 PM Richard Biener via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > This is an attempt to start moving the x86 backend to use
> > standard pattern names for [mask_]gather_load and [mask_]scatter_store
> > rather than using the builtin_{gather,scatter} target hooks.
> >
> > I've started with AVX2 gathers and given x86 only supports masked
> > gather I only implemented mask_gather_load.  Note while for
> > the builtin_gather case the vectorizer will provide an all-true
> > mask operand for non-masked gathers this capability does not
> > exist for the IFN path yet, so only testcases with actual masked
> > gathers will work.
> >
> > If this looks reasonable on the backend side I'll see to first
> > complete the vectorizer part, ripping out the target hook and
> > arranging for the missing pieces.  Another one is the support
> > for SImode indices with DFmode data which requires unpacking
> > the index vector and actually recognizing the IFN.
> >
> > 2021-08-17  Richard Biener  <rguenther@suse.de>
> >
> >         * tree-vect-data-refs.c (vect_check_gather_scatter):
> >         Always use internal functions.
> >         * config/i386/sse.md
> >         (mask_gather_load<mode><vec_gather_idxsi>): New expander.
> >         (mask_gather_load<mode><vec_gather_idxdi>): Likewise.
> > ---
> >  gcc/config/i386/sse.md    | 56 +++++++++++++++++++++++++++++++++++++++
> >  gcc/tree-vect-data-refs.c |  4 +--
> >  2 files changed, 57 insertions(+), 3 deletions(-)
> >
> > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > index 3957c86c3df..40bec98d9f7 100644
> > --- a/gcc/config/i386/sse.md
> > +++ b/gcc/config/i386/sse.md
> > @@ -23232,12 +23232,22 @@
> >                        (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
> >                        (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
> >                        (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
> > +(define_mode_attr vec_gather_idxsi
> > +                     [(V2DI "v4si") (V4DI "v4si") (V8DI "v8si")
> > +                      (V2DF "v4si") (V4DF "v4si") (V8DF "v8si")
> > +                      (V4SI "v4si") (V8SI "v8si") (V16SI "v16si")
> > +                      (V4SF "v4si") (V8SF "v8si") (V16SF "v16si")])
> >
> >  (define_mode_attr VEC_GATHER_IDXDI
> >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> >                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> >                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
> >                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> > +(define_mode_attr vec_gather_idxdi
> > +                     [(V2DI "v2di") (V4DI "v4di") (V8DI "v8di")
> > +                      (V2DF "v2di") (V4DF "v4di") (V8DF "v8di")
> > +                      (V4SI "v2di") (V8SI "v4di") (V16SI "v8di")
> > +                      (V4SF "v2di") (V8SF "v4di") (V16SF "v8di")])
> >
> >  (define_mode_attr VEC_GATHER_SRCDI
> >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > @@ -23245,6 +23255,29 @@
> >                        (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
> >                        (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
> >
> > +(define_expand "mask_gather_load<mode><vec_gather_idxsi>"
> > +  [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> > +                  (unspec:VEC_GATHER_MODE
> > +                    [(pc)
> > +                     (mem:<ssescalarmode>
> > +                       (match_par_dup 6
> > +                         [(match_operand 1 "vsib_address_operand")
> > +                          (match_operand:<VEC_GATHER_IDXSI>
> > +                             2 "register_operand")
> > +                          (match_operand:SI 4 "const1248_operand")
> > +                          (match_operand:SI 3 "const0_operand")]))
> > +                     (mem:BLK (scratch))
> > +                     (match_operand:<sseintvecmode> 5 "register_operand")]
>
> One problem of these is that when AVX512[VL] is enabled we get a AVX512 mask
> mode here and while the internal function expansion check succeeds (it
> never checks
> the mask operand!), RTL expansion fails unexpectedly because of this mismatch.
>
> I suppose a more complicated define_mode_attr might do the trick or do I
I don't think define_mode_attr supports conditional selection based on
target feature.
> need to add && !TARGET_AVX512F to these expanders?
There'll be a duplicated definition of  mask_loadv8sfv8si for avx512
and no-avx512 versions.
>
The best way is like maskloadmn to accept different mask modes in its
name, but that may create too much complexity in the middle-end to
choose between avx2 mask_gather_load and avx512 mask_gather_load.

> I've meanwhile posted a patch to make the vectorizer fall back to
> masked_ when non-masked_ variants are not available and that seems to work
> fine at least.
>
> Richard.
>
> > +                    UNSPEC_GATHER))
> > +             (clobber (match_scratch:VEC_GATHER_MODE 7))])]
> > +  "TARGET_AVX2 && TARGET_USE_GATHER"
> > +{
> > +  operands[5] = gen_lowpart_SUBREG (<MODE>mode, operands[5]);
> > +  operands[6]
> > +    = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, operands[2], operands[3],
> > +                                       operands[5]), UNSPEC_VSIBADDR);
> > +})
> > +
> >  (define_expand "avx2_gathersi<mode>"
> >    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> >                    (unspec:VEC_GATHER_MODE
> > @@ -23306,6 +23339,29 @@
> >     (set_attr "prefix" "vex")
> >     (set_attr "mode" "<sseinsnmode>")])
> >
> > +(define_expand "mask_gather_load<mode><vec_gather_idxdi>"
> > +  [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> > +                  (unspec:VEC_GATHER_MODE
> > +                    [(pc)
> > +                     (mem:<ssescalarmode>
> > +                       (match_par_dup 6
> > +                         [(match_operand 1 "vsib_address_operand")
> > +                          (match_operand:<VEC_GATHER_IDXDI>
> > +                             2 "register_operand")
> > +                          (match_operand:SI 4 "const1248_operand ")
> > +                          (match_operand:SI 3 "const0_operand")]))
> > +                     (mem:BLK (scratch))
> > +                     (match_operand:<sseintvecmode> 5 "register_operand")]
> > +                    UNSPEC_GATHER))
> > +             (clobber (match_scratch:VEC_GATHER_MODE 7))])]
> > +  "TARGET_AVX2 && TARGET_USE_GATHER"
> > +{
> > +  operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, operands[5]);
> > +  operands[6]
> > +    = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, operands[1], operands[2],
> > +                                       operands[4]), UNSPEC_VSIBADDR);
> > +})
> > +
> >  (define_expand "avx2_gatherdi<mode>"
> >    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> >                    (unspec:VEC_GATHER_MODE
> > diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
> > index d594c0a1b1e..67564b45c32 100644
> > --- a/gcc/tree-vect-data-refs.c
> > +++ b/gcc/tree-vect-data-refs.c
> > @@ -3824,9 +3824,7 @@ vect_check_gather_scatter (stmt_vec_info stmt_info, loop_vec_info loop_vinfo,
> >
> >    /* True if we should aim to use internal functions rather than
> >       built-in functions.  */
> > -  bool use_ifn_p = (DR_IS_READ (dr)
> > -                   ? supports_vec_gather_load_p ()
> > -                   : supports_vec_scatter_store_p ());
> > +  bool use_ifn_p = true;
> >
> >    base = DR_REF (dr);
> >    /* For masked loads/stores, DR_REF (dr) is an artificial MEM_REF,
> > --
> > 2.31.1



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-18  3:24   ` Hongtao Liu
@ 2021-08-18  3:29     ` Hongtao Liu
  2021-08-18  8:32       ` Richard Biener
  0 siblings, 1 reply; 21+ messages in thread
From: Hongtao Liu @ 2021-08-18  3:29 UTC (permalink / raw)
  To: Richard Biener
  Cc: Richard Biener, Richard Sandiford, Hongtao Liu, GCC Patches

On Wed, Aug 18, 2021 at 11:24 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Tue, Aug 17, 2021 at 10:43 PM Richard Biener via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > On Tue, Aug 17, 2021 at 3:29 PM Richard Biener via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > This is an attempt to start moving the x86 backend to use
> > > standard pattern names for [mask_]gather_load and [mask_]scatter_store
> > > rather than using the builtin_{gather,scatter} target hooks.
> > >
> > > I've started with AVX2 gathers and given x86 only supports masked
> > > gather I only implemented mask_gather_load.  Note while for
> > > the builtin_gather case the vectorizer will provide an all-true
> > > mask operand for non-masked gathers this capability does not
> > > exist for the IFN path yet, so only testcases with actual masked
> > > gathers will work.
> > >
> > > If this looks reasonable on the backend side I'll see to first
> > > complete the vectorizer part, ripping out the target hook and
> > > arranging for the missing pieces.  Another one is the support
> > > for SImode indices with DFmode data which requires unpacking
> > > the index vector and actually recognizing the IFN.
> > >
> > > 2021-08-17  Richard Biener  <rguenther@suse.de>
> > >
> > >         * tree-vect-data-refs.c (vect_check_gather_scatter):
> > >         Always use internal functions.
> > >         * config/i386/sse.md
> > >         (mask_gather_load<mode><vec_gather_idxsi>): New expander.
> > >         (mask_gather_load<mode><vec_gather_idxdi>): Likewise.
> > > ---
> > >  gcc/config/i386/sse.md    | 56 +++++++++++++++++++++++++++++++++++++++
> > >  gcc/tree-vect-data-refs.c |  4 +--
> > >  2 files changed, 57 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > > index 3957c86c3df..40bec98d9f7 100644
> > > --- a/gcc/config/i386/sse.md
> > > +++ b/gcc/config/i386/sse.md
> > > @@ -23232,12 +23232,22 @@
> > >                        (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
> > >                        (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
> > >                        (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
> > > +(define_mode_attr vec_gather_idxsi
> > > +                     [(V2DI "v4si") (V4DI "v4si") (V8DI "v8si")
> > > +                      (V2DF "v4si") (V4DF "v4si") (V8DF "v8si")
> > > +                      (V4SI "v4si") (V8SI "v8si") (V16SI "v16si")
> > > +                      (V4SF "v4si") (V8SF "v8si") (V16SF "v16si")])
> > >
> > >  (define_mode_attr VEC_GATHER_IDXDI
> > >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > >                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> > >                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
> > >                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> > > +(define_mode_attr vec_gather_idxdi
> > > +                     [(V2DI "v2di") (V4DI "v4di") (V8DI "v8di")
> > > +                      (V2DF "v2di") (V4DF "v4di") (V8DF "v8di")
> > > +                      (V4SI "v2di") (V8SI "v4di") (V16SI "v8di")
> > > +                      (V4SF "v2di") (V8SF "v4di") (V16SF "v8di")])
> > >
> > >  (define_mode_attr VEC_GATHER_SRCDI
> > >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > > @@ -23245,6 +23255,29 @@
> > >                        (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
> > >                        (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
> > >
> > > +(define_expand "mask_gather_load<mode><vec_gather_idxsi>"
> > > +  [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> > > +                  (unspec:VEC_GATHER_MODE
> > > +                    [(pc)
> > > +                     (mem:<ssescalarmode>
> > > +                       (match_par_dup 6
> > > +                         [(match_operand 1 "vsib_address_operand")
> > > +                          (match_operand:<VEC_GATHER_IDXSI>
> > > +                             2 "register_operand")
> > > +                          (match_operand:SI 4 "const1248_operand")
> > > +                          (match_operand:SI 3 "const0_operand")]))
> > > +                     (mem:BLK (scratch))
> > > +                     (match_operand:<sseintvecmode> 5 "register_operand")]
> >
> > One problem of these is that when AVX512[VL] is enabled we get a AVX512 mask
> > mode here and while the internal function expansion check succeeds (it
> > never checks
> > the mask operand!), RTL expansion fails unexpectedly because of this mismatch.
> >
> > I suppose a more complicated define_mode_attr might do the trick or do I
> I don't think define_mode_attr supports conditional selection based on
> target feature.
> > need to add && !TARGET_AVX512F to these expanders?
> There'll be a duplicated definition of  mask_loadv8sfv8si for avx512
> and no-avx512 versions.
Note gather_loadmn can be shared by avx512 and non-avx512 version, we
can handle mask mode in the preparation statements based on target
feature.
> >
> The best way is like maskloadmn to accept different mask modes in its
> name, but that may create too much complexity in the middle-end to
> choose between avx2 mask_gather_load and avx512 mask_gather_load.
>
> > I've meanwhile posted a patch to make the vectorizer fall back to
> > masked_ when non-masked_ variants are not available and that seems to work
> > fine at least.
> >
> > Richard.
> >
> > > +                    UNSPEC_GATHER))
> > > +             (clobber (match_scratch:VEC_GATHER_MODE 7))])]
> > > +  "TARGET_AVX2 && TARGET_USE_GATHER"
> > > +{
> > > +  operands[5] = gen_lowpart_SUBREG (<MODE>mode, operands[5]);
> > > +  operands[6]
> > > +    = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, operands[2], operands[3],
> > > +                                       operands[5]), UNSPEC_VSIBADDR);
> > > +})
> > > +
> > >  (define_expand "avx2_gathersi<mode>"
> > >    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> > >                    (unspec:VEC_GATHER_MODE
> > > @@ -23306,6 +23339,29 @@
> > >     (set_attr "prefix" "vex")
> > >     (set_attr "mode" "<sseinsnmode>")])
> > >
> > > +(define_expand "mask_gather_load<mode><vec_gather_idxdi>"
> > > +  [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> > > +                  (unspec:VEC_GATHER_MODE
> > > +                    [(pc)
> > > +                     (mem:<ssescalarmode>
> > > +                       (match_par_dup 6
> > > +                         [(match_operand 1 "vsib_address_operand")
> > > +                          (match_operand:<VEC_GATHER_IDXDI>
> > > +                             2 "register_operand")
> > > +                          (match_operand:SI 4 "const1248_operand ")
> > > +                          (match_operand:SI 3 "const0_operand")]))
> > > +                     (mem:BLK (scratch))
> > > +                     (match_operand:<sseintvecmode> 5 "register_operand")]
> > > +                    UNSPEC_GATHER))
> > > +             (clobber (match_scratch:VEC_GATHER_MODE 7))])]
> > > +  "TARGET_AVX2 && TARGET_USE_GATHER"
> > > +{
> > > +  operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, operands[5]);
> > > +  operands[6]
> > > +    = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, operands[1], operands[2],
> > > +                                       operands[4]), UNSPEC_VSIBADDR);
> > > +})
> > > +
> > >  (define_expand "avx2_gatherdi<mode>"
> > >    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> > >                    (unspec:VEC_GATHER_MODE
> > > diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
> > > index d594c0a1b1e..67564b45c32 100644
> > > --- a/gcc/tree-vect-data-refs.c
> > > +++ b/gcc/tree-vect-data-refs.c
> > > @@ -3824,9 +3824,7 @@ vect_check_gather_scatter (stmt_vec_info stmt_info, loop_vec_info loop_vinfo,
> > >
> > >    /* True if we should aim to use internal functions rather than
> > >       built-in functions.  */
> > > -  bool use_ifn_p = (DR_IS_READ (dr)
> > > -                   ? supports_vec_gather_load_p ()
> > > -                   : supports_vec_scatter_store_p ());
> > > +  bool use_ifn_p = true;
> > >
> > >    base = DR_REF (dr);
> > >    /* For masked loads/stores, DR_REF (dr) is an artificial MEM_REF,
> > > --
> > > 2.31.1
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-18  3:29     ` Hongtao Liu
@ 2021-08-18  8:32       ` Richard Biener
  2021-08-18  9:17         ` Hongtao Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Biener @ 2021-08-18  8:32 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: Richard Biener, Richard Sandiford, Hongtao Liu, GCC Patches

On Wed, 18 Aug 2021, Hongtao Liu wrote:

> On Wed, Aug 18, 2021 at 11:24 AM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > On Tue, Aug 17, 2021 at 10:43 PM Richard Biener via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > On Tue, Aug 17, 2021 at 3:29 PM Richard Biener via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > > This is an attempt to start moving the x86 backend to use
> > > > standard pattern names for [mask_]gather_load and [mask_]scatter_store
> > > > rather than using the builtin_{gather,scatter} target hooks.
> > > >
> > > > I've started with AVX2 gathers and given x86 only supports masked
> > > > gather I only implemented mask_gather_load.  Note while for
> > > > the builtin_gather case the vectorizer will provide an all-true
> > > > mask operand for non-masked gathers this capability does not
> > > > exist for the IFN path yet, so only testcases with actual masked
> > > > gathers will work.
> > > >
> > > > If this looks reasonable on the backend side I'll see to first
> > > > complete the vectorizer part, ripping out the target hook and
> > > > arranging for the missing pieces.  Another one is the support
> > > > for SImode indices with DFmode data which requires unpacking
> > > > the index vector and actually recognizing the IFN.
> > > >
> > > > 2021-08-17  Richard Biener  <rguenther@suse.de>
> > > >
> > > >         * tree-vect-data-refs.c (vect_check_gather_scatter):
> > > >         Always use internal functions.
> > > >         * config/i386/sse.md
> > > >         (mask_gather_load<mode><vec_gather_idxsi>): New expander.
> > > >         (mask_gather_load<mode><vec_gather_idxdi>): Likewise.
> > > > ---
> > > >  gcc/config/i386/sse.md    | 56 +++++++++++++++++++++++++++++++++++++++
> > > >  gcc/tree-vect-data-refs.c |  4 +--
> > > >  2 files changed, 57 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > > > index 3957c86c3df..40bec98d9f7 100644
> > > > --- a/gcc/config/i386/sse.md
> > > > +++ b/gcc/config/i386/sse.md
> > > > @@ -23232,12 +23232,22 @@
> > > >                        (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
> > > >                        (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
> > > >                        (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
> > > > +(define_mode_attr vec_gather_idxsi
> > > > +                     [(V2DI "v4si") (V4DI "v4si") (V8DI "v8si")
> > > > +                      (V2DF "v4si") (V4DF "v4si") (V8DF "v8si")
> > > > +                      (V4SI "v4si") (V8SI "v8si") (V16SI "v16si")
> > > > +                      (V4SF "v4si") (V8SF "v8si") (V16SF "v16si")])
> > > >
> > > >  (define_mode_attr VEC_GATHER_IDXDI
> > > >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > > >                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> > > >                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
> > > >                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> > > > +(define_mode_attr vec_gather_idxdi
> > > > +                     [(V2DI "v2di") (V4DI "v4di") (V8DI "v8di")
> > > > +                      (V2DF "v2di") (V4DF "v4di") (V8DF "v8di")
> > > > +                      (V4SI "v2di") (V8SI "v4di") (V16SI "v8di")
> > > > +                      (V4SF "v2di") (V8SF "v4di") (V16SF "v8di")])
> > > >
> > > >  (define_mode_attr VEC_GATHER_SRCDI
> > > >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > > > @@ -23245,6 +23255,29 @@
> > > >                        (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
> > > >                        (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
> > > >
> > > > +(define_expand "mask_gather_load<mode><vec_gather_idxsi>"
> > > > +  [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> > > > +                  (unspec:VEC_GATHER_MODE
> > > > +                    [(pc)
> > > > +                     (mem:<ssescalarmode>
> > > > +                       (match_par_dup 6
> > > > +                         [(match_operand 1 "vsib_address_operand")
> > > > +                          (match_operand:<VEC_GATHER_IDXSI>
> > > > +                             2 "register_operand")
> > > > +                          (match_operand:SI 4 "const1248_operand")
> > > > +                          (match_operand:SI 3 "const0_operand")]))
> > > > +                     (mem:BLK (scratch))
> > > > +                     (match_operand:<sseintvecmode> 5 "register_operand")]
> > >
> > > One problem of these is that when AVX512[VL] is enabled we get a AVX512 mask
> > > mode here and while the internal function expansion check succeeds (it
> > > never checks
> > > the mask operand!), RTL expansion fails unexpectedly because of this mismatch.
> > >
> > > I suppose a more complicated define_mode_attr might do the trick or do I
> > I don't think define_mode_attr supports conditional selection based on
> > target feature.
> > > need to add && !TARGET_AVX512F to these expanders?
> > There'll be a duplicated definition of  mask_loadv8sfv8si for avx512
> > and no-avx512 versions.
> Note gather_loadmn can be shared by avx512 and non-avx512 version, we
> can handle mask mode in the preparation statements based on target
> feature.
> > >
> > The best way is like maskloadmn to accept different mask modes in its
> > name, but that may create too much complexity in the middle-end to
> > choose between avx2 mask_gather_load and avx512 mask_gather_load.

So when we have -mavx512f only then we can gather V8DF using a QImode
maks with AVX512 gather but we have to use AVX2 for V4DF gather
using a V4DImode mask.  With -mavx512vl we use a QImode mask for that
case as well.

Note the

  operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, operands[5]);

is because the define_insns for the gathers somehow expect odd mask
modes (the vectorizer generates an integer vector mode matching the
layout of the result vector).  I suppose we could somehow fix that
but I wanted to change as little as possible for now.

The following seems to "work" in generating AVX512 variants for
-mavx512f -mavx512vl -mprefer-vector-width=512 and AVX2 variants
when I drop the -mavx512vl arg:

(define_expand "mask_gather_load<mode><vec_gather_idxdi>"
  [(match_operand:VEC_GATHER_MODE 0 "register_operand")
   (match_operand 1 "vsib_address_operand")
   (match_operand:<VEC_GATHER_IDXDI> 2 "register_operand")
   (match_operand:SI 3 "const0_operand")
   (match_operand:SI 4 "const1248_operand")
   (match_operand 5 "register_operand")]
  "TARGET_AVX2 && TARGET_USE_GATHER"
{ 
  if (TARGET_AVX512VL)
    {
      rtx scratch = gen_reg_rtx (<MODE>mode);
      emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
      emit_insn (gen_avx512vl_gatherdi<mode>
        (operands[0], scratch,
         operands[1], operands[2], operands[5], operands[4]));
    }
  else
    { 
      operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, 
operands[5]);
      rtx scratch = gen_reg_rtx (<MODE>mode);
      emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
      emit_insn (gen_avx2_gatherdi<mode>
        (operands[0], scratch,
         operands[1], operands[2], operands[5], operands[4]));
    }
  DONE;
})

But I see quite horrible code for an always-true mask.  From GIMPLE

  # ivtmp.17_34 = PHI <ivtmp.17_19(3), 0(2)>
  vect__4.5_24 = MEM <vector(8) int> [(int *)idx_12(D) + ivtmp.17_34 * 1];
  vect_patt_28.6_23 = [vec_unpack_lo_expr] vect__4.5_24;
  vect_patt_28.6_22 = [vec_unpack_hi_expr] vect__4.5_24;
  vect_patt_27.7_17 = .MASK_GATHER_LOAD (_18, vect_patt_28.6_23, 8, { 0.0, 
0.0, 0.0, 0.0 }, { -1, -1, -1, -1 });
  vect_patt_27.8_16 = .MASK_GATHER_LOAD (_18, vect_patt_28.6_22, 8, { 0.0, 
0.0, 0.0, 0.0 }, { -1, -1, -1, -1 });
  MEM <vector(4) double> [(double *)&a + ivtmp.17_34 * 2] = 
vect_patt_27.7_17;
  MEM <vector(4) double> [(double *)&a + 32B + ivtmp.17_34 * 2] = 
vect_patt_27.8_16;
  ivtmp.17_19 = ivtmp.17_34 + 32;
  if (ivtmp.17_19 != 4096)

we end up with

        movl    $-1, %edx
        xorl    %eax, %eax
        vxorpd  %xmm1, %xmm1, %xmm1
        kmovw   %edx, %k1
        .p2align 4,,10
        .p2align 3
.L2:
        vpmovsxdq       (%rsi,%rax), %ymm0
        vmovdqu (%rsi,%rax), %ymm4
        vmovapd %ymm1, %ymm3
        kmovw   %k1, %k2
        vmovapd %ymm1, %ymm2
        kmovw   %k1, %k3
        vgatherqpd      (%rdi,%ymm0,8), %ymm3{%k2}
        vextracti128    $0x1, %ymm4, %xmm0
        vpmovsxdq       %xmm0, %ymm0
        vgatherqpd      (%rdi,%ymm0,8), %ymm2{%k3}
        vmovapd %ymm3, a(%rax,%rax)
        vmovapd %ymm2, a+32(%rax,%rax)
        addq    $32, %rax
        cmpq    $4096, %rax
        jne     .L2

I wonder why we have those extra kmovw here.  I see there's
extra (pc) patterns for always-true masks(?) like in the AVX2
case.  But in the end we have to use {%kN} instructions, no?
We can always provide non-masked gather_load expanders if that's
better here.

Richard.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-18  8:32       ` Richard Biener
@ 2021-08-18  9:17         ` Hongtao Liu
  2021-08-18  9:21           ` Richard Biener
  0 siblings, 1 reply; 21+ messages in thread
From: Hongtao Liu @ 2021-08-18  9:17 UTC (permalink / raw)
  To: Richard Biener
  Cc: Richard Biener, Richard Sandiford, Hongtao Liu, GCC Patches

On Wed, Aug 18, 2021 at 4:32 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Wed, 18 Aug 2021, Hongtao Liu wrote:
>
> > On Wed, Aug 18, 2021 at 11:24 AM Hongtao Liu <crazylht@gmail.com> wrote:
> > >
> > > On Tue, Aug 17, 2021 at 10:43 PM Richard Biener via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > > On Tue, Aug 17, 2021 at 3:29 PM Richard Biener via Gcc-patches
> > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > >
> > > > > This is an attempt to start moving the x86 backend to use
> > > > > standard pattern names for [mask_]gather_load and [mask_]scatter_store
> > > > > rather than using the builtin_{gather,scatter} target hooks.
> > > > >
> > > > > I've started with AVX2 gathers and given x86 only supports masked
> > > > > gather I only implemented mask_gather_load.  Note while for
> > > > > the builtin_gather case the vectorizer will provide an all-true
> > > > > mask operand for non-masked gathers this capability does not
> > > > > exist for the IFN path yet, so only testcases with actual masked
> > > > > gathers will work.
> > > > >
> > > > > If this looks reasonable on the backend side I'll see to first
> > > > > complete the vectorizer part, ripping out the target hook and
> > > > > arranging for the missing pieces.  Another one is the support
> > > > > for SImode indices with DFmode data which requires unpacking
> > > > > the index vector and actually recognizing the IFN.
> > > > >
> > > > > 2021-08-17  Richard Biener  <rguenther@suse.de>
> > > > >
> > > > >         * tree-vect-data-refs.c (vect_check_gather_scatter):
> > > > >         Always use internal functions.
> > > > >         * config/i386/sse.md
> > > > >         (mask_gather_load<mode><vec_gather_idxsi>): New expander.
> > > > >         (mask_gather_load<mode><vec_gather_idxdi>): Likewise.
> > > > > ---
> > > > >  gcc/config/i386/sse.md    | 56 +++++++++++++++++++++++++++++++++++++++
> > > > >  gcc/tree-vect-data-refs.c |  4 +--
> > > > >  2 files changed, 57 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > > > > index 3957c86c3df..40bec98d9f7 100644
> > > > > --- a/gcc/config/i386/sse.md
> > > > > +++ b/gcc/config/i386/sse.md
> > > > > @@ -23232,12 +23232,22 @@
> > > > >                        (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
> > > > >                        (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
> > > > >                        (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
> > > > > +(define_mode_attr vec_gather_idxsi
> > > > > +                     [(V2DI "v4si") (V4DI "v4si") (V8DI "v8si")
> > > > > +                      (V2DF "v4si") (V4DF "v4si") (V8DF "v8si")
> > > > > +                      (V4SI "v4si") (V8SI "v8si") (V16SI "v16si")
> > > > > +                      (V4SF "v4si") (V8SF "v8si") (V16SF "v16si")])
> > > > >
> > > > >  (define_mode_attr VEC_GATHER_IDXDI
> > > > >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > > > >                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> > > > >                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
> > > > >                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> > > > > +(define_mode_attr vec_gather_idxdi
> > > > > +                     [(V2DI "v2di") (V4DI "v4di") (V8DI "v8di")
> > > > > +                      (V2DF "v2di") (V4DF "v4di") (V8DF "v8di")
> > > > > +                      (V4SI "v2di") (V8SI "v4di") (V16SI "v8di")
> > > > > +                      (V4SF "v2di") (V8SF "v4di") (V16SF "v8di")])
> > > > >
> > > > >  (define_mode_attr VEC_GATHER_SRCDI
> > > > >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > > > > @@ -23245,6 +23255,29 @@
> > > > >                        (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
> > > > >                        (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
> > > > >
> > > > > +(define_expand "mask_gather_load<mode><vec_gather_idxsi>"
> > > > > +  [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> > > > > +                  (unspec:VEC_GATHER_MODE
> > > > > +                    [(pc)
> > > > > +                     (mem:<ssescalarmode>
> > > > > +                       (match_par_dup 6
> > > > > +                         [(match_operand 1 "vsib_address_operand")
> > > > > +                          (match_operand:<VEC_GATHER_IDXSI>
> > > > > +                             2 "register_operand")
> > > > > +                          (match_operand:SI 4 "const1248_operand")
> > > > > +                          (match_operand:SI 3 "const0_operand")]))
> > > > > +                     (mem:BLK (scratch))
> > > > > +                     (match_operand:<sseintvecmode> 5 "register_operand")]
> > > >
> > > > One problem of these is that when AVX512[VL] is enabled we get a AVX512 mask
> > > > mode here and while the internal function expansion check succeeds (it
> > > > never checks
> > > > the mask operand!), RTL expansion fails unexpectedly because of this mismatch.
> > > >
> > > > I suppose a more complicated define_mode_attr might do the trick or do I
> > > I don't think define_mode_attr supports conditional selection based on
> > > target feature.
> > > > need to add && !TARGET_AVX512F to these expanders?
> > > There'll be a duplicated definition of  mask_loadv8sfv8si for avx512
> > > and no-avx512 versions.
> > Note gather_loadmn can be shared by avx512 and non-avx512 version, we
> > can handle mask mode in the preparation statements based on target
> > feature.
> > > >
> > > The best way is like maskloadmn to accept different mask modes in its
> > > name, but that may create too much complexity in the middle-end to
> > > choose between avx2 mask_gather_load and avx512 mask_gather_load.
>
> So when we have -mavx512f only then we can gather V8DF using a QImode
> maks with AVX512 gather but we have to use AVX2 for V4DF gather
> using a V4DImode mask.  With -mavx512vl we use a QImode mask for that
> case as well.
>
> Note the
>
>   operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, operands[5]);
>
> is because the define_insns for the gathers somehow expect odd mask
> modes (the vectorizer generates an integer vector mode matching the
> layout of the result vector).  I suppose we could somehow fix that
> but I wanted to change as little as possible for now.
>
> The following seems to "work" in generating AVX512 variants for
> -mavx512f -mavx512vl -mprefer-vector-width=512 and AVX2 variants
> when I drop the -mavx512vl arg:
>
Interesting.
> (define_expand "mask_gather_load<mode><vec_gather_idxdi>"
>   [(match_operand:VEC_GATHER_MODE 0 "register_operand")
>    (match_operand 1 "vsib_address_operand")
>    (match_operand:<VEC_GATHER_IDXDI> 2 "register_operand")
>    (match_operand:SI 3 "const0_operand")
>    (match_operand:SI 4 "const1248_operand")
>    (match_operand 5 "register_operand")]
>   "TARGET_AVX2 && TARGET_USE_GATHER"
> {
>   if (TARGET_AVX512VL)
>     {
>       rtx scratch = gen_reg_rtx (<MODE>mode);
>       emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
>       emit_insn (gen_avx512vl_gatherdi<mode>
>         (operands[0], scratch,
>          operands[1], operands[2], operands[5], operands[4]));
>     }
>   else
>     {
>       operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode,
> operands[5]);
>       rtx scratch = gen_reg_rtx (<MODE>mode);
>       emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
>       emit_insn (gen_avx2_gatherdi<mode>
>         (operands[0], scratch,
>          operands[1], operands[2], operands[5], operands[4]));
>     }
>   DONE;
> })
>
> But I see quite horrible code for an always-true mask.  From GIMPLE
>
>   # ivtmp.17_34 = PHI <ivtmp.17_19(3), 0(2)>
>   vect__4.5_24 = MEM <vector(8) int> [(int *)idx_12(D) + ivtmp.17_34 * 1];
>   vect_patt_28.6_23 = [vec_unpack_lo_expr] vect__4.5_24;
>   vect_patt_28.6_22 = [vec_unpack_hi_expr] vect__4.5_24;
>   vect_patt_27.7_17 = .MASK_GATHER_LOAD (_18, vect_patt_28.6_23, 8, { 0.0,
> 0.0, 0.0, 0.0 }, { -1, -1, -1, -1 });
>   vect_patt_27.8_16 = .MASK_GATHER_LOAD (_18, vect_patt_28.6_22, 8, { 0.0,
> 0.0, 0.0, 0.0 }, { -1, -1, -1, -1 });
>   MEM <vector(4) double> [(double *)&a + ivtmp.17_34 * 2] =
> vect_patt_27.7_17;
>   MEM <vector(4) double> [(double *)&a + 32B + ivtmp.17_34 * 2] =
> vect_patt_27.8_16;
>   ivtmp.17_19 = ivtmp.17_34 + 32;
>   if (ivtmp.17_19 != 4096)
>
> we end up with
>
>         movl    $-1, %edx
>         xorl    %eax, %eax
>         vxorpd  %xmm1, %xmm1, %xmm1
>         kmovw   %edx, %k1
>         .p2align 4,,10
>         .p2align 3
> .L2:
>         vpmovsxdq       (%rsi,%rax), %ymm0
>         vmovdqu (%rsi,%rax), %ymm4
>         vmovapd %ymm1, %ymm3
>         kmovw   %k1, %k2
>         vmovapd %ymm1, %ymm2
>         kmovw   %k1, %k3
>         vgatherqpd      (%rdi,%ymm0,8), %ymm3{%k2}
>         vextracti128    $0x1, %ymm4, %xmm0
>         vpmovsxdq       %xmm0, %ymm0
>         vgatherqpd      (%rdi,%ymm0,8), %ymm2{%k3}
>         vmovapd %ymm3, a(%rax,%rax)
>         vmovapd %ymm2, a+32(%rax,%rax)
>         addq    $32, %rax
>         cmpq    $4096, %rax
>         jne     .L2
>
> I wonder why we have those extra kmovw here.  I see there's

(define_insn "*avx512f_gathersi<VI48F:mode>"
  [(set (match_operand:VI48F 0 "register_operand" "=&v")
        (unspec:VI48F
          [(match_operand:VI48F 1 "register_operand" "0")
           (match_operand:<avx512fmaskmode> 7 "register_operand" "2")
           (match_operator:<ssescalarmode> 6 "vsib_mem_operator"
             [(unspec:P
                [(match_operand:P 4 "vsib_address_operand" "Tv")
                 (match_operand:<VEC_GATHER_IDXSI> 3 "register_operand" "v")
                 (match_operand:SI 5 "const1248_operand" "n")]
                UNSPEC_VSIBADDR)])]
          UNSPEC_GATHER))
   (clobber (match_scratch:<avx512fmaskmode> 2 "=&Yk"))]
I think it's because this clobber, and the clobber here is because
mask register will be set to zero by this instruction.
cut from intel sdm
----cut first-------
VGATHERDPS/VGATHERDPD—Gather Packed Single, Packed Double with Signed Dword
......The entire mask register will be set to zero by this instruction
unless it trig-
gers an exception.
----cut end------
Similar for the avx2 version.

> extra (pc) patterns for always-true masks(?) like in the AVX2
> case.  But in the end we have to use {%kN} instructions, no?
> We can always provide non-masked gather_load expanders if that's
> better here.
>
> Richard.



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-18  9:17         ` Hongtao Liu
@ 2021-08-18  9:21           ` Richard Biener
  2021-08-18  9:54             ` Richard Biener
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Biener @ 2021-08-18  9:21 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: Richard Biener, Richard Sandiford, Hongtao Liu, GCC Patches

On Wed, 18 Aug 2021, Hongtao Liu wrote:

> On Wed, Aug 18, 2021 at 4:32 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Wed, 18 Aug 2021, Hongtao Liu wrote:
> >
> > > On Wed, Aug 18, 2021 at 11:24 AM Hongtao Liu <crazylht@gmail.com> wrote:
> > > >
> > > > On Tue, Aug 17, 2021 at 10:43 PM Richard Biener via Gcc-patches
> > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > >
> > > > > On Tue, Aug 17, 2021 at 3:29 PM Richard Biener via Gcc-patches
> > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > >
> > > > > > This is an attempt to start moving the x86 backend to use
> > > > > > standard pattern names for [mask_]gather_load and [mask_]scatter_store
> > > > > > rather than using the builtin_{gather,scatter} target hooks.
> > > > > >
> > > > > > I've started with AVX2 gathers and given x86 only supports masked
> > > > > > gather I only implemented mask_gather_load.  Note while for
> > > > > > the builtin_gather case the vectorizer will provide an all-true
> > > > > > mask operand for non-masked gathers this capability does not
> > > > > > exist for the IFN path yet, so only testcases with actual masked
> > > > > > gathers will work.
> > > > > >
> > > > > > If this looks reasonable on the backend side I'll see to first
> > > > > > complete the vectorizer part, ripping out the target hook and
> > > > > > arranging for the missing pieces.  Another one is the support
> > > > > > for SImode indices with DFmode data which requires unpacking
> > > > > > the index vector and actually recognizing the IFN.
> > > > > >
> > > > > > 2021-08-17  Richard Biener  <rguenther@suse.de>
> > > > > >
> > > > > >         * tree-vect-data-refs.c (vect_check_gather_scatter):
> > > > > >         Always use internal functions.
> > > > > >         * config/i386/sse.md
> > > > > >         (mask_gather_load<mode><vec_gather_idxsi>): New expander.
> > > > > >         (mask_gather_load<mode><vec_gather_idxdi>): Likewise.
> > > > > > ---
> > > > > >  gcc/config/i386/sse.md    | 56 +++++++++++++++++++++++++++++++++++++++
> > > > > >  gcc/tree-vect-data-refs.c |  4 +--
> > > > > >  2 files changed, 57 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > > > > > index 3957c86c3df..40bec98d9f7 100644
> > > > > > --- a/gcc/config/i386/sse.md
> > > > > > +++ b/gcc/config/i386/sse.md
> > > > > > @@ -23232,12 +23232,22 @@
> > > > > >                        (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
> > > > > >                        (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
> > > > > >                        (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
> > > > > > +(define_mode_attr vec_gather_idxsi
> > > > > > +                     [(V2DI "v4si") (V4DI "v4si") (V8DI "v8si")
> > > > > > +                      (V2DF "v4si") (V4DF "v4si") (V8DF "v8si")
> > > > > > +                      (V4SI "v4si") (V8SI "v8si") (V16SI "v16si")
> > > > > > +                      (V4SF "v4si") (V8SF "v8si") (V16SF "v16si")])
> > > > > >
> > > > > >  (define_mode_attr VEC_GATHER_IDXDI
> > > > > >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > > > > >                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> > > > > >                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
> > > > > >                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> > > > > > +(define_mode_attr vec_gather_idxdi
> > > > > > +                     [(V2DI "v2di") (V4DI "v4di") (V8DI "v8di")
> > > > > > +                      (V2DF "v2di") (V4DF "v4di") (V8DF "v8di")
> > > > > > +                      (V4SI "v2di") (V8SI "v4di") (V16SI "v8di")
> > > > > > +                      (V4SF "v2di") (V8SF "v4di") (V16SF "v8di")])
> > > > > >
> > > > > >  (define_mode_attr VEC_GATHER_SRCDI
> > > > > >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > > > > > @@ -23245,6 +23255,29 @@
> > > > > >                        (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
> > > > > >                        (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
> > > > > >
> > > > > > +(define_expand "mask_gather_load<mode><vec_gather_idxsi>"
> > > > > > +  [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> > > > > > +                  (unspec:VEC_GATHER_MODE
> > > > > > +                    [(pc)
> > > > > > +                     (mem:<ssescalarmode>
> > > > > > +                       (match_par_dup 6
> > > > > > +                         [(match_operand 1 "vsib_address_operand")
> > > > > > +                          (match_operand:<VEC_GATHER_IDXSI>
> > > > > > +                             2 "register_operand")
> > > > > > +                          (match_operand:SI 4 "const1248_operand")
> > > > > > +                          (match_operand:SI 3 "const0_operand")]))
> > > > > > +                     (mem:BLK (scratch))
> > > > > > +                     (match_operand:<sseintvecmode> 5 "register_operand")]
> > > > >
> > > > > One problem of these is that when AVX512[VL] is enabled we get a AVX512 mask
> > > > > mode here and while the internal function expansion check succeeds (it
> > > > > never checks
> > > > > the mask operand!), RTL expansion fails unexpectedly because of this mismatch.
> > > > >
> > > > > I suppose a more complicated define_mode_attr might do the trick or do I
> > > > I don't think define_mode_attr supports conditional selection based on
> > > > target feature.
> > > > > need to add && !TARGET_AVX512F to these expanders?
> > > > There'll be a duplicated definition of  mask_loadv8sfv8si for avx512
> > > > and no-avx512 versions.
> > > Note gather_loadmn can be shared by avx512 and non-avx512 version, we
> > > can handle mask mode in the preparation statements based on target
> > > feature.
> > > > >
> > > > The best way is like maskloadmn to accept different mask modes in its
> > > > name, but that may create too much complexity in the middle-end to
> > > > choose between avx2 mask_gather_load and avx512 mask_gather_load.
> >
> > So when we have -mavx512f only then we can gather V8DF using a QImode
> > maks with AVX512 gather but we have to use AVX2 for V4DF gather
> > using a V4DImode mask.  With -mavx512vl we use a QImode mask for that
> > case as well.
> >
> > Note the
> >
> >   operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, operands[5]);
> >
> > is because the define_insns for the gathers somehow expect odd mask
> > modes (the vectorizer generates an integer vector mode matching the
> > layout of the result vector).  I suppose we could somehow fix that
> > but I wanted to change as little as possible for now.
> >
> > The following seems to "work" in generating AVX512 variants for
> > -mavx512f -mavx512vl -mprefer-vector-width=512 and AVX2 variants
> > when I drop the -mavx512vl arg:
> >
> Interesting.
> > (define_expand "mask_gather_load<mode><vec_gather_idxdi>"
> >   [(match_operand:VEC_GATHER_MODE 0 "register_operand")
> >    (match_operand 1 "vsib_address_operand")
> >    (match_operand:<VEC_GATHER_IDXDI> 2 "register_operand")
> >    (match_operand:SI 3 "const0_operand")
> >    (match_operand:SI 4 "const1248_operand")
> >    (match_operand 5 "register_operand")]
> >   "TARGET_AVX2 && TARGET_USE_GATHER"
> > {
> >   if (TARGET_AVX512VL)
> >     {
> >       rtx scratch = gen_reg_rtx (<MODE>mode);
> >       emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
> >       emit_insn (gen_avx512vl_gatherdi<mode>
> >         (operands[0], scratch,
> >          operands[1], operands[2], operands[5], operands[4]));
> >     }
> >   else
> >     {
> >       operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode,
> > operands[5]);
> >       rtx scratch = gen_reg_rtx (<MODE>mode);
> >       emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
> >       emit_insn (gen_avx2_gatherdi<mode>
> >         (operands[0], scratch,
> >          operands[1], operands[2], operands[5], operands[4]));
> >     }
> >   DONE;
> > })
> >
> > But I see quite horrible code for an always-true mask.  From GIMPLE
> >
> >   # ivtmp.17_34 = PHI <ivtmp.17_19(3), 0(2)>
> >   vect__4.5_24 = MEM <vector(8) int> [(int *)idx_12(D) + ivtmp.17_34 * 1];
> >   vect_patt_28.6_23 = [vec_unpack_lo_expr] vect__4.5_24;
> >   vect_patt_28.6_22 = [vec_unpack_hi_expr] vect__4.5_24;
> >   vect_patt_27.7_17 = .MASK_GATHER_LOAD (_18, vect_patt_28.6_23, 8, { 0.0,
> > 0.0, 0.0, 0.0 }, { -1, -1, -1, -1 });
> >   vect_patt_27.8_16 = .MASK_GATHER_LOAD (_18, vect_patt_28.6_22, 8, { 0.0,
> > 0.0, 0.0, 0.0 }, { -1, -1, -1, -1 });
> >   MEM <vector(4) double> [(double *)&a + ivtmp.17_34 * 2] =
> > vect_patt_27.7_17;
> >   MEM <vector(4) double> [(double *)&a + 32B + ivtmp.17_34 * 2] =
> > vect_patt_27.8_16;
> >   ivtmp.17_19 = ivtmp.17_34 + 32;
> >   if (ivtmp.17_19 != 4096)
> >
> > we end up with
> >
> >         movl    $-1, %edx
> >         xorl    %eax, %eax
> >         vxorpd  %xmm1, %xmm1, %xmm1
> >         kmovw   %edx, %k1
> >         .p2align 4,,10
> >         .p2align 3
> > .L2:
> >         vpmovsxdq       (%rsi,%rax), %ymm0
> >         vmovdqu (%rsi,%rax), %ymm4
> >         vmovapd %ymm1, %ymm3
> >         kmovw   %k1, %k2
> >         vmovapd %ymm1, %ymm2
> >         kmovw   %k1, %k3
> >         vgatherqpd      (%rdi,%ymm0,8), %ymm3{%k2}
> >         vextracti128    $0x1, %ymm4, %xmm0
> >         vpmovsxdq       %xmm0, %ymm0
> >         vgatherqpd      (%rdi,%ymm0,8), %ymm2{%k3}
> >         vmovapd %ymm3, a(%rax,%rax)
> >         vmovapd %ymm2, a+32(%rax,%rax)
> >         addq    $32, %rax
> >         cmpq    $4096, %rax
> >         jne     .L2
> >
> > I wonder why we have those extra kmovw here.  I see there's
> 
> (define_insn "*avx512f_gathersi<VI48F:mode>"
>   [(set (match_operand:VI48F 0 "register_operand" "=&v")
>         (unspec:VI48F
>           [(match_operand:VI48F 1 "register_operand" "0")
>            (match_operand:<avx512fmaskmode> 7 "register_operand" "2")
>            (match_operator:<ssescalarmode> 6 "vsib_mem_operator"
>              [(unspec:P
>                 [(match_operand:P 4 "vsib_address_operand" "Tv")
>                  (match_operand:<VEC_GATHER_IDXSI> 3 "register_operand" "v")
>                  (match_operand:SI 5 "const1248_operand" "n")]
>                 UNSPEC_VSIBADDR)])]
>           UNSPEC_GATHER))
>    (clobber (match_scratch:<avx512fmaskmode> 2 "=&Yk"))]
> I think it's because this clobber, and the clobber here is because
> mask register will be set to zero by this instruction.
> cut from intel sdm
> ----cut first-------
> VGATHERDPS/VGATHERDPD—Gather Packed Single, Packed Double with Signed Dword
> ......The entire mask register will be set to zero by this instruction
> unless it trig-
> gers an exception.
> ----cut end------
> Similar for the avx2 version.

Oh, how (not) useful ...  I verified the same code is generated
when using the old builtins path btw, so while it looks inefficient
it doesn't look like it will be a regression.  Maybe a
kxnorw %k2, %k2 would be more efficient than keeping a mask register
to move -1 from around.  Not sure how we'd arrange for that
though.

Richard.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-18  9:21           ` Richard Biener
@ 2021-08-18  9:54             ` Richard Biener
  2021-08-18 10:24               ` Hongtao Liu
  2021-08-18 10:28               ` Richard Biener
  0 siblings, 2 replies; 21+ messages in thread
From: Richard Biener @ 2021-08-18  9:54 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: Richard Biener, Hongtao Liu, GCC Patches


So in the end I seem to be able to combine AVX & AVX512 arriving
at the following which passes basic testing.  I will now see to
teach the vectorizer the required "promotion" to handle
mask_gather_loadv4dfv4si and mask_gather_loadv4sfv4di.

Meanwhile, do you see any hole in the below?  If not I'll
do mask_scatter_store accordingly (that will be simpler since
AVX doesn't have scatter).

Thanks,
Richard.

From 706d3ac96b384eaad249cc83ec542ec643e21a4c Mon Sep 17 00:00:00 2001
From: Richard Biener <rguenther@suse.de>
Date: Tue, 17 Aug 2021 15:14:38 +0200
Subject: [PATCH] move x86 to use gather/scatter internal functions
To: gcc-patches@gcc.gnu.org

This moves the x86 backend to use standard pattern names for
mask_gather_load and mask_scatter_store rather than using the
builtin_{gather,scatter} target hooks.

The patch starts with mask_gather_load convering AVX and AVX512
and omits gather_load which doesn't match an actual instruction.
The vectorizer will appropriately set up a all-ones mask when
necessary.

The vectorizer still lacks support for unpacking, thus
mixed size gathers like mask_gather_loadv4dfv4si.

2021-08-17  Richard Biener  <rguenther@suse.de>

	* config/i386/sse.md
	(vec_gather_idxsi, vec_gather_idxdi): New lower-case
	mode attributes for VEC_GATHER_IDX{SI,DI}.
	(VEC_GATHER_MODEX): New mode iterator.
	(mask_gather_load<mode><vec_gather_idxsi>): New expander.
	(mask_gather_load<mode><vec_gather_idxdi>): Likewise.
---
 gcc/config/i386/sse.md | 53 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 13889687793..6ae826c268a 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -23640,12 +23640,22 @@
 		       (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
 		       (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
 		       (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
+(define_mode_attr vec_gather_idxsi
+		      [(V2DI "v4si") (V4DI "v4si") (V8DI "v8si")
+		       (V2DF "v4si") (V4DF "v4si") (V8DF "v8si")
+		       (V4SI "v4si") (V8SI "v8si") (V16SI "v16si")
+		       (V4SF "v4si") (V8SF "v8si") (V16SF "v16si")])
 
 (define_mode_attr VEC_GATHER_IDXDI
 		      [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
 		       (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
 		       (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
 		       (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
+(define_mode_attr vec_gather_idxdi
+		      [(V2DI "v2di") (V4DI "v4di") (V8DI "v8di")
+		       (V2DF "v2di") (V4DF "v4di") (V8DF "v8di")
+		       (V4SI "v2di") (V8SI "v4di") (V16SI "v8di")
+		       (V4SF "v2di") (V8SF "v4di") (V16SF "v8di")])
 
 (define_mode_attr VEC_GATHER_SRCDI
 		      [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
@@ -23653,6 +23663,30 @@
 		       (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
 		       (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
 
+(define_mode_iterator VEC_GATHER_MODEX
+  [V2DI V2DF V4DI V4DF V4SI V4SF V8SI V8SF
+   (V8DI "TARGET_AVX512F") (V8DF "TARGET_AVX512F")
+   (V16SI "TARGET_AVX512F") (V16SF "TARGET_AVX512F")])
+
+(define_expand "mask_gather_load<mode><vec_gather_idxsi>"
+  [(match_operand:VEC_GATHER_MODEX 0 "register_operand")
+   (match_operand 1 "vsib_address_operand")
+   (match_operand:<VEC_GATHER_IDXSI> 2 "register_operand")
+   (match_operand:SI 3 "const0_operand")
+   (match_operand:SI 4 "const1248_operand")
+   (match_operand 5 "register_operand")]
+  "TARGET_AVX2 && TARGET_USE_GATHER"
+{
+  rtx scratch = gen_reg_rtx (<MODE>mode);
+  emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
+  if (GET_MODE_SIZE (<MODE>mode) != 32 && !TARGET_AVX512VL)
+    operands[5] = gen_lowpart_SUBREG (<MODE>mode, operands[5]);
+  emit_insn (gen_<avx2_avx512>_gathersi<mode> (operands[0], scratch,
+					       operands[1], operands[2],
+					       operands[5], operands[4]));
+  DONE;
+})
+
 (define_expand "avx2_gathersi<mode>"
   [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
 		   (unspec:VEC_GATHER_MODE
@@ -23714,6 +23748,25 @@
    (set_attr "prefix" "vex")
    (set_attr "mode" "<sseinsnmode>")])
 
+(define_expand "mask_gather_load<mode><vec_gather_idxdi>"
+  [(match_operand:VEC_GATHER_MODEX 0 "register_operand")
+   (match_operand 1 "vsib_address_operand")
+   (match_operand:<VEC_GATHER_IDXDI> 2 "register_operand")
+   (match_operand:SI 3 "const0_operand")
+   (match_operand:SI 4 "const1248_operand")
+   (match_operand 5 "register_operand")]
+  "TARGET_AVX2 && TARGET_USE_GATHER"
+{
+  rtx scratch = gen_reg_rtx (<MODE>mode);
+  emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
+  if (GET_MODE_SIZE (<MODE>mode) != 32 && !TARGET_AVX512VL)
+    operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, operands[5]);
+  emit_insn (gen_<avx2_avx512>_gatherdi<mode> (operands[0], scratch,
+					       operands[1], operands[2],
+					       operands[5], operands[4]));
+  DONE;
+})
+
 (define_expand "avx2_gatherdi<mode>"
   [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
 		   (unspec:VEC_GATHER_MODE
-- 
2.31.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-18  9:54             ` Richard Biener
@ 2021-08-18 10:24               ` Hongtao Liu
  2021-08-18 10:36                 ` Richard Biener
  2021-08-18 10:28               ` Richard Biener
  1 sibling, 1 reply; 21+ messages in thread
From: Hongtao Liu @ 2021-08-18 10:24 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, Hongtao Liu, GCC Patches

On Wed, Aug 18, 2021 at 5:54 PM Richard Biener <rguenther@suse.de> wrote:
>
>
> So in the end I seem to be able to combine AVX & AVX512 arriving
> at the following which passes basic testing.  I will now see to
> teach the vectorizer the required "promotion" to handle
> mask_gather_loadv4dfv4si and mask_gather_loadv4sfv4di.
>
> Meanwhile, do you see any hole in the below?  If not I'll
> do mask_scatter_store accordingly (that will be simpler since
> AVX doesn't have scatter).
>
> Thanks,
> Richard.
>
> From 706d3ac96b384eaad249cc83ec542ec643e21a4c Mon Sep 17 00:00:00 2001
> From: Richard Biener <rguenther@suse.de>
> Date: Tue, 17 Aug 2021 15:14:38 +0200
> Subject: [PATCH] move x86 to use gather/scatter internal functions
> To: gcc-patches@gcc.gnu.org
>
> This moves the x86 backend to use standard pattern names for
> mask_gather_load and mask_scatter_store rather than using the
> builtin_{gather,scatter} target hooks.
>
> The patch starts with mask_gather_load convering AVX and AVX512
> and omits gather_load which doesn't match an actual instruction.
> The vectorizer will appropriately set up a all-ones mask when
> necessary.
>
> The vectorizer still lacks support for unpacking, thus
> mixed size gathers like mask_gather_loadv4dfv4si.
>
> 2021-08-17  Richard Biener  <rguenther@suse.de>
>
>         * config/i386/sse.md
>         (vec_gather_idxsi, vec_gather_idxdi): New lower-case
>         mode attributes for VEC_GATHER_IDX{SI,DI}.
>         (VEC_GATHER_MODEX): New mode iterator.
>         (mask_gather_load<mode><vec_gather_idxsi>): New expander.
>         (mask_gather_load<mode><vec_gather_idxdi>): Likewise.
> ---
>  gcc/config/i386/sse.md | 53 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 53 insertions(+)
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 13889687793..6ae826c268a 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -23640,12 +23640,22 @@
>                        (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
>                        (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
>                        (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
> +(define_mode_attr vec_gather_idxsi
> +                     [(V2DI "v4si") (V4DI "v4si") (V8DI "v8si")
> +                      (V2DF "v4si") (V4DF "v4si") (V8DF "v8si")
> +                      (V4SI "v4si") (V8SI "v8si") (V16SI "v16si")
> +                      (V4SF "v4si") (V8SF "v8si") (V16SF "v16si")])
>
>  (define_mode_attr VEC_GATHER_IDXDI
>                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
>                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
>                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
>                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> +(define_mode_attr vec_gather_idxdi
> +                     [(V2DI "v2di") (V4DI "v4di") (V8DI "v8di")
> +                      (V2DF "v2di") (V4DF "v4di") (V8DF "v8di")
> +                      (V4SI "v2di") (V8SI "v4di") (V16SI "v8di")
> +                      (V4SF "v2di") (V8SF "v4di") (V16SF "v8di")])
>
>  (define_mode_attr VEC_GATHER_SRCDI
>                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> @@ -23653,6 +23663,30 @@
>                        (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
>                        (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
>
> +(define_mode_iterator VEC_GATHER_MODEX
> +  [V2DI V2DF V4DI V4DF V4SI V4SF V8SI V8SF
> +   (V8DI "TARGET_AVX512F") (V8DF "TARGET_AVX512F")
> +   (V16SI "TARGET_AVX512F") (V16SF "TARGET_AVX512F")])
> +
> +(define_expand "mask_gather_load<mode><vec_gather_idxsi>"
> +  [(match_operand:VEC_GATHER_MODEX 0 "register_operand")
> +   (match_operand 1 "vsib_address_operand")
> +   (match_operand:<VEC_GATHER_IDXSI> 2 "register_operand")
> +   (match_operand:SI 3 "const0_operand")
> +   (match_operand:SI 4 "const1248_operand")
> +   (match_operand 5 "register_operand")]
> +  "TARGET_AVX2 && TARGET_USE_GATHER"
> +{
> +  rtx scratch = gen_reg_rtx (<MODE>mode);
> +  emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
> +  if (GET_MODE_SIZE (<MODE>mode) != 32 && !TARGET_AVX512VL)
No need for 64-bytes mode since avx512 mask will be used there.
And shouldn't we also handle 32-byte vector float modes?
> +    operands[5] = gen_lowpart_SUBREG (<MODE>mode, operands[5]);
> +  emit_insn (gen_<avx2_avx512>_gathersi<mode> (operands[0], scratch,
I guess you want to use <avx2_avx512> automatically match
avx2_gathersi<mode> and avx512vl_gathersi<mode>, but <avx2_avx512>
will only match avx2 for 256/128-bits mode.

(define_mode_attr avx2_avx512
  [(V4SI "avx2") (V8SI "avx2") (V16SI "avx512f")
   (V2DI "avx2") (V4DI "avx2") (V8DI "avx512f")
   (V4SF "avx2") (V8SF "avx2") (V16SF "avx512f")
   (V2DF "avx2") (V4DF "avx2") (V8DF "avx512f")
   (V8HI "avx512vl") (V16HI "avx512vl") (V32HI "avx512bw")])
> +                                              operands[1], operands[2],
> +                                              operands[5], operands[4]));
> +  DONE;
> +})
> +
>  (define_expand "avx2_gathersi<mode>"
>    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
>                    (unspec:VEC_GATHER_MODE
> @@ -23714,6 +23748,25 @@
>     (set_attr "prefix" "vex")
>     (set_attr "mode" "<sseinsnmode>")])
>
> +(define_expand "mask_gather_load<mode><vec_gather_idxdi>"
> +  [(match_operand:VEC_GATHER_MODEX 0 "register_operand")
> +   (match_operand 1 "vsib_address_operand")
> +   (match_operand:<VEC_GATHER_IDXDI> 2 "register_operand")
> +   (match_operand:SI 3 "const0_operand")
> +   (match_operand:SI 4 "const1248_operand")
> +   (match_operand 5 "register_operand")]
> +  "TARGET_AVX2 && TARGET_USE_GATHER"
> +{
> +  rtx scratch = gen_reg_rtx (<MODE>mode);
> +  emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
> +  if (GET_MODE_SIZE (<MODE>mode) != 32 && !TARGET_AVX512VL)
> +    operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, operands[5]);
> +  emit_insn (gen_<avx2_avx512>_gatherdi<mode> (operands[0], scratch,
> +                                              operands[1], operands[2],
> +                                              operands[5], operands[4]));
> +  DONE;
> +})
> +
>  (define_expand "avx2_gatherdi<mode>"
>    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
>                    (unspec:VEC_GATHER_MODE
> --
> 2.31.1
>
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-18  9:54             ` Richard Biener
  2021-08-18 10:24               ` Hongtao Liu
@ 2021-08-18 10:28               ` Richard Biener
  2021-08-18 11:30                 ` Hongtao Liu
  2021-08-18 11:31                 ` Richard Biener
  1 sibling, 2 replies; 21+ messages in thread
From: Richard Biener @ 2021-08-18 10:28 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: Richard Biener, Hongtao Liu, GCC Patches

On Wed, 18 Aug 2021, Richard Biener wrote:

> 
> So in the end I seem to be able to combine AVX & AVX512 arriving
> at the following which passes basic testing.  I will now see to
> teach the vectorizer the required "promotion" to handle
> mask_gather_loadv4dfv4si and mask_gather_loadv4sfv4di.
> 
> Meanwhile, do you see any hole in the below?  If not I'll
> do mask_scatter_store accordingly (that will be simpler since
> AVX doesn't have scatter).

There seems to be one more complication ... we have

(define_expand "avx2_gatherdi<mode>"
  [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
                   (unspec:VEC_GATHER_MODE
                     [(match_operand:<VEC_GATHER_SRCDI> 1 
"register_operand")
                      (mem:<ssescalarmode>
                        (match_par_dup 6
                          [(match_operand 2 "vsib_address_operand")
                           (match_operand:<VEC_GATHER_IDXDI>
                              3 "register_operand")

but VEC_GATHER_IDXDI is

(define_mode_attr VEC_GATHER_IDXDI
                      [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
                       (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
                       (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
                       (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])

I'd have expected (V4SF "V4DI") for example, or (V8SF "V8DI").
From the ISA docs I conclude that vgatherqps is not supported for
a %zmm destination but V8SF and V8DI is a supported combination?

(define_mode_attr VEC_GATHER_IDXSI
                      [(V2DI "V4SI") (V4DI "V4SI") (V8DI "V8SI")
                       (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
                       (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
                       (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])

looks like what I expect.  So shouldn't we use a special
VEC_GATHERDI_MODE iterator instead that drops V16{SI,SF} and adjust
VEC_GATHER_IDXDI accordingly?  Even

(define_expand "<avx512>_gatherdi<mode>"
  [(parallel [(set (match_operand:VI48F 0 "register_operand")
                   (unspec:VI48F

"supports" V16SI and V16SF gathers, leading to broken patterns?

I can test if anything goes wrong fixing VEC_GATHER_IDXDI to

(define_mode_attr VEC_GATHER_IDXDI
                      [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
                       (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
                       (V4SI "V4DI") (V8SI "V8DI") (V16SI "V8DI")
                       (V4SF "V4DI") (V8SF "V8DI") (V16SF "V8DI")])

Thanks,
Richard.

> Thanks,
> Richard.
> 
> From 706d3ac96b384eaad249cc83ec542ec643e21a4c Mon Sep 17 00:00:00 2001
> From: Richard Biener <rguenther@suse.de>
> Date: Tue, 17 Aug 2021 15:14:38 +0200
> Subject: [PATCH] move x86 to use gather/scatter internal functions
> To: gcc-patches@gcc.gnu.org
> 
> This moves the x86 backend to use standard pattern names for
> mask_gather_load and mask_scatter_store rather than using the
> builtin_{gather,scatter} target hooks.
> 
> The patch starts with mask_gather_load convering AVX and AVX512
> and omits gather_load which doesn't match an actual instruction.
> The vectorizer will appropriately set up a all-ones mask when
> necessary.
> 
> The vectorizer still lacks support for unpacking, thus
> mixed size gathers like mask_gather_loadv4dfv4si.
> 
> 2021-08-17  Richard Biener  <rguenther@suse.de>
> 
> 	* config/i386/sse.md
> 	(vec_gather_idxsi, vec_gather_idxdi): New lower-case
> 	mode attributes for VEC_GATHER_IDX{SI,DI}.
> 	(VEC_GATHER_MODEX): New mode iterator.
> 	(mask_gather_load<mode><vec_gather_idxsi>): New expander.
> 	(mask_gather_load<mode><vec_gather_idxdi>): Likewise.
> ---
>  gcc/config/i386/sse.md | 53 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 53 insertions(+)
> 
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 13889687793..6ae826c268a 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -23640,12 +23640,22 @@
>  		       (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
>  		       (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
>  		       (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
> +(define_mode_attr vec_gather_idxsi
> +		      [(V2DI "v4si") (V4DI "v4si") (V8DI "v8si")
> +		       (V2DF "v4si") (V4DF "v4si") (V8DF "v8si")
> +		       (V4SI "v4si") (V8SI "v8si") (V16SI "v16si")
> +		       (V4SF "v4si") (V8SF "v8si") (V16SF "v16si")])
>  
>  (define_mode_attr VEC_GATHER_IDXDI
>  		      [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
>  		       (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
>  		       (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
>  		       (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> +(define_mode_attr vec_gather_idxdi
> +		      [(V2DI "v2di") (V4DI "v4di") (V8DI "v8di")
> +		       (V2DF "v2di") (V4DF "v4di") (V8DF "v8di")
> +		       (V4SI "v2di") (V8SI "v4di") (V16SI "v8di")
> +		       (V4SF "v2di") (V8SF "v4di") (V16SF "v8di")])
>  
>  (define_mode_attr VEC_GATHER_SRCDI
>  		      [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> @@ -23653,6 +23663,30 @@
>  		       (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
>  		       (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
>  
> +(define_mode_iterator VEC_GATHER_MODEX
> +  [V2DI V2DF V4DI V4DF V4SI V4SF V8SI V8SF
> +   (V8DI "TARGET_AVX512F") (V8DF "TARGET_AVX512F")
> +   (V16SI "TARGET_AVX512F") (V16SF "TARGET_AVX512F")])
> +
> +(define_expand "mask_gather_load<mode><vec_gather_idxsi>"
> +  [(match_operand:VEC_GATHER_MODEX 0 "register_operand")
> +   (match_operand 1 "vsib_address_operand")
> +   (match_operand:<VEC_GATHER_IDXSI> 2 "register_operand")
> +   (match_operand:SI 3 "const0_operand")
> +   (match_operand:SI 4 "const1248_operand")
> +   (match_operand 5 "register_operand")]
> +  "TARGET_AVX2 && TARGET_USE_GATHER"
> +{
> +  rtx scratch = gen_reg_rtx (<MODE>mode);
> +  emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
> +  if (GET_MODE_SIZE (<MODE>mode) != 32 && !TARGET_AVX512VL)
> +    operands[5] = gen_lowpart_SUBREG (<MODE>mode, operands[5]);
> +  emit_insn (gen_<avx2_avx512>_gathersi<mode> (operands[0], scratch,
> +					       operands[1], operands[2],
> +					       operands[5], operands[4]));
> +  DONE;
> +})
> +
>  (define_expand "avx2_gathersi<mode>"
>    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
>  		   (unspec:VEC_GATHER_MODE
> @@ -23714,6 +23748,25 @@
>     (set_attr "prefix" "vex")
>     (set_attr "mode" "<sseinsnmode>")])
>  
> +(define_expand "mask_gather_load<mode><vec_gather_idxdi>"
> +  [(match_operand:VEC_GATHER_MODEX 0 "register_operand")
> +   (match_operand 1 "vsib_address_operand")
> +   (match_operand:<VEC_GATHER_IDXDI> 2 "register_operand")
> +   (match_operand:SI 3 "const0_operand")
> +   (match_operand:SI 4 "const1248_operand")
> +   (match_operand 5 "register_operand")]
> +  "TARGET_AVX2 && TARGET_USE_GATHER"
> +{
> +  rtx scratch = gen_reg_rtx (<MODE>mode);
> +  emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
> +  if (GET_MODE_SIZE (<MODE>mode) != 32 && !TARGET_AVX512VL)
> +    operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, operands[5]);
> +  emit_insn (gen_<avx2_avx512>_gatherdi<mode> (operands[0], scratch,
> +					       operands[1], operands[2],
> +					       operands[5], operands[4]));
> +  DONE;
> +})
> +
>  (define_expand "avx2_gatherdi<mode>"
>    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
>  		   (unspec:VEC_GATHER_MODE
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-18 10:24               ` Hongtao Liu
@ 2021-08-18 10:36                 ` Richard Biener
  0 siblings, 0 replies; 21+ messages in thread
From: Richard Biener @ 2021-08-18 10:36 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: Richard Biener, Hongtao Liu, GCC Patches

On Wed, 18 Aug 2021, Hongtao Liu wrote:

> On Wed, Aug 18, 2021 at 5:54 PM Richard Biener <rguenther@suse.de> wrote:
> >
> >
> > So in the end I seem to be able to combine AVX & AVX512 arriving
> > at the following which passes basic testing.  I will now see to
> > teach the vectorizer the required "promotion" to handle
> > mask_gather_loadv4dfv4si and mask_gather_loadv4sfv4di.
> >
> > Meanwhile, do you see any hole in the below?  If not I'll
> > do mask_scatter_store accordingly (that will be simpler since
> > AVX doesn't have scatter).
> >
> > Thanks,
> > Richard.
> >
> > From 706d3ac96b384eaad249cc83ec542ec643e21a4c Mon Sep 17 00:00:00 2001
> > From: Richard Biener <rguenther@suse.de>
> > Date: Tue, 17 Aug 2021 15:14:38 +0200
> > Subject: [PATCH] move x86 to use gather/scatter internal functions
> > To: gcc-patches@gcc.gnu.org
> >
> > This moves the x86 backend to use standard pattern names for
> > mask_gather_load and mask_scatter_store rather than using the
> > builtin_{gather,scatter} target hooks.
> >
> > The patch starts with mask_gather_load convering AVX and AVX512
> > and omits gather_load which doesn't match an actual instruction.
> > The vectorizer will appropriately set up a all-ones mask when
> > necessary.
> >
> > The vectorizer still lacks support for unpacking, thus
> > mixed size gathers like mask_gather_loadv4dfv4si.
> >
> > 2021-08-17  Richard Biener  <rguenther@suse.de>
> >
> >         * config/i386/sse.md
> >         (vec_gather_idxsi, vec_gather_idxdi): New lower-case
> >         mode attributes for VEC_GATHER_IDX{SI,DI}.
> >         (VEC_GATHER_MODEX): New mode iterator.
> >         (mask_gather_load<mode><vec_gather_idxsi>): New expander.
> >         (mask_gather_load<mode><vec_gather_idxdi>): Likewise.
> > ---
> >  gcc/config/i386/sse.md | 53 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 53 insertions(+)
> >
> > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > index 13889687793..6ae826c268a 100644
> > --- a/gcc/config/i386/sse.md
> > +++ b/gcc/config/i386/sse.md
> > @@ -23640,12 +23640,22 @@
> >                        (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
> >                        (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
> >                        (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
> > +(define_mode_attr vec_gather_idxsi
> > +                     [(V2DI "v4si") (V4DI "v4si") (V8DI "v8si")
> > +                      (V2DF "v4si") (V4DF "v4si") (V8DF "v8si")
> > +                      (V4SI "v4si") (V8SI "v8si") (V16SI "v16si")
> > +                      (V4SF "v4si") (V8SF "v8si") (V16SF "v16si")])
> >
> >  (define_mode_attr VEC_GATHER_IDXDI
> >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> >                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> >                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
> >                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> > +(define_mode_attr vec_gather_idxdi
> > +                     [(V2DI "v2di") (V4DI "v4di") (V8DI "v8di")
> > +                      (V2DF "v2di") (V4DF "v4di") (V8DF "v8di")
> > +                      (V4SI "v2di") (V8SI "v4di") (V16SI "v8di")
> > +                      (V4SF "v2di") (V8SF "v4di") (V16SF "v8di")])
> >
> >  (define_mode_attr VEC_GATHER_SRCDI
> >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > @@ -23653,6 +23663,30 @@
> >                        (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
> >                        (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
> >
> > +(define_mode_iterator VEC_GATHER_MODEX
> > +  [V2DI V2DF V4DI V4DF V4SI V4SF V8SI V8SF
> > +   (V8DI "TARGET_AVX512F") (V8DF "TARGET_AVX512F")
> > +   (V16SI "TARGET_AVX512F") (V16SF "TARGET_AVX512F")])
> > +
> > +(define_expand "mask_gather_load<mode><vec_gather_idxsi>"
> > +  [(match_operand:VEC_GATHER_MODEX 0 "register_operand")
> > +   (match_operand 1 "vsib_address_operand")
> > +   (match_operand:<VEC_GATHER_IDXSI> 2 "register_operand")
> > +   (match_operand:SI 3 "const0_operand")
> > +   (match_operand:SI 4 "const1248_operand")
> > +   (match_operand 5 "register_operand")]
> > +  "TARGET_AVX2 && TARGET_USE_GATHER"
> > +{
> > +  rtx scratch = gen_reg_rtx (<MODE>mode);
> > +  emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
> > +  if (GET_MODE_SIZE (<MODE>mode) != 32 && !TARGET_AVX512VL)
> No need for 64-bytes mode since avx512 mask will be used there.
> And shouldn't we also handle 32-byte vector float modes?

Whoops, yes - that should have been GET_MODE_SIZE (<MODE>mode) != 64.
The condition is supposed to fend off all AVX mask modes here.

> > +    operands[5] = gen_lowpart_SUBREG (<MODE>mode, operands[5]);
> > +  emit_insn (gen_<avx2_avx512>_gathersi<mode> (operands[0], scratch,
> I guess you want to use <avx2_avx512> automatically match
> avx2_gathersi<mode> and avx512vl_gathersi<mode>, but <avx2_avx512>
> will only match avx2 for 256/128-bits mode.
> 
> (define_mode_attr avx2_avx512
>   [(V4SI "avx2") (V8SI "avx2") (V16SI "avx512f")
>    (V2DI "avx2") (V4DI "avx2") (V8DI "avx512f")
>    (V4SF "avx2") (V8SF "avx2") (V16SF "avx512f")
>    (V2DF "avx2") (V4DF "avx2") (V8DF "avx512f")
>    (V8HI "avx512vl") (V16HI "avx512vl") (V32HI "avx512bw")])

Hmm yeah.  I merged it too much I guess.  I suppose

  if (TARGET_AVX512VL)
    gen_<avx512>_gathersi<mode> (...);
  else
    gen_<avx2_avx512>_gathersi<mode> (...);

should do the trick.  I'll do these adjustments and will re-test.

Richard.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-18 10:28               ` Richard Biener
@ 2021-08-18 11:30                 ` Hongtao Liu
  2021-08-18 11:37                   ` Hongtao Liu
  2021-08-18 11:31                 ` Richard Biener
  1 sibling, 1 reply; 21+ messages in thread
From: Hongtao Liu @ 2021-08-18 11:30 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, Hongtao Liu, GCC Patches

On Wed, Aug 18, 2021 at 6:28 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Wed, 18 Aug 2021, Richard Biener wrote:
>
> >
> > So in the end I seem to be able to combine AVX & AVX512 arriving
> > at the following which passes basic testing.  I will now see to
> > teach the vectorizer the required "promotion" to handle
> > mask_gather_loadv4dfv4si and mask_gather_loadv4sfv4di.
> >
> > Meanwhile, do you see any hole in the below?  If not I'll
> > do mask_scatter_store accordingly (that will be simpler since
> > AVX doesn't have scatter).
>
> There seems to be one more complication ... we have
>
> (define_expand "avx2_gatherdi<mode>"
>   [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
>                    (unspec:VEC_GATHER_MODE
>                      [(match_operand:<VEC_GATHER_SRCDI> 1
> "register_operand")
>                       (mem:<ssescalarmode>
>                         (match_par_dup 6
>                           [(match_operand 2 "vsib_address_operand")
>                            (match_operand:<VEC_GATHER_IDXDI>
>                               3 "register_operand")
>
> but VEC_GATHER_IDXDI is
>
> (define_mode_attr VEC_GATHER_IDXDI
>                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
>                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
>                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
>                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
>
> I'd have expected (V4SF "V4DI") for example, or (V8SF "V8DI").
VEX.128 version: For dword indices, the instruction will gather four
single-precision floating-point values. For
qword indices, the instruction will gather two values and zero the
upper 64 bits of the destination.
VEX.256 version: For dword indices, the instruction will gather eight
single-precision floating-point values. For
qword indices, the instruction will gather four values and zero the
upper 128 bits of the destination.

So, for expander name, it should be v2sfv2di and v4sfv4di for IDXDI
under avx2, and v8sfv8di under avx512.

----cut pattern----
(define_insn "*avx2_gatherdi<VEC_GATHER_MODE:mode>_2"
  [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
(unspec:VEC_GATHER_MODE
  [(pc)
   (match_operator:<ssescalarmode> 6 "vsib_mem_operator"
     [(unspec:P
[(match_operand:P 2 "vsib_address_operand" "Tv")
(match_operand:<VEC_GATHER_IDXDI> 3 "register_operand" "x")
(match_operand:SI 5 "const1248_operand" "n")]
UNSPEC_VSIBADDR)])
   (mem:BLK (scratch))
   (match_operand:<VEC_GATHER_SRCDI> 4 "register_operand" "1")]
  UNSPEC_GATHER))
   (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
  "TARGET_AVX2"
{
  if (<VEC_GATHER_MODE:MODE>mode != <VEC_GATHER_SRCDI>mode)
    return "%M2v<sseintprefix>gatherq<ssemodesuffix>\t{%4, %6,
%x0|%x0, %6, %4}";
----cut end---------------
We are using the trick of the operand modifier %x0 to force print xmm.



--------------cut from i386-expand.c-----------
case IX86_BUILTIN_GATHER3DIV16SI:
  if (target == NULL_RTX)
    target = gen_reg_rtx (V8SImode);
  emit_insn (gen_vec_extract_lo_v16si (target, subtarget));
  break;
case IX86_BUILTIN_GATHER3DIV8SF:
case IX86_BUILTIN_GATHERDIV8SF:
  if (target == NULL_RTX)
    target = gen_reg_rtx (V4SFmode);
  emit_insn (gen_vec_extract_lo_v8sf (target, subtarget));
-----------cut from i386-expand.c-----------

and do ugly things when expand builtin

> From the ISA docs I conclude that vgatherqps is not supported for
> a %zmm destination but V8SF and V8DI is a supported combination?
>
> (define_mode_attr VEC_GATHER_IDXSI
>                       [(V2DI "V4SI") (V4DI "V4SI") (V8DI "V8SI")
>                        (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
>                        (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
>                        (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
>
> looks like what I expect.  So shouldn't we use a special
> VEC_GATHERDI_MODE iterator instead that drops V16{SI,SF} and adjust
> VEC_GATHER_IDXDI accordingly?  Even
>
> (define_expand "<avx512>_gatherdi<mode>"
>   [(parallel [(set (match_operand:VI48F 0 "register_operand")
>                    (unspec:VI48F
>
> "supports" V16SI and V16SF gathers, leading to broken patterns?
>
> I can test if anything goes wrong fixing VEC_GATHER_IDXDI to
>
> (define_mode_attr VEC_GATHER_IDXDI
>                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
>                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
>                        (V4SI "V4DI") (V8SI "V8DI") (V16SI "V8DI")
>                        (V4SF "V4DI") (V8SF "V8DI") (V16SF "V8DI")])
>
> Thanks,
> Richard.
>
> > Thanks,
> > Richard.
> >
> > From 706d3ac96b384eaad249cc83ec542ec643e21a4c Mon Sep 17 00:00:00 2001
> > From: Richard Biener <rguenther@suse.de>
> > Date: Tue, 17 Aug 2021 15:14:38 +0200
> > Subject: [PATCH] move x86 to use gather/scatter internal functions
> > To: gcc-patches@gcc.gnu.org
> >
> > This moves the x86 backend to use standard pattern names for
> > mask_gather_load and mask_scatter_store rather than using the
> > builtin_{gather,scatter} target hooks.
> >
> > The patch starts with mask_gather_load convering AVX and AVX512
> > and omits gather_load which doesn't match an actual instruction.
> > The vectorizer will appropriately set up a all-ones mask when
> > necessary.
> >
> > The vectorizer still lacks support for unpacking, thus
> > mixed size gathers like mask_gather_loadv4dfv4si.
> >
> > 2021-08-17  Richard Biener  <rguenther@suse.de>
> >
> >       * config/i386/sse.md
> >       (vec_gather_idxsi, vec_gather_idxdi): New lower-case
> >       mode attributes for VEC_GATHER_IDX{SI,DI}.
> >       (VEC_GATHER_MODEX): New mode iterator.
> >       (mask_gather_load<mode><vec_gather_idxsi>): New expander.
> >       (mask_gather_load<mode><vec_gather_idxdi>): Likewise.
> > ---
> >  gcc/config/i386/sse.md | 53 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 53 insertions(+)
> >
> > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > index 13889687793..6ae826c268a 100644
> > --- a/gcc/config/i386/sse.md
> > +++ b/gcc/config/i386/sse.md
> > @@ -23640,12 +23640,22 @@
> >                      (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
> >                      (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
> >                      (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
> > +(define_mode_attr vec_gather_idxsi
> > +                   [(V2DI "v4si") (V4DI "v4si") (V8DI "v8si")
> > +                    (V2DF "v4si") (V4DF "v4si") (V8DF "v8si")
> > +                    (V4SI "v4si") (V8SI "v8si") (V16SI "v16si")
> > +                    (V4SF "v4si") (V8SF "v8si") (V16SF "v16si")])
> >
> >  (define_mode_attr VEC_GATHER_IDXDI
> >                     [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> >                      (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> >                      (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
> >                      (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> > +(define_mode_attr vec_gather_idxdi
> > +                   [(V2DI "v2di") (V4DI "v4di") (V8DI "v8di")
> > +                    (V2DF "v2di") (V4DF "v4di") (V8DF "v8di")
> > +                    (V4SI "v2di") (V8SI "v4di") (V16SI "v8di")
> > +                    (V4SF "v2di") (V8SF "v4di") (V16SF "v8di")])
> >
> >  (define_mode_attr VEC_GATHER_SRCDI
> >                     [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > @@ -23653,6 +23663,30 @@
> >                      (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
> >                      (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
> >
> > +(define_mode_iterator VEC_GATHER_MODEX
> > +  [V2DI V2DF V4DI V4DF V4SI V4SF V8SI V8SF
> > +   (V8DI "TARGET_AVX512F") (V8DF "TARGET_AVX512F")
> > +   (V16SI "TARGET_AVX512F") (V16SF "TARGET_AVX512F")])
> > +
> > +(define_expand "mask_gather_load<mode><vec_gather_idxsi>"
> > +  [(match_operand:VEC_GATHER_MODEX 0 "register_operand")
> > +   (match_operand 1 "vsib_address_operand")
> > +   (match_operand:<VEC_GATHER_IDXSI> 2 "register_operand")
> > +   (match_operand:SI 3 "const0_operand")
> > +   (match_operand:SI 4 "const1248_operand")
> > +   (match_operand 5 "register_operand")]
> > +  "TARGET_AVX2 && TARGET_USE_GATHER"
> > +{
> > +  rtx scratch = gen_reg_rtx (<MODE>mode);
> > +  emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
> > +  if (GET_MODE_SIZE (<MODE>mode) != 32 && !TARGET_AVX512VL)
> > +    operands[5] = gen_lowpart_SUBREG (<MODE>mode, operands[5]);
> > +  emit_insn (gen_<avx2_avx512>_gathersi<mode> (operands[0], scratch,
> > +                                            operands[1], operands[2],
> > +                                            operands[5], operands[4]));
> > +  DONE;
> > +})
> > +
> >  (define_expand "avx2_gathersi<mode>"
> >    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> >                  (unspec:VEC_GATHER_MODE
> > @@ -23714,6 +23748,25 @@
> >     (set_attr "prefix" "vex")
> >     (set_attr "mode" "<sseinsnmode>")])
> >
> > +(define_expand "mask_gather_load<mode><vec_gather_idxdi>"
> > +  [(match_operand:VEC_GATHER_MODEX 0 "register_operand")
> > +   (match_operand 1 "vsib_address_operand")
> > +   (match_operand:<VEC_GATHER_IDXDI> 2 "register_operand")
> > +   (match_operand:SI 3 "const0_operand")
> > +   (match_operand:SI 4 "const1248_operand")
> > +   (match_operand 5 "register_operand")]
> > +  "TARGET_AVX2 && TARGET_USE_GATHER"
> > +{
> > +  rtx scratch = gen_reg_rtx (<MODE>mode);
> > +  emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
> > +  if (GET_MODE_SIZE (<MODE>mode) != 32 && !TARGET_AVX512VL)
> > +    operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, operands[5]);
> > +  emit_insn (gen_<avx2_avx512>_gatherdi<mode> (operands[0], scratch,
> > +                                            operands[1], operands[2],
> > +                                            operands[5], operands[4]));
> > +  DONE;
> > +})
> > +
> >  (define_expand "avx2_gatherdi<mode>"
> >    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> >                  (unspec:VEC_GATHER_MODE
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-18 10:28               ` Richard Biener
  2021-08-18 11:30                 ` Hongtao Liu
@ 2021-08-18 11:31                 ` Richard Biener
  1 sibling, 0 replies; 21+ messages in thread
From: Richard Biener @ 2021-08-18 11:31 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: Richard Biener, Hongtao Liu, GCC Patches

On Wed, 18 Aug 2021, Richard Biener wrote:

> On Wed, 18 Aug 2021, Richard Biener wrote:
> 
> > 
> > So in the end I seem to be able to combine AVX & AVX512 arriving
> > at the following which passes basic testing.  I will now see to
> > teach the vectorizer the required "promotion" to handle
> > mask_gather_loadv4dfv4si and mask_gather_loadv4sfv4di.
> > 
> > Meanwhile, do you see any hole in the below?  If not I'll
> > do mask_scatter_store accordingly (that will be simpler since
> > AVX doesn't have scatter).
> 
> There seems to be one more complication ... we have
> 
> (define_expand "avx2_gatherdi<mode>"
>   [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
>                    (unspec:VEC_GATHER_MODE
>                      [(match_operand:<VEC_GATHER_SRCDI> 1 
> "register_operand")
>                       (mem:<ssescalarmode>
>                         (match_par_dup 6
>                           [(match_operand 2 "vsib_address_operand")
>                            (match_operand:<VEC_GATHER_IDXDI>
>                               3 "register_operand")
> 
> but VEC_GATHER_IDXDI is
> 
> (define_mode_attr VEC_GATHER_IDXDI
>                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
>                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
>                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
>                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> 
> I'd have expected (V4SF "V4DI") for example, or (V8SF "V8DI").
> From the ISA docs I conclude that vgatherqps is not supported for
> a %zmm destination but V8SF and V8DI is a supported combination?
> 
> (define_mode_attr VEC_GATHER_IDXSI
>                       [(V2DI "V4SI") (V4DI "V4SI") (V8DI "V8SI")
>                        (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
>                        (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
>                        (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
> 
> looks like what I expect.  So shouldn't we use a special
> VEC_GATHERDI_MODE iterator instead that drops V16{SI,SF} and adjust
> VEC_GATHER_IDXDI accordingly?  Even
> 
> (define_expand "<avx512>_gatherdi<mode>"
>   [(parallel [(set (match_operand:VI48F 0 "register_operand")
>                    (unspec:VI48F
> 
> "supports" V16SI and V16SF gathers, leading to broken patterns?
> 
> I can test if anything goes wrong fixing VEC_GATHER_IDXDI to
> 
> (define_mode_attr VEC_GATHER_IDXDI
>                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
>                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
>                        (V4SI "V4DI") (V8SI "V8DI") (V16SI "V8DI")
>                        (V4SF "V4DI") (V8SF "V8DI") (V16SF "V8DI")]

Hum, it seems to be more complicated (or simpler).  The gatherq
instructions will always fill only half of the SFmode destination
while the gatherd will only use half of the index vector for a DFmode
destination.

So we support mask_gather_loadv2sfv2di for example but the insn
patterns want v4sf/v2di operands in the end (and the upper half
of the register are zeroed).

I think I'm going to use a completely separate set of mode
iterators/attributes for the mask_gather_load patterns.

I have to study the builtin path in the vectorizer a bit in how
it handles this all but I suspect it's closely tied to these
x86 specialities.

Richard.


> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Richard.
> > 
> > From 706d3ac96b384eaad249cc83ec542ec643e21a4c Mon Sep 17 00:00:00 2001
> > From: Richard Biener <rguenther@suse.de>
> > Date: Tue, 17 Aug 2021 15:14:38 +0200
> > Subject: [PATCH] move x86 to use gather/scatter internal functions
> > To: gcc-patches@gcc.gnu.org
> > 
> > This moves the x86 backend to use standard pattern names for
> > mask_gather_load and mask_scatter_store rather than using the
> > builtin_{gather,scatter} target hooks.
> > 
> > The patch starts with mask_gather_load convering AVX and AVX512
> > and omits gather_load which doesn't match an actual instruction.
> > The vectorizer will appropriately set up a all-ones mask when
> > necessary.
> > 
> > The vectorizer still lacks support for unpacking, thus
> > mixed size gathers like mask_gather_loadv4dfv4si.
> > 
> > 2021-08-17  Richard Biener  <rguenther@suse.de>
> > 
> > 	* config/i386/sse.md
> > 	(vec_gather_idxsi, vec_gather_idxdi): New lower-case
> > 	mode attributes for VEC_GATHER_IDX{SI,DI}.
> > 	(VEC_GATHER_MODEX): New mode iterator.
> > 	(mask_gather_load<mode><vec_gather_idxsi>): New expander.
> > 	(mask_gather_load<mode><vec_gather_idxdi>): Likewise.
> > ---
> >  gcc/config/i386/sse.md | 53 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 53 insertions(+)
> > 
> > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > index 13889687793..6ae826c268a 100644
> > --- a/gcc/config/i386/sse.md
> > +++ b/gcc/config/i386/sse.md
> > @@ -23640,12 +23640,22 @@
> >  		       (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
> >  		       (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
> >  		       (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
> > +(define_mode_attr vec_gather_idxsi
> > +		      [(V2DI "v4si") (V4DI "v4si") (V8DI "v8si")
> > +		       (V2DF "v4si") (V4DF "v4si") (V8DF "v8si")
> > +		       (V4SI "v4si") (V8SI "v8si") (V16SI "v16si")
> > +		       (V4SF "v4si") (V8SF "v8si") (V16SF "v16si")])
> >  
> >  (define_mode_attr VEC_GATHER_IDXDI
> >  		      [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> >  		       (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> >  		       (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
> >  		       (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> > +(define_mode_attr vec_gather_idxdi
> > +		      [(V2DI "v2di") (V4DI "v4di") (V8DI "v8di")
> > +		       (V2DF "v2di") (V4DF "v4di") (V8DF "v8di")
> > +		       (V4SI "v2di") (V8SI "v4di") (V16SI "v8di")
> > +		       (V4SF "v2di") (V8SF "v4di") (V16SF "v8di")])
> >  
> >  (define_mode_attr VEC_GATHER_SRCDI
> >  		      [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > @@ -23653,6 +23663,30 @@
> >  		       (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
> >  		       (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
> >  
> > +(define_mode_iterator VEC_GATHER_MODEX
> > +  [V2DI V2DF V4DI V4DF V4SI V4SF V8SI V8SF
> > +   (V8DI "TARGET_AVX512F") (V8DF "TARGET_AVX512F")
> > +   (V16SI "TARGET_AVX512F") (V16SF "TARGET_AVX512F")])
> > +
> > +(define_expand "mask_gather_load<mode><vec_gather_idxsi>"
> > +  [(match_operand:VEC_GATHER_MODEX 0 "register_operand")
> > +   (match_operand 1 "vsib_address_operand")
> > +   (match_operand:<VEC_GATHER_IDXSI> 2 "register_operand")
> > +   (match_operand:SI 3 "const0_operand")
> > +   (match_operand:SI 4 "const1248_operand")
> > +   (match_operand 5 "register_operand")]
> > +  "TARGET_AVX2 && TARGET_USE_GATHER"
> > +{
> > +  rtx scratch = gen_reg_rtx (<MODE>mode);
> > +  emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
> > +  if (GET_MODE_SIZE (<MODE>mode) != 32 && !TARGET_AVX512VL)
> > +    operands[5] = gen_lowpart_SUBREG (<MODE>mode, operands[5]);
> > +  emit_insn (gen_<avx2_avx512>_gathersi<mode> (operands[0], scratch,
> > +					       operands[1], operands[2],
> > +					       operands[5], operands[4]));
> > +  DONE;
> > +})
> > +
> >  (define_expand "avx2_gathersi<mode>"
> >    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> >  		   (unspec:VEC_GATHER_MODE
> > @@ -23714,6 +23748,25 @@
> >     (set_attr "prefix" "vex")
> >     (set_attr "mode" "<sseinsnmode>")])
> >  
> > +(define_expand "mask_gather_load<mode><vec_gather_idxdi>"
> > +  [(match_operand:VEC_GATHER_MODEX 0 "register_operand")
> > +   (match_operand 1 "vsib_address_operand")
> > +   (match_operand:<VEC_GATHER_IDXDI> 2 "register_operand")
> > +   (match_operand:SI 3 "const0_operand")
> > +   (match_operand:SI 4 "const1248_operand")
> > +   (match_operand 5 "register_operand")]
> > +  "TARGET_AVX2 && TARGET_USE_GATHER"
> > +{
> > +  rtx scratch = gen_reg_rtx (<MODE>mode);
> > +  emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
> > +  if (GET_MODE_SIZE (<MODE>mode) != 32 && !TARGET_AVX512VL)
> > +    operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, operands[5]);
> > +  emit_insn (gen_<avx2_avx512>_gatherdi<mode> (operands[0], scratch,
> > +					       operands[1], operands[2],
> > +					       operands[5], operands[4]));
> > +  DONE;
> > +})
> > +
> >  (define_expand "avx2_gatherdi<mode>"
> >    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> >  		   (unspec:VEC_GATHER_MODE
> > 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-18 11:30                 ` Hongtao Liu
@ 2021-08-18 11:37                   ` Hongtao Liu
  2021-08-18 11:38                     ` Hongtao Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Hongtao Liu @ 2021-08-18 11:37 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, Hongtao Liu, GCC Patches

On Wed, Aug 18, 2021 at 7:30 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Wed, Aug 18, 2021 at 6:28 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Wed, 18 Aug 2021, Richard Biener wrote:
> >
> > >
> > > So in the end I seem to be able to combine AVX & AVX512 arriving
> > > at the following which passes basic testing.  I will now see to
> > > teach the vectorizer the required "promotion" to handle
> > > mask_gather_loadv4dfv4si and mask_gather_loadv4sfv4di.
> > >
> > > Meanwhile, do you see any hole in the below?  If not I'll
> > > do mask_scatter_store accordingly (that will be simpler since
> > > AVX doesn't have scatter).
> >
> > There seems to be one more complication ... we have
> >
> > (define_expand "avx2_gatherdi<mode>"
> >   [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> >                    (unspec:VEC_GATHER_MODE
> >                      [(match_operand:<VEC_GATHER_SRCDI> 1
> > "register_operand")
> >                       (mem:<ssescalarmode>
> >                         (match_par_dup 6
> >                           [(match_operand 2 "vsib_address_operand")
> >                            (match_operand:<VEC_GATHER_IDXDI>
> >                               3 "register_operand")
> >
> > but VEC_GATHER_IDXDI is
> >
> > (define_mode_attr VEC_GATHER_IDXDI
> >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> >                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> >                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
> >                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> >
> > I'd have expected (V4SF "V4DI") for example, or (V8SF "V8DI").
> VEX.128 version: For dword indices, the instruction will gather four
> single-precision floating-point values. For
> qword indices, the instruction will gather two values and zero the
> upper 64 bits of the destination.
> VEX.256 version: For dword indices, the instruction will gather eight
> single-precision floating-point values. For
> qword indices, the instruction will gather four values and zero the
> upper 128 bits of the destination.
>
> So, for expander name, it should be v2sfv2di and v4sfv4di for IDXDI
> under avx2, and v8sfv8di under avx512.
>
> ----cut pattern----
> (define_insn "*avx2_gatherdi<VEC_GATHER_MODE:mode>_2"
>   [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
> (unspec:VEC_GATHER_MODE
>   [(pc)
>    (match_operator:<ssescalarmode> 6 "vsib_mem_operator"
>      [(unspec:P
> [(match_operand:P 2 "vsib_address_operand" "Tv")
> (match_operand:<VEC_GATHER_IDXDI> 3 "register_operand" "x")
> (match_operand:SI 5 "const1248_operand" "n")]
> UNSPEC_VSIBADDR)])
>    (mem:BLK (scratch))
>    (match_operand:<VEC_GATHER_SRCDI> 4 "register_operand" "1")]
>   UNSPEC_GATHER))
>    (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
>   "TARGET_AVX2"
> {
>   if (<VEC_GATHER_MODE:MODE>mode != <VEC_GATHER_SRCDI>mode)
>     return "%M2v<sseintprefix>gatherq<ssemodesuffix>\t{%4, %6,
> %x0|%x0, %6, %4}";
> ----cut end---------------
> We are using the trick of the operand modifier %x0 to force print xmm.
(define_mode_attr VEC_GATHER_SRCDI
      [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
       (V2DF "V2DF") (V4DF "V4DF") (V8DF "V8DF")
       (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
       (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])

(define_insn "*avx2_gathersi<VEC_GATHER_MODE:mode>"
  [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
(unspec:VEC_GATHER_MODE
  [(match_operand:VEC_GATHER_MODE 2 "register_operand" "0")
   (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
     [(unspec:P
[(match_operand:P 3 "vsib_address_operand" "Tv")
(match_operand:<VEC_GATHER_IDXSI> 4 "register_operand" "x")
(match_operand:SI 6 "const1248_operand" "n")]
UNSPEC_VSIBADDR)])
   (mem:BLK (scratch))
   (match_operand:VEC_GATHER_MODE 5 "register_operand" "1")]
  UNSPEC_GATHER))
   (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
  "TARGET_AVX2"
  "%M3v<sseintprefix>gatherd<ssemodesuffix>\t{%1, %7, %0|%0, %7, %1}"

Or only print operands[1] which has mode VEC_GATHER_SRCDI as index.
>
>
>
> --------------cut from i386-expand.c-----------
> case IX86_BUILTIN_GATHER3DIV16SI:
>   if (target == NULL_RTX)
>     target = gen_reg_rtx (V8SImode);
>   emit_insn (gen_vec_extract_lo_v16si (target, subtarget));
>   break;
> case IX86_BUILTIN_GATHER3DIV8SF:
> case IX86_BUILTIN_GATHERDIV8SF:
>   if (target == NULL_RTX)
>     target = gen_reg_rtx (V4SFmode);
>   emit_insn (gen_vec_extract_lo_v8sf (target, subtarget));
> -----------cut from i386-expand.c-----------
>
> and do ugly things when expand builtin
>
> > From the ISA docs I conclude that vgatherqps is not supported for
> > a %zmm destination but V8SF and V8DI is a supported combination?
> >
> > (define_mode_attr VEC_GATHER_IDXSI
> >                       [(V2DI "V4SI") (V4DI "V4SI") (V8DI "V8SI")
> >                        (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
> >                        (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
> >                        (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
> >
> > looks like what I expect.  So shouldn't we use a special
> > VEC_GATHERDI_MODE iterator instead that drops V16{SI,SF} and adjust
> > VEC_GATHER_IDXDI accordingly?  Even
> >
> > (define_expand "<avx512>_gatherdi<mode>"
> >   [(parallel [(set (match_operand:VI48F 0 "register_operand")
> >                    (unspec:VI48F
> >
> > "supports" V16SI and V16SF gathers, leading to broken patterns?
> >
> > I can test if anything goes wrong fixing VEC_GATHER_IDXDI to
> >
> > (define_mode_attr VEC_GATHER_IDXDI
> >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> >                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> >                        (V4SI "V4DI") (V8SI "V8DI") (V16SI "V8DI")
> >                        (V4SF "V4DI") (V8SF "V8DI") (V16SF "V8DI")])
> >
> > Thanks,
> > Richard.
> >
> > > Thanks,
> > > Richard.
> > >
> > > From 706d3ac96b384eaad249cc83ec542ec643e21a4c Mon Sep 17 00:00:00 2001
> > > From: Richard Biener <rguenther@suse.de>
> > > Date: Tue, 17 Aug 2021 15:14:38 +0200
> > > Subject: [PATCH] move x86 to use gather/scatter internal functions
> > > To: gcc-patches@gcc.gnu.org
> > >
> > > This moves the x86 backend to use standard pattern names for
> > > mask_gather_load and mask_scatter_store rather than using the
> > > builtin_{gather,scatter} target hooks.
> > >
> > > The patch starts with mask_gather_load convering AVX and AVX512
> > > and omits gather_load which doesn't match an actual instruction.
> > > The vectorizer will appropriately set up a all-ones mask when
> > > necessary.
> > >
> > > The vectorizer still lacks support for unpacking, thus
> > > mixed size gathers like mask_gather_loadv4dfv4si.
> > >
> > > 2021-08-17  Richard Biener  <rguenther@suse.de>
> > >
> > >       * config/i386/sse.md
> > >       (vec_gather_idxsi, vec_gather_idxdi): New lower-case
> > >       mode attributes for VEC_GATHER_IDX{SI,DI}.
> > >       (VEC_GATHER_MODEX): New mode iterator.
> > >       (mask_gather_load<mode><vec_gather_idxsi>): New expander.
> > >       (mask_gather_load<mode><vec_gather_idxdi>): Likewise.
> > > ---
> > >  gcc/config/i386/sse.md | 53 ++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 53 insertions(+)
> > >
> > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > > index 13889687793..6ae826c268a 100644
> > > --- a/gcc/config/i386/sse.md
> > > +++ b/gcc/config/i386/sse.md
> > > @@ -23640,12 +23640,22 @@
> > >                      (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
> > >                      (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
> > >                      (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
> > > +(define_mode_attr vec_gather_idxsi
> > > +                   [(V2DI "v4si") (V4DI "v4si") (V8DI "v8si")
> > > +                    (V2DF "v4si") (V4DF "v4si") (V8DF "v8si")
> > > +                    (V4SI "v4si") (V8SI "v8si") (V16SI "v16si")
> > > +                    (V4SF "v4si") (V8SF "v8si") (V16SF "v16si")])
> > >
> > >  (define_mode_attr VEC_GATHER_IDXDI
> > >                     [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > >                      (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> > >                      (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
> > >                      (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> > > +(define_mode_attr vec_gather_idxdi
> > > +                   [(V2DI "v2di") (V4DI "v4di") (V8DI "v8di")
> > > +                    (V2DF "v2di") (V4DF "v4di") (V8DF "v8di")
> > > +                    (V4SI "v2di") (V8SI "v4di") (V16SI "v8di")
> > > +                    (V4SF "v2di") (V8SF "v4di") (V16SF "v8di")])
> > >
> > >  (define_mode_attr VEC_GATHER_SRCDI
> > >                     [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > > @@ -23653,6 +23663,30 @@
> > >                      (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
> > >                      (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
> > >
> > > +(define_mode_iterator VEC_GATHER_MODEX
> > > +  [V2DI V2DF V4DI V4DF V4SI V4SF V8SI V8SF
> > > +   (V8DI "TARGET_AVX512F") (V8DF "TARGET_AVX512F")
> > > +   (V16SI "TARGET_AVX512F") (V16SF "TARGET_AVX512F")])
> > > +
> > > +(define_expand "mask_gather_load<mode><vec_gather_idxsi>"
> > > +  [(match_operand:VEC_GATHER_MODEX 0 "register_operand")
> > > +   (match_operand 1 "vsib_address_operand")
> > > +   (match_operand:<VEC_GATHER_IDXSI> 2 "register_operand")
> > > +   (match_operand:SI 3 "const0_operand")
> > > +   (match_operand:SI 4 "const1248_operand")
> > > +   (match_operand 5 "register_operand")]
> > > +  "TARGET_AVX2 && TARGET_USE_GATHER"
> > > +{
> > > +  rtx scratch = gen_reg_rtx (<MODE>mode);
> > > +  emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
> > > +  if (GET_MODE_SIZE (<MODE>mode) != 32 && !TARGET_AVX512VL)
> > > +    operands[5] = gen_lowpart_SUBREG (<MODE>mode, operands[5]);
> > > +  emit_insn (gen_<avx2_avx512>_gathersi<mode> (operands[0], scratch,
> > > +                                            operands[1], operands[2],
> > > +                                            operands[5], operands[4]));
> > > +  DONE;
> > > +})
> > > +
> > >  (define_expand "avx2_gathersi<mode>"
> > >    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> > >                  (unspec:VEC_GATHER_MODE
> > > @@ -23714,6 +23748,25 @@
> > >     (set_attr "prefix" "vex")
> > >     (set_attr "mode" "<sseinsnmode>")])
> > >
> > > +(define_expand "mask_gather_load<mode><vec_gather_idxdi>"
> > > +  [(match_operand:VEC_GATHER_MODEX 0 "register_operand")
> > > +   (match_operand 1 "vsib_address_operand")
> > > +   (match_operand:<VEC_GATHER_IDXDI> 2 "register_operand")
> > > +   (match_operand:SI 3 "const0_operand")
> > > +   (match_operand:SI 4 "const1248_operand")
> > > +   (match_operand 5 "register_operand")]
> > > +  "TARGET_AVX2 && TARGET_USE_GATHER"
> > > +{
> > > +  rtx scratch = gen_reg_rtx (<MODE>mode);
> > > +  emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
> > > +  if (GET_MODE_SIZE (<MODE>mode) != 32 && !TARGET_AVX512VL)
> > > +    operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, operands[5]);
> > > +  emit_insn (gen_<avx2_avx512>_gatherdi<mode> (operands[0], scratch,
> > > +                                            operands[1], operands[2],
> > > +                                            operands[5], operands[4]));
> > > +  DONE;
> > > +})
> > > +
> > >  (define_expand "avx2_gatherdi<mode>"
> > >    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> > >                  (unspec:VEC_GATHER_MODE
> > >
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> > Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-18 11:37                   ` Hongtao Liu
@ 2021-08-18 11:38                     ` Hongtao Liu
  2021-08-18 13:43                       ` Richard Biener
  0 siblings, 1 reply; 21+ messages in thread
From: Hongtao Liu @ 2021-08-18 11:38 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, Hongtao Liu, GCC Patches

On Wed, Aug 18, 2021 at 7:37 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Wed, Aug 18, 2021 at 7:30 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > On Wed, Aug 18, 2021 at 6:28 PM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > On Wed, 18 Aug 2021, Richard Biener wrote:
> > >
> > > >
> > > > So in the end I seem to be able to combine AVX & AVX512 arriving
> > > > at the following which passes basic testing.  I will now see to
> > > > teach the vectorizer the required "promotion" to handle
> > > > mask_gather_loadv4dfv4si and mask_gather_loadv4sfv4di.
> > > >
> > > > Meanwhile, do you see any hole in the below?  If not I'll
> > > > do mask_scatter_store accordingly (that will be simpler since
> > > > AVX doesn't have scatter).
> > >
> > > There seems to be one more complication ... we have
> > >
> > > (define_expand "avx2_gatherdi<mode>"
> > >   [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> > >                    (unspec:VEC_GATHER_MODE
> > >                      [(match_operand:<VEC_GATHER_SRCDI> 1
> > > "register_operand")
> > >                       (mem:<ssescalarmode>
> > >                         (match_par_dup 6
> > >                           [(match_operand 2 "vsib_address_operand")
> > >                            (match_operand:<VEC_GATHER_IDXDI>
> > >                               3 "register_operand")
> > >
> > > but VEC_GATHER_IDXDI is
> > >
> > > (define_mode_attr VEC_GATHER_IDXDI
> > >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > >                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> > >                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
> > >                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> > >
> > > I'd have expected (V4SF "V4DI") for example, or (V8SF "V8DI").
> > VEX.128 version: For dword indices, the instruction will gather four
> > single-precision floating-point values. For
> > qword indices, the instruction will gather two values and zero the
> > upper 64 bits of the destination.
> > VEX.256 version: For dword indices, the instruction will gather eight
> > single-precision floating-point values. For
> > qword indices, the instruction will gather four values and zero the
> > upper 128 bits of the destination.
> >
> > So, for expander name, it should be v2sfv2di and v4sfv4di for IDXDI
> > under avx2, and v8sfv8di under avx512.
> >
> > ----cut pattern----
> > (define_insn "*avx2_gatherdi<VEC_GATHER_MODE:mode>_2"
> >   [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
> > (unspec:VEC_GATHER_MODE
> >   [(pc)
> >    (match_operator:<ssescalarmode> 6 "vsib_mem_operator"
> >      [(unspec:P
> > [(match_operand:P 2 "vsib_address_operand" "Tv")
> > (match_operand:<VEC_GATHER_IDXDI> 3 "register_operand" "x")
> > (match_operand:SI 5 "const1248_operand" "n")]
> > UNSPEC_VSIBADDR)])
> >    (mem:BLK (scratch))
> >    (match_operand:<VEC_GATHER_SRCDI> 4 "register_operand" "1")]
> >   UNSPEC_GATHER))
> >    (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
> >   "TARGET_AVX2"
> > {
> >   if (<VEC_GATHER_MODE:MODE>mode != <VEC_GATHER_SRCDI>mode)
> >     return "%M2v<sseintprefix>gatherq<ssemodesuffix>\t{%4, %6,
> > %x0|%x0, %6, %4}";
> > ----cut end---------------
> > We are using the trick of the operand modifier %x0 to force print xmm.
> (define_mode_attr VEC_GATHER_SRCDI
>       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
>        (V2DF "V2DF") (V4DF "V4DF") (V8DF "V8DF")
>        (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
>        (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
>
> (define_insn "*avx2_gathersi<VEC_GATHER_MODE:mode>"
>   [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
> (unspec:VEC_GATHER_MODE
>   [(match_operand:VEC_GATHER_MODE 2 "register_operand" "0")
>    (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
>      [(unspec:P
> [(match_operand:P 3 "vsib_address_operand" "Tv")
> (match_operand:<VEC_GATHER_IDXSI> 4 "register_operand" "x")
> (match_operand:SI 6 "const1248_operand" "n")]
> UNSPEC_VSIBADDR)])
>    (mem:BLK (scratch))
>    (match_operand:VEC_GATHER_MODE 5 "register_operand" "1")]
>   UNSPEC_GATHER))
>    (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
>   "TARGET_AVX2"
>   "%M3v<sseintprefix>gatherd<ssemodesuffix>\t{%1, %7, %0|%0, %7, %1}"
>
> Or only print operands[1] which has mode VEC_GATHER_SRCDI as index.

Typo should be operands[2] in the below one
(define_insn "*avx2_gatherdi<VEC_GATHER_MODE:mode>"
  [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
(unspec:VEC_GATHER_MODE
  [(match_operand:<VEC_GATHER_SRCDI> 2 "register_operand" "0")
   (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
     [(unspec:P
[(match_operand:P 3 "vsib_address_operand" "Tv")
(match_operand:<VEC_GATHER_IDXDI> 4 "register_operand" "x")
(match_operand:SI 6 "const1248_operand" "n")]
UNSPEC_VSIBADDR)])
   (mem:BLK (scratch))
   (match_operand:<VEC_GATHER_SRCDI> 5 "register_operand" "1")]
  UNSPEC_GATHER))
   (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
  "TARGET_AVX2"
  "%M3v<sseintprefix>gatherq<ssemodesuffix>\t{%5, %7, %2|%2, %7, %5}"

> >
> >
> >
> > --------------cut from i386-expand.c-----------
> > case IX86_BUILTIN_GATHER3DIV16SI:
> >   if (target == NULL_RTX)
> >     target = gen_reg_rtx (V8SImode);
> >   emit_insn (gen_vec_extract_lo_v16si (target, subtarget));
> >   break;
> > case IX86_BUILTIN_GATHER3DIV8SF:
> > case IX86_BUILTIN_GATHERDIV8SF:
> >   if (target == NULL_RTX)
> >     target = gen_reg_rtx (V4SFmode);
> >   emit_insn (gen_vec_extract_lo_v8sf (target, subtarget));
> > -----------cut from i386-expand.c-----------
> >
> > and do ugly things when expand builtin
> >
> > > From the ISA docs I conclude that vgatherqps is not supported for
> > > a %zmm destination but V8SF and V8DI is a supported combination?
> > >
> > > (define_mode_attr VEC_GATHER_IDXSI
> > >                       [(V2DI "V4SI") (V4DI "V4SI") (V8DI "V8SI")
> > >                        (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
> > >                        (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
> > >                        (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
> > >
> > > looks like what I expect.  So shouldn't we use a special
> > > VEC_GATHERDI_MODE iterator instead that drops V16{SI,SF} and adjust
> > > VEC_GATHER_IDXDI accordingly?  Even
> > >
> > > (define_expand "<avx512>_gatherdi<mode>"
> > >   [(parallel [(set (match_operand:VI48F 0 "register_operand")
> > >                    (unspec:VI48F
> > >
> > > "supports" V16SI and V16SF gathers, leading to broken patterns?
> > >
> > > I can test if anything goes wrong fixing VEC_GATHER_IDXDI to
> > >
> > > (define_mode_attr VEC_GATHER_IDXDI
> > >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > >                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> > >                        (V4SI "V4DI") (V8SI "V8DI") (V16SI "V8DI")
> > >                        (V4SF "V4DI") (V8SF "V8DI") (V16SF "V8DI")])
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > Thanks,
> > > > Richard.
> > > >
> > > > From 706d3ac96b384eaad249cc83ec542ec643e21a4c Mon Sep 17 00:00:00 2001
> > > > From: Richard Biener <rguenther@suse.de>
> > > > Date: Tue, 17 Aug 2021 15:14:38 +0200
> > > > Subject: [PATCH] move x86 to use gather/scatter internal functions
> > > > To: gcc-patches@gcc.gnu.org
> > > >
> > > > This moves the x86 backend to use standard pattern names for
> > > > mask_gather_load and mask_scatter_store rather than using the
> > > > builtin_{gather,scatter} target hooks.
> > > >
> > > > The patch starts with mask_gather_load convering AVX and AVX512
> > > > and omits gather_load which doesn't match an actual instruction.
> > > > The vectorizer will appropriately set up a all-ones mask when
> > > > necessary.
> > > >
> > > > The vectorizer still lacks support for unpacking, thus
> > > > mixed size gathers like mask_gather_loadv4dfv4si.
> > > >
> > > > 2021-08-17  Richard Biener  <rguenther@suse.de>
> > > >
> > > >       * config/i386/sse.md
> > > >       (vec_gather_idxsi, vec_gather_idxdi): New lower-case
> > > >       mode attributes for VEC_GATHER_IDX{SI,DI}.
> > > >       (VEC_GATHER_MODEX): New mode iterator.
> > > >       (mask_gather_load<mode><vec_gather_idxsi>): New expander.
> > > >       (mask_gather_load<mode><vec_gather_idxdi>): Likewise.
> > > > ---
> > > >  gcc/config/i386/sse.md | 53 ++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 53 insertions(+)
> > > >
> > > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > > > index 13889687793..6ae826c268a 100644
> > > > --- a/gcc/config/i386/sse.md
> > > > +++ b/gcc/config/i386/sse.md
> > > > @@ -23640,12 +23640,22 @@
> > > >                      (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
> > > >                      (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
> > > >                      (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
> > > > +(define_mode_attr vec_gather_idxsi
> > > > +                   [(V2DI "v4si") (V4DI "v4si") (V8DI "v8si")
> > > > +                    (V2DF "v4si") (V4DF "v4si") (V8DF "v8si")
> > > > +                    (V4SI "v4si") (V8SI "v8si") (V16SI "v16si")
> > > > +                    (V4SF "v4si") (V8SF "v8si") (V16SF "v16si")])
> > > >
> > > >  (define_mode_attr VEC_GATHER_IDXDI
> > > >                     [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > > >                      (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> > > >                      (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
> > > >                      (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> > > > +(define_mode_attr vec_gather_idxdi
> > > > +                   [(V2DI "v2di") (V4DI "v4di") (V8DI "v8di")
> > > > +                    (V2DF "v2di") (V4DF "v4di") (V8DF "v8di")
> > > > +                    (V4SI "v2di") (V8SI "v4di") (V16SI "v8di")
> > > > +                    (V4SF "v2di") (V8SF "v4di") (V16SF "v8di")])
> > > >
> > > >  (define_mode_attr VEC_GATHER_SRCDI
> > > >                     [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > > > @@ -23653,6 +23663,30 @@
> > > >                      (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
> > > >                      (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
> > > >
> > > > +(define_mode_iterator VEC_GATHER_MODEX
> > > > +  [V2DI V2DF V4DI V4DF V4SI V4SF V8SI V8SF
> > > > +   (V8DI "TARGET_AVX512F") (V8DF "TARGET_AVX512F")
> > > > +   (V16SI "TARGET_AVX512F") (V16SF "TARGET_AVX512F")])
> > > > +
> > > > +(define_expand "mask_gather_load<mode><vec_gather_idxsi>"
> > > > +  [(match_operand:VEC_GATHER_MODEX 0 "register_operand")
> > > > +   (match_operand 1 "vsib_address_operand")
> > > > +   (match_operand:<VEC_GATHER_IDXSI> 2 "register_operand")
> > > > +   (match_operand:SI 3 "const0_operand")
> > > > +   (match_operand:SI 4 "const1248_operand")
> > > > +   (match_operand 5 "register_operand")]
> > > > +  "TARGET_AVX2 && TARGET_USE_GATHER"
> > > > +{
> > > > +  rtx scratch = gen_reg_rtx (<MODE>mode);
> > > > +  emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
> > > > +  if (GET_MODE_SIZE (<MODE>mode) != 32 && !TARGET_AVX512VL)
> > > > +    operands[5] = gen_lowpart_SUBREG (<MODE>mode, operands[5]);
> > > > +  emit_insn (gen_<avx2_avx512>_gathersi<mode> (operands[0], scratch,
> > > > +                                            operands[1], operands[2],
> > > > +                                            operands[5], operands[4]));
> > > > +  DONE;
> > > > +})
> > > > +
> > > >  (define_expand "avx2_gathersi<mode>"
> > > >    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> > > >                  (unspec:VEC_GATHER_MODE
> > > > @@ -23714,6 +23748,25 @@
> > > >     (set_attr "prefix" "vex")
> > > >     (set_attr "mode" "<sseinsnmode>")])
> > > >
> > > > +(define_expand "mask_gather_load<mode><vec_gather_idxdi>"
> > > > +  [(match_operand:VEC_GATHER_MODEX 0 "register_operand")
> > > > +   (match_operand 1 "vsib_address_operand")
> > > > +   (match_operand:<VEC_GATHER_IDXDI> 2 "register_operand")
> > > > +   (match_operand:SI 3 "const0_operand")
> > > > +   (match_operand:SI 4 "const1248_operand")
> > > > +   (match_operand 5 "register_operand")]
> > > > +  "TARGET_AVX2 && TARGET_USE_GATHER"
> > > > +{
> > > > +  rtx scratch = gen_reg_rtx (<MODE>mode);
> > > > +  emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
> > > > +  if (GET_MODE_SIZE (<MODE>mode) != 32 && !TARGET_AVX512VL)
> > > > +    operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, operands[5]);
> > > > +  emit_insn (gen_<avx2_avx512>_gatherdi<mode> (operands[0], scratch,
> > > > +                                            operands[1], operands[2],
> > > > +                                            operands[5], operands[4]));
> > > > +  DONE;
> > > > +})
> > > > +
> > > >  (define_expand "avx2_gatherdi<mode>"
> > > >    [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> > > >                  (unspec:VEC_GATHER_MODE
> > > >
> > >
> > > --
> > > Richard Biener <rguenther@suse.de>
> > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> > > Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)
> >
> >
> >
> > --
> > BR,
> > Hongtao
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-18 11:38                     ` Hongtao Liu
@ 2021-08-18 13:43                       ` Richard Biener
  2021-08-18 14:43                         ` Richard Sandiford
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Biener @ 2021-08-18 13:43 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: Richard Biener, Hongtao Liu, GCC Patches, richard.sandiford

On Wed, 18 Aug 2021, Hongtao Liu wrote:

> On Wed, Aug 18, 2021 at 7:37 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > On Wed, Aug 18, 2021 at 7:30 PM Hongtao Liu <crazylht@gmail.com> wrote:
> > >
> > > On Wed, Aug 18, 2021 at 6:28 PM Richard Biener <rguenther@suse.de> wrote:
> > > >
> > > > On Wed, 18 Aug 2021, Richard Biener wrote:
> > > >
> > > > >
> > > > > So in the end I seem to be able to combine AVX & AVX512 arriving
> > > > > at the following which passes basic testing.  I will now see to
> > > > > teach the vectorizer the required "promotion" to handle
> > > > > mask_gather_loadv4dfv4si and mask_gather_loadv4sfv4di.
> > > > >
> > > > > Meanwhile, do you see any hole in the below?  If not I'll
> > > > > do mask_scatter_store accordingly (that will be simpler since
> > > > > AVX doesn't have scatter).
> > > >
> > > > There seems to be one more complication ... we have
> > > >
> > > > (define_expand "avx2_gatherdi<mode>"
> > > >   [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> > > >                    (unspec:VEC_GATHER_MODE
> > > >                      [(match_operand:<VEC_GATHER_SRCDI> 1
> > > > "register_operand")
> > > >                       (mem:<ssescalarmode>
> > > >                         (match_par_dup 6
> > > >                           [(match_operand 2 "vsib_address_operand")
> > > >                            (match_operand:<VEC_GATHER_IDXDI>
> > > >                               3 "register_operand")
> > > >
> > > > but VEC_GATHER_IDXDI is
> > > >
> > > > (define_mode_attr VEC_GATHER_IDXDI
> > > >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > > >                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> > > >                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
> > > >                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> > > >
> > > > I'd have expected (V4SF "V4DI") for example, or (V8SF "V8DI").
> > > VEX.128 version: For dword indices, the instruction will gather four
> > > single-precision floating-point values. For
> > > qword indices, the instruction will gather two values and zero the
> > > upper 64 bits of the destination.
> > > VEX.256 version: For dword indices, the instruction will gather eight
> > > single-precision floating-point values. For
> > > qword indices, the instruction will gather four values and zero the
> > > upper 128 bits of the destination.
> > >
> > > So, for expander name, it should be v2sfv2di and v4sfv4di for IDXDI
> > > under avx2, and v8sfv8di under avx512.
> > >
> > > ----cut pattern----
> > > (define_insn "*avx2_gatherdi<VEC_GATHER_MODE:mode>_2"
> > >   [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
> > > (unspec:VEC_GATHER_MODE
> > >   [(pc)
> > >    (match_operator:<ssescalarmode> 6 "vsib_mem_operator"
> > >      [(unspec:P
> > > [(match_operand:P 2 "vsib_address_operand" "Tv")
> > > (match_operand:<VEC_GATHER_IDXDI> 3 "register_operand" "x")
> > > (match_operand:SI 5 "const1248_operand" "n")]
> > > UNSPEC_VSIBADDR)])
> > >    (mem:BLK (scratch))
> > >    (match_operand:<VEC_GATHER_SRCDI> 4 "register_operand" "1")]
> > >   UNSPEC_GATHER))
> > >    (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
> > >   "TARGET_AVX2"
> > > {
> > >   if (<VEC_GATHER_MODE:MODE>mode != <VEC_GATHER_SRCDI>mode)
> > >     return "%M2v<sseintprefix>gatherq<ssemodesuffix>\t{%4, %6,
> > > %x0|%x0, %6, %4}";
> > > ----cut end---------------
> > > We are using the trick of the operand modifier %x0 to force print xmm.
> > (define_mode_attr VEC_GATHER_SRCDI
> >       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> >        (V2DF "V2DF") (V4DF "V4DF") (V8DF "V8DF")
> >        (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
> >        (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
> >
> > (define_insn "*avx2_gathersi<VEC_GATHER_MODE:mode>"
> >   [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
> > (unspec:VEC_GATHER_MODE
> >   [(match_operand:VEC_GATHER_MODE 2 "register_operand" "0")
> >    (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
> >      [(unspec:P
> > [(match_operand:P 3 "vsib_address_operand" "Tv")
> > (match_operand:<VEC_GATHER_IDXSI> 4 "register_operand" "x")
> > (match_operand:SI 6 "const1248_operand" "n")]
> > UNSPEC_VSIBADDR)])
> >    (mem:BLK (scratch))
> >    (match_operand:VEC_GATHER_MODE 5 "register_operand" "1")]
> >   UNSPEC_GATHER))
> >    (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
> >   "TARGET_AVX2"
> >   "%M3v<sseintprefix>gatherd<ssemodesuffix>\t{%1, %7, %0|%0, %7, %1}"
> >
> > Or only print operands[1] which has mode VEC_GATHER_SRCDI as index.
> 
> Typo should be operands[2] in the below one
> (define_insn "*avx2_gatherdi<VEC_GATHER_MODE:mode>"
>   [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
> (unspec:VEC_GATHER_MODE
>   [(match_operand:<VEC_GATHER_SRCDI> 2 "register_operand" "0")
>    (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
>      [(unspec:P
> [(match_operand:P 3 "vsib_address_operand" "Tv")
> (match_operand:<VEC_GATHER_IDXDI> 4 "register_operand" "x")
> (match_operand:SI 6 "const1248_operand" "n")]
> UNSPEC_VSIBADDR)])
>    (mem:BLK (scratch))
>    (match_operand:<VEC_GATHER_SRCDI> 5 "register_operand" "1")]
>   UNSPEC_GATHER))
>    (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
>   "TARGET_AVX2"
>   "%M3v<sseintprefix>gatherq<ssemodesuffix>\t{%5, %7, %2|%2, %7, %5}"

It's somewhat of a mess.  When we expose

  mask_gather_loadv8sfv8di

for example then the vectorizer will with -mavx512f generate a
vector mask for v8sf which will be a AVX2 style mask.  But in
the end this will use gen_avx512f_gatherdiv16sf and require a
AVX512 style mask.

So I'm not sure whether mask_gather_load is a good fit for x86
for the instruction variants where the index and data vector
have different size.  For mask_gather_loadv8sfv8di we could
instead have mask_gather_load_lowv16sfv8di and for
mask_gather_loadv8dfv8si we'd have mask_gather_load_idxlov8dfv16si?

Or simply do not constrain the index/data vectors and document
mask_gather_load to load as many elements as min(data,index)
vector elements ... (curiously they are currently only losely
constrained - seemingly allowing a larger index vector at least).

For the builtins we have eye candy like

  def_builtin_pure (OPTION_MASK_ISA_AVX512F, 0, "__builtin_ia32_gatherdiv16sf",
                    V8SF_FTYPE_V8SF_PCVOID_V8DI_QI_INT,
                    IX86_BUILTIN_GATHER3DIV16SF);

note the DIV16SF name but the V8SF/V8DI types.  But there's also

  def_builtin_pure (OPTION_MASK_ISA_AVX512F, 0, "__builtin_ia32_gather3altdiv16sf ",
                    V16SF_FTYPE_V16SF_PCFLOAT_V8DI_HI_INT,
                    IX86_BUILTIN_GATHER3ALTDIV16SF);

which is what seems to be used by the vectorizer and that magically
knows that only the lower part of the data is populated.

The vectorizer seems to key the cases on

  if (known_eq (nunits, gather_off_nunits))
    modifier = NONE;
  else if (known_eq (nunits * 2, gather_off_nunits))
    {
      modifier = WIDEN;
...
    }
  else if (known_eq (nunits, gather_off_nunits * 2))
    {
      modifier = NARROW;

in vect_build_gather_load_calls.  Note that for the IFN case we
rely on pattern recognition to produce a "scalar" gather IFN call
and we do not recover the offset vector type from the machine
description but from the actual vector type of the vector def.

Richard, do you have any advice here?

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-18 13:43                       ` Richard Biener
@ 2021-08-18 14:43                         ` Richard Sandiford
  2021-08-19  8:45                           ` Richard Biener
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Sandiford @ 2021-08-18 14:43 UTC (permalink / raw)
  To: Richard Biener; +Cc: Hongtao Liu, Richard Biener, Hongtao Liu, GCC Patches

Richard Biener <rguenther@suse.de> writes:
> On Wed, 18 Aug 2021, Hongtao Liu wrote:
>
>> On Wed, Aug 18, 2021 at 7:37 PM Hongtao Liu <crazylht@gmail.com> wrote:
>> >
>> > On Wed, Aug 18, 2021 at 7:30 PM Hongtao Liu <crazylht@gmail.com> wrote:
>> > >
>> > > On Wed, Aug 18, 2021 at 6:28 PM Richard Biener <rguenther@suse.de> wrote:
>> > > >
>> > > > On Wed, 18 Aug 2021, Richard Biener wrote:
>> > > >
>> > > > >
>> > > > > So in the end I seem to be able to combine AVX & AVX512 arriving
>> > > > > at the following which passes basic testing.  I will now see to
>> > > > > teach the vectorizer the required "promotion" to handle
>> > > > > mask_gather_loadv4dfv4si and mask_gather_loadv4sfv4di.
>> > > > >
>> > > > > Meanwhile, do you see any hole in the below?  If not I'll
>> > > > > do mask_scatter_store accordingly (that will be simpler since
>> > > > > AVX doesn't have scatter).
>> > > >
>> > > > There seems to be one more complication ... we have
>> > > >
>> > > > (define_expand "avx2_gatherdi<mode>"
>> > > >   [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
>> > > >                    (unspec:VEC_GATHER_MODE
>> > > >                      [(match_operand:<VEC_GATHER_SRCDI> 1
>> > > > "register_operand")
>> > > >                       (mem:<ssescalarmode>
>> > > >                         (match_par_dup 6
>> > > >                           [(match_operand 2 "vsib_address_operand")
>> > > >                            (match_operand:<VEC_GATHER_IDXDI>
>> > > >                               3 "register_operand")
>> > > >
>> > > > but VEC_GATHER_IDXDI is
>> > > >
>> > > > (define_mode_attr VEC_GATHER_IDXDI
>> > > >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
>> > > >                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
>> > > >                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
>> > > >                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
>> > > >
>> > > > I'd have expected (V4SF "V4DI") for example, or (V8SF "V8DI").
>> > > VEX.128 version: For dword indices, the instruction will gather four
>> > > single-precision floating-point values. For
>> > > qword indices, the instruction will gather two values and zero the
>> > > upper 64 bits of the destination.
>> > > VEX.256 version: For dword indices, the instruction will gather eight
>> > > single-precision floating-point values. For
>> > > qword indices, the instruction will gather four values and zero the
>> > > upper 128 bits of the destination.
>> > >
>> > > So, for expander name, it should be v2sfv2di and v4sfv4di for IDXDI
>> > > under avx2, and v8sfv8di under avx512.
>> > >
>> > > ----cut pattern----
>> > > (define_insn "*avx2_gatherdi<VEC_GATHER_MODE:mode>_2"
>> > >   [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
>> > > (unspec:VEC_GATHER_MODE
>> > >   [(pc)
>> > >    (match_operator:<ssescalarmode> 6 "vsib_mem_operator"
>> > >      [(unspec:P
>> > > [(match_operand:P 2 "vsib_address_operand" "Tv")
>> > > (match_operand:<VEC_GATHER_IDXDI> 3 "register_operand" "x")
>> > > (match_operand:SI 5 "const1248_operand" "n")]
>> > > UNSPEC_VSIBADDR)])
>> > >    (mem:BLK (scratch))
>> > >    (match_operand:<VEC_GATHER_SRCDI> 4 "register_operand" "1")]
>> > >   UNSPEC_GATHER))
>> > >    (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
>> > >   "TARGET_AVX2"
>> > > {
>> > >   if (<VEC_GATHER_MODE:MODE>mode != <VEC_GATHER_SRCDI>mode)
>> > >     return "%M2v<sseintprefix>gatherq<ssemodesuffix>\t{%4, %6,
>> > > %x0|%x0, %6, %4}";
>> > > ----cut end---------------
>> > > We are using the trick of the operand modifier %x0 to force print xmm.
>> > (define_mode_attr VEC_GATHER_SRCDI
>> >       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
>> >        (V2DF "V2DF") (V4DF "V4DF") (V8DF "V8DF")
>> >        (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
>> >        (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
>> >
>> > (define_insn "*avx2_gathersi<VEC_GATHER_MODE:mode>"
>> >   [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
>> > (unspec:VEC_GATHER_MODE
>> >   [(match_operand:VEC_GATHER_MODE 2 "register_operand" "0")
>> >    (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
>> >      [(unspec:P
>> > [(match_operand:P 3 "vsib_address_operand" "Tv")
>> > (match_operand:<VEC_GATHER_IDXSI> 4 "register_operand" "x")
>> > (match_operand:SI 6 "const1248_operand" "n")]
>> > UNSPEC_VSIBADDR)])
>> >    (mem:BLK (scratch))
>> >    (match_operand:VEC_GATHER_MODE 5 "register_operand" "1")]
>> >   UNSPEC_GATHER))
>> >    (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
>> >   "TARGET_AVX2"
>> >   "%M3v<sseintprefix>gatherd<ssemodesuffix>\t{%1, %7, %0|%0, %7, %1}"
>> >
>> > Or only print operands[1] which has mode VEC_GATHER_SRCDI as index.
>> 
>> Typo should be operands[2] in the below one
>> (define_insn "*avx2_gatherdi<VEC_GATHER_MODE:mode>"
>>   [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
>> (unspec:VEC_GATHER_MODE
>>   [(match_operand:<VEC_GATHER_SRCDI> 2 "register_operand" "0")
>>    (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
>>      [(unspec:P
>> [(match_operand:P 3 "vsib_address_operand" "Tv")
>> (match_operand:<VEC_GATHER_IDXDI> 4 "register_operand" "x")
>> (match_operand:SI 6 "const1248_operand" "n")]
>> UNSPEC_VSIBADDR)])
>>    (mem:BLK (scratch))
>>    (match_operand:<VEC_GATHER_SRCDI> 5 "register_operand" "1")]
>>   UNSPEC_GATHER))
>>    (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
>>   "TARGET_AVX2"
>>   "%M3v<sseintprefix>gatherq<ssemodesuffix>\t{%5, %7, %2|%2, %7, %5}"
>
> It's somewhat of a mess.  When we expose
>
>   mask_gather_loadv8sfv8di
>
> for example then the vectorizer will with -mavx512f generate a
> vector mask for v8sf which will be a AVX2 style mask.  But in
> the end this will use gen_avx512f_gatherdiv16sf and require a
> AVX512 style mask.

I think it would be OK/sensible to use the larger of the index or
result vectors to determine the mask, if that helps.  There just
wasn't any need to make a distinction for SVE, since there the
mask type is determined entirely by the number of elements.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-18 14:43                         ` Richard Sandiford
@ 2021-08-19  8:45                           ` Richard Biener
  2021-08-19  9:12                             ` Richard Sandiford
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Biener @ 2021-08-19  8:45 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Hongtao Liu, Richard Biener, Hongtao Liu, GCC Patches

On Wed, 18 Aug 2021, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Wed, 18 Aug 2021, Hongtao Liu wrote:
> >
> >> On Wed, Aug 18, 2021 at 7:37 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >> >
> >> > On Wed, Aug 18, 2021 at 7:30 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >> > >
> >> > > On Wed, Aug 18, 2021 at 6:28 PM Richard Biener <rguenther@suse.de> wrote:
> >> > > >
> >> > > > On Wed, 18 Aug 2021, Richard Biener wrote:
> >> > > >
> >> > > > >
> >> > > > > So in the end I seem to be able to combine AVX & AVX512 arriving
> >> > > > > at the following which passes basic testing.  I will now see to
> >> > > > > teach the vectorizer the required "promotion" to handle
> >> > > > > mask_gather_loadv4dfv4si and mask_gather_loadv4sfv4di.
> >> > > > >
> >> > > > > Meanwhile, do you see any hole in the below?  If not I'll
> >> > > > > do mask_scatter_store accordingly (that will be simpler since
> >> > > > > AVX doesn't have scatter).
> >> > > >
> >> > > > There seems to be one more complication ... we have
> >> > > >
> >> > > > (define_expand "avx2_gatherdi<mode>"
> >> > > >   [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> >> > > >                    (unspec:VEC_GATHER_MODE
> >> > > >                      [(match_operand:<VEC_GATHER_SRCDI> 1
> >> > > > "register_operand")
> >> > > >                       (mem:<ssescalarmode>
> >> > > >                         (match_par_dup 6
> >> > > >                           [(match_operand 2 "vsib_address_operand")
> >> > > >                            (match_operand:<VEC_GATHER_IDXDI>
> >> > > >                               3 "register_operand")
> >> > > >
> >> > > > but VEC_GATHER_IDXDI is
> >> > > >
> >> > > > (define_mode_attr VEC_GATHER_IDXDI
> >> > > >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> >> > > >                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> >> > > >                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
> >> > > >                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> >> > > >
> >> > > > I'd have expected (V4SF "V4DI") for example, or (V8SF "V8DI").
> >> > > VEX.128 version: For dword indices, the instruction will gather four
> >> > > single-precision floating-point values. For
> >> > > qword indices, the instruction will gather two values and zero the
> >> > > upper 64 bits of the destination.
> >> > > VEX.256 version: For dword indices, the instruction will gather eight
> >> > > single-precision floating-point values. For
> >> > > qword indices, the instruction will gather four values and zero the
> >> > > upper 128 bits of the destination.
> >> > >
> >> > > So, for expander name, it should be v2sfv2di and v4sfv4di for IDXDI
> >> > > under avx2, and v8sfv8di under avx512.
> >> > >
> >> > > ----cut pattern----
> >> > > (define_insn "*avx2_gatherdi<VEC_GATHER_MODE:mode>_2"
> >> > >   [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
> >> > > (unspec:VEC_GATHER_MODE
> >> > >   [(pc)
> >> > >    (match_operator:<ssescalarmode> 6 "vsib_mem_operator"
> >> > >      [(unspec:P
> >> > > [(match_operand:P 2 "vsib_address_operand" "Tv")
> >> > > (match_operand:<VEC_GATHER_IDXDI> 3 "register_operand" "x")
> >> > > (match_operand:SI 5 "const1248_operand" "n")]
> >> > > UNSPEC_VSIBADDR)])
> >> > >    (mem:BLK (scratch))
> >> > >    (match_operand:<VEC_GATHER_SRCDI> 4 "register_operand" "1")]
> >> > >   UNSPEC_GATHER))
> >> > >    (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
> >> > >   "TARGET_AVX2"
> >> > > {
> >> > >   if (<VEC_GATHER_MODE:MODE>mode != <VEC_GATHER_SRCDI>mode)
> >> > >     return "%M2v<sseintprefix>gatherq<ssemodesuffix>\t{%4, %6,
> >> > > %x0|%x0, %6, %4}";
> >> > > ----cut end---------------
> >> > > We are using the trick of the operand modifier %x0 to force print xmm.
> >> > (define_mode_attr VEC_GATHER_SRCDI
> >> >       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> >> >        (V2DF "V2DF") (V4DF "V4DF") (V8DF "V8DF")
> >> >        (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
> >> >        (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
> >> >
> >> > (define_insn "*avx2_gathersi<VEC_GATHER_MODE:mode>"
> >> >   [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
> >> > (unspec:VEC_GATHER_MODE
> >> >   [(match_operand:VEC_GATHER_MODE 2 "register_operand" "0")
> >> >    (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
> >> >      [(unspec:P
> >> > [(match_operand:P 3 "vsib_address_operand" "Tv")
> >> > (match_operand:<VEC_GATHER_IDXSI> 4 "register_operand" "x")
> >> > (match_operand:SI 6 "const1248_operand" "n")]
> >> > UNSPEC_VSIBADDR)])
> >> >    (mem:BLK (scratch))
> >> >    (match_operand:VEC_GATHER_MODE 5 "register_operand" "1")]
> >> >   UNSPEC_GATHER))
> >> >    (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
> >> >   "TARGET_AVX2"
> >> >   "%M3v<sseintprefix>gatherd<ssemodesuffix>\t{%1, %7, %0|%0, %7, %1}"
> >> >
> >> > Or only print operands[1] which has mode VEC_GATHER_SRCDI as index.
> >> 
> >> Typo should be operands[2] in the below one
> >> (define_insn "*avx2_gatherdi<VEC_GATHER_MODE:mode>"
> >>   [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
> >> (unspec:VEC_GATHER_MODE
> >>   [(match_operand:<VEC_GATHER_SRCDI> 2 "register_operand" "0")
> >>    (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
> >>      [(unspec:P
> >> [(match_operand:P 3 "vsib_address_operand" "Tv")
> >> (match_operand:<VEC_GATHER_IDXDI> 4 "register_operand" "x")
> >> (match_operand:SI 6 "const1248_operand" "n")]
> >> UNSPEC_VSIBADDR)])
> >>    (mem:BLK (scratch))
> >>    (match_operand:<VEC_GATHER_SRCDI> 5 "register_operand" "1")]
> >>   UNSPEC_GATHER))
> >>    (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
> >>   "TARGET_AVX2"
> >>   "%M3v<sseintprefix>gatherq<ssemodesuffix>\t{%5, %7, %2|%2, %7, %5}"
> >
> > It's somewhat of a mess.  When we expose
> >
> >   mask_gather_loadv8sfv8di
> >
> > for example then the vectorizer will with -mavx512f generate a
> > vector mask for v8sf which will be a AVX2 style mask.  But in
> > the end this will use gen_avx512f_gatherdiv16sf and require a
> > AVX512 style mask.
> 
> I think it would be OK/sensible to use the larger of the index or
> result vectors to determine the mask, if that helps.  There just
> wasn't any need to make a distinction for SVE, since there the
> mask type is determined entirely by the number of elements.

I don't think that would help - for a V4SFmode gather with V4DImode
indices the x86 ISA expects the AVX2 mask to be V4SImode, that is,
the mask corresponds to the data mode, not to the index mode.

The real issue is that we're asking for a mask mode just using
the data mode but not the instruction.  So with V8SFmode data
mode and -mavx512f -mno-avx512vl it appears we want a AVX2
mask mode but in reality the gather instruction we can use
uses EVEX.512 encoding when the index mode is V8DImode
and thus is available w/ -mavx512f.

So it appears that we'd need to pass another mode to the
get_mask_mode target hook that determines the instruction
encoding ...

I'm also not sure if this wouldn't open up a can of worms
with the mask computation stmts which would not know which
(maybe there are two) context they are used in.  We're
fundamentally using two different vector sizes and ISA subsets
here :/  Maybe vinfo->vector_mode should fully determine whether
we're using EVEX or VEX encoding in the loop and thus we should
reject attempts to do VEX encoding vectorization in EVEX mode?
I supose this is how you don't get a mix of SVE and ADVSIMD
vectorization in one loop?

But since VEX/EVEX use the same modes (but SVE/ADVSIMD do not)
this will get interesting.

Oh, and -mavx512bw would also be prone to use (different size again)
AVX2 vectors at the same time.  Maybe the "solution" is to simply
not expose the EVEX modes to the vectorizer when there's such
a possibility, thus only when all F, VL and BW are available?
At least when thinking about relaxing the vector sizes we can
deal with.

I've opened this can of worms with

diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 97745a830a2..db938f79c9c 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -3749,13 +3749,15 @@ vect_gather_scatter_fn_p (vec_info *vinfo, bool 
read_p, 
bool masked_p,
 
   for (;;)
     {
-      tree offset_vectype = get_vectype_for_scalar_type (vinfo, 
offset_type);
-      if (!offset_vectype)
-       return false;
-
+      tree offset_vectype
+       = get_related_vectype_for_scalar_type (vinfo->vector_mode, 
offset_type,
+                                              TYPE_VECTOR_SUBPARTS 
(vectype));

to even get the mixed size optabs to be considered.  If we were to
expose the real instructions instead we'd have
mask_gather_loadv16sfv8di but as said the semantics of such
optab entry would need to be defined.  That's the way the current
builtin_gather target hook code works, so it seems that would be
a working approach, side-stepping the mixed size problems.
Here we'd define that excess elements on either the data or the
index operand are ignored (on the output operand they are zeroed).

Richard.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-19  8:45                           ` Richard Biener
@ 2021-08-19  9:12                             ` Richard Sandiford
  2021-08-19  9:22                               ` Richard Biener
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Sandiford @ 2021-08-19  9:12 UTC (permalink / raw)
  To: Richard Biener; +Cc: Hongtao Liu, Richard Biener, Hongtao Liu, GCC Patches

Richard Biener <rguenther@suse.de> writes:
> On Wed, 18 Aug 2021, Richard Sandiford wrote:
>> I think it would be OK/sensible to use the larger of the index or
>> result vectors to determine the mask, if that helps.  There just
>> wasn't any need to make a distinction for SVE, since there the
>> mask type is determined entirely by the number of elements.
>
> I don't think that would help - for a V4SFmode gather with V4DImode
> indices the x86 ISA expects the AVX2 mask to be V4SImode, that is,
> the mask corresponds to the data mode, not to the index mode.

Ah, OK.  So the widest type determines the ISA and then the data
type determines the mask type within that ISA?

> The real issue is that we're asking for a mask mode just using
> the data mode but not the instruction.  So with V8SFmode data
> mode and -mavx512f -mno-avx512vl it appears we want a AVX2
> mask mode but in reality the gather instruction we can use
> uses EVEX.512 encoding when the index mode is V8DImode
> and thus is available w/ -mavx512f.
>
> So it appears that we'd need to pass another mode to the
> get_mask_mode target hook that determines the instruction
> encoding ...
>
> I'm also not sure if this wouldn't open up a can of worms
> with the mask computation stmts which would not know which
> (maybe there are two) context they are used in.  We're
> fundamentally using two different vector sizes and ISA subsets
> here :/

Yeah.

> Maybe vinfo->vector_mode should fully determine whether
> we're using EVEX or VEX encoding in the loop and thus we should
> reject attempts to do VEX encoding vectorization in EVEX mode?
> I supose this is how you don't get a mix of SVE and ADVSIMD
> vectorization in one loop?

Yeah, that's right.  It means that for -msve-vector-bits=128
we effectively have two V16QIs: the “real” AdvSIMD V16QI and
the SVE VNx16QI.  I can't imagine that being popular for x86 :-)

> But since VEX/EVEX use the same modes (but SVE/ADVSIMD do not)
> this will get interesting.
>
> Oh, and -mavx512bw would also be prone to use (different size again)
> AVX2 vectors at the same time.  Maybe the "solution" is to simply
> not expose the EVEX modes to the vectorizer when there's such
> a possibility, thus only when all F, VL and BW are available?
> At least when thinking about relaxing the vector sizes we can
> deal with.
>
> I've opened this can of worms with
>
> diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
> index 97745a830a2..db938f79c9c 100644
> --- a/gcc/tree-vect-data-refs.c
> +++ b/gcc/tree-vect-data-refs.c
> @@ -3749,13 +3749,15 @@ vect_gather_scatter_fn_p (vec_info *vinfo, bool 
> read_p, 
> bool masked_p,
>  
>    for (;;)
>      {
> -      tree offset_vectype = get_vectype_for_scalar_type (vinfo, 
> offset_type);
> -      if (!offset_vectype)
> -       return false;
> -
> +      tree offset_vectype
> +       = get_related_vectype_for_scalar_type (vinfo->vector_mode, 
> offset_type,
> +                                              TYPE_VECTOR_SUBPARTS 
> (vectype));
>
> to even get the mixed size optabs to be considered.  If we were to
> expose the real instructions instead we'd have
> mask_gather_loadv16sfv8di but as said the semantics of such
> optab entry would need to be defined.  That's the way the current
> builtin_gather target hook code works, so it seems that would be
> a working approach, side-stepping the mixed size problems.
> Here we'd define that excess elements on either the data or the
> index operand are ignored (on the output operand they are zeroed).

It seems a bit ugly to expose such a low-level target-specific detail
at the gimple level though.  Also, if we expose the v16sf at the gimple
level then we would need to construct a v16sf vector for scatter stores
and initialise all the elements (at least until we have a don't care
VEC_PERM_EXPR encoding).

I'm not sure this would solve the problem with the mixture of mask
computation stmts that you mentioned above.  E.g. what if the input
to the v16sf scatter store above is a COND_EXPR that shares some mask
subexpresions with the scatter store mask?  This COND_EXPR would be a
v8sf VEC_COND_EXPR and so would presumably use the get_mask_mode
corresponding to v8sf rather than v16sf.  The mask computation stmts
would still need to cope with a mixture of AVX512 and AVX2 masks.

Even if x86 doesn't support that combination yet (because it instead
requires all vector sizes to be equal), it seems like something that
is likely to be supported in future.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] move x86 to use gather/scatter internal functions
  2021-08-19  9:12                             ` Richard Sandiford
@ 2021-08-19  9:22                               ` Richard Biener
  0 siblings, 0 replies; 21+ messages in thread
From: Richard Biener @ 2021-08-19  9:22 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Hongtao Liu, Richard Biener, Hongtao Liu, GCC Patches

On Thu, 19 Aug 2021, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Wed, 18 Aug 2021, Richard Sandiford wrote:
> >> I think it would be OK/sensible to use the larger of the index or
> >> result vectors to determine the mask, if that helps.  There just
> >> wasn't any need to make a distinction for SVE, since there the
> >> mask type is determined entirely by the number of elements.
> >
> > I don't think that would help - for a V4SFmode gather with V4DImode
> > indices the x86 ISA expects the AVX2 mask to be V4SImode, that is,
> > the mask corresponds to the data mode, not to the index mode.
> 
> Ah, OK.  So the widest type determines the ISA and then the data
> type determines the mask type within that ISA?

Yes.

> > The real issue is that we're asking for a mask mode just using
> > the data mode but not the instruction.  So with V8SFmode data
> > mode and -mavx512f -mno-avx512vl it appears we want a AVX2
> > mask mode but in reality the gather instruction we can use
> > uses EVEX.512 encoding when the index mode is V8DImode
> > and thus is available w/ -mavx512f.
> >
> > So it appears that we'd need to pass another mode to the
> > get_mask_mode target hook that determines the instruction
> > encoding ...
> >
> > I'm also not sure if this wouldn't open up a can of worms
> > with the mask computation stmts which would not know which
> > (maybe there are two) context they are used in.  We're
> > fundamentally using two different vector sizes and ISA subsets
> > here :/
> 
> Yeah.
> 
> > Maybe vinfo->vector_mode should fully determine whether
> > we're using EVEX or VEX encoding in the loop and thus we should
> > reject attempts to do VEX encoding vectorization in EVEX mode?
> > I supose this is how you don't get a mix of SVE and ADVSIMD
> > vectorization in one loop?
> 
> Yeah, that's right.  It means that for -msve-vector-bits=128
> we effectively have two V16QIs: the “real” AdvSIMD V16QI and
> the SVE VNx16QI.  I can't imagine that being popular for x86 :-)
> 
> > But since VEX/EVEX use the same modes (but SVE/ADVSIMD do not)
> > this will get interesting.
> >
> > Oh, and -mavx512bw would also be prone to use (different size again)
> > AVX2 vectors at the same time.  Maybe the "solution" is to simply
> > not expose the EVEX modes to the vectorizer when there's such
> > a possibility, thus only when all F, VL and BW are available?
> > At least when thinking about relaxing the vector sizes we can
> > deal with.
> >
> > I've opened this can of worms with
> >
> > diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
> > index 97745a830a2..db938f79c9c 100644
> > --- a/gcc/tree-vect-data-refs.c
> > +++ b/gcc/tree-vect-data-refs.c
> > @@ -3749,13 +3749,15 @@ vect_gather_scatter_fn_p (vec_info *vinfo, bool 
> > read_p, 
> > bool masked_p,
> >  
> >    for (;;)
> >      {
> > -      tree offset_vectype = get_vectype_for_scalar_type (vinfo, 
> > offset_type);
> > -      if (!offset_vectype)
> > -       return false;
> > -
> > +      tree offset_vectype
> > +       = get_related_vectype_for_scalar_type (vinfo->vector_mode, 
> > offset_type,
> > +                                              TYPE_VECTOR_SUBPARTS 
> > (vectype));
> >
> > to even get the mixed size optabs to be considered.  If we were to
> > expose the real instructions instead we'd have
> > mask_gather_loadv16sfv8di but as said the semantics of such
> > optab entry would need to be defined.  That's the way the current
> > builtin_gather target hook code works, so it seems that would be
> > a working approach, side-stepping the mixed size problems.
> > Here we'd define that excess elements on either the data or the
> > index operand are ignored (on the output operand they are zeroed).
> 
> It seems a bit ugly to expose such a low-level target-specific detail
> at the gimple level though.  Also, if we expose the v16sf at the gimple
> level then we would need to construct a v16sf vector for scatter stores
> and initialise all the elements (at least until we have a don't care
> VEC_PERM_EXPR encoding).

True - which is why I tried to not do this in the first place ...

> I'm not sure this would solve the problem with the mixture of mask
> computation stmts that you mentioned above.  E.g. what if the input
> to the v16sf scatter store above is a COND_EXPR that shares some mask
> subexpresions with the scatter store mask?  This COND_EXPR would be a
> v8sf VEC_COND_EXPR and so would presumably use the get_mask_mode
> corresponding to v8sf rather than v16sf.  The mask computation stmts
> would still need to cope with a mixture of AVX512 and AVX2 masks.

But the smaller data mode is only exposed because of the gather
optab - the index mode is large only because the rest of the loop
is vectorized with the larger size.  This is I think why the current
scheme using the builtin_gather target hook works - we expose the
larger data vector here.

> Even if x86 doesn't support that combination yet (because it instead
> requires all vector sizes to be equal), it seems like something that
> is likely to be supported in future.

Maybe, yes.

That said, I'm somewhat out of options here but I'll see if working
around the mask mode thing is possible in the gather specific code.

At least when we ever try to relax the vector size constraint for
loop vectorization we'll have to find a more generic solution to
this problem... (passing a "ISA mode" to the get_mask_mode hook
does sound appealing, but whether that's enough remains to be seen).

Richard.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-08-19  9:22 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-17 13:26 [PATCH] move x86 to use gather/scatter internal functions Richard Biener
2021-08-17 14:42 ` Richard Biener
2021-08-18  3:12   ` Hongtao Liu
2021-08-18  3:24   ` Hongtao Liu
2021-08-18  3:29     ` Hongtao Liu
2021-08-18  8:32       ` Richard Biener
2021-08-18  9:17         ` Hongtao Liu
2021-08-18  9:21           ` Richard Biener
2021-08-18  9:54             ` Richard Biener
2021-08-18 10:24               ` Hongtao Liu
2021-08-18 10:36                 ` Richard Biener
2021-08-18 10:28               ` Richard Biener
2021-08-18 11:30                 ` Hongtao Liu
2021-08-18 11:37                   ` Hongtao Liu
2021-08-18 11:38                     ` Hongtao Liu
2021-08-18 13:43                       ` Richard Biener
2021-08-18 14:43                         ` Richard Sandiford
2021-08-19  8:45                           ` Richard Biener
2021-08-19  9:12                             ` Richard Sandiford
2021-08-19  9:22                               ` Richard Biener
2021-08-18 11:31                 ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).