public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Richard Biener <richard.guenther@gmail.com>
To: Richard Biener <richard.guenther@gmail.com>,
	Tamar Christina <Tamar.Christina@arm.com>,
	 Richard Earnshaw <Richard.Earnshaw@arm.com>, nd <nd@arm.com>,
	 "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	Marcus Shawcroft <Marcus.Shawcroft@arm.com>,
	 Richard Sandiford <richard.sandiford@arm.com>
Subject: Re: [PATCH 1/2]AArch64 Add fallback case using sdot for usdot
Date: Tue, 5 Jul 2022 09:41:16 +0200	[thread overview]
Message-ID: <CAFiYyc3akmsSQdoY9yaMu_hOLMQ54mK5Ft9pCGQAf-VbGAShDQ@mail.gmail.com> (raw)
In-Reply-To: <mptsfng3xdj.fsf@arm.com>

On Tue, Jul 5, 2022 at 8:08 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Wed, Jun 29, 2022 at 4:35 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Richard Biener <richard.guenther@gmail.com> writes:
> >> > On Tue, Jun 28, 2022 at 5:54 PM Tamar Christina <Tamar.Christina@arm.com> wrote:
> >> >>
> >> >> > -----Original Message-----
> >> >> > From: Richard Biener <richard.guenther@gmail.com>
> >> >> > Sent: Monday, June 27, 2022 7:10 AM
> >> >> > To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Richard Earnshaw
> >> >> > <Richard.Earnshaw@arm.com>; nd <nd@arm.com>; gcc-
> >> >> > patches@gcc.gnu.org; Marcus Shawcroft <Marcus.Shawcroft@arm.com>
> >> >> > Subject: Re: [PATCH 1/2]AArch64 Add fallback case using sdot for usdot
> >> >> >
> >> >> > On Mon, Jun 27, 2022 at 7:25 AM Tamar Christina via Gcc-patches <gcc-
> >> >> > patches@gcc.gnu.org> wrote:
> >> >> > >
> >> >> > > > -----Original Message-----
> >> >> > > > From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> > > > Sent: Thursday, June 16, 2022 7:54 PM
> >> >> > > > To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> >> >> > > > <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> >> >> > > > <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> >> >> > <Kyrylo.Tkachov@arm.com>
> >> >> > > > Subject: Re: [PATCH 1/2]AArch64 Add fallback case using sdot for
> >> >> > > > usdot
> >> >> > > >
> >> >> > > > Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> >> > > > > Tamar Christina <tamar.christina@arm.com> writes:
> >> >> > > > >> Hi All,
> >> >> > > > >>
> >> >> > > > >> The usdot operation is common in video encoder and decoders
> >> >> > > > >> including some of the most widely used ones.
> >> >> > > > >>
> >> >> > > > >> This patch adds a +dotprod version of the optab as a fallback for
> >> >> > > > >> when you do have sdot but not usdot available.
> >> >> > > > >>
> >> >> > > > >> The fallback works by adding a bias to the unsigned argument to
> >> >> > > > >> convert it to a signed value and then correcting for the bias later on.
> >> >> > > > >>
> >> >> > > > >> Essentially it relies on (x - 128)y + 128y == xy where x is
> >> >> > > > >> unsigned and y is signed (assuming both are 8-bit values).
> >> >> > > > >> Because the range of a signed byte is only to 127 we split the bias
> >> >> > correction into:
> >> >> > > > >>
> >> >> > > > >>    (x - 128)y + 127y + y
> >> >> > > > >
> >> >> > > > > I bet you knew this question was coming, but: this technique isn't
> >> >> > > > > target-specific, so wouldn't it be better to handle it in
> >> >> > > > > tree-vect-patterns.cc instead?
> >> >> > >
> >> >> > > Ok, so after many hours of trying I don't know how to make this work.
> >> >> > > DOT_PROD_EXPR is a reduction, but emitting them as additional pattern
> >> >> > > statement doesn't work because they'll be marked as internal_def
> >> >> > > rather than reduction_def.  I tried marking the new vec_stmt_info that
> >> >> > > I create explicitly as reduction_def but this gets overwritten during analysis.
> >> >> > >
> >> >> > > I then looked into getting it as a vectorizable_operation but has this
> >> >> > > obvious problems In that it no longer treats it as a reduction and so tries to
> >> >> > decompose into hi/lo.
> >> >> > >
> >> >> > > I then looked into treating additional patterns from  a reduction as
> >> >> > > reductions themselves but this is obviously wrong as non-reduction
> >> >> > statements also get marked as reductions.
> >> >> > >
> >> >> > > The conclusion is that I don't think the vectorizer allows additional
> >> >> > > reductions to be emitted from patterns.
> >> >> >
> >> >> > Indeed.  DOT_PROD is a weird beast and it doesn't define which lanes are
> >> >> > reduced to which so it's only usable when the result is reduced to a single
> >> >> > lane.
> >> >> >
> >> >> > An SLP pattern might work if you use reduc-plus for the reduced lanes and
> >> >> > keep the multiply separate?
> >> >>
> >> >> Unfortunately I can't seem to get it to handle the reduction in SLP.  It seems to always
> >> >> use the non-SLP aware loop vectorizer here.  The suggested unroll factor is always 1 and
> >> >> even trying to force it gets it to bail out later, presumable because it's reducing into a
> >> >> scalar that's used outside the loop?
> >> >
> >> > Yes, it possibly needs 1-lane SLP support.
> >>
> >> As I mentioned to Tamar off-list, I feel like I've been wasting
> >> people's time recently by spewing out ideas that might or might not work
> >> (usually "not work"), so I wanted to get some confidence that the next
> >> suggestion made sense.  In the end I needed most of an implementation
> >> to do that, so it seemed easiest just to finish it off rather than post
> >> it in a half-complete state.  Sorry for the duplication. :-(
> >>
> >> The patch certainly isn't pretty, but I think it's the best we can
> >> do under the current infrastructure, and it should at least make
> >> the costs reasonably accurate.  (Actually, that said, we probably
> >> need to patch the reduction latency calculation in the aarch64
> >> vector code -- didn't think of that until now.)
> >>
> >> Tested on aarch64-linux-gnu and x64_64-linux-gnu.  WDYT?
>
> Turned out I needed another change for this to fire on x86.  Previously
> the input type (half_type) had an arbitrary sign for mixed-sign dotprods,
> which was OK for the existing code, but meant that we could sometimes
> query for unsigned dotprod instead of signed dotprod when considering
> the fallback.  Fixed in the version below (which canonicalises on
> using the signed type).
>
> > Looks reasonable - does this end up in OKish code generation as well?
>
> Seems OK for aarch64.  The Advanced SIMD version of vect-reduc-dot-11.c is:
>
> .L7:
>         ldr     q2, [x1, x3]
>         ldr     q1, [x2, x3]
>         sdot    v0.4s, v1.16b, v3.16b
>         add     x3, x3, 16
>         sdot    v0.4s, v1.16b, v3.16b
>         add     v2.16b, v2.16b, v4.16b
>         sdot    v0.4s, v1.16b, v2.16b
>         cmp     x3, 48
>         bne     .L7
>
> and the SVE version is:
>
> .L7:
>         ld1b    z1.b, p0/z, [x2, x3]
>         ld1b    z2.b, p0/z, [x1, x3]
>         sel     z1.b, p0, z1.b, z4.b
>         add     x3, x3, x5
>         add     z2.b, z2.b, #128
>         sdot    z0.s, z1.b, z3.b
>         whilelo p0.b, w3, w4
>         sdot    z0.s, z1.b, z3.b
>         sdot    z0.s, z1.b, z2.b
>         b.any   .L7
>
> (with the extra SEL handling a final partial vector).
>
> On x86, for -mavx:
>
> int
> f (int res, unsigned short *restrict a, short *restrict b)
> {
>   for (int i = 0; i < 256; ++i)
>     res += a[i] * b[i];
>   return res;
> }
>
> previously generated:
>
> .L2:
>         vmovdqu (%rsi,%rax), %xmm1
>         vmovdqu (%rdx,%rax), %xmm0
>         addq    $16, %rax
>         vpmovsxwd       %xmm0, %xmm3
>         vpsrldq $8, %xmm0, %xmm0
>         vpmovzxwd       %xmm1, %xmm4
>         vpsrldq $8, %xmm1, %xmm1
>         vpmulld %xmm4, %xmm3, %xmm3
>         vpmovsxwd       %xmm0, %xmm0
>         vpmovzxwd       %xmm1, %xmm1
>         vpmulld %xmm1, %xmm0, %xmm0
>         vpaddd  %xmm2, %xmm3, %xmm2
>         vpaddd  %xmm2, %xmm0, %xmm2
>         cmpq    $512, %rax
>         jne     .L2
>
> whereas now it generates:
>
> .L2:
>         vpmaddwd        (%rdx,%rax), %xmm3, %xmm2
>         vpaddw  (%rsi,%rax), %xmm4, %xmm1
>         vpmaddwd        (%rdx,%rax), %xmm1, %xmm1
>         addq    $16, %rax
>         vpaddd  %xmm2, %xmm0, %xmm0
>         vpaddd  %xmm2, %xmm0, %xmm0
>         vpaddd  %xmm1, %xmm0, %xmm0
>         cmpq    $512, %rax
>         jne     .L2
>
> I don't know x86 well enough to be sure that's an improvement though.
> The length of the loop carry dependency has increased from 2 to 3
> VPADDDs.

I think that should be OK.

>
> Tested on aarch64-linux-gnu and x86_64-linux-gnu.

OK.

Thanks,
Richard.

> Richard
>
>
> gcc/
>         * tree-vect-patterns.cc (vect_convert_input): Expect the input
>         type to be signed for optab_vector_mixed_sign.  Update the vectype
>         at the same time as type.
>         (vect_recog_dot_prod_pattern): Update accordingly.  If usdot isn't
>         available, try sdot instead.
>         * tree-vect-loop.cc (vect_is_emulated_mixed_dot_prod): New function.
>         (vect_model_reduction_cost): Model the cost of implementing usdot
>         using sdot.
>         (vectorizable_reduction): Likewise.  Skip target support test
>         for lane reductions.
>         (vect_emulate_mixed_dot_prod): New function.
>         (vect_transform_reduction): Use it to emulate usdot via sdot.
>
> gcc/testsuite/
>         * gcc.dg/vect/vect-reduc-dot-9.c: Reduce target requirements
>         from i8mm to dotprod.
>         * gcc.dg/vect/vect-reduc-dot-10.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-11.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-12.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-13.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-14.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-15.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-16.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-17.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-18.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-19.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-20.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-21.c: Likewise.
>         * gcc.dg/vect/vect-reduc-dot-22.c: Likewise.
> ---
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c |   6 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c |   6 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c |   6 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c |   6 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c |   6 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c |   6 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c |   6 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c |   6 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c |   6 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c |   4 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c |   4 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c |   4 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c |   4 +-
>  gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c  |   6 +-
>  gcc/tree-vect-loop.cc                         | 160 ++++++++++++++++--
>  gcc/tree-vect-patterns.cc                     |  38 ++++-
>  16 files changed, 213 insertions(+), 61 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c
> index 7ce86965ea9..34e25ab7fb0 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-10.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #define SIGNEDNESS_1 unsigned
>  #define SIGNEDNESS_2 unsigned
> @@ -10,4 +10,4 @@
>  #include "vect-reduc-dot-9.c"
>
>  /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c
> index 0f7cbbb87ef..3af8df54cf9 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-11.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #define SIGNEDNESS_1 unsigned
>  #define SIGNEDNESS_2 signed
> @@ -10,4 +10,4 @@
>  #include "vect-reduc-dot-9.c"
>
>  /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c
> index 08412614fc6..77ceef3643b 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-12.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #define SIGNEDNESS_1 unsigned
>  #define SIGNEDNESS_2 signed
> @@ -10,4 +10,4 @@
>  #include "vect-reduc-dot-9.c"
>
>  /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c
> index 7ee0f45f642..d3c0c86f529 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-13.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #define SIGNEDNESS_1 signed
>  #define SIGNEDNESS_2 unsigned
> @@ -10,4 +10,4 @@
>  #include "vect-reduc-dot-9.c"
>
>  /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c
> index 2de1434528b..86a5c85753c 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-14.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #define SIGNEDNESS_1 signed
>  #define SIGNEDNESS_2 unsigned
> @@ -10,4 +10,4 @@
>  #include "vect-reduc-dot-9.c"
>
>  /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c
> index dc48f95a32b..25de0940a65 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-15.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #define SIGNEDNESS_1 signed
>  #define SIGNEDNESS_2 signed
> @@ -10,4 +10,4 @@
>  #include "vect-reduc-dot-9.c"
>
>  /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c
> index aec62878936..4a1dec0677e 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-16.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #define SIGNEDNESS_1 signed
>  #define SIGNEDNESS_2 signed
> @@ -10,4 +10,4 @@
>  #include "vect-reduc-dot-9.c"
>
>  /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c
> index 38f86fe458a..90d21188b76 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-17.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #include "tree-vect.h"
>
> @@ -50,4 +50,4 @@ main (void)
>  }
>
>  /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c
> index 2e86ebe3c6c..81ecb158d29 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-18.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #include "tree-vect.h"
>
> @@ -50,4 +50,4 @@ main (void)
>  }
>
>  /* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c
> index d00f24aae4c..cbcd4f120a5 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-19.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #include "tree-vect.h"
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c
> index 17adbca83a0..e81ed1da5a4 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-20.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #include "tree-vect.h"
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c
> index 6cc6a4f2e92..81ce5cdaffb 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-21.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #include "tree-vect.h"
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c
> index e13d3d5c4da..b8c9d3ca53b 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-22.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #include "tree-vect.h"
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c
> index d1049c96bf1..e0b132f6b35 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-9.c
> @@ -1,6 +1,6 @@
>  /* { dg-require-effective-target vect_int } */
> -/* { dg-require-effective-target arm_v8_2a_i8mm_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> -/* { dg-add-options arm_v8_2a_i8mm }  */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
>
>  #include "tree-vect.h"
>
> @@ -50,4 +50,4 @@ main (void)
>  }
>
>  /* { dg-final { scan-tree-dump-not "vect_recog_dot_prod_pattern: detected" "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_usdot_qi } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 78dfe8519aa..3a70c15b593 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -4566,6 +4566,31 @@ have_whole_vector_shift (machine_mode mode)
>    return true;
>  }
>
> +/* Return true if (a) STMT_INFO is a DOT_PROD_EXPR reduction whose
> +   multiplication operands have differing signs and (b) we intend
> +   to emulate the operation using a series of signed DOT_PROD_EXPRs.
> +   See vect_emulate_mixed_dot_prod for the actual sequence used.  */
> +
> +static bool
> +vect_is_emulated_mixed_dot_prod (loop_vec_info loop_vinfo,
> +                                stmt_vec_info stmt_info)
> +{
> +  gassign *assign = dyn_cast<gassign *> (stmt_info->stmt);
> +  if (!assign || gimple_assign_rhs_code (assign) != DOT_PROD_EXPR)
> +    return false;
> +
> +  tree rhs1 = gimple_assign_rhs1 (assign);
> +  tree rhs2 = gimple_assign_rhs2 (assign);
> +  if (TYPE_SIGN (TREE_TYPE (rhs1)) == TYPE_SIGN (TREE_TYPE (rhs2)))
> +    return false;
> +
> +  stmt_vec_info reduc_info = info_for_reduction (loop_vinfo, stmt_info);
> +  gcc_assert (reduc_info->is_reduc_info);
> +  return !directly_supported_p (DOT_PROD_EXPR,
> +                               STMT_VINFO_REDUC_VECTYPE_IN (reduc_info),
> +                               optab_vector_mixed_sign);
> +}
> +
>  /* TODO: Close dependency between vect_model_*_cost and vectorizable_*
>     functions. Design better to avoid maintenance issues.  */
>
> @@ -4601,6 +4626,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
>    if (!gimple_extract_op (orig_stmt_info->stmt, &op))
>      gcc_unreachable ();
>
> +  bool emulated_mixed_dot_prod
> +    = vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info);
>    if (reduction_type == EXTRACT_LAST_REDUCTION)
>      /* No extra instructions are needed in the prologue.  The loop body
>         operations are costed in vectorizable_condition.  */
> @@ -4628,11 +4655,20 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
>      }
>    else
>      {
> -      /* Add in cost for initial definition.
> -        For cond reduction we have four vectors: initial index, step,
> -        initial result of the data reduction, initial value of the index
> -        reduction.  */
> -      int prologue_stmts = reduction_type == COND_REDUCTION ? 4 : 1;
> +      /* Add in the cost of the initial definitions.  */
> +      int prologue_stmts;
> +      if (reduction_type == COND_REDUCTION)
> +       /* For cond reductions we have four vectors: initial index, step,
> +          initial result of the data reduction, initial value of the index
> +          reduction.  */
> +       prologue_stmts = 4;
> +      else if (emulated_mixed_dot_prod)
> +       /* We need the initial reduction value and two invariants:
> +          one that contains the minimum signed value and one that
> +          contains half of its negative.  */
> +       prologue_stmts = 3;
> +      else
> +       prologue_stmts = 1;
>        prologue_cost += record_stmt_cost (cost_vec, prologue_stmts,
>                                          scalar_to_vec, stmt_info, 0,
>                                          vect_prologue);
> @@ -6797,11 +6833,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>    bool lane_reduc_code_p = (op.code == DOT_PROD_EXPR
>                             || op.code == WIDEN_SUM_EXPR
>                             || op.code == SAD_EXPR);
> -  enum optab_subtype optab_query_kind = optab_vector;
> -  if (op.code == DOT_PROD_EXPR
> -      && (TYPE_SIGN (TREE_TYPE (op.ops[0]))
> -         != TYPE_SIGN (TREE_TYPE (op.ops[1]))))
> -    optab_query_kind = optab_vector_mixed_sign;
>
>    if (!POINTER_TYPE_P (op.type) && !INTEGRAL_TYPE_P (op.type)
>        && !SCALAR_FLOAT_TYPE_P (op.type))
> @@ -7328,9 +7359,17 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>        /* 4. Supportable by target?  */
>        bool ok = true;
>
> -      /* 4.1. check support for the operation in the loop  */
> +      /* 4.1. check support for the operation in the loop
> +
> +        This isn't necessary for the lane reduction codes, since they
> +        can only be produced by pattern matching, and it's up to the
> +        pattern matcher to test for support.  The main reason for
> +        specifically skipping this step is to avoid rechecking whether
> +        mixed-sign dot-products can be implemented using signed
> +        dot-products.  */
>        machine_mode vec_mode = TYPE_MODE (vectype_in);
> -      if (!directly_supported_p (op.code, vectype_in, optab_query_kind))
> +      if (!lane_reduc_code_p
> +         && !directly_supported_p (op.code, vectype_in))
>          {
>            if (dump_enabled_p ())
>              dump_printf (MSG_NOTE, "op not supported by target.\n");
> @@ -7398,7 +7437,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>       vect_transform_reduction.  Otherwise this is costed by the
>       separate vectorizable_* routines.  */
>    if (single_defuse_cycle || lane_reduc_code_p)
> -    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);
> +    {
> +      int factor = 1;
> +      if (vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info))
> +       /* Three dot-products and a subtraction.  */
> +       factor = 4;
> +      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
> +                       stmt_info, 0, vect_body);
> +    }
>
>    if (dump_enabled_p ()
>        && reduction_type == FOLD_LEFT_REDUCTION)
> @@ -7457,6 +7503,81 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>    return true;
>  }
>
> +/* STMT_INFO is a dot-product reduction whose multiplication operands
> +   have different signs.  Emit a sequence to emulate the operation
> +   using a series of signed DOT_PROD_EXPRs and return the last
> +   statement generated.  VEC_DEST is the result of the vector operation
> +   and VOP lists its inputs.  */
> +
> +static gassign *
> +vect_emulate_mixed_dot_prod (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
> +                            gimple_stmt_iterator *gsi, tree vec_dest,
> +                            tree vop[3])
> +{
> +  tree wide_vectype = signed_type_for (TREE_TYPE (vec_dest));
> +  tree narrow_vectype = signed_type_for (TREE_TYPE (vop[0]));
> +  tree narrow_elttype = TREE_TYPE (narrow_vectype);
> +  gimple *new_stmt;
> +
> +  /* Make VOP[0] the unsigned operand VOP[1] the signed operand.  */
> +  if (!TYPE_UNSIGNED (TREE_TYPE (vop[0])))
> +    std::swap (vop[0], vop[1]);
> +
> +  /* Convert all inputs to signed types.  */
> +  for (int i = 0; i < 3; ++i)
> +    if (TYPE_UNSIGNED (TREE_TYPE (vop[i])))
> +      {
> +       tree tmp = make_ssa_name (signed_type_for (TREE_TYPE (vop[i])));
> +       new_stmt = gimple_build_assign (tmp, NOP_EXPR, vop[i]);
> +       vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
> +       vop[i] = tmp;
> +      }
> +
> +  /* In the comments below we assume 8-bit inputs for simplicity,
> +     but the approach works for any full integer type.  */
> +
> +  /* Create a vector of -128.  */
> +  tree min_narrow_elttype = TYPE_MIN_VALUE (narrow_elttype);
> +  tree min_narrow = build_vector_from_val (narrow_vectype,
> +                                          min_narrow_elttype);
> +
> +  /* Create a vector of 64.  */
> +  auto half_wi = wi::lrshift (wi::to_wide (min_narrow_elttype), 1);
> +  tree half_narrow = wide_int_to_tree (narrow_elttype, half_wi);
> +  half_narrow = build_vector_from_val (narrow_vectype, half_narrow);
> +
> +  /* Emit: SUB_RES = VOP[0] - 128.  */
> +  tree sub_res = make_ssa_name (narrow_vectype);
> +  new_stmt = gimple_build_assign (sub_res, PLUS_EXPR, vop[0], min_narrow);
> +  vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
> +
> +  /* Emit:
> +
> +       STAGE1 = DOT_PROD_EXPR <VOP[1], 64, VOP[2]>;
> +       STAGE2 = DOT_PROD_EXPR <VOP[1], 64, STAGE1>;
> +       STAGE3 = DOT_PROD_EXPR <SUB_RES, -128, STAGE2>;
> +
> +     on the basis that x * y == (x - 128) * y + 64 * y + 64 * y
> +     Doing the two 64 * y steps first allows more time to compute x.  */
> +  tree stage1 = make_ssa_name (wide_vectype);
> +  new_stmt = gimple_build_assign (stage1, DOT_PROD_EXPR,
> +                                 vop[1], half_narrow, vop[2]);
> +  vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
> +
> +  tree stage2 = make_ssa_name (wide_vectype);
> +  new_stmt = gimple_build_assign (stage2, DOT_PROD_EXPR,
> +                                 vop[1], half_narrow, stage1);
> +  vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
> +
> +  tree stage3 = make_ssa_name (wide_vectype);
> +  new_stmt = gimple_build_assign (stage3, DOT_PROD_EXPR,
> +                                 sub_res, vop[1], stage2);
> +  vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
> +
> +  /* Convert STAGE3 to the reduction type.  */
> +  return gimple_build_assign (vec_dest, CONVERT_EXPR, stage3);
> +}
> +
>  /* Transform the definition stmt STMT_INFO of a reduction PHI backedge
>     value.  */
>
> @@ -7563,12 +7684,17 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>                                         : &vec_oprnds2));
>      }
>
> +  bool emulated_mixed_dot_prod
> +    = vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info);
>    FOR_EACH_VEC_ELT (vec_oprnds0, i, def0)
>      {
>        gimple *new_stmt;
>        tree vop[3] = { def0, vec_oprnds1[i], NULL_TREE };
>        if (masked_loop_p && !mask_by_cond_expr)
>         {
> +         /* No conditional ifns have been defined for dot-product yet.  */
> +         gcc_assert (code != DOT_PROD_EXPR);
> +
>           /* Make sure that the reduction accumulator is vop[0].  */
>           if (reduc_index == 1)
>             {
> @@ -7597,8 +7723,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>               build_vect_cond_expr (code, vop, mask, gsi);
>             }
>
> -         new_stmt = gimple_build_assign (vec_dest, code,
> -                                         vop[0], vop[1], vop[2]);
> +         if (emulated_mixed_dot_prod)
> +           new_stmt = vect_emulate_mixed_dot_prod (loop_vinfo, stmt_info, gsi,
> +                                                   vec_dest, vop);
> +         else
> +           new_stmt = gimple_build_assign (vec_dest, code,
> +                                           vop[0], vop[1], vop[2]);
>           new_temp = make_ssa_name (vec_dest, new_stmt);
>           gimple_assign_set_lhs (new_stmt, new_temp);
>           vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index 8f624863971..dfbfb71b3c6 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -760,12 +760,16 @@ vect_convert_input (vec_info *vinfo, stmt_vec_info stmt_info, tree type,
>                     vect_unpromoted_value *unprom, tree vectype,
>                     enum optab_subtype subtype = optab_default)
>  {
> -
>    /* Update the type if the signs differ.  */
> -  if (subtype == optab_vector_mixed_sign
> -      && TYPE_SIGN (type) != TYPE_SIGN (TREE_TYPE (unprom->op)))
> -    type = build_nonstandard_integer_type (TYPE_PRECISION (type),
> -                                          TYPE_SIGN (unprom->type));
> +  if (subtype == optab_vector_mixed_sign)
> +    {
> +      gcc_assert (!TYPE_UNSIGNED (type));
> +      if (TYPE_UNSIGNED (TREE_TYPE (unprom->op)))
> +       {
> +         type = unsigned_type_for (type);
> +         vectype = unsigned_type_for (vectype);
> +       }
> +    }
>
>    /* Check for a no-op conversion.  */
>    if (types_compatible_p (type, TREE_TYPE (unprom->op)))
> @@ -1139,16 +1143,34 @@ vect_recog_dot_prod_pattern (vec_info *vinfo,
>       is signed; otherwise, the result has the same sign as the operands.  */
>    if (TYPE_PRECISION (unprom_mult.type) != TYPE_PRECISION (type)
>        && (subtype == optab_vector_mixed_sign
> -       ? TYPE_UNSIGNED (unprom_mult.type)
> -       : TYPE_SIGN (unprom_mult.type) != TYPE_SIGN (half_type)))
> +         ? TYPE_UNSIGNED (unprom_mult.type)
> +         : TYPE_SIGN (unprom_mult.type) != TYPE_SIGN (half_type)))
>      return NULL;
>
>    vect_pattern_detected ("vect_recog_dot_prod_pattern", last_stmt);
>
> +  /* If the inputs have mixed signs, canonicalize on using the signed
> +     input type for analysis.  This also helps when emulating mixed-sign
> +     operations using signed operations.  */
> +  if (subtype == optab_vector_mixed_sign)
> +    half_type = signed_type_for (half_type);
> +
>    tree half_vectype;
>    if (!vect_supportable_direct_optab_p (vinfo, type, DOT_PROD_EXPR, half_type,
>                                         type_out, &half_vectype, subtype))
> -    return NULL;
> +    {
> +      /* We can emulate a mixed-sign dot-product using a sequence of
> +        signed dot-products; see vect_emulate_mixed_dot_prod for details.  */
> +      if (subtype != optab_vector_mixed_sign
> +         || !vect_supportable_direct_optab_p (vinfo, signed_type_for (type),
> +                                              DOT_PROD_EXPR, half_type,
> +                                              type_out, &half_vectype,
> +                                              optab_vector))
> +       return NULL;
> +
> +      *type_out = signed_or_unsigned_type_for (TYPE_UNSIGNED (type),
> +                                              *type_out);
> +    }
>
>    /* Get the inputs in the appropriate types.  */
>    tree mult_oprnd[2];
> --
> 2.25.1
>

      reply	other threads:[~2022-07-05  7:41 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-16 10:48 Tamar Christina
2022-06-16 10:49 ` [PATCH 2/2] Add SVE " Tamar Christina
2022-06-16 16:09 ` [PATCH 1/2]AArch64 Add " Richard Sandiford
2022-06-16 18:53   ` Richard Sandiford
2022-06-27  5:24     ` Tamar Christina
2022-06-27  6:09       ` Richard Biener
2022-06-28 15:54         ` Tamar Christina
2022-06-29  9:33           ` Richard Biener
2022-06-29 14:35             ` Richard Sandiford
2022-06-30  6:45               ` Richard Biener
2022-07-05  6:08                 ` Richard Sandiford
2022-07-05  7:41                   ` Richard Biener [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAFiYyc3akmsSQdoY9yaMu_hOLMQ54mK5Ft9pCGQAf-VbGAShDQ@mail.gmail.com \
    --to=richard.guenther@gmail.com \
    --cc=Marcus.Shawcroft@arm.com \
    --cc=Richard.Earnshaw@arm.com \
    --cc=Tamar.Christina@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=nd@arm.com \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).