* [PATCH] arm: Auto-vectorization for MVE and Neon: vhadd/vrhadd
@ 2021-05-31 13:55 Christophe Lyon
2021-06-02 18:19 ` Richard Sandiford
0 siblings, 1 reply; 7+ messages in thread
From: Christophe Lyon @ 2021-05-31 13:55 UTC (permalink / raw)
To: gcc-patches, richard.sandiford
This patch adds support for auto-vectorization of average value
computation using vhadd or vrhadd, for both MVE and Neon.
The patch adds the needed [u]avg<mode>3_[floor|ceil] patterns to
vec-common.md, I'm not sure how to factorize them without introducing
an unspec iterator?
It also adds tests for 'floor' and for 'ceil', each for MVE and Neon.
Vectorization works with 8-bit and 16 bit input/output vectors, but
not with 32-bit ones because the vectorizer expects wider types
availability for the intermediate values, but int32_t + int32_t does
not involve wider types in the IR.
As noted in neon-vhadd-1.c, I couldn't write a test able to use Neon
vectorization with 64-bit vectors: we default to
-mvectorize-with-neon-quad, and attempts to use
-mvectorize-with-neon-double resulted in much worse code, which this
patch does not aim at improving.
2021-05-31 Christophe Lyon <christophe.lyon@linaro.org>
gcc/
* gcc/config/arm/mve.md (mve_vhaddq_<supf><mode>): Prefix with '@'.
(@mve_vrhaddq_<supf><mode): Likewise.
* gcc/config/arm/neon.md (neon_v<r>hadd<sup><mode>): Likewise.
* config/arm/vec-common.md (avg<mode>3_floor, uavg<mode>3_floor)
(avg<mode>3_ceil", uavg<mode>3_ceil): New patterns.
gcc/testsuite/
* gcc.target/arm/simd/mve-vhadd-1.c: New test.
* gcc.target/arm/simd/mve-vhadd-2.c: New test.
* gcc.target/arm/simd/neon-vhadd-1.c: New test.
* gcc.target/arm/simd/neon-vhadd-2.c: New test.
---
gcc/config/arm/mve.md | 4 +-
gcc/config/arm/neon.md | 2 +-
gcc/config/arm/vec-common.md | 60 ++++++++++++
.../gcc.target/arm/simd/mve-vhadd-1.c | 31 +++++++
.../gcc.target/arm/simd/mve-vhadd-2.c | 31 +++++++
.../gcc.target/arm/simd/neon-vhadd-1.c | 93 +++++++++++++++++++
.../gcc.target/arm/simd/neon-vhadd-2.c | 33 +++++++
7 files changed, 251 insertions(+), 3 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c
create mode 100644 gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c
create mode 100644 gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 0bfa6a91d55..04aa612331a 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -1030,7 +1030,7 @@ (define_insn "mve_vhaddq_n_<supf><mode>"
;;
;; [vhaddq_s, vhaddq_u])
;;
-(define_insn "mve_vhaddq_<supf><mode>"
+(define_insn "@mve_vhaddq_<supf><mode>"
[
(set (match_operand:MVE_2 0 "s_register_operand" "=w")
(unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
@@ -1652,7 +1652,7 @@ (define_insn "mve_vqsubq_<supf><mode>"
;;
;; [vrhaddq_s, vrhaddq_u])
;;
-(define_insn "mve_vrhaddq_<supf><mode>"
+(define_insn "@mve_vrhaddq_<supf><mode>"
[
(set (match_operand:MVE_2 0 "s_register_operand" "=w")
(unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 077c62ffd20..18571d819eb 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -1488,7 +1488,7 @@ (define_insn "neon_vaddw<sup><mode>"
; vhadd and vrhadd.
-(define_insn "neon_v<r>hadd<sup><mode>"
+(define_insn "@neon_v<r>hadd<sup><mode>"
[(set (match_operand:VDQIW 0 "s_register_operand" "=w")
(unspec:VDQIW [(match_operand:VDQIW 1 "s_register_operand" "w")
(match_operand:VDQIW 2 "s_register_operand" "w")]
diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
index 80b273229f5..2779c1a8aaa 100644
--- a/gcc/config/arm/vec-common.md
+++ b/gcc/config/arm/vec-common.md
@@ -565,3 +565,63 @@ (define_expand "reduc_plus_scal_<mode>"
DONE;
})
+
+(define_expand "avg<mode>3_floor"
+ [(match_operand:MVE_2 0 "s_register_operand")
+ (match_operand:MVE_2 1 "s_register_operand")
+ (match_operand:MVE_2 2 "s_register_operand")]
+ "ARM_HAVE_<MODE>_ARITH"
+{
+ if (TARGET_HAVE_MVE)
+ emit_insn (gen_mve_vhaddq (VHADDQ_S, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ else
+ emit_insn (gen_neon_vhadd (UNSPEC_VHADD_S, UNSPEC_VHADD_S, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ DONE;
+})
+
+(define_expand "uavg<mode>3_floor"
+ [(match_operand:MVE_2 0 "s_register_operand")
+ (match_operand:MVE_2 1 "s_register_operand")
+ (match_operand:MVE_2 2 "s_register_operand")]
+ "ARM_HAVE_<MODE>_ARITH"
+{
+ if (TARGET_HAVE_MVE)
+ emit_insn (gen_mve_vhaddq (VHADDQ_U, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ else
+ emit_insn (gen_neon_vhadd (UNSPEC_VHADD_U, UNSPEC_VHADD_U, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ DONE;
+})
+
+(define_expand "avg<mode>3_ceil"
+ [(match_operand:MVE_2 0 "s_register_operand")
+ (match_operand:MVE_2 1 "s_register_operand")
+ (match_operand:MVE_2 2 "s_register_operand")]
+ "ARM_HAVE_<MODE>_ARITH"
+{
+ if (TARGET_HAVE_MVE)
+ emit_insn (gen_mve_vrhaddq (VRHADDQ_S, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ else
+ emit_insn (gen_neon_vhadd (UNSPEC_VRHADD_S, UNSPEC_VRHADD_S, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ DONE;
+})
+
+(define_expand "uavg<mode>3_ceil"
+ [(match_operand:MVE_2 0 "s_register_operand")
+ (match_operand:MVE_2 1 "s_register_operand")
+ (match_operand:MVE_2 2 "s_register_operand")]
+ "ARM_HAVE_<MODE>_ARITH"
+{
+ if (TARGET_HAVE_MVE)
+ emit_insn (gen_mve_vrhaddq (VRHADDQ_U, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ else
+ emit_insn (gen_neon_vhadd (UNSPEC_VRHADD_U, UNSPEC_VRHADD_U, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ DONE;
+})
diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
new file mode 100644
index 00000000000..40489ecc67d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-add-options arm_v8_1m_mve } */
+/* { dg-additional-options "-O3" } */
+
+#include <stdint.h>
+
+#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
+ void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
+ TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
+ int i; \
+ for (i=0; i < (128 / BITS); i++) { \
+ dest[i] = (a[i] OP b[i]) >> 1; \
+ } \
+}
+
+FUNC(s, int, 32, +, vhadd)
+FUNC(u, uint, 32, +, vhadd)
+FUNC(s, int, 16, +, vhadd)
+FUNC(u, uint, 16, +, vhadd)
+FUNC(s, int, 8, +, vhadd)
+FUNC(u, uint, 8, +, vhadd)
+
+/* We cannot make use of the 32 bits version because the vectorizer needs an
+ extended precision for the intermediate computations. */
+/* { dg-final { scan-assembler-times {vhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {vhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {vhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c
new file mode 100644
index 00000000000..9bf145f9457
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-add-options arm_v8_1m_mve } */
+/* { dg-additional-options "-O3" } */
+
+#include <stdint.h>
+
+#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
+ void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
+ TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
+ int i; \
+ for (i=0; i < (128 / BITS); i++) { \
+ dest[i] = (a[i] OP b[i] + 1) >> 1; \
+ } \
+}
+
+FUNC(s, int, 32, +, vrhadd)
+FUNC(u, uint, 32, +, vrhadd)
+FUNC(s, int, 16, +, vrhadd)
+FUNC(u, uint, 16, +, vrhadd)
+FUNC(s, int, 8, +, vrhadd)
+FUNC(u, uint, 8, +, vrhadd)
+
+/* We cannot make use of the 32 bits version because the vectorizer needs an
+ extended precision for the intermediate computations. */
+/* { dg-final { scan-assembler-times {vrhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {vrhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {vrhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
diff --git a/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c
new file mode 100644
index 00000000000..e141c2895a8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c
@@ -0,0 +1,93 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-add-options arm_neon } */
+/* { dg-additional-options "-O3" } */
+
+#include <stdint.h>
+
+/* Since we default to -mvectorize-with-neon-quad, use enough iterations so that
+ we can vectorize using 128-bit vectors. */
+/* For instance with int16_t:
+ 8 iterations and -mvectorize-with-neon-quad, we generate:
+test_vhadd_s16:
+ vld1.16 {q8}, [r1]
+ vld1.16 {q9}, [r2]
+ vhadd.s16 q8, q8, q9
+ vst1.16 {q8}, [r0]
+ bx lr
+
+ 8 iterations and -mvectorize-with-neon-double, we generate:
+test_vhadd_s16:
+ vld1.16 {d24}, [r1]!
+ vld1.16 {d26}, [r2]!
+ vld1.16 {d22}, [r1]
+ vmovl.s16 q12, d24
+ vld1.16 {d20}, [r2]
+ vmovl.s16 q13, d26
+ vmovl.s16 q11, d22
+ vmovl.s16 q10, d20
+ vadd.i32 d28, d26, d24
+ vadd.i32 d24, d27, d25
+ vadd.i32 d25, d22, d20
+ vadd.i32 d20, d23, d21
+ vshr.s32 d18, d28, #1
+ vshr.s32 d19, d24, #1
+ vshr.s32 d16, d25, #1
+ vshr.s32 d17, d20, #1
+ vmovn.i32 d18, q9
+ vmovn.i32 d16, q8
+ vst1.16 {d18}, [r0]!
+ vst1.16 {d16}, [r0]
+ bx lr
+
+ Adding a cast to avoid integer promotion:
+ dest[i] = (int16_t)(a[i] + b[i]) >> 1
+
+ 8 iterations and -mvectorize-with-neon-quad, we generate:
+test_vhadd_s16:
+ vld1.16 {q8}, [r2]
+ vld1.16 {q9}, [r1]
+ vadd.i16 q8, q8, q9
+ vshr.s16 q8, q8, #1
+ vst1.16 {q8}, [r0]
+ bx lr
+
+ 8 iterations and -mvectorize-with-neon-double, we generate:
+test_vhadd_s16:
+ vld1.16 {d17}, [r1]!
+ vld1.16 {d19}, [r2]!
+ vld1.16 {d18}, [r1]
+ vld1.16 {d16}, [r2]
+ vadd.i16 d17, d17, d19
+ vadd.i16 d16, d16, d18
+ vshr.s16 d17, d17, #1
+ vshr.s16 d16, d16, #1
+ vst1.16 {d17}, [r0]!
+ vst1.16 {d16}, [r0]
+ bx lr
+
+ */
+#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
+ void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
+ TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
+ int i; \
+ for (i=0; i < (128 / BITS); i++) { \
+ dest[i] = (a[i] OP b[i]) >> 1; \
+ } \
+}
+
+FUNC(s, int, 32, +, vhadd)
+FUNC(u, uint, 32, +, vhadd)
+FUNC(s, int, 16, +, vhadd)
+FUNC(u, uint, 16, +, vhadd)
+FUNC(s, int, 8, +, vhadd)
+FUNC(u, uint, 8, +, vhadd)
+
+/* We cannot make use of the 32 bits version because the vectorizer needs an
+ extended precision for the intermediate computations. */
+/* { dg-final { scan-assembler-times {vhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {vhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {vhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
diff --git a/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c
new file mode 100644
index 00000000000..7938d89a77c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-add-options arm_neon } */
+/* { dg-additional-options "-O3" } */
+
+#include <stdint.h>
+
+/* Since we default to -mvectorize-with-neon-quad, use enough iterations so that
+ we can vectorize using 128-bit vectors. */
+#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
+ void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
+ TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
+ int i; \
+ for (i=0; i < (128 / BITS); i++) { \
+ dest[i] = (a[i] OP b[i] + 1) >> 1; \
+ } \
+}
+
+FUNC(s, int, 32, +, vrhadd)
+FUNC(u, uint, 32, +, vrhadd)
+FUNC(s, int, 16, +, vrhadd)
+FUNC(u, uint, 16, +, vrhadd)
+FUNC(s, int, 8, +, vrhadd)
+FUNC(u, uint, 8, +, vrhadd)
+
+/* We cannot make use of the 32 bits version because the vectorizer needs an
+ extended precision for the intermediate computations. */
+/* { dg-final { scan-assembler-times {vrhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {vrhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {vrhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
--
2.25.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] arm: Auto-vectorization for MVE and Neon: vhadd/vrhadd
2021-05-31 13:55 [PATCH] arm: Auto-vectorization for MVE and Neon: vhadd/vrhadd Christophe Lyon
@ 2021-06-02 18:19 ` Richard Sandiford
2021-06-04 12:57 ` Christophe Lyon
0 siblings, 1 reply; 7+ messages in thread
From: Richard Sandiford @ 2021-06-02 18:19 UTC (permalink / raw)
To: Christophe Lyon; +Cc: gcc-patches
Christophe Lyon <christophe.lyon@linaro.org> writes:
> This patch adds support for auto-vectorization of average value
> computation using vhadd or vrhadd, for both MVE and Neon.
>
> The patch adds the needed [u]avg<mode>3_[floor|ceil] patterns to
> vec-common.md, I'm not sure how to factorize them without introducing
> an unspec iterator?
Yeah, an int iterator would be one way, but I'm not sure it would
make things better given the differences in how Neon and MVE handle
their unspecs.
> It also adds tests for 'floor' and for 'ceil', each for MVE and Neon.
>
> Vectorization works with 8-bit and 16 bit input/output vectors, but
> not with 32-bit ones because the vectorizer expects wider types
> availability for the intermediate values, but int32_t + int32_t does
> not involve wider types in the IR.
Right. Like you say, it's only valid to use V(R)HADD if, in the source
code, the addition and shift have a wider precision than the operands.
That happens naturally for 8-bit and 16-bit operands, since C arithmetic
promotes them to "int" first. But for 32-bit operands, the C code needs
to do the addition and shift in 64 bits. Doing them in 64 bits should
be fine for narrower operands too.
So:
> diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
> new file mode 100644
> index 00000000000..40489ecc67d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
> @@ -0,0 +1,31 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target arm_v8_1m_mve_ok } */
> +/* { dg-add-options arm_v8_1m_mve } */
> +/* { dg-additional-options "-O3" } */
> +
> +#include <stdint.h>
> +
> +#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
> + void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
> + TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
> + int i; \
> + for (i=0; i < (128 / BITS); i++) { \
> + dest[i] = (a[i] OP b[i]) >> 1; \
> + } \
> +}
> +
…it should work if you make this "((int64_t) a[i] OP b[i]) >> 1".
> As noted in neon-vhadd-1.c, I couldn't write a test able to use Neon
> vectorization with 64-bit vectors: we default to
> -mvectorize-with-neon-quad, and attempts to use
> -mvectorize-with-neon-double resulted in much worse code, which this
> patch does not aim at improving.
I guess this is because the MVE_2 mode iterators only include 128-bit types.
Leaving Neon double as future work sounds good though.
And yeah, the code for V(R)HADD-equivalent operations is much worse when
V(R)HADD isn't available, since the compiler really does need to double
the precision of the operands, do double-precision addition,
do double-precision shifts, and then truncate back. So this looks
like the expected behaviour.
Thanks,
Richard
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] arm: Auto-vectorization for MVE and Neon: vhadd/vrhadd
2021-06-02 18:19 ` Richard Sandiford
@ 2021-06-04 12:57 ` Christophe Lyon
2021-06-08 11:50 ` Richard Sandiford
0 siblings, 1 reply; 7+ messages in thread
From: Christophe Lyon @ 2021-06-04 12:57 UTC (permalink / raw)
To: Christophe Lyon, gcc Patches, Richard Sandiford
[-- Attachment #1: Type: text/plain, Size: 3580 bytes --]
On Wed, 2 Jun 2021 at 20:19, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Christophe Lyon <christophe.lyon@linaro.org> writes:
> > This patch adds support for auto-vectorization of average value
> > computation using vhadd or vrhadd, for both MVE and Neon.
> >
> > The patch adds the needed [u]avg<mode>3_[floor|ceil] patterns to
> > vec-common.md, I'm not sure how to factorize them without introducing
> > an unspec iterator?
>
> Yeah, an int iterator would be one way, but I'm not sure it would
> make things better given the differences in how Neon and MVE handle
> their unspecs.
>
> > It also adds tests for 'floor' and for 'ceil', each for MVE and Neon.
> >
> > Vectorization works with 8-bit and 16 bit input/output vectors, but
> > not with 32-bit ones because the vectorizer expects wider types
> > availability for the intermediate values, but int32_t + int32_t does
> > not involve wider types in the IR.
>
> Right. Like you say, it's only valid to use V(R)HADD if, in the source
> code, the addition and shift have a wider precision than the operands.
> That happens naturally for 8-bit and 16-bit operands, since C arithmetic
> promotes them to "int" first. But for 32-bit operands, the C code needs
> to do the addition and shift in 64 bits. Doing them in 64 bits should
> be fine for narrower operands too.
>
> So:
>
> > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
> > new file mode 100644
> > index 00000000000..40489ecc67d
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
> > @@ -0,0 +1,31 @@
> > +/* { dg-do compile } */
> > +/* { dg-require-effective-target arm_v8_1m_mve_ok } */
> > +/* { dg-add-options arm_v8_1m_mve } */
> > +/* { dg-additional-options "-O3" } */
> > +
> > +#include <stdint.h>
> > +
> > +#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
> > + void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
> > + TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
> > + int i; \
> > + for (i=0; i < (128 / BITS); i++) { \
> > + dest[i] = (a[i] OP b[i]) >> 1; \
> > + } \
> > +}
> > +
>
> …it should work if you make this "((int64_t) a[i] OP b[i]) >> 1".
Indeed. However, this may not be obvious for end-users :-(
I've updated my patch as attached: added the (int64_t) cast and
removed the xfail clauses.
OK for trunk?
Thanks,
Christophe
>
> > As noted in neon-vhadd-1.c, I couldn't write a test able to use Neon
> > vectorization with 64-bit vectors: we default to
> > -mvectorize-with-neon-quad, and attempts to use
> > -mvectorize-with-neon-double resulted in much worse code, which this
> > patch does not aim at improving.
>
> I guess this is because the MVE_2 mode iterators only include 128-bit types.
> Leaving Neon double as future work sounds good though.
Note that I am focusing on MVE enablement at the moment.
> And yeah, the code for V(R)HADD-equivalent operations is much worse when
> V(R)HADD isn't available, since the compiler really does need to double
> the precision of the operands, do double-precision addition,
> do double-precision shifts, and then truncate back. So this looks
> like the expected behaviour.
>
> Thanks,
> Richard
[-- Attachment #2: v2-0001-arm-Auto-vectorization-for-MVE-and-Neon-vhadd-vrh.patch --]
[-- Type: text/x-patch, Size: 14323 bytes --]
From 493693b5c2f4e5fee7408062785930f723f2bd85 Mon Sep 17 00:00:00 2001
From: Christophe Lyon <christophe.lyon@linaro.org>
Date: Thu, 27 May 2021 20:11:28 +0000
Subject: [PATCH v2] arm: Auto-vectorization for MVE and Neon: vhadd/vrhadd
This patch adds support for auto-vectorization of average value
computation using vhadd or vrhadd, for both MVE and Neon.
The patch adds the needed [u]avg<mode>3_[floor|ceil] patterns to
vec-common.md, I'm not sure how to factorize them without introducing
an unspec iterator?
It also adds tests for 'floor' and for 'ceil', each for MVE and Neon.
Vectorization works with 8-bit and 16 bit input/output vectors, but
not with 32-bit ones because the vectorizer expects wider types
availability for the intermediate values, but int32_t + int32_t does
not involve wider types in the IR.
As noted in neon-vhadd-1.c, I couldn't write a test able to use Neon
vectorization with 64-bit vectors: we default to
-mvectorize-with-neon-quad, and attempts to use
-mvectorize-with-neon-double resulted in much worse code, which this
patch does not aim at improving.
2021-05-31 Christophe Lyon <christophe.lyon@linaro.org>
gcc/
* gcc/config/arm/mve.md (mve_vhaddq_<supf><mode>): Prefix with '@'.
(@mve_vrhaddq_<supf><mode): Likewise.
* gcc/config/arm/neon.md (neon_v<r>hadd<sup><mode>): Likewise.
* config/arm/vec-common.md (avg<mode>3_floor, uavg<mode>3_floor)
(avg<mode>3_ceil", uavg<mode>3_ceil): New patterns.
gcc/testsuite/
* gcc.target/arm/simd/mve-vhadd-1.c: New test.
* gcc.target/arm/simd/mve-vhadd-2.c: New test.
* gcc.target/arm/simd/neon-vhadd-1.c: New test.
* gcc.target/arm/simd/neon-vhadd-2.c: New test.
---
gcc/config/arm/mve.md | 4 +-
gcc/config/arm/neon.md | 2 +-
gcc/config/arm/vec-common.md | 60 ++++++++++++
.../gcc.target/arm/simd/mve-vhadd-1.c | 31 +++++++
.../gcc.target/arm/simd/mve-vhadd-2.c | 33 +++++++
.../gcc.target/arm/simd/neon-vhadd-1.c | 93 +++++++++++++++++++
.../gcc.target/arm/simd/neon-vhadd-2.c | 33 +++++++
7 files changed, 253 insertions(+), 3 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c
create mode 100644 gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c
create mode 100644 gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 0bfa6a91d55..04aa612331a 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -1030,7 +1030,7 @@ (define_insn "mve_vhaddq_n_<supf><mode>"
;;
;; [vhaddq_s, vhaddq_u])
;;
-(define_insn "mve_vhaddq_<supf><mode>"
+(define_insn "@mve_vhaddq_<supf><mode>"
[
(set (match_operand:MVE_2 0 "s_register_operand" "=w")
(unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
@@ -1652,7 +1652,7 @@ (define_insn "mve_vqsubq_<supf><mode>"
;;
;; [vrhaddq_s, vrhaddq_u])
;;
-(define_insn "mve_vrhaddq_<supf><mode>"
+(define_insn "@mve_vrhaddq_<supf><mode>"
[
(set (match_operand:MVE_2 0 "s_register_operand" "=w")
(unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 077c62ffd20..18571d819eb 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -1488,7 +1488,7 @@ (define_insn "neon_vaddw<sup><mode>"
; vhadd and vrhadd.
-(define_insn "neon_v<r>hadd<sup><mode>"
+(define_insn "@neon_v<r>hadd<sup><mode>"
[(set (match_operand:VDQIW 0 "s_register_operand" "=w")
(unspec:VDQIW [(match_operand:VDQIW 1 "s_register_operand" "w")
(match_operand:VDQIW 2 "s_register_operand" "w")]
diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
index 80b273229f5..2779c1a8aaa 100644
--- a/gcc/config/arm/vec-common.md
+++ b/gcc/config/arm/vec-common.md
@@ -565,3 +565,63 @@ (define_expand "reduc_plus_scal_<mode>"
DONE;
})
+
+(define_expand "avg<mode>3_floor"
+ [(match_operand:MVE_2 0 "s_register_operand")
+ (match_operand:MVE_2 1 "s_register_operand")
+ (match_operand:MVE_2 2 "s_register_operand")]
+ "ARM_HAVE_<MODE>_ARITH"
+{
+ if (TARGET_HAVE_MVE)
+ emit_insn (gen_mve_vhaddq (VHADDQ_S, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ else
+ emit_insn (gen_neon_vhadd (UNSPEC_VHADD_S, UNSPEC_VHADD_S, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ DONE;
+})
+
+(define_expand "uavg<mode>3_floor"
+ [(match_operand:MVE_2 0 "s_register_operand")
+ (match_operand:MVE_2 1 "s_register_operand")
+ (match_operand:MVE_2 2 "s_register_operand")]
+ "ARM_HAVE_<MODE>_ARITH"
+{
+ if (TARGET_HAVE_MVE)
+ emit_insn (gen_mve_vhaddq (VHADDQ_U, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ else
+ emit_insn (gen_neon_vhadd (UNSPEC_VHADD_U, UNSPEC_VHADD_U, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ DONE;
+})
+
+(define_expand "avg<mode>3_ceil"
+ [(match_operand:MVE_2 0 "s_register_operand")
+ (match_operand:MVE_2 1 "s_register_operand")
+ (match_operand:MVE_2 2 "s_register_operand")]
+ "ARM_HAVE_<MODE>_ARITH"
+{
+ if (TARGET_HAVE_MVE)
+ emit_insn (gen_mve_vrhaddq (VRHADDQ_S, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ else
+ emit_insn (gen_neon_vhadd (UNSPEC_VRHADD_S, UNSPEC_VRHADD_S, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ DONE;
+})
+
+(define_expand "uavg<mode>3_ceil"
+ [(match_operand:MVE_2 0 "s_register_operand")
+ (match_operand:MVE_2 1 "s_register_operand")
+ (match_operand:MVE_2 2 "s_register_operand")]
+ "ARM_HAVE_<MODE>_ARITH"
+{
+ if (TARGET_HAVE_MVE)
+ emit_insn (gen_mve_vrhaddq (VRHADDQ_U, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ else
+ emit_insn (gen_neon_vhadd (UNSPEC_VRHADD_U, UNSPEC_VRHADD_U, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ DONE;
+})
diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
new file mode 100644
index 00000000000..19d5f5aa44f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-add-options arm_v8_1m_mve } */
+/* { dg-additional-options "-O3" } */
+
+#include <stdint.h>
+
+/* We force a cast to int64_t to enable the vectorizer when dealing with 32-bit
+ inputs. */
+#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
+ void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
+ TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
+ int i; \
+ for (i=0; i < (128 / BITS); i++) { \
+ dest[i] = ((int64_t)a[i] OP b[i]) >> 1; \
+ } \
+}
+
+FUNC(s, int, 32, +, vhadd)
+FUNC(u, uint, 32, +, vhadd)
+FUNC(s, int, 16, +, vhadd)
+FUNC(u, uint, 16, +, vhadd)
+FUNC(s, int, 8, +, vhadd)
+FUNC(u, uint, 8, +, vhadd)
+
+/* { dg-final { scan-assembler-times {vhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c
new file mode 100644
index 00000000000..b19d4253b54
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-add-options arm_v8_1m_mve } */
+/* { dg-additional-options "-O3" } */
+
+#include <stdint.h>
+
+/* We force a cast to int64_t to enable the vectorizer when dealing with 32-bit
+ inputs. */
+#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
+ void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
+ TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
+ int i; \
+ for (i=0; i < (128 / BITS); i++) { \
+ dest[i] = ((int64_t)a[i] OP b[i] + 1) >> 1; \
+ } \
+}
+
+FUNC(s, int, 32, +, vrhadd)
+FUNC(u, uint, 32, +, vrhadd)
+FUNC(s, int, 16, +, vrhadd)
+FUNC(u, uint, 16, +, vrhadd)
+FUNC(s, int, 8, +, vrhadd)
+FUNC(u, uint, 8, +, vrhadd)
+
+/* We cannot make use of the 32 bits version because the vectorizer needs an
+ extended precision for the intermediate computations. */
+/* { dg-final { scan-assembler-times {vrhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
diff --git a/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c
new file mode 100644
index 00000000000..cd00ca81c6d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c
@@ -0,0 +1,93 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-add-options arm_neon } */
+/* { dg-additional-options "-O3" } */
+
+#include <stdint.h>
+
+/* Since we default to -mvectorize-with-neon-quad, use enough iterations so that
+ we can vectorize using 128-bit vectors. */
+/* For instance with int16_t:
+ 8 iterations and -mvectorize-with-neon-quad, we generate:
+test_vhadd_s16:
+ vld1.16 {q8}, [r1]
+ vld1.16 {q9}, [r2]
+ vhadd.s16 q8, q8, q9
+ vst1.16 {q8}, [r0]
+ bx lr
+
+ 8 iterations and -mvectorize-with-neon-double, we generate:
+test_vhadd_s16:
+ vld1.16 {d24}, [r1]!
+ vld1.16 {d26}, [r2]!
+ vld1.16 {d22}, [r1]
+ vmovl.s16 q12, d24
+ vld1.16 {d20}, [r2]
+ vmovl.s16 q13, d26
+ vmovl.s16 q11, d22
+ vmovl.s16 q10, d20
+ vadd.i32 d28, d26, d24
+ vadd.i32 d24, d27, d25
+ vadd.i32 d25, d22, d20
+ vadd.i32 d20, d23, d21
+ vshr.s32 d18, d28, #1
+ vshr.s32 d19, d24, #1
+ vshr.s32 d16, d25, #1
+ vshr.s32 d17, d20, #1
+ vmovn.i32 d18, q9
+ vmovn.i32 d16, q8
+ vst1.16 {d18}, [r0]!
+ vst1.16 {d16}, [r0]
+ bx lr
+
+ Adding a cast to avoid integer promotion:
+ dest[i] = (int16_t)(a[i] + b[i]) >> 1
+
+ 8 iterations and -mvectorize-with-neon-quad, we generate:
+test_vhadd_s16:
+ vld1.16 {q8}, [r2]
+ vld1.16 {q9}, [r1]
+ vadd.i16 q8, q8, q9
+ vshr.s16 q8, q8, #1
+ vst1.16 {q8}, [r0]
+ bx lr
+
+ 8 iterations and -mvectorize-with-neon-double, we generate:
+test_vhadd_s16:
+ vld1.16 {d17}, [r1]!
+ vld1.16 {d19}, [r2]!
+ vld1.16 {d18}, [r1]
+ vld1.16 {d16}, [r2]
+ vadd.i16 d17, d17, d19
+ vadd.i16 d16, d16, d18
+ vshr.s16 d17, d17, #1
+ vshr.s16 d16, d16, #1
+ vst1.16 {d17}, [r0]!
+ vst1.16 {d16}, [r0]
+ bx lr
+
+ */
+/* We force a cast to int64_t to enable the vectorizer when dealing with 32-bit
+ inputs. */
+#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
+ void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
+ TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
+ int i; \
+ for (i=0; i < (128 / BITS); i++) { \
+ dest[i] = ((int64_t)a[i] OP b[i]) >> 1; \
+ } \
+}
+
+FUNC(s, int, 32, +, vhadd)
+FUNC(u, uint, 32, +, vhadd)
+FUNC(s, int, 16, +, vhadd)
+FUNC(u, uint, 16, +, vhadd)
+FUNC(s, int, 8, +, vhadd)
+FUNC(u, uint, 8, +, vhadd)
+
+/* { dg-final { scan-assembler-times {vhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
diff --git a/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c
new file mode 100644
index 00000000000..f2692542f9b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-add-options arm_neon } */
+/* { dg-additional-options "-O3" } */
+
+#include <stdint.h>
+
+/* Since we default to -mvectorize-with-neon-quad, use enough iterations so that
+ we can vectorize using 128-bit vectors. */
+/* We force a cast to int64_t to enable the vectorizer when dealing with 32-bit
+ inputs. */
+#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
+ void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
+ TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
+ int i; \
+ for (i=0; i < (128 / BITS); i++) { \
+ dest[i] = ((int64_t)a[i] OP b[i] + 1) >> 1; \
+ } \
+}
+
+FUNC(s, int, 32, +, vrhadd)
+FUNC(u, uint, 32, +, vrhadd)
+FUNC(s, int, 16, +, vrhadd)
+FUNC(u, uint, 16, +, vrhadd)
+FUNC(s, int, 8, +, vrhadd)
+FUNC(u, uint, 8, +, vrhadd)
+
+/* { dg-final { scan-assembler-times {vrhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
--
2.25.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] arm: Auto-vectorization for MVE and Neon: vhadd/vrhadd
2021-06-04 12:57 ` Christophe Lyon
@ 2021-06-08 11:50 ` Richard Sandiford
2021-06-09 15:07 ` Christophe Lyon
0 siblings, 1 reply; 7+ messages in thread
From: Richard Sandiford @ 2021-06-08 11:50 UTC (permalink / raw)
To: Christophe Lyon; +Cc: gcc Patches
Christophe Lyon <christophe.lyon@linaro.org> writes:
> On Wed, 2 Jun 2021 at 20:19, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Christophe Lyon <christophe.lyon@linaro.org> writes:
>> > This patch adds support for auto-vectorization of average value
>> > computation using vhadd or vrhadd, for both MVE and Neon.
>> >
>> > The patch adds the needed [u]avg<mode>3_[floor|ceil] patterns to
>> > vec-common.md, I'm not sure how to factorize them without introducing
>> > an unspec iterator?
>>
>> Yeah, an int iterator would be one way, but I'm not sure it would
>> make things better given the differences in how Neon and MVE handle
>> their unspecs.
>>
>> > It also adds tests for 'floor' and for 'ceil', each for MVE and Neon.
>> >
>> > Vectorization works with 8-bit and 16 bit input/output vectors, but
>> > not with 32-bit ones because the vectorizer expects wider types
>> > availability for the intermediate values, but int32_t + int32_t does
>> > not involve wider types in the IR.
>>
>> Right. Like you say, it's only valid to use V(R)HADD if, in the source
>> code, the addition and shift have a wider precision than the operands.
>> That happens naturally for 8-bit and 16-bit operands, since C arithmetic
>> promotes them to "int" first. But for 32-bit operands, the C code needs
>> to do the addition and shift in 64 bits. Doing them in 64 bits should
>> be fine for narrower operands too.
>>
>> So:
>>
>> > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
>> > new file mode 100644
>> > index 00000000000..40489ecc67d
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
>> > @@ -0,0 +1,31 @@
>> > +/* { dg-do compile } */
>> > +/* { dg-require-effective-target arm_v8_1m_mve_ok } */
>> > +/* { dg-add-options arm_v8_1m_mve } */
>> > +/* { dg-additional-options "-O3" } */
>> > +
>> > +#include <stdint.h>
>> > +
>> > +#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
>> > + void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
>> > + TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
>> > + int i; \
>> > + for (i=0; i < (128 / BITS); i++) { \
>> > + dest[i] = (a[i] OP b[i]) >> 1; \
>> > + } \
>> > +}
>> > +
>>
>> …it should work if you make this "((int64_t) a[i] OP b[i]) >> 1".
>
> Indeed. However, this may not be obvious for end-users :-(
>
> I've updated my patch as attached: added the (int64_t) cast and
> removed the xfail clauses.
>
> OK for trunk?
>
> Thanks,
>
> Christophe
>
>>
>> > As noted in neon-vhadd-1.c, I couldn't write a test able to use Neon
>> > vectorization with 64-bit vectors: we default to
>> > -mvectorize-with-neon-quad, and attempts to use
>> > -mvectorize-with-neon-double resulted in much worse code, which this
>> > patch does not aim at improving.
>>
>> I guess this is because the MVE_2 mode iterators only include 128-bit types.
>> Leaving Neon double as future work sounds good though.
> Note that I am focusing on MVE enablement at the moment.
Right. I meant “possible future work by someone somewhere”. :-)
>> And yeah, the code for V(R)HADD-equivalent operations is much worse when
>> V(R)HADD isn't available, since the compiler really does need to double
>> the precision of the operands, do double-precision addition,
>> do double-precision shifts, and then truncate back. So this looks
>> like the expected behaviour.
>>
>> Thanks,
>> Richard
>
> From 493693b5c2f4e5fee7408062785930f723f2bd85 Mon Sep 17 00:00:00 2001
> From: Christophe Lyon <christophe.lyon@linaro.org>
> Date: Thu, 27 May 2021 20:11:28 +0000
> Subject: [PATCH v2] arm: Auto-vectorization for MVE and Neon: vhadd/vrhadd
>
> This patch adds support for auto-vectorization of average value
> computation using vhadd or vrhadd, for both MVE and Neon.
>
> The patch adds the needed [u]avg<mode>3_[floor|ceil] patterns to
> vec-common.md, I'm not sure how to factorize them without introducing
> an unspec iterator?
>
> It also adds tests for 'floor' and for 'ceil', each for MVE and Neon.
>
> Vectorization works with 8-bit and 16 bit input/output vectors, but
> not with 32-bit ones because the vectorizer expects wider types
> availability for the intermediate values, but int32_t + int32_t does
> not involve wider types in the IR.
>
> As noted in neon-vhadd-1.c, I couldn't write a test able to use Neon
> vectorization with 64-bit vectors: we default to
> -mvectorize-with-neon-quad, and attempts to use
> -mvectorize-with-neon-double resulted in much worse code, which this
> patch does not aim at improving.
The above needs updating.
> 2021-05-31 Christophe Lyon <christophe.lyon@linaro.org>
>
> gcc/
> * gcc/config/arm/mve.md (mve_vhaddq_<supf><mode>): Prefix with '@'.
> (@mve_vrhaddq_<supf><mode): Likewise.
> * gcc/config/arm/neon.md (neon_v<r>hadd<sup><mode>): Likewise.
> * config/arm/vec-common.md (avg<mode>3_floor, uavg<mode>3_floor)
> (avg<mode>3_ceil", uavg<mode>3_ceil): New patterns.
>
> gcc/testsuite/
> * gcc.target/arm/simd/mve-vhadd-1.c: New test.
> * gcc.target/arm/simd/mve-vhadd-2.c: New test.
> * gcc.target/arm/simd/neon-vhadd-1.c: New test.
> * gcc.target/arm/simd/neon-vhadd-2.c: New test.
> ---
> gcc/config/arm/mve.md | 4 +-
> gcc/config/arm/neon.md | 2 +-
> gcc/config/arm/vec-common.md | 60 ++++++++++++
> .../gcc.target/arm/simd/mve-vhadd-1.c | 31 +++++++
> .../gcc.target/arm/simd/mve-vhadd-2.c | 33 +++++++
> .../gcc.target/arm/simd/neon-vhadd-1.c | 93 +++++++++++++++++++
> .../gcc.target/arm/simd/neon-vhadd-2.c | 33 +++++++
> 7 files changed, 253 insertions(+), 3 deletions(-)
> create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
> create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c
> create mode 100644 gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c
> create mode 100644 gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c
>
> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> index 0bfa6a91d55..04aa612331a 100644
> --- a/gcc/config/arm/mve.md
> +++ b/gcc/config/arm/mve.md
> @@ -1030,7 +1030,7 @@ (define_insn "mve_vhaddq_n_<supf><mode>"
> ;;
> ;; [vhaddq_s, vhaddq_u])
> ;;
> -(define_insn "mve_vhaddq_<supf><mode>"
> +(define_insn "@mve_vhaddq_<supf><mode>"
> [
> (set (match_operand:MVE_2 0 "s_register_operand" "=w")
> (unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
> @@ -1652,7 +1652,7 @@ (define_insn "mve_vqsubq_<supf><mode>"
> ;;
> ;; [vrhaddq_s, vrhaddq_u])
> ;;
> -(define_insn "mve_vrhaddq_<supf><mode>"
> +(define_insn "@mve_vrhaddq_<supf><mode>"
> [
> (set (match_operand:MVE_2 0 "s_register_operand" "=w")
> (unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
> index 077c62ffd20..18571d819eb 100644
> --- a/gcc/config/arm/neon.md
> +++ b/gcc/config/arm/neon.md
> @@ -1488,7 +1488,7 @@ (define_insn "neon_vaddw<sup><mode>"
>
> ; vhadd and vrhadd.
>
> -(define_insn "neon_v<r>hadd<sup><mode>"
> +(define_insn "@neon_v<r>hadd<sup><mode>"
> [(set (match_operand:VDQIW 0 "s_register_operand" "=w")
> (unspec:VDQIW [(match_operand:VDQIW 1 "s_register_operand" "w")
> (match_operand:VDQIW 2 "s_register_operand" "w")]
> diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
> index 80b273229f5..2779c1a8aaa 100644
> --- a/gcc/config/arm/vec-common.md
> +++ b/gcc/config/arm/vec-common.md
> @@ -565,3 +565,63 @@ (define_expand "reduc_plus_scal_<mode>"
>
> DONE;
> })
> +
> +(define_expand "avg<mode>3_floor"
> + [(match_operand:MVE_2 0 "s_register_operand")
> + (match_operand:MVE_2 1 "s_register_operand")
> + (match_operand:MVE_2 2 "s_register_operand")]
> + "ARM_HAVE_<MODE>_ARITH"
> +{
> + if (TARGET_HAVE_MVE)
> + emit_insn (gen_mve_vhaddq (VHADDQ_S, <MODE>mode,
> + operands[0], operands[1], operands[2]));
> + else
> + emit_insn (gen_neon_vhadd (UNSPEC_VHADD_S, UNSPEC_VHADD_S, <MODE>mode,
> + operands[0], operands[1], operands[2]));
> + DONE;
> +})
> +
> +(define_expand "uavg<mode>3_floor"
> + [(match_operand:MVE_2 0 "s_register_operand")
> + (match_operand:MVE_2 1 "s_register_operand")
> + (match_operand:MVE_2 2 "s_register_operand")]
> + "ARM_HAVE_<MODE>_ARITH"
> +{
> + if (TARGET_HAVE_MVE)
> + emit_insn (gen_mve_vhaddq (VHADDQ_U, <MODE>mode,
> + operands[0], operands[1], operands[2]));
> + else
> + emit_insn (gen_neon_vhadd (UNSPEC_VHADD_U, UNSPEC_VHADD_U, <MODE>mode,
> + operands[0], operands[1], operands[2]));
> + DONE;
> +})
> +
> +(define_expand "avg<mode>3_ceil"
> + [(match_operand:MVE_2 0 "s_register_operand")
> + (match_operand:MVE_2 1 "s_register_operand")
> + (match_operand:MVE_2 2 "s_register_operand")]
> + "ARM_HAVE_<MODE>_ARITH"
> +{
> + if (TARGET_HAVE_MVE)
> + emit_insn (gen_mve_vrhaddq (VRHADDQ_S, <MODE>mode,
> + operands[0], operands[1], operands[2]));
> + else
> + emit_insn (gen_neon_vhadd (UNSPEC_VRHADD_S, UNSPEC_VRHADD_S, <MODE>mode,
> + operands[0], operands[1], operands[2]));
> + DONE;
> +})
> +
> +(define_expand "uavg<mode>3_ceil"
> + [(match_operand:MVE_2 0 "s_register_operand")
> + (match_operand:MVE_2 1 "s_register_operand")
> + (match_operand:MVE_2 2 "s_register_operand")]
> + "ARM_HAVE_<MODE>_ARITH"
> +{
> + if (TARGET_HAVE_MVE)
> + emit_insn (gen_mve_vrhaddq (VRHADDQ_U, <MODE>mode,
> + operands[0], operands[1], operands[2]));
> + else
> + emit_insn (gen_neon_vhadd (UNSPEC_VRHADD_U, UNSPEC_VRHADD_U, <MODE>mode,
> + operands[0], operands[1], operands[2]));
> + DONE;
> +})
> diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
> new file mode 100644
> index 00000000000..19d5f5aa44f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
> @@ -0,0 +1,31 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target arm_v8_1m_mve_ok } */
> +/* { dg-add-options arm_v8_1m_mve } */
> +/* { dg-additional-options "-O3" } */
> +
> +#include <stdint.h>
> +
> +/* We force a cast to int64_t to enable the vectorizer when dealing with 32-bit
> + inputs. */
> +#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
> + void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
> + TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
> + int i; \
> + for (i=0; i < (128 / BITS); i++) { \
> + dest[i] = ((int64_t)a[i] OP b[i]) >> 1; \
> + } \
> +}
> +
> +FUNC(s, int, 32, +, vhadd)
> +FUNC(u, uint, 32, +, vhadd)
> +FUNC(s, int, 16, +, vhadd)
> +FUNC(u, uint, 16, +, vhadd)
> +FUNC(s, int, 8, +, vhadd)
> +FUNC(u, uint, 8, +, vhadd)
> +
> +/* { dg-final { scan-assembler-times {vhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c
> new file mode 100644
> index 00000000000..b19d4253b54
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c
> @@ -0,0 +1,33 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target arm_v8_1m_mve_ok } */
> +/* { dg-add-options arm_v8_1m_mve } */
> +/* { dg-additional-options "-O3" } */
> +
> +#include <stdint.h>
> +
> +/* We force a cast to int64_t to enable the vectorizer when dealing with 32-bit
> + inputs. */
> +#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
> + void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
> + TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
> + int i; \
> + for (i=0; i < (128 / BITS); i++) { \
> + dest[i] = ((int64_t)a[i] OP b[i] + 1) >> 1; \
> + } \
> +}
> +
> +FUNC(s, int, 32, +, vrhadd)
> +FUNC(u, uint, 32, +, vrhadd)
> +FUNC(s, int, 16, +, vrhadd)
> +FUNC(u, uint, 16, +, vrhadd)
> +FUNC(s, int, 8, +, vrhadd)
> +FUNC(u, uint, 8, +, vrhadd)
> +
> +/* We cannot make use of the 32 bits version because the vectorizer needs an
> + extended precision for the intermediate computations. */
This comment no longer applies.
> +/* { dg-final { scan-assembler-times {vrhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c
> new file mode 100644
> index 00000000000..cd00ca81c6d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c
> @@ -0,0 +1,93 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target arm_neon_ok } */
> +/* { dg-add-options arm_neon } */
> +/* { dg-additional-options "-O3" } */
> +
> +#include <stdint.h>
> +
> +/* Since we default to -mvectorize-with-neon-quad, use enough iterations so that
> + we can vectorize using 128-bit vectors. */
> +/* For instance with int16_t:
> + 8 iterations and -mvectorize-with-neon-quad, we generate:
> +test_vhadd_s16:
> + vld1.16 {q8}, [r1]
> + vld1.16 {q9}, [r2]
> + vhadd.s16 q8, q8, q9
> + vst1.16 {q8}, [r0]
> + bx lr
> +
> + 8 iterations and -mvectorize-with-neon-double, we generate:
> +test_vhadd_s16:
> + vld1.16 {d24}, [r1]!
> + vld1.16 {d26}, [r2]!
> + vld1.16 {d22}, [r1]
> + vmovl.s16 q12, d24
> + vld1.16 {d20}, [r2]
> + vmovl.s16 q13, d26
> + vmovl.s16 q11, d22
> + vmovl.s16 q10, d20
> + vadd.i32 d28, d26, d24
> + vadd.i32 d24, d27, d25
> + vadd.i32 d25, d22, d20
> + vadd.i32 d20, d23, d21
> + vshr.s32 d18, d28, #1
> + vshr.s32 d19, d24, #1
> + vshr.s32 d16, d25, #1
> + vshr.s32 d17, d20, #1
> + vmovn.i32 d18, q9
> + vmovn.i32 d16, q8
> + vst1.16 {d18}, [r0]!
> + vst1.16 {d16}, [r0]
> + bx lr
> +
> + Adding a cast to avoid integer promotion:
> + dest[i] = (int16_t)(a[i] + b[i]) >> 1
> +
> + 8 iterations and -mvectorize-with-neon-quad, we generate:
> +test_vhadd_s16:
> + vld1.16 {q8}, [r2]
> + vld1.16 {q9}, [r1]
> + vadd.i16 q8, q8, q9
> + vshr.s16 q8, q8, #1
> + vst1.16 {q8}, [r0]
> + bx lr
> +
> + 8 iterations and -mvectorize-with-neon-double, we generate:
> +test_vhadd_s16:
> + vld1.16 {d17}, [r1]!
> + vld1.16 {d19}, [r2]!
> + vld1.16 {d18}, [r1]
> + vld1.16 {d16}, [r2]
> + vadd.i16 d17, d17, d19
> + vadd.i16 d16, d16, d18
> + vshr.s16 d17, d17, #1
> + vshr.s16 d16, d16, #1
> + vst1.16 {d17}, [r0]!
> + vst1.16 {d16}, [r0]
> + bx lr
> +
> + */
I think we should remove this and say that, at the moment, we've only
implemented the optabs for 128-bit vectors, so that's all that we
test here.
OK with those changes, thanks.
Richard
> +/* We force a cast to int64_t to enable the vectorizer when dealing with 32-bit
> + inputs. */
> +#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
> + void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
> + TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
> + int i; \
> + for (i=0; i < (128 / BITS); i++) { \
> + dest[i] = ((int64_t)a[i] OP b[i]) >> 1; \
> + } \
> +}
> +
> +FUNC(s, int, 32, +, vhadd)
> +FUNC(u, uint, 32, +, vhadd)
> +FUNC(s, int, 16, +, vhadd)
> +FUNC(u, uint, 16, +, vhadd)
> +FUNC(s, int, 8, +, vhadd)
> +FUNC(u, uint, 8, +, vhadd)
> +
> +/* { dg-final { scan-assembler-times {vhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c
> new file mode 100644
> index 00000000000..f2692542f9b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c
> @@ -0,0 +1,33 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target arm_neon_ok } */
> +/* { dg-add-options arm_neon } */
> +/* { dg-additional-options "-O3" } */
> +
> +#include <stdint.h>
> +
> +/* Since we default to -mvectorize-with-neon-quad, use enough iterations so that
> + we can vectorize using 128-bit vectors. */
> +/* We force a cast to int64_t to enable the vectorizer when dealing with 32-bit
> + inputs. */
> +#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
> + void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
> + TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
> + int i; \
> + for (i=0; i < (128 / BITS); i++) { \
> + dest[i] = ((int64_t)a[i] OP b[i] + 1) >> 1; \
> + } \
> +}
> +
> +FUNC(s, int, 32, +, vrhadd)
> +FUNC(u, uint, 32, +, vrhadd)
> +FUNC(s, int, 16, +, vrhadd)
> +FUNC(u, uint, 16, +, vrhadd)
> +FUNC(s, int, 8, +, vrhadd)
> +FUNC(u, uint, 8, +, vrhadd)
> +
> +/* { dg-final { scan-assembler-times {vrhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-times {vrhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] arm: Auto-vectorization for MVE and Neon: vhadd/vrhadd
2021-06-08 11:50 ` Richard Sandiford
@ 2021-06-09 15:07 ` Christophe Lyon
2021-06-09 15:39 ` Richard Sandiford
0 siblings, 1 reply; 7+ messages in thread
From: Christophe Lyon @ 2021-06-09 15:07 UTC (permalink / raw)
To: Christophe Lyon, gcc Patches, Richard Sandiford
[-- Attachment #1: Type: text/plain, Size: 21745 bytes --]
On Tue, 8 Jun 2021 at 13:50, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Christophe Lyon <christophe.lyon@linaro.org> writes:
> > On Wed, 2 Jun 2021 at 20:19, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Christophe Lyon <christophe.lyon@linaro.org> writes:
> >> > This patch adds support for auto-vectorization of average value
> >> > computation using vhadd or vrhadd, for both MVE and Neon.
> >> >
> >> > The patch adds the needed [u]avg<mode>3_[floor|ceil] patterns to
> >> > vec-common.md, I'm not sure how to factorize them without introducing
> >> > an unspec iterator?
> >>
> >> Yeah, an int iterator would be one way, but I'm not sure it would
> >> make things better given the differences in how Neon and MVE handle
> >> their unspecs.
> >>
> >> > It also adds tests for 'floor' and for 'ceil', each for MVE and Neon.
> >> >
> >> > Vectorization works with 8-bit and 16 bit input/output vectors, but
> >> > not with 32-bit ones because the vectorizer expects wider types
> >> > availability for the intermediate values, but int32_t + int32_t does
> >> > not involve wider types in the IR.
> >>
> >> Right. Like you say, it's only valid to use V(R)HADD if, in the source
> >> code, the addition and shift have a wider precision than the operands.
> >> That happens naturally for 8-bit and 16-bit operands, since C arithmetic
> >> promotes them to "int" first. But for 32-bit operands, the C code needs
> >> to do the addition and shift in 64 bits. Doing them in 64 bits should
> >> be fine for narrower operands too.
> >>
> >> So:
> >>
> >> > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
> >> > new file mode 100644
> >> > index 00000000000..40489ecc67d
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
> >> > @@ -0,0 +1,31 @@
> >> > +/* { dg-do compile } */
> >> > +/* { dg-require-effective-target arm_v8_1m_mve_ok } */
> >> > +/* { dg-add-options arm_v8_1m_mve } */
> >> > +/* { dg-additional-options "-O3" } */
> >> > +
> >> > +#include <stdint.h>
> >> > +
> >> > +#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
> >> > + void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
> >> > + TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
> >> > + int i; \
> >> > + for (i=0; i < (128 / BITS); i++) { \
> >> > + dest[i] = (a[i] OP b[i]) >> 1; \
> >> > + } \
> >> > +}
> >> > +
> >>
> >> …it should work if you make this "((int64_t) a[i] OP b[i]) >> 1".
> >
> > Indeed. However, this may not be obvious for end-users :-(
> >
> > I've updated my patch as attached: added the (int64_t) cast and
> > removed the xfail clauses.
> >
> > OK for trunk?
> >
> > Thanks,
> >
> > Christophe
> >
> >>
> >> > As noted in neon-vhadd-1.c, I couldn't write a test able to use Neon
> >> > vectorization with 64-bit vectors: we default to
> >> > -mvectorize-with-neon-quad, and attempts to use
> >> > -mvectorize-with-neon-double resulted in much worse code, which this
> >> > patch does not aim at improving.
> >>
> >> I guess this is because the MVE_2 mode iterators only include 128-bit types.
> >> Leaving Neon double as future work sounds good though.
> > Note that I am focusing on MVE enablement at the moment.
>
> Right. I meant “possible future work by someone somewhere”. :-)
>
> >> And yeah, the code for V(R)HADD-equivalent operations is much worse when
> >> V(R)HADD isn't available, since the compiler really does need to double
> >> the precision of the operands, do double-precision addition,
> >> do double-precision shifts, and then truncate back. So this looks
> >> like the expected behaviour.
> >>
> >> Thanks,
> >> Richard
> >
> > From 493693b5c2f4e5fee7408062785930f723f2bd85 Mon Sep 17 00:00:00 2001
> > From: Christophe Lyon <christophe.lyon@linaro.org>
> > Date: Thu, 27 May 2021 20:11:28 +0000
> > Subject: [PATCH v2] arm: Auto-vectorization for MVE and Neon: vhadd/vrhadd
> >
> > This patch adds support for auto-vectorization of average value
> > computation using vhadd or vrhadd, for both MVE and Neon.
> >
> > The patch adds the needed [u]avg<mode>3_[floor|ceil] patterns to
> > vec-common.md, I'm not sure how to factorize them without introducing
> > an unspec iterator?
> >
> > It also adds tests for 'floor' and for 'ceil', each for MVE and Neon.
> >
> > Vectorization works with 8-bit and 16 bit input/output vectors, but
> > not with 32-bit ones because the vectorizer expects wider types
> > availability for the intermediate values, but int32_t + int32_t does
> > not involve wider types in the IR.
> >
> > As noted in neon-vhadd-1.c, I couldn't write a test able to use Neon
> > vectorization with 64-bit vectors: we default to
> > -mvectorize-with-neon-quad, and attempts to use
> > -mvectorize-with-neon-double resulted in much worse code, which this
> > patch does not aim at improving.
>
> The above needs updating.
>
> > 2021-05-31 Christophe Lyon <christophe.lyon@linaro.org>
> >
> > gcc/
> > * gcc/config/arm/mve.md (mve_vhaddq_<supf><mode>): Prefix with '@'.
> > (@mve_vrhaddq_<supf><mode): Likewise.
> > * gcc/config/arm/neon.md (neon_v<r>hadd<sup><mode>): Likewise.
> > * config/arm/vec-common.md (avg<mode>3_floor, uavg<mode>3_floor)
> > (avg<mode>3_ceil", uavg<mode>3_ceil): New patterns.
> >
> > gcc/testsuite/
> > * gcc.target/arm/simd/mve-vhadd-1.c: New test.
> > * gcc.target/arm/simd/mve-vhadd-2.c: New test.
> > * gcc.target/arm/simd/neon-vhadd-1.c: New test.
> > * gcc.target/arm/simd/neon-vhadd-2.c: New test.
> > ---
> > gcc/config/arm/mve.md | 4 +-
> > gcc/config/arm/neon.md | 2 +-
> > gcc/config/arm/vec-common.md | 60 ++++++++++++
> > .../gcc.target/arm/simd/mve-vhadd-1.c | 31 +++++++
> > .../gcc.target/arm/simd/mve-vhadd-2.c | 33 +++++++
> > .../gcc.target/arm/simd/neon-vhadd-1.c | 93 +++++++++++++++++++
> > .../gcc.target/arm/simd/neon-vhadd-2.c | 33 +++++++
> > 7 files changed, 253 insertions(+), 3 deletions(-)
> > create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
> > create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c
> > create mode 100644 gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c
> > create mode 100644 gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c
> >
> > diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> > index 0bfa6a91d55..04aa612331a 100644
> > --- a/gcc/config/arm/mve.md
> > +++ b/gcc/config/arm/mve.md
> > @@ -1030,7 +1030,7 @@ (define_insn "mve_vhaddq_n_<supf><mode>"
> > ;;
> > ;; [vhaddq_s, vhaddq_u])
> > ;;
> > -(define_insn "mve_vhaddq_<supf><mode>"
> > +(define_insn "@mve_vhaddq_<supf><mode>"
> > [
> > (set (match_operand:MVE_2 0 "s_register_operand" "=w")
> > (unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
> > @@ -1652,7 +1652,7 @@ (define_insn "mve_vqsubq_<supf><mode>"
> > ;;
> > ;; [vrhaddq_s, vrhaddq_u])
> > ;;
> > -(define_insn "mve_vrhaddq_<supf><mode>"
> > +(define_insn "@mve_vrhaddq_<supf><mode>"
> > [
> > (set (match_operand:MVE_2 0 "s_register_operand" "=w")
> > (unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
> > diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
> > index 077c62ffd20..18571d819eb 100644
> > --- a/gcc/config/arm/neon.md
> > +++ b/gcc/config/arm/neon.md
> > @@ -1488,7 +1488,7 @@ (define_insn "neon_vaddw<sup><mode>"
> >
> > ; vhadd and vrhadd.
> >
> > -(define_insn "neon_v<r>hadd<sup><mode>"
> > +(define_insn "@neon_v<r>hadd<sup><mode>"
> > [(set (match_operand:VDQIW 0 "s_register_operand" "=w")
> > (unspec:VDQIW [(match_operand:VDQIW 1 "s_register_operand" "w")
> > (match_operand:VDQIW 2 "s_register_operand" "w")]
> > diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
> > index 80b273229f5..2779c1a8aaa 100644
> > --- a/gcc/config/arm/vec-common.md
> > +++ b/gcc/config/arm/vec-common.md
> > @@ -565,3 +565,63 @@ (define_expand "reduc_plus_scal_<mode>"
> >
> > DONE;
> > })
> > +
> > +(define_expand "avg<mode>3_floor"
> > + [(match_operand:MVE_2 0 "s_register_operand")
> > + (match_operand:MVE_2 1 "s_register_operand")
> > + (match_operand:MVE_2 2 "s_register_operand")]
> > + "ARM_HAVE_<MODE>_ARITH"
> > +{
> > + if (TARGET_HAVE_MVE)
> > + emit_insn (gen_mve_vhaddq (VHADDQ_S, <MODE>mode,
> > + operands[0], operands[1], operands[2]));
> > + else
> > + emit_insn (gen_neon_vhadd (UNSPEC_VHADD_S, UNSPEC_VHADD_S, <MODE>mode,
> > + operands[0], operands[1], operands[2]));
> > + DONE;
> > +})
> > +
> > +(define_expand "uavg<mode>3_floor"
> > + [(match_operand:MVE_2 0 "s_register_operand")
> > + (match_operand:MVE_2 1 "s_register_operand")
> > + (match_operand:MVE_2 2 "s_register_operand")]
> > + "ARM_HAVE_<MODE>_ARITH"
> > +{
> > + if (TARGET_HAVE_MVE)
> > + emit_insn (gen_mve_vhaddq (VHADDQ_U, <MODE>mode,
> > + operands[0], operands[1], operands[2]));
> > + else
> > + emit_insn (gen_neon_vhadd (UNSPEC_VHADD_U, UNSPEC_VHADD_U, <MODE>mode,
> > + operands[0], operands[1], operands[2]));
> > + DONE;
> > +})
> > +
> > +(define_expand "avg<mode>3_ceil"
> > + [(match_operand:MVE_2 0 "s_register_operand")
> > + (match_operand:MVE_2 1 "s_register_operand")
> > + (match_operand:MVE_2 2 "s_register_operand")]
> > + "ARM_HAVE_<MODE>_ARITH"
> > +{
> > + if (TARGET_HAVE_MVE)
> > + emit_insn (gen_mve_vrhaddq (VRHADDQ_S, <MODE>mode,
> > + operands[0], operands[1], operands[2]));
> > + else
> > + emit_insn (gen_neon_vhadd (UNSPEC_VRHADD_S, UNSPEC_VRHADD_S, <MODE>mode,
> > + operands[0], operands[1], operands[2]));
> > + DONE;
> > +})
> > +
> > +(define_expand "uavg<mode>3_ceil"
> > + [(match_operand:MVE_2 0 "s_register_operand")
> > + (match_operand:MVE_2 1 "s_register_operand")
> > + (match_operand:MVE_2 2 "s_register_operand")]
> > + "ARM_HAVE_<MODE>_ARITH"
> > +{
> > + if (TARGET_HAVE_MVE)
> > + emit_insn (gen_mve_vrhaddq (VRHADDQ_U, <MODE>mode,
> > + operands[0], operands[1], operands[2]));
> > + else
> > + emit_insn (gen_neon_vhadd (UNSPEC_VRHADD_U, UNSPEC_VRHADD_U, <MODE>mode,
> > + operands[0], operands[1], operands[2]));
> > + DONE;
> > +})
> > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
> > new file mode 100644
> > index 00000000000..19d5f5aa44f
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
> > @@ -0,0 +1,31 @@
> > +/* { dg-do compile } */
> > +/* { dg-require-effective-target arm_v8_1m_mve_ok } */
> > +/* { dg-add-options arm_v8_1m_mve } */
> > +/* { dg-additional-options "-O3" } */
> > +
> > +#include <stdint.h>
> > +
> > +/* We force a cast to int64_t to enable the vectorizer when dealing with 32-bit
> > + inputs. */
> > +#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
> > + void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
> > + TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
> > + int i; \
> > + for (i=0; i < (128 / BITS); i++) { \
> > + dest[i] = ((int64_t)a[i] OP b[i]) >> 1; \
> > + } \
> > +}
> > +
> > +FUNC(s, int, 32, +, vhadd)
> > +FUNC(u, uint, 32, +, vhadd)
> > +FUNC(s, int, 16, +, vhadd)
> > +FUNC(u, uint, 16, +, vhadd)
> > +FUNC(s, int, 8, +, vhadd)
> > +FUNC(u, uint, 8, +, vhadd)
> > +
> > +/* { dg-final { scan-assembler-times {vhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c
> > new file mode 100644
> > index 00000000000..b19d4253b54
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-do compile } */
> > +/* { dg-require-effective-target arm_v8_1m_mve_ok } */
> > +/* { dg-add-options arm_v8_1m_mve } */
> > +/* { dg-additional-options "-O3" } */
> > +
> > +#include <stdint.h>
> > +
> > +/* We force a cast to int64_t to enable the vectorizer when dealing with 32-bit
> > + inputs. */
> > +#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
> > + void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
> > + TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
> > + int i; \
> > + for (i=0; i < (128 / BITS); i++) { \
> > + dest[i] = ((int64_t)a[i] OP b[i] + 1) >> 1; \
> > + } \
> > +}
> > +
> > +FUNC(s, int, 32, +, vrhadd)
> > +FUNC(u, uint, 32, +, vrhadd)
> > +FUNC(s, int, 16, +, vrhadd)
> > +FUNC(u, uint, 16, +, vrhadd)
> > +FUNC(s, int, 8, +, vrhadd)
> > +FUNC(u, uint, 8, +, vrhadd)
> > +
> > +/* We cannot make use of the 32 bits version because the vectorizer needs an
> > + extended precision for the intermediate computations. */
>
> This comment no longer applies.
>
> > +/* { dg-final { scan-assembler-times {vrhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vrhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vrhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vrhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vrhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vrhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > diff --git a/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c
> > new file mode 100644
> > index 00000000000..cd00ca81c6d
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c
> > @@ -0,0 +1,93 @@
> > +/* { dg-do compile } */
> > +/* { dg-require-effective-target arm_neon_ok } */
> > +/* { dg-add-options arm_neon } */
> > +/* { dg-additional-options "-O3" } */
> > +
> > +#include <stdint.h>
> > +
> > +/* Since we default to -mvectorize-with-neon-quad, use enough iterations so that
> > + we can vectorize using 128-bit vectors. */
> > +/* For instance with int16_t:
> > + 8 iterations and -mvectorize-with-neon-quad, we generate:
> > +test_vhadd_s16:
> > + vld1.16 {q8}, [r1]
> > + vld1.16 {q9}, [r2]
> > + vhadd.s16 q8, q8, q9
> > + vst1.16 {q8}, [r0]
> > + bx lr
> > +
> > + 8 iterations and -mvectorize-with-neon-double, we generate:
> > +test_vhadd_s16:
> > + vld1.16 {d24}, [r1]!
> > + vld1.16 {d26}, [r2]!
> > + vld1.16 {d22}, [r1]
> > + vmovl.s16 q12, d24
> > + vld1.16 {d20}, [r2]
> > + vmovl.s16 q13, d26
> > + vmovl.s16 q11, d22
> > + vmovl.s16 q10, d20
> > + vadd.i32 d28, d26, d24
> > + vadd.i32 d24, d27, d25
> > + vadd.i32 d25, d22, d20
> > + vadd.i32 d20, d23, d21
> > + vshr.s32 d18, d28, #1
> > + vshr.s32 d19, d24, #1
> > + vshr.s32 d16, d25, #1
> > + vshr.s32 d17, d20, #1
> > + vmovn.i32 d18, q9
> > + vmovn.i32 d16, q8
> > + vst1.16 {d18}, [r0]!
> > + vst1.16 {d16}, [r0]
> > + bx lr
> > +
> > + Adding a cast to avoid integer promotion:
> > + dest[i] = (int16_t)(a[i] + b[i]) >> 1
> > +
> > + 8 iterations and -mvectorize-with-neon-quad, we generate:
> > +test_vhadd_s16:
> > + vld1.16 {q8}, [r2]
> > + vld1.16 {q9}, [r1]
> > + vadd.i16 q8, q8, q9
> > + vshr.s16 q8, q8, #1
> > + vst1.16 {q8}, [r0]
> > + bx lr
> > +
> > + 8 iterations and -mvectorize-with-neon-double, we generate:
> > +test_vhadd_s16:
> > + vld1.16 {d17}, [r1]!
> > + vld1.16 {d19}, [r2]!
> > + vld1.16 {d18}, [r1]
> > + vld1.16 {d16}, [r2]
> > + vadd.i16 d17, d17, d19
> > + vadd.i16 d16, d16, d18
> > + vshr.s16 d17, d17, #1
> > + vshr.s16 d16, d16, #1
> > + vst1.16 {d17}, [r0]!
> > + vst1.16 {d16}, [r0]
> > + bx lr
> > +
> > + */
>
> I think we should remove this and say that, at the moment, we've only
> implemented the optabs for 128-bit vectors, so that's all that we
> test here.
>
I would have sworn I had updated all the comments :-(
> OK with those changes, thanks.
Is the wording OK in the attached v3?
Thanks
> Richard
>
> > +/* We force a cast to int64_t to enable the vectorizer when dealing with 32-bit
> > + inputs. */
> > +#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
> > + void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
> > + TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
> > + int i; \
> > + for (i=0; i < (128 / BITS); i++) { \
> > + dest[i] = ((int64_t)a[i] OP b[i]) >> 1; \
> > + } \
> > +}
> > +
> > +FUNC(s, int, 32, +, vhadd)
> > +FUNC(u, uint, 32, +, vhadd)
> > +FUNC(s, int, 16, +, vhadd)
> > +FUNC(u, uint, 16, +, vhadd)
> > +FUNC(s, int, 8, +, vhadd)
> > +FUNC(u, uint, 8, +, vhadd)
> > +
> > +/* { dg-final { scan-assembler-times {vhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > diff --git a/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c
> > new file mode 100644
> > index 00000000000..f2692542f9b
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-do compile } */
> > +/* { dg-require-effective-target arm_neon_ok } */
> > +/* { dg-add-options arm_neon } */
> > +/* { dg-additional-options "-O3" } */
> > +
> > +#include <stdint.h>
> > +
> > +/* Since we default to -mvectorize-with-neon-quad, use enough iterations so that
> > + we can vectorize using 128-bit vectors. */
> > +/* We force a cast to int64_t to enable the vectorizer when dealing with 32-bit
> > + inputs. */
> > +#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
> > + void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
> > + TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
> > + int i; \
> > + for (i=0; i < (128 / BITS); i++) { \
> > + dest[i] = ((int64_t)a[i] OP b[i] + 1) >> 1; \
> > + } \
> > +}
> > +
> > +FUNC(s, int, 32, +, vrhadd)
> > +FUNC(u, uint, 32, +, vrhadd)
> > +FUNC(s, int, 16, +, vrhadd)
> > +FUNC(u, uint, 16, +, vrhadd)
> > +FUNC(s, int, 8, +, vrhadd)
> > +FUNC(u, uint, 8, +, vrhadd)
> > +
> > +/* { dg-final { scan-assembler-times {vrhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vrhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vrhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vrhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vrhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
> > +/* { dg-final { scan-assembler-times {vrhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
[-- Attachment #2: v3-0001-arm-Auto-vectorization-for-MVE-and-Neon-vhadd-vrh.patch --]
[-- Type: text/x-patch, Size: 12257 bytes --]
From 5327deeccda2661d29182c8a93e67874f32d7976 Mon Sep 17 00:00:00 2001
From: Christophe Lyon <christophe.lyon@linaro.org>
Date: Thu, 27 May 2021 20:11:28 +0000
Subject: [PATCH v3] arm: Auto-vectorization for MVE and Neon: vhadd/vrhadd
This patch adds support for auto-vectorization of average value
computation using vhadd or vrhadd, for both MVE and Neon.
The patch adds the needed [u]avg<mode>3_[floor|ceil] patterns to
vec-common.md, I'm not sure how to factorize them without introducing
an unspec iterator?
It also adds tests for 'floor' and for 'ceil', each for MVE and Neon.
Vectorization works with 8-bit and 16 bit input/output vectors, but
not with 32-bit ones because the vectorizer expects wider types
availability for the intermediate values, but int32_t + int32_t does
not involve wider types in the IR.
The testcases use a cast to int64_t to workaround this and enable
vectorization with vectors of 32-bit elements.
2021-06-09 Christophe Lyon <christophe.lyon@linaro.org>
gcc/
* gcc/config/arm/mve.md (mve_vhaddq_<supf><mode>): Prefix with '@'.
(@mve_vrhaddq_<supf><mode): Likewise.
* gcc/config/arm/neon.md (neon_v<r>hadd<sup><mode>): Likewise.
* config/arm/vec-common.md (avg<mode>3_floor, uavg<mode>3_floor)
(avg<mode>3_ceil", uavg<mode>3_ceil): New patterns.
gcc/testsuite/
* gcc.target/arm/simd/mve-vhadd-1.c: New test.
* gcc.target/arm/simd/mve-vhadd-2.c: New test.
* gcc.target/arm/simd/neon-vhadd-1.c: New test.
* gcc.target/arm/simd/neon-vhadd-2.c: New test.
---
gcc/config/arm/mve.md | 4 +-
gcc/config/arm/neon.md | 2 +-
gcc/config/arm/vec-common.md | 60 +++++++++++++++++++
.../gcc.target/arm/simd/mve-vhadd-1.c | 31 ++++++++++
.../gcc.target/arm/simd/mve-vhadd-2.c | 31 ++++++++++
.../gcc.target/arm/simd/neon-vhadd-1.c | 34 +++++++++++
.../gcc.target/arm/simd/neon-vhadd-2.c | 33 ++++++++++
7 files changed, 192 insertions(+), 3 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c
create mode 100644 gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c
create mode 100644 gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 0bfa6a91d55..04aa612331a 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -1030,7 +1030,7 @@ (define_insn "mve_vhaddq_n_<supf><mode>"
;;
;; [vhaddq_s, vhaddq_u])
;;
-(define_insn "mve_vhaddq_<supf><mode>"
+(define_insn "@mve_vhaddq_<supf><mode>"
[
(set (match_operand:MVE_2 0 "s_register_operand" "=w")
(unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
@@ -1652,7 +1652,7 @@ (define_insn "mve_vqsubq_<supf><mode>"
;;
;; [vrhaddq_s, vrhaddq_u])
;;
-(define_insn "mve_vrhaddq_<supf><mode>"
+(define_insn "@mve_vrhaddq_<supf><mode>"
[
(set (match_operand:MVE_2 0 "s_register_operand" "=w")
(unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 077c62ffd20..18571d819eb 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -1488,7 +1488,7 @@ (define_insn "neon_vaddw<sup><mode>"
; vhadd and vrhadd.
-(define_insn "neon_v<r>hadd<sup><mode>"
+(define_insn "@neon_v<r>hadd<sup><mode>"
[(set (match_operand:VDQIW 0 "s_register_operand" "=w")
(unspec:VDQIW [(match_operand:VDQIW 1 "s_register_operand" "w")
(match_operand:VDQIW 2 "s_register_operand" "w")]
diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
index 80b273229f5..2779c1a8aaa 100644
--- a/gcc/config/arm/vec-common.md
+++ b/gcc/config/arm/vec-common.md
@@ -565,3 +565,63 @@ (define_expand "reduc_plus_scal_<mode>"
DONE;
})
+
+(define_expand "avg<mode>3_floor"
+ [(match_operand:MVE_2 0 "s_register_operand")
+ (match_operand:MVE_2 1 "s_register_operand")
+ (match_operand:MVE_2 2 "s_register_operand")]
+ "ARM_HAVE_<MODE>_ARITH"
+{
+ if (TARGET_HAVE_MVE)
+ emit_insn (gen_mve_vhaddq (VHADDQ_S, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ else
+ emit_insn (gen_neon_vhadd (UNSPEC_VHADD_S, UNSPEC_VHADD_S, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ DONE;
+})
+
+(define_expand "uavg<mode>3_floor"
+ [(match_operand:MVE_2 0 "s_register_operand")
+ (match_operand:MVE_2 1 "s_register_operand")
+ (match_operand:MVE_2 2 "s_register_operand")]
+ "ARM_HAVE_<MODE>_ARITH"
+{
+ if (TARGET_HAVE_MVE)
+ emit_insn (gen_mve_vhaddq (VHADDQ_U, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ else
+ emit_insn (gen_neon_vhadd (UNSPEC_VHADD_U, UNSPEC_VHADD_U, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ DONE;
+})
+
+(define_expand "avg<mode>3_ceil"
+ [(match_operand:MVE_2 0 "s_register_operand")
+ (match_operand:MVE_2 1 "s_register_operand")
+ (match_operand:MVE_2 2 "s_register_operand")]
+ "ARM_HAVE_<MODE>_ARITH"
+{
+ if (TARGET_HAVE_MVE)
+ emit_insn (gen_mve_vrhaddq (VRHADDQ_S, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ else
+ emit_insn (gen_neon_vhadd (UNSPEC_VRHADD_S, UNSPEC_VRHADD_S, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ DONE;
+})
+
+(define_expand "uavg<mode>3_ceil"
+ [(match_operand:MVE_2 0 "s_register_operand")
+ (match_operand:MVE_2 1 "s_register_operand")
+ (match_operand:MVE_2 2 "s_register_operand")]
+ "ARM_HAVE_<MODE>_ARITH"
+{
+ if (TARGET_HAVE_MVE)
+ emit_insn (gen_mve_vrhaddq (VRHADDQ_U, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ else
+ emit_insn (gen_neon_vhadd (UNSPEC_VRHADD_U, UNSPEC_VRHADD_U, <MODE>mode,
+ operands[0], operands[1], operands[2]));
+ DONE;
+})
diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
new file mode 100644
index 00000000000..19d5f5aa44f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-1.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-add-options arm_v8_1m_mve } */
+/* { dg-additional-options "-O3" } */
+
+#include <stdint.h>
+
+/* We force a cast to int64_t to enable the vectorizer when dealing with 32-bit
+ inputs. */
+#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
+ void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
+ TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
+ int i; \
+ for (i=0; i < (128 / BITS); i++) { \
+ dest[i] = ((int64_t)a[i] OP b[i]) >> 1; \
+ } \
+}
+
+FUNC(s, int, 32, +, vhadd)
+FUNC(u, uint, 32, +, vhadd)
+FUNC(s, int, 16, +, vhadd)
+FUNC(u, uint, 16, +, vhadd)
+FUNC(s, int, 8, +, vhadd)
+FUNC(u, uint, 8, +, vhadd)
+
+/* { dg-final { scan-assembler-times {vhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c
new file mode 100644
index 00000000000..30029fc86b3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/mve-vhadd-2.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-add-options arm_v8_1m_mve } */
+/* { dg-additional-options "-O3" } */
+
+#include <stdint.h>
+
+/* We force a cast to int64_t to enable the vectorizer when dealing with 32-bit
+ inputs. */
+#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
+ void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
+ TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
+ int i; \
+ for (i=0; i < (128 / BITS); i++) { \
+ dest[i] = ((int64_t)a[i] OP b[i] + 1) >> 1; \
+ } \
+}
+
+FUNC(s, int, 32, +, vrhadd)
+FUNC(u, uint, 32, +, vrhadd)
+FUNC(s, int, 16, +, vrhadd)
+FUNC(u, uint, 16, +, vrhadd)
+FUNC(s, int, 8, +, vrhadd)
+FUNC(u, uint, 8, +, vrhadd)
+
+/* { dg-final { scan-assembler-times {vrhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
diff --git a/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c
new file mode 100644
index 00000000000..ce577849f03
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-1.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-add-options arm_neon } */
+/* { dg-additional-options "-O3" } */
+
+#include <stdint.h>
+
+/* Since we have implemented the avg* optabs for 128-bit vectors only, use
+ enough iterations to check that vectorization works as expected. */
+
+/* We force a cast to int64_t to enable the vectorizer when dealing with 32-bit
+ inputs. */
+#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
+ void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
+ TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
+ int i; \
+ for (i=0; i < (128 / BITS); i++) { \
+ dest[i] = ((int64_t)a[i] OP b[i]) >> 1; \
+ } \
+}
+
+FUNC(s, int, 32, +, vhadd)
+FUNC(u, uint, 32, +, vhadd)
+FUNC(s, int, 16, +, vhadd)
+FUNC(u, uint, 16, +, vhadd)
+FUNC(s, int, 8, +, vhadd)
+FUNC(u, uint, 8, +, vhadd)
+
+/* { dg-final { scan-assembler-times {vhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
diff --git a/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c
new file mode 100644
index 00000000000..f2692542f9b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/neon-vhadd-2.c
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-add-options arm_neon } */
+/* { dg-additional-options "-O3" } */
+
+#include <stdint.h>
+
+/* Since we default to -mvectorize-with-neon-quad, use enough iterations so that
+ we can vectorize using 128-bit vectors. */
+/* We force a cast to int64_t to enable the vectorizer when dealing with 32-bit
+ inputs. */
+#define FUNC(SIGN, TYPE, BITS, OP, NAME) \
+ void test_ ## NAME ##_ ## SIGN ## BITS (TYPE##BITS##_t * __restrict__ dest, \
+ TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
+ int i; \
+ for (i=0; i < (128 / BITS); i++) { \
+ dest[i] = ((int64_t)a[i] OP b[i] + 1) >> 1; \
+ } \
+}
+
+FUNC(s, int, 32, +, vrhadd)
+FUNC(u, uint, 32, +, vrhadd)
+FUNC(s, int, 16, +, vrhadd)
+FUNC(u, uint, 16, +, vrhadd)
+FUNC(s, int, 8, +, vrhadd)
+FUNC(u, uint, 8, +, vrhadd)
+
+/* { dg-final { scan-assembler-times {vrhadd\.s32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.u32\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.s16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.u16\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.s8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {vrhadd\.u8\tq[0-9]+, q[0-9]+, q[0-9]+} 1 } } */
--
2.25.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] arm: Auto-vectorization for MVE and Neon: vhadd/vrhadd
2021-06-09 15:07 ` Christophe Lyon
@ 2021-06-09 15:39 ` Richard Sandiford
2021-06-09 15:49 ` Christophe Lyon
0 siblings, 1 reply; 7+ messages in thread
From: Richard Sandiford @ 2021-06-09 15:39 UTC (permalink / raw)
To: Christophe Lyon; +Cc: gcc Patches
Christophe Lyon <christophe.lyon@linaro.org> writes:
> This patch adds support for auto-vectorization of average value
> computation using vhadd or vrhadd, for both MVE and Neon.
>
> The patch adds the needed [u]avg<mode>3_[floor|ceil] patterns to
> vec-common.md, I'm not sure how to factorize them without introducing
> an unspec iterator?
>
> It also adds tests for 'floor' and for 'ceil', each for MVE and Neon.
>
> Vectorization works with 8-bit and 16 bit input/output vectors, but
> not with 32-bit ones because the vectorizer expects wider types
> availability for the intermediate values, but int32_t + int32_t does
> not involve wider types in the IR.
>
> The testcases use a cast to int64_t to workaround this and enable
> vectorization with vectors of 32-bit elements.
We're probably in violent agreement here and just phrasing it
differently :-), but IMO this isn't a workaround. It would be actively
wrong to use V(R)HADD for the (u)int32_t version without the cast:
V(R)HADD effectively computes a 33-bit addition result, instead of
the 32-bit addition result in the original (cast-free) source code.
I guess one could argue that overflow is undefined for int32_t + int32_t
and so we could compute the full 33-bit result instead of using modulo
arithmetic. That might be too surprising though. It also wouldn't
help with uint32_t, since we do need to use modulo arithmetic for
uint32_t + uint32_t.
So personally I think it would be better to drop the last two paragraphs.
The patch LGTM though, thanks.
Richard
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] arm: Auto-vectorization for MVE and Neon: vhadd/vrhadd
2021-06-09 15:39 ` Richard Sandiford
@ 2021-06-09 15:49 ` Christophe Lyon
0 siblings, 0 replies; 7+ messages in thread
From: Christophe Lyon @ 2021-06-09 15:49 UTC (permalink / raw)
To: Christophe Lyon, gcc Patches, Richard Sandiford
On Wed, 9 Jun 2021 at 17:39, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Christophe Lyon <christophe.lyon@linaro.org> writes:
> > This patch adds support for auto-vectorization of average value
> > computation using vhadd or vrhadd, for both MVE and Neon.
> >
> > The patch adds the needed [u]avg<mode>3_[floor|ceil] patterns to
> > vec-common.md, I'm not sure how to factorize them without introducing
> > an unspec iterator?
> >
> > It also adds tests for 'floor' and for 'ceil', each for MVE and Neon.
> >
> > Vectorization works with 8-bit and 16 bit input/output vectors, but
> > not with 32-bit ones because the vectorizer expects wider types
> > availability for the intermediate values, but int32_t + int32_t does
> > not involve wider types in the IR.
> >
> > The testcases use a cast to int64_t to workaround this and enable
> > vectorization with vectors of 32-bit elements.
>
> We're probably in violent agreement here and just phrasing it
> differently :-), but IMO this isn't a workaround. It would be actively
> wrong to use V(R)HADD for the (u)int32_t version without the cast:
> V(R)HADD effectively computes a 33-bit addition result, instead of
> the 32-bit addition result in the original (cast-free) source code.
>
> I guess one could argue that overflow is undefined for int32_t + int32_t
> and so we could compute the full 33-bit result instead of using modulo
> arithmetic. That might be too surprising though. It also wouldn't
> help with uint32_t, since we do need to use modulo arithmetic for
> uint32_t + uint32_t.
Right. It would have been clearer if I had written different functions
for the different types, only needing the cast for the int32_t.
I used "workaround" because the int64_t cast looks strange with int8_t
and int16_t.
> So personally I think it would be better to drop the last two paragraphs.
>
> The patch LGTM though, thanks.
>
OK, thanks.
> Richard
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-06-09 15:49 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-31 13:55 [PATCH] arm: Auto-vectorization for MVE and Neon: vhadd/vrhadd Christophe Lyon
2021-06-02 18:19 ` Richard Sandiford
2021-06-04 12:57 ` Christophe Lyon
2021-06-08 11:50 ` Richard Sandiford
2021-06-09 15:07 ` Christophe Lyon
2021-06-09 15:39 ` Richard Sandiford
2021-06-09 15:49 ` Christophe Lyon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).