[PATCH][AArch64] Leveraging the use of STP instruction for vec

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH][AArch64] Leveraging the use of STP instruction for vec_duplicate
@ 2021-02-11 17:46 Victor Do Nascimento
  2021-02-15 15:53 ` Richard Sandiford
  0 siblings, 1 reply; 4+ messages in thread
From: Victor Do Nascimento @ 2021-02-11 17:46 UTC (permalink / raw)
  To: gcc-patches; +Cc: Kyrylo Tkachov, Richard Earnshaw, nd

Dear GCC community,

The backend pattern for storing a pair of identical values in 32 and 64-bit modes with the machine instruction STP was missing, and multiple instructions were needed to reproduce this behavior as a result of failed RTL pattern match in combine pass. 

For the test case :

typedef long long v2di __attribute__((vector_size (16))); typedef int v2si __attribute__((vector_size (8)));

void
foo (v2di *x, long long a)
{
  v2di tmp = {a, a};
  *x = tmp;
}

void
foo2 (v2si *x, int a)
{
  v2si tmp = {a, a};
  *x = tmp;
}

at -O2 on aarch64 gives:

foo:
    stp x1, x1, [x0]
    ret
foo2:
    stp w1, w1, [x0]
    ret

instead of:

foo:
        dup     v0.2d, x1
        str     q0, [x0]
        ret
foo2:
        dup     v0.2s, w1
        str     d0, [x0]
        ret
        
Added new RTL template, unittest and checked for regressions on bootstrapped aarch64-none-linux-gnu.

gcc/ChangeLog

2021-02-04 victor Do Nascimento <mailto:victor.donascimento@arm.com>

	* config/aarch64/aarch64-simd.md: Implement RTX pattern for
	mapping 'vec_duplicate' RTX onto 'STP' ASM insn.
	* config/aarch64/iterators.md: Implement ldpstp_vel_sz iterator
	to map STP/LDP vector element mode to correct suffix in 
	attribute type definition of aarch64_simd_stp<mode> pattern.
    
gcc/testsuite/ChangeLog
    
2021-02-04 Victor Do Nascimento <mailto:victor.donascimento@arm.com>

	* gcc.target/stp_vec-dup_32_64-1.c: New.

Regards,
Victor

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 68baf416045..4623cbb95f4 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -205,6 +205,16 @@
   [(set_attr "type" "neon_stp")]
 )
 
+(define_insn "aarch64_simd_stp<mode>"
+    [(set (match_operand:VP_2E 0 "aarch64_mem_pair_operand" "=Ump,Ump")
+      (vec_duplicate:VP_2E (match_operand:<VEL> 1 "register_operand" "w,r")))]
+    "TARGET_SIMD"
+    "@
+     stp\\t%<Vetype>1, %<Vetype>1, %z0
+     stp\\t%<vw>1, %<vw>1, %z0"
+    [(set_attr "type" "neon_stp<q>, store_<ldpstp_vel_sz>")]
+)
+
 (define_insn "load_pair<VQ:mode><VQ2:mode>"
   [(set (match_operand:VQ 0 "register_operand" "=w")
 	(match_operand:VQ 1 "aarch64_mem_pair_operand" "Ump"))
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index fb1426b7752..aac6e0b5bd9 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -880,6 +880,9 @@
 ;; Likewise for load/store pair.
 (define_mode_attr ldpstp_sz [(SI "8") (DI "16")])
 
+;; Size of element access for STP/LDP-generated vectors.
+(define_mode_attr ldpstp_vel_sz [(V2SI "8") (V2SF "8") (V2DI "16") (V2DF "16")])
+
 ;; For inequal width int to float conversion
 (define_mode_attr w1 [(HF "w") (SF "w") (DF "x")])
 (define_mode_attr w2 [(HF "x") (SF "x") (DF "w")])
diff --git a/gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c b/gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c
new file mode 100644
index 00000000000..a37c903dfd4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+typedef long long v2di __attribute__((vector_size (16)));
+typedef int v2si __attribute__((vector_size (8)));
+
+void
+foo (v2di *x, long long a)
+{
+  v2di tmp = {a, a};
+  *x = tmp;
+}
+
+void
+foo2 (v2si *x, int a)
+{
+  v2si tmp = {a, a};
+  *x = tmp;
+}
+
+/* { dg-final { scan-assembler-times "stp\t" 2 } } */
+/* { dg-final { scan-assembler-not "dup\t" } } */

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH][AArch64] Leveraging the use of STP instruction for vec_duplicate
  2021-02-11 17:46 [PATCH][AArch64] Leveraging the use of STP instruction for vec_duplicate Victor Do Nascimento
@ 2021-02-15 15:53 ` Richard Sandiford
  0 siblings, 0 replies; 4+ messages in thread
From: Richard Sandiford @ 2021-02-15 15:53 UTC (permalink / raw)
  To: Victor Do Nascimento via Gcc-patches
  Cc: Victor Do Nascimento, nd, Richard Earnshaw

Hi Victor,

Thanks for the patch.  I have a couple of very minor comments below,
but otherwise it looks good to go.  However, it will need to wait for
stage 1 to open, unless it fixes a regression.

Victor Do Nascimento via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index 68baf416045..4623cbb95f4 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -205,6 +205,16 @@
>    [(set_attr "type" "neon_stp")]
>  )
>  
> +(define_insn "aarch64_simd_stp<mode>"
> +    [(set (match_operand:VP_2E 0 "aarch64_mem_pair_operand" "=Ump,Ump")
> +      (vec_duplicate:VP_2E (match_operand:<VEL> 1 "register_operand" "w,r")))]
> +    "TARGET_SIMD"
> +    "@
> +     stp\\t%<Vetype>1, %<Vetype>1, %z0
> +     stp\\t%<vw>1, %<vw>1, %z0"
> +    [(set_attr "type" "neon_stp<q>, store_<ldpstp_vel_sz>")]

Minor formatting nit, but: these patterns are generally indented
by 2 spaces rather than 4 (at least in config/aarch64).  Also,
it would be good if the (vec_duplicate:…) lined up with the
(match_operand:…)

I think the type for the first alternative should be neon_stp
rather than neon_stp<q>.  The instruction only ever stores S or D
registers, whereas neon_stp_q is for storing Q registers.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH][AArch64] Leveraging the use of STP instruction for vec_duplicate
@ 2021-03-18 15:36 Victor Do Nascimento
  2021-04-21 12:40 ` Richard Sandiford
  0 siblings, 1 reply; 4+ messages in thread
From: Victor Do Nascimento @ 2021-03-18 15:36 UTC (permalink / raw)
  To: gcc-patches; +Cc: kyrylo.tkachov, richard.earnshaw, nd, victor.donascimento

The backend pattern for storing a pair of identical values in 32 and 64-bit modes with the machine instruction STP was missing, and multiple instructions were needed to reproduce this behavior as a result of failed RTL pattern match in combine pass.

For the test case :

typedef long long v2di __attribute__((vector_size (16)));
typedef int v2si __attribute__((vector_size (8)));

void
foo (v2di *x, long long a)
{
	v2di tmp = {a, a};
	*x = tmp;
}

void
foo2 (v2si *x, int a)
{
	v2si tmp = {a, a};
	*x = tmp;
}

at -O2 on aarch64 gives:

foo:
	stp x1, x1, [x0]
	ret
foo2:
	stp w1, w1, [x0]
	ret

instead of:

foo:
	dup     v0.2d, x1
	str     q0, [x0]
	ret
foo2:
	dup     v0.2s, w1
	str     d0, [x0]
	ret

In preparation for the next stage 1  phase of development, added new RTL template, unittest and checked for regressions on bootstrapped aarch64-none-linux-gnu.

gcc/ChangeLog

2021-02-04 victor Do Nascimento <victor.donascimento@arm.com>

	* config/aarch64/aarch64-simd.md: Implement RTX pattern for
	mapping 'vec_duplicate' RTX onto 'STP' ASM insn.
	* config/aarch64/iterators.md: Implement ldpstp_vel_sz iterator
	to map STP/LDP vector element mode to correct suffix in
	attribute type definition of aarch64_simd_stp<mode> pattern.

gcc/testsuite/ChangeLog

2021-02-04 Victor Do Nascimento <victor.donascimento@arm.com>

	* gcc.target/stp_vec-dup_32_64-1.c: Added test.

Regards,
Victor

---
 gcc/config/aarch64/aarch64-simd.md            | 10 +++++++++
 gcc/config/aarch64/iterators.md               |  3 +++
 .../gcc.target/aarch64/stp_vec_dup_32_64-1.c  | 22 +++++++++++++++++++
 3 files changed, 35 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 71aa77dd010..3d53bab0018 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -205,6 +205,16 @@
   [(set_attr "type" "neon_stp")]
 )
 
+(define_insn "aarch64_simd_stp<mode>"
+  [(set (match_operand:VP_2E 0 "aarch64_mem_pair_operand" "=Ump,Ump")
+		(vec_duplicate:VP_2E (match_operand:<VEL> 1 "register_operand" "w,r")))]
+  "TARGET_SIMD"
+  "@
+   stp\\t%<Vetype>1, %<Vetype>1, %z0
+   stp\\t%<vw>1, %<vw>1, %z0"
+  [(set_attr "type" "neon_stp, store_<ldpstp_vel_sz>")]
+)
+
 (define_insn "load_pair<VQ:mode><VQ2:mode>"
   [(set (match_operand:VQ 0 "register_operand" "=w")
 	(match_operand:VQ 1 "aarch64_mem_pair_operand" "Ump"))
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index fb6e228651e..196055d31e5 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -898,6 +898,9 @@
 ;; Likewise for load/store pair.
 (define_mode_attr ldpstp_sz [(SI "8") (DI "16")])
 
+;; Size of element access for STP/LDP-generated vectors.
+(define_mode_attr ldpstp_vel_sz [(V2SI "8") (V2SF "8") (V2DI "16") (V2DF "16")])
+
 ;; For inequal width int to float conversion
 (define_mode_attr w1 [(HF "w") (SF "w") (DF "x")])
 (define_mode_attr w2 [(HF "x") (SF "x") (DF "w")])
diff --git a/gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c b/gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c
new file mode 100644
index 00000000000..a37c903dfd4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c
@@ -0,1 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+typedef long long v2di __attribute__((vector_size (16)));
+typedef int v2si __attribute__((vector_size (8)));
+
+void
+foo (v2di *x, long long a)
+{
+  v2di tmp = {a, a};
+  *x = tmp;
+}
+
+void
+foo2 (v2si *x, int a)
+{
+  v2si tmp = {a, a};
+  *x = tmp;
+}
+
+/* { dg-final { scan-assembler-times "stp\t" 2 } } */
+/* { dg-final { scan-assembler-not "dup\t" } } */
-- 
2.17.1

@@ -1,0 +23,0 @@

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH][AArch64] Leveraging the use of STP instruction for vec_duplicate
  2021-03-18 15:36 Victor Do Nascimento
@ 2021-04-21 12:40 ` Richard Sandiford
  0 siblings, 0 replies; 4+ messages in thread
From: Richard Sandiford @ 2021-04-21 12:40 UTC (permalink / raw)
  To: Victor Do Nascimento via Gcc-patches
  Cc: Victor Do Nascimento, nd, richard.earnshaw

Victor Do Nascimento via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> The backend pattern for storing a pair of identical values in 32 and 64-bit modes with the machine instruction STP was missing, and multiple instructions were needed to reproduce this behavior as a result of failed RTL pattern match in combine pass.
>
> For the test case :
>
> typedef long long v2di __attribute__((vector_size (16)));
> typedef int v2si __attribute__((vector_size (8)));
>
> void
> foo (v2di *x, long long a)
> {
> 	v2di tmp = {a, a};
> 	*x = tmp;
> }
>
> void
> foo2 (v2si *x, int a)
> {
> 	v2si tmp = {a, a};
> 	*x = tmp;
> }
>
> at -O2 on aarch64 gives:
>
> foo:
> 	stp x1, x1, [x0]
> 	ret
> foo2:
> 	stp w1, w1, [x0]
> 	ret
>
> instead of:
>
> foo:
> 	dup     v0.2d, x1
> 	str     q0, [x0]
> 	ret
> foo2:
> 	dup     v0.2s, w1
> 	str     d0, [x0]
> 	ret
>
> In preparation for the next stage 1  phase of development, added new RTL template, unittest and checked for regressions on bootstrapped aarch64-none-linux-gnu.
>
> gcc/ChangeLog
>
> 2021-02-04 victor Do Nascimento <victor.donascimento@arm.com>
>
> 	* config/aarch64/aarch64-simd.md: Implement RTX pattern for
> 	mapping 'vec_duplicate' RTX onto 'STP' ASM insn.
> 	* config/aarch64/iterators.md: Implement ldpstp_vel_sz iterator
> 	to map STP/LDP vector element mode to correct suffix in
> 	attribute type definition of aarch64_simd_stp<mode> pattern.

A more typical changelog entry would be:

	* config/aarch64/iterators.md (ldpstp_vel_sz): New mode attribute.
	* config/aarch64/aarch64-simd.md (aarch64_simd_stp<mode>): New pattern.

> gcc/testsuite/ChangeLog
>
> 2021-02-04 Victor Do Nascimento <victor.donascimento@arm.com>
>
> 	* gcc.target/stp_vec-dup_32_64-1.c: Added test.
>
> Regards,
> Victor
>
> ---
>  gcc/config/aarch64/aarch64-simd.md            | 10 +++++++++
>  gcc/config/aarch64/iterators.md               |  3 +++
>  .../gcc.target/aarch64/stp_vec_dup_32_64-1.c  | 22 +++++++++++++++++++
>  3 files changed, 35 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index 71aa77dd010..3d53bab0018 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -205,6 +205,16 @@
>    [(set_attr "type" "neon_stp")]
>  )
>  
> +(define_insn "aarch64_simd_stp<mode>"
> +  [(set (match_operand:VP_2E 0 "aarch64_mem_pair_operand" "=Ump,Ump")
> +		(vec_duplicate:VP_2E (match_operand:<VEL> 1 "register_operand" "w,r")))]

Formatting nit: should just be one tab here.

I would have just made that change locally and committed, but I think
there's a problem: aarch64_mem_pair_operand and Ump are geared for pairs
of full-vector stores, rather than for pairs of elements.  This means that
(for example) the V2SI range will be [-256,255] * 8 rather than the expected
[-256,255] * 4.

I think we need to use aarch64_mem_pair_lanes_operand and Umn instead,
as for store_pair_lanes<mode>.  In addition:

  /* If we are dealing with ADDR_QUERY_LDP_STP_N that means the incoming mode
     corresponds to the actual size of the memory being loaded/stored and the
     mode of the corresponding addressing mode is half of that.  */
  if (type == ADDR_QUERY_LDP_STP_N
      && known_eq (GET_MODE_SIZE (mode), 16))
    mode = DFmode;

only handles 128-bit vectors, whereas here we need it to handle 64-bit
vectors too.

It would be good to test the limits, e.g.:

> diff --git a/gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c b/gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c
> new file mode 100644
> index 00000000000..a37c903dfd4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c
> @@ -0,1 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +typedef long long v2di __attribute__((vector_size (16)));
> +typedef int v2si __attribute__((vector_size (8)));
> +
> +void
> +foo (v2di *x, long long a)
> +{
> +  v2di tmp = {a, a};
> +  *x = tmp;
> +}
> +
> +void
> +foo2 (v2si *x, int a)
> +{
> +  v2si tmp = {a, a};
> +  *x = tmp;
> +}

We could have additional tests for:

x[-129] = tmp; // out of range
x[-128] = tmp; // in range
x[127] = tmp; // in range
x[128] = tmp; // out of range

Thanks,
Richard

> +
> +/* { dg-final { scan-assembler-times "stp\t" 2 } } */
> +/* { dg-final { scan-assembler-not "dup\t" } } */

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-04-21 12:40 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-11 17:46 [PATCH][AArch64] Leveraging the use of STP instruction for vec_duplicate Victor Do Nascimento
2021-02-15 15:53 ` Richard Sandiford
2021-03-18 15:36 Victor Do Nascimento
2021-04-21 12:40 ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).