[RFC] AArch64: Have RTL patterns recognize DI extracts from vectors at offset 0 as no-op.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [RFC] AArch64: Have RTL patterns recognize DI extracts from vectors at offset 0 as no-op.
@ 2021-01-04 12:11 Tamar Christina
  2021-01-04 12:28 ` Richard Sandiford
  0 siblings, 1 reply; 3+ messages in thread
From: Tamar Christina @ 2021-01-04 12:11 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 3466 bytes --]

Hi All,

I have been looking into a class of problems where GCC is not recognizing that
a subreg of lane 0 (using little-endian as example) of a vector register and
passing that to an instruction.

As an example consider

poly64_t
testcase (uint8x16_t input, poly64x2_t mask)
{
    poly64_t prodL = vmull_p64((poly64_t)vgetq_lane_p64((poly64x2_t)input, 0),
			       vgetq_lane_p64(mask, 0));
    poly64_t prodH = vmull_high_p64((poly64x2_t)input, mask);
    return prodL + prodH;
}

Where we generate

testcase:
	dup     d2, v0.d[0]
	dup     d3, v1.d[0]
	pmull2  v0.1q, v0.2d, v1.2d
	pmull   v2.1q, v2.1d, v3.1d
	add     d0, d2, d0
	fmov    x0, d0
	ret

whereas it should have been, which clang generates:

testcase:
	pmull   v2.1q, v0.1d, v1.1d
	pmull2  v0.1q, v0.2d, v1.2d
	add     v0.2d, v0.2d, v2.2d
	fmov    x0, d0
	ret

Now this can be naively solved by just adding the RTL patterns for the
vec_selects as the example in the patch, but this doesn't solve the overall
problem and I am wondering how to best do this.

One approach would be to extend combine's noop detection in noop_move_p to
recognize these cases.

The downside here is that the conversion becomes implicit in the rtl. i.e.
you'll see a SET of a V2DI but a use of DI for that same register.  I'm not sure
the semantics of RTL allow such implicit uses?

The second approach I can think of is to extend reload to recognize these no-ops
and give the same register and mark the extract as unused such that DSE cleans
it up.

But there's probably a better approach I didn't think of :)

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md
	(*aarch64_crypto_pmullv2di): Example RTL.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/pmull_2.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 05d18f8bd3ac09c56c82dc73cff855315eb302b7..7bdb93869dbbedc786575b5f89f39c4c6d0d76d0 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -7231,6 +7231,20 @@ (define_insn "aarch64_crypto_pmulldi"
   [(set_attr "type" "crypto_pmull")]
 )
 
+(define_insn "*aarch64_crypto_pmullv2di"
+  [(set (match_operand:TI 0 "register_operand" "=w")
+        (unspec:TI  [(vec_select:DI
+			(match_operand:V2DI 1 "register_operand" "w")
+			(parallel [
+			  (match_operand:SI 2 "const_int_operand" "Z")]))
+		     (match_operand:DI 3 "register_operand" "w")]
+		    UNSPEC_PMULL))]
+ "TARGET_SIMD && TARGET_AES"
+ "pmull\\t%0.1q, %1.1d, %3.1d"
+  [(set_attr "type" "crypto_pmull")]
+)
+
+
 (define_insn "aarch64_crypto_pmullv2di"
  [(set (match_operand:TI 0 "register_operand" "=w")
        (unspec:TI [(match_operand:V2DI 1 "register_operand" "w")
diff --git a/gcc/testsuite/gcc.target/aarch64/pmull_2.c b/gcc/testsuite/gcc.target/aarch64/pmull_2.c
new file mode 100644
index 0000000000000000000000000000000000000000..d9d47518fab2b582329b6332e3a9c7d97c148192
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pmull_2.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=armv8-a+crypto -O3" } */
+
+#include "arm_neon.h"
+
+poly64_t
+testcase (uint8x16_t input, poly64x2_t mask)
+{
+    poly64_t prodL = vmull_p64((poly64_t)vgetq_lane_p64((poly64x2_t)input, 0),
+			       vgetq_lane_p64(mask, 0));
+    poly64_t prodH = vmull_high_p64((poly64x2_t)input, mask);
+    return prodL + prodH;
+}
+
+/* { dg-final { scan-assembler-times "pmull\\tv" 1 } } */


-- 

[-- Attachment #2: rb13979.patch --]
[-- Type: text/x-diff, Size: 1698 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 05d18f8bd3ac09c56c82dc73cff855315eb302b7..7bdb93869dbbedc786575b5f89f39c4c6d0d76d0 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -7231,6 +7231,20 @@ (define_insn "aarch64_crypto_pmulldi"
   [(set_attr "type" "crypto_pmull")]
 )
 
+(define_insn "*aarch64_crypto_pmullv2di"
+  [(set (match_operand:TI 0 "register_operand" "=w")
+        (unspec:TI  [(vec_select:DI
+			(match_operand:V2DI 1 "register_operand" "w")
+			(parallel [
+			  (match_operand:SI 2 "const_int_operand" "Z")]))
+		     (match_operand:DI 3 "register_operand" "w")]
+		    UNSPEC_PMULL))]
+ "TARGET_SIMD && TARGET_AES"
+ "pmull\\t%0.1q, %1.1d, %3.1d"
+  [(set_attr "type" "crypto_pmull")]
+)
+
+
 (define_insn "aarch64_crypto_pmullv2di"
  [(set (match_operand:TI 0 "register_operand" "=w")
        (unspec:TI [(match_operand:V2DI 1 "register_operand" "w")
diff --git a/gcc/testsuite/gcc.target/aarch64/pmull_2.c b/gcc/testsuite/gcc.target/aarch64/pmull_2.c
new file mode 100644
index 0000000000000000000000000000000000000000..d9d47518fab2b582329b6332e3a9c7d97c148192
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pmull_2.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=armv8-a+crypto -O3" } */
+
+#include "arm_neon.h"
+
+poly64_t
+testcase (uint8x16_t input, poly64x2_t mask)
+{
+    poly64_t prodL = vmull_p64((poly64_t)vgetq_lane_p64((poly64x2_t)input, 0),
+			       vgetq_lane_p64(mask, 0));
+    poly64_t prodH = vmull_high_p64((poly64x2_t)input, mask);
+    return prodL + prodH;
+}
+
+/* { dg-final { scan-assembler-times "pmull\\tv" 1 } } */


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] AArch64: Have RTL patterns recognize DI extracts from vectors at offset 0 as no-op.
  2021-01-04 12:11 [RFC] AArch64: Have RTL patterns recognize DI extracts from vectors at offset 0 as no-op Tamar Christina
@ 2021-01-04 12:28 ` Richard Sandiford
  2021-01-04 12:52   ` Tamar Christina
  0 siblings, 1 reply; 3+ messages in thread
From: Richard Sandiford @ 2021-01-04 12:28 UTC (permalink / raw)
  To: Tamar Christina
  Cc: gcc-patches, nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov

Tamar Christina <tamar.christina@arm.com> writes:
> Hi All,
>
> I have been looking into a class of problems where GCC is not recognizing that
> a subreg of lane 0 (using little-endian as example) of a vector register and
> passing that to an instruction.
>
> As an example consider
>
> poly64_t
> testcase (uint8x16_t input, poly64x2_t mask)
> {
>     poly64_t prodL = vmull_p64((poly64_t)vgetq_lane_p64((poly64x2_t)input, 0),
> 			       vgetq_lane_p64(mask, 0));
>     poly64_t prodH = vmull_high_p64((poly64x2_t)input, mask);
>     return prodL + prodH;
> }
>
> Where we generate
>
> testcase:
> 	dup     d2, v0.d[0]
> 	dup     d3, v1.d[0]
> 	pmull2  v0.1q, v0.2d, v1.2d
> 	pmull   v2.1q, v2.1d, v3.1d
> 	add     d0, d2, d0
> 	fmov    x0, d0
> 	ret
>
> whereas it should have been, which clang generates:
>
> testcase:
> 	pmull   v2.1q, v0.1d, v1.1d
> 	pmull2  v0.1q, v0.2d, v1.2d
> 	add     v0.2d, v0.2d, v2.2d
> 	fmov    x0, d0
> 	ret
>
> Now this can be naively solved by just adding the RTL patterns for the
> vec_selects as the example in the patch, but this doesn't solve the overall
> problem and I am wondering how to best do this.
>
> One approach would be to extend combine's noop detection in noop_move_p to
> recognize these cases.
>
> The downside here is that the conversion becomes implicit in the rtl. i.e.
> you'll see a SET of a V2DI but a use of DI for that same register.  I'm not sure
> the semantics of RTL allow such implicit uses?

It's OK to set a hard register in one mode and use it in a different mode
(without subregs), but it's not possible to do the same using pseudos.

> The second approach I can think of is to extend reload to recognize these no-ops
> and give the same register and mark the extract as unused such that DSE cleans
> it up.
>
> But there's probably a better approach I didn't think of :)

FWIW, for MIPS we tended to handle this kind of thing using matching
constraints.  E.g. for:

(define_insn_and_split "aarch64_simd_mov_from_<mode>low"
  [(set (match_operand:<VHALF> 0 "register_operand" "=w,?r")
        (vec_select:<VHALF>
          (match_operand:VQMOV_NO2E 1 "register_operand" "w,w")
          (match_operand:VQMOV_NO2E 2 "vect_par_cnst_lo_half" "")))]
  "TARGET_SIMD"
  "@
   #
   umov\t%0, %1.d[0]"
  "&& reload_completed && aarch64_simd_register (operands[0], <VHALF>mode)"
  [(set (match_dup 0) (match_dup 1))]
  {
    operands[1] = aarch64_replace_reg_mode (operands[1], <VHALF>mode);
  }
  [(set_attr "type" "mov_reg,neon_to_gp<q>")
   (set_attr "length" "4")]
)

use something like "0,w" for operand 1, so that the first alternative
can be split to nothing:

;; When TARGET_64BIT, all SImode integer and accumulator registers
;; should already be in sign-extended form (see TARGET_TRULY_NOOP_TRUNCATION
;; and truncdisi2).  We can therefore get rid of register->register
;; instructions if we constrain the source to be in the same register as
;; the destination.
;;
;; Only the pre-reload scheduler sees the type of the register alternatives;
;; we split them into nothing before the post-reload scheduler runs.
;; These alternatives therefore have type "move" in order to reflect
;; what happens if the two pre-reload operands cannot be tied, and are
;; instead allocated two separate GPRs.  We don't distinguish between
;; the GPR and LO cases because we don't usually know during pre-reload
;; scheduling whether an operand will be LO or not.
(define_insn_and_split "extendsidi2"
  [(set (match_operand:DI 0 "register_operand" "=d,l,d")
        (sign_extend:DI (match_operand:SI 1 "nonimmediate_operand" "0,0,m")))]
  "TARGET_64BIT"
  "@
   #
   #
   lw\t%0,%1"
  "&& reload_completed && register_operand (operands[1], VOIDmode)"
  [(const_int 0)]
{
  emit_note (NOTE_INSN_DELETED);
  DONE;
}
  [(set_attr "move_type" "move,move,load")
   (set_attr "mode" "DI")])

It'll need some experimentation though.  E.g. is it worth providing
a w<-w alternative as well, with ? or ^ to disparage it?

Independently of that, it might be worth trying to add a memory
alternative, so that we can load spilled values directly from
memory instead of first reloading the vector.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: [RFC] AArch64: Have RTL patterns recognize DI extracts from vectors at offset 0 as no-op.
  2021-01-04 12:28 ` Richard Sandiford
@ 2021-01-04 12:52   ` Tamar Christina
  0 siblings, 0 replies; 3+ messages in thread
From: Tamar Christina @ 2021-01-04 12:52 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

Hi Richard,

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Monday, January 4, 2021 12:29 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> Subject: Re: [RFC] AArch64: Have RTL patterns recognize DI extracts from
> vectors at offset 0 as no-op.
> 
> Tamar Christina <tamar.christina@arm.com> writes:
> > Hi All,
> >
> > I have been looking into a class of problems where GCC is not
> > recognizing that a subreg of lane 0 (using little-endian as example)
> > of a vector register and passing that to an instruction.
> >
> > As an example consider
> >
> > poly64_t
> > testcase (uint8x16_t input, poly64x2_t mask) {
> >     poly64_t prodL =
> vmull_p64((poly64_t)vgetq_lane_p64((poly64x2_t)input, 0),
> > 			       vgetq_lane_p64(mask, 0));
> >     poly64_t prodH = vmull_high_p64((poly64x2_t)input, mask);
> >     return prodL + prodH;
> > }
> >
> > Where we generate
> >
> > testcase:
> > 	dup     d2, v0.d[0]
> > 	dup     d3, v1.d[0]
> > 	pmull2  v0.1q, v0.2d, v1.2d
> > 	pmull   v2.1q, v2.1d, v3.1d
> > 	add     d0, d2, d0
> > 	fmov    x0, d0
> > 	ret
> >
> > whereas it should have been, which clang generates:
> >
> > testcase:
> > 	pmull   v2.1q, v0.1d, v1.1d
> > 	pmull2  v0.1q, v0.2d, v1.2d
> > 	add     v0.2d, v0.2d, v2.2d
> > 	fmov    x0, d0
> > 	ret
> >
> > Now this can be naively solved by just adding the RTL patterns for the
> > vec_selects as the example in the patch, but this doesn't solve the
> > overall problem and I am wondering how to best do this.
> >
> > One approach would be to extend combine's noop detection in
> > noop_move_p to recognize these cases.
> >
> > The downside here is that the conversion becomes implicit in the rtl. i.e.
> > you'll see a SET of a V2DI but a use of DI for that same register.
> > I'm not sure the semantics of RTL allow such implicit uses?
> 
> It's OK to set a hard register in one mode and use it in a different mode
> (without subregs), but it's not possible to do the same using pseudos.
> 
> > The second approach I can think of is to extend reload to recognize
> > these no-ops and give the same register and mark the extract as unused
> > such that DSE cleans it up.
> >
> > But there's probably a better approach I didn't think of :)
> 
> FWIW, for MIPS we tended to handle this kind of thing using matching
> constraints.  E.g. for:
> 
> (define_insn_and_split "aarch64_simd_mov_from_<mode>low"
>   [(set (match_operand:<VHALF> 0 "register_operand" "=w,?r")
>         (vec_select:<VHALF>
>           (match_operand:VQMOV_NO2E 1 "register_operand" "w,w")
>           (match_operand:VQMOV_NO2E 2 "vect_par_cnst_lo_half" "")))]
>   "TARGET_SIMD"
>   "@
>    #
>    umov\t%0, %1.d[0]"
>   "&& reload_completed && aarch64_simd_register (operands[0],
> <VHALF>mode)"
>   [(set (match_dup 0) (match_dup 1))]
>   {
>     operands[1] = aarch64_replace_reg_mode (operands[1], <VHALF>mode);
>   }
>   [(set_attr "type" "mov_reg,neon_to_gp<q>")
>    (set_attr "length" "4")]
> )
> 
> use something like "0,w" for operand 1, so that the first alternative can be
> split to nothing:

Ah, interesting, I indeed didn't think of this approach.  I'll go experiment.

Thanks!

> 
> ;; When TARGET_64BIT, all SImode integer and accumulator registers ;;
> should already be in sign-extended form (see
> TARGET_TRULY_NOOP_TRUNCATION ;; and truncdisi2).  We can therefore
> get rid of register->register ;; instructions if we constrain the source to be in
> the same register as ;; the destination.
> ;;
> ;; Only the pre-reload scheduler sees the type of the register alternatives; ;;
> we split them into nothing before the post-reload scheduler runs.
> ;; These alternatives therefore have type "move" in order to reflect ;; what
> happens if the two pre-reload operands cannot be tied, and are ;; instead
> allocated two separate GPRs.  We don't distinguish between ;; the GPR and
> LO cases because we don't usually know during pre-reload ;; scheduling
> whether an operand will be LO or not.
> (define_insn_and_split "extendsidi2"
>   [(set (match_operand:DI 0 "register_operand" "=d,l,d")
>         (sign_extend:DI (match_operand:SI 1 "nonimmediate_operand"
> "0,0,m")))]
>   "TARGET_64BIT"
>   "@
>    #
>    #
>    lw\t%0,%1"
>   "&& reload_completed && register_operand (operands[1], VOIDmode)"
>   [(const_int 0)]
> {
>   emit_note (NOTE_INSN_DELETED);
>   DONE;
> }
>   [(set_attr "move_type" "move,move,load")
>    (set_attr "mode" "DI")])
> 
> It'll need some experimentation though.  E.g. is it worth providing a w<-w
> alternative as well, with ? or ^ to disparage it?
> 
> Independently of that, it might be worth trying to add a memory alternative,
> so that we can load spilled values directly from memory instead of first
> reloading the vector.
> 
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-01-04 12:53 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-04 12:11 [RFC] AArch64: Have RTL patterns recognize DI extracts from vectors at offset 0 as no-op Tamar Christina
2021-01-04 12:28 ` Richard Sandiford
2021-01-04 12:52   ` Tamar Christina

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).