From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ktkachov@sourceware.org>
Received: by sourceware.org (Postfix, from userid 1816)
	id 788A53858D20; Fri, 21 Apr 2023 17:57:11 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 788A53858D20
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1682099831;
	bh=CZsGHgibCy7ccA0BXhUb8mGerTyWGUg8Kv+xY59+jok=;
	h=From:To:Subject:Date:From;
	b=Dm6ZzOFKg348SvLDLt4n9uiZeynWaGviHbe+rm8fI5xvsw0UMCznf5/FA70poUNbQ
	 eDEskzeEYxVykngazcIAVedKU8gbURlpUsULsRutVccwwJV6qvlR0cX9VKFfrduEaa
	 1y6KagE47Nkzo+fx4UVRKHViXWgCZGrv1T0wFLqU=
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset="utf-8"
From: Kyrylo Tkachov <ktkachov@gcc.gnu.org>
To: gcc-cvs@gcc.gnu.org
Subject: [gcc r14-153] aarch64: PR target/99195 Add scheme to optimise away
 vec_concat with zeroes on 64-bit Advanced SIMD 
X-Act-Checkin: gcc
X-Git-Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
X-Git-Refname: refs/heads/master
X-Git-Oldrev: 857c8e3b3bba14bfb078a194993f5f31e0d90391
X-Git-Newrev: f824216cdb078ea9de0980ae066a0e1e83494fd2
Message-Id: <20230421175711.788A53858D20@sourceware.org>
Date: Fri, 21 Apr 2023 17:57:11 +0000 (GMT)
List-Id: <gcc-cvs.sourceware.org>

https://gcc.gnu.org/g:f824216cdb078ea9de0980ae066a0e1e83494fd2

commit r14-153-gf824216cdb078ea9de0980ae066a0e1e83494fd2
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Fri Apr 21 18:56:21 2023 +0100

    aarch64: PR target/99195 Add scheme to optimise away vec_concat with zeroes on 64-bit Advanced SIMD ops
    
    I finally got around to trying out the define_subst approach for PR target/99195.
    The problem we have is that many Advanced SIMD instructions have 64-bit vector variants that
    clear the top half of the 128-bit Q register. This would allow the compiler to avoid generating
    explicit zeroing instructions to concat the 64-bit result with zeroes for code like:
    vcombine_u16(vadd_u16(a, b), vdup_n_u16(0))
    We've been getting user reports of GCC missing this optimisation in real world code, so it's worth
    doing something about it.
    The straightforward approach that we've been taking so far is adding extra patterns in aarch64-simd.md
    that match the 64-bit result in a vec_concat with zeroes. Unfortunately for big-endian the vec_concat
    operands to match have to be the other way around, so we would end up adding two extra define_insns.
    This would lead to too much bloat in aarch64-simd.md
    
    This patch defines a pair of define_subst constructs that allow us to annotate patterns in aarch64-simd.md
    with the <vczle> and <vczbe> subst_attrs and the compiler will automatically produce the vec_concat widening patterns,
    properly gated for BYTES_BIG_ENDIAN when needed. This seems like the least intrusive way to describe the extra zeroing semantics.
    
    I've had a look at the generated insn-*.cc files in the build directory and it seems that define_subst does what we want it to do
    when applied multiple times on a pattern in terms of insn conditions and modes.
    
    This patch adds the define_subst machinery and adds the annotations to some of the straightforward binary and unary integer
    operations. Many more such annotations are possible and I aim add them in future patches if this approach is acceptable.
    
    Bootstrapped and tested on aarch64-none-linux-gnu and on aarch64_be-none-elf.
    
    gcc/ChangeLog:
    
            PR target/99195
            * config/aarch64/aarch64-simd.md (add_vec_concat_subst_le): Define.
            (add_vec_concat_subst_be): Likewise.
            (vczle): Likewise.
            (vczbe): Likewise.
            (add<mode>3): Rename to...
            (add<mode>3<vczle><vczbe>): ... This.
            (sub<mode>3): Rename to...
            (sub<mode>3<vczle><vczbe>): ... This.
            (mul<mode>3): Rename to...
            (mul<mode>3<vczle><vczbe>): ... This.
            (and<mode>3): Rename to...
            (and<mode>3<vczle><vczbe>): ... This.
            (ior<mode>3): Rename to...
            (ior<mode>3<vczle><vczbe>): ... This.
            (xor<mode>3): Rename to...
            (xor<mode>3<vczle><vczbe>): ... This.
            * config/aarch64/iterators.md (VDZ): Define.
    
    gcc/testsuite/ChangeLog:
    
            PR target/99195
            * gcc.target/aarch64/simd/pr99195_1.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-simd.md                | 40 +++++++++++++++---
 gcc/config/aarch64/iterators.md                   |  3 ++
 gcc/testsuite/gcc.target/aarch64/simd/pr99195_1.c | 50 +++++++++++++++++++++++
 3 files changed, 87 insertions(+), 6 deletions(-)
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 1bed24477fb..adcad56cf55 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -18,6 +18,34 @@
 ;; along with GCC; see the file COPYING3.  If not see
 ;; <http://www.gnu.org/licenses/>.
 
+;; The following define_subst rules are used to produce patterns representing
+;; the implicit zeroing effect of 64-bit Advanced SIMD operations, in effect
+;; a vec_concat with zeroes.  The order of the vec_concat operands differs
+;; for big-endian so we have a separate define_subst rule for each endianness.
+(define_subst "add_vec_concat_subst_le"
+  [(set (match_operand:VDZ 0)
+        (match_operand:VDZ 1))]
+  "!BYTES_BIG_ENDIAN"
+  [(set (match_operand:<VDBL> 0)
+        (vec_concat:<VDBL>
+         (match_dup 1)
+         (match_operand:VDZ 2 "aarch64_simd_or_scalar_imm_zero")))])
+
+(define_subst "add_vec_concat_subst_be"
+  [(set (match_operand:VDZ 0)
+        (match_operand:VDZ 1))]
+  "BYTES_BIG_ENDIAN"
+  [(set (match_operand:<VDBL> 0)
+        (vec_concat:<VDBL>
+         (match_operand:VDZ 2 "aarch64_simd_or_scalar_imm_zero")
+         (match_dup 1)))])
+
+;; The subst_attr definitions used to annotate patterns further in the file.
+;; Patterns that need to have the above substitutions added to them should
+;; have <vczle><vczbe> added to their name.
+(define_subst_attr "vczle" "add_vec_concat_subst_le" "" "_vec_concatz_le")
+(define_subst_attr "vczbe" "add_vec_concat_subst_be" "" "_vec_concatz_be")
+
 (define_expand "mov<mode>"
   [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
 	(match_operand:VALL_F16 1 "general_operand"))]
@@ -403,7 +431,7 @@
   [(set_attr "type" "neon_logic<q>")]
 )
 
-(define_insn "add<mode>3"
+(define_insn "add<mode>3<vczle><vczbe>"
   [(set (match_operand:VDQ_I 0 "register_operand" "=w")
         (plus:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w")
 		  (match_operand:VDQ_I 2 "register_operand" "w")))]
@@ -412,7 +440,7 @@
   [(set_attr "type" "neon_add<q>")]
 )
 
-(define_insn "sub<mode>3"
+(define_insn "sub<mode>3<vczle><vczbe>"
   [(set (match_operand:VDQ_I 0 "register_operand" "=w")
         (minus:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w")
 		   (match_operand:VDQ_I 2 "register_operand" "w")))]
@@ -421,7 +449,7 @@
   [(set_attr "type" "neon_sub<q>")]
 )
 
-(define_insn "mul<mode>3"
+(define_insn "mul<mode>3<vczle><vczbe>"
   [(set (match_operand:VDQ_BHSI 0 "register_operand" "=w")
         (mult:VDQ_BHSI (match_operand:VDQ_BHSI 1 "register_operand" "w")
 		   (match_operand:VDQ_BHSI 2 "register_operand" "w")))]
@@ -999,7 +1027,7 @@
 )
 
 ;; For AND (vector, register) and BIC (vector, immediate)
-(define_insn "and<mode>3"
+(define_insn "and<mode>3<vczle><vczbe>"
   [(set (match_operand:VDQ_I 0 "register_operand" "=w,w")
 	(and:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w,0")
 		   (match_operand:VDQ_I 2 "aarch64_reg_or_bic_imm" "w,Db")))]
@@ -1020,7 +1048,7 @@
 )
 
 ;; For ORR (vector, register) and ORR (vector, immediate)
-(define_insn "ior<mode>3"
+(define_insn "ior<mode>3<vczle><vczbe>"
   [(set (match_operand:VDQ_I 0 "register_operand" "=w,w")
 	(ior:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w,0")
 		   (match_operand:VDQ_I 2 "aarch64_reg_or_orr_imm" "w,Do")))]
@@ -1040,7 +1068,7 @@
   [(set_attr "type" "neon_logic<q>")]
 )
 
-(define_insn "xor<mode>3"
+(define_insn "xor<mode>3<vczle><vczbe>"
   [(set (match_operand:VDQ_I 0 "register_operand" "=w")
         (xor:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w")
 		 (match_operand:VDQ_I 2 "register_operand" "w")))]
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 6cbc97cc82c..d3c43a212a1 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -99,6 +99,9 @@
 ;; Double vector modes suitable for moving.  Includes BFmode.
 (define_mode_iterator VDMOV [V8QI V4HI V4HF V4BF V2SI V2SF])
 
+;; 64-bit modes for operations that implicitly clear the top bits of a Q reg.
+(define_mode_iterator VDZ [V8QI V4HI V4HF V4BF V2SI V2SF DI DF])
+
 ;; All modes stored in registers d0-d31.
 (define_mode_iterator DREG [V8QI V4HI V4HF V2SI V2SF DF])
 
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/pr99195_1.c b/gcc/testsuite/gcc.target/aarch64/simd/pr99195_1.c
new file mode 100644
index 00000000000..3ddd5a37af0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/pr99195_1.c
@@ -0,0 +1,50 @@
+/* PR target/99195.  */
+/*  Check that we take advantage of 64-bit Advanced SIMD operations clearing
+    the top half of the vector register and no explicit zeroing instructions
+    are emitted.  */
+/* { dg-do compile } */
+/* { dg-options "-O" } */
+
+#include <arm_neon.h>
+
+#define ONE(OT,IT,OP,S)                         \
+OT                                              \
+foo_##OP##_##S (IT a, IT b)                     \
+{                                               \
+  IT zeros = vcreate_##S (0);                   \
+  return vcombine_##S (v##OP##_##S (a, b), zeros);      \
+}
+
+#define FUNC(T,IS,OS,OP,S) ONE (T##x##OS##_t, T##x##IS##_t, OP, S)
+
+#define OPTWO(T,IS,OS,S,OP1,OP2)        \
+FUNC (T, IS, OS, OP1, S)                \
+FUNC (T, IS, OS, OP2, S)
+
+#define OPTHREE(T, IS, OS, S, OP1, OP2, OP3)    \
+FUNC (T, IS, OS, OP1, S)        \
+OPTWO (T, IS, OS, S, OP2, OP3)
+
+#define OPFOUR(T,IS,OS,S,OP1,OP2,OP3,OP4)       \
+FUNC (T, IS, OS, OP1, S)                \
+OPTHREE (T, IS, OS, S, OP2, OP3, OP4)
+
+#define OPFIVE(T,IS,OS,S,OP1,OP2,OP3,OP4, OP5)  \
+FUNC (T, IS, OS, OP1, S)                \
+OPFOUR (T, IS, OS, S, OP2, OP3, OP4, OP5)
+
+#define OPSIX(T,IS,OS,S,OP1,OP2,OP3,OP4,OP5,OP6)        \
+FUNC (T, IS, OS, OP1, S)                \
+OPFIVE (T, IS, OS, S, OP2, OP3, OP4, OP5, OP6)
+
+OPSIX (int8, 8, 16, s8, add, sub, mul, and, orr, eor)
+OPSIX (int16, 4, 8, s16, add, sub, mul, and, orr, eor)
+OPSIX (int32, 2, 4, s32, add, sub, mul, and, orr, eor)
+
+OPSIX (uint8, 8, 16, u8, add, sub, mul, and, orr, eor)
+OPSIX (uint16, 4, 8, u16, add, sub, mul, and, orr, eor)
+OPSIX (uint32, 2, 4, u32, add, sub, mul, and, orr, eor)
+
+/* { dg-final { scan-assembler-not {\tfmov\t} } }  */
+/* { dg-final { scan-assembler-not {\tmov\t} } }  */
+