From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=qVvf=JX=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id AB339385C6E8
	for <gcc-patches@gcc.gnu.org>; Wed, 14 Feb 2024 11:18:02 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org AB339385C6E8
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org AB339385C6E8
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=217.140.110.172
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707909489; cv=none;
	b=skmfVzBRoOeefUAjGOosRwxSbUzDmSwRRl7M8jWYfEqoGuTVNMuqQAT6Vj7U1q/JrzakziyYNm8yHN4rgQiRsAe8Tf+qQJPhJePaAXjL2KPMxXVa/FU6NKT3OFej7biMSiBTTbPpuw8givi0aoNYkLIOXXpHNMbwYO/NcQLkkuM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1707909489; c=relaxed/simple;
	bh=q+tIepi62bM/aJIaQIZ5nMHgajOL9dxwA4Pi3yWVFMY=;
	h=From:To:Subject:Date:Message-ID:MIME-Version; b=mfyjoHtfhHgRCNMHnLFhB+dI8oYYQbFY73kv7iMwzFSeUePRekC8PG1jcXO7s0itMUauPYr+FdJeiwL5NZuXkjxAaKzlPYyu2iNfnxYqMGFq5ItUCnTggccxAuU8n3s6zFVRV6SGZydZpHMFGbmOda2C0yBFfLKi1+I2h2cafko=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 764201FB;
	Wed, 14 Feb 2024 03:18:43 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id A3F853F766;
	Wed, 14 Feb 2024 03:18:01 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Alex Coplan <alex.coplan@arm.com>
Mail-Followup-To: Alex Coplan <alex.coplan@arm.com>,gcc-patches@gcc.gnu.org,  Kyrylo Tkachov <kyrylo.tkachov@arm.com>,  Richard Earnshaw <richard.earnshaw@arm.com>, richard.sandiford@arm.com
Cc: gcc-patches@gcc.gnu.org,  Kyrylo Tkachov <kyrylo.tkachov@arm.com>,  Richard Earnshaw <richard.earnshaw@arm.com>
Subject: Re: [PATCH][GCC 12] aarch64: Avoid out-of-range shrink-wrapped saves [PR111677]
References: <ZcpRRWbmWLngMD3T@arm.com>
Date: Wed, 14 Feb 2024 11:18:00 +0000
In-Reply-To: <ZcpRRWbmWLngMD3T@arm.com> (Alex Coplan's message of "Mon, 12 Feb
	2024 17:11:33 +0000")
Message-ID: <mptzfw3uwev.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-21.1 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Alex Coplan <alex.coplan@arm.com> writes:
> This is a backport of the GCC 13 fix for PR111677 to the GCC 12 branch.
> The only part of the patch that isn't a straight cherry-pick is due to
> the TX iterator lacking TDmode for GCC 12, so this version adjusts
> TX_V16QI accordingly.
>
> Bootstrapped/regtested on aarch64-linux-gnu, the only changes in the
> testsuite I saw were in
> gcc/testsuite/c-c++-common/hwasan/large-aligned-1.c where the dg-output
> "READ of size 4 [...]" check appears to be flaky on the GCC 12 branch
> since libhwasan gained the short granule tag feature, I've requested a
> backport of the following patch (committed as
> r13-100-g3771486daa1e904ceae6f3e135b28e58af33849f) which should fix that
> (independent) issue for GCC 12:
> https://gcc.gnu.org/pipermail/gcc-patches/2024-February/645278.html
>
> OK for the GCC 12 branch?

OK, thanks.

Richard

> Thanks,
> Alex
>
> -- >8 --
>
> The PR shows us ICEing due to an unrecognizable TFmode save emitted by
> aarch64_process_components.  The problem is that for T{I,F,D}mode we
> conservatively require mems to be in range for x-register ldp/stp.  That
> is because (at least for TImode) it can be allocated to both GPRs and
> FPRs, and in the GPR case that is an x-reg ldp/stp, and the FPR case is
> a q-register load/store.
>
> As Richard pointed out in the PR, aarch64_get_separate_components
> already checks that the offsets are suitable for a single load, so we
> just need to choose a mode in aarch64_reg_save_mode that gives the full
> q-register range.  In this patch, we choose V16QImode as an alternative
> 16-byte "bag-of-bits" mode that doesn't have the artificial range
> restrictions imposed on T{I,F,D}mode.
>
> Unlike for GCC 14 we need additional handling in the load/store pair
> code as various cases are not expecting to see V16QImode (particularly
> the writeback patterns, but also aarch64_gen_load_pair).
>
> gcc/ChangeLog:
>
> 	PR target/111677
> 	* config/aarch64/aarch64.cc (aarch64_reg_save_mode): Use
> 	V16QImode for the full 16-byte FPR saves in the vector PCS case.
> 	(aarch64_gen_storewb_pair): Handle V16QImode.
> 	(aarch64_gen_loadwb_pair): Likewise.
> 	(aarch64_gen_load_pair): Likewise.
> 	* config/aarch64/aarch64.md (loadwb_pair<TX:mode>_<P:mode>):
> 	Rename to ...
> 	(loadwb_pair<TX_V16QI:mode>_<P:mode>): ... this, extending to
> 	V16QImode.
> 	(storewb_pair<TX:mode>_<P:mode>): Rename to ...
> 	(storewb_pair<TX_V16QI:mode>_<P:mode>): ... this, extending to
> 	V16QImode.
> 	* config/aarch64/iterators.md (TX_V16QI): New.
>
> gcc/testsuite/ChangeLog:
>
> 	PR target/111677
> 	* gcc.target/aarch64/torture/pr111677.c: New test.
>
> (cherry picked from commit 2bd8264a131ee1215d3bc6181722f9d30f5569c3)
> ---
>  gcc/config/aarch64/aarch64.cc                 | 13 ++++++-
>  gcc/config/aarch64/aarch64.md                 | 35 ++++++++++---------
>  gcc/config/aarch64/iterators.md               |  3 ++
>  .../gcc.target/aarch64/torture/pr111677.c     | 28 +++++++++++++++
>  4 files changed, 61 insertions(+), 18 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/torture/pr111677.c
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 3bccd96a23d..2bbba323770 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -4135,7 +4135,7 @@ aarch64_reg_save_mode (unsigned int regno)
>        case ARM_PCS_SIMD:
>  	/* The vector PCS saves the low 128 bits (which is the full
>  	   register on non-SVE targets).  */
> -	return TFmode;
> +	return V16QImode;
>  
>        case ARM_PCS_SVE:
>  	/* Use vectors of DImode for registers that need frame
> @@ -8602,6 +8602,10 @@ aarch64_gen_storewb_pair (machine_mode mode, rtx base, rtx reg, rtx reg2,
>        return gen_storewb_pairtf_di (base, base, reg, reg2,
>  				    GEN_INT (-adjustment),
>  				    GEN_INT (UNITS_PER_VREG - adjustment));
> +    case E_V16QImode:
> +      return gen_storewb_pairv16qi_di (base, base, reg, reg2,
> +				       GEN_INT (-adjustment),
> +				       GEN_INT (UNITS_PER_VREG - adjustment));
>      default:
>        gcc_unreachable ();
>      }
> @@ -8647,6 +8651,10 @@ aarch64_gen_loadwb_pair (machine_mode mode, rtx base, rtx reg, rtx reg2,
>      case E_TFmode:
>        return gen_loadwb_pairtf_di (base, base, reg, reg2, GEN_INT (adjustment),
>  				   GEN_INT (UNITS_PER_VREG));
> +    case E_V16QImode:
> +      return gen_loadwb_pairv16qi_di (base, base, reg, reg2,
> +				      GEN_INT (adjustment),
> +				      GEN_INT (UNITS_PER_VREG));
>      default:
>        gcc_unreachable ();
>      }
> @@ -8730,6 +8738,9 @@ aarch64_gen_load_pair (machine_mode mode, rtx reg1, rtx mem1, rtx reg2,
>      case E_V4SImode:
>        return gen_load_pairv4siv4si (reg1, mem1, reg2, mem2);
>  
> +    case E_V16QImode:
> +      return gen_load_pairv16qiv16qi (reg1, mem1, reg2, mem2);
> +
>      default:
>        gcc_unreachable ();
>      }
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index fb100bdf6b3..99f185718c9 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -1874,17 +1874,18 @@ (define_insn "loadwb_pair<GPF:mode>_<P:mode>"
>    [(set_attr "type" "neon_load1_2reg")]
>  )
>  
> -(define_insn "loadwb_pair<TX:mode>_<P:mode>"
> +(define_insn "loadwb_pair<TX_V16QI:mode>_<P:mode>"
>    [(parallel
>      [(set (match_operand:P 0 "register_operand" "=k")
> -          (plus:P (match_operand:P 1 "register_operand" "0")
> -                  (match_operand:P 4 "aarch64_mem_pair_offset" "n")))
> -     (set (match_operand:TX 2 "register_operand" "=w")
> -          (mem:TX (match_dup 1)))
> -     (set (match_operand:TX 3 "register_operand" "=w")
> -          (mem:TX (plus:P (match_dup 1)
> +	  (plus:P (match_operand:P 1 "register_operand" "0")
> +		  (match_operand:P 4 "aarch64_mem_pair_offset" "n")))
> +     (set (match_operand:TX_V16QI 2 "register_operand" "=w")
> +	  (mem:TX_V16QI (match_dup 1)))
> +     (set (match_operand:TX_V16QI 3 "register_operand" "=w")
> +	  (mem:TX_V16QI (plus:P (match_dup 1)
>  			  (match_operand:P 5 "const_int_operand" "n"))))])]
> -  "TARGET_SIMD && INTVAL (operands[5]) == GET_MODE_SIZE (<TX:MODE>mode)"
> +  "TARGET_SIMD
> +   && known_eq (INTVAL (operands[5]), GET_MODE_SIZE (<TX_V16QI:MODE>mode))"
>    "ldp\\t%q2, %q3, [%1], %4"
>    [(set_attr "type" "neon_ldp_q")]
>  )
> @@ -1923,20 +1924,20 @@ (define_insn "storewb_pair<GPF:mode>_<P:mode>"
>    [(set_attr "type" "neon_store1_2reg<q>")]
>  )
>  
> -(define_insn "storewb_pair<TX:mode>_<P:mode>"
> +(define_insn "storewb_pair<TX_V16QI:mode>_<P:mode>"
>    [(parallel
>      [(set (match_operand:P 0 "register_operand" "=&k")
> -          (plus:P (match_operand:P 1 "register_operand" "0")
> -                  (match_operand:P 4 "aarch64_mem_pair_offset" "n")))
> -     (set (mem:TX (plus:P (match_dup 0)
> +	  (plus:P (match_operand:P 1 "register_operand" "0")
> +		  (match_operand:P 4 "aarch64_mem_pair_offset" "n")))
> +     (set (mem:TX_V16QI (plus:P (match_dup 0)
>  			  (match_dup 4)))
> -          (match_operand:TX 2 "register_operand" "w"))
> -     (set (mem:TX (plus:P (match_dup 0)
> +	  (match_operand:TX_V16QI 2 "register_operand" "w"))
> +     (set (mem:TX_V16QI (plus:P (match_dup 0)
>  			  (match_operand:P 5 "const_int_operand" "n")))
> -          (match_operand:TX 3 "register_operand" "w"))])]
> +	  (match_operand:TX_V16QI 3 "register_operand" "w"))])]
>    "TARGET_SIMD
> -   && INTVAL (operands[5])
> -      == INTVAL (operands[4]) + GET_MODE_SIZE (<TX:MODE>mode)"
> +   && known_eq (INTVAL (operands[5]),
> +		INTVAL (operands[4]) + GET_MODE_SIZE (<TX_V16QI:MODE>mode))"
>    "stp\\t%q2, %q3, [%0, %4]!"
>    [(set_attr "type" "neon_stp_q")]
>  )
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index 26a840d7fe9..d49e37893df 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -303,6 +303,9 @@ (define_mode_iterator VS [V2SI V4SI])
>  
>  (define_mode_iterator TX [TI TF])
>  
> +;; TX plus V16QImode.
> +(define_mode_iterator TX_V16QI [TI TF V16QI])
> +
>  ;; Advanced SIMD opaque structure modes.
>  (define_mode_iterator VSTRUCT [OI CI XI])
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c b/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c
> new file mode 100644
> index 00000000000..6bb640c42c0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/torture/pr111677.c
> @@ -0,0 +1,28 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target fopenmp } */
> +/* { dg-options "-ffast-math -fstack-protector-strong -fopenmp" } */
> +typedef struct {
> +  long size_z;
> +  int width;
> +} dt_bilateral_t;
> +typedef float dt_aligned_pixel_t[4];
> +#pragma omp declare simd
> +void dt_bilateral_splat(dt_bilateral_t *b) {
> +  float *buf;
> +  long offsets[8];
> +  for (; b;) {
> +    int firstrow;
> +    for (int j = firstrow; j; j++)
> +      for (int i; i < b->width; i++) {
> +        dt_aligned_pixel_t contrib;
> +        for (int k = 0; k < 4; k++)
> +          buf[offsets[k]] += contrib[k];
> +      }
> +    float *dest;
> +    for (int j = (long)b; j; j++) {
> +      float *src = (float *)b->size_z;
> +      for (int i = 0; i < (long)b; i++)
> +        dest[i] += src[i];
> +    }
> +  }
> +}