From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.sandiford@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id B2AC63858C2D
 for <gcc-patches@gcc.gnu.org>; Tue,  1 Feb 2022 10:44:07 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org B2AC63858C2D
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 6504A6D;
 Tue,  1 Feb 2022 02:44:07 -0800 (PST)
Received: from localhost (unknown [10.32.98.88])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E260A3F73B;
 Tue,  1 Feb 2022 02:44:06 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: apinski--- via Gcc-patches <gcc-patches@gcc.gnu.org>
Mail-Followup-To: apinski--- via Gcc-patches <gcc-patches@gcc.gnu.org>,
 apinski@marvell.com, richard.sandiford@arm.com
Cc: apinski@marvell.com
Subject: Re: [PATCH v3] [AARCH64] Fix PR target/103100 -mstrict-align and
 memset on not aligned buffers
References: <1643164088-18048-1-git-send-email-apinski@marvell.com>
Date: Tue, 01 Feb 2022 10:44:05 +0000
In-Reply-To: <1643164088-18048-1-git-send-email-apinski@marvell.com>
 (apinski's message of "Tue, 25 Jan 2022 18:28:08 -0800")
Message-ID: <mpt4k5ihnnu.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_STATUS, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Feb 2022 10:44:10 -0000

apinski--- via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> From: Andrew Pinski <apinski@marvell.com>
>
> The problem here is that aarch64_expand_setmem does not change the alignment
> for strict alignment case. This is version 3 of this patch, is is based on
> version 2 and moves the check for the number of instructions from the
> optimizing for size case to be always and change the cost of libcalls for
> the !size case to be max_size/16 + 1 (or 17) which was the same as before
> when handling just the max_size.

In this case, if we want the number to be 17 for strict alignment targets,
I think we should just pick 17.  It feels odd to be setting the libcall
cost based on AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS.

I notice that you didn't go for the suggestion in the previous review
round:

| I think in practice the code has only been tuned on targets that
| support LDP/STP Q, so how about moving the copy_limit calculation
| further up and doing:
|
|   unsigned max_set_size = (copy_limit * 8) / BITS_PER_UNIT;
|
| ?

Does that not work?  One advantage of it is that the:

  if (len > max_set_size && !TARGET_MOPS)
    return false;

bail-out will kick in sooner, stopping us from generating (say)
255 garbage insns for a byte-aligned 255-byte copy.

Thanks,
Richard

> The main change is dealing with strict
> alignment case where we only inline a max of 17 instructions as at that
> point the call to the memset will be faster and could handle the dynamic
> alignment instead of just the static alignment.
>
> Note the reason why it is +1 is to count for the setting of the simd
> duplicate.
>
> OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions.
>
> 	PR target/103100
> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64.cc (aarch64_expand_setmem): Constraint
> 	copy_limit to the alignment of the mem if STRICT_ALIGNMENT is
> 	true. Also constraint the number of instructions for the !size
> 	case to max_size/16 + 1.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/memset-strict-align-1.c: Update test.
> 	Reduce the size down to 207 and make s1 global and aligned
> 	to 16 bytes.
> 	* gcc.target/aarch64/memset-strict-align-2.c: New test.
> ---
>  gcc/config/aarch64/aarch64.cc                 | 55 ++++++++++---------
>  .../aarch64/memset-strict-align-1.c           | 20 +++----
>  .../aarch64/memset-strict-align-2.c           | 14 +++++
>  3 files changed, 53 insertions(+), 36 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 296145e6008..02ecb2154ea 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -23831,8 +23831,11 @@ aarch64_expand_setmem (rtx *operands)
>      (zero constants can use XZR directly).  */
>    unsigned mops_cost = 3 + 1 + cst_val;
>    /* A libcall to memset in the worst case takes 3 instructions to prepare
> -     the arguments + 1 for the call.  */
> -  unsigned libcall_cost = 4;
> +     the arguments + 1 for the call.
> +     In the case of not optimizing for size the cost of doing a libcall
> +     is the max_set_size / 16 + 1 or 17 instructions. The one instruction
> +     is for the vector dup which may or may not be used.  */
> +  unsigned libcall_cost = size_p ? 4 : (max_set_size / 16 + 1);
>  
>    /* Upper bound check.  For large constant-sized setmem use the MOPS sequence
>       when available.  */
> @@ -23842,12 +23845,12 @@ aarch64_expand_setmem (rtx *operands)
>  
>    /* Attempt a sequence with a vector broadcast followed by stores.
>       Count the number of operations involved to see if it's worth it
> -     against the alternatives.  A simple counter simd_ops on the
> +     against the alternatives.  A simple counter inlined_ops on the
>       algorithmically-relevant operations is used rather than an rtx_insn count
>       as all the pointer adjusmtents and mode reinterprets will be optimized
>       away later.  */
>    start_sequence ();
> -  unsigned simd_ops = 0;
> +  unsigned inlined_ops = 0;
>  
>    base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
>    dst = adjust_automodify_address (dst, VOIDmode, base, 0);
> @@ -23855,15 +23858,22 @@ aarch64_expand_setmem (rtx *operands)
>    /* Prepare the val using a DUP/MOVI v0.16B, val.  */
>    src = expand_vector_broadcast (V16QImode, val);
>    src = force_reg (V16QImode, src);
> -  simd_ops++;
> +  inlined_ops++;
>    /* Convert len to bits to make the rest of the code simpler.  */
>    n = len * BITS_PER_UNIT;
>  
>    /* Maximum amount to copy in one go.  We allow 256-bit chunks based on the
>       AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter.  */
> -  const int copy_limit = (aarch64_tune_params.extra_tuning_flags
> -			  & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
> -			  ? GET_MODE_BITSIZE (TImode) : 256;
> +  int copy_limit;
> +
> +  if (aarch64_tune_params.extra_tuning_flags
> +      & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
> +    copy_limit = GET_MODE_BITSIZE (TImode);
> +  else
> +    copy_limit = 256;
> +
> +  if (STRICT_ALIGNMENT)
> +    copy_limit = MIN (copy_limit, (int)MEM_ALIGN (dst));
>  
>    while (n > 0)
>      {
> @@ -23878,7 +23888,7 @@ aarch64_expand_setmem (rtx *operands)
>  
>        mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant ();
>        aarch64_set_one_block_and_progress_pointer (src, &dst, cur_mode);
> -      simd_ops++;
> +      inlined_ops++;
>        n -= mode_bits;
>  
>        /* Do certain trailing copies as overlapping if it's going to be
> @@ -23897,24 +23907,17 @@ aarch64_expand_setmem (rtx *operands)
>    rtx_insn *seq = get_insns ();
>    end_sequence ();
>  
> -  if (size_p)
> -    {
> -      /* When optimizing for size we have 3 options: the SIMD broadcast sequence,
> -	 call to memset or the MOPS expansion.  */
> -      if (TARGET_MOPS
> -	  && mops_cost <= libcall_cost
> -	  && mops_cost <= simd_ops)
> -	return aarch64_expand_setmem_mops (operands);
> -      /* If MOPS is not available or not shorter pick a libcall if the SIMD
> -	 sequence is too long.  */
> -      else if (libcall_cost < simd_ops)
> -	return false;
> -      emit_insn (seq);
> -      return true;
> -    }
> +  /* When optimizing for size we have 3 options: the inlined sequence,
> +     call to memset or the MOPS expansion.  */
> +  if (size_p
> +      && TARGET_MOPS
> +      && mops_cost <= libcall_cost
> +      && mops_cost <= inlined_ops)
> +    return aarch64_expand_setmem_mops (operands);
> +  /* Pick a libcall if the inlined sequence is too long.  */
> +  else if (libcall_cost < inlined_ops)
> +    return false;
>  
> -  /* At this point the SIMD broadcast sequence is the best choice when
> -     optimizing for speed.  */
>    emit_insn (seq);
>    return true;
>  }
> diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
> index 664d43aee13..5a7acbf2fa9 100644
> --- a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
> @@ -1,28 +1,28 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -mstrict-align" } */
>  
> -struct s { char x[255]; };
> +struct s { char x[207]; };
> +struct s s1 __attribute__((aligned(16)));
>  void foo (struct s *);
> -void bar (void) { struct s s1 = {}; foo (&s1); }
> +void bar (void) { s1 = (struct s){}; foo (&s1); }
>  
> -/* memset (s1 = {}, sizeof = 255) should be expanded out
> +/* memset (s1 = {}, sizeof = 207) should be expanded out
>     such that there are no overlap stores when -mstrict-align
>     is in use.
> -   so 7 pairs of 16 bytes stores (224 bytes).
> -   1 16 byte stores
> +   so 5 pairs of 16 bytes stores (80 bytes).
> +   2 16 byte stores (FIXME: should be 0)
>     1 8 byte store
>     1 4 byte store
>     1 2 byte store
>     1 1 byte store
>     */
>  
> -/* { dg-final { scan-assembler-times "stp\tq" 7 } } */
> -/* { dg-final { scan-assembler-times "str\tq" 1 } } */
> +/* { dg-final { scan-assembler-times "stp\tq" 5 } } */
> +/* { dg-final { scan-assembler-times "str\tq" 2 } } */
>  /* { dg-final { scan-assembler-times "str\txzr" 1 } } */
>  /* { dg-final { scan-assembler-times "str\twzr" 1 } } */
>  /* { dg-final { scan-assembler-times "strh\twzr" 1 } } */
>  /* { dg-final { scan-assembler-times "strb\twzr" 1 } } */
>  
> -/* Also one store pair for the frame-pointer and the LR. */
> -/* { dg-final { scan-assembler-times "stp\tx" 1 } } */
> -
> +/* No store pair with 8 byte words is needed as foo is called with a sibcall. */
> +/* { dg-final { scan-assembler-times "stp\tx" 0 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
> new file mode 100644
> index 00000000000..67d994fe39e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mstrict-align" } */
> +
> +struct s { char x[15]; };
> +void foo (struct s *);
> +void bar (struct s *s1) { *s1 = (struct s){}; foo (s1); }
> +
> +/* memset (s1 = {}, sizeof = 15) should be expanded out
> +   such that there are no overlap stores when -mstrict-align
> +   is in use. As the alignment of s1 is unknown, byte stores are needed.
> +   so 15 byte stores
> +   */
> +
> +/* { dg-final { scan-assembler-times "strb\twzr" 15 } } */