From: Richard Sandiford <richard.sandiford@arm.com>
To: apinski--- via Gcc-patches <gcc-patches@gcc.gnu.org>
Cc: apinski@marvell.com
Subject: Re: [PATCH v3] [AARCH64] Fix PR target/103100 -mstrict-align and memset on not aligned buffers
Date: Tue, 01 Feb 2022 10:44:05 +0000 [thread overview]
Message-ID: <mpt4k5ihnnu.fsf@arm.com> (raw)
In-Reply-To: <1643164088-18048-1-git-send-email-apinski@marvell.com> (apinski's message of "Tue, 25 Jan 2022 18:28:08 -0800")
apinski--- via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> From: Andrew Pinski <apinski@marvell.com>
>
> The problem here is that aarch64_expand_setmem does not change the alignment
> for strict alignment case. This is version 3 of this patch, is is based on
> version 2 and moves the check for the number of instructions from the
> optimizing for size case to be always and change the cost of libcalls for
> the !size case to be max_size/16 + 1 (or 17) which was the same as before
> when handling just the max_size.
In this case, if we want the number to be 17 for strict alignment targets,
I think we should just pick 17. It feels odd to be setting the libcall
cost based on AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS.
I notice that you didn't go for the suggestion in the previous review
round:
| I think in practice the code has only been tuned on targets that
| support LDP/STP Q, so how about moving the copy_limit calculation
| further up and doing:
|
| unsigned max_set_size = (copy_limit * 8) / BITS_PER_UNIT;
|
| ?
Does that not work? One advantage of it is that the:
if (len > max_set_size && !TARGET_MOPS)
return false;
bail-out will kick in sooner, stopping us from generating (say)
255 garbage insns for a byte-aligned 255-byte copy.
Thanks,
Richard
> The main change is dealing with strict
> alignment case where we only inline a max of 17 instructions as at that
> point the call to the memset will be faster and could handle the dynamic
> alignment instead of just the static alignment.
>
> Note the reason why it is +1 is to count for the setting of the simd
> duplicate.
>
> OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions.
>
> PR target/103100
> gcc/ChangeLog:
>
> * config/aarch64/aarch64.cc (aarch64_expand_setmem): Constraint
> copy_limit to the alignment of the mem if STRICT_ALIGNMENT is
> true. Also constraint the number of instructions for the !size
> case to max_size/16 + 1.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/aarch64/memset-strict-align-1.c: Update test.
> Reduce the size down to 207 and make s1 global and aligned
> to 16 bytes.
> * gcc.target/aarch64/memset-strict-align-2.c: New test.
> ---
> gcc/config/aarch64/aarch64.cc | 55 ++++++++++---------
> .../aarch64/memset-strict-align-1.c | 20 +++----
> .../aarch64/memset-strict-align-2.c | 14 +++++
> 3 files changed, 53 insertions(+), 36 deletions(-)
> create mode 100644 gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 296145e6008..02ecb2154ea 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -23831,8 +23831,11 @@ aarch64_expand_setmem (rtx *operands)
> (zero constants can use XZR directly). */
> unsigned mops_cost = 3 + 1 + cst_val;
> /* A libcall to memset in the worst case takes 3 instructions to prepare
> - the arguments + 1 for the call. */
> - unsigned libcall_cost = 4;
> + the arguments + 1 for the call.
> + In the case of not optimizing for size the cost of doing a libcall
> + is the max_set_size / 16 + 1 or 17 instructions. The one instruction
> + is for the vector dup which may or may not be used. */
> + unsigned libcall_cost = size_p ? 4 : (max_set_size / 16 + 1);
>
> /* Upper bound check. For large constant-sized setmem use the MOPS sequence
> when available. */
> @@ -23842,12 +23845,12 @@ aarch64_expand_setmem (rtx *operands)
>
> /* Attempt a sequence with a vector broadcast followed by stores.
> Count the number of operations involved to see if it's worth it
> - against the alternatives. A simple counter simd_ops on the
> + against the alternatives. A simple counter inlined_ops on the
> algorithmically-relevant operations is used rather than an rtx_insn count
> as all the pointer adjusmtents and mode reinterprets will be optimized
> away later. */
> start_sequence ();
> - unsigned simd_ops = 0;
> + unsigned inlined_ops = 0;
>
> base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
> dst = adjust_automodify_address (dst, VOIDmode, base, 0);
> @@ -23855,15 +23858,22 @@ aarch64_expand_setmem (rtx *operands)
> /* Prepare the val using a DUP/MOVI v0.16B, val. */
> src = expand_vector_broadcast (V16QImode, val);
> src = force_reg (V16QImode, src);
> - simd_ops++;
> + inlined_ops++;
> /* Convert len to bits to make the rest of the code simpler. */
> n = len * BITS_PER_UNIT;
>
> /* Maximum amount to copy in one go. We allow 256-bit chunks based on the
> AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter. */
> - const int copy_limit = (aarch64_tune_params.extra_tuning_flags
> - & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
> - ? GET_MODE_BITSIZE (TImode) : 256;
> + int copy_limit;
> +
> + if (aarch64_tune_params.extra_tuning_flags
> + & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
> + copy_limit = GET_MODE_BITSIZE (TImode);
> + else
> + copy_limit = 256;
> +
> + if (STRICT_ALIGNMENT)
> + copy_limit = MIN (copy_limit, (int)MEM_ALIGN (dst));
>
> while (n > 0)
> {
> @@ -23878,7 +23888,7 @@ aarch64_expand_setmem (rtx *operands)
>
> mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant ();
> aarch64_set_one_block_and_progress_pointer (src, &dst, cur_mode);
> - simd_ops++;
> + inlined_ops++;
> n -= mode_bits;
>
> /* Do certain trailing copies as overlapping if it's going to be
> @@ -23897,24 +23907,17 @@ aarch64_expand_setmem (rtx *operands)
> rtx_insn *seq = get_insns ();
> end_sequence ();
>
> - if (size_p)
> - {
> - /* When optimizing for size we have 3 options: the SIMD broadcast sequence,
> - call to memset or the MOPS expansion. */
> - if (TARGET_MOPS
> - && mops_cost <= libcall_cost
> - && mops_cost <= simd_ops)
> - return aarch64_expand_setmem_mops (operands);
> - /* If MOPS is not available or not shorter pick a libcall if the SIMD
> - sequence is too long. */
> - else if (libcall_cost < simd_ops)
> - return false;
> - emit_insn (seq);
> - return true;
> - }
> + /* When optimizing for size we have 3 options: the inlined sequence,
> + call to memset or the MOPS expansion. */
> + if (size_p
> + && TARGET_MOPS
> + && mops_cost <= libcall_cost
> + && mops_cost <= inlined_ops)
> + return aarch64_expand_setmem_mops (operands);
> + /* Pick a libcall if the inlined sequence is too long. */
> + else if (libcall_cost < inlined_ops)
> + return false;
>
> - /* At this point the SIMD broadcast sequence is the best choice when
> - optimizing for speed. */
> emit_insn (seq);
> return true;
> }
> diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
> index 664d43aee13..5a7acbf2fa9 100644
> --- a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
> @@ -1,28 +1,28 @@
> /* { dg-do compile } */
> /* { dg-options "-O2 -mstrict-align" } */
>
> -struct s { char x[255]; };
> +struct s { char x[207]; };
> +struct s s1 __attribute__((aligned(16)));
> void foo (struct s *);
> -void bar (void) { struct s s1 = {}; foo (&s1); }
> +void bar (void) { s1 = (struct s){}; foo (&s1); }
>
> -/* memset (s1 = {}, sizeof = 255) should be expanded out
> +/* memset (s1 = {}, sizeof = 207) should be expanded out
> such that there are no overlap stores when -mstrict-align
> is in use.
> - so 7 pairs of 16 bytes stores (224 bytes).
> - 1 16 byte stores
> + so 5 pairs of 16 bytes stores (80 bytes).
> + 2 16 byte stores (FIXME: should be 0)
> 1 8 byte store
> 1 4 byte store
> 1 2 byte store
> 1 1 byte store
> */
>
> -/* { dg-final { scan-assembler-times "stp\tq" 7 } } */
> -/* { dg-final { scan-assembler-times "str\tq" 1 } } */
> +/* { dg-final { scan-assembler-times "stp\tq" 5 } } */
> +/* { dg-final { scan-assembler-times "str\tq" 2 } } */
> /* { dg-final { scan-assembler-times "str\txzr" 1 } } */
> /* { dg-final { scan-assembler-times "str\twzr" 1 } } */
> /* { dg-final { scan-assembler-times "strh\twzr" 1 } } */
> /* { dg-final { scan-assembler-times "strb\twzr" 1 } } */
>
> -/* Also one store pair for the frame-pointer and the LR. */
> -/* { dg-final { scan-assembler-times "stp\tx" 1 } } */
> -
> +/* No store pair with 8 byte words is needed as foo is called with a sibcall. */
> +/* { dg-final { scan-assembler-times "stp\tx" 0 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
> new file mode 100644
> index 00000000000..67d994fe39e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mstrict-align" } */
> +
> +struct s { char x[15]; };
> +void foo (struct s *);
> +void bar (struct s *s1) { *s1 = (struct s){}; foo (s1); }
> +
> +/* memset (s1 = {}, sizeof = 15) should be expanded out
> + such that there are no overlap stores when -mstrict-align
> + is in use. As the alignment of s1 is unknown, byte stores are needed.
> + so 15 byte stores
> + */
> +
> +/* { dg-final { scan-assembler-times "strb\twzr" 15 } } */
prev parent reply other threads:[~2022-02-01 10:44 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-26 2:28 apinski
2022-02-01 10:44 ` Richard Sandiford [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=mpt4k5ihnnu.fsf@arm.com \
--to=richard.sandiford@arm.com \
--cc=apinski@marvell.com \
--cc=gcc-patches@gcc.gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).