From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id B2AC63858C2D for ; Tue, 1 Feb 2022 10:44:07 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org B2AC63858C2D Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 6504A6D; Tue, 1 Feb 2022 02:44:07 -0800 (PST) Received: from localhost (unknown [10.32.98.88]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E260A3F73B; Tue, 1 Feb 2022 02:44:06 -0800 (PST) From: Richard Sandiford To: apinski--- via Gcc-patches Mail-Followup-To: apinski--- via Gcc-patches , apinski@marvell.com, richard.sandiford@arm.com Cc: apinski@marvell.com Subject: Re: [PATCH v3] [AARCH64] Fix PR target/103100 -mstrict-align and memset on not aligned buffers References: <1643164088-18048-1-git-send-email-apinski@marvell.com> Date: Tue, 01 Feb 2022 10:44:05 +0000 In-Reply-To: <1643164088-18048-1-git-send-email-apinski@marvell.com> (apinski's message of "Tue, 25 Jan 2022 18:28:08 -0800") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Feb 2022 10:44:10 -0000 apinski--- via Gcc-patches writes: > From: Andrew Pinski > > The problem here is that aarch64_expand_setmem does not change the alignment > for strict alignment case. This is version 3 of this patch, is is based on > version 2 and moves the check for the number of instructions from the > optimizing for size case to be always and change the cost of libcalls for > the !size case to be max_size/16 + 1 (or 17) which was the same as before > when handling just the max_size. In this case, if we want the number to be 17 for strict alignment targets, I think we should just pick 17. It feels odd to be setting the libcall cost based on AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS. I notice that you didn't go for the suggestion in the previous review round: | I think in practice the code has only been tuned on targets that | support LDP/STP Q, so how about moving the copy_limit calculation | further up and doing: | | unsigned max_set_size = (copy_limit * 8) / BITS_PER_UNIT; | | ? Does that not work? One advantage of it is that the: if (len > max_set_size && !TARGET_MOPS) return false; bail-out will kick in sooner, stopping us from generating (say) 255 garbage insns for a byte-aligned 255-byte copy. Thanks, Richard > The main change is dealing with strict > alignment case where we only inline a max of 17 instructions as at that > point the call to the memset will be faster and could handle the dynamic > alignment instead of just the static alignment. > > Note the reason why it is +1 is to count for the setting of the simd > duplicate. > > OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions. > > PR target/103100 > gcc/ChangeLog: > > * config/aarch64/aarch64.cc (aarch64_expand_setmem): Constraint > copy_limit to the alignment of the mem if STRICT_ALIGNMENT is > true. Also constraint the number of instructions for the !size > case to max_size/16 + 1. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/memset-strict-align-1.c: Update test. > Reduce the size down to 207 and make s1 global and aligned > to 16 bytes. > * gcc.target/aarch64/memset-strict-align-2.c: New test. > --- > gcc/config/aarch64/aarch64.cc | 55 ++++++++++--------- > .../aarch64/memset-strict-align-1.c | 20 +++---- > .../aarch64/memset-strict-align-2.c | 14 +++++ > 3 files changed, 53 insertions(+), 36 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index 296145e6008..02ecb2154ea 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -23831,8 +23831,11 @@ aarch64_expand_setmem (rtx *operands) > (zero constants can use XZR directly). */ > unsigned mops_cost = 3 + 1 + cst_val; > /* A libcall to memset in the worst case takes 3 instructions to prepare > - the arguments + 1 for the call. */ > - unsigned libcall_cost = 4; > + the arguments + 1 for the call. > + In the case of not optimizing for size the cost of doing a libcall > + is the max_set_size / 16 + 1 or 17 instructions. The one instruction > + is for the vector dup which may or may not be used. */ > + unsigned libcall_cost = size_p ? 4 : (max_set_size / 16 + 1); > > /* Upper bound check. For large constant-sized setmem use the MOPS sequence > when available. */ > @@ -23842,12 +23845,12 @@ aarch64_expand_setmem (rtx *operands) > > /* Attempt a sequence with a vector broadcast followed by stores. > Count the number of operations involved to see if it's worth it > - against the alternatives. A simple counter simd_ops on the > + against the alternatives. A simple counter inlined_ops on the > algorithmically-relevant operations is used rather than an rtx_insn count > as all the pointer adjusmtents and mode reinterprets will be optimized > away later. */ > start_sequence (); > - unsigned simd_ops = 0; > + unsigned inlined_ops = 0; > > base = copy_to_mode_reg (Pmode, XEXP (dst, 0)); > dst = adjust_automodify_address (dst, VOIDmode, base, 0); > @@ -23855,15 +23858,22 @@ aarch64_expand_setmem (rtx *operands) > /* Prepare the val using a DUP/MOVI v0.16B, val. */ > src = expand_vector_broadcast (V16QImode, val); > src = force_reg (V16QImode, src); > - simd_ops++; > + inlined_ops++; > /* Convert len to bits to make the rest of the code simpler. */ > n = len * BITS_PER_UNIT; > > /* Maximum amount to copy in one go. We allow 256-bit chunks based on the > AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter. */ > - const int copy_limit = (aarch64_tune_params.extra_tuning_flags > - & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) > - ? GET_MODE_BITSIZE (TImode) : 256; > + int copy_limit; > + > + if (aarch64_tune_params.extra_tuning_flags > + & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) > + copy_limit = GET_MODE_BITSIZE (TImode); > + else > + copy_limit = 256; > + > + if (STRICT_ALIGNMENT) > + copy_limit = MIN (copy_limit, (int)MEM_ALIGN (dst)); > > while (n > 0) > { > @@ -23878,7 +23888,7 @@ aarch64_expand_setmem (rtx *operands) > > mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant (); > aarch64_set_one_block_and_progress_pointer (src, &dst, cur_mode); > - simd_ops++; > + inlined_ops++; > n -= mode_bits; > > /* Do certain trailing copies as overlapping if it's going to be > @@ -23897,24 +23907,17 @@ aarch64_expand_setmem (rtx *operands) > rtx_insn *seq = get_insns (); > end_sequence (); > > - if (size_p) > - { > - /* When optimizing for size we have 3 options: the SIMD broadcast sequence, > - call to memset or the MOPS expansion. */ > - if (TARGET_MOPS > - && mops_cost <= libcall_cost > - && mops_cost <= simd_ops) > - return aarch64_expand_setmem_mops (operands); > - /* If MOPS is not available or not shorter pick a libcall if the SIMD > - sequence is too long. */ > - else if (libcall_cost < simd_ops) > - return false; > - emit_insn (seq); > - return true; > - } > + /* When optimizing for size we have 3 options: the inlined sequence, > + call to memset or the MOPS expansion. */ > + if (size_p > + && TARGET_MOPS > + && mops_cost <= libcall_cost > + && mops_cost <= inlined_ops) > + return aarch64_expand_setmem_mops (operands); > + /* Pick a libcall if the inlined sequence is too long. */ > + else if (libcall_cost < inlined_ops) > + return false; > > - /* At this point the SIMD broadcast sequence is the best choice when > - optimizing for speed. */ > emit_insn (seq); > return true; > } > diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c > index 664d43aee13..5a7acbf2fa9 100644 > --- a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c > +++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c > @@ -1,28 +1,28 @@ > /* { dg-do compile } */ > /* { dg-options "-O2 -mstrict-align" } */ > > -struct s { char x[255]; }; > +struct s { char x[207]; }; > +struct s s1 __attribute__((aligned(16))); > void foo (struct s *); > -void bar (void) { struct s s1 = {}; foo (&s1); } > +void bar (void) { s1 = (struct s){}; foo (&s1); } > > -/* memset (s1 = {}, sizeof = 255) should be expanded out > +/* memset (s1 = {}, sizeof = 207) should be expanded out > such that there are no overlap stores when -mstrict-align > is in use. > - so 7 pairs of 16 bytes stores (224 bytes). > - 1 16 byte stores > + so 5 pairs of 16 bytes stores (80 bytes). > + 2 16 byte stores (FIXME: should be 0) > 1 8 byte store > 1 4 byte store > 1 2 byte store > 1 1 byte store > */ > > -/* { dg-final { scan-assembler-times "stp\tq" 7 } } */ > -/* { dg-final { scan-assembler-times "str\tq" 1 } } */ > +/* { dg-final { scan-assembler-times "stp\tq" 5 } } */ > +/* { dg-final { scan-assembler-times "str\tq" 2 } } */ > /* { dg-final { scan-assembler-times "str\txzr" 1 } } */ > /* { dg-final { scan-assembler-times "str\twzr" 1 } } */ > /* { dg-final { scan-assembler-times "strh\twzr" 1 } } */ > /* { dg-final { scan-assembler-times "strb\twzr" 1 } } */ > > -/* Also one store pair for the frame-pointer and the LR. */ > -/* { dg-final { scan-assembler-times "stp\tx" 1 } } */ > - > +/* No store pair with 8 byte words is needed as foo is called with a sibcall. */ > +/* { dg-final { scan-assembler-times "stp\tx" 0 } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c > new file mode 100644 > index 00000000000..67d994fe39e > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c > @@ -0,0 +1,14 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mstrict-align" } */ > + > +struct s { char x[15]; }; > +void foo (struct s *); > +void bar (struct s *s1) { *s1 = (struct s){}; foo (s1); } > + > +/* memset (s1 = {}, sizeof = 15) should be expanded out > + such that there are no overlap stores when -mstrict-align > + is in use. As the alignment of s1 is unknown, byte stores are needed. > + so 15 byte stores > + */ > + > +/* { dg-final { scan-assembler-times "strb\twzr" 15 } } */