From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id ABE1C3858D20 for ; Thu, 14 Dec 2023 16:39:22 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org ABE1C3858D20 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org ABE1C3858D20 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702571965; cv=none; b=SSTTmD/MqCcWw6sOAOQ5kkpEtxsYRRfqNYBXFFDRfBJ6OuTr68iBfAGDn/kBv4Q2fxhLcrO/Yg1R054ZqyDT1+CCrtChvx5o7ffD7ruaGZE7QUdn0PM+hi25VdJBaojdkjU5nruEwcXlb0xMKZzwp/Bej8sdzR0bjGNWdUDJepw= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702571965; c=relaxed/simple; bh=yaRBz3L2mKSVxG6uszHCB9/o2bI1KWIVo0EEE7w2kAo=; h=From:To:Subject:Date:Message-ID:MIME-Version; b=UuBOWaOcZDpJv6SzsBiyrBlbDLsUH9quXgiQrPmelILPHCYe8Oke0LzLADq4qyXyudqVaKOzdeUkqiykvIj1T3KatAbjjJ3Y/EIXX2HAB8+wB2ZNfE4yy43oeKARjLvmHB0BtAZ4T11Exb8OkVxnWb7+p6iecdmneE2eKeBuR80= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id EEE86C15; Thu, 14 Dec 2023 08:40:07 -0800 (PST) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id BD6A83F5A1; Thu, 14 Dec 2023 08:39:21 -0800 (PST) From: Richard Sandiford To: Wilco Dijkstra Mail-Followup-To: Wilco Dijkstra ,GCC Patches , Richard Earnshaw , richard.sandiford@arm.com Cc: GCC Patches , Richard Earnshaw Subject: Re: [PATCH v3] AArch64: Add inline memmove expansion References: Date: Thu, 14 Dec 2023 16:39:20 +0000 In-Reply-To: (Wilco Dijkstra's message of "Fri, 1 Dec 2023 16:56:55 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-21.6 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_LOTSOFHASH,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Sorry, only just realised that I've never replied to this :( Wilco Dijkstra writes: > Hi Richard, > >> + rtx load[max_ops], store[max_ops]; >> >> Please either add a comment explaining why 40 is guaranteed to be >> enough, or (my preference) use: >> >> auto_vec, ...> ops; > > I've changed to using auto_vec since that should help reduce conflicts > with Alex' LDP changes. I double-checked maximum number of instructions, > with a minor tweak to handle AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS > it can now be limited to 12 if you also select -mstrict-align. > > v3: update after review, use auto_vec, tweak max_copy_size, add another test. > > Add support for inline memmove expansions. The generated code is identical > as for memcpy, except that all loads are emitted before stores rather than > being interleaved. The maximum size is 256 bytes which requires at most 16 > registers. > > Passes regress/bootstrap, OK for commit? > > gcc/ChangeLog/ > * config/aarch64/aarch64.opt (aarch64_mops_memmove_size_threshold): > Change default. > * config/aarch64/aarch64.md (cpymemdi): Add a parameter. > (movmemdi): Call aarch64_expand_cpymem. > * config/aarch64/aarch64.cc (aarch64_copy_one_block): Rename function, > simplify, support storing generated loads/stores. > (aarch64_expand_cpymem): Support expansion of memmove. > * config/aarch64/aarch64-protos.h (aarch64_expand_cpymem): Add bool arg. > > gcc/testsuite/ChangeLog/ > * gcc.target/aarch64/memmove.c: Add new test. > * gcc.target/aarch64/memmove.c: Likewise. OK, thanks. I since added a comment: ??? Although it would be possible to use LDP/STP Qn in streaming mode (so using TARGET_BASE_SIMD instead of TARGET_SIMD), it isn't clear whether that would improve performance. */ which now belongs... > > --- > > diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h > index d2718cc87b306e9673b166cc40e0af2ba72aa17b..d958b181d79440ab1b4f274cc188559edc04c628 100644 > --- a/gcc/config/aarch64/aarch64-protos.h > +++ b/gcc/config/aarch64/aarch64-protos.h > @@ -769,7 +769,7 @@ bool aarch64_emit_approx_sqrt (rtx, rtx, bool); > tree aarch64_vector_load_decl (tree); > void aarch64_expand_call (rtx, rtx, rtx, bool); > bool aarch64_expand_cpymem_mops (rtx *, bool); > -bool aarch64_expand_cpymem (rtx *); > +bool aarch64_expand_cpymem (rtx *, bool); > bool aarch64_expand_setmem (rtx *); > bool aarch64_float_const_zero_rtx_p (rtx); > bool aarch64_float_const_rtx_p (rtx); > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index 748b313092c5af452e9526a0c6747c51e598e4ca..26d1485ff6b977caeeb780dfaee739069742e638 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -23058,51 +23058,41 @@ aarch64_progress_pointer (rtx pointer) > return aarch64_move_pointer (pointer, GET_MODE_SIZE (GET_MODE (pointer))); > } > > +typedef auto_vec, 12> copy_ops; > + > /* Copy one block of size MODE from SRC to DST at offset OFFSET. */ > > static void > -aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst, > - machine_mode mode) > +aarch64_copy_one_block (copy_ops &ops, rtx src, rtx dst, > + int offset, machine_mode mode) > { > - /* Handle 256-bit memcpy separately. We do this by making 2 adjacent memory > - address copies using V4SImode so that we can use Q registers. */ > - if (known_eq (GET_MODE_BITSIZE (mode), 256)) > + /* Emit explict load/store pair instructions for 32-byte copies. */ > + if (known_eq (GET_MODE_SIZE (mode), 32)) > { > mode = V4SImode; > + rtx src1 = adjust_address (src, mode, offset); > + rtx src2 = adjust_address (src, mode, offset + 16); > + rtx dst1 = adjust_address (dst, mode, offset); > + rtx dst2 = adjust_address (dst, mode, offset + 16); > rtx reg1 = gen_reg_rtx (mode); > rtx reg2 = gen_reg_rtx (mode); > - /* "Cast" the pointers to the correct mode. */ > - *src = adjust_address (*src, mode, 0); > - *dst = adjust_address (*dst, mode, 0); > - /* Emit the memcpy. */ > - emit_insn (aarch64_gen_load_pair (mode, reg1, *src, reg2, > - aarch64_progress_pointer (*src))); > - emit_insn (aarch64_gen_store_pair (mode, *dst, reg1, > - aarch64_progress_pointer (*dst), reg2)); > - /* Move the pointers forward. */ > - *src = aarch64_move_pointer (*src, 32); > - *dst = aarch64_move_pointer (*dst, 32); > + rtx load = aarch64_gen_load_pair (mode, reg1, src1, reg2, src2); > + rtx store = aarch64_gen_store_pair (mode, dst1, reg1, dst2, reg2); > + ops.safe_push ({ load, store }); > return; > } > > rtx reg = gen_reg_rtx (mode); > - > - /* "Cast" the pointers to the correct mode. */ > - *src = adjust_address (*src, mode, 0); > - *dst = adjust_address (*dst, mode, 0); > - /* Emit the memcpy. */ > - emit_move_insn (reg, *src); > - emit_move_insn (*dst, reg); > - /* Move the pointers forward. */ > - *src = aarch64_progress_pointer (*src); > - *dst = aarch64_progress_pointer (*dst); > + rtx load = gen_move_insn (reg, adjust_address (src, mode, offset)); > + rtx store = gen_move_insn (adjust_address (dst, mode, offset), reg); > + ops.safe_push ({ load, store }); > } > > /* Expand a cpymem/movmem using the MOPS extension. OPERANDS are taken > from the cpymem/movmem pattern. IS_MEMMOVE is true if this is a memmove > rather than memcpy. Return true iff we succeeded. */ > bool > -aarch64_expand_cpymem_mops (rtx *operands, bool is_memmove = false) > +aarch64_expand_cpymem_mops (rtx *operands, bool is_memmove) > { > if (!TARGET_MOPS) > return false; > @@ -23121,51 +23111,47 @@ aarch64_expand_cpymem_mops (rtx *operands, bool is_memmove = false) > return true; > } > > -/* Expand cpymem, as if from a __builtin_memcpy. Return true if > - we succeed, otherwise return false, indicating that a libcall to > - memcpy should be emitted. */ > - > +/* Expand cpymem/movmem, as if from a __builtin_memcpy/memmove. > + OPERANDS are taken from the cpymem/movmem pattern. IS_MEMMOVE is true > + if this is a memmove rather than memcpy. Return true if we succeed, > + otherwise return false, indicating that a libcall should be emitted. */ > bool > -aarch64_expand_cpymem (rtx *operands) > +aarch64_expand_cpymem (rtx *operands, bool is_memmove) > { > - int mode_bits; > + int mode_bytes; > rtx dst = operands[0]; > rtx src = operands[1]; > unsigned align = UINTVAL (operands[3]); > rtx base; > - machine_mode cur_mode = BLKmode; > - bool size_p = optimize_function_for_size_p (cfun); > + machine_mode cur_mode = BLKmode, next_mode; > > /* Variable-sized or strict-align copies may use the MOPS expansion. */ > if (!CONST_INT_P (operands[2]) || (STRICT_ALIGNMENT && align < 16)) > - return aarch64_expand_cpymem_mops (operands); > + return aarch64_expand_cpymem_mops (operands, is_memmove); > > unsigned HOST_WIDE_INT size = UINTVAL (operands[2]); > + bool use_ldpq = TARGET_SIMD && !(aarch64_tune_params.extra_tuning_flags > + & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS); ...here. Apologies again for the delay. Richard > + > + /* Set inline limits for memmove/memcpy. MOPS has a separate threshold. */ > + unsigned max_copy_size = use_ldpq ? 256 : 128; > + unsigned mops_threshold = is_memmove ? aarch64_mops_memmove_size_threshold > + : aarch64_mops_memcpy_size_threshold; > > - /* Try to inline up to 256 bytes. */ > - unsigned max_copy_size = 256; > - unsigned mops_threshold = aarch64_mops_memcpy_size_threshold; > + /* Reduce the maximum size with -Os. */ > + if (optimize_function_for_size_p (cfun)) > + max_copy_size /= 4; > > /* Large copies use MOPS when available or a library call. */ > if (size > max_copy_size || (TARGET_MOPS && size > mops_threshold)) > - return aarch64_expand_cpymem_mops (operands); > - > - int copy_bits = 256; > - > - /* Default to 256-bit LDP/STP on large copies, however small copies, no SIMD > - support or slow 256-bit LDP/STP fall back to 128-bit chunks. */ > - if (size <= 24 > - || !TARGET_SIMD > - || (aarch64_tune_params.extra_tuning_flags > - & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)) > - copy_bits = 128; > - > - /* Emit an inline load+store sequence and count the number of operations > - involved. We use a simple count of just the loads and stores emitted > - rather than rtx_insn count as all the pointer adjustments and reg copying > - in this function will get optimized away later in the pipeline. */ > - start_sequence (); > - unsigned nops = 0; > + return aarch64_expand_cpymem_mops (operands, is_memmove); > + > + unsigned copy_max = 32; > + > + /* Default to 32-byte LDP/STP on large copies, however small copies, no SIMD > + support or slow LDP/STP fall back to 16-byte chunks. */ > + if (size <= 24 || !use_ldpq) > + copy_max = 16; > > base = copy_to_mode_reg (Pmode, XEXP (dst, 0)); > dst = adjust_automodify_address (dst, VOIDmode, base, 0); > @@ -23173,69 +23159,57 @@ aarch64_expand_cpymem (rtx *operands) > base = copy_to_mode_reg (Pmode, XEXP (src, 0)); > src = adjust_automodify_address (src, VOIDmode, base, 0); > > - /* Convert size to bits to make the rest of the code simpler. */ > - int n = size * BITS_PER_UNIT; > + copy_ops ops; > > - while (n > 0) > + int offset = 0; > + > + while (size > 0) > { > /* Find the largest mode in which to do the copy in without over reading > or writing. */ > opt_scalar_int_mode mode_iter; > FOR_EACH_MODE_IN_CLASS (mode_iter, MODE_INT) > - if (GET_MODE_BITSIZE (mode_iter.require ()) <= MIN (n, copy_bits)) > + if (GET_MODE_SIZE (mode_iter.require ()) <= MIN (size, copy_max)) > cur_mode = mode_iter.require (); > > gcc_assert (cur_mode != BLKmode); > > - mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant (); > + mode_bytes = GET_MODE_SIZE (cur_mode).to_constant (); > > /* Prefer Q-register accesses for the last bytes. */ > - if (mode_bits == 128 && copy_bits == 256) > + if (mode_bytes == 16 && copy_max == 32) > cur_mode = V4SImode; > > - aarch64_copy_one_block_and_progress_pointers (&src, &dst, cur_mode); > - /* A single block copy is 1 load + 1 store. */ > - nops += 2; > - n -= mode_bits; > + aarch64_copy_one_block (ops, src, dst, offset, cur_mode); > + size -= mode_bytes; > + offset += mode_bytes; > > /* Emit trailing copies using overlapping unaligned accesses > - (when !STRICT_ALIGNMENT) - this is smaller and faster. */ > - if (n > 0 && n < copy_bits / 2 && !STRICT_ALIGNMENT) > + (when !STRICT_ALIGNMENT) - this is smaller and faster. */ > + if (size > 0 && size < copy_max / 2 && !STRICT_ALIGNMENT) > { > - machine_mode next_mode = smallest_mode_for_size (n, MODE_INT); > - int n_bits = GET_MODE_BITSIZE (next_mode).to_constant (); > - gcc_assert (n_bits <= mode_bits); > - src = aarch64_move_pointer (src, (n - n_bits) / BITS_PER_UNIT); > - dst = aarch64_move_pointer (dst, (n - n_bits) / BITS_PER_UNIT); > - n = n_bits; > + next_mode = smallest_mode_for_size (size * BITS_PER_UNIT, MODE_INT); > + int n_bytes = GET_MODE_SIZE (next_mode).to_constant (); > + gcc_assert (n_bytes <= mode_bytes); > + offset -= n_bytes - size; > + size = n_bytes; > } > } > - rtx_insn *seq = get_insns (); > - end_sequence (); > - /* MOPS sequence requires 3 instructions for the memory copying + 1 to move > - the constant size into a register. */ > - unsigned mops_cost = 3 + 1; > - > - /* If MOPS is available at this point we don't consider the libcall as it's > - not a win even on code size. At this point only consider MOPS if > - optimizing for size. For speed optimizations we will have chosen between > - the two based on copy size already. */ > - if (TARGET_MOPS) > - { > - if (size_p && mops_cost < nops) > - return aarch64_expand_cpymem_mops (operands); > - emit_insn (seq); > - return true; > - } > > - /* A memcpy libcall in the worst case takes 3 instructions to prepare the > - arguments + 1 for the call. When MOPS is not available and we're > - optimizing for size a libcall may be preferable. */ > - unsigned libcall_cost = 4; > - if (size_p && libcall_cost < nops) > - return false; > + /* Memcpy interleaves loads with stores, memmove emits all loads first. */ > + int nops = ops.length(); > + int inc = is_memmove ? nops : nops == 4 ? 2 : 3; > > - emit_insn (seq); > + for (int i = 0; i < nops; i += inc) > + { > + int m = MIN (nops, i + inc); > + /* Emit loads. */ > + for (int j = i; j < m; j++) > + emit_insn (ops[j].first); > + /* Emit stores. */ > + for (int j = i; j < m; j++) > + emit_insn (ops[j].second); > + } > return true; > } > > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md > index 4a3af6df7e7caf1fab9483239ce41845a92e23b7..267955871591245100a3c1703845a336394241ec 100644 > --- a/gcc/config/aarch64/aarch64.md > +++ b/gcc/config/aarch64/aarch64.md > @@ -1632,7 +1632,7 @@ (define_expand "cpymemdi" > (match_operand:DI 3 "immediate_operand")] > "" > { > - if (aarch64_expand_cpymem (operands)) > + if (aarch64_expand_cpymem (operands, false)) > DONE; > FAIL; > } > @@ -1676,17 +1676,9 @@ (define_expand "movmemdi" > (match_operand:BLK 1 "memory_operand") > (match_operand:DI 2 "general_operand") > (match_operand:DI 3 "immediate_operand")] > - "TARGET_MOPS" > + "" > { > - rtx sz_reg = operands[2]; > - /* For constant-sized memmoves check the threshold. > - FIXME: We should add a non-MOPS memmove expansion for smaller, > - constant-sized memmove to avoid going to a libcall. */ > - if (CONST_INT_P (sz_reg) > - && INTVAL (sz_reg) < aarch64_mops_memmove_size_threshold) > - FAIL; > - > - if (aarch64_expand_cpymem_mops (operands, true)) > + if (aarch64_expand_cpymem (operands, true)) > DONE; > FAIL; > } > diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt > index f5a518202a157b5b5bc2b2aa14ac1177fded7d66..0ac9d8c578d706e7bf0f0ae399d84544f0c619dc 100644 > --- a/gcc/config/aarch64/aarch64.opt > +++ b/gcc/config/aarch64/aarch64.opt > @@ -327,7 +327,7 @@ Target Joined UInteger Var(aarch64_mops_memcpy_size_threshold) Init(256) Param > Constant memcpy size in bytes above which to start using MOPS sequence. > > -param=aarch64-mops-memmove-size-threshold= > -Target Joined UInteger Var(aarch64_mops_memmove_size_threshold) Init(0) Param > +Target Joined UInteger Var(aarch64_mops_memmove_size_threshold) Init(256) Param > Constant memmove size in bytes above which to start using MOPS sequence. > > -param=aarch64-mops-memset-size-threshold= > diff --git a/gcc/testsuite/gcc.target/aarch64/memmove.c b/gcc/testsuite/gcc.target/aarch64/memmove.c > new file mode 100644 > index 0000000000000000000000000000000000000000..d2dd65b5190d6f3569f3272b8bfdbd7621f43c7d > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/memmove.c > @@ -0,0 +1,24 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2" } */ > + > +#pragma GCC target "+nomops" > + > +void > +copy1 (int *x, int *y) > +{ > + __builtin_memmove (x, y, 12); > +} > + > +void > +copy2 (int *x, int *y) > +{ > + __builtin_memmove (x, y, 128); > +} > + > +void > +copy3 (int *x, int *y) > +{ > + __builtin_memmove (x, y, 255); > +} > + > +/* { dg-final { scan-assembler-not {\tb\tmemmove} } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/memmove2.c b/gcc/testsuite/gcc.target/aarch64/memmove2.c > new file mode 100644 > index 0000000000000000000000000000000000000000..4c590a32214219cb0413617f2a4fc1e552643cfe > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/memmove2.c > @@ -0,0 +1,24 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mstrict-align" } */ > + > +#pragma GCC target "+nomops" > + > +void > +copy1 (int *x, int *y) > +{ > + __builtin_memmove (x, y, 12); > +} > + > +void > +copy2 (int *x, int *y) > +{ > + __builtin_memmove (x, y, 128); > +} > + > +void > +copy3 (int *x, int *y) > +{ > + __builtin_memmove (x, y, 255); > +} > + > +/* { dg-final { scan-assembler-times {\tb\tmemmove} 3 } } */