From: Richard Sandiford <richard.sandiford@arm.com>
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>,
Richard Earnshaw <Richard.Earnshaw@arm.com>
Subject: Re: [PATCH v3] AArch64: Add inline memmove expansion
Date: Thu, 14 Dec 2023 16:39:20 +0000 [thread overview]
Message-ID: <mpt8r5wspqf.fsf@arm.com> (raw)
In-Reply-To: <PAWPR08MB8982A4B8A35642A442A727228381A@PAWPR08MB8982.eurprd08.prod.outlook.com> (Wilco Dijkstra's message of "Fri, 1 Dec 2023 16:56:55 +0000")
Sorry, only just realised that I've never replied to this :(
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> Hi Richard,
>
>> + rtx load[max_ops], store[max_ops];
>>
>> Please either add a comment explaining why 40 is guaranteed to be
>> enough, or (my preference) use:
>>
>> auto_vec<std::pair<rtx, rtx>, ...> ops;
>
> I've changed to using auto_vec since that should help reduce conflicts
> with Alex' LDP changes. I double-checked maximum number of instructions,
> with a minor tweak to handle AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS
> it can now be limited to 12 if you also select -mstrict-align.
>
> v3: update after review, use auto_vec, tweak max_copy_size, add another test.
>
> Add support for inline memmove expansions. The generated code is identical
> as for memcpy, except that all loads are emitted before stores rather than
> being interleaved. The maximum size is 256 bytes which requires at most 16
> registers.
>
> Passes regress/bootstrap, OK for commit?
>
> gcc/ChangeLog/
> * config/aarch64/aarch64.opt (aarch64_mops_memmove_size_threshold):
> Change default.
> * config/aarch64/aarch64.md (cpymemdi): Add a parameter.
> (movmemdi): Call aarch64_expand_cpymem.
> * config/aarch64/aarch64.cc (aarch64_copy_one_block): Rename function,
> simplify, support storing generated loads/stores.
> (aarch64_expand_cpymem): Support expansion of memmove.
> * config/aarch64/aarch64-protos.h (aarch64_expand_cpymem): Add bool arg.
>
> gcc/testsuite/ChangeLog/
> * gcc.target/aarch64/memmove.c: Add new test.
> * gcc.target/aarch64/memmove.c: Likewise.
OK, thanks. I since added a comment:
??? Although it would be possible to use LDP/STP Qn in streaming mode
(so using TARGET_BASE_SIMD instead of TARGET_SIMD), it isn't clear
whether that would improve performance. */
which now belongs...
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> index d2718cc87b306e9673b166cc40e0af2ba72aa17b..d958b181d79440ab1b4f274cc188559edc04c628 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -769,7 +769,7 @@ bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
> tree aarch64_vector_load_decl (tree);
> void aarch64_expand_call (rtx, rtx, rtx, bool);
> bool aarch64_expand_cpymem_mops (rtx *, bool);
> -bool aarch64_expand_cpymem (rtx *);
> +bool aarch64_expand_cpymem (rtx *, bool);
> bool aarch64_expand_setmem (rtx *);
> bool aarch64_float_const_zero_rtx_p (rtx);
> bool aarch64_float_const_rtx_p (rtx);
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 748b313092c5af452e9526a0c6747c51e598e4ca..26d1485ff6b977caeeb780dfaee739069742e638 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -23058,51 +23058,41 @@ aarch64_progress_pointer (rtx pointer)
> return aarch64_move_pointer (pointer, GET_MODE_SIZE (GET_MODE (pointer)));
> }
>
> +typedef auto_vec<std::pair<rtx, rtx>, 12> copy_ops;
> +
> /* Copy one block of size MODE from SRC to DST at offset OFFSET. */
>
> static void
> -aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
> - machine_mode mode)
> +aarch64_copy_one_block (copy_ops &ops, rtx src, rtx dst,
> + int offset, machine_mode mode)
> {
> - /* Handle 256-bit memcpy separately. We do this by making 2 adjacent memory
> - address copies using V4SImode so that we can use Q registers. */
> - if (known_eq (GET_MODE_BITSIZE (mode), 256))
> + /* Emit explict load/store pair instructions for 32-byte copies. */
> + if (known_eq (GET_MODE_SIZE (mode), 32))
> {
> mode = V4SImode;
> + rtx src1 = adjust_address (src, mode, offset);
> + rtx src2 = adjust_address (src, mode, offset + 16);
> + rtx dst1 = adjust_address (dst, mode, offset);
> + rtx dst2 = adjust_address (dst, mode, offset + 16);
> rtx reg1 = gen_reg_rtx (mode);
> rtx reg2 = gen_reg_rtx (mode);
> - /* "Cast" the pointers to the correct mode. */
> - *src = adjust_address (*src, mode, 0);
> - *dst = adjust_address (*dst, mode, 0);
> - /* Emit the memcpy. */
> - emit_insn (aarch64_gen_load_pair (mode, reg1, *src, reg2,
> - aarch64_progress_pointer (*src)));
> - emit_insn (aarch64_gen_store_pair (mode, *dst, reg1,
> - aarch64_progress_pointer (*dst), reg2));
> - /* Move the pointers forward. */
> - *src = aarch64_move_pointer (*src, 32);
> - *dst = aarch64_move_pointer (*dst, 32);
> + rtx load = aarch64_gen_load_pair (mode, reg1, src1, reg2, src2);
> + rtx store = aarch64_gen_store_pair (mode, dst1, reg1, dst2, reg2);
> + ops.safe_push ({ load, store });
> return;
> }
>
> rtx reg = gen_reg_rtx (mode);
> -
> - /* "Cast" the pointers to the correct mode. */
> - *src = adjust_address (*src, mode, 0);
> - *dst = adjust_address (*dst, mode, 0);
> - /* Emit the memcpy. */
> - emit_move_insn (reg, *src);
> - emit_move_insn (*dst, reg);
> - /* Move the pointers forward. */
> - *src = aarch64_progress_pointer (*src);
> - *dst = aarch64_progress_pointer (*dst);
> + rtx load = gen_move_insn (reg, adjust_address (src, mode, offset));
> + rtx store = gen_move_insn (adjust_address (dst, mode, offset), reg);
> + ops.safe_push ({ load, store });
> }
>
> /* Expand a cpymem/movmem using the MOPS extension. OPERANDS are taken
> from the cpymem/movmem pattern. IS_MEMMOVE is true if this is a memmove
> rather than memcpy. Return true iff we succeeded. */
> bool
> -aarch64_expand_cpymem_mops (rtx *operands, bool is_memmove = false)
> +aarch64_expand_cpymem_mops (rtx *operands, bool is_memmove)
> {
> if (!TARGET_MOPS)
> return false;
> @@ -23121,51 +23111,47 @@ aarch64_expand_cpymem_mops (rtx *operands, bool is_memmove = false)
> return true;
> }
>
> -/* Expand cpymem, as if from a __builtin_memcpy. Return true if
> - we succeed, otherwise return false, indicating that a libcall to
> - memcpy should be emitted. */
> -
> +/* Expand cpymem/movmem, as if from a __builtin_memcpy/memmove.
> + OPERANDS are taken from the cpymem/movmem pattern. IS_MEMMOVE is true
> + if this is a memmove rather than memcpy. Return true if we succeed,
> + otherwise return false, indicating that a libcall should be emitted. */
> bool
> -aarch64_expand_cpymem (rtx *operands)
> +aarch64_expand_cpymem (rtx *operands, bool is_memmove)
> {
> - int mode_bits;
> + int mode_bytes;
> rtx dst = operands[0];
> rtx src = operands[1];
> unsigned align = UINTVAL (operands[3]);
> rtx base;
> - machine_mode cur_mode = BLKmode;
> - bool size_p = optimize_function_for_size_p (cfun);
> + machine_mode cur_mode = BLKmode, next_mode;
>
> /* Variable-sized or strict-align copies may use the MOPS expansion. */
> if (!CONST_INT_P (operands[2]) || (STRICT_ALIGNMENT && align < 16))
> - return aarch64_expand_cpymem_mops (operands);
> + return aarch64_expand_cpymem_mops (operands, is_memmove);
>
> unsigned HOST_WIDE_INT size = UINTVAL (operands[2]);
> + bool use_ldpq = TARGET_SIMD && !(aarch64_tune_params.extra_tuning_flags
> + & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS);
...here.
Apologies again for the delay.
Richard
> +
> + /* Set inline limits for memmove/memcpy. MOPS has a separate threshold. */
> + unsigned max_copy_size = use_ldpq ? 256 : 128;
> + unsigned mops_threshold = is_memmove ? aarch64_mops_memmove_size_threshold
> + : aarch64_mops_memcpy_size_threshold;
>
> - /* Try to inline up to 256 bytes. */
> - unsigned max_copy_size = 256;
> - unsigned mops_threshold = aarch64_mops_memcpy_size_threshold;
> + /* Reduce the maximum size with -Os. */
> + if (optimize_function_for_size_p (cfun))
> + max_copy_size /= 4;
>
> /* Large copies use MOPS when available or a library call. */
> if (size > max_copy_size || (TARGET_MOPS && size > mops_threshold))
> - return aarch64_expand_cpymem_mops (operands);
> -
> - int copy_bits = 256;
> -
> - /* Default to 256-bit LDP/STP on large copies, however small copies, no SIMD
> - support or slow 256-bit LDP/STP fall back to 128-bit chunks. */
> - if (size <= 24
> - || !TARGET_SIMD
> - || (aarch64_tune_params.extra_tuning_flags
> - & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))
> - copy_bits = 128;
> -
> - /* Emit an inline load+store sequence and count the number of operations
> - involved. We use a simple count of just the loads and stores emitted
> - rather than rtx_insn count as all the pointer adjustments and reg copying
> - in this function will get optimized away later in the pipeline. */
> - start_sequence ();
> - unsigned nops = 0;
> + return aarch64_expand_cpymem_mops (operands, is_memmove);
> +
> + unsigned copy_max = 32;
> +
> + /* Default to 32-byte LDP/STP on large copies, however small copies, no SIMD
> + support or slow LDP/STP fall back to 16-byte chunks. */
> + if (size <= 24 || !use_ldpq)
> + copy_max = 16;
>
> base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
> dst = adjust_automodify_address (dst, VOIDmode, base, 0);
> @@ -23173,69 +23159,57 @@ aarch64_expand_cpymem (rtx *operands)
> base = copy_to_mode_reg (Pmode, XEXP (src, 0));
> src = adjust_automodify_address (src, VOIDmode, base, 0);
>
> - /* Convert size to bits to make the rest of the code simpler. */
> - int n = size * BITS_PER_UNIT;
> + copy_ops ops;
>
> - while (n > 0)
> + int offset = 0;
> +
> + while (size > 0)
> {
> /* Find the largest mode in which to do the copy in without over reading
> or writing. */
> opt_scalar_int_mode mode_iter;
> FOR_EACH_MODE_IN_CLASS (mode_iter, MODE_INT)
> - if (GET_MODE_BITSIZE (mode_iter.require ()) <= MIN (n, copy_bits))
> + if (GET_MODE_SIZE (mode_iter.require ()) <= MIN (size, copy_max))
> cur_mode = mode_iter.require ();
>
> gcc_assert (cur_mode != BLKmode);
>
> - mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant ();
> + mode_bytes = GET_MODE_SIZE (cur_mode).to_constant ();
>
> /* Prefer Q-register accesses for the last bytes. */
> - if (mode_bits == 128 && copy_bits == 256)
> + if (mode_bytes == 16 && copy_max == 32)
> cur_mode = V4SImode;
>
> - aarch64_copy_one_block_and_progress_pointers (&src, &dst, cur_mode);
> - /* A single block copy is 1 load + 1 store. */
> - nops += 2;
> - n -= mode_bits;
> + aarch64_copy_one_block (ops, src, dst, offset, cur_mode);
> + size -= mode_bytes;
> + offset += mode_bytes;
>
> /* Emit trailing copies using overlapping unaligned accesses
> - (when !STRICT_ALIGNMENT) - this is smaller and faster. */
> - if (n > 0 && n < copy_bits / 2 && !STRICT_ALIGNMENT)
> + (when !STRICT_ALIGNMENT) - this is smaller and faster. */
> + if (size > 0 && size < copy_max / 2 && !STRICT_ALIGNMENT)
> {
> - machine_mode next_mode = smallest_mode_for_size (n, MODE_INT);
> - int n_bits = GET_MODE_BITSIZE (next_mode).to_constant ();
> - gcc_assert (n_bits <= mode_bits);
> - src = aarch64_move_pointer (src, (n - n_bits) / BITS_PER_UNIT);
> - dst = aarch64_move_pointer (dst, (n - n_bits) / BITS_PER_UNIT);
> - n = n_bits;
> + next_mode = smallest_mode_for_size (size * BITS_PER_UNIT, MODE_INT);
> + int n_bytes = GET_MODE_SIZE (next_mode).to_constant ();
> + gcc_assert (n_bytes <= mode_bytes);
> + offset -= n_bytes - size;
> + size = n_bytes;
> }
> }
> - rtx_insn *seq = get_insns ();
> - end_sequence ();
> - /* MOPS sequence requires 3 instructions for the memory copying + 1 to move
> - the constant size into a register. */
> - unsigned mops_cost = 3 + 1;
> -
> - /* If MOPS is available at this point we don't consider the libcall as it's
> - not a win even on code size. At this point only consider MOPS if
> - optimizing for size. For speed optimizations we will have chosen between
> - the two based on copy size already. */
> - if (TARGET_MOPS)
> - {
> - if (size_p && mops_cost < nops)
> - return aarch64_expand_cpymem_mops (operands);
> - emit_insn (seq);
> - return true;
> - }
>
> - /* A memcpy libcall in the worst case takes 3 instructions to prepare the
> - arguments + 1 for the call. When MOPS is not available and we're
> - optimizing for size a libcall may be preferable. */
> - unsigned libcall_cost = 4;
> - if (size_p && libcall_cost < nops)
> - return false;
> + /* Memcpy interleaves loads with stores, memmove emits all loads first. */
> + int nops = ops.length();
> + int inc = is_memmove ? nops : nops == 4 ? 2 : 3;
>
> - emit_insn (seq);
> + for (int i = 0; i < nops; i += inc)
> + {
> + int m = MIN (nops, i + inc);
> + /* Emit loads. */
> + for (int j = i; j < m; j++)
> + emit_insn (ops[j].first);
> + /* Emit stores. */
> + for (int j = i; j < m; j++)
> + emit_insn (ops[j].second);
> + }
> return true;
> }
>
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 4a3af6df7e7caf1fab9483239ce41845a92e23b7..267955871591245100a3c1703845a336394241ec 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -1632,7 +1632,7 @@ (define_expand "cpymemdi"
> (match_operand:DI 3 "immediate_operand")]
> ""
> {
> - if (aarch64_expand_cpymem (operands))
> + if (aarch64_expand_cpymem (operands, false))
> DONE;
> FAIL;
> }
> @@ -1676,17 +1676,9 @@ (define_expand "movmemdi"
> (match_operand:BLK 1 "memory_operand")
> (match_operand:DI 2 "general_operand")
> (match_operand:DI 3 "immediate_operand")]
> - "TARGET_MOPS"
> + ""
> {
> - rtx sz_reg = operands[2];
> - /* For constant-sized memmoves check the threshold.
> - FIXME: We should add a non-MOPS memmove expansion for smaller,
> - constant-sized memmove to avoid going to a libcall. */
> - if (CONST_INT_P (sz_reg)
> - && INTVAL (sz_reg) < aarch64_mops_memmove_size_threshold)
> - FAIL;
> -
> - if (aarch64_expand_cpymem_mops (operands, true))
> + if (aarch64_expand_cpymem (operands, true))
> DONE;
> FAIL;
> }
> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> index f5a518202a157b5b5bc2b2aa14ac1177fded7d66..0ac9d8c578d706e7bf0f0ae399d84544f0c619dc 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -327,7 +327,7 @@ Target Joined UInteger Var(aarch64_mops_memcpy_size_threshold) Init(256) Param
> Constant memcpy size in bytes above which to start using MOPS sequence.
>
> -param=aarch64-mops-memmove-size-threshold=
> -Target Joined UInteger Var(aarch64_mops_memmove_size_threshold) Init(0) Param
> +Target Joined UInteger Var(aarch64_mops_memmove_size_threshold) Init(256) Param
> Constant memmove size in bytes above which to start using MOPS sequence.
>
> -param=aarch64-mops-memset-size-threshold=
> diff --git a/gcc/testsuite/gcc.target/aarch64/memmove.c b/gcc/testsuite/gcc.target/aarch64/memmove.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..d2dd65b5190d6f3569f3272b8bfdbd7621f43c7d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/memmove.c
> @@ -0,0 +1,24 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +#pragma GCC target "+nomops"
> +
> +void
> +copy1 (int *x, int *y)
> +{
> + __builtin_memmove (x, y, 12);
> +}
> +
> +void
> +copy2 (int *x, int *y)
> +{
> + __builtin_memmove (x, y, 128);
> +}
> +
> +void
> +copy3 (int *x, int *y)
> +{
> + __builtin_memmove (x, y, 255);
> +}
> +
> +/* { dg-final { scan-assembler-not {\tb\tmemmove} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/memmove2.c b/gcc/testsuite/gcc.target/aarch64/memmove2.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..4c590a32214219cb0413617f2a4fc1e552643cfe
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/memmove2.c
> @@ -0,0 +1,24 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mstrict-align" } */
> +
> +#pragma GCC target "+nomops"
> +
> +void
> +copy1 (int *x, int *y)
> +{
> + __builtin_memmove (x, y, 12);
> +}
> +
> +void
> +copy2 (int *x, int *y)
> +{
> + __builtin_memmove (x, y, 128);
> +}
> +
> +void
> +copy3 (int *x, int *y)
> +{
> + __builtin_memmove (x, y, 255);
> +}
> +
> +/* { dg-final { scan-assembler-times {\tb\tmemmove} 3 } } */
prev parent reply other threads:[~2023-12-14 16:39 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-16 12:27 [PATCH v2] " Wilco Dijkstra
2023-11-06 12:09 ` Wilco Dijkstra
2023-11-29 18:30 ` Richard Sandiford
2023-12-01 16:56 ` [PATCH v3] " Wilco Dijkstra
2023-12-14 16:39 ` Richard Sandiford [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=mpt8r5wspqf.fsf@arm.com \
--to=richard.sandiford@arm.com \
--cc=Richard.Earnshaw@arm.com \
--cc=Wilco.Dijkstra@arm.com \
--cc=gcc-patches@gcc.gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).