From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=XCwP=IU=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 70EA83858C66
	for <gcc-patches@gcc.gnu.org>; Wed, 10 Jan 2024 18:13:41 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 70EA83858C66
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 70EA83858C66
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=217.140.110.172
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1704910423; cv=none;
	b=AbxH5n0IdUe98tSeyRtYjVYbs6AqyXeqF0NsAqNf3ml6QPmD7hW680k7nQxslwiLdmRADXeubbezP3VL1Dia7raJXBDOwDE8kerF3SjPZ3Ttd3NOOHRhk8FzsOKTq9MEfTlJWvktcDdjoo5ELVT5cRQqouz8yVuvhwxxe+VcQpA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1704910423; c=relaxed/simple;
	bh=81cpOjiK80Sw+HIGaKDRpp+jcLMJ4qNp+6J2R0tvzLQ=;
	h=From:To:Subject:Date:Message-ID:MIME-Version; b=FVUZzWI9pzVCE/QGhlqUGXnlZ/TrcrSy1DffKIqRm3gxhOAkY9W65KYAWon6UVzlxbrIfIsZwFW2as9O+1y7GtYPt2Ov2eXdwOnEPkVLuwYBjvUPPn7nDE0x5TtPA2qO5q+PfGYmuUYCtgGYkI/R/cqRVpKy61cb/59aDxKpRJM=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 04F51DA7;
	Wed, 10 Jan 2024 10:14:27 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 66D393F5A1;
	Wed, 10 Jan 2024 10:13:40 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Mail-Followup-To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>,Richard Earnshaw <Richard.Earnshaw@foss.arm.com>,  Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>,  GCC Patches <gcc-patches@gcc.gnu.org>,  Richard Earnshaw <Richard.Earnshaw@arm.com>, richard.sandiford@arm.com
Cc: Richard Earnshaw <Richard.Earnshaw@foss.arm.com>,  Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>,  GCC Patches <gcc-patches@gcc.gnu.org>,  Richard Earnshaw <Richard.Earnshaw@arm.com>
Subject: Re: [PATCH v4] AArch64: Cleanup memset expansion
References: <PAWPR08MB898262EC1D06EA3207A4372483D4A@PAWPR08MB8982.eurprd08.prod.outlook.com>
	<DB3PR08MB8986D5D28C0FD01BBE78926583AAA@DB3PR08MB8986.eurprd08.prod.outlook.com>
	<PAXPR08MB69265FEE42BFED937A5D977A93AEA@PAXPR08MB6926.eurprd08.prod.outlook.com>
	<PAWPR08MB898208A63C705C93E1DF86DF83AEA@PAWPR08MB8982.eurprd08.prod.outlook.com>
	<372b9689-24b5-41f4-a990-5aee0226e15f@foss.arm.com>
	<PAXPR08MB69263C649A9087B36398598E93AEA@PAXPR08MB6926.eurprd08.prod.outlook.com>
	<61c6e268-188c-4b35-956d-bd8927d705f2@foss.arm.com>
	<PAWPR08MB89824C8BFD59074A0595B1DD83B2A@PAWPR08MB8982.eurprd08.prod.outlook.com>
	<ea7a72be-b4bb-46e5-b203-96a9c84b81e1@foss.arm.com>
	<PAWPR08MB89829AD1F3F07B8DB11029F583B2A@PAWPR08MB8982.eurprd08.prod.outlook.com>
	<PAWPR08MB8982EC389FCC1CA4B8862A148394A@PAWPR08MB8982.eurprd08.prod.outlook.com>
	<mptle946ogb.fsf@arm.com>
	<PAWPR08MB898251F8B1ACFE78B60761F5836A2@PAWPR08MB8982.eurprd08.prod.outlook.com>
Date: Wed, 10 Jan 2024 18:13:39 +0000
In-Reply-To: <PAWPR08MB898251F8B1ACFE78B60761F5836A2@PAWPR08MB8982.eurprd08.prod.outlook.com>
	(Wilco Dijkstra's message of "Tue, 9 Jan 2024 20:51:06 +0000")
Message-ID: <mptr0ipt5t8.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-21.4 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> Hi Richard,
>
>>> +#define MAX_SET_SIZE(speed) (speed ? 256 : 96)
>>
>> Since this isn't (AFAIK) a standard macro, there doesn't seem to be
>> any need to put it in the header file.  It could just go at the head
>> of aarch64.cc instead.
>
> Sure, I've moved it in v4.
>
>>> +  if (len <= 24 || (aarch64_tune_params.extra_tuning_flags
>>> +                   & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))
>>> +    set_max = 16;
>>
>> I think we should take the tuning parameter into account when applying
>> the MAX_SET_SIZE limit for -Os.  Shouldn't it be 48 rather than 96 in
>> that case?  (Alternatively, I suppose it would make sense to ignore
>> the param for -Os, although we don't seem to do that elsewhere.)
>
> That tune is only used by an obsolete core. I ran the memcpy and memset
> benchmarks from Optimized Routines on xgene-1 with and without LDP/STP.
> There is no measurable penalty for using LDP/STP. I'm not sure why it was
> ever added given it does not do anything useful. I'll post a separate patch
> to remove it to reduce the maintenance overhead.

Is that enough to justify removing it though?  It sounds from:

  https://gcc.gnu.org/pipermail/gcc-patches/2018-June/500017.html

like the problem was in more balanced code, rather than memory-limited
things like memset/memcpy.

But yeah, I'm not sure if the intuition was supported by numbers
in the end.  If SPEC also shows no change then we can probably drop it
(unless someone objects).

Let's leave this patch until that's resolved though, since I think as it
stands the patch does leave -Os -mtune=xgene1 worse off (bigger code).
Handling the tune in the meantime would also be OK.

BTW, just noticed, but...

>
> Cheers,
> Wilco
>
>
> Here is v4 (move MAX_SET_SIZE definition to aarch64.cc):
>
> Cleanup memset implementation.  Similar to memcpy/memmove, use an offset and
> bytes throughout.  Simplify the complex calculations when optimizing for size
> by using a fixed limit.
>
> Passes regress/bootstrap, OK for commit?
>
> gcc/ChangeLog:
>         * config/aarch64/aarch64.cc (MAX_SET_SIZE): New define.
>         (aarch64_progress_pointer): Remove function.
>         (aarch64_set_one_block_and_progress_pointer): Simplify and clean up.
>         (aarch64_expand_setmem): Clean up implementation, use byte offsets,
>         simplify size calculation.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index a5a6b52730d6c5013346d128e89915883f1707ae..62f4eee429c1c5195d54604f1d341a8a5a499d89 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -101,6 +101,10 @@
>  /* Defined for convenience.  */
>  #define POINTER_BYTES (POINTER_SIZE / BITS_PER_UNIT)
>
> +/* Maximum bytes set for an inline memset expansion.  With -Os use 3 STP
> +   and 1 MOVI/DUP (same size as a call).  */
> +#define MAX_SET_SIZE(speed) (speed ? 256 : 96)
> +
>  /* Flags that describe how a function shares certain architectural state
>     with its callers.
>
> @@ -26321,15 +26325,6 @@ aarch64_move_pointer (rtx pointer, poly_int64 amount)
>                                     next, amount);
>  }
>
> -/* Return a new RTX holding the result of moving POINTER forward by the
> -   size of the mode it points to.  */
> -
> -static rtx
> -aarch64_progress_pointer (rtx pointer)
> -{
> -  return aarch64_move_pointer (pointer, GET_MODE_SIZE (GET_MODE (pointer)));
> -}
> -
>  typedef auto_vec<std::pair<rtx, rtx>, 12> copy_ops;
>
>  /* Copy one block of size MODE from SRC to DST at offset OFFSET.  */
> @@ -26484,45 +26479,21 @@ aarch64_expand_cpymem (rtx *operands, bool is_memmove)
>    return true;
>  }
>
> -/* Like aarch64_copy_one_block_and_progress_pointers, except for memset where
> -   SRC is a register we have created with the duplicated value to be set.  */
> +/* Set one block of size MODE at DST at offset OFFSET to value in SRC.  */
>  static void
> -aarch64_set_one_block_and_progress_pointer (rtx src, rtx *dst,
> -                                           machine_mode mode)
> +aarch64_set_one_block (rtx src, rtx dst, int offset, machine_mode mode)
>  {
> -  /* If we are copying 128bits or 256bits, we can do that straight from
> -     the SIMD register we prepared.  */
> -  if (known_eq (GET_MODE_BITSIZE (mode), 256))
> -    {
> -      mode = GET_MODE (src);
> -      /* "Cast" the *dst to the correct mode.  */
> -      *dst = adjust_address (*dst, mode, 0);
> -      /* Emit the memset.  */
> -      emit_insn (aarch64_gen_store_pair (*dst, src, src));
> -
> -      /* Move the pointers forward.  */
> -      *dst = aarch64_move_pointer (*dst, 32);
> -      return;
> -    }
> -  if (known_eq (GET_MODE_BITSIZE (mode), 128))
> +  /* Emit explict store pair instructions for 32-byte writes.  */
> +  if (known_eq (GET_MODE_SIZE (mode), 32))
>      {
> -      /* "Cast" the *dst to the correct mode.  */
> -      *dst = adjust_address (*dst, GET_MODE (src), 0);
> -      /* Emit the memset.  */
> -      emit_move_insn (*dst, src);
> -      /* Move the pointers forward.  */
> -      *dst = aarch64_move_pointer (*dst, 16);
> +      mode = V16QImode;
> +      rtx dst1 = adjust_address (dst, mode, offset);
> +      emit_insn (aarch64_gen_store_pair (dst1, src, src));
>        return;
>      }
> -  /* For copying less, we have to extract the right amount from src.  */
> -  rtx reg = lowpart_subreg (mode, src, GET_MODE (src));
> -
> -  /* "Cast" the *dst to the correct mode.  */
> -  *dst = adjust_address (*dst, mode, 0);
> -  /* Emit the memset.  */
> -  emit_move_insn (*dst, reg);
> -  /* Move the pointer forward.  */
> -  *dst = aarch64_progress_pointer (*dst);
> +  if (known_lt (GET_MODE_SIZE (mode), 16))
> +    src = lowpart_subreg (mode, src, GET_MODE (src));
> +  emit_move_insn (adjust_address (dst, mode, offset), src);
>  }
>
>  /* Expand a setmem using the MOPS instructions.  OPERANDS are the same
> @@ -26551,7 +26522,7 @@ aarch64_expand_setmem_mops (rtx *operands)
>  bool
>  aarch64_expand_setmem (rtx *operands)
>  {
> -  int n, mode_bits;
> +  int mode_bytes;
>    unsigned HOST_WIDE_INT len;
>    rtx dst = operands[0];
>    rtx val = operands[2], src;
> @@ -26564,11 +26535,9 @@ aarch64_expand_setmem (rtx *operands)
>        || (STRICT_ALIGNMENT && align < 16))
>      return aarch64_expand_setmem_mops (operands);
>
> -  bool size_p = optimize_function_for_size_p (cfun);
> -
>    /* Default the maximum to 256-bytes when considering only libcall vs
>       SIMD broadcast sequence.  */

...this comment should be deleted along with the code it's describing.
Don't respin just for that though :)

Thanks,
Richard

> -  unsigned max_set_size = 256;
> +  unsigned max_set_size = MAX_SET_SIZE (optimize_function_for_speed_p (cfun));
>    unsigned mops_threshold = aarch64_mops_memset_size_threshold;
>
>    len = UINTVAL (operands[1]);
> @@ -26577,91 +26546,55 @@ aarch64_expand_setmem (rtx *operands)
>    if (len > max_set_size || (TARGET_MOPS && len > mops_threshold))
>      return aarch64_expand_setmem_mops (operands);
>
> -  int cst_val = !!(CONST_INT_P (val) && (INTVAL (val) != 0));
> -  /* The MOPS sequence takes:
> -     3 instructions for the memory storing
> -     + 1 to move the constant size into a reg
> -     + 1 if VAL is a non-zero constant to move into a reg
> -    (zero constants can use XZR directly).  */
> -  unsigned mops_cost = 3 + 1 + cst_val;
> -  /* A libcall to memset in the worst case takes 3 instructions to prepare
> -     the arguments + 1 for the call.  */
> -  unsigned libcall_cost = 4;
> -
> -  /* Attempt a sequence with a vector broadcast followed by stores.
> -     Count the number of operations involved to see if it's worth it
> -     against the alternatives.  A simple counter simd_ops on the
> -     algorithmically-relevant operations is used rather than an rtx_insn count
> -     as all the pointer adjusmtents and mode reinterprets will be optimized
> -     away later.  */
> -  start_sequence ();
> -  unsigned simd_ops = 0;
> -
>    base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
>    dst = adjust_automodify_address (dst, VOIDmode, base, 0);
>
>    /* Prepare the val using a DUP/MOVI v0.16B, val.  */
>    src = expand_vector_broadcast (V16QImode, val);
>    src = force_reg (V16QImode, src);
> -  simd_ops++;
> -  /* Convert len to bits to make the rest of the code simpler.  */
> -  n = len * BITS_PER_UNIT;
>
> -  /* Maximum amount to copy in one go.  We allow 256-bit chunks based on the
> -     AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter.  */
> -  const int copy_limit = (aarch64_tune_params.extra_tuning_flags
> -                         & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
> -                         ? GET_MODE_BITSIZE (TImode) : 256;
> +  /* Set maximum amount to write in one go.  We allow 32-byte chunks based
> +     on the AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter.  */
> +  unsigned set_max = 32;
>
> -  while (n > 0)
> +  if (len <= 24 || (aarch64_tune_params.extra_tuning_flags
> +                   & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))
> +    set_max = 16;
> +
> +  int offset = 0;
> +  while (len > 0)
>      {
>        /* Find the largest mode in which to do the copy without
>          over writing.  */
>        opt_scalar_int_mode mode_iter;
>        FOR_EACH_MODE_IN_CLASS (mode_iter, MODE_INT)
> -       if (GET_MODE_BITSIZE (mode_iter.require ()) <= MIN (n, copy_limit))
> +       if (GET_MODE_SIZE (mode_iter.require ()) <= MIN (len, set_max))
>           cur_mode = mode_iter.require ();
>
>        gcc_assert (cur_mode != BLKmode);
>
> -      mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant ();
> -      aarch64_set_one_block_and_progress_pointer (src, &dst, cur_mode);
> -      simd_ops++;
> -      n -= mode_bits;
> +      mode_bytes = GET_MODE_SIZE (cur_mode).to_constant ();
> +
> +      /* Prefer Q-register accesses for the last bytes.  */
> +      if (mode_bytes == 16)
> +       cur_mode = V16QImode;
> +
> +      aarch64_set_one_block (src, dst, offset, cur_mode);
> +      len -= mode_bytes;
> +      offset += mode_bytes;
>
>        /* Emit trailing writes using overlapping unaligned accesses
> -       (when !STRICT_ALIGNMENT) - this is smaller and faster.  */
> -      if (n > 0 && n < copy_limit / 2 && !STRICT_ALIGNMENT)
> +        (when !STRICT_ALIGNMENT) - this is smaller and faster.  */
> +      if (len > 0 && len < set_max / 2 && !STRICT_ALIGNMENT)
>         {
> -         next_mode = smallest_mode_for_size (n, MODE_INT);
> -         int n_bits = GET_MODE_BITSIZE (next_mode).to_constant ();
> -         gcc_assert (n_bits <= mode_bits);
> -         dst = aarch64_move_pointer (dst, (n - n_bits) / BITS_PER_UNIT);
> -         n = n_bits;
> +         next_mode = smallest_mode_for_size (len * BITS_PER_UNIT, MODE_INT);
> +         int n_bytes = GET_MODE_SIZE (next_mode).to_constant ();
> +         gcc_assert (n_bytes <= mode_bytes);
> +         offset -= n_bytes - len;
> +         len = n_bytes;
>         }
>      }
> -  rtx_insn *seq = get_insns ();
> -  end_sequence ();
> -
> -  if (size_p)
> -    {
> -      /* When optimizing for size we have 3 options: the SIMD broadcast sequence,
> -        call to memset or the MOPS expansion.  */
> -      if (TARGET_MOPS
> -         && mops_cost <= libcall_cost
> -         && mops_cost <= simd_ops)
> -       return aarch64_expand_setmem_mops (operands);
> -      /* If MOPS is not available or not shorter pick a libcall if the SIMD
> -        sequence is too long.  */
> -      else if (libcall_cost < simd_ops)
> -       return false;
> -      emit_insn (seq);
> -      return true;
> -    }
>
> -  /* At this point the SIMD broadcast sequence is the best choice when
> -     optimizing for speed.  */
> -  emit_insn (seq);
>    return true;
>  }