From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=6ncq=63=gmail.com=pinskia@sourceware.org>
Received: from mail-pg1-x52c.google.com (mail-pg1-x52c.google.com [IPv6:2607:f8b0:4864:20::52c])
	by sourceware.org (Postfix) with ESMTPS id 44E9D3858CDB
	for <gcc-patches@gcc.gnu.org>; Fri,  3 Mar 2023 18:28:28 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 44E9D3858CDB
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-pg1-x52c.google.com with SMTP id z10so2025916pgr.8
        for <gcc-patches@gcc.gnu.org>; Fri, 03 Mar 2023 10:28:28 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=uE5TeVbwvyXA+xCf/26GjkPnjHhjoCjJOgGt7RMsZN0=;
        b=iBeOQqbX/7RkDz5R8uvb7HySRfKfogRurb61QdQRNzEp8Aeq5pV8z8NundqJTC7+oj
         2PtKwnXsEV/8soLrGyhKIoUY+/TGXyOTo0+9RqgcNl3jX/mdASU2l+XLqzNyqm4sPP3I
         c+hFCCw9x9ISPffvznGGxCisRJo0XH1QjXA3KhfeH8ycTKv1zkgydtJvOXfwVubOWhVO
         4U5tVjl8XEXVG8R5NgT56ZEVXiRpsRxBUBBIAldMalE5GXf9GfONt9GuN0LX90yBNFdD
         hGooIjo9oMxU+WF4eM9Gl4wj/km57QnRBxcR5YG3uEr/Xoystk8yyE8DUWbG4zUBdrKY
         NtfQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=uE5TeVbwvyXA+xCf/26GjkPnjHhjoCjJOgGt7RMsZN0=;
        b=EzgnEy+RHa/xycpleVqUnLpEH1bcGWGERrcRa5PkqDFKu9/8wymax5pZqYT58Y9ZUP
         L07vaRaWZGDEGe989ORp+OjiBQOBtGooFltsKhtNnnhCE1lDN3x2XnHneM9Wcv4gytED
         usCjpSDE+5311wurq6MiDgOQ/YkMQHEbrkv5h+/CB1VjJDJKePZyZ4YDtgmUiDBqX5bq
         ix+rlViOazwFzF0vmWd/aY9UuCpjYddmKGqh3OFY7ObAuCRUHeeD8bxYLHKOT4K87yu1
         n7kZmiSa/bBb3j2OiiHc3ZokR6KWtNdldebm+1yA0HoX6cn5OwqVG+C2oF6NXc0g1ejz
         5hWw==
X-Gm-Message-State: AO0yUKUMziF9P1oLeqtcMO6mmtN8MdW1TU2L3SDxuACgpIaqiq1E2UXb
	Y/CM8T3dq8PTZ7Ap91+3orRfXCVS0twNTSQRLFU=
X-Google-Smtp-Source: AK7set8YUhn6zHX+EplruKmywPXscrgsKv0Y2FAdM9IhF+WXF+ZbGqGPhi0myLvFczp7hdEU2yaYXrn2pe0K3OorGUM=
X-Received: by 2002:a65:66c5:0:b0:503:99d9:d9a3 with SMTP id
 c5-20020a6566c5000000b0050399d9d9a3mr852967pgw.6.1677868107015; Fri, 03 Mar
 2023 10:28:27 -0800 (PST)
MIME-Version: 1.0
References: <20230210035312.1630020-1-apinski@marvell.com>
In-Reply-To: <20230210035312.1630020-1-apinski@marvell.com>
From: Andrew Pinski <pinskia@gmail.com>
Date: Fri, 3 Mar 2023 10:28:15 -0800
Message-ID: <CA+=Sn1n48U+kouRpGqgSmY9H-SPCVbOK1-okwHCWx46e3yu2tw@mail.gmail.com>
Subject: Re: [PATCHv4] [AARCH64] Fix PR target/103100 -mstrict-align and
 memset on not aligned buffers
To: Andrew Pinski <apinski@marvell.com>, Richard Sandiford <richard.sandiford@arm.com>
Cc: gcc-patches@gcc.gnu.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-7.7 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Thu, Feb 9, 2023 at 7:54=E2=80=AFPM Andrew Pinski via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> The problem here is that aarch64_expand_setmem does not change the alignm=
ent
> for strict alignment case.
> This is version 4 of the fix, major changes from the last version is fixi=
ng
> the way store pairs are handled which allows handling of storing 2 SI mod=
e
> at a time.
> This also adds a testcase to show a case with -mstrict-align we can do
> the store word pair stores.
>
> OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions.

Ping?

>
>         PR target/103100
>
> gcc/ChangeLog:
>
>         * config/aarch64/aarch64.cc (aarch64_gen_store_pair):
>         Add support for SImode.
>         (aarch64_set_one_block_and_progress_pointer):
>         Add use_pair argument and rewrite and simplifying the
>         code.
>         (aarch64_can_use_pair_load_stores): New function.
>         (aarch64_expand_setmem): Rewrite mode selection to
>         better handle strict alignment and non ld/stp pair case.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/aarch64/memset-strict-align-1.c: Update test.
>         Reduce the size down to 207 and make s1 global and aligned
>         to 16 bytes.
>         * gcc.target/aarch64/memset-strict-align-2.c: New test.
>         * gcc.target/aarch64/memset-strict-align-3.c: New test.
> ---
>  gcc/config/aarch64/aarch64.cc                 | 136 ++++++++++--------
>  .../aarch64/memset-strict-align-1.c           |  19 ++-
>  .../aarch64/memset-strict-align-2.c           |  14 ++
>  .../aarch64/memset-strict-align-3.c           |  15 ++
>  4 files changed, 113 insertions(+), 71 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/memset-strict-align-=
2.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/memset-strict-align-=
3.c
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.c=
c
> index 5c40b6ed22a..3eaf9bd608a 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -8850,6 +8850,9 @@ aarch64_gen_store_pair (machine_mode mode, rtx mem1=
, rtx reg1, rtx mem2,
>  {
>    switch (mode)
>      {
> +    case E_SImode:
> +      return gen_store_pair_sw_sisi (mem1, reg1, mem2, reg2);
> +
>      case E_DImode:
>        return gen_store_pair_dw_didi (mem1, reg1, mem2, reg2);
>
> @@ -24896,42 +24899,49 @@ aarch64_expand_cpymem (rtx *operands)
>     SRC is a register we have created with the duplicated value to be set=
.  */
>  static void
>  aarch64_set_one_block_and_progress_pointer (rtx src, rtx *dst,
> -                                           machine_mode mode)
> +                                           machine_mode mode, bool use_p=
airs)
>  {
> +  rtx reg =3D src;
>    /* If we are copying 128bits or 256bits, we can do that straight from
>       the SIMD register we prepared.  */
> -  if (known_eq (GET_MODE_BITSIZE (mode), 256))
> -    {
> -      mode =3D GET_MODE (src);
> -      /* "Cast" the *dst to the correct mode.  */
> -      *dst =3D adjust_address (*dst, mode, 0);
> -      /* Emit the memset.  */
> -      emit_insn (aarch64_gen_store_pair (mode, *dst, src,
> -                                        aarch64_progress_pointer (*dst),=
 src));
> -
> -      /* Move the pointers forward.  */
> -      *dst =3D aarch64_move_pointer (*dst, 32);
> -      return;
> -    }
>    if (known_eq (GET_MODE_BITSIZE (mode), 128))
> -    {
> -      /* "Cast" the *dst to the correct mode.  */
> -      *dst =3D adjust_address (*dst, GET_MODE (src), 0);
> -      /* Emit the memset.  */
> -      emit_move_insn (*dst, src);
> -      /* Move the pointers forward.  */
> -      *dst =3D aarch64_move_pointer (*dst, 16);
> -      return;
> -    }
> -  /* For copying less, we have to extract the right amount from src.  */
> -  rtx reg =3D lowpart_subreg (mode, src, GET_MODE (src));
> +    mode =3D GET_MODE(src);
> +  else
> +    /* For copying less, we have to extract the right amount from src.  =
*/
> +    reg =3D lowpart_subreg (mode, src, GET_MODE (src));
>
>    /* "Cast" the *dst to the correct mode.  */
>    *dst =3D adjust_address (*dst, mode, 0);
>    /* Emit the memset.  */
> -  emit_move_insn (*dst, reg);
> +  if (use_pairs)
> +    emit_insn (aarch64_gen_store_pair (mode, *dst, reg,
> +                                      aarch64_progress_pointer (*dst),
> +                                      reg));
> +  else
> +    emit_move_insn (*dst, reg);
> +
>    /* Move the pointer forward.  */
>    *dst =3D aarch64_progress_pointer (*dst);
> +  if (use_pairs)
> +    *dst =3D aarch64_progress_pointer (*dst);
> +}
> +
> +/* Returns true if size can be used as a store/load pair.
> +   This is a helper function for aarch64_expand_setmem and others. */
> +static bool
> +aarch64_can_use_pair_load_stores (unsigned HOST_WIDE_INT size)
> +{
> +  /* For DI and SI modes, we can use store pairs.  */
> +  if (size =3D=3D GET_MODE_BITSIZE (DImode)
> +      || size =3D=3D GET_MODE_BITSIZE (SImode))
> +    return true;
> +  /* For TI mode, we will use store pairs only if
> +     the target wants to. */
> +  else if (size =3D=3D GET_MODE_BITSIZE (TImode)
> +      && !(aarch64_tune_params.extra_tuning_flags
> +          & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))
> +    return true;
> +  return false;
>  }
>
>  /* Expand a setmem using the MOPS instructions.  OPERANDS are the same
> @@ -24974,9 +24984,21 @@ aarch64_expand_setmem (rtx *operands)
>
>    bool size_p =3D optimize_function_for_size_p (cfun);
>
> -  /* Default the maximum to 256-bytes when considering only libcall vs
> -     SIMD broadcast sequence.  */
> -  unsigned max_set_size =3D 256;
> +  /* Maximum amount to copy in one go (without taking into account store=
 pairs).  */
> +  int copy_limit =3D GET_MODE_BITSIZE (TImode);
> +
> +  /* For strict alignment case, restrict the copy limit to the compiler =
time
> +     known alignment of the memory.  */
> +  if (STRICT_ALIGNMENT)
> +    copy_limit =3D MIN (copy_limit, (int)MEM_ALIGN (dst));
> +
> +  bool use_pairs =3D aarch64_can_use_pair_load_stores (copy_limit);
> +
> +  /* The max set size is 8 instructions of the copy_limit sized stores
> +     (also taking into account using pair stores or not);
> +     for the non strict alignment, this is 256 bytes.  */
> +  unsigned max_set_size;
> +  max_set_size =3D (copy_limit * (use_pairs ? 2 : 1) * 8) / BITS_PER_UNI=
T;
>
>    len =3D INTVAL (operands[1]);
>    if (len > max_set_size && !TARGET_MOPS)
> @@ -24990,8 +25012,10 @@ aarch64_expand_setmem (rtx *operands)
>      (zero constants can use XZR directly).  */
>    unsigned mops_cost =3D 3 + 1 + cst_val;
>    /* A libcall to memset in the worst case takes 3 instructions to prepa=
re
> -     the arguments + 1 for the call.  */
> -  unsigned libcall_cost =3D 4;
> +     the arguments + 1 for the call.
> +     In the case of not optimizing for size the cost of doing a libcall
> +     is 17 instructions . */
> +  unsigned libcall_cost =3D size_p ? 4 : 17;
>
>    /* Upper bound check.  For large constant-sized setmem use the MOPS se=
quence
>       when available.  */
> @@ -25001,12 +25025,12 @@ aarch64_expand_setmem (rtx *operands)
>
>    /* Attempt a sequence with a vector broadcast followed by stores.
>       Count the number of operations involved to see if it's worth it
> -     against the alternatives.  A simple counter simd_ops on the
> +     against the alternatives.  A simple counter inlined_ops on the
>       algorithmically-relevant operations is used rather than an rtx_insn=
 count
>       as all the pointer adjusmtents and mode reinterprets will be optimi=
zed
>       away later.  */
>    start_sequence ();
> -  unsigned simd_ops =3D 0;
> +  unsigned inlined_ops =3D 0;
>
>    base =3D copy_to_mode_reg (Pmode, XEXP (dst, 0));
>    dst =3D adjust_automodify_address (dst, VOIDmode, base, 0);
> @@ -25014,16 +25038,10 @@ aarch64_expand_setmem (rtx *operands)
>    /* Prepare the val using a DUP/MOVI v0.16B, val.  */
>    src =3D expand_vector_broadcast (V16QImode, val);
>    src =3D force_reg (V16QImode, src);
> -  simd_ops++;
> +  inlined_ops++;
>    /* Convert len to bits to make the rest of the code simpler.  */
>    n =3D len * BITS_PER_UNIT;
>
> -  /* Maximum amount to copy in one go.  We allow 256-bit chunks based on=
 the
> -     AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter.  */
> -  const int copy_limit =3D (aarch64_tune_params.extra_tuning_flags
> -                         & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
> -                         ? GET_MODE_BITSIZE (TImode) : 256;
> -
>    while (n > 0)
>      {
>        /* Find the largest mode in which to do the copy without
> @@ -25036,15 +25054,18 @@ aarch64_expand_setmem (rtx *operands)
>        gcc_assert (cur_mode !=3D BLKmode);
>
>        mode_bits =3D GET_MODE_BITSIZE (cur_mode).to_constant ();
> -      aarch64_set_one_block_and_progress_pointer (src, &dst, cur_mode);
> -      simd_ops++;
> -      n -=3D mode_bits;
> +      bool pairs =3D aarch64_can_use_pair_load_stores (mode_bits);
> +      if (n < (mode_bits * 2))
> +       pairs =3D false;
> +      aarch64_set_one_block_and_progress_pointer (src, &dst, cur_mode, p=
airs);
> +      inlined_ops++;
> +      n -=3D mode_bits * (pairs ? 2 : 1);
>
>        /* Do certain trailing copies as overlapping if it's going to be
>          cheaper.  i.e. less instructions to do so.  For instance doing a=
 15
>          byte copy it's more efficient to do two overlapping 8 byte copie=
s than
>          8 + 4 + 2 + 1.  Only do this when -mstrict-align is not supplied=
.  */
> -      if (n > 0 && n < copy_limit / 2 && !STRICT_ALIGNMENT)
> +      if (n > 0 && n < copy_limit && !STRICT_ALIGNMENT)
>         {
>           next_mode =3D smallest_mode_for_size (n, MODE_INT);
>           int n_bits =3D GET_MODE_BITSIZE (next_mode).to_constant ();
> @@ -25056,24 +25077,17 @@ aarch64_expand_setmem (rtx *operands)
>    rtx_insn *seq =3D get_insns ();
>    end_sequence ();
>
> -  if (size_p)
> -    {
> -      /* When optimizing for size we have 3 options: the SIMD broadcast =
sequence,
> -        call to memset or the MOPS expansion.  */
> -      if (TARGET_MOPS
> -         && mops_cost <=3D libcall_cost
> -         && mops_cost <=3D simd_ops)
> -       return aarch64_expand_setmem_mops (operands);
> -      /* If MOPS is not available or not shorter pick a libcall if the S=
IMD
> -        sequence is too long.  */
> -      else if (libcall_cost < simd_ops)
> -       return false;
> -      emit_insn (seq);
> -      return true;
> -    }
> +  /* When optimizing for size we have 3 options: the inlined sequence,
> +     call to memset or the MOPS expansion.  */
> +  if (size_p
> +      && TARGET_MOPS
> +      && mops_cost <=3D libcall_cost
> +      && mops_cost <=3D inlined_ops)
> +    return aarch64_expand_setmem_mops (operands);
> +  /* Pick a libcall if the inlined sequence is too long.  */
> +  else if (libcall_cost < inlined_ops)
> +    return false;
>
> -  /* At this point the SIMD broadcast sequence is the best choice when
> -     optimizing for speed.  */
>    emit_insn (seq);
>    return true;
>  }
> diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c b/g=
cc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
> index 664d43aee13..63c864b25b0 100644
> --- a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
> @@ -1,28 +1,27 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -mstrict-align" } */
>
> -struct s { char x[255]; };
> +struct s { char x[207]; };
> +struct s s1 __attribute__((aligned(16)));
>  void foo (struct s *);
> -void bar (void) { struct s s1 =3D {}; foo (&s1); }
> +void bar (void) { s1 =3D (struct s){}; foo (&s1); }
>
> -/* memset (s1 =3D {}, sizeof =3D 255) should be expanded out
> +/* memset (s1 =3D {}, sizeof =3D 207) should be expanded out
>     such that there are no overlap stores when -mstrict-align
>     is in use.
> -   so 7 pairs of 16 bytes stores (224 bytes).
> -   1 16 byte stores
> +   so 6 pairs of 16 bytes stores (96 bytes).
>     1 8 byte store
>     1 4 byte store
>     1 2 byte store
>     1 1 byte store
>     */
>
> -/* { dg-final { scan-assembler-times "stp\tq" 7 } } */
> -/* { dg-final { scan-assembler-times "str\tq" 1 } } */
> +/* { dg-final { scan-assembler-times "stp\tq" 6 } } */
> +/* { dg-final { scan-assembler-times "str\tq" 0 } } */
>  /* { dg-final { scan-assembler-times "str\txzr" 1 } } */
>  /* { dg-final { scan-assembler-times "str\twzr" 1 } } */
>  /* { dg-final { scan-assembler-times "strh\twzr" 1 } } */
>  /* { dg-final { scan-assembler-times "strb\twzr" 1 } } */
>
> -/* Also one store pair for the frame-pointer and the LR. */
> -/* { dg-final { scan-assembler-times "stp\tx" 1 } } */
> -
> +/* No store pair with 8 byte words is needed as foo is called with a sib=
call. */
> +/* { dg-final { scan-assembler-times "stp\tx" 0 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c b/g=
cc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
> new file mode 100644
> index 00000000000..650be86604b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mstrict-align" } */
> +
> +struct s { char x[7]; };
> +void foo (struct s *);
> +void bar (struct s *s1) { *s1 =3D (struct s){}; foo (s1); }
> +
> +/* memset (s1 =3D {}, sizeof =3D 7) should be expanded out
> +   such that there are no overlap stores when -mstrict-align
> +   is in use. As the alignment of s1 is unknown, byte stores are needed.
> +   so 15 byte stores
> +   */
> +
> +/* { dg-final { scan-assembler-times "strb\twzr" 7 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-3.c b/g=
cc/testsuite/gcc.target/aarch64/memset-strict-align-3.c
> new file mode 100644
> index 00000000000..09cb9a654dc
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-3.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Os -mstrict-align" } */
> +
> +int a[2*3];
> +int f(int t)
> +{
> +  __builtin_memset(a, t, 2*3*sizeof(int));
> +}
> +
> +/* memset (a, (val), sizeof =3D 2*3*4) should be expanded out
> +   such that there are no overlap stores when -mstrict-align
> +   is in use. As the alignment of a is 4 byte aligned (due to -Os),
> +   store word pairs are needed. so 3 stp are in use.  */
> +
> +/* { dg-final { scan-assembler-times "stp\ts" 3 } } */
> --
> 2.31.1
>