From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=j5K7=5J=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lj1-x233.google.com (mail-lj1-x233.google.com [IPv6:2a00:1450:4864:20::233])
	by sourceware.org (Postfix) with ESMTPS id 5522A3858C66
	for <gcc-patches@gcc.gnu.org>; Thu, 12 Jan 2023 13:18:37 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5522A3858C66
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lj1-x233.google.com with SMTP id g14so19271014ljh.10
        for <gcc-patches@gcc.gnu.org>; Thu, 12 Jan 2023 05:18:37 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=v65SaZBNnQkXloezE97O9CNJgtDNNQmGCSAwc+G2Bns=;
        b=XJGTQRiIiRMtH4PEcuZIK2iqV02ORuvUTZmF79tv8LCC6x160fUfNLtdDZOwPeUSkP
         E+dgzQPweMuc5GcUj/qgN9ty+jCuBPe1DR/DvVGuUJIDRY/XUnHlgMAL8hYZIWN/8piw
         no8XkZko3Dy/+2fn0Jwr9Yn7GfVQPCyNUUaJg1LNmk5XvwfxkjWE4mGVbN+uSt/YZGT5
         EJOxmkjXOsDepMkJpZuBcFj7coImeWYMt7UpgFwDLDeF7cWr9UZN+3NtH45Ez3RuatnD
         gQz4yvvQ3qijbDk4/0F9JCClvNmoUjw11uPyTBcIDi/mUoxsNS2E890Ki+SkP5p/rYht
         81eQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=v65SaZBNnQkXloezE97O9CNJgtDNNQmGCSAwc+G2Bns=;
        b=sGw9j+PqwyR8VuDgD8MvzgO0im7ZUZy4PDUw0Zb7M4kiUDk0L4aofTefl54V9eb/wQ
         PZvV4v+iNAHvKUj6787OEcNjc/gEmrOjLg/IDHuW1dcxl3VvBKSsC3Kgv77gs3jMuU7V
         crNpjgGnqvOJi6mqScibJu4kTM0gbXgsjjDYgNiQ7RDPcGcsLDT2EOOuPHCm+BEdEpqJ
         stkVWHs8n5FuTiWBjBc4R0PwlRONVTI7bmoDZ8qrOn/Ubh4/1VjBCQQ7iCTiBhB0Pplg
         5NeaVXJ6R5uohR5be66M+eRhLRleF9DMt68ZAiqbS/QwSInm7iEv8TWm2GQzoYYRf1KZ
         eaDw==
X-Gm-Message-State: AFqh2ko3JDCvuQEeYBVayHdyXsHeq1FWViPC6aWf/vzJmbFzwXLPaPas
	XaDTewZlcX29BvDH8hfduAy06rI2SuXBqnSVLBI=
X-Google-Smtp-Source: AMrXdXujnRuxEY7aRYRY2tzkli7BJSYt3zMWBS6QaJPyCqz5FirCFYeVMOaMOy3W8w4drXt6DTaYhVHiDMbkNEI0VBc=
X-Received: by 2002:a2e:a882:0:b0:280:51a:7297 with SMTP id
 m2-20020a2ea882000000b00280051a7297mr1940318ljq.49.1673529515175; Thu, 12 Jan
 2023 05:18:35 -0800 (PST)
MIME-Version: 1.0
References: <ora639h4u9.fsf@lxoliva.fsfla.org>
In-Reply-To: <ora639h4u9.fsf@lxoliva.fsfla.org>
From: Richard Biener <richard.guenther@gmail.com>
Date: Thu, 12 Jan 2023 14:18:22 +0100
Message-ID: <CAFiYyc36gAR0ZEw+v322oapWB_=0xmre4M95SFU+Qy78cYRQHA@mail.gmail.com>
Subject: Re: [RFC] Introduce -finline-memset-loops
To: Alexandre Oliva <oliva@adacore.com>
Cc: gcc-patches@gcc.gnu.org
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-7.3 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Tue, Dec 27, 2022 at 5:12 AM Alexandre Oliva via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
>
> try_store_by_multiple_pieces was added not long ago, enabling
> variable-sized memset to be expanded inline when the worst-case
> in-range constant length would, using conditional blocks with powers
> of two to cover all possibilities of length and alignment.
>
> This patch extends the memset expansion to start with a loop, so as to
> still take advantage of known alignment even with long lengths, but
> without necessarily adding store blocks for every power of two.
>
> This makes it possible for any memset call to be expanded, even if
> storing a single byte per iteration.  Surely efficient implementations
> of memset can do better, with a pre-loop to increase alignment, but
> that would likely be excessive for inline expansions of memset.
>
> Still, in some cases, users prefer to inline memset, even if it's not
> as performant, or when it's known to be performant in ways the
> compiler can't tell, to avoid depending on a C runtime library.
>
> With this flag, global or per-function optimizations may enable inline
> expansion of memset to be selectively requested, while the
> infrastructure for that may enable us to introduce per-target tuning
> to enable such looping even advantageous, even if not explicitly
> requested.
>
>
> I realize this is late for new features in this cycle; I`d be happy to
> submit it again later, but I wonder whether there's any interest in this
> feature, or any objections to it.  FWIW, I've regstrapped this on
> x86_64-linux-gnu, and also tested earlier versions of this patch in
> earlier GCC branches with RISC-v crosses.  Is this ok for GCC 14?  Maybe
> even simple enough for GCC 13, considering it's disabled by default?

I wonder if that isn't better handled by targets via the setmem pattern,
like x86 has the stringop inline strathegy.  What is considered acceptable
in terms of size or performance will vary and I don't think there's much
room for improvements on this generic code support?

Richard.

> TIA,
>
>
> for  gcc/ChangeLog
>
>         * builtins.cc (try_store_by_multiple_pieces): Support starting
>         with a loop.
>         * common.opt (finline-memset-loops): New.
>         * doc/invoke.texi (-finline-memset-loops): Add.
>
> for  gcc/testsuite/ChangeLog
>
>         * gcc.dg/torture/inline-mem-set-1.c: New.
> ---
>  gcc/builtins.cc                                 |   50 ++++++++++++++++++++++-
>  gcc/common.opt                                  |    4 ++
>  gcc/doc/invoke.texi                             |   13 ++++++
>  gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c |   14 ++++++
>  4 files changed, 77 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c
>
> diff --git a/gcc/builtins.cc b/gcc/builtins.cc
> index 02c4fefa86f48..388bae58ce49e 100644
> --- a/gcc/builtins.cc
> +++ b/gcc/builtins.cc
> @@ -4361,9 +4361,37 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len,
>    if (max_bits >= 0)
>      xlenest += ((HOST_WIDE_INT_1U << max_bits) * 2
>                 - (HOST_WIDE_INT_1U << ctz_len));
> +  bool max_loop = false;
>    if (!can_store_by_pieces (xlenest, builtin_memset_read_str,
>                             &valc, align, true))
> -    return false;
> +    {
> +      if (!flag_inline_memset_loops)
> +       return false;
> +      while (--max_bits >= sctz_len)
> +       {
> +         xlenest = ((HOST_WIDE_INT_1U << max_bits) * 2
> +                    - (HOST_WIDE_INT_1U << ctz_len));
> +         if (can_store_by_pieces (xlenest + blksize,
> +                                  builtin_memset_read_str,
> +                                  &valc, align, true))
> +           {
> +             max_loop = true;
> +             break;
> +           }
> +         if (!blksize)
> +           continue;
> +         if (can_store_by_pieces (xlenest,
> +                                  builtin_memset_read_str,
> +                                  &valc, align, true))
> +           {
> +             blksize = 0;
> +             max_loop = true;
> +             break;
> +           }
> +       }
> +      if (!max_loop)
> +       return false;
> +    }
>
>    by_pieces_constfn constfun;
>    void *constfundata;
> @@ -4405,6 +4433,7 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len,
>       the least significant bit possibly set in the length.  */
>    for (int i = max_bits; i >= sctz_len; i--)
>      {
> +      rtx_code_label *loop_label = NULL;
>        rtx_code_label *label = NULL;
>        blksize = HOST_WIDE_INT_1U << i;
>
> @@ -4423,14 +4452,24 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len,
>        else if ((max_len & blksize) == 0)
>         continue;
>
> +      if (max_loop && i == max_bits)
> +       {
> +         loop_label = gen_label_rtx ();
> +         emit_label (loop_label);
> +         /* Since we may run this multiple times, don't assume we
> +            know anything about the offset.  */
> +         clear_mem_offset (to);
> +       }
> +
>        /* Issue a store of BLKSIZE bytes.  */
> +      bool update_needed = i != sctz_len || loop_label;
>        to = store_by_pieces (to, blksize,
>                             constfun, constfundata,
>                             align, true,
> -                           i != sctz_len ? RETURN_END : RETURN_BEGIN);
> +                           update_needed ? RETURN_END : RETURN_BEGIN);
>
>        /* Adjust REM and PTR, unless this is the last iteration.  */
> -      if (i != sctz_len)
> +      if (update_needed)
>         {
>           emit_move_insn (ptr, force_operand (XEXP (to, 0), NULL_RTX));
>           to = replace_equiv_address (to, ptr);
> @@ -4438,6 +4477,11 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len,
>           emit_move_insn (rem, force_operand (rem_minus_blksize, NULL_RTX));
>         }
>
> +      if (loop_label)
> +       emit_cmp_and_jump_insns (rem, GEN_INT (blksize), GE, NULL,
> +                                ptr_mode, 1, loop_label,
> +                                profile_probability::likely ());
> +
>        if (label)
>         {
>           emit_label (label);
> diff --git a/gcc/common.opt b/gcc/common.opt
> index 562d73d7f552a..c28af170be896 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -1874,6 +1874,10 @@ finline-atomics
>  Common Var(flag_inline_atomics) Init(1) Optimization
>  Inline __atomic operations when a lock free instruction sequence is available.
>
> +finline-memset-loops
> +Common Var(flag_inline_memset_loops) Init(0) Optimization
> +Inline memset even if it requires loops.
> +
>  fcf-protection
>  Common RejectNegative Alias(fcf-protection=,full)
>
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index da9ad1068fbf6..19f436ad46385 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -548,7 +548,8 @@ Objective-C and Objective-C++ Dialects}.
>  -fgcse-sm  -fhoist-adjacent-loads  -fif-conversion @gol
>  -fif-conversion2  -findirect-inlining @gol
>  -finline-functions  -finline-functions-called-once  -finline-limit=@var{n} @gol
> --finline-small-functions -fipa-modref -fipa-cp  -fipa-cp-clone @gol
> +-finline-memset-loops -finline-small-functions @gol
> +-fipa-modref -fipa-cp  -fipa-cp-clone @gol
>  -fipa-bit-cp  -fipa-vrp  -fipa-pta  -fipa-profile  -fipa-pure-const @gol
>  -fipa-reference  -fipa-reference-addressable @gol
>  -fipa-stack-alignment  -fipa-icf  -fira-algorithm=@var{algorithm} @gol
> @@ -11960,6 +11961,16 @@ in its own right.
>  Enabled at levels @option{-O1}, @option{-O2}, @option{-O3} and @option{-Os},
>  but not @option{-Og}.
>
> +@item -finline-memset-loops
> +@opindex finline-memset-loops
> +Expand @code{memset} calls inline, even when the length is variable or
> +big enough as to require looping.  This may enable the compiler to take
> +advantage of known alignment and length multipliers, but it will often
> +generate code that is less efficient than performant implementations of
> +@code{memset}, and grow code size so much that even a less performant
> +@code{memset} may run faster due to better use of the code cache.  This
> +option is disabled by default.
> +
>  @item -fearly-inlining
>  @opindex fearly-inlining
>  Inline functions marked by @code{always_inline} and functions whose body seems
> diff --git a/gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c b/gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c
> new file mode 100644
> index 0000000000000..73bd1025f191f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-finline-memset-loops -gno-record-gcc-switches -fno-lto" } */
> +
> +void *zero (unsigned long long (*p)[32], int n)
> +{
> +  return __builtin_memset (p, 0, n * sizeof (*p));
> +}
> +
> +void *ones (char (*p)[128], int n)
> +{
> +  return __builtin_memset (p, -1, n * sizeof (*p));
> +}
> +
> +/* { dg-final { scan-assembler-not "memset" } } */
>
> --
> Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
>    Free Software Activist                       GNU Toolchain Engineer
> Disinformation flourishes because many people care deeply about injustice
> but very few check the facts.  Ask me about <https://stallmansupport.org>