From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-x12a.google.com (mail-lf1-x12a.google.com [IPv6:2a00:1450:4864:20::12a]) by sourceware.org (Postfix) with ESMTPS id 7DFC63858C52 for ; Thu, 19 Jan 2023 11:58:39 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7DFC63858C52 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lf1-x12a.google.com with SMTP id j17so2908631lfr.3 for ; Thu, 19 Jan 2023 03:58:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=sbZjss2/TQw0ypVsFKA5buseBKE+gPnaMWIAlGaORpM=; b=nL1d1HU/Ri0BmjeO9qh99UmUmBDbCEs/n4tMLbbuU1QQES7XWojPQKMpZRIVNzGqL+ l+THRJkc0Gh9arqo+cbqa9gHyMMJGbUepwoHZB25oYZDOGRvyY9EdKx2LnSLvW5zbTxq FifvNqMvbXG+OBXEeMuKY5cukf1JJwbFCcem2XMCYa8GID6N0smKPIjjUZ7Ta7SJprSB nUwF0jFapAhbgR/kd/M41fUz7GRCCUqItDEFkKxeSwbgq2ehciBVZ2TbEbdue99JGse/ bib6EIkhY1qEP+k8SQV2YCaR37g6DAvCjfcLqVFdQlzseHXSALN/Xqi35pZw/rn4CwIh ytRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=sbZjss2/TQw0ypVsFKA5buseBKE+gPnaMWIAlGaORpM=; b=baS2WFPjmAIRUDVxqbdZaK6RqjhceAuSkhRfc6jGlwOE1SeJsxfBN7hZFTTL8jLVUQ qdX+AQwE9HDxg8Zz9MNviaoP6uUP/Q0xkkDA433dv0kYqAd1xUNs99ljTfgKqkgDaqTL OSiwkHhkexCRhw9Vd14jnL39JIxWUGFeU5WHYPZKe5OrUmzwgGDl3KvR8DKvI/rSOuqh FyMyfiDom6I9qnEX6WGqQG5I26kLliv4MaWTyDm8Vw9SEDCBNZ9fdgVc1Ooamho2SHQc yVkV4s9hOgv5ob1YRsrNTqUYa7wxAf1wDXyxOvAEE/iCqhy214jW44LwFfCVm3VZb4ax J1wg== X-Gm-Message-State: AFqh2kppAQDFSf4da7o5v8zW29mvN+6b6TGXtMpYTtoALyF2BfgFqXKY 2a5z+Rsoh3UoOc1Xh7TsSQrJersyoHMbwfOTTMo= X-Google-Smtp-Source: AMrXdXscJyIe9FomfuC1lo9LqlUWxox8mFoV2afQwEy2Gwb6SWLWIiODpHWYEOz0E63QQmG8Wvmg9RbmdvSpDICuEPk= X-Received: by 2002:a19:700e:0:b0:4cb:3ca6:1d1a with SMTP id h14-20020a19700e000000b004cb3ca61d1amr496677lfc.448.1674129517479; Thu, 19 Jan 2023 03:58:37 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Richard Biener Date: Thu, 19 Jan 2023 12:58:24 +0100 Message-ID: Subject: Re: [RFC] Introduce -finline-memset-loops To: Alexandre Oliva Cc: gcc-patches@gcc.gnu.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-7.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Thu, Jan 19, 2023 at 12:25 PM Alexandre Oliva wrote: > > On Jan 16, 2023, Richard Biener wrote: > > > On Sat, Jan 14, 2023 at 2:55 AM Alexandre Oliva wro= te: > >> Target-specific code is great for tight optimizations, but the main > >> purpose of this feature is not an optimization. AFAICT it actually > >> slows things down in general (due to code growth, and to conservative > >> assumptions about alignment), except perhaps for some microbenchmarks. > >> It's rather a means to avoid depending on the C runtime, particularly > >> due to compiler-introduced memset calls. > > > OK, that's what I guessed but you didn't spell out. So does it make se= nse > > to mention -ffreestanding in the docs at least? My fear is that we'd g= et > > complaints that -O3 -finline-memset-loops turns nicely optimized memset > > loops into dumb ones (via loop distribution and then stupid re-expansio= n). > > So does it also make sense to turn off -floop-distribute-patterns[-mems= et] > > with -finline-memset-loops? > > I don't think they should be tied together. Verbose as it is, the > expansion of memset is a sort of local optimum given what the compiler > knows about length and alignment: minimizing what can be minimized, > namely the compare count, by grouping stores in large straight-line > blocks. > > Though an optimized memset could in theory perform better, whether > through ifuncs or by bumping alignment, if you're tuning generated code > for a specific target, and you know dest is aligned, the generated code > can likely beat a general-purpose optimized memset, even if by a thin > margin, such as the code that the general-purpose memset would have to > run to detect the alignment and realize it doesn't need to be bumped, > and to extend the byte value to be stored to wider modes. > > So I can envision cases in which -floop-distribute-patterns could turn > highly inefficient stores into a memset with known-strict alignment and > length multiplier, that could then be profitably expanded inline so as > to take advantage of both for performance reasons. > > Indeed, when I started working on this, I thought the issue was > performance, and this led me to pursue the store-by-multiple-pieces > logic. It can indeed bring about performance improvements, both over > generic-target and highly-optimized memset implementations. But it can > also be put to use to avoid C runtime calls. So while I wouldn't > suggest enabling it by default at any optimization level, I wouldn't tie > it to the single purpose of freestanding environments either. > > > >> My initial goal was to be able to show that inline expansion would NOT > >> bring about performance improvements, but performance was not the > >> concern that led to the request. > >> > >> If the approach seems generally acceptable, I may even end up extendin= g > >> it to other such builtins. I have a vague recollection that memcmp is > >> also an issue for us. > > > The C/C++ runtime produce at least memmove, memcpy and memcmp as well. > > *nod*. The others are far more challenging to expand inline in a way > that could potentially be more performant: > > - memcmp can only do by_pieces when testing for equality, presumably > because grouping multiple bytes into larger words to speed things up > won't always get you the expected result if you just subtract the larger > words, endianness reversal prior to subtracting might be required, which > would harm performance. I don't see that using similar > power-of-two-sizes grouping strategies to minimize looping overheads > would be so advantageous, if at all, given the need for testing and > branching at every word. > > - memcpy seems doable, but all of the block moves other than cpymem > assume non-overlapping memcpy. Even if we were to output a test for > overlap that a na=C3=AFve expansion would break, and an alternate block t= o go > backwards, all of the block copying logic would have to be "oriented" to > proceed explicitly forward, backward, or don't-care, where currently we > only have don't-care. > > Though my initial plan, when posting this patch, was to see how well the > general approach was received, before thinking much about how to apply > it to the other builtins, now I am concerned that extending it to them > is more than I can tackle. > > Would it make more sense to extend it, even constrained by the > limitations mentioned above, or handle memset only? In the latter case, > would it still make sense to adopt a command-line option that suggests a > broader effect than it already has, even if it's only a hopeful future > extension? -finline-all-stringops[=3D{memset,memcpy,...}], that you > suggested, seems to be a reasonable and extensible one to adopt. Well, if the main intent is to avoid relying on a C runtime for calls generated by the compiler then yes! Otherwise it would be incomplete. In that light ... > >> Is (optionally) tending to this (uncommon, I suppose) need (or > >> preference?) not something GCC would like to do? > > > Sure, I think for the specific intended purpose that would be fine. > > Cool! > > > It should also only apply to __builtin_memset calls, not to memset > > calls from user code? > > I suppose it could be argued both ways. The situations that I had in > mind either already are or could be made __builtin_memset calls, but I > can't think of reasons to prevent explicit memset calls from such > expansion. Do you have any specific purpose in mind? ... when the user writes memcpy () then he's supposed to provide a definition, even with -ffreestanding. But one could argue that with -ffreestanding the compiler is responsible to provide memcpy () if it itself is introducing calls to it. > In favor of allowing non-__builtin_ memset to be so expanded, I'll > mention that I caught a number of corner cases while developing the > following improved version of the earlier patch (now handling even > large-constant lengths) by bootstrapping the compiler with the option > enabled by default. Without that, testcases would have to be a *lot* > more thorough. I can see that, the question is really what the option is targeted at. For *-linux at least I can't see a reason to use it - performance cannot be it because glibc is decent. So the question really boils down to the intended audience. If it's possibly extended to cover strcpy then for -ffreestanding handling user calls to the functions might make sense, but where do we stop there? Richard. > > Introduce -finline-memset-loops > > From: Alexandre Oliva > > try_store_by_multiple_pieces was added not long ago, enabling > variable-sized memset to be expanded inline when the worst-case > in-range constant length would, using conditional blocks with powers > of two to cover all possibilities of length and alignment. > > This patch extends the memset expansion to start with a loop, so as to > still take advantage of known alignment even with long lengths, but > without necessarily adding store blocks for every power of two. > > This makes it possible for any memset call to be expanded, even if > storing a single byte per iteration. Surely efficient implementations > of memset can do better, with a pre-loop to increase alignment, but > that would likely be excessive for inline expansions of memset. > > Still, in some cases, users prefer to inline memset, even if it's not > as performant, or when it's known to be performant in ways the > compiler can't tell, to avoid depending on a C runtime library. > > With this flag, global or per-function optimizations may enable inline > expansion of memset to be selectively requested, while the > infrastructure for that may enable us to introduce per-target tuning > to enable such looping even advantageous, even if not explicitly > requested. > > > for gcc/ChangeLog > > * builtins.cc (try_store_by_multiple_pieces): Support starting > with a loop. > * common.opt (finline-memset-loops): New. > * doc/invoke.texi (-finline-memset-loops): Add. > > for gcc/testsuite/ChangeLog > > * gcc.dg/torture/inline-mem-set-1.c: New. > --- > gcc/builtins.cc | 99 +++++++++++++++++= ++++-- > gcc/common.opt | 4 + > gcc/doc/invoke.texi | 13 +++ > gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c | 84 +++++++++++++++++= +++ > 4 files changed, 192 insertions(+), 8 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c > > diff --git a/gcc/builtins.cc b/gcc/builtins.cc > index af45829875e6c..733fe17ede622 100644 > --- a/gcc/builtins.cc > +++ b/gcc/builtins.cc > @@ -4322,6 +4322,10 @@ try_store_by_multiple_pieces (rtx to, rtx len, uns= igned int ctz_len, > int tst_bits =3D (max_bits !=3D min_bits ? max_bits > : floor_log2 (max_len ^ min_len)); > > + /* Save the pre-blksize values. */ > + int orig_max_bits =3D max_bits; > + int orig_tst_bits =3D tst_bits; > + > /* Check whether it's profitable to start by storing a fixed BLKSIZE > bytes, to lower max_bits. In the unlikely case of a constant LEN > (implied by identical MAX_LEN and MIN_LEN), we want to issue a > @@ -4361,9 +4365,70 @@ try_store_by_multiple_pieces (rtx to, rtx len, uns= igned int ctz_len, > if (max_bits >=3D 0) > xlenest +=3D ((HOST_WIDE_INT_1U << max_bits) * 2 > - (HOST_WIDE_INT_1U << ctz_len)); > - if (!can_store_by_pieces (xlenest, builtin_memset_read_str, > - &valc, align, true)) > - return false; > + bool max_loop =3D false; > + /* Skip the test in case of overflow in xlenest. It shouldn't > + happen because of the way max_bits and blksize are related, but > + it doesn't hurt to test. */ > + if (blksize > xlenest > + || !can_store_by_pieces (xlenest, builtin_memset_read_str, > + &valc, align, true)) > + { > + if (!flag_inline_memset_loops) > + return false; > + > + for (max_bits =3D orig_max_bits; > + max_bits >=3D sctz_len; > + --max_bits) > + { > + xlenest =3D ((HOST_WIDE_INT_1U << max_bits) * 2 > + - (HOST_WIDE_INT_1U << ctz_len)); > + /* Check that blksize plus the bits to be stored as blocks > + sized at powers of two can be stored by pieces. This is > + like the test above, but with smaller max_bits. Skip > + orig_max_bits (it would be redundant). Also skip in case > + of overflow. */ > + if (max_bits < orig_max_bits > + && xlenest + blksize >=3D xlenest > + && can_store_by_pieces (xlenest + blksize, > + builtin_memset_read_str, > + &valc, align, true)) > + { > + max_loop =3D true; > + break; > + } > + if (blksize > + && can_store_by_pieces (xlenest, > + builtin_memset_read_str, > + &valc, align, true)) > + { > + max_len +=3D blksize; > + min_len +=3D blksize; > + tst_bits =3D orig_tst_bits; > + blksize =3D 0; > + max_loop =3D true; > + break; > + } > + if (max_bits =3D=3D sctz_len) > + { > + --sctz_len; > + --ctz_len; > + } > + } > + if (!max_loop) > + return false; > + /* If the boundaries are such that min and max may run a > + different number of trips in the initial loop, the remainder > + needs not be between the moduli, so set tst_bits to cover all > + bits. Otherwise, if the trip counts are the same, max_len > + has the common prefix, and the previously-computed tst_bits > + is usable. */ > + if (max_len >> max_bits > min_len >> max_bits) > + tst_bits =3D max_bits; > + } > + /* ??? Do we have to check that all powers of two lengths from > + max_bits down to ctz_len pass can_store_by_pieces? As in, could > + it possibly be that xlenest passes while smaller power-of-two > + sizes don't? */ > > by_pieces_constfn constfun; > void *constfundata; > @@ -4405,7 +4470,9 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsi= gned int ctz_len, > the least significant bit possibly set in the length. */ > for (int i =3D max_bits; i >=3D sctz_len; i--) > { > + rtx_code_label *loop_label =3D NULL; > rtx_code_label *label =3D NULL; > + > blksize =3D HOST_WIDE_INT_1U << i; > > /* If we're past the bits shared between min_ and max_len, expand > @@ -4419,18 +4486,31 @@ try_store_by_multiple_pieces (rtx to, rtx len, un= signed int ctz_len, > profile_probability::even ()); > } > /* If we are at a bit that is in the prefix shared by min_ and > - max_len, skip this BLKSIZE if the bit is clear. */ > - else if ((max_len & blksize) =3D=3D 0) > + max_len, skip the current BLKSIZE if the bit is clear, but do > + not skip the loop, even if it doesn't require > + prechecking. */ > + else if ((max_len & blksize) =3D=3D 0 > + && !(max_loop && i =3D=3D max_bits)) > continue; > > + if (max_loop && i =3D=3D max_bits) > + { > + loop_label =3D gen_label_rtx (); > + emit_label (loop_label); > + /* Since we may run this multiple times, don't assume we > + know anything about the offset. */ > + clear_mem_offset (to); > + } > + > /* Issue a store of BLKSIZE bytes. */ > + bool update_needed =3D i !=3D sctz_len || loop_label; > to =3D store_by_pieces (to, blksize, > constfun, constfundata, > align, true, > - i !=3D sctz_len ? RETURN_END : RETURN_BEGIN); > + update_needed ? RETURN_END : RETURN_BEGIN); > > /* Adjust REM and PTR, unless this is the last iteration. */ > - if (i !=3D sctz_len) > + if (update_needed) > { > emit_move_insn (ptr, force_operand (XEXP (to, 0), NULL_RTX)); > to =3D replace_equiv_address (to, ptr); > @@ -4438,6 +4518,11 @@ try_store_by_multiple_pieces (rtx to, rtx len, uns= igned int ctz_len, > emit_move_insn (rem, force_operand (rem_minus_blksize, NULL_RTX= )); > } > > + if (loop_label) > + emit_cmp_and_jump_insns (rem, GEN_INT (blksize), GE, NULL, > + ptr_mode, 1, loop_label, > + profile_probability::likely ()); > + > if (label) > { > emit_label (label); > diff --git a/gcc/common.opt b/gcc/common.opt > index d0371aec8db5f..5d28f054241ad 100644 > --- a/gcc/common.opt > +++ b/gcc/common.opt > @@ -1874,6 +1874,10 @@ finline-atomics > Common Var(flag_inline_atomics) Init(1) Optimization > Inline __atomic operations when a lock free instruction sequence is avai= lable. > > +finline-memset-loops > +Common Var(flag_inline_memset_loops) Init(0) Optimization > +Inline memset even if it requires loops. > + > fcf-protection > Common RejectNegative Alias(fcf-protection=3D,full) > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > index 631c00582bf85..8f8d52bbeef68 100644 > --- a/gcc/doc/invoke.texi > +++ b/gcc/doc/invoke.texi > @@ -548,7 +548,8 @@ Objective-C and Objective-C++ Dialects}. > -fgcse-sm -fhoist-adjacent-loads -fif-conversion @gol > -fif-conversion2 -findirect-inlining @gol > -finline-functions -finline-functions-called-once -finline-limit=3D@va= r{n} @gol > --finline-small-functions -fipa-modref -fipa-cp -fipa-cp-clone @gol > +-finline-memset-loops -finline-small-functions @gol > +-fipa-modref -fipa-cp -fipa-cp-clone @gol > -fipa-bit-cp -fipa-vrp -fipa-pta -fipa-profile -fipa-pure-const @gol > -fipa-reference -fipa-reference-addressable @gol > -fipa-stack-alignment -fipa-icf -fira-algorithm=3D@var{algorithm} @gol > @@ -11961,6 +11962,16 @@ in its own right. > Enabled at levels @option{-O1}, @option{-O2}, @option{-O3} and @option{-= Os}, > but not @option{-Og}. > > +@item -finline-memset-loops > +@opindex finline-memset-loops > +Expand @code{memset} calls inline, even when the length is variable or > +big enough as to require looping. This may enable the compiler to take > +advantage of known alignment and length multipliers, but it will often > +generate code that is less efficient than performant implementations of > +@code{memset}, and grow code size so much that even a less performant > +@code{memset} may run faster due to better use of the code cache. This > +option is disabled by default. > + > @item -fearly-inlining > @opindex fearly-inlining > Inline functions marked by @code{always_inline} and functions whose body= seems > diff --git a/gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c b/gcc/testsu= ite/gcc.dg/torture/inline-mem-set-1.c > new file mode 100644 > index 0000000000000..4de51df006ede > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c > @@ -0,0 +1,84 @@ > +/* { dg-do compile } */ > +/* { dg-options "-finline-memset-loops -gno-record-gcc-switches -fno-lto= " } */ > + > +void *zero (unsigned long long (*p)[32], int n) > +{ > + return __builtin_memset (p, 0, n * sizeof (*p)); > +} > + > +void *ones (char (*p)[128], int n) > +{ > + return __builtin_memset (p, -1, n * sizeof (*p)); > +} > + > +void *opt2 (int *p, int i) > +{ > + return __builtin_memset (p, 0, (i ? 1024 : 2) * sizeof (*p)); > +} > + > +void *opt8 (int *p, int i) > +{ > + return __builtin_memset (p, 0, (i ? 1024 : 8) * sizeof (*p)); > +} > + > +void *opt32 (int *p, int i) > +{ > + return __builtin_memset (p, 0, (i ? 1024 : 32) * sizeof (*p)); > +} > + > +void *opt128 (int *p, int i) > +{ > + return __builtin_memset (p, 0, (i ? 1024 : 128) * sizeof (*p)); > +} > + > +void *opt512 (int *p, int i) > +{ > + return __builtin_memset (p, 0, (i ? 1024 : 512) * sizeof (*p)); > +} > + > +void *opt_primes (int *p, int i) > +{ > + return __builtin_memset (p, 0, (i ? 509 : 7) * sizeof (*p)); > +} > + > +void *opt_primes_blk (int *p, int i) > +{ > + return __builtin_memset (p, 0, (i ? 521 : 9) * sizeof (*p)); > +} > + > +void *huge (long (*p)[16384]) > +{ > + return __builtin_memset (p, 0, sizeof (*p)); > +} > + > +void *hugep1 (long (*p)[16384+1]) > +{ > + return __builtin_memset (p, 0, sizeof (*p)); > +} > + > +void *hugep4 (long (*p)[16384+4]) > +{ > + return __builtin_memset (p, 0, sizeof (*p)); > +} > + > +void *hugep16 (long (*p)[16384+16]) > +{ > + return __builtin_memset (p, 0, sizeof (*p)); > +} > + > +void *hugep64 (long (*p)[16384+64]) > +{ > + return __builtin_memset (p, 0, sizeof (*p)); > +} > + > +void *hugep256 (long (*p)[16384+256]) > +{ > + return __builtin_memset (p, 0, sizeof (*p)); > +} > + > +void *hugep1024p256p64p16p4p1 (long (*p)[16384+1024+64+16+4+1]) > +{ > + return __builtin_memset (p, 0, sizeof (*p)); > +} > + > +/* { dg-final { scan-assembler-not "memset" } } */ > > > -- > Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ > Free Software Activist GNU Toolchain Engineer > Disinformation flourishes because many people care deeply about injustice > but very few check the facts. Ask me about