public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Alexandre Oliva <oliva@adacore.com>
To: Paul Koning <paulkoning@comcast.net>
Cc: Richard Biener <richard.guenther@gmail.com>, gcc-patches@gcc.gnu.org
Subject: Re: [RFC] Introduce -finline-memset-loops
Date: Sat, 14 Jan 2023 00:21:53 -0300	[thread overview]
Message-ID: <ory1q5yf5a.fsf@lxoliva.fsfla.org> (raw)
In-Reply-To: <5268E9F0-32BC-4BE3-8ADE-F7A430DBBCCA@comcast.net> (Paul Koning's message of "Fri, 13 Jan 2023 21:12:59 -0500")

Hello, Paul,

On Jan 13, 2023, Paul Koning <paulkoning@comcast.net> wrote:

>> On Jan 13, 2023, at 8:54 PM, Alexandre Oliva via Gcc-patches
>> <gcc-patches@gcc.gnu.org> wrote:

>> Target-specific code is great for tight optimizations, but the main
>> purpose of this feature is not an optimization.  AFAICT it actually
>> slows things down in general (due to code growth, and to conservative
>> assumptions about alignment), 

> I thought machinery like the memcpy patterns have as one of their
> benefits the ability to find the alignment of their operands and from
> that optimize things.  So I don't understand why you'd say
> "conservative".

Though memcpy implementations normally do that indeed, dynamically
increasing dest alignment has such an impact on code size that *inline*
memcpy doesn't normally do that.  try_store_by_multiple_pieces,
specifically, is potentially branch-heavy to begin with, and bumping
alignment up could double the inline expansion size.  So what it does is
to take the conservative dest alignment estimate from the compiler and
use it.

By adding leading loops to try_store_by_multiple_pieces (as does the
proposed patch, with its option enabled) we may expand an
unknown-length, unknown-alignment memset to something conceptually like
(cims is short for constant-sized inlined memset):

while (len >= 64) { len -= 64; cims(dest, c, 64); dest += 64; }
if (len >= 32) { len -= 32; cims(dest, c, 32); dest += 32; }
if (len >= 16) { len -= 16; cims(dest, c, 16); dest += 16; }
if (len >= 8) { len -= 8; cims(dest, c, 8); dest += 8; }
if (len >= 4) { len -= 4; cims(dest, c, 4); dest += 4; }
if (len >= 2) { len -= 2; cims(dest, c, 2); dest += 2; }
if (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; }

With dynamic alignment bumps under a trivial extension of the current
logic, it would become (cimsN is short for cims with dest known to be
aligned to an N-byte boundary):

if (len >= 2 && (dest & 1)) { len -= 1; cims(dest, c, 1); dest += 1; }
if (len >= 4 && (dest & 2)) { len -= 2; cims2(dest, c, 2); dest += 2; }
if (len >= 8 && (dest & 4)) { len -= 4; cims4(dest, c, 4); dest += 4; }
if (len >= 16 && (dest & 8)) { len -= 8; cims8(dest, c, 8); dest += 8; }
if (len >= 32 && (dest & 16)) { len -= 16; cims16(dest, c, 16); dest += 16; }
if (len >= 64 && (dest & 32)) { len -= 32; cims32(dest, c, 32); dest += 32; }
while (len >= 64) { len -= 64; cims64(dest, c, 64); dest += 64; }
if (len >= 32) { len -= 32; cims32(dest, c, 32); dest += 32; }
if (len >= 16) { len -= 16; cims16(dest, c, 16); dest += 16; }
if (len >= 8) { len -= 8; cims8(dest, c, 8); dest += 8; }
if (len >= 4) { len -= 4; cims4(dest, c, 4); dest += 4; }
if (len >= 2) { len -= 2; cims2(dest, c, 2); dest += 2; }
if (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; }


Now, by using more loops instead of going through every power of two, We
could shorten (for -Os) the former to e.g.:

while (len >= 64) { len -= 64; cims(dest, c, 64); dest += 64; }
while (len >= 8) { len -= 8; cims(dest, c, 8); dest += 8; }
while (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; }

and we could similarly add more compact logic for dynamic alignment:

if (len >= 8) {
  while (dest & 7) { len -= 1; cims(dest, c, 1); dest += 1; }
  if (len >= 64)
    while (dest & 56) { len -= 8; cims8(dest, c, 8); dest += 8; }
  while (len >= 64) { len -= 64; cims64(dest, c, 64); dest += 64; }
  while (len >= 8) { len -= 8; cims8(dest, c, 8); dest += 8; }
}
while (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; }


Now, given that improving performance was never goal of this change, and
the expansion it optionally offers is desirable even when it slows
things down, just making it a simple loop at the known alignment would
do.  The remainder sort of flowed out of the way
try_store_by_multiple_pieces was structured, and I found it sort of made
sense to start with the largest-reasonable block loop, and then end with
whatever try_store_by_multiple_pieces would have expanded a
known-shorter but variable length memset to.  And this is how I got to
it.  I'm not sure it makes any sense to try to change things further to
satisfy other competing goals such as performance or code size.

-- 
Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
   Free Software Activist                       GNU Toolchain Engineer
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>

  reply	other threads:[~2023-01-14  3:22 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-27  4:02 Alexandre Oliva
2023-01-12 13:18 ` Richard Biener
2023-01-14  1:54   ` Alexandre Oliva
2023-01-14  2:12     ` Paul Koning
2023-01-14  3:21       ` Alexandre Oliva [this message]
2023-01-16  7:26     ` Richard Biener
2023-01-19 11:25       ` Alexandre Oliva
2023-01-19 11:58         ` Richard Biener
2023-06-02 10:11         ` Alexandre Oliva
2023-06-02 14:58           ` Fangrui Song
2023-06-23  2:23           ` Introduce -finline-stringops (was: Re: [RFC] Introduce -finline-memset-loops) Alexandre Oliva
2023-09-15  7:59             ` Introduce -finline-stringops Alexandre Oliva
2023-09-21  6:52               ` [PATCH v2] " Alexandre Oliva
2023-09-23  5:56                 ` [PATCH v3] " Alexandre Oliva
2023-11-20 12:51                   ` Alexandre Oliva
2023-11-20 13:49                     ` Richard Biener
2023-12-11 18:43             ` Introduce -finline-stringops (was: Re: [RFC] Introduce -finline-memset-loops) Sam James
2023-12-12  1:42               ` Introduce -finline-stringops Alexandre Oliva

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ory1q5yf5a.fsf@lxoliva.fsfla.org \
    --to=oliva@adacore.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=paulkoning@comcast.net \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).