From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=4eyy=5L=adacore.com=oliva@sourceware.org>
Received: from rock.gnat.com (rock.gnat.com [205.232.38.15])
	by sourceware.org (Postfix) with ESMTPS id B1A223858C50
	for <gcc-patches@gcc.gnu.org>; Sat, 14 Jan 2023 03:22:07 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B1A223858C50
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=adacore.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=adacore.com
Received: from localhost (localhost.localdomain [127.0.0.1])
	by filtered-rock.gnat.com (Postfix) with ESMTP id 873C6116998;
	Fri, 13 Jan 2023 22:22:06 -0500 (EST)
X-Virus-Scanned: Debian amavisd-new at gnat.com
Received: from rock.gnat.com ([127.0.0.1])
	by localhost (rock.gnat.com [127.0.0.1]) (amavisd-new, port 10024)
	with LMTP id rzlZQ+SB7Uxa; Fri, 13 Jan 2023 22:22:06 -0500 (EST)
Received: from free.home (tron.gnat.com [IPv6:2620:20:4000:0:46a8:42ff:fe0e:e294])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by rock.gnat.com (Postfix) with ESMTPS id 29843116993;
	Fri, 13 Jan 2023 22:22:04 -0500 (EST)
Received: from livre (livre.home [172.31.160.2])
	by free.home (8.15.2/8.15.2) with ESMTPS id 30E3LrdI1306747
	(version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT);
	Sat, 14 Jan 2023 00:21:54 -0300
From: Alexandre Oliva <oliva@adacore.com>
To: Paul Koning <paulkoning@comcast.net>
Cc: Richard Biener <richard.guenther@gmail.com>, gcc-patches@gcc.gnu.org
Subject: Re: [RFC] Introduce -finline-memset-loops
Organization: Free thinker, does not speak for AdaCore
References: <ora639h4u9.fsf@lxoliva.fsfla.org>
	<CAFiYyc36gAR0ZEw+v322oapWB_=0xmre4M95SFU+Qy78cYRQHA@mail.gmail.com>
	<or358dzxqu.fsf@lxoliva.fsfla.org>
	<5268E9F0-32BC-4BE3-8ADE-F7A430DBBCCA@comcast.net>
Errors-To: aoliva@lxoliva.fsfla.org
Date: Sat, 14 Jan 2023 00:21:53 -0300
In-Reply-To: <5268E9F0-32BC-4BE3-8ADE-F7A430DBBCCA@comcast.net> (Paul Koning's
	message of "Fri, 13 Jan 2023 21:12:59 -0500")
Message-ID: <ory1q5yf5a.fsf@lxoliva.fsfla.org>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Scanned-By: MIMEDefang 2.84
X-Spam-Status: No, score=-5.8 required=5.0 tests=BAYES_00,KAM_DMARC_STATUS,SCC_5_SHORT_WORD_LINES,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hello, Paul,

On Jan 13, 2023, Paul Koning <paulkoning@comcast.net> wrote:

>> On Jan 13, 2023, at 8:54 PM, Alexandre Oliva via Gcc-patches
>> <gcc-patches@gcc.gnu.org> wrote:

>> Target-specific code is great for tight optimizations, but the main
>> purpose of this feature is not an optimization.  AFAICT it actually
>> slows things down in general (due to code growth, and to conservative
>> assumptions about alignment), 

> I thought machinery like the memcpy patterns have as one of their
> benefits the ability to find the alignment of their operands and from
> that optimize things.  So I don't understand why you'd say
> "conservative".

Though memcpy implementations normally do that indeed, dynamically
increasing dest alignment has such an impact on code size that *inline*
memcpy doesn't normally do that.  try_store_by_multiple_pieces,
specifically, is potentially branch-heavy to begin with, and bumping
alignment up could double the inline expansion size.  So what it does is
to take the conservative dest alignment estimate from the compiler and
use it.

By adding leading loops to try_store_by_multiple_pieces (as does the
proposed patch, with its option enabled) we may expand an
unknown-length, unknown-alignment memset to something conceptually like
(cims is short for constant-sized inlined memset):

while (len >= 64) { len -= 64; cims(dest, c, 64); dest += 64; }
if (len >= 32) { len -= 32; cims(dest, c, 32); dest += 32; }
if (len >= 16) { len -= 16; cims(dest, c, 16); dest += 16; }
if (len >= 8) { len -= 8; cims(dest, c, 8); dest += 8; }
if (len >= 4) { len -= 4; cims(dest, c, 4); dest += 4; }
if (len >= 2) { len -= 2; cims(dest, c, 2); dest += 2; }
if (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; }

With dynamic alignment bumps under a trivial extension of the current
logic, it would become (cimsN is short for cims with dest known to be
aligned to an N-byte boundary):

if (len >= 2 && (dest & 1)) { len -= 1; cims(dest, c, 1); dest += 1; }
if (len >= 4 && (dest & 2)) { len -= 2; cims2(dest, c, 2); dest += 2; }
if (len >= 8 && (dest & 4)) { len -= 4; cims4(dest, c, 4); dest += 4; }
if (len >= 16 && (dest & 8)) { len -= 8; cims8(dest, c, 8); dest += 8; }
if (len >= 32 && (dest & 16)) { len -= 16; cims16(dest, c, 16); dest += 16; }
if (len >= 64 && (dest & 32)) { len -= 32; cims32(dest, c, 32); dest += 32; }
while (len >= 64) { len -= 64; cims64(dest, c, 64); dest += 64; }
if (len >= 32) { len -= 32; cims32(dest, c, 32); dest += 32; }
if (len >= 16) { len -= 16; cims16(dest, c, 16); dest += 16; }
if (len >= 8) { len -= 8; cims8(dest, c, 8); dest += 8; }
if (len >= 4) { len -= 4; cims4(dest, c, 4); dest += 4; }
if (len >= 2) { len -= 2; cims2(dest, c, 2); dest += 2; }
if (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; }


Now, by using more loops instead of going through every power of two, We
could shorten (for -Os) the former to e.g.:

while (len >= 64) { len -= 64; cims(dest, c, 64); dest += 64; }
while (len >= 8) { len -= 8; cims(dest, c, 8); dest += 8; }
while (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; }

and we could similarly add more compact logic for dynamic alignment:

if (len >= 8) {
  while (dest & 7) { len -= 1; cims(dest, c, 1); dest += 1; }
  if (len >= 64)
    while (dest & 56) { len -= 8; cims8(dest, c, 8); dest += 8; }
  while (len >= 64) { len -= 64; cims64(dest, c, 64); dest += 64; }
  while (len >= 8) { len -= 8; cims8(dest, c, 8); dest += 8; }
}
while (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; }


Now, given that improving performance was never goal of this change, and
the expansion it optionally offers is desirable even when it slows
things down, just making it a simple loop at the known alignment would
do.  The remainder sort of flowed out of the way
try_store_by_multiple_pieces was structured, and I found it sort of made
sense to start with the largest-reasonable block loop, and then end with
whatever try_store_by_multiple_pieces would have expanded a
known-shorter but variable length memset to.  And this is how I got to
it.  I'm not sure it makes any sense to try to change things further to
satisfy other competing goals such as performance or code size.

-- 
Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
   Free Software Activist                       GNU Toolchain Engineer
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>