From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from rock.gnat.com (rock.gnat.com [205.232.38.15]) by sourceware.org (Postfix) with ESMTPS id B1A223858C50 for ; Sat, 14 Jan 2023 03:22:07 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B1A223858C50 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=adacore.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=adacore.com Received: from localhost (localhost.localdomain [127.0.0.1]) by filtered-rock.gnat.com (Postfix) with ESMTP id 873C6116998; Fri, 13 Jan 2023 22:22:06 -0500 (EST) X-Virus-Scanned: Debian amavisd-new at gnat.com Received: from rock.gnat.com ([127.0.0.1]) by localhost (rock.gnat.com [127.0.0.1]) (amavisd-new, port 10024) with LMTP id rzlZQ+SB7Uxa; Fri, 13 Jan 2023 22:22:06 -0500 (EST) Received: from free.home (tron.gnat.com [IPv6:2620:20:4000:0:46a8:42ff:fe0e:e294]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by rock.gnat.com (Postfix) with ESMTPS id 29843116993; Fri, 13 Jan 2023 22:22:04 -0500 (EST) Received: from livre (livre.home [172.31.160.2]) by free.home (8.15.2/8.15.2) with ESMTPS id 30E3LrdI1306747 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT); Sat, 14 Jan 2023 00:21:54 -0300 From: Alexandre Oliva To: Paul Koning Cc: Richard Biener , gcc-patches@gcc.gnu.org Subject: Re: [RFC] Introduce -finline-memset-loops Organization: Free thinker, does not speak for AdaCore References: <5268E9F0-32BC-4BE3-8ADE-F7A430DBBCCA@comcast.net> Errors-To: aoliva@lxoliva.fsfla.org Date: Sat, 14 Jan 2023 00:21:53 -0300 In-Reply-To: <5268E9F0-32BC-4BE3-8ADE-F7A430DBBCCA@comcast.net> (Paul Koning's message of "Fri, 13 Jan 2023 21:12:59 -0500") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Scanned-By: MIMEDefang 2.84 X-Spam-Status: No, score=-5.8 required=5.0 tests=BAYES_00,KAM_DMARC_STATUS,SCC_5_SHORT_WORD_LINES,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hello, Paul, On Jan 13, 2023, Paul Koning wrote: >> On Jan 13, 2023, at 8:54 PM, Alexandre Oliva via Gcc-patches >> wrote: >> Target-specific code is great for tight optimizations, but the main >> purpose of this feature is not an optimization. AFAICT it actually >> slows things down in general (due to code growth, and to conservative >> assumptions about alignment), > I thought machinery like the memcpy patterns have as one of their > benefits the ability to find the alignment of their operands and from > that optimize things. So I don't understand why you'd say > "conservative". Though memcpy implementations normally do that indeed, dynamically increasing dest alignment has such an impact on code size that *inline* memcpy doesn't normally do that. try_store_by_multiple_pieces, specifically, is potentially branch-heavy to begin with, and bumping alignment up could double the inline expansion size. So what it does is to take the conservative dest alignment estimate from the compiler and use it. By adding leading loops to try_store_by_multiple_pieces (as does the proposed patch, with its option enabled) we may expand an unknown-length, unknown-alignment memset to something conceptually like (cims is short for constant-sized inlined memset): while (len >= 64) { len -= 64; cims(dest, c, 64); dest += 64; } if (len >= 32) { len -= 32; cims(dest, c, 32); dest += 32; } if (len >= 16) { len -= 16; cims(dest, c, 16); dest += 16; } if (len >= 8) { len -= 8; cims(dest, c, 8); dest += 8; } if (len >= 4) { len -= 4; cims(dest, c, 4); dest += 4; } if (len >= 2) { len -= 2; cims(dest, c, 2); dest += 2; } if (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; } With dynamic alignment bumps under a trivial extension of the current logic, it would become (cimsN is short for cims with dest known to be aligned to an N-byte boundary): if (len >= 2 && (dest & 1)) { len -= 1; cims(dest, c, 1); dest += 1; } if (len >= 4 && (dest & 2)) { len -= 2; cims2(dest, c, 2); dest += 2; } if (len >= 8 && (dest & 4)) { len -= 4; cims4(dest, c, 4); dest += 4; } if (len >= 16 && (dest & 8)) { len -= 8; cims8(dest, c, 8); dest += 8; } if (len >= 32 && (dest & 16)) { len -= 16; cims16(dest, c, 16); dest += 16; } if (len >= 64 && (dest & 32)) { len -= 32; cims32(dest, c, 32); dest += 32; } while (len >= 64) { len -= 64; cims64(dest, c, 64); dest += 64; } if (len >= 32) { len -= 32; cims32(dest, c, 32); dest += 32; } if (len >= 16) { len -= 16; cims16(dest, c, 16); dest += 16; } if (len >= 8) { len -= 8; cims8(dest, c, 8); dest += 8; } if (len >= 4) { len -= 4; cims4(dest, c, 4); dest += 4; } if (len >= 2) { len -= 2; cims2(dest, c, 2); dest += 2; } if (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; } Now, by using more loops instead of going through every power of two, We could shorten (for -Os) the former to e.g.: while (len >= 64) { len -= 64; cims(dest, c, 64); dest += 64; } while (len >= 8) { len -= 8; cims(dest, c, 8); dest += 8; } while (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; } and we could similarly add more compact logic for dynamic alignment: if (len >= 8) { while (dest & 7) { len -= 1; cims(dest, c, 1); dest += 1; } if (len >= 64) while (dest & 56) { len -= 8; cims8(dest, c, 8); dest += 8; } while (len >= 64) { len -= 64; cims64(dest, c, 64); dest += 64; } while (len >= 8) { len -= 8; cims8(dest, c, 8); dest += 8; } } while (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; } Now, given that improving performance was never goal of this change, and the expansion it optionally offers is desirable even when it slows things down, just making it a simple loop at the known alignment would do. The remainder sort of flowed out of the way try_store_by_multiple_pieces was structured, and I found it sort of made sense to start with the largest-reasonable block loop, and then end with whatever try_store_by_multiple_pieces would have expanded a known-shorter but variable length memset to. And this is how I got to it. I'm not sure it makes any sense to try to change things further to satisfy other competing goals such as performance or code size. -- Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ Free Software Activist GNU Toolchain Engineer Disinformation flourishes because many people care deeply about injustice but very few check the facts. Ask me about