From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf1-x435.google.com (mail-pf1-x435.google.com [IPv6:2607:f8b0:4864:20::435]) by sourceware.org (Postfix) with ESMTPS id 5E1B13882079 for ; Wed, 12 Jun 2024 13:03:06 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5E1B13882079 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=vrull.eu Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=vrull.eu ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 5E1B13882079 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::435 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718197389; cv=none; b=XdEME3rcVGW/tEghTAtYbA44Ig3kPYV7QHTLGs6z0YMhYphtEXfGOKpJAv76xSVWdexuUUBkQvycsLxtTj+39nCu3wwely4VtxkDJxUr+2weRN8mPdqfehJZCW7oPIeU+VUOkAvAJdzj8TagPZFTDq7fUZFXx88pZoaOEDNpNss= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718197389; c=relaxed/simple; bh=F6LGU250NwZNSQ06Ci9/R30FXuWcNpAdPl+lOlx03Q0=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=nIVabG1tFrAjfZSFJdPLgmN5xCxak5hxly8civsoXbi36bKc+DBjPPIHnYsyD5cmIDPDo1bj4g70kB8/M/a0rLsoYMKuva9P+UFkz5ua7CS+io0eIugA5HN4SMmmnf5cMQFs7hnH+E50SL0K5A6oy0gjliPyLB2MnGdcg391A+I= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-pf1-x435.google.com with SMTP id d2e1a72fcca58-704313fa830so3098521b3a.3 for ; Wed, 12 Jun 2024 06:03:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vrull.eu; s=google; t=1718197385; x=1718802185; darn=gcc.gnu.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=0PwjUnQ9TmbJ1F64xF4SH1QNTIHGWB7rtGHeao9ODaU=; b=YQYyF7barcOFf2q6dQfTt8AHhrf+lXS5bkl9WATkPzP6tUJlTXvZm2SwBUYRyG5Dhp MzUBB18C5f6fXwmzzBXBi4+R+ManbPUhN7RXIY1G+NOrLNPaJIZv2uCxAsEBC1OpMcNb DoBgTvGZqdvVSdTrdXQeuGwX8sP7CkkxblEP4g4vqFUehPZW5ZqDCzM7jN+NyfMSA9RG pyjhGeQNa2mv901aRuN8XgClLpvvRlz5ymnvFYufFNJPnN235EgfMBZqjzNICHXcOgBM evo2t6C+SqSA69E5DlSmorSuT+F6fyrk9fjfgaDk3WLC3M9+7VRzVWb5ahIq/+po1C3j Dvmw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718197385; x=1718802185; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=0PwjUnQ9TmbJ1F64xF4SH1QNTIHGWB7rtGHeao9ODaU=; b=aSfoBmPCyejtVU4z1v/DUUTU9YNWKKsP+HLkpaKwPBtKdGVetAhkdWO0w2HbXYj4nr bMkC/DhyDPMcyMf++qcd7my2wTtkm4t9GobHQxNsv1LDd9UD0NVEU2BDaEiKvB9C4ci0 kkKO1S47gLnZeLycP0s8TzR3RroK22iOeMQp1FfWeyHrqFQJ3Ru+mRbnNDPcSr9KNhz3 GDoi57f9uVrIrupIHPfHR5aD+s4x+yYKcD5br7nlkwITCMOIN9QOddInRtgKb4Q+qJar FgZae5LjnujINhE9e5r4LDdeD5CEtEDWgRq4fgX+PcJ/7XzQkhVjuQXaInjX8Y6uJj62 5+yA== X-Forwarded-Encrypted: i=1; AJvYcCWSEkeHSzrGNGl7KzM/nOd5cq4g37DvWdYXKoT53Tr23eAEvFpa6c2kf5/7moD154ZAhqriLjdwD9s8FYIz6RY7hIB2pOxznQ== X-Gm-Message-State: AOJu0YyYdU/RjPWSDUpKBMV7ts9GC4Gio72dEhV6c8zmtA2nkPsSDbKS jCNP4oYH0FdiFFA01lm6njMzVgelGnbGKNjUbxUT3GlB2VludGEO1ETvGTq2KVYV+ttNOCkM0Gc jf9zw5fgUWm0eWZwjghCBsUvA8xO72ODQibIcJw== X-Google-Smtp-Source: AGHT+IEXeIQs/9NPySCFIAc52pN2LZmmxl6482nByHzVMcklfIqRbzME7sd3jnN3+6XcAYnaNKz1ZlniT+JcK04vkk8= X-Received: by 2002:a17:90b:4ad2:b0:2c2:e435:774f with SMTP id 98e67ed59e1d1-2c4a760725fmr1744464a91.10.1718197385106; Wed, 12 Jun 2024 06:03:05 -0700 (PDT) MIME-Version: 1.0 References: <20240606101043.3682477-1-manolis.tsamis@vrull.eu> <264e248d-cb54-4d3d-860d-193fd7be1049@gmail.com> In-Reply-To: From: Manolis Tsamis Date: Wed, 12 Jun 2024 16:02:29 +0300 Message-ID: Subject: Re: [PATCH v2] Target-independent store forwarding avoidance. To: Philipp Tomsich Cc: Jeff Law , gcc-patches@gcc.gnu.org, Richard Biener , =?UTF-8?Q?Christoph_M=C3=BCllner?= , Jiangning Liu , Jakub Jelinek , Andrew Pinski Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-9.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Mon, Jun 10, 2024 at 9:27=E2=80=AFPM Philipp Tomsich wrote: > > On Mon, 10 Jun 2024 at 20:03, Jeff Law wrote: > > > > > > > > On 6/10/24 1:55 AM, Manolis Tsamis wrote: > > > > >> > > > There was an older submission of a load-pair specific pass but this i= s > > > a complete reimplementation and indeed significantly more general. > > > Apart from being target independant, it addresses a number of > > > important restrictions and can handle multiple store forwardings per > > > load. > > > It should be noted that it cannot handle the load-pair cases as these > > > need special handling, but that's something we're planning to do in > > > the future by reusing this infrastructure. > > ACK. Thanks for the additional background. > > > > > > > > > >> > > >>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > >>> index 4e8967fd8ab..c769744d178 100644 > > >>> --- a/gcc/doc/invoke.texi > > >>> +++ b/gcc/doc/invoke.texi > > >>> @@ -12657,6 +12657,15 @@ loop unrolling. > > >>> This option is enabled by default at optimization levels @option= {-O1}, > > >>> @option{-O2}, @option{-O3}, @option{-Os}. > > >>> > > >>> +@opindex favoid-store-forwarding > > >>> +@item -favoid-store-forwarding > > >>> +@itemx -fno-avoid-store-forwarding > > >>> +Many CPUs will stall for many cycles when a load partially depends= on previous > > >>> +smaller stores. This pass tries to detect such cases and avoid th= e penalty by > > >>> +changing the order of the load and store and then fixing up the lo= aded value. > > >>> + > > >>> +Disabled by default. > > >> Is there any particular reason why this would be off by default at -= O1 > > >> or higher? It would seem to me that on modern cores that this > > >> transformation should easily be a win. Even on an old in-order core= , > > >> avoiding the load with the bit insert is likely profitable, just not= as > > >> much so. > > >> > > > I don't have a strong opinion for that but I believe Richard's > > > suggestion to decide this on a per-target basis also makes a lot of > > > sense. > > > Deciding whether the transformation is profitable is tightly tied to > > > the architecture in question (i.e. how large the stall is and what > > > sort of bit-insert instructions are available). > > > In order to make this more widely applicable, I think we'll need a > > > target hook that decides in which case the forwarded stores incur a > > > penalty and thus the transformation makes sense. > > You and Richi are probably right. I'm not a big fan of passes being > > enabled/disabled on particular targets, but it may make sense here. > > > > > > > > > Afaik, for each CPU there may be cases that store forwarding is > > > handled efficiently. > > Absolutely. But forwarding from a smaller store to a wider load is > > painful from a hardware standpoint and if we can avoid it from a codege= n > > standpoint, we should. > > This change is what I briefly hinted as "the complete solution" that > we had on the drawing board when we briefly talked last November in > Santa Clara. > > > Did y'all look at spec2017 at all for this patch? I've got our hardwar= e > > guys to expose a signal for this case so that we can (in a month or so) > > get some hard data on how often it's happening in spec2017 and evaluate > > how this patch helps the most affected workloads. But if y'all already > > have some data we can use it as a starting point. > > We have looked at all of SPEC2017, especially for coverage (i.e., > making sure we see a significant number of uses of the transformation) > and correctness. The gcc_r and parest_r components triggered in a > number of "interesting" ways (e.g., motivating the case of > load-elimination). If it helps, we could share the statistics for how > often the pass triggers on compiling each of the SPEC2017 components? > Below is a table with 1) the number of successful store-forwarding-avoidance transformations and 2) the number of these where the load was also eliminated. This is SPEC2017 intrate and fprate; the omitted benchmarks in each case have zero transformations. Following Richard's comment I did two runs: One with the pass placed just after cse1 (this patch) and one with it placed after sched1 (Richard's suggestion). I see an increased number of transformations after sched1 and also in some testcases we get improved code generation so it looks promising. The original motivation was that placing this early will enable subsequent passes to optimize the transformed code, but it looks like this is not a major issue and the load/store placement it more important. I plan to follow up and provide some rough performance metrics and example assembly code of the transformation for these benchmarks. I also tried increasing the maximum distance from 10 to 15 and so only a small increase in transformation count. From what I've seen most cases that we care about are usually close. The data (benchmark, # transformed, # load elimination): After cse1: 500: 32 1 502: 89 9 510: 517 409 511: 12 1 520: 1 0 521: 4 1 525: 51 1 526: 46 11 557: 1 1 After sched1: 500: 45 13 502: 172 33 507: 23 23 510: 544 427 511: 41 29 520: 119 26 521: 19 11 523: 4 4 525: 129 44 526: 120 70 531: 1 1 538: 30 18 541: 2 2 557: 4 4 Note: ran on AArch64 with -O3 -mcpu=3Dneoverse-n1 -flto=3D32. Thanks, Manolis > Philipp.