From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=//y2=NO=vrull.eu=manolis.tsamis@sourceware.org>
Received: from mail-pf1-x435.google.com (mail-pf1-x435.google.com [IPv6:2607:f8b0:4864:20::435])
	by sourceware.org (Postfix) with ESMTPS id 5E1B13882079
	for <gcc-patches@gcc.gnu.org>; Wed, 12 Jun 2024 13:03:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5E1B13882079
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=vrull.eu
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=vrull.eu
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 5E1B13882079
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::435
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718197389; cv=none;
	b=XdEME3rcVGW/tEghTAtYbA44Ig3kPYV7QHTLGs6z0YMhYphtEXfGOKpJAv76xSVWdexuUUBkQvycsLxtTj+39nCu3wwely4VtxkDJxUr+2weRN8mPdqfehJZCW7oPIeU+VUOkAvAJdzj8TagPZFTDq7fUZFXx88pZoaOEDNpNss=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1718197389; c=relaxed/simple;
	bh=F6LGU250NwZNSQ06Ci9/R30FXuWcNpAdPl+lOlx03Q0=;
	h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=nIVabG1tFrAjfZSFJdPLgmN5xCxak5hxly8civsoXbi36bKc+DBjPPIHnYsyD5cmIDPDo1bj4g70kB8/M/a0rLsoYMKuva9P+UFkz5ua7CS+io0eIugA5HN4SMmmnf5cMQFs7hnH+E50SL0K5A6oy0gjliPyLB2MnGdcg391A+I=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-pf1-x435.google.com with SMTP id d2e1a72fcca58-704313fa830so3098521b3a.3
        for <gcc-patches@gcc.gnu.org>; Wed, 12 Jun 2024 06:03:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=vrull.eu; s=google; t=1718197385; x=1718802185; darn=gcc.gnu.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=0PwjUnQ9TmbJ1F64xF4SH1QNTIHGWB7rtGHeao9ODaU=;
        b=YQYyF7barcOFf2q6dQfTt8AHhrf+lXS5bkl9WATkPzP6tUJlTXvZm2SwBUYRyG5Dhp
         MzUBB18C5f6fXwmzzBXBi4+R+ManbPUhN7RXIY1G+NOrLNPaJIZv2uCxAsEBC1OpMcNb
         DoBgTvGZqdvVSdTrdXQeuGwX8sP7CkkxblEP4g4vqFUehPZW5ZqDCzM7jN+NyfMSA9RG
         pyjhGeQNa2mv901aRuN8XgClLpvvRlz5ymnvFYufFNJPnN235EgfMBZqjzNICHXcOgBM
         evo2t6C+SqSA69E5DlSmorSuT+F6fyrk9fjfgaDk3WLC3M9+7VRzVWb5ahIq/+po1C3j
         Dvmw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1718197385; x=1718802185;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=0PwjUnQ9TmbJ1F64xF4SH1QNTIHGWB7rtGHeao9ODaU=;
        b=aSfoBmPCyejtVU4z1v/DUUTU9YNWKKsP+HLkpaKwPBtKdGVetAhkdWO0w2HbXYj4nr
         bMkC/DhyDPMcyMf++qcd7my2wTtkm4t9GobHQxNsv1LDd9UD0NVEU2BDaEiKvB9C4ci0
         kkKO1S47gLnZeLycP0s8TzR3RroK22iOeMQp1FfWeyHrqFQJ3Ru+mRbnNDPcSr9KNhz3
         GDoi57f9uVrIrupIHPfHR5aD+s4x+yYKcD5br7nlkwITCMOIN9QOddInRtgKb4Q+qJar
         FgZae5LjnujINhE9e5r4LDdeD5CEtEDWgRq4fgX+PcJ/7XzQkhVjuQXaInjX8Y6uJj62
         5+yA==
X-Forwarded-Encrypted: i=1; AJvYcCWSEkeHSzrGNGl7KzM/nOd5cq4g37DvWdYXKoT53Tr23eAEvFpa6c2kf5/7moD154ZAhqriLjdwD9s8FYIz6RY7hIB2pOxznQ==
X-Gm-Message-State: AOJu0YyYdU/RjPWSDUpKBMV7ts9GC4Gio72dEhV6c8zmtA2nkPsSDbKS
	jCNP4oYH0FdiFFA01lm6njMzVgelGnbGKNjUbxUT3GlB2VludGEO1ETvGTq2KVYV+ttNOCkM0Gc
	jf9zw5fgUWm0eWZwjghCBsUvA8xO72ODQibIcJw==
X-Google-Smtp-Source: AGHT+IEXeIQs/9NPySCFIAc52pN2LZmmxl6482nByHzVMcklfIqRbzME7sd3jnN3+6XcAYnaNKz1ZlniT+JcK04vkk8=
X-Received: by 2002:a17:90b:4ad2:b0:2c2:e435:774f with SMTP id
 98e67ed59e1d1-2c4a760725fmr1744464a91.10.1718197385106; Wed, 12 Jun 2024
 06:03:05 -0700 (PDT)
MIME-Version: 1.0
References: <20240606101043.3682477-1-manolis.tsamis@vrull.eu>
 <e7a8db67-2a90-41f3-972e-d77b97d8dba5@gmail.com> <CAM3yNXp9qnJTzgL6+81Sc+2g=EmKnCW1XReqaGgy-5D43BmKDA@mail.gmail.com>
 <264e248d-cb54-4d3d-860d-193fd7be1049@gmail.com> <CAAeLtUCThU562S-toXOyz4BOSaMM39=fGuuGQ48G60viHxu1og@mail.gmail.com>
In-Reply-To: <CAAeLtUCThU562S-toXOyz4BOSaMM39=fGuuGQ48G60viHxu1og@mail.gmail.com>
From: Manolis Tsamis <manolis.tsamis@vrull.eu>
Date: Wed, 12 Jun 2024 16:02:29 +0300
Message-ID: <CAM3yNXpw-2z5vkzzwLRch=LOo=2cyL1O7qQpEZnYXp1GOv9Vrw@mail.gmail.com>
Subject: Re: [PATCH v2] Target-independent store forwarding avoidance.
To: Philipp Tomsich <philipp.tomsich@vrull.eu>
Cc: Jeff Law <jeffreyalaw@gmail.com>, gcc-patches@gcc.gnu.org, 
	Richard Biener <rguenther@suse.de>, =?UTF-8?Q?Christoph_M=C3=BCllner?= <christoph.muellner@vrull.eu>, 
	Jiangning Liu <jiangning.liu@amperecomputing.com>, Jakub Jelinek <jakub@redhat.com>, 
	Andrew Pinski <quic_apinski@quicinc.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-9.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Mon, Jun 10, 2024 at 9:27=E2=80=AFPM Philipp Tomsich
<philipp.tomsich@vrull.eu> wrote:
>
> On Mon, 10 Jun 2024 at 20:03, Jeff Law <jeffreyalaw@gmail.com> wrote:
> >
> >
> >
> > On 6/10/24 1:55 AM, Manolis Tsamis wrote:
> >
> > >>
> > > There was an older submission of a load-pair specific pass but this i=
s
> > > a complete reimplementation and indeed significantly more general.
> > > Apart from being target independant, it addresses a number of
> > > important restrictions and can handle multiple store forwardings per
> > > load.
> > > It should be noted that it cannot handle the load-pair cases as these
> > > need special handling, but that's something we're planning to do in
> > > the future by reusing this infrastructure.
> > ACK.  Thanks for the additional background.
> >
> >
> > >
> > >>
> > >>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > >>> index 4e8967fd8ab..c769744d178 100644
> > >>> --- a/gcc/doc/invoke.texi
> > >>> +++ b/gcc/doc/invoke.texi
> > >>> @@ -12657,6 +12657,15 @@ loop unrolling.
> > >>>    This option is enabled by default at optimization levels @option=
{-O1},
> > >>>    @option{-O2}, @option{-O3}, @option{-Os}.
> > >>>
> > >>> +@opindex favoid-store-forwarding
> > >>> +@item -favoid-store-forwarding
> > >>> +@itemx -fno-avoid-store-forwarding
> > >>> +Many CPUs will stall for many cycles when a load partially depends=
 on previous
> > >>> +smaller stores.  This pass tries to detect such cases and avoid th=
e penalty by
> > >>> +changing the order of the load and store and then fixing up the lo=
aded value.
> > >>> +
> > >>> +Disabled by default.
> > >> Is there any particular reason why this would be off by default at -=
O1
> > >> or higher?  It would seem to me that on modern cores that this
> > >> transformation should easily be a win.  Even on an old in-order core=
,
> > >> avoiding the load with the bit insert is likely profitable, just not=
 as
> > >> much so.
> > >>
> > > I don't have a strong opinion for that but I believe Richard's
> > > suggestion to decide this on a per-target basis also makes a lot of
> > > sense.
> > > Deciding whether the transformation is profitable is tightly tied to
> > > the architecture in question (i.e. how large the stall is and what
> > > sort of bit-insert instructions are available).
> > > In order to make this more widely applicable, I think we'll need a
> > > target hook that decides in which case the forwarded stores incur a
> > > penalty and thus the transformation makes sense.
> > You and Richi are probably right.   I'm not a big fan of passes being
> > enabled/disabled on particular targets, but it may make sense here.
> >
> >
> >
> > > Afaik, for each CPU there may be cases that store forwarding is
> > > handled efficiently.
> > Absolutely.   But forwarding from a smaller store to a wider load is
> > painful from a hardware standpoint and if we can avoid it from a codege=
n
> > standpoint, we should.
>
> This change is what I briefly hinted as "the complete solution" that
> we had on the drawing board when we briefly talked last November in
> Santa Clara.
>
> > Did y'all look at spec2017 at all for this patch?  I've got our hardwar=
e
> > guys to expose a signal for this case so that we can (in a month or so)
> > get some hard data on how often it's happening in spec2017 and evaluate
> > how this patch helps the most affected workloads.  But if y'all already
> > have some data we can use it as a starting point.
>
>  We have looked at all of SPEC2017, especially for coverage (i.e.,
> making sure we see a significant number of uses of the transformation)
> and correctness.  The gcc_r and parest_r components triggered in a
> number of "interesting" ways (e.g., motivating the case of
> load-elimination).  If it helps, we could share the statistics for how
> often the pass triggers on compiling each of the SPEC2017 components?
>
Below is a table with 1) the number of successful
store-forwarding-avoidance transformations and 2) the number of these
where the load was also eliminated.
This is SPEC2017 intrate and fprate; the omitted benchmarks in each
case have zero transformations.

Following Richard's comment I did two runs: One with the pass placed
just after cse1 (this patch) and one with it placed after sched1
(Richard's suggestion). I see an increased number of transformations
after sched1 and also in some testcases we get improved code
generation so it looks promising. The original motivation was that
placing this early will enable subsequent passes to optimize the
transformed code, but it looks like this is not a major issue and the
load/store placement it more important.

I plan to follow up and provide some rough performance metrics and
example assembly code of the transformation for these benchmarks.
I also tried increasing the maximum distance from 10 to 15 and so only
a small increase in transformation count. From what I've seen most
cases that we care about are usually close.

The data (benchmark, # transformed, # load elimination):

After cse1:
  500: 32 1
  502: 89 9
  510: 517 409
  511: 12 1
  520: 1 0
  521: 4 1
  525: 51 1
  526: 46 11
  557: 1 1

After sched1:
  500: 45 13
  502: 172 33
  507: 23 23
  510: 544 427
  511: 41 29
  520: 119 26
  521: 19 11
  523: 4 4
  525: 129 44
  526: 120 70
  531: 1 1
  538: 30 18
  541: 2 2
  557: 4 4

Note: ran on AArch64 with -O3 -mcpu=3Dneoverse-n1 -flto=3D32.

Thanks,
Manolis

> Philipp.