From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-x229.google.com (mail-lj1-x229.google.com [IPv6:2a00:1450:4864:20::229]) by sourceware.org (Postfix) with ESMTPS id C3AE03858D20 for ; Tue, 29 Aug 2023 11:12:52 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C3AE03858D20 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lj1-x229.google.com with SMTP id 38308e7fff4ca-2bcc331f942so49975841fa.0 for ; Tue, 29 Aug 2023 04:12:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1693307571; x=1693912371; darn=gcc.gnu.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=NMuogkJ67jn1iq5U54pmCVE7S1cMqLhntAmg5R6AoZo=; b=Ex4KgrYR5EuU/izn9FX9EZrNOVE7bZwAQ4VFK2tVNsrXJXSENkQ4LjiVq0rOJYQZxP +1FHmpQRHGpjCDsGTAgw2tgRCMhnHN1hBW9yyYn/Rq+C6NTeVFxJ3moV5Xa2gJ9H2tqg w3tTG/OU3ob5lhIkBTyf+nXcfqA5CT0+k0jwdJERWJYT8Ux6NE2vragh0T2VfAscYhA0 FG8bCeTp+YtrnmxxsV5jk1aiDbfqFcJmoBQS/nTar++oEEmQYzmLiBVuKoZheklHGDYm 9AwWxjtKk3S5DIOYfz/O2MWW51qD7Xbi7S5mbwWO8p4vsMmM621B/SD5kHbQzKZ7vSrX pb7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693307571; x=1693912371; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NMuogkJ67jn1iq5U54pmCVE7S1cMqLhntAmg5R6AoZo=; b=DU3AhtfKJF59RzYmIRpeW/AeLbNOZBwFzWhE0yZ1SVvWEQQVA3LUi02aZR8ouKNzwU mNgvqRk6SOEOvShyEf/MxYgigmZNrVNcAsQolRzV0FMeMcXYRnYpOC+nI/hI57roxYlk kyqQ/aOSZGFGuavSGw4gEr5lCai8SRpOJkfDKZ9+Vqf5IBt6QHrHY0k8s0XAOjx707mR hKmLB2MEYskuFOAkLDZbzLPw/e5/brnfX6adOGRR/IjL3Lwd2N6F9/PykHkyL/RM+8YJ 3yd0FlsfHzpXOsJWCxGuRgIHip0msClK5i2jsL2/lCj6N4jw8MK5ye0krzRlIkvuxdH4 eUVA== X-Gm-Message-State: AOJu0Yw+gNCBfpEkQh0C6HL/6AwA6sGMoPFV3api0/y793XeiFyocQzD Sk+K9PIl14/wA1vzT92v+pUdVLwiuLBHhh6U5Wo= X-Google-Smtp-Source: AGHT+IHU/9SAkumhJnp0mXuh/XXmLsnbxP+oOLJePo1ZVY0uDTa8Szhk4ojAS4M0jzfIpb9AZSih/Fbbqx1P4vREG9c= X-Received: by 2002:a2e:a23b:0:b0:2bc:eb16:f51e with SMTP id i27-20020a2ea23b000000b002bceb16f51emr914266ljm.19.1693307570444; Tue, 29 Aug 2023 04:12:50 -0700 (PDT) MIME-Version: 1.0 References: <4c3c9a1c-e182-30a9-342d-525adfb8cffd@gmail.com> In-Reply-To: From: Richard Biener Date: Tue, 29 Aug 2023 13:11:10 +0200 Message-ID: Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA To: Di Zhao OS Cc: Jeff Law , Martin Jambor , "gcc-patches@gcc.gnu.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-1.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Tue, Aug 29, 2023 at 10:59=E2=80=AFAM Di Zhao OS wrote: > > Hi, > > > -----Original Message----- > > From: Richard Biener > > Sent: Tuesday, August 29, 2023 4:09 PM > > To: Di Zhao OS > > Cc: Jeff Law ; Martin Jambor ; = gcc- > > patches@gcc.gnu.org > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reasso= c to > > reduce cross backedge FMA > > > > On Tue, Aug 29, 2023 at 9:49=E2=80=AFAM Di Zhao OS > > wrote: > > > > > > Hi, > > > > > > > -----Original Message----- > > > > From: Richard Biener > > > > Sent: Tuesday, August 29, 2023 3:41 PM > > > > To: Jeff Law ; Martin Jambor > > > > Cc: Di Zhao OS ; gcc-patches@gcc.gnu= .org > > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in re= assoc > > to > > > > reduce cross backedge FMA > > > > > > > > On Tue, Aug 29, 2023 at 1:23=E2=80=AFAM Jeff Law via Gcc-patches > > > > wrote: > > > > > > > > > > > > > > > > > > > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote: > > > > > > This patch tries to fix the 2% regression in 510.parest_r on > > > > > > ampere1 in the tracker. (Previous discussion is here: > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html= ) > > > > > > > > > > > > 1. Add testcases for the problem. For an op list in the form of > > > > > > "acc =3D a * b + c * d + acc", currently reassociation doesn't > > > > > > Swap the operands so that more FMAs can be generated. > > > > > > After widening_mul the result looks like: > > > > > > > > > > > > _1 =3D .FMA(a, b, acc_0); > > > > > > acc_1 =3D .FMA(c, d, _1); > > > > > > > > > > > > While previously (before the "Handle FMA friendly..." patch), > > > > > > widening_mul's result was like: > > > > > > > > > > > > _1 =3D a * b; > > > > > > _2 =3D .FMA (c, d, _1); > > > > > > acc_1 =3D acc_0 + _2; > > > > > > > > How can we execute the multiply and the FMA in parallel? They > > > > depend on each other. Or is it the uarch can handle dependence > > > > on the add operand but only when it is with a multiplication and > > > > not a FMA in some better ways? (I'd doubt so much complexity) > > > > > > > > Can you explain in more detail how the uarch executes one vs. the > > > > other case? > > Here's my understanding after consulted our hardware team. For the > second case, the uarch of some out-of-order processors can calculate > "_2" of several loops at the same time, since there's no dependency > among different iterations. While for the first case the next iteration > has to wait for the current iteration to finish, so "acc_0"'s value is > known. I assume it is also the case in some i386 processors, since I > saw the patch "Deferring FMA transformations in tight loops" also > changed corresponding files. That should be true for all kind of operations, no? Thus it means reassoc should in general associate cross-iteration accumulation last? Historically we associated those first because that's how the vectorizer liked to see them, but I think that's no longer necessary. It should be achievable by properly biasing the operand during rank computation (don't we already do that?). > > > > > > > > > > If the code fragment is in a loop, some architecture can execut= e > > > > > > the latter in parallel, so the performance can be much faster t= han > > > > > > the former. For the small testcase, the performance gap is over > > > > > > 10% on both ampere1 and neoverse-n1. So the point here is to av= oid > > > > > > turning the last statement into FMA, and keep it a PLUS_EXPR as > > > > > > much as possible. (If we are rewriting the op list into paralle= l, > > > > > > no special treatment is needed, since the last statement after > > > > > > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.) > > > > > > > > > > > > 2. Function result_feeds_back_from_phi_p is to check for cross > > > > > > backedge dependency. Added new enum fma_state to describe the > > > > > > state of FMA candidates. > > > > > > > > > > > > With this patch, there's a 3% improvement in 510.parest_r 1-cop= y > > > > > > run on ampere1. The compile options are: > > > > > > "-Ofast -mcpu=3Dampere1 -flto --param avoid-fma-max-bits=3D512"= . > > > > > > > > > > > > Best regards, > > > > > > Di Zhao > > > > > > > > > > > > ---- > > > > > > > > > > > > PR tree-optimization/110279 > > > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > > > * tree-ssa-reassoc.cc (enum fma_state): New enum to > > > > > > describe the state of FMA candidates for an op list. > > > > > > (rewrite_expr_tree_parallel): Changed boolean > > > > > > parameter to enum type. > > > > > > (result_feeds_back_from_phi_p): New function to check > > > > > > for cross backedge dependency. > > > > > > (rank_ops_for_fma): Return enum fma_state. Added new > > > > > > parameter. > > > > > > (reassociate_bb): If there's backedge dependency in an > > > > > > op list, swap the operands before rewrite_expr_tree. > > > > > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > > > > > * gcc.dg/pr110279.c: New test. > > > > > Not a review, but more of a question -- isn't this transformation= 's > > > > > profitability uarch sensitive. ie, just because it's bad for a s= et of > > > > > aarch64 uarches, doesn't mean it's bad everywhere. > > > > > > > > > > And in general we shy away from trying to adjust gimple code base= d on > > > > > uarch preferences. > > > > > > > > > > It seems the right place to do this is gimple->rtl expansion. > > > > > > > > Another comment is that FMA forming has this deferring code which I > > > > think deals exactly with this kind of thing? CCing Martin who did = this > > > > work based on AMD uarchs also not wanting cross-loop dependences > > > > on FMAs (or so). In particular I see > > > > > > > > if (fma_state.m_deferring_p > > > > && fma_state.m_initial_phi) > > > > { > > > > gcc_checking_assert (fma_state.m_last_result); > > > > if (!last_fma_candidate_feeds_initial_phi (&fma_state, > > > > &m_last_result_set= )) > > > > cancel_fma_deferring (&fma_state); > > > > > > > > and I think code to avoid FMAs in other/related cases should be her= e > > > > as well, like avoid forming back-to-back FMAs. > > > > > > The changes in this patch is controlled by "param_avoid_fma_max_bits"= , so > > > I think it should only affect architectures with similar behavior. (T= he > > > parameter was added in a previous patch "Deferring FMA transformation= s > > > in tight loops", which seems to be dealing with the same issue.) > > > > That's what I said. Is the pipeline behavior properly modeled by the > > scheduler description? Your patch seems to not only affect loops > > but the FMA forming case already present does - is the case you are > > after not in a tight loop? In principle with a loop with enough schedu= ling > > freedom this should be a non-issue - unless the pipeline model isn't > > accurate and we put the FMAs too close together? > > The current deferring FMA scheme discards all the FMA candidates in the F= MA > chain that has loop dependency. So for this case there will be too less (= zero) > FMA generated; while before the "Handle FMA friendly..." patch or after t= his > patch we can still have 1 FMA. I also tried to change the deferring FMA s= cheme > so only the last FMA is discarded, but it still can't solve the regressio= n. > I think it's because this deferring scheme is in the later pass widening_= mul, > which only decides whether to form FMA, but doesn't manipulate the tree f= or > better data parallelism. So it seems better to optimize for data parallel= ism > (swapping the operands) in reassoc if we know there can be FMAs with loop > dependency, and leave the current deferring FMA scheme as the last resort= ? > > > > > Richard. > > > > > > Richard. > > > > > > > > > Jeff > > > > > > Thanks, > Di Zhao