From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=JkvT=BE=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lf1-x132.google.com (mail-lf1-x132.google.com [IPv6:2a00:1450:4864:20::132])
	by sourceware.org (Postfix) with ESMTPS id 50B853856970
	for <gcc-patches@gcc.gnu.org>; Mon, 15 May 2023 13:13:11 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 50B853856970
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lf1-x132.google.com with SMTP id 2adb3069b0e04-4f13dafd5dcso14486548e87.3
        for <gcc-patches@gcc.gnu.org>; Mon, 15 May 2023 06:13:11 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1684156390; x=1686748390;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=xfGR37GXV+za/LkTv/vfnspWwMURClmReURq65hjI0Q=;
        b=DWxwqQZ77d9BvE6EqfgTaGxKKffyl+CgbUlY2Rn1m1ZFju/8dN6JZkotbtsDZYLrqB
         /Tm2y/m/5hzaE5mW9CTVfUiOiEwPany2BxqR9rArKfhTwUrUcau/TvL9vJrHuskpUQG7
         mk0TzhVIqc29XPhJuhUg1w86ErVEt85vzPx+U9+qKraxIqt/VwQDJI5dtpctk4qnc1Yz
         77ZtY3HDUBykGdH7Jpii/oSd3l9VnUjTsjGUq1TvNQGRODS/HO+jiJJt0/hSoRuxZTtA
         wTqjzxCiOkCkGcZCzXuTLxWfKGDvlKGX5LFB+D82ycLuYnK6keBKOXANIO6BxdVzOcPV
         GNrg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1684156390; x=1686748390;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=xfGR37GXV+za/LkTv/vfnspWwMURClmReURq65hjI0Q=;
        b=NUJu4CVSBToGCfhdfKdBUoZk8EyECNMqcShZxWMbfFB5FUjL4ziV4QX8navOylUuqk
         I+QbcpZnHVQ9TEXqOxPnmcI00rzf+JvA+12a21Rb8FSF0VesnytYEmn3LnVdbAeoLqOy
         ftMY4sUUbVdE5r4MGK6Zg9NveGWJaiUe/v44w/aiB31OD4wLewMGRB4r/W6PBS6Ny6sL
         +jRtNSvfBwpko8gZkay7iw44JQ9OsXZ20xfCiw2lkmyor3vgutDY3yDABRDRQUgChUBI
         kleqmQB3Ax0ZsGNnCM5B59xFPHWFNjSKzZlrRjMQll3rSr5eGlypHFOIpdpwW/B+E4Yt
         6GYA==
X-Gm-Message-State: AC+VfDzUCcWJ/lco2e0Pk/xYhByJutrJFoSSCGzxcwyz8BK9l2nTxMij
	2Tj/r7irxkCPjfdkEHgpxdSyxiDGjCLxpLRofWM=
X-Google-Smtp-Source: ACHHUZ6muOU1Xj8k5YffqPAplqXuG/EvFE0VSKArwBRD22Z7f0pF1FIvGJw2Satoo/8mGHia5JyNBVLogNxJyD3tQSU=
X-Received: by 2002:ac2:454b:0:b0:4f1:45a7:114f with SMTP id
 j11-20020ac2454b000000b004f145a7114fmr6628162lfm.37.1684156389370; Mon, 15
 May 2023 06:13:09 -0700 (PDT)
MIME-Version: 1.0
References: <20230511101201.2052667-1-lili.cui@intel.com> <CAFiYyc1qAkMXUunetGbL5zMiUYONRAD4u_J-E=PTxA7Se5OaQQ@mail.gmail.com>
 <SJ0PR11MB56006AC172F9B0537DC460199E749@SJ0PR11MB5600.namprd11.prod.outlook.com>
 <CAFiYyc2SJ5s_8B5jB9JVLHY-zxTDx3=_AXy2AcZuWb-Cs=WFsg@mail.gmail.com> <SJ0PR11MB5600FACCACA4371FD957871F9E759@SJ0PR11MB5600.namprd11.prod.outlook.com>
In-Reply-To: <SJ0PR11MB5600FACCACA4371FD957871F9E759@SJ0PR11MB5600.namprd11.prod.outlook.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Mon, 15 May 2023 15:12:57 +0200
Message-ID: <CAFiYyc3GhaVARNhxWCVT+QobZS9pnCgy5xRBSN=Jf8QFkUM9Jg@mail.gmail.com>
Subject: Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the
 chain with FMA in reassoc pass
To: "Cui, Lili" <lili.cui@intel.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-1.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Fri, May 12, 2023 at 11:05=E2=80=AFAM Cui, Lili <lili.cui@intel.com> wro=
te:
>
> > ISTR there were no sufficient comments in the code explaining why
> > rewrite_expr_tree_parallel_for_fma is better by design.  In fact ...
> >
> > >
> > > >
> > > > >                   if (!reassoc_insert_powi_p
> > > > > -                     && ops.length () > 3
> > > > > +                     && len > 3
> > > > > +                     && (!keep_fma_chain
> > > > > +                         || (keep_fma_chain
> > > > > +                             && len >
> > > > > + param_reassoc_max_chain_length_with_fma))
> > > >
> > > > in the case len < param_reassoc_max_chain_length_with_fma we have
> > > > the chain re-sorted but fall through to non-parallel rewrite.  I
> > > > wonder if we do not want to instead adjust the reassociation width?
> > > > I'd say it depends on the number of mult cases in the chain (sth th=
e re-
> > sorting could have computed).
> > > > Why do we have two completely independent --params here?  Can you
> > > > give an example --param value combination that makes "sense" and
> > > > show how it is beneficial?
> > >
> > > For this small case https://godbolt.org/z/Pxczrre8P a * b + c * d + e
> > > * f  + j
> > >
> > > GCC trunk: ops_num =3D 4, targetm.sched.reassociation_width is 4 (sca=
lar fp
> > cost is 4). Calculated: Width =3D 2. we can get 2 FMAs.
> > > ----------------------------------
> > >   _1 =3D a_6(D) * b_7(D);
> > >   _2 =3D c_8(D) * d_9(D);
> > >   _5 =3D _1 + _2;
> > >   _4 =3D e_10(D) * f_11(D);
> > >   _3 =3D _4 + j_12(D);
> > >   _13 =3D _3 + _5;
> > > --------------------------------------------------------
> > >   _2 =3D c_8(D) * d_9(D);
> > >   _5 =3D .FMA (a_6(D), b_7(D), _2);
> > >   _3 =3D .FMA (e_10(D), f_11(D), j_12(D));
> > >   _13 =3D _3 + _5;
> > > --------------------------------------------------------
> > > New patch: If just rearrange ops and fall through to parallel rewrite=
 to
> > break the chain with width =3D 2.
> > >
> > > ---------------------------------------------------------
> > >   _1 =3D a_6(D) * b_7(D);
> > >   _2 =3D j + _1;          -----> put j at the first.
> > >   _3 =3D c_8(D) * d_9(D);
> > >   _4 =3D e_10(D) * f_11(D);
> > >   _5 =3D _3 + _4;       -----> break chain with width =3D 2. we lost =
a FMA here.
> > >   _13 =3D _2 + 5;
> > >
> > > -------------------------------------------------------
> > >   _3 =3D c_8(D) * d_9(D);
> > >   _2 =3D .FMA (a_6(D), b_7(D), j);
> > >   _5 =3D .FMA (e_10(D), f_11(D), _3);
> > >   _13 =3D _2 + _5;
> > > --------------------------------------------------------
> > > Sometimes break chain will lose FMA( break chain needs put two
> > > mult-ops together, which will lose one FMA ), we can only get 2 FMAs
> > > here, if we want to get 3 FMAs, we need to keep the chain and not
> > > break it. So I added a param to control chain length
> > > "param_reassoc_max_chain_length_with_fma =3D 4" (For the small case i=
n
> > > Bugzilla 98350, we need to keep the chain to generate 6 FMAs.)
> > > -------------------------------------------------------
> > >   _1 =3D a_6(D) * b_7(D);
> > >   _2 =3D c_8(D) * d_9(D);
> > >   _4 =3D e_10(D) * f_11(D);
> > >   _15 =3D _4 + j_12(D);
> > >   _16 =3D _15 + _2;
> > >   _13 =3D _16 + _1;
> > > -------------------------------------------------------
> > >   _15 =3D .FMA (e_10(D), f_11(D), j_12(D));
> > >   _16 =3D .FMA (c_8(D), d_9(D), _15);
> > >   _13 =3D .FMA (a_6(D), b_7(D), _16);
> > > -------------------------------------------------------
> > > In some case we want to break the chain with width, we can set
> > "param_reassoc_max_chain_length_with_fma =3D 2", it will rearrange ops =
and
> > break the chain with width.
> >
> > ... it sounds like the problem could be fully addressed by sorting the =
chain
> > with reassoc-width in mind?
> > Wouldn't it be preferable if rewrite_expr_tree_parallel would get a vec=
tor of
> > mul and a vector of non-mul ops so it can pick from the optimal candida=
te?
> >
> > That said, I think rewrite_expr_tree_parallel_for_fma at least needs mo=
re
> > comments.
> >
> Sorry for not writing note clearly enough, I'll add more.
> I have two places that need to be clarified.
>
> 1. For some case we need to keep chain to generate more FMAs, because bre=
ak chain will lose FMA.
>    for example  g + a * b + c * d + e * f,
>    Keep chain can get 3 FMAs, break chain can get 2 FMAs. It's hard to sa=
y which one is better, so we provide a param for users to customize.
>
> 2. when the chain has FMAs and need to break the chain with width,
> for example l + a * b + c * d + e * f + g * h + j * k;(we already put non=
-mul first)
> rewrite_expr_tree_parallel :
> when width =3D 2, it will break the chain like this. actually it break th=
e chain in to 3. It ignores the width and adds all ops two by two. it will =
lose FMA.
>
> ssa1 =3D l + a * b;
> ssa2 =3D c * d + e * f;
> ssa3 =3D g * h + j * k;
> ssa4 =3D ssa1 + ssa2;
> ssa5 =3D ssa4 + ssa3;
>
> rewrite_expr_tree_parallel_for_fma
> when width =3D 2, we break the chain into two like this.
>
> ssa1 =3D l + a * b;
> ssa2 =3D c * d + e * f;
> ssa3 =3D ssa1 + g * h;
> ssa4 =3D ssa2 + j * k;
> ssa5 =3D ssa3 +ssa4;
>
> I think it's okay to remove or keep rewrite_expr_tree_parallel_for_fma. M=
ore FMAs are generated only for some special cases.
> I'm not sure whether the new method is better than the old one. I created=
 a small case the execution time of the two sequences is almost the same on=
 SPR and ICX.

I think to make a difference you need to hit the number of parallel
fadd/fmul the pipeline can perform.  I don't
think issue width is ever a problem for chains w/o fma and throughput
of fma vs fadd + fmul should be similar.

That said, I think iff then we should try to improve
rewrite_expr_tree_parallel rather than adding a new
function.  For example for the case with equal rank operands we can
try to sort adds first.  I can't convince
myself that rewrite_expr_tree_parallel honors ranks properly quickly.

Richard.

> Thanks,
> Lili.
>