From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=wKJs=BL=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lf1-x12e.google.com (mail-lf1-x12e.google.com [IPv6:2a00:1450:4864:20::12e])
	by sourceware.org (Postfix) with ESMTPS id 7E1923858D35
	for <gcc-patches@gcc.gnu.org>; Mon, 22 May 2023 13:15:18 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7E1923858D35
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lf1-x12e.google.com with SMTP id 2adb3069b0e04-4f14468ef54so6797214e87.0
        for <gcc-patches@gcc.gnu.org>; Mon, 22 May 2023 06:15:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1684761317; x=1687353317;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=2soH3xrQIM7hQNJktDrlT8bcJmtpk/krQ5AlKcMcQ2c=;
        b=XrZtxC+r+f2/mSMWue8EligzW+hT0+xrrNHDZrO/17LKr2fGbxQoccbkN7ChKrzuAk
         jlpsfkh11jxMNkm7TZhqfKwwplgbjimGeNNFBmV07sSsNn6CQ4CJ6FymStHCH4tziCTg
         IIiQl72nx5L4RbrtxF9nfk98gLULPGPf8WyiWRBhe81K44cg7ua0F06sOdRQsl/OKyyy
         6FoZ1uMH9Er3zWeRXpSLHBQE4Ew8VHWhuPdqF8ekd/+yMWxsjO5HpiSmW1czudS20Ch1
         r0cohSWgyK4DgaVxjT9foJOq7ZAFtwQKpsdF5aTPb8PjacK1QJZzI3Pjf+b2+mUuCGTU
         N1mQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1684761317; x=1687353317;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=2soH3xrQIM7hQNJktDrlT8bcJmtpk/krQ5AlKcMcQ2c=;
        b=QIfsyCF0ogzwNNonrHOmx7zFaLPiuX7dBxlPdXDdEXLLkGArPXOfat5F4GJn6pDgOY
         tR9kNn6jvHJDo2C1Pcy7IUCq5YVdYwib0cUFD/1zApiMafPuQ8dB2r1hWTu4X0ckB4NX
         neuRXRN6Xl4XnL5Pej6raFOGo9PmZg+YUZa5vJJwNCKaOvUjt2CwF0zcM6JJdqgitmQl
         69xozoEnRIlJ3yMd5pXKWOZiqcdrvrtOnFZRnaStejut9qO3rCDyipxco7qH403YgXCR
         5sYkKTaKXQD7+JAxmh0bURlu1VfGIXHW2IAAfieJmnWy67eBlqu4LcybjfhWqDaOLKzl
         MnDg==
X-Gm-Message-State: AC+VfDy2K3RFso0WRrldJHf0STKurIlW8mm0Oz6xCeTYWr1+DUEpCKVT
	LPo8oBwlgq5zmzQnjb8zzVWlb4kthhSVPtmI7OxjlhC8
X-Google-Smtp-Source: ACHHUZ5v1+EjyCpVHWMPgT1D7rsIv56r4Jw7AN9laJM/T6eLLC9obDud2kDg39uc/nOFZinwglyzJNn48Emor3qUrl8=
X-Received: by 2002:ac2:52ad:0:b0:4f3:9930:5b8c with SMTP id
 r13-20020ac252ad000000b004f399305b8cmr3046804lfm.25.1684761316820; Mon, 22
 May 2023 06:15:16 -0700 (PDT)
MIME-Version: 1.0
References: <20230511101201.2052667-1-lili.cui@intel.com> <CAFiYyc1qAkMXUunetGbL5zMiUYONRAD4u_J-E=PTxA7Se5OaQQ@mail.gmail.com>
 <SJ0PR11MB56006AC172F9B0537DC460199E749@SJ0PR11MB5600.namprd11.prod.outlook.com>
 <CAFiYyc2SJ5s_8B5jB9JVLHY-zxTDx3=_AXy2AcZuWb-Cs=WFsg@mail.gmail.com>
 <SJ0PR11MB5600FACCACA4371FD957871F9E759@SJ0PR11MB5600.namprd11.prod.outlook.com>
 <CAFiYyc3GhaVARNhxWCVT+QobZS9pnCgy5xRBSN=Jf8QFkUM9Jg@mail.gmail.com> <SJ0PR11MB56006591A169D43098C628699E7E9@SJ0PR11MB5600.namprd11.prod.outlook.com>
In-Reply-To: <SJ0PR11MB56006591A169D43098C628699E7E9@SJ0PR11MB5600.namprd11.prod.outlook.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Mon, 22 May 2023 15:15:04 +0200
Message-ID: <CAFiYyc2yosbG-heVjye_jcPB40SHPHbQJiFsUN3FdF+AqGjfcQ@mail.gmail.com>
Subject: Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the
 chain with FMA in reassoc pass
To: "Cui, Lili" <lili.cui@intel.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-0.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_ASCII_DIVIDERS,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Wed, May 17, 2023 at 3:05=E2=80=AFPM Cui, Lili <lili.cui@intel.com> wrot=
e:
>
> > I think to make a difference you need to hit the number of parallel fad=
d/fmul
> > the pipeline can perform.  I don't think issue width is ever a problem =
for
> > chains w/o fma and throughput of fma vs fadd + fmul should be similar.
> >
>
> Yes, for x86 backend, fadd , fmul and fma have the same TP meaning they s=
hould have the same width.
> The current implementation is reasonable  /* reassoc int, fp, vec_int, ve=
c_fp.  */.
>
> > That said, I think iff then we should try to improve
> > rewrite_expr_tree_parallel rather than adding a new function.  For exam=
ple
> > for the case with equal rank operands we can try to sort adds first.  I=
 can't
> > convince myself that rewrite_expr_tree_parallel honors ranks properly
> > quickly.
> >
>
> I rewrite this patch, there are mainly two changes:
> 1. I made some changes to rewrite_expr_tree_parallel_for_fma and used it =
instead of rewrite_expr_tree_parallel. The following example shows that the=
 sequence generated by the this patch is better.
> 2. Put no-mult ops and mult ops alternately at the end of the queue, whic=
h is conducive to generating more fma and reducing the loss of FMA when bre=
aking the chain.
>
> With these two changes, GCC can break the chain with width =3D 2 and gene=
rates 6 FMAs for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D98350  with=
out any params.
>
> -------------------------------------------------------------------------=
-----------------------------------------
> Source code=EF=BC=9A g + h + j + s + m + n+a+b +e  (https://godbolt.org/z=
/G8sb86n84)
> Compile options: -Ofast -mfpmath=3Dsse -mfma
> Width =3D 3 was chosen for reassociation
> -------------------------------------------------------------------------=
----------------------------------------
> Old rewrite_expr_tree_parallel generates:
>   _6 =3D g_8(D) + h_9(D);       ------> parallel 0
>   _3 =3D s_11(D) + m_12(D);  ------> parallel 1
>   _5 =3D _3 + j_10(D);
>   _2 =3D n_13(D) + a_14(D);   ------> parallel 2
>   _1 =3D b_15(D) + e_16(D);  -----> Parallel 3, This is not necessary, an=
d it is not friendly to FMA.
>   _4 =3D _1 + _2;
>   _7 =3D _4 + _5;
>   _17 =3D _6 + _7;
>   return _17;
>
> When the width =3D 3,  we need 5 cycles here.
> ---------------------------------------------first end-------------------=
--------------------------------------
> Rewrite the old rewrite_expr_tree_parallel (3 sets in parallel) generates=
:
>
>   _3 =3D s_11(D) + m_12(D);  ------> parallel 0
>   _5 =3D _3 + j_10(D);
>   _2 =3D n_13(D) + a_14(D);   ------> parallel 1
>   _1 =3D b_15(D) + e_16(D);   ------> parallel 2
>   _4 =3D _1 + _2;
>   _6 =3D _4 + _5;
>   _7 =3D _6 + h_9(D);
>   _17 =3D _7 + g_8(D);
>   return _17;
>
> When the width =3D 3, we need 5 cycles here.
> ---------------------------------------------second end------------------=
-------------------------------------
> Use rewrite_expr_tree_parallel_for_fma instead of rewrite_expr_tree_paral=
lel generates:
>
>   _3 =3D s_11(D) + m_12(D);
>   _6 =3D _3 + g_8(D);
>   _2 =3D n_13(D) + a_14(D);
>   _5 =3D _2 + h_9(D);
>   _1 =3D b_15(D) + e_16(D);
>   _4 =3D _1 + j_10(D);
>   _7 =3D _4 + _5;
>   _17 =3D _7 + _6;
>   return _17;
>
> When the width =3D 3, we need 4 cycles here.
> --------------------------------------------third end--------------------=
---------------------------------------

Yes, so what I was saying is that I doubt rewrite_expr_tree_parallel
is optimal - you show
that for the specific example rewrite_expr_tree_parallel_for_fma is
better.  I was arguing
we want a single function, whether we single out leaves with
multiplications or not.

And we want documentation that shows the strategy will result in optimal la=
tency
(I think we should not sacrifice latency just for the sake of forming
more FMAs).

Richard.

>
> Thanks,
> Lili.
>