From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=udnH=BB=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lf1-x12f.google.com (mail-lf1-x12f.google.com [IPv6:2a00:1450:4864:20::12f])
	by sourceware.org (Postfix) with ESMTPS id 4840B3856242
	for <gcc-patches@gcc.gnu.org>; Fri, 12 May 2023 06:05:11 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 4840B3856242
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lf1-x12f.google.com with SMTP id 2adb3069b0e04-4f2676d62a2so2609542e87.0
        for <gcc-patches@gcc.gnu.org>; Thu, 11 May 2023 23:05:11 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1683871510; x=1686463510;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=Wsx+SjG/asWtXx6zXoYUCjFFOJhULHeRp2zd/XeGmaE=;
        b=pEXHlU4DQYZqyW7osQf+9Lux2zyJuZQeA+szpuDJuP3Ws0jvSMvb940kjdUuY1tjW5
         IRczOgQ0GfshH3/cBbICw81ASrhgP/cuF44nBQ+YH4UxI6dtGQq2Ocq6ei7UfVlgZQdI
         JHjLWY1YnITkg2b93JmcolCOi6OKqNM3jwTaECr/ZpK/kqqejHk8e/HhgdF4FLqeSqq3
         E7dj83IX5jYSjbrjRaoG9FcSSfQ+PjoJlOndY3Fh+H7iLebGBQmBXy5eNhUi02pp/w6B
         HrUQJhNORvGb5zeKPhD4BZpAywx4LRBk/YQkawkA3oHnuK9J1Krhw3simYurFgKhtDuC
         ffbg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1683871510; x=1686463510;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=Wsx+SjG/asWtXx6zXoYUCjFFOJhULHeRp2zd/XeGmaE=;
        b=BOqpg15POMtddv58ry/0CDVKEjjqiaan3PUnwekBII5VZ+1kCCzh22lpzO0mkUEUX+
         bS0T69Xp6epiliDm/0QSy9sRUsHPduGqJ4YOiDHq1u3SxmVEaAILt/htEOnYlBcTLsAY
         bgiD8EllwLm1qYr0C/K5vxOdmMgmRuWaOJQx9OZ/t1u+rxOGwZ5WMDZjQykiDrHW7uD1
         7K23bcG1zWhMkuMeydBrsTT+ZI9mj85S/ftB+PXtaVxGM+soeM/wo2KY4pWasYaVedEH
         LR3jEkbSqmzWXg4FUB+Ltt0Y7xFGrJcvBdSLCGNiS1pY+1/BhXmayG7z3qmmJGPdAE0f
         6vGw==
X-Gm-Message-State: AC+VfDyb+UNJboI50zFKzYGYYoOR8SnrouGe+g6xnazG0WOikssYp61p
	gHxbDfRswo4a2GcLlgwTwhQmPZkAuTs3nniOXrDMpLfkLBI=
X-Google-Smtp-Source: ACHHUZ7gQDGi3j9RXJsDD1Kc9bKxvk6DwoX+osYb3296lTo8nwJsCEnkKLKW4RTWUB2sdZqh+ULUjt8k1pCf8YUyTMs=
X-Received: by 2002:ac2:4889:0:b0:4f2:7953:6d16 with SMTP id
 x9-20020ac24889000000b004f279536d16mr170570lfc.18.1683871509418; Thu, 11 May
 2023 23:05:09 -0700 (PDT)
MIME-Version: 1.0
References: <20230511101201.2052667-1-lili.cui@intel.com> <CAFiYyc1qAkMXUunetGbL5zMiUYONRAD4u_J-E=PTxA7Se5OaQQ@mail.gmail.com>
 <SJ0PR11MB56006AC172F9B0537DC460199E749@SJ0PR11MB5600.namprd11.prod.outlook.com>
In-Reply-To: <SJ0PR11MB56006AC172F9B0537DC460199E749@SJ0PR11MB5600.namprd11.prod.outlook.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Fri, 12 May 2023 08:04:57 +0200
Message-ID: <CAFiYyc2SJ5s_8B5jB9JVLHY-zxTDx3=_AXy2AcZuWb-Cs=WFsg@mail.gmail.com>
Subject: Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the
 chain with FMA in reassoc pass
To: "Cui, Lili" <lili.cui@intel.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-0.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Thu, May 11, 2023 at 5:20=E2=80=AFPM Cui, Lili <lili.cui@intel.com> wrot=
e:
>
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: Thursday, May 11, 2023 6:53 PM
> > To: Cui, Lili <lili.cui@intel.com>
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length=
 of
> > the chain with FMA in reassoc pass
>
> Hi Richard,
> Thanks for helping to review the patch.
>
> >
> > As you are not changing the number of ops you should be able to use
> > quick_push here and below.  You should be able to do
> >
> >  ops->splice (ops_mult);
> >  ops->splice (ops_others);
> >
> > as well.
> >
> Done.
>
> > > +                 /* When enabling param_reassoc_max_chain_length_wit=
h_fma
> > to
> > > +                    keep the chain with fma, rank_ops_for_fma will d=
etect if
> > > +                    the chain has fmas and if so it will rearrange t=
he ops.  */
> > > +                 if (param_reassoc_max_chain_length_with_fma > 1
> > > +                     && direct_internal_fn_supported_p (IFN_FMA,
> > > +                                                        TREE_TYPE (l=
hs),
> > > +                                                        opt_type)
> > > +                     && (rhs_code =3D=3D PLUS_EXPR || rhs_code =3D=
=3D MINUS_EXPR))
> > > +                   {
> > > +                     keep_fma_chain =3D rank_ops_for_fma(&ops);
> > > +                   }
> > > +
> > > +                 int len =3D ops.length ();
> > >                   /* Only rewrite the expression tree to parallel in =
the
> > >                      last reassoc pass to avoid useless work back-and=
-forth
> > >                      with initial linearization.  */
> >
> > we are doing the parallel rewrite only in the last reassoc pass, i thin=
k it makes
> > sense to do the same for reassoc-for-fma.
>
> I rearranged the order of ops in reassoc1 without break the chain, it gen=
erated more vectorize during vector pass( seen in benchmark 503). So I rewr=
ite the ssa tree and keep the chain with function "rewrite_expr_tree" in re=
assoc1, break the chain with "rewrite_expr_tree_parallel_for_fma" in reasso=
c2.
>
> >
> > Why do the existing expr rewrites not work after re-sorting the ops?
>
> For case https://godbolt.org/z/3x9PWE9Kb:  we put  "j" at first.
>
> j + l * m + a * b + c * d + e * f + g * h;
>
> GCC trunk: width =3D 2, ops_num =3D 6, old function " rewrite_expr_tree_p=
arallel " generates 3 FMAs.
> -------------------------------------------------------------------------=
--------------
>   _1 =3D l_10(D) * m_11(D);
>   _3 =3D a_13(D) * b_14(D);
>   _4 =3D j_12(D) + _3;    --------> Here is one FMA.
>   _5 =3D c_15(D) * d_16(D);
>   _8 =3D _1 + _5;            --------> Here is one FMA and lost one.
>   _7 =3D e_17(D) * f_18(D);
>   _9 =3D g_19(D) * h_20(D);
>   _2 =3D _7 + _9;           --------> Here is one FMA and lost one.
>   _6 =3D _2 + _4;
>   _21 =3D _6 + _8;
>   # VUSE <.MEM_22(D)>
>   return _21;
> -------------------------------------------------------------------------=
-------------
> width =3D 2, ops_num =3D 6, new function " rewrite_expr_tree_parallel_for=
_fma " generates 4 FMAs.
> -------------------------------------------------------------------------=
-----
> _1 =3D a_10(D) * b_11(D);
>   _3 =3D c_13(D) * d_14(D);
>   _5 =3D e_15(D) * f_16(D);
>   _7 =3D g_17(D) * h_18(D);
>   _4 =3D _5 + _7;           --------> Here is one FMA and lost one.
>   _8 =3D _4 + _1;           --------> Here is one FMA.
>   _9 =3D l_19(D) * m_20(D);
>   _2 =3D _9 + j_12(D);    --------> Here is one FMA.
>   _6 =3D _2 + _3;            --------> Here is one FMA.
>   _21 =3D _8 + _6;
>   return _21;
> -------------------------------------------------------------------------=
-----------

ISTR there were no sufficient comments in the code explaining why
rewrite_expr_tree_parallel_for_fma is better by design.  In fact ...

>
> >
> > >                   if (!reassoc_insert_powi_p
> > > -                     && ops.length () > 3
> > > +                     && len > 3
> > > +                     && (!keep_fma_chain
> > > +                         || (keep_fma_chain
> > > +                             && len >
> > > + param_reassoc_max_chain_length_with_fma))
> >
> > in the case len < param_reassoc_max_chain_length_with_fma we have the
> > chain re-sorted but fall through to non-parallel rewrite.  I wonder if =
we do
> > not want to instead adjust the reassociation width?  I'd say it depends=
 on the
> > number of mult cases in the chain (sth the re-sorting could have comput=
ed).
> > Why do we have two completely independent --params here?  Can you give
> > an example --param value combination that makes "sense" and show how it
> > is beneficial?
>
> For this small case https://godbolt.org/z/Pxczrre8P
> a * b + c * d + e * f  + j
>
> GCC trunk: ops_num =3D 4, targetm.sched.reassociation_width is 4 (scalar =
fp cost is 4). Calculated: Width =3D 2. we can get 2 FMAs.
> ----------------------------------
>   _1 =3D a_6(D) * b_7(D);
>   _2 =3D c_8(D) * d_9(D);
>   _5 =3D _1 + _2;
>   _4 =3D e_10(D) * f_11(D);
>   _3 =3D _4 + j_12(D);
>   _13 =3D _3 + _5;
> --------------------------------------------------------
>   _2 =3D c_8(D) * d_9(D);
>   _5 =3D .FMA (a_6(D), b_7(D), _2);
>   _3 =3D .FMA (e_10(D), f_11(D), j_12(D));
>   _13 =3D _3 + _5;
> --------------------------------------------------------
> New patch: If just rearrange ops and fall through to parallel rewrite to =
break the chain with width =3D 2.
>
> ---------------------------------------------------------
>   _1 =3D a_6(D) * b_7(D);
>   _2 =3D j + _1;          -----> put j at the first.
>   _3 =3D c_8(D) * d_9(D);
>   _4 =3D e_10(D) * f_11(D);
>   _5 =3D _3 + _4;       -----> break chain with width =3D 2. we lost a FM=
A here.
>   _13 =3D _2 + 5;
>
> -------------------------------------------------------
>   _3 =3D c_8(D) * d_9(D);
>   _2 =3D .FMA (a_6(D), b_7(D), j);
>   _5 =3D .FMA (e_10(D), f_11(D), _3);
>   _13 =3D _2 + _5;
> --------------------------------------------------------
> Sometimes break chain will lose FMA( break chain needs put two mult-ops t=
ogether, which will lose one FMA ), we can only get 2 FMAs here, if we want=
 to get 3 FMAs, we need to keep the chain and not break it. So I added a pa=
ram to control chain length "param_reassoc_max_chain_length_with_fma =3D 4"=
 (For the small case in Bugzilla 98350, we need to keep the chain to genera=
te 6 FMAs.)
> -------------------------------------------------------
>   _1 =3D a_6(D) * b_7(D);
>   _2 =3D c_8(D) * d_9(D);
>   _4 =3D e_10(D) * f_11(D);
>   _15 =3D _4 + j_12(D);
>   _16 =3D _15 + _2;
>   _13 =3D _16 + _1;
> -------------------------------------------------------
>   _15 =3D .FMA (e_10(D), f_11(D), j_12(D));
>   _16 =3D .FMA (c_8(D), d_9(D), _15);
>   _13 =3D .FMA (a_6(D), b_7(D), _16);
> -------------------------------------------------------
> In some case we want to break the chain with width, we can set "param_rea=
ssoc_max_chain_length_with_fma =3D 2", it will rearrange ops and break the =
chain with width.

... it sounds like the problem could be fully addressed by sorting the
chain with reassoc-width in mind?
Wouldn't it be preferable if rewrite_expr_tree_parallel would get a
vector of mul and a vector of
non-mul ops so it can pick from the optimal candidate?

That said, I think rewrite_expr_tree_parallel_for_fma at least needs
more comments.

> >
> > >                       && (width =3D get_reassociation_width (ops_num,=
 rhs_code,
> > > -                                                          mode)) > 1=
)
> > > +                                                           mode)) >
> > > + 1)
> > >                     {
> > > -                     if (dump_file && (dump_flags & TDF_DETAILS))
> > > -                       fprintf (dump_file,
> > > -                                "Width =3D %d was chosen for reassoc=
iation\n",
> > > -                                width);
> > > -                     rewrite_expr_tree_parallel (as_a <gassign *> (s=
tmt),
> > > -                                                 width, ops);
> > > +                     if (keep_fma_chain)
> > > +                       {
> > > +                         if (dump_file && (dump_flags & TDF_DETAILS)=
)
> > > +                           fprintf (dump_file,
> > > +                                    "Break chain len =3D %d into wid=
th for FMA\n",
> > > +                                    len);
> > > +                         rewrite_expr_tree_parallel_for_fma
> > > +                           (as_a <gassign *> (stmt), width, ops);
> > > +                       }
> > > +                     else
> > > +                       {
> > > +                         if (dump_file && (dump_flags & TDF_DETAILS)=
)
> > > +                           fprintf (dump_file,
> > > +                                    "Width =3D %d was chosen for rea=
ssociation\n",
> > > +                                    width);
> > > +                         rewrite_expr_tree_parallel (as_a <gassign *=
> (stmt),
> > > +                                                     width, ops);
> > > +                       }
> > >                     }
> > >                   else
> > > -                    {
> > > -                      /* When there are three operands left, we want
> > > -                         to make sure the ones that get the double
> > > -                         binary op are chosen wisely.  */
> > > -                      int len =3D ops.length ();
> > > -                      if (len >=3D 3)
> > > +                   {
> > > +                     /* When there are three operands left, we want
> > > +                        to make sure the ones that get the double
> > > +                        binary op are chosen wisely.  */
> > > +                     if (len >=3D 3 && !keep_fma_chain)
> > >                         swap_ops_for_binary_stmt (ops, len - 3);
> > >
> > >                       new_lhs =3D rewrite_expr_tree (stmt, rhs_code, =
0, ops,
> > >                                                    powi_result !=3D N=
ULL
> > >                                                    || negate_result,
> > >                                                    len !=3D orig_len)=
;
> > > -                    }
> > > -
> > > +                   }
> > >                   /* If we combined some repeated factors into a
> > >                      __builtin_powi call, multiply that result by the
> > >                      reassociated operands.  */
> > > --
> > > 2.25.1
> > >