From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=wKJs=BL=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lj1-x236.google.com (mail-lj1-x236.google.com [IPv6:2a00:1450:4864:20::236])
	by sourceware.org (Postfix) with ESMTPS id 0400E3858D35
	for <gcc-patches@gcc.gnu.org>; Mon, 22 May 2023 13:30:00 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 0400E3858D35
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lj1-x236.google.com with SMTP id 38308e7fff4ca-2af2958db45so34834591fa.1
        for <gcc-patches@gcc.gnu.org>; Mon, 22 May 2023 06:29:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1684762198; x=1687354198;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=QACWJdAVbteZ+vvOridqqETww1+gSHeVFrLJN8u4euY=;
        b=WISSW8haeQeYijtZ5FWcveplPzcQ58yUxyWuKDe0YCGs03aqnBdqFJTDyYvSp5za60
         HYgKBXxzT98G4oic20upW0WHXJ4971HzTFf0DJas8ofm97Fp7wvY1DSEZOCMmnfjkoXX
         hYn0kA80+1dLSxSy27GVTbHjWSZ91DTihUebIk/B6Lqo2w1dngAgCDrk4tfPjBQuSZVo
         Cv8zubIRJf+FZvm+nEDXjbq3PTHLz0kOMOUcMzGEU8TfL0RtcPNC2w7c+yTv1P/odrBN
         BlD7WmQwEuiKJ4uTR+SXkMqUERpqxdu4PgmFbCcd0FpcQ1RKOZiMBg4viPsIV9NNYpIw
         Q8nQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1684762198; x=1687354198;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=QACWJdAVbteZ+vvOridqqETww1+gSHeVFrLJN8u4euY=;
        b=eMjt+DNcWy7V/hv2mMPzz2FNz82cwKfjuZHVPZy4p+XA9ri5nWk7i5FDF6zq6jYqak
         eFhoGXJOWn43Y9LG419bdCX6HQNNdxGgG+0fIK60Wu2a4V4fp5QU7Ah1JRiNWS8LBJyQ
         556a/ipwI2hfgTUNwnC4sb1YaTyD/Q0pbs3zqB6QG2CD5ovk2qXybklW+RpDNUkM8Gpe
         eoJkA7VY3jDeZoTL8fS4jvdpUrRkZbct9cDHNvU3YxaImb1VGjs/H67wIy2B3ddp1ofk
         YFgcT2YfiAG6Oc4ojugWGL2Z7GSMiB0LGSRcskwbPyTGQLPr2H1CwjyRDlNNWhFAXzdu
         lFVg==
X-Gm-Message-State: AC+VfDy6HRYkBUmVzH4a/YGM67Dw2QKebExk3mtxlPHj/ZzpLOiJb+2v
	7PfM+e0rH/m8NFIryTxypfZNzK0ywEuO7tfvqS0=
X-Google-Smtp-Source: ACHHUZ6GxrVPevCSYOx0XpPqJCn9uryv8N4HZc+LWSq5NHhOsu8yKxDWyd5pHlCnm5zan29c2uHFiL0O9UISk3n3Dec=
X-Received: by 2002:a2e:86cc:0:b0:2af:c9d8:87b4 with SMTP id
 n12-20020a2e86cc000000b002afc9d887b4mr1559151ljj.29.1684762197978; Mon, 22
 May 2023 06:29:57 -0700 (PDT)
MIME-Version: 1.0
References: <20230517130222.2534562-1-lili.cui@intel.com>
In-Reply-To: <20230517130222.2534562-1-lili.cui@intel.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Mon, 22 May 2023 15:29:45 +0200
Message-ID: <CAFiYyc2tkFdzY5e-YU+S0qWVa+1vmKCbNH1b2_qs5wi0CBuxhw@mail.gmail.com>
Subject: Re: [PATCH] PR gcc/98350:Handle FMA friendly in reassoc pass
To: "Cui, Lili" <lili.cui@intel.com>
Cc: gcc-patches@gcc.gnu.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-7.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Wed, May 17, 2023 at 3:02=E2=80=AFPM Cui, Lili <lili.cui@intel.com> wrot=
e:
>
> From: Lili Cui <lili.cui@intel.com>
>
> Make some changes in reassoc pass to make it more friendly to fma pass la=
ter.
> Using FMA instead of mult + add reduces register pressure and insruction
> retired.
>
> There are mainly two changes
> 1. Put no-mult ops and mult ops alternately at the end of the queue, whic=
h is
> conducive to generating more fma and reducing the loss of FMA when breaki=
ng
> the chain.
> 2. Rewrite the rewrite_expr_tree_parallel function to try to build parall=
el
> chains according to the given correlation width, keeping the FMA chance a=
s
> much as possible.
>
> TEST1:
>
> float
> foo (float a, float b, float c, float d, float *e)
> {
>    return  *e  + a * b + c * d ;
> }
>
> For "-Ofast -mfpmath=3Dsse -mfma" GCC generates:
>         vmulss  %xmm3, %xmm2, %xmm2
>         vfmadd132ss     %xmm1, %xmm2, %xmm0
>         vaddss  (%rdi), %xmm0, %xmm0
>         ret
>
> With this patch GCC generates:
>         vfmadd213ss   (%rdi), %xmm1, %xmm0
>         vfmadd231ss   %xmm2, %xmm3, %xmm0
>         ret
>
> TEST2:
>
> for (int i =3D 0; i < N; i++)
> {
>   a[i] +=3D b[i]* c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] *=
 l[i] + m[i]* o[i] + p[i];
> }
>
> For "-Ofast -mfpmath=3Dsse -mfma"  GCC generates:
>         vmovapd e(%rax), %ymm4
>         vmulpd  d(%rax), %ymm4, %ymm3
>         addq    $32, %rax
>         vmovapd c-32(%rax), %ymm5
>         vmovapd j-32(%rax), %ymm6
>         vmulpd  h-32(%rax), %ymm6, %ymm2
>         vmovapd a-32(%rax), %ymm6
>         vaddpd  p-32(%rax), %ymm6, %ymm0
>         vmovapd g-32(%rax), %ymm7
>         vfmadd231pd     b-32(%rax), %ymm5, %ymm3
>         vmovapd o-32(%rax), %ymm4
>         vmulpd  m-32(%rax), %ymm4, %ymm1
>         vmovapd l-32(%rax), %ymm5
>         vfmadd231pd     f-32(%rax), %ymm7, %ymm2
>         vfmadd231pd     k-32(%rax), %ymm5, %ymm1
>         vaddpd  %ymm3, %ymm0, %ymm0
>         vaddpd  %ymm2, %ymm0, %ymm0
>         vaddpd  %ymm1, %ymm0, %ymm0
>         vmovapd %ymm0, a-32(%rax)
>         cmpq    $8192, %rax
>         jne     .L4
>         vzeroupper
>         ret
>
> with this patch applied GCC breaks the chain with width =3D 2 and generat=
es 6 fma:
>
>         vmovapd a(%rax), %ymm2
>         vmovapd c(%rax), %ymm0
>         addq    $32, %rax
>         vmovapd e-32(%rax), %ymm1
>         vmovapd p-32(%rax), %ymm5
>         vmovapd g-32(%rax), %ymm3
>         vmovapd j-32(%rax), %ymm6
>         vmovapd l-32(%rax), %ymm4
>         vmovapd o-32(%rax), %ymm7
>         vfmadd132pd     b-32(%rax), %ymm2, %ymm0
>         vfmadd132pd     d-32(%rax), %ymm5, %ymm1
>         vfmadd231pd     f-32(%rax), %ymm3, %ymm0
>         vfmadd231pd     h-32(%rax), %ymm6, %ymm1
>         vfmadd231pd     k-32(%rax), %ymm4, %ymm0
>         vfmadd231pd     m-32(%rax), %ymm7, %ymm1
>         vaddpd  %ymm1, %ymm0, %ymm0
>         vmovapd %ymm0, a-32(%rax)
>         cmpq    $8192, %rax
>         jne     .L2
>         vzeroupper
>         ret
>
> gcc/ChangeLog:
>
>         PR gcc/98350
>         * tree-ssa-reassoc.cc
>         (rewrite_expr_tree_parallel): Rewrite this function.
>         (rank_ops_for_fma): New.
>         (reassociate_bb): Handle new function.
>
> gcc/testsuite/ChangeLog:
>
>         PR gcc/98350
>         * gcc.dg/pr98350-1.c: New test.
>         * gcc.dg/pr98350-2.c: Ditto.
> ---
>  gcc/testsuite/gcc.dg/pr98350-1.c |  31 ++++
>  gcc/testsuite/gcc.dg/pr98350-2.c |  11 ++
>  gcc/tree-ssa-reassoc.cc          | 256 +++++++++++++++++++++----------
>  3 files changed, 215 insertions(+), 83 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/pr98350-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/pr98350-2.c
>
> diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c b/gcc/testsuite/gcc.dg/pr98=
350-1.c
> new file mode 100644
> index 00000000000..185511c5e0a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr98350-1.c
> @@ -0,0 +1,31 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -mfpmath=3Dsse -mfma -Wno-attributes " } */
> +
> +/* Test that the compiler properly optimizes multiply and add
> +   to generate more FMA instructions.  */
> +#define N 1024
> +double a[N];
> +double b[N];
> +double c[N];
> +double d[N];
> +double e[N];
> +double f[N];
> +double g[N];
> +double h[N];
> +double j[N];
> +double k[N];
> +double l[N];
> +double m[N];
> +double o[N];
> +double p[N];
> +
> +
> +void
> +foo (void)
> +{
> +  for (int i =3D 0; i < N; i++)
> +  {
> +    a[i] +=3D b[i] * c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[=
i] * l[i] + m[i]* o[i] + p[i];
> +  }
> +}
> +/* { dg-final { scan-assembler-times "vfm" 6  } } */
> diff --git a/gcc/testsuite/gcc.dg/pr98350-2.c b/gcc/testsuite/gcc.dg/pr98=
350-2.c
> new file mode 100644
> index 00000000000..b35d88aead9
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr98350-2.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -mfpmath=3Dsse -mfma -Wno-attributes " } */
> +
> +/* Test that the compiler rearrange the ops to generate more FMA.  */
> +
> +float
> +foo1 (float a, float b, float c, float d, float *e)
> +{
> +   return   *e + a * b + c * d ;
> +}
> +/* { dg-final { scan-assembler-times "vfm" 2  } } */
> diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
> index 067a3f07f7e..52c8aab6033 100644
> --- a/gcc/tree-ssa-reassoc.cc
> +++ b/gcc/tree-ssa-reassoc.cc
> @@ -54,6 +54,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-ssa-reassoc.h"
>  #include "tree-ssa-math-opts.h"
>  #include "gimple-range.h"
> +#include "internal-fn.h"
>
>  /*  This is a simple global reassociation pass.  It is, in part, based
>      on the LLVM pass of the same name (They do some things more/less
> @@ -5468,14 +5469,24 @@ get_reassociation_width (int ops_num, enum tree_c=
ode opc,
>    return width;
>  }
>
> -/* Recursively rewrite our linearized statements so that the operators
> -   match those in OPS[OPINDEX], putting the computation in rank
> -   order and trying to allow operations to be executed in
> -   parallel.  */
> +/* Rewrite statements with dependency chain with regard to the chance to
> +   generate FMA.
> +   For the chain with FMA: Try to keep fma opportunity as much as possib=
le.
> +   For the chain without FMA: Putting the computation in rank order and =
trying
> +   to allow operations to be executed in parallel.
> +   E.g.
> +   e + f + g + a * b + c * d;
>
> +   ssa1 =3D e + f;
> +   ssa2 =3D g + a * b;
> +   ssa3 =3D ssa1 + c * d;
> +   ssa4 =3D ssa2 + ssa3;
> +
> +   This reassociation approach preserves the chance of fma generation as=
 much
> +   as possible.  */
>  static void
> -rewrite_expr_tree_parallel (gassign *stmt, int width,
> -                           const vec<operand_entry *> &ops)
> +rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
> +                                        const vec<operand_entry *> &ops)
>  {
>    enum tree_code opcode =3D gimple_assign_rhs_code (stmt);
>    int op_num =3D ops.length ();
> @@ -5483,10 +5494,11 @@ rewrite_expr_tree_parallel (gassign *stmt, int wi=
dth,
>    int stmt_num =3D op_num - 1;
>    gimple **stmts =3D XALLOCAVEC (gimple *, stmt_num);
>    int op_index =3D op_num - 1;
> -  int stmt_index =3D 0;
> -  int ready_stmts_end =3D 0;
> -  int i =3D 0;
> -  gimple *stmt1 =3D NULL, *stmt2 =3D NULL;
> +  int width_count =3D width;
> +  int i =3D 0, j =3D 0;
> +  tree tmp_op[2], op1;
> +  operand_entry *oe;
> +  gimple *stmt1 =3D NULL;
>    tree last_rhs1 =3D gimple_assign_rhs1 (stmt);
>
>    /* We start expression rewriting from the top statements.
> @@ -5496,91 +5508,84 @@ rewrite_expr_tree_parallel (gassign *stmt, int wi=
dth,
>    for (i =3D stmt_num - 2; i >=3D 0; i--)
>      stmts[i] =3D SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
>
> -  for (i =3D 0; i < stmt_num; i++)
> +  /* Build parallel dependency chain according to width.  */
> +  for (i =3D 0; i < width; i++)
>      {
> -      tree op1, op2;
> -
> -      /* Determine whether we should use results of
> -        already handled statements or not.  */
> -      if (ready_stmts_end =3D=3D 0
> -         && (i - stmt_index >=3D width || op_index < 1))
> -       ready_stmts_end =3D i;
> -
> -      /* Now we choose operands for the next statement.  Non zero
> -        value in ready_stmts_end means here that we should use
> -        the result of already generated statements as new operand.  */
> -      if (ready_stmts_end > 0)
> -       {
> -         op1 =3D gimple_assign_lhs (stmts[stmt_index++]);
> -         if (ready_stmts_end > stmt_index)
> -           op2 =3D gimple_assign_lhs (stmts[stmt_index++]);
> -         else if (op_index >=3D 0)
> -           {
> -             operand_entry *oe =3D ops[op_index--];
> -             stmt2 =3D oe->stmt_to_insert;
> -             op2 =3D oe->op;
> -           }
> -         else
> -           {
> -             gcc_assert (stmt_index < i);
> -             op2 =3D gimple_assign_lhs (stmts[stmt_index++]);
> -           }
> +      /*   */

empty comment?

> +      if (op_index > 1 && !has_fma)
> +       swap_ops_for_binary_stmt (ops, op_index - 2);
>
> -         if (stmt_index >=3D ready_stmts_end)
> -           ready_stmts_end =3D 0;
> -       }
> -      else
> +      for (j =3D 0; j < 2; j++)
>         {
> -         if (op_index > 1)
> -           swap_ops_for_binary_stmt (ops, op_index - 2);
> -         operand_entry *oe2 =3D ops[op_index--];
> -         operand_entry *oe1 =3D ops[op_index--];
> -         op2 =3D oe2->op;
> -         stmt2 =3D oe2->stmt_to_insert;
> -         op1 =3D oe1->op;
> -         stmt1 =3D oe1->stmt_to_insert;
> +         gcc_assert (op_index >=3D 0);
> +         oe =3D ops[op_index--];
> +         tmp_op[j] =3D oe->op;
> +         /* If the stmt that defines operand has to be inserted, insert =
it
> +            before the use.  */
> +         stmt1 =3D oe->stmt_to_insert;
> +         if (stmt1)
> +           insert_stmt_before_use (stmts[i], stmt1);
> +         stmt1 =3D NULL;
>         }
> -
> -      /* If we emit the last statement then we should put
> -        operands into the last statement.  It will also
> -        break the loop.  */
> -      if (op_index < 0 && stmt_index =3D=3D i)
> -       i =3D stmt_num - 1;
> +      stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1), tmp_op[1], =
tmp_op[0], opcode);
> +      gimple_set_visited (stmts[i], true);
>
>        if (dump_file && (dump_flags & TDF_DETAILS))
>         {
> -         fprintf (dump_file, "Transforming ");
> +         fprintf (dump_file, " into ");
>           print_gimple_stmt (dump_file, stmts[i], 0);
>         }
> +    }
>
> -      /* If the stmt that defines operand has to be inserted, insert it
> -        before the use.  */
> -      if (stmt1)
> -       insert_stmt_before_use (stmts[i], stmt1);
> -      if (stmt2)
> -       insert_stmt_before_use (stmts[i], stmt2);
> -      stmt1 =3D stmt2 =3D NULL;
> -
> -      /* We keep original statement only for the last one.  All
> -        others are recreated.  */
> -      if (i =3D=3D stmt_num - 1)
> +  for (i =3D width; i < stmt_num; i++)
> +    {
> +      /* We keep original statement only for the last one.  All others a=
re
> +        recreated.  */
> +      if ( op_index < 0)
>         {
> -         gimple_assign_set_rhs1 (stmts[i], op1);
> -         gimple_assign_set_rhs2 (stmts[i], op2);
> -         update_stmt (stmts[i]);
> +         if (width_count =3D=3D 2)
> +           {
> +
> +             /* We keep original statement only for the last one.  All
> +                others are recreated.  */
> +             gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (stmts[=
i-1]));
> +             gimple_assign_set_rhs2 (stmts[i], gimple_assign_lhs (stmts[=
i-2]));
> +             update_stmt (stmts[i]);
> +           }
> +         else
> +           {
> +
> +             stmts[i] =3D
> +               build_and_add_sum (TREE_TYPE (last_rhs1),
> +                                  gimple_assign_lhs (stmts[i-width_count=
]),
> +                                  gimple_assign_lhs (stmts[i-width_count=
+1]),
> +                                  opcode);
> +             gimple_set_visited (stmts[i], true);
> +             width_count--;
> +           }
>         }
>        else
>         {
> -         stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1), op1, op2=
, opcode);
> +         /* Attach the rest of the ops to the parallel dependency chain.=
  */
> +         oe =3D ops[op_index--];
> +         op1 =3D oe->op;
> +         stmt1 =3D oe->stmt_to_insert;
> +         if (stmt1)
> +           insert_stmt_before_use (stmts[i], stmt1);
> +         stmt1 =3D NULL;
> +         stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1),
> +                                       gimple_assign_lhs (stmts[i-width]=
),
> +                                       op1,
> +                                       opcode);
>           gimple_set_visited (stmts[i], true);
>         }
> +
>        if (dump_file && (dump_flags & TDF_DETAILS))
>         {
>           fprintf (dump_file, " into ");
>           print_gimple_stmt (dump_file, stmts[i], 0);
>         }
>      }
> -

I've looked three times but didn't find a use of 'has_fma'?

>    remove_visited_stmt_chain (last_rhs1);
>  }
>
> @@ -6649,6 +6654,76 @@ transform_stmt_to_multiply (gimple_stmt_iterator *=
gsi, gimple *stmt,
>      }
>  }
>
> +/* Rearrange ops to generate more FMA when the chain may has more than 2=
 fmas.

may have

> +   Put no-mult ops and mult ops alternately at the end of the queue, whi=
ch is
> +   conducive to generating more fma and reducing the loss of FMA when br=
eaking
> +   the chain.
> +   E.g.
> +   a * b + c * d + e generates:
> +
> +   _4  =3D c_9(D) * d_10(D);
> +   _12 =3D .FMA (a_7(D), b_8(D), _4);
> +   _11 =3D e_6(D) + _12;
> +
> +   Rtearrange ops to -> e + a * b + c * d generates:

Rearrange

> +
> +   _4  =3D .FMA (c_7(D), d_8(D), _3);
> +   _11 =3D .FMA (a_5(D), b_6(D), _4);
> + */
> +static bool
> +rank_ops_for_fma (vec<operand_entry *> *ops)
> +{
> +  operand_entry *oe;
> +  unsigned int i;
> +  unsigned int ops_length =3D ops->length ();
> +  auto_vec<operand_entry *> ops_mult;
> +  auto_vec<operand_entry *> ops_others;
> +
> +  FOR_EACH_VEC_ELT (*ops, i, oe)
> +    {
> +      if (TREE_CODE (oe->op) =3D=3D SSA_NAME)
> +       {
> +         gimple *def_stmt =3D SSA_NAME_DEF_STMT (oe->op);
> +         if (is_gimple_assign (def_stmt)
> +             && gimple_assign_rhs_code (def_stmt) =3D=3D MULT_EXPR)
> +           ops_mult.safe_push (oe);
> +         else
> +           ops_others.safe_push (oe);
> +       }
> +      else
> +       ops_others.safe_push (oe);
> +    }
> +  /* When ops_mult.length =3D=3D 2, like the following case,
> +
> +     a * b + c * d + e.
> +
> +     we need to rearrange the ops.
> +
> +     Putting ops that not def from mult in front can generate more fmas.=
  */
> +  if (ops_mult.length () >=3D 2)
> +    {
> +      /* If all ops are defined with mult, we don't need to rearrange th=
em.  */
> +      if (ops_mult.length () !=3D ops_length)

use && with the previous condition.

> +       {
> +         /* Put no-mult ops and mult ops alternately at the end of the
> +            queue, which is conducive to generating more fma and reducin=
g the
> +            loss of FMA when breaking the chain.  */
> +         ops->truncate (0);
> +         ops->splice (ops_mult);
> +         int j, opindex =3D ops->length ();
> +         int others_length =3D ops_others.length();
> +         for (j =3D 0; j < others_length; j++)
> +           {
> +             oe =3D ops_others.pop ();
> +             ops->safe_insert (opindex, oe);

that's quadratic as it needs to move ops.  As said previously
we know that 'ops' has enough room and you can use the
quick_ (or non-safe_) variants of the APIs on it.

Otherwise looks good to me.

Thanks,
Richard.

> +             if (opindex > 0)
> +               opindex--;
> +           }
> +       }
> +      return true;
> +    }
> +  return false;
> +}
>  /* Reassociate expressions in basic block BB and its post-dominator as
>     children.
>
> @@ -6813,6 +6888,7 @@ reassociate_bb (basic_block bb)
>                   machine_mode mode =3D TYPE_MODE (TREE_TYPE (lhs));
>                   int ops_num =3D ops.length ();
>                   int width;
> +                 bool has_fma =3D false;
>
>                   /* For binary bit operations, if there are at least 3
>                      operands and the last operand in OPS is a constant,
> @@ -6821,11 +6897,23 @@ reassociate_bb (basic_block bb)
>                      often match a canonical bit test when we get to RTL.=
  */
>                   if (ops.length () > 2
>                       && (rhs_code =3D=3D BIT_AND_EXPR
> -                         || rhs_code =3D=3D BIT_IOR_EXPR
> -                         || rhs_code =3D=3D BIT_XOR_EXPR)
> +                         || rhs_code =3D=3D BIT_IOR_EXPR
> +                         || rhs_code =3D=3D BIT_XOR_EXPR)
>                       && TREE_CODE (ops.last ()->op) =3D=3D INTEGER_CST)
>                     std::swap (*ops[0], *ops[ops_num - 1]);
>
> +                 optimization_type opt_type =3D bb_optimization_type (bb=
);
> +
> +                 /* If the target support FMA, rank_ops_for_fma will det=
ect if
> +                    the chain has fmas and rearrange the ops if so.  */
> +                 if (direct_internal_fn_supported_p (IFN_FMA,
> +                                                     TREE_TYPE (lhs),
> +                                                     opt_type)
> +                     && (rhs_code =3D=3D PLUS_EXPR || rhs_code =3D=3D MI=
NUS_EXPR))
> +                   {
> +                     has_fma =3D rank_ops_for_fma(&ops);
> +                   }
> +
>                   /* Only rewrite the expression tree to parallel in the
>                      last reassoc pass to avoid useless work back-and-for=
th
>                      with initial linearization.  */
> @@ -6839,22 +6927,24 @@ reassociate_bb (basic_block bb)
>                                  "Width =3D %d was chosen for reassociati=
on\n",
>                                  width);
>                       rewrite_expr_tree_parallel (as_a <gassign *> (stmt)=
,
> -                                                 width, ops);
> +                                                 width,
> +                                                 has_fma,
> +                                                 ops);
>                     }
>                   else
> -                    {
> -                      /* When there are three operands left, we want
> -                         to make sure the ones that get the double
> -                         binary op are chosen wisely.  */
> -                      int len =3D ops.length ();
> -                      if (len >=3D 3)
> +                   {
> +                     /* When there are three operands left, we want
> +                        to make sure the ones that get the double
> +                        binary op are chosen wisely.  */
> +                     int len =3D ops.length ();
> +                     if (len >=3D 3 && !has_fma)
>                         swap_ops_for_binary_stmt (ops, len - 3);
>
>                       new_lhs =3D rewrite_expr_tree (stmt, rhs_code, 0, o=
ps,
>                                                    powi_result !=3D NULL
>                                                    || negate_result,
>                                                    len !=3D orig_len);
> -                    }
> +                   }
>
>                   /* If we combined some repeated factors into a
>                      __builtin_powi call, multiply that result by the
> --
> 2.25.1
>