From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=LeRa=CR=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lj1-x236.google.com (mail-lj1-x236.google.com [IPv6:2a00:1450:4864:20::236])
	by sourceware.org (Postfix) with ESMTPS id 53D4B3858D3C
	for <gcc-patches@gcc.gnu.org>; Thu, 29 Jun 2023 06:45:16 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 53D4B3858D3C
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lj1-x236.google.com with SMTP id 38308e7fff4ca-2b6994a8ce3so4616251fa.1
        for <gcc-patches@gcc.gnu.org>; Wed, 28 Jun 2023 23:45:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1688021115; x=1690613115;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=mNL3CJ+NF8fyDyBNDtj+yes0f9AtjBNs8V3d0icbJqk=;
        b=WtmbdxeOuHM8XZVhbFvhdoD2OhV556PXFXJagPgXrYMfj2QKnZT5KCBk6X0LqUCok7
         poH/DpfS+hKFdWJrZYVZsWxLDP/pj3PZfsOsvTMRZUhtCUXfYPmriu+xyFjYFuPiN/Jk
         72fdVNEqJaHhNNmtLuNSvxt5w+hFSqzJ9VyhpTbA6TzXtIegV5B8SSm2jmeWd+W/BamH
         J1FQIIRImT4kJ8t5VCVwVBtKTMi7tgbTDdAdDKq7p/Oe4l0n5LmpRBTT7BnybvYRHwVM
         FisxuNT8e6oEo0J+thbkX5NHGtE4NSQ0yKf8peMBgiAu5kwQXzv50qZXb/ZXEql0h4xZ
         Herg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1688021115; x=1690613115;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=mNL3CJ+NF8fyDyBNDtj+yes0f9AtjBNs8V3d0icbJqk=;
        b=C+4cRUKoV3zY4O0tCYL21q/kIvV+9OJ2BeJvomk/FeC22386D5o4uXvT2Eua87/IzI
         iJ2KBBns6KCCQxmfEWaX8IUEx14+BUFqgI6rDoqr6Ma29+ZG9ZABVSjmeQICD/9nCGMo
         rzcV3km0hhQalARMCQ8GOAPH9FmTyyCz7X7YRL9VUjahy9+XNdDTwHLudv4rUkhMHiJz
         0izKsd5Kd2bnaIAJJKafbKtXFBLRq2GW0YheXfTsbyMRJ4roIi/kpoLI9bitBxrXQGuH
         F/qE1z8XCVWEg635z+5qr1Ht2K5HY9Y71VEAy9s2xa3nSqkeenNUKvXyjyLcmFmTzcdG
         /i7Q==
X-Gm-Message-State: AC+VfDyLeL21FF5oy11lDv/TVwsLUIQQQ5ekjPXh2hgI4KB/SlPScUnd
	AIeXuQULVxCq9AA/Gp5PX+olPXY4oh373soh2ItnmwuB
X-Google-Smtp-Source: ACHHUZ5AdGHOouuHhgKSebwTit83DwljD1TBidojPGXPnQ5jTrhfL8s8Kg15S6Ltf46QoYpg0YRCZYWmMbskHG8N3QU=
X-Received: by 2002:a05:651c:115:b0:2b5:7fd2:ec36 with SMTP id
 a21-20020a05651c011500b002b57fd2ec36mr18062174ljb.21.1688021114595; Wed, 28
 Jun 2023 23:45:14 -0700 (PDT)
MIME-Version: 1.0
References: <20230629014949.1995621-1-lili.cui@intel.com>
In-Reply-To: <20230629014949.1995621-1-lili.cui@intel.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Thu, 29 Jun 2023 08:42:06 +0200
Message-ID: <CAFiYyc3gGXaLsA0jVm6g8kaP8HozrNVEKm7xbrw65tVKfqb2xw@mail.gmail.com>
Subject: Re: [PATCH] PR gcc/110148:Avoid adding loop-carried ops to long chains
To: "Cui, Lili" <lili.cui@intel.com>
Cc: gcc-patches@gcc.gnu.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-7.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Thu, Jun 29, 2023 at 3:49=E2=80=AFAM Cui, Lili <lili.cui@intel.com> wrot=
e:
>
> From: Lili Cui <lili.cui@intel.com>
>
> Hi Maintainer
>
> This patch is to fix TSVC242 regression related to loop-carried ops.
>
> Bootstrapped and regtested. Ok for trunk?

OK.

Thanks,
Richard.

> Regards
> Lili.
>
> Avoid adding loop-carried ops to long chains, otherwise the whole chain w=
ill
> have dependencies across the loop iteration. Just keep loop-carried ops i=
n a
> separate chain.
>    E.g.
>    x_1 =3D phi(x_0, x_2)
>    y_1 =3D phi(y_0, y_2)
>
>    a + b + c + d + e + x1 + y1
>
>    SSA1 =3D a + b;
>    SSA2 =3D c + d;
>    SSA3 =3D SSA1 + e;
>    SSA4 =3D SSA3 + SSA2;
>    SSA5 =3D x1 + y1;
>    SSA6 =3D SSA4 + SSA5;
>
> With the patch applied, these test cases improved by 32%~100%.
>
> S242:
> for (int i =3D 1; i < LEN_1D; ++i) {
>     a[i] =3D a[i - 1] + s1 + s2 + b[i] + c[i] + d[i];}
>
> Case 1:
> for (int i =3D 1; i < LEN_1D; ++i) {
>     a[i] =3D a[i - 1] + s1 + s2 + b[i] + c[i] + d[i] + e[i];}
>
> Case 2:
> for (int i =3D 1; i < LEN_1D; ++i) {
>     a[i] =3D a[i - 1] + b[i - 1] + s1 + s2 + b[i] + c[i] + d[i] + e[i];}
>
> The value is the execution time
> A: original version
> B: with FMA patch g:e5405f065bace0685cb3b8878d1dfc7a6e7ef409(base on A)
> C: with current patch(base on B)
>
>           A       B       C     B/A             C/A
> s242    2.859   5.152   2.859   1.802028681     1
> case 1  5.489   5.488   3.511   0.999818        0.64
> case 2  7.216   7.499   4.885   1.039218        0.68
>
> gcc/ChangeLog:
>         PR tree-optimization/110148
>         * tree-ssa-reassoc.cc (rewrite_expr_tree_parallel): Handle loop-c=
arried
>         ops in this function.
> ---
>  gcc/tree-ssa-reassoc.cc | 236 ++++++++++++++++++++++++++++------------
>  1 file changed, 167 insertions(+), 69 deletions(-)
>
> diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
> index 96c88ec003e..f5da385e0b2 100644
> --- a/gcc/tree-ssa-reassoc.cc
> +++ b/gcc/tree-ssa-reassoc.cc
> @@ -5471,37 +5471,62 @@ get_reassociation_width (int ops_num, enum tree_c=
ode opc,
>    return width;
>  }
>
> +#define SPECIAL_BIASED_END_STMT 0 /* It is the end stmt of all ops.  */
> +#define BIASED_END_STMT 1 /* It is the end stmt of normal or biased ops.=
  */
> +#define NORMAL_END_STMT 2 /* It is the end stmt of normal ops.  */
> +
>  /* Rewrite statements with dependency chain with regard the chance to ge=
nerate
>     FMA.
>     For the chain with FMA: Try to keep fma opportunity as much as possib=
le.
>     For the chain without FMA: Putting the computation in rank order and =
trying
>     to allow operations to be executed in parallel.
>     E.g.
> -   e + f + g + a * b + c * d;
> +   e + f + a * b + c * d;
>
> -   ssa1 =3D e + f;
> -   ssa2 =3D g + a * b;
> -   ssa3 =3D ssa1 + c * d;
> -   ssa4 =3D ssa2 + ssa3;
> +   ssa1 =3D e + a * b;
> +   ssa2 =3D f + c * d;
> +   ssa3 =3D ssa1 + ssa2;
>
>     This reassociation approach preserves the chance of fma generation as=
 much
> -   as possible.  */
> +   as possible.
> +
> +   Another thing is to avoid adding loop-carried ops to long chains, oth=
erwise
> +   the whole chain will have dependencies across the loop iteration. Jus=
t keep
> +   loop-carried ops in a separate chain.
> +   E.g.
> +   x_1 =3D phi(x_0, x_2)
> +   y_1 =3D phi(y_0, y_2)
> +
> +   a + b + c + d + e + x1 + y1
> +
> +   SSA1 =3D a + b;
> +   SSA2 =3D c + d;
> +   SSA3 =3D SSA1 + e;
> +   SSA4 =3D SSA3 + SSA2;
> +   SSA5 =3D x1 + y1;
> +   SSA6 =3D SSA4 + SSA5;
> + */
>  static void
>  rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
> -                                        const vec<operand_entry *> &ops)
> +                           const vec<operand_entry *> &ops)
>  {
>    enum tree_code opcode =3D gimple_assign_rhs_code (stmt);
>    int op_num =3D ops.length ();
> +  int op_normal_num =3D op_num;
>    gcc_assert (op_num > 0);
>    int stmt_num =3D op_num - 1;
>    gimple **stmts =3D XALLOCAVEC (gimple *, stmt_num);
> -  int op_index =3D op_num - 1;
> -  int width_count =3D width;
>    int i =3D 0, j =3D 0;
>    tree tmp_op[2], op1;
>    operand_entry *oe;
>    gimple *stmt1 =3D NULL;
>    tree last_rhs1 =3D gimple_assign_rhs1 (stmt);
> +  int last_rhs1_stmt_index =3D 0, last_rhs2_stmt_index =3D 0;
> +  int width_active =3D 0, width_count =3D 0;
> +  bool has_biased =3D false, ops_changed =3D false;
> +  auto_vec<operand_entry *> ops_normal;
> +  auto_vec<operand_entry *> ops_biased;
> +  vec<operand_entry *> *ops1;
>
>    /* We start expression rewriting from the top statements.
>       So, in this loop we create a full list of statements
> @@ -5510,83 +5535,155 @@ rewrite_expr_tree_parallel (gassign *stmt, int w=
idth, bool has_fma,
>    for (i =3D stmt_num - 2; i >=3D 0; i--)
>      stmts[i] =3D SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
>
> -  /* Width should not be larger than op_num / 2, since we can not create
> +  /* Avoid adding loop-carried ops to long chains, first filter out the
> +     loop-carried.  But we need to make sure that the length of the rema=
inder
> +     is not less than 4, which is the smallest ops length we can break t=
he
> +     dependency.  */
> +  FOR_EACH_VEC_ELT (ops, i, oe)
> +    {
> +      if (TREE_CODE (oe->op) =3D=3D SSA_NAME
> +         && bitmap_bit_p (biased_names, SSA_NAME_VERSION (oe->op))
> +         && op_normal_num > 4)
> +       {
> +         ops_biased.safe_push (oe);
> +         has_biased =3D true;
> +         op_normal_num --;
> +       }
> +      else
> +       ops_normal.safe_push (oe);
> +    }
> +
> +  /* Width should not be larger than ops length / 2, since we can not cr=
eate
>       more parallel dependency chains that exceeds such value.  */
> -  width =3D width <=3D op_num / 2 ? width : op_num / 2;
> +  int width_normal =3D op_normal_num / 2;
> +  int width_biased =3D (op_num - op_normal_num) / 2;
> +  width_normal =3D width <=3D width_normal ? width : width_normal;
> +  width_biased =3D width <=3D width_biased ? width : width_biased;
> +
> +  ops1 =3D &ops_normal;
> +  width_count =3D width_active =3D width_normal;
>
>    /* Build parallel dependency chain according to width.  */
> -  for (i =3D 0; i < width; i++)
> +  for (i =3D 0; i < stmt_num; i++)
>      {
> -      /* If the chain has FAM, we do not swap two operands.  */
> -      if (op_index > 1 && !has_fma)
> -       swap_ops_for_binary_stmt (ops, op_index - 2);
> -
> -      for (j =3D 0; j < 2; j++)
> +      if (dump_file && (dump_flags & TDF_DETAILS))
>         {
> -         gcc_assert (op_index >=3D 0);
> -         oe =3D ops[op_index--];
> -         tmp_op[j] =3D oe->op;
> -         /* If the stmt that defines operand has to be inserted, insert =
it
> -            before the use.  */
> -         stmt1 =3D oe->stmt_to_insert;
> -         if (stmt1)
> -           insert_stmt_before_use (stmts[i], stmt1);
> -         stmt1 =3D NULL;
> +         fprintf (dump_file, "Transforming ");
> +         print_gimple_stmt (dump_file, stmts[i], 0);
>         }
> -      stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1),
> -                                   tmp_op[1],
> -                                   tmp_op[0],
> -                                   opcode);
> -      gimple_set_visited (stmts[i], true);
>
> -      if (dump_file && (dump_flags & TDF_DETAILS))
> +      /* When the work of normal ops is over, but the loop is not over,
> +        continue to do biased ops.  */
> +      if (width_count =3D=3D 0 && ops1 =3D=3D &ops_normal)
>         {
> -         fprintf (dump_file, " into ");
> -         print_gimple_stmt (dump_file, stmts[i], 0);
> +         ops1 =3D &ops_biased;
> +         width_count =3D width_active =3D width_biased;
> +         ops_changed =3D true;
>         }
> -    }
>
> -  for (i =3D width; i < stmt_num; i++)
> -    {
> -      /* We keep original statement only for the last one.  All others a=
re
> -        recreated.  */
> -      if ( op_index < 0)
> +      /* Swap the operands if no FMA in the chain.  */
> +      if (ops1->length () > 2 && !has_fma)
> +       swap_ops_for_binary_stmt (*ops1, ops1->length () - 3);
> +
> +      if (i < width_active
> +         || (ops_changed && i <=3D (last_rhs1_stmt_index + width_active)=
))
>         {
> -         if (width_count =3D=3D 2)
> +         for (j =3D 0; j < 2; j++)
>             {
> -
> -             /* We keep original statement only for the last one.  All
> -                others are recreated.  */
> -             gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (stmts[=
i-1]));
> -             gimple_assign_set_rhs2 (stmts[i], gimple_assign_lhs (stmts[=
i-2]));
> -             update_stmt (stmts[i]);
> +             oe =3D ops1->pop ();
> +             tmp_op[j] =3D oe->op;
> +             /* If the stmt that defines operand has to be inserted, ins=
ert it
> +                before the use.  */
> +             stmt1 =3D oe->stmt_to_insert;
> +             if (stmt1)
> +               insert_stmt_before_use (stmts[i], stmt1);
> +             stmt1 =3D NULL;
>             }
> -         else
> -           {
> +         stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1),
> +                                       tmp_op[1],
> +                                       tmp_op[0],
> +                                       opcode);
> +         gimple_set_visited (stmts[i], true);
>
> -             stmts[i] =3D
> -               build_and_add_sum (TREE_TYPE (last_rhs1),
> -                                  gimple_assign_lhs (stmts[i-width_count=
]),
> -                                  gimple_assign_lhs (stmts[i-width_count=
+1]),
> -                                  opcode);
> -             gimple_set_visited (stmts[i], true);
> -             width_count--;
> -           }
>         }
>        else
>         {
> -         /* Attach the rest of the ops to the parallel dependency chain.=
  */
> -         oe =3D ops[op_index--];
> -         op1 =3D oe->op;
> -         stmt1 =3D oe->stmt_to_insert;
> -         if (stmt1)
> -           insert_stmt_before_use (stmts[i], stmt1);
> -         stmt1 =3D NULL;
> -         stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1),
> -                                       gimple_assign_lhs (stmts[i-width]=
),
> -                                       op1,
> -                                       opcode);
> -         gimple_set_visited (stmts[i], true);
> +         /* We keep original statement only for the last one.  All other=
s are
> +            recreated.  */
> +         if (!ops1->length ())
> +           {
> +             /* For biased length equal to 2.  */
> +             if (width_count =3D=3D BIASED_END_STMT && !last_rhs2_stmt_i=
ndex)
> +               last_rhs2_stmt_index =3D i - 1;
> +
> +             /* When width_count =3D=3D 2 and there is no biased, just f=
inish.  */
> +             if (width_count =3D=3D NORMAL_END_STMT && !has_biased)
> +               {
> +                 last_rhs1_stmt_index =3D i - 1;
> +                 last_rhs2_stmt_index =3D i - 2;
> +               }
> +             if (last_rhs1_stmt_index && (last_rhs2_stmt_index || !has_b=
iased))
> +               {
> +                 /* We keep original statement only for the last one.  A=
ll
> +                    others are recreated.  */
> +                 gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (st=
mts[last_rhs1_stmt_index]));
> +                 gimple_assign_set_rhs2 (stmts[i], gimple_assign_lhs (st=
mts[last_rhs2_stmt_index]));
> +                 update_stmt (stmts[i]);
> +               }
> +             else
> +               {
> +                 stmts[i] =3D
> +                   build_and_add_sum (TREE_TYPE (last_rhs1),
> +                                      gimple_assign_lhs (stmts[i-width_c=
ount]),
> +                                      gimple_assign_lhs (stmts[i-width_c=
ount+1]),
> +                                      opcode);
> +                 gimple_set_visited (stmts[i], true);
> +                 width_count--;
> +
> +                 /* It is the end of normal or biased ops.
> +                    last_rhs1_stmt_index used to record the last stmt in=
dex
> +                    for normal ops.  last_rhs2_stmt_index used to record=
 the
> +                    last stmt index for biased ops.  */
> +                 if (width_count =3D=3D BIASED_END_STMT)
> +                   {
> +                     gcc_assert (has_biased);
> +                     if (ops_biased.length ())
> +                       last_rhs1_stmt_index =3D i;
> +                     else
> +                       last_rhs2_stmt_index =3D i;
> +                     width_count--;
> +                   }
> +               }
> +           }
> +         else
> +           {
> +             /* Attach the rest of the ops to the parallel dependency ch=
ain.  */
> +             oe =3D ops1->pop ();
> +             op1 =3D oe->op;
> +             stmt1 =3D oe->stmt_to_insert;
> +             if (stmt1)
> +               insert_stmt_before_use (stmts[i], stmt1);
> +             stmt1 =3D NULL;
> +
> +             /* For only one biased ops.  */
> +             if (width_count =3D=3D SPECIAL_BIASED_END_STMT)
> +               {
> +                 /* We keep original statement only for the last one.  A=
ll
> +                    others are recreated.  */
> +                 gcc_assert (has_biased);
> +                 gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (st=
mts[last_rhs1_stmt_index]));
> +                 gimple_assign_set_rhs2 (stmts[i], op1);
> +                 update_stmt (stmts[i]);
> +               }
> +             else
> +               {
> +                 stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1),
> +                                               gimple_assign_lhs (stmts[=
i-width_active]),
> +                                               op1,
> +                                               opcode);
> +                 gimple_set_visited (stmts[i], true);
> +               }
> +           }
>         }
>
>        if (dump_file && (dump_flags & TDF_DETAILS))
> @@ -5595,6 +5692,7 @@ rewrite_expr_tree_parallel (gassign *stmt, int widt=
h, bool has_fma,
>           print_gimple_stmt (dump_file, stmts[i], 0);
>         }
>      }
> +
>    remove_visited_stmt_chain (last_rhs1);
>  }
>
> --
> 2.25.1
>