From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-x236.google.com (mail-lj1-x236.google.com [IPv6:2a00:1450:4864:20::236]) by sourceware.org (Postfix) with ESMTPS id 53D4B3858D3C for ; Thu, 29 Jun 2023 06:45:16 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 53D4B3858D3C Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lj1-x236.google.com with SMTP id 38308e7fff4ca-2b6994a8ce3so4616251fa.1 for ; Wed, 28 Jun 2023 23:45:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1688021115; x=1690613115; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=mNL3CJ+NF8fyDyBNDtj+yes0f9AtjBNs8V3d0icbJqk=; b=WtmbdxeOuHM8XZVhbFvhdoD2OhV556PXFXJagPgXrYMfj2QKnZT5KCBk6X0LqUCok7 poH/DpfS+hKFdWJrZYVZsWxLDP/pj3PZfsOsvTMRZUhtCUXfYPmriu+xyFjYFuPiN/Jk 72fdVNEqJaHhNNmtLuNSvxt5w+hFSqzJ9VyhpTbA6TzXtIegV5B8SSm2jmeWd+W/BamH J1FQIIRImT4kJ8t5VCVwVBtKTMi7tgbTDdAdDKq7p/Oe4l0n5LmpRBTT7BnybvYRHwVM FisxuNT8e6oEo0J+thbkX5NHGtE4NSQ0yKf8peMBgiAu5kwQXzv50qZXb/ZXEql0h4xZ Herg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688021115; x=1690613115; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mNL3CJ+NF8fyDyBNDtj+yes0f9AtjBNs8V3d0icbJqk=; b=C+4cRUKoV3zY4O0tCYL21q/kIvV+9OJ2BeJvomk/FeC22386D5o4uXvT2Eua87/IzI iJ2KBBns6KCCQxmfEWaX8IUEx14+BUFqgI6rDoqr6Ma29+ZG9ZABVSjmeQICD/9nCGMo rzcV3km0hhQalARMCQ8GOAPH9FmTyyCz7X7YRL9VUjahy9+XNdDTwHLudv4rUkhMHiJz 0izKsd5Kd2bnaIAJJKafbKtXFBLRq2GW0YheXfTsbyMRJ4roIi/kpoLI9bitBxrXQGuH F/qE1z8XCVWEg635z+5qr1Ht2K5HY9Y71VEAy9s2xa3nSqkeenNUKvXyjyLcmFmTzcdG /i7Q== X-Gm-Message-State: AC+VfDyLeL21FF5oy11lDv/TVwsLUIQQQ5ekjPXh2hgI4KB/SlPScUnd AIeXuQULVxCq9AA/Gp5PX+olPXY4oh373soh2ItnmwuB X-Google-Smtp-Source: ACHHUZ5AdGHOouuHhgKSebwTit83DwljD1TBidojPGXPnQ5jTrhfL8s8Kg15S6Ltf46QoYpg0YRCZYWmMbskHG8N3QU= X-Received: by 2002:a05:651c:115:b0:2b5:7fd2:ec36 with SMTP id a21-20020a05651c011500b002b57fd2ec36mr18062174ljb.21.1688021114595; Wed, 28 Jun 2023 23:45:14 -0700 (PDT) MIME-Version: 1.0 References: <20230629014949.1995621-1-lili.cui@intel.com> In-Reply-To: <20230629014949.1995621-1-lili.cui@intel.com> From: Richard Biener Date: Thu, 29 Jun 2023 08:42:06 +0200 Message-ID: Subject: Re: [PATCH] PR gcc/110148:Avoid adding loop-carried ops to long chains To: "Cui, Lili" Cc: gcc-patches@gcc.gnu.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-7.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Thu, Jun 29, 2023 at 3:49=E2=80=AFAM Cui, Lili wrot= e: > > From: Lili Cui > > Hi Maintainer > > This patch is to fix TSVC242 regression related to loop-carried ops. > > Bootstrapped and regtested. Ok for trunk? OK. Thanks, Richard. > Regards > Lili. > > Avoid adding loop-carried ops to long chains, otherwise the whole chain w= ill > have dependencies across the loop iteration. Just keep loop-carried ops i= n a > separate chain. > E.g. > x_1 =3D phi(x_0, x_2) > y_1 =3D phi(y_0, y_2) > > a + b + c + d + e + x1 + y1 > > SSA1 =3D a + b; > SSA2 =3D c + d; > SSA3 =3D SSA1 + e; > SSA4 =3D SSA3 + SSA2; > SSA5 =3D x1 + y1; > SSA6 =3D SSA4 + SSA5; > > With the patch applied, these test cases improved by 32%~100%. > > S242: > for (int i =3D 1; i < LEN_1D; ++i) { > a[i] =3D a[i - 1] + s1 + s2 + b[i] + c[i] + d[i];} > > Case 1: > for (int i =3D 1; i < LEN_1D; ++i) { > a[i] =3D a[i - 1] + s1 + s2 + b[i] + c[i] + d[i] + e[i];} > > Case 2: > for (int i =3D 1; i < LEN_1D; ++i) { > a[i] =3D a[i - 1] + b[i - 1] + s1 + s2 + b[i] + c[i] + d[i] + e[i];} > > The value is the execution time > A: original version > B: with FMA patch g:e5405f065bace0685cb3b8878d1dfc7a6e7ef409(base on A) > C: with current patch(base on B) > > A B C B/A C/A > s242 2.859 5.152 2.859 1.802028681 1 > case 1 5.489 5.488 3.511 0.999818 0.64 > case 2 7.216 7.499 4.885 1.039218 0.68 > > gcc/ChangeLog: > PR tree-optimization/110148 > * tree-ssa-reassoc.cc (rewrite_expr_tree_parallel): Handle loop-c= arried > ops in this function. > --- > gcc/tree-ssa-reassoc.cc | 236 ++++++++++++++++++++++++++++------------ > 1 file changed, 167 insertions(+), 69 deletions(-) > > diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc > index 96c88ec003e..f5da385e0b2 100644 > --- a/gcc/tree-ssa-reassoc.cc > +++ b/gcc/tree-ssa-reassoc.cc > @@ -5471,37 +5471,62 @@ get_reassociation_width (int ops_num, enum tree_c= ode opc, > return width; > } > > +#define SPECIAL_BIASED_END_STMT 0 /* It is the end stmt of all ops. */ > +#define BIASED_END_STMT 1 /* It is the end stmt of normal or biased ops.= */ > +#define NORMAL_END_STMT 2 /* It is the end stmt of normal ops. */ > + > /* Rewrite statements with dependency chain with regard the chance to ge= nerate > FMA. > For the chain with FMA: Try to keep fma opportunity as much as possib= le. > For the chain without FMA: Putting the computation in rank order and = trying > to allow operations to be executed in parallel. > E.g. > - e + f + g + a * b + c * d; > + e + f + a * b + c * d; > > - ssa1 =3D e + f; > - ssa2 =3D g + a * b; > - ssa3 =3D ssa1 + c * d; > - ssa4 =3D ssa2 + ssa3; > + ssa1 =3D e + a * b; > + ssa2 =3D f + c * d; > + ssa3 =3D ssa1 + ssa2; > > This reassociation approach preserves the chance of fma generation as= much > - as possible. */ > + as possible. > + > + Another thing is to avoid adding loop-carried ops to long chains, oth= erwise > + the whole chain will have dependencies across the loop iteration. Jus= t keep > + loop-carried ops in a separate chain. > + E.g. > + x_1 =3D phi(x_0, x_2) > + y_1 =3D phi(y_0, y_2) > + > + a + b + c + d + e + x1 + y1 > + > + SSA1 =3D a + b; > + SSA2 =3D c + d; > + SSA3 =3D SSA1 + e; > + SSA4 =3D SSA3 + SSA2; > + SSA5 =3D x1 + y1; > + SSA6 =3D SSA4 + SSA5; > + */ > static void > rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma, > - const vec &ops) > + const vec &ops) > { > enum tree_code opcode =3D gimple_assign_rhs_code (stmt); > int op_num =3D ops.length (); > + int op_normal_num =3D op_num; > gcc_assert (op_num > 0); > int stmt_num =3D op_num - 1; > gimple **stmts =3D XALLOCAVEC (gimple *, stmt_num); > - int op_index =3D op_num - 1; > - int width_count =3D width; > int i =3D 0, j =3D 0; > tree tmp_op[2], op1; > operand_entry *oe; > gimple *stmt1 =3D NULL; > tree last_rhs1 =3D gimple_assign_rhs1 (stmt); > + int last_rhs1_stmt_index =3D 0, last_rhs2_stmt_index =3D 0; > + int width_active =3D 0, width_count =3D 0; > + bool has_biased =3D false, ops_changed =3D false; > + auto_vec ops_normal; > + auto_vec ops_biased; > + vec *ops1; > > /* We start expression rewriting from the top statements. > So, in this loop we create a full list of statements > @@ -5510,83 +5535,155 @@ rewrite_expr_tree_parallel (gassign *stmt, int w= idth, bool has_fma, > for (i =3D stmt_num - 2; i >=3D 0; i--) > stmts[i] =3D SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1])); > > - /* Width should not be larger than op_num / 2, since we can not create > + /* Avoid adding loop-carried ops to long chains, first filter out the > + loop-carried. But we need to make sure that the length of the rema= inder > + is not less than 4, which is the smallest ops length we can break t= he > + dependency. */ > + FOR_EACH_VEC_ELT (ops, i, oe) > + { > + if (TREE_CODE (oe->op) =3D=3D SSA_NAME > + && bitmap_bit_p (biased_names, SSA_NAME_VERSION (oe->op)) > + && op_normal_num > 4) > + { > + ops_biased.safe_push (oe); > + has_biased =3D true; > + op_normal_num --; > + } > + else > + ops_normal.safe_push (oe); > + } > + > + /* Width should not be larger than ops length / 2, since we can not cr= eate > more parallel dependency chains that exceeds such value. */ > - width =3D width <=3D op_num / 2 ? width : op_num / 2; > + int width_normal =3D op_normal_num / 2; > + int width_biased =3D (op_num - op_normal_num) / 2; > + width_normal =3D width <=3D width_normal ? width : width_normal; > + width_biased =3D width <=3D width_biased ? width : width_biased; > + > + ops1 =3D &ops_normal; > + width_count =3D width_active =3D width_normal; > > /* Build parallel dependency chain according to width. */ > - for (i =3D 0; i < width; i++) > + for (i =3D 0; i < stmt_num; i++) > { > - /* If the chain has FAM, we do not swap two operands. */ > - if (op_index > 1 && !has_fma) > - swap_ops_for_binary_stmt (ops, op_index - 2); > - > - for (j =3D 0; j < 2; j++) > + if (dump_file && (dump_flags & TDF_DETAILS)) > { > - gcc_assert (op_index >=3D 0); > - oe =3D ops[op_index--]; > - tmp_op[j] =3D oe->op; > - /* If the stmt that defines operand has to be inserted, insert = it > - before the use. */ > - stmt1 =3D oe->stmt_to_insert; > - if (stmt1) > - insert_stmt_before_use (stmts[i], stmt1); > - stmt1 =3D NULL; > + fprintf (dump_file, "Transforming "); > + print_gimple_stmt (dump_file, stmts[i], 0); > } > - stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1), > - tmp_op[1], > - tmp_op[0], > - opcode); > - gimple_set_visited (stmts[i], true); > > - if (dump_file && (dump_flags & TDF_DETAILS)) > + /* When the work of normal ops is over, but the loop is not over, > + continue to do biased ops. */ > + if (width_count =3D=3D 0 && ops1 =3D=3D &ops_normal) > { > - fprintf (dump_file, " into "); > - print_gimple_stmt (dump_file, stmts[i], 0); > + ops1 =3D &ops_biased; > + width_count =3D width_active =3D width_biased; > + ops_changed =3D true; > } > - } > > - for (i =3D width; i < stmt_num; i++) > - { > - /* We keep original statement only for the last one. All others a= re > - recreated. */ > - if ( op_index < 0) > + /* Swap the operands if no FMA in the chain. */ > + if (ops1->length () > 2 && !has_fma) > + swap_ops_for_binary_stmt (*ops1, ops1->length () - 3); > + > + if (i < width_active > + || (ops_changed && i <=3D (last_rhs1_stmt_index + width_active)= )) > { > - if (width_count =3D=3D 2) > + for (j =3D 0; j < 2; j++) > { > - > - /* We keep original statement only for the last one. All > - others are recreated. */ > - gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (stmts[= i-1])); > - gimple_assign_set_rhs2 (stmts[i], gimple_assign_lhs (stmts[= i-2])); > - update_stmt (stmts[i]); > + oe =3D ops1->pop (); > + tmp_op[j] =3D oe->op; > + /* If the stmt that defines operand has to be inserted, ins= ert it > + before the use. */ > + stmt1 =3D oe->stmt_to_insert; > + if (stmt1) > + insert_stmt_before_use (stmts[i], stmt1); > + stmt1 =3D NULL; > } > - else > - { > + stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1), > + tmp_op[1], > + tmp_op[0], > + opcode); > + gimple_set_visited (stmts[i], true); > > - stmts[i] =3D > - build_and_add_sum (TREE_TYPE (last_rhs1), > - gimple_assign_lhs (stmts[i-width_count= ]), > - gimple_assign_lhs (stmts[i-width_count= +1]), > - opcode); > - gimple_set_visited (stmts[i], true); > - width_count--; > - } > } > else > { > - /* Attach the rest of the ops to the parallel dependency chain.= */ > - oe =3D ops[op_index--]; > - op1 =3D oe->op; > - stmt1 =3D oe->stmt_to_insert; > - if (stmt1) > - insert_stmt_before_use (stmts[i], stmt1); > - stmt1 =3D NULL; > - stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1), > - gimple_assign_lhs (stmts[i-width]= ), > - op1, > - opcode); > - gimple_set_visited (stmts[i], true); > + /* We keep original statement only for the last one. All other= s are > + recreated. */ > + if (!ops1->length ()) > + { > + /* For biased length equal to 2. */ > + if (width_count =3D=3D BIASED_END_STMT && !last_rhs2_stmt_i= ndex) > + last_rhs2_stmt_index =3D i - 1; > + > + /* When width_count =3D=3D 2 and there is no biased, just f= inish. */ > + if (width_count =3D=3D NORMAL_END_STMT && !has_biased) > + { > + last_rhs1_stmt_index =3D i - 1; > + last_rhs2_stmt_index =3D i - 2; > + } > + if (last_rhs1_stmt_index && (last_rhs2_stmt_index || !has_b= iased)) > + { > + /* We keep original statement only for the last one. A= ll > + others are recreated. */ > + gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (st= mts[last_rhs1_stmt_index])); > + gimple_assign_set_rhs2 (stmts[i], gimple_assign_lhs (st= mts[last_rhs2_stmt_index])); > + update_stmt (stmts[i]); > + } > + else > + { > + stmts[i] =3D > + build_and_add_sum (TREE_TYPE (last_rhs1), > + gimple_assign_lhs (stmts[i-width_c= ount]), > + gimple_assign_lhs (stmts[i-width_c= ount+1]), > + opcode); > + gimple_set_visited (stmts[i], true); > + width_count--; > + > + /* It is the end of normal or biased ops. > + last_rhs1_stmt_index used to record the last stmt in= dex > + for normal ops. last_rhs2_stmt_index used to record= the > + last stmt index for biased ops. */ > + if (width_count =3D=3D BIASED_END_STMT) > + { > + gcc_assert (has_biased); > + if (ops_biased.length ()) > + last_rhs1_stmt_index =3D i; > + else > + last_rhs2_stmt_index =3D i; > + width_count--; > + } > + } > + } > + else > + { > + /* Attach the rest of the ops to the parallel dependency ch= ain. */ > + oe =3D ops1->pop (); > + op1 =3D oe->op; > + stmt1 =3D oe->stmt_to_insert; > + if (stmt1) > + insert_stmt_before_use (stmts[i], stmt1); > + stmt1 =3D NULL; > + > + /* For only one biased ops. */ > + if (width_count =3D=3D SPECIAL_BIASED_END_STMT) > + { > + /* We keep original statement only for the last one. A= ll > + others are recreated. */ > + gcc_assert (has_biased); > + gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (st= mts[last_rhs1_stmt_index])); > + gimple_assign_set_rhs2 (stmts[i], op1); > + update_stmt (stmts[i]); > + } > + else > + { > + stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1), > + gimple_assign_lhs (stmts[= i-width_active]), > + op1, > + opcode); > + gimple_set_visited (stmts[i], true); > + } > + } > } > > if (dump_file && (dump_flags & TDF_DETAILS)) > @@ -5595,6 +5692,7 @@ rewrite_expr_tree_parallel (gassign *stmt, int widt= h, bool has_fma, > print_gimple_stmt (dump_file, stmts[i], 0); > } > } > + > remove_visited_stmt_chain (last_rhs1); > } > > -- > 2.25.1 >