From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-x235.google.com (mail-lj1-x235.google.com [IPv6:2a00:1450:4864:20::235]) by sourceware.org (Postfix) with ESMTPS id 25BE53858D33 for ; Thu, 11 May 2023 10:54:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 25BE53858D33 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lj1-x235.google.com with SMTP id 38308e7fff4ca-2ac8c0fbb16so77699081fa.2 for ; Thu, 11 May 2023 03:54:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683802475; x=1686394475; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Yg2/oSlE6j4o80WCxVozFrjGiq/7FHNC1yJ8ilGjwNo=; b=ep7Qk1jY9ZEziNj3UwfNev/kKPq3hzDz8QdbObJ+8TeyeC1BJT/OelXEQDqZLc0b6K WFcBn8GWM8aPOCxpMxL4XGcNZWrW7QzM+69ZXOdDBQ5uDyHjI8YrAUV/f9EwNySI/Czm JB+QzPnXExP5oMsGRJp+SsqVNw2qgggWguYn3C+VIHA9OOYZ9dRfAudfFExfmN2ljlkF 4HB3KVjZ0Xie0g/Mz+q/CQlIGVJZque0l4PQGUMUSeAgGEjM5jK7tSL/YfHV9IYJy4Am OYC3AE3/bDYv0eTxXMc7Y5eoMPdoTMnc0ymmHcJIo1dM2vIZ6SWlHDoI6tTeRtBR6pXR JznQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683802475; x=1686394475; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Yg2/oSlE6j4o80WCxVozFrjGiq/7FHNC1yJ8ilGjwNo=; b=d3sVv9PoPioXf7Q4aVzim2hConmx32J9QTfxV2tw75EYW5qxexcN/gsl4s8Vo/fRpa FY9iwddKSQZqSDX27fQtCcZcG+WflQlIB0WtD7dWi37KB256dfWeELQsQ5asthlG3sQx zjq3tTnpjhyBa5ZaLC54kGAB0ZptUHCRdpCJxYbusMgGjDkFZv9Zot1/0PDlZZ/LiEcS kBhQaZkEzKQgP7hSL3IwlEYvWce9Cq7TCaqMF3XDNKWwhjYzckQkvFLvmQ3Ttt4axrLp qvSf1er6qVP+o3KflHbyUUtfqvUOIoHc5JvgX5xCBL0bLr5IZIDOVW6fm7uIxDelhrIe cY/A== X-Gm-Message-State: AC+VfDzMOYzTAj+Q4If9TmlxrAUGpW1dSsLOnZp95ctIxiNSNbySTwnf EUkFdNB/lGlgKZpjKLtjlLE6NDKYgfHNqJRs9+AH/nEj X-Google-Smtp-Source: ACHHUZ5aLMyDlgBS5bhpF415+K1VrwZ92844iHBzR0xfevl6ZKc0e9qGp1FR8y5bU9H3C1fM1HB45g70dEHKraiFd/Q= X-Received: by 2002:a2e:8689:0:b0:299:bdc7:8ff2 with SMTP id l9-20020a2e8689000000b00299bdc78ff2mr3087103lji.39.1683802474926; Thu, 11 May 2023 03:54:34 -0700 (PDT) MIME-Version: 1.0 References: <20230511101201.2052667-1-lili.cui@intel.com> In-Reply-To: <20230511101201.2052667-1-lili.cui@intel.com> From: Richard Biener Date: Thu, 11 May 2023 12:52:41 +0200 Message-ID: Subject: Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass To: "Cui, Lili" Cc: gcc-patches@gcc.gnu.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-7.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Thu, May 11, 2023 at 12:13=E2=80=AFPM Cui, Lili via Gcc-patches wrote: > > From: Lili Cui > > Hi, > > Those two patches each add a param to control the length of the chain wit= h > FMA in reassoc pass and a tuning option in the backend. > > Bootstrapped and regtested. Ok for trunk? > > Regards > Lili. > > Add a param for the chain with FMA in reassoc pass to make it more friend= ly to > the fma pass later. First to detect if this chain has ability to > generate more than 2 FMAs,if yes and param_reassoc_max_chain_length_with_= fma > is enabled, We will rearrange the ops so that they can be combined into m= ore > FMAs. When the chain length exceeds param_reassoc_max_chain_length_with_f= ma, > build parallel chains according to given association width and try to kee= p FMA > opportunity as much as possible. > > TEST1: > > float > foo (float a, float b, float c, float d, float *e) > { > return *e + a * b + c * d ; > } > > For -Ofast -march=3Dicelake-server GCC generates: > vmulss %xmm3, %xmm2, %xmm2 > vfmadd132ss %xmm1, %xmm2, %xmm0 > vaddss (%rdi), %xmm0, %xmm0 > ret > > with "--param=3Dreassoc-max-chain-length-with-fma=3D3" GCC generates: > vfmadd213ss (%rdi), %xmm1, %xmm0 > vfmadd231ss %xmm2, %xmm3, %xmm0 > ret > > gcc/ChangeLog: > > PR gcc/98350 > * params.opt (reassoc-max-fma-chain-length): New param. > * tree-ssa-reassoc.cc > (rewrite_expr_tree_parallel_for_fma): New. > (rank_ops_for_fma): Ditto. > (reassociate_bb): Handle new function. > > gcc/testsuite/ChangeLog: > > PR gcc/98350 > * gcc.dg/pr98350-1.c: New test. > * gcc.dg/pr98350-2.c: Ditto. > --- > gcc/params.opt | 4 + > gcc/testsuite/gcc.dg/pr98350-1.c | 31 +++++ > gcc/testsuite/gcc.dg/pr98350-2.c | 17 +++ > gcc/tree-ssa-reassoc.cc | 228 ++++++++++++++++++++++++++++--- > 4 files changed, 264 insertions(+), 16 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/pr98350-1.c > create mode 100644 gcc/testsuite/gcc.dg/pr98350-2.c > > diff --git a/gcc/params.opt b/gcc/params.opt > index 823cdb2ff85..f7c719afe64 100644 > --- a/gcc/params.opt > +++ b/gcc/params.opt > @@ -1182,4 +1182,8 @@ The maximum factor which the loop vectorizer applie= s to the cost of statements i > Common Joined UInteger Var(param_vect_induction_float) Init(1) IntegerRa= nge(0, 1) Param Optimization > Enable loop vectorization of floating point inductions. > > +-param=3Dreassoc-max-chain-length-with-fma=3D > +Common Joined UInteger Var(param_reassoc_max_chain_length_with_fma) Init= (1) IntegerRange(1, 65536) Param Optimization > +The maximum chain length with fma considered in reassociation pass. > + > ; This comment is to ensure we retain the blank line above. > diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c b/gcc/testsuite/gcc.dg/pr98= 350-1.c > new file mode 100644 > index 00000000000..32ecce13a2d > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/pr98350-1.c > @@ -0,0 +1,31 @@ > +/* { dg-do compile } */ > +/* { dg-options "-Ofast -mfpmath=3Dsse -mfma --param=3Dreassoc-max-chain= -length-with-fma=3D7 -Wno-attributes " } */ > + > +/* Test that the compiler properly optimizes multiply and add > + to generate more FMA instructions. */ > +#define N 1024 > +double a[N]; > +double b[N]; > +double c[N]; > +double d[N]; > +double e[N]; > +double f[N]; > +double g[N]; > +double h[N]; > +double j[N]; > +double k[N]; > +double l[N]; > +double m[N]; > +double o[N]; > +double p[N]; > + > + > +void > +foo (void) > +{ > + for (int i =3D 0; i < N; i++) > + { > + a[i] +=3D b[i] * c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[= i] * l[i] + m[i]* o[i] + p[i]; > + } > +} > +/* { dg-final { scan-assembler-times "vfm" 6 } } */ > diff --git a/gcc/testsuite/gcc.dg/pr98350-2.c b/gcc/testsuite/gcc.dg/pr98= 350-2.c > new file mode 100644 > index 00000000000..246025d43b8 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/pr98350-2.c > @@ -0,0 +1,17 @@ > +/* { dg-do compile } */ > +/* { dg-options "-Ofast -mfpmath=3Dsse -mfma --param=3Dreassoc-max-chain= -length-with-fma=3D6 -Wno-attributes " } */ > + > +/* Test that the compiler properly build parallel chains according to gi= ven > + association width and try to keep FMA opportunity as much as possible= . */ > +#define N 33 > +double a[N]; > + > +void > +foo (void) > +{ > + a[32] =3D a[0] *a[1] + a[2] * a[3] + a[4] * a[5] + a[6] * a[7] + a[8] = * a[9] > + + a[10] * a[11] + a[12] * a[13] + a[14] * a[15] + a[16] * a[17] > + + a[18] * a[19] + a[20] * a[21] + a[22] * a[23] + a[24] + a[25] > + + a[26] + a[27] + a[28] + a[29] + a[30] + a[31]; > +} > +/* { dg-final { scan-assembler-times "vfm" 12 } } */ > diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc > index 067a3f07f7e..6d2e158c4f5 100644 > --- a/gcc/tree-ssa-reassoc.cc > +++ b/gcc/tree-ssa-reassoc.cc > @@ -54,6 +54,7 @@ along with GCC; see the file COPYING3. If not see > #include "tree-ssa-reassoc.h" > #include "tree-ssa-math-opts.h" > #include "gimple-range.h" > +#include "internal-fn.h" > > /* This is a simple global reassociation pass. It is, in part, based > on the LLVM pass of the same name (They do some things more/less > @@ -5468,6 +5469,114 @@ get_reassociation_width (int ops_num, enum tree_c= ode opc, > return width; > } > > +/* Rewrite statements with dependency chain with regard to the chance to > + generate FMA. When the dependency chain length exceeds > + param_max_reassoc_chain_length_with_fma, build parallel chains accord= ing to > + given association width and try to keep fma opportunity as much as po= ssible. > + E.g. > + e + f + g + a * b + c * d; > + > + ssa1 =3D e + f; > + ssa2 =3D g + a * b; > + ssa3 =3D ssa1 + c * d; > + ssa4 =3D ssa2 + ssa3; > + > + This reassociation approach preserves the chance of fma generation as= much > + as possible. */ > +static void > +rewrite_expr_tree_parallel_for_fma (gassign *stmt, int width, > + const vec &ops) > +{ > + enum tree_code opcode =3D gimple_assign_rhs_code (stmt); > + int op_num =3D ops.length (); > + gcc_assert (op_num > 0); > + int stmt_num =3D op_num - 1; > + gimple **stmts =3D XALLOCAVEC (gimple *, stmt_num); > + int op_index =3D op_num - 1; > + int width_count =3D width; > + int i =3D 0, j =3D 0; > + tree tmp_op[2], op1; > + operand_entry *oe; > + gimple *stmt1 =3D NULL; > + tree last_rhs1 =3D gimple_assign_rhs1 (stmt); > + > + /* We start expression rewriting from the top statements. > + So, in this loop we create a full list of statements > + we will work with. */ > + stmts[stmt_num - 1] =3D stmt; > + for (i =3D stmt_num - 2; i >=3D 0; i--) > + stmts[i] =3D SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1])); > + > + /* Build parallel FMA dependency chain according to width. */ > + for (i =3D 0; i < width; i++) > + { > + for (j =3D 0; j < 2; j++) > + { > + oe =3D ops[op_index--]; > + tmp_op[j] =3D oe->op; > + stmt1 =3D oe->stmt_to_insert; > + if (stmt1) > + insert_stmt_before_use (stmts[i], stmt1); > + stmt1 =3D NULL; > + } > + stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1), tmp_op[1], = tmp_op[0], opcode); > + gimple_set_visited (stmts[i], true); > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, " into "); > + print_gimple_stmt (dump_file, stmts[i], 0); > + } > + } > + > + for (i =3D width; i < stmt_num; i++) > + { > + /* We keep original statement only for the last one. All others a= re > + recreated. */ > + if ( op_index < 0) > + { > + if (width_count =3D=3D 2) > + { > + > + gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (stmts[= i-1])); > + gimple_assign_set_rhs2 (stmts[i], gimple_assign_lhs (stmts[= i-2])); > + } > + else > + { > + > + stmts[i] =3D > + build_and_add_sum (TREE_TYPE (last_rhs1), > + gimple_assign_lhs (stmts[i-width_count= ]), > + gimple_assign_lhs (stmts[i-width_count= +1]), > + opcode); > + width_count--; > + } > + update_stmt (stmts[i]); > + } > + else > + { > + oe =3D ops[op_index--]; > + op1 =3D oe->op; > + stmt1 =3D oe->stmt_to_insert; > + if (stmt1) > + insert_stmt_before_use (stmts[i], stmt1); > + stmt1 =3D NULL; > + stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1), > + gimple_assign_lhs (stmts[i-width]), > + op1, > + opcode); > + gimple_set_visited (stmts[i], true); > + } > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, " into "); > + print_gimple_stmt (dump_file, stmts[i], 0); > + } > + } > + remove_visited_stmt_chain (last_rhs1); > +} > + > /* Recursively rewrite our linearized statements so that the operators > match those in OPS[OPINDEX], putting the computation in rank > order and trying to allow operations to be executed in > @@ -6649,6 +6758,64 @@ transform_stmt_to_multiply (gimple_stmt_iterator *= gsi, gimple *stmt, > } > } > > +/* Rearrange ops to generate more FMA when the chain may has more than 2= fmas. > + Putting ops that not def from mult in front can generate more fma. > + E.g. > + a * b + c * d + e generates: > + > + _4 =3D c_9(D) * d_10(D); > + _12 =3D .FMA (a_7(D), b_8(D), _4); > + _11 =3D e_6(D) + _12; > + > + Rtearrange ops to -> e + a * b + c * d generates: > + > + _4 =3D .FMA (c_7(D), d_8(D), _3); > + _11 =3D .FMA (a_5(D), b_6(D), _4); > + */ > +static bool > +rank_ops_for_fma (vec *ops) > +{ > + operand_entry *oe; > + unsigned int i; > + auto_vec ops_mult; > + auto_vec ops_others; > + > + FOR_EACH_VEC_ELT (*ops, i, oe) > + { > + if (TREE_CODE (oe->op) =3D=3D SSA_NAME) > + { > + gimple *def_stmt =3D SSA_NAME_DEF_STMT (oe->op); > + if (is_gimple_assign (def_stmt) > + && gimple_assign_rhs_code (def_stmt) =3D=3D MULT_EXPR) > + ops_mult.safe_push (oe); > + else > + ops_others.safe_push (oe); > + } > + else > + ops_others.safe_push (oe); > + } > + /* When ops_mult.length =3D=3D 2, like the following case, > + > + a * b + c * d + e. > + > + we need to rearrange the ops. > + > + Putting ops that not def from mult in front can generate more fmas.= */ > + if (ops_mult.length () >=3D 2) > + { > + /* If all ops are defined with mult, we don't need to rearrange th= em. */ > + if (ops_mult.length () !=3D ops->length ()) > + { > + ops->truncate (0); > + FOR_EACH_VEC_ELT (ops_mult, i, oe) > + ops->safe_push (oe); As you are not changing the number of ops you should be able to use quick_push here and below. You should be able to do ops->splice (ops_mult); ops->splice (ops_others); as well. > + FOR_EACH_VEC_ELT (ops_others, i, oe) > + ops->safe_push (oe); > + } > + return true; > + } > + return false; > +} > /* Reassociate expressions in basic block BB and its post-dominator as > children. > > @@ -6813,6 +6980,7 @@ reassociate_bb (basic_block bb) > machine_mode mode =3D TYPE_MODE (TREE_TYPE (lhs)); > int ops_num =3D ops.length (); > int width; > + bool keep_fma_chain =3D false; > > /* For binary bit operations, if there are at least 3 > operands and the last operand in OPS is a constant, > @@ -6826,36 +6994,64 @@ reassociate_bb (basic_block bb) > && TREE_CODE (ops.last ()->op) =3D=3D INTEGER_CST) > std::swap (*ops[0], *ops[ops_num - 1]); > > + optimization_type opt_type =3D bb_optimization_type (bb= ); > + > + /* When enabling param_reassoc_max_chain_length_with_fm= a to > + keep the chain with fma, rank_ops_for_fma will detec= t if > + the chain has fmas and if so it will rearrange the o= ps. */ > + if (param_reassoc_max_chain_length_with_fma > 1 > + && direct_internal_fn_supported_p (IFN_FMA, > + TREE_TYPE (lhs), > + opt_type) > + && (rhs_code =3D=3D PLUS_EXPR || rhs_code =3D=3D MI= NUS_EXPR)) > + { > + keep_fma_chain =3D rank_ops_for_fma(&ops); > + } > + > + int len =3D ops.length (); > /* Only rewrite the expression tree to parallel in the > last reassoc pass to avoid useless work back-and-for= th > with initial linearization. */ we are doing the parallel rewrite only in the last reassoc pass, i think it makes sense to do the same for reassoc-for-fma. Why do the existing expr rewrites not work after re-sorting the ops? > if (!reassoc_insert_powi_p > - && ops.length () > 3 > + && len > 3 > + && (!keep_fma_chain > + || (keep_fma_chain > + && len > param_reassoc_max_chain_length_wit= h_fma)) in the case len < param_reassoc_max_chain_length_with_fma we have the chain re-sorted but fall through to non-parallel rewrite. I wonder if we d= o not want to instead adjust the reassociation width? I'd say it depends on the number of mult cases in the chain (sth the re-sorting could have computed). Why do we have two completely independent --params here? Can you give an example --param value combination that makes "sense" and show how it is beneficial? > && (width =3D get_reassociation_width (ops_num, rhs= _code, > - mode)) > 1) > + mode)) > 1) > { > - if (dump_file && (dump_flags & TDF_DETAILS)) > - fprintf (dump_file, > - "Width =3D %d was chosen for reassociati= on\n", > - width); > - rewrite_expr_tree_parallel (as_a (stmt)= , > - width, ops); > + if (keep_fma_chain) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, > + "Break chain len =3D %d into width f= or FMA\n", > + len); > + rewrite_expr_tree_parallel_for_fma > + (as_a (stmt), width, ops); > + } > + else > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, > + "Width =3D %d was chosen for reassoc= iation\n", > + width); > + rewrite_expr_tree_parallel (as_a (s= tmt), > + width, ops); > + } > } > else > - { > - /* When there are three operands left, we want > - to make sure the ones that get the double > - binary op are chosen wisely. */ > - int len =3D ops.length (); > - if (len >=3D 3) > + { > + /* When there are three operands left, we want > + to make sure the ones that get the double > + binary op are chosen wisely. */ > + if (len >=3D 3 && !keep_fma_chain) > swap_ops_for_binary_stmt (ops, len - 3); > > new_lhs =3D rewrite_expr_tree (stmt, rhs_code, 0, o= ps, > powi_result !=3D NULL > || negate_result, > len !=3D orig_len); > - } > - > + } > /* If we combined some repeated factors into a > __builtin_powi call, multiply that result by the > reassociated operands. */ > -- > 2.25.1 >