From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-x12f.google.com (mail-lf1-x12f.google.com [IPv6:2a00:1450:4864:20::12f]) by sourceware.org (Postfix) with ESMTPS id 4840B3856242 for ; Fri, 12 May 2023 06:05:11 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 4840B3856242 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lf1-x12f.google.com with SMTP id 2adb3069b0e04-4f2676d62a2so2609542e87.0 for ; Thu, 11 May 2023 23:05:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683871510; x=1686463510; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Wsx+SjG/asWtXx6zXoYUCjFFOJhULHeRp2zd/XeGmaE=; b=pEXHlU4DQYZqyW7osQf+9Lux2zyJuZQeA+szpuDJuP3Ws0jvSMvb940kjdUuY1tjW5 IRczOgQ0GfshH3/cBbICw81ASrhgP/cuF44nBQ+YH4UxI6dtGQq2Ocq6ei7UfVlgZQdI JHjLWY1YnITkg2b93JmcolCOi6OKqNM3jwTaECr/ZpK/kqqejHk8e/HhgdF4FLqeSqq3 E7dj83IX5jYSjbrjRaoG9FcSSfQ+PjoJlOndY3Fh+H7iLebGBQmBXy5eNhUi02pp/w6B HrUQJhNORvGb5zeKPhD4BZpAywx4LRBk/YQkawkA3oHnuK9J1Krhw3simYurFgKhtDuC ffbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683871510; x=1686463510; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Wsx+SjG/asWtXx6zXoYUCjFFOJhULHeRp2zd/XeGmaE=; b=BOqpg15POMtddv58ry/0CDVKEjjqiaan3PUnwekBII5VZ+1kCCzh22lpzO0mkUEUX+ bS0T69Xp6epiliDm/0QSy9sRUsHPduGqJ4YOiDHq1u3SxmVEaAILt/htEOnYlBcTLsAY bgiD8EllwLm1qYr0C/K5vxOdmMgmRuWaOJQx9OZ/t1u+rxOGwZ5WMDZjQykiDrHW7uD1 7K23bcG1zWhMkuMeydBrsTT+ZI9mj85S/ftB+PXtaVxGM+soeM/wo2KY4pWasYaVedEH LR3jEkbSqmzWXg4FUB+Ltt0Y7xFGrJcvBdSLCGNiS1pY+1/BhXmayG7z3qmmJGPdAE0f 6vGw== X-Gm-Message-State: AC+VfDyb+UNJboI50zFKzYGYYoOR8SnrouGe+g6xnazG0WOikssYp61p gHxbDfRswo4a2GcLlgwTwhQmPZkAuTs3nniOXrDMpLfkLBI= X-Google-Smtp-Source: ACHHUZ7gQDGi3j9RXJsDD1Kc9bKxvk6DwoX+osYb3296lTo8nwJsCEnkKLKW4RTWUB2sdZqh+ULUjt8k1pCf8YUyTMs= X-Received: by 2002:ac2:4889:0:b0:4f2:7953:6d16 with SMTP id x9-20020ac24889000000b004f279536d16mr170570lfc.18.1683871509418; Thu, 11 May 2023 23:05:09 -0700 (PDT) MIME-Version: 1.0 References: <20230511101201.2052667-1-lili.cui@intel.com> In-Reply-To: From: Richard Biener Date: Fri, 12 May 2023 08:04:57 +0200 Message-ID: Subject: Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass To: "Cui, Lili" Cc: "gcc-patches@gcc.gnu.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Thu, May 11, 2023 at 5:20=E2=80=AFPM Cui, Lili wrot= e: > > > -----Original Message----- > > From: Richard Biener > > Sent: Thursday, May 11, 2023 6:53 PM > > To: Cui, Lili > > Cc: gcc-patches@gcc.gnu.org > > Subject: Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length= of > > the chain with FMA in reassoc pass > > Hi Richard, > Thanks for helping to review the patch. > > > > > As you are not changing the number of ops you should be able to use > > quick_push here and below. You should be able to do > > > > ops->splice (ops_mult); > > ops->splice (ops_others); > > > > as well. > > > Done. > > > > + /* When enabling param_reassoc_max_chain_length_wit= h_fma > > to > > > + keep the chain with fma, rank_ops_for_fma will d= etect if > > > + the chain has fmas and if so it will rearrange t= he ops. */ > > > + if (param_reassoc_max_chain_length_with_fma > 1 > > > + && direct_internal_fn_supported_p (IFN_FMA, > > > + TREE_TYPE (l= hs), > > > + opt_type) > > > + && (rhs_code =3D=3D PLUS_EXPR || rhs_code =3D= =3D MINUS_EXPR)) > > > + { > > > + keep_fma_chain =3D rank_ops_for_fma(&ops); > > > + } > > > + > > > + int len =3D ops.length (); > > > /* Only rewrite the expression tree to parallel in = the > > > last reassoc pass to avoid useless work back-and= -forth > > > with initial linearization. */ > > > > we are doing the parallel rewrite only in the last reassoc pass, i thin= k it makes > > sense to do the same for reassoc-for-fma. > > I rearranged the order of ops in reassoc1 without break the chain, it gen= erated more vectorize during vector pass( seen in benchmark 503). So I rewr= ite the ssa tree and keep the chain with function "rewrite_expr_tree" in re= assoc1, break the chain with "rewrite_expr_tree_parallel_for_fma" in reasso= c2. > > > > > Why do the existing expr rewrites not work after re-sorting the ops? > > For case https://godbolt.org/z/3x9PWE9Kb: we put "j" at first. > > j + l * m + a * b + c * d + e * f + g * h; > > GCC trunk: width =3D 2, ops_num =3D 6, old function " rewrite_expr_tree_p= arallel " generates 3 FMAs. > -------------------------------------------------------------------------= -------------- > _1 =3D l_10(D) * m_11(D); > _3 =3D a_13(D) * b_14(D); > _4 =3D j_12(D) + _3; --------> Here is one FMA. > _5 =3D c_15(D) * d_16(D); > _8 =3D _1 + _5; --------> Here is one FMA and lost one. > _7 =3D e_17(D) * f_18(D); > _9 =3D g_19(D) * h_20(D); > _2 =3D _7 + _9; --------> Here is one FMA and lost one. > _6 =3D _2 + _4; > _21 =3D _6 + _8; > # VUSE <.MEM_22(D)> > return _21; > -------------------------------------------------------------------------= ------------- > width =3D 2, ops_num =3D 6, new function " rewrite_expr_tree_parallel_for= _fma " generates 4 FMAs. > -------------------------------------------------------------------------= ----- > _1 =3D a_10(D) * b_11(D); > _3 =3D c_13(D) * d_14(D); > _5 =3D e_15(D) * f_16(D); > _7 =3D g_17(D) * h_18(D); > _4 =3D _5 + _7; --------> Here is one FMA and lost one. > _8 =3D _4 + _1; --------> Here is one FMA. > _9 =3D l_19(D) * m_20(D); > _2 =3D _9 + j_12(D); --------> Here is one FMA. > _6 =3D _2 + _3; --------> Here is one FMA. > _21 =3D _8 + _6; > return _21; > -------------------------------------------------------------------------= ----------- ISTR there were no sufficient comments in the code explaining why rewrite_expr_tree_parallel_for_fma is better by design. In fact ... > > > > > > if (!reassoc_insert_powi_p > > > - && ops.length () > 3 > > > + && len > 3 > > > + && (!keep_fma_chain > > > + || (keep_fma_chain > > > + && len > > > > + param_reassoc_max_chain_length_with_fma)) > > > > in the case len < param_reassoc_max_chain_length_with_fma we have the > > chain re-sorted but fall through to non-parallel rewrite. I wonder if = we do > > not want to instead adjust the reassociation width? I'd say it depends= on the > > number of mult cases in the chain (sth the re-sorting could have comput= ed). > > Why do we have two completely independent --params here? Can you give > > an example --param value combination that makes "sense" and show how it > > is beneficial? > > For this small case https://godbolt.org/z/Pxczrre8P > a * b + c * d + e * f + j > > GCC trunk: ops_num =3D 4, targetm.sched.reassociation_width is 4 (scalar = fp cost is 4). Calculated: Width =3D 2. we can get 2 FMAs. > ---------------------------------- > _1 =3D a_6(D) * b_7(D); > _2 =3D c_8(D) * d_9(D); > _5 =3D _1 + _2; > _4 =3D e_10(D) * f_11(D); > _3 =3D _4 + j_12(D); > _13 =3D _3 + _5; > -------------------------------------------------------- > _2 =3D c_8(D) * d_9(D); > _5 =3D .FMA (a_6(D), b_7(D), _2); > _3 =3D .FMA (e_10(D), f_11(D), j_12(D)); > _13 =3D _3 + _5; > -------------------------------------------------------- > New patch: If just rearrange ops and fall through to parallel rewrite to = break the chain with width =3D 2. > > --------------------------------------------------------- > _1 =3D a_6(D) * b_7(D); > _2 =3D j + _1; -----> put j at the first. > _3 =3D c_8(D) * d_9(D); > _4 =3D e_10(D) * f_11(D); > _5 =3D _3 + _4; -----> break chain with width =3D 2. we lost a FM= A here. > _13 =3D _2 + 5; > > ------------------------------------------------------- > _3 =3D c_8(D) * d_9(D); > _2 =3D .FMA (a_6(D), b_7(D), j); > _5 =3D .FMA (e_10(D), f_11(D), _3); > _13 =3D _2 + _5; > -------------------------------------------------------- > Sometimes break chain will lose FMA( break chain needs put two mult-ops t= ogether, which will lose one FMA ), we can only get 2 FMAs here, if we want= to get 3 FMAs, we need to keep the chain and not break it. So I added a pa= ram to control chain length "param_reassoc_max_chain_length_with_fma =3D 4"= (For the small case in Bugzilla 98350, we need to keep the chain to genera= te 6 FMAs.) > ------------------------------------------------------- > _1 =3D a_6(D) * b_7(D); > _2 =3D c_8(D) * d_9(D); > _4 =3D e_10(D) * f_11(D); > _15 =3D _4 + j_12(D); > _16 =3D _15 + _2; > _13 =3D _16 + _1; > ------------------------------------------------------- > _15 =3D .FMA (e_10(D), f_11(D), j_12(D)); > _16 =3D .FMA (c_8(D), d_9(D), _15); > _13 =3D .FMA (a_6(D), b_7(D), _16); > ------------------------------------------------------- > In some case we want to break the chain with width, we can set "param_rea= ssoc_max_chain_length_with_fma =3D 2", it will rearrange ops and break the = chain with width. ... it sounds like the problem could be fully addressed by sorting the chain with reassoc-width in mind? Wouldn't it be preferable if rewrite_expr_tree_parallel would get a vector of mul and a vector of non-mul ops so it can pick from the optimal candidate? That said, I think rewrite_expr_tree_parallel_for_fma at least needs more comments. > > > > > && (width =3D get_reassociation_width (ops_num,= rhs_code, > > > - mode)) > 1= ) > > > + mode)) > > > > + 1) > > > { > > > - if (dump_file && (dump_flags & TDF_DETAILS)) > > > - fprintf (dump_file, > > > - "Width =3D %d was chosen for reassoc= iation\n", > > > - width); > > > - rewrite_expr_tree_parallel (as_a (s= tmt), > > > - width, ops); > > > + if (keep_fma_chain) > > > + { > > > + if (dump_file && (dump_flags & TDF_DETAILS)= ) > > > + fprintf (dump_file, > > > + "Break chain len =3D %d into wid= th for FMA\n", > > > + len); > > > + rewrite_expr_tree_parallel_for_fma > > > + (as_a (stmt), width, ops); > > > + } > > > + else > > > + { > > > + if (dump_file && (dump_flags & TDF_DETAILS)= ) > > > + fprintf (dump_file, > > > + "Width =3D %d was chosen for rea= ssociation\n", > > > + width); > > > + rewrite_expr_tree_parallel (as_a (stmt), > > > + width, ops); > > > + } > > > } > > > else > > > - { > > > - /* When there are three operands left, we want > > > - to make sure the ones that get the double > > > - binary op are chosen wisely. */ > > > - int len =3D ops.length (); > > > - if (len >=3D 3) > > > + { > > > + /* When there are three operands left, we want > > > + to make sure the ones that get the double > > > + binary op are chosen wisely. */ > > > + if (len >=3D 3 && !keep_fma_chain) > > > swap_ops_for_binary_stmt (ops, len - 3); > > > > > > new_lhs =3D rewrite_expr_tree (stmt, rhs_code, = 0, ops, > > > powi_result !=3D N= ULL > > > || negate_result, > > > len !=3D orig_len)= ; > > > - } > > > - > > > + } > > > /* If we combined some repeated factors into a > > > __builtin_powi call, multiply that result by the > > > reassociated operands. */ > > > -- > > > 2.25.1 > > >