From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-x132.google.com (mail-lf1-x132.google.com [IPv6:2a00:1450:4864:20::132]) by sourceware.org (Postfix) with ESMTPS id 50B853856970 for ; Mon, 15 May 2023 13:13:11 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 50B853856970 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lf1-x132.google.com with SMTP id 2adb3069b0e04-4f13dafd5dcso14486548e87.3 for ; Mon, 15 May 2023 06:13:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1684156390; x=1686748390; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=xfGR37GXV+za/LkTv/vfnspWwMURClmReURq65hjI0Q=; b=DWxwqQZ77d9BvE6EqfgTaGxKKffyl+CgbUlY2Rn1m1ZFju/8dN6JZkotbtsDZYLrqB /Tm2y/m/5hzaE5mW9CTVfUiOiEwPany2BxqR9rArKfhTwUrUcau/TvL9vJrHuskpUQG7 mk0TzhVIqc29XPhJuhUg1w86ErVEt85vzPx+U9+qKraxIqt/VwQDJI5dtpctk4qnc1Yz 77ZtY3HDUBykGdH7Jpii/oSd3l9VnUjTsjGUq1TvNQGRODS/HO+jiJJt0/hSoRuxZTtA wTqjzxCiOkCkGcZCzXuTLxWfKGDvlKGX5LFB+D82ycLuYnK6keBKOXANIO6BxdVzOcPV GNrg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684156390; x=1686748390; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=xfGR37GXV+za/LkTv/vfnspWwMURClmReURq65hjI0Q=; b=NUJu4CVSBToGCfhdfKdBUoZk8EyECNMqcShZxWMbfFB5FUjL4ziV4QX8navOylUuqk I+QbcpZnHVQ9TEXqOxPnmcI00rzf+JvA+12a21Rb8FSF0VesnytYEmn3LnVdbAeoLqOy ftMY4sUUbVdE5r4MGK6Zg9NveGWJaiUe/v44w/aiB31OD4wLewMGRB4r/W6PBS6Ny6sL +jRtNSvfBwpko8gZkay7iw44JQ9OsXZ20xfCiw2lkmyor3vgutDY3yDABRDRQUgChUBI kleqmQB3Ax0ZsGNnCM5B59xFPHWFNjSKzZlrRjMQll3rSr5eGlypHFOIpdpwW/B+E4Yt 6GYA== X-Gm-Message-State: AC+VfDzUCcWJ/lco2e0Pk/xYhByJutrJFoSSCGzxcwyz8BK9l2nTxMij 2Tj/r7irxkCPjfdkEHgpxdSyxiDGjCLxpLRofWM= X-Google-Smtp-Source: ACHHUZ6muOU1Xj8k5YffqPAplqXuG/EvFE0VSKArwBRD22Z7f0pF1FIvGJw2Satoo/8mGHia5JyNBVLogNxJyD3tQSU= X-Received: by 2002:ac2:454b:0:b0:4f1:45a7:114f with SMTP id j11-20020ac2454b000000b004f145a7114fmr6628162lfm.37.1684156389370; Mon, 15 May 2023 06:13:09 -0700 (PDT) MIME-Version: 1.0 References: <20230511101201.2052667-1-lili.cui@intel.com> In-Reply-To: From: Richard Biener Date: Mon, 15 May 2023 15:12:57 +0200 Message-ID: Subject: Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass To: "Cui, Lili" Cc: "gcc-patches@gcc.gnu.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-1.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Fri, May 12, 2023 at 11:05=E2=80=AFAM Cui, Lili wro= te: > > > ISTR there were no sufficient comments in the code explaining why > > rewrite_expr_tree_parallel_for_fma is better by design. In fact ... > > > > > > > > > > > > > > if (!reassoc_insert_powi_p > > > > > - && ops.length () > 3 > > > > > + && len > 3 > > > > > + && (!keep_fma_chain > > > > > + || (keep_fma_chain > > > > > + && len > > > > > > + param_reassoc_max_chain_length_with_fma)) > > > > > > > > in the case len < param_reassoc_max_chain_length_with_fma we have > > > > the chain re-sorted but fall through to non-parallel rewrite. I > > > > wonder if we do not want to instead adjust the reassociation width? > > > > I'd say it depends on the number of mult cases in the chain (sth th= e re- > > sorting could have computed). > > > > Why do we have two completely independent --params here? Can you > > > > give an example --param value combination that makes "sense" and > > > > show how it is beneficial? > > > > > > For this small case https://godbolt.org/z/Pxczrre8P a * b + c * d + e > > > * f + j > > > > > > GCC trunk: ops_num =3D 4, targetm.sched.reassociation_width is 4 (sca= lar fp > > cost is 4). Calculated: Width =3D 2. we can get 2 FMAs. > > > ---------------------------------- > > > _1 =3D a_6(D) * b_7(D); > > > _2 =3D c_8(D) * d_9(D); > > > _5 =3D _1 + _2; > > > _4 =3D e_10(D) * f_11(D); > > > _3 =3D _4 + j_12(D); > > > _13 =3D _3 + _5; > > > -------------------------------------------------------- > > > _2 =3D c_8(D) * d_9(D); > > > _5 =3D .FMA (a_6(D), b_7(D), _2); > > > _3 =3D .FMA (e_10(D), f_11(D), j_12(D)); > > > _13 =3D _3 + _5; > > > -------------------------------------------------------- > > > New patch: If just rearrange ops and fall through to parallel rewrite= to > > break the chain with width =3D 2. > > > > > > --------------------------------------------------------- > > > _1 =3D a_6(D) * b_7(D); > > > _2 =3D j + _1; -----> put j at the first. > > > _3 =3D c_8(D) * d_9(D); > > > _4 =3D e_10(D) * f_11(D); > > > _5 =3D _3 + _4; -----> break chain with width =3D 2. we lost = a FMA here. > > > _13 =3D _2 + 5; > > > > > > ------------------------------------------------------- > > > _3 =3D c_8(D) * d_9(D); > > > _2 =3D .FMA (a_6(D), b_7(D), j); > > > _5 =3D .FMA (e_10(D), f_11(D), _3); > > > _13 =3D _2 + _5; > > > -------------------------------------------------------- > > > Sometimes break chain will lose FMA( break chain needs put two > > > mult-ops together, which will lose one FMA ), we can only get 2 FMAs > > > here, if we want to get 3 FMAs, we need to keep the chain and not > > > break it. So I added a param to control chain length > > > "param_reassoc_max_chain_length_with_fma =3D 4" (For the small case i= n > > > Bugzilla 98350, we need to keep the chain to generate 6 FMAs.) > > > ------------------------------------------------------- > > > _1 =3D a_6(D) * b_7(D); > > > _2 =3D c_8(D) * d_9(D); > > > _4 =3D e_10(D) * f_11(D); > > > _15 =3D _4 + j_12(D); > > > _16 =3D _15 + _2; > > > _13 =3D _16 + _1; > > > ------------------------------------------------------- > > > _15 =3D .FMA (e_10(D), f_11(D), j_12(D)); > > > _16 =3D .FMA (c_8(D), d_9(D), _15); > > > _13 =3D .FMA (a_6(D), b_7(D), _16); > > > ------------------------------------------------------- > > > In some case we want to break the chain with width, we can set > > "param_reassoc_max_chain_length_with_fma =3D 2", it will rearrange ops = and > > break the chain with width. > > > > ... it sounds like the problem could be fully addressed by sorting the = chain > > with reassoc-width in mind? > > Wouldn't it be preferable if rewrite_expr_tree_parallel would get a vec= tor of > > mul and a vector of non-mul ops so it can pick from the optimal candida= te? > > > > That said, I think rewrite_expr_tree_parallel_for_fma at least needs mo= re > > comments. > > > Sorry for not writing note clearly enough, I'll add more. > I have two places that need to be clarified. > > 1. For some case we need to keep chain to generate more FMAs, because bre= ak chain will lose FMA. > for example g + a * b + c * d + e * f, > Keep chain can get 3 FMAs, break chain can get 2 FMAs. It's hard to sa= y which one is better, so we provide a param for users to customize. > > 2. when the chain has FMAs and need to break the chain with width, > for example l + a * b + c * d + e * f + g * h + j * k;(we already put non= -mul first) > rewrite_expr_tree_parallel : > when width =3D 2, it will break the chain like this. actually it break th= e chain in to 3. It ignores the width and adds all ops two by two. it will = lose FMA. > > ssa1 =3D l + a * b; > ssa2 =3D c * d + e * f; > ssa3 =3D g * h + j * k; > ssa4 =3D ssa1 + ssa2; > ssa5 =3D ssa4 + ssa3; > > rewrite_expr_tree_parallel_for_fma > when width =3D 2, we break the chain into two like this. > > ssa1 =3D l + a * b; > ssa2 =3D c * d + e * f; > ssa3 =3D ssa1 + g * h; > ssa4 =3D ssa2 + j * k; > ssa5 =3D ssa3 +ssa4; > > I think it's okay to remove or keep rewrite_expr_tree_parallel_for_fma. M= ore FMAs are generated only for some special cases. > I'm not sure whether the new method is better than the old one. I created= a small case the execution time of the two sequences is almost the same on= SPR and ICX. I think to make a difference you need to hit the number of parallel fadd/fmul the pipeline can perform. I don't think issue width is ever a problem for chains w/o fma and throughput of fma vs fadd + fmul should be similar. That said, I think iff then we should try to improve rewrite_expr_tree_parallel rather than adding a new function. For example for the case with equal rank operands we can try to sort adds first. I can't convince myself that rewrite_expr_tree_parallel honors ranks properly quickly. Richard. > Thanks, > Lili. >