From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-x12e.google.com (mail-lf1-x12e.google.com [IPv6:2a00:1450:4864:20::12e]) by sourceware.org (Postfix) with ESMTPS id 7E1923858D35 for ; Mon, 22 May 2023 13:15:18 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7E1923858D35 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lf1-x12e.google.com with SMTP id 2adb3069b0e04-4f14468ef54so6797214e87.0 for ; Mon, 22 May 2023 06:15:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1684761317; x=1687353317; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=2soH3xrQIM7hQNJktDrlT8bcJmtpk/krQ5AlKcMcQ2c=; b=XrZtxC+r+f2/mSMWue8EligzW+hT0+xrrNHDZrO/17LKr2fGbxQoccbkN7ChKrzuAk jlpsfkh11jxMNkm7TZhqfKwwplgbjimGeNNFBmV07sSsNn6CQ4CJ6FymStHCH4tziCTg IIiQl72nx5L4RbrtxF9nfk98gLULPGPf8WyiWRBhe81K44cg7ua0F06sOdRQsl/OKyyy 6FoZ1uMH9Er3zWeRXpSLHBQE4Ew8VHWhuPdqF8ekd/+yMWxsjO5HpiSmW1czudS20Ch1 r0cohSWgyK4DgaVxjT9foJOq7ZAFtwQKpsdF5aTPb8PjacK1QJZzI3Pjf+b2+mUuCGTU N1mQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684761317; x=1687353317; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=2soH3xrQIM7hQNJktDrlT8bcJmtpk/krQ5AlKcMcQ2c=; b=QIfsyCF0ogzwNNonrHOmx7zFaLPiuX7dBxlPdXDdEXLLkGArPXOfat5F4GJn6pDgOY tR9kNn6jvHJDo2C1Pcy7IUCq5YVdYwib0cUFD/1zApiMafPuQ8dB2r1hWTu4X0ckB4NX neuRXRN6Xl4XnL5Pej6raFOGo9PmZg+YUZa5vJJwNCKaOvUjt2CwF0zcM6JJdqgitmQl 69xozoEnRIlJ3yMd5pXKWOZiqcdrvrtOnFZRnaStejut9qO3rCDyipxco7qH403YgXCR 5sYkKTaKXQD7+JAxmh0bURlu1VfGIXHW2IAAfieJmnWy67eBlqu4LcybjfhWqDaOLKzl MnDg== X-Gm-Message-State: AC+VfDy2K3RFso0WRrldJHf0STKurIlW8mm0Oz6xCeTYWr1+DUEpCKVT LPo8oBwlgq5zmzQnjb8zzVWlb4kthhSVPtmI7OxjlhC8 X-Google-Smtp-Source: ACHHUZ5v1+EjyCpVHWMPgT1D7rsIv56r4Jw7AN9laJM/T6eLLC9obDud2kDg39uc/nOFZinwglyzJNn48Emor3qUrl8= X-Received: by 2002:ac2:52ad:0:b0:4f3:9930:5b8c with SMTP id r13-20020ac252ad000000b004f399305b8cmr3046804lfm.25.1684761316820; Mon, 22 May 2023 06:15:16 -0700 (PDT) MIME-Version: 1.0 References: <20230511101201.2052667-1-lili.cui@intel.com> In-Reply-To: From: Richard Biener Date: Mon, 22 May 2023 15:15:04 +0200 Message-ID: Subject: Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass To: "Cui, Lili" Cc: "gcc-patches@gcc.gnu.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_ASCII_DIVIDERS,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Wed, May 17, 2023 at 3:05=E2=80=AFPM Cui, Lili wrot= e: > > > I think to make a difference you need to hit the number of parallel fad= d/fmul > > the pipeline can perform. I don't think issue width is ever a problem = for > > chains w/o fma and throughput of fma vs fadd + fmul should be similar. > > > > Yes, for x86 backend, fadd , fmul and fma have the same TP meaning they s= hould have the same width. > The current implementation is reasonable /* reassoc int, fp, vec_int, ve= c_fp. */. > > > That said, I think iff then we should try to improve > > rewrite_expr_tree_parallel rather than adding a new function. For exam= ple > > for the case with equal rank operands we can try to sort adds first. I= can't > > convince myself that rewrite_expr_tree_parallel honors ranks properly > > quickly. > > > > I rewrite this patch, there are mainly two changes: > 1. I made some changes to rewrite_expr_tree_parallel_for_fma and used it = instead of rewrite_expr_tree_parallel. The following example shows that the= sequence generated by the this patch is better. > 2. Put no-mult ops and mult ops alternately at the end of the queue, whic= h is conducive to generating more fma and reducing the loss of FMA when bre= aking the chain. > > With these two changes, GCC can break the chain with width =3D 2 and gene= rates 6 FMAs for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D98350 with= out any params. > > -------------------------------------------------------------------------= ----------------------------------------- > Source code=EF=BC=9A g + h + j + s + m + n+a+b +e (https://godbolt.org/z= /G8sb86n84) > Compile options: -Ofast -mfpmath=3Dsse -mfma > Width =3D 3 was chosen for reassociation > -------------------------------------------------------------------------= ---------------------------------------- > Old rewrite_expr_tree_parallel generates: > _6 =3D g_8(D) + h_9(D); ------> parallel 0 > _3 =3D s_11(D) + m_12(D); ------> parallel 1 > _5 =3D _3 + j_10(D); > _2 =3D n_13(D) + a_14(D); ------> parallel 2 > _1 =3D b_15(D) + e_16(D); -----> Parallel 3, This is not necessary, an= d it is not friendly to FMA. > _4 =3D _1 + _2; > _7 =3D _4 + _5; > _17 =3D _6 + _7; > return _17; > > When the width =3D 3, we need 5 cycles here. > ---------------------------------------------first end-------------------= -------------------------------------- > Rewrite the old rewrite_expr_tree_parallel (3 sets in parallel) generates= : > > _3 =3D s_11(D) + m_12(D); ------> parallel 0 > _5 =3D _3 + j_10(D); > _2 =3D n_13(D) + a_14(D); ------> parallel 1 > _1 =3D b_15(D) + e_16(D); ------> parallel 2 > _4 =3D _1 + _2; > _6 =3D _4 + _5; > _7 =3D _6 + h_9(D); > _17 =3D _7 + g_8(D); > return _17; > > When the width =3D 3, we need 5 cycles here. > ---------------------------------------------second end------------------= ------------------------------------- > Use rewrite_expr_tree_parallel_for_fma instead of rewrite_expr_tree_paral= lel generates: > > _3 =3D s_11(D) + m_12(D); > _6 =3D _3 + g_8(D); > _2 =3D n_13(D) + a_14(D); > _5 =3D _2 + h_9(D); > _1 =3D b_15(D) + e_16(D); > _4 =3D _1 + j_10(D); > _7 =3D _4 + _5; > _17 =3D _7 + _6; > return _17; > > When the width =3D 3, we need 4 cycles here. > --------------------------------------------third end--------------------= --------------------------------------- Yes, so what I was saying is that I doubt rewrite_expr_tree_parallel is optimal - you show that for the specific example rewrite_expr_tree_parallel_for_fma is better. I was arguing we want a single function, whether we single out leaves with multiplications or not. And we want documentation that shows the strategy will result in optimal la= tency (I think we should not sacrifice latency just for the sake of forming more FMAs). Richard. > > Thanks, > Lili. >