From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 101D43860C3E for ; Mon, 14 Jun 2021 15:54:38 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 101D43860C3E Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3C0D51FB; Mon, 14 Jun 2021 08:54:37 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.126]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 4B4D53F70D; Mon, 14 Jun 2021 08:54:36 -0700 (PDT) From: Richard Sandiford To: Tamar Christina Mail-Followup-To: Tamar Christina , "gcc-patches\@gcc.gnu.org" , nd , Richard Earnshaw , Marcus Shawcroft , Kyrylo Tkachov , richard.sandiford@arm.com Cc: "gcc-patches\@gcc.gnu.org" , nd , Richard Earnshaw , Marcus Shawcroft , Kyrylo Tkachov Subject: Re: [PATCH][RFC]AArch64 SVE: Fix multiple comparison masks on inverted operands References: Date: Mon, 14 Jun 2021 16:54:35 +0100 In-Reply-To: (Tamar Christina's message of "Mon, 14 Jun 2021 15:05:38 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-6.5 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 14 Jun 2021 15:54:40 -0000 Tamar Christina writes: > Hi Richard, >> -----Original Message----- >> From: Richard Sandiford >> Sent: Monday, June 14, 2021 3:50 PM >> To: Tamar Christina >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw >> ; Marcus Shawcroft >> ; Kyrylo Tkachov >> Subject: Re: [PATCH][RFC]AArch64 SVE: Fix multiple comparison masks on >> inverted operands >> >> Tamar Christina writes: >> > Hi All, >> > >> > This RFC is trying to address the following inefficiency when >> > vectorizing conditional statements with SVE. >> > >> > Consider the case >> > >> > void f10(double * restrict z, double * restrict w, double * restrict x, >> > double * restrict y, int n) >> > { >> > for (int i = 0; i < n; i++) { >> > z[i] = (w[i] > 0) ? x[i] + w[i] : y[i] - w[i]; >> > } >> > } >> > >> > >> > For which we currently generate at -O3: >> > >> > f10: >> > cmp w4, 0 >> > ble .L1 >> > mov x5, 0 >> > whilelo p1.d, wzr, w4 >> > ptrue p3.b, all >> > .L3: >> > ld1d z1.d, p1/z, [x1, x5, lsl 3] >> > fcmgt p2.d, p1/z, z1.d, #0.0 >> > fcmgt p0.d, p3/z, z1.d, #0.0 >> > ld1d z2.d, p2/z, [x2, x5, lsl 3] >> > bic p0.b, p3/z, p1.b, p0.b >> > ld1d z0.d, p0/z, [x3, x5, lsl 3] >> > fsub z0.d, p0/m, z0.d, z1.d >> > movprfx z0.d, p2/m, z1.d >> > fadd z0.d, p2/m, z0.d, z2.d >> > st1d z0.d, p1, [x0, x5, lsl 3] >> > incd x5 >> > whilelo p1.d, w5, w4 >> > b.any .L3 >> > .L1: >> > ret >> > >> > Notice that the condition for the else branch duplicates the same >> > predicate as the then branch and then uses BIC to negate the results. >> > >> > The reason for this is that during instruction generation in the >> > vectorizer we emit >> > >> > mask__41.11_66 = vect__4.10_64 > vect_cst__65; >> > vec_mask_and_69 = mask__41.11_66 & loop_mask_63; >> > vec_mask_and_71 = mask__41.11_66 & loop_mask_63; >> > mask__43.16_73 = ~mask__41.11_66; >> > vec_mask_and_76 = mask__43.16_73 & loop_mask_63; >> > vec_mask_and_78 = mask__43.16_73 & loop_mask_63; >> > >> > which ultimately gets optimized to >> > >> > mask__41.11_66 = vect__4.10_64 > { 0.0, ... }; >> > vec_mask_and_69 = loop_mask_63 & mask__41.11_66; >> > mask__43.16_73 = ~mask__41.11_66; >> > vec_mask_and_76 = loop_mask_63 & mask__43.16_73; >> > >> > Notice how the negate is on the operation and not the predicate >> > resulting from the operation. When this is expanded this turns into >> > RTL where the negate is on the compare directly. This means the RTL >> > is different from the one without the negate and so CSE is unable to >> recognize that they are essentially same operation. >> > >> > To fix this my patch changes it so you negate the mask rather than the >> > operation >> > >> > mask__41.13_55 = vect__4.12_53 > { 0.0, ... }; >> > vec_mask_and_58 = loop_mask_52 & mask__41.13_55; >> > vec_mask_op_67 = ~vec_mask_and_58; >> > vec_mask_and_65 = loop_mask_52 & vec_mask_op_67; >> >> But to me this looks like a pessimisation in gimple terms. We've increased >> the length of the critical path: vec_mask_and_65 now needs a chain of >> 4 operations instead of 3. > > True, but it should reduce the number of RTL patterns. I would have thought > RTL is more expensive to handle than gimple. I think this is only a fair gimple optimisation if gimple does the isel itself (to a predicated compare and a predicated NOT). >> We also need to be careful not to pessimise the case in which the >> comparison is an integer one. At the moment we'll generate opposed >> conditions, which is the intended behaviour: > > This still happens with this patch at `-Ofast` because that flips the conditions, > So the different representation doesn't harm it. OK, that's good. >> >> .L3: >> ld1d z1.d, p0/z, [x1, x5, lsl 3] >> cmpgt p2.d, p0/z, z1.d, #0 >> movprfx z2, z1 >> scvtf z2.d, p3/m, z1.d >> cmple p1.d, p0/z, z1.d, #0 >> ld1d z0.d, p2/z, [x2, x5, lsl 3] >> ld1d z1.d, p1/z, [x3, x5, lsl 3] >> fadd z0.d, p2/m, z0.d, z2.d >> movprfx z0.d, p1/m, z1.d >> fsub z0.d, p1/m, z0.d, z2.d >> st1d z0.d, p0, [x0, x5, lsl 3] >> add x5, x5, x6 >> whilelo p0.d, w5, w4 >> b.any .L3 >> >> Could we handle the fcmp case using a 3->2 define_split instead: convert >> >> (set res (and (not (fcmp X Y)) Z)) -> >> (set res (fcmp X Y)) >> (set res (and (not res) Z)) >> > > This was the other approach I mentioned. It works, and gives you the neg, but only in the case where the compare is single use. But in the original example we duplicate the comparison through a 2->2 combine, which leaves the original comparison as a single use. Isn't that enough? > e.g. in > > void f11(double * restrict z, double * restrict w, double * restrict x, double * restrict y, int n) > { > for (int i = 0; i < n; i++) { > z[i] = (w[i] > 0) ? w[i] : y[i] - w[i]; > } > } > > You have some of the same problem. It generates > > ld1d z0.d, p0/z, [x1, x2, lsl 3] > fcmgt p2.d, p3/z, z0.d, #0.0 > bic p1.b, p3/z, p0.b, p2.b > ld1d z1.d, p1/z, [x3, x2, lsl 3] > fsub z1.d, p1/m, z1.d, z0.d > sel z0.d, p2, z0.d, z1.d > st1d z0.d, p0, [x0, x2, lsl 3] > incd x2 > whilelo p0.d, w2, w4 > > which has two problems. fcmgt doesn't need to be predicated on p3 which is ptrue all, > it can/should be p0. > > With that fixed the splitter won't match because p2 is needed in the sel, so it's not single use and so combine won't > try to build the RTL so it can be split. I think having the vectoriser avoid the dual use between the IFN_MASK_LOAD/STORE and the VEC_COND_EXPR is fair game, since that is the only pass that has the context to prove that including the loop mask in the VEC_COND_EXPR condition is correct. We already try to do that to some extent: /* See whether another part of the vectorized code applies a loop mask to the condition, or to its inverse. */ but it would need extending to handle this case. Thanks, Richard