From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id DE94D3861962 for ; Wed, 30 Jun 2021 17:55:14 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DE94D3861962 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 652A86D; Wed, 30 Jun 2021 10:55:14 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.126]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 53ACC3F718; Wed, 30 Jun 2021 10:55:13 -0700 (PDT) From: Richard Sandiford To: Tamar Christina Mail-Followup-To: Tamar Christina , "gcc-patches\@gcc.gnu.org" , nd , Richard Earnshaw , Marcus Shawcroft , Kyrylo Tkachov , richard.sandiford@arm.com Cc: "gcc-patches\@gcc.gnu.org" , nd , Richard Earnshaw , Marcus Shawcroft , Kyrylo Tkachov Subject: Re: [PATCH][RFC]AArch64 SVE: Fix multiple comparison masks on inverted operands References: Date: Wed, 30 Jun 2021 18:55:12 +0100 In-Reply-To: (Tamar Christina's message of "Wed, 30 Jun 2021 16:08:44 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-6.4 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Jun 2021 17:55:17 -0000 Tamar Christina writes: >> -----Original Message----- >> From: Richard Sandiford >> Sent: Monday, June 14, 2021 4:55 PM >> To: Tamar Christina >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw >> ; Marcus Shawcroft >> ; Kyrylo Tkachov >> Subject: Re: [PATCH][RFC]AArch64 SVE: Fix multiple comparison masks on >> inverted operands >>=20 >> Tamar Christina writes: >> > Hi Richard, >> >> -----Original Message----- >> >> From: Richard Sandiford >> >> Sent: Monday, June 14, 2021 3:50 PM >> >> To: Tamar Christina >> >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw >> >> ; Marcus Shawcroft >> >> ; Kyrylo Tkachov >> >> >> Subject: Re: [PATCH][RFC]AArch64 SVE: Fix multiple comparison masks >> >> on inverted operands >> >> >> >> Tamar Christina writes: >> >> > Hi All, >> >> > >> >> > This RFC is trying to address the following inefficiency when >> >> > vectorizing conditional statements with SVE. >> >> > >> >> > Consider the case >> >> > >> >> > void f10(double * restrict z, double * restrict w, double * restric= t x, >> >> > double * restrict y, int n) >> >> > { >> >> > for (int i =3D 0; i < n; i++) { >> >> > z[i] =3D (w[i] > 0) ? x[i] + w[i] : y[i] - w[i]; >> >> > } >> >> > } >> >> > >> >> > >> >> > For which we currently generate at -O3: >> >> > >> >> > f10: >> >> > cmp w4, 0 >> >> > ble .L1 >> >> > mov x5, 0 >> >> > whilelo p1.d, wzr, w4 >> >> > ptrue p3.b, all >> >> > .L3: >> >> > ld1d z1.d, p1/z, [x1, x5, lsl 3] >> >> > fcmgt p2.d, p1/z, z1.d, #0.0 >> >> > fcmgt p0.d, p3/z, z1.d, #0.0 >> >> > ld1d z2.d, p2/z, [x2, x5, lsl 3] >> >> > bic p0.b, p3/z, p1.b, p0.b >> >> > ld1d z0.d, p0/z, [x3, x5, lsl 3] >> >> > fsub z0.d, p0/m, z0.d, z1.d >> >> > movprfx z0.d, p2/m, z1.d >> >> > fadd z0.d, p2/m, z0.d, z2.d >> >> > st1d z0.d, p1, [x0, x5, lsl 3] >> >> > incd x5 >> >> > whilelo p1.d, w5, w4 >> >> > b.any .L3 >> >> > .L1: >> >> > ret >> >> > >> >> > Notice that the condition for the else branch duplicates the same >> >> > predicate as the then branch and then uses BIC to negate the result= s. >> >> > >> >> > The reason for this is that during instruction generation in the >> >> > vectorizer we emit >> >> > >> >> > mask__41.11_66 =3D vect__4.10_64 > vect_cst__65; >> >> > vec_mask_and_69 =3D mask__41.11_66 & loop_mask_63; >> >> > vec_mask_and_71 =3D mask__41.11_66 & loop_mask_63; >> >> > mask__43.16_73 =3D ~mask__41.11_66; >> >> > vec_mask_and_76 =3D mask__43.16_73 & loop_mask_63; >> >> > vec_mask_and_78 =3D mask__43.16_73 & loop_mask_63; >> >> > >> >> > which ultimately gets optimized to >> >> > >> >> > mask__41.11_66 =3D vect__4.10_64 > { 0.0, ... }; >> >> > vec_mask_and_69 =3D loop_mask_63 & mask__41.11_66; >> >> > mask__43.16_73 =3D ~mask__41.11_66; >> >> > vec_mask_and_76 =3D loop_mask_63 & mask__43.16_73; >> >> > >> >> > Notice how the negate is on the operation and not the predicate >> >> > resulting from the operation. When this is expanded this turns >> >> > into RTL where the negate is on the compare directly. This means >> >> > the RTL is different from the one without the negate and so CSE is >> >> > unable to >> >> recognize that they are essentially same operation. >> >> > >> >> > To fix this my patch changes it so you negate the mask rather than >> >> > the operation >> >> > >> >> > mask__41.13_55 =3D vect__4.12_53 > { 0.0, ... }; >> >> > vec_mask_and_58 =3D loop_mask_52 & mask__41.13_55; >> >> > vec_mask_op_67 =3D ~vec_mask_and_58; >> >> > vec_mask_and_65 =3D loop_mask_52 & vec_mask_op_67; >> >> >> >> But to me this looks like a pessimisation in gimple terms. We've >> >> increased the length of the critical path: vec_mask_and_65 now needs >> >> a chain of >> >> 4 operations instead of 3. >> > >> > True, but it should reduce the number of RTL patterns. I would have >> > thought RTL is more expensive to handle than gimple. >>=20 >> I think this is only a fair gimple optimisation if gimple does the isel = itself (to a >> predicated compare and a predicated NOT). >>=20 >> >> We also need to be careful not to pessimise the case in which the >> >> comparison is an integer one. At the moment we'll generate opposed >> >> conditions, which is the intended behaviour: >> > >> > This still happens with this patch at `-Ofast` because that flips the >> > conditions, So the different representation doesn't harm it. >>=20 >> OK, that's good. >>=20 >> >> >> >> .L3: >> >> ld1d z1.d, p0/z, [x1, x5, lsl 3] >> >> cmpgt p2.d, p0/z, z1.d, #0 >> >> movprfx z2, z1 >> >> scvtf z2.d, p3/m, z1.d >> >> cmple p1.d, p0/z, z1.d, #0 >> >> ld1d z0.d, p2/z, [x2, x5, lsl 3] >> >> ld1d z1.d, p1/z, [x3, x5, lsl 3] >> >> fadd z0.d, p2/m, z0.d, z2.d >> >> movprfx z0.d, p1/m, z1.d >> >> fsub z0.d, p1/m, z0.d, z2.d >> >> st1d z0.d, p0, [x0, x5, lsl 3] >> >> add x5, x5, x6 >> >> whilelo p0.d, w5, w4 >> >> b.any .L3 >> >> >> >> Could we handle the fcmp case using a 3->2 define_split instead: >> >> convert >> >> >> >> (set res (and (not (fcmp X Y)) Z)) -> >> >> (set res (fcmp X Y)) >> >> (set res (and (not res) Z)) >> >> >> > >> > This was the other approach I mentioned. It works, and gives you the n= eg, >> but only in the case where the compare is single use. >>=20 >> But in the original example we duplicate the comparison through a >> 2->2 combine, which leaves the original comparison as a single use. >> Isn't that enough? >>=20 >> > e.g. in >> > >> > void f11(double * restrict z, double * restrict w, double * restrict >> > x, double * restrict y, int n) { >> > for (int i =3D 0; i < n; i++) { >> > z[i] =3D (w[i] > 0) ? w[i] : y[i] - w[i]; >> > } >> > } >> > >> > You have some of the same problem. It generates >> > >> > ld1d z0.d, p0/z, [x1, x2, lsl 3] >> > fcmgt p2.d, p3/z, z0.d, #0.0 >> > bic p1.b, p3/z, p0.b, p2.b >> > ld1d z1.d, p1/z, [x3, x2, lsl 3] >> > fsub z1.d, p1/m, z1.d, z0.d >> > sel z0.d, p2, z0.d, z1.d >> > st1d z0.d, p0, [x0, x2, lsl 3] >> > incd x2 >> > whilelo p0.d, w2, w4 >> > >> > which has two problems. fcmgt doesn't need to be predicated on p3 >> > which is ptrue all, it can/should be p0. >> > >> > With that fixed the splitter won't match because p2 is needed in the >> > sel, so it's not single use and so combine won't try to build the RTL = so it can >> be split. >>=20 >> I think having the vectoriser avoid the dual use between the >> IFN_MASK_LOAD/STORE and the VEC_COND_EXPR is fair game, since that is >> the only pass that has the context to prove that including the loop mask= in >> the VEC_COND_EXPR condition is correct. We already try to do that to so= me >> extent: >>=20 > > Sorry I have been looking at this these past couple of days and I just do= n't know how > this is supposed to work. > > In the above example the problem is not just the use of p2 in the VEC_CON= D_EXPR. If > the VEC_COND_EXPR is changed to use p1 then p1 now has 3 uses which makes > combine still not try the combination. > > But the general case > > void f10(double * restrict z, double * restrict w, double * restrict x, d= ouble * restrict y, int n) > { > for (int i =3D 0; i < n; i++) { > z[i] =3D (w[i] > 0) ? x[i] + w[i] : y[i] - w[i]; > } > } > > Produces > > f10: > cmp w4, 0 > ble .L1 > mov x5, 0 > whilelo p1.d, wzr, w4 > ptrue p3.b, all > .p2align 5,,15 > .L3: > ld1d z1.d, p1/z, [x1, x5, lsl 3] > fcmgt p2.d, p1/z, z1.d, #0.0 > fcmgt p0.d, p3/z, z1.d, #0.0 > ld1d z2.d, p2/z, [x2, x5, lsl 3] > bic p0.b, p3/z, p1.b, p0.b > ld1d z0.d, p0/z, [x3, x5, lsl 3] > fsub z0.d, p0/m, z0.d, z1.d > movprfx z0.d, p2/m, z1.d > fadd z0.d, p2/m, z0.d, z2.d > st1d z0.d, p1, [x0, x5, lsl 3] > incd x5 > whilelo p1.d, w5, w4 > b.any .L3 > > where the VEC_COND_EXPR has been elided. > > The problem is that the comparison for the inverse case is unpredicated. Yeah, for the original f10 example I was suggesting that we use combine instead of changing the vectoriser. It turns out that the 3->2 split thing I mentioned above won't work though, because we match the BIC first. But I think something like: (define_insn_and_split "*inverted_fcmgt" [(set (match_operand: 0 "register_operand" "=3DUpa") (and: (and: (not: (unspec: [(match_operand: 1 "register_operand" "Upl") (const_int SVE_KNOWN_PTRUE) (match_operand:SVE_FULL_F 2 "register_operand" "w") (match_operand:SVE_FULL_F 3 "aarch64_simd_reg_or_zero" "wDz")] SVE_COND_FP_CMP_I0)) (match_operand: 4 "register_operand" "w")) (match_dup 1)))] "TARGET_SVE && !reg_overlap_mentioned_p (operands[4], operands[0])" "#" "&& 1" [(set (match_dup 0) (unspec: [(match_dup 4) (const_int SVE_MAYBE_NOT_PTRUE) (match_dup 2) (match_dup 3)] SVE_COND_FP_CMP_I0)) (set (match_dup 0) (and: (not: (match_dup 0)) (match_dup 4)))] ) is a legitimate optimisation in its own right because it gets rid of the PTRUE operand (at split time). This would be a good thing to do wherever the FCMxx and BIC come from. The snag is that we don't then CSE the comparisons between split1 and RA. The post-RA optimisers might then leave a move. E.g. for plain -O3 the above pattern gives the expected: cmp w4, 0 ble .L1 mov x5, 0 cntd x6 whilelo p0.d, wzr, w4 .p2align 3,,7 .L3: ld1d z1.d, p0/z, [x1, x5, lsl 3] fcmgt p2.d, p0/z, z1.d, #0.0 not p1.b, p0/z, p2.b ld1d z2.d, p2/z, [x2, x5, lsl 3] ld1d z0.d, p1/z, [x3, x5, lsl 3] fsub z0.d, p1/m, z0.d, z1.d movprfx z0.d, p2/m, z1.d fadd z0.d, p2/m, z0.d, z2.d st1d z0.d, p0, [x0, x5, lsl 3] add x5, x5, x6 whilelo p0.d, w5, w4 b.any .L3 .L1: ret but -O3 -fno-trapping-math has a predicate move: cmp w4, 0 ble .L1 mov x5, 0 cntd x6 whilelo p0.d, wzr, w4 .p2align 3,,7 .L3: ld1d z1.d, p0/z, [x1, x5, lsl 3] fcmgt p1.d, p0/z, z1.d, #0.0 mov p2.b, p1.b not p1.b, p0/z, p1.b ld1d z0.d, p2/z, [x2, x5, lsl 3] ld1d z2.d, p1/z, [x3, x5, lsl 3] fadd z0.d, z1.d, z0.d movprfx z0.d, p1/m, z2.d fsub z0.d, p1/m, z0.d, z1.d st1d z0.d, p0, [x0, x5, lsl 3] add x5, x5, x6 whilelo p0.d, w5, w4 b.any .L3 .L1: ret It would be good to fix this in RTL =E2=80=9Csomehow=E2=80=9D. But for f11 we still get: f11: .LFB1: .cfi_startproc cmp w4, 0 ble .L6 mov x2, 0 cntd x5 whilelo p1.d, wzr, w4 ptrue p3.b, all .p2align 3,,7 .L8: ld1d z0.d, p1/z, [x1, x2, lsl 3] fcmgt p0.d, p1/z, z0.d, #0.0 fcmgt p2.d, p3/z, z0.d, #0.0 not p0.b, p1/z, p0.b ld1d z1.d, p0/z, [x3, x2, lsl 3] fsub z1.d, p0/m, z1.d, z0.d sel z0.d, p2, z0.d, z1.d st1d z0.d, p1, [x0, x2, lsl 3] add x2, x2, x5 whilelo p1.d, w2, w4 b.any .L8 .L6: ret This is where the VEC_COND_EXPR code I mentioned should come in. At the moment, before generating: VEC_COND_EXPR we check whether there are any other operations predicated on C & LOOP_MASK. If so, we use: VEC_COND_EXPR instead. This avoids both C and C & LOOP_MASK being live at the same time. The change I was suggesting was that we should also check whether there are any operations predicated on ~C & LOOP_MASK. If so, we should generate: VEC_COND_EXPR<~C & LOOP_MASK, E2, E1> Again, the justification is that we don't have C and ~C & LOOP_MASK live at the same time. If C itself has the form ~C' then swapping the operands also saves a negation. > This code is fine as far as I can tell. But there's nothing you can do h= ere. The mask it needs is ~original > So it does not find an inverse mask to use because it has to honor floati= ng point exceptions. > > And indeed `-fno-trapping-math` or `-Ofast` generates the most optimal se= quence, but when honoring > traps it can't re-use invert existing mask, which leaves the operation un= predicated. > > So is what you're requesting that it looks inside unary operators and tri= es to CSE the thing they're pointed to? [Hopefully answered this above] > In which case isn't it about the same as what I had before just that the = vectorizer did the CSE itself? IMO the VEC_COND_EXPR optimisation is more selective. It only changes the gimple if we know that it will have an effect (in terms of reducing the number of SSA_NAMEs that have multiple uses). Also, unlike the original patch, there's no double-masking invovled: ~C & LOOP_MASK will only apply LOOP_MASK once. > If that's the case maybe it's better to do lookups into loop_vinfo->scala= r_cond_masked_set in prepare_load_store_mask? > So that it just applies to everything? scalar_cond_masked_set only exists to convert a mask that ignores LOOP_MASK into one that is ANDed with LOOP_MASK. For loads and stores we always AND with LOOP_MASK, so a lookup isn't necessary. Thanks, Richard