From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id A2EE03858D33 for ; Tue, 28 Feb 2023 11:08:46 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A2EE03858D33 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 737B0C14; Tue, 28 Feb 2023 03:09:29 -0800 (PST) Received: from localhost (e121540-lin.manchester.arm.com [10.32.99.50]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 799A93F881; Tue, 28 Feb 2023 03:08:45 -0800 (PST) From: Richard Sandiford To: Tamar Christina Mail-Followup-To: Tamar Christina ,Tamar Christina via Gcc-patches , nd , "rguenther\@suse.de" , "jlaw\@ventanamicro.com" , richard.sandiford@arm.com Cc: Tamar Christina via Gcc-patches , nd , "rguenther\@suse.de" , "jlaw\@ventanamicro.com" Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] References: Date: Tue, 28 Feb 2023 11:08:44 +0000 In-Reply-To: (Tamar Christina's message of "Mon, 27 Feb 2023 22:10:38 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-28.5 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Tamar Christina writes: >> -----Original Message----- >> From: Richard Sandiford >> Sent: Monday, February 27, 2023 9:33 PM >> To: Tamar Christina via Gcc-patches >> Cc: Tamar Christina ; nd ; >> rguenther@suse.de; jlaw@ventanamicro.com >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask >> by using new optabs [PR108583] >> >> Tamar Christina via Gcc-patches writes: >> >> -----Original Message----- >> >> From: Richard Sandiford >> >> Sent: Monday, February 27, 2023 12:12 PM >> >> To: Tamar Christina >> >> Cc: Tamar Christina via Gcc-patches ; nd >> >> ; rguenther@suse.de; jlaw@ventanamicro.com >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of >> >> div-bitmask by using new optabs [PR108583] >> >> >> >> Tamar Christina writes: >> >> > Hi, >> >> > >> >> >> > I avoided open coding it with add and shift because it creates a >> >> >> > 4 instructions (and shifts which are typically slow) dependency >> >> >> > chain instead of a load and multiply. This change, unless the >> >> >> > target is known to optimize it further is unlikely to be >> >> >> > beneficial. And by the time we get to costing the only >> >> >> > alternative is to undo the existing pattern and >> >> >> so you lose the general shift optimization. >> >> >> > >> >> >> > So it seemed unwise to open code as shifts, given the codegen >> >> >> > out of the vectorizer would be degenerate for most targets or >> >> >> > one needs the more complicated route of costing during pattern >> matching already. >> >> >> >> >> >> Hmm, OK. That seems like a cost-model thing though, rather than >> >> >> something that should be exposed through optabs. And I imagine >> >> >> the open-coded version would still be better than nothing on >> >> >> targets without >> >> highpart multiply. >> >> >> >> >> >> So how about replacing the hook with one that simply asks whether >> >> >> division through highpart multiplication is preferred over the >> >> >> add/shift >> >> sequence? >> >> >> (Unfortunately it's not going to be possible to work that out from >> >> >> existing >> >> >> information.) >> >> > >> >> > So this doesn't work for SVE. For SVE the multiplication widening >> >> > pass introduces FMAs at gimple level. So in the cases where the >> >> > operation is fed from a widening multiplication we end up generating >> FMA. >> >> If that was it I could have matched FMA. >> >> > >> >> > But it also pushes the multiplication in the second operand because >> >> > it no longer has a mul to share the results with. >> >> > >> >> > In any case, the gimple code is transformed into >> >> > >> >> > vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121); >> >> > vect_patt_57.9_123 = (vector([8,8]) unsigned short) vect__3.8_122; >> >> > vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, { >> >> > 257, ... }); >> >> > vect_patt_65.12_128 = vect_patt_64.11_127 >> 8; >> >> > vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124, >> >> > vect_patt_65.12_128); >> >> > vect_patt_62.14_130 = vect_patt_66.13_129 >> 8; >> >> > vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21) >> >> > vect_patt_62.14_130; >> >> > >> >> > This transformation is much worse than the original code, it >> >> > extended the dependency chain with another expensive instruction. I >> >> > can try to correct this in RTL by matching FMA and shift and >> >> > splitting into MUL + >> >> ADDHNB and hope CSE takes care of the extra mul. >> >> > >> >> > But this seems like a hack, and it's basically undoing the earlier >> >> > transformation. It seems to me that the open coding is a bad idea. >> >> >> >> Could you post the patch that gives this result? I'll have a poke around. >> > >> > Sure, I'll post the new series, it needs all of them. >> >> Thanks. Which testcase did you use to get the above? >> > > #include > > #define N 16 > #define TYPE uint8_t > > void fun3(TYPE* restrict pixel, TYPE level, int n) > { > for (int i = 0; i < (n & -16); i+=1) > pixel[i] = (pixel[i] * level) / 0xff; > } Thanks. In that testcase, isn't the FMA handling an anti-optimisation in its own right though? It's duplicating a multiplication into two points on a dependency chain. E.g. for: unsigned int f1 (unsigned int a, unsigned int b, unsigned int c) { unsigned int d = a * b; return d + ((c + d) >> 1); } unsigned int g1 (unsigned int a, unsigned int b, unsigned int c) { return a * b + c; } __Uint32x4_t f2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c) { __Uint32x4_t d = a * b; return d + ((c + d) >> 1); } __Uint32x4_t g2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c) { return a * b + c; } typedef unsigned int vec __attribute__((vector_size(32))); vec f3 (vec a, vec b, vec c) { vec d = a * b; return d + ((c + d) >> 1); } vec g3 (vec a, vec b, vec c) { return a * b + c; } compiled with -O2 -msve-vector-bits=256 -march=armv8.2-a+sve, all the g functions use multiply-add (as expected), but the f functions are: f1: mul w1, w0, w1 add w0, w1, w2 add w0, w1, w0, lsr 1 ret f2: mul v0.4s, v0.4s, v1.4s add v2.4s, v0.4s, v2.4s usra v0.4s, v2.4s, 1 ret f3: ... mla z0.s, p0/m, z1.s, z2.s lsr z0.s, z0.s, #1 mad z1.s, p0/m, z2.s, z0.s ... What we do for f3 doesn't seem like a good idea. I can see that duplicating an integer multiplication might make sense if the integer FMAs are done in parallel. But if one is a dependency of the other, then at least for integer FMA, I think we should punt, especially since we don't know what the target's late-forwarding restrictions are. I guess fp-contract comes into play for the FP FMAs though. >> But since SVE does have highpart multiply, and since the assumption for SVE is >> that MULH+shift is better than ADD*3+shift*2, shouldn't SVE just be one of >> the targets for which the hook that "asks whether division through highpart >> multiplication is preferred over the add/shift sequence" returns true? >> > > Yes (it's also two adds not 3), but it's not correct for SVE2, which has addhnb, in which case 2x addhnb is > much faster than MULH+shift. And the problem is that widening_mul will not > allow add+shift to reach the backend because the ADD+shift were open coded. > > They are now subjected to further optimization. > > To summarize: > > Other targets: false > SVE: false > SVE2: true > NEON: true Yeah, looks good. > SVE2 borked because MUL+ADD+SHIFT -> FMA+SHIFT. > > If you're saying you don't want the optimization for SVE2, then sure, happy to turn it off. > > But UMULH+LSR == 6 cycles on Neoverse-N2 and throughput of 1. > 2x ADDHNB = 4 cycles and throughput of 2. No, I meant the same as what you said in the summary above. Richard