On Tue, 28 Feb 2023, Richard Sandiford wrote: > Tamar Christina writes: > >> -----Original Message----- > >> From: Richard Sandiford > >> Sent: Tuesday, February 28, 2023 11:09 AM > >> To: Tamar Christina > >> Cc: Tamar Christina via Gcc-patches ; nd > >> ; rguenther@suse.de; jlaw@ventanamicro.com > >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask > >> by using new optabs [PR108583] > >> > >> Tamar Christina writes: > >> >> -----Original Message----- > >> >> From: Richard Sandiford > >> >> Sent: Monday, February 27, 2023 9:33 PM > >> >> To: Tamar Christina via Gcc-patches > >> >> Cc: Tamar Christina ; nd ; > >> >> rguenther@suse.de; jlaw@ventanamicro.com > >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of > >> >> div-bitmask by using new optabs [PR108583] > >> >> > >> >> Tamar Christina via Gcc-patches writes: > >> >> >> -----Original Message----- > >> >> >> From: Richard Sandiford > >> >> >> Sent: Monday, February 27, 2023 12:12 PM > >> >> >> To: Tamar Christina > >> >> >> Cc: Tamar Christina via Gcc-patches ; nd > >> >> >> ; rguenther@suse.de; jlaw@ventanamicro.com > >> >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of > >> >> >> div-bitmask by using new optabs [PR108583] > >> >> >> > >> >> >> Tamar Christina writes: > >> >> >> > Hi, > >> >> >> > > >> >> >> >> > I avoided open coding it with add and shift because it > >> >> >> >> > creates a > >> >> >> >> > 4 instructions (and shifts which are typically slow) > >> >> >> >> > dependency chain instead of a load and multiply. This > >> >> >> >> > change, unless the target is known to optimize it further is > >> >> >> >> > unlikely to be beneficial. And by the time we get to costing > >> >> >> >> > the only alternative is to undo the existing pattern and > >> >> >> >> so you lose the general shift optimization. > >> >> >> >> > > >> >> >> >> > So it seemed unwise to open code as shifts, given the codegen > >> >> >> >> > out of the vectorizer would be degenerate for most targets or > >> >> >> >> > one needs the more complicated route of costing during > >> >> >> >> > pattern > >> >> matching already. > >> >> >> >> > >> >> >> >> Hmm, OK. That seems like a cost-model thing though, rather > >> >> >> >> than something that should be exposed through optabs. And I > >> >> >> >> imagine the open-coded version would still be better than > >> >> >> >> nothing on targets without > >> >> >> highpart multiply. > >> >> >> >> > >> >> >> >> So how about replacing the hook with one that simply asks > >> >> >> >> whether division through highpart multiplication is preferred > >> >> >> >> over the add/shift > >> >> >> sequence? > >> >> >> >> (Unfortunately it's not going to be possible to work that out > >> >> >> >> from existing > >> >> >> >> information.) > >> >> >> > > >> >> >> > So this doesn't work for SVE. For SVE the multiplication > >> >> >> > widening pass introduces FMAs at gimple level. So in the cases > >> >> >> > where the operation is fed from a widening multiplication we end > >> >> >> > up generating > >> >> FMA. > >> >> >> If that was it I could have matched FMA. > >> >> >> > > >> >> >> > But it also pushes the multiplication in the second operand > >> >> >> > because it no longer has a mul to share the results with. > >> >> >> > > >> >> >> > In any case, the gimple code is transformed into > >> >> >> > > >> >> >> > vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121); > >> >> >> > vect_patt_57.9_123 = (vector([8,8]) unsigned short) > >> >> >> > vect__3.8_122; > >> >> >> > vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, { > >> >> >> > 257, ... }); > >> >> >> > vect_patt_65.12_128 = vect_patt_64.11_127 >> 8; > >> >> >> > vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124, > >> >> >> > vect_patt_65.12_128); > >> >> >> > vect_patt_62.14_130 = vect_patt_66.13_129 >> 8; > >> >> >> > vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21) > >> >> >> > vect_patt_62.14_130; > >> >> >> > > >> >> >> > This transformation is much worse than the original code, it > >> >> >> > extended the dependency chain with another expensive > >> >> >> > instruction. I can try to correct this in RTL by matching FMA > >> >> >> > and shift and splitting into MUL + > >> >> >> ADDHNB and hope CSE takes care of the extra mul. > >> >> >> > > >> >> >> > But this seems like a hack, and it's basically undoing the > >> >> >> > earlier transformation. It seems to me that the open coding is a bad > >> idea. > >> >> >> > >> >> >> Could you post the patch that gives this result? I'll have a poke around. > >> >> > > >> >> > Sure, I'll post the new series, it needs all of them. > >> >> > >> >> Thanks. Which testcase did you use to get the above? > >> >> > >> > > >> > #include > >> > > >> > #define N 16 > >> > #define TYPE uint8_t > >> > > >> > void fun3(TYPE* restrict pixel, TYPE level, int n) { > >> > for (int i = 0; i < (n & -16); i+=1) > >> > pixel[i] = (pixel[i] * level) / 0xff; } > >> > >> Thanks. In that testcase, isn't the FMA handling an anti-optimisation in its > >> own right though? It's duplicating a multiplication into two points on a > >> dependency chain. > > > > Most definitely, that's what I meant above. The "optimization" doesn't take into > > account the effect on the rest of the chain. > > > >> > >> E.g. for: > >> > >> unsigned int > >> f1 (unsigned int a, unsigned int b, unsigned int c) { > >> unsigned int d = a * b; > >> return d + ((c + d) >> 1); > >> } > >> unsigned int > >> g1 (unsigned int a, unsigned int b, unsigned int c) { > >> return a * b + c; > >> } > >> > >> __Uint32x4_t > >> f2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c) { > >> __Uint32x4_t d = a * b; > >> return d + ((c + d) >> 1); > >> } > >> __Uint32x4_t > >> g2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c) { > >> return a * b + c; > >> } > >> > >> typedef unsigned int vec __attribute__((vector_size(32))); vec > >> f3 (vec a, vec b, vec c) > >> { > >> vec d = a * b; > >> return d + ((c + d) >> 1); > >> } > >> vec > >> g3 (vec a, vec b, vec c) > >> { > >> return a * b + c; > >> } > >> > >> compiled with -O2 -msve-vector-bits=256 -march=armv8.2-a+sve, all the g > >> functions use multiply-add (as expected), but the f functions are: > >> > >> f1: > >> mul w1, w0, w1 > >> add w0, w1, w2 > >> add w0, w1, w0, lsr 1 > >> ret > >> > >> f2: > >> mul v0.4s, v0.4s, v1.4s > >> add v2.4s, v0.4s, v2.4s > >> usra v0.4s, v2.4s, 1 > >> ret > >> > >> f3: > >> ... > >> mla z0.s, p0/m, z1.s, z2.s > >> lsr z0.s, z0.s, #1 > >> mad z1.s, p0/m, z2.s, z0.s > >> ... > >> > >> What we do for f3 doesn't seem like a good idea. > > > > Agreed, I guess this means I have to fix that as well? ☹ > > > > I'll go take a look then.. > > How about something like this, before the main loop in > convert_mult_to_fma: > > /* There is no numerical difference between fused and unfused integer FMAs, > and the assumption below that FMA is as cheap as addition is unlikely > to be true, especially if the multiplication occurs multiple times on > the same chain. E.g., for something like: > > (((a * b) + c) >> 1) + (a * b) > > we do not want to duplicate the a * b into two additions, not least > because the result is not a natural FMA chain. */ > if (ANY_INTEGRAL_TYPE_P (type) > && !has_single_use (mul_result)) > return false; > > ? Richi, would that be OK with you? Yes, I think that would be OK. I would assume that integer FMA is as cheap as multiplication (also for FP), but then the question is how CPUs implement integer FMA, if they split into two uops or not. Richard.