On Wed, 12 Apr 2023 at 14:29, Richard Sandiford wrote: > > Prathamesh Kulkarni writes: > > On Thu, 6 Apr 2023 at 16:05, Richard Sandiford > > wrote: > >> > >> Prathamesh Kulkarni writes: > >> > On Tue, 4 Apr 2023 at 23:35, Richard Sandiford > >> > wrote: > >> >> > diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc > >> >> > index cd9cace3c9b..3de79060619 100644 > >> >> > --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc > >> >> > +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc > >> >> > @@ -817,6 +817,62 @@ public: > >> >> > > >> >> > class svdupq_impl : public quiet > >> >> > { > >> >> > +private: > >> >> > + gimple * > >> >> > + fold_nonconst_dupq (gimple_folder &f, unsigned factor) const > >> >> > + { > >> >> > + /* Lower lhs = svdupq (arg0, arg1, ..., argN} into: > >> >> > + tmp = {arg0, arg1, ..., arg} > >> >> > + lhs = VEC_PERM_EXPR (tmp, tmp, {0, 1, 2, N-1, ...}) */ > >> >> > + > >> >> > + /* TODO: Revisit to handle factor by padding zeros. */ > >> >> > + if (factor > 1) > >> >> > + return NULL; > >> >> > >> >> Isn't the key thing here predicate vs. vector rather than factor == 1 vs. > >> >> factor != 1? Do we generate good code for b8, where factor should be 1? > >> > Hi, > >> > It generates the following code for svdup_n_b8: > >> > https://pastebin.com/ypYt590c > >> > >> Hmm, yeah, not pretty :-) But it's not pretty without either. > >> > >> > I suppose lowering to ctor+vec_perm_expr is not really useful > >> > for this case because it won't simplify ctor, unlike the above case of > >> > svdupq_s32 (x[0], x[1], x[2], x[3]); > >> > However I wonder if it's still a good idea to lower svdupq for predicates, for > >> > representing svdupq (or other intrinsics) using GIMPLE constructs as > >> > far as possible ? > >> > >> It's possible, but I think we'd need an example in which its a clear > >> benefit. > > Sorry I posted for wrong test case above. > > For the following test: > > svbool_t f(uint8x16_t x) > > { > > return svdupq_n_b8 (x[0], x[1], x[2], x[3], x[4], x[5], x[6], x[7], > > x[8], x[9], x[10], x[11], x[12], > > x[13], x[14], x[15]); > > } > > > > Code-gen: > > https://pastebin.com/maexgeJn > > > > I suppose it's equivalent to following ? > > > > svbool_t f2(uint8x16_t x) > > { > > svuint8_t tmp = svdupq_n_u8 ((bool) x[0], (bool) x[1], (bool) x[2], > > (bool) x[3], > > (bool) x[4], (bool) x[5], (bool) x[6], > > (bool) x[7], > > (bool) x[8], (bool) x[9], (bool) x[10], > > (bool) x[11], > > (bool) x[12], (bool) x[13], (bool) > > x[14], (bool) x[15]); > > return svcmpne_n_u8 (svptrue_b8 (), tmp, 0); > > } > > Yeah, this is essentially the transformation that the svdupq rtl > expander uses. It would probably be a good idea to do that in > gimple too. Hi, I tested the interleave+zip1 for vector init patch and it segfaulted during bootstrap while trying to build libgfortran/generated/matmul_i2.c. Rebuilding with --enable-checking=rtl showed out of bounds access in aarch64_unzip_vector_init in following hunk: + rtvec vec = rtvec_alloc (n / 2); + for (int i = 0; i < n; i++) + RTVEC_ELT (vec, i) = (even_p) ? XVECEXP (vals, 0, 2 * i) + : XVECEXP (vals, 0, 2 * i + 1); which is incorrect since it allocates n/2 but iterates and stores upto n. The attached patch fixes the issue, which passed bootstrap, however resulted in following fallout during testsuite run: 1] sve/acle/general/dupq_[1-4].c tests fail. For the following test: int32x4_t f(int32_t x) { return (int32x4_t) { x, 1, 2, 3 }; } Code-gen without patch: f: adrp x1, .LC0 ldr q0, [x1, #:lo12:.LC0] ins v0.s[0], w0 ret Code-gen with patch: f: movi v0.2s, 0x2 adrp x1, .LC0 ldr d1, [x1, #:lo12:.LC0] ins v0.s[0], w0 zip1 v0.4s, v0.4s, v1.4s ret It shows, fallback_seq_cost = 20, seq_total_cost = 16 where seq_total_cost determines the cost for interleave+zip1 sequence and fallback_seq_cost is the cost for fallback sequence. Altho it shows lesser cost, I am not sure if the interleave+zip1 sequence is better in this case ? 2] sve/acle/general/dupq_[5-6].c tests fail: int32x4_t f(int32_t x0, int32_t x1, int32_t x2, int32_t x3) { return (int32x4_t) { x0, x1, x2, x3 }; } code-gen without patch: f: fmov s0, w0 ins v0.s[1], w1 ins v0.s[2], w2 ins v0.s[3], w3 ret code-gen with patch: f: fmov s0, w0 fmov s1, w1 ins v0.s[1], w2 ins v1.s[1], w3 zip1 v0.4s, v0.4s, v1.4s ret It shows fallback_seq_cost = 28, seq_total_cost = 16 3] aarch64/ldp_stp_16.c's cons2_8_float test fails. Test case: void cons2_8_float(float *x, float val0, float val1) { #pragma GCC unroll(8) for (int i = 0; i < 8 * 2; i += 2) { x[i + 0] = val0; x[i + 1] = val1; } } which is lowered to: void cons2_8_float (float * x, float val0, float val1) { vector(4) float _86; [local count: 119292720]: _86 = {val0_11(D), val1_13(D), val0_11(D), val1_13(D)}; MEM [(float *)x_10(D)] = _86; MEM [(float *)x_10(D) + 16B] = _86; MEM [(float *)x_10(D) + 32B] = _86; MEM [(float *)x_10(D) + 48B] = _86; return; } code-gen without patch: cons2_8_float: dup v0.4s, v0.s[0] ins v0.s[1], v1.s[0] ins v0.s[3], v1.s[0] stp q0, q0, [x0] stp q0, q0, [x0, 32] ret code-gen with patch: cons2_8_float: dup v1.2s, v1.s[0] dup v0.2s, v0.s[0] zip1 v0.4s, v0.4s, v1.4s stp q0, q0, [x0] stp q0, q0, [x0, 32] ret It shows fallback_seq_cost = 28, seq_total_cost = 16 I think the test fails because it doesn't match: ** dup v([0-9]+)\.4s, .* Shall it be OK to amend the test assuming code-gen with patch is better ? 4] aarch64/pr109072_1.c s32x4_3 test fails: For the following test: int32x4_t s32x4_3 (int32_t x, int32_t y) { int32_t arr[] = { x, y, y, y }; return vld1q_s32 (arr); } code-gen without patch: s32x4_3: dup v0.4s, w1 ins v0.s[0], w0 ret code-gen with patch: s32x4_3: fmov s1, w1 fmov s0, w0 ins v0.s[1], v1.s[0] dup v1.2s, v1.s[0] zip1 v0.4s, v0.4s, v1.4s ret It shows fallback_seq_cost = 20, seq_total_cost = 16 I am not sure how interleave+zip1 cost is lesser than fallback seq cost for this case. I assume that the fallback sequence is better here ? PS: The patch for folding svdupq to ctor+vec_perm_expr passes bootstrap+test without any issues. Thanks, Prathamesh > > Thanks, > Richard > > > > > which generates: > > f2: > > .LFB3901: > > .cfi_startproc > > movi v1.16b, 0x1 > > ptrue p0.b, all > > cmeq v0.16b, v0.16b, #0 > > bic v0.16b, v1.16b, v0.16b > > dup z0.q, z0.q[0] > > cmpne p0.b, p0/z, z0.b, #0 > > ret > > > > Thanks, > > Prathamesh