From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 22FF63858D28 for ; Fri, 10 Feb 2023 14:10:19 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 22FF63858D28 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 499444B3; Fri, 10 Feb 2023 06:11:01 -0800 (PST) Received: from localhost (e121540-lin.manchester.arm.com [10.32.99.50]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C1C9F3F71E; Fri, 10 Feb 2023 06:10:17 -0800 (PST) From: Richard Sandiford To: Tamar Christina Mail-Followup-To: Tamar Christina ,gcc-patches@gcc.gnu.org, nd@arm.com, Richard.Earnshaw@arm.com, Marcus.Shawcroft@arm.com, Kyrylo.Tkachov@arm.com, richard.sandiford@arm.com Cc: gcc-patches@gcc.gnu.org, nd@arm.com, Richard.Earnshaw@arm.com, Marcus.Shawcroft@arm.com, Kyrylo.Tkachov@arm.com Subject: Re: [PATCH 2/2]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583] References: Date: Fri, 10 Feb 2023 14:10:16 +0000 In-Reply-To: (Tamar Christina's message of "Thu, 9 Feb 2023 17:22:09 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-33.5 required=5.0 tests=BAYES_00,BODY_8BITS,GIT_PATCH_0,KAM_ASCII_DIVIDERS,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_LOTSOFHASH,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: I was asking in the 1/2 review whether we need the optab, but that decision doesn't affect the other patterns, so: Tamar Christina writes: > Hi All, > > This replaces the custom division hook with just an implementation through > add_highpart. For NEON we implement the add highpart (Addition + extract= ion of > the upper highpart of the register in the same precision) as ADD + LSR. > > This representation allows us to easily optimize the sequence using exist= ing > sequences. This gets us a pretty decent sequence using SRA: > > umull v1.8h, v0.8b, v3.8b > umull2 v0.8h, v0.16b, v3.16b > add v5.8h, v1.8h, v2.8h > add v4.8h, v0.8h, v2.8h > usra v1.8h, v5.8h, 8 > usra v0.8h, v4.8h, 8 > uzp2 v1.16b, v1.16b, v0.16b > > To get the most optimal sequence however we match (a + ((b + c) >> n)) wh= ere n > is half the precision of the mode of the operation into addhn + uaddw whi= ch is > a general good optimization on its own and gets us back to: > > .L4: > ldr q0, [x3] > umull v1.8h, v0.8b, v5.8b > umull2 v0.8h, v0.16b, v5.16b > addhn v3.8b, v1.8h, v4.8h > addhn v2.8b, v0.8h, v4.8h > uaddw v1.8h, v1.8h, v3.8b > uaddw v0.8h, v0.8h, v2.8b > uzp2 v1.16b, v1.16b, v0.16b > str q1, [x3], 16 > cmp x3, x4 > bne .L4 > > For SVE2 we optimize the initial sequence to the same ADD + LSR which get= s us: > > .L3: > ld1b z0.h, p0/z, [x0, x3] > mul z0.h, p1/m, z0.h, z2.h > add z1.h, z0.h, z3.h > usra z0.h, z1.h, #8 > lsr z0.h, z0.h, #8 > st1b z0.h, p0, [x0, x3] > inch x3 > whilelo p0.h, w3, w2 > b.any .L3 > .L1: > ret > > and to get the most optimal sequence I match (a + b) >> n (same constrain= t on n) > to addhnb which gets us to: > > .L3: > ld1b z0.h, p0/z, [x0, x3] > mul z0.h, p1/m, z0.h, z2.h > addhnb z1.b, z0.h, z3.h > addhnb z0.b, z0.h, z1.h > st1b z0.h, p0, [x0, x3] > inch x3 > whilelo p0.h, w3, w2 > b.any .L3 > > There are multiple RTL representations possible for these optimizations, = I did > not represent them using a zero_extend because we seem very inconsistent = in this > in the backend. Since they are unspecs we won't match them from vector o= ps > anyway. I figured maintainers would prefer this, but my maintainer ouija = board > is still out for repairs :) I agree this is the best approach as things stand. Personally, I'd like to have some way for the target to define simplification rules based on unspecs, so that unspecs act more like target-specific rtl codes. But I know others disagree, and it wouldn't really apply to this case anyway. > There are no new test as new correctness tests were added to the mid-end = and > the existing codegen tests for this already exist. > > Bootstrapped Regtested on aarch64-none-linux-gnu and issues. > > Ok for master? > > Thanks, > Tamar > > gcc/ChangeLog: > > PR target/108583 > * config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv3): Remove. > (add3_highpart, *bitmask_shift_plus): New. > * config/aarch64/aarch64-sve2.md (add3_highpart, > *bitmask_shift_plus): New. > (@aarch64_bitmask_udiv3): Remove. > * config/aarch64/aarch64.cc > (aarch64_vectorize_can_special_div_by_constant): Removed. > * config/aarch64/iterators.md (UNSPEC_SADD_HIGHPART, > UNSPEC_UADD_HIGHPART, ADD_HIGHPART): New. > > --- inline copy of patch --=20 > diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarc= h64-simd.md > index 7f212bf37cd2c120dceb7efa733c9fa76226f029..26871a56d1fdb134f0ad9d828= ce68a8df0272c53 100644 > --- a/gcc/config/aarch64/aarch64-simd.md > +++ b/gcc/config/aarch64/aarch64-simd.md > @@ -4867,62 +4867,48 @@ (define_expand "aarch64_hn2" > } > ) >=20=20 > -;; div optimizations using narrowings > -;; we can do the division e.g. shorts by 255 faster by calculating it as > -;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in > -;; double the precision of x. > -;; > -;; If we imagine a short as being composed of two blocks of bytes then > -;; adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to > -;; adding 1 to each sub component: > -;; > -;; short value of 16-bits > -;; =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=AC= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =90 > -;; =E2=94=82 =E2=94=82 =E2=94=82 > -;; =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=B4= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =98 > -;; 8-bit part1 =E2=96=B2 8-bit part2 =E2=96=B2 > -;; =E2=94=82 =E2=94=82 > -;; =E2=94=82 =E2=94=82 > -;; +1 +1 > -;; > -;; after the first addition, we have to shift right by 8, and narrow the > -;; results back to a byte. Remember that the addition must be done in > -;; double the precision of the input. Since 8 is half the size of a sho= rt > -;; we can use a narrowing halfing instruction in AArch64, addhn which al= so > -;; does the addition in a wider precision and narrows back to a byte. T= he > -;; shift itself is implicit in the operation as it writes back only the = top > -;; half of the result. i.e. bits 2*esize-1:esize. > -;; > -;; Since we have narrowed the result of the first part back to a byte, f= or > -;; the second addition we can use a widening addition, uaddw. > -;; > -;; For the final shift, since it's unsigned arithmetic we emit an ushr b= y 8. > -;; > -;; The shift is later optimized by combine to a uzp2 with movi #0. > -(define_expand "@aarch64_bitmask_udiv3" > - [(match_operand:VQN 0 "register_operand") > - (match_operand:VQN 1 "register_operand") > - (match_operand:VQN 2 "immediate_operand")] > +;; Implement add_highpart as ADD + RSHIFT, we have various optimization = for > +;; narrowing represented as shifts and so this representation will allow= us to > +;; further optimize this should the result require narrowing. The altern= ative > +;; representation of ADDHN + UXTL is less efficient and harder to futher > +;; optimize. > +(define_expand "add3_highpart" > + [(set (match_operand:VQN 0 "register_operand") > + (unspec:VQN [(match_operand:VQN 1 "register_operand") > + (match_operand:VQN 2 "register_operand")] > + ADD_HIGHPART))] > + "TARGET_SIMD" > +{ > + rtx result =3D gen_reg_rtx (mode); > + int shift_amount =3D GET_MODE_UNIT_BITSIZE (mode) / 2; > + rtx shift_vector =3D aarch64_simd_gen_const_vector_dup (mode, > + shift_amount); > + emit_insn (gen_add3 (result, operands[1], operands[2])); > + emit_insn (gen_aarch64_simd_lshr (operands[0], result, shift_vec= tor)); > + DONE; > +}) > + > +;; Optimize ((a + b) >> n) + c where n is half the bitsize of the vector > +(define_insn_and_split "*bitmask_shift_plus" > + [(set (match_operand:VQN 0 "register_operand" "=3Dw") > + (plus:VQN > + (lshiftrt:VQN > + (plus:VQN (match_operand:VQN 1 "register_operand" "w") > + (match_operand:VQN 2 "register_operand" "w")) > + (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_exact_top" "Dr")) > + (match_operand:VQN 4 "register_operand" "w")))] > "TARGET_SIMD" > + "#" > + "&& !reload_completed" This is an ICE trap, since "#" forces a split while "!reload_completed" prevents one after reload. I think the theoretically correct way would be to use operand 0 as a temporary when reload_completed, which in turn means making it an earlyclobber. However, IIUC, this pattern would only be formed from combining three distinct patterns. Is that right? If so, we should be able to handle it as a plain define_split, with no define_insn. That should make things simpler, so would be worth trying before the changes I mentioned above. > + [(const_int 0)] > { > - unsigned HOST_WIDE_INT size > - =3D (1ULL << GET_MODE_UNIT_BITSIZE (mode)) - 1; > - rtx elt =3D unwrap_const_vec_duplicate (operands[2]); > - if (!CONST_INT_P (elt) || UINTVAL (elt) !=3D size) > - FAIL; > - > - rtx addend =3D gen_reg_rtx (mode); > - rtx val =3D aarch64_simd_gen_const_vector_dup (mode, 1); > - emit_move_insn (addend, lowpart_subreg (mode, val, mo= de)); > - rtx tmp1 =3D gen_reg_rtx (mode); > - rtx tmp2 =3D gen_reg_rtx (mode); > - emit_insn (gen_aarch64_addhn (tmp1, operands[1], addend)); > - unsigned bitsize =3D GET_MODE_UNIT_BITSIZE (mode); > - rtx shift_vector =3D aarch64_simd_gen_const_vector_dup (mode, bi= tsize); > - emit_insn (gen_aarch64_uaddw (tmp2, operands[1], tmp1)); > - emit_insn (gen_aarch64_simd_lshr (operands[0], tmp2, shift_vecto= r)); > + rtx tmp =3D gen_reg_rtx (mode); > + emit_insn (gen_aarch64_addhn (tmp, operands[1], operands[2])); > + emit_insn (gen_aarch64_uaddw (operands[0], operands[4], tmp)= ); > DONE; > -}) > +} > + [(set_attr "type" "neon_add_halve")] I think we should leave this out, since it's a multi-instruction pattern. > +) >=20=20 > ;; pmul. >=20=20 > diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarc= h64-sve2.md > index 40c0728a7e6f00c395c360ce7625bc2e4a018809..ad01c1ddf9257cec951ed0c16= 558a3c4d856813b 100644 > --- a/gcc/config/aarch64/aarch64-sve2.md > +++ b/gcc/config/aarch64/aarch64-sve2.md > @@ -2317,39 +2317,51 @@ (define_insn "@aarch64_sve_" > ;; ---- [INT] Misc optab implementations > ;; ---------------------------------------------------------------------= ---- > ;; Includes: > -;; - aarch64_bitmask_udiv > +;; - add_highpart > ;; ---------------------------------------------------------------------= ---- >=20=20 > -;; div optimizations using narrowings > -;; we can do the division e.g. shorts by 255 faster by calculating it as > -;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in > -;; double the precision of x. > -;; > -;; See aarch64-simd.md for bigger explanation. > -(define_expand "@aarch64_bitmask_udiv3" > - [(match_operand:SVE_FULL_HSDI 0 "register_operand") > - (match_operand:SVE_FULL_HSDI 1 "register_operand") > - (match_operand:SVE_FULL_HSDI 2 "immediate_operand")] > +;; Implement add_highpart as ADD + RSHIFT, we have various optimization = for > +;; narrowing represented as shifts and so this representation will allow= us to > +;; further optimize this should the result require narrowing. The altern= ative > +;; representation of ADDHN + UXTL is less efficient and harder to futher > +;; optimize. > +(define_expand "add3_highpart" > + [(set (match_operand:SVE_FULL_HSDI 0 "register_operand") > + (unspec:SVE_FULL_HSDI > + [(match_operand:SVE_FULL_HSDI 1 "register_operand") > + (match_operand:SVE_FULL_HSDI 2 "register_operand")] > + ADD_HIGHPART))] > "TARGET_SVE2" > { > - unsigned HOST_WIDE_INT size > - =3D (1ULL << GET_MODE_UNIT_BITSIZE (mode)) - 1; > - rtx elt =3D unwrap_const_vec_duplicate (operands[2]); > - if (!CONST_INT_P (elt) || UINTVAL (elt) !=3D size) > - FAIL; > + rtx result =3D gen_reg_rtx (mode); > + int shift_amount =3D GET_MODE_UNIT_BITSIZE (mode) / 2; > + rtx shift_vector =3D aarch64_simd_gen_const_vector_dup (mode, > + shift_amount); > + emit_insn (gen_add3 (result, operands[1], operands[2])); > + emit_insn (gen_vlshr3 (operands[0], result, shift_vector)); > + DONE; > +}) >=20=20 > - rtx addend =3D gen_reg_rtx (mode); > +;; Optimize ((a + b) >> n) where n is half the bitsize of the vector > +(define_insn_and_split "*bitmask_shift_plus" > + [(set (match_operand:SVE_FULL_HSDI 0 "register_operand" "=3Dw") > + (unspec:SVE_FULL_HSDI [ > + (match_operand: 1 "register_operand" "Upl") Looks like this can be: (match_operand: 1) since the predicate isn't used. > + (lshiftrt:SVE_FULL_HSDI > + (plus:SVE_FULL_HSDI > + (match_operand:SVE_FULL_HSDI 2 "register_operand" "w") > + (match_operand:SVE_FULL_HSDI 3 "register_operand" "w")) > + (match_operand:SVE_FULL_HSDI 4 "aarch64_simd_shift_imm_vec_exact_= top" "Dr")) > + ] UNSPEC_PRED_X))] Very minor nit, but the formatting used in the file follows the style in the earlier pattern above, with [ immediately before ( and ] immediately after ). Not that that's inherently better or anything, it's just a consistency thing. > + "TARGET_SVE2" > + "#" > + "&& !reload_completed" > + [(const_int 0)] > +{ > rtx tmp1 =3D gen_reg_rtx (mode); > - rtx tmp2 =3D gen_reg_rtx (mode); > - rtx val =3D aarch64_simd_gen_const_vector_dup (mode, 1); > - emit_move_insn (addend, lowpart_subreg (mode, val, mode= )); > - emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, mode, tmp1, operands[= 1], > - addend)); > - emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, mode, tmp2, operands[= 1], > - lowpart_subreg (mode, tmp1, > - mode))); > + emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, mode, tmp1, operands[= 2], operands[3])); > emit_move_insn (operands[0], > - lowpart_subreg (mode, tmp2, mode)); > + lowpart_subreg (mode, tmp1, mode)); > DONE; > }) Since this is a single instruction, I'm not sure it's worth splitting it. Perhaps there would be CSE opportunities to having a single form, but it seems unlikely. And doing the unsplit form is nice and safe. But yeah, generating the patterns this way seems like a good approach. It might even help optimise open-coded versions of the same trick. Thanks, Richard >=20=20 > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index e6f47cbbb0d04a6f33b9a741ebb614cabd0204b9..8a04feb29e6bfb423a09dde2c= d64853e69d0e1ba 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -24363,46 +24363,6 @@ aarch64_vectorize_vec_perm_const (machine_mode v= mode, machine_mode op_mode, >=20=20 > return ret; > } > - > -/* Implement TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST. */ > - > -bool > -aarch64_vectorize_can_special_div_by_constant (enum tree_code code, > - tree vectype, wide_int cst, > - rtx *output, rtx in0, rtx in1) > -{ > - if (code !=3D TRUNC_DIV_EXPR > - || !TYPE_UNSIGNED (vectype)) > - return false; > - > - machine_mode mode =3D TYPE_MODE (vectype); > - unsigned int flags =3D aarch64_classify_vector_mode (mode); > - if ((flags & VEC_ANY_SVE) && !TARGET_SVE2) > - return false; > - > - int pow =3D wi::exact_log2 (cst + 1); > - auto insn_code =3D maybe_code_for_aarch64_bitmask_udiv3 (TYPE_MODE (ve= ctype)); > - /* SVE actually has a div operator, we may have gotten here through > - that route. */ > - if (pow !=3D (int) (element_precision (vectype) / 2) > - || insn_code =3D=3D CODE_FOR_nothing) > - return false; > - > - /* We can use the optimized pattern. */ > - if (in0 =3D=3D NULL_RTX && in1 =3D=3D NULL_RTX) > - return true; > - > - gcc_assert (output); > - > - expand_operand ops[3]; > - create_output_operand (&ops[0], *output, mode); > - create_input_operand (&ops[1], in0, mode); > - create_fixed_operand (&ops[2], in1); > - expand_insn (insn_code, 3, ops); > - *output =3D ops[0].value; > - return true; > -} > - > /* Generate a byte permute mask for a register of mode MODE, > which has NUNITS units. */ >=20=20 > diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterato= rs.md > index 6cbc97cc82c06a68259bdf4dec8a0eab230081e5..ae627ae56cbd1e8b882e596db= a974e74ef396e0e 100644 > --- a/gcc/config/aarch64/iterators.md > +++ b/gcc/config/aarch64/iterators.md > @@ -750,6 +750,8 @@ (define_c_enum "unspec" > UNSPEC_REVH ; Used in aarch64-sve.md. > UNSPEC_REVW ; Used in aarch64-sve.md. > UNSPEC_REVBHW ; Used in aarch64-sve.md. > + UNSPEC_SADD_HIGHPART ; Used in aarch64-sve.md. > + UNSPEC_UADD_HIGHPART ; Used in aarch64-sve.md. > UNSPEC_SMUL_HIGHPART ; Used in aarch64-sve.md. > UNSPEC_UMUL_HIGHPART ; Used in aarch64-sve.md. > UNSPEC_FMLA ; Used in aarch64-sve.md. > @@ -2704,6 +2706,7 @@ (define_int_iterator UNPACK [UNSPEC_UNPACKSHI UNSPE= C_UNPACKUHI >=20=20 > (define_int_iterator UNPACK_UNSIGNED [UNSPEC_UNPACKULO UNSPEC_UNPACKUHI]) >=20=20 > +(define_int_iterator ADD_HIGHPART [UNSPEC_SADD_HIGHPART UNSPEC_UADD_HIGH= PART]) > (define_int_iterator MUL_HIGHPART [UNSPEC_SMUL_HIGHPART UNSPEC_UMUL_HIGH= PART]) >=20=20 > (define_int_iterator CLAST [UNSPEC_CLASTA UNSPEC_CLASTB]) > @@ -3342,6 +3345,8 @@ (define_int_attr su [(UNSPEC_SADDV "s") > (UNSPEC_UNPACKUHI "u") > (UNSPEC_UNPACKSLO "s") > (UNSPEC_UNPACKULO "u") > + (UNSPEC_SADD_HIGHPART "s") > + (UNSPEC_UADD_HIGHPART "u") > (UNSPEC_SMUL_HIGHPART "s") > (UNSPEC_UMUL_HIGHPART "u") > (UNSPEC_COND_FCVTZS "s")