From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 066D73858D28 for ; Fri, 10 Feb 2023 13:36:31 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 066D73858D28 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 047BC4B3; Fri, 10 Feb 2023 05:37:13 -0800 (PST) Received: from localhost (e121540-lin.manchester.arm.com [10.32.99.50]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 9ABE83F71E; Fri, 10 Feb 2023 05:36:29 -0800 (PST) From: Richard Sandiford To: Tamar Christina via Gcc-patches Mail-Followup-To: Tamar Christina via Gcc-patches ,Tamar Christina , nd@arm.com, rguenther@suse.de, jlaw@ventanamicro.com, richard.sandiford@arm.com Cc: Tamar Christina , nd@arm.com, rguenther@suse.de, jlaw@ventanamicro.com Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] References: Date: Fri, 10 Feb 2023 13:36:28 +0000 In-Reply-To: (Tamar Christina via Gcc-patches's message of "Thu, 9 Feb 2023 17:16:40 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-33.9 required=5.0 tests=BAYES_00,BODY_8BITS,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_LOTSOFHASH,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: I think I'm misunderstanding, but: it seems like we're treating the add highpart optabs as companions to the mul highpart optabs. But AIUI, the add highpart optab is used such that, for an N-bit mode, we do an N-bit addition followed by a shift by N/2. Is that right? The mul highpart optabs instead do an 2N-bit multiplication followed by a shift by N. Apart from consistency, the reason this matters is: I'm not sure what we gain by adding the optab rather than simply open-coding the addition and the shift directly into the vector pattern. It seems like the AArch64 expander in 2/2 does just do an ordinary N-bit addition followed by an ordinary shift by N/2. Some comments in addition to Richard's: Tamar Christina via Gcc-patches writes: > Hi All, > > As discussed in the ticket, this replaces the approach for optimizing the > div by bitmask operation from a hook into optabs implemented through > add_highpart. > > In order to be able to use this we need to check whether the current prec= ision > has enough bits to do the operation without any of the additions overflow= ing. > > We use range information to determine this and only do the operation if w= e're > sure am overflow won't occur. > > Bootstrapped Regtested on aarch64-none-linux-gnu and issues. > > Ok for master? > > Thanks, > Tamar > > gcc/ChangeLog: > > PR target/108583 > * doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Remove. > * doc/tm.texi.in: Likewise. > * explow.cc (round_push, align_dynamic_address): Revert previous patch. > * expmed.cc (expand_divmod): Likewise. > * expmed.h (expand_divmod): Likewise. > * expr.cc (force_operand, expand_expr_divmod): Likewise. > * optabs.cc (expand_doubleword_mod, expand_doubleword_divmod): Likewise. > * internal-fn.def (ADDH): New. > * optabs.def (sadd_highpart_optab, uadd_highpart_optab): New. > * doc/md.texi: Document them. > * doc/rtl.texi: Likewise. > * target.def (can_special_div_by_const): Remove. > * target.h: Remove tree-core.h include > * targhooks.cc (default_can_special_div_by_const): Remove. > * targhooks.h (default_can_special_div_by_const): Remove. > * tree-vect-generic.cc (expand_vector_operation): Remove hook. > * tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook and > implement new obtab recognition based on range. > * tree-vect-stmts.cc (vectorizable_operation): Remove hook. > > gcc/testsuite/ChangeLog: > > PR target/108583 > * gcc.dg/vect/vect-div-bitmask-4.c: New test. > * gcc.dg/vect/vect-div-bitmask-5.c: New test. > > --- inline copy of patch --=20 > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi > index 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f740= 8038595e21af35d 100644 > --- a/gcc/doc/md.texi > +++ b/gcc/doc/md.texi > @@ -5668,6 +5668,18 @@ represented in RTL using a @code{smul_highpart} RT= X expression. > Similar, but the multiplication is unsigned. This may be represented > in RTL using an @code{umul_highpart} RTX expression. >=20=20 > +@cindex @code{sadd@var{m}3_highpart} instruction pattern > +@item @samp{smul@var{m}3_highpart} sadd > +Perform a signed addition of operands 1 and 2, which have mode > +@var{m}, and store the most significant half of the product in operand 0. > +The least significant half of the product is discarded. This may be > +represented in RTL using a @code{sadd_highpart} RTX expression. > + > +@cindex @code{uadd@var{m}3_highpart} instruction pattern > +@item @samp{uadd@var{m}3_highpart} > +Similar, but the addition is unsigned. This may be represented > +in RTL using an @code{uadd_highpart} RTX expression. > + > @cindex @code{madd@var{m}@var{n}4} instruction pattern > @item @samp{madd@var{m}@var{n}4} > Multiply operands 1 and 2, sign-extend them to mode @var{n}, add > diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi > index d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343= d171940ec4222f3 100644 > --- a/gcc/doc/rtl.texi > +++ b/gcc/doc/rtl.texi > @@ -2535,6 +2535,17 @@ out in machine mode @var{m}. @code{smul_highpart}= returns the high part > of a signed multiplication, @code{umul_highpart} returns the high part > of an unsigned multiplication. >=20=20 > +@findex sadd_highpart > +@findex uadd_highpart > +@cindex high-part addition > +@cindex addition high part > +@item (sadd_highpart:@var{m} @var{x} @var{y}) > +@itemx (uadd_highpart:@var{m} @var{x} @var{y}) > +Represents the high-part addition of @var{x} and @var{y} carried > +out in machine mode @var{m}. @code{sadd_highpart} returns the high part > +of a signed addition, @code{uadd_highpart} returns the high part > +of an unsigned addition. The patch doesn't add these RTL codes though. > + > @findex fma > @cindex fused multiply-add > @item (fma:@var{m} @var{x} @var{y} @var{z}) > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi > index c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d57914840= 17e6b0d62ab077e 100644 > --- a/gcc/doc/tm.texi > +++ b/gcc/doc/tm.texi > @@ -6137,22 +6137,6 @@ instruction pattern. There is no need for the hoo= k to handle these two > implementation approaches itself. > @end deftypefn >=20=20 > -@deftypefn {Target Hook} bool TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST = (enum @var{tree_code}, tree @var{vectype}, wide_int @var{constant}, rtx *@v= ar{output}, rtx @var{in0}, rtx @var{in1}) > -This hook is used to test whether the target has a special method of > -division of vectors of type @var{vectype} using the value @var{constant}, > -and producing a vector of type @var{vectype}. The division > -will then not be decomposed by the vectorizer and kept as a div. > - > -When the hook is being used to test whether the target supports a special > -divide, @var{in0}, @var{in1}, and @var{output} are all null. When the h= ook > -is being used to emit a division, @var{in0} and @var{in1} are the source > -vectors of type @var{vecttype} and @var{output} is the destination vecto= r of > -type @var{vectype}. > - > -Return true if the operation is possible, emitting instructions for it > -if rtxes are provided and updating @var{output}. > -@end deftypefn > - > @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTI= ON (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in}) > This hook should return the decl of a function that implements the > vectorized variant of the function with the @code{combined_fn} code > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in > index 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0= a3abccd1c293c7b 100644 > --- a/gcc/doc/tm.texi.in > +++ b/gcc/doc/tm.texi.in > @@ -4173,8 +4173,6 @@ address; but often a machine-dependent strategy ca= n generate better code. >=20=20 > @hook TARGET_VECTORIZE_VEC_PERM_CONST >=20=20 > -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST > - > @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION >=20=20 > @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION > diff --git a/gcc/explow.cc b/gcc/explow.cc > index 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0= befa016eea4573c 100644 > --- a/gcc/explow.cc > +++ b/gcc/explow.cc > @@ -1037,7 +1037,7 @@ round_push (rtx size) > TRUNC_DIV_EXPR. */ > size =3D expand_binop (Pmode, add_optab, size, alignm1_rtx, > NULL_RTX, 1, OPTAB_LIB_WIDEN); > - size =3D expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size, al= ign_rtx, > + size =3D expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx, > NULL_RTX, 1); > size =3D expand_mult (Pmode, size, align_rtx, NULL_RTX, 1); >=20=20 > @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned require= d_align) > gen_int_mode (required_align / BITS_PER_UNIT - 1, > Pmode), > NULL_RTX, 1, OPTAB_LIB_WIDEN); > - target =3D expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, target, > + target =3D expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target, > gen_int_mode (required_align / BITS_PER_UNIT, > Pmode), > NULL_RTX, 1); > diff --git a/gcc/expmed.h b/gcc/expmed.h > index 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c5364= 0941628068f3901 100644 > --- a/gcc/expmed.h > +++ b/gcc/expmed.h > @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code, machine_mode= , rtx, poly_int64, rtx, > extern rtx maybe_expand_shift (enum tree_code, machine_mode, rtx, int, r= tx, > int); > #ifdef GCC_OPTABS_H > -extern rtx expand_divmod (int, enum tree_code, machine_mode, tree, tree, > - rtx, rtx, rtx, int, > - enum optab_methods =3D OPTAB_LIB_WIDEN); > +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx, rtx, > + rtx, int, enum optab_methods =3D OPTAB_LIB_WIDEN); > #endif > #endif >=20=20 > diff --git a/gcc/expmed.cc b/gcc/expmed.cc > index 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3= a59c169d3b7692f 100644 > --- a/gcc/expmed.cc > +++ b/gcc/expmed.cc > @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx op0, HO= ST_WIDE_INT d) >=20=20 > rtx > expand_divmod (int rem_flag, enum tree_code code, machine_mode mode, > - tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target, > - int unsignedp, enum optab_methods methods) > + rtx op0, rtx op1, rtx target, int unsignedp, > + enum optab_methods methods) > { > machine_mode compute_mode; > rtx tquotient; > @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code code, = machine_mode mode, >=20=20 > last_div_const =3D ! rem_flag && op1_is_constant ? INTVAL (op1) : 0; >=20=20 > - /* Check if the target has specific expansions for the division. */ > - tree cst; > - if (treeop0 > - && treeop1 > - && (cst =3D uniform_integer_cst_p (treeop1)) > - && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE (tr= eeop0), > - wi::to_wide (cst), > - &target, op0, op1)) > - return target; > - > - > /* Now convert to the best mode to use. */ > if (compute_mode !=3D mode) > { > @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code code, m= achine_mode mode, > || (optab_handler (sdivmod_optab, int_mode) > !=3D CODE_FOR_nothing))) > quotient =3D expand_divmod (0, TRUNC_DIV_EXPR, > - int_mode, treeop0, treeop1, > - op0, gen_int_mode (abs_d, > + int_mode, op0, > + gen_int_mode (abs_d, > int_mode), > NULL_RTX, 0); > else > @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code code, m= achine_mode mode, > size - 1, NULL_RTX, 0); > t3 =3D force_operand (gen_rtx_MINUS (int_mode, t1, nsign), > NULL_RTX); > - t4 =3D expand_divmod (0, TRUNC_DIV_EXPR, int_mode, treeop0, > - treeop1, t3, op1, NULL_RTX, 0); > + t4 =3D expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3, op1, > + NULL_RTX, 0); > if (t4) > { > rtx t5; > diff --git a/gcc/expr.cc b/gcc/expr.cc > index 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b2= 280c6e277f26d72 100644 > --- a/gcc/expr.cc > +++ b/gcc/expr.cc > @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target) > return expand_divmod (0, > FLOAT_MODE_P (GET_MODE (value)) > ? RDIV_EXPR : TRUNC_DIV_EXPR, > - GET_MODE (value), NULL, NULL, op1, op2, > - target, 0); > + GET_MODE (value), op1, op2, target, 0); > case MOD: > - return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL, > - op1, op2, target, 0); > + return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2, > + target, 0); > case UDIV: > - return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), NULL, NULL, > - op1, op2, target, 1); > + return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), op1, op2, > + target, 1); > case UMOD: > - return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL, > - op1, op2, target, 1); > + return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2, > + target, 1); > case ASHIFTRT: > return expand_simple_binop (GET_MODE (value), code, op1, op2, > target, 0, OPTAB_LIB_WIDEN); > @@ -9170,13 +9169,11 @@ expand_expr_divmod (tree_code code, machine_mode = mode, tree treeop0, > bool speed_p =3D optimize_insn_for_speed_p (); > do_pending_stack_adjust (); > start_sequence (); > - rtx uns_ret =3D expand_divmod (mod_p, code, mode, treeop0, treeop1, > - op0, op1, target, 1); > + rtx uns_ret =3D expand_divmod (mod_p, code, mode, op0, op1, target= , 1); > rtx_insn *uns_insns =3D get_insns (); > end_sequence (); > start_sequence (); > - rtx sgn_ret =3D expand_divmod (mod_p, code, mode, treeop0, treeop1, > - op0, op1, target, 0); > + rtx sgn_ret =3D expand_divmod (mod_p, code, mode, op0, op1, target= , 0); > rtx_insn *sgn_insns =3D get_insns (); > end_sequence (); > unsigned uns_cost =3D seq_cost (uns_insns, speed_p); > @@ -9198,8 +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode mo= de, tree treeop0, > emit_insn (sgn_insns); > return sgn_ret; > } > - return expand_divmod (mod_p, code, mode, treeop0, treeop1, > - op0, op1, target, unsignedp); > + return expand_divmod (mod_p, code, mode, op0, op1, target, unsignedp); > } >=20=20 > rtx > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def > index 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5= a3b8a734baa800f 100644 > --- a/gcc/internal-fn.def > +++ b/gcc/internal-fn.def > @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | E= CF_NOTHROW, first, >=20=20 > DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW, first, > smul_highpart, umul_highpart, binary) > +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST | ECF_NOTHROW, first, > + sadd_highpart, uadd_highpart, binary) > DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first, > smulhs, umulhs, binary) > DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first, > diff --git a/gcc/optabs.cc b/gcc/optabs.cc > index cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369= a6e77082c1e617b 100644 > --- a/gcc/optabs.cc > +++ b/gcc/optabs.cc > @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode mode, rtx op0, = rtx op1, bool unsignedp) > return NULL_RTX; > } > } > - rtx remainder =3D expand_divmod (1, TRUNC_MOD_EXPR, word_mode, NUL= L, NULL, > - sum, gen_int_mode (INTVAL (op1), > - word_mode), > + rtx remainder =3D expand_divmod (1, TRUNC_MOD_EXPR, word_mode, sum, > + gen_int_mode (INTVAL (op1), word_mode), > NULL_RTX, 1, OPTAB_DIRECT); > if (remainder =3D=3D NULL_RTX) > return NULL_RTX; > @@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op= 0, rtx op1, rtx *rem, >=20=20 > if (op11 !=3D const1_rtx) > { > - rtx rem2 =3D expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL, q= uot1, > - op11, NULL_RTX, unsignedp, OPTAB_DIRECT); > + rtx rem2 =3D expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1, op11, > + NULL_RTX, unsignedp, OPTAB_DIRECT); > if (rem2 =3D=3D NULL_RTX) > return NULL_RTX; >=20=20 > @@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op= 0, rtx op1, rtx *rem, > if (rem2 =3D=3D NULL_RTX) > return NULL_RTX; >=20=20 > - rtx quot2 =3D expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL, = quot1, > - op11, NULL_RTX, unsignedp, OPTAB_DIRECT); > + rtx quot2 =3D expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11, > + NULL_RTX, unsignedp, OPTAB_DIRECT); > if (quot2 =3D=3D NULL_RTX) > return NULL_RTX; >=20=20 > diff --git a/gcc/optabs.def b/gcc/optabs.def > index 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5= ccbf6147947351a 100644 > --- a/gcc/optabs.def > +++ b/gcc/optabs.def > @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3") >=20=20 > OPTAB_D (smul_highpart_optab, "smul$a3_highpart") > OPTAB_D (umul_highpart_optab, "umul$a3_highpart") > +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart") > +OPTAB_D (uadd_highpart_optab, "uadd$a3_highpart") >=20=20 > OPTAB_D (cmpmem_optab, "cmpmem$a") > OPTAB_D (cmpstr_optab, "cmpstr$a") > diff --git a/gcc/target.def b/gcc/target.def > index db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d= 81afa2c2baa64a5 100644 > --- a/gcc/target.def > +++ b/gcc/target.def > @@ -1905,25 +1905,6 @@ implementation approaches itself.", > const vec_perm_indices &sel), > NULL) >=20=20 > -DEFHOOK > -(can_special_div_by_const, > - "This hook is used to test whether the target has a special method of\n\ > -division of vectors of type @var{vectype} using the value @var{constant}= ,\n\ > -and producing a vector of type @var{vectype}. The division\n\ > -will then not be decomposed by the vectorizer and kept as a div.\n\ > -\n\ > -When the hook is being used to test whether the target supports a specia= l\n\ > -divide, @var{in0}, @var{in1}, and @var{output} are all null. When the h= ook\n\ > -is being used to emit a division, @var{in0} and @var{in1} are the source= \n\ > -vectors of type @var{vecttype} and @var{output} is the destination vecto= r of\n\ > -type @var{vectype}.\n\ > -\n\ > -Return true if the operation is possible, emitting instructions for it\n\ > -if rtxes are provided and updating @var{output}.", > - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output, > - rtx in0, rtx in1), > - default_can_special_div_by_const) > - > /* Return true if the target supports misaligned store/load of a > specific factor denoted in the third parameter. The last parameter > is true if the access is defined in a packed struct. */ > diff --git a/gcc/target.h b/gcc/target.h > index 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b9= 9f913158c2d47b1 100644 > --- a/gcc/target.h > +++ b/gcc/target.h > @@ -51,7 +51,6 @@ > #include "insn-codes.h" > #include "tm.h" > #include "hard-reg-set.h" > -#include "tree-core.h" >=20=20 > #if CHECKING_P >=20=20 > diff --git a/gcc/targhooks.h b/gcc/targhooks.h > index a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad22445= 49317a31390f0c2 100644 > --- a/gcc/targhooks.h > +++ b/gcc/targhooks.h > @@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage (addr_s= pace_t, location_t); > extern rtx default_addr_space_convert (rtx, tree, tree); > extern unsigned int default_case_values_threshold (void); > extern bool default_have_conditional_execution (void); > -extern bool default_can_special_div_by_const (enum tree_code, tree, wide= _int, > - rtx *, rtx, rtx); >=20=20 > extern bool default_libc_has_function (enum function_class, tree); > extern bool default_libc_has_fast_function (int fcode); > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc > index fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91= e03877337a931e7 100644 > --- a/gcc/targhooks.cc > +++ b/gcc/targhooks.cc > @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void) > return HAVE_conditional_execution; > } >=20=20 > -/* Default that no division by constant operations are special. */ > -bool > -default_can_special_div_by_const (enum tree_code, tree, wide_int, rtx *,= rtx, > - rtx) > -{ > - return false; > -} > - > /* By default we assume that c99 functions are present at the runtime, > but sincos is not. */ > bool > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsui= te/gcc.dg/vect/vect-div-bitmask-4.c > new file mode 100644 > index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a= 0a04ea8c1f73e3c > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c > @@ -0,0 +1,25 @@ > +/* { dg-require-effective-target vect_int } */ > + > +#include > +#include "tree-vect.h" > + > +typedef unsigned __attribute__((__vector_size__ (16))) V; > + > +static __attribute__((__noinline__)) __attribute__((__noclone__)) V > +foo (V v, unsigned short i) > +{ > + v /=3D i; > + return v; > +} > + > +int > +main (void) > +{ > + V v =3D foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0= xffff); > + for (unsigned i =3D 0; i < sizeof (v) / sizeof (v[0]); i++) > + if (v[i] !=3D 0x00010001) > + __builtin_abort (); > + return 0; > +} > + > +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected"= "vect" { target aarch64*-*-* } } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsui= te/gcc.dg/vect/vect-div-bitmask-5.c > new file mode 100644 > index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b4991= 4d2a29b933de625 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c > @@ -0,0 +1,58 @@ > +/* { dg-require-effective-target vect_int } */ > + > +#include > +#include > +#include "tree-vect.h" > + > +#define N 50 > +#define TYPE uint8_t=20 > + > +#ifndef DEBUG > +#define DEBUG 0 > +#endif > + > +#define BASE ((TYPE) -1 < 0 ? -126 : 4) > + > + > +__attribute__((noipa, noinline, optimize("O1"))) > +void fun1(TYPE* restrict pixel, TYPE level, int n) > +{ > + for (int i =3D 0; i < n; i+=3D1) > + pixel[i] =3D (pixel[i] + level) / 0xff; > +} > + > +__attribute__((noipa, noinline, optimize("O3"))) > +void fun2(TYPE* restrict pixel, TYPE level, int n) > +{ > + for (int i =3D 0; i < n; i+=3D1) > + pixel[i] =3D (pixel[i] + level) / 0xff; > +} > + > +int main () > +{ > + TYPE a[N]; > + TYPE b[N]; > + > + for (int i =3D 0; i < N; ++i) > + { > + a[i] =3D BASE + i * 13; > + b[i] =3D BASE + i * 13; > + if (DEBUG) > + printf ("%d: 0x%x\n", i, a[i]); > + } > + > + fun1 (a, N / 2, N); > + fun2 (b, N / 2, N); > + > + for (int i =3D 0; i < N; ++i) > + { > + if (DEBUG) > + printf ("%d =3D 0x%x =3D=3D 0x%x\n", i, a[i], b[i]); > + > + if (a[i] !=3D b[i]) > + __builtin_abort (); > + } > + return 0; > +} > + > +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { targ= et aarch64*-*-* } } } */ > diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc > index 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077= dc3e970bed75ef6 100644 > --- a/gcc/tree-vect-generic.cc > +++ b/gcc/tree-vect-generic.cc > @@ -1237,17 +1237,6 @@ expand_vector_operation (gimple_stmt_iterator *gsi= , tree type, tree compute_type > tree rhs2 =3D gimple_assign_rhs2 (assign); > tree ret; >=20=20 > - /* Check if the target was going to handle it through the special > - division callback hook. */ > - tree cst =3D uniform_integer_cst_p (rhs2); > - if (cst && > - targetm.vectorize.can_special_div_by_const (code, type, > - wi::to_wide (cst), > - NULL, > - NULL_RTX, NULL_RTX)) > - return NULL_TREE; > - > - > if (!optimize > || !VECTOR_INTEGER_TYPE_P (type) > || TREE_CODE (rhs2) !=3D VECTOR_CST > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc > index 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f3= 369de2afea139d6 100644 > --- a/gcc/tree-vect-patterns.cc > +++ b/gcc/tree-vect-patterns.cc > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo, > return pattern_stmt; > } > else if ((cst =3D uniform_integer_cst_p (oprnd1)) > - && targetm.vectorize.can_special_div_by_const (rhs_code, vectype, > - wi::to_wide (cst), > - NULL, NULL_RTX, > - NULL_RTX)) > + && TYPE_UNSIGNED (itype) > + && rhs_code =3D=3D TRUNC_DIV_EXPR > + && vectype > + && direct_internal_fn_supported_p (IFN_ADDH, vectype, > + OPTIMIZE_FOR_SPEED)) > { > - return NULL; > + /* div optimizations using narrowings > + we can do the division e.g. shorts by 255 faster by calculating i= t as > + (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in > + double the precision of x. > + > + If we imagine a short as being composed of two blocks of bytes th= en > + adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to > + adding 1 to each sub component: > + > + short value of 16-bits > + =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=90 > + =E2=94=82 =E2=94=82 =E2=94=82 > + =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=B4=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=98 > + 8-bit part1 =E2=96=B2 8-bit part2 =E2=96=B2 > + =E2=94=82 =E2=94=82 > + =E2=94=82 =E2=94=82 > + +1 +1 > + > + after the first addition, we have to shift right by 8, and narrow= the > + results back to a byte. Remember that the addition must be done = in > + double the precision of the input. However if we know that the a= ddition > + `x + 257` does not overflow then we can do the operation in the c= urrent > + precision. In which case we don't need the pack and unpacks. */ > + auto wcst =3D wi::to_wide (cst); > + int pow =3D wi::exact_log2 (wcst + 1); > + if (pow =3D=3D (int) (element_precision (vectype) / 2)) > + { > + wide_int min,max; > + /* If we're in a pattern we need to find the orginal definition. */ > + tree op0 =3D oprnd0; > + gimple *stmt =3D SSA_NAME_DEF_STMT (oprnd0); > + stmt_vec_info stmt_info =3D vinfo->lookup_stmt (stmt); > + if (is_pattern_stmt_p (stmt_info)) > + { > + auto orig_stmt =3D STMT_VINFO_RELATED_STMT (stmt_info); > + if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt))) > + op0 =3D gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt)); > + } If this is generally safe (I'm skipping thinking about it in the interests of a quick review :-)), then I think it should be done in vect_get_range_info instead. Using gimple_get_lhs would be more general than handling just assignments. > + > + /* Check that no overflow will occur. If we don't have range > + information we can't perform the optimization. */ > + if (vect_get_range_info (op0, &min, &max)) > + { > + wide_int one =3D wi::to_wide (build_one_cst (itype)); > + wide_int adder =3D wi::add (one, wi::lshift (one, pow)); > + wi::overflow_type ovf; > + /* We need adder and max in the same precision. */ > + wide_int zadder > + =3D wide_int_storage::from (adder, wi::get_precision (max), > + UNSIGNED); > + wi::add (max, zadder, UNSIGNED, &ovf); Could you explain this a bit more? When do we have mismatched precisions? Thanks, Richard > + if (ovf =3D=3D wi::OVF_NONE) > + { > + *type_out =3D vectype; > + tree tadder =3D wide_int_to_tree (itype, adder); > + gcall *patt1 > + =3D gimple_build_call_internal (IFN_ADDH, 2, oprnd0, tadder); > + tree lhs =3D vect_recog_temp_ssa_var (itype, NULL); > + gimple_call_set_lhs (patt1, lhs); > + append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype); > + > + pattern_stmt > + =3D gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs); > + lhs =3D vect_recog_temp_ssa_var (itype, NULL); > + gimple_call_set_lhs (pattern_stmt, lhs); > + > + return pattern_stmt; > + } > + } > + } > } >=20=20 > if (prec > HOST_BITS_PER_WIDE_INT > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > index eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b95= 64fc4e066e50081 100644 > --- a/gcc/tree-vect-stmts.cc > +++ b/gcc/tree-vect-stmts.cc > @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo, > } > target_support_p =3D (optab_handler (optab, vec_mode) > !=3D CODE_FOR_nothing); > - tree cst; > - if (!target_support_p > - && op1 > - && (cst =3D uniform_integer_cst_p (op1))) > - target_support_p > - =3D targetm.vectorize.can_special_div_by_const (code, vectype, > - wi::to_wide (cst), > - NULL, NULL_RTX, > - NULL_RTX); > } >=20=20 > bool using_emulated_vectors_p =3D vect_emulated_vector_p (vectype);