[PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
@ 2023-02-09 17:16 Tamar Christina
  2023-02-09 17:22 ` [PATCH 2/2]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583] Tamar Christina
                   ` (3 more replies)
  0 siblings, 4 replies; 47+ messages in thread
From: Tamar Christina @ 2023-02-09 17:16 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, rguenther, jlaw

[-- Attachment #1: Type: text/plain, Size: 25251 bytes --]

Hi All,

As discussed in the ticket, this replaces the approach for optimizing the
div by bitmask operation from a hook into optabs implemented through
add_highpart.

In order to be able to use this we need to check whether the current precision
has enough bits to do the operation without any of the additions overflowing.

We use range information to determine this and only do the operation if we're
sure am overflow won't occur.

Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going> issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	PR target/108583
	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Remove.
	* doc/tm.texi.in: Likewise.
	* explow.cc (round_push, align_dynamic_address): Revert previous patch.
	* expmed.cc (expand_divmod): Likewise.
	* expmed.h (expand_divmod): Likewise.
	* expr.cc (force_operand, expand_expr_divmod): Likewise.
	* optabs.cc (expand_doubleword_mod, expand_doubleword_divmod): Likewise.
	* internal-fn.def (ADDH): New.
	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
	* doc/md.texi: Document them.
	* doc/rtl.texi: Likewise.
	* target.def (can_special_div_by_const): Remove.
	* target.h: Remove tree-core.h include
	* targhooks.cc (default_can_special_div_by_const): Remove.
	* targhooks.h (default_can_special_div_by_const): Remove.
	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook and
	implement new obtab recognition based on range.
	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.

gcc/testsuite/ChangeLog:

	PR target/108583
	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
	* gcc.dg/vect/vect-div-bitmask-5.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f7408038595e21af35d 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5668,6 +5668,18 @@ represented in RTL using a @code{smul_highpart} RTX expression.
 Similar, but the multiplication is unsigned.  This may be represented
 in RTL using an @code{umul_highpart} RTX expression.
 
+@cindex @code{sadd@var{m}3_highpart} instruction pattern
+@item @samp{smul@var{m}3_highpart}
+Perform a signed addition of operands 1 and 2, which have mode
+@var{m}, and store the most significant half of the product in operand 0.
+The least significant half of the product is discarded.  This may be
+represented in RTL using a @code{sadd_highpart} RTX expression.
+
+@cindex @code{uadd@var{m}3_highpart} instruction pattern
+@item @samp{uadd@var{m}3_highpart}
+Similar, but the addition is unsigned.  This may be represented
+in RTL using an @code{uadd_highpart} RTX expression.
+
 @cindex @code{madd@var{m}@var{n}4} instruction pattern
 @item @samp{madd@var{m}@var{n}4}
 Multiply operands 1 and 2, sign-extend them to mode @var{n}, add
diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
index d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343d171940ec4222f3 100644
--- a/gcc/doc/rtl.texi
+++ b/gcc/doc/rtl.texi
@@ -2535,6 +2535,17 @@ out in machine mode @var{m}.  @code{smul_highpart} returns the high part
 of a signed multiplication, @code{umul_highpart} returns the high part
 of an unsigned multiplication.
 
+@findex sadd_highpart
+@findex uadd_highpart
+@cindex high-part addition
+@cindex addition high part
+@item (sadd_highpart:@var{m} @var{x} @var{y})
+@itemx (uadd_highpart:@var{m} @var{x} @var{y})
+Represents the high-part addition of @var{x} and @var{y} carried
+out in machine mode @var{m}.  @code{sadd_highpart} returns the high part
+of a signed addition, @code{uadd_highpart} returns the high part
+of an unsigned addition.
+
 @findex fma
 @cindex fused multiply-add
 @item (fma:@var{m} @var{x} @var{y} @var{z})
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d5791484017e6b0d62ab077e 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for the hook to handle these two
 implementation approaches itself.
 @end deftypefn
 
-@deftypefn {Target Hook} bool TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum @var{tree_code}, tree @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx @var{in0}, rtx @var{in1})
-This hook is used to test whether the target has a special method of
-division of vectors of type @var{vectype} using the value @var{constant},
-and producing a vector of type @var{vectype}.  The division
-will then not be decomposed by the vectorizer and kept as a div.
-
-When the hook is being used to test whether the target supports a special
-divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook
-is being used to emit a division, @var{in0} and @var{in1} are the source
-vectors of type @var{vecttype} and @var{output} is the destination vector of
-type @var{vectype}.
-
-Return true if the operation is possible, emitting instructions for it
-if rtxes are provided and updating @var{output}.
-@end deftypefn
-
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
 This hook should return the decl of a function that implements the
 vectorized variant of the function with the @code{combined_fn} code
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0a3abccd1c293c7b 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4173,8 +4173,6 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_VEC_PERM_CONST
 
-@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
-
 @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
 
 @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
diff --git a/gcc/explow.cc b/gcc/explow.cc
index 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0befa016eea4573c 100644
--- a/gcc/explow.cc
+++ b/gcc/explow.cc
@@ -1037,7 +1037,7 @@ round_push (rtx size)
      TRUNC_DIV_EXPR.  */
   size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
 		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
-  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size, align_rtx,
+  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
 			NULL_RTX, 1);
   size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
 
@@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned required_align)
 			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
 				       Pmode),
 			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
-  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, target,
+  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
 			  gen_int_mode (required_align / BITS_PER_UNIT,
 					Pmode),
 			  NULL_RTX, 1);
diff --git a/gcc/expmed.h b/gcc/expmed.h
index 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c53640941628068f3901 100644
--- a/gcc/expmed.h
+++ b/gcc/expmed.h
@@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code, machine_mode, rtx, poly_int64, rtx,
 extern rtx maybe_expand_shift (enum tree_code, machine_mode, rtx, int, rtx,
 			       int);
 #ifdef GCC_OPTABS_H
-extern rtx expand_divmod (int, enum tree_code, machine_mode, tree, tree,
-			  rtx, rtx, rtx, int,
-			  enum optab_methods = OPTAB_LIB_WIDEN);
+extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx, rtx,
+			  rtx, int, enum optab_methods = OPTAB_LIB_WIDEN);
 #endif
 #endif
 
diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3a59c169d3b7692f 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx op0, HOST_WIDE_INT d)
 
 rtx
 expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
-	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
-	       int unsignedp, enum optab_methods methods)
+	       rtx op0, rtx op1, rtx target, int unsignedp,
+	       enum optab_methods methods)
 {
   machine_mode compute_mode;
   rtx tquotient;
@@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
 
   last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
 
-  /* Check if the target has specific expansions for the division.  */
-  tree cst;
-  if (treeop0
-      && treeop1
-      && (cst = uniform_integer_cst_p (treeop1))
-      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE (treeop0),
-						     wi::to_wide (cst),
-						     &target, op0, op1))
-    return target;
-
-
   /* Now convert to the best mode to use.  */
   if (compute_mode != mode)
     {
@@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
 			    || (optab_handler (sdivmod_optab, int_mode)
 				!= CODE_FOR_nothing)))
 		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
-						int_mode, treeop0, treeop1,
-						op0, gen_int_mode (abs_d,
+						int_mode, op0,
+						gen_int_mode (abs_d,
 							      int_mode),
 						NULL_RTX, 0);
 		    else
@@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
 				      size - 1, NULL_RTX, 0);
 		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
 				    NULL_RTX);
-		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, treeop0,
-				    treeop1, t3, op1, NULL_RTX, 0);
+		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3, op1,
+				    NULL_RTX, 0);
 		if (t4)
 		  {
 		    rtx t5;
diff --git a/gcc/expr.cc b/gcc/expr.cc
index 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b2280c6e277f26d72 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
 	    return expand_divmod (0,
 				  FLOAT_MODE_P (GET_MODE (value))
 				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
-				  GET_MODE (value), NULL, NULL, op1, op2,
-				  target, 0);
+				  GET_MODE (value), op1, op2, target, 0);
 	case MOD:
-	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
-				op1, op2, target, 0);
+	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
+				target, 0);
 	case UDIV:
-	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), NULL, NULL,
-				op1, op2, target, 1);
+	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), op1, op2,
+				target, 1);
 	case UMOD:
-	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
-				op1, op2, target, 1);
+	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
+				target, 1);
 	case ASHIFTRT:
 	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
 				      target, 0, OPTAB_LIB_WIDEN);
@@ -9170,13 +9169,11 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
       bool speed_p = optimize_insn_for_speed_p ();
       do_pending_stack_adjust ();
       start_sequence ();
-      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
-				   op0, op1, target, 1);
+      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 1);
       rtx_insn *uns_insns = get_insns ();
       end_sequence ();
       start_sequence ();
-      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
-				   op0, op1, target, 0);
+      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 0);
       rtx_insn *sgn_insns = get_insns ();
       end_sequence ();
       unsigned uns_cost = seq_cost (uns_insns, speed_p);
@@ -9198,8 +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
       emit_insn (sgn_insns);
       return sgn_ret;
     }
-  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
-			op0, op1, target, unsignedp);
+  return expand_divmod (mod_p, code, mode, op0, op1, target, unsignedp);
 }
 
 rtx
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5a3b8a734baa800f 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first,
 
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW, first,
 			      smul_highpart, umul_highpart, binary)
+DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST | ECF_NOTHROW, first,
+			      sadd_highpart, uadd_highpart, binary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first,
 			      smulhs, umulhs, binary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first,
diff --git a/gcc/optabs.cc b/gcc/optabs.cc
index cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6e77082c1e617b 100644
--- a/gcc/optabs.cc
+++ b/gcc/optabs.cc
@@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode mode, rtx op0, rtx op1, bool unsignedp)
 		return NULL_RTX;
 	    }
 	}
-      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, NULL, NULL,
-				     sum, gen_int_mode (INTVAL (op1),
-							word_mode),
+      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, sum,
+				     gen_int_mode (INTVAL (op1), word_mode),
 				     NULL_RTX, 1, OPTAB_DIRECT);
       if (remainder == NULL_RTX)
 	return NULL_RTX;
@@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
 
   if (op11 != const1_rtx)
     {
-      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL, quot1,
-				op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
+      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1, op11,
+				NULL_RTX, unsignedp, OPTAB_DIRECT);
       if (rem2 == NULL_RTX)
 	return NULL_RTX;
 
@@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
       if (rem2 == NULL_RTX)
 	return NULL_RTX;
 
-      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL, quot1,
-				 op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
+      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
+				 NULL_RTX, unsignedp, OPTAB_DIRECT);
       if (quot2 == NULL_RTX)
 	return NULL_RTX;
 
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5ccbf6147947351a 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
 
 OPTAB_D (smul_highpart_optab, "smul$a3_highpart")
 OPTAB_D (umul_highpart_optab, "umul$a3_highpart")
+OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart")
+OPTAB_D (uadd_highpart_optab, "uadd$a3_highpart")
 
 OPTAB_D (cmpmem_optab, "cmpmem$a")
 OPTAB_D (cmpstr_optab, "cmpstr$a")
diff --git a/gcc/target.def b/gcc/target.def
index db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1905,25 +1905,6 @@ implementation approaches itself.",
 	const vec_perm_indices &sel),
  NULL)
 
-DEFHOOK
-(can_special_div_by_const,
- "This hook is used to test whether the target has a special method of\n\
-division of vectors of type @var{vectype} using the value @var{constant},\n\
-and producing a vector of type @var{vectype}.  The division\n\
-will then not be decomposed by the vectorizer and kept as a div.\n\
-\n\
-When the hook is being used to test whether the target supports a special\n\
-divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook\n\
-is being used to emit a division, @var{in0} and @var{in1} are the source\n\
-vectors of type @var{vecttype} and @var{output} is the destination vector of\n\
-type @var{vectype}.\n\
-\n\
-Return true if the operation is possible, emitting instructions for it\n\
-if rtxes are provided and updating @var{output}.",
- bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
-	rtx in0, rtx in1),
- default_can_special_div_by_const)
-
 /* Return true if the target supports misaligned store/load of a
    specific factor denoted in the third parameter.  The last parameter
    is true if the access is defined in a packed struct.  */
diff --git a/gcc/target.h b/gcc/target.h
index 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b99f913158c2d47b1 100644
--- a/gcc/target.h
+++ b/gcc/target.h
@@ -51,7 +51,6 @@
 #include "insn-codes.h"
 #include "tm.h"
 #include "hard-reg-set.h"
-#include "tree-core.h"
 
 #if CHECKING_P
 
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2244549317a31390f0c2 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage (addr_space_t, location_t);
 extern rtx default_addr_space_convert (rtx, tree, tree);
 extern unsigned int default_case_values_threshold (void);
 extern bool default_have_conditional_execution (void);
-extern bool default_can_special_div_by_const (enum tree_code, tree, wide_int,
-					      rtx *, rtx, rtx);
 
 extern bool default_libc_has_function (enum function_class, tree);
 extern bool default_libc_has_fast_function (int fcode);
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e03877337a931e7 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
   return HAVE_conditional_execution;
 }
 
-/* Default that no division by constant operations are special.  */
-bool
-default_can_special_div_by_const (enum tree_code, tree, wide_int, rtx *, rtx,
-				  rtx)
-{
-  return false;
-}
-
 /* By default we assume that c99 functions are present at the runtime,
    but sincos is not.  */
 bool
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0a04ea8c1f73e3c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
@@ -0,0 +1,25 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include "tree-vect.h"
+
+typedef unsigned __attribute__((__vector_size__ (16))) V;
+
+static __attribute__((__noinline__)) __attribute__((__noclone__)) V
+foo (V v, unsigned short i)
+{
+  v /= i;
+  return v;
+}
+
+int
+main (void)
+{
+  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0xffff);
+  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
+    if (v[i] != 0x00010001)
+      __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b49914d2a29b933de625
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
@@ -0,0 +1,58 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define N 50
+#define TYPE uint8_t 
+
+#ifndef DEBUG
+#define DEBUG 0
+#endif
+
+#define BASE ((TYPE) -1 < 0 ? -126 : 4)
+
+
+__attribute__((noipa, noinline, optimize("O1")))
+void fun1(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+__attribute__((noipa, noinline, optimize("O3")))
+void fun2(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+int main ()
+{
+  TYPE a[N];
+  TYPE b[N];
+
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 13;
+      b[i] = BASE + i * 13;
+      if (DEBUG)
+        printf ("%d: 0x%x\n", i, a[i]);
+    }
+
+  fun1 (a, N / 2, N);
+  fun2 (b, N / 2, N);
+
+  for (int i = 0; i < N; ++i)
+    {
+      if (DEBUG)
+        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
+
+      if (a[i] != b[i])
+        __builtin_abort ();
+    }
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
index 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077dc3e970bed75ef6 100644
--- a/gcc/tree-vect-generic.cc
+++ b/gcc/tree-vect-generic.cc
@@ -1237,17 +1237,6 @@ expand_vector_operation (gimple_stmt_iterator *gsi, tree type, tree compute_type
 	  tree rhs2 = gimple_assign_rhs2 (assign);
 	  tree ret;
 
-	  /* Check if the target was going to handle it through the special
-	     division callback hook.  */
-	  tree cst = uniform_integer_cst_p (rhs2);
-	  if (cst &&
-	      targetm.vectorize.can_special_div_by_const (code, type,
-							  wi::to_wide (cst),
-							  NULL,
-							  NULL_RTX, NULL_RTX))
-	    return NULL_TREE;
-
-
 	  if (!optimize
 	      || !VECTOR_INTEGER_TYPE_P (type)
 	      || TREE_CODE (rhs2) != VECTOR_CST
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f3369de2afea139d6 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
       return pattern_stmt;
     }
   else if ((cst = uniform_integer_cst_p (oprnd1))
-	   && targetm.vectorize.can_special_div_by_const (rhs_code, vectype,
-							  wi::to_wide (cst),
-							  NULL, NULL_RTX,
-							  NULL_RTX))
+	   && TYPE_UNSIGNED (itype)
+	   && rhs_code == TRUNC_DIV_EXPR
+	   && vectype
+	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
+					      OPTIMIZE_FOR_SPEED))
     {
-      return NULL;
+      /* div optimizations using narrowings
+       we can do the division e.g. shorts by 255 faster by calculating it as
+       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
+       double the precision of x.
+
+       If we imagine a short as being composed of two blocks of bytes then
+       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
+       adding 1 to each sub component:
+
+	    short value of 16-bits
+       ┌──────────────┬────────────────┐
+       │              │                │
+       └──────────────┴────────────────┘
+	 8-bit part1 ▲  8-bit part2   ▲
+		     │                │
+		     │                │
+		    +1               +1
+
+       after the first addition, we have to shift right by 8, and narrow the
+       results back to a byte.  Remember that the addition must be done in
+       double the precision of the input.  However if we know that the addition
+       `x + 257` does not overflow then we can do the operation in the current
+       precision.  In which case we don't need the pack and unpacks.  */
+      auto wcst = wi::to_wide (cst);
+      int pow = wi::exact_log2 (wcst + 1);
+      if (pow == (int) (element_precision (vectype) / 2))
+	{
+	  wide_int min,max;
+	  /* If we're in a pattern we need to find the orginal definition.  */
+	  tree op0 = oprnd0;
+	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
+	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
+	  if (is_pattern_stmt_p (stmt_info))
+	    {
+	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
+	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
+		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
+	    }
+
+	  /* Check that no overflow will occur.  If we don't have range
+	     information we can't perform the optimization.  */
+	  if (vect_get_range_info (op0, &min, &max))
+	    {
+	      wide_int one = wi::to_wide (build_one_cst (itype));
+	      wide_int adder = wi::add (one, wi::lshift (one, pow));
+	      wi::overflow_type ovf;
+	      /* We need adder and max in the same precision.  */
+	      wide_int zadder
+		= wide_int_storage::from (adder, wi::get_precision (max),
+					  UNSIGNED);
+	      wi::add (max, zadder, UNSIGNED, &ovf);
+	      if (ovf == wi::OVF_NONE)
+		{
+		  *type_out = vectype;
+		  tree tadder = wide_int_to_tree (itype, adder);
+		  gcall *patt1
+		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, tadder);
+		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
+		  gimple_call_set_lhs (patt1, lhs);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  pattern_stmt
+		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
+		  lhs = vect_recog_temp_ssa_var (itype, NULL);
+		  gimple_call_set_lhs (pattern_stmt, lhs);
+
+		  return pattern_stmt;
+		}
+	    }
+	}
     }
 
   if (prec > HOST_BITS_PER_WIDE_INT
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b9564fc4e066e50081 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
 	}
       target_support_p = (optab_handler (optab, vec_mode)
 			  != CODE_FOR_nothing);
-      tree cst;
-      if (!target_support_p
-	  && op1
-	  && (cst = uniform_integer_cst_p (op1)))
-	target_support_p
-	  = targetm.vectorize.can_special_div_by_const (code, vectype,
-							wi::to_wide (cst),
-							NULL, NULL_RTX,
-							NULL_RTX);
     }
 
   bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);




-- 

[-- Attachment #2: rb16909.patch --]
[-- Type: text/plain, Size: 23518 bytes --]

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f7408038595e21af35d 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5668,6 +5668,18 @@ represented in RTL using a @code{smul_highpart} RTX expression.
 Similar, but the multiplication is unsigned.  This may be represented
 in RTL using an @code{umul_highpart} RTX expression.
 
+@cindex @code{sadd@var{m}3_highpart} instruction pattern
+@item @samp{smul@var{m}3_highpart}
+Perform a signed addition of operands 1 and 2, which have mode
+@var{m}, and store the most significant half of the product in operand 0.
+The least significant half of the product is discarded.  This may be
+represented in RTL using a @code{sadd_highpart} RTX expression.
+
+@cindex @code{uadd@var{m}3_highpart} instruction pattern
+@item @samp{uadd@var{m}3_highpart}
+Similar, but the addition is unsigned.  This may be represented
+in RTL using an @code{uadd_highpart} RTX expression.
+
 @cindex @code{madd@var{m}@var{n}4} instruction pattern
 @item @samp{madd@var{m}@var{n}4}
 Multiply operands 1 and 2, sign-extend them to mode @var{n}, add
diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
index d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343d171940ec4222f3 100644
--- a/gcc/doc/rtl.texi
+++ b/gcc/doc/rtl.texi
@@ -2535,6 +2535,17 @@ out in machine mode @var{m}.  @code{smul_highpart} returns the high part
 of a signed multiplication, @code{umul_highpart} returns the high part
 of an unsigned multiplication.
 
+@findex sadd_highpart
+@findex uadd_highpart
+@cindex high-part addition
+@cindex addition high part
+@item (sadd_highpart:@var{m} @var{x} @var{y})
+@itemx (uadd_highpart:@var{m} @var{x} @var{y})
+Represents the high-part addition of @var{x} and @var{y} carried
+out in machine mode @var{m}.  @code{sadd_highpart} returns the high part
+of a signed addition, @code{uadd_highpart} returns the high part
+of an unsigned addition.
+
 @findex fma
 @cindex fused multiply-add
 @item (fma:@var{m} @var{x} @var{y} @var{z})
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d5791484017e6b0d62ab077e 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for the hook to handle these two
 implementation approaches itself.
 @end deftypefn
 
-@deftypefn {Target Hook} bool TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum @var{tree_code}, tree @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx @var{in0}, rtx @var{in1})
-This hook is used to test whether the target has a special method of
-division of vectors of type @var{vectype} using the value @var{constant},
-and producing a vector of type @var{vectype}.  The division
-will then not be decomposed by the vectorizer and kept as a div.
-
-When the hook is being used to test whether the target supports a special
-divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook
-is being used to emit a division, @var{in0} and @var{in1} are the source
-vectors of type @var{vecttype} and @var{output} is the destination vector of
-type @var{vectype}.
-
-Return true if the operation is possible, emitting instructions for it
-if rtxes are provided and updating @var{output}.
-@end deftypefn
-
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
 This hook should return the decl of a function that implements the
 vectorized variant of the function with the @code{combined_fn} code
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0a3abccd1c293c7b 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4173,8 +4173,6 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_VEC_PERM_CONST
 
-@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
-
 @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
 
 @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
diff --git a/gcc/explow.cc b/gcc/explow.cc
index 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0befa016eea4573c 100644
--- a/gcc/explow.cc
+++ b/gcc/explow.cc
@@ -1037,7 +1037,7 @@ round_push (rtx size)
      TRUNC_DIV_EXPR.  */
   size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
 		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
-  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size, align_rtx,
+  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
 			NULL_RTX, 1);
   size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
 
@@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned required_align)
 			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
 				       Pmode),
 			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
-  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, target,
+  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
 			  gen_int_mode (required_align / BITS_PER_UNIT,
 					Pmode),
 			  NULL_RTX, 1);
diff --git a/gcc/expmed.h b/gcc/expmed.h
index 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c53640941628068f3901 100644
--- a/gcc/expmed.h
+++ b/gcc/expmed.h
@@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code, machine_mode, rtx, poly_int64, rtx,
 extern rtx maybe_expand_shift (enum tree_code, machine_mode, rtx, int, rtx,
 			       int);
 #ifdef GCC_OPTABS_H
-extern rtx expand_divmod (int, enum tree_code, machine_mode, tree, tree,
-			  rtx, rtx, rtx, int,
-			  enum optab_methods = OPTAB_LIB_WIDEN);
+extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx, rtx,
+			  rtx, int, enum optab_methods = OPTAB_LIB_WIDEN);
 #endif
 #endif
 
diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3a59c169d3b7692f 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx op0, HOST_WIDE_INT d)
 
 rtx
 expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
-	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
-	       int unsignedp, enum optab_methods methods)
+	       rtx op0, rtx op1, rtx target, int unsignedp,
+	       enum optab_methods methods)
 {
   machine_mode compute_mode;
   rtx tquotient;
@@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
 
   last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
 
-  /* Check if the target has specific expansions for the division.  */
-  tree cst;
-  if (treeop0
-      && treeop1
-      && (cst = uniform_integer_cst_p (treeop1))
-      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE (treeop0),
-						     wi::to_wide (cst),
-						     &target, op0, op1))
-    return target;
-
-
   /* Now convert to the best mode to use.  */
   if (compute_mode != mode)
     {
@@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
 			    || (optab_handler (sdivmod_optab, int_mode)
 				!= CODE_FOR_nothing)))
 		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
-						int_mode, treeop0, treeop1,
-						op0, gen_int_mode (abs_d,
+						int_mode, op0,
+						gen_int_mode (abs_d,
 							      int_mode),
 						NULL_RTX, 0);
 		    else
@@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
 				      size - 1, NULL_RTX, 0);
 		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
 				    NULL_RTX);
-		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, treeop0,
-				    treeop1, t3, op1, NULL_RTX, 0);
+		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3, op1,
+				    NULL_RTX, 0);
 		if (t4)
 		  {
 		    rtx t5;
diff --git a/gcc/expr.cc b/gcc/expr.cc
index 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b2280c6e277f26d72 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
 	    return expand_divmod (0,
 				  FLOAT_MODE_P (GET_MODE (value))
 				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
-				  GET_MODE (value), NULL, NULL, op1, op2,
-				  target, 0);
+				  GET_MODE (value), op1, op2, target, 0);
 	case MOD:
-	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
-				op1, op2, target, 0);
+	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
+				target, 0);
 	case UDIV:
-	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), NULL, NULL,
-				op1, op2, target, 1);
+	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), op1, op2,
+				target, 1);
 	case UMOD:
-	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
-				op1, op2, target, 1);
+	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
+				target, 1);
 	case ASHIFTRT:
 	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
 				      target, 0, OPTAB_LIB_WIDEN);
@@ -9170,13 +9169,11 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
       bool speed_p = optimize_insn_for_speed_p ();
       do_pending_stack_adjust ();
       start_sequence ();
-      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
-				   op0, op1, target, 1);
+      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 1);
       rtx_insn *uns_insns = get_insns ();
       end_sequence ();
       start_sequence ();
-      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
-				   op0, op1, target, 0);
+      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 0);
       rtx_insn *sgn_insns = get_insns ();
       end_sequence ();
       unsigned uns_cost = seq_cost (uns_insns, speed_p);
@@ -9198,8 +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
       emit_insn (sgn_insns);
       return sgn_ret;
     }
-  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
-			op0, op1, target, unsignedp);
+  return expand_divmod (mod_p, code, mode, op0, op1, target, unsignedp);
 }
 
 rtx
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5a3b8a734baa800f 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first,
 
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW, first,
 			      smul_highpart, umul_highpart, binary)
+DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST | ECF_NOTHROW, first,
+			      sadd_highpart, uadd_highpart, binary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first,
 			      smulhs, umulhs, binary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first,
diff --git a/gcc/optabs.cc b/gcc/optabs.cc
index cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6e77082c1e617b 100644
--- a/gcc/optabs.cc
+++ b/gcc/optabs.cc
@@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode mode, rtx op0, rtx op1, bool unsignedp)
 		return NULL_RTX;
 	    }
 	}
-      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, NULL, NULL,
-				     sum, gen_int_mode (INTVAL (op1),
-							word_mode),
+      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, sum,
+				     gen_int_mode (INTVAL (op1), word_mode),
 				     NULL_RTX, 1, OPTAB_DIRECT);
       if (remainder == NULL_RTX)
 	return NULL_RTX;
@@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
 
   if (op11 != const1_rtx)
     {
-      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL, quot1,
-				op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
+      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1, op11,
+				NULL_RTX, unsignedp, OPTAB_DIRECT);
       if (rem2 == NULL_RTX)
 	return NULL_RTX;
 
@@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
       if (rem2 == NULL_RTX)
 	return NULL_RTX;
 
-      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL, quot1,
-				 op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
+      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
+				 NULL_RTX, unsignedp, OPTAB_DIRECT);
       if (quot2 == NULL_RTX)
 	return NULL_RTX;
 
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5ccbf6147947351a 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
 
 OPTAB_D (smul_highpart_optab, "smul$a3_highpart")
 OPTAB_D (umul_highpart_optab, "umul$a3_highpart")
+OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart")
+OPTAB_D (uadd_highpart_optab, "uadd$a3_highpart")
 
 OPTAB_D (cmpmem_optab, "cmpmem$a")
 OPTAB_D (cmpstr_optab, "cmpstr$a")
diff --git a/gcc/target.def b/gcc/target.def
index db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1905,25 +1905,6 @@ implementation approaches itself.",
 	const vec_perm_indices &sel),
  NULL)
 
-DEFHOOK
-(can_special_div_by_const,
- "This hook is used to test whether the target has a special method of\n\
-division of vectors of type @var{vectype} using the value @var{constant},\n\
-and producing a vector of type @var{vectype}.  The division\n\
-will then not be decomposed by the vectorizer and kept as a div.\n\
-\n\
-When the hook is being used to test whether the target supports a special\n\
-divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook\n\
-is being used to emit a division, @var{in0} and @var{in1} are the source\n\
-vectors of type @var{vecttype} and @var{output} is the destination vector of\n\
-type @var{vectype}.\n\
-\n\
-Return true if the operation is possible, emitting instructions for it\n\
-if rtxes are provided and updating @var{output}.",
- bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
-	rtx in0, rtx in1),
- default_can_special_div_by_const)
-
 /* Return true if the target supports misaligned store/load of a
    specific factor denoted in the third parameter.  The last parameter
    is true if the access is defined in a packed struct.  */
diff --git a/gcc/target.h b/gcc/target.h
index 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b99f913158c2d47b1 100644
--- a/gcc/target.h
+++ b/gcc/target.h
@@ -51,7 +51,6 @@
 #include "insn-codes.h"
 #include "tm.h"
 #include "hard-reg-set.h"
-#include "tree-core.h"
 
 #if CHECKING_P
 
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2244549317a31390f0c2 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage (addr_space_t, location_t);
 extern rtx default_addr_space_convert (rtx, tree, tree);
 extern unsigned int default_case_values_threshold (void);
 extern bool default_have_conditional_execution (void);
-extern bool default_can_special_div_by_const (enum tree_code, tree, wide_int,
-					      rtx *, rtx, rtx);
 
 extern bool default_libc_has_function (enum function_class, tree);
 extern bool default_libc_has_fast_function (int fcode);
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e03877337a931e7 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
   return HAVE_conditional_execution;
 }
 
-/* Default that no division by constant operations are special.  */
-bool
-default_can_special_div_by_const (enum tree_code, tree, wide_int, rtx *, rtx,
-				  rtx)
-{
-  return false;
-}
-
 /* By default we assume that c99 functions are present at the runtime,
    but sincos is not.  */
 bool
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0a04ea8c1f73e3c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
@@ -0,0 +1,25 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include "tree-vect.h"
+
+typedef unsigned __attribute__((__vector_size__ (16))) V;
+
+static __attribute__((__noinline__)) __attribute__((__noclone__)) V
+foo (V v, unsigned short i)
+{
+  v /= i;
+  return v;
+}
+
+int
+main (void)
+{
+  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0xffff);
+  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
+    if (v[i] != 0x00010001)
+      __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b49914d2a29b933de625
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
@@ -0,0 +1,58 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define N 50
+#define TYPE uint8_t 
+
+#ifndef DEBUG
+#define DEBUG 0
+#endif
+
+#define BASE ((TYPE) -1 < 0 ? -126 : 4)
+
+
+__attribute__((noipa, noinline, optimize("O1")))
+void fun1(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+__attribute__((noipa, noinline, optimize("O3")))
+void fun2(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+int main ()
+{
+  TYPE a[N];
+  TYPE b[N];
+
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 13;
+      b[i] = BASE + i * 13;
+      if (DEBUG)
+        printf ("%d: 0x%x\n", i, a[i]);
+    }
+
+  fun1 (a, N / 2, N);
+  fun2 (b, N / 2, N);
+
+  for (int i = 0; i < N; ++i)
+    {
+      if (DEBUG)
+        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
+
+      if (a[i] != b[i])
+        __builtin_abort ();
+    }
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
index 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077dc3e970bed75ef6 100644
--- a/gcc/tree-vect-generic.cc
+++ b/gcc/tree-vect-generic.cc
@@ -1237,17 +1237,6 @@ expand_vector_operation (gimple_stmt_iterator *gsi, tree type, tree compute_type
 	  tree rhs2 = gimple_assign_rhs2 (assign);
 	  tree ret;
 
-	  /* Check if the target was going to handle it through the special
-	     division callback hook.  */
-	  tree cst = uniform_integer_cst_p (rhs2);
-	  if (cst &&
-	      targetm.vectorize.can_special_div_by_const (code, type,
-							  wi::to_wide (cst),
-							  NULL,
-							  NULL_RTX, NULL_RTX))
-	    return NULL_TREE;
-
-
 	  if (!optimize
 	      || !VECTOR_INTEGER_TYPE_P (type)
 	      || TREE_CODE (rhs2) != VECTOR_CST
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f3369de2afea139d6 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
       return pattern_stmt;
     }
   else if ((cst = uniform_integer_cst_p (oprnd1))
-	   && targetm.vectorize.can_special_div_by_const (rhs_code, vectype,
-							  wi::to_wide (cst),
-							  NULL, NULL_RTX,
-							  NULL_RTX))
+	   && TYPE_UNSIGNED (itype)
+	   && rhs_code == TRUNC_DIV_EXPR
+	   && vectype
+	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
+					      OPTIMIZE_FOR_SPEED))
     {
-      return NULL;
+      /* div optimizations using narrowings
+       we can do the division e.g. shorts by 255 faster by calculating it as
+       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
+       double the precision of x.
+
+       If we imagine a short as being composed of two blocks of bytes then
+       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
+       adding 1 to each sub component:
+
+	    short value of 16-bits
+       ┌──────────────┬────────────────┐
+       │              │                │
+       └──────────────┴────────────────┘
+	 8-bit part1 ▲  8-bit part2   ▲
+		     │                │
+		     │                │
+		    +1               +1
+
+       after the first addition, we have to shift right by 8, and narrow the
+       results back to a byte.  Remember that the addition must be done in
+       double the precision of the input.  However if we know that the addition
+       `x + 257` does not overflow then we can do the operation in the current
+       precision.  In which case we don't need the pack and unpacks.  */
+      auto wcst = wi::to_wide (cst);
+      int pow = wi::exact_log2 (wcst + 1);
+      if (pow == (int) (element_precision (vectype) / 2))
+	{
+	  wide_int min,max;
+	  /* If we're in a pattern we need to find the orginal definition.  */
+	  tree op0 = oprnd0;
+	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
+	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
+	  if (is_pattern_stmt_p (stmt_info))
+	    {
+	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
+	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
+		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
+	    }
+
+	  /* Check that no overflow will occur.  If we don't have range
+	     information we can't perform the optimization.  */
+	  if (vect_get_range_info (op0, &min, &max))
+	    {
+	      wide_int one = wi::to_wide (build_one_cst (itype));
+	      wide_int adder = wi::add (one, wi::lshift (one, pow));
+	      wi::overflow_type ovf;
+	      /* We need adder and max in the same precision.  */
+	      wide_int zadder
+		= wide_int_storage::from (adder, wi::get_precision (max),
+					  UNSIGNED);
+	      wi::add (max, zadder, UNSIGNED, &ovf);
+	      if (ovf == wi::OVF_NONE)
+		{
+		  *type_out = vectype;
+		  tree tadder = wide_int_to_tree (itype, adder);
+		  gcall *patt1
+		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, tadder);
+		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
+		  gimple_call_set_lhs (patt1, lhs);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  pattern_stmt
+		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
+		  lhs = vect_recog_temp_ssa_var (itype, NULL);
+		  gimple_call_set_lhs (pattern_stmt, lhs);
+
+		  return pattern_stmt;
+		}
+	    }
+	}
     }
 
   if (prec > HOST_BITS_PER_WIDE_INT
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b9564fc4e066e50081 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
 	}
       target_support_p = (optab_handler (optab, vec_mode)
 			  != CODE_FOR_nothing);
-      tree cst;
-      if (!target_support_p
-	  && op1
-	  && (cst = uniform_integer_cst_p (op1)))
-	target_support_p
-	  = targetm.vectorize.can_special_div_by_const (code, vectype,
-							wi::to_wide (cst),
-							NULL, NULL_RTX,
-							NULL_RTX);
     }
 
   bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);




^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 2/2]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583]
  2023-02-09 17:16 [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] Tamar Christina
@ 2023-02-09 17:22 ` Tamar Christina
  2023-02-10 10:35   ` Tamar Christina
  2023-02-10 14:10   ` Richard Sandiford
  2023-02-10 10:34 ` [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] Tamar Christina
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 47+ messages in thread
From: Tamar Christina @ 2023-02-09 17:22 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 14446 bytes --]

Hi All,

This replaces the custom division hook with just an implementation through
add_highpart.  For NEON we implement the add highpart (Addition + extraction of
the upper highpart of the register in the same precision) as ADD + LSR.

This representation allows us to easily optimize the sequence using existing
sequences. This gets us a pretty decent sequence using SRA:

        umull   v1.8h, v0.8b, v3.8b
        umull2  v0.8h, v0.16b, v3.16b
        add     v5.8h, v1.8h, v2.8h
        add     v4.8h, v0.8h, v2.8h
        usra    v1.8h, v5.8h, 8
        usra    v0.8h, v4.8h, 8
        uzp2    v1.16b, v1.16b, v0.16b

To get the most optimal sequence however we match (a + ((b + c) >> n)) where n
is half the precision of the mode of the operation into addhn + uaddw which is
a general good optimization on its own and gets us back to:

.L4:
        ldr     q0, [x3]
        umull   v1.8h, v0.8b, v5.8b
        umull2  v0.8h, v0.16b, v5.16b
        addhn   v3.8b, v1.8h, v4.8h
        addhn   v2.8b, v0.8h, v4.8h
        uaddw   v1.8h, v1.8h, v3.8b
        uaddw   v0.8h, v0.8h, v2.8b
        uzp2    v1.16b, v1.16b, v0.16b
        str     q1, [x3], 16
        cmp     x3, x4
        bne     .L4

For SVE2 we optimize the initial sequence to the same ADD + LSR which gets us:

.L3:
        ld1b    z0.h, p0/z, [x0, x3]
        mul     z0.h, p1/m, z0.h, z2.h
        add     z1.h, z0.h, z3.h
        usra    z0.h, z1.h, #8
        lsr     z0.h, z0.h, #8
        st1b    z0.h, p0, [x0, x3]
        inch    x3
        whilelo p0.h, w3, w2
        b.any   .L3
.L1:
        ret

and to get the most optimal sequence I match (a + b) >> n (same constraint on n)
to addhnb which gets us to:

.L3:
        ld1b    z0.h, p0/z, [x0, x3]
        mul     z0.h, p1/m, z0.h, z2.h
        addhnb  z1.b, z0.h, z3.h
        addhnb  z0.b, z0.h, z1.h
        st1b    z0.h, p0, [x0, x3]
        inch    x3
        whilelo p0.h, w3, w2
        b.any   .L3

There are multiple RTL representations possible for these optimizations, I did
not represent them using a zero_extend because we seem very inconsistent in this
in the backend.  Since they are unspecs we won't match them from vector ops
anyway. I figured maintainers would prefer this, but my maintainer ouija board
is still out for repairs :)

There are no new test as new correctness tests were added to the mid-end and
the existing codegen tests for this already exist.

Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going> issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	PR target/108583
	* config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv<mode>3): Remove.
	(<su>add<mode>3_highpart, *bitmask_shift_plus<mode>): New.
	* config/aarch64/aarch64-sve2.md (<su>add<mode>3_highpart,
	*bitmask_shift_plus<mode>): New.
	(@aarch64_bitmask_udiv<mode>3): Remove.
	* config/aarch64/aarch64.cc
	(aarch64_vectorize_can_special_div_by_constant): Removed.
	* config/aarch64/iterators.md (UNSPEC_SADD_HIGHPART,
	UNSPEC_UADD_HIGHPART, ADD_HIGHPART): New.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 7f212bf37cd2c120dceb7efa733c9fa76226f029..26871a56d1fdb134f0ad9d828ce68a8df0272c53 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4867,62 +4867,48 @@ (define_expand "aarch64_<sur><addsub>hn2<mode>"
   }
 )
 
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; If we imagine a short as being composed of two blocks of bytes then
-;; adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
-;; adding 1 to each sub component:
-;;
-;;      short value of 16-bits
-;; ┌──────────────┬────────────────┐
-;; │              │                │
-;; └──────────────┴────────────────┘
-;;   8-bit part1 ▲  8-bit part2   ▲
-;;               │                │
-;;               │                │
-;;              +1               +1
-;;
-;; after the first addition, we have to shift right by 8, and narrow the
-;; results back to a byte.  Remember that the addition must be done in
-;; double the precision of the input.  Since 8 is half the size of a short
-;; we can use a narrowing halfing instruction in AArch64, addhn which also
-;; does the addition in a wider precision and narrows back to a byte.  The
-;; shift itself is implicit in the operation as it writes back only the top
-;; half of the result. i.e. bits 2*esize-1:esize.
-;;
-;; Since we have narrowed the result of the first part back to a byte, for
-;; the second addition we can use a widening addition, uaddw.
-;;
-;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
-;;
-;; The shift is later optimized by combine to a uzp2 with movi #0.
-(define_expand "@aarch64_bitmask_udiv<mode>3"
-  [(match_operand:VQN 0 "register_operand")
-   (match_operand:VQN 1 "register_operand")
-   (match_operand:VQN 2 "immediate_operand")]
+;; Implement add_highpart as ADD + RSHIFT, we have various optimization for
+;; narrowing represented as shifts and so this representation will allow us to
+;; further optimize this should the result require narrowing. The alternative
+;; representation of ADDHN + UXTL is less efficient and harder to futher
+;; optimize.
+(define_expand "<su>add<mode>3_highpart"
+  [(set (match_operand:VQN 0 "register_operand")
+	(unspec:VQN [(match_operand:VQN 1 "register_operand")
+		     (match_operand:VQN 2 "register_operand")]
+		    ADD_HIGHPART))]
+  "TARGET_SIMD"
+{
+  rtx result = gen_reg_rtx (<MODE>mode);
+  int shift_amount = GET_MODE_UNIT_BITSIZE (<MODE>mode) / 2;
+  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
+							shift_amount);
+  emit_insn (gen_add<mode>3 (result, operands[1], operands[2]));
+  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], result, shift_vector));
+  DONE;
+})
+
+;; Optimize ((a + b) >> n) + c where n is half the bitsize of the vector
+(define_insn_and_split "*bitmask_shift_plus<mode>"
+  [(set (match_operand:VQN 0 "register_operand" "=w")
+	(plus:VQN
+	  (lshiftrt:VQN
+	    (plus:VQN (match_operand:VQN 1 "register_operand" "w")
+		      (match_operand:VQN 2 "register_operand" "w"))
+	    (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
+	  (match_operand:VQN 4 "register_operand" "w")))]
   "TARGET_SIMD"
+  "#"
+  "&& !reload_completed"
+  [(const_int 0)]
 {
-  unsigned HOST_WIDE_INT size
-    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode)) - 1;
-  rtx elt = unwrap_const_vec_duplicate (operands[2]);
-  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
-    FAIL;
-
-  rtx addend = gen_reg_rtx (<MODE>mode);
-  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROWQ2>mode, 1);
-  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROWQ2>mode));
-  rtx tmp1 = gen_reg_rtx (<VNARROWQ>mode);
-  rtx tmp2 = gen_reg_rtx (<MODE>mode);
-  emit_insn (gen_aarch64_addhn<mode> (tmp1, operands[1], addend));
-  unsigned bitsize = GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode);
-  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode, bitsize);
-  emit_insn (gen_aarch64_uaddw<Vnarrowq> (tmp2, operands[1], tmp1));
-  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], tmp2, shift_vector));
+  rtx tmp = gen_reg_rtx (<VNARROWQ>mode);
+  emit_insn (gen_aarch64_addhn<mode> (tmp, operands[1], operands[2]));
+  emit_insn (gen_aarch64_uaddw<Vnarrowq> (operands[0], operands[4], tmp));
   DONE;
-})
+}
+  [(set_attr "type" "neon_add_halve<q>")]
+)
 
 ;; pmul.
 
diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
index 40c0728a7e6f00c395c360ce7625bc2e4a018809..ad01c1ddf9257cec951ed0c16558a3c4d856813b 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -2317,39 +2317,51 @@ (define_insn "@aarch64_sve_<optab><mode>"
 ;; ---- [INT] Misc optab implementations
 ;; -------------------------------------------------------------------------
 ;; Includes:
-;; - aarch64_bitmask_udiv
+;; - add_highpart
 ;; -------------------------------------------------------------------------
 
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; See aarch64-simd.md for bigger explanation.
-(define_expand "@aarch64_bitmask_udiv<mode>3"
-  [(match_operand:SVE_FULL_HSDI 0 "register_operand")
-   (match_operand:SVE_FULL_HSDI 1 "register_operand")
-   (match_operand:SVE_FULL_HSDI 2 "immediate_operand")]
+;; Implement add_highpart as ADD + RSHIFT, we have various optimization for
+;; narrowing represented as shifts and so this representation will allow us to
+;; further optimize this should the result require narrowing. The alternative
+;; representation of ADDHN + UXTL is less efficient and harder to futher
+;; optimize.
+(define_expand "<su>add<mode>3_highpart"
+  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand")
+	(unspec:SVE_FULL_HSDI
+	  [(match_operand:SVE_FULL_HSDI 1 "register_operand")
+	   (match_operand:SVE_FULL_HSDI 2 "register_operand")]
+	  ADD_HIGHPART))]
   "TARGET_SVE2"
 {
-  unsigned HOST_WIDE_INT size
-    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROW>mode)) - 1;
-  rtx elt = unwrap_const_vec_duplicate (operands[2]);
-  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
-    FAIL;
+  rtx result = gen_reg_rtx (<MODE>mode);
+  int shift_amount = GET_MODE_UNIT_BITSIZE (<MODE>mode) / 2;
+  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
+							shift_amount);
+  emit_insn (gen_add<mode>3 (result, operands[1], operands[2]));
+  emit_insn (gen_vlshr<mode>3 (operands[0], result, shift_vector));
+  DONE;
+})
 
-  rtx addend = gen_reg_rtx (<MODE>mode);
+;; Optimize ((a + b) >> n) where n is half the bitsize of the vector
+(define_insn_and_split "*bitmask_shift_plus<mode>"
+  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand" "=w")
+	(unspec:SVE_FULL_HSDI [
+	    (match_operand:<VPRED> 1 "register_operand" "Upl")
+	    (lshiftrt:SVE_FULL_HSDI
+	      (plus:SVE_FULL_HSDI
+		(match_operand:SVE_FULL_HSDI 2 "register_operand" "w")
+		(match_operand:SVE_FULL_HSDI 3 "register_operand" "w"))
+	      (match_operand:SVE_FULL_HSDI 4 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
+        ] UNSPEC_PRED_X))]
+  "TARGET_SVE2"
+  "#"
+  "&& !reload_completed"
+  [(const_int 0)]
+{
   rtx tmp1 = gen_reg_rtx (<VNARROW>mode);
-  rtx tmp2 = gen_reg_rtx (<VNARROW>mode);
-  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROW>mode, 1);
-  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROW>mode));
-  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1, operands[1],
-			      addend));
-  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp2, operands[1],
-			      lowpart_subreg (<MODE>mode, tmp1,
-					      <VNARROW>mode)));
+  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1, operands[2], operands[3]));
   emit_move_insn (operands[0],
-		  lowpart_subreg (<MODE>mode, tmp2, <VNARROW>mode));
+		  lowpart_subreg (<MODE>mode, tmp1, <VNARROW>mode));
   DONE;
 })
 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index e6f47cbbb0d04a6f33b9a741ebb614cabd0204b9..8a04feb29e6bfb423a09dde2cd64853e69d0e1ba 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -24363,46 +24363,6 @@ aarch64_vectorize_vec_perm_const (machine_mode vmode, machine_mode op_mode,
 
   return ret;
 }
-
-/* Implement TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST.  */
-
-bool
-aarch64_vectorize_can_special_div_by_constant (enum tree_code code,
-					       tree vectype, wide_int cst,
-					       rtx *output, rtx in0, rtx in1)
-{
-  if (code != TRUNC_DIV_EXPR
-      || !TYPE_UNSIGNED (vectype))
-    return false;
-
-  machine_mode mode = TYPE_MODE (vectype);
-  unsigned int flags = aarch64_classify_vector_mode (mode);
-  if ((flags & VEC_ANY_SVE) && !TARGET_SVE2)
-    return false;
-
-  int pow = wi::exact_log2 (cst + 1);
-  auto insn_code = maybe_code_for_aarch64_bitmask_udiv3 (TYPE_MODE (vectype));
-  /* SVE actually has a div operator, we may have gotten here through
-     that route.  */
-  if (pow != (int) (element_precision (vectype) / 2)
-      || insn_code == CODE_FOR_nothing)
-    return false;
-
-  /* We can use the optimized pattern.  */
-  if (in0 == NULL_RTX && in1 == NULL_RTX)
-    return true;
-
-  gcc_assert (output);
-
-  expand_operand ops[3];
-  create_output_operand (&ops[0], *output, mode);
-  create_input_operand (&ops[1], in0, mode);
-  create_fixed_operand (&ops[2], in1);
-  expand_insn (insn_code, 3, ops);
-  *output = ops[0].value;
-  return true;
-}
-
 /* Generate a byte permute mask for a register of mode MODE,
    which has NUNITS units.  */
 
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 6cbc97cc82c06a68259bdf4dec8a0eab230081e5..ae627ae56cbd1e8b882e596dba974e74ef396e0e 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -750,6 +750,8 @@ (define_c_enum "unspec"
     UNSPEC_REVH		; Used in aarch64-sve.md.
     UNSPEC_REVW		; Used in aarch64-sve.md.
     UNSPEC_REVBHW	; Used in aarch64-sve.md.
+    UNSPEC_SADD_HIGHPART ; Used in aarch64-sve.md.
+    UNSPEC_UADD_HIGHPART ; Used in aarch64-sve.md.
     UNSPEC_SMUL_HIGHPART ; Used in aarch64-sve.md.
     UNSPEC_UMUL_HIGHPART ; Used in aarch64-sve.md.
     UNSPEC_FMLA		; Used in aarch64-sve.md.
@@ -2704,6 +2706,7 @@ (define_int_iterator UNPACK [UNSPEC_UNPACKSHI UNSPEC_UNPACKUHI
 
 (define_int_iterator UNPACK_UNSIGNED [UNSPEC_UNPACKULO UNSPEC_UNPACKUHI])
 
+(define_int_iterator ADD_HIGHPART [UNSPEC_SADD_HIGHPART UNSPEC_UADD_HIGHPART])
 (define_int_iterator MUL_HIGHPART [UNSPEC_SMUL_HIGHPART UNSPEC_UMUL_HIGHPART])
 
 (define_int_iterator CLAST [UNSPEC_CLASTA UNSPEC_CLASTB])
@@ -3342,6 +3345,8 @@ (define_int_attr su [(UNSPEC_SADDV "s")
 		     (UNSPEC_UNPACKUHI "u")
 		     (UNSPEC_UNPACKSLO "s")
 		     (UNSPEC_UNPACKULO "u")
+		     (UNSPEC_SADD_HIGHPART "s")
+		     (UNSPEC_UADD_HIGHPART "u")
 		     (UNSPEC_SMUL_HIGHPART "s")
 		     (UNSPEC_UMUL_HIGHPART "u")
 		     (UNSPEC_COND_FCVTZS "s")




-- 

[-- Attachment #2: rb16910.patch --]
[-- Type: text/plain, Size: 11404 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 7f212bf37cd2c120dceb7efa733c9fa76226f029..26871a56d1fdb134f0ad9d828ce68a8df0272c53 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4867,62 +4867,48 @@ (define_expand "aarch64_<sur><addsub>hn2<mode>"
   }
 )
 
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; If we imagine a short as being composed of two blocks of bytes then
-;; adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
-;; adding 1 to each sub component:
-;;
-;;      short value of 16-bits
-;; ┌──────────────┬────────────────┐
-;; │              │                │
-;; └──────────────┴────────────────┘
-;;   8-bit part1 ▲  8-bit part2   ▲
-;;               │                │
-;;               │                │
-;;              +1               +1
-;;
-;; after the first addition, we have to shift right by 8, and narrow the
-;; results back to a byte.  Remember that the addition must be done in
-;; double the precision of the input.  Since 8 is half the size of a short
-;; we can use a narrowing halfing instruction in AArch64, addhn which also
-;; does the addition in a wider precision and narrows back to a byte.  The
-;; shift itself is implicit in the operation as it writes back only the top
-;; half of the result. i.e. bits 2*esize-1:esize.
-;;
-;; Since we have narrowed the result of the first part back to a byte, for
-;; the second addition we can use a widening addition, uaddw.
-;;
-;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
-;;
-;; The shift is later optimized by combine to a uzp2 with movi #0.
-(define_expand "@aarch64_bitmask_udiv<mode>3"
-  [(match_operand:VQN 0 "register_operand")
-   (match_operand:VQN 1 "register_operand")
-   (match_operand:VQN 2 "immediate_operand")]
+;; Implement add_highpart as ADD + RSHIFT, we have various optimization for
+;; narrowing represented as shifts and so this representation will allow us to
+;; further optimize this should the result require narrowing. The alternative
+;; representation of ADDHN + UXTL is less efficient and harder to futher
+;; optimize.
+(define_expand "<su>add<mode>3_highpart"
+  [(set (match_operand:VQN 0 "register_operand")
+	(unspec:VQN [(match_operand:VQN 1 "register_operand")
+		     (match_operand:VQN 2 "register_operand")]
+		    ADD_HIGHPART))]
+  "TARGET_SIMD"
+{
+  rtx result = gen_reg_rtx (<MODE>mode);
+  int shift_amount = GET_MODE_UNIT_BITSIZE (<MODE>mode) / 2;
+  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
+							shift_amount);
+  emit_insn (gen_add<mode>3 (result, operands[1], operands[2]));
+  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], result, shift_vector));
+  DONE;
+})
+
+;; Optimize ((a + b) >> n) + c where n is half the bitsize of the vector
+(define_insn_and_split "*bitmask_shift_plus<mode>"
+  [(set (match_operand:VQN 0 "register_operand" "=w")
+	(plus:VQN
+	  (lshiftrt:VQN
+	    (plus:VQN (match_operand:VQN 1 "register_operand" "w")
+		      (match_operand:VQN 2 "register_operand" "w"))
+	    (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
+	  (match_operand:VQN 4 "register_operand" "w")))]
   "TARGET_SIMD"
+  "#"
+  "&& !reload_completed"
+  [(const_int 0)]
 {
-  unsigned HOST_WIDE_INT size
-    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode)) - 1;
-  rtx elt = unwrap_const_vec_duplicate (operands[2]);
-  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
-    FAIL;
-
-  rtx addend = gen_reg_rtx (<MODE>mode);
-  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROWQ2>mode, 1);
-  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROWQ2>mode));
-  rtx tmp1 = gen_reg_rtx (<VNARROWQ>mode);
-  rtx tmp2 = gen_reg_rtx (<MODE>mode);
-  emit_insn (gen_aarch64_addhn<mode> (tmp1, operands[1], addend));
-  unsigned bitsize = GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode);
-  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode, bitsize);
-  emit_insn (gen_aarch64_uaddw<Vnarrowq> (tmp2, operands[1], tmp1));
-  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], tmp2, shift_vector));
+  rtx tmp = gen_reg_rtx (<VNARROWQ>mode);
+  emit_insn (gen_aarch64_addhn<mode> (tmp, operands[1], operands[2]));
+  emit_insn (gen_aarch64_uaddw<Vnarrowq> (operands[0], operands[4], tmp));
   DONE;
-})
+}
+  [(set_attr "type" "neon_add_halve<q>")]
+)
 
 ;; pmul.
 
diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
index 40c0728a7e6f00c395c360ce7625bc2e4a018809..ad01c1ddf9257cec951ed0c16558a3c4d856813b 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -2317,39 +2317,51 @@ (define_insn "@aarch64_sve_<optab><mode>"
 ;; ---- [INT] Misc optab implementations
 ;; -------------------------------------------------------------------------
 ;; Includes:
-;; - aarch64_bitmask_udiv
+;; - add_highpart
 ;; -------------------------------------------------------------------------
 
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; See aarch64-simd.md for bigger explanation.
-(define_expand "@aarch64_bitmask_udiv<mode>3"
-  [(match_operand:SVE_FULL_HSDI 0 "register_operand")
-   (match_operand:SVE_FULL_HSDI 1 "register_operand")
-   (match_operand:SVE_FULL_HSDI 2 "immediate_operand")]
+;; Implement add_highpart as ADD + RSHIFT, we have various optimization for
+;; narrowing represented as shifts and so this representation will allow us to
+;; further optimize this should the result require narrowing. The alternative
+;; representation of ADDHN + UXTL is less efficient and harder to futher
+;; optimize.
+(define_expand "<su>add<mode>3_highpart"
+  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand")
+	(unspec:SVE_FULL_HSDI
+	  [(match_operand:SVE_FULL_HSDI 1 "register_operand")
+	   (match_operand:SVE_FULL_HSDI 2 "register_operand")]
+	  ADD_HIGHPART))]
   "TARGET_SVE2"
 {
-  unsigned HOST_WIDE_INT size
-    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROW>mode)) - 1;
-  rtx elt = unwrap_const_vec_duplicate (operands[2]);
-  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
-    FAIL;
+  rtx result = gen_reg_rtx (<MODE>mode);
+  int shift_amount = GET_MODE_UNIT_BITSIZE (<MODE>mode) / 2;
+  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
+							shift_amount);
+  emit_insn (gen_add<mode>3 (result, operands[1], operands[2]));
+  emit_insn (gen_vlshr<mode>3 (operands[0], result, shift_vector));
+  DONE;
+})
 
-  rtx addend = gen_reg_rtx (<MODE>mode);
+;; Optimize ((a + b) >> n) where n is half the bitsize of the vector
+(define_insn_and_split "*bitmask_shift_plus<mode>"
+  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand" "=w")
+	(unspec:SVE_FULL_HSDI [
+	    (match_operand:<VPRED> 1 "register_operand" "Upl")
+	    (lshiftrt:SVE_FULL_HSDI
+	      (plus:SVE_FULL_HSDI
+		(match_operand:SVE_FULL_HSDI 2 "register_operand" "w")
+		(match_operand:SVE_FULL_HSDI 3 "register_operand" "w"))
+	      (match_operand:SVE_FULL_HSDI 4 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
+        ] UNSPEC_PRED_X))]
+  "TARGET_SVE2"
+  "#"
+  "&& !reload_completed"
+  [(const_int 0)]
+{
   rtx tmp1 = gen_reg_rtx (<VNARROW>mode);
-  rtx tmp2 = gen_reg_rtx (<VNARROW>mode);
-  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROW>mode, 1);
-  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROW>mode));
-  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1, operands[1],
-			      addend));
-  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp2, operands[1],
-			      lowpart_subreg (<MODE>mode, tmp1,
-					      <VNARROW>mode)));
+  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1, operands[2], operands[3]));
   emit_move_insn (operands[0],
-		  lowpart_subreg (<MODE>mode, tmp2, <VNARROW>mode));
+		  lowpart_subreg (<MODE>mode, tmp1, <VNARROW>mode));
   DONE;
 })
 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index e6f47cbbb0d04a6f33b9a741ebb614cabd0204b9..8a04feb29e6bfb423a09dde2cd64853e69d0e1ba 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -24363,46 +24363,6 @@ aarch64_vectorize_vec_perm_const (machine_mode vmode, machine_mode op_mode,
 
   return ret;
 }
-
-/* Implement TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST.  */
-
-bool
-aarch64_vectorize_can_special_div_by_constant (enum tree_code code,
-					       tree vectype, wide_int cst,
-					       rtx *output, rtx in0, rtx in1)
-{
-  if (code != TRUNC_DIV_EXPR
-      || !TYPE_UNSIGNED (vectype))
-    return false;
-
-  machine_mode mode = TYPE_MODE (vectype);
-  unsigned int flags = aarch64_classify_vector_mode (mode);
-  if ((flags & VEC_ANY_SVE) && !TARGET_SVE2)
-    return false;
-
-  int pow = wi::exact_log2 (cst + 1);
-  auto insn_code = maybe_code_for_aarch64_bitmask_udiv3 (TYPE_MODE (vectype));
-  /* SVE actually has a div operator, we may have gotten here through
-     that route.  */
-  if (pow != (int) (element_precision (vectype) / 2)
-      || insn_code == CODE_FOR_nothing)
-    return false;
-
-  /* We can use the optimized pattern.  */
-  if (in0 == NULL_RTX && in1 == NULL_RTX)
-    return true;
-
-  gcc_assert (output);
-
-  expand_operand ops[3];
-  create_output_operand (&ops[0], *output, mode);
-  create_input_operand (&ops[1], in0, mode);
-  create_fixed_operand (&ops[2], in1);
-  expand_insn (insn_code, 3, ops);
-  *output = ops[0].value;
-  return true;
-}
-
 /* Generate a byte permute mask for a register of mode MODE,
    which has NUNITS units.  */
 
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 6cbc97cc82c06a68259bdf4dec8a0eab230081e5..ae627ae56cbd1e8b882e596dba974e74ef396e0e 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -750,6 +750,8 @@ (define_c_enum "unspec"
     UNSPEC_REVH		; Used in aarch64-sve.md.
     UNSPEC_REVW		; Used in aarch64-sve.md.
     UNSPEC_REVBHW	; Used in aarch64-sve.md.
+    UNSPEC_SADD_HIGHPART ; Used in aarch64-sve.md.
+    UNSPEC_UADD_HIGHPART ; Used in aarch64-sve.md.
     UNSPEC_SMUL_HIGHPART ; Used in aarch64-sve.md.
     UNSPEC_UMUL_HIGHPART ; Used in aarch64-sve.md.
     UNSPEC_FMLA		; Used in aarch64-sve.md.
@@ -2704,6 +2706,7 @@ (define_int_iterator UNPACK [UNSPEC_UNPACKSHI UNSPEC_UNPACKUHI
 
 (define_int_iterator UNPACK_UNSIGNED [UNSPEC_UNPACKULO UNSPEC_UNPACKUHI])
 
+(define_int_iterator ADD_HIGHPART [UNSPEC_SADD_HIGHPART UNSPEC_UADD_HIGHPART])
 (define_int_iterator MUL_HIGHPART [UNSPEC_SMUL_HIGHPART UNSPEC_UMUL_HIGHPART])
 
 (define_int_iterator CLAST [UNSPEC_CLASTA UNSPEC_CLASTB])
@@ -3342,6 +3345,8 @@ (define_int_attr su [(UNSPEC_SADDV "s")
 		     (UNSPEC_UNPACKUHI "u")
 		     (UNSPEC_UNPACKSLO "s")
 		     (UNSPEC_UNPACKULO "u")
+		     (UNSPEC_SADD_HIGHPART "s")
+		     (UNSPEC_UADD_HIGHPART "u")
 		     (UNSPEC_SMUL_HIGHPART "s")
 		     (UNSPEC_UMUL_HIGHPART "u")
 		     (UNSPEC_COND_FCVTZS "s")




^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-09 17:16 [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] Tamar Christina
  2023-02-09 17:22 ` [PATCH 2/2]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583] Tamar Christina
@ 2023-02-10 10:34 ` Tamar Christina
  2023-02-10 13:13 ` Richard Biener
  2023-02-10 13:36 ` Richard Sandiford
  3 siblings, 0 replies; 47+ messages in thread
From: Tamar Christina @ 2023-02-10 10:34 UTC (permalink / raw)
  To: Tamar Christina, gcc-patches; +Cc: nd, rguenther, jlaw

Oops, realizes I forgot to fill in the test results, there were no issues 😊

> -----Original Message-----
> From: Gcc-patches <gcc-patches-
> bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Tamar
> Christina via Gcc-patches
> Sent: Thursday, February 9, 2023 5:17 PM
> To: gcc-patches@gcc.gnu.org
> Cc: nd <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> Subject: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by
> using new optabs [PR108583]
> 
> Hi All,
> 
> As discussed in the ticket, this replaces the approach for optimizing the div by
> bitmask operation from a hook into optabs implemented through
> add_highpart.
> 
> In order to be able to use this we need to check whether the current
> precision has enough bits to do the operation without any of the additions
> overflowing.
> 
> We use range information to determine this and only do the operation if
> we're sure am overflow won't occur.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going> issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	PR target/108583
> 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST):
> Remove.
> 	* doc/tm.texi.in: Likewise.
> 	* explow.cc (round_push, align_dynamic_address): Revert previous
> patch.
> 	* expmed.cc (expand_divmod): Likewise.
> 	* expmed.h (expand_divmod): Likewise.
> 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
> 	* optabs.cc (expand_doubleword_mod,
> expand_doubleword_divmod): Likewise.
> 	* internal-fn.def (ADDH): New.
> 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
> 	* doc/md.texi: Document them.
> 	* doc/rtl.texi: Likewise.
> 	* target.def (can_special_div_by_const): Remove.
> 	* target.h: Remove tree-core.h include
> 	* targhooks.cc (default_can_special_div_by_const): Remove.
> 	* targhooks.h (default_can_special_div_by_const): Remove.
> 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
> 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook
> and
> 	implement new obtab recognition based on range.
> 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR target/108583
> 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
> 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
> 
> --- inline copy of patch --
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
> 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f74080
> 38595e21af35d 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5668,6 +5668,18 @@ represented in RTL using a @code{smul_highpart}
> RTX expression.
>  Similar, but the multiplication is unsigned.  This may be represented  in RTL
> using an @code{umul_highpart} RTX expression.
> 
> +@cindex @code{sadd@var{m}3_highpart} instruction pattern @item
> +@samp{smul@var{m}3_highpart} Perform a signed addition of operands 1
> +and 2, which have mode @var{m}, and store the most significant half of
> +the product in operand 0.
> +The least significant half of the product is discarded.  This may be
> +represented in RTL using a @code{sadd_highpart} RTX expression.
> +
> +@cindex @code{uadd@var{m}3_highpart} instruction pattern @item
> +@samp{uadd@var{m}3_highpart} Similar, but the addition is unsigned.
> +This may be represented in RTL using an @code{uadd_highpart} RTX
> +expression.
> +
>  @cindex @code{madd@var{m}@var{n}4} instruction pattern  @item
> @samp{madd@var{m}@var{n}4}  Multiply operands 1 and 2, sign-extend
> them to mode @var{n}, add diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
> index
> d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343
> d171940ec4222f3 100644
> --- a/gcc/doc/rtl.texi
> +++ b/gcc/doc/rtl.texi
> @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.
> @code{smul_highpart} returns the high part  of a signed multiplication,
> @code{umul_highpart} returns the high part  of an unsigned multiplication.
> 
> +@findex sadd_highpart
> +@findex uadd_highpart
> +@cindex high-part addition
> +@cindex addition high part
> +@item (sadd_highpart:@var{m} @var{x} @var{y}) @itemx
> +(uadd_highpart:@var{m} @var{x} @var{y}) Represents the high-part
> +addition of @var{x} and @var{y} carried out in machine mode @var{m}.
> +@code{sadd_highpart} returns the high part of a signed addition,
> +@code{uadd_highpart} returns the high part of an unsigned addition.
> +
>  @findex fma
>  @cindex fused multiply-add
>  @item (fma:@var{m} @var{x} @var{y} @var{z}) diff --git a/gcc/doc/tm.texi
> b/gcc/doc/tm.texi index
> c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d57914840
> 17e6b0d62ab077e 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for the hook
> to handle these two  implementation approaches itself.
>  @end deftypefn
> 
> -@deftypefn {Target Hook} bool
> TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum
> @var{tree_code}, tree @var{vectype}, wide_int @var{constant}, rtx
> *@var{output}, rtx @var{in0}, rtx @var{in1}) -This hook is used to test
> whether the target has a special method of -division of vectors of type
> @var{vectype} using the value @var{constant}, -and producing a vector of
> type @var{vectype}.  The division -will then not be decomposed by the
> vectorizer and kept as a div.
> -
> -When the hook is being used to test whether the target supports a special -
> divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook -
> is being used to emit a division, @var{in0} and @var{in1} are the source -
> vectors of type @var{vecttype} and @var{output} is the destination vector of
> -type @var{vectype}.
> -
> -Return true if the operation is possible, emitting instructions for it -if rtxes
> are provided and updating @var{output}.
> -@end deftypefn
> -
>  @deftypefn {Target Hook} tree
> TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned
> @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})  This hook
> should return the decl of a function that implements the  vectorized variant
> of the function with the @code{combined_fn} code diff --git
> a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index
> 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0
> a3abccd1c293c7b 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent strategy
> can generate better code.
> 
>  @hook TARGET_VECTORIZE_VEC_PERM_CONST
> 
> -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
> -
>  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
> 
>  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
> diff --git a/gcc/explow.cc b/gcc/explow.cc index
> 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0
> befa016eea4573c 100644
> --- a/gcc/explow.cc
> +++ b/gcc/explow.cc
> @@ -1037,7 +1037,7 @@ round_push (rtx size)
>       TRUNC_DIV_EXPR.  */
>    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
>  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
> -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size,
> align_rtx,
> +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
>  			NULL_RTX, 1);
>    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
> 
> @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned
> required_align)
>  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
>  				       Pmode),
>  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
> -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
> target,
> +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
>  			  gen_int_mode (required_align / BITS_PER_UNIT,
>  					Pmode),
>  			  NULL_RTX, 1);
> diff --git a/gcc/expmed.h b/gcc/expmed.h index
> 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c5364
> 0941628068f3901 100644
> --- a/gcc/expmed.h
> +++ b/gcc/expmed.h
> @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code,
> machine_mode, rtx, poly_int64, rtx,  extern rtx maybe_expand_shift (enum
> tree_code, machine_mode, rtx, int, rtx,
>  			       int);
>  #ifdef GCC_OPTABS_H
> -extern rtx expand_divmod (int, enum tree_code, machine_mode, tree,
> tree,
> -			  rtx, rtx, rtx, int,
> -			  enum optab_methods = OPTAB_LIB_WIDEN);
> +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx, rtx,
> +			  rtx, int, enum optab_methods =
> OPTAB_LIB_WIDEN);
>  #endif
>  #endif
> 
> diff --git a/gcc/expmed.cc b/gcc/expmed.cc index
> 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3
> a59c169d3b7692f 100644
> --- a/gcc/expmed.cc
> +++ b/gcc/expmed.cc
> @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx
> op0, HOST_WIDE_INT d)
> 
>  rtx
>  expand_divmod (int rem_flag, enum tree_code code, machine_mode
> mode,
> -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
> -	       int unsignedp, enum optab_methods methods)
> +	       rtx op0, rtx op1, rtx target, int unsignedp,
> +	       enum optab_methods methods)
>  {
>    machine_mode compute_mode;
>    rtx tquotient;
> @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code
> code, machine_mode mode,
> 
>    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
> 
> -  /* Check if the target has specific expansions for the division.  */
> -  tree cst;
> -  if (treeop0
> -      && treeop1
> -      && (cst = uniform_integer_cst_p (treeop1))
> -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE
> (treeop0),
> -						     wi::to_wide (cst),
> -						     &target, op0, op1))
> -    return target;
> -
> -
>    /* Now convert to the best mode to use.  */
>    if (compute_mode != mode)
>      {
> @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code
> code, machine_mode mode,
>  			    || (optab_handler (sdivmod_optab, int_mode)
>  				!= CODE_FOR_nothing)))
>  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
> -						int_mode, treeop0, treeop1,
> -						op0, gen_int_mode (abs_d,
> +						int_mode, op0,
> +						gen_int_mode (abs_d,
>  							      int_mode),
>  						NULL_RTX, 0);
>  		    else
> @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code
> code, machine_mode mode,
>  				      size - 1, NULL_RTX, 0);
>  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
>  				    NULL_RTX);
> -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode,
> treeop0,
> -				    treeop1, t3, op1, NULL_RTX, 0);
> +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3,
> op1,
> +				    NULL_RTX, 0);
>  		if (t4)
>  		  {
>  		    rtx t5;
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index
> 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b
> 2280c6e277f26d72 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
>  	    return expand_divmod (0,
>  				  FLOAT_MODE_P (GET_MODE (value))
>  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
> -				  GET_MODE (value), NULL, NULL, op1, op2,
> -				  target, 0);
> +				  GET_MODE (value), op1, op2, target, 0);
>  	case MOD:
> -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> NULL, NULL,
> -				op1, op2, target, 0);
> +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> op1, op2,
> +				target, 0);
>  	case UDIV:
> -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
> NULL, NULL,
> -				op1, op2, target, 1);
> +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
> op1, op2,
> +				target, 1);
>  	case UMOD:
> -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> NULL, NULL,
> -				op1, op2, target, 1);
> +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> op1, op2,
> +				target, 1);
>  	case ASHIFTRT:
>  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
>  				      target, 0, OPTAB_LIB_WIDEN);
> @@ -9170,13 +9169,11 @@ expand_expr_divmod (tree_code code,
> machine_mode mode, tree treeop0,
>        bool speed_p = optimize_insn_for_speed_p ();
>        do_pending_stack_adjust ();
>        start_sequence ();
> -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -				   op0, op1, target, 1);
> +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1, target,
> + 1);
>        rtx_insn *uns_insns = get_insns ();
>        end_sequence ();
>        start_sequence ();
> -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -				   op0, op1, target, 0);
> +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1, target,
> + 0);
>        rtx_insn *sgn_insns = get_insns ();
>        end_sequence ();
>        unsigned uns_cost = seq_cost (uns_insns, speed_p); @@ -9198,8 +9195,7
> @@ expand_expr_divmod (tree_code code, machine_mode mode, tree
> treeop0,
>        emit_insn (sgn_insns);
>        return sgn_ret;
>      }
> -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -			op0, op1, target, unsignedp);
> +  return expand_divmod (mod_p, code, mode, op0, op1, target,
> + unsignedp);
>  }
> 
>  rtx
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
> 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5a
> 3b8a734baa800f 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL,
> ECF_CONST | ECF_NOTHROW, first,
> 
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW,
> first,
>  			      smul_highpart, umul_highpart, binary)
> +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST | ECF_NOTHROW,
> first,
> +			      sadd_highpart, uadd_highpart, binary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST |
> ECF_NOTHROW, first,
>  			      smulhs, umulhs, binary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST |
> ECF_NOTHROW, first, diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
> cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6
> e77082c1e617b 100644
> --- a/gcc/optabs.cc
> +++ b/gcc/optabs.cc
> @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode mode,
> rtx op0, rtx op1, bool unsignedp)
>  		return NULL_RTX;
>  	    }
>  	}
> -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode,
> NULL, NULL,
> -				     sum, gen_int_mode (INTVAL (op1),
> -							word_mode),
> +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode,
> sum,
> +				     gen_int_mode (INTVAL (op1),
> word_mode),
>  				     NULL_RTX, 1, OPTAB_DIRECT);
>        if (remainder == NULL_RTX)
>  	return NULL_RTX;
> @@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode
> mode, rtx op0, rtx op1, rtx *rem,
> 
>    if (op11 != const1_rtx)
>      {
> -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL,
> quot1,
> -				op11, NULL_RTX, unsignedp,
> OPTAB_DIRECT);
> +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1, op11,
> +				NULL_RTX, unsignedp, OPTAB_DIRECT);
>        if (rem2 == NULL_RTX)
>  	return NULL_RTX;
> 
> @@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode
> mode, rtx op0, rtx op1, rtx *rem,
>        if (rem2 == NULL_RTX)
>  	return NULL_RTX;
> 
> -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL,
> quot1,
> -				 op11, NULL_RTX, unsignedp,
> OPTAB_DIRECT);
> +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
> +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
>        if (quot2 == NULL_RTX)
>  	return NULL_RTX;
> 
> diff --git a/gcc/optabs.def b/gcc/optabs.def index
> 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5
> ccbf6147947351a 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
> 
>  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")  OPTAB_D
> (umul_highpart_optab, "umul$a3_highpart")
> +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart") OPTAB_D
> +(uadd_highpart_optab, "uadd$a3_highpart")
> 
>  OPTAB_D (cmpmem_optab, "cmpmem$a")
>  OPTAB_D (cmpstr_optab, "cmpstr$a")
> diff --git a/gcc/target.def b/gcc/target.def index
> db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d
> 81afa2c2baa64a5 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -1905,25 +1905,6 @@ implementation approaches itself.",
>  	const vec_perm_indices &sel),
>   NULL)
> 
> -DEFHOOK
> -(can_special_div_by_const,
> - "This hook is used to test whether the target has a special method of\n\ -
> division of vectors of type @var{vectype} using the value @var{constant},\n\
> -and producing a vector of type @var{vectype}.  The division\n\ -will then
> not be decomposed by the vectorizer and kept as a div.\n\ -\n\ -When the
> hook is being used to test whether the target supports a special\n\ -divide,
> @var{in0}, @var{in1}, and @var{output} are all null.  When the hook\n\ -is
> being used to emit a division, @var{in0} and @var{in1} are the source\n\ -
> vectors of type @var{vecttype} and @var{output} is the destination vector
> of\n\ -type @var{vectype}.\n\ -\n\ -Return true if the operation is possible,
> emitting instructions for it\n\ -if rtxes are provided and updating
> @var{output}.",
> - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
> -	rtx in0, rtx in1),
> - default_can_special_div_by_const)
> -
>  /* Return true if the target supports misaligned store/load of a
>     specific factor denoted in the third parameter.  The last parameter
>     is true if the access is defined in a packed struct.  */ diff --git a/gcc/target.h
> b/gcc/target.h index
> 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b9
> 9f913158c2d47b1 100644
> --- a/gcc/target.h
> +++ b/gcc/target.h
> @@ -51,7 +51,6 @@
>  #include "insn-codes.h"
>  #include "tm.h"
>  #include "hard-reg-set.h"
> -#include "tree-core.h"
> 
>  #if CHECKING_P
> 
> diff --git a/gcc/targhooks.h b/gcc/targhooks.h index
> a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad224454
> 9317a31390f0c2 100644
> --- a/gcc/targhooks.h
> +++ b/gcc/targhooks.h
> @@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage
> (addr_space_t, location_t);  extern rtx default_addr_space_convert (rtx,
> tree, tree);  extern unsigned int default_case_values_threshold (void);
> extern bool default_have_conditional_execution (void); -extern bool
> default_can_special_div_by_const (enum tree_code, tree, wide_int,
> -					      rtx *, rtx, rtx);
> 
>  extern bool default_libc_has_function (enum function_class, tree);  extern
> bool default_libc_has_fast_function (int fcode); diff --git a/gcc/targhooks.cc
> b/gcc/targhooks.cc index
> fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e
> 03877337a931e7 100644
> --- a/gcc/targhooks.cc
> +++ b/gcc/targhooks.cc
> @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
>    return HAVE_conditional_execution;
>  }
> 
> -/* Default that no division by constant operations are special.  */ -bool -
> default_can_special_div_by_const (enum tree_code, tree, wide_int, rtx *,
> rtx,
> -				  rtx)
> -{
> -  return false;
> -}
> -
>  /* By default we assume that c99 functions are present at the runtime,
>     but sincos is not.  */
>  bool
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0
> a04ea8c1f73e3c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> @@ -0,0 +1,25 @@
> +/* { dg-require-effective-target vect_int } */
> +
> +#include <stdint.h>
> +#include "tree-vect.h"
> +
> +typedef unsigned __attribute__((__vector_size__ (16))) V;
> +
> +static __attribute__((__noinline__)) __attribute__((__noclone__)) V foo
> +(V v, unsigned short i) {
> +  v /= i;
> +  return v;
> +}
> +
> +int
> +main (void)
> +{
> +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff },
> +0xffff);
> +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
> +    if (v[i] != 0x00010001)
> +      __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern:
> +detected" "vect" { target aarch64*-*-* } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b4991
> 4d2a29b933de625
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> @@ -0,0 +1,58 @@
> +/* { dg-require-effective-target vect_int } */
> +
> +#include <stdint.h>
> +#include <stdio.h>
> +#include "tree-vect.h"
> +
> +#define N 50
> +#define TYPE uint8_t
> +
> +#ifndef DEBUG
> +#define DEBUG 0
> +#endif
> +
> +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> +
> +
> +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE*
> +restrict pixel, TYPE level, int n) {
> +  for (int i = 0; i < n; i+=1)
> +    pixel[i] = (pixel[i] + level) / 0xff; }
> +
> +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE*
> +restrict pixel, TYPE level, int n) {
> +  for (int i = 0; i < n; i+=1)
> +    pixel[i] = (pixel[i] + level) / 0xff; }
> +
> +int main ()
> +{
> +  TYPE a[N];
> +  TYPE b[N];
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE + i * 13;
> +      b[i] = BASE + i * 13;
> +      if (DEBUG)
> +        printf ("%d: 0x%x\n", i, a[i]);
> +    }
> +
> +  fun1 (a, N / 2, N);
> +  fun2 (b, N / 2, N);
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      if (DEBUG)
> +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> +
> +      if (a[i] != b[i])
> +        __builtin_abort ();
> +    }
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" {
> +target aarch64*-*-* } } } */
> diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc index
> 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077d
> c3e970bed75ef6 100644
> --- a/gcc/tree-vect-generic.cc
> +++ b/gcc/tree-vect-generic.cc
> @@ -1237,17 +1237,6 @@ expand_vector_operation (gimple_stmt_iterator
> *gsi, tree type, tree compute_type
>  	  tree rhs2 = gimple_assign_rhs2 (assign);
>  	  tree ret;
> 
> -	  /* Check if the target was going to handle it through the special
> -	     division callback hook.  */
> -	  tree cst = uniform_integer_cst_p (rhs2);
> -	  if (cst &&
> -	      targetm.vectorize.can_special_div_by_const (code, type,
> -							  wi::to_wide (cst),
> -							  NULL,
> -							  NULL_RTX,
> NULL_RTX))
> -	    return NULL_TREE;
> -
> -
>  	  if (!optimize
>  	      || !VECTOR_INTEGER_TYPE_P (type)
>  	      || TREE_CODE (rhs2) != VECTOR_CST diff --git a/gcc/tree-vect-
> patterns.cc b/gcc/tree-vect-patterns.cc index
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
> 69de2afea139d6 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
>        return pattern_stmt;
>      }
>    else if ((cst = uniform_integer_cst_p (oprnd1))
> -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
> vectype,
> -							  wi::to_wide (cst),
> -							  NULL, NULL_RTX,
> -							  NULL_RTX))
> +	   && TYPE_UNSIGNED (itype)
> +	   && rhs_code == TRUNC_DIV_EXPR
> +	   && vectype
> +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> +					      OPTIMIZE_FOR_SPEED))
>      {
> -      return NULL;
> +      /* div optimizations using narrowings
> +       we can do the division e.g. shorts by 255 faster by calculating it as
> +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> +       double the precision of x.
> +
> +       If we imagine a short as being composed of two blocks of bytes then
> +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
> +       adding 1 to each sub component:
> +
> +	    short value of 16-bits
> +       ┌──────────────┬────────────────┐
> +       │              │                │
> +       └──────────────┴────────────────┘
> +	 8-bit part1 ▲  8-bit part2   ▲
> +		     │                │
> +		     │                │
> +		    +1               +1
> +
> +       after the first addition, we have to shift right by 8, and narrow the
> +       results back to a byte.  Remember that the addition must be done in
> +       double the precision of the input.  However if we know that the addition
> +       `x + 257` does not overflow then we can do the operation in the current
> +       precision.  In which case we don't need the pack and unpacks.  */
> +      auto wcst = wi::to_wide (cst);
> +      int pow = wi::exact_log2 (wcst + 1);
> +      if (pow == (int) (element_precision (vectype) / 2))
> +	{
> +	  wide_int min,max;
> +	  /* If we're in a pattern we need to find the orginal definition.  */
> +	  tree op0 = oprnd0;
> +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> +	  if (is_pattern_stmt_p (stmt_info))
> +	    {
> +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> +	    }
> +
> +	  /* Check that no overflow will occur.  If we don't have range
> +	     information we can't perform the optimization.  */
> +	  if (vect_get_range_info (op0, &min, &max))
> +	    {
> +	      wide_int one = wi::to_wide (build_one_cst (itype));
> +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> +	      wi::overflow_type ovf;
> +	      /* We need adder and max in the same precision.  */
> +	      wide_int zadder
> +		= wide_int_storage::from (adder, wi::get_precision (max),
> +					  UNSIGNED);
> +	      wi::add (max, zadder, UNSIGNED, &ovf);
> +	      if (ovf == wi::OVF_NONE)
> +		{
> +		  *type_out = vectype;
> +		  tree tadder = wide_int_to_tree (itype, adder);
> +		  gcall *patt1
> +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0,
> tadder);
> +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
> +		  gimple_call_set_lhs (patt1, lhs);
> +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1,
> vectype);
> +
> +		  pattern_stmt
> +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
> +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
> +		  gimple_call_set_lhs (pattern_stmt, lhs);
> +
> +		  return pattern_stmt;
> +		}
> +	    }
> +	}
>      }
> 
>    if (prec > HOST_BITS_PER_WIDE_INT
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index
> eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b95
> 64fc4e066e50081 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
>  	}
>        target_support_p = (optab_handler (optab, vec_mode)
>  			  != CODE_FOR_nothing);
> -      tree cst;
> -      if (!target_support_p
> -	  && op1
> -	  && (cst = uniform_integer_cst_p (op1)))
> -	target_support_p
> -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
> -							wi::to_wide (cst),
> -							NULL, NULL_RTX,
> -							NULL_RTX);
>      }
> 
>    bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);
> 
> 
> 
> 
> --

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 2/2]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583]
  2023-02-09 17:22 ` [PATCH 2/2]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583] Tamar Christina
@ 2023-02-10 10:35   ` Tamar Christina
  2023-02-10 14:10   ` Richard Sandiford
  1 sibling, 0 replies; 47+ messages in thread
From: Tamar Christina @ 2023-02-10 10:35 UTC (permalink / raw)
  To: Tamar Christina, gcc-patches
  Cc: nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov,
	Richard Sandiford

Oops, realizes I forgot to fill in the test results, there were no issues 😊

> -----Original Message-----
> From: Gcc-patches <gcc-patches-
> bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Tamar
> Christina via Gcc-patches
> Sent: Thursday, February 9, 2023 5:22 PM
> To: gcc-patches@gcc.gnu.org
> Cc: nd <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>;
> Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> <Kyrylo.Tkachov@arm.com>; Richard Sandiford
> <Richard.Sandiford@arm.com>
> Subject: [PATCH 2/2]AArch64 Update div-bitmask to implement new optab
> instead of target hook [PR108583]
> 
> Hi All,
> 
> This replaces the custom division hook with just an implementation through
> add_highpart.  For NEON we implement the add highpart (Addition +
> extraction of the upper highpart of the register in the same precision) as ADD
> + LSR.
> 
> This representation allows us to easily optimize the sequence using existing
> sequences. This gets us a pretty decent sequence using SRA:
> 
>         umull   v1.8h, v0.8b, v3.8b
>         umull2  v0.8h, v0.16b, v3.16b
>         add     v5.8h, v1.8h, v2.8h
>         add     v4.8h, v0.8h, v2.8h
>         usra    v1.8h, v5.8h, 8
>         usra    v0.8h, v4.8h, 8
>         uzp2    v1.16b, v1.16b, v0.16b
> 
> To get the most optimal sequence however we match (a + ((b + c) >> n))
> where n is half the precision of the mode of the operation into addhn +
> uaddw which is a general good optimization on its own and gets us back to:
> 
> .L4:
>         ldr     q0, [x3]
>         umull   v1.8h, v0.8b, v5.8b
>         umull2  v0.8h, v0.16b, v5.16b
>         addhn   v3.8b, v1.8h, v4.8h
>         addhn   v2.8b, v0.8h, v4.8h
>         uaddw   v1.8h, v1.8h, v3.8b
>         uaddw   v0.8h, v0.8h, v2.8b
>         uzp2    v1.16b, v1.16b, v0.16b
>         str     q1, [x3], 16
>         cmp     x3, x4
>         bne     .L4
> 
> For SVE2 we optimize the initial sequence to the same ADD + LSR which gets
> us:
> 
> .L3:
>         ld1b    z0.h, p0/z, [x0, x3]
>         mul     z0.h, p1/m, z0.h, z2.h
>         add     z1.h, z0.h, z3.h
>         usra    z0.h, z1.h, #8
>         lsr     z0.h, z0.h, #8
>         st1b    z0.h, p0, [x0, x3]
>         inch    x3
>         whilelo p0.h, w3, w2
>         b.any   .L3
> .L1:
>         ret
> 
> and to get the most optimal sequence I match (a + b) >> n (same constraint
> on n) to addhnb which gets us to:
> 
> .L3:
>         ld1b    z0.h, p0/z, [x0, x3]
>         mul     z0.h, p1/m, z0.h, z2.h
>         addhnb  z1.b, z0.h, z3.h
>         addhnb  z0.b, z0.h, z1.h
>         st1b    z0.h, p0, [x0, x3]
>         inch    x3
>         whilelo p0.h, w3, w2
>         b.any   .L3
> 
> There are multiple RTL representations possible for these optimizations, I did
> not represent them using a zero_extend because we seem very inconsistent
> in this in the backend.  Since they are unspecs we won't match them from
> vector ops anyway. I figured maintainers would prefer this, but my
> maintainer ouija board is still out for repairs :)
> 
> There are no new test as new correctness tests were added to the mid-end
> and the existing codegen tests for this already exist.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going> issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	PR target/108583
> 	* config/aarch64/aarch64-simd.md
> (@aarch64_bitmask_udiv<mode>3): Remove.
> 	(<su>add<mode>3_highpart, *bitmask_shift_plus<mode>): New.
> 	* config/aarch64/aarch64-sve2.md (<su>add<mode>3_highpart,
> 	*bitmask_shift_plus<mode>): New.
> 	(@aarch64_bitmask_udiv<mode>3): Remove.
> 	* config/aarch64/aarch64.cc
> 	(aarch64_vectorize_can_special_div_by_constant): Removed.
> 	* config/aarch64/iterators.md (UNSPEC_SADD_HIGHPART,
> 	UNSPEC_UADD_HIGHPART, ADD_HIGHPART): New.
> 
> --- inline copy of patch --
> diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> index
> 7f212bf37cd2c120dceb7efa733c9fa76226f029..26871a56d1fdb134f0ad9d828ce
> 68a8df0272c53 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -4867,62 +4867,48 @@ (define_expand
> "aarch64_<sur><addsub>hn2<mode>"
>    }
>  )
> 
> -;; div optimizations using narrowings
> -;; we can do the division e.g. shorts by 255 faster by calculating it as -;; (x +
> ((x + 257) >> 8)) >> 8 assuming the operation is done in -;; double the
> precision of x.
> -;;
> -;; If we imagine a short as being composed of two blocks of bytes then -;;
> adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to -;;
> adding 1 to each sub component:
> -;;
> -;;      short value of 16-bits
> -;; ┌──────────────┬────────────────┐
> -;; │              │                │
> -;; └──────────────┴────────────────┘
> -;;   8-bit part1 ▲  8-bit part2   ▲
> -;;               │                │
> -;;               │                │
> -;;              +1               +1
> -;;
> -;; after the first addition, we have to shift right by 8, and narrow the -;;
> results back to a byte.  Remember that the addition must be done in -;;
> double the precision of the input.  Since 8 is half the size of a short -;; we can
> use a narrowing halfing instruction in AArch64, addhn which also -;; does the
> addition in a wider precision and narrows back to a byte.  The -;; shift itself is
> implicit in the operation as it writes back only the top -;; half of the result. i.e.
> bits 2*esize-1:esize.
> -;;
> -;; Since we have narrowed the result of the first part back to a byte, for -;;
> the second addition we can use a widening addition, uaddw.
> -;;
> -;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
> -;;
> -;; The shift is later optimized by combine to a uzp2 with movi #0.
> -(define_expand "@aarch64_bitmask_udiv<mode>3"
> -  [(match_operand:VQN 0 "register_operand")
> -   (match_operand:VQN 1 "register_operand")
> -   (match_operand:VQN 2 "immediate_operand")]
> +;; Implement add_highpart as ADD + RSHIFT, we have various optimization
> +for ;; narrowing represented as shifts and so this representation will
> +allow us to ;; further optimize this should the result require
> +narrowing. The alternative ;; representation of ADDHN + UXTL is less
> +efficient and harder to futher ;; optimize.
> +(define_expand "<su>add<mode>3_highpart"
> +  [(set (match_operand:VQN 0 "register_operand")
> +	(unspec:VQN [(match_operand:VQN 1 "register_operand")
> +		     (match_operand:VQN 2 "register_operand")]
> +		    ADD_HIGHPART))]
> +  "TARGET_SIMD"
> +{
> +  rtx result = gen_reg_rtx (<MODE>mode);
> +  int shift_amount = GET_MODE_UNIT_BITSIZE (<MODE>mode) / 2;
> +  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
> +							shift_amount);
> +  emit_insn (gen_add<mode>3 (result, operands[1], operands[2]));
> +  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], result,
> +shift_vector));
> +  DONE;
> +})
> +
> +;; Optimize ((a + b) >> n) + c where n is half the bitsize of the
> +vector (define_insn_and_split "*bitmask_shift_plus<mode>"
> +  [(set (match_operand:VQN 0 "register_operand" "=w")
> +	(plus:VQN
> +	  (lshiftrt:VQN
> +	    (plus:VQN (match_operand:VQN 1 "register_operand" "w")
> +		      (match_operand:VQN 2 "register_operand" "w"))
> +	    (match_operand:VQN 3
> "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
> +	  (match_operand:VQN 4 "register_operand" "w")))]
>    "TARGET_SIMD"
> +  "#"
> +  "&& !reload_completed"
> +  [(const_int 0)]
>  {
> -  unsigned HOST_WIDE_INT size
> -    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode)) - 1;
> -  rtx elt = unwrap_const_vec_duplicate (operands[2]);
> -  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
> -    FAIL;
> -
> -  rtx addend = gen_reg_rtx (<MODE>mode);
> -  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROWQ2>mode, 1);
> -  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val,
> <VNARROWQ2>mode));
> -  rtx tmp1 = gen_reg_rtx (<VNARROWQ>mode);
> -  rtx tmp2 = gen_reg_rtx (<MODE>mode);
> -  emit_insn (gen_aarch64_addhn<mode> (tmp1, operands[1], addend));
> -  unsigned bitsize = GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode);
> -  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
> bitsize);
> -  emit_insn (gen_aarch64_uaddw<Vnarrowq> (tmp2, operands[1], tmp1));
> -  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], tmp2,
> shift_vector));
> +  rtx tmp = gen_reg_rtx (<VNARROWQ>mode);  emit_insn
> + (gen_aarch64_addhn<mode> (tmp, operands[1], operands[2]));
> emit_insn
> + (gen_aarch64_uaddw<Vnarrowq> (operands[0], operands[4], tmp));
>    DONE;
> -})
> +}
> +  [(set_attr "type" "neon_add_halve<q>")]
> +)
> 
>  ;; pmul.
> 
> diff --git a/gcc/config/aarch64/aarch64-sve2.md
> b/gcc/config/aarch64/aarch64-sve2.md
> index
> 40c0728a7e6f00c395c360ce7625bc2e4a018809..ad01c1ddf9257cec951ed0c165
> 58a3c4d856813b 100644
> --- a/gcc/config/aarch64/aarch64-sve2.md
> +++ b/gcc/config/aarch64/aarch64-sve2.md
> @@ -2317,39 +2317,51 @@ (define_insn "@aarch64_sve_<optab><mode>"
>  ;; ---- [INT] Misc optab implementations  ;; -----------------------------------------
> --------------------------------
>  ;; Includes:
> -;; - aarch64_bitmask_udiv
> +;; - add_highpart
>  ;; -------------------------------------------------------------------------
> 
> -;; div optimizations using narrowings
> -;; we can do the division e.g. shorts by 255 faster by calculating it as -;; (x +
> ((x + 257) >> 8)) >> 8 assuming the operation is done in -;; double the
> precision of x.
> -;;
> -;; See aarch64-simd.md for bigger explanation.
> -(define_expand "@aarch64_bitmask_udiv<mode>3"
> -  [(match_operand:SVE_FULL_HSDI 0 "register_operand")
> -   (match_operand:SVE_FULL_HSDI 1 "register_operand")
> -   (match_operand:SVE_FULL_HSDI 2 "immediate_operand")]
> +;; Implement add_highpart as ADD + RSHIFT, we have various optimization
> +for ;; narrowing represented as shifts and so this representation will
> +allow us to ;; further optimize this should the result require
> +narrowing. The alternative ;; representation of ADDHN + UXTL is less
> +efficient and harder to futher ;; optimize.
> +(define_expand "<su>add<mode>3_highpart"
> +  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand")
> +	(unspec:SVE_FULL_HSDI
> +	  [(match_operand:SVE_FULL_HSDI 1 "register_operand")
> +	   (match_operand:SVE_FULL_HSDI 2 "register_operand")]
> +	  ADD_HIGHPART))]
>    "TARGET_SVE2"
>  {
> -  unsigned HOST_WIDE_INT size
> -    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROW>mode)) - 1;
> -  rtx elt = unwrap_const_vec_duplicate (operands[2]);
> -  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
> -    FAIL;
> +  rtx result = gen_reg_rtx (<MODE>mode);
> +  int shift_amount = GET_MODE_UNIT_BITSIZE (<MODE>mode) / 2;
> +  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
> +							shift_amount);
> +  emit_insn (gen_add<mode>3 (result, operands[1], operands[2]));
> +  emit_insn (gen_vlshr<mode>3 (operands[0], result, shift_vector));
> +  DONE;
> +})
> 
> -  rtx addend = gen_reg_rtx (<MODE>mode);
> +;; Optimize ((a + b) >> n) where n is half the bitsize of the vector
> +(define_insn_and_split "*bitmask_shift_plus<mode>"
> +  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand" "=w")
> +	(unspec:SVE_FULL_HSDI [
> +	    (match_operand:<VPRED> 1 "register_operand" "Upl")
> +	    (lshiftrt:SVE_FULL_HSDI
> +	      (plus:SVE_FULL_HSDI
> +		(match_operand:SVE_FULL_HSDI 2 "register_operand" "w")
> +		(match_operand:SVE_FULL_HSDI 3 "register_operand" "w"))
> +	      (match_operand:SVE_FULL_HSDI 4
> "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
> +        ] UNSPEC_PRED_X))]
> +  "TARGET_SVE2"
> +  "#"
> +  "&& !reload_completed"
> +  [(const_int 0)]
> +{
>    rtx tmp1 = gen_reg_rtx (<VNARROW>mode);
> -  rtx tmp2 = gen_reg_rtx (<VNARROW>mode);
> -  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROW>mode, 1);
> -  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val,
> <VNARROW>mode));
> -  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1,
> operands[1],
> -			      addend));
> -  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp2,
> operands[1],
> -			      lowpart_subreg (<MODE>mode, tmp1,
> -					      <VNARROW>mode)));
> +  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1,
> + operands[2], operands[3]));
>    emit_move_insn (operands[0],
> -		  lowpart_subreg (<MODE>mode, tmp2,
> <VNARROW>mode));
> +		  lowpart_subreg (<MODE>mode, tmp1,
> <VNARROW>mode));
>    DONE;
>  })
> 
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index
> e6f47cbbb0d04a6f33b9a741ebb614cabd0204b9..8a04feb29e6bfb423a09dde2
> cd64853e69d0e1ba 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -24363,46 +24363,6 @@ aarch64_vectorize_vec_perm_const
> (machine_mode vmode, machine_mode op_mode,
> 
>    return ret;
>  }
> -
> -/* Implement TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST.  */
> -
> -bool
> -aarch64_vectorize_can_special_div_by_constant (enum tree_code code,
> -					       tree vectype, wide_int cst,
> -					       rtx *output, rtx in0, rtx in1)
> -{
> -  if (code != TRUNC_DIV_EXPR
> -      || !TYPE_UNSIGNED (vectype))
> -    return false;
> -
> -  machine_mode mode = TYPE_MODE (vectype);
> -  unsigned int flags = aarch64_classify_vector_mode (mode);
> -  if ((flags & VEC_ANY_SVE) && !TARGET_SVE2)
> -    return false;
> -
> -  int pow = wi::exact_log2 (cst + 1);
> -  auto insn_code = maybe_code_for_aarch64_bitmask_udiv3 (TYPE_MODE
> (vectype));
> -  /* SVE actually has a div operator, we may have gotten here through
> -     that route.  */
> -  if (pow != (int) (element_precision (vectype) / 2)
> -      || insn_code == CODE_FOR_nothing)
> -    return false;
> -
> -  /* We can use the optimized pattern.  */
> -  if (in0 == NULL_RTX && in1 == NULL_RTX)
> -    return true;
> -
> -  gcc_assert (output);
> -
> -  expand_operand ops[3];
> -  create_output_operand (&ops[0], *output, mode);
> -  create_input_operand (&ops[1], in0, mode);
> -  create_fixed_operand (&ops[2], in1);
> -  expand_insn (insn_code, 3, ops);
> -  *output = ops[0].value;
> -  return true;
> -}
> -
>  /* Generate a byte permute mask for a register of mode MODE,
>     which has NUNITS units.  */
> 
> diff --git a/gcc/config/aarch64/iterators.md
> b/gcc/config/aarch64/iterators.md index
> 6cbc97cc82c06a68259bdf4dec8a0eab230081e5..ae627ae56cbd1e8b882e596d
> ba974e74ef396e0e 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -750,6 +750,8 @@ (define_c_enum "unspec"
>      UNSPEC_REVH		; Used in aarch64-sve.md.
>      UNSPEC_REVW		; Used in aarch64-sve.md.
>      UNSPEC_REVBHW	; Used in aarch64-sve.md.
> +    UNSPEC_SADD_HIGHPART ; Used in aarch64-sve.md.
> +    UNSPEC_UADD_HIGHPART ; Used in aarch64-sve.md.
>      UNSPEC_SMUL_HIGHPART ; Used in aarch64-sve.md.
>      UNSPEC_UMUL_HIGHPART ; Used in aarch64-sve.md.
>      UNSPEC_FMLA		; Used in aarch64-sve.md.
> @@ -2704,6 +2706,7 @@ (define_int_iterator UNPACK
> [UNSPEC_UNPACKSHI UNSPEC_UNPACKUHI
> 
>  (define_int_iterator UNPACK_UNSIGNED [UNSPEC_UNPACKULO
> UNSPEC_UNPACKUHI])
> 
> +(define_int_iterator ADD_HIGHPART [UNSPEC_SADD_HIGHPART
> +UNSPEC_UADD_HIGHPART])
>  (define_int_iterator MUL_HIGHPART [UNSPEC_SMUL_HIGHPART
> UNSPEC_UMUL_HIGHPART])
> 
>  (define_int_iterator CLAST [UNSPEC_CLASTA UNSPEC_CLASTB]) @@ -3342,6
> +3345,8 @@ (define_int_attr su [(UNSPEC_SADDV "s")
>  		     (UNSPEC_UNPACKUHI "u")
>  		     (UNSPEC_UNPACKSLO "s")
>  		     (UNSPEC_UNPACKULO "u")
> +		     (UNSPEC_SADD_HIGHPART "s")
> +		     (UNSPEC_UADD_HIGHPART "u")
>  		     (UNSPEC_SMUL_HIGHPART "s")
>  		     (UNSPEC_UMUL_HIGHPART "u")
>  		     (UNSPEC_COND_FCVTZS "s")
> 
> 
> 
> 
> --

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-09 17:16 [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] Tamar Christina
  2023-02-09 17:22 ` [PATCH 2/2]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583] Tamar Christina
  2023-02-10 10:34 ` [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] Tamar Christina
@ 2023-02-10 13:13 ` Richard Biener
  2023-02-10 13:36 ` Richard Sandiford
  3 siblings, 0 replies; 47+ messages in thread
From: Richard Biener @ 2023-02-10 13:13 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc-patches, nd, jlaw

[-- Attachment #1: Type: text/plain, Size: 27246 bytes --]

On Thu, 9 Feb 2023, Tamar Christina wrote:

> Hi All,
> 
> As discussed in the ticket, this replaces the approach for optimizing the
> div by bitmask operation from a hook into optabs implemented through
> add_highpart.
> 
> In order to be able to use this we need to check whether the current precision
> has enough bits to do the operation without any of the additions overflowing.
> 
> We use range information to determine this and only do the operation if we're
> sure am overflow won't occur.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going> issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	PR target/108583
> 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Remove.
> 	* doc/tm.texi.in: Likewise.
> 	* explow.cc (round_push, align_dynamic_address): Revert previous patch.
> 	* expmed.cc (expand_divmod): Likewise.
> 	* expmed.h (expand_divmod): Likewise.
> 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
> 	* optabs.cc (expand_doubleword_mod, expand_doubleword_divmod): Likewise.
> 	* internal-fn.def (ADDH): New.
> 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
> 	* doc/md.texi: Document them.
> 	* doc/rtl.texi: Likewise.
> 	* target.def (can_special_div_by_const): Remove.
> 	* target.h: Remove tree-core.h include
> 	* targhooks.cc (default_can_special_div_by_const): Remove.
> 	* targhooks.h (default_can_special_div_by_const): Remove.
> 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
> 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook and
> 	implement new obtab recognition based on range.
> 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR target/108583
> 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
> 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
> 
> --- inline copy of patch -- 
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f7408038595e21af35d 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5668,6 +5668,18 @@ represented in RTL using a @code{smul_highpart} RTX expression.
>  Similar, but the multiplication is unsigned.  This may be represented
>  in RTL using an @code{umul_highpart} RTX expression.
>  
> +@cindex @code{sadd@var{m}3_highpart} instruction pattern
> +@item @samp{smul@var{m}3_highpart}
> +Perform a signed addition of operands 1 and 2, which have mode
> +@var{m}, and store the most significant half of the product in operand 0.

of the sum

> +The least significant half of the product is discarded.  This may be
> +represented in RTL using a @code{sadd_highpart} RTX expression.

likewise.

> +
> +@cindex @code{uadd@var{m}3_highpart} instruction pattern
> +@item @samp{uadd@var{m}3_highpart}
> +Similar, but the addition is unsigned.  This may be represented
> +in RTL using an @code{uadd_highpart} RTX expression.
> +

is the highpart of the results sign- (for sadd) or zero- (for uadd)
extended to the full precision of the result mode? "store the most
significant half ... in operand 0" leaves that underspecified I think
(likewise for the mul_highpart pattern docs you copied this from).

Otherwise looks good to me.  Review would have been easier with
splitting the revert from the new implementation ...

Thanks,
Richard.

>  @cindex @code{madd@var{m}@var{n}4} instruction pattern
>  @item @samp{madd@var{m}@var{n}4}
>  Multiply operands 1 and 2, sign-extend them to mode @var{n}, add
> diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
> index d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343d171940ec4222f3 100644
> --- a/gcc/doc/rtl.texi
> +++ b/gcc/doc/rtl.texi
> @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.  @code{smul_highpart} returns the high part
>  of a signed multiplication, @code{umul_highpart} returns the high part
>  of an unsigned multiplication.
>  
> +@findex sadd_highpart
> +@findex uadd_highpart
> +@cindex high-part addition
> +@cindex addition high part
> +@item (sadd_highpart:@var{m} @var{x} @var{y})
> +@itemx (uadd_highpart:@var{m} @var{x} @var{y})
> +Represents the high-part addition of @var{x} and @var{y} carried
> +out in machine mode @var{m}.  @code{sadd_highpart} returns the high part
> +of a signed addition, @code{uadd_highpart} returns the high part
> +of an unsigned addition.
> +
>  @findex fma
>  @cindex fused multiply-add
>  @item (fma:@var{m} @var{x} @var{y} @var{z})
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d5791484017e6b0d62ab077e 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for the hook to handle these two
>  implementation approaches itself.
>  @end deftypefn
>  
> -@deftypefn {Target Hook} bool TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum @var{tree_code}, tree @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx @var{in0}, rtx @var{in1})
> -This hook is used to test whether the target has a special method of
> -division of vectors of type @var{vectype} using the value @var{constant},
> -and producing a vector of type @var{vectype}.  The division
> -will then not be decomposed by the vectorizer and kept as a div.
> -
> -When the hook is being used to test whether the target supports a special
> -divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook
> -is being used to emit a division, @var{in0} and @var{in1} are the source
> -vectors of type @var{vecttype} and @var{output} is the destination vector of
> -type @var{vectype}.
> -
> -Return true if the operation is possible, emitting instructions for it
> -if rtxes are provided and updating @var{output}.
> -@end deftypefn
> -
>  @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
>  This hook should return the decl of a function that implements the
>  vectorized variant of the function with the @code{combined_fn} code
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0a3abccd1c293c7b 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent strategy can generate better code.
>  
>  @hook TARGET_VECTORIZE_VEC_PERM_CONST
>  
> -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
> -
>  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
>  
>  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
> diff --git a/gcc/explow.cc b/gcc/explow.cc
> index 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0befa016eea4573c 100644
> --- a/gcc/explow.cc
> +++ b/gcc/explow.cc
> @@ -1037,7 +1037,7 @@ round_push (rtx size)
>       TRUNC_DIV_EXPR.  */
>    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
>  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
> -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size, align_rtx,
> +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
>  			NULL_RTX, 1);
>    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
>  
> @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned required_align)
>  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
>  				       Pmode),
>  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
> -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, target,
> +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
>  			  gen_int_mode (required_align / BITS_PER_UNIT,
>  					Pmode),
>  			  NULL_RTX, 1);
> diff --git a/gcc/expmed.h b/gcc/expmed.h
> index 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c53640941628068f3901 100644
> --- a/gcc/expmed.h
> +++ b/gcc/expmed.h
> @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code, machine_mode, rtx, poly_int64, rtx,
>  extern rtx maybe_expand_shift (enum tree_code, machine_mode, rtx, int, rtx,
>  			       int);
>  #ifdef GCC_OPTABS_H
> -extern rtx expand_divmod (int, enum tree_code, machine_mode, tree, tree,
> -			  rtx, rtx, rtx, int,
> -			  enum optab_methods = OPTAB_LIB_WIDEN);
> +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx, rtx,
> +			  rtx, int, enum optab_methods = OPTAB_LIB_WIDEN);
>  #endif
>  #endif
>  
> diff --git a/gcc/expmed.cc b/gcc/expmed.cc
> index 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3a59c169d3b7692f 100644
> --- a/gcc/expmed.cc
> +++ b/gcc/expmed.cc
> @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx op0, HOST_WIDE_INT d)
>  
>  rtx
>  expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
> -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
> -	       int unsignedp, enum optab_methods methods)
> +	       rtx op0, rtx op1, rtx target, int unsignedp,
> +	       enum optab_methods methods)
>  {
>    machine_mode compute_mode;
>    rtx tquotient;
> @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
>  
>    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
>  
> -  /* Check if the target has specific expansions for the division.  */
> -  tree cst;
> -  if (treeop0
> -      && treeop1
> -      && (cst = uniform_integer_cst_p (treeop1))
> -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE (treeop0),
> -						     wi::to_wide (cst),
> -						     &target, op0, op1))
> -    return target;
> -
> -
>    /* Now convert to the best mode to use.  */
>    if (compute_mode != mode)
>      {
> @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
>  			    || (optab_handler (sdivmod_optab, int_mode)
>  				!= CODE_FOR_nothing)))
>  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
> -						int_mode, treeop0, treeop1,
> -						op0, gen_int_mode (abs_d,
> +						int_mode, op0,
> +						gen_int_mode (abs_d,
>  							      int_mode),
>  						NULL_RTX, 0);
>  		    else
> @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
>  				      size - 1, NULL_RTX, 0);
>  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
>  				    NULL_RTX);
> -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, treeop0,
> -				    treeop1, t3, op1, NULL_RTX, 0);
> +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3, op1,
> +				    NULL_RTX, 0);
>  		if (t4)
>  		  {
>  		    rtx t5;
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b2280c6e277f26d72 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
>  	    return expand_divmod (0,
>  				  FLOAT_MODE_P (GET_MODE (value))
>  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
> -				  GET_MODE (value), NULL, NULL, op1, op2,
> -				  target, 0);
> +				  GET_MODE (value), op1, op2, target, 0);
>  	case MOD:
> -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
> -				op1, op2, target, 0);
> +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
> +				target, 0);
>  	case UDIV:
> -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), NULL, NULL,
> -				op1, op2, target, 1);
> +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), op1, op2,
> +				target, 1);
>  	case UMOD:
> -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
> -				op1, op2, target, 1);
> +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
> +				target, 1);
>  	case ASHIFTRT:
>  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
>  				      target, 0, OPTAB_LIB_WIDEN);
> @@ -9170,13 +9169,11 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
>        bool speed_p = optimize_insn_for_speed_p ();
>        do_pending_stack_adjust ();
>        start_sequence ();
> -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -				   op0, op1, target, 1);
> +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 1);
>        rtx_insn *uns_insns = get_insns ();
>        end_sequence ();
>        start_sequence ();
> -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -				   op0, op1, target, 0);
> +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 0);
>        rtx_insn *sgn_insns = get_insns ();
>        end_sequence ();
>        unsigned uns_cost = seq_cost (uns_insns, speed_p);
> @@ -9198,8 +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
>        emit_insn (sgn_insns);
>        return sgn_ret;
>      }
> -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -			op0, op1, target, unsignedp);
> +  return expand_divmod (mod_p, code, mode, op0, op1, target, unsignedp);
>  }
>  
>  rtx
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5a3b8a734baa800f 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first,
>  
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW, first,
>  			      smul_highpart, umul_highpart, binary)
> +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST | ECF_NOTHROW, first,
> +			      sadd_highpart, uadd_highpart, binary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first,
>  			      smulhs, umulhs, binary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first,
> diff --git a/gcc/optabs.cc b/gcc/optabs.cc
> index cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6e77082c1e617b 100644
> --- a/gcc/optabs.cc
> +++ b/gcc/optabs.cc
> @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode mode, rtx op0, rtx op1, bool unsignedp)
>  		return NULL_RTX;
>  	    }
>  	}
> -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, NULL, NULL,
> -				     sum, gen_int_mode (INTVAL (op1),
> -							word_mode),
> +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, sum,
> +				     gen_int_mode (INTVAL (op1), word_mode),
>  				     NULL_RTX, 1, OPTAB_DIRECT);
>        if (remainder == NULL_RTX)
>  	return NULL_RTX;
> @@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
>  
>    if (op11 != const1_rtx)
>      {
> -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL, quot1,
> -				op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
> +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1, op11,
> +				NULL_RTX, unsignedp, OPTAB_DIRECT);
>        if (rem2 == NULL_RTX)
>  	return NULL_RTX;
>  
> @@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
>        if (rem2 == NULL_RTX)
>  	return NULL_RTX;
>  
> -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL, quot1,
> -				 op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
> +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
> +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
>        if (quot2 == NULL_RTX)
>  	return NULL_RTX;
>  
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5ccbf6147947351a 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
>  
>  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")
>  OPTAB_D (umul_highpart_optab, "umul$a3_highpart")
> +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart")
> +OPTAB_D (uadd_highpart_optab, "uadd$a3_highpart")
>  
>  OPTAB_D (cmpmem_optab, "cmpmem$a")
>  OPTAB_D (cmpstr_optab, "cmpstr$a")
> diff --git a/gcc/target.def b/gcc/target.def
> index db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -1905,25 +1905,6 @@ implementation approaches itself.",
>  	const vec_perm_indices &sel),
>   NULL)
>  
> -DEFHOOK
> -(can_special_div_by_const,
> - "This hook is used to test whether the target has a special method of\n\
> -division of vectors of type @var{vectype} using the value @var{constant},\n\
> -and producing a vector of type @var{vectype}.  The division\n\
> -will then not be decomposed by the vectorizer and kept as a div.\n\
> -\n\
> -When the hook is being used to test whether the target supports a special\n\
> -divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook\n\
> -is being used to emit a division, @var{in0} and @var{in1} are the source\n\
> -vectors of type @var{vecttype} and @var{output} is the destination vector of\n\
> -type @var{vectype}.\n\
> -\n\
> -Return true if the operation is possible, emitting instructions for it\n\
> -if rtxes are provided and updating @var{output}.",
> - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
> -	rtx in0, rtx in1),
> - default_can_special_div_by_const)
> -
>  /* Return true if the target supports misaligned store/load of a
>     specific factor denoted in the third parameter.  The last parameter
>     is true if the access is defined in a packed struct.  */
> diff --git a/gcc/target.h b/gcc/target.h
> index 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b99f913158c2d47b1 100644
> --- a/gcc/target.h
> +++ b/gcc/target.h
> @@ -51,7 +51,6 @@
>  #include "insn-codes.h"
>  #include "tm.h"
>  #include "hard-reg-set.h"
> -#include "tree-core.h"
>  
>  #if CHECKING_P
>  
> diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> index a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2244549317a31390f0c2 100644
> --- a/gcc/targhooks.h
> +++ b/gcc/targhooks.h
> @@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage (addr_space_t, location_t);
>  extern rtx default_addr_space_convert (rtx, tree, tree);
>  extern unsigned int default_case_values_threshold (void);
>  extern bool default_have_conditional_execution (void);
> -extern bool default_can_special_div_by_const (enum tree_code, tree, wide_int,
> -					      rtx *, rtx, rtx);
>  
>  extern bool default_libc_has_function (enum function_class, tree);
>  extern bool default_libc_has_fast_function (int fcode);
> diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> index fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e03877337a931e7 100644
> --- a/gcc/targhooks.cc
> +++ b/gcc/targhooks.cc
> @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
>    return HAVE_conditional_execution;
>  }
>  
> -/* Default that no division by constant operations are special.  */
> -bool
> -default_can_special_div_by_const (enum tree_code, tree, wide_int, rtx *, rtx,
> -				  rtx)
> -{
> -  return false;
> -}
> -
>  /* By default we assume that c99 functions are present at the runtime,
>     but sincos is not.  */
>  bool
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0a04ea8c1f73e3c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> @@ -0,0 +1,25 @@
> +/* { dg-require-effective-target vect_int } */
> +
> +#include <stdint.h>
> +#include "tree-vect.h"
> +
> +typedef unsigned __attribute__((__vector_size__ (16))) V;
> +
> +static __attribute__((__noinline__)) __attribute__((__noclone__)) V
> +foo (V v, unsigned short i)
> +{
> +  v /= i;
> +  return v;
> +}
> +
> +int
> +main (void)
> +{
> +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0xffff);
> +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
> +    if (v[i] != 0x00010001)
> +      __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected" "vect" { target aarch64*-*-* } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b49914d2a29b933de625
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> @@ -0,0 +1,58 @@
> +/* { dg-require-effective-target vect_int } */
> +
> +#include <stdint.h>
> +#include <stdio.h>
> +#include "tree-vect.h"
> +
> +#define N 50
> +#define TYPE uint8_t 
> +
> +#ifndef DEBUG
> +#define DEBUG 0
> +#endif
> +
> +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> +
> +
> +__attribute__((noipa, noinline, optimize("O1")))
> +void fun1(TYPE* restrict pixel, TYPE level, int n)
> +{
> +  for (int i = 0; i < n; i+=1)
> +    pixel[i] = (pixel[i] + level) / 0xff;
> +}
> +
> +__attribute__((noipa, noinline, optimize("O3")))
> +void fun2(TYPE* restrict pixel, TYPE level, int n)
> +{
> +  for (int i = 0; i < n; i+=1)
> +    pixel[i] = (pixel[i] + level) / 0xff;
> +}
> +
> +int main ()
> +{
> +  TYPE a[N];
> +  TYPE b[N];
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE + i * 13;
> +      b[i] = BASE + i * 13;
> +      if (DEBUG)
> +        printf ("%d: 0x%x\n", i, a[i]);
> +    }
> +
> +  fun1 (a, N / 2, N);
> +  fun2 (b, N / 2, N);
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      if (DEBUG)
> +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> +
> +      if (a[i] != b[i])
> +        __builtin_abort ();
> +    }
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { target aarch64*-*-* } } } */
> diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
> index 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077dc3e970bed75ef6 100644
> --- a/gcc/tree-vect-generic.cc
> +++ b/gcc/tree-vect-generic.cc
> @@ -1237,17 +1237,6 @@ expand_vector_operation (gimple_stmt_iterator *gsi, tree type, tree compute_type
>  	  tree rhs2 = gimple_assign_rhs2 (assign);
>  	  tree ret;
>  
> -	  /* Check if the target was going to handle it through the special
> -	     division callback hook.  */
> -	  tree cst = uniform_integer_cst_p (rhs2);
> -	  if (cst &&
> -	      targetm.vectorize.can_special_div_by_const (code, type,
> -							  wi::to_wide (cst),
> -							  NULL,
> -							  NULL_RTX, NULL_RTX))
> -	    return NULL_TREE;
> -
> -
>  	  if (!optimize
>  	      || !VECTOR_INTEGER_TYPE_P (type)
>  	      || TREE_CODE (rhs2) != VECTOR_CST
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f3369de2afea139d6 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
>        return pattern_stmt;
>      }
>    else if ((cst = uniform_integer_cst_p (oprnd1))
> -	   && targetm.vectorize.can_special_div_by_const (rhs_code, vectype,
> -							  wi::to_wide (cst),
> -							  NULL, NULL_RTX,
> -							  NULL_RTX))
> +	   && TYPE_UNSIGNED (itype)
> +	   && rhs_code == TRUNC_DIV_EXPR
> +	   && vectype
> +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> +					      OPTIMIZE_FOR_SPEED))
>      {
> -      return NULL;
> +      /* div optimizations using narrowings
> +       we can do the division e.g. shorts by 255 faster by calculating it as
> +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> +       double the precision of x.
> +
> +       If we imagine a short as being composed of two blocks of bytes then
> +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
> +       adding 1 to each sub component:
> +
> +	    short value of 16-bits
> +       ┌──────────────┬────────────────┐
> +       │              │                │
> +       └──────────────┴────────────────┘
> +	 8-bit part1 ▲  8-bit part2   ▲
> +		     │                │
> +		     │                │
> +		    +1               +1
> +
> +       after the first addition, we have to shift right by 8, and narrow the
> +       results back to a byte.  Remember that the addition must be done in
> +       double the precision of the input.  However if we know that the addition
> +       `x + 257` does not overflow then we can do the operation in the current
> +       precision.  In which case we don't need the pack and unpacks.  */
> +      auto wcst = wi::to_wide (cst);
> +      int pow = wi::exact_log2 (wcst + 1);
> +      if (pow == (int) (element_precision (vectype) / 2))
> +	{
> +	  wide_int min,max;
> +	  /* If we're in a pattern we need to find the orginal definition.  */
> +	  tree op0 = oprnd0;
> +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> +	  if (is_pattern_stmt_p (stmt_info))
> +	    {
> +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> +	    }
> +
> +	  /* Check that no overflow will occur.  If we don't have range
> +	     information we can't perform the optimization.  */
> +	  if (vect_get_range_info (op0, &min, &max))
> +	    {
> +	      wide_int one = wi::to_wide (build_one_cst (itype));
> +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> +	      wi::overflow_type ovf;
> +	      /* We need adder and max in the same precision.  */
> +	      wide_int zadder
> +		= wide_int_storage::from (adder, wi::get_precision (max),
> +					  UNSIGNED);
> +	      wi::add (max, zadder, UNSIGNED, &ovf);
> +	      if (ovf == wi::OVF_NONE)
> +		{
> +		  *type_out = vectype;
> +		  tree tadder = wide_int_to_tree (itype, adder);
> +		  gcall *patt1
> +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, tadder);
> +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
> +		  gimple_call_set_lhs (patt1, lhs);
> +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
> +
> +		  pattern_stmt
> +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
> +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
> +		  gimple_call_set_lhs (pattern_stmt, lhs);
> +
> +		  return pattern_stmt;
> +		}
> +	    }
> +	}
>      }
>  
>    if (prec > HOST_BITS_PER_WIDE_INT
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b9564fc4e066e50081 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
>  	}
>        target_support_p = (optab_handler (optab, vec_mode)
>  			  != CODE_FOR_nothing);
> -      tree cst;
> -      if (!target_support_p
> -	  && op1
> -	  && (cst = uniform_integer_cst_p (op1)))
> -	target_support_p
> -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
> -							wi::to_wide (cst),
> -							NULL, NULL_RTX,
> -							NULL_RTX);
>      }
>  
>    bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);
> 
> 
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-09 17:16 [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] Tamar Christina
                   ` (2 preceding siblings ...)
  2023-02-10 13:13 ` Richard Biener
@ 2023-02-10 13:36 ` Richard Sandiford
  2023-02-10 13:52   ` Richard Biener
  2023-02-10 14:13   ` Tamar Christina
  3 siblings, 2 replies; 47+ messages in thread
From: Richard Sandiford @ 2023-02-10 13:36 UTC (permalink / raw)
  To: Tamar Christina via Gcc-patches; +Cc: Tamar Christina, nd, rguenther, jlaw

I think I'm misunderstanding, but: it seems like we're treating the
add highpart optabs as companions to the mul highpart optabs.  But AIUI,
the add highpart optab is used such that, for an N-bit mode, we do
an N-bit addition followed by a shift by N/2.  Is that right?
The mul highpart optabs instead do an 2N-bit multiplication followed
by a shift by N.

Apart from consistency, the reason this matters is: I'm not sure what we
gain by adding the optab rather than simply open-coding the addition and
the shift directly into the vector pattern.  It seems like the AArch64
expander in 2/2 does just do an ordinary N-bit addition followed by an
ordinary shift by N/2.

Some comments in addition to Richard's:

Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> Hi All,
>
> As discussed in the ticket, this replaces the approach for optimizing the
> div by bitmask operation from a hook into optabs implemented through
> add_highpart.
>
> In order to be able to use this we need to check whether the current precision
> has enough bits to do the operation without any of the additions overflowing.
>
> We use range information to determine this and only do the operation if we're
> sure am overflow won't occur.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going> issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	PR target/108583
> 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Remove.
> 	* doc/tm.texi.in: Likewise.
> 	* explow.cc (round_push, align_dynamic_address): Revert previous patch.
> 	* expmed.cc (expand_divmod): Likewise.
> 	* expmed.h (expand_divmod): Likewise.
> 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
> 	* optabs.cc (expand_doubleword_mod, expand_doubleword_divmod): Likewise.
> 	* internal-fn.def (ADDH): New.
> 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
> 	* doc/md.texi: Document them.
> 	* doc/rtl.texi: Likewise.
> 	* target.def (can_special_div_by_const): Remove.
> 	* target.h: Remove tree-core.h include
> 	* targhooks.cc (default_can_special_div_by_const): Remove.
> 	* targhooks.h (default_can_special_div_by_const): Remove.
> 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
> 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook and
> 	implement new obtab recognition based on range.
> 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
>
> gcc/testsuite/ChangeLog:
>
> 	PR target/108583
> 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
> 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
>
> --- inline copy of patch -- 
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f7408038595e21af35d 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5668,6 +5668,18 @@ represented in RTL using a @code{smul_highpart} RTX expression.
>  Similar, but the multiplication is unsigned.  This may be represented
>  in RTL using an @code{umul_highpart} RTX expression.
>  
> +@cindex @code{sadd@var{m}3_highpart} instruction pattern
> +@item @samp{smul@var{m}3_highpart}

sadd

> +Perform a signed addition of operands 1 and 2, which have mode
> +@var{m}, and store the most significant half of the product in operand 0.
> +The least significant half of the product is discarded.  This may be
> +represented in RTL using a @code{sadd_highpart} RTX expression.
> +
> +@cindex @code{uadd@var{m}3_highpart} instruction pattern
> +@item @samp{uadd@var{m}3_highpart}
> +Similar, but the addition is unsigned.  This may be represented
> +in RTL using an @code{uadd_highpart} RTX expression.
> +
>  @cindex @code{madd@var{m}@var{n}4} instruction pattern
>  @item @samp{madd@var{m}@var{n}4}
>  Multiply operands 1 and 2, sign-extend them to mode @var{n}, add
> diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
> index d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343d171940ec4222f3 100644
> --- a/gcc/doc/rtl.texi
> +++ b/gcc/doc/rtl.texi
> @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.  @code{smul_highpart} returns the high part
>  of a signed multiplication, @code{umul_highpart} returns the high part
>  of an unsigned multiplication.
>  
> +@findex sadd_highpart
> +@findex uadd_highpart
> +@cindex high-part addition
> +@cindex addition high part
> +@item (sadd_highpart:@var{m} @var{x} @var{y})
> +@itemx (uadd_highpart:@var{m} @var{x} @var{y})
> +Represents the high-part addition of @var{x} and @var{y} carried
> +out in machine mode @var{m}.  @code{sadd_highpart} returns the high part
> +of a signed addition, @code{uadd_highpart} returns the high part
> +of an unsigned addition.

The patch doesn't add these RTL codes though.

> +
>  @findex fma
>  @cindex fused multiply-add
>  @item (fma:@var{m} @var{x} @var{y} @var{z})
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d5791484017e6b0d62ab077e 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for the hook to handle these two
>  implementation approaches itself.
>  @end deftypefn
>  
> -@deftypefn {Target Hook} bool TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum @var{tree_code}, tree @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx @var{in0}, rtx @var{in1})
> -This hook is used to test whether the target has a special method of
> -division of vectors of type @var{vectype} using the value @var{constant},
> -and producing a vector of type @var{vectype}.  The division
> -will then not be decomposed by the vectorizer and kept as a div.
> -
> -When the hook is being used to test whether the target supports a special
> -divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook
> -is being used to emit a division, @var{in0} and @var{in1} are the source
> -vectors of type @var{vecttype} and @var{output} is the destination vector of
> -type @var{vectype}.
> -
> -Return true if the operation is possible, emitting instructions for it
> -if rtxes are provided and updating @var{output}.
> -@end deftypefn
> -
>  @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
>  This hook should return the decl of a function that implements the
>  vectorized variant of the function with the @code{combined_fn} code
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0a3abccd1c293c7b 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent strategy can generate better code.
>  
>  @hook TARGET_VECTORIZE_VEC_PERM_CONST
>  
> -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
> -
>  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
>  
>  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
> diff --git a/gcc/explow.cc b/gcc/explow.cc
> index 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0befa016eea4573c 100644
> --- a/gcc/explow.cc
> +++ b/gcc/explow.cc
> @@ -1037,7 +1037,7 @@ round_push (rtx size)
>       TRUNC_DIV_EXPR.  */
>    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
>  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
> -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size, align_rtx,
> +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
>  			NULL_RTX, 1);
>    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
>  
> @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned required_align)
>  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
>  				       Pmode),
>  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
> -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, target,
> +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
>  			  gen_int_mode (required_align / BITS_PER_UNIT,
>  					Pmode),
>  			  NULL_RTX, 1);
> diff --git a/gcc/expmed.h b/gcc/expmed.h
> index 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c53640941628068f3901 100644
> --- a/gcc/expmed.h
> +++ b/gcc/expmed.h
> @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code, machine_mode, rtx, poly_int64, rtx,
>  extern rtx maybe_expand_shift (enum tree_code, machine_mode, rtx, int, rtx,
>  			       int);
>  #ifdef GCC_OPTABS_H
> -extern rtx expand_divmod (int, enum tree_code, machine_mode, tree, tree,
> -			  rtx, rtx, rtx, int,
> -			  enum optab_methods = OPTAB_LIB_WIDEN);
> +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx, rtx,
> +			  rtx, int, enum optab_methods = OPTAB_LIB_WIDEN);
>  #endif
>  #endif
>  
> diff --git a/gcc/expmed.cc b/gcc/expmed.cc
> index 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3a59c169d3b7692f 100644
> --- a/gcc/expmed.cc
> +++ b/gcc/expmed.cc
> @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx op0, HOST_WIDE_INT d)
>  
>  rtx
>  expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
> -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
> -	       int unsignedp, enum optab_methods methods)
> +	       rtx op0, rtx op1, rtx target, int unsignedp,
> +	       enum optab_methods methods)
>  {
>    machine_mode compute_mode;
>    rtx tquotient;
> @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
>  
>    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
>  
> -  /* Check if the target has specific expansions for the division.  */
> -  tree cst;
> -  if (treeop0
> -      && treeop1
> -      && (cst = uniform_integer_cst_p (treeop1))
> -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE (treeop0),
> -						     wi::to_wide (cst),
> -						     &target, op0, op1))
> -    return target;
> -
> -
>    /* Now convert to the best mode to use.  */
>    if (compute_mode != mode)
>      {
> @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
>  			    || (optab_handler (sdivmod_optab, int_mode)
>  				!= CODE_FOR_nothing)))
>  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
> -						int_mode, treeop0, treeop1,
> -						op0, gen_int_mode (abs_d,
> +						int_mode, op0,
> +						gen_int_mode (abs_d,
>  							      int_mode),
>  						NULL_RTX, 0);
>  		    else
> @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
>  				      size - 1, NULL_RTX, 0);
>  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
>  				    NULL_RTX);
> -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, treeop0,
> -				    treeop1, t3, op1, NULL_RTX, 0);
> +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3, op1,
> +				    NULL_RTX, 0);
>  		if (t4)
>  		  {
>  		    rtx t5;
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b2280c6e277f26d72 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
>  	    return expand_divmod (0,
>  				  FLOAT_MODE_P (GET_MODE (value))
>  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
> -				  GET_MODE (value), NULL, NULL, op1, op2,
> -				  target, 0);
> +				  GET_MODE (value), op1, op2, target, 0);
>  	case MOD:
> -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
> -				op1, op2, target, 0);
> +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
> +				target, 0);
>  	case UDIV:
> -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), NULL, NULL,
> -				op1, op2, target, 1);
> +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), op1, op2,
> +				target, 1);
>  	case UMOD:
> -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
> -				op1, op2, target, 1);
> +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
> +				target, 1);
>  	case ASHIFTRT:
>  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
>  				      target, 0, OPTAB_LIB_WIDEN);
> @@ -9170,13 +9169,11 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
>        bool speed_p = optimize_insn_for_speed_p ();
>        do_pending_stack_adjust ();
>        start_sequence ();
> -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -				   op0, op1, target, 1);
> +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 1);
>        rtx_insn *uns_insns = get_insns ();
>        end_sequence ();
>        start_sequence ();
> -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -				   op0, op1, target, 0);
> +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 0);
>        rtx_insn *sgn_insns = get_insns ();
>        end_sequence ();
>        unsigned uns_cost = seq_cost (uns_insns, speed_p);
> @@ -9198,8 +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
>        emit_insn (sgn_insns);
>        return sgn_ret;
>      }
> -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -			op0, op1, target, unsignedp);
> +  return expand_divmod (mod_p, code, mode, op0, op1, target, unsignedp);
>  }
>  
>  rtx
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5a3b8a734baa800f 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first,
>  
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW, first,
>  			      smul_highpart, umul_highpart, binary)
> +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST | ECF_NOTHROW, first,
> +			      sadd_highpart, uadd_highpart, binary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first,
>  			      smulhs, umulhs, binary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first,
> diff --git a/gcc/optabs.cc b/gcc/optabs.cc
> index cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6e77082c1e617b 100644
> --- a/gcc/optabs.cc
> +++ b/gcc/optabs.cc
> @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode mode, rtx op0, rtx op1, bool unsignedp)
>  		return NULL_RTX;
>  	    }
>  	}
> -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, NULL, NULL,
> -				     sum, gen_int_mode (INTVAL (op1),
> -							word_mode),
> +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, sum,
> +				     gen_int_mode (INTVAL (op1), word_mode),
>  				     NULL_RTX, 1, OPTAB_DIRECT);
>        if (remainder == NULL_RTX)
>  	return NULL_RTX;
> @@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
>  
>    if (op11 != const1_rtx)
>      {
> -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL, quot1,
> -				op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
> +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1, op11,
> +				NULL_RTX, unsignedp, OPTAB_DIRECT);
>        if (rem2 == NULL_RTX)
>  	return NULL_RTX;
>  
> @@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
>        if (rem2 == NULL_RTX)
>  	return NULL_RTX;
>  
> -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL, quot1,
> -				 op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
> +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
> +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
>        if (quot2 == NULL_RTX)
>  	return NULL_RTX;
>  
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5ccbf6147947351a 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
>  
>  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")
>  OPTAB_D (umul_highpart_optab, "umul$a3_highpart")
> +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart")
> +OPTAB_D (uadd_highpart_optab, "uadd$a3_highpart")
>  
>  OPTAB_D (cmpmem_optab, "cmpmem$a")
>  OPTAB_D (cmpstr_optab, "cmpstr$a")
> diff --git a/gcc/target.def b/gcc/target.def
> index db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -1905,25 +1905,6 @@ implementation approaches itself.",
>  	const vec_perm_indices &sel),
>   NULL)
>  
> -DEFHOOK
> -(can_special_div_by_const,
> - "This hook is used to test whether the target has a special method of\n\
> -division of vectors of type @var{vectype} using the value @var{constant},\n\
> -and producing a vector of type @var{vectype}.  The division\n\
> -will then not be decomposed by the vectorizer and kept as a div.\n\
> -\n\
> -When the hook is being used to test whether the target supports a special\n\
> -divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook\n\
> -is being used to emit a division, @var{in0} and @var{in1} are the source\n\
> -vectors of type @var{vecttype} and @var{output} is the destination vector of\n\
> -type @var{vectype}.\n\
> -\n\
> -Return true if the operation is possible, emitting instructions for it\n\
> -if rtxes are provided and updating @var{output}.",
> - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
> -	rtx in0, rtx in1),
> - default_can_special_div_by_const)
> -
>  /* Return true if the target supports misaligned store/load of a
>     specific factor denoted in the third parameter.  The last parameter
>     is true if the access is defined in a packed struct.  */
> diff --git a/gcc/target.h b/gcc/target.h
> index 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b99f913158c2d47b1 100644
> --- a/gcc/target.h
> +++ b/gcc/target.h
> @@ -51,7 +51,6 @@
>  #include "insn-codes.h"
>  #include "tm.h"
>  #include "hard-reg-set.h"
> -#include "tree-core.h"
>  
>  #if CHECKING_P
>  
> diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> index a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2244549317a31390f0c2 100644
> --- a/gcc/targhooks.h
> +++ b/gcc/targhooks.h
> @@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage (addr_space_t, location_t);
>  extern rtx default_addr_space_convert (rtx, tree, tree);
>  extern unsigned int default_case_values_threshold (void);
>  extern bool default_have_conditional_execution (void);
> -extern bool default_can_special_div_by_const (enum tree_code, tree, wide_int,
> -					      rtx *, rtx, rtx);
>  
>  extern bool default_libc_has_function (enum function_class, tree);
>  extern bool default_libc_has_fast_function (int fcode);
> diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> index fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e03877337a931e7 100644
> --- a/gcc/targhooks.cc
> +++ b/gcc/targhooks.cc
> @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
>    return HAVE_conditional_execution;
>  }
>  
> -/* Default that no division by constant operations are special.  */
> -bool
> -default_can_special_div_by_const (enum tree_code, tree, wide_int, rtx *, rtx,
> -				  rtx)
> -{
> -  return false;
> -}
> -
>  /* By default we assume that c99 functions are present at the runtime,
>     but sincos is not.  */
>  bool
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0a04ea8c1f73e3c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> @@ -0,0 +1,25 @@
> +/* { dg-require-effective-target vect_int } */
> +
> +#include <stdint.h>
> +#include "tree-vect.h"
> +
> +typedef unsigned __attribute__((__vector_size__ (16))) V;
> +
> +static __attribute__((__noinline__)) __attribute__((__noclone__)) V
> +foo (V v, unsigned short i)
> +{
> +  v /= i;
> +  return v;
> +}
> +
> +int
> +main (void)
> +{
> +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0xffff);
> +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
> +    if (v[i] != 0x00010001)
> +      __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected" "vect" { target aarch64*-*-* } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b49914d2a29b933de625
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> @@ -0,0 +1,58 @@
> +/* { dg-require-effective-target vect_int } */
> +
> +#include <stdint.h>
> +#include <stdio.h>
> +#include "tree-vect.h"
> +
> +#define N 50
> +#define TYPE uint8_t 
> +
> +#ifndef DEBUG
> +#define DEBUG 0
> +#endif
> +
> +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> +
> +
> +__attribute__((noipa, noinline, optimize("O1")))
> +void fun1(TYPE* restrict pixel, TYPE level, int n)
> +{
> +  for (int i = 0; i < n; i+=1)
> +    pixel[i] = (pixel[i] + level) / 0xff;
> +}
> +
> +__attribute__((noipa, noinline, optimize("O3")))
> +void fun2(TYPE* restrict pixel, TYPE level, int n)
> +{
> +  for (int i = 0; i < n; i+=1)
> +    pixel[i] = (pixel[i] + level) / 0xff;
> +}
> +
> +int main ()
> +{
> +  TYPE a[N];
> +  TYPE b[N];
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE + i * 13;
> +      b[i] = BASE + i * 13;
> +      if (DEBUG)
> +        printf ("%d: 0x%x\n", i, a[i]);
> +    }
> +
> +  fun1 (a, N / 2, N);
> +  fun2 (b, N / 2, N);
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      if (DEBUG)
> +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> +
> +      if (a[i] != b[i])
> +        __builtin_abort ();
> +    }
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { target aarch64*-*-* } } } */
> diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
> index 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077dc3e970bed75ef6 100644
> --- a/gcc/tree-vect-generic.cc
> +++ b/gcc/tree-vect-generic.cc
> @@ -1237,17 +1237,6 @@ expand_vector_operation (gimple_stmt_iterator *gsi, tree type, tree compute_type
>  	  tree rhs2 = gimple_assign_rhs2 (assign);
>  	  tree ret;
>  
> -	  /* Check if the target was going to handle it through the special
> -	     division callback hook.  */
> -	  tree cst = uniform_integer_cst_p (rhs2);
> -	  if (cst &&
> -	      targetm.vectorize.can_special_div_by_const (code, type,
> -							  wi::to_wide (cst),
> -							  NULL,
> -							  NULL_RTX, NULL_RTX))
> -	    return NULL_TREE;
> -
> -
>  	  if (!optimize
>  	      || !VECTOR_INTEGER_TYPE_P (type)
>  	      || TREE_CODE (rhs2) != VECTOR_CST
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f3369de2afea139d6 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
>        return pattern_stmt;
>      }
>    else if ((cst = uniform_integer_cst_p (oprnd1))
> -	   && targetm.vectorize.can_special_div_by_const (rhs_code, vectype,
> -							  wi::to_wide (cst),
> -							  NULL, NULL_RTX,
> -							  NULL_RTX))
> +	   && TYPE_UNSIGNED (itype)
> +	   && rhs_code == TRUNC_DIV_EXPR
> +	   && vectype
> +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> +					      OPTIMIZE_FOR_SPEED))
>      {
> -      return NULL;
> +      /* div optimizations using narrowings
> +       we can do the division e.g. shorts by 255 faster by calculating it as
> +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> +       double the precision of x.
> +
> +       If we imagine a short as being composed of two blocks of bytes then
> +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
> +       adding 1 to each sub component:
> +
> +	    short value of 16-bits
> +       ┌──────────────┬────────────────┐
> +       │              │                │
> +       └──────────────┴────────────────┘
> +	 8-bit part1 ▲  8-bit part2   ▲
> +		     │                │
> +		     │                │
> +		    +1               +1
> +
> +       after the first addition, we have to shift right by 8, and narrow the
> +       results back to a byte.  Remember that the addition must be done in
> +       double the precision of the input.  However if we know that the addition
> +       `x + 257` does not overflow then we can do the operation in the current
> +       precision.  In which case we don't need the pack and unpacks.  */
> +      auto wcst = wi::to_wide (cst);
> +      int pow = wi::exact_log2 (wcst + 1);
> +      if (pow == (int) (element_precision (vectype) / 2))
> +	{
> +	  wide_int min,max;
> +	  /* If we're in a pattern we need to find the orginal definition.  */
> +	  tree op0 = oprnd0;
> +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> +	  if (is_pattern_stmt_p (stmt_info))
> +	    {
> +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> +	    }

If this is generally safe (I'm skipping thinking about it in the
interests of a quick review :-)), then I think it should be done in
vect_get_range_info instead.  Using gimple_get_lhs would be more
general than handling just assignments.

> +
> +	  /* Check that no overflow will occur.  If we don't have range
> +	     information we can't perform the optimization.  */
> +	  if (vect_get_range_info (op0, &min, &max))
> +	    {
> +	      wide_int one = wi::to_wide (build_one_cst (itype));
> +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> +	      wi::overflow_type ovf;
> +	      /* We need adder and max in the same precision.  */
> +	      wide_int zadder
> +		= wide_int_storage::from (adder, wi::get_precision (max),
> +					  UNSIGNED);
> +	      wi::add (max, zadder, UNSIGNED, &ovf);

Could you explain this a bit more?  When do we have mismatched precisions?

Thanks,
Richard

> +	      if (ovf == wi::OVF_NONE)
> +		{
> +		  *type_out = vectype;
> +		  tree tadder = wide_int_to_tree (itype, adder);
> +		  gcall *patt1
> +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, tadder);
> +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
> +		  gimple_call_set_lhs (patt1, lhs);
> +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
> +
> +		  pattern_stmt
> +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
> +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
> +		  gimple_call_set_lhs (pattern_stmt, lhs);
> +
> +		  return pattern_stmt;
> +		}
> +	    }
> +	}
>      }
>  
>    if (prec > HOST_BITS_PER_WIDE_INT
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b9564fc4e066e50081 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
>  	}
>        target_support_p = (optab_handler (optab, vec_mode)
>  			  != CODE_FOR_nothing);
> -      tree cst;
> -      if (!target_support_p
> -	  && op1
> -	  && (cst = uniform_integer_cst_p (op1)))
> -	target_support_p
> -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
> -							wi::to_wide (cst),
> -							NULL, NULL_RTX,
> -							NULL_RTX);
>      }
>  
>    bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-10 13:36 ` Richard Sandiford
@ 2023-02-10 13:52   ` Richard Biener
  2023-02-10 14:13   ` Tamar Christina
  1 sibling, 0 replies; 47+ messages in thread
From: Richard Biener @ 2023-02-10 13:52 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: Tamar Christina via Gcc-patches, Tamar Christina, nd, jlaw

[-- Attachment #1: Type: text/plain, Size: 29489 bytes --]

On Fri, 10 Feb 2023, Richard Sandiford wrote:

> I think I'm misunderstanding, but: it seems like we're treating the
> add highpart optabs as companions to the mul highpart optabs.  But AIUI,
> the add highpart optab is used such that, for an N-bit mode, we do
> an N-bit addition followed by a shift by N/2.  Is that right?
> The mul highpart optabs instead do an 2N-bit multiplication followed
> by a shift by N.

That also confused me - and the docs add to the confusion more than
clear it up ... I agree we should be consistent in the semantics
for add_highpart and mul_highpart.

> Apart from consistency, the reason this matters is: I'm not sure what we
> gain by adding the optab rather than simply open-coding the addition and
> the shift directly into the vector pattern.  It seems like the AArch64
> expander in 2/2 does just do an ordinary N-bit addition followed by an
> ordinary shift by N/2.
> 
> Some comments in addition to Richard's:
> 
> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > Hi All,
> >
> > As discussed in the ticket, this replaces the approach for optimizing the
> > div by bitmask operation from a hook into optabs implemented through
> > add_highpart.
> >
> > In order to be able to use this we need to check whether the current precision
> > has enough bits to do the operation without any of the additions overflowing.
> >
> > We use range information to determine this and only do the operation if we're
> > sure am overflow won't occur.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going> issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	PR target/108583
> > 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Remove.
> > 	* doc/tm.texi.in: Likewise.
> > 	* explow.cc (round_push, align_dynamic_address): Revert previous patch.
> > 	* expmed.cc (expand_divmod): Likewise.
> > 	* expmed.h (expand_divmod): Likewise.
> > 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
> > 	* optabs.cc (expand_doubleword_mod, expand_doubleword_divmod): Likewise.
> > 	* internal-fn.def (ADDH): New.
> > 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
> > 	* doc/md.texi: Document them.
> > 	* doc/rtl.texi: Likewise.
> > 	* target.def (can_special_div_by_const): Remove.
> > 	* target.h: Remove tree-core.h include
> > 	* targhooks.cc (default_can_special_div_by_const): Remove.
> > 	* targhooks.h (default_can_special_div_by_const): Remove.
> > 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
> > 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook and
> > 	implement new obtab recognition based on range.
> > 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 	PR target/108583
> > 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
> > 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
> >
> > --- inline copy of patch -- 
> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > index 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f7408038595e21af35d 100644
> > --- a/gcc/doc/md.texi
> > +++ b/gcc/doc/md.texi
> > @@ -5668,6 +5668,18 @@ represented in RTL using a @code{smul_highpart} RTX expression.
> >  Similar, but the multiplication is unsigned.  This may be represented
> >  in RTL using an @code{umul_highpart} RTX expression.
> >  
> > +@cindex @code{sadd@var{m}3_highpart} instruction pattern
> > +@item @samp{smul@var{m}3_highpart}
> 
> sadd
> 
> > +Perform a signed addition of operands 1 and 2, which have mode
> > +@var{m}, and store the most significant half of the product in operand 0.
> > +The least significant half of the product is discarded.  This may be
> > +represented in RTL using a @code{sadd_highpart} RTX expression.
> > +
> > +@cindex @code{uadd@var{m}3_highpart} instruction pattern
> > +@item @samp{uadd@var{m}3_highpart}
> > +Similar, but the addition is unsigned.  This may be represented
> > +in RTL using an @code{uadd_highpart} RTX expression.
> > +
> >  @cindex @code{madd@var{m}@var{n}4} instruction pattern
> >  @item @samp{madd@var{m}@var{n}4}
> >  Multiply operands 1 and 2, sign-extend them to mode @var{n}, add
> > diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
> > index d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343d171940ec4222f3 100644
> > --- a/gcc/doc/rtl.texi
> > +++ b/gcc/doc/rtl.texi
> > @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.  @code{smul_highpart} returns the high part
> >  of a signed multiplication, @code{umul_highpart} returns the high part
> >  of an unsigned multiplication.
> >  
> > +@findex sadd_highpart
> > +@findex uadd_highpart
> > +@cindex high-part addition
> > +@cindex addition high part
> > +@item (sadd_highpart:@var{m} @var{x} @var{y})
> > +@itemx (uadd_highpart:@var{m} @var{x} @var{y})
> > +Represents the high-part addition of @var{x} and @var{y} carried
> > +out in machine mode @var{m}.  @code{sadd_highpart} returns the high part
> > +of a signed addition, @code{uadd_highpart} returns the high part
> > +of an unsigned addition.
> 
> The patch doesn't add these RTL codes though.
> 
> > +
> >  @findex fma
> >  @cindex fused multiply-add
> >  @item (fma:@var{m} @var{x} @var{y} @var{z})
> > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > index c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d5791484017e6b0d62ab077e 100644
> > --- a/gcc/doc/tm.texi
> > +++ b/gcc/doc/tm.texi
> > @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for the hook to handle these two
> >  implementation approaches itself.
> >  @end deftypefn
> >  
> > -@deftypefn {Target Hook} bool TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum @var{tree_code}, tree @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx @var{in0}, rtx @var{in1})
> > -This hook is used to test whether the target has a special method of
> > -division of vectors of type @var{vectype} using the value @var{constant},
> > -and producing a vector of type @var{vectype}.  The division
> > -will then not be decomposed by the vectorizer and kept as a div.
> > -
> > -When the hook is being used to test whether the target supports a special
> > -divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook
> > -is being used to emit a division, @var{in0} and @var{in1} are the source
> > -vectors of type @var{vecttype} and @var{output} is the destination vector of
> > -type @var{vectype}.
> > -
> > -Return true if the operation is possible, emitting instructions for it
> > -if rtxes are provided and updating @var{output}.
> > -@end deftypefn
> > -
> >  @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
> >  This hook should return the decl of a function that implements the
> >  vectorized variant of the function with the @code{combined_fn} code
> > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > index 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0a3abccd1c293c7b 100644
> > --- a/gcc/doc/tm.texi.in
> > +++ b/gcc/doc/tm.texi.in
> > @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent strategy can generate better code.
> >  
> >  @hook TARGET_VECTORIZE_VEC_PERM_CONST
> >  
> > -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
> > -
> >  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
> >  
> >  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
> > diff --git a/gcc/explow.cc b/gcc/explow.cc
> > index 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0befa016eea4573c 100644
> > --- a/gcc/explow.cc
> > +++ b/gcc/explow.cc
> > @@ -1037,7 +1037,7 @@ round_push (rtx size)
> >       TRUNC_DIV_EXPR.  */
> >    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
> >  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
> > -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size, align_rtx,
> > +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
> >  			NULL_RTX, 1);
> >    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
> >  
> > @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned required_align)
> >  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
> >  				       Pmode),
> >  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
> > -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, target,
> > +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
> >  			  gen_int_mode (required_align / BITS_PER_UNIT,
> >  					Pmode),
> >  			  NULL_RTX, 1);
> > diff --git a/gcc/expmed.h b/gcc/expmed.h
> > index 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c53640941628068f3901 100644
> > --- a/gcc/expmed.h
> > +++ b/gcc/expmed.h
> > @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code, machine_mode, rtx, poly_int64, rtx,
> >  extern rtx maybe_expand_shift (enum tree_code, machine_mode, rtx, int, rtx,
> >  			       int);
> >  #ifdef GCC_OPTABS_H
> > -extern rtx expand_divmod (int, enum tree_code, machine_mode, tree, tree,
> > -			  rtx, rtx, rtx, int,
> > -			  enum optab_methods = OPTAB_LIB_WIDEN);
> > +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx, rtx,
> > +			  rtx, int, enum optab_methods = OPTAB_LIB_WIDEN);
> >  #endif
> >  #endif
> >  
> > diff --git a/gcc/expmed.cc b/gcc/expmed.cc
> > index 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3a59c169d3b7692f 100644
> > --- a/gcc/expmed.cc
> > +++ b/gcc/expmed.cc
> > @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx op0, HOST_WIDE_INT d)
> >  
> >  rtx
> >  expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
> > -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
> > -	       int unsignedp, enum optab_methods methods)
> > +	       rtx op0, rtx op1, rtx target, int unsignedp,
> > +	       enum optab_methods methods)
> >  {
> >    machine_mode compute_mode;
> >    rtx tquotient;
> > @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
> >  
> >    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
> >  
> > -  /* Check if the target has specific expansions for the division.  */
> > -  tree cst;
> > -  if (treeop0
> > -      && treeop1
> > -      && (cst = uniform_integer_cst_p (treeop1))
> > -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE (treeop0),
> > -						     wi::to_wide (cst),
> > -						     &target, op0, op1))
> > -    return target;
> > -
> > -
> >    /* Now convert to the best mode to use.  */
> >    if (compute_mode != mode)
> >      {
> > @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
> >  			    || (optab_handler (sdivmod_optab, int_mode)
> >  				!= CODE_FOR_nothing)))
> >  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
> > -						int_mode, treeop0, treeop1,
> > -						op0, gen_int_mode (abs_d,
> > +						int_mode, op0,
> > +						gen_int_mode (abs_d,
> >  							      int_mode),
> >  						NULL_RTX, 0);
> >  		    else
> > @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
> >  				      size - 1, NULL_RTX, 0);
> >  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
> >  				    NULL_RTX);
> > -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, treeop0,
> > -				    treeop1, t3, op1, NULL_RTX, 0);
> > +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3, op1,
> > +				    NULL_RTX, 0);
> >  		if (t4)
> >  		  {
> >  		    rtx t5;
> > diff --git a/gcc/expr.cc b/gcc/expr.cc
> > index 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b2280c6e277f26d72 100644
> > --- a/gcc/expr.cc
> > +++ b/gcc/expr.cc
> > @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
> >  	    return expand_divmod (0,
> >  				  FLOAT_MODE_P (GET_MODE (value))
> >  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
> > -				  GET_MODE (value), NULL, NULL, op1, op2,
> > -				  target, 0);
> > +				  GET_MODE (value), op1, op2, target, 0);
> >  	case MOD:
> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
> > -				op1, op2, target, 0);
> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
> > +				target, 0);
> >  	case UDIV:
> > -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), NULL, NULL,
> > -				op1, op2, target, 1);
> > +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), op1, op2,
> > +				target, 1);
> >  	case UMOD:
> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
> > -				op1, op2, target, 1);
> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
> > +				target, 1);
> >  	case ASHIFTRT:
> >  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
> >  				      target, 0, OPTAB_LIB_WIDEN);
> > @@ -9170,13 +9169,11 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
> >        bool speed_p = optimize_insn_for_speed_p ();
> >        do_pending_stack_adjust ();
> >        start_sequence ();
> > -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> > -				   op0, op1, target, 1);
> > +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 1);
> >        rtx_insn *uns_insns = get_insns ();
> >        end_sequence ();
> >        start_sequence ();
> > -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> > -				   op0, op1, target, 0);
> > +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 0);
> >        rtx_insn *sgn_insns = get_insns ();
> >        end_sequence ();
> >        unsigned uns_cost = seq_cost (uns_insns, speed_p);
> > @@ -9198,8 +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
> >        emit_insn (sgn_insns);
> >        return sgn_ret;
> >      }
> > -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
> > -			op0, op1, target, unsignedp);
> > +  return expand_divmod (mod_p, code, mode, op0, op1, target, unsignedp);
> >  }
> >  
> >  rtx
> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > index 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5a3b8a734baa800f 100644
> > --- a/gcc/internal-fn.def
> > +++ b/gcc/internal-fn.def
> > @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first,
> >  
> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW, first,
> >  			      smul_highpart, umul_highpart, binary)
> > +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST | ECF_NOTHROW, first,
> > +			      sadd_highpart, uadd_highpart, binary)
> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first,
> >  			      smulhs, umulhs, binary)
> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first,
> > diff --git a/gcc/optabs.cc b/gcc/optabs.cc
> > index cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6e77082c1e617b 100644
> > --- a/gcc/optabs.cc
> > +++ b/gcc/optabs.cc
> > @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode mode, rtx op0, rtx op1, bool unsignedp)
> >  		return NULL_RTX;
> >  	    }
> >  	}
> > -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, NULL, NULL,
> > -				     sum, gen_int_mode (INTVAL (op1),
> > -							word_mode),
> > +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, sum,
> > +				     gen_int_mode (INTVAL (op1), word_mode),
> >  				     NULL_RTX, 1, OPTAB_DIRECT);
> >        if (remainder == NULL_RTX)
> >  	return NULL_RTX;
> > @@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
> >  
> >    if (op11 != const1_rtx)
> >      {
> > -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL, quot1,
> > -				op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
> > +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1, op11,
> > +				NULL_RTX, unsignedp, OPTAB_DIRECT);
> >        if (rem2 == NULL_RTX)
> >  	return NULL_RTX;
> >  
> > @@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
> >        if (rem2 == NULL_RTX)
> >  	return NULL_RTX;
> >  
> > -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL, quot1,
> > -				 op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
> > +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
> > +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
> >        if (quot2 == NULL_RTX)
> >  	return NULL_RTX;
> >  
> > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > index 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5ccbf6147947351a 100644
> > --- a/gcc/optabs.def
> > +++ b/gcc/optabs.def
> > @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
> >  
> >  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")
> >  OPTAB_D (umul_highpart_optab, "umul$a3_highpart")
> > +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart")
> > +OPTAB_D (uadd_highpart_optab, "uadd$a3_highpart")
> >  
> >  OPTAB_D (cmpmem_optab, "cmpmem$a")
> >  OPTAB_D (cmpstr_optab, "cmpstr$a")
> > diff --git a/gcc/target.def b/gcc/target.def
> > index db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5 100644
> > --- a/gcc/target.def
> > +++ b/gcc/target.def
> > @@ -1905,25 +1905,6 @@ implementation approaches itself.",
> >  	const vec_perm_indices &sel),
> >   NULL)
> >  
> > -DEFHOOK
> > -(can_special_div_by_const,
> > - "This hook is used to test whether the target has a special method of\n\
> > -division of vectors of type @var{vectype} using the value @var{constant},\n\
> > -and producing a vector of type @var{vectype}.  The division\n\
> > -will then not be decomposed by the vectorizer and kept as a div.\n\
> > -\n\
> > -When the hook is being used to test whether the target supports a special\n\
> > -divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook\n\
> > -is being used to emit a division, @var{in0} and @var{in1} are the source\n\
> > -vectors of type @var{vecttype} and @var{output} is the destination vector of\n\
> > -type @var{vectype}.\n\
> > -\n\
> > -Return true if the operation is possible, emitting instructions for it\n\
> > -if rtxes are provided and updating @var{output}.",
> > - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
> > -	rtx in0, rtx in1),
> > - default_can_special_div_by_const)
> > -
> >  /* Return true if the target supports misaligned store/load of a
> >     specific factor denoted in the third parameter.  The last parameter
> >     is true if the access is defined in a packed struct.  */
> > diff --git a/gcc/target.h b/gcc/target.h
> > index 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b99f913158c2d47b1 100644
> > --- a/gcc/target.h
> > +++ b/gcc/target.h
> > @@ -51,7 +51,6 @@
> >  #include "insn-codes.h"
> >  #include "tm.h"
> >  #include "hard-reg-set.h"
> > -#include "tree-core.h"
> >  
> >  #if CHECKING_P
> >  
> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> > index a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2244549317a31390f0c2 100644
> > --- a/gcc/targhooks.h
> > +++ b/gcc/targhooks.h
> > @@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage (addr_space_t, location_t);
> >  extern rtx default_addr_space_convert (rtx, tree, tree);
> >  extern unsigned int default_case_values_threshold (void);
> >  extern bool default_have_conditional_execution (void);
> > -extern bool default_can_special_div_by_const (enum tree_code, tree, wide_int,
> > -					      rtx *, rtx, rtx);
> >  
> >  extern bool default_libc_has_function (enum function_class, tree);
> >  extern bool default_libc_has_fast_function (int fcode);
> > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> > index fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e03877337a931e7 100644
> > --- a/gcc/targhooks.cc
> > +++ b/gcc/targhooks.cc
> > @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
> >    return HAVE_conditional_execution;
> >  }
> >  
> > -/* Default that no division by constant operations are special.  */
> > -bool
> > -default_can_special_div_by_const (enum tree_code, tree, wide_int, rtx *, rtx,
> > -				  rtx)
> > -{
> > -  return false;
> > -}
> > -
> >  /* By default we assume that c99 functions are present at the runtime,
> >     but sincos is not.  */
> >  bool
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> > new file mode 100644
> > index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0a04ea8c1f73e3c
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> > @@ -0,0 +1,25 @@
> > +/* { dg-require-effective-target vect_int } */
> > +
> > +#include <stdint.h>
> > +#include "tree-vect.h"
> > +
> > +typedef unsigned __attribute__((__vector_size__ (16))) V;
> > +
> > +static __attribute__((__noinline__)) __attribute__((__noclone__)) V
> > +foo (V v, unsigned short i)
> > +{
> > +  v /= i;
> > +  return v;
> > +}
> > +
> > +int
> > +main (void)
> > +{
> > +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0xffff);
> > +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
> > +    if (v[i] != 0x00010001)
> > +      __builtin_abort ();
> > +  return 0;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected" "vect" { target aarch64*-*-* } } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> > new file mode 100644
> > index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b49914d2a29b933de625
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> > @@ -0,0 +1,58 @@
> > +/* { dg-require-effective-target vect_int } */
> > +
> > +#include <stdint.h>
> > +#include <stdio.h>
> > +#include "tree-vect.h"
> > +
> > +#define N 50
> > +#define TYPE uint8_t 
> > +
> > +#ifndef DEBUG
> > +#define DEBUG 0
> > +#endif
> > +
> > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> > +
> > +
> > +__attribute__((noipa, noinline, optimize("O1")))
> > +void fun1(TYPE* restrict pixel, TYPE level, int n)
> > +{
> > +  for (int i = 0; i < n; i+=1)
> > +    pixel[i] = (pixel[i] + level) / 0xff;
> > +}
> > +
> > +__attribute__((noipa, noinline, optimize("O3")))
> > +void fun2(TYPE* restrict pixel, TYPE level, int n)
> > +{
> > +  for (int i = 0; i < n; i+=1)
> > +    pixel[i] = (pixel[i] + level) / 0xff;
> > +}
> > +
> > +int main ()
> > +{
> > +  TYPE a[N];
> > +  TYPE b[N];
> > +
> > +  for (int i = 0; i < N; ++i)
> > +    {
> > +      a[i] = BASE + i * 13;
> > +      b[i] = BASE + i * 13;
> > +      if (DEBUG)
> > +        printf ("%d: 0x%x\n", i, a[i]);
> > +    }
> > +
> > +  fun1 (a, N / 2, N);
> > +  fun2 (b, N / 2, N);
> > +
> > +  for (int i = 0; i < N; ++i)
> > +    {
> > +      if (DEBUG)
> > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> > +
> > +      if (a[i] != b[i])
> > +        __builtin_abort ();
> > +    }
> > +  return 0;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { target aarch64*-*-* } } } */
> > diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
> > index 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077dc3e970bed75ef6 100644
> > --- a/gcc/tree-vect-generic.cc
> > +++ b/gcc/tree-vect-generic.cc
> > @@ -1237,17 +1237,6 @@ expand_vector_operation (gimple_stmt_iterator *gsi, tree type, tree compute_type
> >  	  tree rhs2 = gimple_assign_rhs2 (assign);
> >  	  tree ret;
> >  
> > -	  /* Check if the target was going to handle it through the special
> > -	     division callback hook.  */
> > -	  tree cst = uniform_integer_cst_p (rhs2);
> > -	  if (cst &&
> > -	      targetm.vectorize.can_special_div_by_const (code, type,
> > -							  wi::to_wide (cst),
> > -							  NULL,
> > -							  NULL_RTX, NULL_RTX))
> > -	    return NULL_TREE;
> > -
> > -
> >  	  if (!optimize
> >  	      || !VECTOR_INTEGER_TYPE_P (type)
> >  	      || TREE_CODE (rhs2) != VECTOR_CST
> > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> > index 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f3369de2afea139d6 100644
> > --- a/gcc/tree-vect-patterns.cc
> > +++ b/gcc/tree-vect-patterns.cc
> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
> >        return pattern_stmt;
> >      }
> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code, vectype,
> > -							  wi::to_wide (cst),
> > -							  NULL, NULL_RTX,
> > -							  NULL_RTX))
> > +	   && TYPE_UNSIGNED (itype)
> > +	   && rhs_code == TRUNC_DIV_EXPR
> > +	   && vectype
> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> > +					      OPTIMIZE_FOR_SPEED))
> >      {
> > -      return NULL;
> > +      /* div optimizations using narrowings
> > +       we can do the division e.g. shorts by 255 faster by calculating it as
> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> > +       double the precision of x.
> > +
> > +       If we imagine a short as being composed of two blocks of bytes then
> > +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
> > +       adding 1 to each sub component:
> > +
> > +	    short value of 16-bits
> > +       ┌──────────────┬────────────────┐
> > +       │              │                │
> > +       └──────────────┴────────────────┘
> > +	 8-bit part1 ▲  8-bit part2   ▲
> > +		     │                │
> > +		     │                │
> > +		    +1               +1
> > +
> > +       after the first addition, we have to shift right by 8, and narrow the
> > +       results back to a byte.  Remember that the addition must be done in
> > +       double the precision of the input.  However if we know that the addition
> > +       `x + 257` does not overflow then we can do the operation in the current
> > +       precision.  In which case we don't need the pack and unpacks.  */
> > +      auto wcst = wi::to_wide (cst);
> > +      int pow = wi::exact_log2 (wcst + 1);
> > +      if (pow == (int) (element_precision (vectype) / 2))
> > +	{
> > +	  wide_int min,max;
> > +	  /* If we're in a pattern we need to find the orginal definition.  */
> > +	  tree op0 = oprnd0;
> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> > +	  if (is_pattern_stmt_p (stmt_info))
> > +	    {
> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> > +	    }
> 
> If this is generally safe (I'm skipping thinking about it in the
> interests of a quick review :-)), then I think it should be done in
> vect_get_range_info instead.  Using gimple_get_lhs would be more
> general than handling just assignments.
> 
> > +
> > +	  /* Check that no overflow will occur.  If we don't have range
> > +	     information we can't perform the optimization.  */
> > +	  if (vect_get_range_info (op0, &min, &max))
> > +	    {
> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> > +	      wi::overflow_type ovf;
> > +	      /* We need adder and max in the same precision.  */
> > +	      wide_int zadder
> > +		= wide_int_storage::from (adder, wi::get_precision (max),
> > +					  UNSIGNED);
> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
> 
> Could you explain this a bit more?  When do we have mismatched precisions?
> 
> Thanks,
> Richard
> 
> > +	      if (ovf == wi::OVF_NONE)
> > +		{
> > +		  *type_out = vectype;
> > +		  tree tadder = wide_int_to_tree (itype, adder);
> > +		  gcall *patt1
> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, tadder);
> > +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
> > +		  gimple_call_set_lhs (patt1, lhs);
> > +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
> > +
> > +		  pattern_stmt
> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
> > +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
> > +		  gimple_call_set_lhs (pattern_stmt, lhs);
> > +
> > +		  return pattern_stmt;
> > +		}
> > +	    }
> > +	}
> >      }
> >  
> >    if (prec > HOST_BITS_PER_WIDE_INT
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b9564fc4e066e50081 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
> >  	}
> >        target_support_p = (optab_handler (optab, vec_mode)
> >  			  != CODE_FOR_nothing);
> > -      tree cst;
> > -      if (!target_support_p
> > -	  && op1
> > -	  && (cst = uniform_integer_cst_p (op1)))
> > -	target_support_p
> > -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
> > -							wi::to_wide (cst),
> > -							NULL, NULL_RTX,
> > -							NULL_RTX);
> >      }
> >  
> >    bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583]
  2023-02-09 17:22 ` [PATCH 2/2]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583] Tamar Christina
  2023-02-10 10:35   ` Tamar Christina
@ 2023-02-10 14:10   ` Richard Sandiford
  1 sibling, 0 replies; 47+ messages in thread
From: Richard Sandiford @ 2023-02-10 14:10 UTC (permalink / raw)
  To: Tamar Christina
  Cc: gcc-patches, nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov

I was asking in the 1/2 review whether we need the optab, but that
decision doesn't affect the other patterns, so:

Tamar Christina <tamar.christina@arm.com> writes:
> Hi All,
>
> This replaces the custom division hook with just an implementation through
> add_highpart.  For NEON we implement the add highpart (Addition + extraction of
> the upper highpart of the register in the same precision) as ADD + LSR.
>
> This representation allows us to easily optimize the sequence using existing
> sequences. This gets us a pretty decent sequence using SRA:
>
>         umull   v1.8h, v0.8b, v3.8b
>         umull2  v0.8h, v0.16b, v3.16b
>         add     v5.8h, v1.8h, v2.8h
>         add     v4.8h, v0.8h, v2.8h
>         usra    v1.8h, v5.8h, 8
>         usra    v0.8h, v4.8h, 8
>         uzp2    v1.16b, v1.16b, v0.16b
>
> To get the most optimal sequence however we match (a + ((b + c) >> n)) where n
> is half the precision of the mode of the operation into addhn + uaddw which is
> a general good optimization on its own and gets us back to:
>
> .L4:
>         ldr     q0, [x3]
>         umull   v1.8h, v0.8b, v5.8b
>         umull2  v0.8h, v0.16b, v5.16b
>         addhn   v3.8b, v1.8h, v4.8h
>         addhn   v2.8b, v0.8h, v4.8h
>         uaddw   v1.8h, v1.8h, v3.8b
>         uaddw   v0.8h, v0.8h, v2.8b
>         uzp2    v1.16b, v1.16b, v0.16b
>         str     q1, [x3], 16
>         cmp     x3, x4
>         bne     .L4
>
> For SVE2 we optimize the initial sequence to the same ADD + LSR which gets us:
>
> .L3:
>         ld1b    z0.h, p0/z, [x0, x3]
>         mul     z0.h, p1/m, z0.h, z2.h
>         add     z1.h, z0.h, z3.h
>         usra    z0.h, z1.h, #8
>         lsr     z0.h, z0.h, #8
>         st1b    z0.h, p0, [x0, x3]
>         inch    x3
>         whilelo p0.h, w3, w2
>         b.any   .L3
> .L1:
>         ret
>
> and to get the most optimal sequence I match (a + b) >> n (same constraint on n)
> to addhnb which gets us to:
>
> .L3:
>         ld1b    z0.h, p0/z, [x0, x3]
>         mul     z0.h, p1/m, z0.h, z2.h
>         addhnb  z1.b, z0.h, z3.h
>         addhnb  z0.b, z0.h, z1.h
>         st1b    z0.h, p0, [x0, x3]
>         inch    x3
>         whilelo p0.h, w3, w2
>         b.any   .L3
>
> There are multiple RTL representations possible for these optimizations, I did
> not represent them using a zero_extend because we seem very inconsistent in this
> in the backend.  Since they are unspecs we won't match them from vector ops
> anyway. I figured maintainers would prefer this, but my maintainer ouija board
> is still out for repairs :)

I agree this is the best approach as things stand.  Personally, I'd like
to have some way for the target to define simplification rules based on
unspecs, so that unspecs act more like target-specific rtl codes.  But I
know others disagree, and it wouldn't really apply to this case anyway.

> There are no new test as new correctness tests were added to the mid-end and
> the existing codegen tests for this already exist.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going> issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	PR target/108583
> 	* config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv<mode>3): Remove.
> 	(<su>add<mode>3_highpart, *bitmask_shift_plus<mode>): New.
> 	* config/aarch64/aarch64-sve2.md (<su>add<mode>3_highpart,
> 	*bitmask_shift_plus<mode>): New.
> 	(@aarch64_bitmask_udiv<mode>3): Remove.
> 	* config/aarch64/aarch64.cc
> 	(aarch64_vectorize_can_special_div_by_constant): Removed.
> 	* config/aarch64/iterators.md (UNSPEC_SADD_HIGHPART,
> 	UNSPEC_UADD_HIGHPART, ADD_HIGHPART): New.
>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index 7f212bf37cd2c120dceb7efa733c9fa76226f029..26871a56d1fdb134f0ad9d828ce68a8df0272c53 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -4867,62 +4867,48 @@ (define_expand "aarch64_<sur><addsub>hn2<mode>"
>    }
>  )
>  
> -;; div optimizations using narrowings
> -;; we can do the division e.g. shorts by 255 faster by calculating it as
> -;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> -;; double the precision of x.
> -;;
> -;; If we imagine a short as being composed of two blocks of bytes then
> -;; adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
> -;; adding 1 to each sub component:
> -;;
> -;;      short value of 16-bits
> -;; ┌──────────────┬────────────────┐
> -;; │              │                │
> -;; └──────────────┴────────────────┘
> -;;   8-bit part1 ▲  8-bit part2   ▲
> -;;               │                │
> -;;               │                │
> -;;              +1               +1
> -;;
> -;; after the first addition, we have to shift right by 8, and narrow the
> -;; results back to a byte.  Remember that the addition must be done in
> -;; double the precision of the input.  Since 8 is half the size of a short
> -;; we can use a narrowing halfing instruction in AArch64, addhn which also
> -;; does the addition in a wider precision and narrows back to a byte.  The
> -;; shift itself is implicit in the operation as it writes back only the top
> -;; half of the result. i.e. bits 2*esize-1:esize.
> -;;
> -;; Since we have narrowed the result of the first part back to a byte, for
> -;; the second addition we can use a widening addition, uaddw.
> -;;
> -;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
> -;;
> -;; The shift is later optimized by combine to a uzp2 with movi #0.
> -(define_expand "@aarch64_bitmask_udiv<mode>3"
> -  [(match_operand:VQN 0 "register_operand")
> -   (match_operand:VQN 1 "register_operand")
> -   (match_operand:VQN 2 "immediate_operand")]
> +;; Implement add_highpart as ADD + RSHIFT, we have various optimization for
> +;; narrowing represented as shifts and so this representation will allow us to
> +;; further optimize this should the result require narrowing. The alternative
> +;; representation of ADDHN + UXTL is less efficient and harder to futher
> +;; optimize.
> +(define_expand "<su>add<mode>3_highpart"
> +  [(set (match_operand:VQN 0 "register_operand")
> +	(unspec:VQN [(match_operand:VQN 1 "register_operand")
> +		     (match_operand:VQN 2 "register_operand")]
> +		    ADD_HIGHPART))]
> +  "TARGET_SIMD"
> +{
> +  rtx result = gen_reg_rtx (<MODE>mode);
> +  int shift_amount = GET_MODE_UNIT_BITSIZE (<MODE>mode) / 2;
> +  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
> +							shift_amount);
> +  emit_insn (gen_add<mode>3 (result, operands[1], operands[2]));
> +  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], result, shift_vector));
> +  DONE;
> +})
> +
> +;; Optimize ((a + b) >> n) + c where n is half the bitsize of the vector
> +(define_insn_and_split "*bitmask_shift_plus<mode>"
> +  [(set (match_operand:VQN 0 "register_operand" "=w")
> +	(plus:VQN
> +	  (lshiftrt:VQN
> +	    (plus:VQN (match_operand:VQN 1 "register_operand" "w")
> +		      (match_operand:VQN 2 "register_operand" "w"))
> +	    (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
> +	  (match_operand:VQN 4 "register_operand" "w")))]
>    "TARGET_SIMD"
> +  "#"
> +  "&& !reload_completed"

This is an ICE trap, since "#" forces a split while "!reload_completed"
prevents one after reload.

I think the theoretically correct way would be to use operand 0 as a
temporary when reload_completed, which in turn means making it an
earlyclobber.

However, IIUC, this pattern would only be formed from combining
three distinct patterns.  Is that right?  If so, we should be able
to handle it as a plain define_split, with no define_insn.
That should make things simpler, so would be worth trying before
the changes I mentioned above.

> +  [(const_int 0)]
>  {
> -  unsigned HOST_WIDE_INT size
> -    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode)) - 1;
> -  rtx elt = unwrap_const_vec_duplicate (operands[2]);
> -  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
> -    FAIL;
> -
> -  rtx addend = gen_reg_rtx (<MODE>mode);
> -  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROWQ2>mode, 1);
> -  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROWQ2>mode));
> -  rtx tmp1 = gen_reg_rtx (<VNARROWQ>mode);
> -  rtx tmp2 = gen_reg_rtx (<MODE>mode);
> -  emit_insn (gen_aarch64_addhn<mode> (tmp1, operands[1], addend));
> -  unsigned bitsize = GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode);
> -  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode, bitsize);
> -  emit_insn (gen_aarch64_uaddw<Vnarrowq> (tmp2, operands[1], tmp1));
> -  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], tmp2, shift_vector));
> +  rtx tmp = gen_reg_rtx (<VNARROWQ>mode);
> +  emit_insn (gen_aarch64_addhn<mode> (tmp, operands[1], operands[2]));
> +  emit_insn (gen_aarch64_uaddw<Vnarrowq> (operands[0], operands[4], tmp));
>    DONE;
> -})
> +}
> +  [(set_attr "type" "neon_add_halve<q>")]

I think we should leave this out, since it's a multi-instruction pattern.

> +)
>  
>  ;; pmul.
>  
> diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
> index 40c0728a7e6f00c395c360ce7625bc2e4a018809..ad01c1ddf9257cec951ed0c16558a3c4d856813b 100644
> --- a/gcc/config/aarch64/aarch64-sve2.md
> +++ b/gcc/config/aarch64/aarch64-sve2.md
> @@ -2317,39 +2317,51 @@ (define_insn "@aarch64_sve_<optab><mode>"
>  ;; ---- [INT] Misc optab implementations
>  ;; -------------------------------------------------------------------------
>  ;; Includes:
> -;; - aarch64_bitmask_udiv
> +;; - add_highpart
>  ;; -------------------------------------------------------------------------
>  
> -;; div optimizations using narrowings
> -;; we can do the division e.g. shorts by 255 faster by calculating it as
> -;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> -;; double the precision of x.
> -;;
> -;; See aarch64-simd.md for bigger explanation.
> -(define_expand "@aarch64_bitmask_udiv<mode>3"
> -  [(match_operand:SVE_FULL_HSDI 0 "register_operand")
> -   (match_operand:SVE_FULL_HSDI 1 "register_operand")
> -   (match_operand:SVE_FULL_HSDI 2 "immediate_operand")]
> +;; Implement add_highpart as ADD + RSHIFT, we have various optimization for
> +;; narrowing represented as shifts and so this representation will allow us to
> +;; further optimize this should the result require narrowing. The alternative
> +;; representation of ADDHN + UXTL is less efficient and harder to futher
> +;; optimize.
> +(define_expand "<su>add<mode>3_highpart"
> +  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand")
> +	(unspec:SVE_FULL_HSDI
> +	  [(match_operand:SVE_FULL_HSDI 1 "register_operand")
> +	   (match_operand:SVE_FULL_HSDI 2 "register_operand")]
> +	  ADD_HIGHPART))]
>    "TARGET_SVE2"
>  {
> -  unsigned HOST_WIDE_INT size
> -    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROW>mode)) - 1;
> -  rtx elt = unwrap_const_vec_duplicate (operands[2]);
> -  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
> -    FAIL;
> +  rtx result = gen_reg_rtx (<MODE>mode);
> +  int shift_amount = GET_MODE_UNIT_BITSIZE (<MODE>mode) / 2;
> +  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
> +							shift_amount);
> +  emit_insn (gen_add<mode>3 (result, operands[1], operands[2]));
> +  emit_insn (gen_vlshr<mode>3 (operands[0], result, shift_vector));
> +  DONE;
> +})
>  
> -  rtx addend = gen_reg_rtx (<MODE>mode);
> +;; Optimize ((a + b) >> n) where n is half the bitsize of the vector
> +(define_insn_and_split "*bitmask_shift_plus<mode>"
> +  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand" "=w")
> +	(unspec:SVE_FULL_HSDI [
> +	    (match_operand:<VPRED> 1 "register_operand" "Upl")

Looks like this can be:

  (match_operand:<VPRED> 1)

since the predicate isn't used.

> +	    (lshiftrt:SVE_FULL_HSDI
> +	      (plus:SVE_FULL_HSDI
> +		(match_operand:SVE_FULL_HSDI 2 "register_operand" "w")
> +		(match_operand:SVE_FULL_HSDI 3 "register_operand" "w"))
> +	      (match_operand:SVE_FULL_HSDI 4 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
> +        ] UNSPEC_PRED_X))]

Very minor nit, but the formatting used in the file follows the style
in the earlier pattern above, with [ immediately before ( and ]
immediately after ).  Not that that's inherently better or anything,
it's just a consistency thing.

> +  "TARGET_SVE2"
> +  "#"
> +  "&& !reload_completed"
> +  [(const_int 0)]
> +{
>    rtx tmp1 = gen_reg_rtx (<VNARROW>mode);
> -  rtx tmp2 = gen_reg_rtx (<VNARROW>mode);
> -  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROW>mode, 1);
> -  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROW>mode));
> -  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1, operands[1],
> -			      addend));
> -  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp2, operands[1],
> -			      lowpart_subreg (<MODE>mode, tmp1,
> -					      <VNARROW>mode)));
> +  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1, operands[2], operands[3]));
>    emit_move_insn (operands[0],
> -		  lowpart_subreg (<MODE>mode, tmp2, <VNARROW>mode));
> +		  lowpart_subreg (<MODE>mode, tmp1, <VNARROW>mode));
>    DONE;
>  })

Since this is a single instruction, I'm not sure it's worth splitting it.
Perhaps there would be CSE opportunities to having a single form,
but it seems unlikely.  And doing the unsplit form is nice and safe.

But yeah, generating the patterns this way seems like a good approach.
It might even help optimise open-coded versions of the same trick.

Thanks,
Richard


>  
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index e6f47cbbb0d04a6f33b9a741ebb614cabd0204b9..8a04feb29e6bfb423a09dde2cd64853e69d0e1ba 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -24363,46 +24363,6 @@ aarch64_vectorize_vec_perm_const (machine_mode vmode, machine_mode op_mode,
>  
>    return ret;
>  }
> -
> -/* Implement TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST.  */
> -
> -bool
> -aarch64_vectorize_can_special_div_by_constant (enum tree_code code,
> -					       tree vectype, wide_int cst,
> -					       rtx *output, rtx in0, rtx in1)
> -{
> -  if (code != TRUNC_DIV_EXPR
> -      || !TYPE_UNSIGNED (vectype))
> -    return false;
> -
> -  machine_mode mode = TYPE_MODE (vectype);
> -  unsigned int flags = aarch64_classify_vector_mode (mode);
> -  if ((flags & VEC_ANY_SVE) && !TARGET_SVE2)
> -    return false;
> -
> -  int pow = wi::exact_log2 (cst + 1);
> -  auto insn_code = maybe_code_for_aarch64_bitmask_udiv3 (TYPE_MODE (vectype));
> -  /* SVE actually has a div operator, we may have gotten here through
> -     that route.  */
> -  if (pow != (int) (element_precision (vectype) / 2)
> -      || insn_code == CODE_FOR_nothing)
> -    return false;
> -
> -  /* We can use the optimized pattern.  */
> -  if (in0 == NULL_RTX && in1 == NULL_RTX)
> -    return true;
> -
> -  gcc_assert (output);
> -
> -  expand_operand ops[3];
> -  create_output_operand (&ops[0], *output, mode);
> -  create_input_operand (&ops[1], in0, mode);
> -  create_fixed_operand (&ops[2], in1);
> -  expand_insn (insn_code, 3, ops);
> -  *output = ops[0].value;
> -  return true;
> -}
> -
>  /* Generate a byte permute mask for a register of mode MODE,
>     which has NUNITS units.  */
>  
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index 6cbc97cc82c06a68259bdf4dec8a0eab230081e5..ae627ae56cbd1e8b882e596dba974e74ef396e0e 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -750,6 +750,8 @@ (define_c_enum "unspec"
>      UNSPEC_REVH		; Used in aarch64-sve.md.
>      UNSPEC_REVW		; Used in aarch64-sve.md.
>      UNSPEC_REVBHW	; Used in aarch64-sve.md.
> +    UNSPEC_SADD_HIGHPART ; Used in aarch64-sve.md.
> +    UNSPEC_UADD_HIGHPART ; Used in aarch64-sve.md.
>      UNSPEC_SMUL_HIGHPART ; Used in aarch64-sve.md.
>      UNSPEC_UMUL_HIGHPART ; Used in aarch64-sve.md.
>      UNSPEC_FMLA		; Used in aarch64-sve.md.
> @@ -2704,6 +2706,7 @@ (define_int_iterator UNPACK [UNSPEC_UNPACKSHI UNSPEC_UNPACKUHI
>  
>  (define_int_iterator UNPACK_UNSIGNED [UNSPEC_UNPACKULO UNSPEC_UNPACKUHI])
>  
> +(define_int_iterator ADD_HIGHPART [UNSPEC_SADD_HIGHPART UNSPEC_UADD_HIGHPART])
>  (define_int_iterator MUL_HIGHPART [UNSPEC_SMUL_HIGHPART UNSPEC_UMUL_HIGHPART])
>  
>  (define_int_iterator CLAST [UNSPEC_CLASTA UNSPEC_CLASTB])
> @@ -3342,6 +3345,8 @@ (define_int_attr su [(UNSPEC_SADDV "s")
>  		     (UNSPEC_UNPACKUHI "u")
>  		     (UNSPEC_UNPACKSLO "s")
>  		     (UNSPEC_UNPACKULO "u")
> +		     (UNSPEC_SADD_HIGHPART "s")
> +		     (UNSPEC_UADD_HIGHPART "u")
>  		     (UNSPEC_SMUL_HIGHPART "s")
>  		     (UNSPEC_UMUL_HIGHPART "u")
>  		     (UNSPEC_COND_FCVTZS "s")

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-10 13:36 ` Richard Sandiford
  2023-02-10 13:52   ` Richard Biener
@ 2023-02-10 14:13   ` Tamar Christina
  2023-02-10 14:30     ` Richard Sandiford
  2023-02-10 15:56     ` Richard Sandiford
  1 sibling, 2 replies; 47+ messages in thread
From: Tamar Christina @ 2023-02-10 14:13 UTC (permalink / raw)
  To: Richard Sandiford, Tamar Christina via Gcc-patches; +Cc: nd, rguenther, jlaw

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Friday, February 10, 2023 1:36 PM
> To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>
> Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
> rguenther@suse.de; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> I think I'm misunderstanding, but: it seems like we're treating the add
> highpart optabs as companions to the mul highpart optabs.  But AIUI, the add
> highpart optab is used such that, for an N-bit mode, we do an N-bit addition
> followed by a shift by N/2.  Is that right?
> The mul highpart optabs instead do an 2N-bit multiplication followed by a
> shift by N.

Correct.

> 
> Apart from consistency, the reason this matters is: I'm not sure what we gain
> by adding the optab rather than simply open-coding the addition and the
> shift directly into the vector pattern.  It seems like the AArch64 expander in
> 2/2 does just do an ordinary N-bit addition followed by an ordinary shift by
> N/2.

I mentioned in the implementation, but I did so because AArch64 has various
optimization on shifts when it comes to truncating results.  I didn't need to
represent it with shifts, in fact the original pattern did not. But representing it
directly in the final instructions are problematic because these instructions are
unspecs and I would have needed to provide additional optabs to optimize them in.

So the shift representation was more natural for AArch64. It would not be say for
AArch32 which does not have these optimizations already. SVE has similar optimizations
and at the very worse you get an usra.

I avoided open coding it with add and shift because it creates a 4 instructions (and shifts
which are typically slow) dependency chain instead of a load and multiply.  This change,
unless the target is known to optimize it further is unlikely to be beneficial.  And by the
time we get to costing the only alternative is to undo the existing pattern and so you lose
the general shift optimization.

So it seemed unwise to open code as shifts, given the codegen out of the vectorizer would
be degenerate for most targets or one needs the more complicated route of costing during
pattern matching already.

> 
> Some comments in addition to Richard's:
> 
> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > Hi All,
> >
> > As discussed in the ticket, this replaces the approach for optimizing
> > the div by bitmask operation from a hook into optabs implemented
> > through add_highpart.
> >
> > In order to be able to use this we need to check whether the current
> > precision has enough bits to do the operation without any of the additions
> overflowing.
> >
> > We use range information to determine this and only do the operation
> > if we're sure am overflow won't occur.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going>
> issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	PR target/108583
> > 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST):
> Remove.
> > 	* doc/tm.texi.in: Likewise.
> > 	* explow.cc (round_push, align_dynamic_address): Revert previous
> patch.
> > 	* expmed.cc (expand_divmod): Likewise.
> > 	* expmed.h (expand_divmod): Likewise.
> > 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
> > 	* optabs.cc (expand_doubleword_mod,
> expand_doubleword_divmod): Likewise.
> > 	* internal-fn.def (ADDH): New.
> > 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
> > 	* doc/md.texi: Document them.
> > 	* doc/rtl.texi: Likewise.
> > 	* target.def (can_special_div_by_const): Remove.
> > 	* target.h: Remove tree-core.h include
> > 	* targhooks.cc (default_can_special_div_by_const): Remove.
> > 	* targhooks.h (default_can_special_div_by_const): Remove.
> > 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
> > 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook
> and
> > 	implement new obtab recognition based on range.
> > 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 	PR target/108583
> > 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
> > 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
> >
> 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f74080
> 3
> > 8595e21af35d 100644
> > --- a/gcc/doc/md.texi
> > +++ b/gcc/doc/md.texi
> > @@ -5668,6 +5668,18 @@ represented in RTL using a
> @code{smul_highpart} RTX expression.
> >  Similar, but the multiplication is unsigned.  This may be represented
> > in RTL using an @code{umul_highpart} RTX expression.
> >
> > +@cindex @code{sadd@var{m}3_highpart} instruction pattern @item
> > +@samp{smul@var{m}3_highpart}
> 
> sadd
> 
> > +Perform a signed addition of operands 1 and 2, which have mode
> > +@var{m}, and store the most significant half of the product in operand 0.
> > +The least significant half of the product is discarded.  This may be
> > +represented in RTL using a @code{sadd_highpart} RTX expression.
> > +
> > +@cindex @code{uadd@var{m}3_highpart} instruction pattern @item
> > +@samp{uadd@var{m}3_highpart} Similar, but the addition is unsigned.
> > +This may be represented in RTL using an @code{uadd_highpart} RTX
> > +expression.
> > +
> >  @cindex @code{madd@var{m}@var{n}4} instruction pattern  @item
> > @samp{madd@var{m}@var{n}4}  Multiply operands 1 and 2, sign-extend
> > them to mode @var{n}, add diff --git a/gcc/doc/rtl.texi
> > b/gcc/doc/rtl.texi index
> >
> d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343
> d17
> > 1940ec4222f3 100644
> > --- a/gcc/doc/rtl.texi
> > +++ b/gcc/doc/rtl.texi
> > @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.
> > @code{smul_highpart} returns the high part  of a signed
> > multiplication, @code{umul_highpart} returns the high part  of an unsigned
> multiplication.
> >
> > +@findex sadd_highpart
> > +@findex uadd_highpart
> > +@cindex high-part addition
> > +@cindex addition high part
> > +@item (sadd_highpart:@var{m} @var{x} @var{y}) @itemx
> > +(uadd_highpart:@var{m} @var{x} @var{y}) Represents the high-part
> > +addition of @var{x} and @var{y} carried out in machine mode @var{m}.
> > +@code{sadd_highpart} returns the high part of a signed addition,
> > +@code{uadd_highpart} returns the high part of an unsigned addition.
> 
> The patch doesn't add these RTL codes though.
> 
> > +
> >  @findex fma
> >  @cindex fused multiply-add
> >  @item (fma:@var{m} @var{x} @var{y} @var{z}) diff --git
> > a/gcc/doc/tm.texi b/gcc/doc/tm.texi index
> >
> c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d57914840
> 17e
> > 6b0d62ab077e 100644
> > --- a/gcc/doc/tm.texi
> > +++ b/gcc/doc/tm.texi
> > @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for the
> > hook to handle these two  implementation approaches itself.
> >  @end deftypefn
> >
> > -@deftypefn {Target Hook} bool
> > TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum
> @var{tree_code}, tree
> > @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx
> > @var{in0}, rtx @var{in1}) -This hook is used to test whether the
> > target has a special method of -division of vectors of type @var{vectype}
> using the value @var{constant}, -and producing a vector of type
> @var{vectype}.  The division -will then not be decomposed by the vectorizer
> and kept as a div.
> > -
> > -When the hook is being used to test whether the target supports a
> > special -divide, @var{in0}, @var{in1}, and @var{output} are all null.
> > When the hook -is being used to emit a division, @var{in0} and
> > @var{in1} are the source -vectors of type @var{vecttype} and
> > @var{output} is the destination vector of -type @var{vectype}.
> > -
> > -Return true if the operation is possible, emitting instructions for
> > it -if rtxes are provided and updating @var{output}.
> > -@end deftypefn
> > -
> >  @deftypefn {Target Hook} tree
> > TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned
> @var{code},
> > tree @var{vec_type_out}, tree @var{vec_type_in})  This hook should
> > return the decl of a function that implements the  vectorized variant
> > of the function with the @code{combined_fn} code diff --git
> > a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index
> >
> 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0
> a3a
> > bccd1c293c7b 100644
> > --- a/gcc/doc/tm.texi.in
> > +++ b/gcc/doc/tm.texi.in
> > @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent
> strategy can generate better code.
> >
> >  @hook TARGET_VECTORIZE_VEC_PERM_CONST
> >
> > -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
> > -
> >  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
> >
> >  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
> > diff --git a/gcc/explow.cc b/gcc/explow.cc index
> >
> 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0
> bef
> > a016eea4573c 100644
> > --- a/gcc/explow.cc
> > +++ b/gcc/explow.cc
> > @@ -1037,7 +1037,7 @@ round_push (rtx size)
> >       TRUNC_DIV_EXPR.  */
> >    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
> >  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
> > -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size,
> > align_rtx,
> > +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
> >  			NULL_RTX, 1);
> >    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
> >
> > @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned
> required_align)
> >  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
> >  				       Pmode),
> >  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
> > -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
> > target,
> > +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
> >  			  gen_int_mode (required_align / BITS_PER_UNIT,
> >  					Pmode),
> >  			  NULL_RTX, 1);
> > diff --git a/gcc/expmed.h b/gcc/expmed.h index
> >
> 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c5364
> 094
> > 1628068f3901 100644
> > --- a/gcc/expmed.h
> > +++ b/gcc/expmed.h
> > @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code,
> > machine_mode, rtx, poly_int64, rtx,  extern rtx maybe_expand_shift
> (enum tree_code, machine_mode, rtx, int, rtx,
> >  			       int);
> >  #ifdef GCC_OPTABS_H
> > -extern rtx expand_divmod (int, enum tree_code, machine_mode, tree,
> tree,
> > -			  rtx, rtx, rtx, int,
> > -			  enum optab_methods = OPTAB_LIB_WIDEN);
> > +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx,
> rtx,
> > +			  rtx, int, enum optab_methods =
> OPTAB_LIB_WIDEN);
> >  #endif
> >  #endif
> >
> > diff --git a/gcc/expmed.cc b/gcc/expmed.cc index
> >
> 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3
> a59
> > c169d3b7692f 100644
> > --- a/gcc/expmed.cc
> > +++ b/gcc/expmed.cc
> > @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx
> op0,
> > HOST_WIDE_INT d)
> >
> >  rtx
> >  expand_divmod (int rem_flag, enum tree_code code, machine_mode
> mode,
> > -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
> > -	       int unsignedp, enum optab_methods methods)
> > +	       rtx op0, rtx op1, rtx target, int unsignedp,
> > +	       enum optab_methods methods)
> >  {
> >    machine_mode compute_mode;
> >    rtx tquotient;
> > @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code
> > code, machine_mode mode,
> >
> >    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
> >
> > -  /* Check if the target has specific expansions for the division.
> > */
> > -  tree cst;
> > -  if (treeop0
> > -      && treeop1
> > -      && (cst = uniform_integer_cst_p (treeop1))
> > -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE
> (treeop0),
> > -						     wi::to_wide (cst),
> > -						     &target, op0, op1))
> > -    return target;
> > -
> > -
> >    /* Now convert to the best mode to use.  */
> >    if (compute_mode != mode)
> >      {
> > @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code
> code, machine_mode mode,
> >  			    || (optab_handler (sdivmod_optab, int_mode)
> >  				!= CODE_FOR_nothing)))
> >  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
> > -						int_mode, treeop0, treeop1,
> > -						op0, gen_int_mode (abs_d,
> > +						int_mode, op0,
> > +						gen_int_mode (abs_d,
> >  							      int_mode),
> >  						NULL_RTX, 0);
> >  		    else
> > @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code
> code, machine_mode mode,
> >  				      size - 1, NULL_RTX, 0);
> >  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
> >  				    NULL_RTX);
> > -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode,
> treeop0,
> > -				    treeop1, t3, op1, NULL_RTX, 0);
> > +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3,
> op1,
> > +				    NULL_RTX, 0);
> >  		if (t4)
> >  		  {
> >  		    rtx t5;
> > diff --git a/gcc/expr.cc b/gcc/expr.cc index
> >
> 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b
> 2280
> > c6e277f26d72 100644
> > --- a/gcc/expr.cc
> > +++ b/gcc/expr.cc
> > @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
> >  	    return expand_divmod (0,
> >  				  FLOAT_MODE_P (GET_MODE (value))
> >  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
> > -				  GET_MODE (value), NULL, NULL, op1, op2,
> > -				  target, 0);
> > +				  GET_MODE (value), op1, op2, target, 0);
> >  	case MOD:
> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> NULL, NULL,
> > -				op1, op2, target, 0);
> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> op1, op2,
> > +				target, 0);
> >  	case UDIV:
> > -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
> NULL, NULL,
> > -				op1, op2, target, 1);
> > +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
> op1, op2,
> > +				target, 1);
> >  	case UMOD:
> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> NULL, NULL,
> > -				op1, op2, target, 1);
> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> op1, op2,
> > +				target, 1);
> >  	case ASHIFTRT:
> >  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
> >  				      target, 0, OPTAB_LIB_WIDEN); @@ -
> 9170,13 +9169,11 @@
> > expand_expr_divmod (tree_code code, machine_mode mode, tree
> treeop0,
> >        bool speed_p = optimize_insn_for_speed_p ();
> >        do_pending_stack_adjust ();
> >        start_sequence ();
> > -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> > -				   op0, op1, target, 1);
> > +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1,
> > + target, 1);
> >        rtx_insn *uns_insns = get_insns ();
> >        end_sequence ();
> >        start_sequence ();
> > -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> > -				   op0, op1, target, 0);
> > +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1,
> > + target, 0);
> >        rtx_insn *sgn_insns = get_insns ();
> >        end_sequence ();
> >        unsigned uns_cost = seq_cost (uns_insns, speed_p); @@ -9198,8
> > +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode
> mode, tree treeop0,
> >        emit_insn (sgn_insns);
> >        return sgn_ret;
> >      }
> > -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
> > -			op0, op1, target, unsignedp);
> > +  return expand_divmod (mod_p, code, mode, op0, op1, target,
> > + unsignedp);
> >  }
> >
> >  rtx
> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
> >
> 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5a
> 3b
> > 8a734baa800f 100644
> > --- a/gcc/internal-fn.def
> > +++ b/gcc/internal-fn.def
> > @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL,
> ECF_CONST
> > | ECF_NOTHROW, first,
> >
> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST |
> ECF_NOTHROW, first,
> >  			      smul_highpart, umul_highpart, binary)
> > +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST |
> ECF_NOTHROW, first,
> > +			      sadd_highpart, uadd_highpart, binary)
> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST |
> ECF_NOTHROW, first,
> >  			      smulhs, umulhs, binary)
> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST |
> ECF_NOTHROW, first,
> > diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
> >
> cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6
> e
> > 77082c1e617b 100644
> > --- a/gcc/optabs.cc
> > +++ b/gcc/optabs.cc
> > @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode
> mode, rtx op0, rtx op1, bool unsignedp)
> >  		return NULL_RTX;
> >  	    }
> >  	}
> > -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode,
> NULL, NULL,
> > -				     sum, gen_int_mode (INTVAL (op1),
> > -							word_mode),
> > +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode,
> sum,
> > +				     gen_int_mode (INTVAL (op1),
> word_mode),
> >  				     NULL_RTX, 1, OPTAB_DIRECT);
> >        if (remainder == NULL_RTX)
> >  	return NULL_RTX;
> > @@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode
> mode, rtx
> > op0, rtx op1, rtx *rem,
> >
> >    if (op11 != const1_rtx)
> >      {
> > -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL,
> quot1,
> > -				op11, NULL_RTX, unsignedp,
> OPTAB_DIRECT);
> > +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1,
> op11,
> > +				NULL_RTX, unsignedp, OPTAB_DIRECT);
> >        if (rem2 == NULL_RTX)
> >  	return NULL_RTX;
> >
> > @@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode
> mode, rtx op0, rtx op1, rtx *rem,
> >        if (rem2 == NULL_RTX)
> >  	return NULL_RTX;
> >
> > -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL,
> quot1,
> > -				 op11, NULL_RTX, unsignedp,
> OPTAB_DIRECT);
> > +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
> > +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
> >        if (quot2 == NULL_RTX)
> >  	return NULL_RTX;
> >
> > diff --git a/gcc/optabs.def b/gcc/optabs.def index
> >
> 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5
> ccb
> > f6147947351a 100644
> > --- a/gcc/optabs.def
> > +++ b/gcc/optabs.def
> > @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
> >
> >  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")  OPTAB_D
> > (umul_highpart_optab, "umul$a3_highpart")
> > +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart") OPTAB_D
> > +(uadd_highpart_optab, "uadd$a3_highpart")
> >
> >  OPTAB_D (cmpmem_optab, "cmpmem$a")
> >  OPTAB_D (cmpstr_optab, "cmpstr$a")
> > diff --git a/gcc/target.def b/gcc/target.def index
> >
> db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d
> 81a
> > fa2c2baa64a5 100644
> > --- a/gcc/target.def
> > +++ b/gcc/target.def
> > @@ -1905,25 +1905,6 @@ implementation approaches itself.",
> >  	const vec_perm_indices &sel),
> >   NULL)
> >
> > -DEFHOOK
> > -(can_special_div_by_const,
> > - "This hook is used to test whether the target has a special method
> > of\n\ -division of vectors of type @var{vectype} using the value
> > @var{constant},\n\ -and producing a vector of type @var{vectype}.  The
> > division\n\ -will then not be decomposed by the vectorizer and kept as
> > a div.\n\ -\n\ -When the hook is being used to test whether the target
> > supports a special\n\ -divide, @var{in0}, @var{in1}, and @var{output}
> > are all null.  When the hook\n\ -is being used to emit a division,
> > @var{in0} and @var{in1} are the source\n\ -vectors of type
> > @var{vecttype} and @var{output} is the destination vector of\n\ -type
> > @var{vectype}.\n\ -\n\ -Return true if the operation is possible,
> > emitting instructions for it\n\ -if rtxes are provided and updating
> > @var{output}.",
> > - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
> > -	rtx in0, rtx in1),
> > - default_can_special_div_by_const)
> > -
> >  /* Return true if the target supports misaligned store/load of a
> >     specific factor denoted in the third parameter.  The last parameter
> >     is true if the access is defined in a packed struct.  */ diff
> > --git a/gcc/target.h b/gcc/target.h index
> >
> 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b9
> 9f9
> > 13158c2d47b1 100644
> > --- a/gcc/target.h
> > +++ b/gcc/target.h
> > @@ -51,7 +51,6 @@
> >  #include "insn-codes.h"
> >  #include "tm.h"
> >  #include "hard-reg-set.h"
> > -#include "tree-core.h"
> >
> >  #if CHECKING_P
> >
> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h index
> >
> a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad224454
> 93
> > 17a31390f0c2 100644
> > --- a/gcc/targhooks.h
> > +++ b/gcc/targhooks.h
> > @@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage
> > (addr_space_t, location_t);  extern rtx default_addr_space_convert
> > (rtx, tree, tree);  extern unsigned int default_case_values_threshold
> > (void);  extern bool default_have_conditional_execution (void);
> > -extern bool default_can_special_div_by_const (enum tree_code, tree,
> wide_int,
> > -					      rtx *, rtx, rtx);
> >
> >  extern bool default_libc_has_function (enum function_class, tree);
> > extern bool default_libc_has_fast_function (int fcode); diff --git
> > a/gcc/targhooks.cc b/gcc/targhooks.cc index
> >
> fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e
> 03
> > 877337a931e7 100644
> > --- a/gcc/targhooks.cc
> > +++ b/gcc/targhooks.cc
> > @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
> >    return HAVE_conditional_execution;
> >  }
> >
> > -/* Default that no division by constant operations are special.  */
> > -bool -default_can_special_div_by_const (enum tree_code, tree,
> > wide_int, rtx *, rtx,
> > -				  rtx)
> > -{
> > -  return false;
> > -}
> > -
> >  /* By default we assume that c99 functions are present at the runtime,
> >     but sincos is not.  */
> >  bool
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0
> a0
> > 4ea8c1f73e3c
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> > @@ -0,0 +1,25 @@
> > +/* { dg-require-effective-target vect_int } */
> > +
> > +#include <stdint.h>
> > +#include "tree-vect.h"
> > +
> > +typedef unsigned __attribute__((__vector_size__ (16))) V;
> > +
> > +static __attribute__((__noinline__)) __attribute__((__noclone__)) V
> > +foo (V v, unsigned short i) {
> > +  v /= i;
> > +  return v;
> > +}
> > +
> > +int
> > +main (void)
> > +{
> > +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff },
> > +0xffff);
> > +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
> > +    if (v[i] != 0x00010001)
> > +      __builtin_abort ();
> > +  return 0;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern:
> > +detected" "vect" { target aarch64*-*-* } } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b4991
> 4d2
> > a29b933de625
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> > @@ -0,0 +1,58 @@
> > +/* { dg-require-effective-target vect_int } */
> > +
> > +#include <stdint.h>
> > +#include <stdio.h>
> > +#include "tree-vect.h"
> > +
> > +#define N 50
> > +#define TYPE uint8_t
> > +
> > +#ifndef DEBUG
> > +#define DEBUG 0
> > +#endif
> > +
> > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> > +
> > +
> > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE*
> > +restrict pixel, TYPE level, int n) {
> > +  for (int i = 0; i < n; i+=1)
> > +    pixel[i] = (pixel[i] + level) / 0xff; }
> > +
> > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE*
> > +restrict pixel, TYPE level, int n) {
> > +  for (int i = 0; i < n; i+=1)
> > +    pixel[i] = (pixel[i] + level) / 0xff; }
> > +
> > +int main ()
> > +{
> > +  TYPE a[N];
> > +  TYPE b[N];
> > +
> > +  for (int i = 0; i < N; ++i)
> > +    {
> > +      a[i] = BASE + i * 13;
> > +      b[i] = BASE + i * 13;
> > +      if (DEBUG)
> > +        printf ("%d: 0x%x\n", i, a[i]);
> > +    }
> > +
> > +  fun1 (a, N / 2, N);
> > +  fun2 (b, N / 2, N);
> > +
> > +  for (int i = 0; i < N; ++i)
> > +    {
> > +      if (DEBUG)
> > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> > +
> > +      if (a[i] != b[i])
> > +        __builtin_abort ();
> > +    }
> > +  return 0;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" {
> > +target aarch64*-*-* } } } */
> > diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc index
> >
> 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077d
> c3
> > e970bed75ef6 100644
> > --- a/gcc/tree-vect-generic.cc
> > +++ b/gcc/tree-vect-generic.cc
> > @@ -1237,17 +1237,6 @@ expand_vector_operation
> (gimple_stmt_iterator *gsi, tree type, tree compute_type
> >  	  tree rhs2 = gimple_assign_rhs2 (assign);
> >  	  tree ret;
> >
> > -	  /* Check if the target was going to handle it through the special
> > -	     division callback hook.  */
> > -	  tree cst = uniform_integer_cst_p (rhs2);
> > -	  if (cst &&
> > -	      targetm.vectorize.can_special_div_by_const (code, type,
> > -							  wi::to_wide (cst),
> > -							  NULL,
> > -							  NULL_RTX,
> NULL_RTX))
> > -	    return NULL_TREE;
> > -
> > -
> >  	  if (!optimize
> >  	      || !VECTOR_INTEGER_TYPE_P (type)
> >  	      || TREE_CODE (rhs2) != VECTOR_CST diff --git
> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
> >
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
> 69
> > de2afea139d6 100644
> > --- a/gcc/tree-vect-patterns.cc
> > +++ b/gcc/tree-vect-patterns.cc
> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
> >        return pattern_stmt;
> >      }
> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
> vectype,
> > -							  wi::to_wide (cst),
> > -							  NULL, NULL_RTX,
> > -							  NULL_RTX))
> > +	   && TYPE_UNSIGNED (itype)
> > +	   && rhs_code == TRUNC_DIV_EXPR
> > +	   && vectype
> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> > +					      OPTIMIZE_FOR_SPEED))
> >      {
> > -      return NULL;
> > +      /* div optimizations using narrowings
> > +       we can do the division e.g. shorts by 255 faster by calculating it as
> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> > +       double the precision of x.
> > +
> > +       If we imagine a short as being composed of two blocks of bytes then
> > +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
> > +       adding 1 to each sub component:
> > +
> > +	    short value of 16-bits
> > +       ┌──────────────┬────────────────┐
> > +       │              │                │
> > +       └──────────────┴────────────────┘
> > +	 8-bit part1 ▲  8-bit part2   ▲
> > +		     │                │
> > +		     │                │
> > +		    +1               +1
> > +
> > +       after the first addition, we have to shift right by 8, and narrow the
> > +       results back to a byte.  Remember that the addition must be done in
> > +       double the precision of the input.  However if we know that the
> addition
> > +       `x + 257` does not overflow then we can do the operation in the
> current
> > +       precision.  In which case we don't need the pack and unpacks.  */
> > +      auto wcst = wi::to_wide (cst);
> > +      int pow = wi::exact_log2 (wcst + 1);
> > +      if (pow == (int) (element_precision (vectype) / 2))
> > +	{
> > +	  wide_int min,max;
> > +	  /* If we're in a pattern we need to find the orginal definition.  */
> > +	  tree op0 = oprnd0;
> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> > +	  if (is_pattern_stmt_p (stmt_info))
> > +	    {
> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> > +	    }
> 
> If this is generally safe (I'm skipping thinking about it in the interests of a
> quick review :-)), then I think it should be done in vect_get_range_info
> instead.  Using gimple_get_lhs would be more general than handling just
> assignments.
> 
> > +
> > +	  /* Check that no overflow will occur.  If we don't have range
> > +	     information we can't perform the optimization.  */
> > +	  if (vect_get_range_info (op0, &min, &max))
> > +	    {
> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> > +	      wi::overflow_type ovf;
> > +	      /* We need adder and max in the same precision.  */
> > +	      wide_int zadder
> > +		= wide_int_storage::from (adder, wi::get_precision (max),
> > +					  UNSIGNED);
> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
> 
> Could you explain this a bit more?  When do we have mismatched
> precisions?

C promotion rules will promote e.g.

void fun2(uint8_t* restrict pixel, uint8_t level, int n)
{
  for (int i = 0; i < n; i+=1)
    pixel[i] = (pixel[i] + level) / 0xff;
}

And have the addition be done as a 32 bit integer.  The vectorizer will demote this down
to a short, but range information is not stored for patterns.  So In the above the range will
correctly be 0x1fe but the precision will be that of the original expression, so 32.  This will
be a mismatch with itype which is derived from the size the vectorizer will perform the
operation in.

Thanks,
Tamar

> 
> Thanks,
> Richard
> 
> > +	      if (ovf == wi::OVF_NONE)
> > +		{
> > +		  *type_out = vectype;
> > +		  tree tadder = wide_int_to_tree (itype, adder);
> > +		  gcall *patt1
> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0,
> tadder);
> > +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
> > +		  gimple_call_set_lhs (patt1, lhs);
> > +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1,
> vectype);
> > +
> > +		  pattern_stmt
> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
> > +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
> > +		  gimple_call_set_lhs (pattern_stmt, lhs);
> > +
> > +		  return pattern_stmt;
> > +		}
> > +	    }
> > +	}
> >      }
> >
> >    if (prec > HOST_BITS_PER_WIDE_INT
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index
> >
> eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b95
> 64f
> > c4e066e50081 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
> >  	}
> >        target_support_p = (optab_handler (optab, vec_mode)
> >  			  != CODE_FOR_nothing);
> > -      tree cst;
> > -      if (!target_support_p
> > -	  && op1
> > -	  && (cst = uniform_integer_cst_p (op1)))
> > -	target_support_p
> > -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
> > -							wi::to_wide (cst),
> > -							NULL, NULL_RTX,
> > -							NULL_RTX);
> >      }
> >
> >    bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-10 14:13   ` Tamar Christina
@ 2023-02-10 14:30     ` Richard Sandiford
  2023-02-10 14:54       ` Tamar Christina
  2023-02-27 11:09       ` Tamar Christina
  2023-02-10 15:56     ` Richard Sandiford
  1 sibling, 2 replies; 47+ messages in thread
From: Richard Sandiford @ 2023-02-10 14:30 UTC (permalink / raw)
  To: Tamar Christina; +Cc: Tamar Christina via Gcc-patches, nd, rguenther, jlaw

Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Friday, February 10, 2023 1:36 PM
>> To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>
>> Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
>> rguenther@suse.de; jlaw@ventanamicro.com
>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>> by using new optabs [PR108583]
>> 
>> I think I'm misunderstanding, but: it seems like we're treating the add
>> highpart optabs as companions to the mul highpart optabs.  But AIUI, the add
>> highpart optab is used such that, for an N-bit mode, we do an N-bit addition
>> followed by a shift by N/2.  Is that right?
>> The mul highpart optabs instead do an 2N-bit multiplication followed by a
>> shift by N.
>
> Correct.
>
>> 
>> Apart from consistency, the reason this matters is: I'm not sure what we gain
>> by adding the optab rather than simply open-coding the addition and the
>> shift directly into the vector pattern.  It seems like the AArch64 expander in
>> 2/2 does just do an ordinary N-bit addition followed by an ordinary shift by
>> N/2.
>
> I mentioned in the implementation, but I did so because AArch64 has various
> optimization on shifts when it comes to truncating results.  I didn't need to
> represent it with shifts, in fact the original pattern did not. But representing it
> directly in the final instructions are problematic because these instructions are
> unspecs and I would have needed to provide additional optabs to optimize them in.
>
> So the shift representation was more natural for AArch64. It would not be say for
> AArch32 which does not have these optimizations already. SVE has similar optimizations
> and at the very worse you get an usra.
>
> I avoided open coding it with add and shift because it creates a 4 instructions (and shifts
> which are typically slow) dependency chain instead of a load and multiply.  This change,
> unless the target is known to optimize it further is unlikely to be beneficial.  And by the
> time we get to costing the only alternative is to undo the existing pattern and so you lose
> the general shift optimization.
>
> So it seemed unwise to open code as shifts, given the codegen out of the vectorizer would
> be degenerate for most targets or one needs the more complicated route of costing during
> pattern matching already.

Hmm, OK.  That seems like a cost-model thing though, rather than
something that should be exposed through optabs.  And I imagine
the open-coded version would still be better than nothing on
targets without highpart multiply.

So how about replacing the hook with one that simply asks whether
division through highpart multiplication is preferred over the
add/shift sequence?  (Unfortunately it's not going to be possible
to work that out from existing information.)

Thanks,
Richard

>
>> 
>> Some comments in addition to Richard's:
>> 
>> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> > Hi All,
>> >
>> > As discussed in the ticket, this replaces the approach for optimizing
>> > the div by bitmask operation from a hook into optabs implemented
>> > through add_highpart.
>> >
>> > In order to be able to use this we need to check whether the current
>> > precision has enough bits to do the operation without any of the additions
>> overflowing.
>> >
>> > We use range information to determine this and only do the operation
>> > if we're sure am overflow won't occur.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going>
>> issues.
>> >
>> > Ok for master?
>> >
>> > Thanks,
>> > Tamar
>> >
>> > gcc/ChangeLog:
>> >
>> > 	PR target/108583
>> > 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST):
>> Remove.
>> > 	* doc/tm.texi.in: Likewise.
>> > 	* explow.cc (round_push, align_dynamic_address): Revert previous
>> patch.
>> > 	* expmed.cc (expand_divmod): Likewise.
>> > 	* expmed.h (expand_divmod): Likewise.
>> > 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
>> > 	* optabs.cc (expand_doubleword_mod,
>> expand_doubleword_divmod): Likewise.
>> > 	* internal-fn.def (ADDH): New.
>> > 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
>> > 	* doc/md.texi: Document them.
>> > 	* doc/rtl.texi: Likewise.
>> > 	* target.def (can_special_div_by_const): Remove.
>> > 	* target.h: Remove tree-core.h include
>> > 	* targhooks.cc (default_can_special_div_by_const): Remove.
>> > 	* targhooks.h (default_can_special_div_by_const): Remove.
>> > 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
>> > 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook
>> and
>> > 	implement new obtab recognition based on range.
>> > 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> > 	PR target/108583
>> > 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
>> > 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
>> >
>> > --- inline copy of patch --
>> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
>> >
>> 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f74080
>> 3
>> > 8595e21af35d 100644
>> > --- a/gcc/doc/md.texi
>> > +++ b/gcc/doc/md.texi
>> > @@ -5668,6 +5668,18 @@ represented in RTL using a
>> @code{smul_highpart} RTX expression.
>> >  Similar, but the multiplication is unsigned.  This may be represented
>> > in RTL using an @code{umul_highpart} RTX expression.
>> >
>> > +@cindex @code{sadd@var{m}3_highpart} instruction pattern @item
>> > +@samp{smul@var{m}3_highpart}
>> 
>> sadd
>> 
>> > +Perform a signed addition of operands 1 and 2, which have mode
>> > +@var{m}, and store the most significant half of the product in operand 0.
>> > +The least significant half of the product is discarded.  This may be
>> > +represented in RTL using a @code{sadd_highpart} RTX expression.
>> > +
>> > +@cindex @code{uadd@var{m}3_highpart} instruction pattern @item
>> > +@samp{uadd@var{m}3_highpart} Similar, but the addition is unsigned.
>> > +This may be represented in RTL using an @code{uadd_highpart} RTX
>> > +expression.
>> > +
>> >  @cindex @code{madd@var{m}@var{n}4} instruction pattern  @item
>> > @samp{madd@var{m}@var{n}4}  Multiply operands 1 and 2, sign-extend
>> > them to mode @var{n}, add diff --git a/gcc/doc/rtl.texi
>> > b/gcc/doc/rtl.texi index
>> >
>> d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343
>> d17
>> > 1940ec4222f3 100644
>> > --- a/gcc/doc/rtl.texi
>> > +++ b/gcc/doc/rtl.texi
>> > @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.
>> > @code{smul_highpart} returns the high part  of a signed
>> > multiplication, @code{umul_highpart} returns the high part  of an unsigned
>> multiplication.
>> >
>> > +@findex sadd_highpart
>> > +@findex uadd_highpart
>> > +@cindex high-part addition
>> > +@cindex addition high part
>> > +@item (sadd_highpart:@var{m} @var{x} @var{y}) @itemx
>> > +(uadd_highpart:@var{m} @var{x} @var{y}) Represents the high-part
>> > +addition of @var{x} and @var{y} carried out in machine mode @var{m}.
>> > +@code{sadd_highpart} returns the high part of a signed addition,
>> > +@code{uadd_highpart} returns the high part of an unsigned addition.
>> 
>> The patch doesn't add these RTL codes though.
>> 
>> > +
>> >  @findex fma
>> >  @cindex fused multiply-add
>> >  @item (fma:@var{m} @var{x} @var{y} @var{z}) diff --git
>> > a/gcc/doc/tm.texi b/gcc/doc/tm.texi index
>> >
>> c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d57914840
>> 17e
>> > 6b0d62ab077e 100644
>> > --- a/gcc/doc/tm.texi
>> > +++ b/gcc/doc/tm.texi
>> > @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for the
>> > hook to handle these two  implementation approaches itself.
>> >  @end deftypefn
>> >
>> > -@deftypefn {Target Hook} bool
>> > TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum
>> @var{tree_code}, tree
>> > @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx
>> > @var{in0}, rtx @var{in1}) -This hook is used to test whether the
>> > target has a special method of -division of vectors of type @var{vectype}
>> using the value @var{constant}, -and producing a vector of type
>> @var{vectype}.  The division -will then not be decomposed by the vectorizer
>> and kept as a div.
>> > -
>> > -When the hook is being used to test whether the target supports a
>> > special -divide, @var{in0}, @var{in1}, and @var{output} are all null.
>> > When the hook -is being used to emit a division, @var{in0} and
>> > @var{in1} are the source -vectors of type @var{vecttype} and
>> > @var{output} is the destination vector of -type @var{vectype}.
>> > -
>> > -Return true if the operation is possible, emitting instructions for
>> > it -if rtxes are provided and updating @var{output}.
>> > -@end deftypefn
>> > -
>> >  @deftypefn {Target Hook} tree
>> > TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned
>> @var{code},
>> > tree @var{vec_type_out}, tree @var{vec_type_in})  This hook should
>> > return the decl of a function that implements the  vectorized variant
>> > of the function with the @code{combined_fn} code diff --git
>> > a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index
>> >
>> 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0
>> a3a
>> > bccd1c293c7b 100644
>> > --- a/gcc/doc/tm.texi.in
>> > +++ b/gcc/doc/tm.texi.in
>> > @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent
>> strategy can generate better code.
>> >
>> >  @hook TARGET_VECTORIZE_VEC_PERM_CONST
>> >
>> > -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
>> > -
>> >  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
>> >
>> >  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
>> > diff --git a/gcc/explow.cc b/gcc/explow.cc index
>> >
>> 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0
>> bef
>> > a016eea4573c 100644
>> > --- a/gcc/explow.cc
>> > +++ b/gcc/explow.cc
>> > @@ -1037,7 +1037,7 @@ round_push (rtx size)
>> >       TRUNC_DIV_EXPR.  */
>> >    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
>> >  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
>> > -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size,
>> > align_rtx,
>> > +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
>> >  			NULL_RTX, 1);
>> >    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
>> >
>> > @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned
>> required_align)
>> >  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
>> >  				       Pmode),
>> >  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
>> > -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
>> > target,
>> > +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
>> >  			  gen_int_mode (required_align / BITS_PER_UNIT,
>> >  					Pmode),
>> >  			  NULL_RTX, 1);
>> > diff --git a/gcc/expmed.h b/gcc/expmed.h index
>> >
>> 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c5364
>> 094
>> > 1628068f3901 100644
>> > --- a/gcc/expmed.h
>> > +++ b/gcc/expmed.h
>> > @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code,
>> > machine_mode, rtx, poly_int64, rtx,  extern rtx maybe_expand_shift
>> (enum tree_code, machine_mode, rtx, int, rtx,
>> >  			       int);
>> >  #ifdef GCC_OPTABS_H
>> > -extern rtx expand_divmod (int, enum tree_code, machine_mode, tree,
>> tree,
>> > -			  rtx, rtx, rtx, int,
>> > -			  enum optab_methods = OPTAB_LIB_WIDEN);
>> > +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx,
>> rtx,
>> > +			  rtx, int, enum optab_methods =
>> OPTAB_LIB_WIDEN);
>> >  #endif
>> >  #endif
>> >
>> > diff --git a/gcc/expmed.cc b/gcc/expmed.cc index
>> >
>> 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3
>> a59
>> > c169d3b7692f 100644
>> > --- a/gcc/expmed.cc
>> > +++ b/gcc/expmed.cc
>> > @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx
>> op0,
>> > HOST_WIDE_INT d)
>> >
>> >  rtx
>> >  expand_divmod (int rem_flag, enum tree_code code, machine_mode
>> mode,
>> > -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
>> > -	       int unsignedp, enum optab_methods methods)
>> > +	       rtx op0, rtx op1, rtx target, int unsignedp,
>> > +	       enum optab_methods methods)
>> >  {
>> >    machine_mode compute_mode;
>> >    rtx tquotient;
>> > @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code
>> > code, machine_mode mode,
>> >
>> >    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
>> >
>> > -  /* Check if the target has specific expansions for the division.
>> > */
>> > -  tree cst;
>> > -  if (treeop0
>> > -      && treeop1
>> > -      && (cst = uniform_integer_cst_p (treeop1))
>> > -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE
>> (treeop0),
>> > -						     wi::to_wide (cst),
>> > -						     &target, op0, op1))
>> > -    return target;
>> > -
>> > -
>> >    /* Now convert to the best mode to use.  */
>> >    if (compute_mode != mode)
>> >      {
>> > @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code
>> code, machine_mode mode,
>> >  			    || (optab_handler (sdivmod_optab, int_mode)
>> >  				!= CODE_FOR_nothing)))
>> >  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
>> > -						int_mode, treeop0, treeop1,
>> > -						op0, gen_int_mode (abs_d,
>> > +						int_mode, op0,
>> > +						gen_int_mode (abs_d,
>> >  							      int_mode),
>> >  						NULL_RTX, 0);
>> >  		    else
>> > @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code
>> code, machine_mode mode,
>> >  				      size - 1, NULL_RTX, 0);
>> >  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
>> >  				    NULL_RTX);
>> > -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode,
>> treeop0,
>> > -				    treeop1, t3, op1, NULL_RTX, 0);
>> > +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3,
>> op1,
>> > +				    NULL_RTX, 0);
>> >  		if (t4)
>> >  		  {
>> >  		    rtx t5;
>> > diff --git a/gcc/expr.cc b/gcc/expr.cc index
>> >
>> 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b
>> 2280
>> > c6e277f26d72 100644
>> > --- a/gcc/expr.cc
>> > +++ b/gcc/expr.cc
>> > @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
>> >  	    return expand_divmod (0,
>> >  				  FLOAT_MODE_P (GET_MODE (value))
>> >  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
>> > -				  GET_MODE (value), NULL, NULL, op1, op2,
>> > -				  target, 0);
>> > +				  GET_MODE (value), op1, op2, target, 0);
>> >  	case MOD:
>> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
>> NULL, NULL,
>> > -				op1, op2, target, 0);
>> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
>> op1, op2,
>> > +				target, 0);
>> >  	case UDIV:
>> > -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
>> NULL, NULL,
>> > -				op1, op2, target, 1);
>> > +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
>> op1, op2,
>> > +				target, 1);
>> >  	case UMOD:
>> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
>> NULL, NULL,
>> > -				op1, op2, target, 1);
>> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
>> op1, op2,
>> > +				target, 1);
>> >  	case ASHIFTRT:
>> >  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
>> >  				      target, 0, OPTAB_LIB_WIDEN); @@ -
>> 9170,13 +9169,11 @@
>> > expand_expr_divmod (tree_code code, machine_mode mode, tree
>> treeop0,
>> >        bool speed_p = optimize_insn_for_speed_p ();
>> >        do_pending_stack_adjust ();
>> >        start_sequence ();
>> > -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
>> > -				   op0, op1, target, 1);
>> > +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1,
>> > + target, 1);
>> >        rtx_insn *uns_insns = get_insns ();
>> >        end_sequence ();
>> >        start_sequence ();
>> > -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
>> > -				   op0, op1, target, 0);
>> > +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1,
>> > + target, 0);
>> >        rtx_insn *sgn_insns = get_insns ();
>> >        end_sequence ();
>> >        unsigned uns_cost = seq_cost (uns_insns, speed_p); @@ -9198,8
>> > +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode
>> mode, tree treeop0,
>> >        emit_insn (sgn_insns);
>> >        return sgn_ret;
>> >      }
>> > -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
>> > -			op0, op1, target, unsignedp);
>> > +  return expand_divmod (mod_p, code, mode, op0, op1, target,
>> > + unsignedp);
>> >  }
>> >
>> >  rtx
>> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
>> >
>> 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5a
>> 3b
>> > 8a734baa800f 100644
>> > --- a/gcc/internal-fn.def
>> > +++ b/gcc/internal-fn.def
>> > @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL,
>> ECF_CONST
>> > | ECF_NOTHROW, first,
>> >
>> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST |
>> ECF_NOTHROW, first,
>> >  			      smul_highpart, umul_highpart, binary)
>> > +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST |
>> ECF_NOTHROW, first,
>> > +			      sadd_highpart, uadd_highpart, binary)
>> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST |
>> ECF_NOTHROW, first,
>> >  			      smulhs, umulhs, binary)
>> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST |
>> ECF_NOTHROW, first,
>> > diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
>> >
>> cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6
>> e
>> > 77082c1e617b 100644
>> > --- a/gcc/optabs.cc
>> > +++ b/gcc/optabs.cc
>> > @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode
>> mode, rtx op0, rtx op1, bool unsignedp)
>> >  		return NULL_RTX;
>> >  	    }
>> >  	}
>> > -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode,
>> NULL, NULL,
>> > -				     sum, gen_int_mode (INTVAL (op1),
>> > -							word_mode),
>> > +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode,
>> sum,
>> > +				     gen_int_mode (INTVAL (op1),
>> word_mode),
>> >  				     NULL_RTX, 1, OPTAB_DIRECT);
>> >        if (remainder == NULL_RTX)
>> >  	return NULL_RTX;
>> > @@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode
>> mode, rtx
>> > op0, rtx op1, rtx *rem,
>> >
>> >    if (op11 != const1_rtx)
>> >      {
>> > -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL,
>> quot1,
>> > -				op11, NULL_RTX, unsignedp,
>> OPTAB_DIRECT);
>> > +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1,
>> op11,
>> > +				NULL_RTX, unsignedp, OPTAB_DIRECT);
>> >        if (rem2 == NULL_RTX)
>> >  	return NULL_RTX;
>> >
>> > @@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode
>> mode, rtx op0, rtx op1, rtx *rem,
>> >        if (rem2 == NULL_RTX)
>> >  	return NULL_RTX;
>> >
>> > -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL,
>> quot1,
>> > -				 op11, NULL_RTX, unsignedp,
>> OPTAB_DIRECT);
>> > +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
>> > +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
>> >        if (quot2 == NULL_RTX)
>> >  	return NULL_RTX;
>> >
>> > diff --git a/gcc/optabs.def b/gcc/optabs.def index
>> >
>> 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5
>> ccb
>> > f6147947351a 100644
>> > --- a/gcc/optabs.def
>> > +++ b/gcc/optabs.def
>> > @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
>> >
>> >  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")  OPTAB_D
>> > (umul_highpart_optab, "umul$a3_highpart")
>> > +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart") OPTAB_D
>> > +(uadd_highpart_optab, "uadd$a3_highpart")
>> >
>> >  OPTAB_D (cmpmem_optab, "cmpmem$a")
>> >  OPTAB_D (cmpstr_optab, "cmpstr$a")
>> > diff --git a/gcc/target.def b/gcc/target.def index
>> >
>> db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d
>> 81a
>> > fa2c2baa64a5 100644
>> > --- a/gcc/target.def
>> > +++ b/gcc/target.def
>> > @@ -1905,25 +1905,6 @@ implementation approaches itself.",
>> >  	const vec_perm_indices &sel),
>> >   NULL)
>> >
>> > -DEFHOOK
>> > -(can_special_div_by_const,
>> > - "This hook is used to test whether the target has a special method
>> > of\n\ -division of vectors of type @var{vectype} using the value
>> > @var{constant},\n\ -and producing a vector of type @var{vectype}.  The
>> > division\n\ -will then not be decomposed by the vectorizer and kept as
>> > a div.\n\ -\n\ -When the hook is being used to test whether the target
>> > supports a special\n\ -divide, @var{in0}, @var{in1}, and @var{output}
>> > are all null.  When the hook\n\ -is being used to emit a division,
>> > @var{in0} and @var{in1} are the source\n\ -vectors of type
>> > @var{vecttype} and @var{output} is the destination vector of\n\ -type
>> > @var{vectype}.\n\ -\n\ -Return true if the operation is possible,
>> > emitting instructions for it\n\ -if rtxes are provided and updating
>> > @var{output}.",
>> > - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
>> > -	rtx in0, rtx in1),
>> > - default_can_special_div_by_const)
>> > -
>> >  /* Return true if the target supports misaligned store/load of a
>> >     specific factor denoted in the third parameter.  The last parameter
>> >     is true if the access is defined in a packed struct.  */ diff
>> > --git a/gcc/target.h b/gcc/target.h index
>> >
>> 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b9
>> 9f9
>> > 13158c2d47b1 100644
>> > --- a/gcc/target.h
>> > +++ b/gcc/target.h
>> > @@ -51,7 +51,6 @@
>> >  #include "insn-codes.h"
>> >  #include "tm.h"
>> >  #include "hard-reg-set.h"
>> > -#include "tree-core.h"
>> >
>> >  #if CHECKING_P
>> >
>> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h index
>> >
>> a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad224454
>> 93
>> > 17a31390f0c2 100644
>> > --- a/gcc/targhooks.h
>> > +++ b/gcc/targhooks.h
>> > @@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage
>> > (addr_space_t, location_t);  extern rtx default_addr_space_convert
>> > (rtx, tree, tree);  extern unsigned int default_case_values_threshold
>> > (void);  extern bool default_have_conditional_execution (void);
>> > -extern bool default_can_special_div_by_const (enum tree_code, tree,
>> wide_int,
>> > -					      rtx *, rtx, rtx);
>> >
>> >  extern bool default_libc_has_function (enum function_class, tree);
>> > extern bool default_libc_has_fast_function (int fcode); diff --git
>> > a/gcc/targhooks.cc b/gcc/targhooks.cc index
>> >
>> fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e
>> 03
>> > 877337a931e7 100644
>> > --- a/gcc/targhooks.cc
>> > +++ b/gcc/targhooks.cc
>> > @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
>> >    return HAVE_conditional_execution;
>> >  }
>> >
>> > -/* Default that no division by constant operations are special.  */
>> > -bool -default_can_special_div_by_const (enum tree_code, tree,
>> > wide_int, rtx *, rtx,
>> > -				  rtx)
>> > -{
>> > -  return false;
>> > -}
>> > -
>> >  /* By default we assume that c99 functions are present at the runtime,
>> >     but sincos is not.  */
>> >  bool
>> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
>> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
>> > new file mode 100644
>> > index
>> >
>> 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0
>> a0
>> > 4ea8c1f73e3c
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
>> > @@ -0,0 +1,25 @@
>> > +/* { dg-require-effective-target vect_int } */
>> > +
>> > +#include <stdint.h>
>> > +#include "tree-vect.h"
>> > +
>> > +typedef unsigned __attribute__((__vector_size__ (16))) V;
>> > +
>> > +static __attribute__((__noinline__)) __attribute__((__noclone__)) V
>> > +foo (V v, unsigned short i) {
>> > +  v /= i;
>> > +  return v;
>> > +}
>> > +
>> > +int
>> > +main (void)
>> > +{
>> > +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff },
>> > +0xffff);
>> > +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
>> > +    if (v[i] != 0x00010001)
>> > +      __builtin_abort ();
>> > +  return 0;
>> > +}
>> > +
>> > +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern:
>> > +detected" "vect" { target aarch64*-*-* } } } */
>> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
>> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
>> > new file mode 100644
>> > index
>> >
>> 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b4991
>> 4d2
>> > a29b933de625
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
>> > @@ -0,0 +1,58 @@
>> > +/* { dg-require-effective-target vect_int } */
>> > +
>> > +#include <stdint.h>
>> > +#include <stdio.h>
>> > +#include "tree-vect.h"
>> > +
>> > +#define N 50
>> > +#define TYPE uint8_t
>> > +
>> > +#ifndef DEBUG
>> > +#define DEBUG 0
>> > +#endif
>> > +
>> > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
>> > +
>> > +
>> > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE*
>> > +restrict pixel, TYPE level, int n) {
>> > +  for (int i = 0; i < n; i+=1)
>> > +    pixel[i] = (pixel[i] + level) / 0xff; }
>> > +
>> > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE*
>> > +restrict pixel, TYPE level, int n) {
>> > +  for (int i = 0; i < n; i+=1)
>> > +    pixel[i] = (pixel[i] + level) / 0xff; }
>> > +
>> > +int main ()
>> > +{
>> > +  TYPE a[N];
>> > +  TYPE b[N];
>> > +
>> > +  for (int i = 0; i < N; ++i)
>> > +    {
>> > +      a[i] = BASE + i * 13;
>> > +      b[i] = BASE + i * 13;
>> > +      if (DEBUG)
>> > +        printf ("%d: 0x%x\n", i, a[i]);
>> > +    }
>> > +
>> > +  fun1 (a, N / 2, N);
>> > +  fun2 (b, N / 2, N);
>> > +
>> > +  for (int i = 0; i < N; ++i)
>> > +    {
>> > +      if (DEBUG)
>> > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
>> > +
>> > +      if (a[i] != b[i])
>> > +        __builtin_abort ();
>> > +    }
>> > +  return 0;
>> > +}
>> > +
>> > +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" {
>> > +target aarch64*-*-* } } } */
>> > diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc index
>> >
>> 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077d
>> c3
>> > e970bed75ef6 100644
>> > --- a/gcc/tree-vect-generic.cc
>> > +++ b/gcc/tree-vect-generic.cc
>> > @@ -1237,17 +1237,6 @@ expand_vector_operation
>> (gimple_stmt_iterator *gsi, tree type, tree compute_type
>> >  	  tree rhs2 = gimple_assign_rhs2 (assign);
>> >  	  tree ret;
>> >
>> > -	  /* Check if the target was going to handle it through the special
>> > -	     division callback hook.  */
>> > -	  tree cst = uniform_integer_cst_p (rhs2);
>> > -	  if (cst &&
>> > -	      targetm.vectorize.can_special_div_by_const (code, type,
>> > -							  wi::to_wide (cst),
>> > -							  NULL,
>> > -							  NULL_RTX,
>> NULL_RTX))
>> > -	    return NULL_TREE;
>> > -
>> > -
>> >  	  if (!optimize
>> >  	      || !VECTOR_INTEGER_TYPE_P (type)
>> >  	      || TREE_CODE (rhs2) != VECTOR_CST diff --git
>> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
>> >
>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
>> 69
>> > de2afea139d6 100644
>> > --- a/gcc/tree-vect-patterns.cc
>> > +++ b/gcc/tree-vect-patterns.cc
>> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
>> >        return pattern_stmt;
>> >      }
>> >    else if ((cst = uniform_integer_cst_p (oprnd1))
>> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
>> vectype,
>> > -							  wi::to_wide (cst),
>> > -							  NULL, NULL_RTX,
>> > -							  NULL_RTX))
>> > +	   && TYPE_UNSIGNED (itype)
>> > +	   && rhs_code == TRUNC_DIV_EXPR
>> > +	   && vectype
>> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
>> > +					      OPTIMIZE_FOR_SPEED))
>> >      {
>> > -      return NULL;
>> > +      /* div optimizations using narrowings
>> > +       we can do the division e.g. shorts by 255 faster by calculating it as
>> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
>> > +       double the precision of x.
>> > +
>> > +       If we imagine a short as being composed of two blocks of bytes then
>> > +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
>> > +       adding 1 to each sub component:
>> > +
>> > +	    short value of 16-bits
>> > +       ┌──────────────┬────────────────┐
>> > +       │              │                │
>> > +       └──────────────┴────────────────┘
>> > +	 8-bit part1 ▲  8-bit part2   ▲
>> > +		     │                │
>> > +		     │                │
>> > +		    +1               +1
>> > +
>> > +       after the first addition, we have to shift right by 8, and narrow the
>> > +       results back to a byte.  Remember that the addition must be done in
>> > +       double the precision of the input.  However if we know that the
>> addition
>> > +       `x + 257` does not overflow then we can do the operation in the
>> current
>> > +       precision.  In which case we don't need the pack and unpacks.  */
>> > +      auto wcst = wi::to_wide (cst);
>> > +      int pow = wi::exact_log2 (wcst + 1);
>> > +      if (pow == (int) (element_precision (vectype) / 2))
>> > +	{
>> > +	  wide_int min,max;
>> > +	  /* If we're in a pattern we need to find the orginal definition.  */
>> > +	  tree op0 = oprnd0;
>> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
>> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
>> > +	  if (is_pattern_stmt_p (stmt_info))
>> > +	    {
>> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
>> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
>> > +	    }
>> 
>> If this is generally safe (I'm skipping thinking about it in the interests of a
>> quick review :-)), then I think it should be done in vect_get_range_info
>> instead.  Using gimple_get_lhs would be more general than handling just
>> assignments.
>> 
>> > +
>> > +	  /* Check that no overflow will occur.  If we don't have range
>> > +	     information we can't perform the optimization.  */
>> > +	  if (vect_get_range_info (op0, &min, &max))
>> > +	    {
>> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
>> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
>> > +	      wi::overflow_type ovf;
>> > +	      /* We need adder and max in the same precision.  */
>> > +	      wide_int zadder
>> > +		= wide_int_storage::from (adder, wi::get_precision (max),
>> > +					  UNSIGNED);
>> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
>> 
>> Could you explain this a bit more?  When do we have mismatched
>> precisions?
>
> C promotion rules will promote e.g.
>
> void fun2(uint8_t* restrict pixel, uint8_t level, int n)
> {
>   for (int i = 0; i < n; i+=1)
>     pixel[i] = (pixel[i] + level) / 0xff;
> }
>
> And have the addition be done as a 32 bit integer.  The vectorizer will demote this down
> to a short, but range information is not stored for patterns.  So In the above the range will
> correctly be 0x1fe but the precision will be that of the original expression, so 32.  This will
> be a mismatch with itype which is derived from the size the vectorizer will perform the
> operation in.
>
> Thanks,
> Tamar
>
>> 
>> Thanks,
>> Richard
>> 
>> > +	      if (ovf == wi::OVF_NONE)
>> > +		{
>> > +		  *type_out = vectype;
>> > +		  tree tadder = wide_int_to_tree (itype, adder);
>> > +		  gcall *patt1
>> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0,
>> tadder);
>> > +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
>> > +		  gimple_call_set_lhs (patt1, lhs);
>> > +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1,
>> vectype);
>> > +
>> > +		  pattern_stmt
>> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
>> > +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
>> > +		  gimple_call_set_lhs (pattern_stmt, lhs);
>> > +
>> > +		  return pattern_stmt;
>> > +		}
>> > +	    }
>> > +	}
>> >      }
>> >
>> >    if (prec > HOST_BITS_PER_WIDE_INT
>> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index
>> >
>> eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b95
>> 64f
>> > c4e066e50081 100644
>> > --- a/gcc/tree-vect-stmts.cc
>> > +++ b/gcc/tree-vect-stmts.cc
>> > @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
>> >  	}
>> >        target_support_p = (optab_handler (optab, vec_mode)
>> >  			  != CODE_FOR_nothing);
>> > -      tree cst;
>> > -      if (!target_support_p
>> > -	  && op1
>> > -	  && (cst = uniform_integer_cst_p (op1)))
>> > -	target_support_p
>> > -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
>> > -							wi::to_wide (cst),
>> > -							NULL, NULL_RTX,
>> > -							NULL_RTX);
>> >      }
>> >
>> >    bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-10 14:30     ` Richard Sandiford
@ 2023-02-10 14:54       ` Tamar Christina
  2023-02-27 11:09       ` Tamar Christina
  1 sibling, 0 replies; 47+ messages in thread
From: Tamar Christina @ 2023-02-10 14:54 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Tamar Christina via Gcc-patches, nd, rguenther, jlaw

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Friday, February 10, 2023 2:31 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Friday, February 10, 2023 1:36 PM
> >> To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>
> >> Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
> >> rguenther@suse.de; jlaw@ventanamicro.com
> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> div-bitmask by using new optabs [PR108583]
> >>
> >> I think I'm misunderstanding, but: it seems like we're treating the
> >> add highpart optabs as companions to the mul highpart optabs.  But
> >> AIUI, the add highpart optab is used such that, for an N-bit mode, we
> >> do an N-bit addition followed by a shift by N/2.  Is that right?
> >> The mul highpart optabs instead do an 2N-bit multiplication followed
> >> by a shift by N.
> >
> > Correct.
> >
> >>
> >> Apart from consistency, the reason this matters is: I'm not sure what
> >> we gain by adding the optab rather than simply open-coding the
> >> addition and the shift directly into the vector pattern.  It seems
> >> like the AArch64 expander in
> >> 2/2 does just do an ordinary N-bit addition followed by an ordinary
> >> shift by N/2.
> >
> > I mentioned in the implementation, but I did so because AArch64 has
> > various optimization on shifts when it comes to truncating results.  I
> > didn't need to represent it with shifts, in fact the original pattern
> > did not. But representing it directly in the final instructions are
> > problematic because these instructions are unspecs and I would have
> needed to provide additional optabs to optimize them in.
> >
> > So the shift representation was more natural for AArch64. It would not
> > be say for
> > AArch32 which does not have these optimizations already. SVE has
> > similar optimizations and at the very worse you get an usra.
> >
> > I avoided open coding it with add and shift because it creates a 4
> > instructions (and shifts which are typically slow) dependency chain
> > instead of a load and multiply.  This change, unless the target is
> > known to optimize it further is unlikely to be beneficial.  And by the
> > time we get to costing the only alternative is to undo the existing pattern
> and so you lose the general shift optimization.
> >
> > So it seemed unwise to open code as shifts, given the codegen out of
> > the vectorizer would be degenerate for most targets or one needs the
> > more complicated route of costing during pattern matching already.
> 
> Hmm, OK.  That seems like a cost-model thing though, rather than something
> that should be exposed through optabs.  And I imagine the open-coded
> version would still be better than nothing on targets without highpart
> multiply.

Yeah but I don't think we've ever done costing on patterns during matching.
It's always been commit and go under the assumption that the replacement
Is always going to be cheaper.

> 
> So how about replacing the hook with one that simply asks whether division
> through highpart multiplication is preferred over the add/shift sequence?
> (Unfortunately it's not going to be possible to work that out from existing
> information.)

If Richi has no objections to it I can do that instead then..

Just to clarify, are you satisfied with the answer on the mixed precisions?

Thanks,
Tamar

> 
> Thanks,
> Richard
> 
> >
> >>
> >> Some comments in addition to Richard's:
> >>
> >> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> > Hi All,
> >> >
> >> > As discussed in the ticket, this replaces the approach for
> >> > optimizing the div by bitmask operation from a hook into optabs
> >> > implemented through add_highpart.
> >> >
> >> > In order to be able to use this we need to check whether the
> >> > current precision has enough bits to do the operation without any
> >> > of the additions
> >> overflowing.
> >> >
> >> > We use range information to determine this and only do the
> >> > operation if we're sure am overflow won't occur.
> >> >
> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going>
> >> issues.
> >> >
> >> > Ok for master?
> >> >
> >> > Thanks,
> >> > Tamar
> >> >
> >> > gcc/ChangeLog:
> >> >
> >> > 	PR target/108583
> >> > 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST):
> >> Remove.
> >> > 	* doc/tm.texi.in: Likewise.
> >> > 	* explow.cc (round_push, align_dynamic_address): Revert previous
> >> patch.
> >> > 	* expmed.cc (expand_divmod): Likewise.
> >> > 	* expmed.h (expand_divmod): Likewise.
> >> > 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
> >> > 	* optabs.cc (expand_doubleword_mod,
> >> expand_doubleword_divmod): Likewise.
> >> > 	* internal-fn.def (ADDH): New.
> >> > 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
> >> > 	* doc/md.texi: Document them.
> >> > 	* doc/rtl.texi: Likewise.
> >> > 	* target.def (can_special_div_by_const): Remove.
> >> > 	* target.h: Remove tree-core.h include
> >> > 	* targhooks.cc (default_can_special_div_by_const): Remove.
> >> > 	* targhooks.h (default_can_special_div_by_const): Remove.
> >> > 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
> >> > 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook
> >> and
> >> > 	implement new obtab recognition based on range.
> >> > 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
> >> >
> >> > gcc/testsuite/ChangeLog:
> >> >
> >> > 	PR target/108583
> >> > 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
> >> > 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
> >> >
> >> > --- inline copy of patch --
> >> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
> >> >
> >>
> 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f74080
> >> 3
> >> > 8595e21af35d 100644
> >> > --- a/gcc/doc/md.texi
> >> > +++ b/gcc/doc/md.texi
> >> > @@ -5668,6 +5668,18 @@ represented in RTL using a
> >> @code{smul_highpart} RTX expression.
> >> >  Similar, but the multiplication is unsigned.  This may be
> >> > represented in RTL using an @code{umul_highpart} RTX expression.
> >> >
> >> > +@cindex @code{sadd@var{m}3_highpart} instruction pattern @item
> >> > +@samp{smul@var{m}3_highpart}
> >>
> >> sadd
> >>
> >> > +Perform a signed addition of operands 1 and 2, which have mode
> >> > +@var{m}, and store the most significant half of the product in operand
> 0.
> >> > +The least significant half of the product is discarded.  This may
> >> > +be represented in RTL using a @code{sadd_highpart} RTX expression.
> >> > +
> >> > +@cindex @code{uadd@var{m}3_highpart} instruction pattern @item
> >> > +@samp{uadd@var{m}3_highpart} Similar, but the addition is unsigned.
> >> > +This may be represented in RTL using an @code{uadd_highpart} RTX
> >> > +expression.
> >> > +
> >> >  @cindex @code{madd@var{m}@var{n}4} instruction pattern  @item
> >> > @samp{madd@var{m}@var{n}4}  Multiply operands 1 and 2, sign-
> extend
> >> > them to mode @var{n}, add diff --git a/gcc/doc/rtl.texi
> >> > b/gcc/doc/rtl.texi index
> >> >
> >>
> d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343
> >> d17
> >> > 1940ec4222f3 100644
> >> > --- a/gcc/doc/rtl.texi
> >> > +++ b/gcc/doc/rtl.texi
> >> > @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.
> >> > @code{smul_highpart} returns the high part  of a signed
> >> > multiplication, @code{umul_highpart} returns the high part  of an
> >> > unsigned
> >> multiplication.
> >> >
> >> > +@findex sadd_highpart
> >> > +@findex uadd_highpart
> >> > +@cindex high-part addition
> >> > +@cindex addition high part
> >> > +@item (sadd_highpart:@var{m} @var{x} @var{y}) @itemx
> >> > +(uadd_highpart:@var{m} @var{x} @var{y}) Represents the high-part
> >> > +addition of @var{x} and @var{y} carried out in machine mode
> @var{m}.
> >> > +@code{sadd_highpart} returns the high part of a signed addition,
> >> > +@code{uadd_highpart} returns the high part of an unsigned addition.
> >>
> >> The patch doesn't add these RTL codes though.
> >>
> >> > +
> >> >  @findex fma
> >> >  @cindex fused multiply-add
> >> >  @item (fma:@var{m} @var{x} @var{y} @var{z}) diff --git
> >> > a/gcc/doc/tm.texi b/gcc/doc/tm.texi index
> >> >
> >>
> c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d57914840
> >> 17e
> >> > 6b0d62ab077e 100644
> >> > --- a/gcc/doc/tm.texi
> >> > +++ b/gcc/doc/tm.texi
> >> > @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for
> >> > the hook to handle these two  implementation approaches itself.
> >> >  @end deftypefn
> >> >
> >> > -@deftypefn {Target Hook} bool
> >> > TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum
> >> @var{tree_code}, tree
> >> > @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx
> >> > @var{in0}, rtx @var{in1}) -This hook is used to test whether the
> >> > target has a special method of -division of vectors of type
> >> > @var{vectype}
> >> using the value @var{constant}, -and producing a vector of type
> >> @var{vectype}.  The division -will then not be decomposed by the
> >> vectorizer and kept as a div.
> >> > -
> >> > -When the hook is being used to test whether the target supports a
> >> > special -divide, @var{in0}, @var{in1}, and @var{output} are all null.
> >> > When the hook -is being used to emit a division, @var{in0} and
> >> > @var{in1} are the source -vectors of type @var{vecttype} and
> >> > @var{output} is the destination vector of -type @var{vectype}.
> >> > -
> >> > -Return true if the operation is possible, emitting instructions
> >> > for it -if rtxes are provided and updating @var{output}.
> >> > -@end deftypefn
> >> > -
> >> >  @deftypefn {Target Hook} tree
> >> > TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned
> >> @var{code},
> >> > tree @var{vec_type_out}, tree @var{vec_type_in})  This hook should
> >> > return the decl of a function that implements the  vectorized
> >> > variant of the function with the @code{combined_fn} code diff --git
> >> > a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index
> >> >
> >>
> 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0
> >> a3a
> >> > bccd1c293c7b 100644
> >> > --- a/gcc/doc/tm.texi.in
> >> > +++ b/gcc/doc/tm.texi.in
> >> > @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent
> >> strategy can generate better code.
> >> >
> >> >  @hook TARGET_VECTORIZE_VEC_PERM_CONST
> >> >
> >> > -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
> >> > -
> >> >  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
> >> >
> >> >  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
> >> > diff --git a/gcc/explow.cc b/gcc/explow.cc index
> >> >
> >>
> 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0
> >> bef
> >> > a016eea4573c 100644
> >> > --- a/gcc/explow.cc
> >> > +++ b/gcc/explow.cc
> >> > @@ -1037,7 +1037,7 @@ round_push (rtx size)
> >> >       TRUNC_DIV_EXPR.  */
> >> >    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
> >> >  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
> >> > -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
> >> > size, align_rtx,
> >> > +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
> >> >  			NULL_RTX, 1);
> >> >    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
> >> >
> >> > @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned
> >> required_align)
> >> >  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
> >> >  				       Pmode),
> >> >  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
> >> > -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
> >> > target,
> >> > +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
> >> >  			  gen_int_mode (required_align / BITS_PER_UNIT,
> >> >  					Pmode),
> >> >  			  NULL_RTX, 1);
> >> > diff --git a/gcc/expmed.h b/gcc/expmed.h index
> >> >
> >>
> 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c5364
> >> 094
> >> > 1628068f3901 100644
> >> > --- a/gcc/expmed.h
> >> > +++ b/gcc/expmed.h
> >> > @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code,
> >> > machine_mode, rtx, poly_int64, rtx,  extern rtx maybe_expand_shift
> >> (enum tree_code, machine_mode, rtx, int, rtx,
> >> >  			       int);
> >> >  #ifdef GCC_OPTABS_H
> >> > -extern rtx expand_divmod (int, enum tree_code, machine_mode,
> tree,
> >> tree,
> >> > -			  rtx, rtx, rtx, int,
> >> > -			  enum optab_methods = OPTAB_LIB_WIDEN);
> >> > +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx,
> >> rtx,
> >> > +			  rtx, int, enum optab_methods =
> >> OPTAB_LIB_WIDEN);
> >> >  #endif
> >> >  #endif
> >> >
> >> > diff --git a/gcc/expmed.cc b/gcc/expmed.cc index
> >> >
> >>
> 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3
> >> a59
> >> > c169d3b7692f 100644
> >> > --- a/gcc/expmed.cc
> >> > +++ b/gcc/expmed.cc
> >> > @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode,
> rtx
> >> op0,
> >> > HOST_WIDE_INT d)
> >> >
> >> >  rtx
> >> >  expand_divmod (int rem_flag, enum tree_code code, machine_mode
> >> mode,
> >> > -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
> >> > -	       int unsignedp, enum optab_methods methods)
> >> > +	       rtx op0, rtx op1, rtx target, int unsignedp,
> >> > +	       enum optab_methods methods)
> >> >  {
> >> >    machine_mode compute_mode;
> >> >    rtx tquotient;
> >> > @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum
> tree_code
> >> > code, machine_mode mode,
> >> >
> >> >    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) :
> >> > 0;
> >> >
> >> > -  /* Check if the target has specific expansions for the division.
> >> > */
> >> > -  tree cst;
> >> > -  if (treeop0
> >> > -      && treeop1
> >> > -      && (cst = uniform_integer_cst_p (treeop1))
> >> > -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE
> >> (treeop0),
> >> > -						     wi::to_wide (cst),
> >> > -						     &target, op0, op1))
> >> > -    return target;
> >> > -
> >> > -
> >> >    /* Now convert to the best mode to use.  */
> >> >    if (compute_mode != mode)
> >> >      {
> >> > @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum
> tree_code
> >> code, machine_mode mode,
> >> >  			    || (optab_handler (sdivmod_optab, int_mode)
> >> >  				!= CODE_FOR_nothing)))
> >> >  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
> >> > -						int_mode, treeop0, treeop1,
> >> > -						op0, gen_int_mode (abs_d,
> >> > +						int_mode, op0,
> >> > +						gen_int_mode (abs_d,
> >> >  							      int_mode),
> >> >  						NULL_RTX, 0);
> >> >  		    else
> >> > @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum
> tree_code
> >> code, machine_mode mode,
> >> >  				      size - 1, NULL_RTX, 0);
> >> >  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
> >> >  				    NULL_RTX);
> >> > -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode,
> >> treeop0,
> >> > -				    treeop1, t3, op1, NULL_RTX, 0);
> >> > +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3,
> >> op1,
> >> > +				    NULL_RTX, 0);
> >> >  		if (t4)
> >> >  		  {
> >> >  		    rtx t5;
> >> > diff --git a/gcc/expr.cc b/gcc/expr.cc index
> >> >
> >>
> 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b
> >> 2280
> >> > c6e277f26d72 100644
> >> > --- a/gcc/expr.cc
> >> > +++ b/gcc/expr.cc
> >> > @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
> >> >  	    return expand_divmod (0,
> >> >  				  FLOAT_MODE_P (GET_MODE (value))
> >> >  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
> >> > -				  GET_MODE (value), NULL, NULL, op1, op2,
> >> > -				  target, 0);
> >> > +				  GET_MODE (value), op1, op2, target, 0);
> >> >  	case MOD:
> >> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> >> NULL, NULL,
> >> > -				op1, op2, target, 0);
> >> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> >> op1, op2,
> >> > +				target, 0);
> >> >  	case UDIV:
> >> > -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
> >> NULL, NULL,
> >> > -				op1, op2, target, 1);
> >> > +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
> >> op1, op2,
> >> > +				target, 1);
> >> >  	case UMOD:
> >> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> >> NULL, NULL,
> >> > -				op1, op2, target, 1);
> >> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> >> op1, op2,
> >> > +				target, 1);
> >> >  	case ASHIFTRT:
> >> >  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
> >> >  				      target, 0, OPTAB_LIB_WIDEN); @@ -
> >> 9170,13 +9169,11 @@
> >> > expand_expr_divmod (tree_code code, machine_mode mode, tree
> >> treeop0,
> >> >        bool speed_p = optimize_insn_for_speed_p ();
> >> >        do_pending_stack_adjust ();
> >> >        start_sequence ();
> >> > -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0,
> treeop1,
> >> > -				   op0, op1, target, 1);
> >> > +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1,
> >> > + target, 1);
> >> >        rtx_insn *uns_insns = get_insns ();
> >> >        end_sequence ();
> >> >        start_sequence ();
> >> > -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0,
> treeop1,
> >> > -				   op0, op1, target, 0);
> >> > +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1,
> >> > + target, 0);
> >> >        rtx_insn *sgn_insns = get_insns ();
> >> >        end_sequence ();
> >> >        unsigned uns_cost = seq_cost (uns_insns, speed_p); @@
> >> > -9198,8
> >> > +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode
> >> mode, tree treeop0,
> >> >        emit_insn (sgn_insns);
> >> >        return sgn_ret;
> >> >      }
> >> > -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
> >> > -			op0, op1, target, unsignedp);
> >> > +  return expand_divmod (mod_p, code, mode, op0, op1, target,
> >> > + unsignedp);
> >> >  }
> >> >
> >> >  rtx
> >> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
> >> >
> >>
> 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5a
> >> 3b
> >> > 8a734baa800f 100644
> >> > --- a/gcc/internal-fn.def
> >> > +++ b/gcc/internal-fn.def
> >> > @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL,
> >> ECF_CONST
> >> > | ECF_NOTHROW, first,
> >> >
> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST |
> >> ECF_NOTHROW, first,
> >> >  			      smul_highpart, umul_highpart, binary)
> >> > +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST |
> >> ECF_NOTHROW, first,
> >> > +			      sadd_highpart, uadd_highpart, binary)
> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST |
> >> ECF_NOTHROW, first,
> >> >  			      smulhs, umulhs, binary)
> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST |
> >> ECF_NOTHROW, first,
> >> > diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
> >> >
> >>
> cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6
> >> e
> >> > 77082c1e617b 100644
> >> > --- a/gcc/optabs.cc
> >> > +++ b/gcc/optabs.cc
> >> > @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode
> >> mode, rtx op0, rtx op1, bool unsignedp)
> >> >  		return NULL_RTX;
> >> >  	    }
> >> >  	}
> >> > -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR,
> word_mode,
> >> NULL, NULL,
> >> > -				     sum, gen_int_mode (INTVAL (op1),
> >> > -							word_mode),
> >> > +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR,
> word_mode,
> >> sum,
> >> > +				     gen_int_mode (INTVAL (op1),
> >> word_mode),
> >> >  				     NULL_RTX, 1, OPTAB_DIRECT);
> >> >        if (remainder == NULL_RTX)
> >> >  	return NULL_RTX;
> >> > @@ -1211,8 +1210,8 @@ expand_doubleword_divmod
> (machine_mode
> >> mode, rtx
> >> > op0, rtx op1, rtx *rem,
> >> >
> >> >    if (op11 != const1_rtx)
> >> >      {
> >> > -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL,
> NULL,
> >> quot1,
> >> > -				op11, NULL_RTX, unsignedp,
> >> OPTAB_DIRECT);
> >> > +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1,
> >> op11,
> >> > +				NULL_RTX, unsignedp, OPTAB_DIRECT);
> >> >        if (rem2 == NULL_RTX)
> >> >  	return NULL_RTX;
> >> >
> >> > @@ -1226,8 +1225,8 @@ expand_doubleword_divmod
> (machine_mode
> >> mode, rtx op0, rtx op1, rtx *rem,
> >> >        if (rem2 == NULL_RTX)
> >> >  	return NULL_RTX;
> >> >
> >> > -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL,
> NULL,
> >> quot1,
> >> > -				 op11, NULL_RTX, unsignedp,
> >> OPTAB_DIRECT);
> >> > +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1,
> op11,
> >> > +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
> >> >        if (quot2 == NULL_RTX)
> >> >  	return NULL_RTX;
> >> >
> >> > diff --git a/gcc/optabs.def b/gcc/optabs.def index
> >> >
> >>
> 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5
> >> ccb
> >> > f6147947351a 100644
> >> > --- a/gcc/optabs.def
> >> > +++ b/gcc/optabs.def
> >> > @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
> >> >
> >> >  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")  OPTAB_D
> >> > (umul_highpart_optab, "umul$a3_highpart")
> >> > +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart") OPTAB_D
> >> > +(uadd_highpart_optab, "uadd$a3_highpart")
> >> >
> >> >  OPTAB_D (cmpmem_optab, "cmpmem$a")  OPTAB_D (cmpstr_optab,
> >> > "cmpstr$a") diff --git a/gcc/target.def b/gcc/target.def index
> >> >
> >>
> db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d
> >> 81a
> >> > fa2c2baa64a5 100644
> >> > --- a/gcc/target.def
> >> > +++ b/gcc/target.def
> >> > @@ -1905,25 +1905,6 @@ implementation approaches itself.",
> >> >  	const vec_perm_indices &sel),
> >> >   NULL)
> >> >
> >> > -DEFHOOK
> >> > -(can_special_div_by_const,
> >> > - "This hook is used to test whether the target has a special
> >> > method of\n\ -division of vectors of type @var{vectype} using the
> >> > value @var{constant},\n\ -and producing a vector of type
> >> > @var{vectype}.  The division\n\ -will then not be decomposed by the
> >> > vectorizer and kept as a div.\n\ -\n\ -When the hook is being used
> >> > to test whether the target supports a special\n\ -divide,
> >> > @var{in0}, @var{in1}, and @var{output} are all null.  When the
> >> > hook\n\ -is being used to emit a division, @var{in0} and @var{in1}
> >> > are the source\n\ -vectors of type @var{vecttype} and @var{output}
> >> > is the destination vector of\n\ -type @var{vectype}.\n\ -\n\
> >> > -Return true if the operation is possible, emitting instructions
> >> > for it\n\ -if rtxes are provided and updating @var{output}.",
> >> > - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
> >> > -	rtx in0, rtx in1),
> >> > - default_can_special_div_by_const)
> >> > -
> >> >  /* Return true if the target supports misaligned store/load of a
> >> >     specific factor denoted in the third parameter.  The last parameter
> >> >     is true if the access is defined in a packed struct.  */ diff
> >> > --git a/gcc/target.h b/gcc/target.h index
> >> >
> >>
> 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b9
> >> 9f9
> >> > 13158c2d47b1 100644
> >> > --- a/gcc/target.h
> >> > +++ b/gcc/target.h
> >> > @@ -51,7 +51,6 @@
> >> >  #include "insn-codes.h"
> >> >  #include "tm.h"
> >> >  #include "hard-reg-set.h"
> >> > -#include "tree-core.h"
> >> >
> >> >  #if CHECKING_P
> >> >
> >> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h index
> >> >
> >>
> a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad224454
> >> 93
> >> > 17a31390f0c2 100644
> >> > --- a/gcc/targhooks.h
> >> > +++ b/gcc/targhooks.h
> >> > @@ -209,8 +209,6 @@ extern void
> default_addr_space_diagnose_usage
> >> > (addr_space_t, location_t);  extern rtx default_addr_space_convert
> >> > (rtx, tree, tree);  extern unsigned int
> >> > default_case_values_threshold (void);  extern bool
> >> > default_have_conditional_execution (void); -extern bool
> >> > default_can_special_div_by_const (enum tree_code, tree,
> >> wide_int,
> >> > -					      rtx *, rtx, rtx);
> >> >
> >> >  extern bool default_libc_has_function (enum function_class, tree);
> >> > extern bool default_libc_has_fast_function (int fcode); diff --git
> >> > a/gcc/targhooks.cc b/gcc/targhooks.cc index
> >> >
> >>
> fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e
> >> 03
> >> > 877337a931e7 100644
> >> > --- a/gcc/targhooks.cc
> >> > +++ b/gcc/targhooks.cc
> >> > @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
> >> >    return HAVE_conditional_execution;  }
> >> >
> >> > -/* Default that no division by constant operations are special.
> >> > */ -bool -default_can_special_div_by_const (enum tree_code, tree,
> >> > wide_int, rtx *, rtx,
> >> > -				  rtx)
> >> > -{
> >> > -  return false;
> >> > -}
> >> > -
> >> >  /* By default we assume that c99 functions are present at the runtime,
> >> >     but sincos is not.  */
> >> >  bool
> >> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> >> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> >> > new file mode 100644
> >> > index
> >> >
> >>
> 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0
> >> a0
> >> > 4ea8c1f73e3c
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> >> > @@ -0,0 +1,25 @@
> >> > +/* { dg-require-effective-target vect_int } */
> >> > +
> >> > +#include <stdint.h>
> >> > +#include "tree-vect.h"
> >> > +
> >> > +typedef unsigned __attribute__((__vector_size__ (16))) V;
> >> > +
> >> > +static __attribute__((__noinline__)) __attribute__((__noclone__))
> >> > +V foo (V v, unsigned short i) {
> >> > +  v /= i;
> >> > +  return v;
> >> > +}
> >> > +
> >> > +int
> >> > +main (void)
> >> > +{
> >> > +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff
> >> > +}, 0xffff);
> >> > +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
> >> > +    if (v[i] != 0x00010001)
> >> > +      __builtin_abort ();
> >> > +  return 0;
> >> > +}
> >> > +
> >> > +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern:
> >> > +detected" "vect" { target aarch64*-*-* } } } */
> >> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> >> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> >> > new file mode 100644
> >> > index
> >> >
> >>
> 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b4991
> >> 4d2
> >> > a29b933de625
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> >> > @@ -0,0 +1,58 @@
> >> > +/* { dg-require-effective-target vect_int } */
> >> > +
> >> > +#include <stdint.h>
> >> > +#include <stdio.h>
> >> > +#include "tree-vect.h"
> >> > +
> >> > +#define N 50
> >> > +#define TYPE uint8_t
> >> > +
> >> > +#ifndef DEBUG
> >> > +#define DEBUG 0
> >> > +#endif
> >> > +
> >> > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> >> > +
> >> > +
> >> > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE*
> >> > +restrict pixel, TYPE level, int n) {
> >> > +  for (int i = 0; i < n; i+=1)
> >> > +    pixel[i] = (pixel[i] + level) / 0xff; }
> >> > +
> >> > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE*
> >> > +restrict pixel, TYPE level, int n) {
> >> > +  for (int i = 0; i < n; i+=1)
> >> > +    pixel[i] = (pixel[i] + level) / 0xff; }
> >> > +
> >> > +int main ()
> >> > +{
> >> > +  TYPE a[N];
> >> > +  TYPE b[N];
> >> > +
> >> > +  for (int i = 0; i < N; ++i)
> >> > +    {
> >> > +      a[i] = BASE + i * 13;
> >> > +      b[i] = BASE + i * 13;
> >> > +      if (DEBUG)
> >> > +        printf ("%d: 0x%x\n", i, a[i]);
> >> > +    }
> >> > +
> >> > +  fun1 (a, N / 2, N);
> >> > +  fun2 (b, N / 2, N);
> >> > +
> >> > +  for (int i = 0; i < N; ++i)
> >> > +    {
> >> > +      if (DEBUG)
> >> > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> >> > +
> >> > +      if (a[i] != b[i])
> >> > +        __builtin_abort ();
> >> > +    }
> >> > +  return 0;
> >> > +}
> >> > +
> >> > +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect"
> >> > +{ target aarch64*-*-* } } } */
> >> > diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
> >> > index
> >> >
> >>
> 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077d
> >> c3
> >> > e970bed75ef6 100644
> >> > --- a/gcc/tree-vect-generic.cc
> >> > +++ b/gcc/tree-vect-generic.cc
> >> > @@ -1237,17 +1237,6 @@ expand_vector_operation
> >> (gimple_stmt_iterator *gsi, tree type, tree compute_type
> >> >  	  tree rhs2 = gimple_assign_rhs2 (assign);
> >> >  	  tree ret;
> >> >
> >> > -	  /* Check if the target was going to handle it through the special
> >> > -	     division callback hook.  */
> >> > -	  tree cst = uniform_integer_cst_p (rhs2);
> >> > -	  if (cst &&
> >> > -	      targetm.vectorize.can_special_div_by_const (code, type,
> >> > -							  wi::to_wide (cst),
> >> > -							  NULL,
> >> > -							  NULL_RTX,
> >> NULL_RTX))
> >> > -	    return NULL_TREE;
> >> > -
> >> > -
> >> >  	  if (!optimize
> >> >  	      || !VECTOR_INTEGER_TYPE_P (type)
> >> >  	      || TREE_CODE (rhs2) != VECTOR_CST diff --git
> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
> >> >
> >>
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
> >> 69
> >> > de2afea139d6 100644
> >> > --- a/gcc/tree-vect-patterns.cc
> >> > +++ b/gcc/tree-vect-patterns.cc
> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
> *vinfo,
> >> >        return pattern_stmt;
> >> >      }
> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
> >> vectype,
> >> > -							  wi::to_wide (cst),
> >> > -							  NULL, NULL_RTX,
> >> > -							  NULL_RTX))
> >> > +	   && TYPE_UNSIGNED (itype)
> >> > +	   && rhs_code == TRUNC_DIV_EXPR
> >> > +	   && vectype
> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> >> > +					      OPTIMIZE_FOR_SPEED))
> >> >      {
> >> > -      return NULL;
> >> > +      /* div optimizations using narrowings
> >> > +       we can do the division e.g. shorts by 255 faster by calculating it as
> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> >> > +       double the precision of x.
> >> > +
> >> > +       If we imagine a short as being composed of two blocks of bytes
> then
> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent
> to
> >> > +       adding 1 to each sub component:
> >> > +
> >> > +	    short value of 16-bits
> >> > +       ┌──────────────┬────────────────┐
> >> > +       │              │                │
> >> > +       └──────────────┴────────────────┘
> >> > +	 8-bit part1 ▲  8-bit part2   ▲
> >> > +		     │                │
> >> > +		     │                │
> >> > +		    +1               +1
> >> > +
> >> > +       after the first addition, we have to shift right by 8, and narrow the
> >> > +       results back to a byte.  Remember that the addition must be done
> in
> >> > +       double the precision of the input.  However if we know that
> >> > + the
> >> addition
> >> > +       `x + 257` does not overflow then we can do the operation in
> >> > + the
> >> current
> >> > +       precision.  In which case we don't need the pack and unpacks.  */
> >> > +      auto wcst = wi::to_wide (cst);
> >> > +      int pow = wi::exact_log2 (wcst + 1);
> >> > +      if (pow == (int) (element_precision (vectype) / 2))
> >> > +	{
> >> > +	  wide_int min,max;
> >> > +	  /* If we're in a pattern we need to find the orginal definition.  */
> >> > +	  tree op0 = oprnd0;
> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> >> > +	  if (is_pattern_stmt_p (stmt_info))
> >> > +	    {
> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> >> > +	    }
> >>
> >> If this is generally safe (I'm skipping thinking about it in the
> >> interests of a quick review :-)), then I think it should be done in
> >> vect_get_range_info instead.  Using gimple_get_lhs would be more
> >> general than handling just assignments.
> >>
> >> > +
> >> > +	  /* Check that no overflow will occur.  If we don't have range
> >> > +	     information we can't perform the optimization.  */
> >> > +	  if (vect_get_range_info (op0, &min, &max))
> >> > +	    {
> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> >> > +	      wi::overflow_type ovf;
> >> > +	      /* We need adder and max in the same precision.  */
> >> > +	      wide_int zadder
> >> > +		= wide_int_storage::from (adder, wi::get_precision (max),
> >> > +					  UNSIGNED);
> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
> >>
> >> Could you explain this a bit more?  When do we have mismatched
> >> precisions?
> >
> > C promotion rules will promote e.g.
> >
> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
> >   for (int i = 0; i < n; i+=1)
> >     pixel[i] = (pixel[i] + level) / 0xff; }
> >
> > And have the addition be done as a 32 bit integer.  The vectorizer
> > will demote this down to a short, but range information is not stored
> > for patterns.  So In the above the range will correctly be 0x1fe but
> > the precision will be that of the original expression, so 32.  This
> > will be a mismatch with itype which is derived from the size the vectorizer
> will perform the operation in.
> >
> > Thanks,
> > Tamar
> >
> >>
> >> Thanks,
> >> Richard
> >>
> >> > +	      if (ovf == wi::OVF_NONE)
> >> > +		{
> >> > +		  *type_out = vectype;
> >> > +		  tree tadder = wide_int_to_tree (itype, adder);
> >> > +		  gcall *patt1
> >> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0,
> >> tadder);
> >> > +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
> >> > +		  gimple_call_set_lhs (patt1, lhs);
> >> > +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1,
> >> vectype);
> >> > +
> >> > +		  pattern_stmt
> >> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
> >> > +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
> >> > +		  gimple_call_set_lhs (pattern_stmt, lhs);
> >> > +
> >> > +		  return pattern_stmt;
> >> > +		}
> >> > +	    }
> >> > +	}
> >> >      }
> >> >
> >> >    if (prec > HOST_BITS_PER_WIDE_INT diff --git
> >> > a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index
> >> >
> >>
> eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b95
> >> 64f
> >> > c4e066e50081 100644
> >> > --- a/gcc/tree-vect-stmts.cc
> >> > +++ b/gcc/tree-vect-stmts.cc
> >> > @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
> >> >  	}
> >> >        target_support_p = (optab_handler (optab, vec_mode)
> >> >  			  != CODE_FOR_nothing);
> >> > -      tree cst;
> >> > -      if (!target_support_p
> >> > -	  && op1
> >> > -	  && (cst = uniform_integer_cst_p (op1)))
> >> > -	target_support_p
> >> > -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
> >> > -							wi::to_wide (cst),
> >> > -							NULL, NULL_RTX,
> >> > -							NULL_RTX);
> >> >      }
> >> >
> >> >    bool using_emulated_vectors_p = vect_emulated_vector_p
> >> > (vectype);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-10 14:13   ` Tamar Christina
  2023-02-10 14:30     ` Richard Sandiford
@ 2023-02-10 15:56     ` Richard Sandiford
  2023-02-10 16:09       ` Tamar Christina
  1 sibling, 1 reply; 47+ messages in thread
From: Richard Sandiford @ 2023-02-10 15:56 UTC (permalink / raw)
  To: Tamar Christina; +Cc: Tamar Christina via Gcc-patches, nd, rguenther, jlaw

Tamar Christina <Tamar.Christina@arm.com> writes:
>> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
>> >
>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
>> 69
>> > de2afea139d6 100644
>> > --- a/gcc/tree-vect-patterns.cc
>> > +++ b/gcc/tree-vect-patterns.cc
>> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
>> >        return pattern_stmt;
>> >      }
>> >    else if ((cst = uniform_integer_cst_p (oprnd1))
>> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
>> vectype,
>> > -							  wi::to_wide (cst),
>> > -							  NULL, NULL_RTX,
>> > -							  NULL_RTX))
>> > +	   && TYPE_UNSIGNED (itype)
>> > +	   && rhs_code == TRUNC_DIV_EXPR
>> > +	   && vectype
>> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
>> > +					      OPTIMIZE_FOR_SPEED))
>> >      {
>> > -      return NULL;
>> > +      /* div optimizations using narrowings
>> > +       we can do the division e.g. shorts by 255 faster by calculating it as
>> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
>> > +       double the precision of x.
>> > +
>> > +       If we imagine a short as being composed of two blocks of bytes then
>> > +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
>> > +       adding 1 to each sub component:
>> > +
>> > +	    short value of 16-bits
>> > +       ┌──────────────┬────────────────┐
>> > +       │              │                │
>> > +       └──────────────┴────────────────┘
>> > +	 8-bit part1 ▲  8-bit part2   ▲
>> > +		     │                │
>> > +		     │                │
>> > +		    +1               +1
>> > +
>> > +       after the first addition, we have to shift right by 8, and narrow the
>> > +       results back to a byte.  Remember that the addition must be done in
>> > +       double the precision of the input.  However if we know that the
>> addition
>> > +       `x + 257` does not overflow then we can do the operation in the
>> current
>> > +       precision.  In which case we don't need the pack and unpacks.  */
>> > +      auto wcst = wi::to_wide (cst);
>> > +      int pow = wi::exact_log2 (wcst + 1);
>> > +      if (pow == (int) (element_precision (vectype) / 2))
>> > +	{
>> > +	  wide_int min,max;
>> > +	  /* If we're in a pattern we need to find the orginal definition.  */
>> > +	  tree op0 = oprnd0;
>> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
>> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
>> > +	  if (is_pattern_stmt_p (stmt_info))
>> > +	    {
>> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
>> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
>> > +	    }
>> 
>> If this is generally safe (I'm skipping thinking about it in the interests of a
>> quick review :-)), then I think it should be done in vect_get_range_info
>> instead.  Using gimple_get_lhs would be more general than handling just
>> assignments.
>> 
>> > +
>> > +	  /* Check that no overflow will occur.  If we don't have range
>> > +	     information we can't perform the optimization.  */
>> > +	  if (vect_get_range_info (op0, &min, &max))
>> > +	    {
>> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
>> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
>> > +	      wi::overflow_type ovf;
>> > +	      /* We need adder and max in the same precision.  */
>> > +	      wide_int zadder
>> > +		= wide_int_storage::from (adder, wi::get_precision (max),
>> > +					  UNSIGNED);
>> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
>> 
>> Could you explain this a bit more?  When do we have mismatched
>> precisions?
>
> C promotion rules will promote e.g.
>
> void fun2(uint8_t* restrict pixel, uint8_t level, int n)
> {
>   for (int i = 0; i < n; i+=1)
>     pixel[i] = (pixel[i] + level) / 0xff;
> }
>
> And have the addition be done as a 32 bit integer.  The vectorizer will demote this down
> to a short, but range information is not stored for patterns.  So In the above the range will
> correctly be 0x1fe but the precision will be that of the original expression, so 32.  This will
> be a mismatch with itype which is derived from the size the vectorizer will perform the
> operation in.

Gah, missed this first time round, sorry.

Richi would know better than me, but I think it's dangerous to rely on
the orig/pattern link for range information.  The end result of a pattern
(vect_stmt_to_vectorize) has to have the same type as the lhs of the
original statement.  But the other statements in the pattern sequence
can do arbitrary things.  Their range isn't predictable from the range
of the original statement result.

IIRC, the addition above is converted to:

  a' = (uint16_t) pixel[i]
  b' = (uint16_t) level
  sum' = a' + b'
  sum = (int) sum'

where sum is the direct replacement of "pixel[i] + level", with the
same type and range.  The division then uses sum' instead of sum.

But the fact that sum' is part of the same pattern as sum doesn't
guarantee that sum' has the same range as sum.  E.g. the pattern
statements added by the division optimisation wouldn't have this
property.

Is it possible to tell ranger to compute the range of expressions that
haven't been added to the IL?  (Genuine question, haven't looked.
It seems pretty powerful though.)

Thanks,
Richard

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-10 15:56     ` Richard Sandiford
@ 2023-02-10 16:09       ` Tamar Christina
  2023-02-10 16:25         ` Richard Sandiford
  0 siblings, 1 reply; 47+ messages in thread
From: Tamar Christina @ 2023-02-10 16:09 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Tamar Christina via Gcc-patches, nd, rguenther, jlaw

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Friday, February 10, 2023 3:57 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
> >> >
> >>
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
> >> 69
> >> > de2afea139d6 100644
> >> > --- a/gcc/tree-vect-patterns.cc
> >> > +++ b/gcc/tree-vect-patterns.cc
> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
> *vinfo,
> >> >        return pattern_stmt;
> >> >      }
> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
> >> vectype,
> >> > -							  wi::to_wide (cst),
> >> > -							  NULL, NULL_RTX,
> >> > -							  NULL_RTX))
> >> > +	   && TYPE_UNSIGNED (itype)
> >> > +	   && rhs_code == TRUNC_DIV_EXPR
> >> > +	   && vectype
> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> >> > +					      OPTIMIZE_FOR_SPEED))
> >> >      {
> >> > -      return NULL;
> >> > +      /* div optimizations using narrowings
> >> > +       we can do the division e.g. shorts by 255 faster by calculating it as
> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> >> > +       double the precision of x.
> >> > +
> >> > +       If we imagine a short as being composed of two blocks of bytes
> then
> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent
> to
> >> > +       adding 1 to each sub component:
> >> > +
> >> > +	    short value of 16-bits
> >> > +       ┌──────────────┬────────────────┐
> >> > +       │              │                │
> >> > +       └──────────────┴────────────────┘
> >> > +	 8-bit part1 ▲  8-bit part2   ▲
> >> > +		     │                │
> >> > +		     │                │
> >> > +		    +1               +1
> >> > +
> >> > +       after the first addition, we have to shift right by 8, and narrow the
> >> > +       results back to a byte.  Remember that the addition must be done
> in
> >> > +       double the precision of the input.  However if we know that
> >> > + the
> >> addition
> >> > +       `x + 257` does not overflow then we can do the operation in
> >> > + the
> >> current
> >> > +       precision.  In which case we don't need the pack and unpacks.  */
> >> > +      auto wcst = wi::to_wide (cst);
> >> > +      int pow = wi::exact_log2 (wcst + 1);
> >> > +      if (pow == (int) (element_precision (vectype) / 2))
> >> > +	{
> >> > +	  wide_int min,max;
> >> > +	  /* If we're in a pattern we need to find the orginal definition.  */
> >> > +	  tree op0 = oprnd0;
> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> >> > +	  if (is_pattern_stmt_p (stmt_info))
> >> > +	    {
> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> >> > +	    }
> >>
> >> If this is generally safe (I'm skipping thinking about it in the
> >> interests of a quick review :-)), then I think it should be done in
> >> vect_get_range_info instead.  Using gimple_get_lhs would be more
> >> general than handling just assignments.
> >>
> >> > +
> >> > +	  /* Check that no overflow will occur.  If we don't have range
> >> > +	     information we can't perform the optimization.  */
> >> > +	  if (vect_get_range_info (op0, &min, &max))
> >> > +	    {
> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> >> > +	      wi::overflow_type ovf;
> >> > +	      /* We need adder and max in the same precision.  */
> >> > +	      wide_int zadder
> >> > +		= wide_int_storage::from (adder, wi::get_precision (max),
> >> > +					  UNSIGNED);
> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
> >>
> >> Could you explain this a bit more?  When do we have mismatched
> >> precisions?
> >
> > C promotion rules will promote e.g.
> >
> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
> >   for (int i = 0; i < n; i+=1)
> >     pixel[i] = (pixel[i] + level) / 0xff; }
> >
> > And have the addition be done as a 32 bit integer.  The vectorizer
> > will demote this down to a short, but range information is not stored
> > for patterns.  So In the above the range will correctly be 0x1fe but
> > the precision will be that of the original expression, so 32.  This
> > will be a mismatch with itype which is derived from the size the vectorizer
> will perform the operation in.
> 
> Gah, missed this first time round, sorry.
> 
> Richi would know better than me, but I think it's dangerous to rely on the
> orig/pattern link for range information.  The end result of a pattern
> (vect_stmt_to_vectorize) has to have the same type as the lhs of the original
> statement.  But the other statements in the pattern sequence can do
> arbitrary things.  Their range isn't predictable from the range of the original
> statement result.
> 
> IIRC, the addition above is converted to:
> 
>   a' = (uint16_t) pixel[i]
>   b' = (uint16_t) level
>   sum' = a' + b'
>   sum = (int) sum'
> 
> where sum is the direct replacement of "pixel[i] + level", with the same type
> and range.  The division then uses sum' instead of sum.
> 
> But the fact that sum' is part of the same pattern as sum doesn't guarantee
> that sum' has the same range as sum.  E.g. the pattern statements added by
> the division optimisation wouldn't have this property.

So my assumption is that no pattern would replace a statement with something
That has higher precision than the C statement. The pattern above is demoted
By the vectorizer based on range information already. My assumption was that
the precision can only ever be smaller, because otherwise the pattern has violated
the semantics of the C code, which would be dangerous if e.g. the expression escapes?

> 
> Is it possible to tell ranger to compute the range of expressions that haven't
> been added to the IL?  (Genuine question, haven't looked.
> It seems pretty powerful though.)

I don't know either, I guess for things it has explicit knowledge about it's ok, so
+w or *w would be fine, but with a random IFN_ it'll likely have to punt as varying.

I guess while theoretically possible, I don't see a case where the vectorizer would
Introduce a higher precision, as this would reduce your VF.

The only place I can think of that this is unsafe is if the division is introduced as part
of another pattern, but in that case the pattern won't have a related statement so
we'll punt.

Regards,
Tamar

> 
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-10 16:09       ` Tamar Christina
@ 2023-02-10 16:25         ` Richard Sandiford
  2023-02-10 16:33           ` Tamar Christina
  0 siblings, 1 reply; 47+ messages in thread
From: Richard Sandiford @ 2023-02-10 16:25 UTC (permalink / raw)
  To: Tamar Christina; +Cc: Tamar Christina via Gcc-patches, nd, rguenther, jlaw

Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Friday, February 10, 2023 3:57 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>> by using new optabs [PR108583]
>> 
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
>> >> >
>> >>
>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
>> >> 69
>> >> > de2afea139d6 100644
>> >> > --- a/gcc/tree-vect-patterns.cc
>> >> > +++ b/gcc/tree-vect-patterns.cc
>> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
>> *vinfo,
>> >> >        return pattern_stmt;
>> >> >      }
>> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
>> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
>> >> vectype,
>> >> > -							  wi::to_wide (cst),
>> >> > -							  NULL, NULL_RTX,
>> >> > -							  NULL_RTX))
>> >> > +	   && TYPE_UNSIGNED (itype)
>> >> > +	   && rhs_code == TRUNC_DIV_EXPR
>> >> > +	   && vectype
>> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
>> >> > +					      OPTIMIZE_FOR_SPEED))
>> >> >      {
>> >> > -      return NULL;
>> >> > +      /* div optimizations using narrowings
>> >> > +       we can do the division e.g. shorts by 255 faster by calculating it as
>> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
>> >> > +       double the precision of x.
>> >> > +
>> >> > +       If we imagine a short as being composed of two blocks of bytes
>> then
>> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent
>> to
>> >> > +       adding 1 to each sub component:
>> >> > +
>> >> > +	    short value of 16-bits
>> >> > +       ┌──────────────┬────────────────┐
>> >> > +       │              │                │
>> >> > +       └──────────────┴────────────────┘
>> >> > +	 8-bit part1 ▲  8-bit part2   ▲
>> >> > +		     │                │
>> >> > +		     │                │
>> >> > +		    +1               +1
>> >> > +
>> >> > +       after the first addition, we have to shift right by 8, and narrow the
>> >> > +       results back to a byte.  Remember that the addition must be done
>> in
>> >> > +       double the precision of the input.  However if we know that
>> >> > + the
>> >> addition
>> >> > +       `x + 257` does not overflow then we can do the operation in
>> >> > + the
>> >> current
>> >> > +       precision.  In which case we don't need the pack and unpacks.  */
>> >> > +      auto wcst = wi::to_wide (cst);
>> >> > +      int pow = wi::exact_log2 (wcst + 1);
>> >> > +      if (pow == (int) (element_precision (vectype) / 2))
>> >> > +	{
>> >> > +	  wide_int min,max;
>> >> > +	  /* If we're in a pattern we need to find the orginal definition.  */
>> >> > +	  tree op0 = oprnd0;
>> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
>> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
>> >> > +	  if (is_pattern_stmt_p (stmt_info))
>> >> > +	    {
>> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
>> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
>> >> > +	    }
>> >>
>> >> If this is generally safe (I'm skipping thinking about it in the
>> >> interests of a quick review :-)), then I think it should be done in
>> >> vect_get_range_info instead.  Using gimple_get_lhs would be more
>> >> general than handling just assignments.
>> >>
>> >> > +
>> >> > +	  /* Check that no overflow will occur.  If we don't have range
>> >> > +	     information we can't perform the optimization.  */
>> >> > +	  if (vect_get_range_info (op0, &min, &max))
>> >> > +	    {
>> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
>> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
>> >> > +	      wi::overflow_type ovf;
>> >> > +	      /* We need adder and max in the same precision.  */
>> >> > +	      wide_int zadder
>> >> > +		= wide_int_storage::from (adder, wi::get_precision (max),
>> >> > +					  UNSIGNED);
>> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
>> >>
>> >> Could you explain this a bit more?  When do we have mismatched
>> >> precisions?
>> >
>> > C promotion rules will promote e.g.
>> >
>> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
>> >   for (int i = 0; i < n; i+=1)
>> >     pixel[i] = (pixel[i] + level) / 0xff; }
>> >
>> > And have the addition be done as a 32 bit integer.  The vectorizer
>> > will demote this down to a short, but range information is not stored
>> > for patterns.  So In the above the range will correctly be 0x1fe but
>> > the precision will be that of the original expression, so 32.  This
>> > will be a mismatch with itype which is derived from the size the vectorizer
>> will perform the operation in.
>> 
>> Gah, missed this first time round, sorry.
>> 
>> Richi would know better than me, but I think it's dangerous to rely on the
>> orig/pattern link for range information.  The end result of a pattern
>> (vect_stmt_to_vectorize) has to have the same type as the lhs of the original
>> statement.  But the other statements in the pattern sequence can do
>> arbitrary things.  Their range isn't predictable from the range of the original
>> statement result.
>> 
>> IIRC, the addition above is converted to:
>> 
>>   a' = (uint16_t) pixel[i]
>>   b' = (uint16_t) level
>>   sum' = a' + b'
>>   sum = (int) sum'
>> 
>> where sum is the direct replacement of "pixel[i] + level", with the same type
>> and range.  The division then uses sum' instead of sum.
>> 
>> But the fact that sum' is part of the same pattern as sum doesn't guarantee
>> that sum' has the same range as sum.  E.g. the pattern statements added by
>> the division optimisation wouldn't have this property.
>
> So my assumption is that no pattern would replace a statement with something
> That has higher precision than the C statement. The pattern above is demoted
> By the vectorizer based on range information already. My assumption was that
> the precision can only ever be smaller, because otherwise the pattern has violated
> the semantics of the C code, which would be dangerous if e.g. the expression escapes?

IMO the difference in precisions was a symptom of the problem rather
than the direct cause.

The point is more that "B = vect_orig_stmt(A)" just says "A is used
somehow in a new calculation of B".  A might equal B (if A replaces B),
or A might be an arbitrary temporary result.  The code above is instead
using it to mean "A equals B, expressed in a different type".  That
happens to be true for sum' in the sequence above, but it isn't true of
non-final pattern statements in general.

In other words, the code hasn't proved that the path from A to
vect_stmt_to_vectorize(B) just involves conversions.

Applying the range of a pattern result to all temporary results in
the pattern could lead to wrong results even when the precisions
are all the same.

>> Is it possible to tell ranger to compute the range of expressions that haven't
>> been added to the IL?  (Genuine question, haven't looked.
>> It seems pretty powerful though.)
>
> I don't know either, I guess for things it has explicit knowledge about it's ok, so
> +w or *w would be fine, but with a random IFN_ it'll likely have to punt as varying.

Yeah.  But sum' above involves simple arithmetic and conversions,
so IFNs shouldn't be a problem.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-10 16:25         ` Richard Sandiford
@ 2023-02-10 16:33           ` Tamar Christina
  2023-02-10 16:57             ` Richard Sandiford
  0 siblings, 1 reply; 47+ messages in thread
From: Tamar Christina @ 2023-02-10 16:33 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Tamar Christina via Gcc-patches, nd, rguenther, jlaw

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Friday, February 10, 2023 4:25 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Friday, February 10, 2023 3:57 PM
> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> div-bitmask by using new optabs [PR108583]
> >>
> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
> >> >> >
> >> >>
> >>
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
> >> >> 69
> >> >> > de2afea139d6 100644
> >> >> > --- a/gcc/tree-vect-patterns.cc
> >> >> > +++ b/gcc/tree-vect-patterns.cc
> >> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
> >> *vinfo,
> >> >> >        return pattern_stmt;
> >> >> >      }
> >> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> >> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
> >> >> vectype,
> >> >> > -							  wi::to_wide
> (cst),
> >> >> > -							  NULL,
> NULL_RTX,
> >> >> > -							  NULL_RTX))
> >> >> > +	   && TYPE_UNSIGNED (itype)
> >> >> > +	   && rhs_code == TRUNC_DIV_EXPR
> >> >> > +	   && vectype
> >> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> >> >> > +					      OPTIMIZE_FOR_SPEED))
> >> >> >      {
> >> >> > -      return NULL;
> >> >> > +      /* div optimizations using narrowings
> >> >> > +       we can do the division e.g. shorts by 255 faster by calculating it
> as
> >> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> >> >> > +       double the precision of x.
> >> >> > +
> >> >> > +       If we imagine a short as being composed of two blocks of
> >> >> > + bytes
> >> then
> >> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is
> >> >> > + equivalent
> >> to
> >> >> > +       adding 1 to each sub component:
> >> >> > +
> >> >> > +	    short value of 16-bits
> >> >> > +       ┌──────────────┬────────────────┐
> >> >> > +       │              │                │
> >> >> > +       └──────────────┴────────────────┘
> >> >> > +	 8-bit part1 ▲  8-bit part2   ▲
> >> >> > +		     │                │
> >> >> > +		     │                │
> >> >> > +		    +1               +1
> >> >> > +
> >> >> > +       after the first addition, we have to shift right by 8, and narrow
> the
> >> >> > +       results back to a byte.  Remember that the addition must
> >> >> > + be done
> >> in
> >> >> > +       double the precision of the input.  However if we know
> >> >> > + that the
> >> >> addition
> >> >> > +       `x + 257` does not overflow then we can do the operation
> >> >> > + in the
> >> >> current
> >> >> > +       precision.  In which case we don't need the pack and unpacks.
> */
> >> >> > +      auto wcst = wi::to_wide (cst);
> >> >> > +      int pow = wi::exact_log2 (wcst + 1);
> >> >> > +      if (pow == (int) (element_precision (vectype) / 2))
> >> >> > +	{
> >> >> > +	  wide_int min,max;
> >> >> > +	  /* If we're in a pattern we need to find the orginal
> definition.  */
> >> >> > +	  tree op0 = oprnd0;
> >> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> >> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> >> >> > +	  if (is_pattern_stmt_p (stmt_info))
> >> >> > +	    {
> >> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT
> (stmt_info);
> >> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> >> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT
> (orig_stmt));
> >> >> > +	    }
> >> >>
> >> >> If this is generally safe (I'm skipping thinking about it in the
> >> >> interests of a quick review :-)), then I think it should be done
> >> >> in vect_get_range_info instead.  Using gimple_get_lhs would be
> >> >> more general than handling just assignments.
> >> >>
> >> >> > +
> >> >> > +	  /* Check that no overflow will occur.  If we don't have range
> >> >> > +	     information we can't perform the optimization.  */
> >> >> > +	  if (vect_get_range_info (op0, &min, &max))
> >> >> > +	    {
> >> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
> >> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> >> >> > +	      wi::overflow_type ovf;
> >> >> > +	      /* We need adder and max in the same precision.  */
> >> >> > +	      wide_int zadder
> >> >> > +		= wide_int_storage::from (adder, wi::get_precision
> (max),
> >> >> > +					  UNSIGNED);
> >> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
> >> >>
> >> >> Could you explain this a bit more?  When do we have mismatched
> >> >> precisions?
> >> >
> >> > C promotion rules will promote e.g.
> >> >
> >> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
> >> >   for (int i = 0; i < n; i+=1)
> >> >     pixel[i] = (pixel[i] + level) / 0xff; }
> >> >
> >> > And have the addition be done as a 32 bit integer.  The vectorizer
> >> > will demote this down to a short, but range information is not
> >> > stored for patterns.  So In the above the range will correctly be
> >> > 0x1fe but the precision will be that of the original expression, so
> >> > 32.  This will be a mismatch with itype which is derived from the
> >> > size the vectorizer
> >> will perform the operation in.
> >>
> >> Gah, missed this first time round, sorry.
> >>
> >> Richi would know better than me, but I think it's dangerous to rely
> >> on the orig/pattern link for range information.  The end result of a
> >> pattern
> >> (vect_stmt_to_vectorize) has to have the same type as the lhs of the
> >> original statement.  But the other statements in the pattern sequence
> >> can do arbitrary things.  Their range isn't predictable from the
> >> range of the original statement result.
> >>
> >> IIRC, the addition above is converted to:
> >>
> >>   a' = (uint16_t) pixel[i]
> >>   b' = (uint16_t) level
> >>   sum' = a' + b'
> >>   sum = (int) sum'
> >>
> >> where sum is the direct replacement of "pixel[i] + level", with the
> >> same type and range.  The division then uses sum' instead of sum.
> >>
> >> But the fact that sum' is part of the same pattern as sum doesn't
> >> guarantee that sum' has the same range as sum.  E.g. the pattern
> >> statements added by the division optimisation wouldn't have this
> property.
> >
> > So my assumption is that no pattern would replace a statement with
> > something That has higher precision than the C statement. The pattern
> > above is demoted By the vectorizer based on range information already.
> > My assumption was that the precision can only ever be smaller, because
> > otherwise the pattern has violated the semantics of the C code, which
> would be dangerous if e.g. the expression escapes?
> 
> IMO the difference in precisions was a symptom of the problem rather than
> the direct cause.
> 
> The point is more that "B = vect_orig_stmt(A)" just says "A is used somehow
> in a new calculation of B".  A might equal B (if A replaces B), or A might be an
> arbitrary temporary result.  The code above is instead using it to mean "A
> equals B, expressed in a different type".  That happens to be true for sum' in
> the sequence above, but it isn't true of non-final pattern statements in
> general.
> 

Sorry for being dense, but I though that's exactly what the code does and what I
tried explain before. If B isn't a final statement than it won't have an original statement.
AFAIK, the only places we set original statement is the root of the pattern expression.

> In other words, the code hasn't proved that the path from A to
> vect_stmt_to_vectorize(B) just involves conversions.
> 
> Applying the range of a pattern result to all temporary results in the pattern
> could lead to wrong results even when the precisions are all the same.

But maybe I'm misremembering here. I don't believe we'd ever match in the middle of
A multi pattern sequence because the additional patterns are not emitted in the instruction
stream.  That's why we have append_pattern_def_seq which appends the additional
statemens to the pattern's def sequence.

Unlike the original seed for the pattern, these aren't materialized until codegen or SLP build.

But I could be wrong...

Tamar

> 
> >> Is it possible to tell ranger to compute the range of expressions
> >> that haven't been added to the IL?  (Genuine question, haven't looked.
> >> It seems pretty powerful though.)
> >
> > I don't know either, I guess for things it has explicit knowledge
> > about it's ok, so
> > +w or *w would be fine, but with a random IFN_ it'll likely have to punt as
> varying.
> 
> Yeah.  But sum' above involves simple arithmetic and conversions, so IFNs
> shouldn't be a problem.
> 
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-10 16:33           ` Tamar Christina
@ 2023-02-10 16:57             ` Richard Sandiford
  2023-02-10 17:01               ` Richard Sandiford
  2023-02-10 17:14               ` Tamar Christina
  0 siblings, 2 replies; 47+ messages in thread
From: Richard Sandiford @ 2023-02-10 16:57 UTC (permalink / raw)
  To: Tamar Christina; +Cc: Tamar Christina via Gcc-patches, nd, rguenther, jlaw

Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Friday, February 10, 2023 4:25 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>> by using new optabs [PR108583]
>> 
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> -----Original Message-----
>> >> From: Richard Sandiford <richard.sandiford@arm.com>
>> >> Sent: Friday, February 10, 2023 3:57 PM
>> >> To: Tamar Christina <Tamar.Christina@arm.com>
>> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
>> >> div-bitmask by using new optabs [PR108583]
>> >>
>> >> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
>> >> >> >
>> >> >>
>> >>
>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
>> >> >> 69
>> >> >> > de2afea139d6 100644
>> >> >> > --- a/gcc/tree-vect-patterns.cc
>> >> >> > +++ b/gcc/tree-vect-patterns.cc
>> >> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
>> >> *vinfo,
>> >> >> >        return pattern_stmt;
>> >> >> >      }
>> >> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
>> >> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
>> >> >> vectype,
>> >> >> > -							  wi::to_wide
>> (cst),
>> >> >> > -							  NULL,
>> NULL_RTX,
>> >> >> > -							  NULL_RTX))
>> >> >> > +	   && TYPE_UNSIGNED (itype)
>> >> >> > +	   && rhs_code == TRUNC_DIV_EXPR
>> >> >> > +	   && vectype
>> >> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
>> >> >> > +					      OPTIMIZE_FOR_SPEED))
>> >> >> >      {
>> >> >> > -      return NULL;
>> >> >> > +      /* div optimizations using narrowings
>> >> >> > +       we can do the division e.g. shorts by 255 faster by calculating it
>> as
>> >> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
>> >> >> > +       double the precision of x.
>> >> >> > +
>> >> >> > +       If we imagine a short as being composed of two blocks of
>> >> >> > + bytes
>> >> then
>> >> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is
>> >> >> > + equivalent
>> >> to
>> >> >> > +       adding 1 to each sub component:
>> >> >> > +
>> >> >> > +	    short value of 16-bits
>> >> >> > +       ┌──────────────┬────────────────┐
>> >> >> > +       │              │                │
>> >> >> > +       └──────────────┴────────────────┘
>> >> >> > +	 8-bit part1 ▲  8-bit part2   ▲
>> >> >> > +		     │                │
>> >> >> > +		     │                │
>> >> >> > +		    +1               +1
>> >> >> > +
>> >> >> > +       after the first addition, we have to shift right by 8, and narrow
>> the
>> >> >> > +       results back to a byte.  Remember that the addition must
>> >> >> > + be done
>> >> in
>> >> >> > +       double the precision of the input.  However if we know
>> >> >> > + that the
>> >> >> addition
>> >> >> > +       `x + 257` does not overflow then we can do the operation
>> >> >> > + in the
>> >> >> current
>> >> >> > +       precision.  In which case we don't need the pack and unpacks.
>> */
>> >> >> > +      auto wcst = wi::to_wide (cst);
>> >> >> > +      int pow = wi::exact_log2 (wcst + 1);
>> >> >> > +      if (pow == (int) (element_precision (vectype) / 2))
>> >> >> > +	{
>> >> >> > +	  wide_int min,max;
>> >> >> > +	  /* If we're in a pattern we need to find the orginal
>> definition.  */
>> >> >> > +	  tree op0 = oprnd0;
>> >> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
>> >> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
>> >> >> > +	  if (is_pattern_stmt_p (stmt_info))
>> >> >> > +	    {
>> >> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT
>> (stmt_info);
>> >> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>> >> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT
>> (orig_stmt));
>> >> >> > +	    }
>> >> >>
>> >> >> If this is generally safe (I'm skipping thinking about it in the
>> >> >> interests of a quick review :-)), then I think it should be done
>> >> >> in vect_get_range_info instead.  Using gimple_get_lhs would be
>> >> >> more general than handling just assignments.
>> >> >>
>> >> >> > +
>> >> >> > +	  /* Check that no overflow will occur.  If we don't have range
>> >> >> > +	     information we can't perform the optimization.  */
>> >> >> > +	  if (vect_get_range_info (op0, &min, &max))
>> >> >> > +	    {
>> >> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
>> >> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
>> >> >> > +	      wi::overflow_type ovf;
>> >> >> > +	      /* We need adder and max in the same precision.  */
>> >> >> > +	      wide_int zadder
>> >> >> > +		= wide_int_storage::from (adder, wi::get_precision
>> (max),
>> >> >> > +					  UNSIGNED);
>> >> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
>> >> >>
>> >> >> Could you explain this a bit more?  When do we have mismatched
>> >> >> precisions?
>> >> >
>> >> > C promotion rules will promote e.g.
>> >> >
>> >> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
>> >> >   for (int i = 0; i < n; i+=1)
>> >> >     pixel[i] = (pixel[i] + level) / 0xff; }
>> >> >
>> >> > And have the addition be done as a 32 bit integer.  The vectorizer
>> >> > will demote this down to a short, but range information is not
>> >> > stored for patterns.  So In the above the range will correctly be
>> >> > 0x1fe but the precision will be that of the original expression, so
>> >> > 32.  This will be a mismatch with itype which is derived from the
>> >> > size the vectorizer
>> >> will perform the operation in.
>> >>
>> >> Gah, missed this first time round, sorry.
>> >>
>> >> Richi would know better than me, but I think it's dangerous to rely
>> >> on the orig/pattern link for range information.  The end result of a
>> >> pattern
>> >> (vect_stmt_to_vectorize) has to have the same type as the lhs of the
>> >> original statement.  But the other statements in the pattern sequence
>> >> can do arbitrary things.  Their range isn't predictable from the
>> >> range of the original statement result.
>> >>
>> >> IIRC, the addition above is converted to:
>> >>
>> >>   a' = (uint16_t) pixel[i]
>> >>   b' = (uint16_t) level
>> >>   sum' = a' + b'
>> >>   sum = (int) sum'
>> >>
>> >> where sum is the direct replacement of "pixel[i] + level", with the
>> >> same type and range.  The division then uses sum' instead of sum.
>> >>
>> >> But the fact that sum' is part of the same pattern as sum doesn't
>> >> guarantee that sum' has the same range as sum.  E.g. the pattern
>> >> statements added by the division optimisation wouldn't have this
>> property.
>> >
>> > So my assumption is that no pattern would replace a statement with
>> > something That has higher precision than the C statement. The pattern
>> > above is demoted By the vectorizer based on range information already.
>> > My assumption was that the precision can only ever be smaller, because
>> > otherwise the pattern has violated the semantics of the C code, which
>> would be dangerous if e.g. the expression escapes?
>> 
>> IMO the difference in precisions was a symptom of the problem rather than
>> the direct cause.
>> 
>> The point is more that "B = vect_orig_stmt(A)" just says "A is used somehow
>> in a new calculation of B".  A might equal B (if A replaces B), or A might be an
>> arbitrary temporary result.  The code above is instead using it to mean "A
>> equals B, expressed in a different type".  That happens to be true for sum' in
>> the sequence above, but it isn't true of non-final pattern statements in
>> general.
>> 
>
> Sorry for being dense, but I though that's exactly what the code does and what I
> tried explain before. If B isn't a final statement than it won't have an original statement.
> AFAIK, the only places we set original statement is the root of the pattern expression.

Final pattern statements (those not in DEF_SEQ) always have the same
type and value as the original statements.  We wouldn't see mismatched
precisions if we were only looking at final pattern statements.

Like you say, the 16-bit addition didn't exist before vectorisation
(it was a 32-bit addition instead).  So to make things type-correct,
the 32-bit addition:

   A: sum = a + b           (STMT_VINFO_RELATED_STMT == A2)

is replaced with:

   DEF_SEQ:
     A1: tmp = a' + b'      (STMT_VINFO_RELATED_STMT == A)
   A2: sum' = (int) tmp     (STMT_VINFO_RELATED_STMT == A)

(using different notation from before, just to confuse things).
Here, A2 is the final pattern statement for A and A1 is just a
temporary result.  sum == sum'.

Later, we do a similar thing for the division itself.  We have:

   B: quotient = sum / 0xff            (STMT_VINFO_RELATED_STMT == B2)

We realise that this can be a 16-bit division, so (IIRC) we use
vect_look_through_possible_promotion on sum to find the best
starting point.  This should give:

   DEF_SEQ:
     B1: tmp2 = tmp / (uint16_t) 0xff  (STMT_VINFO_RELATED_STMT == B)
   B2: quotient' = (int) tmp2          (STMT_VINFO_RELATED_STMT == B)

Both changes are done by vect_widened_op_tree.

We then apply the division pattern to B1.  B1 is a nonfinal pattern
statement that uses the result (tmp) of another nonfinal pattern
statement (A1).

The code does:

	  if (is_pattern_stmt_p (stmt_info))
	    {
	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
	    }

is_pattern_stmt_p is true for both A1 and A2, and STMT_VINFO_RELATED_STMT
is A for both A1 and A2.  I would expect:

  gcc_assert (stmt_info == vect_stmt_to_vectorize (orig_stmt));

(testing for a final pattern) to fail for the motivating example.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-10 16:57             ` Richard Sandiford
@ 2023-02-10 17:01               ` Richard Sandiford
  2023-02-10 17:14               ` Tamar Christina
  1 sibling, 0 replies; 47+ messages in thread
From: Richard Sandiford @ 2023-02-10 17:01 UTC (permalink / raw)
  To: Tamar Christina; +Cc: Tamar Christina via Gcc-patches, nd, rguenther, jlaw

Richard Sandiford <richard.sandiford@arm.com> writes:
> Final pattern statements (those not in DEF_SEQ) always have the same
> type and value as the original statements.  We wouldn't see mismatched
> precisions if we were only looking at final pattern statements.
>
> Like you say, the 16-bit addition didn't exist before vectorisation
> (it was a 32-bit addition instead).  So to make things type-correct,
> the 32-bit addition:
>
>    A: sum = a + b           (STMT_VINFO_RELATED_STMT == A2)
>
> is replaced with:
>
>    DEF_SEQ:
>      A1: tmp = a' + b'      (STMT_VINFO_RELATED_STMT == A)
>    A2: sum' = (int) tmp     (STMT_VINFO_RELATED_STMT == A)
>
> (using different notation from before, just to confuse things).
> Here, A2 is the final pattern statement for A and A1 is just a
> temporary result.  sum == sum'.
>
> Later, we do a similar thing for the division itself.  We have:
>
>    B: quotient = sum / 0xff            (STMT_VINFO_RELATED_STMT == B2)
>
> We realise that this can be a 16-bit division, so (IIRC) we use
> vect_look_through_possible_promotion on sum to find the best
> starting point.  This should give:
>
>    DEF_SEQ:
>      B1: tmp2 = tmp / (uint16_t) 0xff  (STMT_VINFO_RELATED_STMT == B)
>    B2: quotient' = (int) tmp2          (STMT_VINFO_RELATED_STMT == B)
>
> Both changes are done by vect_widened_op_tree.

Eh, I meant vect_recog_over_widening_pattern.

Richard

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-10 16:57             ` Richard Sandiford
  2023-02-10 17:01               ` Richard Sandiford
@ 2023-02-10 17:14               ` Tamar Christina
  2023-02-10 18:12                 ` Richard Sandiford
  1 sibling, 1 reply; 47+ messages in thread
From: Tamar Christina @ 2023-02-10 17:14 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Tamar Christina via Gcc-patches, nd, rguenther, jlaw

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Friday, February 10, 2023 4:57 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Friday, February 10, 2023 4:25 PM
> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> div-bitmask by using new optabs [PR108583]
> >>
> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> -----Original Message-----
> >> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> Sent: Friday, February 10, 2023 3:57 PM
> >> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> >> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> >> div-bitmask by using new optabs [PR108583]
> >> >>
> >> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
> >> >> >> >
> >> >> >>
> >> >>
> >>
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
> >> >> >> 69
> >> >> >> > de2afea139d6 100644
> >> >> >> > --- a/gcc/tree-vect-patterns.cc
> >> >> >> > +++ b/gcc/tree-vect-patterns.cc
> >> >> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern
> (vec_info
> >> >> *vinfo,
> >> >> >> >        return pattern_stmt;
> >> >> >> >      }
> >> >> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> >> >> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
> >> >> >> vectype,
> >> >> >> > -							  wi::to_wide
> >> (cst),
> >> >> >> > -							  NULL,
> >> NULL_RTX,
> >> >> >> > -							  NULL_RTX))
> >> >> >> > +	   && TYPE_UNSIGNED (itype)
> >> >> >> > +	   && rhs_code == TRUNC_DIV_EXPR
> >> >> >> > +	   && vectype
> >> >> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> >> >> >> > +					      OPTIMIZE_FOR_SPEED))
> >> >> >> >      {
> >> >> >> > -      return NULL;
> >> >> >> > +      /* div optimizations using narrowings
> >> >> >> > +       we can do the division e.g. shorts by 255 faster by
> >> >> >> > + calculating it
> >> as
> >> >> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> >> >> >> > +       double the precision of x.
> >> >> >> > +
> >> >> >> > +       If we imagine a short as being composed of two blocks
> >> >> >> > + of bytes
> >> >> then
> >> >> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is
> >> >> >> > + equivalent
> >> >> to
> >> >> >> > +       adding 1 to each sub component:
> >> >> >> > +
> >> >> >> > +	    short value of 16-bits
> >> >> >> > +       ┌──────────────┬────────────────┐
> >> >> >> > +       │              │                │
> >> >> >> > +       └──────────────┴────────────────┘
> >> >> >> > +	 8-bit part1 ▲  8-bit part2   ▲
> >> >> >> > +		     │                │
> >> >> >> > +		     │                │
> >> >> >> > +		    +1               +1
> >> >> >> > +
> >> >> >> > +       after the first addition, we have to shift right by
> >> >> >> > + 8, and narrow
> >> the
> >> >> >> > +       results back to a byte.  Remember that the addition
> >> >> >> > + must be done
> >> >> in
> >> >> >> > +       double the precision of the input.  However if we
> >> >> >> > + know that the
> >> >> >> addition
> >> >> >> > +       `x + 257` does not overflow then we can do the
> >> >> >> > + operation in the
> >> >> >> current
> >> >> >> > +       precision.  In which case we don't need the pack and
> unpacks.
> >> */
> >> >> >> > +      auto wcst = wi::to_wide (cst);
> >> >> >> > +      int pow = wi::exact_log2 (wcst + 1);
> >> >> >> > +      if (pow == (int) (element_precision (vectype) / 2))
> >> >> >> > +	{
> >> >> >> > +	  wide_int min,max;
> >> >> >> > +	  /* If we're in a pattern we need to find the orginal
> >> definition.  */
> >> >> >> > +	  tree op0 = oprnd0;
> >> >> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> >> >> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> >> >> >> > +	  if (is_pattern_stmt_p (stmt_info))
> >> >> >> > +	    {
> >> >> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT
> >> (stmt_info);
> >> >> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> >> >> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT
> >> (orig_stmt));
> >> >> >> > +	    }
> >> >> >>
> >> >> >> If this is generally safe (I'm skipping thinking about it in
> >> >> >> the interests of a quick review :-)), then I think it should be
> >> >> >> done in vect_get_range_info instead.  Using gimple_get_lhs
> >> >> >> would be more general than handling just assignments.
> >> >> >>
> >> >> >> > +
> >> >> >> > +	  /* Check that no overflow will occur.  If we don't have range
> >> >> >> > +	     information we can't perform the optimization.  */
> >> >> >> > +	  if (vect_get_range_info (op0, &min, &max))
> >> >> >> > +	    {
> >> >> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
> >> >> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> >> >> >> > +	      wi::overflow_type ovf;
> >> >> >> > +	      /* We need adder and max in the same precision.  */
> >> >> >> > +	      wide_int zadder
> >> >> >> > +		= wide_int_storage::from (adder, wi::get_precision
> >> (max),
> >> >> >> > +					  UNSIGNED);
> >> >> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
> >> >> >>
> >> >> >> Could you explain this a bit more?  When do we have mismatched
> >> >> >> precisions?
> >> >> >
> >> >> > C promotion rules will promote e.g.
> >> >> >
> >> >> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
> >> >> >   for (int i = 0; i < n; i+=1)
> >> >> >     pixel[i] = (pixel[i] + level) / 0xff; }
> >> >> >
> >> >> > And have the addition be done as a 32 bit integer.  The
> >> >> > vectorizer will demote this down to a short, but range
> >> >> > information is not stored for patterns.  So In the above the
> >> >> > range will correctly be 0x1fe but the precision will be that of
> >> >> > the original expression, so 32.  This will be a mismatch with
> >> >> > itype which is derived from the size the vectorizer
> >> >> will perform the operation in.
> >> >>
> >> >> Gah, missed this first time round, sorry.
> >> >>
> >> >> Richi would know better than me, but I think it's dangerous to
> >> >> rely on the orig/pattern link for range information.  The end
> >> >> result of a pattern
> >> >> (vect_stmt_to_vectorize) has to have the same type as the lhs of
> >> >> the original statement.  But the other statements in the pattern
> >> >> sequence can do arbitrary things.  Their range isn't predictable
> >> >> from the range of the original statement result.
> >> >>
> >> >> IIRC, the addition above is converted to:
> >> >>
> >> >>   a' = (uint16_t) pixel[i]
> >> >>   b' = (uint16_t) level
> >> >>   sum' = a' + b'
> >> >>   sum = (int) sum'
> >> >>
> >> >> where sum is the direct replacement of "pixel[i] + level", with
> >> >> the same type and range.  The division then uses sum' instead of sum.
> >> >>
> >> >> But the fact that sum' is part of the same pattern as sum doesn't
> >> >> guarantee that sum' has the same range as sum.  E.g. the pattern
> >> >> statements added by the division optimisation wouldn't have this
> >> property.
> >> >
> >> > So my assumption is that no pattern would replace a statement with
> >> > something That has higher precision than the C statement. The
> >> > pattern above is demoted By the vectorizer based on range information
> already.
> >> > My assumption was that the precision can only ever be smaller,
> >> > because otherwise the pattern has violated the semantics of the C
> >> > code, which
> >> would be dangerous if e.g. the expression escapes?
> >>
> >> IMO the difference in precisions was a symptom of the problem rather
> >> than the direct cause.
> >>
> >> The point is more that "B = vect_orig_stmt(A)" just says "A is used
> >> somehow in a new calculation of B".  A might equal B (if A replaces
> >> B), or A might be an arbitrary temporary result.  The code above is
> >> instead using it to mean "A equals B, expressed in a different type".
> >> That happens to be true for sum' in the sequence above, but it isn't
> >> true of non-final pattern statements in general.
> >>
> >
> > Sorry for being dense, but I though that's exactly what the code does
> > and what I tried explain before. If B isn't a final statement than it won't
> have an original statement.
> > AFAIK, the only places we set original statement is the root of the pattern
> expression.
> 
> Final pattern statements (those not in DEF_SEQ) always have the same type
> and value as the original statements.  We wouldn't see mismatched
> precisions if we were only looking at final pattern statements.

We would because the entire problem is that pattern statement have no ranges.
Ranger does not track them after they have been created.  This could of course
Trivially be solved if we tell ranger about the demotion we did, but we don't do so
at the moment. It will just return varying here.  This is the root cause of the issue.

> 
> Like you say, the 16-bit addition didn't exist before vectorisation (it was a 32-
> bit addition instead).  So to make things type-correct, the 32-bit addition:
> 
>    A: sum = a + b           (STMT_VINFO_RELATED_STMT == A2)
> 
> is replaced with:
> 
>    DEF_SEQ:
>      A1: tmp = a' + b'      (STMT_VINFO_RELATED_STMT == A)
>    A2: sum' = (int) tmp     (STMT_VINFO_RELATED_STMT == A)
> 
> (using different notation from before, just to confuse things).
> Here, A2 is the final pattern statement for A and A1 is just a temporary result.
> sum == sum'.
> 
> Later, we do a similar thing for the division itself.  We have:
> 
>    B: quotient = sum / 0xff            (STMT_VINFO_RELATED_STMT == B2)
> 
> We realise that this can be a 16-bit division, so (IIRC) we use
> vect_look_through_possible_promotion on sum to find the best starting
> point.  This should give:
> 
>    DEF_SEQ:
>      B1: tmp2 = tmp / (uint16_t) 0xff  (STMT_VINFO_RELATED_STMT == B)
>    B2: quotient' = (int) tmp2          (STMT_VINFO_RELATED_STMT == B)
> 
> Both changes are done by vect_widened_op_tree.
> 
> We then apply the division pattern to B1.  B1 is a nonfinal pattern statement
> that uses the result (tmp) of another nonfinal pattern statement (A1).
> 
> The code does:
> 
> 	  if (is_pattern_stmt_p (stmt_info))
> 	    {
> 	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> 	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> 		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> 	    }
> 
> is_pattern_stmt_p is true for both A1 and A2, and
> STMT_VINFO_RELATED_STMT is A for both A1 and A2.  I would expect:
> 
>   gcc_assert (stmt_info == vect_stmt_to_vectorize (orig_stmt));
> 
> (testing for a final pattern) to fail for the motivating example.
> 

I think we're actually saying the same thing. I believe all I'm saying is that looking
at the original statement is a safe alternative as it conservatively will overestimate
to VARYING or give a wider range than the pattern would have.

I'm saying it's conservatively safe, while not overly accurate.  The alternative would be
to tell ranger about the demotions in vect_recog_over_widening_pattern using range::set.

But for this to work the general widening pattern also have to update the range information.

I think where we're disagreeing is that I think looking at the original scalar statement is a safe
conservative estimate.  It will fail in some cases, but that's a missed optimization, not a miss-optimization.

In any case, if you disagree I don’t' really see a way forward aside from making this its own pattern
running it before the overwidening pattern.

Alternatively I'd love to know how to proceed.

Tamar

> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-10 17:14               ` Tamar Christina
@ 2023-02-10 18:12                 ` Richard Sandiford
  2023-02-10 18:34                   ` Richard Biener
  0 siblings, 1 reply; 47+ messages in thread
From: Richard Sandiford @ 2023-02-10 18:12 UTC (permalink / raw)
  To: Tamar Christina; +Cc: Tamar Christina via Gcc-patches, nd, rguenther, jlaw

Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Friday, February 10, 2023 4:57 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>> by using new optabs [PR108583]
>> 
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> -----Original Message-----
>> >> From: Richard Sandiford <richard.sandiford@arm.com>
>> >> Sent: Friday, February 10, 2023 4:25 PM
>> >> To: Tamar Christina <Tamar.Christina@arm.com>
>> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
>> >> div-bitmask by using new optabs [PR108583]
>> >>
>> >> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> >> -----Original Message-----
>> >> >> From: Richard Sandiford <richard.sandiford@arm.com>
>> >> >> Sent: Friday, February 10, 2023 3:57 PM
>> >> >> To: Tamar Christina <Tamar.Christina@arm.com>
>> >> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> >> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
>> >> >> div-bitmask by using new optabs [PR108583]
>> >> >>
>> >> >> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> >> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
>> >> >> >> 69
>> >> >> >> > de2afea139d6 100644
>> >> >> >> > --- a/gcc/tree-vect-patterns.cc
>> >> >> >> > +++ b/gcc/tree-vect-patterns.cc
>> >> >> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern
>> (vec_info
>> >> >> *vinfo,
>> >> >> >> >        return pattern_stmt;
>> >> >> >> >      }
>> >> >> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
>> >> >> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
>> >> >> >> vectype,
>> >> >> >> > -							  wi::to_wide
>> >> (cst),
>> >> >> >> > -							  NULL,
>> >> NULL_RTX,
>> >> >> >> > -							  NULL_RTX))
>> >> >> >> > +	   && TYPE_UNSIGNED (itype)
>> >> >> >> > +	   && rhs_code == TRUNC_DIV_EXPR
>> >> >> >> > +	   && vectype
>> >> >> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
>> >> >> >> > +					      OPTIMIZE_FOR_SPEED))
>> >> >> >> >      {
>> >> >> >> > -      return NULL;
>> >> >> >> > +      /* div optimizations using narrowings
>> >> >> >> > +       we can do the division e.g. shorts by 255 faster by
>> >> >> >> > + calculating it
>> >> as
>> >> >> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
>> >> >> >> > +       double the precision of x.
>> >> >> >> > +
>> >> >> >> > +       If we imagine a short as being composed of two blocks
>> >> >> >> > + of bytes
>> >> >> then
>> >> >> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is
>> >> >> >> > + equivalent
>> >> >> to
>> >> >> >> > +       adding 1 to each sub component:
>> >> >> >> > +
>> >> >> >> > +	    short value of 16-bits
>> >> >> >> > +       ┌──────────────┬────────────────┐
>> >> >> >> > +       │              │                │
>> >> >> >> > +       └──────────────┴────────────────┘
>> >> >> >> > +	 8-bit part1 ▲  8-bit part2   ▲
>> >> >> >> > +		     │                │
>> >> >> >> > +		     │                │
>> >> >> >> > +		    +1               +1
>> >> >> >> > +
>> >> >> >> > +       after the first addition, we have to shift right by
>> >> >> >> > + 8, and narrow
>> >> the
>> >> >> >> > +       results back to a byte.  Remember that the addition
>> >> >> >> > + must be done
>> >> >> in
>> >> >> >> > +       double the precision of the input.  However if we
>> >> >> >> > + know that the
>> >> >> >> addition
>> >> >> >> > +       `x + 257` does not overflow then we can do the
>> >> >> >> > + operation in the
>> >> >> >> current
>> >> >> >> > +       precision.  In which case we don't need the pack and
>> unpacks.
>> >> */
>> >> >> >> > +      auto wcst = wi::to_wide (cst);
>> >> >> >> > +      int pow = wi::exact_log2 (wcst + 1);
>> >> >> >> > +      if (pow == (int) (element_precision (vectype) / 2))
>> >> >> >> > +	{
>> >> >> >> > +	  wide_int min,max;
>> >> >> >> > +	  /* If we're in a pattern we need to find the orginal
>> >> definition.  */
>> >> >> >> > +	  tree op0 = oprnd0;
>> >> >> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
>> >> >> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
>> >> >> >> > +	  if (is_pattern_stmt_p (stmt_info))
>> >> >> >> > +	    {
>> >> >> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT
>> >> (stmt_info);
>> >> >> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>> >> >> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT
>> >> (orig_stmt));
>> >> >> >> > +	    }
>> >> >> >>
>> >> >> >> If this is generally safe (I'm skipping thinking about it in
>> >> >> >> the interests of a quick review :-)), then I think it should be
>> >> >> >> done in vect_get_range_info instead.  Using gimple_get_lhs
>> >> >> >> would be more general than handling just assignments.
>> >> >> >>
>> >> >> >> > +
>> >> >> >> > +	  /* Check that no overflow will occur.  If we don't have range
>> >> >> >> > +	     information we can't perform the optimization.  */
>> >> >> >> > +	  if (vect_get_range_info (op0, &min, &max))
>> >> >> >> > +	    {
>> >> >> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
>> >> >> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
>> >> >> >> > +	      wi::overflow_type ovf;
>> >> >> >> > +	      /* We need adder and max in the same precision.  */
>> >> >> >> > +	      wide_int zadder
>> >> >> >> > +		= wide_int_storage::from (adder, wi::get_precision
>> >> (max),
>> >> >> >> > +					  UNSIGNED);
>> >> >> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
>> >> >> >>
>> >> >> >> Could you explain this a bit more?  When do we have mismatched
>> >> >> >> precisions?
>> >> >> >
>> >> >> > C promotion rules will promote e.g.
>> >> >> >
>> >> >> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
>> >> >> >   for (int i = 0; i < n; i+=1)
>> >> >> >     pixel[i] = (pixel[i] + level) / 0xff; }
>> >> >> >
>> >> >> > And have the addition be done as a 32 bit integer.  The
>> >> >> > vectorizer will demote this down to a short, but range
>> >> >> > information is not stored for patterns.  So In the above the
>> >> >> > range will correctly be 0x1fe but the precision will be that of
>> >> >> > the original expression, so 32.  This will be a mismatch with
>> >> >> > itype which is derived from the size the vectorizer
>> >> >> will perform the operation in.
>> >> >>
>> >> >> Gah, missed this first time round, sorry.
>> >> >>
>> >> >> Richi would know better than me, but I think it's dangerous to
>> >> >> rely on the orig/pattern link for range information.  The end
>> >> >> result of a pattern
>> >> >> (vect_stmt_to_vectorize) has to have the same type as the lhs of
>> >> >> the original statement.  But the other statements in the pattern
>> >> >> sequence can do arbitrary things.  Their range isn't predictable
>> >> >> from the range of the original statement result.
>> >> >>
>> >> >> IIRC, the addition above is converted to:
>> >> >>
>> >> >>   a' = (uint16_t) pixel[i]
>> >> >>   b' = (uint16_t) level
>> >> >>   sum' = a' + b'
>> >> >>   sum = (int) sum'
>> >> >>
>> >> >> where sum is the direct replacement of "pixel[i] + level", with
>> >> >> the same type and range.  The division then uses sum' instead of sum.
>> >> >>
>> >> >> But the fact that sum' is part of the same pattern as sum doesn't
>> >> >> guarantee that sum' has the same range as sum.  E.g. the pattern
>> >> >> statements added by the division optimisation wouldn't have this
>> >> property.
>> >> >
>> >> > So my assumption is that no pattern would replace a statement with
>> >> > something That has higher precision than the C statement. The
>> >> > pattern above is demoted By the vectorizer based on range information
>> already.
>> >> > My assumption was that the precision can only ever be smaller,
>> >> > because otherwise the pattern has violated the semantics of the C
>> >> > code, which
>> >> would be dangerous if e.g. the expression escapes?
>> >>
>> >> IMO the difference in precisions was a symptom of the problem rather
>> >> than the direct cause.
>> >>
>> >> The point is more that "B = vect_orig_stmt(A)" just says "A is used
>> >> somehow in a new calculation of B".  A might equal B (if A replaces
>> >> B), or A might be an arbitrary temporary result.  The code above is
>> >> instead using it to mean "A equals B, expressed in a different type".
>> >> That happens to be true for sum' in the sequence above, but it isn't
>> >> true of non-final pattern statements in general.
>> >>
>> >
>> > Sorry for being dense, but I though that's exactly what the code does
>> > and what I tried explain before. If B isn't a final statement than it won't
>> have an original statement.
>> > AFAIK, the only places we set original statement is the root of the pattern
>> expression.
>> 
>> Final pattern statements (those not in DEF_SEQ) always have the same type
>> and value as the original statements.  We wouldn't see mismatched
>> precisions if we were only looking at final pattern statements.
>
> We would because the entire problem is that pattern statement have no ranges.
> Ranger does not track them after they have been created.  This could of course
> Trivially be solved if we tell ranger about the demotion we did, but we don't do so
> at the moment. It will just return varying here.  This is the root cause of the issue.
>
>> 
>> Like you say, the 16-bit addition didn't exist before vectorisation (it was a 32-
>> bit addition instead).  So to make things type-correct, the 32-bit addition:
>> 
>>    A: sum = a + b           (STMT_VINFO_RELATED_STMT == A2)
>> 
>> is replaced with:
>> 
>>    DEF_SEQ:
>>      A1: tmp = a' + b'      (STMT_VINFO_RELATED_STMT == A)
>>    A2: sum' = (int) tmp     (STMT_VINFO_RELATED_STMT == A)
>> 
>> (using different notation from before, just to confuse things).
>> Here, A2 is the final pattern statement for A and A1 is just a temporary result.
>> sum == sum'.
>> 
>> Later, we do a similar thing for the division itself.  We have:
>> 
>>    B: quotient = sum / 0xff            (STMT_VINFO_RELATED_STMT == B2)
>> 
>> We realise that this can be a 16-bit division, so (IIRC) we use
>> vect_look_through_possible_promotion on sum to find the best starting
>> point.  This should give:
>> 
>>    DEF_SEQ:
>>      B1: tmp2 = tmp / (uint16_t) 0xff  (STMT_VINFO_RELATED_STMT == B)
>>    B2: quotient' = (int) tmp2          (STMT_VINFO_RELATED_STMT == B)
>> 
>> Both changes are done by vect_widened_op_tree.
>> 
>> We then apply the division pattern to B1.  B1 is a nonfinal pattern statement
>> that uses the result (tmp) of another nonfinal pattern statement (A1).
>> 
>> The code does:
>> 
>> 	  if (is_pattern_stmt_p (stmt_info))
>> 	    {
>> 	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
>> 	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>> 		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
>> 	    }
>> 
>> is_pattern_stmt_p is true for both A1 and A2, and
>> STMT_VINFO_RELATED_STMT is A for both A1 and A2.  I would expect:
>> 
>>   gcc_assert (stmt_info == vect_stmt_to_vectorize (orig_stmt));
>> 
>> (testing for a final pattern) to fail for the motivating example.
>> 
>
> I think we're actually saying the same thing. I believe all I'm saying is that looking
> at the original statement is a safe alternative as it conservatively will overestimate
> to VARYING or give a wider range than the pattern would have.

Hmm, but you said "If B isn't a final statement than it won't have an
original statement. AFAIK, the only places we set original statement
is the root of the pattern expression."  My point was that that isn't true.
All statements in the pattern have an original statement, not just the root.
And we're specifically relying on that for the motivating example to work.

> I'm saying it's conservatively safe, while not overly accurate.  The alternative would be
> to tell ranger about the demotions in vect_recog_over_widening_pattern using range::set.
>
> But for this to work the general widening pattern also have to update the range information.
>
> I think where we're disagreeing is that I think looking at the original scalar statement is a safe
> conservative estimate.  It will fail in some cases, but that's a missed optimization, not a miss-optimization.

Yeah, like you say, I disagree that it's conservatively correct.
It means that we're hoping (without proving) that the only things
between stmt_info and the final pattern statement are conversions.
I don't think there's any reason in principle why that must hold.

What would be conservatively correct would be to start from the
final pattern statement and work our way down to the value that
is actually being used.  That seems a bit convoluted though,
so I'd prefer not to do that...

> In any case, if you disagree I don’t' really see a way forward aside from making this its own pattern
> running it before the overwidening pattern.

I think we should look to see if ranger can be persuaded to provide the
range of the 16-bit addition, even though the statement that produces it
isn't part of a BB.  It shouldn't matter that the addition originally
came from a 32-bit one: the range follows directly from the ranges of
the operands (i.e. the fact that the operands are the results of
widening conversions).

Thanks,
Richard

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-10 18:12                 ` Richard Sandiford
@ 2023-02-10 18:34                   ` Richard Biener
  2023-02-10 20:58                     ` Andrew MacLeod
  0 siblings, 1 reply; 47+ messages in thread
From: Richard Biener @ 2023-02-10 18:34 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: Tamar Christina, Tamar Christina via Gcc-patches, nd, jlaw,
	Andrew MacLeod



> Am 10.02.2023 um 19:12 schrieb Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org>:
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
>>> -----Original Message-----
>>> From: Richard Sandiford <richard.sandiford@arm.com>
>>> Sent: Friday, February 10, 2023 4:57 PM
>>> To: Tamar Christina <Tamar.Christina@arm.com>
>>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>>> by using new optabs [PR108583]
>>> 
>>> Tamar Christina <Tamar.Christina@arm.com> writes:
>>>>> -----Original Message-----
>>>>> From: Richard Sandiford <richard.sandiford@arm.com>
>>>>> Sent: Friday, February 10, 2023 4:25 PM
>>>>> To: Tamar Christina <Tamar.Christina@arm.com>
>>>>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>>>>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>>>>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
>>>>> div-bitmask by using new optabs [PR108583]
>>>>> 
>>>>> Tamar Christina <Tamar.Christina@arm.com> writes:
>>>>>>> -----Original Message-----
>>>>>>> From: Richard Sandiford <richard.sandiford@arm.com>
>>>>>>> Sent: Friday, February 10, 2023 3:57 PM
>>>>>>> To: Tamar Christina <Tamar.Christina@arm.com>
>>>>>>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>>>>>>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>>>>>>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
>>>>>>> div-bitmask by using new optabs [PR108583]
>>>>>>> 
>>>>>>> Tamar Christina <Tamar.Christina@arm.com> writes:
>>>>>>>>>> a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
>>>>>>>>> 69
>>>>>>>>>> de2afea139d6 100644
>>>>>>>>>> --- a/gcc/tree-vect-patterns.cc
>>>>>>>>>> +++ b/gcc/tree-vect-patterns.cc
>>>>>>>>>> @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern
>>> (vec_info
>>>>>>> *vinfo,
>>>>>>>>>>       return pattern_stmt;
>>>>>>>>>>     }
>>>>>>>>>>   else if ((cst = uniform_integer_cst_p (oprnd1))
>>>>>>>>>> -       && targetm.vectorize.can_special_div_by_const (rhs_code,
>>>>>>>>> vectype,
>>>>>>>>>> -                              wi::to_wide
>>>>> (cst),
>>>>>>>>>> -                              NULL,
>>>>> NULL_RTX,
>>>>>>>>>> -                              NULL_RTX))
>>>>>>>>>> +       && TYPE_UNSIGNED (itype)
>>>>>>>>>> +       && rhs_code == TRUNC_DIV_EXPR
>>>>>>>>>> +       && vectype
>>>>>>>>>> +       && direct_internal_fn_supported_p (IFN_ADDH, vectype,
>>>>>>>>>> +                          OPTIMIZE_FOR_SPEED))
>>>>>>>>>>     {
>>>>>>>>>> -      return NULL;
>>>>>>>>>> +      /* div optimizations using narrowings
>>>>>>>>>> +       we can do the division e.g. shorts by 255 faster by
>>>>>>>>>> + calculating it
>>>>> as
>>>>>>>>>> +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
>>>>>>>>>> +       double the precision of x.
>>>>>>>>>> +
>>>>>>>>>> +       If we imagine a short as being composed of two blocks
>>>>>>>>>> + of bytes
>>>>>>> then
>>>>>>>>>> +       adding 257 or 0b0000_0001_0000_0001 to the number is
>>>>>>>>>> + equivalent
>>>>>>> to
>>>>>>>>>> +       adding 1 to each sub component:
>>>>>>>>>> +
>>>>>>>>>> +        short value of 16-bits
>>>>>>>>>> +       ┌──────────────┬────────────────┐
>>>>>>>>>> +       │              │                │
>>>>>>>>>> +       └──────────────┴────────────────┘
>>>>>>>>>> +     8-bit part1 ▲  8-bit part2   ▲
>>>>>>>>>> +             │                │
>>>>>>>>>> +             │                │
>>>>>>>>>> +            +1               +1
>>>>>>>>>> +
>>>>>>>>>> +       after the first addition, we have to shift right by
>>>>>>>>>> + 8, and narrow
>>>>> the
>>>>>>>>>> +       results back to a byte.  Remember that the addition
>>>>>>>>>> + must be done
>>>>>>> in
>>>>>>>>>> +       double the precision of the input.  However if we
>>>>>>>>>> + know that the
>>>>>>>>> addition
>>>>>>>>>> +       `x + 257` does not overflow then we can do the
>>>>>>>>>> + operation in the
>>>>>>>>> current
>>>>>>>>>> +       precision.  In which case we don't need the pack and
>>> unpacks.
>>>>> */
>>>>>>>>>> +      auto wcst = wi::to_wide (cst);
>>>>>>>>>> +      int pow = wi::exact_log2 (wcst + 1);
>>>>>>>>>> +      if (pow == (int) (element_precision (vectype) / 2))
>>>>>>>>>> +    {
>>>>>>>>>> +      wide_int min,max;
>>>>>>>>>> +      /* If we're in a pattern we need to find the orginal
>>>>> definition.  */
>>>>>>>>>> +      tree op0 = oprnd0;
>>>>>>>>>> +      gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
>>>>>>>>>> +      stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
>>>>>>>>>> +      if (is_pattern_stmt_p (stmt_info))
>>>>>>>>>> +        {
>>>>>>>>>> +          auto orig_stmt = STMT_VINFO_RELATED_STMT
>>>>> (stmt_info);
>>>>>>>>>> +          if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>>>>>>>>>> +        op0 = gimple_assign_lhs (STMT_VINFO_STMT
>>>>> (orig_stmt));
>>>>>>>>>> +        }
>>>>>>>>> 
>>>>>>>>> If this is generally safe (I'm skipping thinking about it in
>>>>>>>>> the interests of a quick review :-)), then I think it should be
>>>>>>>>> done in vect_get_range_info instead.  Using gimple_get_lhs
>>>>>>>>> would be more general than handling just assignments.
>>>>>>>>> 
>>>>>>>>>> +
>>>>>>>>>> +      /* Check that no overflow will occur.  If we don't have range
>>>>>>>>>> +         information we can't perform the optimization.  */
>>>>>>>>>> +      if (vect_get_range_info (op0, &min, &max))
>>>>>>>>>> +        {
>>>>>>>>>> +          wide_int one = wi::to_wide (build_one_cst (itype));
>>>>>>>>>> +          wide_int adder = wi::add (one, wi::lshift (one, pow));
>>>>>>>>>> +          wi::overflow_type ovf;
>>>>>>>>>> +          /* We need adder and max in the same precision.  */
>>>>>>>>>> +          wide_int zadder
>>>>>>>>>> +        = wide_int_storage::from (adder, wi::get_precision
>>>>> (max),
>>>>>>>>>> +                      UNSIGNED);
>>>>>>>>>> +          wi::add (max, zadder, UNSIGNED, &ovf);
>>>>>>>>> 
>>>>>>>>> Could you explain this a bit more?  When do we have mismatched
>>>>>>>>> precisions?
>>>>>>>> 
>>>>>>>> C promotion rules will promote e.g.
>>>>>>>> 
>>>>>>>> void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
>>>>>>>>  for (int i = 0; i < n; i+=1)
>>>>>>>>    pixel[i] = (pixel[i] + level) / 0xff; }
>>>>>>>> 
>>>>>>>> And have the addition be done as a 32 bit integer.  The
>>>>>>>> vectorizer will demote this down to a short, but range
>>>>>>>> information is not stored for patterns.  So In the above the
>>>>>>>> range will correctly be 0x1fe but the precision will be that of
>>>>>>>> the original expression, so 32.  This will be a mismatch with
>>>>>>>> itype which is derived from the size the vectorizer
>>>>>>> will perform the operation in.
>>>>>>> 
>>>>>>> Gah, missed this first time round, sorry.
>>>>>>> 
>>>>>>> Richi would know better than me, but I think it's dangerous to
>>>>>>> rely on the orig/pattern link for range information.  The end
>>>>>>> result of a pattern
>>>>>>> (vect_stmt_to_vectorize) has to have the same type as the lhs of
>>>>>>> the original statement.  But the other statements in the pattern
>>>>>>> sequence can do arbitrary things.  Their range isn't predictable
>>>>>>> from the range of the original statement result.
>>>>>>> 
>>>>>>> IIRC, the addition above is converted to:
>>>>>>> 
>>>>>>>  a' = (uint16_t) pixel[i]
>>>>>>>  b' = (uint16_t) level
>>>>>>>  sum' = a' + b'
>>>>>>>  sum = (int) sum'
>>>>>>> 
>>>>>>> where sum is the direct replacement of "pixel[i] + level", with
>>>>>>> the same type and range.  The division then uses sum' instead of sum.
>>>>>>> 
>>>>>>> But the fact that sum' is part of the same pattern as sum doesn't
>>>>>>> guarantee that sum' has the same range as sum.  E.g. the pattern
>>>>>>> statements added by the division optimisation wouldn't have this
>>>>> property.
>>>>>> 
>>>>>> So my assumption is that no pattern would replace a statement with
>>>>>> something That has higher precision than the C statement. The
>>>>>> pattern above is demoted By the vectorizer based on range information
>>> already.
>>>>>> My assumption was that the precision can only ever be smaller,
>>>>>> because otherwise the pattern has violated the semantics of the C
>>>>>> code, which
>>>>> would be dangerous if e.g. the expression escapes?
>>>>> 
>>>>> IMO the difference in precisions was a symptom of the problem rather
>>>>> than the direct cause.
>>>>> 
>>>>> The point is more that "B = vect_orig_stmt(A)" just says "A is used
>>>>> somehow in a new calculation of B".  A might equal B (if A replaces
>>>>> B), or A might be an arbitrary temporary result.  The code above is
>>>>> instead using it to mean "A equals B, expressed in a different type".
>>>>> That happens to be true for sum' in the sequence above, but it isn't
>>>>> true of non-final pattern statements in general.
>>>>> 
>>>> 
>>>> Sorry for being dense, but I though that's exactly what the code does
>>>> and what I tried explain before. If B isn't a final statement than it won't
>>> have an original statement.
>>>> AFAIK, the only places we set original statement is the root of the pattern
>>> expression.
>>> 
>>> Final pattern statements (those not in DEF_SEQ) always have the same type
>>> and value as the original statements.  We wouldn't see mismatched
>>> precisions if we were only looking at final pattern statements.
>> 
>> We would because the entire problem is that pattern statement have no ranges.
>> Ranger does not track them after they have been created.  This could of course
>> Trivially be solved if we tell ranger about the demotion we did, but we don't do so
>> at the moment. It will just return varying here.  This is the root cause of the issue.
>> 
>>> 
>>> Like you say, the 16-bit addition didn't exist before vectorisation (it was a 32-
>>> bit addition instead).  So to make things type-correct, the 32-bit addition:
>>> 
>>>   A: sum = a + b           (STMT_VINFO_RELATED_STMT == A2)
>>> 
>>> is replaced with:
>>> 
>>>   DEF_SEQ:
>>>     A1: tmp = a' + b'      (STMT_VINFO_RELATED_STMT == A)
>>>   A2: sum' = (int) tmp     (STMT_VINFO_RELATED_STMT == A)
>>> 
>>> (using different notation from before, just to confuse things).
>>> Here, A2 is the final pattern statement for A and A1 is just a temporary result.
>>> sum == sum'.
>>> 
>>> Later, we do a similar thing for the division itself.  We have:
>>> 
>>>   B: quotient = sum / 0xff            (STMT_VINFO_RELATED_STMT == B2)
>>> 
>>> We realise that this can be a 16-bit division, so (IIRC) we use
>>> vect_look_through_possible_promotion on sum to find the best starting
>>> point.  This should give:
>>> 
>>>   DEF_SEQ:
>>>     B1: tmp2 = tmp / (uint16_t) 0xff  (STMT_VINFO_RELATED_STMT == B)
>>>   B2: quotient' = (int) tmp2          (STMT_VINFO_RELATED_STMT == B)
>>> 
>>> Both changes are done by vect_widened_op_tree.
>>> 
>>> We then apply the division pattern to B1.  B1 is a nonfinal pattern statement
>>> that uses the result (tmp) of another nonfinal pattern statement (A1).
>>> 
>>> The code does:
>>> 
>>>      if (is_pattern_stmt_p (stmt_info))
>>>        {
>>>          auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
>>>          if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>>>        op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
>>>        }
>>> 
>>> is_pattern_stmt_p is true for both A1 and A2, and
>>> STMT_VINFO_RELATED_STMT is A for both A1 and A2.  I would expect:
>>> 
>>>  gcc_assert (stmt_info == vect_stmt_to_vectorize (orig_stmt));
>>> 
>>> (testing for a final pattern) to fail for the motivating example.
>>> 
>> 
>> I think we're actually saying the same thing. I believe all I'm saying is that looking
>> at the original statement is a safe alternative as it conservatively will overestimate
>> to VARYING or give a wider range than the pattern would have.
> 
> Hmm, but you said "If B isn't a final statement than it won't have an
> original statement. AFAIK, the only places we set original statement
> is the root of the pattern expression."  My point was that that isn't true.
> All statements in the pattern have an original statement, not just the root.
> And we're specifically relying on that for the motivating example to work.
> 
>> I'm saying it's conservatively safe, while not overly accurate.  The alternative would be
>> to tell ranger about the demotions in vect_recog_over_widening_pattern using range::set.
>> 
>> But for this to work the general widening pattern also have to update the range information.
>> 
>> I think where we're disagreeing is that I think looking at the original scalar statement is a safe
>> conservative estimate.  It will fail in some cases, but that's a missed optimization, not a miss-optimization.
> 
> Yeah, like you say, I disagree that it's conservatively correct.
> It means that we're hoping (without proving) that the only things
> between stmt_info and the final pattern statement are conversions.
> I don't think there's any reason in principle why that must hold.
> 
> What would be conservatively correct would be to start from the
> final pattern statement and work our way down to the value that
> is actually being used.  That seems a bit convoluted though,
> so I'd prefer not to do that...
> 
>> In any case, if you disagree I don’t' really see a way forward aside from making this its own pattern
>> running it before the overwidening pattern.
> 
> I think we should look to see if ranger can be persuaded to provide the
> range of the 16-bit addition, even though the statement that produces it
> isn't part of a BB.  It shouldn't matter that the addition originally
> came from a 32-bit one: the range follows directly from the ranges of
> the operands (i.e. the fact that the operands are the results of
> widening conversions).

I think you can ask ranger on operations on names defined in the IL, so you can work yourself through the sequence of operations in the pattern sequence to compute ranges on their defs (and possibly even store them in the SSA info).  You just need to pick the correct ranger API for this…. Andrew CCed

Richard 

> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-10 18:34                   ` Richard Biener
@ 2023-02-10 20:58                     ` Andrew MacLeod
  2023-02-13  9:54                       ` Tamar Christina
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew MacLeod @ 2023-02-10 20:58 UTC (permalink / raw)
  To: Richard Biener, Richard Sandiford
  Cc: Tamar Christina, Tamar Christina via Gcc-patches, nd, jlaw


On 2/10/23 13:34, Richard Biener wrote:
>
>>> In any case, if you disagree I don’t' really see a way forward aside from making this its own pattern
>>> running it before the overwidening pattern.
>> I think we should look to see if ranger can be persuaded to provide the
>> range of the 16-bit addition, even though the statement that produces it
>> isn't part of a BB.  It shouldn't matter that the addition originally
>> came from a 32-bit one: the range follows directly from the ranges of
>> the operands (i.e. the fact that the operands are the results of
>> widening conversions).
> I think you can ask ranger on operations on names defined in the IL, so you can work yourself through the sequence of operations in the pattern sequence to compute ranges on their defs (and possibly even store them in the SSA info).  You just need to pick the correct ranger API for this…. Andrew CCed
>
>
Its not clear to me whats being asked...

Expressions don't need to be in the IL to do range calculations.. I 
believe we support arbitrary tree expressions via range_of_expr.

if you have 32 bit ranges that you want to do 16 bit addition on, you 
can also cast those ranges to a 16bit type,

my32bitrange.cast (my16bittype);

then invoke range-ops directly via getting the handler:

handler = range_op_handler (PLUS_EXPR, 16bittype_tree);
if (handler)
    handler->fold (result, my16bittype, mycasted32bitrange, 
myothercasted32bitrange)

There are higher level APIs if what you have on hand is closer to IL 
than random ranges

Describe exactly what it is you want to do... and I'll try to direct you 
to the best way to do it.

Andrew




^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-10 20:58                     ` Andrew MacLeod
@ 2023-02-13  9:54                       ` Tamar Christina
  2023-02-15 12:51                         ` Tamar Christina
  0 siblings, 1 reply; 47+ messages in thread
From: Tamar Christina @ 2023-02-13  9:54 UTC (permalink / raw)
  To: Andrew MacLeod, Richard Biener, Richard Sandiford
  Cc: Tamar Christina via Gcc-patches, nd, jlaw

> -----Original Message-----
> From: Andrew MacLeod <amacleod@redhat.com>
> Sent: Friday, February 10, 2023 8:59 PM
> To: Richard Biener <rguenther@suse.de>; Richard Sandiford
> <Richard.Sandiford@arm.com>
> Cc: Tamar Christina <Tamar.Christina@arm.com>; Tamar Christina via Gcc-
> patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
> jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> 
> On 2/10/23 13:34, Richard Biener wrote:
> >
> >>> In any case, if you disagree I don’t' really see a way forward aside
> >>> from making this its own pattern running it before the overwidening
> pattern.
> >> I think we should look to see if ranger can be persuaded to provide
> >> the range of the 16-bit addition, even though the statement that
> >> produces it isn't part of a BB.  It shouldn't matter that the
> >> addition originally came from a 32-bit one: the range follows
> >> directly from the ranges of the operands (i.e. the fact that the
> >> operands are the results of widening conversions).
> > I think you can ask ranger on operations on names defined in the IL,
> > so you can work yourself through the sequence of operations in the
> > pattern sequence to compute ranges on their defs (and possibly even
> > store them in the SSA info).  You just need to pick the correct ranger
> > API for this…. Andrew CCed
> >
> >
> Its not clear to me whats being asked...
> 
> Expressions don't need to be in the IL to do range calculations.. I believe we
> support arbitrary tree expressions via range_of_expr.
> 
> if you have 32 bit ranges that you want to do 16 bit addition on, you can also
> cast those ranges to a 16bit type,
> 
> my32bitrange.cast (my16bittype);
> 
> then invoke range-ops directly via getting the handler:
> 
> handler = range_op_handler (PLUS_EXPR, 16bittype_tree); if (handler)
>     handler->fold (result, my16bittype, mycasted32bitrange,
> myothercasted32bitrange)
> 
> There are higher level APIs if what you have on hand is closer to IL than
> random ranges
> 
> Describe exactly what it is you want to do... and I'll try to direct you to the
> best way to do it.

The vectorizer has  a pattern matcher that runs at startup on the scalar code.
This pattern matcher can replace one or more statements with alternative ones,
these can be either existing tree_code or new internal functions.

One of the patterns here is a overwidening detection pattern which reduces the
precision that an operation is to be done in during vectorization.

Another one is widening multiplication, which replaced PLUS_EXPR with WIDEN_PLUS_EXPR.

These can be chained, so e.g. a widening addition done on ints can be reduced to a widen addition
done on shorts.

The question is whether given the new expression that the vectorizer has
created whether ranger can tell what the precision is.  get_range_query fails because presumably
it has no idea about the new operations created
 and also doesn't know about any new IFNs.

Thanks,
Tamar

> 
> Andrew
> 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-13  9:54                       ` Tamar Christina
@ 2023-02-15 12:51                         ` Tamar Christina
  2023-02-15 16:05                           ` Andrew MacLeod
  0 siblings, 1 reply; 47+ messages in thread
From: Tamar Christina @ 2023-02-15 12:51 UTC (permalink / raw)
  To: Tamar Christina, Andrew MacLeod, Richard Biener, Richard Sandiford
  Cc: Tamar Christina via Gcc-patches, nd, jlaw

> > >>> In any case, if you disagree I don’t' really see a way forward
> > >>> aside from making this its own pattern running it before the
> > >>> overwidening
> > pattern.
> > >> I think we should look to see if ranger can be persuaded to provide
> > >> the range of the 16-bit addition, even though the statement that
> > >> produces it isn't part of a BB.  It shouldn't matter that the
> > >> addition originally came from a 32-bit one: the range follows
> > >> directly from the ranges of the operands (i.e. the fact that the
> > >> operands are the results of widening conversions).
> > > I think you can ask ranger on operations on names defined in the IL,
> > > so you can work yourself through the sequence of operations in the
> > > pattern sequence to compute ranges on their defs (and possibly even
> > > store them in the SSA info).  You just need to pick the correct
> > > ranger API for this…. Andrew CCed
> > >
> > >
> > Its not clear to me whats being asked...
> >
> > Expressions don't need to be in the IL to do range calculations.. I
> > believe we support arbitrary tree expressions via range_of_expr.
> >
> > if you have 32 bit ranges that you want to do 16 bit addition on, you
> > can also cast those ranges to a 16bit type,
> >
> > my32bitrange.cast (my16bittype);
> >
> > then invoke range-ops directly via getting the handler:
> >
> > handler = range_op_handler (PLUS_EXPR, 16bittype_tree); if (handler)
> >     handler->fold (result, my16bittype, mycasted32bitrange,
> > myothercasted32bitrange)
> >
> > There are higher level APIs if what you have on hand is closer to IL
> > than random ranges
> >
> > Describe exactly what it is you want to do... and I'll try to direct
> > you to the best way to do it.
> 
> The vectorizer has  a pattern matcher that runs at startup on the scalar code.
> This pattern matcher can replace one or more statements with alternative
> ones, these can be either existing tree_code or new internal functions.
> 
> One of the patterns here is a overwidening detection pattern which reduces
> the precision that an operation is to be done in during vectorization.
> 
> Another one is widening multiplication, which replaced PLUS_EXPR with
> WIDEN_PLUS_EXPR.
> 
> These can be chained, so e.g. a widening addition done on ints can be
> reduced to a widen addition done on shorts.
> 
> The question is whether given the new expression that the vectorizer has
> created whether ranger can tell what the precision is.  get_range_query fails
> because presumably it has no idea about the new operations created  and
> also doesn't know about any new IFNs.

Hi,

I have been trying to use ranger as requested. I've tried:

	  gimple_ranger ranger;
	  int_range_max r;
	  /* Check that no overflow will occur.  If we don't have range
	     information we can't perform the optimization.  */
	  if (ranger.range_of_expr (r, oprnd0, stmt))
	    {
	      wide_int max = r.upper_bound ();
                    ....

Which works for non-patterns, but still doesn't work for patterns.
On a stmt:
patt_27 = (_3) w+ (level_15(D));

it gives me a range:

$2 = {
  <wide_int_storage> = {
    val = {[0x0] = 0xffffffffffffffff, [0x1] = 0x7fff95bd8b00, [0x2] = 0x7fff95bd78b0, [0x3] = 0x3fa1dd0, [0x4] = 0x3fa1dd0, [0x5] = 0x344a706f832d4f00, [0x6] = 0x7fff95bd7950, [0x7] = 0x1ae7f11, [0x8] = 0x7fff95bd79f8},
    len = 0x1,
    precision = 0x10
  },
  members of generic_wide_int<wide_int_storage>:
  static is_sign_extended = 0x1
}

The precision is fine, but range seems to be -1?

Should I use range_op_handler (WIDEN_PLUS_EXPR, ...) in this case?

Thanks,
Tamar

> 
> Thanks,
> Tamar
> 
> >
> > Andrew
> >
> >


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-15 12:51                         ` Tamar Christina
@ 2023-02-15 16:05                           ` Andrew MacLeod
  2023-02-15 17:13                             ` Tamar Christina
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew MacLeod @ 2023-02-15 16:05 UTC (permalink / raw)
  To: Tamar Christina, Richard Biener, Richard Sandiford
  Cc: Tamar Christina via Gcc-patches, nd, jlaw


On 2/15/23 07:51, Tamar Christina wrote:
>>>>>> In any case, if you disagree I don’t' really see a way forward
>>>>>> aside from making this its own pattern running it before the
>>>>>> overwidening
>>> pattern.
>>>>> I think we should look to see if ranger can be persuaded to provide
>>>>> the range of the 16-bit addition, even though the statement that
>>>>> produces it isn't part of a BB.  It shouldn't matter that the
>>>>> addition originally came from a 32-bit one: the range follows
>>>>> directly from the ranges of the operands (i.e. the fact that the
>>>>> operands are the results of widening conversions).
>>>> I think you can ask ranger on operations on names defined in the IL,
>>>> so you can work yourself through the sequence of operations in the
>>>> pattern sequence to compute ranges on their defs (and possibly even
>>>> store them in the SSA info).  You just need to pick the correct
>>>> ranger API for this…. Andrew CCed
>>>>
>>>>
>>> Its not clear to me whats being asked...
>>>
>>> Expressions don't need to be in the IL to do range calculations.. I
>>> believe we support arbitrary tree expressions via range_of_expr.
>>>
>>> if you have 32 bit ranges that you want to do 16 bit addition on, you
>>> can also cast those ranges to a 16bit type,
>>>
>>> my32bitrange.cast (my16bittype);
>>>
>>> then invoke range-ops directly via getting the handler:
>>>
>>> handler = range_op_handler (PLUS_EXPR, 16bittype_tree); if (handler)
>>>      handler->fold (result, my16bittype, mycasted32bitrange,
>>> myothercasted32bitrange)
>>>
>>> There are higher level APIs if what you have on hand is closer to IL
>>> than random ranges
>>>
>>> Describe exactly what it is you want to do... and I'll try to direct
>>> you to the best way to do it.
>> The vectorizer has  a pattern matcher that runs at startup on the scalar code.
>> This pattern matcher can replace one or more statements with alternative
>> ones, these can be either existing tree_code or new internal functions.
>>
>> One of the patterns here is a overwidening detection pattern which reduces
>> the precision that an operation is to be done in during vectorization.
>>
>> Another one is widening multiplication, which replaced PLUS_EXPR with
>> WIDEN_PLUS_EXPR.
>>
>> These can be chained, so e.g. a widening addition done on ints can be
>> reduced to a widen addition done on shorts.
>>
>> The question is whether given the new expression that the vectorizer has
>> created whether ranger can tell what the precision is.  get_range_query fails
>> because presumably it has no idea about the new operations created  and
>> also doesn't know about any new IFNs.
> Hi,
>
> I have been trying to use ranger as requested. I've tried:
>
> 	  gimple_ranger ranger;
> 	  int_range_max r;
> 	  /* Check that no overflow will occur.  If we don't have range
> 	     information we can't perform the optimization.  */
> 	  if (ranger.range_of_expr (r, oprnd0, stmt))
> 	    {
> 	      wide_int max = r.upper_bound ();
>                      ....
>
> Which works for non-patterns, but still doesn't work for patterns.
> On a stmt:
> patt_27 = (_3) w+ (level_15(D));
>
> it gives me a range:
>
> $2 = {
>    <wide_int_storage> = {
>      val = {[0x0] = 0xffffffffffffffff, [0x1] = 0x7fff95bd8b00, [0x2] = 0x7fff95bd78b0, [0x3] = 0x3fa1dd0, [0x4] = 0x3fa1dd0, [0x5] = 0x344a706f832d4f00, [0x6] = 0x7fff95bd7950, [0x7] = 0x1ae7f11, [0x8] = 0x7fff95bd79f8},
>      len = 0x1,
>      precision = 0x10
>    },
>    members of generic_wide_int<wide_int_storage>:
>    static is_sign_extended = 0x1
> }
>
> The precision is fine, but range seems to be -1?
>
> Should I use range_op_handler (WIDEN_PLUS_EXPR, ...) in this case?

Its easier to see the range if you dump it.. ie:

p r.dump(stderr)

Im way behind the curve on exactly whats going on.  Im not sure how the 
above 2 things relate..  I presume $2 is is 'max'?  I have no context, 
what did you expect the range of _3 to be?

We have no entry in range-ops.cc for a WIDEN_PLUS_EXPR,  so ranger would 
only give back a VARYING for that no doubt.. however I doubt it would be 
too difficult to write the fold_range() method for it.

Its unclear to me what you mean by it doesnt work on patterns. so lets 
do some basics.

You have a stmt  "patt_27 = (_3) w+ (level_15(D));"

I gather thats a WIDEN_PLUS_EXPR, and if I read it right, patt_27 is a 
type that is twice as wide as _3, and will contain the value "_3 + 
level_15"?

You query above is asking for the range of _3 at this stmt in the IL.

And you are trying to determine whether the expression "_3 + level_15" 
would still fit in the type of _3, and thus you could avoid the WIDEN_* 
paradigm and revert to a simply plus?

And you also want to be able to do this for expressions which are not 
currently in the IL?

----  IF that is all true, then I would suggest one of 2 possible routes.
1) we add WIDEN_PLUS_EXPR to range-ops.  THIs involves writing 
fold_range() for it whereby it would create a range of a type double the 
precision of _3, then take the 2 ranges for op1 and op2, cast them to 
this new type and add them.

2) manually doing the same thing.   BUt if you are goignto manually do 
it, we might as well put that same code into fold_range then the entire 
ecosystem will benefit.

Once the operation can be performed in range ops, you can cast the new 
range back to the type of _3 and see if its fully represented. ie

int_range_max r1, r2
if (ranger.range_of_stmt (r1, stmt))
   {
     r2 = r1;
     r2.cast (TREE_TYPE (_3));
     r2.cast (TREE_TYPE (patt_27));
     if (r1 == r2)
       // No info was lost casting back and forth, so r1 must fit into 
type of _3

That should work for within the IL.  And if you want to do the same 
thing outside of the IL, you have to come up with the values you want to 
use for op1 and op2, replace the ranger query with a direct range-opfold:

range_op_handler handler (WIDEN_PLUS_EXPR, TREE_TYPE (patt_27));
if (handler && handler->fold_range (r1, range_of__3, range_of_level_15))
   {
     // same casting song and dance


If you dont want to go thru this process, in theory, you could try 
simply adding _3 and level_15 in their own precision, and if max/min 
aren't +INF/-INF then you can probably assume there is no overflow?
in which case, all you do is the path you are on above for within a stmt 
should work:

	  gimple_ranger ranger;
	  int_range_max r0, r1, def;
	  /* Check that no overflow will occur.  If we don't have range
	     information we can't perform the optimization.  */
	  if (ranger.range_of_expr (r0, oprnd0, stmt) && ranger.range_of_expr (r1,oprnd1, stmt)
	    {
	      range_op_handler handler (PLUS_EXPR, TREE_TYPE (_3));
	      if (handler && handler->fold_range (def, r0, r1))
		// examine def.upper_bound() and def.lower_bound()

Am I grasping some of the issue here?

Andrew




^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-15 16:05                           ` Andrew MacLeod
@ 2023-02-15 17:13                             ` Tamar Christina
  2023-02-15 17:50                               ` Andrew MacLeod
  0 siblings, 1 reply; 47+ messages in thread
From: Tamar Christina @ 2023-02-15 17:13 UTC (permalink / raw)
  To: Andrew MacLeod, Richard Biener, Richard Sandiford
  Cc: Tamar Christina via Gcc-patches, nd, jlaw

> On 2/15/23 07:51, Tamar Christina wrote:
> >>>>>> In any case, if you disagree I don’t' really see a way forward
> >>>>>> aside from making this its own pattern running it before the
> >>>>>> overwidening
> >>> pattern.
> >>>>> I think we should look to see if ranger can be persuaded to
> >>>>> provide the range of the 16-bit addition, even though the
> >>>>> statement that produces it isn't part of a BB.  It shouldn't
> >>>>> matter that the addition originally came from a 32-bit one: the
> >>>>> range follows directly from the ranges of the operands (i.e. the
> >>>>> fact that the operands are the results of widening conversions).
> >>>> I think you can ask ranger on operations on names defined in the
> >>>> IL, so you can work yourself through the sequence of operations in
> >>>> the pattern sequence to compute ranges on their defs (and possibly
> >>>> even store them in the SSA info).  You just need to pick the
> >>>> correct ranger API for this…. Andrew CCed
> >>>>
> >>>>
> >>> Its not clear to me whats being asked...
> >>>
> >>> Expressions don't need to be in the IL to do range calculations.. I
> >>> believe we support arbitrary tree expressions via range_of_expr.
> >>>
> >>> if you have 32 bit ranges that you want to do 16 bit addition on,
> >>> you can also cast those ranges to a 16bit type,
> >>>
> >>> my32bitrange.cast (my16bittype);
> >>>
> >>> then invoke range-ops directly via getting the handler:
> >>>
> >>> handler = range_op_handler (PLUS_EXPR, 16bittype_tree); if (handler)
> >>>      handler->fold (result, my16bittype, mycasted32bitrange,
> >>> myothercasted32bitrange)
> >>>
> >>> There are higher level APIs if what you have on hand is closer to IL
> >>> than random ranges
> >>>
> >>> Describe exactly what it is you want to do... and I'll try to direct
> >>> you to the best way to do it.
> >> The vectorizer has  a pattern matcher that runs at startup on the scalar
> code.
> >> This pattern matcher can replace one or more statements with
> >> alternative ones, these can be either existing tree_code or new internal
> functions.
> >>
> >> One of the patterns here is a overwidening detection pattern which
> >> reduces the precision that an operation is to be done in during
> vectorization.
> >>
> >> Another one is widening multiplication, which replaced PLUS_EXPR with
> >> WIDEN_PLUS_EXPR.
> >>
> >> These can be chained, so e.g. a widening addition done on ints can be
> >> reduced to a widen addition done on shorts.
> >>
> >> The question is whether given the new expression that the vectorizer
> >> has created whether ranger can tell what the precision is.
> >> get_range_query fails because presumably it has no idea about the new
> >> operations created  and also doesn't know about any new IFNs.
> > Hi,
> >
> > I have been trying to use ranger as requested. I've tried:
> >
> > 	  gimple_ranger ranger;
> > 	  int_range_max r;
> > 	  /* Check that no overflow will occur.  If we don't have range
> > 	     information we can't perform the optimization.  */
> > 	  if (ranger.range_of_expr (r, oprnd0, stmt))
> > 	    {
> > 	      wide_int max = r.upper_bound ();
> >                      ....
> >
> > Which works for non-patterns, but still doesn't work for patterns.
> > On a stmt:
> > patt_27 = (_3) w+ (level_15(D));
> >
> > it gives me a range:
> >
> > $2 = {
> >    <wide_int_storage> = {
> >      val = {[0x0] = 0xffffffffffffffff, [0x1] = 0x7fff95bd8b00, [0x2] =
> 0x7fff95bd78b0, [0x3] = 0x3fa1dd0, [0x4] = 0x3fa1dd0, [0x5] =
> 0x344a706f832d4f00, [0x6] = 0x7fff95bd7950, [0x7] = 0x1ae7f11, [0x8] =
> 0x7fff95bd79f8},
> >      len = 0x1,
> >      precision = 0x10
> >    },
> >    members of generic_wide_int<wide_int_storage>:
> >    static is_sign_extended = 0x1
> > }
> >
> > The precision is fine, but range seems to be -1?
> >
> > Should I use range_op_handler (WIDEN_PLUS_EXPR, ...) in this case?
> 
> Its easier to see the range if you dump it.. ie:
> 
> p r.dump(stderr)
> 
> Im way behind the curve on exactly whats going on.  Im not sure how the
> above 2 things relate..  I presume $2 is is 'max'?  I have no context, what did
> you expect the range of _3 to be?

Yes, $2 is max, and the expected range is 0x1fe as it's unsigned addition.
I'll expand below.

> 
> We have no entry in range-ops.cc for a WIDEN_PLUS_EXPR,  so ranger would
> only give back a VARYING for that no doubt.. however I doubt it would be
> too difficult to write the fold_range() method for it.
> 
> Its unclear to me what you mean by it doesnt work on patterns. so lets do
> some basics.
> 
> You have a stmt  "patt_27 = (_3) w+ (level_15(D));"
> 
> I gather thats a WIDEN_PLUS_EXPR, and if I read it right, patt_27 is a type
> that is twice as wide as _3, and will contain the value "_3 + level_15"?
> 
> You query above is asking for the range of _3 at this stmt in the IL.
> 
> And you are trying to determine whether the expression "_3 + level_15"
> would still fit in the type of _3, and thus you could avoid the WIDEN_*
> paradigm and revert to a simply plus?
> 
> And you also want to be able to do this for expressions which are not
> currently in the IL?

A pattern is an alternative IL that the vectorizer introduces for if it were to
vectorize a particular scalar statement(s).  The scalar statement(s) is not replaced
in the original scalar IL but it is replaced in the IL the vectorizer uses.

The example I'm working with here is this

#define N 16
void fun2(uint8_t* restrict pixel, uint8_t level, int n)
{
  for (int i = 0; i < n; i+=1)
    pixel[i] = (pixel[i] + level) / 0xff;
}

Where the C promotion rules promotes the operands to int.  However when
vectoring we try to increase the VF of the loop, that is do the operation in the
smallest type possible.  In this case it is safe to do the operation as a short.

So the vectorizer demotes 

  _28 = (int) level_14(D);
  _4 = (int) _3;
  _6 = _4 + _28;
  _7 = _6 / 255;
  _8 = (unsigned char) _7;

Into

  _28 = (short) level_14(D);
  _4 = (short) _3;
  _6 = _4 + _28;
  _7 = _6 / 255;
  _8 = (unsigned char) _7;

This is done in the scalar pattern matcher it has based on range information.
The new instructions replace the old ones in the vectorizer IL.

There is then a second pattern matcher that runs because some targets have
operations that can perform a widening during a mathematical operation.
There are many such operations, +w, *w, >>w, <<w etc.  Some are tree codes,
others are represented as internal function calls.

This second pattern replaces the above into:

  _6 = _3 +w level_14(D);
  _7 = _6 / 255;
  _8 = (unsigned char) _7;

Thus removing the need to promote before the addition.  What I'm working on
is an optimization for division.  So I am after what the range of _6 is. oprnd0 in my
example is the 1rst operand of the division.

I need to know the range of_6 because based on the range we can optimize this
division into something much more efficient.

> 
> ----  IF that is all true, then I would suggest one of 2 possible routes.
> 1) we add WIDEN_PLUS_EXPR to range-ops.  THIs involves writing
> fold_range() for it whereby it would create a range of a type double the
> precision of _3, then take the 2 ranges for op1 and op2, cast them to this new
> type and add them.
> 

Right, so I guess none of the widening operations are currently there.  Can you
point me in the right direction of where I need to add them?

> 2) manually doing the same thing.   BUt if you are goignto manually do it, we
> might as well put that same code into fold_range then the entire ecosystem
> will benefit.
> 
> Once the operation can be performed in range ops, you can cast the new
> range back to the type of _3 and see if its fully represented. ie
> 
> int_range_max r1, r2
> if (ranger.range_of_stmt (r1, stmt))
>    {
>      r2 = r1;
>      r2.cast (TREE_TYPE (_3));
>      r2.cast (TREE_TYPE (patt_27));
>      if (r1 == r2)
>        // No info was lost casting back and forth, so r1 must fit into type of _3
> 
> That should work for within the IL.  And if you want to do the same thing
> outside of the IL, you have to come up with the values you want to use for
> op1 and op2, replace the ranger query with a direct range-opfold:
> 
> range_op_handler handler (WIDEN_PLUS_EXPR, TREE_TYPE (patt_27)); if
> (handler && handler->fold_range (r1, range_of__3, range_of_level_15))
>    {
>      // same casting song and dance
> 
> 

Just for my own understanding, does the fold_range here update the information
in the IL? Or is it just for this computation? So when I hit this pattern again it
recomputes it?

> If you dont want to go thru this process, in theory, you could try simply
> adding _3 and level_15 in their own precision, and if max/min aren't +INF/-
> INF then you can probably assume there is no overflow?
> in which case, all you do is the path you are on above for within a stmt should
> work:
> 
> 	  gimple_ranger ranger;
> 	  int_range_max r0, r1, def;
> 	  /* Check that no overflow will occur.  If we don't have range
> 	     information we can't perform the optimization.  */
> 	  if (ranger.range_of_expr (r0, oprnd0, stmt) &&
> ranger.range_of_expr (r1,oprnd1, stmt)
> 	    {
> 	      range_op_handler handler (PLUS_EXPR, TREE_TYPE (_3));
> 	      if (handler && handler->fold_range (def, r0, r1))
> 		// examine def.upper_bound() and def.lower_bound()
> 
> Am I grasping some of the issue here?

You are, and this was helpful.  I would imagine that Richard wouldn't accept me
to do it locally though.  So I guess if it's safe to do for this PR fix, I can add the basic
widening operations to ranger-ops if you can show me where.

Thanks,
Tamar

> 
> Andrew
> 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-15 17:13                             ` Tamar Christina
@ 2023-02-15 17:50                               ` Andrew MacLeod
  2023-02-15 18:42                                 ` Andrew MacLeod
  2023-02-22 13:06                                 ` Tamar Christina
  0 siblings, 2 replies; 47+ messages in thread
From: Andrew MacLeod @ 2023-02-15 17:50 UTC (permalink / raw)
  To: Tamar Christina, Richard Biener, Richard Sandiford
  Cc: Tamar Christina via Gcc-patches, nd, jlaw

[-- Attachment #1: Type: text/plain, Size: 6043 bytes --]


On 2/15/23 12:13, Tamar Christina wrote:
>> On 2/15/23 07:51, Tamar Christina wrote:
>>
Thanks, lots of useful context there.


> This second pattern replaces the above into:
>
>    _6 = _3 +w level_14(D);
>    _7 = _6 / 255;
>    _8 = (unsigned char) _7;
>
> Thus removing the need to promote before the addition.  What I'm working on
> is an optimization for division.  So I am after what the range of _6 is. oprnd0 in my
> example is the 1rst operand of the division.
>
> I need to know the range of_6 because based on the range we can optimize this
> division into something much more efficient.
>
>> ----  IF that is all true, then I would suggest one of 2 possible routes.
>> 1) we add WIDEN_PLUS_EXPR to range-ops.  THIs involves writing
>> fold_range() for it whereby it would create a range of a type double the
>> precision of _3, then take the 2 ranges for op1 and op2, cast them to this new
>> type and add them.
>>
> Right, so I guess none of the widening operations are currently there.  Can you
> point me in the right direction of where I need to add them?

sure, details below


>> 2) manually doing the same thing.   BUt if you are goignto manually do it, we
>> might as well put that same code into fold_range then the entire ecosystem
>> will benefit.
>>
>> Once the operation can be performed in range ops, you can cast the new
>> range back to the type of _3 and see if its fully represented. ie
>>
>> int_range_max r1, r2
>> if (ranger.range_of_stmt (r1, stmt))
>>     {
>>       r2 = r1;
>>       r2.cast (TREE_TYPE (_3));
>>       r2.cast (TREE_TYPE (patt_27));
>>       if (r1 == r2)
>>         // No info was lost casting back and forth, so r1 must fit into type of _3
>>
>> That should work for within the IL.  And if you want to do the same thing
>> outside of the IL, you have to come up with the values you want to use for
>> op1 and op2, replace the ranger query with a direct range-opfold:
>>
>> range_op_handler handler (WIDEN_PLUS_EXPR, TREE_TYPE (patt_27)); if
>> (handler && handler->fold_range (r1, range_of__3, range_of_level_15))
>>     {
>>       // same casting song and dance
>>
>>
> Just for my own understanding, does the fold_range here update the information
> in the IL? Or is it just for this computation? So when I hit this pattern again it
> recomputes it?

fold_range does not update anything.  It just performs the calculation, 
and passes like VRP etc are responsible for if, and when, that is 
reflected in some way/transformation in the IL. The IL is primarily used 
for context to look back and try to determine the range of the inputs to 
the statement.   Thats why, if you arent using an expression in the IL, 
you need to provide the ranges yourself.   BY default, you end up with 
the full range for the type, ie VARYING.  but if ranger can detertmine 
through branches and such that its something different, it will. ie, so 
if you case is preceeded by

if (_3 < 20 && level_15< 20)
   //  the range of _3 will be [0, 19] and _15 will be [0, 19], and th 
addition will end up with a range of [0, 38]

In your case, I see the ranges are the range of the 8 bit type: irange] 
int [0, 255] NONZERO 0xff



>> If you dont want to go thru this process, in theory, you could try simply
>> adding _3 and level_15 in their own precision, and if max/min aren't +INF/-
>> INF then you can probably assume there is no overflow?
>> in which case, all you do is the path you are on above for within a stmt should
>> work:
>>
>> 	  gimple_ranger ranger;
>> 	  int_range_max r0, r1, def;
>> 	  /* Check that no overflow will occur.  If we don't have range
>> 	     information we can't perform the optimization.  */
>> 	  if (ranger.range_of_expr (r0, oprnd0, stmt) &&
>> ranger.range_of_expr (r1,oprnd1, stmt)
>> 	    {
>> 	      range_op_handler handler (PLUS_EXPR, TREE_TYPE (_3));
>> 	      if (handler && handler->fold_range (def, r0, r1))so I would expect a skeleton to be
>> 		// examine def.upper_bound() and def.lower_bound()
>>
>> Am I grasping some of the issue here?
> You are, and this was helpful.  I would imagine that Richard wouldn't accept me
> to do it locally though.  So I guess if it's safe to do for this PR fix, I can add the basic
> widening operations to ranger-ops if you can show me where.
>

all the range-op integer code is in gcc/range-op.cc.  As this is a basic 
binary operation, you should be able to get away with implementing a 
single routine,  wi_fold () which adds 2 wide int bounds  together and 
returns a result.  THis si the implelemntaion for operator_plus.

void
operator_plus::wi_fold (irange &r, tree type,
                         const wide_int &lh_lb, const wide_int &lh_ub,
                         const wide_int &rh_lb, const wide_int &rh_ub) const
{
   wi::overflow_type ov_lb, ov_ub;
   signop s = TYPE_SIGN (type);
   wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
   wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
   value_range_with_overflow (r, type, new_lb, new_ub, ov_lb, ov_ub);
}


you shouldn't have to do any of the overflow stuff at the end, just take 
the 2 sets of wide int, double their precision to start, add them 
together (it cant possible overflow right) and then return an 
int_range<2> with those bounds...
ie

void
operator_plus::wi_fold (irange &r, tree type,
                         const wide_int &lh_lb, const wide_int &lh_ub,
                         const wide_int &rh_lb, const wide_int &rh_ub) const
{
   wi::overflow_type ov_lb, ov_ub;
   signop s = TYPE_SIGN (type);

   // Do whatever wideint magic is required to do this adds in higher 
precision
   wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
   wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);

   r = int_range<2> (type, new_lb, new_ub);
}


The operator needs to be registered, I've attached the skeleton for it.  
you should just have to finish implementing wi_fold().

in theory :-)


[-- Attachment #2: tam.diff --]
[-- Type: text/x-patch, Size: 1227 bytes --]

diff --git a/gcc/range-op.cc b/gcc/range-op.cc
index 5c67bce6d3a..c425c496c25 100644
--- a/gcc/range-op.cc
+++ b/gcc/range-op.cc
@@ -1730,6 +1730,29 @@ operator_minus::op2_range (irange &r, tree type,
   return fold_range (r, type, op1, lhs);
 }
 
+class operator_widen_plus : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub) const;
+} op_widen_plus;
+
+void
+operator_widen_plus::wi_fold (irange &r, tree type,
+			      const wide_int &lh_lb,
+			      const wide_int &lh_ub,
+			      const wide_int &rh_lb,
+			      const wide_int &rh_ub) const
+{
+  wi::overflow_type ov_lb, ov_ub;
+  signop s = TYPE_SIGN (type);
+  wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
+  wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
+  r = int_range<2> (type, new_lb, new_ub);
+}
 
 class operator_pointer_diff : public range_operator
 {
@@ -4505,6 +4528,7 @@ integral_table::integral_table ()
   set (ABSU_EXPR, op_absu);
   set (NEGATE_EXPR, op_negate);
   set (ADDR_EXPR, op_addr);
+  set (WIDEN_PLUS_EXPR, op_widen_plus);
 }
 
 // Instantiate a range op table for pointer operations.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-15 17:50                               ` Andrew MacLeod
@ 2023-02-15 18:42                                 ` Andrew MacLeod
  2023-02-22 12:51                                   ` Tamar Christina
  2023-02-22 16:41                                   ` Andrew MacLeod
  2023-02-22 13:06                                 ` Tamar Christina
  1 sibling, 2 replies; 47+ messages in thread
From: Andrew MacLeod @ 2023-02-15 18:42 UTC (permalink / raw)
  To: Tamar Christina, Richard Biener, Richard Sandiford
  Cc: Tamar Christina via Gcc-patches, nd, jlaw


On 2/15/23 12:50, Andrew MacLeod wrote:
>
> On 2/15/23 12:13, Tamar Christina wrote:
>>> On 2/15/23 07:51, Tamar Christina wrote:
> void
> operator_plus::wi_fold (irange &r, tree type,
>                         const wide_int &lh_lb, const wide_int &lh_ub,
>                         const wide_int &rh_lb, const wide_int &rh_ub) 
> const
> {
>   wi::overflow_type ov_lb, ov_ub;
>   signop s = TYPE_SIGN (type);
>
>   // Do whatever wideint magic is required to do this adds in higher 
> precision
>   wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
>   wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
>
>   r = int_range<2> (type, new_lb, new_ub);
> }
>
>
> The operator needs to be registered, I've attached the skeleton for 
> it.  you should just have to finish implementing wi_fold().
>
> in theory :-)
>
You also mentioned earlier that some were tree codes, some were internal 
function calls?  We have some initial support for built in functions, 
but I am not familiar with all the various forms they can take.  We 
currently support CFN_ functions in

   gimple-range-op.cc, gimple_range_op_handler::maybe_builtin_call ()

Basically this is part of a "gimple_range_op_handler"  wrapper for 
range-ops which can provide a range-ops class for stmts that don't map 
to a binary or unary form.. such as built in functions.

If you get to the point where you need this for a builtin function, I 
can help you through that too.  Although someone may have to also help 
me through what differentiates the different kinds of internal function 
:-)    I presume they are all similar in some way.

Andrew



^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-15 18:42                                 ` Andrew MacLeod
@ 2023-02-22 12:51                                   ` Tamar Christina
  2023-02-22 16:41                                   ` Andrew MacLeod
  1 sibling, 0 replies; 47+ messages in thread
From: Tamar Christina @ 2023-02-22 12:51 UTC (permalink / raw)
  To: Andrew MacLeod, Richard Biener, Richard Sandiford
  Cc: Tamar Christina via Gcc-patches, nd, jlaw



> -----Original Message-----
> From: Andrew MacLeod <amacleod@redhat.com>
> Sent: Wednesday, February 15, 2023 6:43 PM
> To: Tamar Christina <Tamar.Christina@arm.com>; Richard Biener
> <rguenther@suse.de>; Richard Sandiford <Richard.Sandiford@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> 
> On 2/15/23 12:50, Andrew MacLeod wrote:
> >
> > On 2/15/23 12:13, Tamar Christina wrote:
> >>> On 2/15/23 07:51, Tamar Christina wrote:
> > void
> > operator_plus::wi_fold (irange &r, tree type,
> >                         const wide_int &lh_lb, const wide_int &lh_ub,
> >                         const wide_int &rh_lb, const wide_int &rh_ub)
> > const {
> >   wi::overflow_type ov_lb, ov_ub;
> >   signop s = TYPE_SIGN (type);
> >
> >   // Do whatever wideint magic is required to do this adds in higher
> > precision
> >   wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
> >   wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
> >
> >   r = int_range<2> (type, new_lb, new_ub); }
> >
> >
> > The operator needs to be registered, I've attached the skeleton for
> > it.  you should just have to finish implementing wi_fold().
> >
> > in theory :-)
> >
> You also mentioned earlier that some were tree codes, some were internal
> function calls?  We have some initial support for built in functions,
> but I am not familiar with all the various forms they can take.  We
> currently support CFN_ functions in

Ah then this should work, CFNs is are e helper class to combine compiler builtins
and internal functions in one structure.  So with support for CFN_ both should
be supported.  Probably just a matter of adding the new ops then.

> 
>    gimple-range-op.cc, gimple_range_op_handler::maybe_builtin_call ()
> 
> Basically this is part of a "gimple_range_op_handler"  wrapper for
> range-ops which can provide a range-ops class for stmts that don't map
> to a binary or unary form.. such as built in functions.
> 
> If you get to the point where you need this for a builtin function, I
> can help you through that too.  Although someone may have to also help
> me through what differentiates the different kinds of internal function
> :-)    I presume they are all similar in some way.

Will do! I'm hoping to range related vectorizer missed optimizations with this in
GCC 14 so I'll be back 😊

Cheers,
Tamar
> 
> Andrew
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-15 17:50                               ` Andrew MacLeod
  2023-02-15 18:42                                 ` Andrew MacLeod
@ 2023-02-22 13:06                                 ` Tamar Christina
  2023-02-22 15:19                                   ` Andrew MacLeod
  1 sibling, 1 reply; 47+ messages in thread
From: Tamar Christina @ 2023-02-22 13:06 UTC (permalink / raw)
  To: Andrew MacLeod, Richard Biener, Richard Sandiford
  Cc: Tamar Christina via Gcc-patches, nd, jlaw

Hi Andrew,

> 
> all the range-op integer code is in gcc/range-op.cc.  As this is a basic
> binary operation, you should be able to get away with implementing a
> single routine,  wi_fold () which adds 2 wide int bounds  together and
> returns a result.  THis si the implelemntaion for operator_plus.
> 
> void
> operator_plus::wi_fold (irange &r, tree type,
>                          const wide_int &lh_lb, const wide_int &lh_ub,
>                          const wide_int &rh_lb, const wide_int &rh_ub) const
> {
>    wi::overflow_type ov_lb, ov_ub;
>    signop s = TYPE_SIGN (type);
>    wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
>    wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
>    value_range_with_overflow (r, type, new_lb, new_ub, ov_lb, ov_ub);
> }
> 
> 
> you shouldn't have to do any of the overflow stuff at the end, just take
> the 2 sets of wide int, double their precision to start, add them
> together (it cant possible overflow right) and then return an
> int_range<2> with those bounds...
> ie
> 
> void
> operator_plus::wi_fold (irange &r, tree type,
>                          const wide_int &lh_lb, const wide_int &lh_ub,
>                          const wide_int &rh_lb, const wide_int &rh_ub) const
> {
>    wi::overflow_type ov_lb, ov_ub;
>    signop s = TYPE_SIGN (type);
> 
>    // Do whatever wideint magic is required to do this adds in higher
> precision
>    wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
>    wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
> 
>    r = int_range<2> (type, new_lb, new_ub);
> }
> 

So I've been working on adding support for widening plus and widening multiplication,
and my examples all work now.. but during bootstrap I hit a problem.

Say you have a mixed sign widening multiplication, such as in:

int decMultiplyOp_zacc, decMultiplyOp_iacc;
int *decMultiplyOp_lp;
void decMultiplyOp() {
  decMultiplyOp_lp = &decMultiplyOp_zacc;
  for (; decMultiplyOp_lp < &decMultiplyOp_zacc + decMultiplyOp_iacc;
       decMultiplyOp_lp++)
    *decMultiplyOp_lp = 0;
}

Eventually the pointer arithmetic will generate:

intD.7 decMultiplyOp_iacc.2_13;
long unsigned intD.11 _15;
_15 = decMultiplyOp_iacc.2_13 w* 4;
and it'll try to get the range from this.

My implementation is just:

void
operator_widen_mult::wi_fold (irange &r, tree type,
			const wide_int &lh_lb, const wide_int &lh_ub,
			const wide_int &rh_lb, const wide_int &rh_ub) const
{
  signop s = TYPE_SIGN (type);

  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, s);
  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, s);
  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);

  /* We don't expect a widening multiplication to be able to overflow but range
     calculations for multiplications are complicated.  After widening the
     operands lets call the base class.  */
  return operator_mult::wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
}

But in this case the operands are different types and the wi_fold only gets the
type of the operation. The issue is that when increasing the precision for lh_*
I need to sign extend the value and not zero extend, but I don't seem to have
enough context here to know that I do.  I'm missing the type of the operands.

For non-widening operations this doesn't matter as the precision stays the same.

Is there a way to get the information I need?

Thanks,
Tamar

> 
> The operator needs to be registered, I've attached the skeleton for it.
> you should just have to finish implementing wi_fold().
> 
> in theory :-)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-22 13:06                                 ` Tamar Christina
@ 2023-02-22 15:19                                   ` Andrew MacLeod
  0 siblings, 0 replies; 47+ messages in thread
From: Andrew MacLeod @ 2023-02-22 15:19 UTC (permalink / raw)
  To: Tamar Christina, Richard Biener, Richard Sandiford
  Cc: Tamar Christina via Gcc-patches, nd, jlaw

[-- Attachment #1: Type: text/plain, Size: 5050 bytes --]


On 2/22/23 08:06, Tamar Christina wrote:
> Hi Andrew,
>
>> all the range-op integer code is in gcc/range-op.cc.  As this is a basic
>> binary operation, you should be able to get away with implementing a
>> single routine,  wi_fold () which adds 2 wide int bounds  together and
>> returns a result.  THis si the implelemntaion for operator_plus.
>>
>> void
>> operator_plus::wi_fold (irange &r, tree type,
>>                           const wide_int &lh_lb, const wide_int &lh_ub,
>>                           const wide_int &rh_lb, const wide_int &rh_ub) const
>> {
>>     wi::overflow_type ov_lb, ov_ub;
>>     signop s = TYPE_SIGN (type);
>>     wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
>>     wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
>>     value_range_with_overflow (r, type, new_lb, new_ub, ov_lb, ov_ub);
>> }
>>
>>
>> you shouldn't have to do any of the overflow stuff at the end, just take
>> the 2 sets of wide int, double their precision to start, add them
>> together (it cant possible overflow right) and then return an
>> int_range<2> with those bounds...
>> ie
>>
>> void
>> operator_plus::wi_fold (irange &r, tree type,
>>                           const wide_int &lh_lb, const wide_int &lh_ub,
>>                           const wide_int &rh_lb, const wide_int &rh_ub) const
>> {
>>     wi::overflow_type ov_lb, ov_ub;
>>     signop s = TYPE_SIGN (type);
>>
>>     // Do whatever wideint magic is required to do this adds in higher
>> precision
>>     wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
>>     wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
>>
>>     r = int_range<2> (type, new_lb, new_ub);
>> }
>>
> So I've been working on adding support for widening plus and widening multiplication,
> and my examples all work now.. but during bootstrap I hit a problem.
>
> Say you have a mixed sign widening multiplication, such as in:
>
> int decMultiplyOp_zacc, decMultiplyOp_iacc;
> int *decMultiplyOp_lp;
> void decMultiplyOp() {
>    decMultiplyOp_lp = &decMultiplyOp_zacc;
>    for (; decMultiplyOp_lp < &decMultiplyOp_zacc + decMultiplyOp_iacc;
>         decMultiplyOp_lp++)
>      *decMultiplyOp_lp = 0;
> }
>
> Eventually the pointer arithmetic will generate:
>
> intD.7 decMultiplyOp_iacc.2_13;
> long unsigned intD.11 _15;
> _15 = decMultiplyOp_iacc.2_13 w* 4;
> and it'll try to get the range from this.
>
> My implementation is just:
>
> void
> operator_widen_mult::wi_fold (irange &r, tree type,
> 			const wide_int &lh_lb, const wide_int &lh_ub,
> 			const wide_int &rh_lb, const wide_int &rh_ub) const
> {
>    signop s = TYPE_SIGN (type);
>
>    wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, s);
>    wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
>    wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, s);
>    wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
>
>    /* We don't expect a widening multiplication to be able to overflow but range
>       calculations for multiplications are complicated.  After widening the
>       operands lets call the base class.  */
>    return operator_mult::wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
> }
>
> But in this case the operands are different types and the wi_fold only gets the
> type of the operation. The issue is that when increasing the precision for lh_*
> I need to sign extend the value and not zero extend, but I don't seem to have
> enough context here to know that I do.  I'm missing the type of the operands.
>
> For non-widening operations this doesn't matter as the precision stays the same.
>
> Is there a way to get the information I need?
>
>
we haven't had this situation before, if I understand it correctly:

The LHS is a different type than both the operands, and your problem is 
you need to know the sign of at least operand1 in order to know whether 
to zero extend or to sign extend it?  huh. haven't run into that with 
any other bit of IL before :-P

Let me think about it.  I am loathe to change range-ops itself, but we 
may be able to leverage the builtin-function approach to dealing with 
something non-standard. At least for the moment to keep you going.

For the builtins, we provide a range-ops handler, *after* we look at the 
operands from within a gimple-context where we can still see the types, 
and  choose an appropriate handler.  so I'm thinking we provide 2 handlers,

operator_widen_mult_signed
operator_widen_mult_unsigned

chosen based on whether to sign extned or zero extend op1. look at the 
type of operand one, and return the appropriate handler. Let me give you 
a skeleton.  I *think* this should do it.

you can provide 2 versions of  operator_widen_mult in range-ops (so you 
can still inherit from operator_mult).  The should be exported I think, 
and the appropriate one should be called I think...

Give it a try and see if it works :-P.





[-- Attachment #2: tamar.diff --]
[-- Type: text/x-patch, Size: 1586 bytes --]

diff --git a/gcc/gimple-range-op.cc b/gcc/gimple-range-op.cc
index d9dfdc56939..e4391f4a616 100644
--- a/gcc/gimple-range-op.cc
+++ b/gcc/gimple-range-op.cc
@@ -179,6 +179,8 @@ gimple_range_op_handler::gimple_range_op_handler (gimple *s)
   // statements.
   if (is_a <gcall *> (m_stmt))
     maybe_builtin_call ();
+  else
+    maybe_non_standard ();
 }
 
 // Calculate what we can determine of the range of this unary
@@ -764,6 +766,29 @@ public:
   }
 } op_cfn_parity;
 
+// Set up a gimple_range_op_handler for any nonstandard function which can be
+// supported via range-ops.
+
+void
+gimple_range_op_handler::maybe_non_standard ()
+{
+  if (gimple_code (m_stmt) == GIMPLE_ASSIGN)
+    switch (gimple_assign_rhs_code (m_stmt))
+      {
+	case WIDEN_MULT_EXPR:
+	  extern class range_operator &op_widen_mult_signed;
+	  extern class range_operator &op_widen_mult_unsigned;
+	  m_valid = true;
+	  m_op1 = gimple_assign_rhs1 (m_stmt);
+	  m_op2 = gimple_assign_rhs2 (m_stmt);
+	  if (TYPE_SIGN (TREE_TYPE (m_op1)) == SIGNED)
+	    m_int = &op_widen_mult_signed;
+	  else
+	    m_int = &op_widen_mult_unsigned;
+	default:
+	  break;
+      }
+}
 // Set up a gimple_range_op_handler for any built in function which can be
 // supported via range-ops.
 
diff --git a/gcc/gimple-range-op.h b/gcc/gimple-range-op.h
index 743b858126e..1bf63c5ce6f 100644
--- a/gcc/gimple-range-op.h
+++ b/gcc/gimple-range-op.h
@@ -41,6 +41,7 @@ public:
 		 relation_trio = TRIO_VARYING);
 private:
   void maybe_builtin_call ();
+  void maybe_non_standard ();
   gimple *m_stmt;
   tree m_op1, m_op2;
 };

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-15 18:42                                 ` Andrew MacLeod
  2023-02-22 12:51                                   ` Tamar Christina
@ 2023-02-22 16:41                                   ` Andrew MacLeod
  2023-02-22 18:03                                     ` Tamar Christina
  1 sibling, 1 reply; 47+ messages in thread
From: Andrew MacLeod @ 2023-02-22 16:41 UTC (permalink / raw)
  To: Tamar Christina, Richard Biener, Richard Sandiford
  Cc: Tamar Christina via Gcc-patches, nd, jlaw


On 2/15/23 13:42, Andrew MacLeod wrote:
>
> On 2/15/23 12:50, Andrew MacLeod wrote:
>>
>> On 2/15/23 12:13, Tamar Christina wrote:
>>>> On 2/15/23 07:51, Tamar Christina wrote:
>> void
>> operator_plus::wi_fold (irange &r, tree type,
>>                         const wide_int &lh_lb, const wide_int &lh_ub,
>>                         const wide_int &rh_lb, const wide_int &rh_ub) 
>> const
>> {
>>   wi::overflow_type ov_lb, ov_ub;
>>   signop s = TYPE_SIGN (type);
>>
>>   // Do whatever wideint magic is required to do this adds in higher 
>> precision
>>   wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
>>   wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
>>
>>   r = int_range<2> (type, new_lb, new_ub);
>> }
>>
>>
>> The operator needs to be registered, I've attached the skeleton for 
>> it.  you should just have to finish implementing wi_fold().
>>
>> in theory :-)
>>
> You also mentioned earlier that some were tree codes, some were 
> internal function calls?  We have some initial support for built in 
> functions, but I am not familiar with all the various forms they can 
> take.  We currently support CFN_ functions in
>
>   gimple-range-op.cc, gimple_range_op_handler::maybe_builtin_call ()
>
> Basically this is part of a "gimple_range_op_handler"  wrapper for 
> range-ops which can provide a range-ops class for stmts that don't map 
> to a binary or unary form.. such as built in functions.
>
> If you get to the point where you need this for a builtin function, I 
> can help you through that too.  Although someone may have to also help 
> me through what differentiates the different kinds of internal 
> function :-)    I presume they are all similar in some way.
>
> Andrew
>
Oh yeah, and in case you haven't figured it out on your own, you'll have 
to remove WIDEN_MULT_EXPR from the range-ops init table.   This 
non-standard mechanism only gets checked if there is no standard 
range-op table entry for the tree code :-P

Andrew

Andrew


^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-22 16:41                                   ` Andrew MacLeod
@ 2023-02-22 18:03                                     ` Tamar Christina
  2023-02-22 18:33                                       ` Andrew MacLeod
  0 siblings, 1 reply; 47+ messages in thread
From: Tamar Christina @ 2023-02-22 18:03 UTC (permalink / raw)
  To: Andrew MacLeod, Richard Biener, Richard Sandiford
  Cc: Tamar Christina via Gcc-patches, nd, jlaw

> -----Original Message-----
> From: Andrew MacLeod <amacleod@redhat.com>
> Sent: Wednesday, February 22, 2023 4:42 PM
> To: Tamar Christina <Tamar.Christina@arm.com>; Richard Biener
> <rguenther@suse.de>; Richard Sandiford <Richard.Sandiford@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> 
> On 2/15/23 13:42, Andrew MacLeod wrote:
> >
> > On 2/15/23 12:50, Andrew MacLeod wrote:
> >>
> >> On 2/15/23 12:13, Tamar Christina wrote:
> >>>> On 2/15/23 07:51, Tamar Christina wrote:
> >> void
> >> operator_plus::wi_fold (irange &r, tree type,
> >>                         const wide_int &lh_lb, const wide_int &lh_ub,
> >>                         const wide_int &rh_lb, const wide_int &rh_ub)
> >> const {
> >>   wi::overflow_type ov_lb, ov_ub;
> >>   signop s = TYPE_SIGN (type);
> >>
> >>   // Do whatever wideint magic is required to do this adds in higher
> >> precision
> >>   wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
> >>   wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
> >>
> >>   r = int_range<2> (type, new_lb, new_ub); }
> >>
> >>
> >> The operator needs to be registered, I've attached the skeleton for
> >> it.  you should just have to finish implementing wi_fold().
> >>
> >> in theory :-)
> >>
> > You also mentioned earlier that some were tree codes, some were
> > internal function calls?  We have some initial support for built in
> > functions, but I am not familiar with all the various forms they can
> > take.  We currently support CFN_ functions in
> >
> >   gimple-range-op.cc, gimple_range_op_handler::maybe_builtin_call ()
> >
> > Basically this is part of a "gimple_range_op_handler"  wrapper for
> > range-ops which can provide a range-ops class for stmts that don't map
> > to a binary or unary form.. such as built in functions.
> >
> > If you get to the point where you need this for a builtin function, I
> > can help you through that too.  Although someone may have to also help
> > me through what differentiates the different kinds of internal
> > function :-)    I presume they are all similar in some way.
> >
> > Andrew
> >
> Oh yeah, and in case you haven't figured it out on your own, you'll have
> to remove WIDEN_MULT_EXPR from the range-ops init table.   This
> non-standard mechanism only gets checked if there is no standard
> range-op table entry for the tree code :-P
> 

Hmm it looks like it'll work, but it keeps segfaulting in:

bool
range_op_handler::fold_range (vrange &r, tree type,
			      const vrange &lh,
			      const vrange &rh,
			      relation_trio rel) const
{
  gcc_checking_assert (m_valid);
  if (m_int)
    return m_int->fold_range (as_a <irange> (r), type,
			   as_a <irange> (lh),
			   as_a <irange> (rh), rel);

while trying to call fold_range.

But m_int is set to the right instance. Probably something I'm missing,
I'll double check it all.

Cheers,
Tamar
> Andrew
> 
> Andrew


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-22 18:03                                     ` Tamar Christina
@ 2023-02-22 18:33                                       ` Andrew MacLeod
  2023-02-23  8:36                                         ` Tamar Christina
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew MacLeod @ 2023-02-22 18:33 UTC (permalink / raw)
  To: Tamar Christina, Richard Biener, Richard Sandiford
  Cc: Tamar Christina via Gcc-patches, nd, jlaw


On 2/22/23 13:03, Tamar Christina wrote:
>> -----Original Message-----
>> From: Andrew MacLeod <amacleod@redhat.com>
>> Sent: Wednesday, February 22, 2023 4:42 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>; Richard Biener
>> <rguenther@suse.de>; Richard Sandiford <Richard.Sandiford@arm.com>
>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> <nd@arm.com>; jlaw@ventanamicro.com
>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>> by using new optabs [PR108583]
>>
>>
>> On 2/15/23 13:42, Andrew MacLeod wrote:
>>> On 2/15/23 12:50, Andrew MacLeod wrote:
>>>> On 2/15/23 12:13, Tamar Christina wrote:
>>>>>> On 2/15/23 07:51, Tamar Christina wrote:
>>>> void
>>>> operator_plus::wi_fold (irange &r, tree type,
>>>>                          const wide_int &lh_lb, const wide_int &lh_ub,
>>>>                          const wide_int &rh_lb, const wide_int &rh_ub)
>>>> const {
>>>>    wi::overflow_type ov_lb, ov_ub;
>>>>    signop s = TYPE_SIGN (type);
>>>>
>>>>    // Do whatever wideint magic is required to do this adds in higher
>>>> precision
>>>>    wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
>>>>    wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
>>>>
>>>>    r = int_range<2> (type, new_lb, new_ub); }
>>>>
>>>>
>>>> The operator needs to be registered, I've attached the skeleton for
>>>> it.  you should just have to finish implementing wi_fold().
>>>>
>>>> in theory :-)
>>>>
>>> You also mentioned earlier that some were tree codes, some were
>>> internal function calls?  We have some initial support for built in
>>> functions, but I am not familiar with all the various forms they can
>>> take.  We currently support CFN_ functions in
>>>
>>>    gimple-range-op.cc, gimple_range_op_handler::maybe_builtin_call ()
>>>
>>> Basically this is part of a "gimple_range_op_handler"  wrapper for
>>> range-ops which can provide a range-ops class for stmts that don't map
>>> to a binary or unary form.. such as built in functions.
>>>
>>> If you get to the point where you need this for a builtin function, I
>>> can help you through that too.  Although someone may have to also help
>>> me through what differentiates the different kinds of internal
>>> function :-)    I presume they are all similar in some way.
>>>
>>> Andrew
>>>
>> Oh yeah, and in case you haven't figured it out on your own, you'll have
>> to remove WIDEN_MULT_EXPR from the range-ops init table.   This
>> non-standard mechanism only gets checked if there is no standard
>> range-op table entry for the tree code :-P
>>
> Hmm it looks like it'll work, but it keeps segfaulting in:
>
> bool
> range_op_handler::fold_range (vrange &r, tree type,
> 			      const vrange &lh,
> 			      const vrange &rh,
> 			      relation_trio rel) const
> {
>    gcc_checking_assert (m_valid);
>    if (m_int)
>      return m_int->fold_range (as_a <irange> (r), type,
> 			   as_a <irange> (lh),
> 			   as_a <irange> (rh), rel);
>
> while trying to call fold_range.
>
> But m_int is set to the right instance. Probably something I'm missing,
> I'll double check it all.
>
Hmm.  whats your class operator_widen_mult* look like? what are you 
inheriting from?   Send me your patch and I'll have a look if you want.  
this is somewhat  new territory :-)

I cant imagine it being a linkage thing between the 2 files since the 
operator is defined in another file and the address taken in this one? 
that should work, but strange that cant make the call...

Andrew


^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-22 18:33                                       ` Andrew MacLeod
@ 2023-02-23  8:36                                         ` Tamar Christina
  2023-02-23 16:39                                           ` Andrew MacLeod
  0 siblings, 1 reply; 47+ messages in thread
From: Tamar Christina @ 2023-02-23  8:36 UTC (permalink / raw)
  To: Andrew MacLeod, Richard Biener, Richard Sandiford
  Cc: Tamar Christina via Gcc-patches, nd, jlaw

[-- Attachment #1: Type: text/plain, Size: 3140 bytes --]

Hi Andrew,

> >> Oh yeah, and in case you haven't figured it out on your own, you'll
> >> have to remove WIDEN_MULT_EXPR from the range-ops init table.   This
> >> non-standard mechanism only gets checked if there is no standard
> >> range-op table entry for the tree code :-P
> >>
> > Hmm it looks like it'll work, but it keeps segfaulting in:
> >
> > bool
> > range_op_handler::fold_range (vrange &r, tree type,
> > 			      const vrange &lh,
> > 			      const vrange &rh,
> > 			      relation_trio rel) const
> > {
> >    gcc_checking_assert (m_valid);
> >    if (m_int)
> >      return m_int->fold_range (as_a <irange> (r), type,
> > 			   as_a <irange> (lh),
> > 			   as_a <irange> (rh), rel);
> >
> > while trying to call fold_range.
> >
> > But m_int is set to the right instance. Probably something I'm
> > missing, I'll double check it all.
> >
> Hmm.  whats your class operator_widen_mult* look like? what are you
> inheriting from?   Send me your patch and I'll have a look if you want. this is
> somewhat  new territory :-)

I've attached the patch, and my testcase is:

int decMultiplyOp_zacc, decMultiplyOp_iacc;
int *decMultiplyOp_lp;
void decMultiplyOp() {
  decMultiplyOp_lp = &decMultiplyOp_zacc;
  for (; decMultiplyOp_lp < &decMultiplyOp_zacc + decMultiplyOp_iacc;
       decMultiplyOp_lp++)
    *decMultiplyOp_lp = 0;
}

And compiling with aarch64-none-elf-gcc -O2 zero.c -S -o - -Werror=stringop-overflow

Also to explain a bit on why we're only seeing this now:

The original sequence for most of the pipeline is based on a cast and multiplication

  # RANGE [irange] long unsigned int [0, 2147483647][18446744071562067968, +INF]
  _14 = (long unsigned intD.11) decMultiplyOp_iacc.2_13;
  # RANGE [irange] long unsigned int [0, 8589934588][18446744065119617024, 18446744073709551612] NONZERO 0xfffffffffffffffc
  _15 = _14 * 4;

But things like widening multiply are quite common, so some ISAs have it on scalars as well, not just vectors.
So there's a pass widening_mul that runs late for these targets.  This replaces the above with

  # RANGE [irange] long unsigned int [0, 8589934588][18446744065119617024, 18446744073709551612] NONZERO 0xfffffffffffffffc
  _15 = decMultiplyOp_iacc.2_13 w* 4;

And copies over the final range from the original expression.

After that there are passes like the warning passes that try to requery ranged to see if any optimization  has changed them.
Before my attempt to support *w this would just return VARYING and it would only use the old range.

Now however, without taking care to sign extend when appropriate the MIN range changes from a negative value to a large
positive one when we increase the precision.  So passes that re-query late get the wrong range.  That's why for instance in this case
we get an incorrect warning generated.

Thanks for the help!

Tamar

> 
> I cant imagine it being a linkage thing between the 2 files since the operator is
> defined in another file and the address taken in this one?
> that should work, but strange that cant make the call...
> 
> Andrew


[-- Attachment #2: rb16929.patch --]
[-- Type: application/octet-stream, Size: 7378 bytes --]

diff --git a/gcc/gimple-range-op.h b/gcc/gimple-range-op.h
index 743b858126e333ea9590c0f175aacb476260c048..1bf63c5ce6f5db924a1f5907ab4539e376281bd0 100644
--- a/gcc/gimple-range-op.h
+++ b/gcc/gimple-range-op.h
@@ -41,6 +41,7 @@ public:
 		 relation_trio = TRIO_VARYING);
 private:
   void maybe_builtin_call ();
+  void maybe_non_standard ();
   gimple *m_stmt;
   tree m_op1, m_op2;
 };
diff --git a/gcc/gimple-range-op.cc b/gcc/gimple-range-op.cc
index d9dfdc56939bb62ade72726b15c3d5e87e4ddcd1..81089876d303f4caa16d099866ecf70bae543768 100644
--- a/gcc/gimple-range-op.cc
+++ b/gcc/gimple-range-op.cc
@@ -179,6 +179,8 @@ gimple_range_op_handler::gimple_range_op_handler (gimple *s)
   // statements.
   if (is_a <gcall *> (m_stmt))
     maybe_builtin_call ();
+  else
+    maybe_non_standard ();
 }
 
 // Calculate what we can determine of the range of this unary
@@ -764,6 +766,38 @@ public:
   }
 } op_cfn_parity;
 
+// Set up a gimple_range_op_handler for any nonstandard function which can be
+// supported via range-ops.
+
+void
+gimple_range_op_handler::maybe_non_standard ()
+{
+  if (gimple_code (m_stmt) == GIMPLE_ASSIGN)
+    switch (gimple_assign_rhs_code (m_stmt))
+      {
+	case WIDEN_MULT_EXPR:
+	{
+	  extern class range_operator &op_widen_mult_signed;
+	  extern class range_operator &op_widen_mult_unsigned;
+	  m_valid = true;
+	  m_op1 = gimple_assign_rhs1 (m_stmt);
+	  m_op2 = gimple_assign_rhs2 (m_stmt);
+	  bool signed1 = TYPE_SIGN (TREE_TYPE (m_op1)) == SIGNED;
+	  bool signed2 = TYPE_SIGN (TREE_TYPE (m_op2)) == SIGNED;
+	  if (signed2 && !signed1)
+	    std::swap (m_op1, m_op2);
+
+	  if (signed1 || signed2)
+	    m_int = &op_widen_mult_signed;
+	  else
+	    m_int = &op_widen_mult_unsigned;
+	  break;
+	}
+	default:
+	  break;
+      }
+}
+
 // Set up a gimple_range_op_handler for any built in function which can be
 // supported via range-ops.
 
diff --git a/gcc/range-op.cc b/gcc/range-op.cc
index 5c67bce6d3aab81ad3186b902e09d6a96878d9bb..c15bd83b077ad31c5ae7db5ffe5f2831d153128e 100644
--- a/gcc/range-op.cc
+++ b/gcc/range-op.cc
@@ -1556,6 +1556,34 @@ operator_plus::op2_range (irange &r, tree type,
   return op1_range (r, type, lhs, op1, rel.swap_op1_op2 ());
 }
 
+class operator_widen_plus : public operator_plus
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub) const;
+} op_widen_plus;
+
+void
+operator_widen_plus::wi_fold (irange &r, tree type,
+			const wide_int &lh_lb, const wide_int &lh_ub,
+			const wide_int &rh_lb, const wide_int &rh_ub) const
+{
+   wi::overflow_type ov_lb, ov_ub;
+   signop s = TYPE_SIGN (type);
+
+   wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, s);
+   wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+   wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, s);
+   wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+   wide_int new_lb = wi::add (lh_wlb, rh_wlb, s, &ov_lb);
+   wide_int new_ub = wi::add (lh_wub, rh_wub, s, &ov_ub);
+
+   r = int_range<2> (type, new_lb, new_ub);
+}
 
 class operator_minus : public range_operator
 {
@@ -1877,10 +1905,10 @@ public:
 		        const wide_int &lh_lb,
 		        const wide_int &lh_ub,
 		        const wide_int &rh_lb,
-			const wide_int &rh_ub) const final override;
+			const wide_int &rh_ub) const;
   virtual bool wi_op_overflows (wide_int &res, tree type,
 				const wide_int &w0, const wide_int &w1)
-    const final override;
+    const;
   virtual bool op1_range (irange &r, tree type,
 			  const irange &lhs,
 			  const irange &op2,
@@ -2031,6 +2059,99 @@ operator_mult::wi_fold (irange &r, tree type,
     }
 }
 
+class operator_widen_mult_signed : public operator_mult
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub)
+    const;
+  virtual bool wi_op_overflows (wide_int &res, tree type,
+				const wide_int &w0, const wide_int &w1)
+    const;
+} op_widen_mult_signed;
+
+void
+operator_widen_mult_signed::wi_fold (irange &r, tree type,
+				     const wide_int &lh_lb,
+				     const wide_int &lh_ub,
+				     const wide_int &rh_lb,
+				     const wide_int &rh_ub) const
+{
+  signop s = TYPE_SIGN (type);
+
+  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, SIGNED);
+  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, SIGNED);
+  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, s);
+  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+  /* We don't expect a widening multiplication to be able to overflow but range
+     calculations for multiplications are complicated.  After widening the
+     operands lets call the base class.  */
+  return operator_mult::wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
+}
+
+bool
+operator_widen_mult_signed::wi_op_overflows (wide_int &res, tree type,
+					      const wide_int &w0,
+					      const wide_int &w1) const
+{
+  signop s = TYPE_SIGN (type);
+
+  wide_int ww0 = wide_int::from (w0, wi::get_precision (w0) * 2, SIGNED);
+  wide_int ww1 = wide_int::from (w1, wi::get_precision (w1) * 2, s);
+
+  return operator_mult::wi_op_overflows (res, type, ww0, ww1);
+}
+
+class operator_widen_mult_unsigned : public operator_mult
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub)
+    const;
+  virtual bool wi_op_overflows (wide_int &res, tree type,
+				const wide_int &w0, const wide_int &w1)
+    const;
+} op_widen_mult_unsigned;
+
+void
+operator_widen_mult_unsigned::wi_fold (irange &r, tree type,
+				       const wide_int &lh_lb,
+				       const wide_int &lh_ub,
+				       const wide_int &rh_lb,
+				       const wide_int &rh_ub) const
+{
+  signop s = TYPE_SIGN (type);
+
+  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, UNSIGNED);
+  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, UNSIGNED);
+  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, s);
+  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+  /* We don't expect a widening multiplication to be able to overflow but range
+     calculations for multiplications are complicated.  After widening the
+     operands lets call the base class.  */
+  return operator_mult::wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
+}
+
+bool
+operator_widen_mult_unsigned::wi_op_overflows (wide_int &res, tree type,
+					       const wide_int &w0,
+					       const wide_int &w1) const
+{
+  signop s = TYPE_SIGN (type);
+
+  wide_int ww0 = wide_int::from (w0, wi::get_precision (w0) * 2, UNSIGNED);
+  wide_int ww1 = wide_int::from (w1, wi::get_precision (w1) * 2, s);
+
+  return operator_mult::wi_op_overflows (res, type, ww0, ww1);
+}
 
 class operator_div : public cross_product_operator
 {
@@ -4473,6 +4594,7 @@ integral_table::integral_table ()
   set (GT_EXPR, op_gt);
   set (GE_EXPR, op_ge);
   set (PLUS_EXPR, op_plus);
+  set (WIDEN_PLUS_EXPR, op_widen_plus);
   set (MINUS_EXPR, op_minus);
   set (MIN_EXPR, op_min);
   set (MAX_EXPR, op_max);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-23  8:36                                         ` Tamar Christina
@ 2023-02-23 16:39                                           ` Andrew MacLeod
  2023-02-23 16:56                                             ` Tamar Christina
  2023-03-01 16:57                                             ` Andrew Carlotti
  0 siblings, 2 replies; 47+ messages in thread
From: Andrew MacLeod @ 2023-02-23 16:39 UTC (permalink / raw)
  To: Tamar Christina, Richard Biener, Richard Sandiford
  Cc: Tamar Christina via Gcc-patches, nd, jlaw

[-- Attachment #1: Type: text/plain, Size: 4346 bytes --]


On 2/23/23 03:36, Tamar Christina wrote:
> Hi Andrew,
>
>>>> Oh yeah, and in case you haven't figured it out on your own, you'll
>>>> have to remove WIDEN_MULT_EXPR from the range-ops init table.   This
>>>> non-standard mechanism only gets checked if there is no standard
>>>> range-op table entry for the tree code :-P
>>>>
>>> Hmm it looks like it'll work, but it keeps segfaulting in:
>>>
>>> bool
>>> range_op_handler::fold_range (vrange &r, tree type,
>>> 			      const vrange &lh,
>>> 			      const vrange &rh,
>>> 			      relation_trio rel) const
>>> {
>>>     gcc_checking_assert (m_valid);
>>>     if (m_int)
>>>       return m_int->fold_range (as_a <irange> (r), type,
>>> 			   as_a <irange> (lh),
>>> 			   as_a <irange> (rh), rel);
>>>
>>> while trying to call fold_range.
>>>
>>> But m_int is set to the right instance. Probably something I'm
>>> missing, I'll double check it all.
>>>
>> Hmm.  whats your class operator_widen_mult* look like? what are you
>> inheriting from?   Send me your patch and I'll have a look if you want. this is
>> somewhat  new territory :-)
> I've attached the patch, and my testcase is:
>
> int decMultiplyOp_zacc, decMultiplyOp_iacc;
> int *decMultiplyOp_lp;
> void decMultiplyOp() {
>    decMultiplyOp_lp = &decMultiplyOp_zacc;
>    for (; decMultiplyOp_lp < &decMultiplyOp_zacc + decMultiplyOp_iacc;
>         decMultiplyOp_lp++)
>      *decMultiplyOp_lp = 0;
> }
>
> And compiling with aarch64-none-elf-gcc -O2 zero.c -S -o - -Werror=stringop-overflow
>
> Also to explain a bit on why we're only seeing this now:
>
> The original sequence for most of the pipeline is based on a cast and multiplication
>
>    # RANGE [irange] long unsigned int [0, 2147483647][18446744071562067968, +INF]
>    _14 = (long unsigned intD.11) decMultiplyOp_iacc.2_13;
>    # RANGE [irange] long unsigned int [0, 8589934588][18446744065119617024, 18446744073709551612] NONZERO 0xfffffffffffffffc
>    _15 = _14 * 4;
>
> But things like widening multiply are quite common, so some ISAs have it on scalars as well, not just vectors.
> So there's a pass widening_mul that runs late for these targets.  This replaces the above with
>
>    # RANGE [irange] long unsigned int [0, 8589934588][18446744065119617024, 18446744073709551612] NONZERO 0xfffffffffffffffc
>    _15 = decMultiplyOp_iacc.2_13 w* 4;
>
> And copies over the final range from the original expression.
>
> After that there are passes like the warning passes that try to requery ranged to see if any optimization  has changed them.
> Before my attempt to support *w this would just return VARYING and it would only use the old range.
>
> Now however, without taking care to sign extend when appropriate the MIN range changes from a negative value to a large
> positive one when we increase the precision.  So passes that re-query late get the wrong range.  That's why for instance in this case
> we get an incorrect warning generated.
>
> Thanks for the help!
>
> Tamar
>
>> I cant imagine it being a linkage thing between the 2 files since the operator is
>> defined in another file and the address taken in this one?
>> that should work, but strange that cant make the call...
>>
>> Andrew

It is some sort of linkage/vtable thing.  The fix.diff patch applied on 
top of what you have will fix the fold issue. This'll do for now until I 
formalize how this is going to work goign forward.

Inheriting from operator_mult is also going to be hazardous because it 
also has an op1_range and op2_range...  you should at least define those 
and return VARYING to avoid other issues.  Same thing applies to 
widen_plus I think, and it has relation processing and other things as 
well.  Your widen operands are not what those classes expect, so I think 
you probably just want a fresh range operator.

It also looks like the mult operation is sign/zero extending both upper 
bounds, and neither lower bound..   I think that should be the LH upper 
and lower bound?

I've attached a second patch  (newversion.patch) which incorporates my 
fix, the fix to the sign of only op1's bounds,  as well as a 
simplification of the classes to not inherit from operator_mult/plus..   
I think this still does what you want?  and it wont get you into 
unexpected trouble later :-)

let me know if this is still doing what you are expecting...

Andrew


[-- Attachment #2: fix.diff --]
[-- Type: text/x-patch, Size: 1825 bytes --]

diff --git a/gcc/gimple-range-op.cc b/gcc/gimple-range-op.cc
index 81089876d30..824e0338f34 100644
--- a/gcc/gimple-range-op.cc
+++ b/gcc/gimple-range-op.cc
@@ -777,8 +777,6 @@ gimple_range_op_handler::maybe_non_standard ()
       {
 	case WIDEN_MULT_EXPR:
 	{
-	  extern class range_operator &op_widen_mult_signed;
-	  extern class range_operator &op_widen_mult_unsigned;
 	  m_valid = true;
 	  m_op1 = gimple_assign_rhs1 (m_stmt);
 	  m_op2 = gimple_assign_rhs2 (m_stmt);
@@ -788,9 +786,9 @@ gimple_range_op_handler::maybe_non_standard ()
 	    std::swap (m_op1, m_op2);
 
 	  if (signed1 || signed2)
-	    m_int = &op_widen_mult_signed;
+	    m_int = ptr_op_widen_mult_signed;
 	  else
-	    m_int = &op_widen_mult_unsigned;
+	    m_int = ptr_op_widen_mult_unsigned;
 	  break;
 	}
 	default:
diff --git a/gcc/range-op.cc b/gcc/range-op.cc
index c15bd83b077..bace915b256 100644
--- a/gcc/range-op.cc
+++ b/gcc/range-op.cc
@@ -2072,6 +2072,7 @@ public:
 				const wide_int &w0, const wide_int &w1)
     const;
 } op_widen_mult_signed;
+range_operator *ptr_op_widen_mult_signed = &op_widen_mult_signed;
 
 void
 operator_widen_mult_signed::wi_fold (irange &r, tree type,
@@ -2119,6 +2120,7 @@ public:
 				const wide_int &w0, const wide_int &w1)
     const;
 } op_widen_mult_unsigned;
+range_operator *ptr_op_widen_mult_unsigned = &op_widen_mult_unsigned;
 
 void
 operator_widen_mult_unsigned::wi_fold (irange &r, tree type,
diff --git a/gcc/range-op.h b/gcc/range-op.h
index f00b747f08a..5fe463234ae 100644
--- a/gcc/range-op.h
+++ b/gcc/range-op.h
@@ -311,4 +311,6 @@ private:
 // This holds the range op table for floating point operations.
 extern floating_op_table *floating_tree_table;
 
+extern range_operator *ptr_op_widen_mult_signed;
+extern range_operator *ptr_op_widen_mult_unsigned;
 #endif // GCC_RANGE_OP_H

[-- Attachment #3: newversion.patch --]
[-- Type: text/x-patch, Size: 6125 bytes --]

diff --git a/gcc/gimple-range-op.cc b/gcc/gimple-range-op.cc
index d9dfdc56939..824e0338f34 100644
--- a/gcc/gimple-range-op.cc
+++ b/gcc/gimple-range-op.cc
@@ -179,6 +179,8 @@ gimple_range_op_handler::gimple_range_op_handler (gimple *s)
   // statements.
   if (is_a <gcall *> (m_stmt))
     maybe_builtin_call ();
+  else
+    maybe_non_standard ();
 }
 
 // Calculate what we can determine of the range of this unary
@@ -764,6 +766,36 @@ public:
   }
 } op_cfn_parity;
 
+// Set up a gimple_range_op_handler for any nonstandard function which can be
+// supported via range-ops.
+
+void
+gimple_range_op_handler::maybe_non_standard ()
+{
+  if (gimple_code (m_stmt) == GIMPLE_ASSIGN)
+    switch (gimple_assign_rhs_code (m_stmt))
+      {
+	case WIDEN_MULT_EXPR:
+	{
+	  m_valid = true;
+	  m_op1 = gimple_assign_rhs1 (m_stmt);
+	  m_op2 = gimple_assign_rhs2 (m_stmt);
+	  bool signed1 = TYPE_SIGN (TREE_TYPE (m_op1)) == SIGNED;
+	  bool signed2 = TYPE_SIGN (TREE_TYPE (m_op2)) == SIGNED;
+	  if (signed2 && !signed1)
+	    std::swap (m_op1, m_op2);
+
+	  if (signed1 || signed2)
+	    m_int = ptr_op_widen_mult_signed;
+	  else
+	    m_int = ptr_op_widen_mult_unsigned;
+	  break;
+	}
+	default:
+	  break;
+      }
+}
+
 // Set up a gimple_range_op_handler for any built in function which can be
 // supported via range-ops.
 
diff --git a/gcc/gimple-range-op.h b/gcc/gimple-range-op.h
index 743b858126e..1bf63c5ce6f 100644
--- a/gcc/gimple-range-op.h
+++ b/gcc/gimple-range-op.h
@@ -41,6 +41,7 @@ public:
 		 relation_trio = TRIO_VARYING);
 private:
   void maybe_builtin_call ();
+  void maybe_non_standard ();
   gimple *m_stmt;
   tree m_op1, m_op2;
 };
diff --git a/gcc/range-op.cc b/gcc/range-op.cc
index 5c67bce6d3a..7cd19a92d00 100644
--- a/gcc/range-op.cc
+++ b/gcc/range-op.cc
@@ -1556,6 +1556,34 @@ operator_plus::op2_range (irange &r, tree type,
   return op1_range (r, type, lhs, op1, rel.swap_op1_op2 ());
 }
 
+class operator_widen_plus : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub) const;
+} op_widen_plus;
+
+void
+operator_widen_plus::wi_fold (irange &r, tree type,
+			const wide_int &lh_lb, const wide_int &lh_ub,
+			const wide_int &rh_lb, const wide_int &rh_ub) const
+{
+   wi::overflow_type ov_lb, ov_ub;
+   signop s = TYPE_SIGN (type);
+
+   wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, s);
+   wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+   wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, s);
+   wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+   wide_int new_lb = wi::add (lh_wlb, rh_wlb, s, &ov_lb);
+   wide_int new_ub = wi::add (lh_wub, rh_wub, s, &ov_ub);
+
+   r = int_range<2> (type, new_lb, new_ub);
+}
 
 class operator_minus : public range_operator
 {
@@ -2031,6 +2059,70 @@ operator_mult::wi_fold (irange &r, tree type,
     }
 }
 
+class operator_widen_mult_signed : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub)
+    const;
+} op_widen_mult_signed;
+range_operator *ptr_op_widen_mult_signed = &op_widen_mult_signed;
+
+void
+operator_widen_mult_signed::wi_fold (irange &r, tree type,
+				     const wide_int &lh_lb,
+				     const wide_int &lh_ub,
+				     const wide_int &rh_lb,
+				     const wide_int &rh_ub) const
+{
+  signop s = TYPE_SIGN (type);
+
+  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, SIGNED);
+  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, SIGNED);
+  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+  /* We don't expect a widening multiplication to be able to overflow but range
+     calculations for multiplications are complicated.  After widening the
+     operands lets call the base class.  */
+  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
+}
+
+
+class operator_widen_mult_unsigned : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub)
+    const;
+} op_widen_mult_unsigned;
+range_operator *ptr_op_widen_mult_unsigned = &op_widen_mult_unsigned;
+
+void
+operator_widen_mult_unsigned::wi_fold (irange &r, tree type,
+				       const wide_int &lh_lb,
+				       const wide_int &lh_ub,
+				       const wide_int &rh_lb,
+				       const wide_int &rh_ub) const
+{
+  signop s = TYPE_SIGN (type);
+
+  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, UNSIGNED);
+  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, UNSIGNED);
+  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+  /* We don't expect a widening multiplication to be able to overflow but range
+     calculations for multiplications are complicated.  After widening the
+     operands lets call the base class.  */
+  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
+}
 
 class operator_div : public cross_product_operator
 {
@@ -4473,6 +4565,7 @@ integral_table::integral_table ()
   set (GT_EXPR, op_gt);
   set (GE_EXPR, op_ge);
   set (PLUS_EXPR, op_plus);
+  set (WIDEN_PLUS_EXPR, op_widen_plus);
   set (MINUS_EXPR, op_minus);
   set (MIN_EXPR, op_min);
   set (MAX_EXPR, op_max);
diff --git a/gcc/range-op.h b/gcc/range-op.h
index f00b747f08a..5fe463234ae 100644
--- a/gcc/range-op.h
+++ b/gcc/range-op.h
@@ -311,4 +311,6 @@ private:
 // This holds the range op table for floating point operations.
 extern floating_op_table *floating_tree_table;
 
+extern range_operator *ptr_op_widen_mult_signed;
+extern range_operator *ptr_op_widen_mult_unsigned;
 #endif // GCC_RANGE_OP_H

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-23 16:39                                           ` Andrew MacLeod
@ 2023-02-23 16:56                                             ` Tamar Christina
  2023-03-01 16:57                                             ` Andrew Carlotti
  1 sibling, 0 replies; 47+ messages in thread
From: Tamar Christina @ 2023-02-23 16:56 UTC (permalink / raw)
  To: Andrew MacLeod, Richard Biener, Richard Sandiford
  Cc: Tamar Christina via Gcc-patches, nd, jlaw

> -----Original Message-----
> From: Andrew MacLeod <amacleod@redhat.com>
> Sent: Thursday, February 23, 2023 4:40 PM
> To: Tamar Christina <Tamar.Christina@arm.com>; Richard Biener
> <rguenther@suse.de>; Richard Sandiford <Richard.Sandiford@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> 
> On 2/23/23 03:36, Tamar Christina wrote:
> > Hi Andrew,
> >
> >>>> Oh yeah, and in case you haven't figured it out on your own, you'll
> >>>> have to remove WIDEN_MULT_EXPR from the range-ops init table.
> >>>> This non-standard mechanism only gets checked if there is no
> >>>> standard range-op table entry for the tree code :-P
> >>>>
> >>> Hmm it looks like it'll work, but it keeps segfaulting in:
> >>>
> >>> bool
> >>> range_op_handler::fold_range (vrange &r, tree type,
> >>> 			      const vrange &lh,
> >>> 			      const vrange &rh,
> >>> 			      relation_trio rel) const
> >>> {
> >>>     gcc_checking_assert (m_valid);
> >>>     if (m_int)
> >>>       return m_int->fold_range (as_a <irange> (r), type,
> >>> 			   as_a <irange> (lh),
> >>> 			   as_a <irange> (rh), rel);
> >>>
> >>> while trying to call fold_range.
> >>>
> >>> But m_int is set to the right instance. Probably something I'm
> >>> missing, I'll double check it all.
> >>>
> >> Hmm.  whats your class operator_widen_mult* look like? what are you
> >> inheriting from?   Send me your patch and I'll have a look if you
> >> want. this is somewhat  new territory :-)
> > I've attached the patch, and my testcase is:
> >
> > int decMultiplyOp_zacc, decMultiplyOp_iacc; int *decMultiplyOp_lp;
> > void decMultiplyOp() {
> >    decMultiplyOp_lp = &decMultiplyOp_zacc;
> >    for (; decMultiplyOp_lp < &decMultiplyOp_zacc + decMultiplyOp_iacc;
> >         decMultiplyOp_lp++)
> >      *decMultiplyOp_lp = 0;
> > }
> >
> > And compiling with aarch64-none-elf-gcc -O2 zero.c -S -o -
> > -Werror=stringop-overflow
> >
> > Also to explain a bit on why we're only seeing this now:
> >
> > The original sequence for most of the pipeline is based on a cast and
> > multiplication
> >
> >    # RANGE [irange] long unsigned int [0,
> 2147483647][18446744071562067968, +INF]
> >    _14 = (long unsigned intD.11) decMultiplyOp_iacc.2_13;
> >    # RANGE [irange] long unsigned int [0,
> 8589934588][18446744065119617024, 18446744073709551612]
> NONZERO 0xfffffffffffffffc
> >    _15 = _14 * 4;
> >
> > But things like widening multiply are quite common, so some ISAs have it on
> scalars as well, not just vectors.
> > So there's a pass widening_mul that runs late for these targets.  This
> > replaces the above with
> >
> >    # RANGE [irange] long unsigned int [0,
> 8589934588][18446744065119617024, 18446744073709551612]
> NONZERO 0xfffffffffffffffc
> >    _15 = decMultiplyOp_iacc.2_13 w* 4;
> >
> > And copies over the final range from the original expression.
> >
> > After that there are passes like the warning passes that try to requery ranged
> to see if any optimization  has changed them.
> > Before my attempt to support *w this would just return VARYING and it
> would only use the old range.
> >
> > Now however, without taking care to sign extend when appropriate the
> > MIN range changes from a negative value to a large positive one when
> > we increase the precision.  So passes that re-query late get the wrong range.
> That's why for instance in this case we get an incorrect warning generated.
> >
> > Thanks for the help!
> >
> > Tamar
> >
> >> I cant imagine it being a linkage thing between the 2 files since the
> >> operator is defined in another file and the address taken in this one?
> >> that should work, but strange that cant make the call...
> >>
> >> Andrew
> 
> It is some sort of linkage/vtable thing.  The fix.diff patch applied on top of
> what you have will fix the fold issue. This'll do for now until I formalize how this
> is going to work goign forward.

Ah, I did see gdb warning about the vtable 😊

> 
> Inheriting from operator_mult is also going to be hazardous because it also
> has an op1_range and op2_range...  you should at least define those and
> return VARYING to avoid other issues.  Same thing applies to widen_plus I
> think, and it has relation processing and other things as well.  Your widen
> operands are not what those classes expect, so I think you probably just want
> a fresh range operator.
> 
> It also looks like the mult operation is sign/zero extending both upper bounds,
> and neither lower bound..   I think that should be the LH upper and lower
> bound?

Ah yes, that was a typo.

> 
> I've attached a second patch  (newversion.patch) which incorporates my fix,
> the fix to the sign of only op1's bounds,  as well as a simplification of the
> classes to not inherit from operator_mult/plus.. I think this still does what you
> want?  and it wont get you into unexpected trouble later :-)
> 
> let me know if this is still doing what you are expecting...

Yes it was! And works perfectly.  I think I'll need the same for widen_plus, so I'll
make those changes and do full regression run and submit the finished patch.

Thanks for all the help!

Cheers,
Tamar
> 
> Andrew


^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-10 14:30     ` Richard Sandiford
  2023-02-10 14:54       ` Tamar Christina
@ 2023-02-27 11:09       ` Tamar Christina
  2023-02-27 12:11         ` Richard Sandiford
  1 sibling, 1 reply; 47+ messages in thread
From: Tamar Christina @ 2023-02-27 11:09 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Tamar Christina via Gcc-patches, nd, rguenther, jlaw

Hi,

> > I avoided open coding it with add and shift because it creates a 4
> > instructions (and shifts which are typically slow) dependency chain
> > instead of a load and multiply.  This change, unless the target is
> > known to optimize it further is unlikely to be beneficial.  And by the
> > time we get to costing the only alternative is to undo the existing pattern and
> so you lose the general shift optimization.
> >
> > So it seemed unwise to open code as shifts, given the codegen out of
> > the vectorizer would be degenerate for most targets or one needs the
> > more complicated route of costing during pattern matching already.
> 
> Hmm, OK.  That seems like a cost-model thing though, rather than something
> that should be exposed through optabs.  And I imagine the open-coded
> version would still be better than nothing on targets without highpart multiply.
> 
> So how about replacing the hook with one that simply asks whether division
> through highpart multiplication is preferred over the add/shift sequence?
> (Unfortunately it's not going to be possible to work that out from existing
> information.)

So this doesn't work for SVE.  For SVE the multiplication widening pass introduces
FMAs at gimple level.  So in the cases where the operation is fed from a widening
multiplication we end up generating FMA.  If that was it I could have matched FMA.

But it also pushes the multiplication in the second operand because it no longer has
a mul to share the results with.

In any case, the gimple code is transformed into

vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121);
vect_patt_57.9_123 = (vector([8,8]) unsigned short) vect__3.8_122;
vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, { 257, ... });
vect_patt_65.12_128 = vect_patt_64.11_127 >> 8;
vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124, vect_patt_65.12_128);
vect_patt_62.14_130 = vect_patt_66.13_129 >> 8;
vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21) vect_patt_62.14_130;

This transformation is much worse than the original code, it extended the dependency
chain with another expensive instruction. I can try to correct this in RTL by matching
FMA and shift and splitting into MUL + ADDHNB and hope CSE takes care of the extra mul.

But this seems like a hack, and it's basically undoing the earlier transformation.  It seems to
me that the open coding is a bad idea.

Do you still want it Richard?

Thanks,
Tamar
> 
> Thanks,
> Richard
> 
> >
> >>
> >> Some comments in addition to Richard's:
> >>
> >> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> > Hi All,
> >> >
> >> > As discussed in the ticket, this replaces the approach for
> >> > optimizing the div by bitmask operation from a hook into optabs
> >> > implemented through add_highpart.
> >> >
> >> > In order to be able to use this we need to check whether the
> >> > current precision has enough bits to do the operation without any
> >> > of the additions
> >> overflowing.
> >> >
> >> > We use range information to determine this and only do the
> >> > operation if we're sure am overflow won't occur.
> >> >
> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going>
> >> issues.
> >> >
> >> > Ok for master?
> >> >
> >> > Thanks,
> >> > Tamar
> >> >
> >> > gcc/ChangeLog:
> >> >
> >> > 	PR target/108583
> >> > 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST):
> >> Remove.
> >> > 	* doc/tm.texi.in: Likewise.
> >> > 	* explow.cc (round_push, align_dynamic_address): Revert previous
> >> patch.
> >> > 	* expmed.cc (expand_divmod): Likewise.
> >> > 	* expmed.h (expand_divmod): Likewise.
> >> > 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
> >> > 	* optabs.cc (expand_doubleword_mod,
> >> expand_doubleword_divmod): Likewise.
> >> > 	* internal-fn.def (ADDH): New.
> >> > 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
> >> > 	* doc/md.texi: Document them.
> >> > 	* doc/rtl.texi: Likewise.
> >> > 	* target.def (can_special_div_by_const): Remove.
> >> > 	* target.h: Remove tree-core.h include
> >> > 	* targhooks.cc (default_can_special_div_by_const): Remove.
> >> > 	* targhooks.h (default_can_special_div_by_const): Remove.
> >> > 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
> >> > 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook
> >> and
> >> > 	implement new obtab recognition based on range.
> >> > 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
> >> >
> >> > gcc/testsuite/ChangeLog:
> >> >
> >> > 	PR target/108583
> >> > 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
> >> > 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
> >> >
> >> > --- inline copy of patch --
> >> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
> >> >
> >>
> 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f
> 74080
> >> 3
> >> > 8595e21af35d 100644
> >> > --- a/gcc/doc/md.texi
> >> > +++ b/gcc/doc/md.texi
> >> > @@ -5668,6 +5668,18 @@ represented in RTL using a
> >> @code{smul_highpart} RTX expression.
> >> >  Similar, but the multiplication is unsigned.  This may be
> >> > represented in RTL using an @code{umul_highpart} RTX expression.
> >> >
> >> > +@cindex @code{sadd@var{m}3_highpart} instruction pattern @item
> >> > +@samp{smul@var{m}3_highpart}
> >>
> >> sadd
> >>
> >> > +Perform a signed addition of operands 1 and 2, which have mode
> >> > +@var{m}, and store the most significant half of the product in operand
> 0.
> >> > +The least significant half of the product is discarded.  This may
> >> > +be represented in RTL using a @code{sadd_highpart} RTX expression.
> >> > +
> >> > +@cindex @code{uadd@var{m}3_highpart} instruction pattern @item
> >> > +@samp{uadd@var{m}3_highpart} Similar, but the addition is unsigned.
> >> > +This may be represented in RTL using an @code{uadd_highpart} RTX
> >> > +expression.
> >> > +
> >> >  @cindex @code{madd@var{m}@var{n}4} instruction pattern  @item
> >> > @samp{madd@var{m}@var{n}4}  Multiply operands 1 and 2, sign-
> extend
> >> > them to mode @var{n}, add diff --git a/gcc/doc/rtl.texi
> >> > b/gcc/doc/rtl.texi index
> >> >
> >>
> d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00
> 343
> >> d17
> >> > 1940ec4222f3 100644
> >> > --- a/gcc/doc/rtl.texi
> >> > +++ b/gcc/doc/rtl.texi
> >> > @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.
> >> > @code{smul_highpart} returns the high part  of a signed
> >> > multiplication, @code{umul_highpart} returns the high part  of an
> >> > unsigned
> >> multiplication.
> >> >
> >> > +@findex sadd_highpart
> >> > +@findex uadd_highpart
> >> > +@cindex high-part addition
> >> > +@cindex addition high part
> >> > +@item (sadd_highpart:@var{m} @var{x} @var{y}) @itemx
> >> > +(uadd_highpart:@var{m} @var{x} @var{y}) Represents the high-part
> >> > +addition of @var{x} and @var{y} carried out in machine mode @var{m}.
> >> > +@code{sadd_highpart} returns the high part of a signed addition,
> >> > +@code{uadd_highpart} returns the high part of an unsigned addition.
> >>
> >> The patch doesn't add these RTL codes though.
> >>
> >> > +
> >> >  @findex fma
> >> >  @cindex fused multiply-add
> >> >  @item (fma:@var{m} @var{x} @var{y} @var{z}) diff --git
> >> > a/gcc/doc/tm.texi b/gcc/doc/tm.texi index
> >> >
> >>
> c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d57
> 914840
> >> 17e
> >> > 6b0d62ab077e 100644
> >> > --- a/gcc/doc/tm.texi
> >> > +++ b/gcc/doc/tm.texi
> >> > @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for
> >> > the hook to handle these two  implementation approaches itself.
> >> >  @end deftypefn
> >> >
> >> > -@deftypefn {Target Hook} bool
> >> > TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum
> >> @var{tree_code}, tree
> >> > @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx
> >> > @var{in0}, rtx @var{in1}) -This hook is used to test whether the
> >> > target has a special method of -division of vectors of type
> >> > @var{vectype}
> >> using the value @var{constant}, -and producing a vector of type
> >> @var{vectype}.  The division -will then not be decomposed by the
> >> vectorizer and kept as a div.
> >> > -
> >> > -When the hook is being used to test whether the target supports a
> >> > special -divide, @var{in0}, @var{in1}, and @var{output} are all null.
> >> > When the hook -is being used to emit a division, @var{in0} and
> >> > @var{in1} are the source -vectors of type @var{vecttype} and
> >> > @var{output} is the destination vector of -type @var{vectype}.
> >> > -
> >> > -Return true if the operation is possible, emitting instructions
> >> > for it -if rtxes are provided and updating @var{output}.
> >> > -@end deftypefn
> >> > -
> >> >  @deftypefn {Target Hook} tree
> >> > TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned
> >> @var{code},
> >> > tree @var{vec_type_out}, tree @var{vec_type_in})  This hook should
> >> > return the decl of a function that implements the  vectorized
> >> > variant of the function with the @code{combined_fn} code diff --git
> >> > a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index
> >> >
> >>
> 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d
> 1efec0
> >> a3a
> >> > bccd1c293c7b 100644
> >> > --- a/gcc/doc/tm.texi.in
> >> > +++ b/gcc/doc/tm.texi.in
> >> > @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent
> >> strategy can generate better code.
> >> >
> >> >  @hook TARGET_VECTORIZE_VEC_PERM_CONST
> >> >
> >> > -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
> >> > -
> >> >  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
> >> >
> >> >  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
> >> > diff --git a/gcc/explow.cc b/gcc/explow.cc index
> >> >
> >>
> 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc
> 212f0
> >> bef
> >> > a016eea4573c 100644
> >> > --- a/gcc/explow.cc
> >> > +++ b/gcc/explow.cc
> >> > @@ -1037,7 +1037,7 @@ round_push (rtx size)
> >> >       TRUNC_DIV_EXPR.  */
> >> >    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
> >> >  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
> >> > -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
> >> > size, align_rtx,
> >> > +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
> >> >  			NULL_RTX, 1);
> >> >    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
> >> >
> >> > @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned
> >> required_align)
> >> >  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
> >> >  				       Pmode),
> >> >  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
> >> > -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
> >> > target,
> >> > +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
> >> >  			  gen_int_mode (required_align / BITS_PER_UNIT,
> >> >  					Pmode),
> >> >  			  NULL_RTX, 1);
> >> > diff --git a/gcc/expmed.h b/gcc/expmed.h index
> >> >
> >>
> 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c
> 5364
> >> 094
> >> > 1628068f3901 100644
> >> > --- a/gcc/expmed.h
> >> > +++ b/gcc/expmed.h
> >> > @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code,
> >> > machine_mode, rtx, poly_int64, rtx,  extern rtx maybe_expand_shift
> >> (enum tree_code, machine_mode, rtx, int, rtx,
> >> >  			       int);
> >> >  #ifdef GCC_OPTABS_H
> >> > -extern rtx expand_divmod (int, enum tree_code, machine_mode, tree,
> >> tree,
> >> > -			  rtx, rtx, rtx, int,
> >> > -			  enum optab_methods = OPTAB_LIB_WIDEN);
> >> > +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx,
> >> rtx,
> >> > +			  rtx, int, enum optab_methods =
> >> OPTAB_LIB_WIDEN);
> >> >  #endif
> >> >  #endif
> >> >
> >> > diff --git a/gcc/expmed.cc b/gcc/expmed.cc index
> >> >
> >>
> 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025a
> b18a3
> >> a59
> >> > c169d3b7692f 100644
> >> > --- a/gcc/expmed.cc
> >> > +++ b/gcc/expmed.cc
> >> > @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode,
> rtx
> >> op0,
> >> > HOST_WIDE_INT d)
> >> >
> >> >  rtx
> >> >  expand_divmod (int rem_flag, enum tree_code code, machine_mode
> >> mode,
> >> > -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
> >> > -	       int unsignedp, enum optab_methods methods)
> >> > +	       rtx op0, rtx op1, rtx target, int unsignedp,
> >> > +	       enum optab_methods methods)
> >> >  {
> >> >    machine_mode compute_mode;
> >> >    rtx tquotient;
> >> > @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum
> tree_code
> >> > code, machine_mode mode,
> >> >
> >> >    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) :
> >> > 0;
> >> >
> >> > -  /* Check if the target has specific expansions for the division.
> >> > */
> >> > -  tree cst;
> >> > -  if (treeop0
> >> > -      && treeop1
> >> > -      && (cst = uniform_integer_cst_p (treeop1))
> >> > -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE
> >> (treeop0),
> >> > -						     wi::to_wide (cst),
> >> > -						     &target, op0, op1))
> >> > -    return target;
> >> > -
> >> > -
> >> >    /* Now convert to the best mode to use.  */
> >> >    if (compute_mode != mode)
> >> >      {
> >> > @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum
> tree_code
> >> code, machine_mode mode,
> >> >  			    || (optab_handler (sdivmod_optab, int_mode)
> >> >  				!= CODE_FOR_nothing)))
> >> >  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
> >> > -						int_mode, treeop0, treeop1,
> >> > -						op0, gen_int_mode (abs_d,
> >> > +						int_mode, op0,
> >> > +						gen_int_mode (abs_d,
> >> >  							      int_mode),
> >> >  						NULL_RTX, 0);
> >> >  		    else
> >> > @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum
> tree_code
> >> code, machine_mode mode,
> >> >  				      size - 1, NULL_RTX, 0);
> >> >  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
> >> >  				    NULL_RTX);
> >> > -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode,
> >> treeop0,
> >> > -				    treeop1, t3, op1, NULL_RTX, 0);
> >> > +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3,
> >> op1,
> >> > +				    NULL_RTX, 0);
> >> >  		if (t4)
> >> >  		  {
> >> >  		    rtx t5;
> >> > diff --git a/gcc/expr.cc b/gcc/expr.cc index
> >> >
> >>
> 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e75521633907
> 8d5b
> >> 2280
> >> > c6e277f26d72 100644
> >> > --- a/gcc/expr.cc
> >> > +++ b/gcc/expr.cc
> >> > @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
> >> >  	    return expand_divmod (0,
> >> >  				  FLOAT_MODE_P (GET_MODE (value))
> >> >  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
> >> > -				  GET_MODE (value), NULL, NULL, op1, op2,
> >> > -				  target, 0);
> >> > +				  GET_MODE (value), op1, op2, target, 0);
> >> >  	case MOD:
> >> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> >> NULL, NULL,
> >> > -				op1, op2, target, 0);
> >> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> >> op1, op2,
> >> > +				target, 0);
> >> >  	case UDIV:
> >> > -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
> >> NULL, NULL,
> >> > -				op1, op2, target, 1);
> >> > +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
> >> op1, op2,
> >> > +				target, 1);
> >> >  	case UMOD:
> >> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> >> NULL, NULL,
> >> > -				op1, op2, target, 1);
> >> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> >> op1, op2,
> >> > +				target, 1);
> >> >  	case ASHIFTRT:
> >> >  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
> >> >  				      target, 0, OPTAB_LIB_WIDEN); @@ -
> >> 9170,13 +9169,11 @@
> >> > expand_expr_divmod (tree_code code, machine_mode mode, tree
> >> treeop0,
> >> >        bool speed_p = optimize_insn_for_speed_p ();
> >> >        do_pending_stack_adjust ();
> >> >        start_sequence ();
> >> > -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> >> > -				   op0, op1, target, 1);
> >> > +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1,
> >> > + target, 1);
> >> >        rtx_insn *uns_insns = get_insns ();
> >> >        end_sequence ();
> >> >        start_sequence ();
> >> > -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> >> > -				   op0, op1, target, 0);
> >> > +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1,
> >> > + target, 0);
> >> >        rtx_insn *sgn_insns = get_insns ();
> >> >        end_sequence ();
> >> >        unsigned uns_cost = seq_cost (uns_insns, speed_p); @@
> >> > -9198,8
> >> > +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode
> >> mode, tree treeop0,
> >> >        emit_insn (sgn_insns);
> >> >        return sgn_ret;
> >> >      }
> >> > -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
> >> > -			op0, op1, target, unsignedp);
> >> > +  return expand_divmod (mod_p, code, mode, op0, op1, target,
> >> > + unsignedp);
> >> >  }
> >> >
> >> >  rtx
> >> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
> >> >
> >>
> 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052
> 584f5a
> >> 3b
> >> > 8a734baa800f 100644
> >> > --- a/gcc/internal-fn.def
> >> > +++ b/gcc/internal-fn.def
> >> > @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL,
> >> ECF_CONST
> >> > | ECF_NOTHROW, first,
> >> >
> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST |
> >> ECF_NOTHROW, first,
> >> >  			      smul_highpart, umul_highpart, binary)
> >> > +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST |
> >> ECF_NOTHROW, first,
> >> > +			      sadd_highpart, uadd_highpart, binary)
> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST |
> >> ECF_NOTHROW, first,
> >> >  			      smulhs, umulhs, binary)
> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST |
> >> ECF_NOTHROW, first,
> >> > diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
> >> >
> >>
> cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe03
> 69a6
> >> e
> >> > 77082c1e617b 100644
> >> > --- a/gcc/optabs.cc
> >> > +++ b/gcc/optabs.cc
> >> > @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode
> >> mode, rtx op0, rtx op1, bool unsignedp)
> >> >  		return NULL_RTX;
> >> >  	    }
> >> >  	}
> >> > -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode,
> >> NULL, NULL,
> >> > -				     sum, gen_int_mode (INTVAL (op1),
> >> > -							word_mode),
> >> > +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR,
> word_mode,
> >> sum,
> >> > +				     gen_int_mode (INTVAL (op1),
> >> word_mode),
> >> >  				     NULL_RTX, 1, OPTAB_DIRECT);
> >> >        if (remainder == NULL_RTX)
> >> >  	return NULL_RTX;
> >> > @@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode
> >> mode, rtx
> >> > op0, rtx op1, rtx *rem,
> >> >
> >> >    if (op11 != const1_rtx)
> >> >      {
> >> > -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL,
> NULL,
> >> quot1,
> >> > -				op11, NULL_RTX, unsignedp,
> >> OPTAB_DIRECT);
> >> > +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1,
> >> op11,
> >> > +				NULL_RTX, unsignedp, OPTAB_DIRECT);
> >> >        if (rem2 == NULL_RTX)
> >> >  	return NULL_RTX;
> >> >
> >> > @@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode
> >> mode, rtx op0, rtx op1, rtx *rem,
> >> >        if (rem2 == NULL_RTX)
> >> >  	return NULL_RTX;
> >> >
> >> > -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL,
> NULL,
> >> quot1,
> >> > -				 op11, NULL_RTX, unsignedp,
> >> OPTAB_DIRECT);
> >> > +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1,
> op11,
> >> > +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
> >> >        if (quot2 == NULL_RTX)
> >> >  	return NULL_RTX;
> >> >
> >> > diff --git a/gcc/optabs.def b/gcc/optabs.def index
> >> >
> >>
> 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2
> d7a5
> >> ccb
> >> > f6147947351a 100644
> >> > --- a/gcc/optabs.def
> >> > +++ b/gcc/optabs.def
> >> > @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
> >> >
> >> >  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")  OPTAB_D
> >> > (umul_highpart_optab, "umul$a3_highpart")
> >> > +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart") OPTAB_D
> >> > +(uadd_highpart_optab, "uadd$a3_highpart")
> >> >
> >> >  OPTAB_D (cmpmem_optab, "cmpmem$a")  OPTAB_D (cmpstr_optab,
> >> > "cmpstr$a") diff --git a/gcc/target.def b/gcc/target.def index
> >> >
> >>
> db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed0
> 8d1d
> >> 81a
> >> > fa2c2baa64a5 100644
> >> > --- a/gcc/target.def
> >> > +++ b/gcc/target.def
> >> > @@ -1905,25 +1905,6 @@ implementation approaches itself.",
> >> >  	const vec_perm_indices &sel),
> >> >   NULL)
> >> >
> >> > -DEFHOOK
> >> > -(can_special_div_by_const,
> >> > - "This hook is used to test whether the target has a special
> >> > method of\n\ -division of vectors of type @var{vectype} using the
> >> > value @var{constant},\n\ -and producing a vector of type
> >> > @var{vectype}.  The division\n\ -will then not be decomposed by the
> >> > vectorizer and kept as a div.\n\ -\n\ -When the hook is being used
> >> > to test whether the target supports a special\n\ -divide,
> >> > @var{in0}, @var{in1}, and @var{output} are all null.  When the
> >> > hook\n\ -is being used to emit a division, @var{in0} and @var{in1}
> >> > are the source\n\ -vectors of type @var{vecttype} and @var{output}
> >> > is the destination vector of\n\ -type @var{vectype}.\n\ -\n\
> >> > -Return true if the operation is possible, emitting instructions
> >> > for it\n\ -if rtxes are provided and updating @var{output}.",
> >> > - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
> >> > -	rtx in0, rtx in1),
> >> > - default_can_special_div_by_const)
> >> > -
> >> >  /* Return true if the target supports misaligned store/load of a
> >> >     specific factor denoted in the third parameter.  The last parameter
> >> >     is true if the access is defined in a packed struct.  */ diff
> >> > --git a/gcc/target.h b/gcc/target.h index
> >> >
> >>
> 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fc
> a82b9
> >> 9f9
> >> > 13158c2d47b1 100644
> >> > --- a/gcc/target.h
> >> > +++ b/gcc/target.h
> >> > @@ -51,7 +51,6 @@
> >> >  #include "insn-codes.h"
> >> >  #include "tm.h"
> >> >  #include "hard-reg-set.h"
> >> > -#include "tree-core.h"
> >> >
> >> >  #if CHECKING_P
> >> >
> >> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h index
> >> >
> >>
> a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2
> 24454
> >> 93
> >> > 17a31390f0c2 100644
> >> > --- a/gcc/targhooks.h
> >> > +++ b/gcc/targhooks.h
> >> > @@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage
> >> > (addr_space_t, location_t);  extern rtx default_addr_space_convert
> >> > (rtx, tree, tree);  extern unsigned int
> >> > default_case_values_threshold (void);  extern bool
> >> > default_have_conditional_execution (void); -extern bool
> >> > default_can_special_div_by_const (enum tree_code, tree,
> >> wide_int,
> >> > -					      rtx *, rtx, rtx);
> >> >
> >> >  extern bool default_libc_has_function (enum function_class, tree);
> >> > extern bool default_libc_has_fast_function (int fcode); diff --git
> >> > a/gcc/targhooks.cc b/gcc/targhooks.cc index
> >> >
> >>
> fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2
> da91e
> >> 03
> >> > 877337a931e7 100644
> >> > --- a/gcc/targhooks.cc
> >> > +++ b/gcc/targhooks.cc
> >> > @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
> >> >    return HAVE_conditional_execution;  }
> >> >
> >> > -/* Default that no division by constant operations are special.
> >> > */ -bool -default_can_special_div_by_const (enum tree_code, tree,
> >> > wide_int, rtx *, rtx,
> >> > -				  rtx)
> >> > -{
> >> > -  return false;
> >> > -}
> >> > -
> >> >  /* By default we assume that c99 functions are present at the runtime,
> >> >     but sincos is not.  */
> >> >  bool
> >> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> >> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> >> > new file mode 100644
> >> > index
> >> >
> >>
> 0000000000000000000000000000000000000000..c81f8946922250234b
> f759e0a0
> >> a0
> >> > 4ea8c1f73e3c
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> >> > @@ -0,0 +1,25 @@
> >> > +/* { dg-require-effective-target vect_int } */
> >> > +
> >> > +#include <stdint.h>
> >> > +#include "tree-vect.h"
> >> > +
> >> > +typedef unsigned __attribute__((__vector_size__ (16))) V;
> >> > +
> >> > +static __attribute__((__noinline__)) __attribute__((__noclone__))
> >> > +V foo (V v, unsigned short i) {
> >> > +  v /= i;
> >> > +  return v;
> >> > +}
> >> > +
> >> > +int
> >> > +main (void)
> >> > +{
> >> > +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff
> >> > +}, 0xffff);
> >> > +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
> >> > +    if (v[i] != 0x00010001)
> >> > +      __builtin_abort ();
> >> > +  return 0;
> >> > +}
> >> > +
> >> > +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern:
> >> > +detected" "vect" { target aarch64*-*-* } } } */
> >> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> >> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> >> > new file mode 100644
> >> > index
> >> >
> >>
> 0000000000000000000000000000000000000000..b4eb1a4dacba481e63
> 06b4991
> >> 4d2
> >> > a29b933de625
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> >> > @@ -0,0 +1,58 @@
> >> > +/* { dg-require-effective-target vect_int } */
> >> > +
> >> > +#include <stdint.h>
> >> > +#include <stdio.h>
> >> > +#include "tree-vect.h"
> >> > +
> >> > +#define N 50
> >> > +#define TYPE uint8_t
> >> > +
> >> > +#ifndef DEBUG
> >> > +#define DEBUG 0
> >> > +#endif
> >> > +
> >> > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> >> > +
> >> > +
> >> > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE*
> >> > +restrict pixel, TYPE level, int n) {
> >> > +  for (int i = 0; i < n; i+=1)
> >> > +    pixel[i] = (pixel[i] + level) / 0xff; }
> >> > +
> >> > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE*
> >> > +restrict pixel, TYPE level, int n) {
> >> > +  for (int i = 0; i < n; i+=1)
> >> > +    pixel[i] = (pixel[i] + level) / 0xff; }
> >> > +
> >> > +int main ()
> >> > +{
> >> > +  TYPE a[N];
> >> > +  TYPE b[N];
> >> > +
> >> > +  for (int i = 0; i < N; ++i)
> >> > +    {
> >> > +      a[i] = BASE + i * 13;
> >> > +      b[i] = BASE + i * 13;
> >> > +      if (DEBUG)
> >> > +        printf ("%d: 0x%x\n", i, a[i]);
> >> > +    }
> >> > +
> >> > +  fun1 (a, N / 2, N);
> >> > +  fun2 (b, N / 2, N);
> >> > +
> >> > +  for (int i = 0; i < N; ++i)
> >> > +    {
> >> > +      if (DEBUG)
> >> > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> >> > +
> >> > +      if (a[i] != b[i])
> >> > +        __builtin_abort ();
> >> > +    }
> >> > +  return 0;
> >> > +}
> >> > +
> >> > +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect"
> >> > +{ target aarch64*-*-* } } } */
> >> > diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
> >> > index
> >> >
> >>
> 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c
> 14077d
> >> c3
> >> > e970bed75ef6 100644
> >> > --- a/gcc/tree-vect-generic.cc
> >> > +++ b/gcc/tree-vect-generic.cc
> >> > @@ -1237,17 +1237,6 @@ expand_vector_operation
> >> (gimple_stmt_iterator *gsi, tree type, tree compute_type
> >> >  	  tree rhs2 = gimple_assign_rhs2 (assign);
> >> >  	  tree ret;
> >> >
> >> > -	  /* Check if the target was going to handle it through the special
> >> > -	     division callback hook.  */
> >> > -	  tree cst = uniform_integer_cst_p (rhs2);
> >> > -	  if (cst &&
> >> > -	      targetm.vectorize.can_special_div_by_const (code, type,
> >> > -							  wi::to_wide (cst),
> >> > -							  NULL,
> >> > -							  NULL_RTX,
> >> NULL_RTX))
> >> > -	    return NULL_TREE;
> >> > -
> >> > -
> >> >  	  if (!optimize
> >> >  	      || !VECTOR_INTEGER_TYPE_P (type)
> >> >  	      || TREE_CODE (rhs2) != VECTOR_CST diff --git
> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
> >> >
> >>
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc46
> 7f33
> >> 69
> >> > de2afea139d6 100644
> >> > --- a/gcc/tree-vect-patterns.cc
> >> > +++ b/gcc/tree-vect-patterns.cc
> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
> *vinfo,
> >> >        return pattern_stmt;
> >> >      }
> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
> >> vectype,
> >> > -							  wi::to_wide (cst),
> >> > -							  NULL, NULL_RTX,
> >> > -							  NULL_RTX))
> >> > +	   && TYPE_UNSIGNED (itype)
> >> > +	   && rhs_code == TRUNC_DIV_EXPR
> >> > +	   && vectype
> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> >> > +					      OPTIMIZE_FOR_SPEED))
> >> >      {
> >> > -      return NULL;
> >> > +      /* div optimizations using narrowings
> >> > +       we can do the division e.g. shorts by 255 faster by calculating it as
> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> >> > +       double the precision of x.
> >> > +
> >> > +       If we imagine a short as being composed of two blocks of bytes then
> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is
> equivalent to
> >> > +       adding 1 to each sub component:
> >> > +
> >> > +	    short value of 16-bits
> >> > +       ┌──────────────┬────────────────┐
> >> > +       │              │                │
> >> > +       └──────────────┴────────────────┘
> >> > +	 8-bit part1 ▲  8-bit part2   ▲
> >> > +		     │                │
> >> > +		     │                │
> >> > +		    +1               +1
> >> > +
> >> > +       after the first addition, we have to shift right by 8, and narrow the
> >> > +       results back to a byte.  Remember that the addition must be done in
> >> > +       double the precision of the input.  However if we know that
> >> > + the
> >> addition
> >> > +       `x + 257` does not overflow then we can do the operation in
> >> > + the
> >> current
> >> > +       precision.  In which case we don't need the pack and unpacks.  */
> >> > +      auto wcst = wi::to_wide (cst);
> >> > +      int pow = wi::exact_log2 (wcst + 1);
> >> > +      if (pow == (int) (element_precision (vectype) / 2))
> >> > +	{
> >> > +	  wide_int min,max;
> >> > +	  /* If we're in a pattern we need to find the orginal definition.  */
> >> > +	  tree op0 = oprnd0;
> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> >> > +	  if (is_pattern_stmt_p (stmt_info))
> >> > +	    {
> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> >> > +	    }
> >>
> >> If this is generally safe (I'm skipping thinking about it in the
> >> interests of a quick review :-)), then I think it should be done in
> >> vect_get_range_info instead.  Using gimple_get_lhs would be more
> >> general than handling just assignments.
> >>
> >> > +
> >> > +	  /* Check that no overflow will occur.  If we don't have range
> >> > +	     information we can't perform the optimization.  */
> >> > +	  if (vect_get_range_info (op0, &min, &max))
> >> > +	    {
> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> >> > +	      wi::overflow_type ovf;
> >> > +	      /* We need adder and max in the same precision.  */
> >> > +	      wide_int zadder
> >> > +		= wide_int_storage::from (adder, wi::get_precision (max),
> >> > +					  UNSIGNED);
> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
> >>
> >> Could you explain this a bit more?  When do we have mismatched
> >> precisions?
> >
> > C promotion rules will promote e.g.
> >
> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
> >   for (int i = 0; i < n; i+=1)
> >     pixel[i] = (pixel[i] + level) / 0xff; }
> >
> > And have the addition be done as a 32 bit integer.  The vectorizer
> > will demote this down to a short, but range information is not stored
> > for patterns.  So In the above the range will correctly be 0x1fe but
> > the precision will be that of the original expression, so 32.  This
> > will be a mismatch with itype which is derived from the size the vectorizer
> will perform the operation in.
> >
> > Thanks,
> > Tamar
> >
> >>
> >> Thanks,
> >> Richard
> >>
> >> > +	      if (ovf == wi::OVF_NONE)
> >> > +		{
> >> > +		  *type_out = vectype;
> >> > +		  tree tadder = wide_int_to_tree (itype, adder);
> >> > +		  gcall *patt1
> >> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0,
> >> tadder);
> >> > +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
> >> > +		  gimple_call_set_lhs (patt1, lhs);
> >> > +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1,
> >> vectype);
> >> > +
> >> > +		  pattern_stmt
> >> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
> >> > +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
> >> > +		  gimple_call_set_lhs (pattern_stmt, lhs);
> >> > +
> >> > +		  return pattern_stmt;
> >> > +		}
> >> > +	    }
> >> > +	}
> >> >      }
> >> >
> >> >    if (prec > HOST_BITS_PER_WIDE_INT diff --git
> >> > a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index
> >> >
> >>
> eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0
> b95
> >> 64f
> >> > c4e066e50081 100644
> >> > --- a/gcc/tree-vect-stmts.cc
> >> > +++ b/gcc/tree-vect-stmts.cc
> >> > @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
> >> >  	}
> >> >        target_support_p = (optab_handler (optab, vec_mode)
> >> >  			  != CODE_FOR_nothing);
> >> > -      tree cst;
> >> > -      if (!target_support_p
> >> > -	  && op1
> >> > -	  && (cst = uniform_integer_cst_p (op1)))
> >> > -	target_support_p
> >> > -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
> >> > -							wi::to_wide (cst),
> >> > -							NULL, NULL_RTX,
> >> > -							NULL_RTX);
> >> >      }
> >> >
> >> >    bool using_emulated_vectors_p = vect_emulated_vector_p
> >> > (vectype);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-27 11:09       ` Tamar Christina
@ 2023-02-27 12:11         ` Richard Sandiford
  2023-02-27 12:14           ` Tamar Christina
  0 siblings, 1 reply; 47+ messages in thread
From: Richard Sandiford @ 2023-02-27 12:11 UTC (permalink / raw)
  To: Tamar Christina; +Cc: Tamar Christina via Gcc-patches, nd, rguenther, jlaw

Tamar Christina <Tamar.Christina@arm.com> writes:
> Hi,
>
>> > I avoided open coding it with add and shift because it creates a 4
>> > instructions (and shifts which are typically slow) dependency chain
>> > instead of a load and multiply.  This change, unless the target is
>> > known to optimize it further is unlikely to be beneficial.  And by the
>> > time we get to costing the only alternative is to undo the existing pattern and
>> so you lose the general shift optimization.
>> >
>> > So it seemed unwise to open code as shifts, given the codegen out of
>> > the vectorizer would be degenerate for most targets or one needs the
>> > more complicated route of costing during pattern matching already.
>> 
>> Hmm, OK.  That seems like a cost-model thing though, rather than something
>> that should be exposed through optabs.  And I imagine the open-coded
>> version would still be better than nothing on targets without highpart multiply.
>> 
>> So how about replacing the hook with one that simply asks whether division
>> through highpart multiplication is preferred over the add/shift sequence?
>> (Unfortunately it's not going to be possible to work that out from existing
>> information.)
>
> So this doesn't work for SVE.  For SVE the multiplication widening pass introduces
> FMAs at gimple level.  So in the cases where the operation is fed from a widening
> multiplication we end up generating FMA.  If that was it I could have matched FMA.
>
> But it also pushes the multiplication in the second operand because it no longer has
> a mul to share the results with.
>
> In any case, the gimple code is transformed into
>
> vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121);
> vect_patt_57.9_123 = (vector([8,8]) unsigned short) vect__3.8_122;
> vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, { 257, ... });
> vect_patt_65.12_128 = vect_patt_64.11_127 >> 8;
> vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124, vect_patt_65.12_128);
> vect_patt_62.14_130 = vect_patt_66.13_129 >> 8;
> vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21) vect_patt_62.14_130;
>
> This transformation is much worse than the original code, it extended the dependency
> chain with another expensive instruction. I can try to correct this in RTL by matching
> FMA and shift and splitting into MUL + ADDHNB and hope CSE takes care of the extra mul.
>
> But this seems like a hack, and it's basically undoing the earlier transformation.  It seems to
> me that the open coding is a bad idea.

Could you post the patch that gives this result?  I'll have a poke around.

Thanks,
Richard

> Do you still want it Richard?
>
> Thanks,
> Tamar
>> 
>> Thanks,
>> Richard
>> 
>> >
>> >>
>> >> Some comments in addition to Richard's:
>> >>
>> >> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> >> > Hi All,
>> >> >
>> >> > As discussed in the ticket, this replaces the approach for
>> >> > optimizing the div by bitmask operation from a hook into optabs
>> >> > implemented through add_highpart.
>> >> >
>> >> > In order to be able to use this we need to check whether the
>> >> > current precision has enough bits to do the operation without any
>> >> > of the additions
>> >> overflowing.
>> >> >
>> >> > We use range information to determine this and only do the
>> >> > operation if we're sure am overflow won't occur.
>> >> >
>> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going>
>> >> issues.
>> >> >
>> >> > Ok for master?
>> >> >
>> >> > Thanks,
>> >> > Tamar
>> >> >
>> >> > gcc/ChangeLog:
>> >> >
>> >> > 	PR target/108583
>> >> > 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST):
>> >> Remove.
>> >> > 	* doc/tm.texi.in: Likewise.
>> >> > 	* explow.cc (round_push, align_dynamic_address): Revert previous
>> >> patch.
>> >> > 	* expmed.cc (expand_divmod): Likewise.
>> >> > 	* expmed.h (expand_divmod): Likewise.
>> >> > 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
>> >> > 	* optabs.cc (expand_doubleword_mod,
>> >> expand_doubleword_divmod): Likewise.
>> >> > 	* internal-fn.def (ADDH): New.
>> >> > 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
>> >> > 	* doc/md.texi: Document them.
>> >> > 	* doc/rtl.texi: Likewise.
>> >> > 	* target.def (can_special_div_by_const): Remove.
>> >> > 	* target.h: Remove tree-core.h include
>> >> > 	* targhooks.cc (default_can_special_div_by_const): Remove.
>> >> > 	* targhooks.h (default_can_special_div_by_const): Remove.
>> >> > 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
>> >> > 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook
>> >> and
>> >> > 	implement new obtab recognition based on range.
>> >> > 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
>> >> >
>> >> > gcc/testsuite/ChangeLog:
>> >> >
>> >> > 	PR target/108583
>> >> > 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
>> >> > 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
>> >> >
>> >> > --- inline copy of patch --
>> >> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
>> >> >
>> >>
>> 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f
>> 74080
>> >> 3
>> >> > 8595e21af35d 100644
>> >> > --- a/gcc/doc/md.texi
>> >> > +++ b/gcc/doc/md.texi
>> >> > @@ -5668,6 +5668,18 @@ represented in RTL using a
>> >> @code{smul_highpart} RTX expression.
>> >> >  Similar, but the multiplication is unsigned.  This may be
>> >> > represented in RTL using an @code{umul_highpart} RTX expression.
>> >> >
>> >> > +@cindex @code{sadd@var{m}3_highpart} instruction pattern @item
>> >> > +@samp{smul@var{m}3_highpart}
>> >>
>> >> sadd
>> >>
>> >> > +Perform a signed addition of operands 1 and 2, which have mode
>> >> > +@var{m}, and store the most significant half of the product in operand
>> 0.
>> >> > +The least significant half of the product is discarded.  This may
>> >> > +be represented in RTL using a @code{sadd_highpart} RTX expression.
>> >> > +
>> >> > +@cindex @code{uadd@var{m}3_highpart} instruction pattern @item
>> >> > +@samp{uadd@var{m}3_highpart} Similar, but the addition is unsigned.
>> >> > +This may be represented in RTL using an @code{uadd_highpart} RTX
>> >> > +expression.
>> >> > +
>> >> >  @cindex @code{madd@var{m}@var{n}4} instruction pattern  @item
>> >> > @samp{madd@var{m}@var{n}4}  Multiply operands 1 and 2, sign-
>> extend
>> >> > them to mode @var{n}, add diff --git a/gcc/doc/rtl.texi
>> >> > b/gcc/doc/rtl.texi index
>> >> >
>> >>
>> d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00
>> 343
>> >> d17
>> >> > 1940ec4222f3 100644
>> >> > --- a/gcc/doc/rtl.texi
>> >> > +++ b/gcc/doc/rtl.texi
>> >> > @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.
>> >> > @code{smul_highpart} returns the high part  of a signed
>> >> > multiplication, @code{umul_highpart} returns the high part  of an
>> >> > unsigned
>> >> multiplication.
>> >> >
>> >> > +@findex sadd_highpart
>> >> > +@findex uadd_highpart
>> >> > +@cindex high-part addition
>> >> > +@cindex addition high part
>> >> > +@item (sadd_highpart:@var{m} @var{x} @var{y}) @itemx
>> >> > +(uadd_highpart:@var{m} @var{x} @var{y}) Represents the high-part
>> >> > +addition of @var{x} and @var{y} carried out in machine mode @var{m}.
>> >> > +@code{sadd_highpart} returns the high part of a signed addition,
>> >> > +@code{uadd_highpart} returns the high part of an unsigned addition.
>> >>
>> >> The patch doesn't add these RTL codes though.
>> >>
>> >> > +
>> >> >  @findex fma
>> >> >  @cindex fused multiply-add
>> >> >  @item (fma:@var{m} @var{x} @var{y} @var{z}) diff --git
>> >> > a/gcc/doc/tm.texi b/gcc/doc/tm.texi index
>> >> >
>> >>
>> c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d57
>> 914840
>> >> 17e
>> >> > 6b0d62ab077e 100644
>> >> > --- a/gcc/doc/tm.texi
>> >> > +++ b/gcc/doc/tm.texi
>> >> > @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for
>> >> > the hook to handle these two  implementation approaches itself.
>> >> >  @end deftypefn
>> >> >
>> >> > -@deftypefn {Target Hook} bool
>> >> > TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum
>> >> @var{tree_code}, tree
>> >> > @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx
>> >> > @var{in0}, rtx @var{in1}) -This hook is used to test whether the
>> >> > target has a special method of -division of vectors of type
>> >> > @var{vectype}
>> >> using the value @var{constant}, -and producing a vector of type
>> >> @var{vectype}.  The division -will then not be decomposed by the
>> >> vectorizer and kept as a div.
>> >> > -
>> >> > -When the hook is being used to test whether the target supports a
>> >> > special -divide, @var{in0}, @var{in1}, and @var{output} are all null.
>> >> > When the hook -is being used to emit a division, @var{in0} and
>> >> > @var{in1} are the source -vectors of type @var{vecttype} and
>> >> > @var{output} is the destination vector of -type @var{vectype}.
>> >> > -
>> >> > -Return true if the operation is possible, emitting instructions
>> >> > for it -if rtxes are provided and updating @var{output}.
>> >> > -@end deftypefn
>> >> > -
>> >> >  @deftypefn {Target Hook} tree
>> >> > TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned
>> >> @var{code},
>> >> > tree @var{vec_type_out}, tree @var{vec_type_in})  This hook should
>> >> > return the decl of a function that implements the  vectorized
>> >> > variant of the function with the @code{combined_fn} code diff --git
>> >> > a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index
>> >> >
>> >>
>> 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d
>> 1efec0
>> >> a3a
>> >> > bccd1c293c7b 100644
>> >> > --- a/gcc/doc/tm.texi.in
>> >> > +++ b/gcc/doc/tm.texi.in
>> >> > @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent
>> >> strategy can generate better code.
>> >> >
>> >> >  @hook TARGET_VECTORIZE_VEC_PERM_CONST
>> >> >
>> >> > -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
>> >> > -
>> >> >  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
>> >> >
>> >> >  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
>> >> > diff --git a/gcc/explow.cc b/gcc/explow.cc index
>> >> >
>> >>
>> 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc
>> 212f0
>> >> bef
>> >> > a016eea4573c 100644
>> >> > --- a/gcc/explow.cc
>> >> > +++ b/gcc/explow.cc
>> >> > @@ -1037,7 +1037,7 @@ round_push (rtx size)
>> >> >       TRUNC_DIV_EXPR.  */
>> >> >    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
>> >> >  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
>> >> > -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
>> >> > size, align_rtx,
>> >> > +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
>> >> >  			NULL_RTX, 1);
>> >> >    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
>> >> >
>> >> > @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned
>> >> required_align)
>> >> >  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
>> >> >  				       Pmode),
>> >> >  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
>> >> > -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
>> >> > target,
>> >> > +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
>> >> >  			  gen_int_mode (required_align / BITS_PER_UNIT,
>> >> >  					Pmode),
>> >> >  			  NULL_RTX, 1);
>> >> > diff --git a/gcc/expmed.h b/gcc/expmed.h index
>> >> >
>> >>
>> 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c
>> 5364
>> >> 094
>> >> > 1628068f3901 100644
>> >> > --- a/gcc/expmed.h
>> >> > +++ b/gcc/expmed.h
>> >> > @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code,
>> >> > machine_mode, rtx, poly_int64, rtx,  extern rtx maybe_expand_shift
>> >> (enum tree_code, machine_mode, rtx, int, rtx,
>> >> >  			       int);
>> >> >  #ifdef GCC_OPTABS_H
>> >> > -extern rtx expand_divmod (int, enum tree_code, machine_mode, tree,
>> >> tree,
>> >> > -			  rtx, rtx, rtx, int,
>> >> > -			  enum optab_methods = OPTAB_LIB_WIDEN);
>> >> > +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx,
>> >> rtx,
>> >> > +			  rtx, int, enum optab_methods =
>> >> OPTAB_LIB_WIDEN);
>> >> >  #endif
>> >> >  #endif
>> >> >
>> >> > diff --git a/gcc/expmed.cc b/gcc/expmed.cc index
>> >> >
>> >>
>> 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025a
>> b18a3
>> >> a59
>> >> > c169d3b7692f 100644
>> >> > --- a/gcc/expmed.cc
>> >> > +++ b/gcc/expmed.cc
>> >> > @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode,
>> rtx
>> >> op0,
>> >> > HOST_WIDE_INT d)
>> >> >
>> >> >  rtx
>> >> >  expand_divmod (int rem_flag, enum tree_code code, machine_mode
>> >> mode,
>> >> > -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
>> >> > -	       int unsignedp, enum optab_methods methods)
>> >> > +	       rtx op0, rtx op1, rtx target, int unsignedp,
>> >> > +	       enum optab_methods methods)
>> >> >  {
>> >> >    machine_mode compute_mode;
>> >> >    rtx tquotient;
>> >> > @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum
>> tree_code
>> >> > code, machine_mode mode,
>> >> >
>> >> >    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) :
>> >> > 0;
>> >> >
>> >> > -  /* Check if the target has specific expansions for the division.
>> >> > */
>> >> > -  tree cst;
>> >> > -  if (treeop0
>> >> > -      && treeop1
>> >> > -      && (cst = uniform_integer_cst_p (treeop1))
>> >> > -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE
>> >> (treeop0),
>> >> > -						     wi::to_wide (cst),
>> >> > -						     &target, op0, op1))
>> >> > -    return target;
>> >> > -
>> >> > -
>> >> >    /* Now convert to the best mode to use.  */
>> >> >    if (compute_mode != mode)
>> >> >      {
>> >> > @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum
>> tree_code
>> >> code, machine_mode mode,
>> >> >  			    || (optab_handler (sdivmod_optab, int_mode)
>> >> >  				!= CODE_FOR_nothing)))
>> >> >  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
>> >> > -						int_mode, treeop0, treeop1,
>> >> > -						op0, gen_int_mode (abs_d,
>> >> > +						int_mode, op0,
>> >> > +						gen_int_mode (abs_d,
>> >> >  							      int_mode),
>> >> >  						NULL_RTX, 0);
>> >> >  		    else
>> >> > @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum
>> tree_code
>> >> code, machine_mode mode,
>> >> >  				      size - 1, NULL_RTX, 0);
>> >> >  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
>> >> >  				    NULL_RTX);
>> >> > -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode,
>> >> treeop0,
>> >> > -				    treeop1, t3, op1, NULL_RTX, 0);
>> >> > +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3,
>> >> op1,
>> >> > +				    NULL_RTX, 0);
>> >> >  		if (t4)
>> >> >  		  {
>> >> >  		    rtx t5;
>> >> > diff --git a/gcc/expr.cc b/gcc/expr.cc index
>> >> >
>> >>
>> 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e75521633907
>> 8d5b
>> >> 2280
>> >> > c6e277f26d72 100644
>> >> > --- a/gcc/expr.cc
>> >> > +++ b/gcc/expr.cc
>> >> > @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
>> >> >  	    return expand_divmod (0,
>> >> >  				  FLOAT_MODE_P (GET_MODE (value))
>> >> >  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
>> >> > -				  GET_MODE (value), NULL, NULL, op1, op2,
>> >> > -				  target, 0);
>> >> > +				  GET_MODE (value), op1, op2, target, 0);
>> >> >  	case MOD:
>> >> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
>> >> NULL, NULL,
>> >> > -				op1, op2, target, 0);
>> >> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
>> >> op1, op2,
>> >> > +				target, 0);
>> >> >  	case UDIV:
>> >> > -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
>> >> NULL, NULL,
>> >> > -				op1, op2, target, 1);
>> >> > +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
>> >> op1, op2,
>> >> > +				target, 1);
>> >> >  	case UMOD:
>> >> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
>> >> NULL, NULL,
>> >> > -				op1, op2, target, 1);
>> >> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
>> >> op1, op2,
>> >> > +				target, 1);
>> >> >  	case ASHIFTRT:
>> >> >  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
>> >> >  				      target, 0, OPTAB_LIB_WIDEN); @@ -
>> >> 9170,13 +9169,11 @@
>> >> > expand_expr_divmod (tree_code code, machine_mode mode, tree
>> >> treeop0,
>> >> >        bool speed_p = optimize_insn_for_speed_p ();
>> >> >        do_pending_stack_adjust ();
>> >> >        start_sequence ();
>> >> > -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
>> >> > -				   op0, op1, target, 1);
>> >> > +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1,
>> >> > + target, 1);
>> >> >        rtx_insn *uns_insns = get_insns ();
>> >> >        end_sequence ();
>> >> >        start_sequence ();
>> >> > -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
>> >> > -				   op0, op1, target, 0);
>> >> > +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1,
>> >> > + target, 0);
>> >> >        rtx_insn *sgn_insns = get_insns ();
>> >> >        end_sequence ();
>> >> >        unsigned uns_cost = seq_cost (uns_insns, speed_p); @@
>> >> > -9198,8
>> >> > +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode
>> >> mode, tree treeop0,
>> >> >        emit_insn (sgn_insns);
>> >> >        return sgn_ret;
>> >> >      }
>> >> > -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
>> >> > -			op0, op1, target, unsignedp);
>> >> > +  return expand_divmod (mod_p, code, mode, op0, op1, target,
>> >> > + unsignedp);
>> >> >  }
>> >> >
>> >> >  rtx
>> >> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
>> >> >
>> >>
>> 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052
>> 584f5a
>> >> 3b
>> >> > 8a734baa800f 100644
>> >> > --- a/gcc/internal-fn.def
>> >> > +++ b/gcc/internal-fn.def
>> >> > @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL,
>> >> ECF_CONST
>> >> > | ECF_NOTHROW, first,
>> >> >
>> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST |
>> >> ECF_NOTHROW, first,
>> >> >  			      smul_highpart, umul_highpart, binary)
>> >> > +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST |
>> >> ECF_NOTHROW, first,
>> >> > +			      sadd_highpart, uadd_highpart, binary)
>> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST |
>> >> ECF_NOTHROW, first,
>> >> >  			      smulhs, umulhs, binary)
>> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST |
>> >> ECF_NOTHROW, first,
>> >> > diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
>> >> >
>> >>
>> cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe03
>> 69a6
>> >> e
>> >> > 77082c1e617b 100644
>> >> > --- a/gcc/optabs.cc
>> >> > +++ b/gcc/optabs.cc
>> >> > @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode
>> >> mode, rtx op0, rtx op1, bool unsignedp)
>> >> >  		return NULL_RTX;
>> >> >  	    }
>> >> >  	}
>> >> > -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode,
>> >> NULL, NULL,
>> >> > -				     sum, gen_int_mode (INTVAL (op1),
>> >> > -							word_mode),
>> >> > +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR,
>> word_mode,
>> >> sum,
>> >> > +				     gen_int_mode (INTVAL (op1),
>> >> word_mode),
>> >> >  				     NULL_RTX, 1, OPTAB_DIRECT);
>> >> >        if (remainder == NULL_RTX)
>> >> >  	return NULL_RTX;
>> >> > @@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode
>> >> mode, rtx
>> >> > op0, rtx op1, rtx *rem,
>> >> >
>> >> >    if (op11 != const1_rtx)
>> >> >      {
>> >> > -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL,
>> NULL,
>> >> quot1,
>> >> > -				op11, NULL_RTX, unsignedp,
>> >> OPTAB_DIRECT);
>> >> > +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1,
>> >> op11,
>> >> > +				NULL_RTX, unsignedp, OPTAB_DIRECT);
>> >> >        if (rem2 == NULL_RTX)
>> >> >  	return NULL_RTX;
>> >> >
>> >> > @@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode
>> >> mode, rtx op0, rtx op1, rtx *rem,
>> >> >        if (rem2 == NULL_RTX)
>> >> >  	return NULL_RTX;
>> >> >
>> >> > -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL,
>> NULL,
>> >> quot1,
>> >> > -				 op11, NULL_RTX, unsignedp,
>> >> OPTAB_DIRECT);
>> >> > +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1,
>> op11,
>> >> > +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
>> >> >        if (quot2 == NULL_RTX)
>> >> >  	return NULL_RTX;
>> >> >
>> >> > diff --git a/gcc/optabs.def b/gcc/optabs.def index
>> >> >
>> >>
>> 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2
>> d7a5
>> >> ccb
>> >> > f6147947351a 100644
>> >> > --- a/gcc/optabs.def
>> >> > +++ b/gcc/optabs.def
>> >> > @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
>> >> >
>> >> >  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")  OPTAB_D
>> >> > (umul_highpart_optab, "umul$a3_highpart")
>> >> > +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart") OPTAB_D
>> >> > +(uadd_highpart_optab, "uadd$a3_highpart")
>> >> >
>> >> >  OPTAB_D (cmpmem_optab, "cmpmem$a")  OPTAB_D (cmpstr_optab,
>> >> > "cmpstr$a") diff --git a/gcc/target.def b/gcc/target.def index
>> >> >
>> >>
>> db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed0
>> 8d1d
>> >> 81a
>> >> > fa2c2baa64a5 100644
>> >> > --- a/gcc/target.def
>> >> > +++ b/gcc/target.def
>> >> > @@ -1905,25 +1905,6 @@ implementation approaches itself.",
>> >> >  	const vec_perm_indices &sel),
>> >> >   NULL)
>> >> >
>> >> > -DEFHOOK
>> >> > -(can_special_div_by_const,
>> >> > - "This hook is used to test whether the target has a special
>> >> > method of\n\ -division of vectors of type @var{vectype} using the
>> >> > value @var{constant},\n\ -and producing a vector of type
>> >> > @var{vectype}.  The division\n\ -will then not be decomposed by the
>> >> > vectorizer and kept as a div.\n\ -\n\ -When the hook is being used
>> >> > to test whether the target supports a special\n\ -divide,
>> >> > @var{in0}, @var{in1}, and @var{output} are all null.  When the
>> >> > hook\n\ -is being used to emit a division, @var{in0} and @var{in1}
>> >> > are the source\n\ -vectors of type @var{vecttype} and @var{output}
>> >> > is the destination vector of\n\ -type @var{vectype}.\n\ -\n\
>> >> > -Return true if the operation is possible, emitting instructions
>> >> > for it\n\ -if rtxes are provided and updating @var{output}.",
>> >> > - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
>> >> > -	rtx in0, rtx in1),
>> >> > - default_can_special_div_by_const)
>> >> > -
>> >> >  /* Return true if the target supports misaligned store/load of a
>> >> >     specific factor denoted in the third parameter.  The last parameter
>> >> >     is true if the access is defined in a packed struct.  */ diff
>> >> > --git a/gcc/target.h b/gcc/target.h index
>> >> >
>> >>
>> 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fc
>> a82b9
>> >> 9f9
>> >> > 13158c2d47b1 100644
>> >> > --- a/gcc/target.h
>> >> > +++ b/gcc/target.h
>> >> > @@ -51,7 +51,6 @@
>> >> >  #include "insn-codes.h"
>> >> >  #include "tm.h"
>> >> >  #include "hard-reg-set.h"
>> >> > -#include "tree-core.h"
>> >> >
>> >> >  #if CHECKING_P
>> >> >
>> >> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h index
>> >> >
>> >>
>> a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2
>> 24454
>> >> 93
>> >> > 17a31390f0c2 100644
>> >> > --- a/gcc/targhooks.h
>> >> > +++ b/gcc/targhooks.h
>> >> > @@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage
>> >> > (addr_space_t, location_t);  extern rtx default_addr_space_convert
>> >> > (rtx, tree, tree);  extern unsigned int
>> >> > default_case_values_threshold (void);  extern bool
>> >> > default_have_conditional_execution (void); -extern bool
>> >> > default_can_special_div_by_const (enum tree_code, tree,
>> >> wide_int,
>> >> > -					      rtx *, rtx, rtx);
>> >> >
>> >> >  extern bool default_libc_has_function (enum function_class, tree);
>> >> > extern bool default_libc_has_fast_function (int fcode); diff --git
>> >> > a/gcc/targhooks.cc b/gcc/targhooks.cc index
>> >> >
>> >>
>> fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2
>> da91e
>> >> 03
>> >> > 877337a931e7 100644
>> >> > --- a/gcc/targhooks.cc
>> >> > +++ b/gcc/targhooks.cc
>> >> > @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
>> >> >    return HAVE_conditional_execution;  }
>> >> >
>> >> > -/* Default that no division by constant operations are special.
>> >> > */ -bool -default_can_special_div_by_const (enum tree_code, tree,
>> >> > wide_int, rtx *, rtx,
>> >> > -				  rtx)
>> >> > -{
>> >> > -  return false;
>> >> > -}
>> >> > -
>> >> >  /* By default we assume that c99 functions are present at the runtime,
>> >> >     but sincos is not.  */
>> >> >  bool
>> >> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
>> >> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
>> >> > new file mode 100644
>> >> > index
>> >> >
>> >>
>> 0000000000000000000000000000000000000000..c81f8946922250234b
>> f759e0a0
>> >> a0
>> >> > 4ea8c1f73e3c
>> >> > --- /dev/null
>> >> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
>> >> > @@ -0,0 +1,25 @@
>> >> > +/* { dg-require-effective-target vect_int } */
>> >> > +
>> >> > +#include <stdint.h>
>> >> > +#include "tree-vect.h"
>> >> > +
>> >> > +typedef unsigned __attribute__((__vector_size__ (16))) V;
>> >> > +
>> >> > +static __attribute__((__noinline__)) __attribute__((__noclone__))
>> >> > +V foo (V v, unsigned short i) {
>> >> > +  v /= i;
>> >> > +  return v;
>> >> > +}
>> >> > +
>> >> > +int
>> >> > +main (void)
>> >> > +{
>> >> > +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff
>> >> > +}, 0xffff);
>> >> > +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
>> >> > +    if (v[i] != 0x00010001)
>> >> > +      __builtin_abort ();
>> >> > +  return 0;
>> >> > +}
>> >> > +
>> >> > +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern:
>> >> > +detected" "vect" { target aarch64*-*-* } } } */
>> >> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
>> >> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
>> >> > new file mode 100644
>> >> > index
>> >> >
>> >>
>> 0000000000000000000000000000000000000000..b4eb1a4dacba481e63
>> 06b4991
>> >> 4d2
>> >> > a29b933de625
>> >> > --- /dev/null
>> >> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
>> >> > @@ -0,0 +1,58 @@
>> >> > +/* { dg-require-effective-target vect_int } */
>> >> > +
>> >> > +#include <stdint.h>
>> >> > +#include <stdio.h>
>> >> > +#include "tree-vect.h"
>> >> > +
>> >> > +#define N 50
>> >> > +#define TYPE uint8_t
>> >> > +
>> >> > +#ifndef DEBUG
>> >> > +#define DEBUG 0
>> >> > +#endif
>> >> > +
>> >> > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
>> >> > +
>> >> > +
>> >> > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE*
>> >> > +restrict pixel, TYPE level, int n) {
>> >> > +  for (int i = 0; i < n; i+=1)
>> >> > +    pixel[i] = (pixel[i] + level) / 0xff; }
>> >> > +
>> >> > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE*
>> >> > +restrict pixel, TYPE level, int n) {
>> >> > +  for (int i = 0; i < n; i+=1)
>> >> > +    pixel[i] = (pixel[i] + level) / 0xff; }
>> >> > +
>> >> > +int main ()
>> >> > +{
>> >> > +  TYPE a[N];
>> >> > +  TYPE b[N];
>> >> > +
>> >> > +  for (int i = 0; i < N; ++i)
>> >> > +    {
>> >> > +      a[i] = BASE + i * 13;
>> >> > +      b[i] = BASE + i * 13;
>> >> > +      if (DEBUG)
>> >> > +        printf ("%d: 0x%x\n", i, a[i]);
>> >> > +    }
>> >> > +
>> >> > +  fun1 (a, N / 2, N);
>> >> > +  fun2 (b, N / 2, N);
>> >> > +
>> >> > +  for (int i = 0; i < N; ++i)
>> >> > +    {
>> >> > +      if (DEBUG)
>> >> > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
>> >> > +
>> >> > +      if (a[i] != b[i])
>> >> > +        __builtin_abort ();
>> >> > +    }
>> >> > +  return 0;
>> >> > +}
>> >> > +
>> >> > +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect"
>> >> > +{ target aarch64*-*-* } } } */
>> >> > diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
>> >> > index
>> >> >
>> >>
>> 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c
>> 14077d
>> >> c3
>> >> > e970bed75ef6 100644
>> >> > --- a/gcc/tree-vect-generic.cc
>> >> > +++ b/gcc/tree-vect-generic.cc
>> >> > @@ -1237,17 +1237,6 @@ expand_vector_operation
>> >> (gimple_stmt_iterator *gsi, tree type, tree compute_type
>> >> >  	  tree rhs2 = gimple_assign_rhs2 (assign);
>> >> >  	  tree ret;
>> >> >
>> >> > -	  /* Check if the target was going to handle it through the special
>> >> > -	     division callback hook.  */
>> >> > -	  tree cst = uniform_integer_cst_p (rhs2);
>> >> > -	  if (cst &&
>> >> > -	      targetm.vectorize.can_special_div_by_const (code, type,
>> >> > -							  wi::to_wide (cst),
>> >> > -							  NULL,
>> >> > -							  NULL_RTX,
>> >> NULL_RTX))
>> >> > -	    return NULL_TREE;
>> >> > -
>> >> > -
>> >> >  	  if (!optimize
>> >> >  	      || !VECTOR_INTEGER_TYPE_P (type)
>> >> >  	      || TREE_CODE (rhs2) != VECTOR_CST diff --git
>> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
>> >> >
>> >>
>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc46
>> 7f33
>> >> 69
>> >> > de2afea139d6 100644
>> >> > --- a/gcc/tree-vect-patterns.cc
>> >> > +++ b/gcc/tree-vect-patterns.cc
>> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
>> *vinfo,
>> >> >        return pattern_stmt;
>> >> >      }
>> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
>> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
>> >> vectype,
>> >> > -							  wi::to_wide (cst),
>> >> > -							  NULL, NULL_RTX,
>> >> > -							  NULL_RTX))
>> >> > +	   && TYPE_UNSIGNED (itype)
>> >> > +	   && rhs_code == TRUNC_DIV_EXPR
>> >> > +	   && vectype
>> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
>> >> > +					      OPTIMIZE_FOR_SPEED))
>> >> >      {
>> >> > -      return NULL;
>> >> > +      /* div optimizations using narrowings
>> >> > +       we can do the division e.g. shorts by 255 faster by calculating it as
>> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
>> >> > +       double the precision of x.
>> >> > +
>> >> > +       If we imagine a short as being composed of two blocks of bytes then
>> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is
>> equivalent to
>> >> > +       adding 1 to each sub component:
>> >> > +
>> >> > +	    short value of 16-bits
>> >> > +       ┌──────────────┬────────────────┐
>> >> > +       │              │                │
>> >> > +       └──────────────┴────────────────┘
>> >> > +	 8-bit part1 ▲  8-bit part2   ▲
>> >> > +		     │                │
>> >> > +		     │                │
>> >> > +		    +1               +1
>> >> > +
>> >> > +       after the first addition, we have to shift right by 8, and narrow the
>> >> > +       results back to a byte.  Remember that the addition must be done in
>> >> > +       double the precision of the input.  However if we know that
>> >> > + the
>> >> addition
>> >> > +       `x + 257` does not overflow then we can do the operation in
>> >> > + the
>> >> current
>> >> > +       precision.  In which case we don't need the pack and unpacks.  */
>> >> > +      auto wcst = wi::to_wide (cst);
>> >> > +      int pow = wi::exact_log2 (wcst + 1);
>> >> > +      if (pow == (int) (element_precision (vectype) / 2))
>> >> > +	{
>> >> > +	  wide_int min,max;
>> >> > +	  /* If we're in a pattern we need to find the orginal definition.  */
>> >> > +	  tree op0 = oprnd0;
>> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
>> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
>> >> > +	  if (is_pattern_stmt_p (stmt_info))
>> >> > +	    {
>> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
>> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
>> >> > +	    }
>> >>
>> >> If this is generally safe (I'm skipping thinking about it in the
>> >> interests of a quick review :-)), then I think it should be done in
>> >> vect_get_range_info instead.  Using gimple_get_lhs would be more
>> >> general than handling just assignments.
>> >>
>> >> > +
>> >> > +	  /* Check that no overflow will occur.  If we don't have range
>> >> > +	     information we can't perform the optimization.  */
>> >> > +	  if (vect_get_range_info (op0, &min, &max))
>> >> > +	    {
>> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
>> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
>> >> > +	      wi::overflow_type ovf;
>> >> > +	      /* We need adder and max in the same precision.  */
>> >> > +	      wide_int zadder
>> >> > +		= wide_int_storage::from (adder, wi::get_precision (max),
>> >> > +					  UNSIGNED);
>> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
>> >>
>> >> Could you explain this a bit more?  When do we have mismatched
>> >> precisions?
>> >
>> > C promotion rules will promote e.g.
>> >
>> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
>> >   for (int i = 0; i < n; i+=1)
>> >     pixel[i] = (pixel[i] + level) / 0xff; }
>> >
>> > And have the addition be done as a 32 bit integer.  The vectorizer
>> > will demote this down to a short, but range information is not stored
>> > for patterns.  So In the above the range will correctly be 0x1fe but
>> > the precision will be that of the original expression, so 32.  This
>> > will be a mismatch with itype which is derived from the size the vectorizer
>> will perform the operation in.
>> >
>> > Thanks,
>> > Tamar
>> >
>> >>
>> >> Thanks,
>> >> Richard
>> >>
>> >> > +	      if (ovf == wi::OVF_NONE)
>> >> > +		{
>> >> > +		  *type_out = vectype;
>> >> > +		  tree tadder = wide_int_to_tree (itype, adder);
>> >> > +		  gcall *patt1
>> >> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0,
>> >> tadder);
>> >> > +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
>> >> > +		  gimple_call_set_lhs (patt1, lhs);
>> >> > +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1,
>> >> vectype);
>> >> > +
>> >> > +		  pattern_stmt
>> >> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
>> >> > +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
>> >> > +		  gimple_call_set_lhs (pattern_stmt, lhs);
>> >> > +
>> >> > +		  return pattern_stmt;
>> >> > +		}
>> >> > +	    }
>> >> > +	}
>> >> >      }
>> >> >
>> >> >    if (prec > HOST_BITS_PER_WIDE_INT diff --git
>> >> > a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index
>> >> >
>> >>
>> eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0
>> b95
>> >> 64f
>> >> > c4e066e50081 100644
>> >> > --- a/gcc/tree-vect-stmts.cc
>> >> > +++ b/gcc/tree-vect-stmts.cc
>> >> > @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
>> >> >  	}
>> >> >        target_support_p = (optab_handler (optab, vec_mode)
>> >> >  			  != CODE_FOR_nothing);
>> >> > -      tree cst;
>> >> > -      if (!target_support_p
>> >> > -	  && op1
>> >> > -	  && (cst = uniform_integer_cst_p (op1)))
>> >> > -	target_support_p
>> >> > -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
>> >> > -							wi::to_wide (cst),
>> >> > -							NULL, NULL_RTX,
>> >> > -							NULL_RTX);
>> >> >      }
>> >> >
>> >> >    bool using_emulated_vectors_p = vect_emulated_vector_p
>> >> > (vectype);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-27 12:11         ` Richard Sandiford
@ 2023-02-27 12:14           ` Tamar Christina
  2023-02-27 21:33             ` Richard Sandiford
  0 siblings, 1 reply; 47+ messages in thread
From: Tamar Christina @ 2023-02-27 12:14 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Tamar Christina via Gcc-patches, nd, rguenther, jlaw

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Monday, February 27, 2023 12:12 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> > Hi,
> >
> >> > I avoided open coding it with add and shift because it creates a 4
> >> > instructions (and shifts which are typically slow) dependency chain
> >> > instead of a load and multiply.  This change, unless the target is
> >> > known to optimize it further is unlikely to be beneficial.  And by
> >> > the time we get to costing the only alternative is to undo the
> >> > existing pattern and
> >> so you lose the general shift optimization.
> >> >
> >> > So it seemed unwise to open code as shifts, given the codegen out
> >> > of the vectorizer would be degenerate for most targets or one needs
> >> > the more complicated route of costing during pattern matching already.
> >>
> >> Hmm, OK.  That seems like a cost-model thing though, rather than
> >> something that should be exposed through optabs.  And I imagine the
> >> open-coded version would still be better than nothing on targets without
> highpart multiply.
> >>
> >> So how about replacing the hook with one that simply asks whether
> >> division through highpart multiplication is preferred over the add/shift
> sequence?
> >> (Unfortunately it's not going to be possible to work that out from
> >> existing
> >> information.)
> >
> > So this doesn't work for SVE.  For SVE the multiplication widening
> > pass introduces FMAs at gimple level.  So in the cases where the
> > operation is fed from a widening multiplication we end up generating FMA.
> If that was it I could have matched FMA.
> >
> > But it also pushes the multiplication in the second operand because it
> > no longer has a mul to share the results with.
> >
> > In any case, the gimple code is transformed into
> >
> > vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121);
> > vect_patt_57.9_123 = (vector([8,8]) unsigned short) vect__3.8_122;
> > vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, { 257,
> > ... });
> > vect_patt_65.12_128 = vect_patt_64.11_127 >> 8;
> > vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124,
> > vect_patt_65.12_128);
> > vect_patt_62.14_130 = vect_patt_66.13_129 >> 8;
> > vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21)
> > vect_patt_62.14_130;
> >
> > This transformation is much worse than the original code, it extended
> > the dependency chain with another expensive instruction. I can try to
> > correct this in RTL by matching FMA and shift and splitting into MUL +
> ADDHNB and hope CSE takes care of the extra mul.
> >
> > But this seems like a hack, and it's basically undoing the earlier
> > transformation.  It seems to me that the open coding is a bad idea.
> 
> Could you post the patch that gives this result?  I'll have a poke around.

Sure, I'll post the new series, it needs all of them.

Tamar.

> 
> Thanks,
> Richard
> 
> > Do you still want it Richard?
> >
> > Thanks,
> > Tamar
> >>
> >> Thanks,
> >> Richard
> >>
> >> >
> >> >>
> >> >> Some comments in addition to Richard's:
> >> >>
> >> >> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> >> > Hi All,
> >> >> >
> >> >> > As discussed in the ticket, this replaces the approach for
> >> >> > optimizing the div by bitmask operation from a hook into optabs
> >> >> > implemented through add_highpart.
> >> >> >
> >> >> > In order to be able to use this we need to check whether the
> >> >> > current precision has enough bits to do the operation without
> >> >> > any of the additions
> >> >> overflowing.
> >> >> >
> >> >> > We use range information to determine this and only do the
> >> >> > operation if we're sure am overflow won't occur.
> >> >> >
> >> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going>
> >> >> issues.
> >> >> >
> >> >> > Ok for master?
> >> >> >
> >> >> > Thanks,
> >> >> > Tamar
> >> >> >
> >> >> > gcc/ChangeLog:
> >> >> >
> >> >> > 	PR target/108583
> >> >> > 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST):
> >> >> Remove.
> >> >> > 	* doc/tm.texi.in: Likewise.
> >> >> > 	* explow.cc (round_push, align_dynamic_address): Revert
> >> >> > previous
> >> >> patch.
> >> >> > 	* expmed.cc (expand_divmod): Likewise.
> >> >> > 	* expmed.h (expand_divmod): Likewise.
> >> >> > 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
> >> >> > 	* optabs.cc (expand_doubleword_mod,
> >> >> expand_doubleword_divmod): Likewise.
> >> >> > 	* internal-fn.def (ADDH): New.
> >> >> > 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
> >> >> > 	* doc/md.texi: Document them.
> >> >> > 	* doc/rtl.texi: Likewise.
> >> >> > 	* target.def (can_special_div_by_const): Remove.
> >> >> > 	* target.h: Remove tree-core.h include
> >> >> > 	* targhooks.cc (default_can_special_div_by_const): Remove.
> >> >> > 	* targhooks.h (default_can_special_div_by_const): Remove.
> >> >> > 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
> >> >> > 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove
> >> >> > hook
> >> >> and
> >> >> > 	implement new obtab recognition based on range.
> >> >> > 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
> >> >> >
> >> >> > gcc/testsuite/ChangeLog:
> >> >> >
> >> >> > 	PR target/108583
> >> >> > 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
> >> >> > 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
> >> >> >
> >> >> > --- inline copy of patch --
> >> >> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
> >> >> >
> >> >>
> >>
> 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f
> >> 74080
> >> >> 3
> >> >> > 8595e21af35d 100644
> >> >> > --- a/gcc/doc/md.texi
> >> >> > +++ b/gcc/doc/md.texi
> >> >> > @@ -5668,6 +5668,18 @@ represented in RTL using a
> >> >> @code{smul_highpart} RTX expression.
> >> >> >  Similar, but the multiplication is unsigned.  This may be
> >> >> > represented in RTL using an @code{umul_highpart} RTX expression.
> >> >> >
> >> >> > +@cindex @code{sadd@var{m}3_highpart} instruction pattern @item
> >> >> > +@samp{smul@var{m}3_highpart}
> >> >>
> >> >> sadd
> >> >>
> >> >> > +Perform a signed addition of operands 1 and 2, which have mode
> >> >> > +@var{m}, and store the most significant half of the product in
> >> >> > +operand
> >> 0.
> >> >> > +The least significant half of the product is discarded.  This
> >> >> > +may be represented in RTL using a @code{sadd_highpart} RTX
> expression.
> >> >> > +
> >> >> > +@cindex @code{uadd@var{m}3_highpart} instruction pattern @item
> >> >> > +@samp{uadd@var{m}3_highpart} Similar, but the addition is
> unsigned.
> >> >> > +This may be represented in RTL using an @code{uadd_highpart}
> >> >> > +RTX expression.
> >> >> > +
> >> >> >  @cindex @code{madd@var{m}@var{n}4} instruction pattern  @item
> >> >> > @samp{madd@var{m}@var{n}4}  Multiply operands 1 and 2, sign-
> >> extend
> >> >> > them to mode @var{n}, add diff --git a/gcc/doc/rtl.texi
> >> >> > b/gcc/doc/rtl.texi index
> >> >> >
> >> >>
> >>
> d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00
> >> 343
> >> >> d17
> >> >> > 1940ec4222f3 100644
> >> >> > --- a/gcc/doc/rtl.texi
> >> >> > +++ b/gcc/doc/rtl.texi
> >> >> > @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.
> >> >> > @code{smul_highpart} returns the high part  of a signed
> >> >> > multiplication, @code{umul_highpart} returns the high part  of
> >> >> > an unsigned
> >> >> multiplication.
> >> >> >
> >> >> > +@findex sadd_highpart
> >> >> > +@findex uadd_highpart
> >> >> > +@cindex high-part addition
> >> >> > +@cindex addition high part
> >> >> > +@item (sadd_highpart:@var{m} @var{x} @var{y}) @itemx
> >> >> > +(uadd_highpart:@var{m} @var{x} @var{y}) Represents the
> >> >> > +high-part addition of @var{x} and @var{y} carried out in machine
> mode @var{m}.
> >> >> > +@code{sadd_highpart} returns the high part of a signed
> >> >> > +addition, @code{uadd_highpart} returns the high part of an unsigned
> addition.
> >> >>
> >> >> The patch doesn't add these RTL codes though.
> >> >>
> >> >> > +
> >> >> >  @findex fma
> >> >> >  @cindex fused multiply-add
> >> >> >  @item (fma:@var{m} @var{x} @var{y} @var{z}) diff --git
> >> >> > a/gcc/doc/tm.texi b/gcc/doc/tm.texi index
> >> >> >
> >> >>
> >>
> c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d57
> >> 914840
> >> >> 17e
> >> >> > 6b0d62ab077e 100644
> >> >> > --- a/gcc/doc/tm.texi
> >> >> > +++ b/gcc/doc/tm.texi
> >> >> > @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need
> >> >> > for the hook to handle these two  implementation approaches itself.
> >> >> >  @end deftypefn
> >> >> >
> >> >> > -@deftypefn {Target Hook} bool
> >> >> > TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum
> >> >> @var{tree_code}, tree
> >> >> > @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx
> >> >> > @var{in0}, rtx @var{in1}) -This hook is used to test whether the
> >> >> > target has a special method of -division of vectors of type
> >> >> > @var{vectype}
> >> >> using the value @var{constant}, -and producing a vector of type
> >> >> @var{vectype}.  The division -will then not be decomposed by the
> >> >> vectorizer and kept as a div.
> >> >> > -
> >> >> > -When the hook is being used to test whether the target supports
> >> >> > a special -divide, @var{in0}, @var{in1}, and @var{output} are all null.
> >> >> > When the hook -is being used to emit a division, @var{in0} and
> >> >> > @var{in1} are the source -vectors of type @var{vecttype} and
> >> >> > @var{output} is the destination vector of -type @var{vectype}.
> >> >> > -
> >> >> > -Return true if the operation is possible, emitting instructions
> >> >> > for it -if rtxes are provided and updating @var{output}.
> >> >> > -@end deftypefn
> >> >> > -
> >> >> >  @deftypefn {Target Hook} tree
> >> >> > TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned
> >> >> @var{code},
> >> >> > tree @var{vec_type_out}, tree @var{vec_type_in})  This hook
> >> >> > should return the decl of a function that implements the
> >> >> > vectorized variant of the function with the @code{combined_fn}
> >> >> > code diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index
> >> >> >
> >> >>
> >>
> 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d
> >> 1efec0
> >> >> a3a
> >> >> > bccd1c293c7b 100644
> >> >> > --- a/gcc/doc/tm.texi.in
> >> >> > +++ b/gcc/doc/tm.texi.in
> >> >> > @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent
> >> >> strategy can generate better code.
> >> >> >
> >> >> >  @hook TARGET_VECTORIZE_VEC_PERM_CONST
> >> >> >
> >> >> > -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
> >> >> > -
> >> >> >  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
> >> >> >
> >> >> >  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
> >> >> > diff --git a/gcc/explow.cc b/gcc/explow.cc index
> >> >> >
> >> >>
> >>
> 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc
> >> 212f0
> >> >> bef
> >> >> > a016eea4573c 100644
> >> >> > --- a/gcc/explow.cc
> >> >> > +++ b/gcc/explow.cc
> >> >> > @@ -1037,7 +1037,7 @@ round_push (rtx size)
> >> >> >       TRUNC_DIV_EXPR.  */
> >> >> >    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
> >> >> >  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
> >> >> > -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
> >> >> > size, align_rtx,
> >> >> > +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size,
> >> >> > + align_rtx,
> >> >> >  			NULL_RTX, 1);
> >> >> >    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
> >> >> >
> >> >> > @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target,
> >> >> > unsigned
> >> >> required_align)
> >> >> >  			 gen_int_mode (required_align /
> BITS_PER_UNIT - 1,
> >> >> >  				       Pmode),
> >> >> >  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
> >> >> > -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
> >> >> > target,
> >> >> > +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
> >> >> >  			  gen_int_mode (required_align /
> BITS_PER_UNIT,
> >> >> >  					Pmode),
> >> >> >  			  NULL_RTX, 1);
> >> >> > diff --git a/gcc/expmed.h b/gcc/expmed.h index
> >> >> >
> >> >>
> >>
> 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c
> >> 5364
> >> >> 094
> >> >> > 1628068f3901 100644
> >> >> > --- a/gcc/expmed.h
> >> >> > +++ b/gcc/expmed.h
> >> >> > @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code,
> >> >> > machine_mode, rtx, poly_int64, rtx,  extern rtx
> >> >> > maybe_expand_shift
> >> >> (enum tree_code, machine_mode, rtx, int, rtx,
> >> >> >  			       int);
> >> >> >  #ifdef GCC_OPTABS_H
> >> >> > -extern rtx expand_divmod (int, enum tree_code, machine_mode,
> >> >> > tree,
> >> >> tree,
> >> >> > -			  rtx, rtx, rtx, int,
> >> >> > -			  enum optab_methods =
> OPTAB_LIB_WIDEN);
> >> >> > +extern rtx expand_divmod (int, enum tree_code, machine_mode,
> >> >> > +rtx,
> >> >> rtx,
> >> >> > +			  rtx, int, enum optab_methods =
> >> >> OPTAB_LIB_WIDEN);
> >> >> >  #endif
> >> >> >  #endif
> >> >> >
> >> >> > diff --git a/gcc/expmed.cc b/gcc/expmed.cc index
> >> >> >
> >> >>
> >>
> 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025a
> >> b18a3
> >> >> a59
> >> >> > c169d3b7692f 100644
> >> >> > --- a/gcc/expmed.cc
> >> >> > +++ b/gcc/expmed.cc
> >> >> > @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode
> mode,
> >> rtx
> >> >> op0,
> >> >> > HOST_WIDE_INT d)
> >> >> >
> >> >> >  rtx
> >> >> >  expand_divmod (int rem_flag, enum tree_code code, machine_mode
> >> >> mode,
> >> >> > -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
> >> >> > -	       int unsignedp, enum optab_methods methods)
> >> >> > +	       rtx op0, rtx op1, rtx target, int unsignedp,
> >> >> > +	       enum optab_methods methods)
> >> >> >  {
> >> >> >    machine_mode compute_mode;
> >> >> >    rtx tquotient;
> >> >> > @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum
> >> tree_code
> >> >> > code, machine_mode mode,
> >> >> >
> >> >> >    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) :
> >> >> > 0;
> >> >> >
> >> >> > -  /* Check if the target has specific expansions for the division.
> >> >> > */
> >> >> > -  tree cst;
> >> >> > -  if (treeop0
> >> >> > -      && treeop1
> >> >> > -      && (cst = uniform_integer_cst_p (treeop1))
> >> >> > -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE
> >> >> (treeop0),
> >> >> > -						     wi::to_wide (cst),
> >> >> > -						     &target, op0, op1))
> >> >> > -    return target;
> >> >> > -
> >> >> > -
> >> >> >    /* Now convert to the best mode to use.  */
> >> >> >    if (compute_mode != mode)
> >> >> >      {
> >> >> > @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum
> >> tree_code
> >> >> code, machine_mode mode,
> >> >> >  			    || (optab_handler (sdivmod_optab,
> int_mode)
> >> >> >  				!= CODE_FOR_nothing)))
> >> >> >  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
> >> >> > -						int_mode, treeop0,
> treeop1,
> >> >> > -						op0, gen_int_mode
> (abs_d,
> >> >> > +						int_mode, op0,
> >> >> > +						gen_int_mode
> (abs_d,
> >> >> >  							      int_mode),
> >> >> >  						NULL_RTX, 0);
> >> >> >  		    else
> >> >> > @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum
> >> tree_code
> >> >> code, machine_mode mode,
> >> >> >  				      size - 1, NULL_RTX, 0);
> >> >> >  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1,
> nsign),
> >> >> >  				    NULL_RTX);
> >> >> > -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode,
> >> >> treeop0,
> >> >> > -				    treeop1, t3, op1, NULL_RTX, 0);
> >> >> > +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode,
> t3,
> >> >> op1,
> >> >> > +				    NULL_RTX, 0);
> >> >> >  		if (t4)
> >> >> >  		  {
> >> >> >  		    rtx t5;
> >> >> > diff --git a/gcc/expr.cc b/gcc/expr.cc index
> >> >> >
> >> >>
> >>
> 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e75521633907
> >> 8d5b
> >> >> 2280
> >> >> > c6e277f26d72 100644
> >> >> > --- a/gcc/expr.cc
> >> >> > +++ b/gcc/expr.cc
> >> >> > @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
> >> >> >  	    return expand_divmod (0,
> >> >> >  				  FLOAT_MODE_P (GET_MODE
> (value))
> >> >> >  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
> >> >> > -				  GET_MODE (value), NULL, NULL,
> op1, op2,
> >> >> > -				  target, 0);
> >> >> > +				  GET_MODE (value), op1, op2, target,
> 0);
> >> >> >  	case MOD:
> >> >> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE
> (value),
> >> >> NULL, NULL,
> >> >> > -				op1, op2, target, 0);
> >> >> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE
> (value),
> >> >> op1, op2,
> >> >> > +				target, 0);
> >> >> >  	case UDIV:
> >> >> > -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE
> (value),
> >> >> NULL, NULL,
> >> >> > -				op1, op2, target, 1);
> >> >> > +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE
> (value),
> >> >> op1, op2,
> >> >> > +				target, 1);
> >> >> >  	case UMOD:
> >> >> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE
> (value),
> >> >> NULL, NULL,
> >> >> > -				op1, op2, target, 1);
> >> >> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE
> (value),
> >> >> op1, op2,
> >> >> > +				target, 1);
> >> >> >  	case ASHIFTRT:
> >> >> >  	  return expand_simple_binop (GET_MODE (value), code, op1,
> op2,
> >> >> >  				      target, 0, OPTAB_LIB_WIDEN); @@
> -
> >> >> 9170,13 +9169,11 @@
> >> >> > expand_expr_divmod (tree_code code, machine_mode mode, tree
> >> >> treeop0,
> >> >> >        bool speed_p = optimize_insn_for_speed_p ();
> >> >> >        do_pending_stack_adjust ();
> >> >> >        start_sequence ();
> >> >> > -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0,
> treeop1,
> >> >> > -				   op0, op1, target, 1);
> >> >> > +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1,
> >> >> > + target, 1);
> >> >> >        rtx_insn *uns_insns = get_insns ();
> >> >> >        end_sequence ();
> >> >> >        start_sequence ();
> >> >> > -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0,
> treeop1,
> >> >> > -				   op0, op1, target, 0);
> >> >> > +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1,
> >> >> > + target, 0);
> >> >> >        rtx_insn *sgn_insns = get_insns ();
> >> >> >        end_sequence ();
> >> >> >        unsigned uns_cost = seq_cost (uns_insns, speed_p); @@
> >> >> > -9198,8
> >> >> > +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode
> >> >> mode, tree treeop0,
> >> >> >        emit_insn (sgn_insns);
> >> >> >        return sgn_ret;
> >> >> >      }
> >> >> > -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
> >> >> > -			op0, op1, target, unsignedp);
> >> >> > +  return expand_divmod (mod_p, code, mode, op0, op1, target,
> >> >> > + unsignedp);
> >> >> >  }
> >> >> >
> >> >> >  rtx
> >> >> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
> >> >> >
> >> >>
> >>
> 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052
> >> 584f5a
> >> >> 3b
> >> >> > 8a734baa800f 100644
> >> >> > --- a/gcc/internal-fn.def
> >> >> > +++ b/gcc/internal-fn.def
> >> >> > @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN
> (AVG_CEIL,
> >> >> ECF_CONST
> >> >> > | ECF_NOTHROW, first,
> >> >> >
> >> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST |
> >> >> ECF_NOTHROW, first,
> >> >> >  			      smul_highpart, umul_highpart, binary)
> >> >> > +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST |
> >> >> ECF_NOTHROW, first,
> >> >> > +			      sadd_highpart, uadd_highpart, binary)
> >> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST |
> >> >> ECF_NOTHROW, first,
> >> >> >  			      smulhs, umulhs, binary)
> DEF_INTERNAL_SIGNED_OPTAB_FN
> >> >> > (MULHRS, ECF_CONST |
> >> >> ECF_NOTHROW, first,
> >> >> > diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
> >> >> >
> >> >>
> >>
> cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe03
> >> 69a6
> >> >> e
> >> >> > 77082c1e617b 100644
> >> >> > --- a/gcc/optabs.cc
> >> >> > +++ b/gcc/optabs.cc
> >> >> > @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode
> >> >> mode, rtx op0, rtx op1, bool unsignedp)
> >> >> >  		return NULL_RTX;
> >> >> >  	    }
> >> >> >  	}
> >> >> > -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR,
> word_mode,
> >> >> NULL, NULL,
> >> >> > -				     sum, gen_int_mode (INTVAL (op1),
> >> >> > -							word_mode),
> >> >> > +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR,
> >> word_mode,
> >> >> sum,
> >> >> > +				     gen_int_mode (INTVAL (op1),
> >> >> word_mode),
> >> >> >  				     NULL_RTX, 1, OPTAB_DIRECT);
> >> >> >        if (remainder == NULL_RTX)
> >> >> >  	return NULL_RTX;
> >> >> > @@ -1211,8 +1210,8 @@ expand_doubleword_divmod
> (machine_mode
> >> >> mode, rtx
> >> >> > op0, rtx op1, rtx *rem,
> >> >> >
> >> >> >    if (op11 != const1_rtx)
> >> >> >      {
> >> >> > -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL,
> >> NULL,
> >> >> quot1,
> >> >> > -				op11, NULL_RTX, unsignedp,
> >> >> OPTAB_DIRECT);
> >> >> > +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1,
> >> >> op11,
> >> >> > +				NULL_RTX, unsignedp,
> OPTAB_DIRECT);
> >> >> >        if (rem2 == NULL_RTX)
> >> >> >  	return NULL_RTX;
> >> >> >
> >> >> > @@ -1226,8 +1225,8 @@ expand_doubleword_divmod
> (machine_mode
> >> >> mode, rtx op0, rtx op1, rtx *rem,
> >> >> >        if (rem2 == NULL_RTX)
> >> >> >  	return NULL_RTX;
> >> >> >
> >> >> > -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL,
> >> NULL,
> >> >> quot1,
> >> >> > -				 op11, NULL_RTX, unsignedp,
> >> >> OPTAB_DIRECT);
> >> >> > +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode,
> >> >> > + quot1,
> >> op11,
> >> >> > +				 NULL_RTX, unsignedp,
> OPTAB_DIRECT);
> >> >> >        if (quot2 == NULL_RTX)
> >> >> >  	return NULL_RTX;
> >> >> >
> >> >> > diff --git a/gcc/optabs.def b/gcc/optabs.def index
> >> >> >
> >> >>
> >>
> 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2
> >> d7a5
> >> >> ccb
> >> >> > f6147947351a 100644
> >> >> > --- a/gcc/optabs.def
> >> >> > +++ b/gcc/optabs.def
> >> >> > @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
> >> >> >
> >> >> >  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")  OPTAB_D
> >> >> > (umul_highpart_optab, "umul$a3_highpart")
> >> >> > +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart") OPTAB_D
> >> >> > +(uadd_highpart_optab, "uadd$a3_highpart")
> >> >> >
> >> >> >  OPTAB_D (cmpmem_optab, "cmpmem$a")  OPTAB_D (cmpstr_optab,
> >> >> > "cmpstr$a") diff --git a/gcc/target.def b/gcc/target.def index
> >> >> >
> >> >>
> >>
> db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed0
> >> 8d1d
> >> >> 81a
> >> >> > fa2c2baa64a5 100644
> >> >> > --- a/gcc/target.def
> >> >> > +++ b/gcc/target.def
> >> >> > @@ -1905,25 +1905,6 @@ implementation approaches itself.",
> >> >> >  	const vec_perm_indices &sel),
> >> >> >   NULL)
> >> >> >
> >> >> > -DEFHOOK
> >> >> > -(can_special_div_by_const,
> >> >> > - "This hook is used to test whether the target has a special
> >> >> > method of\n\ -division of vectors of type @var{vectype} using
> >> >> > the value @var{constant},\n\ -and producing a vector of type
> >> >> > @var{vectype}.  The division\n\ -will then not be decomposed by
> >> >> > the vectorizer and kept as a div.\n\ -\n\ -When the hook is
> >> >> > being used to test whether the target supports a special\n\
> >> >> > -divide, @var{in0}, @var{in1}, and @var{output} are all null.
> >> >> > When the hook\n\ -is being used to emit a division, @var{in0}
> >> >> > and @var{in1} are the source\n\ -vectors of type @var{vecttype}
> >> >> > and @var{output} is the destination vector of\n\ -type
> >> >> > @var{vectype}.\n\ -\n\ -Return true if the operation is
> >> >> > possible, emitting instructions for it\n\ -if rtxes are provided
> >> >> > and updating @var{output}.",
> >> >> > - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
> >> >> > -	rtx in0, rtx in1),
> >> >> > - default_can_special_div_by_const)
> >> >> > -
> >> >> >  /* Return true if the target supports misaligned store/load of a
> >> >> >     specific factor denoted in the third parameter.  The last parameter
> >> >> >     is true if the access is defined in a packed struct.  */
> >> >> > diff --git a/gcc/target.h b/gcc/target.h index
> >> >> >
> >> >>
> >>
> 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fc
> >> a82b9
> >> >> 9f9
> >> >> > 13158c2d47b1 100644
> >> >> > --- a/gcc/target.h
> >> >> > +++ b/gcc/target.h
> >> >> > @@ -51,7 +51,6 @@
> >> >> >  #include "insn-codes.h"
> >> >> >  #include "tm.h"
> >> >> >  #include "hard-reg-set.h"
> >> >> > -#include "tree-core.h"
> >> >> >
> >> >> >  #if CHECKING_P
> >> >> >
> >> >> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h index
> >> >> >
> >> >>
> >>
> a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2
> >> 24454
> >> >> 93
> >> >> > 17a31390f0c2 100644
> >> >> > --- a/gcc/targhooks.h
> >> >> > +++ b/gcc/targhooks.h
> >> >> > @@ -209,8 +209,6 @@ extern void
> >> >> > default_addr_space_diagnose_usage (addr_space_t, location_t);
> >> >> > extern rtx default_addr_space_convert (rtx, tree, tree);  extern
> >> >> > unsigned int default_case_values_threshold (void);  extern bool
> >> >> > default_have_conditional_execution (void); -extern bool
> >> >> > default_can_special_div_by_const (enum tree_code, tree,
> >> >> wide_int,
> >> >> > -					      rtx *, rtx, rtx);
> >> >> >
> >> >> >  extern bool default_libc_has_function (enum function_class,
> >> >> > tree); extern bool default_libc_has_fast_function (int fcode);
> >> >> > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc index
> >> >> >
> >> >>
> >>
> fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2
> >> da91e
> >> >> 03
> >> >> > 877337a931e7 100644
> >> >> > --- a/gcc/targhooks.cc
> >> >> > +++ b/gcc/targhooks.cc
> >> >> > @@ -1840,14 +1840,6 @@ default_have_conditional_execution
> (void)
> >> >> >    return HAVE_conditional_execution;  }
> >> >> >
> >> >> > -/* Default that no division by constant operations are special.
> >> >> > */ -bool -default_can_special_div_by_const (enum tree_code,
> >> >> > tree, wide_int, rtx *, rtx,
> >> >> > -				  rtx)
> >> >> > -{
> >> >> > -  return false;
> >> >> > -}
> >> >> > -
> >> >> >  /* By default we assume that c99 functions are present at the runtime,
> >> >> >     but sincos is not.  */
> >> >> >  bool
> >> >> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> >> >> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> >> >> > new file mode 100644
> >> >> > index
> >> >> >
> >> >>
> >>
> 0000000000000000000000000000000000000000..c81f8946922250234b
> >> f759e0a0
> >> >> a0
> >> >> > 4ea8c1f73e3c
> >> >> > --- /dev/null
> >> >> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> >> >> > @@ -0,0 +1,25 @@
> >> >> > +/* { dg-require-effective-target vect_int } */
> >> >> > +
> >> >> > +#include <stdint.h>
> >> >> > +#include "tree-vect.h"
> >> >> > +
> >> >> > +typedef unsigned __attribute__((__vector_size__ (16))) V;
> >> >> > +
> >> >> > +static __attribute__((__noinline__))
> >> >> > +__attribute__((__noclone__)) V foo (V v, unsigned short i) {
> >> >> > +  v /= i;
> >> >> > +  return v;
> >> >> > +}
> >> >> > +
> >> >> > +int
> >> >> > +main (void)
> >> >> > +{
> >> >> > +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff,
> >> >> > +0xffffffff }, 0xffff);
> >> >> > +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
> >> >> > +    if (v[i] != 0x00010001)
> >> >> > +      __builtin_abort ();
> >> >> > +  return 0;
> >> >> > +}
> >> >> > +
> >> >> > +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern:
> >> >> > +detected" "vect" { target aarch64*-*-* } } } */
> >> >> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> >> >> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> >> >> > new file mode 100644
> >> >> > index
> >> >> >
> >> >>
> >>
> 0000000000000000000000000000000000000000..b4eb1a4dacba481e63
> >> 06b4991
> >> >> 4d2
> >> >> > a29b933de625
> >> >> > --- /dev/null
> >> >> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> >> >> > @@ -0,0 +1,58 @@
> >> >> > +/* { dg-require-effective-target vect_int } */
> >> >> > +
> >> >> > +#include <stdint.h>
> >> >> > +#include <stdio.h>
> >> >> > +#include "tree-vect.h"
> >> >> > +
> >> >> > +#define N 50
> >> >> > +#define TYPE uint8_t
> >> >> > +
> >> >> > +#ifndef DEBUG
> >> >> > +#define DEBUG 0
> >> >> > +#endif
> >> >> > +
> >> >> > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> >> >> > +
> >> >> > +
> >> >> > +__attribute__((noipa, noinline, optimize("O1"))) void
> >> >> > +fun1(TYPE* restrict pixel, TYPE level, int n) {
> >> >> > +  for (int i = 0; i < n; i+=1)
> >> >> > +    pixel[i] = (pixel[i] + level) / 0xff; }
> >> >> > +
> >> >> > +__attribute__((noipa, noinline, optimize("O3"))) void
> >> >> > +fun2(TYPE* restrict pixel, TYPE level, int n) {
> >> >> > +  for (int i = 0; i < n; i+=1)
> >> >> > +    pixel[i] = (pixel[i] + level) / 0xff; }
> >> >> > +
> >> >> > +int main ()
> >> >> > +{
> >> >> > +  TYPE a[N];
> >> >> > +  TYPE b[N];
> >> >> > +
> >> >> > +  for (int i = 0; i < N; ++i)
> >> >> > +    {
> >> >> > +      a[i] = BASE + i * 13;
> >> >> > +      b[i] = BASE + i * 13;
> >> >> > +      if (DEBUG)
> >> >> > +        printf ("%d: 0x%x\n", i, a[i]);
> >> >> > +    }
> >> >> > +
> >> >> > +  fun1 (a, N / 2, N);
> >> >> > +  fun2 (b, N / 2, N);
> >> >> > +
> >> >> > +  for (int i = 0; i < N; ++i)
> >> >> > +    {
> >> >> > +      if (DEBUG)
> >> >> > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> >> >> > +
> >> >> > +      if (a[i] != b[i])
> >> >> > +        __builtin_abort ();
> >> >> > +    }
> >> >> > +  return 0;
> >> >> > +}
> >> >> > +
> >> >> > +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect"
> >> >> > +{ target aarch64*-*-* } } } */
> >> >> > diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
> >> >> > index
> >> >> >
> >> >>
> >>
> 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c
> >> 14077d
> >> >> c3
> >> >> > e970bed75ef6 100644
> >> >> > --- a/gcc/tree-vect-generic.cc
> >> >> > +++ b/gcc/tree-vect-generic.cc
> >> >> > @@ -1237,17 +1237,6 @@ expand_vector_operation
> >> >> (gimple_stmt_iterator *gsi, tree type, tree compute_type
> >> >> >  	  tree rhs2 = gimple_assign_rhs2 (assign);
> >> >> >  	  tree ret;
> >> >> >
> >> >> > -	  /* Check if the target was going to handle it through the
> special
> >> >> > -	     division callback hook.  */
> >> >> > -	  tree cst = uniform_integer_cst_p (rhs2);
> >> >> > -	  if (cst &&
> >> >> > -	      targetm.vectorize.can_special_div_by_const (code, type,
> >> >> > -							  wi::to_wide
> (cst),
> >> >> > -							  NULL,
> >> >> > -							  NULL_RTX,
> >> >> NULL_RTX))
> >> >> > -	    return NULL_TREE;
> >> >> > -
> >> >> > -
> >> >> >  	  if (!optimize
> >> >> >  	      || !VECTOR_INTEGER_TYPE_P (type)
> >> >> >  	      || TREE_CODE (rhs2) != VECTOR_CST diff --git
> >> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
> >> >> >
> >> >>
> >>
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc46
> >> 7f33
> >> >> 69
> >> >> > de2afea139d6 100644
> >> >> > --- a/gcc/tree-vect-patterns.cc
> >> >> > +++ b/gcc/tree-vect-patterns.cc
> >> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
> >> *vinfo,
> >> >> >        return pattern_stmt;
> >> >> >      }
> >> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> >> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
> >> >> vectype,
> >> >> > -							  wi::to_wide
> (cst),
> >> >> > -							  NULL,
> NULL_RTX,
> >> >> > -							  NULL_RTX))
> >> >> > +	   && TYPE_UNSIGNED (itype)
> >> >> > +	   && rhs_code == TRUNC_DIV_EXPR
> >> >> > +	   && vectype
> >> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> >> >> > +					      OPTIMIZE_FOR_SPEED))
> >> >> >      {
> >> >> > -      return NULL;
> >> >> > +      /* div optimizations using narrowings
> >> >> > +       we can do the division e.g. shorts by 255 faster by calculating it as
> >> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> >> >> > +       double the precision of x.
> >> >> > +
> >> >> > +       If we imagine a short as being composed of two blocks of bytes
> then
> >> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is
> >> equivalent to
> >> >> > +       adding 1 to each sub component:
> >> >> > +
> >> >> > +	    short value of 16-bits
> >> >> > +       ┌──────────────┬────────────────┐
> >> >> > +       │              │                │
> >> >> > +       └──────────────┴────────────────┘
> >> >> > +	 8-bit part1 ▲  8-bit part2   ▲
> >> >> > +		     │                │
> >> >> > +		     │                │
> >> >> > +		    +1               +1
> >> >> > +
> >> >> > +       after the first addition, we have to shift right by 8, and narrow the
> >> >> > +       results back to a byte.  Remember that the addition must be done
> in
> >> >> > +       double the precision of the input.  However if we know
> >> >> > + that the
> >> >> addition
> >> >> > +       `x + 257` does not overflow then we can do the operation
> >> >> > + in the
> >> >> current
> >> >> > +       precision.  In which case we don't need the pack and unpacks.  */
> >> >> > +      auto wcst = wi::to_wide (cst);
> >> >> > +      int pow = wi::exact_log2 (wcst + 1);
> >> >> > +      if (pow == (int) (element_precision (vectype) / 2))
> >> >> > +	{
> >> >> > +	  wide_int min,max;
> >> >> > +	  /* If we're in a pattern we need to find the orginal definition.
> */
> >> >> > +	  tree op0 = oprnd0;
> >> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> >> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> >> >> > +	  if (is_pattern_stmt_p (stmt_info))
> >> >> > +	    {
> >> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT
> (stmt_info);
> >> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> >> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT
> (orig_stmt));
> >> >> > +	    }
> >> >>
> >> >> If this is generally safe (I'm skipping thinking about it in the
> >> >> interests of a quick review :-)), then I think it should be done
> >> >> in vect_get_range_info instead.  Using gimple_get_lhs would be
> >> >> more general than handling just assignments.
> >> >>
> >> >> > +
> >> >> > +	  /* Check that no overflow will occur.  If we don't have range
> >> >> > +	     information we can't perform the optimization.  */
> >> >> > +	  if (vect_get_range_info (op0, &min, &max))
> >> >> > +	    {
> >> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
> >> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> >> >> > +	      wi::overflow_type ovf;
> >> >> > +	      /* We need adder and max in the same precision.  */
> >> >> > +	      wide_int zadder
> >> >> > +		= wide_int_storage::from (adder, wi::get_precision
> (max),
> >> >> > +					  UNSIGNED);
> >> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
> >> >>
> >> >> Could you explain this a bit more?  When do we have mismatched
> >> >> precisions?
> >> >
> >> > C promotion rules will promote e.g.
> >> >
> >> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
> >> >   for (int i = 0; i < n; i+=1)
> >> >     pixel[i] = (pixel[i] + level) / 0xff; }
> >> >
> >> > And have the addition be done as a 32 bit integer.  The vectorizer
> >> > will demote this down to a short, but range information is not
> >> > stored for patterns.  So In the above the range will correctly be
> >> > 0x1fe but the precision will be that of the original expression, so
> >> > 32.  This will be a mismatch with itype which is derived from the
> >> > size the vectorizer
> >> will perform the operation in.
> >> >
> >> > Thanks,
> >> > Tamar
> >> >
> >> >>
> >> >> Thanks,
> >> >> Richard
> >> >>
> >> >> > +	      if (ovf == wi::OVF_NONE)
> >> >> > +		{
> >> >> > +		  *type_out = vectype;
> >> >> > +		  tree tadder = wide_int_to_tree (itype, adder);
> >> >> > +		  gcall *patt1
> >> >> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0,
> >> >> tadder);
> >> >> > +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
> >> >> > +		  gimple_call_set_lhs (patt1, lhs);
> >> >> > +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1,
> >> >> vectype);
> >> >> > +
> >> >> > +		  pattern_stmt
> >> >> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0,
> lhs);
> >> >> > +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
> >> >> > +		  gimple_call_set_lhs (pattern_stmt, lhs);
> >> >> > +
> >> >> > +		  return pattern_stmt;
> >> >> > +		}
> >> >> > +	    }
> >> >> > +	}
> >> >> >      }
> >> >> >
> >> >> >    if (prec > HOST_BITS_PER_WIDE_INT diff --git
> >> >> > a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index
> >> >> >
> >> >>
> >>
> eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0
> >> b95
> >> >> 64f
> >> >> > c4e066e50081 100644
> >> >> > --- a/gcc/tree-vect-stmts.cc
> >> >> > +++ b/gcc/tree-vect-stmts.cc
> >> >> > @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
> >> >> >  	}
> >> >> >        target_support_p = (optab_handler (optab, vec_mode)
> >> >> >  			  != CODE_FOR_nothing);
> >> >> > -      tree cst;
> >> >> > -      if (!target_support_p
> >> >> > -	  && op1
> >> >> > -	  && (cst = uniform_integer_cst_p (op1)))
> >> >> > -	target_support_p
> >> >> > -	  = targetm.vectorize.can_special_div_by_const (code,
> vectype,
> >> >> > -							wi::to_wide
> (cst),
> >> >> > -							NULL,
> NULL_RTX,
> >> >> > -							NULL_RTX);
> >> >> >      }
> >> >> >
> >> >> >    bool using_emulated_vectors_p = vect_emulated_vector_p
> >> >> > (vectype);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-27 12:14           ` Tamar Christina
@ 2023-02-27 21:33             ` Richard Sandiford
  2023-02-27 22:10               ` Tamar Christina
  0 siblings, 1 reply; 47+ messages in thread
From: Richard Sandiford @ 2023-02-27 21:33 UTC (permalink / raw)
  To: Tamar Christina via Gcc-patches; +Cc: Tamar Christina, nd, rguenther, jlaw

Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Monday, February 27, 2023 12:12 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>> by using new optabs [PR108583]
>> 
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> > Hi,
>> >
>> >> > I avoided open coding it with add and shift because it creates a 4
>> >> > instructions (and shifts which are typically slow) dependency chain
>> >> > instead of a load and multiply.  This change, unless the target is
>> >> > known to optimize it further is unlikely to be beneficial.  And by
>> >> > the time we get to costing the only alternative is to undo the
>> >> > existing pattern and
>> >> so you lose the general shift optimization.
>> >> >
>> >> > So it seemed unwise to open code as shifts, given the codegen out
>> >> > of the vectorizer would be degenerate for most targets or one needs
>> >> > the more complicated route of costing during pattern matching already.
>> >>
>> >> Hmm, OK.  That seems like a cost-model thing though, rather than
>> >> something that should be exposed through optabs.  And I imagine the
>> >> open-coded version would still be better than nothing on targets without
>> highpart multiply.
>> >>
>> >> So how about replacing the hook with one that simply asks whether
>> >> division through highpart multiplication is preferred over the add/shift
>> sequence?
>> >> (Unfortunately it's not going to be possible to work that out from
>> >> existing
>> >> information.)
>> >
>> > So this doesn't work for SVE.  For SVE the multiplication widening
>> > pass introduces FMAs at gimple level.  So in the cases where the
>> > operation is fed from a widening multiplication we end up generating FMA.
>> If that was it I could have matched FMA.
>> >
>> > But it also pushes the multiplication in the second operand because it
>> > no longer has a mul to share the results with.
>> >
>> > In any case, the gimple code is transformed into
>> >
>> > vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121);
>> > vect_patt_57.9_123 = (vector([8,8]) unsigned short) vect__3.8_122;
>> > vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, { 257,
>> > ... });
>> > vect_patt_65.12_128 = vect_patt_64.11_127 >> 8;
>> > vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124,
>> > vect_patt_65.12_128);
>> > vect_patt_62.14_130 = vect_patt_66.13_129 >> 8;
>> > vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21)
>> > vect_patt_62.14_130;
>> >
>> > This transformation is much worse than the original code, it extended
>> > the dependency chain with another expensive instruction. I can try to
>> > correct this in RTL by matching FMA and shift and splitting into MUL +
>> ADDHNB and hope CSE takes care of the extra mul.
>> >
>> > But this seems like a hack, and it's basically undoing the earlier
>> > transformation.  It seems to me that the open coding is a bad idea.
>> 
>> Could you post the patch that gives this result?  I'll have a poke around.
>
> Sure, I'll post the new series, it needs all of them.

Thanks.  Which testcase did you use to get the above?

But since SVE does have highpart multiply, and since the assumption for
SVE is that MULH+shift is better than ADD*3+shift*2, shouldn't SVE just
be one of the targets for which the hook that "asks whether division
through highpart multiplication is preferred over the add/shift
sequence" returns true?

For extra conservativeness, we could make the hook default to true
and explicitly return false for Advanced SIMD and for SVE2.

Richard

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-27 21:33             ` Richard Sandiford
@ 2023-02-27 22:10               ` Tamar Christina
  2023-02-28 11:08                 ` Richard Sandiford
  0 siblings, 1 reply; 47+ messages in thread
From: Tamar Christina @ 2023-02-27 22:10 UTC (permalink / raw)
  To: Richard Sandiford, Tamar Christina via Gcc-patches; +Cc: nd, rguenther, jlaw

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Monday, February 27, 2023 9:33 PM
> To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>
> Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
> rguenther@suse.de; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Monday, February 27, 2023 12:12 PM
> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> div-bitmask by using new optabs [PR108583]
> >>
> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> > Hi,
> >> >
> >> >> > I avoided open coding it with add and shift because it creates a
> >> >> > 4 instructions (and shifts which are typically slow) dependency
> >> >> > chain instead of a load and multiply.  This change, unless the
> >> >> > target is known to optimize it further is unlikely to be
> >> >> > beneficial.  And by the time we get to costing the only
> >> >> > alternative is to undo the existing pattern and
> >> >> so you lose the general shift optimization.
> >> >> >
> >> >> > So it seemed unwise to open code as shifts, given the codegen
> >> >> > out of the vectorizer would be degenerate for most targets or
> >> >> > one needs the more complicated route of costing during pattern
> matching already.
> >> >>
> >> >> Hmm, OK.  That seems like a cost-model thing though, rather than
> >> >> something that should be exposed through optabs.  And I imagine
> >> >> the open-coded version would still be better than nothing on
> >> >> targets without
> >> highpart multiply.
> >> >>
> >> >> So how about replacing the hook with one that simply asks whether
> >> >> division through highpart multiplication is preferred over the
> >> >> add/shift
> >> sequence?
> >> >> (Unfortunately it's not going to be possible to work that out from
> >> >> existing
> >> >> information.)
> >> >
> >> > So this doesn't work for SVE.  For SVE the multiplication widening
> >> > pass introduces FMAs at gimple level.  So in the cases where the
> >> > operation is fed from a widening multiplication we end up generating
> FMA.
> >> If that was it I could have matched FMA.
> >> >
> >> > But it also pushes the multiplication in the second operand because
> >> > it no longer has a mul to share the results with.
> >> >
> >> > In any case, the gimple code is transformed into
> >> >
> >> > vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121);
> >> > vect_patt_57.9_123 = (vector([8,8]) unsigned short) vect__3.8_122;
> >> > vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, {
> >> > 257, ... });
> >> > vect_patt_65.12_128 = vect_patt_64.11_127 >> 8;
> >> > vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124,
> >> > vect_patt_65.12_128);
> >> > vect_patt_62.14_130 = vect_patt_66.13_129 >> 8;
> >> > vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21)
> >> > vect_patt_62.14_130;
> >> >
> >> > This transformation is much worse than the original code, it
> >> > extended the dependency chain with another expensive instruction. I
> >> > can try to correct this in RTL by matching FMA and shift and
> >> > splitting into MUL +
> >> ADDHNB and hope CSE takes care of the extra mul.
> >> >
> >> > But this seems like a hack, and it's basically undoing the earlier
> >> > transformation.  It seems to me that the open coding is a bad idea.
> >>
> >> Could you post the patch that gives this result?  I'll have a poke around.
> >
> > Sure, I'll post the new series, it needs all of them.
> 
> Thanks.  Which testcase did you use to get the above?
> 

#include <stdint.h>

#define N 16
#define TYPE uint8_t

void fun3(TYPE* restrict pixel, TYPE level, int n)
{
  for (int i = 0; i < (n & -16); i+=1)
    pixel[i] = (pixel[i] * level) / 0xff;
}

> But since SVE does have highpart multiply, and since the assumption for SVE is
> that MULH+shift is better than ADD*3+shift*2, shouldn't SVE just be one of
> the targets for which the hook that "asks whether division through highpart
> multiplication is preferred over the add/shift sequence" returns true?
> 

Yes (it's also two adds not 3), but it's not correct for SVE2, which has addhnb, in which case 2x addhnb is
much faster than MULH+shift.  And the problem is that widening_mul will not
allow add+shift to reach the backend because the ADD+shift were open coded.

They are now subjected to further optimization.

To summarize:

Other targets: false
SVE: false
SVE2: true
NEON: true

SVE2 borked because MUL+ADD+SHIFT -> FMA+SHIFT.

If you're saying you don't want the optimization for SVE2, then sure, happy to turn it off.

But  UMULH+LSR == 6 cycles on Neoverse-N2 and throughput of 1.
2x ADDHNB = 4 cycles and throughput of 2.

Tamar.

> 
> Richard

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-27 22:10               ` Tamar Christina
@ 2023-02-28 11:08                 ` Richard Sandiford
  2023-02-28 11:12                   ` Tamar Christina
  0 siblings, 1 reply; 47+ messages in thread
From: Richard Sandiford @ 2023-02-28 11:08 UTC (permalink / raw)
  To: Tamar Christina; +Cc: Tamar Christina via Gcc-patches, nd, rguenther, jlaw

Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Monday, February 27, 2023 9:33 PM
>> To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>
>> Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
>> rguenther@suse.de; jlaw@ventanamicro.com
>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>> by using new optabs [PR108583]
>> 
>> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> >> -----Original Message-----
>> >> From: Richard Sandiford <richard.sandiford@arm.com>
>> >> Sent: Monday, February 27, 2023 12:12 PM
>> >> To: Tamar Christina <Tamar.Christina@arm.com>
>> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
>> >> div-bitmask by using new optabs [PR108583]
>> >>
>> >> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> > Hi,
>> >> >
>> >> >> > I avoided open coding it with add and shift because it creates a
>> >> >> > 4 instructions (and shifts which are typically slow) dependency
>> >> >> > chain instead of a load and multiply.  This change, unless the
>> >> >> > target is known to optimize it further is unlikely to be
>> >> >> > beneficial.  And by the time we get to costing the only
>> >> >> > alternative is to undo the existing pattern and
>> >> >> so you lose the general shift optimization.
>> >> >> >
>> >> >> > So it seemed unwise to open code as shifts, given the codegen
>> >> >> > out of the vectorizer would be degenerate for most targets or
>> >> >> > one needs the more complicated route of costing during pattern
>> matching already.
>> >> >>
>> >> >> Hmm, OK.  That seems like a cost-model thing though, rather than
>> >> >> something that should be exposed through optabs.  And I imagine
>> >> >> the open-coded version would still be better than nothing on
>> >> >> targets without
>> >> highpart multiply.
>> >> >>
>> >> >> So how about replacing the hook with one that simply asks whether
>> >> >> division through highpart multiplication is preferred over the
>> >> >> add/shift
>> >> sequence?
>> >> >> (Unfortunately it's not going to be possible to work that out from
>> >> >> existing
>> >> >> information.)
>> >> >
>> >> > So this doesn't work for SVE.  For SVE the multiplication widening
>> >> > pass introduces FMAs at gimple level.  So in the cases where the
>> >> > operation is fed from a widening multiplication we end up generating
>> FMA.
>> >> If that was it I could have matched FMA.
>> >> >
>> >> > But it also pushes the multiplication in the second operand because
>> >> > it no longer has a mul to share the results with.
>> >> >
>> >> > In any case, the gimple code is transformed into
>> >> >
>> >> > vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121);
>> >> > vect_patt_57.9_123 = (vector([8,8]) unsigned short) vect__3.8_122;
>> >> > vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, {
>> >> > 257, ... });
>> >> > vect_patt_65.12_128 = vect_patt_64.11_127 >> 8;
>> >> > vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124,
>> >> > vect_patt_65.12_128);
>> >> > vect_patt_62.14_130 = vect_patt_66.13_129 >> 8;
>> >> > vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21)
>> >> > vect_patt_62.14_130;
>> >> >
>> >> > This transformation is much worse than the original code, it
>> >> > extended the dependency chain with another expensive instruction. I
>> >> > can try to correct this in RTL by matching FMA and shift and
>> >> > splitting into MUL +
>> >> ADDHNB and hope CSE takes care of the extra mul.
>> >> >
>> >> > But this seems like a hack, and it's basically undoing the earlier
>> >> > transformation.  It seems to me that the open coding is a bad idea.
>> >>
>> >> Could you post the patch that gives this result?  I'll have a poke around.
>> >
>> > Sure, I'll post the new series, it needs all of them.
>> 
>> Thanks.  Which testcase did you use to get the above?
>> 
>
> #include <stdint.h>
>
> #define N 16
> #define TYPE uint8_t
>
> void fun3(TYPE* restrict pixel, TYPE level, int n)
> {
>   for (int i = 0; i < (n & -16); i+=1)
>     pixel[i] = (pixel[i] * level) / 0xff;
> }

Thanks.  In that testcase, isn't the FMA handling an anti-optimisation
in its own right though?  It's duplicating a multiplication into two
points on a dependency chain.

E.g. for:

unsigned int
f1 (unsigned int a, unsigned int b, unsigned int c)
{
  unsigned int d = a * b;
  return d + ((c + d) >> 1);
}
unsigned int
g1 (unsigned int a, unsigned int b, unsigned int c)
{
  return a * b + c;
}

__Uint32x4_t
f2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c)
{
  __Uint32x4_t d = a * b;
  return d + ((c + d) >> 1);
}
__Uint32x4_t
g2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c)
{
  return a * b + c;
}

typedef unsigned int vec __attribute__((vector_size(32)));
vec
f3 (vec a, vec b, vec c)
{
  vec d = a * b;
  return d + ((c + d) >> 1);
}
vec
g3 (vec a, vec b, vec c)
{
  return a * b + c;
}

compiled with -O2 -msve-vector-bits=256 -march=armv8.2-a+sve,
all the g functions use multiply-add (as expected), but the
f functions are:

f1:
        mul     w1, w0, w1
        add     w0, w1, w2
        add     w0, w1, w0, lsr 1
        ret

f2:
        mul     v0.4s, v0.4s, v1.4s
        add     v2.4s, v0.4s, v2.4s
        usra    v0.4s, v2.4s, 1
        ret

f3:
        ...
        mla     z0.s, p0/m, z1.s, z2.s
        lsr     z0.s, z0.s, #1
        mad     z1.s, p0/m, z2.s, z0.s
        ...

What we do for f3 doesn't seem like a good idea.

I can see that duplicating an integer multiplication might make sense if
the integer FMAs are done in parallel.  But if one is a dependency of
the other, then at least for integer FMA, I think we should punt,
especially since we don't know what the target's late-forwarding
restrictions are.  I guess fp-contract comes into play for the
FP FMAs though.

>> But since SVE does have highpart multiply, and since the assumption for SVE is
>> that MULH+shift is better than ADD*3+shift*2, shouldn't SVE just be one of
>> the targets for which the hook that "asks whether division through highpart
>> multiplication is preferred over the add/shift sequence" returns true?
>> 
>
> Yes (it's also two adds not 3), but it's not correct for SVE2, which has addhnb, in which case 2x addhnb is
> much faster than MULH+shift.  And the problem is that widening_mul will not
> allow add+shift to reach the backend because the ADD+shift were open coded.
>
> They are now subjected to further optimization.
>
> To summarize:
>
> Other targets: false
> SVE: false
> SVE2: true
> NEON: true

Yeah, looks good.

> SVE2 borked because MUL+ADD+SHIFT -> FMA+SHIFT.
>
> If you're saying you don't want the optimization for SVE2, then sure, happy to turn it off.
>
> But  UMULH+LSR == 6 cycles on Neoverse-N2 and throughput of 1.
> 2x ADDHNB = 4 cycles and throughput of 2.

No, I meant the same as what you said in the summary above.

Richard

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-28 11:08                 ` Richard Sandiford
@ 2023-02-28 11:12                   ` Tamar Christina
  2023-02-28 12:03                     ` Richard Sandiford
  0 siblings, 1 reply; 47+ messages in thread
From: Tamar Christina @ 2023-02-28 11:12 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Tamar Christina via Gcc-patches, nd, rguenther, jlaw

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Tuesday, February 28, 2023 11:09 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Monday, February 27, 2023 9:33 PM
> >> To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>
> >> Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
> >> rguenther@suse.de; jlaw@ventanamicro.com
> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> div-bitmask by using new optabs [PR108583]
> >>
> >> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> >> -----Original Message-----
> >> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> Sent: Monday, February 27, 2023 12:12 PM
> >> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> >> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> >> div-bitmask by using new optabs [PR108583]
> >> >>
> >> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> > Hi,
> >> >> >
> >> >> >> > I avoided open coding it with add and shift because it
> >> >> >> > creates a
> >> >> >> > 4 instructions (and shifts which are typically slow)
> >> >> >> > dependency chain instead of a load and multiply.  This
> >> >> >> > change, unless the target is known to optimize it further is
> >> >> >> > unlikely to be beneficial.  And by the time we get to costing
> >> >> >> > the only alternative is to undo the existing pattern and
> >> >> >> so you lose the general shift optimization.
> >> >> >> >
> >> >> >> > So it seemed unwise to open code as shifts, given the codegen
> >> >> >> > out of the vectorizer would be degenerate for most targets or
> >> >> >> > one needs the more complicated route of costing during
> >> >> >> > pattern
> >> matching already.
> >> >> >>
> >> >> >> Hmm, OK.  That seems like a cost-model thing though, rather
> >> >> >> than something that should be exposed through optabs.  And I
> >> >> >> imagine the open-coded version would still be better than
> >> >> >> nothing on targets without
> >> >> highpart multiply.
> >> >> >>
> >> >> >> So how about replacing the hook with one that simply asks
> >> >> >> whether division through highpart multiplication is preferred
> >> >> >> over the add/shift
> >> >> sequence?
> >> >> >> (Unfortunately it's not going to be possible to work that out
> >> >> >> from existing
> >> >> >> information.)
> >> >> >
> >> >> > So this doesn't work for SVE.  For SVE the multiplication
> >> >> > widening pass introduces FMAs at gimple level.  So in the cases
> >> >> > where the operation is fed from a widening multiplication we end
> >> >> > up generating
> >> FMA.
> >> >> If that was it I could have matched FMA.
> >> >> >
> >> >> > But it also pushes the multiplication in the second operand
> >> >> > because it no longer has a mul to share the results with.
> >> >> >
> >> >> > In any case, the gimple code is transformed into
> >> >> >
> >> >> > vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121);
> >> >> > vect_patt_57.9_123 = (vector([8,8]) unsigned short)
> >> >> > vect__3.8_122;
> >> >> > vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, {
> >> >> > 257, ... });
> >> >> > vect_patt_65.12_128 = vect_patt_64.11_127 >> 8;
> >> >> > vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124,
> >> >> > vect_patt_65.12_128);
> >> >> > vect_patt_62.14_130 = vect_patt_66.13_129 >> 8;
> >> >> > vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21)
> >> >> > vect_patt_62.14_130;
> >> >> >
> >> >> > This transformation is much worse than the original code, it
> >> >> > extended the dependency chain with another expensive
> >> >> > instruction. I can try to correct this in RTL by matching FMA
> >> >> > and shift and splitting into MUL +
> >> >> ADDHNB and hope CSE takes care of the extra mul.
> >> >> >
> >> >> > But this seems like a hack, and it's basically undoing the
> >> >> > earlier transformation.  It seems to me that the open coding is a bad
> idea.
> >> >>
> >> >> Could you post the patch that gives this result?  I'll have a poke around.
> >> >
> >> > Sure, I'll post the new series, it needs all of them.
> >>
> >> Thanks.  Which testcase did you use to get the above?
> >>
> >
> > #include <stdint.h>
> >
> > #define N 16
> > #define TYPE uint8_t
> >
> > void fun3(TYPE* restrict pixel, TYPE level, int n) {
> >   for (int i = 0; i < (n & -16); i+=1)
> >     pixel[i] = (pixel[i] * level) / 0xff; }
> 
> Thanks.  In that testcase, isn't the FMA handling an anti-optimisation in its
> own right though?  It's duplicating a multiplication into two points on a
> dependency chain.

Most definitely, that's what I meant above. The "optimization" doesn't take into
account the effect on the rest of the chain.

> 
> E.g. for:
> 
> unsigned int
> f1 (unsigned int a, unsigned int b, unsigned int c) {
>   unsigned int d = a * b;
>   return d + ((c + d) >> 1);
> }
> unsigned int
> g1 (unsigned int a, unsigned int b, unsigned int c) {
>   return a * b + c;
> }
> 
> __Uint32x4_t
> f2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c) {
>   __Uint32x4_t d = a * b;
>   return d + ((c + d) >> 1);
> }
> __Uint32x4_t
> g2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c) {
>   return a * b + c;
> }
> 
> typedef unsigned int vec __attribute__((vector_size(32))); vec
> f3 (vec a, vec b, vec c)
> {
>   vec d = a * b;
>   return d + ((c + d) >> 1);
> }
> vec
> g3 (vec a, vec b, vec c)
> {
>   return a * b + c;
> }
> 
> compiled with -O2 -msve-vector-bits=256 -march=armv8.2-a+sve, all the g
> functions use multiply-add (as expected), but the f functions are:
> 
> f1:
>         mul     w1, w0, w1
>         add     w0, w1, w2
>         add     w0, w1, w0, lsr 1
>         ret
> 
> f2:
>         mul     v0.4s, v0.4s, v1.4s
>         add     v2.4s, v0.4s, v2.4s
>         usra    v0.4s, v2.4s, 1
>         ret
> 
> f3:
>         ...
>         mla     z0.s, p0/m, z1.s, z2.s
>         lsr     z0.s, z0.s, #1
>         mad     z1.s, p0/m, z2.s, z0.s
>         ...
> 
> What we do for f3 doesn't seem like a good idea.

Agreed,  I guess this means I have to fix that as well? ☹

I'll go take a look then..

Tamar.

> 
> I can see that duplicating an integer multiplication might make sense if the
> integer FMAs are done in parallel.  But if one is a dependency of the other,
> then at least for integer FMA, I think we should punt, especially since we don't
> know what the target's late-forwarding restrictions are.  I guess fp-contract
> comes into play for the FP FMAs though.
> 
> >> But since SVE does have highpart multiply, and since the assumption
> >> for SVE is that MULH+shift is better than ADD*3+shift*2, shouldn't
> >> SVE just be one of the targets for which the hook that "asks whether
> >> division through highpart multiplication is preferred over the add/shift
> sequence" returns true?
> >>
> >
> > Yes (it's also two adds not 3), but it's not correct for SVE2, which
> > has addhnb, in which case 2x addhnb is much faster than MULH+shift.
> > And the problem is that widening_mul will not allow add+shift to reach the
> backend because the ADD+shift were open coded.
> >
> > They are now subjected to further optimization.
> >
> > To summarize:
> >
> > Other targets: false
> > SVE: false
> > SVE2: true
> > NEON: true
> 
> Yeah, looks good.
> 
> > SVE2 borked because MUL+ADD+SHIFT -> FMA+SHIFT.
> >
> > If you're saying you don't want the optimization for SVE2, then sure, happy
> to turn it off.
> >
> > But  UMULH+LSR == 6 cycles on Neoverse-N2 and throughput of 1.
> > 2x ADDHNB = 4 cycles and throughput of 2.
> 
> No, I meant the same as what you said in the summary above.
> 
> Richard

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-28 11:12                   ` Tamar Christina
@ 2023-02-28 12:03                     ` Richard Sandiford
  2023-03-01 11:30                       ` Richard Biener
  0 siblings, 1 reply; 47+ messages in thread
From: Richard Sandiford @ 2023-02-28 12:03 UTC (permalink / raw)
  To: Tamar Christina; +Cc: Tamar Christina via Gcc-patches, nd, rguenther, jlaw

Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Tuesday, February 28, 2023 11:09 AM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>> by using new optabs [PR108583]
>> 
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> -----Original Message-----
>> >> From: Richard Sandiford <richard.sandiford@arm.com>
>> >> Sent: Monday, February 27, 2023 9:33 PM
>> >> To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>
>> >> Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
>> >> rguenther@suse.de; jlaw@ventanamicro.com
>> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
>> >> div-bitmask by using new optabs [PR108583]
>> >>
>> >> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> >> >> -----Original Message-----
>> >> >> From: Richard Sandiford <richard.sandiford@arm.com>
>> >> >> Sent: Monday, February 27, 2023 12:12 PM
>> >> >> To: Tamar Christina <Tamar.Christina@arm.com>
>> >> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> >> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
>> >> >> div-bitmask by using new optabs [PR108583]
>> >> >>
>> >> >> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> >> > Hi,
>> >> >> >
>> >> >> >> > I avoided open coding it with add and shift because it
>> >> >> >> > creates a
>> >> >> >> > 4 instructions (and shifts which are typically slow)
>> >> >> >> > dependency chain instead of a load and multiply.  This
>> >> >> >> > change, unless the target is known to optimize it further is
>> >> >> >> > unlikely to be beneficial.  And by the time we get to costing
>> >> >> >> > the only alternative is to undo the existing pattern and
>> >> >> >> so you lose the general shift optimization.
>> >> >> >> >
>> >> >> >> > So it seemed unwise to open code as shifts, given the codegen
>> >> >> >> > out of the vectorizer would be degenerate for most targets or
>> >> >> >> > one needs the more complicated route of costing during
>> >> >> >> > pattern
>> >> matching already.
>> >> >> >>
>> >> >> >> Hmm, OK.  That seems like a cost-model thing though, rather
>> >> >> >> than something that should be exposed through optabs.  And I
>> >> >> >> imagine the open-coded version would still be better than
>> >> >> >> nothing on targets without
>> >> >> highpart multiply.
>> >> >> >>
>> >> >> >> So how about replacing the hook with one that simply asks
>> >> >> >> whether division through highpart multiplication is preferred
>> >> >> >> over the add/shift
>> >> >> sequence?
>> >> >> >> (Unfortunately it's not going to be possible to work that out
>> >> >> >> from existing
>> >> >> >> information.)
>> >> >> >
>> >> >> > So this doesn't work for SVE.  For SVE the multiplication
>> >> >> > widening pass introduces FMAs at gimple level.  So in the cases
>> >> >> > where the operation is fed from a widening multiplication we end
>> >> >> > up generating
>> >> FMA.
>> >> >> If that was it I could have matched FMA.
>> >> >> >
>> >> >> > But it also pushes the multiplication in the second operand
>> >> >> > because it no longer has a mul to share the results with.
>> >> >> >
>> >> >> > In any case, the gimple code is transformed into
>> >> >> >
>> >> >> > vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121);
>> >> >> > vect_patt_57.9_123 = (vector([8,8]) unsigned short)
>> >> >> > vect__3.8_122;
>> >> >> > vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, {
>> >> >> > 257, ... });
>> >> >> > vect_patt_65.12_128 = vect_patt_64.11_127 >> 8;
>> >> >> > vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124,
>> >> >> > vect_patt_65.12_128);
>> >> >> > vect_patt_62.14_130 = vect_patt_66.13_129 >> 8;
>> >> >> > vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21)
>> >> >> > vect_patt_62.14_130;
>> >> >> >
>> >> >> > This transformation is much worse than the original code, it
>> >> >> > extended the dependency chain with another expensive
>> >> >> > instruction. I can try to correct this in RTL by matching FMA
>> >> >> > and shift and splitting into MUL +
>> >> >> ADDHNB and hope CSE takes care of the extra mul.
>> >> >> >
>> >> >> > But this seems like a hack, and it's basically undoing the
>> >> >> > earlier transformation.  It seems to me that the open coding is a bad
>> idea.
>> >> >>
>> >> >> Could you post the patch that gives this result?  I'll have a poke around.
>> >> >
>> >> > Sure, I'll post the new series, it needs all of them.
>> >>
>> >> Thanks.  Which testcase did you use to get the above?
>> >>
>> >
>> > #include <stdint.h>
>> >
>> > #define N 16
>> > #define TYPE uint8_t
>> >
>> > void fun3(TYPE* restrict pixel, TYPE level, int n) {
>> >   for (int i = 0; i < (n & -16); i+=1)
>> >     pixel[i] = (pixel[i] * level) / 0xff; }
>> 
>> Thanks.  In that testcase, isn't the FMA handling an anti-optimisation in its
>> own right though?  It's duplicating a multiplication into two points on a
>> dependency chain.
>
> Most definitely, that's what I meant above. The "optimization" doesn't take into
> account the effect on the rest of the chain.
>
>> 
>> E.g. for:
>> 
>> unsigned int
>> f1 (unsigned int a, unsigned int b, unsigned int c) {
>>   unsigned int d = a * b;
>>   return d + ((c + d) >> 1);
>> }
>> unsigned int
>> g1 (unsigned int a, unsigned int b, unsigned int c) {
>>   return a * b + c;
>> }
>> 
>> __Uint32x4_t
>> f2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c) {
>>   __Uint32x4_t d = a * b;
>>   return d + ((c + d) >> 1);
>> }
>> __Uint32x4_t
>> g2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c) {
>>   return a * b + c;
>> }
>> 
>> typedef unsigned int vec __attribute__((vector_size(32))); vec
>> f3 (vec a, vec b, vec c)
>> {
>>   vec d = a * b;
>>   return d + ((c + d) >> 1);
>> }
>> vec
>> g3 (vec a, vec b, vec c)
>> {
>>   return a * b + c;
>> }
>> 
>> compiled with -O2 -msve-vector-bits=256 -march=armv8.2-a+sve, all the g
>> functions use multiply-add (as expected), but the f functions are:
>> 
>> f1:
>>         mul     w1, w0, w1
>>         add     w0, w1, w2
>>         add     w0, w1, w0, lsr 1
>>         ret
>> 
>> f2:
>>         mul     v0.4s, v0.4s, v1.4s
>>         add     v2.4s, v0.4s, v2.4s
>>         usra    v0.4s, v2.4s, 1
>>         ret
>> 
>> f3:
>>         ...
>>         mla     z0.s, p0/m, z1.s, z2.s
>>         lsr     z0.s, z0.s, #1
>>         mad     z1.s, p0/m, z2.s, z0.s
>>         ...
>> 
>> What we do for f3 doesn't seem like a good idea.
>
> Agreed,  I guess this means I have to fix that as well? ☹
>
> I'll go take a look then..

How about something like this, before the main loop in
convert_mult_to_fma:

  /* There is no numerical difference between fused and unfused integer FMAs,
     and the assumption below that FMA is as cheap as addition is unlikely
     to be true, especially if the multiplication occurs multiple times on
     the same chain.  E.g., for something like:

         (((a * b) + c) >> 1) + (a * b)

     we do not want to duplicate the a * b into two additions, not least
     because the result is not a natural FMA chain.  */
  if (ANY_INTEGRAL_TYPE_P (type)
      && !has_single_use (mul_result))
    return false;

?  Richi, would that be OK with you?

From a quick check, it passes the aarch64-sve{,2}.exp tests.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-28 12:03                     ` Richard Sandiford
@ 2023-03-01 11:30                       ` Richard Biener
  0 siblings, 0 replies; 47+ messages in thread
From: Richard Biener @ 2023-03-01 11:30 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: Tamar Christina, Tamar Christina via Gcc-patches, nd, jlaw

[-- Attachment #1: Type: text/plain, Size: 8186 bytes --]

On Tue, 28 Feb 2023, Richard Sandiford wrote:

> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Tuesday, February 28, 2023 11:09 AM
> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> >> by using new optabs [PR108583]
> >> 
> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> -----Original Message-----
> >> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> Sent: Monday, February 27, 2023 9:33 PM
> >> >> To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>
> >> >> Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
> >> >> rguenther@suse.de; jlaw@ventanamicro.com
> >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> >> div-bitmask by using new optabs [PR108583]
> >> >>
> >> >> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> >> >> -----Original Message-----
> >> >> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> >> Sent: Monday, February 27, 2023 12:12 PM
> >> >> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> >> >> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> >> >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> >> >> div-bitmask by using new optabs [PR108583]
> >> >> >>
> >> >> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> >> > I avoided open coding it with add and shift because it
> >> >> >> >> > creates a
> >> >> >> >> > 4 instructions (and shifts which are typically slow)
> >> >> >> >> > dependency chain instead of a load and multiply.  This
> >> >> >> >> > change, unless the target is known to optimize it further is
> >> >> >> >> > unlikely to be beneficial.  And by the time we get to costing
> >> >> >> >> > the only alternative is to undo the existing pattern and
> >> >> >> >> so you lose the general shift optimization.
> >> >> >> >> >
> >> >> >> >> > So it seemed unwise to open code as shifts, given the codegen
> >> >> >> >> > out of the vectorizer would be degenerate for most targets or
> >> >> >> >> > one needs the more complicated route of costing during
> >> >> >> >> > pattern
> >> >> matching already.
> >> >> >> >>
> >> >> >> >> Hmm, OK.  That seems like a cost-model thing though, rather
> >> >> >> >> than something that should be exposed through optabs.  And I
> >> >> >> >> imagine the open-coded version would still be better than
> >> >> >> >> nothing on targets without
> >> >> >> highpart multiply.
> >> >> >> >>
> >> >> >> >> So how about replacing the hook with one that simply asks
> >> >> >> >> whether division through highpart multiplication is preferred
> >> >> >> >> over the add/shift
> >> >> >> sequence?
> >> >> >> >> (Unfortunately it's not going to be possible to work that out
> >> >> >> >> from existing
> >> >> >> >> information.)
> >> >> >> >
> >> >> >> > So this doesn't work for SVE.  For SVE the multiplication
> >> >> >> > widening pass introduces FMAs at gimple level.  So in the cases
> >> >> >> > where the operation is fed from a widening multiplication we end
> >> >> >> > up generating
> >> >> FMA.
> >> >> >> If that was it I could have matched FMA.
> >> >> >> >
> >> >> >> > But it also pushes the multiplication in the second operand
> >> >> >> > because it no longer has a mul to share the results with.
> >> >> >> >
> >> >> >> > In any case, the gimple code is transformed into
> >> >> >> >
> >> >> >> > vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121);
> >> >> >> > vect_patt_57.9_123 = (vector([8,8]) unsigned short)
> >> >> >> > vect__3.8_122;
> >> >> >> > vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, {
> >> >> >> > 257, ... });
> >> >> >> > vect_patt_65.12_128 = vect_patt_64.11_127 >> 8;
> >> >> >> > vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124,
> >> >> >> > vect_patt_65.12_128);
> >> >> >> > vect_patt_62.14_130 = vect_patt_66.13_129 >> 8;
> >> >> >> > vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21)
> >> >> >> > vect_patt_62.14_130;
> >> >> >> >
> >> >> >> > This transformation is much worse than the original code, it
> >> >> >> > extended the dependency chain with another expensive
> >> >> >> > instruction. I can try to correct this in RTL by matching FMA
> >> >> >> > and shift and splitting into MUL +
> >> >> >> ADDHNB and hope CSE takes care of the extra mul.
> >> >> >> >
> >> >> >> > But this seems like a hack, and it's basically undoing the
> >> >> >> > earlier transformation.  It seems to me that the open coding is a bad
> >> idea.
> >> >> >>
> >> >> >> Could you post the patch that gives this result?  I'll have a poke around.
> >> >> >
> >> >> > Sure, I'll post the new series, it needs all of them.
> >> >>
> >> >> Thanks.  Which testcase did you use to get the above?
> >> >>
> >> >
> >> > #include <stdint.h>
> >> >
> >> > #define N 16
> >> > #define TYPE uint8_t
> >> >
> >> > void fun3(TYPE* restrict pixel, TYPE level, int n) {
> >> >   for (int i = 0; i < (n & -16); i+=1)
> >> >     pixel[i] = (pixel[i] * level) / 0xff; }
> >> 
> >> Thanks.  In that testcase, isn't the FMA handling an anti-optimisation in its
> >> own right though?  It's duplicating a multiplication into two points on a
> >> dependency chain.
> >
> > Most definitely, that's what I meant above. The "optimization" doesn't take into
> > account the effect on the rest of the chain.
> >
> >> 
> >> E.g. for:
> >> 
> >> unsigned int
> >> f1 (unsigned int a, unsigned int b, unsigned int c) {
> >>   unsigned int d = a * b;
> >>   return d + ((c + d) >> 1);
> >> }
> >> unsigned int
> >> g1 (unsigned int a, unsigned int b, unsigned int c) {
> >>   return a * b + c;
> >> }
> >> 
> >> __Uint32x4_t
> >> f2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c) {
> >>   __Uint32x4_t d = a * b;
> >>   return d + ((c + d) >> 1);
> >> }
> >> __Uint32x4_t
> >> g2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c) {
> >>   return a * b + c;
> >> }
> >> 
> >> typedef unsigned int vec __attribute__((vector_size(32))); vec
> >> f3 (vec a, vec b, vec c)
> >> {
> >>   vec d = a * b;
> >>   return d + ((c + d) >> 1);
> >> }
> >> vec
> >> g3 (vec a, vec b, vec c)
> >> {
> >>   return a * b + c;
> >> }
> >> 
> >> compiled with -O2 -msve-vector-bits=256 -march=armv8.2-a+sve, all the g
> >> functions use multiply-add (as expected), but the f functions are:
> >> 
> >> f1:
> >>         mul     w1, w0, w1
> >>         add     w0, w1, w2
> >>         add     w0, w1, w0, lsr 1
> >>         ret
> >> 
> >> f2:
> >>         mul     v0.4s, v0.4s, v1.4s
> >>         add     v2.4s, v0.4s, v2.4s
> >>         usra    v0.4s, v2.4s, 1
> >>         ret
> >> 
> >> f3:
> >>         ...
> >>         mla     z0.s, p0/m, z1.s, z2.s
> >>         lsr     z0.s, z0.s, #1
> >>         mad     z1.s, p0/m, z2.s, z0.s
> >>         ...
> >> 
> >> What we do for f3 doesn't seem like a good idea.
> >
> > Agreed,  I guess this means I have to fix that as well? ☹
> >
> > I'll go take a look then..
> 
> How about something like this, before the main loop in
> convert_mult_to_fma:
> 
>   /* There is no numerical difference between fused and unfused integer FMAs,
>      and the assumption below that FMA is as cheap as addition is unlikely
>      to be true, especially if the multiplication occurs multiple times on
>      the same chain.  E.g., for something like:
> 
>          (((a * b) + c) >> 1) + (a * b)
> 
>      we do not want to duplicate the a * b into two additions, not least
>      because the result is not a natural FMA chain.  */
>   if (ANY_INTEGRAL_TYPE_P (type)
>       && !has_single_use (mul_result))
>     return false;
> 
> ?  Richi, would that be OK with you?

Yes, I think that would be OK.  I would assume that integer FMA is
as cheap as multiplication (also for FP), but then the question is
how CPUs implement integer FMA, if they split into two uops or not.

Richard.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-02-23 16:39                                           ` Andrew MacLeod
  2023-02-23 16:56                                             ` Tamar Christina
@ 2023-03-01 16:57                                             ` Andrew Carlotti
  2023-03-01 18:16                                               ` Tamar Christina
  1 sibling, 1 reply; 47+ messages in thread
From: Andrew Carlotti @ 2023-03-01 16:57 UTC (permalink / raw)
  To: Andrew MacLeod
  Cc: Tamar Christina, Richard Biener, Richard Sandiford,
	Tamar Christina via Gcc-patches, nd, jlaw

On Thu, Feb 23, 2023 at 11:39:51AM -0500, Andrew MacLeod via Gcc-patches wrote:
> 
> 
> Inheriting from operator_mult is also going to be hazardous because it also
> has an op1_range and op2_range...ï¿½ you should at least define those and
> return VARYING to avoid other issues.ï¿½ Same thing applies to widen_plus I
> think, and it has relation processing and other things as well.ï¿½ Your widen
> operands are not what those classes expect, so I think you probably just
> want a fresh range operator.
> 
> It also looks like the mult operation is sign/zero extending both upper
> bounds, and neither lower bound..ï¿½ï¿½ I think that should be the LH upper and
> lower bound?
> 
> I've attached a second patchï¿½ (newversion.patch) which incorporates my fix,
> the fix to the sign of only op1's bounds,ï¿½ as well as a simplification of
> the classes to not inherit from operator_mult/plus..ï¿½ï¿½ I think this still
> does what you want?ï¿½ and it wont get you into unexpected trouble later :-)
> 
> let me know if this is still doing what you are expecting...
> 
> Andrew
> 

Hi,

This patch still uses the wrong signedness for some of the extensions in
WIDEN_MULT_EXPR. It currently bases it's promotion decisions on whether there
is any signed argument, and whether the result is signed - i.e.:

		Patch extends as:
UUU		UU
UUS -> USU
USU		SU
USS		SU	wrong
SUU		US	wrong
SUS -> SSU
SSU		SS	wrong
SSS		SS

The documentation in tree.def is unclear about whether the output signedness is
linked to the input signedness, but at least the SSU case seems valid, and is
mishandled here.

I think it would be clearer and simpler to have four (or three) different
versions for each combnation of signedness of the input operands. This could be
implemented without extra code duplication by creating four different instances
of an operator_widen_mult class (perhaps extending a range_operator_mixed_sign
class), with the signedness indicated by two additional class members.

The documentation for WIDEN_PLUS_EXPR (and several other expressions added in
the same commit) is completely missing. If the signs are required to be
matching, then this should be clarified; otherwise it would need the same
special handling as WIDEN_MULT_EXPR.

Andrew

> diff --git a/gcc/gimple-range-op.cc b/gcc/gimple-range-op.cc
> index d9dfdc56939..824e0338f34 100644
> --- a/gcc/gimple-range-op.cc
> +++ b/gcc/gimple-range-op.cc
> @@ -179,6 +179,8 @@ gimple_range_op_handler::gimple_range_op_handler (gimple *s)
>    // statements.
>    if (is_a <gcall *> (m_stmt))
>      maybe_builtin_call ();
> +  else
> +    maybe_non_standard ();
>  }
>  
>  // Calculate what we can determine of the range of this unary
> @@ -764,6 +766,36 @@ public:
>    }
>  } op_cfn_parity;
>  
> +// Set up a gimple_range_op_handler for any nonstandard function which can be
> +// supported via range-ops.
> +
> +void
> +gimple_range_op_handler::maybe_non_standard ()
> +{
> +  if (gimple_code (m_stmt) == GIMPLE_ASSIGN)
> +    switch (gimple_assign_rhs_code (m_stmt))
> +      {
> +	case WIDEN_MULT_EXPR:
> +	{
> +	  m_valid = true;
> +	  m_op1 = gimple_assign_rhs1 (m_stmt);
> +	  m_op2 = gimple_assign_rhs2 (m_stmt);
> +	  bool signed1 = TYPE_SIGN (TREE_TYPE (m_op1)) == SIGNED;
> +	  bool signed2 = TYPE_SIGN (TREE_TYPE (m_op2)) == SIGNED;
> +	  if (signed2 && !signed1)
> +	    std::swap (m_op1, m_op2);
> +
> +	  if (signed1 || signed2)
> +	    m_int = ptr_op_widen_mult_signed;
> +	  else
> +	    m_int = ptr_op_widen_mult_unsigned;
> +	  break;
> +	}
> +	default:
> +	  break;
> +      }
> +}
> +
>  // Set up a gimple_range_op_handler for any built in function which can be
>  // supported via range-ops.
>  
> diff --git a/gcc/gimple-range-op.h b/gcc/gimple-range-op.h
> index 743b858126e..1bf63c5ce6f 100644
> --- a/gcc/gimple-range-op.h
> +++ b/gcc/gimple-range-op.h
> @@ -41,6 +41,7 @@ public:
>  		 relation_trio = TRIO_VARYING);
>  private:
>    void maybe_builtin_call ();
> +  void maybe_non_standard ();
>    gimple *m_stmt;
>    tree m_op1, m_op2;
>  };
> diff --git a/gcc/range-op.cc b/gcc/range-op.cc
> index 5c67bce6d3a..7cd19a92d00 100644
> --- a/gcc/range-op.cc
> +++ b/gcc/range-op.cc
> @@ -1556,6 +1556,34 @@ operator_plus::op2_range (irange &r, tree type,
>    return op1_range (r, type, lhs, op1, rel.swap_op1_op2 ());
>  }
>  
> +class operator_widen_plus : public range_operator
> +{
> +public:
> +  virtual void wi_fold (irange &r, tree type,
> +			const wide_int &lh_lb,
> +			const wide_int &lh_ub,
> +			const wide_int &rh_lb,
> +			const wide_int &rh_ub) const;
> +} op_widen_plus;
> +
> +void
> +operator_widen_plus::wi_fold (irange &r, tree type,
> +			const wide_int &lh_lb, const wide_int &lh_ub,
> +			const wide_int &rh_lb, const wide_int &rh_ub) const
> +{
> +   wi::overflow_type ov_lb, ov_ub;
> +   signop s = TYPE_SIGN (type);
> +
> +   wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, s);
> +   wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
> +   wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, s);
> +   wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
> +
> +   wide_int new_lb = wi::add (lh_wlb, rh_wlb, s, &ov_lb);
> +   wide_int new_ub = wi::add (lh_wub, rh_wub, s, &ov_ub);
> +
> +   r = int_range<2> (type, new_lb, new_ub);
> +}
>  
>  class operator_minus : public range_operator
>  {
> @@ -2031,6 +2059,70 @@ operator_mult::wi_fold (irange &r, tree type,
>      }
>  }
>  
> +class operator_widen_mult_signed : public range_operator
> +{
> +public:
> +  virtual void wi_fold (irange &r, tree type,
> +			const wide_int &lh_lb,
> +			const wide_int &lh_ub,
> +			const wide_int &rh_lb,
> +			const wide_int &rh_ub)
> +    const;
> +} op_widen_mult_signed;
> +range_operator *ptr_op_widen_mult_signed = &op_widen_mult_signed;
> +
> +void
> +operator_widen_mult_signed::wi_fold (irange &r, tree type,
> +				     const wide_int &lh_lb,
> +				     const wide_int &lh_ub,
> +				     const wide_int &rh_lb,
> +				     const wide_int &rh_ub) const
> +{
> +  signop s = TYPE_SIGN (type);
> +
> +  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, SIGNED);
> +  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, SIGNED);
> +  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
> +  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
> +
> +  /* We don't expect a widening multiplication to be able to overflow but range
> +     calculations for multiplications are complicated.  After widening the
> +     operands lets call the base class.  */
> +  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
> +}
> +
> +
> +class operator_widen_mult_unsigned : public range_operator
> +{
> +public:
> +  virtual void wi_fold (irange &r, tree type,
> +			const wide_int &lh_lb,
> +			const wide_int &lh_ub,
> +			const wide_int &rh_lb,
> +			const wide_int &rh_ub)
> +    const;
> +} op_widen_mult_unsigned;
> +range_operator *ptr_op_widen_mult_unsigned = &op_widen_mult_unsigned;
> +
> +void
> +operator_widen_mult_unsigned::wi_fold (irange &r, tree type,
> +				       const wide_int &lh_lb,
> +				       const wide_int &lh_ub,
> +				       const wide_int &rh_lb,
> +				       const wide_int &rh_ub) const
> +{
> +  signop s = TYPE_SIGN (type);
> +
> +  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, UNSIGNED);
> +  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, UNSIGNED);
> +  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
> +  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
> +
> +  /* We don't expect a widening multiplication to be able to overflow but range
> +     calculations for multiplications are complicated.  After widening the
> +     operands lets call the base class.  */
> +  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
> +}
>  
>  class operator_div : public cross_product_operator
>  {
> @@ -4473,6 +4565,7 @@ integral_table::integral_table ()
>    set (GT_EXPR, op_gt);
>    set (GE_EXPR, op_ge);
>    set (PLUS_EXPR, op_plus);
> +  set (WIDEN_PLUS_EXPR, op_widen_plus);
>    set (MINUS_EXPR, op_minus);
>    set (MIN_EXPR, op_min);
>    set (MAX_EXPR, op_max);
> diff --git a/gcc/range-op.h b/gcc/range-op.h
> index f00b747f08a..5fe463234ae 100644
> --- a/gcc/range-op.h
> +++ b/gcc/range-op.h
> @@ -311,4 +311,6 @@ private:
>  // This holds the range op table for floating point operations.
>  extern floating_op_table *floating_tree_table;
>  
> +extern range_operator *ptr_op_widen_mult_signed;
> +extern range_operator *ptr_op_widen_mult_unsigned;
>  #endif // GCC_RANGE_OP_H


^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]
  2023-03-01 16:57                                             ` Andrew Carlotti
@ 2023-03-01 18:16                                               ` Tamar Christina
  0 siblings, 0 replies; 47+ messages in thread
From: Tamar Christina @ 2023-03-01 18:16 UTC (permalink / raw)
  To: Andrew Carlotti, Andrew MacLeod
  Cc: Richard Biener, Richard Sandiford,
	Tamar Christina via Gcc-patches, nd, jlaw

> -----Original Message-----
> From: Andrew Carlotti <Andrew.Carlotti@arm.com>
> Sent: Wednesday, March 1, 2023 4:58 PM
> To: Andrew MacLeod <amacleod@redhat.com>
> Cc: Tamar Christina <Tamar.Christina@arm.com>; Richard Biener
> <rguenther@suse.de>; Richard Sandiford <Richard.Sandiford@arm.com>;
> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> On Thu, Feb 23, 2023 at 11:39:51AM -0500, Andrew MacLeod via Gcc-
> patches wrote:
> >
> >
> > Inheriting from operator_mult is also going to be hazardous because it
> > also has an op1_range and op2_range...ï¿½ you should at least define
> > those and return VARYING to avoid other issues.ï¿½ Same thing applies
> > to widen_plus I think, and it has relation processing and other things
> > as well.ï¿½ Your widen operands are not what those classes expect, so
> > I think you probably just want a fresh range operator.
> >
> > It also looks like the mult operation is sign/zero extending both
> > upper bounds, and neither lower bound..ï¿½ï¿½ I think that should be
> > the LH upper and lower bound?
> >
> > I've attached a second patchï¿½ (newversion.patch) which incorporates
> > my fix, the fix to the sign of only op1's bounds,ï¿½ as well as a
> > simplification of the classes to not inherit from
> > operator_mult/plus..ï¿½ï¿½ I think this still does what you want?ï¿½
> > and it wont get you into unexpected trouble later :-)
> >
> > let me know if this is still doing what you are expecting...
> >
> > Andrew
> >
> 
> Hi,
> 
> This patch still uses the wrong signedness for some of the extensions in
> WIDEN_MULT_EXPR. It currently bases it's promotion decisions on whether
> there is any signed argument, and whether the result is signed - i.e.:
> 
> 		Patch extends as:
> UUU		UU
> UUS -> USU
> USU		SU
> USS		SU	wrong
> SUU		US	wrong
> SUS -> SSU
> SSU		SS	wrong
> SSS		SS
> 
> The documentation in tree.def is unclear about whether the output
> signedness is linked to the input signedness, but at least the SSU case seems
> valid, and is mishandled here.

Hi,

Thanks for the concern, but I don't think those "wrong" cases are valid.
There's only one explicit carve-out for this mismatch that I'm aware of which is
for constants that fit in the source type.  convert_mult_to_widen doesn't accept
them otherwise.

For every other mismatched sign it will fold an explicit convert into the sequence
to ensure all three types match.

i.e. 

long unsigned d1(int x, int y)
{
    return (long unsigned)x * y;
}

Requires a cast.

long unsigned d1(int x, int y)
{
    return (long unsigned)x * 4;
}

Does not, and

long unsigned d1(int x, int y)
{
    return (long unsigned)x * -4;
}

Does not fit and so is not accepted.  The reason it can happen is that the unsigned
cast on a positive constant is discarded.

Further more, the operation that introduces this widening only looks at the sign of the left
most operand and that of the result.

So this is correctly handling the normal cases and the abnormal ones the compiler introduces
as specific optimizations.

Tamar.


> 
> I think it would be clearer and simpler to have four (or three) different versions
> for each combnation of signedness of the input operands. This could be
> implemented without extra code duplication by creating four different
> instances of an operator_widen_mult class (perhaps extending a
> range_operator_mixed_sign class), with the signedness indicated by two
> additional class members.
> 
> The documentation for WIDEN_PLUS_EXPR (and several other expressions
> added in the same commit) is completely missing. If the signs are required to
> be matching, then this should be clarified; otherwise it would need the same
> special handling as WIDEN_MULT_EXPR.
> 
> Andrew
> 
> > diff --git a/gcc/gimple-range-op.cc b/gcc/gimple-range-op.cc index
> > d9dfdc56939..824e0338f34 100644
> > --- a/gcc/gimple-range-op.cc
> > +++ b/gcc/gimple-range-op.cc
> > @@ -179,6 +179,8 @@
> gimple_range_op_handler::gimple_range_op_handler (gimple *s)
> >    // statements.
> >    if (is_a <gcall *> (m_stmt))
> >      maybe_builtin_call ();
> > +  else
> > +    maybe_non_standard ();
> >  }
> >
> >  // Calculate what we can determine of the range of this unary @@
> > -764,6 +766,36 @@ public:
> >    }
> >  } op_cfn_parity;
> >
> > +// Set up a gimple_range_op_handler for any nonstandard function
> > +which can be // supported via range-ops.
> > +
> > +void
> > +gimple_range_op_handler::maybe_non_standard () {
> > +  if (gimple_code (m_stmt) == GIMPLE_ASSIGN)
> > +    switch (gimple_assign_rhs_code (m_stmt))
> > +      {
> > +	case WIDEN_MULT_EXPR:
> > +	{
> > +	  m_valid = true;
> > +	  m_op1 = gimple_assign_rhs1 (m_stmt);
> > +	  m_op2 = gimple_assign_rhs2 (m_stmt);
> > +	  bool signed1 = TYPE_SIGN (TREE_TYPE (m_op1)) == SIGNED;
> > +	  bool signed2 = TYPE_SIGN (TREE_TYPE (m_op2)) == SIGNED;
> > +	  if (signed2 && !signed1)
> > +	    std::swap (m_op1, m_op2);
> > +
> > +	  if (signed1 || signed2)
> > +	    m_int = ptr_op_widen_mult_signed;
> > +	  else
> > +	    m_int = ptr_op_widen_mult_unsigned;
> > +	  break;
> > +	}
> > +	default:
> > +	  break;
> > +      }
> > +}
> > +
> >  // Set up a gimple_range_op_handler for any built in function which
> > can be  // supported via range-ops.
> >
> > diff --git a/gcc/gimple-range-op.h b/gcc/gimple-range-op.h index
> > 743b858126e..1bf63c5ce6f 100644
> > --- a/gcc/gimple-range-op.h
> > +++ b/gcc/gimple-range-op.h
> > @@ -41,6 +41,7 @@ public:
> >  		 relation_trio = TRIO_VARYING);
> >  private:
> >    void maybe_builtin_call ();
> > +  void maybe_non_standard ();
> >    gimple *m_stmt;
> >    tree m_op1, m_op2;
> >  };
> > diff --git a/gcc/range-op.cc b/gcc/range-op.cc index
> > 5c67bce6d3a..7cd19a92d00 100644
> > --- a/gcc/range-op.cc
> > +++ b/gcc/range-op.cc
> > @@ -1556,6 +1556,34 @@ operator_plus::op2_range (irange &r, tree type,
> >    return op1_range (r, type, lhs, op1, rel.swap_op1_op2 ());  }
> >
> > +class operator_widen_plus : public range_operator {
> > +public:
> > +  virtual void wi_fold (irange &r, tree type,
> > +			const wide_int &lh_lb,
> > +			const wide_int &lh_ub,
> > +			const wide_int &rh_lb,
> > +			const wide_int &rh_ub) const;
> > +} op_widen_plus;
> > +
> > +void
> > +operator_widen_plus::wi_fold (irange &r, tree type,
> > +			const wide_int &lh_lb, const wide_int &lh_ub,
> > +			const wide_int &rh_lb, const wide_int &rh_ub) const {
> > +   wi::overflow_type ov_lb, ov_ub;
> > +   signop s = TYPE_SIGN (type);
> > +
> > +   wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, s);
> > +   wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
> > +   wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, s);
> > +   wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub)
> > + * 2, s);
> > +
> > +   wide_int new_lb = wi::add (lh_wlb, rh_wlb, s, &ov_lb);
> > +   wide_int new_ub = wi::add (lh_wub, rh_wub, s, &ov_ub);
> > +
> > +   r = int_range<2> (type, new_lb, new_ub); }
> >
> >  class operator_minus : public range_operator  { @@ -2031,6 +2059,70
> > @@ operator_mult::wi_fold (irange &r, tree type,
> >      }
> >  }
> >
> > +class operator_widen_mult_signed : public range_operator {
> > +public:
> > +  virtual void wi_fold (irange &r, tree type,
> > +			const wide_int &lh_lb,
> > +			const wide_int &lh_ub,
> > +			const wide_int &rh_lb,
> > +			const wide_int &rh_ub)
> > +    const;
> > +} op_widen_mult_signed;
> > +range_operator *ptr_op_widen_mult_signed = &op_widen_mult_signed;
> > +
> > +void
> > +operator_widen_mult_signed::wi_fold (irange &r, tree type,
> > +				     const wide_int &lh_lb,
> > +				     const wide_int &lh_ub,
> > +				     const wide_int &rh_lb,
> > +				     const wide_int &rh_ub) const {
> > +  signop s = TYPE_SIGN (type);
> > +
> > +  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb)
> > + * 2, SIGNED);  wide_int lh_wub = wide_int::from (lh_ub,
> > + wi::get_precision (lh_ub) * 2, SIGNED);  wide_int rh_wlb =
> > + wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);  wide_int
> > + rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
> > +
> > +  /* We don't expect a widening multiplication to be able to overflow but
> range
> > +     calculations for multiplications are complicated.  After widening the
> > +     operands lets call the base class.  */
> > +  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub); }
> > +
> > +
> > +class operator_widen_mult_unsigned : public range_operator {
> > +public:
> > +  virtual void wi_fold (irange &r, tree type,
> > +			const wide_int &lh_lb,
> > +			const wide_int &lh_ub,
> > +			const wide_int &rh_lb,
> > +			const wide_int &rh_ub)
> > +    const;
> > +} op_widen_mult_unsigned;
> > +range_operator *ptr_op_widen_mult_unsigned =
> &op_widen_mult_unsigned;
> > +
> > +void
> > +operator_widen_mult_unsigned::wi_fold (irange &r, tree type,
> > +				       const wide_int &lh_lb,
> > +				       const wide_int &lh_ub,
> > +				       const wide_int &rh_lb,
> > +				       const wide_int &rh_ub) const {
> > +  signop s = TYPE_SIGN (type);
> > +
> > +  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb)
> > + * 2, UNSIGNED);  wide_int lh_wub = wide_int::from (lh_ub,
> > + wi::get_precision (lh_ub) * 2, UNSIGNED);  wide_int rh_wlb =
> > + wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);  wide_int
> > + rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
> > +
> > +  /* We don't expect a widening multiplication to be able to overflow but
> range
> > +     calculations for multiplications are complicated.  After widening the
> > +     operands lets call the base class.  */
> > +  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub); }
> >
> >  class operator_div : public cross_product_operator  { @@ -4473,6
> > +4565,7 @@ integral_table::integral_table ()
> >    set (GT_EXPR, op_gt);
> >    set (GE_EXPR, op_ge);
> >    set (PLUS_EXPR, op_plus);
> > +  set (WIDEN_PLUS_EXPR, op_widen_plus);
> >    set (MINUS_EXPR, op_minus);
> >    set (MIN_EXPR, op_min);
> >    set (MAX_EXPR, op_max);
> > diff --git a/gcc/range-op.h b/gcc/range-op.h index
> > f00b747f08a..5fe463234ae 100644
> > --- a/gcc/range-op.h
> > +++ b/gcc/range-op.h
> > @@ -311,4 +311,6 @@ private:
> >  // This holds the range op table for floating point operations.
> >  extern floating_op_table *floating_tree_table;
> >
> > +extern range_operator *ptr_op_widen_mult_signed; extern
> > +range_operator *ptr_op_widen_mult_unsigned;
> >  #endif // GCC_RANGE_OP_H


^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2023-03-01 18:17 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-09 17:16 [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] Tamar Christina
2023-02-09 17:22 ` [PATCH 2/2]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583] Tamar Christina
2023-02-10 10:35   ` Tamar Christina
2023-02-10 14:10   ` Richard Sandiford
2023-02-10 10:34 ` [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] Tamar Christina
2023-02-10 13:13 ` Richard Biener
2023-02-10 13:36 ` Richard Sandiford
2023-02-10 13:52   ` Richard Biener
2023-02-10 14:13   ` Tamar Christina
2023-02-10 14:30     ` Richard Sandiford
2023-02-10 14:54       ` Tamar Christina
2023-02-27 11:09       ` Tamar Christina
2023-02-27 12:11         ` Richard Sandiford
2023-02-27 12:14           ` Tamar Christina
2023-02-27 21:33             ` Richard Sandiford
2023-02-27 22:10               ` Tamar Christina
2023-02-28 11:08                 ` Richard Sandiford
2023-02-28 11:12                   ` Tamar Christina
2023-02-28 12:03                     ` Richard Sandiford
2023-03-01 11:30                       ` Richard Biener
2023-02-10 15:56     ` Richard Sandiford
2023-02-10 16:09       ` Tamar Christina
2023-02-10 16:25         ` Richard Sandiford
2023-02-10 16:33           ` Tamar Christina
2023-02-10 16:57             ` Richard Sandiford
2023-02-10 17:01               ` Richard Sandiford
2023-02-10 17:14               ` Tamar Christina
2023-02-10 18:12                 ` Richard Sandiford
2023-02-10 18:34                   ` Richard Biener
2023-02-10 20:58                     ` Andrew MacLeod
2023-02-13  9:54                       ` Tamar Christina
2023-02-15 12:51                         ` Tamar Christina
2023-02-15 16:05                           ` Andrew MacLeod
2023-02-15 17:13                             ` Tamar Christina
2023-02-15 17:50                               ` Andrew MacLeod
2023-02-15 18:42                                 ` Andrew MacLeod
2023-02-22 12:51                                   ` Tamar Christina
2023-02-22 16:41                                   ` Andrew MacLeod
2023-02-22 18:03                                     ` Tamar Christina
2023-02-22 18:33                                       ` Andrew MacLeod
2023-02-23  8:36                                         ` Tamar Christina
2023-02-23 16:39                                           ` Andrew MacLeod
2023-02-23 16:56                                             ` Tamar Christina
2023-03-01 16:57                                             ` Andrew Carlotti
2023-03-01 18:16                                               ` Tamar Christina
2023-02-22 13:06                                 ` Tamar Christina
2023-02-22 15:19                                   ` Andrew MacLeod

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).