[PATCH 1/4]middle-end: Revert can_special_div_by

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH 1/4]middle-end: Revert can_special_div_by_const changes [PR108583]
@ 2023-02-27 12:32 Tamar Christina
  2023-02-27 12:33 ` [PATCH 2/4][ranger]: Add range-ops for widen addition and widen multiplication [PR108583] Tamar Christina
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Tamar Christina @ 2023-02-27 12:32 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, rguenther, jlaw

[-- Attachment #1: Type: text/plain, Size: 15556 bytes --]

Hi All,

This reverts the changes for the CAN_SPECIAL_DIV_BY_CONST hook.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	PR target/108583
	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Remove.
	* doc/tm.texi.in: Likewise.
	* explow.cc (round_push, align_dynamic_address): Revert previous patch.
	* expmed.cc (expand_divmod): Likewise.
	* expmed.h (expand_divmod): Likewise.
	* expr.cc (force_operand, expand_expr_divmod): Likewise.
	* optabs.cc (expand_doubleword_mod, expand_doubleword_divmod): Likewise.
	* target.def (can_special_div_by_const): Remove.
	* target.h: Remove tree-core.h include
	* targhooks.cc (default_can_special_div_by_const): Remove.
	* targhooks.h (default_can_special_div_by_const): Remove.
	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook.
	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.

--- inline copy of patch -- 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c6c891972d1e58cd163b259ba96a599d62326865..50a8872a6695b18b9bed0d393bacf733833633db 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6137,20 +6137,6 @@ instruction pattern.  There is no need for the hook to handle these two
 implementation approaches itself.
 @end deftypefn
 
-@deftypefn {Target Hook} bool TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum @var{tree_code}, tree @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx @var{in0}, rtx @var{in1})
-This hook is used to test whether the target has a special method of
-division of vectors of type @var{vectype} using the value @var{constant},
-and producing a vector of type @var{vectype}.  The division
-will then not be decomposed by the vectorizer and kept as a div.
-
-When the hook is being used to test whether the target supports a special
-divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook
-is being used to emit a division, @var{in0} and @var{in1} are the source
-vectors of type @var{vecttype} and @var{output} is the destination vector of
-type @var{vectype}.
-
-Return true if the operation is possible, emitting instructions for it
-if rtxes are provided and updating @var{output}.
 @end deftypefn
 
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 613b2534149415f442163d599503efaf423b673b..3e07978a02f4e6077adae6cadc93ea4273295f1f 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4173,7 +4173,6 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_VEC_PERM_CONST
 
-@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
 
 @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
 
diff --git a/gcc/explow.cc b/gcc/explow.cc
index 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0befa016eea4573c 100644
--- a/gcc/explow.cc
+++ b/gcc/explow.cc
@@ -1037,7 +1037,7 @@ round_push (rtx size)
      TRUNC_DIV_EXPR.  */
   size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
 		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
-  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size, align_rtx,
+  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
 			NULL_RTX, 1);
   size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
 
@@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned required_align)
 			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
 				       Pmode),
 			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
-  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, target,
+  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
 			  gen_int_mode (required_align / BITS_PER_UNIT,
 					Pmode),
 			  NULL_RTX, 1);
diff --git a/gcc/expmed.h b/gcc/expmed.h
index 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c53640941628068f3901 100644
--- a/gcc/expmed.h
+++ b/gcc/expmed.h
@@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code, machine_mode, rtx, poly_int64, rtx,
 extern rtx maybe_expand_shift (enum tree_code, machine_mode, rtx, int, rtx,
 			       int);
 #ifdef GCC_OPTABS_H
-extern rtx expand_divmod (int, enum tree_code, machine_mode, tree, tree,
-			  rtx, rtx, rtx, int,
-			  enum optab_methods = OPTAB_LIB_WIDEN);
+extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx, rtx,
+			  rtx, int, enum optab_methods = OPTAB_LIB_WIDEN);
 #endif
 #endif
 
diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3a59c169d3b7692f 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx op0, HOST_WIDE_INT d)
 
 rtx
 expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
-	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
-	       int unsignedp, enum optab_methods methods)
+	       rtx op0, rtx op1, rtx target, int unsignedp,
+	       enum optab_methods methods)
 {
   machine_mode compute_mode;
   rtx tquotient;
@@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
 
   last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
 
-  /* Check if the target has specific expansions for the division.  */
-  tree cst;
-  if (treeop0
-      && treeop1
-      && (cst = uniform_integer_cst_p (treeop1))
-      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE (treeop0),
-						     wi::to_wide (cst),
-						     &target, op0, op1))
-    return target;
-
-
   /* Now convert to the best mode to use.  */
   if (compute_mode != mode)
     {
@@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
 			    || (optab_handler (sdivmod_optab, int_mode)
 				!= CODE_FOR_nothing)))
 		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
-						int_mode, treeop0, treeop1,
-						op0, gen_int_mode (abs_d,
+						int_mode, op0,
+						gen_int_mode (abs_d,
 							      int_mode),
 						NULL_RTX, 0);
 		    else
@@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
 				      size - 1, NULL_RTX, 0);
 		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
 				    NULL_RTX);
-		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, treeop0,
-				    treeop1, t3, op1, NULL_RTX, 0);
+		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3, op1,
+				    NULL_RTX, 0);
 		if (t4)
 		  {
 		    rtx t5;
diff --git a/gcc/expr.cc b/gcc/expr.cc
index 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b2280c6e277f26d72 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
 	    return expand_divmod (0,
 				  FLOAT_MODE_P (GET_MODE (value))
 				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
-				  GET_MODE (value), NULL, NULL, op1, op2,
-				  target, 0);
+				  GET_MODE (value), op1, op2, target, 0);
 	case MOD:
-	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
-				op1, op2, target, 0);
+	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
+				target, 0);
 	case UDIV:
-	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), NULL, NULL,
-				op1, op2, target, 1);
+	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), op1, op2,
+				target, 1);
 	case UMOD:
-	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
-				op1, op2, target, 1);
+	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
+				target, 1);
 	case ASHIFTRT:
 	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
 				      target, 0, OPTAB_LIB_WIDEN);
@@ -9170,13 +9169,11 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
       bool speed_p = optimize_insn_for_speed_p ();
       do_pending_stack_adjust ();
       start_sequence ();
-      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
-				   op0, op1, target, 1);
+      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 1);
       rtx_insn *uns_insns = get_insns ();
       end_sequence ();
       start_sequence ();
-      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
-				   op0, op1, target, 0);
+      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 0);
       rtx_insn *sgn_insns = get_insns ();
       end_sequence ();
       unsigned uns_cost = seq_cost (uns_insns, speed_p);
@@ -9198,8 +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
       emit_insn (sgn_insns);
       return sgn_ret;
     }
-  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
-			op0, op1, target, unsignedp);
+  return expand_divmod (mod_p, code, mode, op0, op1, target, unsignedp);
 }
 
 rtx
diff --git a/gcc/optabs.cc b/gcc/optabs.cc
index cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6e77082c1e617b 100644
--- a/gcc/optabs.cc
+++ b/gcc/optabs.cc
@@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode mode, rtx op0, rtx op1, bool unsignedp)
 		return NULL_RTX;
 	    }
 	}
-      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, NULL, NULL,
-				     sum, gen_int_mode (INTVAL (op1),
-							word_mode),
+      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, sum,
+				     gen_int_mode (INTVAL (op1), word_mode),
 				     NULL_RTX, 1, OPTAB_DIRECT);
       if (remainder == NULL_RTX)
 	return NULL_RTX;
@@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
 
   if (op11 != const1_rtx)
     {
-      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL, quot1,
-				op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
+      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1, op11,
+				NULL_RTX, unsignedp, OPTAB_DIRECT);
       if (rem2 == NULL_RTX)
 	return NULL_RTX;
 
@@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
       if (rem2 == NULL_RTX)
 	return NULL_RTX;
 
-      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL, quot1,
-				 op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
+      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
+				 NULL_RTX, unsignedp, OPTAB_DIRECT);
       if (quot2 == NULL_RTX)
 	return NULL_RTX;
 
diff --git a/gcc/target.def b/gcc/target.def
index db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1905,25 +1905,6 @@ implementation approaches itself.",
 	const vec_perm_indices &sel),
  NULL)
 
-DEFHOOK
-(can_special_div_by_const,
- "This hook is used to test whether the target has a special method of\n\
-division of vectors of type @var{vectype} using the value @var{constant},\n\
-and producing a vector of type @var{vectype}.  The division\n\
-will then not be decomposed by the vectorizer and kept as a div.\n\
-\n\
-When the hook is being used to test whether the target supports a special\n\
-divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook\n\
-is being used to emit a division, @var{in0} and @var{in1} are the source\n\
-vectors of type @var{vecttype} and @var{output} is the destination vector of\n\
-type @var{vectype}.\n\
-\n\
-Return true if the operation is possible, emitting instructions for it\n\
-if rtxes are provided and updating @var{output}.",
- bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
-	rtx in0, rtx in1),
- default_can_special_div_by_const)
-
 /* Return true if the target supports misaligned store/load of a
    specific factor denoted in the third parameter.  The last parameter
    is true if the access is defined in a packed struct.  */
diff --git a/gcc/target.h b/gcc/target.h
index 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b99f913158c2d47b1 100644
--- a/gcc/target.h
+++ b/gcc/target.h
@@ -51,7 +51,6 @@
 #include "insn-codes.h"
 #include "tm.h"
 #include "hard-reg-set.h"
-#include "tree-core.h"
 
 #if CHECKING_P
 
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2244549317a31390f0c2 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage (addr_space_t, location_t);
 extern rtx default_addr_space_convert (rtx, tree, tree);
 extern unsigned int default_case_values_threshold (void);
 extern bool default_have_conditional_execution (void);
-extern bool default_can_special_div_by_const (enum tree_code, tree, wide_int,
-					      rtx *, rtx, rtx);
 
 extern bool default_libc_has_function (enum function_class, tree);
 extern bool default_libc_has_fast_function (int fcode);
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e03877337a931e7 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
   return HAVE_conditional_execution;
 }
 
-/* Default that no division by constant operations are special.  */
-bool
-default_can_special_div_by_const (enum tree_code, tree, wide_int, rtx *, rtx,
-				  rtx)
-{
-  return false;
-}
-
 /* By default we assume that c99 functions are present at the runtime,
    but sincos is not.  */
 bool
diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
index 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077dc3e970bed75ef6 100644
--- a/gcc/tree-vect-generic.cc
+++ b/gcc/tree-vect-generic.cc
@@ -1237,17 +1237,6 @@ expand_vector_operation (gimple_stmt_iterator *gsi, tree type, tree compute_type
 	  tree rhs2 = gimple_assign_rhs2 (assign);
 	  tree ret;
 
-	  /* Check if the target was going to handle it through the special
-	     division callback hook.  */
-	  tree cst = uniform_integer_cst_p (rhs2);
-	  if (cst &&
-	      targetm.vectorize.can_special_div_by_const (code, type,
-							  wi::to_wide (cst),
-							  NULL,
-							  NULL_RTX, NULL_RTX))
-	    return NULL_TREE;
-
-
 	  if (!optimize
 	      || !VECTOR_INTEGER_TYPE_P (type)
 	      || TREE_CODE (rhs2) != VECTOR_CST
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 6934aebc69f231af24668f0a1c3d140e97f55487..1766ce277d6b88d8aa3be77e7c8abb504a10a735 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -3913,14 +3913,6 @@ vect_recog_divmod_pattern (vec_info *vinfo,
 
       return pattern_stmt;
     }
-  else if ((cst = uniform_integer_cst_p (oprnd1))
-	   && targetm.vectorize.can_special_div_by_const (rhs_code, vectype,
-							  wi::to_wide (cst),
-							  NULL, NULL_RTX,
-							  NULL_RTX))
-    {
-      return NULL;
-    }
 
   if (prec > HOST_BITS_PER_WIDE_INT
       || integer_zerop (oprnd1))
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b9564fc4e066e50081 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
 	}
       target_support_p = (optab_handler (optab, vec_mode)
 			  != CODE_FOR_nothing);
-      tree cst;
-      if (!target_support_p
-	  && op1
-	  && (cst = uniform_integer_cst_p (op1)))
-	target_support_p
-	  = targetm.vectorize.can_special_div_by_const (code, vectype,
-							wi::to_wide (cst),
-							NULL, NULL_RTX,
-							NULL_RTX);
     }
 
   bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);




-- 

[-- Attachment #2: rb16928.patch --]
[-- Type: text/plain, Size: 14535 bytes --]

diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c6c891972d1e58cd163b259ba96a599d62326865..50a8872a6695b18b9bed0d393bacf733833633db 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6137,20 +6137,6 @@ instruction pattern.  There is no need for the hook to handle these two
 implementation approaches itself.
 @end deftypefn
 
-@deftypefn {Target Hook} bool TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum @var{tree_code}, tree @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx @var{in0}, rtx @var{in1})
-This hook is used to test whether the target has a special method of
-division of vectors of type @var{vectype} using the value @var{constant},
-and producing a vector of type @var{vectype}.  The division
-will then not be decomposed by the vectorizer and kept as a div.
-
-When the hook is being used to test whether the target supports a special
-divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook
-is being used to emit a division, @var{in0} and @var{in1} are the source
-vectors of type @var{vecttype} and @var{output} is the destination vector of
-type @var{vectype}.
-
-Return true if the operation is possible, emitting instructions for it
-if rtxes are provided and updating @var{output}.
 @end deftypefn
 
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 613b2534149415f442163d599503efaf423b673b..3e07978a02f4e6077adae6cadc93ea4273295f1f 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4173,7 +4173,6 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_VEC_PERM_CONST
 
-@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
 
 @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
 
diff --git a/gcc/explow.cc b/gcc/explow.cc
index 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0befa016eea4573c 100644
--- a/gcc/explow.cc
+++ b/gcc/explow.cc
@@ -1037,7 +1037,7 @@ round_push (rtx size)
      TRUNC_DIV_EXPR.  */
   size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
 		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
-  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size, align_rtx,
+  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
 			NULL_RTX, 1);
   size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
 
@@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned required_align)
 			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
 				       Pmode),
 			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
-  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, target,
+  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
 			  gen_int_mode (required_align / BITS_PER_UNIT,
 					Pmode),
 			  NULL_RTX, 1);
diff --git a/gcc/expmed.h b/gcc/expmed.h
index 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c53640941628068f3901 100644
--- a/gcc/expmed.h
+++ b/gcc/expmed.h
@@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code, machine_mode, rtx, poly_int64, rtx,
 extern rtx maybe_expand_shift (enum tree_code, machine_mode, rtx, int, rtx,
 			       int);
 #ifdef GCC_OPTABS_H
-extern rtx expand_divmod (int, enum tree_code, machine_mode, tree, tree,
-			  rtx, rtx, rtx, int,
-			  enum optab_methods = OPTAB_LIB_WIDEN);
+extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx, rtx,
+			  rtx, int, enum optab_methods = OPTAB_LIB_WIDEN);
 #endif
 #endif
 
diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3a59c169d3b7692f 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx op0, HOST_WIDE_INT d)
 
 rtx
 expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
-	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
-	       int unsignedp, enum optab_methods methods)
+	       rtx op0, rtx op1, rtx target, int unsignedp,
+	       enum optab_methods methods)
 {
   machine_mode compute_mode;
   rtx tquotient;
@@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
 
   last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
 
-  /* Check if the target has specific expansions for the division.  */
-  tree cst;
-  if (treeop0
-      && treeop1
-      && (cst = uniform_integer_cst_p (treeop1))
-      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE (treeop0),
-						     wi::to_wide (cst),
-						     &target, op0, op1))
-    return target;
-
-
   /* Now convert to the best mode to use.  */
   if (compute_mode != mode)
     {
@@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
 			    || (optab_handler (sdivmod_optab, int_mode)
 				!= CODE_FOR_nothing)))
 		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
-						int_mode, treeop0, treeop1,
-						op0, gen_int_mode (abs_d,
+						int_mode, op0,
+						gen_int_mode (abs_d,
 							      int_mode),
 						NULL_RTX, 0);
 		    else
@@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
 				      size - 1, NULL_RTX, 0);
 		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
 				    NULL_RTX);
-		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, treeop0,
-				    treeop1, t3, op1, NULL_RTX, 0);
+		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3, op1,
+				    NULL_RTX, 0);
 		if (t4)
 		  {
 		    rtx t5;
diff --git a/gcc/expr.cc b/gcc/expr.cc
index 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b2280c6e277f26d72 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
 	    return expand_divmod (0,
 				  FLOAT_MODE_P (GET_MODE (value))
 				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
-				  GET_MODE (value), NULL, NULL, op1, op2,
-				  target, 0);
+				  GET_MODE (value), op1, op2, target, 0);
 	case MOD:
-	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
-				op1, op2, target, 0);
+	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
+				target, 0);
 	case UDIV:
-	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), NULL, NULL,
-				op1, op2, target, 1);
+	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), op1, op2,
+				target, 1);
 	case UMOD:
-	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
-				op1, op2, target, 1);
+	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
+				target, 1);
 	case ASHIFTRT:
 	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
 				      target, 0, OPTAB_LIB_WIDEN);
@@ -9170,13 +9169,11 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
       bool speed_p = optimize_insn_for_speed_p ();
       do_pending_stack_adjust ();
       start_sequence ();
-      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
-				   op0, op1, target, 1);
+      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 1);
       rtx_insn *uns_insns = get_insns ();
       end_sequence ();
       start_sequence ();
-      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
-				   op0, op1, target, 0);
+      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 0);
       rtx_insn *sgn_insns = get_insns ();
       end_sequence ();
       unsigned uns_cost = seq_cost (uns_insns, speed_p);
@@ -9198,8 +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
       emit_insn (sgn_insns);
       return sgn_ret;
     }
-  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
-			op0, op1, target, unsignedp);
+  return expand_divmod (mod_p, code, mode, op0, op1, target, unsignedp);
 }
 
 rtx
diff --git a/gcc/optabs.cc b/gcc/optabs.cc
index cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6e77082c1e617b 100644
--- a/gcc/optabs.cc
+++ b/gcc/optabs.cc
@@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode mode, rtx op0, rtx op1, bool unsignedp)
 		return NULL_RTX;
 	    }
 	}
-      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, NULL, NULL,
-				     sum, gen_int_mode (INTVAL (op1),
-							word_mode),
+      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, sum,
+				     gen_int_mode (INTVAL (op1), word_mode),
 				     NULL_RTX, 1, OPTAB_DIRECT);
       if (remainder == NULL_RTX)
 	return NULL_RTX;
@@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
 
   if (op11 != const1_rtx)
     {
-      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL, quot1,
-				op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
+      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1, op11,
+				NULL_RTX, unsignedp, OPTAB_DIRECT);
       if (rem2 == NULL_RTX)
 	return NULL_RTX;
 
@@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
       if (rem2 == NULL_RTX)
 	return NULL_RTX;
 
-      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL, quot1,
-				 op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
+      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
+				 NULL_RTX, unsignedp, OPTAB_DIRECT);
       if (quot2 == NULL_RTX)
 	return NULL_RTX;
 
diff --git a/gcc/target.def b/gcc/target.def
index db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1905,25 +1905,6 @@ implementation approaches itself.",
 	const vec_perm_indices &sel),
  NULL)
 
-DEFHOOK
-(can_special_div_by_const,
- "This hook is used to test whether the target has a special method of\n\
-division of vectors of type @var{vectype} using the value @var{constant},\n\
-and producing a vector of type @var{vectype}.  The division\n\
-will then not be decomposed by the vectorizer and kept as a div.\n\
-\n\
-When the hook is being used to test whether the target supports a special\n\
-divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook\n\
-is being used to emit a division, @var{in0} and @var{in1} are the source\n\
-vectors of type @var{vecttype} and @var{output} is the destination vector of\n\
-type @var{vectype}.\n\
-\n\
-Return true if the operation is possible, emitting instructions for it\n\
-if rtxes are provided and updating @var{output}.",
- bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
-	rtx in0, rtx in1),
- default_can_special_div_by_const)
-
 /* Return true if the target supports misaligned store/load of a
    specific factor denoted in the third parameter.  The last parameter
    is true if the access is defined in a packed struct.  */
diff --git a/gcc/target.h b/gcc/target.h
index 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b99f913158c2d47b1 100644
--- a/gcc/target.h
+++ b/gcc/target.h
@@ -51,7 +51,6 @@
 #include "insn-codes.h"
 #include "tm.h"
 #include "hard-reg-set.h"
-#include "tree-core.h"
 
 #if CHECKING_P
 
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2244549317a31390f0c2 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage (addr_space_t, location_t);
 extern rtx default_addr_space_convert (rtx, tree, tree);
 extern unsigned int default_case_values_threshold (void);
 extern bool default_have_conditional_execution (void);
-extern bool default_can_special_div_by_const (enum tree_code, tree, wide_int,
-					      rtx *, rtx, rtx);
 
 extern bool default_libc_has_function (enum function_class, tree);
 extern bool default_libc_has_fast_function (int fcode);
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e03877337a931e7 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
   return HAVE_conditional_execution;
 }
 
-/* Default that no division by constant operations are special.  */
-bool
-default_can_special_div_by_const (enum tree_code, tree, wide_int, rtx *, rtx,
-				  rtx)
-{
-  return false;
-}
-
 /* By default we assume that c99 functions are present at the runtime,
    but sincos is not.  */
 bool
diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
index 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077dc3e970bed75ef6 100644
--- a/gcc/tree-vect-generic.cc
+++ b/gcc/tree-vect-generic.cc
@@ -1237,17 +1237,6 @@ expand_vector_operation (gimple_stmt_iterator *gsi, tree type, tree compute_type
 	  tree rhs2 = gimple_assign_rhs2 (assign);
 	  tree ret;
 
-	  /* Check if the target was going to handle it through the special
-	     division callback hook.  */
-	  tree cst = uniform_integer_cst_p (rhs2);
-	  if (cst &&
-	      targetm.vectorize.can_special_div_by_const (code, type,
-							  wi::to_wide (cst),
-							  NULL,
-							  NULL_RTX, NULL_RTX))
-	    return NULL_TREE;
-
-
 	  if (!optimize
 	      || !VECTOR_INTEGER_TYPE_P (type)
 	      || TREE_CODE (rhs2) != VECTOR_CST
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 6934aebc69f231af24668f0a1c3d140e97f55487..1766ce277d6b88d8aa3be77e7c8abb504a10a735 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -3913,14 +3913,6 @@ vect_recog_divmod_pattern (vec_info *vinfo,
 
       return pattern_stmt;
     }
-  else if ((cst = uniform_integer_cst_p (oprnd1))
-	   && targetm.vectorize.can_special_div_by_const (rhs_code, vectype,
-							  wi::to_wide (cst),
-							  NULL, NULL_RTX,
-							  NULL_RTX))
-    {
-      return NULL;
-    }
 
   if (prec > HOST_BITS_PER_WIDE_INT
       || integer_zerop (oprnd1))
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b9564fc4e066e50081 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
 	}
       target_support_p = (optab_handler (optab, vec_mode)
 			  != CODE_FOR_nothing);
-      tree cst;
-      if (!target_support_p
-	  && op1
-	  && (cst = uniform_integer_cst_p (op1)))
-	target_support_p
-	  = targetm.vectorize.can_special_div_by_const (code, vectype,
-							wi::to_wide (cst),
-							NULL, NULL_RTX,
-							NULL_RTX);
     }
 
   bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/4][ranger]: Add range-ops for widen addition and widen multiplication [PR108583]
  2023-02-27 12:32 [PATCH 1/4]middle-end: Revert can_special_div_by_const changes [PR108583] Tamar Christina
@ 2023-02-27 12:33 ` Tamar Christina
  2023-03-06 11:20   ` Tamar Christina
  2023-02-27 12:33 ` [PATCH 3/4]middle-end: Implement preferred_div_as_shifts_over_mult [PR108583] Tamar Christina
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Tamar Christina @ 2023-02-27 12:33 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, amacleod, aldyh

[-- Attachment #1: Type: text/plain, Size: 8976 bytes --]

Hi All,

This adds range-ops for widening addition and widening multiplication.

I couldn't figure out how to write a test for this.  It looks like there are
self tests but not a way to write standalone ones?  I did create testcases in
the patch 3/4 which tests the end result.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	PR target/108583
	* gimple-range-op.h (gimple_range_op_handler): Add maybe_non_standard.
	* gimple-range-op.cc (gimple_range_op_handler::gimple_range_op_handler):
	Use it.
	(gimple_range_op_handler::maybe_non_standard): New.
	* range-op.cc (class operator_widen_plus_signed,
	operator_widen_plus_signed::wi_fold, class operator_widen_plus_unsigned,
	operator_widen_plus_unsigned::wi_fold, class operator_widen_mult_signed,
	operator_widen_mult_signed::wi_fold, class operator_widen_mult_unsigned,
	operator_widen_mult_unsigned::wi_fold,
	ptr_op_widen_mult_signed, ptr_op_widen_mult_unsigned,
	ptr_op_widen_plus_signed, ptr_op_widen_plus_unsigned): New.
	* range-op.h (ptr_op_widen_mult_signed, ptr_op_widen_mult_unsigned,
	ptr_op_widen_plus_signed, ptr_op_widen_plus_unsigned): New

Co-Authored-By: Andrew MacLeod <amacleod@redhat.com>

--- inline copy of patch -- 
diff --git a/gcc/gimple-range-op.h b/gcc/gimple-range-op.h
index 743b858126e333ea9590c0f175aacb476260c048..1bf63c5ce6f5db924a1f5907ab4539e376281bd0 100644
--- a/gcc/gimple-range-op.h
+++ b/gcc/gimple-range-op.h
@@ -41,6 +41,7 @@ public:
 		 relation_trio = TRIO_VARYING);
 private:
   void maybe_builtin_call ();
+  void maybe_non_standard ();
   gimple *m_stmt;
   tree m_op1, m_op2;
 };
diff --git a/gcc/gimple-range-op.cc b/gcc/gimple-range-op.cc
index d9dfdc56939bb62ade72726b15c3d5e87e4ddcd1..ad13c873c6303db5b68b74db1562c0db6763101f 100644
--- a/gcc/gimple-range-op.cc
+++ b/gcc/gimple-range-op.cc
@@ -179,6 +179,8 @@ gimple_range_op_handler::gimple_range_op_handler (gimple *s)
   // statements.
   if (is_a <gcall *> (m_stmt))
     maybe_builtin_call ();
+  else
+    maybe_non_standard ();
 }
 
 // Calculate what we can determine of the range of this unary
@@ -764,6 +766,44 @@ public:
   }
 } op_cfn_parity;
 
+// Set up a gimple_range_op_handler for any nonstandard function which can be
+// supported via range-ops.
+
+void
+gimple_range_op_handler::maybe_non_standard ()
+{
+  range_operator *signed_op = ptr_op_widen_mult_signed;
+  range_operator *unsigned_op = ptr_op_widen_mult_unsigned;
+  if (gimple_code (m_stmt) == GIMPLE_ASSIGN)
+    switch (gimple_assign_rhs_code (m_stmt))
+      {
+	case WIDEN_PLUS_EXPR:
+	{
+	  signed_op = ptr_op_widen_plus_signed;
+	  unsigned_op = ptr_op_widen_plus_unsigned;
+	}
+	gcc_fallthrough ();
+	case WIDEN_MULT_EXPR:
+	{
+	  m_valid = true;
+	  m_op1 = gimple_assign_rhs1 (m_stmt);
+	  m_op2 = gimple_assign_rhs2 (m_stmt);
+	  bool signed1 = TYPE_SIGN (TREE_TYPE (m_op1)) == SIGNED;
+	  bool signed2 = TYPE_SIGN (TREE_TYPE (m_op2)) == SIGNED;
+	  if (signed2 && !signed1)
+	    std::swap (m_op1, m_op2);
+
+	  if (signed1 || signed2)
+	    m_int = signed_op;
+	  else
+	    m_int = unsigned_op;
+	  break;
+	}
+	default:
+	  break;
+      }
+}
+
 // Set up a gimple_range_op_handler for any built in function which can be
 // supported via range-ops.
 
diff --git a/gcc/range-op.h b/gcc/range-op.h
index f00b747f08a1fa8404c63bfe5a931b4048008b03..b1eeac70df81f2bdf228af7adff5399e7ac5e5d6 100644
--- a/gcc/range-op.h
+++ b/gcc/range-op.h
@@ -311,4 +311,8 @@ private:
 // This holds the range op table for floating point operations.
 extern floating_op_table *floating_tree_table;
 
+extern range_operator *ptr_op_widen_mult_signed;
+extern range_operator *ptr_op_widen_mult_unsigned;
+extern range_operator *ptr_op_widen_plus_signed;
+extern range_operator *ptr_op_widen_plus_unsigned;
 #endif // GCC_RANGE_OP_H
diff --git a/gcc/range-op.cc b/gcc/range-op.cc
index 5c67bce6d3aab81ad3186b902e09d6a96878d9bb..718ccb6f074e1a2a9ef1b7a5d4e879898d4a7fc3 100644
--- a/gcc/range-op.cc
+++ b/gcc/range-op.cc
@@ -1556,6 +1556,73 @@ operator_plus::op2_range (irange &r, tree type,
   return op1_range (r, type, lhs, op1, rel.swap_op1_op2 ());
 }
 
+class operator_widen_plus_signed : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub) const;
+} op_widen_plus_signed;
+range_operator *ptr_op_widen_plus_signed = &op_widen_plus_signed;
+
+void
+operator_widen_plus_signed::wi_fold (irange &r, tree type,
+				     const wide_int &lh_lb,
+				     const wide_int &lh_ub,
+				     const wide_int &rh_lb,
+				     const wide_int &rh_ub) const
+{
+   wi::overflow_type ov_lb, ov_ub;
+   signop s = TYPE_SIGN (type);
+
+   wide_int lh_wlb
+     = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, SIGNED);
+   wide_int lh_wub
+     = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, SIGNED);
+   wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+   wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+   wide_int new_lb = wi::add (lh_wlb, rh_wlb, s, &ov_lb);
+   wide_int new_ub = wi::add (lh_wub, rh_wub, s, &ov_ub);
+
+   r = int_range<2> (type, new_lb, new_ub);
+}
+
+class operator_widen_plus_unsigned : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub) const;
+} op_widen_plus_unsigned;
+range_operator *ptr_op_widen_plus_unsigned = &op_widen_plus_unsigned;
+
+void
+operator_widen_plus_unsigned::wi_fold (irange &r, tree type,
+				       const wide_int &lh_lb,
+				       const wide_int &lh_ub,
+				       const wide_int &rh_lb,
+				       const wide_int &rh_ub) const
+{
+   wi::overflow_type ov_lb, ov_ub;
+   signop s = TYPE_SIGN (type);
+
+   wide_int lh_wlb
+     = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, UNSIGNED);
+   wide_int lh_wub
+     = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, UNSIGNED);
+   wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+   wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+   wide_int new_lb = wi::add (lh_wlb, rh_wlb, s, &ov_lb);
+   wide_int new_ub = wi::add (lh_wub, rh_wub, s, &ov_ub);
+
+   r = int_range<2> (type, new_lb, new_ub);
+}
 
 class operator_minus : public range_operator
 {
@@ -2031,6 +2098,70 @@ operator_mult::wi_fold (irange &r, tree type,
     }
 }
 
+class operator_widen_mult_signed : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub)
+    const;
+} op_widen_mult_signed;
+range_operator *ptr_op_widen_mult_signed = &op_widen_mult_signed;
+
+void
+operator_widen_mult_signed::wi_fold (irange &r, tree type,
+				     const wide_int &lh_lb,
+				     const wide_int &lh_ub,
+				     const wide_int &rh_lb,
+				     const wide_int &rh_ub) const
+{
+  signop s = TYPE_SIGN (type);
+
+  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, SIGNED);
+  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, SIGNED);
+  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+  /* We don't expect a widening multiplication to be able to overflow but range
+     calculations for multiplications are complicated.  After widening the
+     operands lets call the base class.  */
+  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
+}
+
+
+class operator_widen_mult_unsigned : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub)
+    const;
+} op_widen_mult_unsigned;
+range_operator *ptr_op_widen_mult_unsigned = &op_widen_mult_unsigned;
+
+void
+operator_widen_mult_unsigned::wi_fold (irange &r, tree type,
+				       const wide_int &lh_lb,
+				       const wide_int &lh_ub,
+				       const wide_int &rh_lb,
+				       const wide_int &rh_ub) const
+{
+  signop s = TYPE_SIGN (type);
+
+  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, UNSIGNED);
+  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, UNSIGNED);
+  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+  /* We don't expect a widening multiplication to be able to overflow but range
+     calculations for multiplications are complicated.  After widening the
+     operands lets call the base class.  */
+  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
+}
 
 class operator_div : public cross_product_operator
 {




-- 

[-- Attachment #2: rb16929.patch --]
[-- Type: text/plain, Size: 7713 bytes --]

diff --git a/gcc/gimple-range-op.h b/gcc/gimple-range-op.h
index 743b858126e333ea9590c0f175aacb476260c048..1bf63c5ce6f5db924a1f5907ab4539e376281bd0 100644
--- a/gcc/gimple-range-op.h
+++ b/gcc/gimple-range-op.h
@@ -41,6 +41,7 @@ public:
 		 relation_trio = TRIO_VARYING);
 private:
   void maybe_builtin_call ();
+  void maybe_non_standard ();
   gimple *m_stmt;
   tree m_op1, m_op2;
 };
diff --git a/gcc/gimple-range-op.cc b/gcc/gimple-range-op.cc
index d9dfdc56939bb62ade72726b15c3d5e87e4ddcd1..ad13c873c6303db5b68b74db1562c0db6763101f 100644
--- a/gcc/gimple-range-op.cc
+++ b/gcc/gimple-range-op.cc
@@ -179,6 +179,8 @@ gimple_range_op_handler::gimple_range_op_handler (gimple *s)
   // statements.
   if (is_a <gcall *> (m_stmt))
     maybe_builtin_call ();
+  else
+    maybe_non_standard ();
 }
 
 // Calculate what we can determine of the range of this unary
@@ -764,6 +766,44 @@ public:
   }
 } op_cfn_parity;
 
+// Set up a gimple_range_op_handler for any nonstandard function which can be
+// supported via range-ops.
+
+void
+gimple_range_op_handler::maybe_non_standard ()
+{
+  range_operator *signed_op = ptr_op_widen_mult_signed;
+  range_operator *unsigned_op = ptr_op_widen_mult_unsigned;
+  if (gimple_code (m_stmt) == GIMPLE_ASSIGN)
+    switch (gimple_assign_rhs_code (m_stmt))
+      {
+	case WIDEN_PLUS_EXPR:
+	{
+	  signed_op = ptr_op_widen_plus_signed;
+	  unsigned_op = ptr_op_widen_plus_unsigned;
+	}
+	gcc_fallthrough ();
+	case WIDEN_MULT_EXPR:
+	{
+	  m_valid = true;
+	  m_op1 = gimple_assign_rhs1 (m_stmt);
+	  m_op2 = gimple_assign_rhs2 (m_stmt);
+	  bool signed1 = TYPE_SIGN (TREE_TYPE (m_op1)) == SIGNED;
+	  bool signed2 = TYPE_SIGN (TREE_TYPE (m_op2)) == SIGNED;
+	  if (signed2 && !signed1)
+	    std::swap (m_op1, m_op2);
+
+	  if (signed1 || signed2)
+	    m_int = signed_op;
+	  else
+	    m_int = unsigned_op;
+	  break;
+	}
+	default:
+	  break;
+      }
+}
+
 // Set up a gimple_range_op_handler for any built in function which can be
 // supported via range-ops.
 
diff --git a/gcc/range-op.h b/gcc/range-op.h
index f00b747f08a1fa8404c63bfe5a931b4048008b03..b1eeac70df81f2bdf228af7adff5399e7ac5e5d6 100644
--- a/gcc/range-op.h
+++ b/gcc/range-op.h
@@ -311,4 +311,8 @@ private:
 // This holds the range op table for floating point operations.
 extern floating_op_table *floating_tree_table;
 
+extern range_operator *ptr_op_widen_mult_signed;
+extern range_operator *ptr_op_widen_mult_unsigned;
+extern range_operator *ptr_op_widen_plus_signed;
+extern range_operator *ptr_op_widen_plus_unsigned;
 #endif // GCC_RANGE_OP_H
diff --git a/gcc/range-op.cc b/gcc/range-op.cc
index 5c67bce6d3aab81ad3186b902e09d6a96878d9bb..718ccb6f074e1a2a9ef1b7a5d4e879898d4a7fc3 100644
--- a/gcc/range-op.cc
+++ b/gcc/range-op.cc
@@ -1556,6 +1556,73 @@ operator_plus::op2_range (irange &r, tree type,
   return op1_range (r, type, lhs, op1, rel.swap_op1_op2 ());
 }
 
+class operator_widen_plus_signed : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub) const;
+} op_widen_plus_signed;
+range_operator *ptr_op_widen_plus_signed = &op_widen_plus_signed;
+
+void
+operator_widen_plus_signed::wi_fold (irange &r, tree type,
+				     const wide_int &lh_lb,
+				     const wide_int &lh_ub,
+				     const wide_int &rh_lb,
+				     const wide_int &rh_ub) const
+{
+   wi::overflow_type ov_lb, ov_ub;
+   signop s = TYPE_SIGN (type);
+
+   wide_int lh_wlb
+     = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, SIGNED);
+   wide_int lh_wub
+     = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, SIGNED);
+   wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+   wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+   wide_int new_lb = wi::add (lh_wlb, rh_wlb, s, &ov_lb);
+   wide_int new_ub = wi::add (lh_wub, rh_wub, s, &ov_ub);
+
+   r = int_range<2> (type, new_lb, new_ub);
+}
+
+class operator_widen_plus_unsigned : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub) const;
+} op_widen_plus_unsigned;
+range_operator *ptr_op_widen_plus_unsigned = &op_widen_plus_unsigned;
+
+void
+operator_widen_plus_unsigned::wi_fold (irange &r, tree type,
+				       const wide_int &lh_lb,
+				       const wide_int &lh_ub,
+				       const wide_int &rh_lb,
+				       const wide_int &rh_ub) const
+{
+   wi::overflow_type ov_lb, ov_ub;
+   signop s = TYPE_SIGN (type);
+
+   wide_int lh_wlb
+     = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, UNSIGNED);
+   wide_int lh_wub
+     = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, UNSIGNED);
+   wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+   wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+   wide_int new_lb = wi::add (lh_wlb, rh_wlb, s, &ov_lb);
+   wide_int new_ub = wi::add (lh_wub, rh_wub, s, &ov_ub);
+
+   r = int_range<2> (type, new_lb, new_ub);
+}
 
 class operator_minus : public range_operator
 {
@@ -2031,6 +2098,70 @@ operator_mult::wi_fold (irange &r, tree type,
     }
 }
 
+class operator_widen_mult_signed : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub)
+    const;
+} op_widen_mult_signed;
+range_operator *ptr_op_widen_mult_signed = &op_widen_mult_signed;
+
+void
+operator_widen_mult_signed::wi_fold (irange &r, tree type,
+				     const wide_int &lh_lb,
+				     const wide_int &lh_ub,
+				     const wide_int &rh_lb,
+				     const wide_int &rh_ub) const
+{
+  signop s = TYPE_SIGN (type);
+
+  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, SIGNED);
+  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, SIGNED);
+  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+  /* We don't expect a widening multiplication to be able to overflow but range
+     calculations for multiplications are complicated.  After widening the
+     operands lets call the base class.  */
+  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
+}
+
+
+class operator_widen_mult_unsigned : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub)
+    const;
+} op_widen_mult_unsigned;
+range_operator *ptr_op_widen_mult_unsigned = &op_widen_mult_unsigned;
+
+void
+operator_widen_mult_unsigned::wi_fold (irange &r, tree type,
+				       const wide_int &lh_lb,
+				       const wide_int &lh_ub,
+				       const wide_int &rh_lb,
+				       const wide_int &rh_ub) const
+{
+  signop s = TYPE_SIGN (type);
+
+  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, UNSIGNED);
+  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, UNSIGNED);
+  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+  /* We don't expect a widening multiplication to be able to overflow but range
+     calculations for multiplications are complicated.  After widening the
+     operands lets call the base class.  */
+  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
+}
 
 class operator_div : public cross_product_operator
 {




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 3/4]middle-end: Implement preferred_div_as_shifts_over_mult [PR108583]
  2023-02-27 12:32 [PATCH 1/4]middle-end: Revert can_special_div_by_const changes [PR108583] Tamar Christina
  2023-02-27 12:33 ` [PATCH 2/4][ranger]: Add range-ops for widen addition and widen multiplication [PR108583] Tamar Christina
@ 2023-02-27 12:33 ` Tamar Christina
  2023-03-06 11:23   ` Tamar Christina
  2023-02-27 12:34 ` [PATCH 4/4]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583] Tamar Christina
  2023-02-27 14:07 ` [PATCH 1/4]middle-end: Revert can_special_div_by_const changes [PR108583] Richard Biener
  3 siblings, 1 reply; 19+ messages in thread
From: Tamar Christina @ 2023-02-27 12:33 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, rguenther, jlaw, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 10865 bytes --]

Hi All,

As Richard S wanted, this now implements a hook
preferred_div_as_shifts_over_mult that indicates whether a target prefers that
the vectorizer decomposes division as shifts rather than multiplication when
possible.

In order to be able to use this we need to check whether the current precision
has enough bits to do the operation without any of the additions overflowing.

We use range information to determine this and only do the operation if we're
sure am overflow won't occur. This now uses ranger to do this range check.

This seems to work better than vect_get_range_info which uses range_query, but I
have not switched the interface of vect_get_range_info over in this PR fix.

As Andy said before initializing a ranger instance is cheap but not free, and if
the intention is to call it often during a pass it should be instantiated at
pass startup and passed along to the places that need it.  This is a big
refactoring and doesn't seem right to do in this PR.  But we should in GCC 14.

Currently we only instantiate it after a long series of much cheaper checks.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	PR target/108583
	* target.def (preferred_div_as_shifts_over_mult): New.
	* doc/tm.texi.in: Document it.
	* doc/tm.texi: Regenerate.
	* targhooks.cc (default_preferred_div_as_shifts_over_mult): New.
	* targhooks.h (default_preferred_div_as_shifts_over_mult): New.
	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Use it.

gcc/testsuite/ChangeLog:

	PR target/108583
	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
	* gcc.dg/vect/vect-div-bitmask-5.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 50a8872a6695b18b9bed0d393bacf733833633db..c85196015e2e53047fcc65d32ef2d3203d2a6bab 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6137,6 +6137,9 @@ instruction pattern.  There is no need for the hook to handle these two
 implementation approaches itself.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT (void)
+When decomposing a division operation, if possible prefer to decompose the
+operation as shifts rather than multiplication by magic constants.
 @end deftypefn
 
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3e07978a02f4e6077adae6cadc93ea4273295f1f..0051017a7fd67691a343470f36ad4fc32c8e7e15 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4173,6 +4173,7 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_VEC_PERM_CONST
 
+@hook TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT
 
 @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
 
diff --git a/gcc/target.def b/gcc/target.def
index e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5..8cc18b1f3c5de24c21faf891b9d4d0b6fd5b59d7 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1868,6 +1868,15 @@ correct for most targets.",
  poly_uint64, (const_tree type),
  default_preferred_vector_alignment)
 
+/* Returns whether the target has a preference for decomposing divisions using
+   shifts rather than multiplies.  */
+DEFHOOK
+(preferred_div_as_shifts_over_mult,
+ "When decomposing a division operation, if possible prefer to decompose the\n\
+operation as shifts rather than multiplication by magic constants.",
+ bool, (void),
+ default_preferred_div_as_shifts_over_mult)
+
 /* Return true if vector alignment is reachable (by peeling N
    iterations) for the given scalar type.  */
 DEFHOOK
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index a6a4809ca91baa5d7fad2244549317a31390f0c2..dda011c59fbd5973ee648dfea195619cc41c71bc 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -53,6 +53,8 @@ extern scalar_int_mode default_unwind_word_mode (void);
 extern unsigned HOST_WIDE_INT default_shift_truncation_mask
   (machine_mode);
 extern unsigned int default_min_divisions_for_recip_mul (machine_mode);
+extern bool
+default_preferred_div_as_shifts_over_mult (void);
 extern int default_mode_rep_extended (scalar_int_mode, scalar_int_mode);
 
 extern tree default_stack_protect_guard (void);
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index 211525720a620d6f533e2da91e03877337a931e7..6396f344eef09dd61f358938846a1c02a70b31d8 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -1483,6 +1483,15 @@ default_preferred_vector_alignment (const_tree type)
   return TYPE_ALIGN (type);
 }
 
+/* The default implementation of
+   TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT.  */
+
+bool
+default_preferred_div_as_shifts_over_mult (void)
+{
+  return false;
+}
+
 /* By default assume vectors of element TYPE require a multiple of the natural
    alignment of TYPE.  TYPE is naturally aligned if IS_PACKED is false.  */
 bool
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0a04ea8c1f73e3c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
@@ -0,0 +1,25 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include "tree-vect.h"
+
+typedef unsigned __attribute__((__vector_size__ (16))) V;
+
+static __attribute__((__noinline__)) __attribute__((__noclone__)) V
+foo (V v, unsigned short i)
+{
+  v /= i;
+  return v;
+}
+
+int
+main (void)
+{
+  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0xffff);
+  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
+    if (v[i] != 0x00010001)
+      __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b49914d2a29b933de625
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
@@ -0,0 +1,58 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define N 50
+#define TYPE uint8_t 
+
+#ifndef DEBUG
+#define DEBUG 0
+#endif
+
+#define BASE ((TYPE) -1 < 0 ? -126 : 4)
+
+
+__attribute__((noipa, noinline, optimize("O1")))
+void fun1(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+__attribute__((noipa, noinline, optimize("O3")))
+void fun2(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+int main ()
+{
+  TYPE a[N];
+  TYPE b[N];
+
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 13;
+      b[i] = BASE + i * 13;
+      if (DEBUG)
+        printf ("%d: 0x%x\n", i, a[i]);
+    }
+
+  fun1 (a, N / 2, N);
+  fun2 (b, N / 2, N);
+
+  for (int i = 0; i < N; ++i)
+    {
+      if (DEBUG)
+        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
+
+      if (a[i] != b[i])
+        __builtin_abort ();
+    }
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 1766ce277d6b88d8aa3be77e7c8abb504a10a735..31f2a6753b4faccb77351c8c5afed9775888b60f 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -3913,6 +3913,84 @@ vect_recog_divmod_pattern (vec_info *vinfo,
 
       return pattern_stmt;
     }
+  else if ((cst = uniform_integer_cst_p (oprnd1))
+	   && TYPE_UNSIGNED (itype)
+	   && rhs_code == TRUNC_DIV_EXPR
+	   && vectype
+	   && targetm.vectorize.preferred_div_as_shifts_over_mult ())
+    {
+      /* div optimizations using narrowings
+       we can do the division e.g. shorts by 255 faster by calculating it as
+       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
+       double the precision of x.
+
+       If we imagine a short as being composed of two blocks of bytes then
+       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
+       adding 1 to each sub component:
+
+	    short value of 16-bits
+       ┌──────────────┬────────────────┐
+       │              │                │
+       └──────────────┴────────────────┘
+	 8-bit part1 ▲  8-bit part2   ▲
+		     │                │
+		     │                │
+		    +1               +1
+
+       after the first addition, we have to shift right by 8, and narrow the
+       results back to a byte.  Remember that the addition must be done in
+       double the precision of the input.  However if we know that the addition
+       `x + 257` does not overflow then we can do the operation in the current
+       precision.  In which case we don't need the pack and unpacks.  */
+      auto wcst = wi::to_wide (cst);
+      int pow = wi::exact_log2 (wcst + 1);
+      if (pow == (int) (element_precision (vectype) / 2))
+	{
+	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
+
+	  gimple_ranger ranger;
+	  int_range_max r;
+
+	  /* Check that no overflow will occur.  If we don't have range
+	     information we can't perform the optimization.  */
+
+	  if (ranger.range_of_expr (r, oprnd0, stmt))
+	    {
+	      wide_int max = r.upper_bound ();
+	      wide_int one = wi::to_wide (build_one_cst (itype));
+	      wide_int adder = wi::add (one, wi::lshift (one, pow));
+	      wi::overflow_type ovf;
+	      wi::add (max, adder, UNSIGNED, &ovf);
+	      if (ovf == wi::OVF_NONE)
+		{
+		  *type_out = vectype;
+		  tree tadder = wide_int_to_tree (itype, adder);
+		  tree rshift = wide_int_to_tree (itype, pow);
+
+		  tree new_lhs1 = vect_recog_temp_ssa_var (itype, NULL);
+		  gassign *patt1
+		    = gimple_build_assign (new_lhs1, PLUS_EXPR, oprnd0, tadder);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  tree new_lhs2 = vect_recog_temp_ssa_var (itype, NULL);
+		  patt1 = gimple_build_assign (new_lhs2, RSHIFT_EXPR, new_lhs1,
+					       rshift);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  tree new_lhs3 = vect_recog_temp_ssa_var (itype, NULL);
+		  patt1 = gimple_build_assign (new_lhs3, PLUS_EXPR, new_lhs2,
+					       oprnd0);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  tree new_lhs4 = vect_recog_temp_ssa_var (itype, NULL);
+		  pattern_stmt = gimple_build_assign (new_lhs4, RSHIFT_EXPR,
+						      new_lhs3, rshift);
+
+		  return pattern_stmt;
+		}
+	    }
+	}
+    }
 
   if (prec > HOST_BITS_PER_WIDE_INT
       || integer_zerop (oprnd1))




-- 

[-- Attachment #2: rb16930.patch --]
[-- Type: text/plain, Size: 9169 bytes --]

diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 50a8872a6695b18b9bed0d393bacf733833633db..c85196015e2e53047fcc65d32ef2d3203d2a6bab 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6137,6 +6137,9 @@ instruction pattern.  There is no need for the hook to handle these two
 implementation approaches itself.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT (void)
+When decomposing a division operation, if possible prefer to decompose the
+operation as shifts rather than multiplication by magic constants.
 @end deftypefn
 
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3e07978a02f4e6077adae6cadc93ea4273295f1f..0051017a7fd67691a343470f36ad4fc32c8e7e15 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4173,6 +4173,7 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_VEC_PERM_CONST
 
+@hook TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT
 
 @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
 
diff --git a/gcc/target.def b/gcc/target.def
index e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5..8cc18b1f3c5de24c21faf891b9d4d0b6fd5b59d7 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1868,6 +1868,15 @@ correct for most targets.",
  poly_uint64, (const_tree type),
  default_preferred_vector_alignment)
 
+/* Returns whether the target has a preference for decomposing divisions using
+   shifts rather than multiplies.  */
+DEFHOOK
+(preferred_div_as_shifts_over_mult,
+ "When decomposing a division operation, if possible prefer to decompose the\n\
+operation as shifts rather than multiplication by magic constants.",
+ bool, (void),
+ default_preferred_div_as_shifts_over_mult)
+
 /* Return true if vector alignment is reachable (by peeling N
    iterations) for the given scalar type.  */
 DEFHOOK
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index a6a4809ca91baa5d7fad2244549317a31390f0c2..dda011c59fbd5973ee648dfea195619cc41c71bc 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -53,6 +53,8 @@ extern scalar_int_mode default_unwind_word_mode (void);
 extern unsigned HOST_WIDE_INT default_shift_truncation_mask
   (machine_mode);
 extern unsigned int default_min_divisions_for_recip_mul (machine_mode);
+extern bool
+default_preferred_div_as_shifts_over_mult (void);
 extern int default_mode_rep_extended (scalar_int_mode, scalar_int_mode);
 
 extern tree default_stack_protect_guard (void);
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index 211525720a620d6f533e2da91e03877337a931e7..6396f344eef09dd61f358938846a1c02a70b31d8 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -1483,6 +1483,15 @@ default_preferred_vector_alignment (const_tree type)
   return TYPE_ALIGN (type);
 }
 
+/* The default implementation of
+   TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT.  */
+
+bool
+default_preferred_div_as_shifts_over_mult (void)
+{
+  return false;
+}
+
 /* By default assume vectors of element TYPE require a multiple of the natural
    alignment of TYPE.  TYPE is naturally aligned if IS_PACKED is false.  */
 bool
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0a04ea8c1f73e3c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
@@ -0,0 +1,25 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include "tree-vect.h"
+
+typedef unsigned __attribute__((__vector_size__ (16))) V;
+
+static __attribute__((__noinline__)) __attribute__((__noclone__)) V
+foo (V v, unsigned short i)
+{
+  v /= i;
+  return v;
+}
+
+int
+main (void)
+{
+  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0xffff);
+  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
+    if (v[i] != 0x00010001)
+      __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b49914d2a29b933de625
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
@@ -0,0 +1,58 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define N 50
+#define TYPE uint8_t 
+
+#ifndef DEBUG
+#define DEBUG 0
+#endif
+
+#define BASE ((TYPE) -1 < 0 ? -126 : 4)
+
+
+__attribute__((noipa, noinline, optimize("O1")))
+void fun1(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+__attribute__((noipa, noinline, optimize("O3")))
+void fun2(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+int main ()
+{
+  TYPE a[N];
+  TYPE b[N];
+
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 13;
+      b[i] = BASE + i * 13;
+      if (DEBUG)
+        printf ("%d: 0x%x\n", i, a[i]);
+    }
+
+  fun1 (a, N / 2, N);
+  fun2 (b, N / 2, N);
+
+  for (int i = 0; i < N; ++i)
+    {
+      if (DEBUG)
+        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
+
+      if (a[i] != b[i])
+        __builtin_abort ();
+    }
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 1766ce277d6b88d8aa3be77e7c8abb504a10a735..31f2a6753b4faccb77351c8c5afed9775888b60f 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -3913,6 +3913,84 @@ vect_recog_divmod_pattern (vec_info *vinfo,
 
       return pattern_stmt;
     }
+  else if ((cst = uniform_integer_cst_p (oprnd1))
+	   && TYPE_UNSIGNED (itype)
+	   && rhs_code == TRUNC_DIV_EXPR
+	   && vectype
+	   && targetm.vectorize.preferred_div_as_shifts_over_mult ())
+    {
+      /* div optimizations using narrowings
+       we can do the division e.g. shorts by 255 faster by calculating it as
+       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
+       double the precision of x.
+
+       If we imagine a short as being composed of two blocks of bytes then
+       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
+       adding 1 to each sub component:
+
+	    short value of 16-bits
+       ┌──────────────┬────────────────┐
+       │              │                │
+       └──────────────┴────────────────┘
+	 8-bit part1 ▲  8-bit part2   ▲
+		     │                │
+		     │                │
+		    +1               +1
+
+       after the first addition, we have to shift right by 8, and narrow the
+       results back to a byte.  Remember that the addition must be done in
+       double the precision of the input.  However if we know that the addition
+       `x + 257` does not overflow then we can do the operation in the current
+       precision.  In which case we don't need the pack and unpacks.  */
+      auto wcst = wi::to_wide (cst);
+      int pow = wi::exact_log2 (wcst + 1);
+      if (pow == (int) (element_precision (vectype) / 2))
+	{
+	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
+
+	  gimple_ranger ranger;
+	  int_range_max r;
+
+	  /* Check that no overflow will occur.  If we don't have range
+	     information we can't perform the optimization.  */
+
+	  if (ranger.range_of_expr (r, oprnd0, stmt))
+	    {
+	      wide_int max = r.upper_bound ();
+	      wide_int one = wi::to_wide (build_one_cst (itype));
+	      wide_int adder = wi::add (one, wi::lshift (one, pow));
+	      wi::overflow_type ovf;
+	      wi::add (max, adder, UNSIGNED, &ovf);
+	      if (ovf == wi::OVF_NONE)
+		{
+		  *type_out = vectype;
+		  tree tadder = wide_int_to_tree (itype, adder);
+		  tree rshift = wide_int_to_tree (itype, pow);
+
+		  tree new_lhs1 = vect_recog_temp_ssa_var (itype, NULL);
+		  gassign *patt1
+		    = gimple_build_assign (new_lhs1, PLUS_EXPR, oprnd0, tadder);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  tree new_lhs2 = vect_recog_temp_ssa_var (itype, NULL);
+		  patt1 = gimple_build_assign (new_lhs2, RSHIFT_EXPR, new_lhs1,
+					       rshift);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  tree new_lhs3 = vect_recog_temp_ssa_var (itype, NULL);
+		  patt1 = gimple_build_assign (new_lhs3, PLUS_EXPR, new_lhs2,
+					       oprnd0);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  tree new_lhs4 = vect_recog_temp_ssa_var (itype, NULL);
+		  pattern_stmt = gimple_build_assign (new_lhs4, RSHIFT_EXPR,
+						      new_lhs3, rshift);
+
+		  return pattern_stmt;
+		}
+	    }
+	}
+    }
 
   if (prec > HOST_BITS_PER_WIDE_INT
       || integer_zerop (oprnd1))




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4/4]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583]
  2023-02-27 12:32 [PATCH 1/4]middle-end: Revert can_special_div_by_const changes [PR108583] Tamar Christina
  2023-02-27 12:33 ` [PATCH 2/4][ranger]: Add range-ops for widen addition and widen multiplication [PR108583] Tamar Christina
  2023-02-27 12:33 ` [PATCH 3/4]middle-end: Implement preferred_div_as_shifts_over_mult [PR108583] Tamar Christina
@ 2023-02-27 12:34 ` Tamar Christina
  2023-03-06 11:21   ` Tamar Christina
  2023-02-27 14:07 ` [PATCH 1/4]middle-end: Revert can_special_div_by_const changes [PR108583] Richard Biener
  3 siblings, 1 reply; 19+ messages in thread
From: Tamar Christina @ 2023-02-27 12:34 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 12412 bytes --]

Hi All,

This replaces the custom division hook with just an implementation through
add_highpart.  For NEON we implement the add highpart (Addition + extraction of
the upper highpart of the register in the same precision) as ADD + LSR.

This representation allows us to easily optimize the sequence using existing
sequences. This gets us a pretty decent sequence using SRA:

        umull   v1.8h, v0.8b, v3.8b
        umull2  v0.8h, v0.16b, v3.16b
        add     v5.8h, v1.8h, v2.8h
        add     v4.8h, v0.8h, v2.8h
        usra    v1.8h, v5.8h, 8
        usra    v0.8h, v4.8h, 8
        uzp2    v1.16b, v1.16b, v0.16b

To get the most optimal sequence however we match (a + ((b + c) >> n)) where n
is half the precision of the mode of the operation into addhn + uaddw which is
a general good optimization on its own and gets us back to:

.L4:
        ldr     q0, [x3]
        umull   v1.8h, v0.8b, v5.8b
        umull2  v0.8h, v0.16b, v5.16b
        addhn   v3.8b, v1.8h, v4.8h
        addhn   v2.8b, v0.8h, v4.8h
        uaddw   v1.8h, v1.8h, v3.8b
        uaddw   v0.8h, v0.8h, v2.8b
        uzp2    v1.16b, v1.16b, v0.16b
        str     q1, [x3], 16
        cmp     x3, x4
        bne     .L4

For SVE2 we optimize the initial sequence to the same ADD + LSR which gets us:

.L3:
        ld1b    z0.h, p0/z, [x0, x3]
        mul     z0.h, p1/m, z0.h, z2.h
        add     z1.h, z0.h, z3.h
        usra    z0.h, z1.h, #8
        lsr     z0.h, z0.h, #8
        st1b    z0.h, p0, [x0, x3]
        inch    x3
        whilelo p0.h, w3, w2
        b.any   .L3
.L1:
        ret

and to get the most optimal sequence I match (a + b) >> n (same constraint on n)
to addhnb which gets us to:

.L3:
        ld1b    z0.h, p0/z, [x0, x3]
        mul     z0.h, p1/m, z0.h, z2.h
        addhnb  z1.b, z0.h, z3.h
        addhnb  z0.b, z0.h, z1.h
        st1b    z0.h, p0, [x0, x3]
        inch    x3
        whilelo p0.h, w3, w2
        b.any   .L3

There are multiple RTL representations possible for these optimizations, I did
not represent them using a zero_extend because we seem very inconsistent in this
in the backend.  Since they are unspecs we won't match them from vector ops
anyway. I figured maintainers would prefer this, but my maintainer ouija board
is still out for repairs :)

There are no new test as new correctness tests were added to the mid-end and
the existing codegen tests for this already exist.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	PR target/108583
	* config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv<mode>3): Remove.
	(*bitmask_shift_plus<mode>): New.
	* config/aarch64/aarch64-sve2.md (*bitmask_shift_plus<mode>): New.
	(@aarch64_bitmask_udiv<mode>3): Remove.
	* config/aarch64/aarch64.cc
	(aarch64_vectorize_can_special_div_by_constant,
	TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Removed.
	(TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT,
	aarch64_vectorize_preferred_div_as_shifts_over_mult): New.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 7f212bf37cd2c120dceb7efa733c9fa76226f029..e1ecb88634f93d380ef534093ea6599dc7278108 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4867,60 +4867,27 @@ (define_expand "aarch64_<sur><addsub>hn2<mode>"
   }
 )
 
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; If we imagine a short as being composed of two blocks of bytes then
-;; adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
-;; adding 1 to each sub component:
-;;
-;;      short value of 16-bits
-;; ┌──────────────┬────────────────┐
-;; │              │                │
-;; └──────────────┴────────────────┘
-;;   8-bit part1 ▲  8-bit part2   ▲
-;;               │                │
-;;               │                │
-;;              +1               +1
-;;
-;; after the first addition, we have to shift right by 8, and narrow the
-;; results back to a byte.  Remember that the addition must be done in
-;; double the precision of the input.  Since 8 is half the size of a short
-;; we can use a narrowing halfing instruction in AArch64, addhn which also
-;; does the addition in a wider precision and narrows back to a byte.  The
-;; shift itself is implicit in the operation as it writes back only the top
-;; half of the result. i.e. bits 2*esize-1:esize.
-;;
-;; Since we have narrowed the result of the first part back to a byte, for
-;; the second addition we can use a widening addition, uaddw.
-;;
-;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
-;;
-;; The shift is later optimized by combine to a uzp2 with movi #0.
-(define_expand "@aarch64_bitmask_udiv<mode>3"
-  [(match_operand:VQN 0 "register_operand")
-   (match_operand:VQN 1 "register_operand")
-   (match_operand:VQN 2 "immediate_operand")]
+;; Optimize ((a + b) >> n) + c where n is half the bitsize of the vector
+(define_insn_and_split "*bitmask_shift_plus<mode>"
+  [(set (match_operand:VQN 0 "register_operand" "=&w")
+	(plus:VQN
+	  (lshiftrt:VQN
+	    (plus:VQN (match_operand:VQN 1 "register_operand" "w")
+		      (match_operand:VQN 2 "register_operand" "w"))
+	    (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
+	  (match_operand:VQN 4 "register_operand" "w")))]
   "TARGET_SIMD"
+  "#"
+  "&& true"
+  [(const_int 0)]
 {
-  unsigned HOST_WIDE_INT size
-    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode)) - 1;
-  rtx elt = unwrap_const_vec_duplicate (operands[2]);
-  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
-    FAIL;
-
-  rtx addend = gen_reg_rtx (<MODE>mode);
-  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROWQ2>mode, 1);
-  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROWQ2>mode));
-  rtx tmp1 = gen_reg_rtx (<VNARROWQ>mode);
-  rtx tmp2 = gen_reg_rtx (<MODE>mode);
-  emit_insn (gen_aarch64_addhn<mode> (tmp1, operands[1], addend));
-  unsigned bitsize = GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode);
-  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode, bitsize);
-  emit_insn (gen_aarch64_uaddw<Vnarrowq> (tmp2, operands[1], tmp1));
-  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], tmp2, shift_vector));
+  rtx tmp;
+  if (can_create_pseudo_p ())
+    tmp = gen_reg_rtx (<VNARROWQ>mode);
+  else
+    tmp = gen_rtx_REG (<VNARROWQ>mode, REGNO (operands[0]));
+  emit_insn (gen_aarch64_addhn<mode> (tmp, operands[1], operands[2]));
+  emit_insn (gen_aarch64_uaddw<Vnarrowq> (operands[0], operands[4], tmp));
   DONE;
 })
 
diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
index 40c0728a7e6f00c395c360ce7625bc2e4a018809..bed44d7d6873877386222d56144cc115e3953a61 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -2317,41 +2317,24 @@ (define_insn "@aarch64_sve_<optab><mode>"
 ;; ---- [INT] Misc optab implementations
 ;; -------------------------------------------------------------------------
 ;; Includes:
-;; - aarch64_bitmask_udiv
+;; - bitmask_shift_plus
 ;; -------------------------------------------------------------------------
 
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; See aarch64-simd.md for bigger explanation.
-(define_expand "@aarch64_bitmask_udiv<mode>3"
-  [(match_operand:SVE_FULL_HSDI 0 "register_operand")
-   (match_operand:SVE_FULL_HSDI 1 "register_operand")
-   (match_operand:SVE_FULL_HSDI 2 "immediate_operand")]
+;; Optimize ((a + b) >> n) where n is half the bitsize of the vector
+(define_insn "*bitmask_shift_plus<mode>"
+  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand" "=w")
+	(unspec:SVE_FULL_HSDI
+	   [(match_operand:<VPRED> 1)
+	    (lshiftrt:SVE_FULL_HSDI
+	      (plus:SVE_FULL_HSDI
+		(match_operand:SVE_FULL_HSDI 2 "register_operand" "w")
+		(match_operand:SVE_FULL_HSDI 3 "register_operand" "w"))
+	      (match_operand:SVE_FULL_HSDI 4
+		 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))]
+          UNSPEC_PRED_X))]
   "TARGET_SVE2"
-{
-  unsigned HOST_WIDE_INT size
-    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROW>mode)) - 1;
-  rtx elt = unwrap_const_vec_duplicate (operands[2]);
-  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
-    FAIL;
-
-  rtx addend = gen_reg_rtx (<MODE>mode);
-  rtx tmp1 = gen_reg_rtx (<VNARROW>mode);
-  rtx tmp2 = gen_reg_rtx (<VNARROW>mode);
-  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROW>mode, 1);
-  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROW>mode));
-  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1, operands[1],
-			      addend));
-  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp2, operands[1],
-			      lowpart_subreg (<MODE>mode, tmp1,
-					      <VNARROW>mode)));
-  emit_move_insn (operands[0],
-		  lowpart_subreg (<MODE>mode, tmp2, <VNARROW>mode));
-  DONE;
-})
+  "addhnb\t%0.<Ventype>, %2.<Vetype>, %3.<Vetype>"
+)
 
 ;; =========================================================================
 ;; == Permutation
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index e6f47cbbb0d04a6f33b9a741ebb614cabd0204b9..2728fb347c0df1756b237f4d6268908eef6bdd2a 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -3849,6 +3849,13 @@ aarch64_vectorize_related_mode (machine_mode vector_mode,
   return default_vectorize_related_mode (vector_mode, element_mode, nunits);
 }
 
+/* Implement TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT.  */
+
+static bool aarch64_vectorize_preferred_div_as_shifts_over_mult (void)
+{
+  return true;
+}
+
 /* Implement TARGET_PREFERRED_ELSE_VALUE.  For binary operations,
    prefer to use the first arithmetic operand as the else value if
    the else value doesn't matter, since that exactly matches the SVE
@@ -24363,46 +24370,6 @@ aarch64_vectorize_vec_perm_const (machine_mode vmode, machine_mode op_mode,
 
   return ret;
 }
-
-/* Implement TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST.  */
-
-bool
-aarch64_vectorize_can_special_div_by_constant (enum tree_code code,
-					       tree vectype, wide_int cst,
-					       rtx *output, rtx in0, rtx in1)
-{
-  if (code != TRUNC_DIV_EXPR
-      || !TYPE_UNSIGNED (vectype))
-    return false;
-
-  machine_mode mode = TYPE_MODE (vectype);
-  unsigned int flags = aarch64_classify_vector_mode (mode);
-  if ((flags & VEC_ANY_SVE) && !TARGET_SVE2)
-    return false;
-
-  int pow = wi::exact_log2 (cst + 1);
-  auto insn_code = maybe_code_for_aarch64_bitmask_udiv3 (TYPE_MODE (vectype));
-  /* SVE actually has a div operator, we may have gotten here through
-     that route.  */
-  if (pow != (int) (element_precision (vectype) / 2)
-      || insn_code == CODE_FOR_nothing)
-    return false;
-
-  /* We can use the optimized pattern.  */
-  if (in0 == NULL_RTX && in1 == NULL_RTX)
-    return true;
-
-  gcc_assert (output);
-
-  expand_operand ops[3];
-  create_output_operand (&ops[0], *output, mode);
-  create_input_operand (&ops[1], in0, mode);
-  create_fixed_operand (&ops[2], in1);
-  expand_insn (insn_code, 3, ops);
-  *output = ops[0].value;
-  return true;
-}
-
 /* Generate a byte permute mask for a register of mode MODE,
    which has NUNITS units.  */
 
@@ -27904,13 +27871,13 @@ aarch64_libgcc_floating_mode_supported_p
 #undef TARGET_MAX_ANCHOR_OFFSET
 #define TARGET_MAX_ANCHOR_OFFSET 4095
 
+#undef TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT
+#define TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT \
+  aarch64_vectorize_preferred_div_as_shifts_over_mult
+
 #undef TARGET_VECTOR_ALIGNMENT
 #define TARGET_VECTOR_ALIGNMENT aarch64_simd_vector_alignment
 
-#undef TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
-#define TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST \
-  aarch64_vectorize_can_special_div_by_constant
-
 #undef TARGET_VECTORIZE_PREFERRED_VECTOR_ALIGNMENT
 #define TARGET_VECTORIZE_PREFERRED_VECTOR_ALIGNMENT \
   aarch64_vectorize_preferred_vector_alignment




-- 

[-- Attachment #2: rb16910.patch --]
[-- Type: text/plain, Size: 9368 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 7f212bf37cd2c120dceb7efa733c9fa76226f029..e1ecb88634f93d380ef534093ea6599dc7278108 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4867,60 +4867,27 @@ (define_expand "aarch64_<sur><addsub>hn2<mode>"
   }
 )
 
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; If we imagine a short as being composed of two blocks of bytes then
-;; adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
-;; adding 1 to each sub component:
-;;
-;;      short value of 16-bits
-;; ┌──────────────┬────────────────┐
-;; │              │                │
-;; └──────────────┴────────────────┘
-;;   8-bit part1 ▲  8-bit part2   ▲
-;;               │                │
-;;               │                │
-;;              +1               +1
-;;
-;; after the first addition, we have to shift right by 8, and narrow the
-;; results back to a byte.  Remember that the addition must be done in
-;; double the precision of the input.  Since 8 is half the size of a short
-;; we can use a narrowing halfing instruction in AArch64, addhn which also
-;; does the addition in a wider precision and narrows back to a byte.  The
-;; shift itself is implicit in the operation as it writes back only the top
-;; half of the result. i.e. bits 2*esize-1:esize.
-;;
-;; Since we have narrowed the result of the first part back to a byte, for
-;; the second addition we can use a widening addition, uaddw.
-;;
-;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
-;;
-;; The shift is later optimized by combine to a uzp2 with movi #0.
-(define_expand "@aarch64_bitmask_udiv<mode>3"
-  [(match_operand:VQN 0 "register_operand")
-   (match_operand:VQN 1 "register_operand")
-   (match_operand:VQN 2 "immediate_operand")]
+;; Optimize ((a + b) >> n) + c where n is half the bitsize of the vector
+(define_insn_and_split "*bitmask_shift_plus<mode>"
+  [(set (match_operand:VQN 0 "register_operand" "=&w")
+	(plus:VQN
+	  (lshiftrt:VQN
+	    (plus:VQN (match_operand:VQN 1 "register_operand" "w")
+		      (match_operand:VQN 2 "register_operand" "w"))
+	    (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
+	  (match_operand:VQN 4 "register_operand" "w")))]
   "TARGET_SIMD"
+  "#"
+  "&& true"
+  [(const_int 0)]
 {
-  unsigned HOST_WIDE_INT size
-    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode)) - 1;
-  rtx elt = unwrap_const_vec_duplicate (operands[2]);
-  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
-    FAIL;
-
-  rtx addend = gen_reg_rtx (<MODE>mode);
-  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROWQ2>mode, 1);
-  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROWQ2>mode));
-  rtx tmp1 = gen_reg_rtx (<VNARROWQ>mode);
-  rtx tmp2 = gen_reg_rtx (<MODE>mode);
-  emit_insn (gen_aarch64_addhn<mode> (tmp1, operands[1], addend));
-  unsigned bitsize = GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode);
-  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode, bitsize);
-  emit_insn (gen_aarch64_uaddw<Vnarrowq> (tmp2, operands[1], tmp1));
-  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], tmp2, shift_vector));
+  rtx tmp;
+  if (can_create_pseudo_p ())
+    tmp = gen_reg_rtx (<VNARROWQ>mode);
+  else
+    tmp = gen_rtx_REG (<VNARROWQ>mode, REGNO (operands[0]));
+  emit_insn (gen_aarch64_addhn<mode> (tmp, operands[1], operands[2]));
+  emit_insn (gen_aarch64_uaddw<Vnarrowq> (operands[0], operands[4], tmp));
   DONE;
 })
 
diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
index 40c0728a7e6f00c395c360ce7625bc2e4a018809..bed44d7d6873877386222d56144cc115e3953a61 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -2317,41 +2317,24 @@ (define_insn "@aarch64_sve_<optab><mode>"
 ;; ---- [INT] Misc optab implementations
 ;; -------------------------------------------------------------------------
 ;; Includes:
-;; - aarch64_bitmask_udiv
+;; - bitmask_shift_plus
 ;; -------------------------------------------------------------------------
 
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; See aarch64-simd.md for bigger explanation.
-(define_expand "@aarch64_bitmask_udiv<mode>3"
-  [(match_operand:SVE_FULL_HSDI 0 "register_operand")
-   (match_operand:SVE_FULL_HSDI 1 "register_operand")
-   (match_operand:SVE_FULL_HSDI 2 "immediate_operand")]
+;; Optimize ((a + b) >> n) where n is half the bitsize of the vector
+(define_insn "*bitmask_shift_plus<mode>"
+  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand" "=w")
+	(unspec:SVE_FULL_HSDI
+	   [(match_operand:<VPRED> 1)
+	    (lshiftrt:SVE_FULL_HSDI
+	      (plus:SVE_FULL_HSDI
+		(match_operand:SVE_FULL_HSDI 2 "register_operand" "w")
+		(match_operand:SVE_FULL_HSDI 3 "register_operand" "w"))
+	      (match_operand:SVE_FULL_HSDI 4
+		 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))]
+          UNSPEC_PRED_X))]
   "TARGET_SVE2"
-{
-  unsigned HOST_WIDE_INT size
-    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROW>mode)) - 1;
-  rtx elt = unwrap_const_vec_duplicate (operands[2]);
-  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
-    FAIL;
-
-  rtx addend = gen_reg_rtx (<MODE>mode);
-  rtx tmp1 = gen_reg_rtx (<VNARROW>mode);
-  rtx tmp2 = gen_reg_rtx (<VNARROW>mode);
-  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROW>mode, 1);
-  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROW>mode));
-  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1, operands[1],
-			      addend));
-  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp2, operands[1],
-			      lowpart_subreg (<MODE>mode, tmp1,
-					      <VNARROW>mode)));
-  emit_move_insn (operands[0],
-		  lowpart_subreg (<MODE>mode, tmp2, <VNARROW>mode));
-  DONE;
-})
+  "addhnb\t%0.<Ventype>, %2.<Vetype>, %3.<Vetype>"
+)
 
 ;; =========================================================================
 ;; == Permutation
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index e6f47cbbb0d04a6f33b9a741ebb614cabd0204b9..2728fb347c0df1756b237f4d6268908eef6bdd2a 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -3849,6 +3849,13 @@ aarch64_vectorize_related_mode (machine_mode vector_mode,
   return default_vectorize_related_mode (vector_mode, element_mode, nunits);
 }
 
+/* Implement TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT.  */
+
+static bool aarch64_vectorize_preferred_div_as_shifts_over_mult (void)
+{
+  return true;
+}
+
 /* Implement TARGET_PREFERRED_ELSE_VALUE.  For binary operations,
    prefer to use the first arithmetic operand as the else value if
    the else value doesn't matter, since that exactly matches the SVE
@@ -24363,46 +24370,6 @@ aarch64_vectorize_vec_perm_const (machine_mode vmode, machine_mode op_mode,
 
   return ret;
 }
-
-/* Implement TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST.  */
-
-bool
-aarch64_vectorize_can_special_div_by_constant (enum tree_code code,
-					       tree vectype, wide_int cst,
-					       rtx *output, rtx in0, rtx in1)
-{
-  if (code != TRUNC_DIV_EXPR
-      || !TYPE_UNSIGNED (vectype))
-    return false;
-
-  machine_mode mode = TYPE_MODE (vectype);
-  unsigned int flags = aarch64_classify_vector_mode (mode);
-  if ((flags & VEC_ANY_SVE) && !TARGET_SVE2)
-    return false;
-
-  int pow = wi::exact_log2 (cst + 1);
-  auto insn_code = maybe_code_for_aarch64_bitmask_udiv3 (TYPE_MODE (vectype));
-  /* SVE actually has a div operator, we may have gotten here through
-     that route.  */
-  if (pow != (int) (element_precision (vectype) / 2)
-      || insn_code == CODE_FOR_nothing)
-    return false;
-
-  /* We can use the optimized pattern.  */
-  if (in0 == NULL_RTX && in1 == NULL_RTX)
-    return true;
-
-  gcc_assert (output);
-
-  expand_operand ops[3];
-  create_output_operand (&ops[0], *output, mode);
-  create_input_operand (&ops[1], in0, mode);
-  create_fixed_operand (&ops[2], in1);
-  expand_insn (insn_code, 3, ops);
-  *output = ops[0].value;
-  return true;
-}
-
 /* Generate a byte permute mask for a register of mode MODE,
    which has NUNITS units.  */
 
@@ -27904,13 +27871,13 @@ aarch64_libgcc_floating_mode_supported_p
 #undef TARGET_MAX_ANCHOR_OFFSET
 #define TARGET_MAX_ANCHOR_OFFSET 4095
 
+#undef TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT
+#define TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT \
+  aarch64_vectorize_preferred_div_as_shifts_over_mult
+
 #undef TARGET_VECTOR_ALIGNMENT
 #define TARGET_VECTOR_ALIGNMENT aarch64_simd_vector_alignment
 
-#undef TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
-#define TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST \
-  aarch64_vectorize_can_special_div_by_constant
-
 #undef TARGET_VECTORIZE_PREFERRED_VECTOR_ALIGNMENT
 #define TARGET_VECTORIZE_PREFERRED_VECTOR_ALIGNMENT \
   aarch64_vectorize_preferred_vector_alignment




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4]middle-end: Revert can_special_div_by_const changes [PR108583]
  2023-02-27 12:32 [PATCH 1/4]middle-end: Revert can_special_div_by_const changes [PR108583] Tamar Christina
                   ` (2 preceding siblings ...)
  2023-02-27 12:34 ` [PATCH 4/4]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583] Tamar Christina
@ 2023-02-27 14:07 ` Richard Biener
  3 siblings, 0 replies; 19+ messages in thread
From: Richard Biener @ 2023-02-27 14:07 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc-patches, nd, jlaw

On Mon, 27 Feb 2023, Tamar Christina wrote:

> Hi All,
> 
> This reverts the changes for the CAN_SPECIAL_DIV_BY_CONST hook.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?

OK (you don't need approval for such reversion).

Thanks,
Richard.

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	PR target/108583
> 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Remove.
> 	* doc/tm.texi.in: Likewise.
> 	* explow.cc (round_push, align_dynamic_address): Revert previous patch.
> 	* expmed.cc (expand_divmod): Likewise.
> 	* expmed.h (expand_divmod): Likewise.
> 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
> 	* optabs.cc (expand_doubleword_mod, expand_doubleword_divmod): Likewise.
> 	* target.def (can_special_div_by_const): Remove.
> 	* target.h: Remove tree-core.h include
> 	* targhooks.cc (default_can_special_div_by_const): Remove.
> 	* targhooks.h (default_can_special_div_by_const): Remove.
> 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
> 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook.
> 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
> 
> --- inline copy of patch -- 
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index c6c891972d1e58cd163b259ba96a599d62326865..50a8872a6695b18b9bed0d393bacf733833633db 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -6137,20 +6137,6 @@ instruction pattern.  There is no need for the hook to handle these two
>  implementation approaches itself.
>  @end deftypefn
>  
> -@deftypefn {Target Hook} bool TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum @var{tree_code}, tree @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx @var{in0}, rtx @var{in1})
> -This hook is used to test whether the target has a special method of
> -division of vectors of type @var{vectype} using the value @var{constant},
> -and producing a vector of type @var{vectype}.  The division
> -will then not be decomposed by the vectorizer and kept as a div.
> -
> -When the hook is being used to test whether the target supports a special
> -divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook
> -is being used to emit a division, @var{in0} and @var{in1} are the source
> -vectors of type @var{vecttype} and @var{output} is the destination vector of
> -type @var{vectype}.
> -
> -Return true if the operation is possible, emitting instructions for it
> -if rtxes are provided and updating @var{output}.
>  @end deftypefn
>  
>  @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index 613b2534149415f442163d599503efaf423b673b..3e07978a02f4e6077adae6cadc93ea4273295f1f 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -4173,7 +4173,6 @@ address;  but often a machine-dependent strategy can generate better code.
>  
>  @hook TARGET_VECTORIZE_VEC_PERM_CONST
>  
> -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
>  
>  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
>  
> diff --git a/gcc/explow.cc b/gcc/explow.cc
> index 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0befa016eea4573c 100644
> --- a/gcc/explow.cc
> +++ b/gcc/explow.cc
> @@ -1037,7 +1037,7 @@ round_push (rtx size)
>       TRUNC_DIV_EXPR.  */
>    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
>  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
> -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size, align_rtx,
> +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
>  			NULL_RTX, 1);
>    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
>  
> @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned required_align)
>  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
>  				       Pmode),
>  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
> -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, target,
> +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
>  			  gen_int_mode (required_align / BITS_PER_UNIT,
>  					Pmode),
>  			  NULL_RTX, 1);
> diff --git a/gcc/expmed.h b/gcc/expmed.h
> index 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c53640941628068f3901 100644
> --- a/gcc/expmed.h
> +++ b/gcc/expmed.h
> @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code, machine_mode, rtx, poly_int64, rtx,
>  extern rtx maybe_expand_shift (enum tree_code, machine_mode, rtx, int, rtx,
>  			       int);
>  #ifdef GCC_OPTABS_H
> -extern rtx expand_divmod (int, enum tree_code, machine_mode, tree, tree,
> -			  rtx, rtx, rtx, int,
> -			  enum optab_methods = OPTAB_LIB_WIDEN);
> +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx, rtx,
> +			  rtx, int, enum optab_methods = OPTAB_LIB_WIDEN);
>  #endif
>  #endif
>  
> diff --git a/gcc/expmed.cc b/gcc/expmed.cc
> index 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3a59c169d3b7692f 100644
> --- a/gcc/expmed.cc
> +++ b/gcc/expmed.cc
> @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx op0, HOST_WIDE_INT d)
>  
>  rtx
>  expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
> -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
> -	       int unsignedp, enum optab_methods methods)
> +	       rtx op0, rtx op1, rtx target, int unsignedp,
> +	       enum optab_methods methods)
>  {
>    machine_mode compute_mode;
>    rtx tquotient;
> @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
>  
>    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
>  
> -  /* Check if the target has specific expansions for the division.  */
> -  tree cst;
> -  if (treeop0
> -      && treeop1
> -      && (cst = uniform_integer_cst_p (treeop1))
> -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE (treeop0),
> -						     wi::to_wide (cst),
> -						     &target, op0, op1))
> -    return target;
> -
> -
>    /* Now convert to the best mode to use.  */
>    if (compute_mode != mode)
>      {
> @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
>  			    || (optab_handler (sdivmod_optab, int_mode)
>  				!= CODE_FOR_nothing)))
>  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
> -						int_mode, treeop0, treeop1,
> -						op0, gen_int_mode (abs_d,
> +						int_mode, op0,
> +						gen_int_mode (abs_d,
>  							      int_mode),
>  						NULL_RTX, 0);
>  		    else
> @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
>  				      size - 1, NULL_RTX, 0);
>  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
>  				    NULL_RTX);
> -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, treeop0,
> -				    treeop1, t3, op1, NULL_RTX, 0);
> +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3, op1,
> +				    NULL_RTX, 0);
>  		if (t4)
>  		  {
>  		    rtx t5;
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b2280c6e277f26d72 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
>  	    return expand_divmod (0,
>  				  FLOAT_MODE_P (GET_MODE (value))
>  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
> -				  GET_MODE (value), NULL, NULL, op1, op2,
> -				  target, 0);
> +				  GET_MODE (value), op1, op2, target, 0);
>  	case MOD:
> -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
> -				op1, op2, target, 0);
> +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
> +				target, 0);
>  	case UDIV:
> -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), NULL, NULL,
> -				op1, op2, target, 1);
> +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), op1, op2,
> +				target, 1);
>  	case UMOD:
> -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
> -				op1, op2, target, 1);
> +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
> +				target, 1);
>  	case ASHIFTRT:
>  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
>  				      target, 0, OPTAB_LIB_WIDEN);
> @@ -9170,13 +9169,11 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
>        bool speed_p = optimize_insn_for_speed_p ();
>        do_pending_stack_adjust ();
>        start_sequence ();
> -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -				   op0, op1, target, 1);
> +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 1);
>        rtx_insn *uns_insns = get_insns ();
>        end_sequence ();
>        start_sequence ();
> -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -				   op0, op1, target, 0);
> +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 0);
>        rtx_insn *sgn_insns = get_insns ();
>        end_sequence ();
>        unsigned uns_cost = seq_cost (uns_insns, speed_p);
> @@ -9198,8 +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
>        emit_insn (sgn_insns);
>        return sgn_ret;
>      }
> -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -			op0, op1, target, unsignedp);
> +  return expand_divmod (mod_p, code, mode, op0, op1, target, unsignedp);
>  }
>  
>  rtx
> diff --git a/gcc/optabs.cc b/gcc/optabs.cc
> index cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6e77082c1e617b 100644
> --- a/gcc/optabs.cc
> +++ b/gcc/optabs.cc
> @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode mode, rtx op0, rtx op1, bool unsignedp)
>  		return NULL_RTX;
>  	    }
>  	}
> -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, NULL, NULL,
> -				     sum, gen_int_mode (INTVAL (op1),
> -							word_mode),
> +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, sum,
> +				     gen_int_mode (INTVAL (op1), word_mode),
>  				     NULL_RTX, 1, OPTAB_DIRECT);
>        if (remainder == NULL_RTX)
>  	return NULL_RTX;
> @@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
>  
>    if (op11 != const1_rtx)
>      {
> -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL, quot1,
> -				op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
> +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1, op11,
> +				NULL_RTX, unsignedp, OPTAB_DIRECT);
>        if (rem2 == NULL_RTX)
>  	return NULL_RTX;
>  
> @@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
>        if (rem2 == NULL_RTX)
>  	return NULL_RTX;
>  
> -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL, quot1,
> -				 op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
> +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
> +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
>        if (quot2 == NULL_RTX)
>  	return NULL_RTX;
>  
> diff --git a/gcc/target.def b/gcc/target.def
> index db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -1905,25 +1905,6 @@ implementation approaches itself.",
>  	const vec_perm_indices &sel),
>   NULL)
>  
> -DEFHOOK
> -(can_special_div_by_const,
> - "This hook is used to test whether the target has a special method of\n\
> -division of vectors of type @var{vectype} using the value @var{constant},\n\
> -and producing a vector of type @var{vectype}.  The division\n\
> -will then not be decomposed by the vectorizer and kept as a div.\n\
> -\n\
> -When the hook is being used to test whether the target supports a special\n\
> -divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook\n\
> -is being used to emit a division, @var{in0} and @var{in1} are the source\n\
> -vectors of type @var{vecttype} and @var{output} is the destination vector of\n\
> -type @var{vectype}.\n\
> -\n\
> -Return true if the operation is possible, emitting instructions for it\n\
> -if rtxes are provided and updating @var{output}.",
> - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
> -	rtx in0, rtx in1),
> - default_can_special_div_by_const)
> -
>  /* Return true if the target supports misaligned store/load of a
>     specific factor denoted in the third parameter.  The last parameter
>     is true if the access is defined in a packed struct.  */
> diff --git a/gcc/target.h b/gcc/target.h
> index 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b99f913158c2d47b1 100644
> --- a/gcc/target.h
> +++ b/gcc/target.h
> @@ -51,7 +51,6 @@
>  #include "insn-codes.h"
>  #include "tm.h"
>  #include "hard-reg-set.h"
> -#include "tree-core.h"
>  
>  #if CHECKING_P
>  
> diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> index a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2244549317a31390f0c2 100644
> --- a/gcc/targhooks.h
> +++ b/gcc/targhooks.h
> @@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage (addr_space_t, location_t);
>  extern rtx default_addr_space_convert (rtx, tree, tree);
>  extern unsigned int default_case_values_threshold (void);
>  extern bool default_have_conditional_execution (void);
> -extern bool default_can_special_div_by_const (enum tree_code, tree, wide_int,
> -					      rtx *, rtx, rtx);
>  
>  extern bool default_libc_has_function (enum function_class, tree);
>  extern bool default_libc_has_fast_function (int fcode);
> diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> index fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e03877337a931e7 100644
> --- a/gcc/targhooks.cc
> +++ b/gcc/targhooks.cc
> @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
>    return HAVE_conditional_execution;
>  }
>  
> -/* Default that no division by constant operations are special.  */
> -bool
> -default_can_special_div_by_const (enum tree_code, tree, wide_int, rtx *, rtx,
> -				  rtx)
> -{
> -  return false;
> -}
> -
>  /* By default we assume that c99 functions are present at the runtime,
>     but sincos is not.  */
>  bool
> diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
> index 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077dc3e970bed75ef6 100644
> --- a/gcc/tree-vect-generic.cc
> +++ b/gcc/tree-vect-generic.cc
> @@ -1237,17 +1237,6 @@ expand_vector_operation (gimple_stmt_iterator *gsi, tree type, tree compute_type
>  	  tree rhs2 = gimple_assign_rhs2 (assign);
>  	  tree ret;
>  
> -	  /* Check if the target was going to handle it through the special
> -	     division callback hook.  */
> -	  tree cst = uniform_integer_cst_p (rhs2);
> -	  if (cst &&
> -	      targetm.vectorize.can_special_div_by_const (code, type,
> -							  wi::to_wide (cst),
> -							  NULL,
> -							  NULL_RTX, NULL_RTX))
> -	    return NULL_TREE;
> -
> -
>  	  if (!optimize
>  	      || !VECTOR_INTEGER_TYPE_P (type)
>  	      || TREE_CODE (rhs2) != VECTOR_CST
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index 6934aebc69f231af24668f0a1c3d140e97f55487..1766ce277d6b88d8aa3be77e7c8abb504a10a735 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -3913,14 +3913,6 @@ vect_recog_divmod_pattern (vec_info *vinfo,
>  
>        return pattern_stmt;
>      }
> -  else if ((cst = uniform_integer_cst_p (oprnd1))
> -	   && targetm.vectorize.can_special_div_by_const (rhs_code, vectype,
> -							  wi::to_wide (cst),
> -							  NULL, NULL_RTX,
> -							  NULL_RTX))
> -    {
> -      return NULL;
> -    }
>  
>    if (prec > HOST_BITS_PER_WIDE_INT
>        || integer_zerop (oprnd1))
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b9564fc4e066e50081 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
>  	}
>        target_support_p = (optab_handler (optab, vec_mode)
>  			  != CODE_FOR_nothing);
> -      tree cst;
> -      if (!target_support_p
> -	  && op1
> -	  && (cst = uniform_integer_cst_p (op1)))
> -	target_support_p
> -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
> -							wi::to_wide (cst),
> -							NULL, NULL_RTX,
> -							NULL_RTX);
>      }
>  
>    bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);
> 
> 
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 2/4][ranger]: Add range-ops for widen addition and widen multiplication [PR108583]
  2023-02-27 12:33 ` [PATCH 2/4][ranger]: Add range-ops for widen addition and widen multiplication [PR108583] Tamar Christina
@ 2023-03-06 11:20   ` Tamar Christina
  2023-03-08  8:57     ` Aldy Hernandez
  0 siblings, 1 reply; 19+ messages in thread
From: Tamar Christina @ 2023-03-06 11:20 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, amacleod, aldyh

[-- Attachment #1: Type: text/plain, Size: 9623 bytes --]

Ping.

And updated the patch to reject cases that we don't expect or can handle cleanly for now.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	PR target/108583
	* gimple-range-op.h (gimple_range_op_handler): Add maybe_non_standard.
	* gimple-range-op.cc (gimple_range_op_handler::gimple_range_op_handler):
	Use it.
	(gimple_range_op_handler::maybe_non_standard): New.
	* range-op.cc (class operator_widen_plus_signed,
	operator_widen_plus_signed::wi_fold, class operator_widen_plus_unsigned,
	operator_widen_plus_unsigned::wi_fold, class operator_widen_mult_signed,
	operator_widen_mult_signed::wi_fold, class operator_widen_mult_unsigned,
	operator_widen_mult_unsigned::wi_fold,
	ptr_op_widen_mult_signed, ptr_op_widen_mult_unsigned,
	ptr_op_widen_plus_signed, ptr_op_widen_plus_unsigned): New.
	* range-op.h (ptr_op_widen_mult_signed, ptr_op_widen_mult_unsigned,
	ptr_op_widen_plus_signed, ptr_op_widen_plus_unsigned): New

Co-Authored-By: Andrew MacLeod <amacleod@redhat.com>

--- Inline copy of patch ---

diff --git a/gcc/gimple-range-op.h b/gcc/gimple-range-op.h
index 743b858126e333ea9590c0f175aacb476260c048..1bf63c5ce6f5db924a1f5907ab4539e376281bd0 100644
--- a/gcc/gimple-range-op.h
+++ b/gcc/gimple-range-op.h
@@ -41,6 +41,7 @@ public:
 		 relation_trio = TRIO_VARYING);
 private:
   void maybe_builtin_call ();
+  void maybe_non_standard ();
   gimple *m_stmt;
   tree m_op1, m_op2;
 };
diff --git a/gcc/gimple-range-op.cc b/gcc/gimple-range-op.cc
index d9dfdc56939bb62ade72726b15c3d5e87e4ddcd1..a5d625387e712c170e1e68f6a7d494027f6ef0d0 100644
--- a/gcc/gimple-range-op.cc
+++ b/gcc/gimple-range-op.cc
@@ -179,6 +179,8 @@ gimple_range_op_handler::gimple_range_op_handler (gimple *s)
   // statements.
   if (is_a <gcall *> (m_stmt))
     maybe_builtin_call ();
+  else
+    maybe_non_standard ();
 }
 
 // Calculate what we can determine of the range of this unary
@@ -764,6 +766,57 @@ public:
   }
 } op_cfn_parity;
 
+// Set up a gimple_range_op_handler for any nonstandard function which can be
+// supported via range-ops.
+
+void
+gimple_range_op_handler::maybe_non_standard ()
+{
+  range_operator *signed_op = ptr_op_widen_mult_signed;
+  range_operator *unsigned_op = ptr_op_widen_mult_unsigned;
+  if (gimple_code (m_stmt) == GIMPLE_ASSIGN)
+    switch (gimple_assign_rhs_code (m_stmt))
+      {
+	case WIDEN_PLUS_EXPR:
+	{
+	  signed_op = ptr_op_widen_plus_signed;
+	  unsigned_op = ptr_op_widen_plus_unsigned;
+	}
+	gcc_fallthrough ();
+	case WIDEN_MULT_EXPR:
+	{
+	  m_valid = false;
+	  m_op1 = gimple_assign_rhs1 (m_stmt);
+	  m_op2 = gimple_assign_rhs2 (m_stmt);
+	  tree ret = gimple_assign_lhs (m_stmt);
+	  bool signed1 = TYPE_SIGN (TREE_TYPE (m_op1)) == SIGNED;
+	  bool signed2 = TYPE_SIGN (TREE_TYPE (m_op2)) == SIGNED;
+	  bool signed_ret = TYPE_SIGN (TREE_TYPE (ret)) == SIGNED;
+
+	  /* Normally these operands should all have the same sign, but
+	     some passes and violate this by taking mismatched sign args.  At
+	     the moment the only one that's possible is mismatch inputs and
+	     unsigned output.  Once ranger supports signs for the operands we
+	     can properly fix it,  for now only accept the case we can do
+	     correctly.  */
+	  if ((signed1 ^ signed2) && signed_ret)
+	    return;
+
+	  m_valid = true;
+	  if (signed2 && !signed1)
+	    std::swap (m_op1, m_op2);
+
+	  if (signed1 || signed2)
+	    m_int = signed_op;
+	  else
+	    m_int = unsigned_op;
+	  break;
+	}
+	default:
+	  break;
+      }
+}
+
 // Set up a gimple_range_op_handler for any built in function which can be
 // supported via range-ops.
 
diff --git a/gcc/range-op.h b/gcc/range-op.h
index f00b747f08a1fa8404c63bfe5a931b4048008b03..b1eeac70df81f2bdf228af7adff5399e7ac5e5d6 100644
--- a/gcc/range-op.h
+++ b/gcc/range-op.h
@@ -311,4 +311,8 @@ private:
 // This holds the range op table for floating point operations.
 extern floating_op_table *floating_tree_table;
 
+extern range_operator *ptr_op_widen_mult_signed;
+extern range_operator *ptr_op_widen_mult_unsigned;
+extern range_operator *ptr_op_widen_plus_signed;
+extern range_operator *ptr_op_widen_plus_unsigned;
 #endif // GCC_RANGE_OP_H
diff --git a/gcc/range-op.cc b/gcc/range-op.cc
index 5c67bce6d3aab81ad3186b902e09d6a96878d9bb..718ccb6f074e1a2a9ef1b7a5d4e879898d4a7fc3 100644
--- a/gcc/range-op.cc
+++ b/gcc/range-op.cc
@@ -1556,6 +1556,73 @@ operator_plus::op2_range (irange &r, tree type,
   return op1_range (r, type, lhs, op1, rel.swap_op1_op2 ());
 }
 
+class operator_widen_plus_signed : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub) const;
+} op_widen_plus_signed;
+range_operator *ptr_op_widen_plus_signed = &op_widen_plus_signed;
+
+void
+operator_widen_plus_signed::wi_fold (irange &r, tree type,
+				     const wide_int &lh_lb,
+				     const wide_int &lh_ub,
+				     const wide_int &rh_lb,
+				     const wide_int &rh_ub) const
+{
+   wi::overflow_type ov_lb, ov_ub;
+   signop s = TYPE_SIGN (type);
+
+   wide_int lh_wlb
+     = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, SIGNED);
+   wide_int lh_wub
+     = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, SIGNED);
+   wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+   wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+   wide_int new_lb = wi::add (lh_wlb, rh_wlb, s, &ov_lb);
+   wide_int new_ub = wi::add (lh_wub, rh_wub, s, &ov_ub);
+
+   r = int_range<2> (type, new_lb, new_ub);
+}
+
+class operator_widen_plus_unsigned : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub) const;
+} op_widen_plus_unsigned;
+range_operator *ptr_op_widen_plus_unsigned = &op_widen_plus_unsigned;
+
+void
+operator_widen_plus_unsigned::wi_fold (irange &r, tree type,
+				       const wide_int &lh_lb,
+				       const wide_int &lh_ub,
+				       const wide_int &rh_lb,
+				       const wide_int &rh_ub) const
+{
+   wi::overflow_type ov_lb, ov_ub;
+   signop s = TYPE_SIGN (type);
+
+   wide_int lh_wlb
+     = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, UNSIGNED);
+   wide_int lh_wub
+     = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, UNSIGNED);
+   wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+   wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+   wide_int new_lb = wi::add (lh_wlb, rh_wlb, s, &ov_lb);
+   wide_int new_ub = wi::add (lh_wub, rh_wub, s, &ov_ub);
+
+   r = int_range<2> (type, new_lb, new_ub);
+}
 
 class operator_minus : public range_operator
 {
@@ -2031,6 +2098,70 @@ operator_mult::wi_fold (irange &r, tree type,
     }
 }
 
+class operator_widen_mult_signed : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub)
+    const;
+} op_widen_mult_signed;
+range_operator *ptr_op_widen_mult_signed = &op_widen_mult_signed;
+
+void
+operator_widen_mult_signed::wi_fold (irange &r, tree type,
+				     const wide_int &lh_lb,
+				     const wide_int &lh_ub,
+				     const wide_int &rh_lb,
+				     const wide_int &rh_ub) const
+{
+  signop s = TYPE_SIGN (type);
+
+  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, SIGNED);
+  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, SIGNED);
+  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+  /* We don't expect a widening multiplication to be able to overflow but range
+     calculations for multiplications are complicated.  After widening the
+     operands lets call the base class.  */
+  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
+}
+
+
+class operator_widen_mult_unsigned : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub)
+    const;
+} op_widen_mult_unsigned;
+range_operator *ptr_op_widen_mult_unsigned = &op_widen_mult_unsigned;
+
+void
+operator_widen_mult_unsigned::wi_fold (irange &r, tree type,
+				       const wide_int &lh_lb,
+				       const wide_int &lh_ub,
+				       const wide_int &rh_lb,
+				       const wide_int &rh_ub) const
+{
+  signop s = TYPE_SIGN (type);
+
+  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, UNSIGNED);
+  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, UNSIGNED);
+  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+  /* We don't expect a widening multiplication to be able to overflow but range
+     calculations for multiplications are complicated.  After widening the
+     operands lets call the base class.  */
+  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
+}
 
 class operator_div : public cross_product_operator
 {


[-- Attachment #2: rb16929.patch --]
[-- Type: application/octet-stream, Size: 8266 bytes --]

diff --git a/gcc/gimple-range-op.h b/gcc/gimple-range-op.h
index 743b858126e333ea9590c0f175aacb476260c048..1bf63c5ce6f5db924a1f5907ab4539e376281bd0 100644
--- a/gcc/gimple-range-op.h
+++ b/gcc/gimple-range-op.h
@@ -41,6 +41,7 @@ public:
 		 relation_trio = TRIO_VARYING);
 private:
   void maybe_builtin_call ();
+  void maybe_non_standard ();
   gimple *m_stmt;
   tree m_op1, m_op2;
 };
diff --git a/gcc/gimple-range-op.cc b/gcc/gimple-range-op.cc
index d9dfdc56939bb62ade72726b15c3d5e87e4ddcd1..a5d625387e712c170e1e68f6a7d494027f6ef0d0 100644
--- a/gcc/gimple-range-op.cc
+++ b/gcc/gimple-range-op.cc
@@ -179,6 +179,8 @@ gimple_range_op_handler::gimple_range_op_handler (gimple *s)
   // statements.
   if (is_a <gcall *> (m_stmt))
     maybe_builtin_call ();
+  else
+    maybe_non_standard ();
 }
 
 // Calculate what we can determine of the range of this unary
@@ -764,6 +766,57 @@ public:
   }
 } op_cfn_parity;
 
+// Set up a gimple_range_op_handler for any nonstandard function which can be
+// supported via range-ops.
+
+void
+gimple_range_op_handler::maybe_non_standard ()
+{
+  range_operator *signed_op = ptr_op_widen_mult_signed;
+  range_operator *unsigned_op = ptr_op_widen_mult_unsigned;
+  if (gimple_code (m_stmt) == GIMPLE_ASSIGN)
+    switch (gimple_assign_rhs_code (m_stmt))
+      {
+	case WIDEN_PLUS_EXPR:
+	{
+	  signed_op = ptr_op_widen_plus_signed;
+	  unsigned_op = ptr_op_widen_plus_unsigned;
+	}
+	gcc_fallthrough ();
+	case WIDEN_MULT_EXPR:
+	{
+	  m_valid = false;
+	  m_op1 = gimple_assign_rhs1 (m_stmt);
+	  m_op2 = gimple_assign_rhs2 (m_stmt);
+	  tree ret = gimple_assign_lhs (m_stmt);
+	  bool signed1 = TYPE_SIGN (TREE_TYPE (m_op1)) == SIGNED;
+	  bool signed2 = TYPE_SIGN (TREE_TYPE (m_op2)) == SIGNED;
+	  bool signed_ret = TYPE_SIGN (TREE_TYPE (ret)) == SIGNED;
+
+	  /* Normally these operands should all have the same sign, but
+	     some passes and violate this by taking mismatched sign args.  At
+	     the moment the only one that's possible is mismatch inputs and
+	     unsigned output.  Once ranger supports signs for the operands we
+	     can properly fix it,  for now only accept the case we can do
+	     correctly.  */
+	  if ((signed1 ^ signed2) && signed_ret)
+	    return;
+
+	  m_valid = true;
+	  if (signed2 && !signed1)
+	    std::swap (m_op1, m_op2);
+
+	  if (signed1 || signed2)
+	    m_int = signed_op;
+	  else
+	    m_int = unsigned_op;
+	  break;
+	}
+	default:
+	  break;
+      }
+}
+
 // Set up a gimple_range_op_handler for any built in function which can be
 // supported via range-ops.
 
diff --git a/gcc/range-op.h b/gcc/range-op.h
index f00b747f08a1fa8404c63bfe5a931b4048008b03..b1eeac70df81f2bdf228af7adff5399e7ac5e5d6 100644
--- a/gcc/range-op.h
+++ b/gcc/range-op.h
@@ -311,4 +311,8 @@ private:
 // This holds the range op table for floating point operations.
 extern floating_op_table *floating_tree_table;
 
+extern range_operator *ptr_op_widen_mult_signed;
+extern range_operator *ptr_op_widen_mult_unsigned;
+extern range_operator *ptr_op_widen_plus_signed;
+extern range_operator *ptr_op_widen_plus_unsigned;
 #endif // GCC_RANGE_OP_H
diff --git a/gcc/range-op.cc b/gcc/range-op.cc
index 5c67bce6d3aab81ad3186b902e09d6a96878d9bb..718ccb6f074e1a2a9ef1b7a5d4e879898d4a7fc3 100644
--- a/gcc/range-op.cc
+++ b/gcc/range-op.cc
@@ -1556,6 +1556,73 @@ operator_plus::op2_range (irange &r, tree type,
   return op1_range (r, type, lhs, op1, rel.swap_op1_op2 ());
 }
 
+class operator_widen_plus_signed : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub) const;
+} op_widen_plus_signed;
+range_operator *ptr_op_widen_plus_signed = &op_widen_plus_signed;
+
+void
+operator_widen_plus_signed::wi_fold (irange &r, tree type,
+				     const wide_int &lh_lb,
+				     const wide_int &lh_ub,
+				     const wide_int &rh_lb,
+				     const wide_int &rh_ub) const
+{
+   wi::overflow_type ov_lb, ov_ub;
+   signop s = TYPE_SIGN (type);
+
+   wide_int lh_wlb
+     = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, SIGNED);
+   wide_int lh_wub
+     = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, SIGNED);
+   wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+   wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+   wide_int new_lb = wi::add (lh_wlb, rh_wlb, s, &ov_lb);
+   wide_int new_ub = wi::add (lh_wub, rh_wub, s, &ov_ub);
+
+   r = int_range<2> (type, new_lb, new_ub);
+}
+
+class operator_widen_plus_unsigned : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub) const;
+} op_widen_plus_unsigned;
+range_operator *ptr_op_widen_plus_unsigned = &op_widen_plus_unsigned;
+
+void
+operator_widen_plus_unsigned::wi_fold (irange &r, tree type,
+				       const wide_int &lh_lb,
+				       const wide_int &lh_ub,
+				       const wide_int &rh_lb,
+				       const wide_int &rh_ub) const
+{
+   wi::overflow_type ov_lb, ov_ub;
+   signop s = TYPE_SIGN (type);
+
+   wide_int lh_wlb
+     = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, UNSIGNED);
+   wide_int lh_wub
+     = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, UNSIGNED);
+   wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+   wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+   wide_int new_lb = wi::add (lh_wlb, rh_wlb, s, &ov_lb);
+   wide_int new_ub = wi::add (lh_wub, rh_wub, s, &ov_ub);
+
+   r = int_range<2> (type, new_lb, new_ub);
+}
 
 class operator_minus : public range_operator
 {
@@ -2031,6 +2098,70 @@ operator_mult::wi_fold (irange &r, tree type,
     }
 }
 
+class operator_widen_mult_signed : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub)
+    const;
+} op_widen_mult_signed;
+range_operator *ptr_op_widen_mult_signed = &op_widen_mult_signed;
+
+void
+operator_widen_mult_signed::wi_fold (irange &r, tree type,
+				     const wide_int &lh_lb,
+				     const wide_int &lh_ub,
+				     const wide_int &rh_lb,
+				     const wide_int &rh_ub) const
+{
+  signop s = TYPE_SIGN (type);
+
+  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, SIGNED);
+  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, SIGNED);
+  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+  /* We don't expect a widening multiplication to be able to overflow but range
+     calculations for multiplications are complicated.  After widening the
+     operands lets call the base class.  */
+  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
+}
+
+
+class operator_widen_mult_unsigned : public range_operator
+{
+public:
+  virtual void wi_fold (irange &r, tree type,
+			const wide_int &lh_lb,
+			const wide_int &lh_ub,
+			const wide_int &rh_lb,
+			const wide_int &rh_ub)
+    const;
+} op_widen_mult_unsigned;
+range_operator *ptr_op_widen_mult_unsigned = &op_widen_mult_unsigned;
+
+void
+operator_widen_mult_unsigned::wi_fold (irange &r, tree type,
+				       const wide_int &lh_lb,
+				       const wide_int &lh_ub,
+				       const wide_int &rh_lb,
+				       const wide_int &rh_ub) const
+{
+  signop s = TYPE_SIGN (type);
+
+  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, UNSIGNED);
+  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, UNSIGNED);
+  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
+  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
+
+  /* We don't expect a widening multiplication to be able to overflow but range
+     calculations for multiplications are complicated.  After widening the
+     operands lets call the base class.  */
+  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
+}
 
 class operator_div : public cross_product_operator
 {

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 4/4]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583]
  2023-02-27 12:34 ` [PATCH 4/4]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583] Tamar Christina
@ 2023-03-06 11:21   ` Tamar Christina
  2023-03-08  9:17     ` Richard Sandiford
  0 siblings, 1 reply; 19+ messages in thread
From: Tamar Christina @ 2023-03-06 11:21 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov,
	Richard Sandiford

Ping,

And updating the hook.

There are no new test as new correctness tests were added to the mid-end and
the existing codegen tests for this already exist.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	PR target/108583
	* config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv<mode>3): Remove.
	(*bitmask_shift_plus<mode>): New.
	* config/aarch64/aarch64-sve2.md (*bitmask_shift_plus<mode>): New.
	(@aarch64_bitmask_udiv<mode>3): Remove.
	* config/aarch64/aarch64.cc
	(aarch64_vectorize_can_special_div_by_constant,
	TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Removed.
	(TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT,
	aarch64_vectorize_preferred_div_as_shifts_over_mult): New.

--- inline copy of patch ---

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 7f212bf37cd2c120dceb7efa733c9fa76226f029..e1ecb88634f93d380ef534093ea6599dc7278108 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4867,60 +4867,27 @@ (define_expand "aarch64_<sur><addsub>hn2<mode>"
   }
 )
 
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; If we imagine a short as being composed of two blocks of bytes then
-;; adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
-;; adding 1 to each sub component:
-;;
-;;      short value of 16-bits
-;; ┌──────────────┬────────────────┐
-;; │              │                │
-;; └──────────────┴────────────────┘
-;;   8-bit part1 ▲  8-bit part2   ▲
-;;               │                │
-;;               │                │
-;;              +1               +1
-;;
-;; after the first addition, we have to shift right by 8, and narrow the
-;; results back to a byte.  Remember that the addition must be done in
-;; double the precision of the input.  Since 8 is half the size of a short
-;; we can use a narrowing halfing instruction in AArch64, addhn which also
-;; does the addition in a wider precision and narrows back to a byte.  The
-;; shift itself is implicit in the operation as it writes back only the top
-;; half of the result. i.e. bits 2*esize-1:esize.
-;;
-;; Since we have narrowed the result of the first part back to a byte, for
-;; the second addition we can use a widening addition, uaddw.
-;;
-;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
-;;
-;; The shift is later optimized by combine to a uzp2 with movi #0.
-(define_expand "@aarch64_bitmask_udiv<mode>3"
-  [(match_operand:VQN 0 "register_operand")
-   (match_operand:VQN 1 "register_operand")
-   (match_operand:VQN 2 "immediate_operand")]
+;; Optimize ((a + b) >> n) + c where n is half the bitsize of the vector
+(define_insn_and_split "*bitmask_shift_plus<mode>"
+  [(set (match_operand:VQN 0 "register_operand" "=&w")
+	(plus:VQN
+	  (lshiftrt:VQN
+	    (plus:VQN (match_operand:VQN 1 "register_operand" "w")
+		      (match_operand:VQN 2 "register_operand" "w"))
+	    (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
+	  (match_operand:VQN 4 "register_operand" "w")))]
   "TARGET_SIMD"
+  "#"
+  "&& true"
+  [(const_int 0)]
 {
-  unsigned HOST_WIDE_INT size
-    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode)) - 1;
-  rtx elt = unwrap_const_vec_duplicate (operands[2]);
-  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
-    FAIL;
-
-  rtx addend = gen_reg_rtx (<MODE>mode);
-  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROWQ2>mode, 1);
-  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROWQ2>mode));
-  rtx tmp1 = gen_reg_rtx (<VNARROWQ>mode);
-  rtx tmp2 = gen_reg_rtx (<MODE>mode);
-  emit_insn (gen_aarch64_addhn<mode> (tmp1, operands[1], addend));
-  unsigned bitsize = GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode);
-  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode, bitsize);
-  emit_insn (gen_aarch64_uaddw<Vnarrowq> (tmp2, operands[1], tmp1));
-  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], tmp2, shift_vector));
+  rtx tmp;
+  if (can_create_pseudo_p ())
+    tmp = gen_reg_rtx (<VNARROWQ>mode);
+  else
+    tmp = gen_rtx_REG (<VNARROWQ>mode, REGNO (operands[0]));
+  emit_insn (gen_aarch64_addhn<mode> (tmp, operands[1], operands[2]));
+  emit_insn (gen_aarch64_uaddw<Vnarrowq> (operands[0], operands[4], tmp));
   DONE;
 })
 
diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
index 40c0728a7e6f00c395c360ce7625bc2e4a018809..bed44d7d6873877386222d56144cc115e3953a61 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -2317,41 +2317,24 @@ (define_insn "@aarch64_sve_<optab><mode>"
 ;; ---- [INT] Misc optab implementations
 ;; -------------------------------------------------------------------------
 ;; Includes:
-;; - aarch64_bitmask_udiv
+;; - bitmask_shift_plus
 ;; -------------------------------------------------------------------------
 
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; See aarch64-simd.md for bigger explanation.
-(define_expand "@aarch64_bitmask_udiv<mode>3"
-  [(match_operand:SVE_FULL_HSDI 0 "register_operand")
-   (match_operand:SVE_FULL_HSDI 1 "register_operand")
-   (match_operand:SVE_FULL_HSDI 2 "immediate_operand")]
+;; Optimize ((a + b) >> n) where n is half the bitsize of the vector
+(define_insn "*bitmask_shift_plus<mode>"
+  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand" "=w")
+	(unspec:SVE_FULL_HSDI
+	   [(match_operand:<VPRED> 1)
+	    (lshiftrt:SVE_FULL_HSDI
+	      (plus:SVE_FULL_HSDI
+		(match_operand:SVE_FULL_HSDI 2 "register_operand" "w")
+		(match_operand:SVE_FULL_HSDI 3 "register_operand" "w"))
+	      (match_operand:SVE_FULL_HSDI 4
+		 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))]
+          UNSPEC_PRED_X))]
   "TARGET_SVE2"
-{
-  unsigned HOST_WIDE_INT size
-    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROW>mode)) - 1;
-  rtx elt = unwrap_const_vec_duplicate (operands[2]);
-  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
-    FAIL;
-
-  rtx addend = gen_reg_rtx (<MODE>mode);
-  rtx tmp1 = gen_reg_rtx (<VNARROW>mode);
-  rtx tmp2 = gen_reg_rtx (<VNARROW>mode);
-  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROW>mode, 1);
-  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROW>mode));
-  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1, operands[1],
-			      addend));
-  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp2, operands[1],
-			      lowpart_subreg (<MODE>mode, tmp1,
-					      <VNARROW>mode)));
-  emit_move_insn (operands[0],
-		  lowpart_subreg (<MODE>mode, tmp2, <VNARROW>mode));
-  DONE;
-})
+  "addhnb\t%0.<Ventype>, %2.<Vetype>, %3.<Vetype>"
+)
 
 ;; =========================================================================
 ;; == Permutation
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index e6f47cbbb0d04a6f33b9a741ebb614cabd0204b9..eb4f99ee524844ed5b3684c6fe807a4128685423 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -3849,6 +3849,19 @@ aarch64_vectorize_related_mode (machine_mode vector_mode,
   return default_vectorize_related_mode (vector_mode, element_mode, nunits);
 }
 
+/* Implement TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT.  */
+
+static bool
+aarch64_vectorize_preferred_div_as_shifts_over_mult (const_tree type)
+{
+  machine_mode mode = TYPE_MODE (type);
+  unsigned int vec_flags = aarch64_classify_vector_mode (mode);
+  bool sve_p = (vec_flags & VEC_ANY_SVE);
+  bool simd_p = (vec_flags & VEC_ADVSIMD);
+
+  return (sve_p && TARGET_SVE2) || (simd_p && TARGET_SIMD);
+}
+
 /* Implement TARGET_PREFERRED_ELSE_VALUE.  For binary operations,
    prefer to use the first arithmetic operand as the else value if
    the else value doesn't matter, since that exactly matches the SVE
@@ -24363,46 +24376,6 @@ aarch64_vectorize_vec_perm_const (machine_mode vmode, machine_mode op_mode,
 
   return ret;
 }
-
-/* Implement TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST.  */
-
-bool
-aarch64_vectorize_can_special_div_by_constant (enum tree_code code,
-					       tree vectype, wide_int cst,
-					       rtx *output, rtx in0, rtx in1)
-{
-  if (code != TRUNC_DIV_EXPR
-      || !TYPE_UNSIGNED (vectype))
-    return false;
-
-  machine_mode mode = TYPE_MODE (vectype);
-  unsigned int flags = aarch64_classify_vector_mode (mode);
-  if ((flags & VEC_ANY_SVE) && !TARGET_SVE2)
-    return false;
-
-  int pow = wi::exact_log2 (cst + 1);
-  auto insn_code = maybe_code_for_aarch64_bitmask_udiv3 (TYPE_MODE (vectype));
-  /* SVE actually has a div operator, we may have gotten here through
-     that route.  */
-  if (pow != (int) (element_precision (vectype) / 2)
-      || insn_code == CODE_FOR_nothing)
-    return false;
-
-  /* We can use the optimized pattern.  */
-  if (in0 == NULL_RTX && in1 == NULL_RTX)
-    return true;
-
-  gcc_assert (output);
-
-  expand_operand ops[3];
-  create_output_operand (&ops[0], *output, mode);
-  create_input_operand (&ops[1], in0, mode);
-  create_fixed_operand (&ops[2], in1);
-  expand_insn (insn_code, 3, ops);
-  *output = ops[0].value;
-  return true;
-}
-
 /* Generate a byte permute mask for a register of mode MODE,
    which has NUNITS units.  */
 
@@ -27904,13 +27877,13 @@ aarch64_libgcc_floating_mode_supported_p
 #undef TARGET_MAX_ANCHOR_OFFSET
 #define TARGET_MAX_ANCHOR_OFFSET 4095
 
+#undef TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT
+#define TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT \
+  aarch64_vectorize_preferred_div_as_shifts_over_mult
+
 #undef TARGET_VECTOR_ALIGNMENT
 #define TARGET_VECTOR_ALIGNMENT aarch64_simd_vector_alignment
 
-#undef TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
-#define TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST \
-  aarch64_vectorize_can_special_div_by_constant
-
 #undef TARGET_VECTORIZE_PREFERRED_VECTOR_ALIGNMENT
 #define TARGET_VECTORIZE_PREFERRED_VECTOR_ALIGNMENT \
   aarch64_vectorize_preferred_vector_alignment

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 3/4]middle-end: Implement preferred_div_as_shifts_over_mult [PR108583]
  2023-02-27 12:33 ` [PATCH 3/4]middle-end: Implement preferred_div_as_shifts_over_mult [PR108583] Tamar Christina
@ 2023-03-06 11:23   ` Tamar Christina
  2023-03-08  8:55     ` Richard Sandiford
  0 siblings, 1 reply; 19+ messages in thread
From: Tamar Christina @ 2023-03-06 11:23 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, rguenther, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 11839 bytes --]

Ping,

And updated the hook to allow to differentiate between ISAs.

As Andy said before initializing a ranger instance is cheap but not free, and if
the intention is to call it often during a pass it should be instantiated at
pass startup and passed along to the places that need it.  This is a big
refactoring and doesn't seem right to do in this PR.  But we should in GCC 14.

Currently we only instantiate it after a long series of much cheaper checks.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	PR target/108583
	* target.def (preferred_div_as_shifts_over_mult): New.
	* doc/tm.texi.in: Document it.
	* doc/tm.texi: Regenerate.
	* targhooks.cc (default_preferred_div_as_shifts_over_mult): New.
	* targhooks.h (default_preferred_div_as_shifts_over_mult): New.
	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Use it.

gcc/testsuite/ChangeLog:

	PR target/108583
	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
	* gcc.dg/vect/vect-div-bitmask-5.c: New test.

--- inline copy of patch ---

diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 50a8872a6695b18b9bed0d393bacf733833633db..f69f7f036272e867ea1c3fee851b117f057f68c5 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6137,6 +6137,10 @@ instruction pattern.  There is no need for the hook to handle these two
 implementation approaches itself.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT (const_tree @var{type})
+If possible, when decomposing a division operation of vectors of
+type @var{type} during vectorization, prefer to use shifts rather than
+multiplication by magic constants.
 @end deftypefn
 
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3e07978a02f4e6077adae6cadc93ea4273295f1f..0051017a7fd67691a343470f36ad4fc32c8e7e15 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4173,6 +4173,7 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_VEC_PERM_CONST
 
+@hook TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT
 
 @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
 
diff --git a/gcc/target.def b/gcc/target.def
index e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5..bdee9b7f9c941508738fac49593b5baa525e2915 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1868,6 +1868,16 @@ correct for most targets.",
  poly_uint64, (const_tree type),
  default_preferred_vector_alignment)
 
+/* Returns whether the target has a preference for decomposing divisions using
+   shifts rather than multiplies.  */
+DEFHOOK
+(preferred_div_as_shifts_over_mult,
+ "If possible, when decomposing a division operation of vectors of\n\
+type @var{type} during vectorization, prefer to use shifts rather than\n\
+multiplication by magic constants.",
+ bool, (const_tree type),
+ default_preferred_div_as_shifts_over_mult)
+
 /* Return true if vector alignment is reachable (by peeling N
    iterations) for the given scalar type.  */
 DEFHOOK
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index a6a4809ca91baa5d7fad2244549317a31390f0c2..a207963b9e6eb9300df0043e1b79aa6c941d0f7f 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -53,6 +53,8 @@ extern scalar_int_mode default_unwind_word_mode (void);
 extern unsigned HOST_WIDE_INT default_shift_truncation_mask
   (machine_mode);
 extern unsigned int default_min_divisions_for_recip_mul (machine_mode);
+extern bool default_preferred_div_as_shifts_over_mult
+  (const_tree);
 extern int default_mode_rep_extended (scalar_int_mode, scalar_int_mode);
 
 extern tree default_stack_protect_guard (void);
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index 211525720a620d6f533e2da91e03877337a931e7..becea6ef4b6329cfa0b676f8d844630fbdc97f20 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -1483,6 +1483,15 @@ default_preferred_vector_alignment (const_tree type)
   return TYPE_ALIGN (type);
 }
 
+/* The default implementation of
+   TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT.  */
+
+bool
+default_preferred_div_as_shifts_over_mult (const_tree /* type */)
+{
+  return false;
+}
+
 /* By default assume vectors of element TYPE require a multiple of the natural
    alignment of TYPE.  TYPE is naturally aligned if IS_PACKED is false.  */
 bool
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0a04ea8c1f73e3c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
@@ -0,0 +1,25 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include "tree-vect.h"
+
+typedef unsigned __attribute__((__vector_size__ (16))) V;
+
+static __attribute__((__noinline__)) __attribute__((__noclone__)) V
+foo (V v, unsigned short i)
+{
+  v /= i;
+  return v;
+}
+
+int
+main (void)
+{
+  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0xffff);
+  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
+    if (v[i] != 0x00010001)
+      __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b49914d2a29b933de625
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
@@ -0,0 +1,58 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define N 50
+#define TYPE uint8_t 
+
+#ifndef DEBUG
+#define DEBUG 0
+#endif
+
+#define BASE ((TYPE) -1 < 0 ? -126 : 4)
+
+
+__attribute__((noipa, noinline, optimize("O1")))
+void fun1(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+__attribute__((noipa, noinline, optimize("O3")))
+void fun2(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+int main ()
+{
+  TYPE a[N];
+  TYPE b[N];
+
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 13;
+      b[i] = BASE + i * 13;
+      if (DEBUG)
+        printf ("%d: 0x%x\n", i, a[i]);
+    }
+
+  fun1 (a, N / 2, N);
+  fun2 (b, N / 2, N);
+
+  for (int i = 0; i < N; ++i)
+    {
+      if (DEBUG)
+        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
+
+      if (a[i] != b[i])
+        __builtin_abort ();
+    }
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
index 5ab5b944a573ad24ce8427aff24fc5215bf05dac..26ed91d58fa4709a67c903ad446d267a3113c172 100644
--- a/gcc/tree-ssa-math-opts.cc
+++ b/gcc/tree-ssa-math-opts.cc
@@ -3346,6 +3346,20 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
 		    param_avoid_fma_max_bits));
   bool defer = check_defer;
   bool seen_negate_p = false;
+
+  /* There is no numerical difference between fused and unfused integer FMAs,
+     and the assumption below that FMA is as cheap as addition is unlikely
+     to be true, especially if the multiplication occurs multiple times on
+     the same chain.  E.g., for something like:
+
+	 (((a * b) + c) >> 1) + (a * b)
+
+     we do not want to duplicate the a * b into two additions, not least
+     because the result is not a natural FMA chain.  */
+  if (ANY_INTEGRAL_TYPE_P (type)
+      && !has_single_use (mul_result))
+    return false;
+
   /* Make sure that the multiplication statement becomes dead after
      the transformation, thus that all uses are transformed to FMAs.
      This means we assume that an FMA operation has the same cost
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 1766ce277d6b88d8aa3be77e7c8abb504a10a735..27fb4c6a59f0182c4a836f96ab6b5e2e405a18a0 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -3913,6 +3913,84 @@ vect_recog_divmod_pattern (vec_info *vinfo,
 
       return pattern_stmt;
     }
+  else if ((cst = uniform_integer_cst_p (oprnd1))
+	   && TYPE_UNSIGNED (itype)
+	   && rhs_code == TRUNC_DIV_EXPR
+	   && vectype
+	   && targetm.vectorize.preferred_div_as_shifts_over_mult (vectype))
+    {
+      /* div optimizations using narrowings
+       we can do the division e.g. shorts by 255 faster by calculating it as
+       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
+       double the precision of x.
+
+       If we imagine a short as being composed of two blocks of bytes then
+       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
+       adding 1 to each sub component:
+
+	    short value of 16-bits
+       ┌──────────────┬────────────────┐
+       │              │                │
+       └──────────────┴────────────────┘
+	 8-bit part1 ▲  8-bit part2   ▲
+		     │                │
+		     │                │
+		    +1               +1
+
+       after the first addition, we have to shift right by 8, and narrow the
+       results back to a byte.  Remember that the addition must be done in
+       double the precision of the input.  However if we know that the addition
+       `x + 257` does not overflow then we can do the operation in the current
+       precision.  In which case we don't need the pack and unpacks.  */
+      auto wcst = wi::to_wide (cst);
+      int pow = wi::exact_log2 (wcst + 1);
+      if (pow == (int) (element_precision (vectype) / 2))
+	{
+	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
+
+	  gimple_ranger ranger;
+	  int_range_max r;
+
+	  /* Check that no overflow will occur.  If we don't have range
+	     information we can't perform the optimization.  */
+
+	  if (ranger.range_of_expr (r, oprnd0, stmt))
+	    {
+	      wide_int max = r.upper_bound ();
+	      wide_int one = wi::to_wide (build_one_cst (itype));
+	      wide_int adder = wi::add (one, wi::lshift (one, pow));
+	      wi::overflow_type ovf;
+	      wi::add (max, adder, UNSIGNED, &ovf);
+	      if (ovf == wi::OVF_NONE)
+		{
+		  *type_out = vectype;
+		  tree tadder = wide_int_to_tree (itype, adder);
+		  tree rshift = wide_int_to_tree (itype, pow);
+
+		  tree new_lhs1 = vect_recog_temp_ssa_var (itype, NULL);
+		  gassign *patt1
+		    = gimple_build_assign (new_lhs1, PLUS_EXPR, oprnd0, tadder);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  tree new_lhs2 = vect_recog_temp_ssa_var (itype, NULL);
+		  patt1 = gimple_build_assign (new_lhs2, RSHIFT_EXPR, new_lhs1,
+					       rshift);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  tree new_lhs3 = vect_recog_temp_ssa_var (itype, NULL);
+		  patt1 = gimple_build_assign (new_lhs3, PLUS_EXPR, new_lhs2,
+					       oprnd0);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  tree new_lhs4 = vect_recog_temp_ssa_var (itype, NULL);
+		  pattern_stmt = gimple_build_assign (new_lhs4, RSHIFT_EXPR,
+						      new_lhs3, rshift);
+
+		  return pattern_stmt;
+		}
+	    }
+	}
+    }
 
   if (prec > HOST_BITS_PER_WIDE_INT
       || integer_zerop (oprnd1))

[-- Attachment #2: rb16930.patch --]
[-- Type: application/octet-stream, Size: 10446 bytes --]

diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 50a8872a6695b18b9bed0d393bacf733833633db..f69f7f036272e867ea1c3fee851b117f057f68c5 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6137,6 +6137,10 @@ instruction pattern.  There is no need for the hook to handle these two
 implementation approaches itself.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT (const_tree @var{type})
+If possible, when decomposing a division operation of vectors of
+type @var{type} during vectorization, prefer to use shifts rather than
+multiplication by magic constants.
 @end deftypefn
 
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3e07978a02f4e6077adae6cadc93ea4273295f1f..0051017a7fd67691a343470f36ad4fc32c8e7e15 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4173,6 +4173,7 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_VEC_PERM_CONST
 
+@hook TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT
 
 @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
 
diff --git a/gcc/target.def b/gcc/target.def
index e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5..bdee9b7f9c941508738fac49593b5baa525e2915 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1868,6 +1868,16 @@ correct for most targets.",
  poly_uint64, (const_tree type),
  default_preferred_vector_alignment)
 
+/* Returns whether the target has a preference for decomposing divisions using
+   shifts rather than multiplies.  */
+DEFHOOK
+(preferred_div_as_shifts_over_mult,
+ "If possible, when decomposing a division operation of vectors of\n\
+type @var{type} during vectorization, prefer to use shifts rather than\n\
+multiplication by magic constants.",
+ bool, (const_tree type),
+ default_preferred_div_as_shifts_over_mult)
+
 /* Return true if vector alignment is reachable (by peeling N
    iterations) for the given scalar type.  */
 DEFHOOK
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index a6a4809ca91baa5d7fad2244549317a31390f0c2..a207963b9e6eb9300df0043e1b79aa6c941d0f7f 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -53,6 +53,8 @@ extern scalar_int_mode default_unwind_word_mode (void);
 extern unsigned HOST_WIDE_INT default_shift_truncation_mask
   (machine_mode);
 extern unsigned int default_min_divisions_for_recip_mul (machine_mode);
+extern bool default_preferred_div_as_shifts_over_mult
+  (const_tree);
 extern int default_mode_rep_extended (scalar_int_mode, scalar_int_mode);
 
 extern tree default_stack_protect_guard (void);
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index 211525720a620d6f533e2da91e03877337a931e7..becea6ef4b6329cfa0b676f8d844630fbdc97f20 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -1483,6 +1483,15 @@ default_preferred_vector_alignment (const_tree type)
   return TYPE_ALIGN (type);
 }
 
+/* The default implementation of
+   TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT.  */
+
+bool
+default_preferred_div_as_shifts_over_mult (const_tree /* type */)
+{
+  return false;
+}
+
 /* By default assume vectors of element TYPE require a multiple of the natural
    alignment of TYPE.  TYPE is naturally aligned if IS_PACKED is false.  */
 bool
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0a04ea8c1f73e3c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
@@ -0,0 +1,25 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include "tree-vect.h"
+
+typedef unsigned __attribute__((__vector_size__ (16))) V;
+
+static __attribute__((__noinline__)) __attribute__((__noclone__)) V
+foo (V v, unsigned short i)
+{
+  v /= i;
+  return v;
+}
+
+int
+main (void)
+{
+  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0xffff);
+  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
+    if (v[i] != 0x00010001)
+      __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b49914d2a29b933de625
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
@@ -0,0 +1,58 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define N 50
+#define TYPE uint8_t 
+
+#ifndef DEBUG
+#define DEBUG 0
+#endif
+
+#define BASE ((TYPE) -1 < 0 ? -126 : 4)
+
+
+__attribute__((noipa, noinline, optimize("O1")))
+void fun1(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+__attribute__((noipa, noinline, optimize("O3")))
+void fun2(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+int main ()
+{
+  TYPE a[N];
+  TYPE b[N];
+
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 13;
+      b[i] = BASE + i * 13;
+      if (DEBUG)
+        printf ("%d: 0x%x\n", i, a[i]);
+    }
+
+  fun1 (a, N / 2, N);
+  fun2 (b, N / 2, N);
+
+  for (int i = 0; i < N; ++i)
+    {
+      if (DEBUG)
+        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
+
+      if (a[i] != b[i])
+        __builtin_abort ();
+    }
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
index 5ab5b944a573ad24ce8427aff24fc5215bf05dac..26ed91d58fa4709a67c903ad446d267a3113c172 100644
--- a/gcc/tree-ssa-math-opts.cc
+++ b/gcc/tree-ssa-math-opts.cc
@@ -3346,6 +3346,20 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
 		    param_avoid_fma_max_bits));
   bool defer = check_defer;
   bool seen_negate_p = false;
+
+  /* There is no numerical difference between fused and unfused integer FMAs,
+     and the assumption below that FMA is as cheap as addition is unlikely
+     to be true, especially if the multiplication occurs multiple times on
+     the same chain.  E.g., for something like:
+
+	 (((a * b) + c) >> 1) + (a * b)
+
+     we do not want to duplicate the a * b into two additions, not least
+     because the result is not a natural FMA chain.  */
+  if (ANY_INTEGRAL_TYPE_P (type)
+      && !has_single_use (mul_result))
+    return false;
+
   /* Make sure that the multiplication statement becomes dead after
      the transformation, thus that all uses are transformed to FMAs.
      This means we assume that an FMA operation has the same cost
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 1766ce277d6b88d8aa3be77e7c8abb504a10a735..27fb4c6a59f0182c4a836f96ab6b5e2e405a18a0 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -3913,6 +3913,84 @@ vect_recog_divmod_pattern (vec_info *vinfo,
 
       return pattern_stmt;
     }
+  else if ((cst = uniform_integer_cst_p (oprnd1))
+	   && TYPE_UNSIGNED (itype)
+	   && rhs_code == TRUNC_DIV_EXPR
+	   && vectype
+	   && targetm.vectorize.preferred_div_as_shifts_over_mult (vectype))
+    {
+      /* div optimizations using narrowings
+       we can do the division e.g. shorts by 255 faster by calculating it as
+       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
+       double the precision of x.
+
+       If we imagine a short as being composed of two blocks of bytes then
+       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
+       adding 1 to each sub component:
+
+	    short value of 16-bits
+       ┌──────────────┬────────────────┐
+       │              │                │
+       └──────────────┴────────────────┘
+	 8-bit part1 ▲  8-bit part2   ▲
+		     │                │
+		     │                │
+		    +1               +1
+
+       after the first addition, we have to shift right by 8, and narrow the
+       results back to a byte.  Remember that the addition must be done in
+       double the precision of the input.  However if we know that the addition
+       `x + 257` does not overflow then we can do the operation in the current
+       precision.  In which case we don't need the pack and unpacks.  */
+      auto wcst = wi::to_wide (cst);
+      int pow = wi::exact_log2 (wcst + 1);
+      if (pow == (int) (element_precision (vectype) / 2))
+	{
+	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
+
+	  gimple_ranger ranger;
+	  int_range_max r;
+
+	  /* Check that no overflow will occur.  If we don't have range
+	     information we can't perform the optimization.  */
+
+	  if (ranger.range_of_expr (r, oprnd0, stmt))
+	    {
+	      wide_int max = r.upper_bound ();
+	      wide_int one = wi::to_wide (build_one_cst (itype));
+	      wide_int adder = wi::add (one, wi::lshift (one, pow));
+	      wi::overflow_type ovf;
+	      wi::add (max, adder, UNSIGNED, &ovf);
+	      if (ovf == wi::OVF_NONE)
+		{
+		  *type_out = vectype;
+		  tree tadder = wide_int_to_tree (itype, adder);
+		  tree rshift = wide_int_to_tree (itype, pow);
+
+		  tree new_lhs1 = vect_recog_temp_ssa_var (itype, NULL);
+		  gassign *patt1
+		    = gimple_build_assign (new_lhs1, PLUS_EXPR, oprnd0, tadder);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  tree new_lhs2 = vect_recog_temp_ssa_var (itype, NULL);
+		  patt1 = gimple_build_assign (new_lhs2, RSHIFT_EXPR, new_lhs1,
+					       rshift);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  tree new_lhs3 = vect_recog_temp_ssa_var (itype, NULL);
+		  patt1 = gimple_build_assign (new_lhs3, PLUS_EXPR, new_lhs2,
+					       oprnd0);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  tree new_lhs4 = vect_recog_temp_ssa_var (itype, NULL);
+		  pattern_stmt = gimple_build_assign (new_lhs4, RSHIFT_EXPR,
+						      new_lhs3, rshift);
+
+		  return pattern_stmt;
+		}
+	    }
+	}
+    }
 
   if (prec > HOST_BITS_PER_WIDE_INT
       || integer_zerop (oprnd1))

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4]middle-end: Implement preferred_div_as_shifts_over_mult [PR108583]
  2023-03-06 11:23   ` Tamar Christina
@ 2023-03-08  8:55     ` Richard Sandiford
  2023-03-09 19:39       ` Tamar Christina
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Sandiford @ 2023-03-08  8:55 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc-patches, nd, rguenther

Tamar Christina <Tamar.Christina@arm.com> writes:
> Ping,
>
> And updated the hook to allow to differentiate between ISAs.
>
> As Andy said before initializing a ranger instance is cheap but not free, and if
> the intention is to call it often during a pass it should be instantiated at
> pass startup and passed along to the places that need it.  This is a big
> refactoring and doesn't seem right to do in this PR.  But we should in GCC 14.
>
> Currently we only instantiate it after a long series of much cheaper checks.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>         PR target/108583
>         * target.def (preferred_div_as_shifts_over_mult): New.
>         * doc/tm.texi.in: Document it.
>         * doc/tm.texi: Regenerate.
>         * targhooks.cc (default_preferred_div_as_shifts_over_mult): New.
>         * targhooks.h (default_preferred_div_as_shifts_over_mult): New.
>         * tree-vect-patterns.cc (vect_recog_divmod_pattern): Use it.
>
> gcc/testsuite/ChangeLog:
>
>         PR target/108583
>         * gcc.dg/vect/vect-div-bitmask-4.c: New test.
>         * gcc.dg/vect/vect-div-bitmask-5.c: New test.
>
> --- inline copy of patch ---
>
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 50a8872a6695b18b9bed0d393bacf733833633db..f69f7f036272e867ea1c3fee851b117f057f68c5 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -6137,6 +6137,10 @@ instruction pattern.  There is no need for the hook to handle these two
>  implementation approaches itself.
>  @end deftypefn
>
> +@deftypefn {Target Hook} bool TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT (const_tree @var{type})
> +If possible, when decomposing a division operation of vectors of
> +type @var{type} during vectorization, prefer to use shifts rather than
> +multiplication by magic constants.
>  @end deftypefn
>
>  @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index 3e07978a02f4e6077adae6cadc93ea4273295f1f..0051017a7fd67691a343470f36ad4fc32c8e7e15 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -4173,6 +4173,7 @@ address;  but often a machine-dependent strategy can generate better code.
>
>  @hook TARGET_VECTORIZE_VEC_PERM_CONST
>
> +@hook TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT
>
>  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
>
> diff --git a/gcc/target.def b/gcc/target.def
> index e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5..bdee9b7f9c941508738fac49593b5baa525e2915 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -1868,6 +1868,16 @@ correct for most targets.",
>   poly_uint64, (const_tree type),
>   default_preferred_vector_alignment)
>
> +/* Returns whether the target has a preference for decomposing divisions using
> +   shifts rather than multiplies.  */
> +DEFHOOK
> +(preferred_div_as_shifts_over_mult,
> + "If possible, when decomposing a division operation of vectors of\n\
> +type @var{type} during vectorization, prefer to use shifts rather than\n\
> +multiplication by magic constants.",

Both approaches requires shifts though.  How about:

  Sometimes it is possible to implement a vector division using a sequence
  of two addition-shift pairs, giving four instructions in total.
  Return true if taking this approach for @var{vectype} is likely
  to be better than using a sequence involving highpart multiplication.

It should also say what the default is, more below.

> + bool, (const_tree type),
> + default_preferred_div_as_shifts_over_mult)
> +
>  /* Return true if vector alignment is reachable (by peeling N
>     iterations) for the given scalar type.  */
>  DEFHOOK
> diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> index a6a4809ca91baa5d7fad2244549317a31390f0c2..a207963b9e6eb9300df0043e1b79aa6c941d0f7f 100644
> --- a/gcc/targhooks.h
> +++ b/gcc/targhooks.h
> @@ -53,6 +53,8 @@ extern scalar_int_mode default_unwind_word_mode (void);
>  extern unsigned HOST_WIDE_INT default_shift_truncation_mask
>    (machine_mode);
>  extern unsigned int default_min_divisions_for_recip_mul (machine_mode);
> +extern bool default_preferred_div_as_shifts_over_mult
> +  (const_tree);
>  extern int default_mode_rep_extended (scalar_int_mode, scalar_int_mode);
>
>  extern tree default_stack_protect_guard (void);
> diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> index 211525720a620d6f533e2da91e03877337a931e7..becea6ef4b6329cfa0b676f8d844630fbdc97f20 100644
> --- a/gcc/targhooks.cc
> +++ b/gcc/targhooks.cc
> @@ -1483,6 +1483,15 @@ default_preferred_vector_alignment (const_tree type)
>    return TYPE_ALIGN (type);
>  }
>
> +/* The default implementation of
> +   TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT.  */
> +
> +bool
> +default_preferred_div_as_shifts_over_mult (const_tree /* type */)
> +{
> +  return false;
> +}
> +

I think the default should be true for targets without highpart multiplication,
since the fallback isn't possible then.  Either that, or we should skip
calling the hook when the fallback isn't possible.  E.g. maybe we could
test and record can_mult_highpart_p before the new code, and skip the
hook test when can_mult_highpart_p is false.

>  /* By default assume vectors of element TYPE require a multiple of the natural
>     alignment of TYPE.  TYPE is naturally aligned if IS_PACKED is false.  */
>  bool
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0a04ea8c1f73e3c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> @@ -0,0 +1,25 @@
> +/* { dg-require-effective-target vect_int } */
> +
> +#include <stdint.h>
> +#include "tree-vect.h"
> +
> +typedef unsigned __attribute__((__vector_size__ (16))) V;
> +
> +static __attribute__((__noinline__)) __attribute__((__noclone__)) V
> +foo (V v, unsigned short i)
> +{
> +  v /= i;
> +  return v;
> +}
> +
> +int
> +main (void)
> +{
> +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0xffff);
> +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
> +    if (v[i] != 0x00010001)
> +      __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected" "vect" { target aarch64*-*-* } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b49914d2a29b933de625
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> @@ -0,0 +1,58 @@
> +/* { dg-require-effective-target vect_int } */
> +
> +#include <stdint.h>
> +#include <stdio.h>
> +#include "tree-vect.h"
> +
> +#define N 50
> +#define TYPE uint8_t
> +
> +#ifndef DEBUG
> +#define DEBUG 0
> +#endif
> +
> +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> +
> +
> +__attribute__((noipa, noinline, optimize("O1")))
> +void fun1(TYPE* restrict pixel, TYPE level, int n)
> +{
> +  for (int i = 0; i < n; i+=1)
> +    pixel[i] = (pixel[i] + level) / 0xff;
> +}
> +
> +__attribute__((noipa, noinline, optimize("O3")))
> +void fun2(TYPE* restrict pixel, TYPE level, int n)
> +{
> +  for (int i = 0; i < n; i+=1)
> +    pixel[i] = (pixel[i] + level) / 0xff;
> +}
> +
> +int main ()
> +{
> +  TYPE a[N];
> +  TYPE b[N];
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE + i * 13;
> +      b[i] = BASE + i * 13;
> +      if (DEBUG)
> +        printf ("%d: 0x%x\n", i, a[i]);
> +    }
> +
> +  fun1 (a, N / 2, N);
> +  fun2 (b, N / 2, N);
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      if (DEBUG)
> +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> +
> +      if (a[i] != b[i])
> +        __builtin_abort ();
> +    }
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { target aarch64*-*-* } } } */
> diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
> index 5ab5b944a573ad24ce8427aff24fc5215bf05dac..26ed91d58fa4709a67c903ad446d267a3113c172 100644
> --- a/gcc/tree-ssa-math-opts.cc
> +++ b/gcc/tree-ssa-math-opts.cc
> @@ -3346,6 +3346,20 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
>                     param_avoid_fma_max_bits));
>    bool defer = check_defer;
>    bool seen_negate_p = false;
> +
> +  /* There is no numerical difference between fused and unfused integer FMAs,
> +     and the assumption below that FMA is as cheap as addition is unlikely
> +     to be true, especially if the multiplication occurs multiple times on
> +     the same chain.  E.g., for something like:
> +
> +        (((a * b) + c) >> 1) + (a * b)
> +
> +     we do not want to duplicate the a * b into two additions, not least
> +     because the result is not a natural FMA chain.  */
> +  if (ANY_INTEGRAL_TYPE_P (type)
> +      && !has_single_use (mul_result))
> +    return false;
> +
>    /* Make sure that the multiplication statement becomes dead after
>       the transformation, thus that all uses are transformed to FMAs.
>       This means we assume that an FMA operation has the same cost

I think this should be a separate patch, with its own testcase.
Sorry for not saying that until now.  The testcase from that thread
would be enough.

> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index 1766ce277d6b88d8aa3be77e7c8abb504a10a735..27fb4c6a59f0182c4a836f96ab6b5e2e405a18a0 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -3913,6 +3913,84 @@ vect_recog_divmod_pattern (vec_info *vinfo,
>
>        return pattern_stmt;
>      }
> +  else if ((cst = uniform_integer_cst_p (oprnd1))
> +          && TYPE_UNSIGNED (itype)
> +          && rhs_code == TRUNC_DIV_EXPR
> +          && vectype
> +          && targetm.vectorize.preferred_div_as_shifts_over_mult (vectype))

The else isn't really necessary here.

> +    {
> +      /* div optimizations using narrowings
> +       we can do the division e.g. shorts by 255 faster by calculating it as
> +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> +       double the precision of x.
> +
> +       If we imagine a short as being composed of two blocks of bytes then
> +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
> +       adding 1 to each sub component:
> +
> +           short value of 16-bits
> +       ┌──────────────┬────────────────┐
> +       │              │                │
> +       └──────────────┴────────────────┘
> +        8-bit part1 ▲  8-bit part2   ▲
> +                    │                │
> +                    │                │
> +                   +1               +1
> +
> +       after the first addition, we have to shift right by 8, and narrow the
> +       results back to a byte. Remember that the addition must be done in
> +       double the precision of the input.  However if we know that the addition
> +       `x + 257` does not overflow then we can do the operation in the current
> +       precision.  In which case we don't need the pack and unpacks.  */

I think this needs rewording.  The last two sentences describe
the real constraint: x + 257 msut not overflow.  So the part
about "assuming the operation is done in double the precision of x"
and narrowing "the results back to a byte" don't apply.

AIUI, this is really an instance of the general transform:

  x // N == ((x+N+2) // (N+1) + x) // (N+1)  for 0 <= x < N(N+3)

(Hope I've got that right.  Proof sketch below if this isn't an
off-the-shelf result.)

So when 0 <= x < N(N+3) is guaranteed by the precision of the type,
the question becomes whether the operation overflows, like you say.
For smaller N, it's the N(N+3) bound that matters.

How about:

      /* We can use the relationship:

	   x // N == ((x+N+2) // (N+1) + x) // (N+1)  for 0 <= x < N(N+3)

         to optimize cases where N+1 is a power of 2, and where // (N+1)
         is therefore a shift right.  When operating in modes that are
         multiples of a byte in size, there are two cases:

         (1) N(N+3) is not representable, in which case the question
             becomes whether the replacement expression overflows.
             It is enough to test that x+N+2 does not overflow,
             i.e. that x < MAX-(N+1).

         (2) N(N+3) is representable, in which case it is the (only)
             bound that we need to check.

         ??? For now we just handle the case where // (N+1) is a shift
         right by half the precision, since some architectures can
         optimize the associated addition and shift combinations
         into single instructions.  */

> +      auto wcst = wi::to_wide (cst);
> +      int pow = wi::exact_log2 (wcst + 1);
> +      if (pow == (int) (element_precision (vectype) / 2))

Seems like this should be equivalent to:

  if (pow == prec / 2)

which would come in handy for the comment below.

> +       {
> +         gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> +
> +         gimple_ranger ranger;
> +         int_range_max r;
> +
> +         /* Check that no overflow will occur.  If we don't have range
> +            information we can't perform the optimization.  */
> +
> +         if (ranger.range_of_expr (r, oprnd0, stmt))
> +           {
> +             wide_int max = r.upper_bound ();
> +             wide_int one = wi::to_wide (build_one_cst (itype));

Looks like this could be:

  auto one = wi::shwi (1, prec);

We shouldn't build trees just to convert them to wide_ints.

> +             wide_int adder = wi::add (one, wi::lshift (one, pow));
> +             wi::overflow_type ovf;
> +             wi::add (max, adder, UNSIGNED, &ovf);
> +             if (ovf == wi::OVF_NONE)
> +               {
> +                 *type_out = vectype;
> +                 tree tadder = wide_int_to_tree (itype, adder);
> +                 tree rshift = wide_int_to_tree (itype, pow);
> +
> +                 tree new_lhs1 = vect_recog_temp_ssa_var (itype, NULL);
> +                 gassign *patt1
> +                   = gimple_build_assign (new_lhs1, PLUS_EXPR, oprnd0, tadder);
> +                 append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
> +
> +                 tree new_lhs2 = vect_recog_temp_ssa_var (itype, NULL);
> +                 patt1 = gimple_build_assign (new_lhs2, RSHIFT_EXPR, new_lhs1,
> +                                              rshift);
> +                 append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
> +
> +                 tree new_lhs3 = vect_recog_temp_ssa_var (itype, NULL);
> +                 patt1 = gimple_build_assign (new_lhs3, PLUS_EXPR, new_lhs2,
> +                                              oprnd0);
> +                 append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
> +
> +                 tree new_lhs4 = vect_recog_temp_ssa_var (itype, NULL);
> +                 pattern_stmt = gimple_build_assign (new_lhs4, RSHIFT_EXPR,
> +                                                     new_lhs3, rshift);
> +
> +                 return pattern_stmt;
> +               }
> +           }
> +       }
> +    }
>
>    if (prec > HOST_BITS_PER_WIDE_INT
>        || integer_zerop (oprnd1))

LGTM otherwise, but would prefer to see the updated patch before acking.

Thanks,
Richard

----

Proof sketch, probably unnecessarily verbose/indirect:

The transform is:

  x // N  ==  ((x+N+2) // (N+1) + x) // (N+1)           [A]

For this to be valid, we need:

  0 <= x - N(((x+N+2) // (N+1) + x) // (N+1)) < N       [B]

Dividing into two cases:

(1) when 0 <= x < N:

  Then N+1 < x+N+2 < 2(N+1),
  so (x+N+2) // (N+1) == 1

  Substituting into [B] gives:

    0 <= x - N((1+x) // (N+1)) < N                      [C]

  And 0 <= x < N implies 1 <= 1+x < N+1,
  so (1+x) // (N+1) == 0.

  Substituting into [C] gives:

    0 <= x < N

  which is given.

(2) when x = K(N+1)-1+L for integral K and L, 0 <= L <= N:

  Then:

    (x+N+2) // (N+1) == (K(N+1)-1+L+N+2) // (N+1)
                     == ((K+1)(N+1)+L) // (N+1)
                     == K+1

  due to the range of L.

  Substituting into [B] gives:

    0 <= K(N+1)-1+L - N((K+1+K(N+1)-1+L) // (N+1)) < N
      <= K(N+1)-1+L - N((K+L+K(N+1)) // (N+1)) < N
      <= K(N+1)-1+L - N((K+L) // (N+1) + K) < N
      <= K+L-1 - N((K+L) // (N+1)) < N
      <= K+L - N((K+L) // (N+1)) < N+1                  [D]

  Dividing into three subcases:

  (2a) when (K+L) // (N+1) == 0

     This division result implies 0 <= K+L < N+1.

     Also, substituting the division into [D] gives:

       0 <= K+L < N+1

     which is the same condition.

  (2b) when (K+L) // (N+1) == 1

     This division result implies N+1 <= K+L < 2N+2.

     Also, substituting the division into [D] gives:

       0 <= K+L < 2N+1

     So for this case, [A] gives the wrong result iff K+L == 2N+1.

     Since L is bounded by N, the minimum K for which this is true
     is K == N+1.

     Therefore, for this case, the minimum invalid value of x is
     (N+1)(N+1)-1+N == N(N+3).

  (2c) when (K+L) // (N+1) >= 2

     This division result implies 2N+2 <= K+L.

     Since L is bounded by N, K >= N+2, and so x >= (N+2)(N+1)-1,
     i.e. x >= N(N+3)+1.  So the minimum invalid x for this case
     is greater than the one for (2b).

So [A] is valid if (but not only if) 0 <= x < N(N+3).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4][ranger]: Add range-ops for widen addition and widen multiplication [PR108583]
  2023-03-06 11:20   ` Tamar Christina
@ 2023-03-08  8:57     ` Aldy Hernandez
  2023-03-09 19:37       ` Tamar Christina
  0 siblings, 1 reply; 19+ messages in thread
From: Aldy Hernandez @ 2023-03-08  8:57 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc-patches, nd, amacleod

As Andrew has been advising on this one, I'd prefer for him to review
it.  However, he's on vacation this week.  FYI...

Aldy

On Mon, Mar 6, 2023 at 12:22 PM Tamar Christina <Tamar.Christina@arm.com> wrote:
>
> Ping.
>
> And updated the patch to reject cases that we don't expect or can handle cleanly for now.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>         PR target/108583
>         * gimple-range-op.h (gimple_range_op_handler): Add maybe_non_standard.
>         * gimple-range-op.cc (gimple_range_op_handler::gimple_range_op_handler):
>         Use it.
>         (gimple_range_op_handler::maybe_non_standard): New.
>         * range-op.cc (class operator_widen_plus_signed,
>         operator_widen_plus_signed::wi_fold, class operator_widen_plus_unsigned,
>         operator_widen_plus_unsigned::wi_fold, class operator_widen_mult_signed,
>         operator_widen_mult_signed::wi_fold, class operator_widen_mult_unsigned,
>         operator_widen_mult_unsigned::wi_fold,
>         ptr_op_widen_mult_signed, ptr_op_widen_mult_unsigned,
>         ptr_op_widen_plus_signed, ptr_op_widen_plus_unsigned): New.
>         * range-op.h (ptr_op_widen_mult_signed, ptr_op_widen_mult_unsigned,
>         ptr_op_widen_plus_signed, ptr_op_widen_plus_unsigned): New
>
> Co-Authored-By: Andrew MacLeod <amacleod@redhat.com>
>
> --- Inline copy of patch ---
>
> diff --git a/gcc/gimple-range-op.h b/gcc/gimple-range-op.h
> index 743b858126e333ea9590c0f175aacb476260c048..1bf63c5ce6f5db924a1f5907ab4539e376281bd0 100644
> --- a/gcc/gimple-range-op.h
> +++ b/gcc/gimple-range-op.h
> @@ -41,6 +41,7 @@ public:
>                  relation_trio = TRIO_VARYING);
>  private:
>    void maybe_builtin_call ();
> +  void maybe_non_standard ();
>    gimple *m_stmt;
>    tree m_op1, m_op2;
>  };
> diff --git a/gcc/gimple-range-op.cc b/gcc/gimple-range-op.cc
> index d9dfdc56939bb62ade72726b15c3d5e87e4ddcd1..a5d625387e712c170e1e68f6a7d494027f6ef0d0 100644
> --- a/gcc/gimple-range-op.cc
> +++ b/gcc/gimple-range-op.cc
> @@ -179,6 +179,8 @@ gimple_range_op_handler::gimple_range_op_handler (gimple *s)
>    // statements.
>    if (is_a <gcall *> (m_stmt))
>      maybe_builtin_call ();
> +  else
> +    maybe_non_standard ();
>  }
>
>  // Calculate what we can determine of the range of this unary
> @@ -764,6 +766,57 @@ public:
>    }
>  } op_cfn_parity;
>
> +// Set up a gimple_range_op_handler for any nonstandard function which can be
> +// supported via range-ops.
> +
> +void
> +gimple_range_op_handler::maybe_non_standard ()
> +{
> +  range_operator *signed_op = ptr_op_widen_mult_signed;
> +  range_operator *unsigned_op = ptr_op_widen_mult_unsigned;
> +  if (gimple_code (m_stmt) == GIMPLE_ASSIGN)
> +    switch (gimple_assign_rhs_code (m_stmt))
> +      {
> +       case WIDEN_PLUS_EXPR:
> +       {
> +         signed_op = ptr_op_widen_plus_signed;
> +         unsigned_op = ptr_op_widen_plus_unsigned;
> +       }
> +       gcc_fallthrough ();
> +       case WIDEN_MULT_EXPR:
> +       {
> +         m_valid = false;
> +         m_op1 = gimple_assign_rhs1 (m_stmt);
> +         m_op2 = gimple_assign_rhs2 (m_stmt);
> +         tree ret = gimple_assign_lhs (m_stmt);
> +         bool signed1 = TYPE_SIGN (TREE_TYPE (m_op1)) == SIGNED;
> +         bool signed2 = TYPE_SIGN (TREE_TYPE (m_op2)) == SIGNED;
> +         bool signed_ret = TYPE_SIGN (TREE_TYPE (ret)) == SIGNED;
> +
> +         /* Normally these operands should all have the same sign, but
> +            some passes and violate this by taking mismatched sign args.  At
> +            the moment the only one that's possible is mismatch inputs and
> +            unsigned output.  Once ranger supports signs for the operands we
> +            can properly fix it,  for now only accept the case we can do
> +            correctly.  */
> +         if ((signed1 ^ signed2) && signed_ret)
> +           return;
> +
> +         m_valid = true;
> +         if (signed2 && !signed1)
> +           std::swap (m_op1, m_op2);
> +
> +         if (signed1 || signed2)
> +           m_int = signed_op;
> +         else
> +           m_int = unsigned_op;
> +         break;
> +       }
> +       default:
> +         break;
> +      }
> +}
> +
>  // Set up a gimple_range_op_handler for any built in function which can be
>  // supported via range-ops.
>
> diff --git a/gcc/range-op.h b/gcc/range-op.h
> index f00b747f08a1fa8404c63bfe5a931b4048008b03..b1eeac70df81f2bdf228af7adff5399e7ac5e5d6 100644
> --- a/gcc/range-op.h
> +++ b/gcc/range-op.h
> @@ -311,4 +311,8 @@ private:
>  // This holds the range op table for floating point operations.
>  extern floating_op_table *floating_tree_table;
>
> +extern range_operator *ptr_op_widen_mult_signed;
> +extern range_operator *ptr_op_widen_mult_unsigned;
> +extern range_operator *ptr_op_widen_plus_signed;
> +extern range_operator *ptr_op_widen_plus_unsigned;
>  #endif // GCC_RANGE_OP_H
> diff --git a/gcc/range-op.cc b/gcc/range-op.cc
> index 5c67bce6d3aab81ad3186b902e09d6a96878d9bb..718ccb6f074e1a2a9ef1b7a5d4e879898d4a7fc3 100644
> --- a/gcc/range-op.cc
> +++ b/gcc/range-op.cc
> @@ -1556,6 +1556,73 @@ operator_plus::op2_range (irange &r, tree type,
>    return op1_range (r, type, lhs, op1, rel.swap_op1_op2 ());
>  }
>
> +class operator_widen_plus_signed : public range_operator
> +{
> +public:
> +  virtual void wi_fold (irange &r, tree type,
> +                       const wide_int &lh_lb,
> +                       const wide_int &lh_ub,
> +                       const wide_int &rh_lb,
> +                       const wide_int &rh_ub) const;
> +} op_widen_plus_signed;
> +range_operator *ptr_op_widen_plus_signed = &op_widen_plus_signed;
> +
> +void
> +operator_widen_plus_signed::wi_fold (irange &r, tree type,
> +                                    const wide_int &lh_lb,
> +                                    const wide_int &lh_ub,
> +                                    const wide_int &rh_lb,
> +                                    const wide_int &rh_ub) const
> +{
> +   wi::overflow_type ov_lb, ov_ub;
> +   signop s = TYPE_SIGN (type);
> +
> +   wide_int lh_wlb
> +     = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, SIGNED);
> +   wide_int lh_wub
> +     = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, SIGNED);
> +   wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
> +   wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
> +
> +   wide_int new_lb = wi::add (lh_wlb, rh_wlb, s, &ov_lb);
> +   wide_int new_ub = wi::add (lh_wub, rh_wub, s, &ov_ub);
> +
> +   r = int_range<2> (type, new_lb, new_ub);
> +}
> +
> +class operator_widen_plus_unsigned : public range_operator
> +{
> +public:
> +  virtual void wi_fold (irange &r, tree type,
> +                       const wide_int &lh_lb,
> +                       const wide_int &lh_ub,
> +                       const wide_int &rh_lb,
> +                       const wide_int &rh_ub) const;
> +} op_widen_plus_unsigned;
> +range_operator *ptr_op_widen_plus_unsigned = &op_widen_plus_unsigned;
> +
> +void
> +operator_widen_plus_unsigned::wi_fold (irange &r, tree type,
> +                                      const wide_int &lh_lb,
> +                                      const wide_int &lh_ub,
> +                                      const wide_int &rh_lb,
> +                                      const wide_int &rh_ub) const
> +{
> +   wi::overflow_type ov_lb, ov_ub;
> +   signop s = TYPE_SIGN (type);
> +
> +   wide_int lh_wlb
> +     = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, UNSIGNED);
> +   wide_int lh_wub
> +     = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, UNSIGNED);
> +   wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
> +   wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
> +
> +   wide_int new_lb = wi::add (lh_wlb, rh_wlb, s, &ov_lb);
> +   wide_int new_ub = wi::add (lh_wub, rh_wub, s, &ov_ub);
> +
> +   r = int_range<2> (type, new_lb, new_ub);
> +}
>
>  class operator_minus : public range_operator
>  {
> @@ -2031,6 +2098,70 @@ operator_mult::wi_fold (irange &r, tree type,
>      }
>  }
>
> +class operator_widen_mult_signed : public range_operator
> +{
> +public:
> +  virtual void wi_fold (irange &r, tree type,
> +                       const wide_int &lh_lb,
> +                       const wide_int &lh_ub,
> +                       const wide_int &rh_lb,
> +                       const wide_int &rh_ub)
> +    const;
> +} op_widen_mult_signed;
> +range_operator *ptr_op_widen_mult_signed = &op_widen_mult_signed;
> +
> +void
> +operator_widen_mult_signed::wi_fold (irange &r, tree type,
> +                                    const wide_int &lh_lb,
> +                                    const wide_int &lh_ub,
> +                                    const wide_int &rh_lb,
> +                                    const wide_int &rh_ub) const
> +{
> +  signop s = TYPE_SIGN (type);
> +
> +  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, SIGNED);
> +  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, SIGNED);
> +  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
> +  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
> +
> +  /* We don't expect a widening multiplication to be able to overflow but range
> +     calculations for multiplications are complicated.  After widening the
> +     operands lets call the base class.  */
> +  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
> +}
> +
> +
> +class operator_widen_mult_unsigned : public range_operator
> +{
> +public:
> +  virtual void wi_fold (irange &r, tree type,
> +                       const wide_int &lh_lb,
> +                       const wide_int &lh_ub,
> +                       const wide_int &rh_lb,
> +                       const wide_int &rh_ub)
> +    const;
> +} op_widen_mult_unsigned;
> +range_operator *ptr_op_widen_mult_unsigned = &op_widen_mult_unsigned;
> +
> +void
> +operator_widen_mult_unsigned::wi_fold (irange &r, tree type,
> +                                      const wide_int &lh_lb,
> +                                      const wide_int &lh_ub,
> +                                      const wide_int &rh_lb,
> +                                      const wide_int &rh_ub) const
> +{
> +  signop s = TYPE_SIGN (type);
> +
> +  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, UNSIGNED);
> +  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, UNSIGNED);
> +  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
> +  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
> +
> +  /* We don't expect a widening multiplication to be able to overflow but range
> +     calculations for multiplications are complicated.  After widening the
> +     operands lets call the base class.  */
> +  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
> +}
>
>  class operator_div : public cross_product_operator
>  {
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583]
  2023-03-06 11:21   ` Tamar Christina
@ 2023-03-08  9:17     ` Richard Sandiford
  2023-03-08  9:25       ` Tamar Christina
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Sandiford @ 2023-03-08  9:17 UTC (permalink / raw)
  To: Tamar Christina
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

Tamar Christina <Tamar.Christina@arm.com> writes:
> Ping,
>
> And updating the hook.
>
> There are no new test as new correctness tests were added to the mid-end and
> the existing codegen tests for this already exist.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>         PR target/108583
>         * config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv<mode>3): Remove.
>         (*bitmask_shift_plus<mode>): New.
>         * config/aarch64/aarch64-sve2.md (*bitmask_shift_plus<mode>): New.
>         (@aarch64_bitmask_udiv<mode>3): Remove.
>         * config/aarch64/aarch64.cc
>         (aarch64_vectorize_can_special_div_by_constant,
>         TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Removed.
>         (TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT,
>         aarch64_vectorize_preferred_div_as_shifts_over_mult): New.
>
> --- inline copy of patch ---
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index 7f212bf37cd2c120dceb7efa733c9fa76226f029..e1ecb88634f93d380ef534093ea6599dc7278108 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -4867,60 +4867,27 @@ (define_expand "aarch64_<sur><addsub>hn2<mode>"
>    }
>  )
>
> -;; div optimizations using narrowings
> -;; we can do the division e.g. shorts by 255 faster by calculating it as
> -;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> -;; double the precision of x.
> -;;
> -;; If we imagine a short as being composed of two blocks of bytes then
> -;; adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
> -;; adding 1 to each sub component:
> -;;
> -;;      short value of 16-bits
> -;; ┌──────────────┬────────────────┐
> -;; │              │                │
> -;; └──────────────┴────────────────┘
> -;;   8-bit part1 ▲  8-bit part2   ▲
> -;;               │                │
> -;;               │                │
> -;;              +1               +1
> -;;
> -;; after the first addition, we have to shift right by 8, and narrow the
> -;; results back to a byte.  Remember that the addition must be done in
> -;; double the precision of the input.  Since 8 is half the size of a short
> -;; we can use a narrowing halfing instruction in AArch64, addhn which also
> -;; does the addition in a wider precision and narrows back to a byte.  The
> -;; shift itself is implicit in the operation as it writes back only the top
> -;; half of the result. i.e. bits 2*esize-1:esize.
> -;;
> -;; Since we have narrowed the result of the first part back to a byte, for
> -;; the second addition we can use a widening addition, uaddw.
> -;;
> -;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
> -;;
> -;; The shift is later optimized by combine to a uzp2 with movi #0.
> -(define_expand "@aarch64_bitmask_udiv<mode>3"
> -  [(match_operand:VQN 0 "register_operand")
> -   (match_operand:VQN 1 "register_operand")
> -   (match_operand:VQN 2 "immediate_operand")]
> +;; Optimize ((a + b) >> n) + c where n is half the bitsize of the vector
> +(define_insn_and_split "*bitmask_shift_plus<mode>"
> +  [(set (match_operand:VQN 0 "register_operand" "=&w")
> +       (plus:VQN
> +         (lshiftrt:VQN
> +           (plus:VQN (match_operand:VQN 1 "register_operand" "w")
> +                     (match_operand:VQN 2 "register_operand" "w"))
> +           (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))

I guess this is personal preference, sorry, but I think we should drop
the constraint.  The predicate does the real check, and the operand is
never reloaded, so "Dr" isn't any more helpful than an empty constraint,
and IMO can be confusing.

> +         (match_operand:VQN 4 "register_operand" "w")))]
>    "TARGET_SIMD"
> +  "#"
> +  "&& true"
> +  [(const_int 0)]
>  {
> -  unsigned HOST_WIDE_INT size
> -    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode)) - 1;
> -  rtx elt = unwrap_const_vec_duplicate (operands[2]);
> -  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
> -    FAIL;
> -
> -  rtx addend = gen_reg_rtx (<MODE>mode);
> -  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROWQ2>mode, 1);
> -  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROWQ2>mode));
> -  rtx tmp1 = gen_reg_rtx (<VNARROWQ>mode);
> -  rtx tmp2 = gen_reg_rtx (<MODE>mode);
> -  emit_insn (gen_aarch64_addhn<mode> (tmp1, operands[1], addend));
> -  unsigned bitsize = GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode);
> -  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode, bitsize);
> -  emit_insn (gen_aarch64_uaddw<Vnarrowq> (tmp2, operands[1], tmp1));
> -  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], tmp2, shift_vector));
> +  rtx tmp;
> +  if (can_create_pseudo_p ())
> +    tmp = gen_reg_rtx (<VNARROWQ>mode);
> +  else
> +    tmp = gen_rtx_REG (<VNARROWQ>mode, REGNO (operands[0]));
> +  emit_insn (gen_aarch64_addhn<mode> (tmp, operands[1], operands[2]));
> +  emit_insn (gen_aarch64_uaddw<Vnarrowq> (operands[0], operands[4], tmp));
>    DONE;
>  })

In the previous review, I said:

  However, IIUC, this pattern would only be formed from combining
  three distinct patterns.  Is that right?  If so, we should be able
  to handle it as a plain define_split, with no define_insn.
  That should make things simpler, so would be worth trying before
  the changes I mentioned above.

Did you try that?  I still think it'd be preferable to defining a new insn.

> diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
> index 40c0728a7e6f00c395c360ce7625bc2e4a018809..bed44d7d6873877386222d56144cc115e3953a61 100644
> --- a/gcc/config/aarch64/aarch64-sve2.md
> +++ b/gcc/config/aarch64/aarch64-sve2.md
> @@ -2317,41 +2317,24 @@ (define_insn "@aarch64_sve_<optab><mode>"
>  ;; ---- [INT] Misc optab implementations
>  ;; -------------------------------------------------------------------------
>  ;; Includes:
> -;; - aarch64_bitmask_udiv
> +;; - bitmask_shift_plus

This is no longer an optab.

The original purpose of the "Includes:" comments was to list the
ISA instructions that are actually being generated, as a short-cut
to working through all the abstractions.  Just listing define_insn
names doesn't really add anything over reading the insns themselves.

Since the new pattern is an alternative way of generating ADDHNB,
it probably belongs in the "Narrowing binary arithmetic" section.

>  ;; -------------------------------------------------------------------------
>
> -;; div optimizations using narrowings
> -;; we can do the division e.g. shorts by 255 faster by calculating it as
> -;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> -;; double the precision of x.
> -;;
> -;; See aarch64-simd.md for bigger explanation.
> -(define_expand "@aarch64_bitmask_udiv<mode>3"
> -  [(match_operand:SVE_FULL_HSDI 0 "register_operand")
> -   (match_operand:SVE_FULL_HSDI 1 "register_operand")
> -   (match_operand:SVE_FULL_HSDI 2 "immediate_operand")]
> +;; Optimize ((a + b) >> n) where n is half the bitsize of the vector
> +(define_insn "*bitmask_shift_plus<mode>"
> +  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand" "=w")
> +       (unspec:SVE_FULL_HSDI
> +          [(match_operand:<VPRED> 1)
> +           (lshiftrt:SVE_FULL_HSDI
> +             (plus:SVE_FULL_HSDI
> +               (match_operand:SVE_FULL_HSDI 2 "register_operand" "w")
> +               (match_operand:SVE_FULL_HSDI 3 "register_operand" "w"))
> +             (match_operand:SVE_FULL_HSDI 4
> +                "aarch64_simd_shift_imm_vec_exact_top" "Dr"))]

Same comment about the constraints here.

> +          UNSPEC_PRED_X))]
>    "TARGET_SVE2"
> -{
> -  unsigned HOST_WIDE_INT size
> -    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROW>mode)) - 1;
> -  rtx elt = unwrap_const_vec_duplicate (operands[2]);
> -  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
> -    FAIL;
> -
> -  rtx addend = gen_reg_rtx (<MODE>mode);
> -  rtx tmp1 = gen_reg_rtx (<VNARROW>mode);
> -  rtx tmp2 = gen_reg_rtx (<VNARROW>mode);
> -  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROW>mode, 1);
> -  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROW>mode));
> -  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1, operands[1],
> -                             addend));
> -  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp2, operands[1],
> -                             lowpart_subreg (<MODE>mode, tmp1,
> -                                             <VNARROW>mode)));
> -  emit_move_insn (operands[0],
> -                 lowpart_subreg (<MODE>mode, tmp2, <VNARROW>mode));
> -  DONE;
> -})
> +  "addhnb\t%0.<Ventype>, %2.<Vetype>, %3.<Vetype>"
> +)

The pattern LGTM otherwise.

>  ;; =========================================================================
>  ;; == Permutation
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index e6f47cbbb0d04a6f33b9a741ebb614cabd0204b9..eb4f99ee524844ed5b3684c6fe807a4128685423 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -3849,6 +3849,19 @@ aarch64_vectorize_related_mode (machine_mode vector_mode,
>    return default_vectorize_related_mode (vector_mode, element_mode, nunits);
>  }
>
> +/* Implement TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT.  */
> +
> +static bool
> +aarch64_vectorize_preferred_div_as_shifts_over_mult (const_tree type)
> +{
> +  machine_mode mode = TYPE_MODE (type);
> +  unsigned int vec_flags = aarch64_classify_vector_mode (mode);
> +  bool sve_p = (vec_flags & VEC_ANY_SVE);
> +  bool simd_p = (vec_flags & VEC_ADVSIMD);
> +
> +  return (sve_p && TARGET_SVE2) || (simd_p && TARGET_SIMD);
> +}
> +

And the hook LGTM too.

Thanks,
Richard

>  /* Implement TARGET_PREFERRED_ELSE_VALUE.  For binary operations,
>     prefer to use the first arithmetic operand as the else value if
>     the else value doesn't matter, since that exactly matches the SVE
> @@ -24363,46 +24376,6 @@ aarch64_vectorize_vec_perm_const (machine_mode vmode, machine_mode op_mode,
>
>    return ret;
>  }
> -
> -/* Implement TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST.  */
> -
> -bool
> -aarch64_vectorize_can_special_div_by_constant (enum tree_code code,
> -                                              tree vectype, wide_int cst,
> -                                              rtx *output, rtx in0, rtx in1)
> -{
> -  if (code != TRUNC_DIV_EXPR
> -      || !TYPE_UNSIGNED (vectype))
> -    return false;
> -
> -  machine_mode mode = TYPE_MODE (vectype);
> -  unsigned int flags = aarch64_classify_vector_mode (mode);
> -  if ((flags & VEC_ANY_SVE) && !TARGET_SVE2)
> -    return false;
> -
> -  int pow = wi::exact_log2 (cst + 1);
> -  auto insn_code = maybe_code_for_aarch64_bitmask_udiv3 (TYPE_MODE (vectype));
> -  /* SVE actually has a div operator, we may have gotten here through
> -     that route.  */
> -  if (pow != (int) (element_precision (vectype) / 2)
> -      || insn_code == CODE_FOR_nothing)
> -    return false;
> -
> -  /* We can use the optimized pattern.  */
> -  if (in0 == NULL_RTX && in1 == NULL_RTX)
> -    return true;
> -
> -  gcc_assert (output);
> -
> -  expand_operand ops[3];
> -  create_output_operand (&ops[0], *output, mode);
> -  create_input_operand (&ops[1], in0, mode);
> -  create_fixed_operand (&ops[2], in1);
> -  expand_insn (insn_code, 3, ops);
> -  *output = ops[0].value;
> -  return true;
> -}
> -
>  /* Generate a byte permute mask for a register of mode MODE,
>     which has NUNITS units.  */
>
> @@ -27904,13 +27877,13 @@ aarch64_libgcc_floating_mode_supported_p
>  #undef TARGET_MAX_ANCHOR_OFFSET
>  #define TARGET_MAX_ANCHOR_OFFSET 4095
>
> +#undef TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT
> +#define TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT \
> +  aarch64_vectorize_preferred_div_as_shifts_over_mult
> +
>  #undef TARGET_VECTOR_ALIGNMENT
>  #define TARGET_VECTOR_ALIGNMENT aarch64_simd_vector_alignment
>
> -#undef TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
> -#define TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST \
> -  aarch64_vectorize_can_special_div_by_constant
> -
>  #undef TARGET_VECTORIZE_PREFERRED_VECTOR_ALIGNMENT
>  #define TARGET_VECTORIZE_PREFERRED_VECTOR_ALIGNMENT \
>    aarch64_vectorize_preferred_vector_alignment

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 4/4]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583]
  2023-03-08  9:17     ` Richard Sandiford
@ 2023-03-08  9:25       ` Tamar Christina
  2023-03-08 10:44         ` Richard Sandiford
  0 siblings, 1 reply; 19+ messages in thread
From: Tamar Christina @ 2023-03-08  9:25 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Wednesday, March 8, 2023 9:18 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> Subject: Re: [PATCH 4/4]AArch64 Update div-bitmask to implement new
> optab instead of target hook [PR108583]
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> > Ping,
> >
> > And updating the hook.
> >
> > There are no new test as new correctness tests were added to the
> > mid-end and the existing codegen tests for this already exist.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> >         PR target/108583
> >         * config/aarch64/aarch64-simd.md
> (@aarch64_bitmask_udiv<mode>3): Remove.
> >         (*bitmask_shift_plus<mode>): New.
> >         * config/aarch64/aarch64-sve2.md (*bitmask_shift_plus<mode>): New.
> >         (@aarch64_bitmask_udiv<mode>3): Remove.
> >         * config/aarch64/aarch64.cc
> >         (aarch64_vectorize_can_special_div_by_constant,
> >         TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Removed.
> >         (TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT,
> >         aarch64_vectorize_preferred_div_as_shifts_over_mult): New.
> >
> > --- inline copy of patch ---
> >
> > diff --git a/gcc/config/aarch64/aarch64-simd.md
> > b/gcc/config/aarch64/aarch64-simd.md
> > index
> >
> 7f212bf37cd2c120dceb7efa733c9fa76226f029..e1ecb88634f93d380ef534
> 093ea6
> > 599dc7278108 100644
> > --- a/gcc/config/aarch64/aarch64-simd.md
> > +++ b/gcc/config/aarch64/aarch64-simd.md
> > @@ -4867,60 +4867,27 @@ (define_expand
> "aarch64_<sur><addsub>hn2<mode>"
> >    }
> >  )
> >
> > -;; div optimizations using narrowings -;; we can do the division e.g.
> > shorts by 255 faster by calculating it as -;; (x + ((x + 257) >> 8))
> > >> 8 assuming the operation is done in -;; double the precision of x.
> > -;;
> > -;; If we imagine a short as being composed of two blocks of bytes
> > then -;; adding 257 or 0b0000_0001_0000_0001 to the number is
> > equivalent to -;; adding 1 to each sub component:
> > -;;
> > -;;      short value of 16-bits
> > -;; ┌──────────────┬────────────────┐
> > -;; │              │                │
> > -;; └──────────────┴────────────────┘
> > -;;   8-bit part1 ▲  8-bit part2   ▲
> > -;;               │                │
> > -;;               │                │
> > -;;              +1               +1
> > -;;
> > -;; after the first addition, we have to shift right by 8, and narrow
> > the -;; results back to a byte.  Remember that the addition must be
> > done in -;; double the precision of the input.  Since 8 is half the
> > size of a short -;; we can use a narrowing halfing instruction in
> > AArch64, addhn which also -;; does the addition in a wider precision
> > and narrows back to a byte.  The -;; shift itself is implicit in the
> > operation as it writes back only the top -;; half of the result. i.e. bits 2*esize-
> 1:esize.
> > -;;
> > -;; Since we have narrowed the result of the first part back to a
> > byte, for -;; the second addition we can use a widening addition, uaddw.
> > -;;
> > -;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
> > -;;
> > -;; The shift is later optimized by combine to a uzp2 with movi #0.
> > -(define_expand "@aarch64_bitmask_udiv<mode>3"
> > -  [(match_operand:VQN 0 "register_operand")
> > -   (match_operand:VQN 1 "register_operand")
> > -   (match_operand:VQN 2 "immediate_operand")]
> > +;; Optimize ((a + b) >> n) + c where n is half the bitsize of the
> > +vector (define_insn_and_split "*bitmask_shift_plus<mode>"
> > +  [(set (match_operand:VQN 0 "register_operand" "=&w")
> > +       (plus:VQN
> > +         (lshiftrt:VQN
> > +           (plus:VQN (match_operand:VQN 1 "register_operand" "w")
> > +                     (match_operand:VQN 2 "register_operand" "w"))
> > +           (match_operand:VQN 3
> > +"aarch64_simd_shift_imm_vec_exact_top" "Dr"))
> 
> I guess this is personal preference, sorry, but I think we should drop the
> constraint.  The predicate does the real check, and the operand is never
> reloaded, so "Dr" isn't any more helpful than an empty constraint, and IMO
> can be confusing.
> 
> > +         (match_operand:VQN 4 "register_operand" "w")))]
> >    "TARGET_SIMD"
> > +  "#"
> > +  "&& true"
> > +  [(const_int 0)]
> >  {
> > -  unsigned HOST_WIDE_INT size
> > -    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode)) - 1;
> > -  rtx elt = unwrap_const_vec_duplicate (operands[2]);
> > -  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
> > -    FAIL;
> > -
> > -  rtx addend = gen_reg_rtx (<MODE>mode);
> > -  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROWQ2>mode, 1);
> > -  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val,
> > <VNARROWQ2>mode));
> > -  rtx tmp1 = gen_reg_rtx (<VNARROWQ>mode);
> > -  rtx tmp2 = gen_reg_rtx (<MODE>mode);
> > -  emit_insn (gen_aarch64_addhn<mode> (tmp1, operands[1], addend));
> > -  unsigned bitsize = GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode);
> > -  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
> > bitsize);
> > -  emit_insn (gen_aarch64_uaddw<Vnarrowq> (tmp2, operands[1], tmp1));
> > -  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], tmp2,
> > shift_vector));
> > +  rtx tmp;
> > +  if (can_create_pseudo_p ())
> > +    tmp = gen_reg_rtx (<VNARROWQ>mode);  else
> > +    tmp = gen_rtx_REG (<VNARROWQ>mode, REGNO (operands[0]));
> > + emit_insn (gen_aarch64_addhn<mode> (tmp, operands[1],
> operands[2]));
> > + emit_insn (gen_aarch64_uaddw<Vnarrowq> (operands[0], operands[4],
> > + tmp));
> >    DONE;
> >  })
> 
> In the previous review, I said:
> 
>   However, IIUC, this pattern would only be formed from combining
>   three distinct patterns.  Is that right?  If so, we should be able
>   to handle it as a plain define_split, with no define_insn.
>   That should make things simpler, so would be worth trying before
>   the changes I mentioned above.
> 
> Did you try that?  I still think it'd be preferable to defining a new insn.

Yes I did! Sorry I forgot to mention that.  When I made it a split for some
reason It wasn't matching it anymore.

Regards,
Tamar
> 
> > diff --git a/gcc/config/aarch64/aarch64-sve2.md
> > b/gcc/config/aarch64/aarch64-sve2.md
> > index
> >
> 40c0728a7e6f00c395c360ce7625bc2e4a018809..bed44d7d687387738622
> 2d56144c
> > c115e3953a61 100644
> > --- a/gcc/config/aarch64/aarch64-sve2.md
> > +++ b/gcc/config/aarch64/aarch64-sve2.md
> > @@ -2317,41 +2317,24 @@ (define_insn
> "@aarch64_sve_<optab><mode>"
> >  ;; ---- [INT] Misc optab implementations  ;;
> > ----------------------------------------------------------------------
> > ---
> >  ;; Includes:
> > -;; - aarch64_bitmask_udiv
> > +;; - bitmask_shift_plus
> 
> This is no longer an optab.
> 
> The original purpose of the "Includes:" comments was to list the ISA
> instructions that are actually being generated, as a short-cut to working
> through all the abstractions.  Just listing define_insn names doesn't really add
> anything over reading the insns themselves.
> 
> Since the new pattern is an alternative way of generating ADDHNB, it probably
> belongs in the "Narrowing binary arithmetic" section.
> 
> >  ;;
> > ----------------------------------------------------------------------
> > ---
> >
> > -;; div optimizations using narrowings -;; we can do the division e.g.
> > shorts by 255 faster by calculating it as -;; (x + ((x + 257) >> 8))
> > >> 8 assuming the operation is done in -;; double the precision of x.
> > -;;
> > -;; See aarch64-simd.md for bigger explanation.
> > -(define_expand "@aarch64_bitmask_udiv<mode>3"
> > -  [(match_operand:SVE_FULL_HSDI 0 "register_operand")
> > -   (match_operand:SVE_FULL_HSDI 1 "register_operand")
> > -   (match_operand:SVE_FULL_HSDI 2 "immediate_operand")]
> > +;; Optimize ((a + b) >> n) where n is half the bitsize of the vector
> > +(define_insn "*bitmask_shift_plus<mode>"
> > +  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand" "=w")
> > +       (unspec:SVE_FULL_HSDI
> > +          [(match_operand:<VPRED> 1)
> > +           (lshiftrt:SVE_FULL_HSDI
> > +             (plus:SVE_FULL_HSDI
> > +               (match_operand:SVE_FULL_HSDI 2 "register_operand" "w")
> > +               (match_operand:SVE_FULL_HSDI 3 "register_operand" "w"))
> > +             (match_operand:SVE_FULL_HSDI 4
> > +                "aarch64_simd_shift_imm_vec_exact_top" "Dr"))]
> 
> Same comment about the constraints here.
> 
> > +          UNSPEC_PRED_X))]
> >    "TARGET_SVE2"
> > -{
> > -  unsigned HOST_WIDE_INT size
> > -    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROW>mode)) - 1;
> > -  rtx elt = unwrap_const_vec_duplicate (operands[2]);
> > -  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
> > -    FAIL;
> > -
> > -  rtx addend = gen_reg_rtx (<MODE>mode);
> > -  rtx tmp1 = gen_reg_rtx (<VNARROW>mode);
> > -  rtx tmp2 = gen_reg_rtx (<VNARROW>mode);
> > -  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROW>mode, 1);
> > -  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val,
> > <VNARROW>mode));
> > -  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1,
> operands[1],
> > -                             addend));
> > -  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp2,
> operands[1],
> > -                             lowpart_subreg (<MODE>mode, tmp1,
> > -                                             <VNARROW>mode)));
> > -  emit_move_insn (operands[0],
> > -                 lowpart_subreg (<MODE>mode, tmp2, <VNARROW>mode));
> > -  DONE;
> > -})
> > +  "addhnb\t%0.<Ventype>, %2.<Vetype>, %3.<Vetype>"
> > +)
> 
> The pattern LGTM otherwise.
> 
> >  ;;
> >
> ===================================================================
> ===
> > ===
> >  ;; == Permutation
> > diff --git a/gcc/config/aarch64/aarch64.cc
> > b/gcc/config/aarch64/aarch64.cc index
> >
> e6f47cbbb0d04a6f33b9a741ebb614cabd0204b9..eb4f99ee524844ed5b36
> 84c6fe80
> > 7a4128685423 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -3849,6 +3849,19 @@ aarch64_vectorize_related_mode
> (machine_mode vector_mode,
> >    return default_vectorize_related_mode (vector_mode, element_mode,
> > nunits);  }
> >
> > +/* Implement
> TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT.  */
> > +
> > +static bool
> > +aarch64_vectorize_preferred_div_as_shifts_over_mult (const_tree type)
> > +{
> > +  machine_mode mode = TYPE_MODE (type);
> > +  unsigned int vec_flags = aarch64_classify_vector_mode (mode);
> > +  bool sve_p = (vec_flags & VEC_ANY_SVE);
> > +  bool simd_p = (vec_flags & VEC_ADVSIMD);
> > +
> > +  return (sve_p && TARGET_SVE2) || (simd_p && TARGET_SIMD); }
> > +
> 
> And the hook LGTM too.
> 
> Thanks,
> Richard
> 
> >  /* Implement TARGET_PREFERRED_ELSE_VALUE.  For binary operations,
> >     prefer to use the first arithmetic operand as the else value if
> >     the else value doesn't matter, since that exactly matches the SVE
> > @@ -24363,46 +24376,6 @@ aarch64_vectorize_vec_perm_const
> > (machine_mode vmode, machine_mode op_mode,
> >
> >    return ret;
> >  }
> > -
> > -/* Implement TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST.  */
> > -
> > -bool
> > -aarch64_vectorize_can_special_div_by_constant (enum tree_code code,
> > -                                              tree vectype, wide_int cst,
> > -                                              rtx *output, rtx in0, rtx in1)
> > -{
> > -  if (code != TRUNC_DIV_EXPR
> > -      || !TYPE_UNSIGNED (vectype))
> > -    return false;
> > -
> > -  machine_mode mode = TYPE_MODE (vectype);
> > -  unsigned int flags = aarch64_classify_vector_mode (mode);
> > -  if ((flags & VEC_ANY_SVE) && !TARGET_SVE2)
> > -    return false;
> > -
> > -  int pow = wi::exact_log2 (cst + 1);
> > -  auto insn_code = maybe_code_for_aarch64_bitmask_udiv3 (TYPE_MODE
> > (vectype));
> > -  /* SVE actually has a div operator, we may have gotten here through
> > -     that route.  */
> > -  if (pow != (int) (element_precision (vectype) / 2)
> > -      || insn_code == CODE_FOR_nothing)
> > -    return false;
> > -
> > -  /* We can use the optimized pattern.  */
> > -  if (in0 == NULL_RTX && in1 == NULL_RTX)
> > -    return true;
> > -
> > -  gcc_assert (output);
> > -
> > -  expand_operand ops[3];
> > -  create_output_operand (&ops[0], *output, mode);
> > -  create_input_operand (&ops[1], in0, mode);
> > -  create_fixed_operand (&ops[2], in1);
> > -  expand_insn (insn_code, 3, ops);
> > -  *output = ops[0].value;
> > -  return true;
> > -}
> > -
> >  /* Generate a byte permute mask for a register of mode MODE,
> >     which has NUNITS units.  */
> >
> > @@ -27904,13 +27877,13 @@
> aarch64_libgcc_floating_mode_supported_p
> >  #undef TARGET_MAX_ANCHOR_OFFSET
> >  #define TARGET_MAX_ANCHOR_OFFSET 4095
> >
> > +#undef TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT
> > +#define TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT \
> > +  aarch64_vectorize_preferred_div_as_shifts_over_mult
> > +
> >  #undef TARGET_VECTOR_ALIGNMENT
> >  #define TARGET_VECTOR_ALIGNMENT aarch64_simd_vector_alignment
> >
> > -#undef TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
> > -#define TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST \
> > -  aarch64_vectorize_can_special_div_by_constant
> > -
> >  #undef TARGET_VECTORIZE_PREFERRED_VECTOR_ALIGNMENT
> >  #define TARGET_VECTORIZE_PREFERRED_VECTOR_ALIGNMENT \
> >    aarch64_vectorize_preferred_vector_alignment

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583]
  2023-03-08  9:25       ` Tamar Christina
@ 2023-03-08 10:44         ` Richard Sandiford
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Sandiford @ 2023-03-08 10:44 UTC (permalink / raw)
  To: Tamar Christina
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> > +         (match_operand:VQN 4 "register_operand" "w")))]
>> >    "TARGET_SIMD"
>> > +  "#"
>> > +  "&& true"
>> > +  [(const_int 0)]
>> >  {
>> > -  unsigned HOST_WIDE_INT size
>> > -    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode)) - 1;
>> > -  rtx elt = unwrap_const_vec_duplicate (operands[2]);
>> > -  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
>> > -    FAIL;
>> > -
>> > -  rtx addend = gen_reg_rtx (<MODE>mode);
>> > -  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROWQ2>mode, 1);
>> > -  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val,
>> > <VNARROWQ2>mode));
>> > -  rtx tmp1 = gen_reg_rtx (<VNARROWQ>mode);
>> > -  rtx tmp2 = gen_reg_rtx (<MODE>mode);
>> > -  emit_insn (gen_aarch64_addhn<mode> (tmp1, operands[1], addend));
>> > -  unsigned bitsize = GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode);
>> > -  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
>> > bitsize);
>> > -  emit_insn (gen_aarch64_uaddw<Vnarrowq> (tmp2, operands[1], tmp1));
>> > -  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], tmp2,
>> > shift_vector));
>> > +  rtx tmp;
>> > +  if (can_create_pseudo_p ())
>> > +    tmp = gen_reg_rtx (<VNARROWQ>mode);  else
>> > +    tmp = gen_rtx_REG (<VNARROWQ>mode, REGNO (operands[0]));
>> > + emit_insn (gen_aarch64_addhn<mode> (tmp, operands[1],
>> operands[2]));
>> > + emit_insn (gen_aarch64_uaddw<Vnarrowq> (operands[0], operands[4],
>> > + tmp));
>> >    DONE;
>> >  })
>> 
>> In the previous review, I said:
>> 
>>   However, IIUC, this pattern would only be formed from combining
>>   three distinct patterns.  Is that right?  If so, we should be able
>>   to handle it as a plain define_split, with no define_insn.
>>   That should make things simpler, so would be worth trying before
>>   the changes I mentioned above.
>> 
>> Did you try that?  I still think it'd be preferable to defining a new insn.
>
> Yes I did! Sorry I forgot to mention that.  When I made it a split for some
> reason It wasn't matching it anymore.

I was hoping for a bit more detail than that :-)  But it seems that
the reason is that we match SRA first, so the final combination
is a 2-to-1 rather than 3-to-1.

So yeah, the patch is OK with the other changes mentioned in the review.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 2/4][ranger]: Add range-ops for widen addition and widen multiplication [PR108583]
  2023-03-08  8:57     ` Aldy Hernandez
@ 2023-03-09 19:37       ` Tamar Christina
  2023-03-10 13:32         ` Andrew MacLeod
  0 siblings, 1 reply; 19+ messages in thread
From: Tamar Christina @ 2023-03-09 19:37 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: gcc-patches, nd, amacleod

Cheers,

Thanks! I'll way for him to come back then 😊

Thanks,
Tamar

> -----Original Message-----
> From: Aldy Hernandez <aldyh@redhat.com>
> Sent: Wednesday, March 8, 2023 8:57 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; amacleod@redhat.com
> Subject: Re: [PATCH 2/4][ranger]: Add range-ops for widen addition and
> widen multiplication [PR108583]
> 
> As Andrew has been advising on this one, I'd prefer for him to review it.
> However, he's on vacation this week.  FYI...
> 
> Aldy
> 
> On Mon, Mar 6, 2023 at 12:22 PM Tamar Christina
> <Tamar.Christina@arm.com> wrote:
> >
> > Ping.
> >
> > And updated the patch to reject cases that we don't expect or can handle
> cleanly for now.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> >         PR target/108583
> >         * gimple-range-op.h (gimple_range_op_handler): Add
> maybe_non_standard.
> >         * gimple-range-op.cc
> (gimple_range_op_handler::gimple_range_op_handler):
> >         Use it.
> >         (gimple_range_op_handler::maybe_non_standard): New.
> >         * range-op.cc (class operator_widen_plus_signed,
> >         operator_widen_plus_signed::wi_fold, class
> operator_widen_plus_unsigned,
> >         operator_widen_plus_unsigned::wi_fold, class
> operator_widen_mult_signed,
> >         operator_widen_mult_signed::wi_fold, class
> operator_widen_mult_unsigned,
> >         operator_widen_mult_unsigned::wi_fold,
> >         ptr_op_widen_mult_signed, ptr_op_widen_mult_unsigned,
> >         ptr_op_widen_plus_signed, ptr_op_widen_plus_unsigned): New.
> >         * range-op.h (ptr_op_widen_mult_signed,
> ptr_op_widen_mult_unsigned,
> >         ptr_op_widen_plus_signed, ptr_op_widen_plus_unsigned): New
> >
> > Co-Authored-By: Andrew MacLeod <amacleod@redhat.com>
> >
> > --- Inline copy of patch ---
> >
> > diff --git a/gcc/gimple-range-op.h b/gcc/gimple-range-op.h index
> >
> 743b858126e333ea9590c0f175aacb476260c048..1bf63c5ce6f5db924a1f5
> 907ab45
> > 39e376281bd0 100644
> > --- a/gcc/gimple-range-op.h
> > +++ b/gcc/gimple-range-op.h
> > @@ -41,6 +41,7 @@ public:
> >                  relation_trio = TRIO_VARYING);
> >  private:
> >    void maybe_builtin_call ();
> > +  void maybe_non_standard ();
> >    gimple *m_stmt;
> >    tree m_op1, m_op2;
> >  };
> > diff --git a/gcc/gimple-range-op.cc b/gcc/gimple-range-op.cc index
> >
> d9dfdc56939bb62ade72726b15c3d5e87e4ddcd1..a5d625387e712c170e1e
> 68f6a7d4
> > 94027f6ef0d0 100644
> > --- a/gcc/gimple-range-op.cc
> > +++ b/gcc/gimple-range-op.cc
> > @@ -179,6 +179,8 @@
> gimple_range_op_handler::gimple_range_op_handler (gimple *s)
> >    // statements.
> >    if (is_a <gcall *> (m_stmt))
> >      maybe_builtin_call ();
> > +  else
> > +    maybe_non_standard ();
> >  }
> >
> >  // Calculate what we can determine of the range of this unary @@
> > -764,6 +766,57 @@ public:
> >    }
> >  } op_cfn_parity;
> >
> > +// Set up a gimple_range_op_handler for any nonstandard function
> > +which can be // supported via range-ops.
> > +
> > +void
> > +gimple_range_op_handler::maybe_non_standard () {
> > +  range_operator *signed_op = ptr_op_widen_mult_signed;
> > +  range_operator *unsigned_op = ptr_op_widen_mult_unsigned;
> > +  if (gimple_code (m_stmt) == GIMPLE_ASSIGN)
> > +    switch (gimple_assign_rhs_code (m_stmt))
> > +      {
> > +       case WIDEN_PLUS_EXPR:
> > +       {
> > +         signed_op = ptr_op_widen_plus_signed;
> > +         unsigned_op = ptr_op_widen_plus_unsigned;
> > +       }
> > +       gcc_fallthrough ();
> > +       case WIDEN_MULT_EXPR:
> > +       {
> > +         m_valid = false;
> > +         m_op1 = gimple_assign_rhs1 (m_stmt);
> > +         m_op2 = gimple_assign_rhs2 (m_stmt);
> > +         tree ret = gimple_assign_lhs (m_stmt);
> > +         bool signed1 = TYPE_SIGN (TREE_TYPE (m_op1)) == SIGNED;
> > +         bool signed2 = TYPE_SIGN (TREE_TYPE (m_op2)) == SIGNED;
> > +         bool signed_ret = TYPE_SIGN (TREE_TYPE (ret)) == SIGNED;
> > +
> > +         /* Normally these operands should all have the same sign, but
> > +            some passes and violate this by taking mismatched sign args.  At
> > +            the moment the only one that's possible is mismatch inputs and
> > +            unsigned output.  Once ranger supports signs for the operands we
> > +            can properly fix it,  for now only accept the case we can do
> > +            correctly.  */
> > +         if ((signed1 ^ signed2) && signed_ret)
> > +           return;
> > +
> > +         m_valid = true;
> > +         if (signed2 && !signed1)
> > +           std::swap (m_op1, m_op2);
> > +
> > +         if (signed1 || signed2)
> > +           m_int = signed_op;
> > +         else
> > +           m_int = unsigned_op;
> > +         break;
> > +       }
> > +       default:
> > +         break;
> > +      }
> > +}
> > +
> >  // Set up a gimple_range_op_handler for any built in function which
> > can be  // supported via range-ops.
> >
> > diff --git a/gcc/range-op.h b/gcc/range-op.h index
> >
> f00b747f08a1fa8404c63bfe5a931b4048008b03..b1eeac70df81f2bdf228af
> 7adff5
> > 399e7ac5e5d6 100644
> > --- a/gcc/range-op.h
> > +++ b/gcc/range-op.h
> > @@ -311,4 +311,8 @@ private:
> >  // This holds the range op table for floating point operations.
> >  extern floating_op_table *floating_tree_table;
> >
> > +extern range_operator *ptr_op_widen_mult_signed; extern
> > +range_operator *ptr_op_widen_mult_unsigned; extern range_operator
> > +*ptr_op_widen_plus_signed; extern range_operator
> > +*ptr_op_widen_plus_unsigned;
> >  #endif // GCC_RANGE_OP_H
> > diff --git a/gcc/range-op.cc b/gcc/range-op.cc index
> >
> 5c67bce6d3aab81ad3186b902e09d6a96878d9bb..718ccb6f074e1a2a9ef1
> b7a5d4e8
> > 79898d4a7fc3 100644
> > --- a/gcc/range-op.cc
> > +++ b/gcc/range-op.cc
> > @@ -1556,6 +1556,73 @@ operator_plus::op2_range (irange &r, tree type,
> >    return op1_range (r, type, lhs, op1, rel.swap_op1_op2 ());  }
> >
> > +class operator_widen_plus_signed : public range_operator {
> > +public:
> > +  virtual void wi_fold (irange &r, tree type,
> > +                       const wide_int &lh_lb,
> > +                       const wide_int &lh_ub,
> > +                       const wide_int &rh_lb,
> > +                       const wide_int &rh_ub) const; }
> > +op_widen_plus_signed; range_operator *ptr_op_widen_plus_signed =
> > +&op_widen_plus_signed;
> > +
> > +void
> > +operator_widen_plus_signed::wi_fold (irange &r, tree type,
> > +                                    const wide_int &lh_lb,
> > +                                    const wide_int &lh_ub,
> > +                                    const wide_int &rh_lb,
> > +                                    const wide_int &rh_ub) const {
> > +   wi::overflow_type ov_lb, ov_ub;
> > +   signop s = TYPE_SIGN (type);
> > +
> > +   wide_int lh_wlb
> > +     = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, SIGNED);
> > +   wide_int lh_wub
> > +     = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, SIGNED);
> > +   wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
> > +   wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub)
> > + * 2, s);
> > +
> > +   wide_int new_lb = wi::add (lh_wlb, rh_wlb, s, &ov_lb);
> > +   wide_int new_ub = wi::add (lh_wub, rh_wub, s, &ov_ub);
> > +
> > +   r = int_range<2> (type, new_lb, new_ub); }
> > +
> > +class operator_widen_plus_unsigned : public range_operator {
> > +public:
> > +  virtual void wi_fold (irange &r, tree type,
> > +                       const wide_int &lh_lb,
> > +                       const wide_int &lh_ub,
> > +                       const wide_int &rh_lb,
> > +                       const wide_int &rh_ub) const; }
> > +op_widen_plus_unsigned; range_operator *ptr_op_widen_plus_unsigned
> =
> > +&op_widen_plus_unsigned;
> > +
> > +void
> > +operator_widen_plus_unsigned::wi_fold (irange &r, tree type,
> > +                                      const wide_int &lh_lb,
> > +                                      const wide_int &lh_ub,
> > +                                      const wide_int &rh_lb,
> > +                                      const wide_int &rh_ub) const {
> > +   wi::overflow_type ov_lb, ov_ub;
> > +   signop s = TYPE_SIGN (type);
> > +
> > +   wide_int lh_wlb
> > +     = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, UNSIGNED);
> > +   wide_int lh_wub
> > +     = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, UNSIGNED);
> > +   wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
> > +   wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub)
> > + * 2, s);
> > +
> > +   wide_int new_lb = wi::add (lh_wlb, rh_wlb, s, &ov_lb);
> > +   wide_int new_ub = wi::add (lh_wub, rh_wub, s, &ov_ub);
> > +
> > +   r = int_range<2> (type, new_lb, new_ub); }
> >
> >  class operator_minus : public range_operator  { @@ -2031,6 +2098,70
> > @@ operator_mult::wi_fold (irange &r, tree type,
> >      }
> >  }
> >
> > +class operator_widen_mult_signed : public range_operator {
> > +public:
> > +  virtual void wi_fold (irange &r, tree type,
> > +                       const wide_int &lh_lb,
> > +                       const wide_int &lh_ub,
> > +                       const wide_int &rh_lb,
> > +                       const wide_int &rh_ub)
> > +    const;
> > +} op_widen_mult_signed;
> > +range_operator *ptr_op_widen_mult_signed = &op_widen_mult_signed;
> > +
> > +void
> > +operator_widen_mult_signed::wi_fold (irange &r, tree type,
> > +                                    const wide_int &lh_lb,
> > +                                    const wide_int &lh_ub,
> > +                                    const wide_int &rh_lb,
> > +                                    const wide_int &rh_ub) const {
> > +  signop s = TYPE_SIGN (type);
> > +
> > +  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb)
> > + * 2, SIGNED);  wide_int lh_wub = wide_int::from (lh_ub,
> > + wi::get_precision (lh_ub) * 2, SIGNED);  wide_int rh_wlb =
> > + wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);  wide_int
> > + rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
> > +
> > +  /* We don't expect a widening multiplication to be able to overflow but
> range
> > +     calculations for multiplications are complicated.  After widening the
> > +     operands lets call the base class.  */
> > +  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub); }
> > +
> > +
> > +class operator_widen_mult_unsigned : public range_operator {
> > +public:
> > +  virtual void wi_fold (irange &r, tree type,
> > +                       const wide_int &lh_lb,
> > +                       const wide_int &lh_ub,
> > +                       const wide_int &rh_lb,
> > +                       const wide_int &rh_ub)
> > +    const;
> > +} op_widen_mult_unsigned;
> > +range_operator *ptr_op_widen_mult_unsigned =
> &op_widen_mult_unsigned;
> > +
> > +void
> > +operator_widen_mult_unsigned::wi_fold (irange &r, tree type,
> > +                                      const wide_int &lh_lb,
> > +                                      const wide_int &lh_ub,
> > +                                      const wide_int &rh_lb,
> > +                                      const wide_int &rh_ub) const {
> > +  signop s = TYPE_SIGN (type);
> > +
> > +  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb)
> > + * 2, UNSIGNED);  wide_int lh_wub = wide_int::from (lh_ub,
> > + wi::get_precision (lh_ub) * 2, UNSIGNED);  wide_int rh_wlb =
> > + wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);  wide_int
> > + rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
> > +
> > +  /* We don't expect a widening multiplication to be able to overflow but
> range
> > +     calculations for multiplications are complicated.  After widening the
> > +     operands lets call the base class.  */
> > +  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub); }
> >
> >  class operator_div : public cross_product_operator  {
> >


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 3/4]middle-end: Implement preferred_div_as_shifts_over_mult [PR108583]
  2023-03-08  8:55     ` Richard Sandiford
@ 2023-03-09 19:39       ` Tamar Christina
  2023-03-10  8:39         ` Richard Sandiford
  0 siblings, 1 reply; 19+ messages in thread
From: Tamar Christina @ 2023-03-09 19:39 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches, nd, rguenther

[-- Attachment #1: Type: text/plain, Size: 10180 bytes --]

Hi,

Here's the respun patch.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	PR target/108583
	* target.def (preferred_div_as_shifts_over_mult): New.
	* doc/tm.texi.in: Document it.
	* doc/tm.texi: Regenerate.
	* targhooks.cc (default_preferred_div_as_shifts_over_mult): New.
	* targhooks.h (default_preferred_div_as_shifts_over_mult): New.
	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Use it.

gcc/testsuite/ChangeLog:

	PR target/108583
	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
	* gcc.dg/vect/vect-div-bitmask-5.c: New test.

--- inline copy of patch ---

diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 50a8872a6695b18b9bed0d393bacf733833633db..bf7269e323de1a065d4d04376e5a2703cbb0f9fa 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6137,6 +6137,12 @@ instruction pattern.  There is no need for the hook to handle these two
 implementation approaches itself.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT (const_tree @var{type})
+Sometimes it is possible to implement a vector division using a sequence
+of two addition-shift pairs, giving four instructions in total.
+Return true if taking this approach for @var{vectype} is likely
+to be better than using a sequence involving highpart multiplication.
+Default is false if @code{can_mult_highpart_p}, otherwise true.
 @end deftypefn
 
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3e07978a02f4e6077adae6cadc93ea4273295f1f..0051017a7fd67691a343470f36ad4fc32c8e7e15 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4173,6 +4173,7 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_VEC_PERM_CONST
 
+@hook TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT
 
 @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
 
diff --git a/gcc/target.def b/gcc/target.def
index e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5..e4474a3ed6bd2f5f5c010bf0d40c2a371370490c 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1868,6 +1868,18 @@ correct for most targets.",
  poly_uint64, (const_tree type),
  default_preferred_vector_alignment)
 
+/* Returns whether the target has a preference for decomposing divisions using
+   shifts rather than multiplies.  */
+DEFHOOK
+(preferred_div_as_shifts_over_mult,
+ "Sometimes it is possible to implement a vector division using a sequence\n\
+of two addition-shift pairs, giving four instructions in total.\n\
+Return true if taking this approach for @var{vectype} is likely\n\
+to be better than using a sequence involving highpart multiplication.\n\
+Default is false if @code{can_mult_highpart_p}, otherwise true.",
+ bool, (const_tree type),
+ default_preferred_div_as_shifts_over_mult)
+
 /* Return true if vector alignment is reachable (by peeling N
    iterations) for the given scalar type.  */
 DEFHOOK
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index a6a4809ca91baa5d7fad2244549317a31390f0c2..a207963b9e6eb9300df0043e1b79aa6c941d0f7f 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -53,6 +53,8 @@ extern scalar_int_mode default_unwind_word_mode (void);
 extern unsigned HOST_WIDE_INT default_shift_truncation_mask
   (machine_mode);
 extern unsigned int default_min_divisions_for_recip_mul (machine_mode);
+extern bool default_preferred_div_as_shifts_over_mult
+  (const_tree);
 extern int default_mode_rep_extended (scalar_int_mode, scalar_int_mode);
 
 extern tree default_stack_protect_guard (void);
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index 211525720a620d6f533e2da91e03877337a931e7..7f39ff9b7ec2bf66625d48a47bb76e96c05a3233 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -1483,6 +1483,15 @@ default_preferred_vector_alignment (const_tree type)
   return TYPE_ALIGN (type);
 }
 
+/* The default implementation of
+   TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT.  */
+
+bool
+default_preferred_div_as_shifts_over_mult (const_tree type)
+{
+  return can_mult_highpart_p (TYPE_MODE (type), TYPE_UNSIGNED (type));
+}
+
 /* By default assume vectors of element TYPE require a multiple of the natural
    alignment of TYPE.  TYPE is naturally aligned if IS_PACKED is false.  */
 bool
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0a04ea8c1f73e3c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
@@ -0,0 +1,25 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include "tree-vect.h"
+
+typedef unsigned __attribute__((__vector_size__ (16))) V;
+
+static __attribute__((__noinline__)) __attribute__((__noclone__)) V
+foo (V v, unsigned short i)
+{
+  v /= i;
+  return v;
+}
+
+int
+main (void)
+{
+  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0xffff);
+  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
+    if (v[i] != 0x00010001)
+      __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b49914d2a29b933de625
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
@@ -0,0 +1,58 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define N 50
+#define TYPE uint8_t 
+
+#ifndef DEBUG
+#define DEBUG 0
+#endif
+
+#define BASE ((TYPE) -1 < 0 ? -126 : 4)
+
+
+__attribute__((noipa, noinline, optimize("O1")))
+void fun1(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+__attribute__((noipa, noinline, optimize("O3")))
+void fun2(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+int main ()
+{
+  TYPE a[N];
+  TYPE b[N];
+
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 13;
+      b[i] = BASE + i * 13;
+      if (DEBUG)
+        printf ("%d: 0x%x\n", i, a[i]);
+    }
+
+  fun1 (a, N / 2, N);
+  fun2 (b, N / 2, N);
+
+  for (int i = 0; i < N; ++i)
+    {
+      if (DEBUG)
+        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
+
+      if (a[i] != b[i])
+        __builtin_abort ();
+    }
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 1766ce277d6b88d8aa3be77e7c8abb504a10a735..183f1a623fbde34f505259cf8f4fb4d34e069614 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -3914,6 +3914,83 @@ vect_recog_divmod_pattern (vec_info *vinfo,
       return pattern_stmt;
     }
 
+  if ((cst = uniform_integer_cst_p (oprnd1))
+	   && TYPE_UNSIGNED (itype)
+	   && rhs_code == TRUNC_DIV_EXPR
+	   && vectype
+	   && targetm.vectorize.preferred_div_as_shifts_over_mult (vectype))
+    {
+      /* We can use the relationship:
+
+	   x // N == ((x+N+2) // (N+1) + x) // (N+1)  for 0 <= x < N(N+3)
+
+	 to optimize cases where N+1 is a power of 2, and where // (N+1)
+	 is therefore a shift right.  When operating in modes that are
+	 multiples of a byte in size, there are two cases:
+
+	 (1) N(N+3) is not representable, in which case the question
+	     becomes whether the replacement expression overflows.
+	     It is enough to test that x+N+2 does not overflow,
+	     i.e. that x < MAX-(N+1).
+
+	 (2) N(N+3) is representable, in which case it is the (only)
+	     bound that we need to check.
+
+	 ??? For now we just handle the case where // (N+1) is a shift
+	 right by half the precision, since some architectures can
+	 optimize the associated addition and shift combinations
+	 into single instructions.  */
+
+      auto wcst = wi::to_wide (cst);
+      int pow = wi::exact_log2 (wcst + 1);
+      if (pow == prec / 2)
+	{
+	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
+
+	  gimple_ranger ranger;
+	  int_range_max r;
+
+	  /* Check that no overflow will occur.  If we don't have range
+	     information we can't perform the optimization.  */
+
+	  if (ranger.range_of_expr (r, oprnd0, stmt))
+	    {
+	      wide_int max = r.upper_bound ();
+	      wide_int one = wi::shwi (1, prec);
+	      wide_int adder = wi::add (one, wi::lshift (one, pow));
+	      wi::overflow_type ovf;
+	      wi::add (max, adder, UNSIGNED, &ovf);
+	      if (ovf == wi::OVF_NONE)
+		{
+		  *type_out = vectype;
+		  tree tadder = wide_int_to_tree (itype, adder);
+		  tree rshift = wide_int_to_tree (itype, pow);
+
+		  tree new_lhs1 = vect_recog_temp_ssa_var (itype, NULL);
+		  gassign *patt1
+		    = gimple_build_assign (new_lhs1, PLUS_EXPR, oprnd0, tadder);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  tree new_lhs2 = vect_recog_temp_ssa_var (itype, NULL);
+		  patt1 = gimple_build_assign (new_lhs2, RSHIFT_EXPR, new_lhs1,
+					       rshift);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  tree new_lhs3 = vect_recog_temp_ssa_var (itype, NULL);
+		  patt1 = gimple_build_assign (new_lhs3, PLUS_EXPR, new_lhs2,
+					       oprnd0);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  tree new_lhs4 = vect_recog_temp_ssa_var (itype, NULL);
+		  pattern_stmt = gimple_build_assign (new_lhs4, RSHIFT_EXPR,
+						      new_lhs3, rshift);
+
+		  return pattern_stmt;
+		}
+	    }
+	}
+    }
+
   if (prec > HOST_BITS_PER_WIDE_INT
       || integer_zerop (oprnd1))
     return NULL;

[-- Attachment #2: rb16930.patch --]
[-- Type: application/octet-stream, Size: 9243 bytes --]

diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 50a8872a6695b18b9bed0d393bacf733833633db..bf7269e323de1a065d4d04376e5a2703cbb0f9fa 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6137,6 +6137,12 @@ instruction pattern.  There is no need for the hook to handle these two
 implementation approaches itself.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT (const_tree @var{type})
+Sometimes it is possible to implement a vector division using a sequence
+of two addition-shift pairs, giving four instructions in total.
+Return true if taking this approach for @var{vectype} is likely
+to be better than using a sequence involving highpart multiplication.
+Default is false if @code{can_mult_highpart_p}, otherwise true.
 @end deftypefn
 
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3e07978a02f4e6077adae6cadc93ea4273295f1f..0051017a7fd67691a343470f36ad4fc32c8e7e15 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4173,6 +4173,7 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_VEC_PERM_CONST
 
+@hook TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT
 
 @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
 
diff --git a/gcc/target.def b/gcc/target.def
index e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5..e4474a3ed6bd2f5f5c010bf0d40c2a371370490c 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1868,6 +1868,18 @@ correct for most targets.",
  poly_uint64, (const_tree type),
  default_preferred_vector_alignment)
 
+/* Returns whether the target has a preference for decomposing divisions using
+   shifts rather than multiplies.  */
+DEFHOOK
+(preferred_div_as_shifts_over_mult,
+ "Sometimes it is possible to implement a vector division using a sequence\n\
+of two addition-shift pairs, giving four instructions in total.\n\
+Return true if taking this approach for @var{vectype} is likely\n\
+to be better than using a sequence involving highpart multiplication.\n\
+Default is false if @code{can_mult_highpart_p}, otherwise true.",
+ bool, (const_tree type),
+ default_preferred_div_as_shifts_over_mult)
+
 /* Return true if vector alignment is reachable (by peeling N
    iterations) for the given scalar type.  */
 DEFHOOK
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index a6a4809ca91baa5d7fad2244549317a31390f0c2..a207963b9e6eb9300df0043e1b79aa6c941d0f7f 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -53,6 +53,8 @@ extern scalar_int_mode default_unwind_word_mode (void);
 extern unsigned HOST_WIDE_INT default_shift_truncation_mask
   (machine_mode);
 extern unsigned int default_min_divisions_for_recip_mul (machine_mode);
+extern bool default_preferred_div_as_shifts_over_mult
+  (const_tree);
 extern int default_mode_rep_extended (scalar_int_mode, scalar_int_mode);
 
 extern tree default_stack_protect_guard (void);
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index 211525720a620d6f533e2da91e03877337a931e7..7f39ff9b7ec2bf66625d48a47bb76e96c05a3233 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -1483,6 +1483,15 @@ default_preferred_vector_alignment (const_tree type)
   return TYPE_ALIGN (type);
 }
 
+/* The default implementation of
+   TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT.  */
+
+bool
+default_preferred_div_as_shifts_over_mult (const_tree type)
+{
+  return can_mult_highpart_p (TYPE_MODE (type), TYPE_UNSIGNED (type));
+}
+
 /* By default assume vectors of element TYPE require a multiple of the natural
    alignment of TYPE.  TYPE is naturally aligned if IS_PACKED is false.  */
 bool
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0a04ea8c1f73e3c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
@@ -0,0 +1,25 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include "tree-vect.h"
+
+typedef unsigned __attribute__((__vector_size__ (16))) V;
+
+static __attribute__((__noinline__)) __attribute__((__noclone__)) V
+foo (V v, unsigned short i)
+{
+  v /= i;
+  return v;
+}
+
+int
+main (void)
+{
+  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0xffff);
+  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
+    if (v[i] != 0x00010001)
+      __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b49914d2a29b933de625
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
@@ -0,0 +1,58 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define N 50
+#define TYPE uint8_t 
+
+#ifndef DEBUG
+#define DEBUG 0
+#endif
+
+#define BASE ((TYPE) -1 < 0 ? -126 : 4)
+
+
+__attribute__((noipa, noinline, optimize("O1")))
+void fun1(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+__attribute__((noipa, noinline, optimize("O3")))
+void fun2(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+int main ()
+{
+  TYPE a[N];
+  TYPE b[N];
+
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 13;
+      b[i] = BASE + i * 13;
+      if (DEBUG)
+        printf ("%d: 0x%x\n", i, a[i]);
+    }
+
+  fun1 (a, N / 2, N);
+  fun2 (b, N / 2, N);
+
+  for (int i = 0; i < N; ++i)
+    {
+      if (DEBUG)
+        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
+
+      if (a[i] != b[i])
+        __builtin_abort ();
+    }
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 1766ce277d6b88d8aa3be77e7c8abb504a10a735..183f1a623fbde34f505259cf8f4fb4d34e069614 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -3914,6 +3914,83 @@ vect_recog_divmod_pattern (vec_info *vinfo,
       return pattern_stmt;
     }
 
+  if ((cst = uniform_integer_cst_p (oprnd1))
+	   && TYPE_UNSIGNED (itype)
+	   && rhs_code == TRUNC_DIV_EXPR
+	   && vectype
+	   && targetm.vectorize.preferred_div_as_shifts_over_mult (vectype))
+    {
+      /* We can use the relationship:
+
+	   x // N == ((x+N+2) // (N+1) + x) // (N+1)  for 0 <= x < N(N+3)
+
+	 to optimize cases where N+1 is a power of 2, and where // (N+1)
+	 is therefore a shift right.  When operating in modes that are
+	 multiples of a byte in size, there are two cases:
+
+	 (1) N(N+3) is not representable, in which case the question
+	     becomes whether the replacement expression overflows.
+	     It is enough to test that x+N+2 does not overflow,
+	     i.e. that x < MAX-(N+1).
+
+	 (2) N(N+3) is representable, in which case it is the (only)
+	     bound that we need to check.
+
+	 ??? For now we just handle the case where // (N+1) is a shift
+	 right by half the precision, since some architectures can
+	 optimize the associated addition and shift combinations
+	 into single instructions.  */
+
+      auto wcst = wi::to_wide (cst);
+      int pow = wi::exact_log2 (wcst + 1);
+      if (pow == prec / 2)
+	{
+	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
+
+	  gimple_ranger ranger;
+	  int_range_max r;
+
+	  /* Check that no overflow will occur.  If we don't have range
+	     information we can't perform the optimization.  */
+
+	  if (ranger.range_of_expr (r, oprnd0, stmt))
+	    {
+	      wide_int max = r.upper_bound ();
+	      wide_int one = wi::shwi (1, prec);
+	      wide_int adder = wi::add (one, wi::lshift (one, pow));
+	      wi::overflow_type ovf;
+	      wi::add (max, adder, UNSIGNED, &ovf);
+	      if (ovf == wi::OVF_NONE)
+		{
+		  *type_out = vectype;
+		  tree tadder = wide_int_to_tree (itype, adder);
+		  tree rshift = wide_int_to_tree (itype, pow);
+
+		  tree new_lhs1 = vect_recog_temp_ssa_var (itype, NULL);
+		  gassign *patt1
+		    = gimple_build_assign (new_lhs1, PLUS_EXPR, oprnd0, tadder);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  tree new_lhs2 = vect_recog_temp_ssa_var (itype, NULL);
+		  patt1 = gimple_build_assign (new_lhs2, RSHIFT_EXPR, new_lhs1,
+					       rshift);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  tree new_lhs3 = vect_recog_temp_ssa_var (itype, NULL);
+		  patt1 = gimple_build_assign (new_lhs3, PLUS_EXPR, new_lhs2,
+					       oprnd0);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  tree new_lhs4 = vect_recog_temp_ssa_var (itype, NULL);
+		  pattern_stmt = gimple_build_assign (new_lhs4, RSHIFT_EXPR,
+						      new_lhs3, rshift);
+
+		  return pattern_stmt;
+		}
+	    }
+	}
+    }
+
   if (prec > HOST_BITS_PER_WIDE_INT
       || integer_zerop (oprnd1))
     return NULL;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4]middle-end: Implement preferred_div_as_shifts_over_mult [PR108583]
  2023-03-09 19:39       ` Tamar Christina
@ 2023-03-10  8:39         ` Richard Sandiford
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Sandiford @ 2023-03-10  8:39 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc-patches, nd, rguenther

Tamar Christina <Tamar.Christina@arm.com> writes:
> Hi,
>
> Here's the respun patch.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	PR target/108583
> 	* target.def (preferred_div_as_shifts_over_mult): New.
> 	* doc/tm.texi.in: Document it.
> 	* doc/tm.texi: Regenerate.
> 	* targhooks.cc (default_preferred_div_as_shifts_over_mult): New.
> 	* targhooks.h (default_preferred_div_as_shifts_over_mult): New.
> 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Use it.
>
> gcc/testsuite/ChangeLog:
>
> 	PR target/108583
> 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
> 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
>
> --- inline copy of patch ---
>
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 50a8872a6695b18b9bed0d393bacf733833633db..bf7269e323de1a065d4d04376e5a2703cbb0f9fa 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -6137,6 +6137,12 @@ instruction pattern.  There is no need for the hook to handle these two
>  implementation approaches itself.
>  @end deftypefn
>  
> +@deftypefn {Target Hook} bool TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT (const_tree @var{type})
> +Sometimes it is possible to implement a vector division using a sequence
> +of two addition-shift pairs, giving four instructions in total.
> +Return true if taking this approach for @var{vectype} is likely
> +to be better than using a sequence involving highpart multiplication.
> +Default is false if @code{can_mult_highpart_p}, otherwise true.
>  @end deftypefn
>  
>  @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index 3e07978a02f4e6077adae6cadc93ea4273295f1f..0051017a7fd67691a343470f36ad4fc32c8e7e15 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -4173,6 +4173,7 @@ address;  but often a machine-dependent strategy can generate better code.
>  
>  @hook TARGET_VECTORIZE_VEC_PERM_CONST
>  
> +@hook TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT
>  
>  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
>  
> diff --git a/gcc/target.def b/gcc/target.def
> index e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5..e4474a3ed6bd2f5f5c010bf0d40c2a371370490c 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -1868,6 +1868,18 @@ correct for most targets.",
>   poly_uint64, (const_tree type),
>   default_preferred_vector_alignment)
>  
> +/* Returns whether the target has a preference for decomposing divisions using
> +   shifts rather than multiplies.  */
> +DEFHOOK
> +(preferred_div_as_shifts_over_mult,
> + "Sometimes it is possible to implement a vector division using a sequence\n\
> +of two addition-shift pairs, giving four instructions in total.\n\
> +Return true if taking this approach for @var{vectype} is likely\n\
> +to be better than using a sequence involving highpart multiplication.\n\
> +Default is false if @code{can_mult_highpart_p}, otherwise true.",
> + bool, (const_tree type),
> + default_preferred_div_as_shifts_over_mult)
> +
>  /* Return true if vector alignment is reachable (by peeling N
>     iterations) for the given scalar type.  */
>  DEFHOOK
> diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> index a6a4809ca91baa5d7fad2244549317a31390f0c2..a207963b9e6eb9300df0043e1b79aa6c941d0f7f 100644
> --- a/gcc/targhooks.h
> +++ b/gcc/targhooks.h
> @@ -53,6 +53,8 @@ extern scalar_int_mode default_unwind_word_mode (void);
>  extern unsigned HOST_WIDE_INT default_shift_truncation_mask
>    (machine_mode);
>  extern unsigned int default_min_divisions_for_recip_mul (machine_mode);
> +extern bool default_preferred_div_as_shifts_over_mult
> +  (const_tree);
>  extern int default_mode_rep_extended (scalar_int_mode, scalar_int_mode);
>  
>  extern tree default_stack_protect_guard (void);
> diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> index 211525720a620d6f533e2da91e03877337a931e7..7f39ff9b7ec2bf66625d48a47bb76e96c05a3233 100644
> --- a/gcc/targhooks.cc
> +++ b/gcc/targhooks.cc
> @@ -1483,6 +1483,15 @@ default_preferred_vector_alignment (const_tree type)
>    return TYPE_ALIGN (type);
>  }
>  
> +/* The default implementation of
> +   TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT.  */
> +
> +bool
> +default_preferred_div_as_shifts_over_mult (const_tree type)
> +{
> +  return can_mult_highpart_p (TYPE_MODE (type), TYPE_UNSIGNED (type));

The return value should be inverted.

> +}
> +
>  /* By default assume vectors of element TYPE require a multiple of the natural
>     alignment of TYPE.  TYPE is naturally aligned if IS_PACKED is false.  */
>  bool
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0a04ea8c1f73e3c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> @@ -0,0 +1,25 @@
> +/* { dg-require-effective-target vect_int } */
> +
> +#include <stdint.h>
> +#include "tree-vect.h"
> +
> +typedef unsigned __attribute__((__vector_size__ (16))) V;
> +
> +static __attribute__((__noinline__)) __attribute__((__noclone__)) V
> +foo (V v, unsigned short i)
> +{
> +  v /= i;
> +  return v;
> +}
> +
> +int
> +main (void)
> +{
> +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0xffff);
> +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
> +    if (v[i] != 0x00010001)
> +      __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected" "vect" { target aarch64*-*-* } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b49914d2a29b933de625
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> @@ -0,0 +1,58 @@
> +/* { dg-require-effective-target vect_int } */
> +
> +#include <stdint.h>
> +#include <stdio.h>
> +#include "tree-vect.h"
> +
> +#define N 50
> +#define TYPE uint8_t 
> +
> +#ifndef DEBUG
> +#define DEBUG 0
> +#endif
> +
> +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> +
> +
> +__attribute__((noipa, noinline, optimize("O1")))
> +void fun1(TYPE* restrict pixel, TYPE level, int n)
> +{
> +  for (int i = 0; i < n; i+=1)
> +    pixel[i] = (pixel[i] + level) / 0xff;
> +}
> +
> +__attribute__((noipa, noinline, optimize("O3")))
> +void fun2(TYPE* restrict pixel, TYPE level, int n)
> +{
> +  for (int i = 0; i < n; i+=1)
> +    pixel[i] = (pixel[i] + level) / 0xff;
> +}
> +
> +int main ()
> +{
> +  TYPE a[N];
> +  TYPE b[N];
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE + i * 13;
> +      b[i] = BASE + i * 13;
> +      if (DEBUG)
> +        printf ("%d: 0x%x\n", i, a[i]);
> +    }
> +
> +  fun1 (a, N / 2, N);
> +  fun2 (b, N / 2, N);
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      if (DEBUG)
> +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> +
> +      if (a[i] != b[i])
> +        __builtin_abort ();
> +    }
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { target aarch64*-*-* } } } */
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index 1766ce277d6b88d8aa3be77e7c8abb504a10a735..183f1a623fbde34f505259cf8f4fb4d34e069614 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -3914,6 +3914,83 @@ vect_recog_divmod_pattern (vec_info *vinfo,
>        return pattern_stmt;
>      }
>  
> +  if ((cst = uniform_integer_cst_p (oprnd1))
> +	   && TYPE_UNSIGNED (itype)
> +	   && rhs_code == TRUNC_DIV_EXPR
> +	   && vectype
> +	   && targetm.vectorize.preferred_div_as_shifts_over_mult (vectype))

Needs reindenting.

OK with those changes, thanks.

Richard

> +    {
> +      /* We can use the relationship:
> +
> +	   x // N == ((x+N+2) // (N+1) + x) // (N+1)  for 0 <= x < N(N+3)
> +
> +	 to optimize cases where N+1 is a power of 2, and where // (N+1)
> +	 is therefore a shift right.  When operating in modes that are
> +	 multiples of a byte in size, there are two cases:
> +
> +	 (1) N(N+3) is not representable, in which case the question
> +	     becomes whether the replacement expression overflows.
> +	     It is enough to test that x+N+2 does not overflow,
> +	     i.e. that x < MAX-(N+1).
> +
> +	 (2) N(N+3) is representable, in which case it is the (only)
> +	     bound that we need to check.
> +
> +	 ??? For now we just handle the case where // (N+1) is a shift
> +	 right by half the precision, since some architectures can
> +	 optimize the associated addition and shift combinations
> +	 into single instructions.  */
> +
> +      auto wcst = wi::to_wide (cst);
> +      int pow = wi::exact_log2 (wcst + 1);
> +      if (pow == prec / 2)
> +	{
> +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> +
> +	  gimple_ranger ranger;
> +	  int_range_max r;
> +
> +	  /* Check that no overflow will occur.  If we don't have range
> +	     information we can't perform the optimization.  */
> +
> +	  if (ranger.range_of_expr (r, oprnd0, stmt))
> +	    {
> +	      wide_int max = r.upper_bound ();
> +	      wide_int one = wi::shwi (1, prec);
> +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> +	      wi::overflow_type ovf;
> +	      wi::add (max, adder, UNSIGNED, &ovf);
> +	      if (ovf == wi::OVF_NONE)
> +		{
> +		  *type_out = vectype;
> +		  tree tadder = wide_int_to_tree (itype, adder);
> +		  tree rshift = wide_int_to_tree (itype, pow);
> +
> +		  tree new_lhs1 = vect_recog_temp_ssa_var (itype, NULL);
> +		  gassign *patt1
> +		    = gimple_build_assign (new_lhs1, PLUS_EXPR, oprnd0, tadder);
> +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
> +
> +		  tree new_lhs2 = vect_recog_temp_ssa_var (itype, NULL);
> +		  patt1 = gimple_build_assign (new_lhs2, RSHIFT_EXPR, new_lhs1,
> +					       rshift);
> +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
> +
> +		  tree new_lhs3 = vect_recog_temp_ssa_var (itype, NULL);
> +		  patt1 = gimple_build_assign (new_lhs3, PLUS_EXPR, new_lhs2,
> +					       oprnd0);
> +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
> +
> +		  tree new_lhs4 = vect_recog_temp_ssa_var (itype, NULL);
> +		  pattern_stmt = gimple_build_assign (new_lhs4, RSHIFT_EXPR,
> +						      new_lhs3, rshift);
> +
> +		  return pattern_stmt;
> +		}
> +	    }
> +	}
> +    }
> +
>    if (prec > HOST_BITS_PER_WIDE_INT
>        || integer_zerop (oprnd1))
>      return NULL;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4][ranger]: Add range-ops for widen addition and widen multiplication [PR108583]
  2023-03-09 19:37       ` Tamar Christina
@ 2023-03-10 13:32         ` Andrew MacLeod
  2023-03-10 14:11           ` Tamar Christina
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew MacLeod @ 2023-03-10 13:32 UTC (permalink / raw)
  To: Tamar Christina, Aldy Hernandez; +Cc: gcc-patches, nd

On 3/9/23 14:37, Tamar Christina wrote:
> Cheers,
>
> Thanks! I'll way for him to come back then 😊
>
> Thanks,
> Tamar
>
>> -----Original Message-----
>> From: Aldy Hernandez <aldyh@redhat.com>
>> Sent: Wednesday, March 8, 2023 8:57 AM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; amacleod@redhat.com
>> Subject: Re: [PATCH 2/4][ranger]: Add range-ops for widen addition and
>> widen multiplication [PR108583]
>>
>> As Andrew has been advising on this one, I'd prefer for him to review it.
>> However, he's on vacation this week.  FYI...
>>
>> Aldy
>>
>> On Mon, Mar 6, 2023 at 12:22 PM Tamar Christina
>> <Tamar.Christina@arm.com> wrote:
>>> Ping.
>>>
>>> And updated the patch to reject cases that we don't expect or can handle
>> cleanly for now.
>>
Its OK by me...  but i think a release managers haa to sign off on it 
for this stage. Next stage 1 I will formalize the process a bit more for 
nonstandard rangeops

Andrew


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 2/4][ranger]: Add range-ops for widen addition and widen multiplication [PR108583]
  2023-03-10 13:32         ` Andrew MacLeod
@ 2023-03-10 14:11           ` Tamar Christina
  2023-03-10 14:30             ` Richard Biener
  0 siblings, 1 reply; 19+ messages in thread
From: Tamar Christina @ 2023-03-10 14:11 UTC (permalink / raw)
  To: Andrew MacLeod, Aldy Hernandez, rguenther; +Cc: gcc-patches, nd

> >> As Andrew has been advising on this one, I'd prefer for him to review it.
> >> However, he's on vacation this week.  FYI...
> >>
> >> Aldy
> >>
> >> On Mon, Mar 6, 2023 at 12:22 PM Tamar Christina
> >> <Tamar.Christina@arm.com> wrote:
> >>> Ping.
> >>>
> >>> And updated the patch to reject cases that we don't expect or can
> >>> handle
> >> cleanly for now.
> >>
> Its OK by me...  but i think a release managers haa to sign off on it for this
> stage. Next stage 1 I will formalize the process a bit more for nonstandard
> rangeops
> 

Thanks!

Richi is this change OK with you?

Thanks,
Tamar
> Andrew


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4][ranger]: Add range-ops for widen addition and widen multiplication [PR108583]
  2023-03-10 14:11           ` Tamar Christina
@ 2023-03-10 14:30             ` Richard Biener
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Biener @ 2023-03-10 14:30 UTC (permalink / raw)
  To: Tamar Christina via Gcc-patches; +Cc: Andrew MacLeod, Aldy Hernandez, nd



> Am 10.03.2023 um 15:12 schrieb Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>:
> 
> 
>> 
>>>> As Andrew has been advising on this one, I'd prefer for him to review it.
>>>> However, he's on vacation this week.  FYI...
>>>> 
>>>> Aldy
>>>> 
>>>> On Mon, Mar 6, 2023 at 12:22 PM Tamar Christina
>>>> <Tamar.Christina@arm.com> wrote:
>>>>> Ping.
>>>>> 
>>>>> And updated the patch to reject cases that we don't expect or can
>>>>> handle
>>>> cleanly for now.
>>>> 
>> Its OK by me...  but i think a release managers haa to sign off on it for this
>> stage. Next stage 1 I will formalize the process a bit more for nonstandard
>> rangeops
>> 
> 
> Thanks!
> 
> Richi is this change OK with you?

Yes.

Richard 

> Thanks,
> Tamar
>> Andrew
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-03-10 14:30 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-27 12:32 [PATCH 1/4]middle-end: Revert can_special_div_by_const changes [PR108583] Tamar Christina
2023-02-27 12:33 ` [PATCH 2/4][ranger]: Add range-ops for widen addition and widen multiplication [PR108583] Tamar Christina
2023-03-06 11:20   ` Tamar Christina
2023-03-08  8:57     ` Aldy Hernandez
2023-03-09 19:37       ` Tamar Christina
2023-03-10 13:32         ` Andrew MacLeod
2023-03-10 14:11           ` Tamar Christina
2023-03-10 14:30             ` Richard Biener
2023-02-27 12:33 ` [PATCH 3/4]middle-end: Implement preferred_div_as_shifts_over_mult [PR108583] Tamar Christina
2023-03-06 11:23   ` Tamar Christina
2023-03-08  8:55     ` Richard Sandiford
2023-03-09 19:39       ` Tamar Christina
2023-03-10  8:39         ` Richard Sandiford
2023-02-27 12:34 ` [PATCH 4/4]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583] Tamar Christina
2023-03-06 11:21   ` Tamar Christina
2023-03-08  9:17     ` Richard Sandiford
2023-03-08  9:25       ` Tamar Christina
2023-03-08 10:44         ` Richard Sandiford
2023-02-27 14:07 ` [PATCH 1/4]middle-end: Revert can_special_div_by_const changes [PR108583] Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).