public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [RFC/PATCH] ira: Consider matching constraints with param [PR100328]
@ 2021-06-09  5:18 Kewen.Lin
  2021-06-28  6:26 ` [RFC/PATCH v3] ira: Support more matching constraint forms " Kewen.Lin
  0 siblings, 1 reply; 16+ messages in thread
From: Kewen.Lin @ 2021-06-09  5:18 UTC (permalink / raw)
  To: GCC Patches
  Cc: Segher Boessenkool, Bill Schmidt, Vladimir Makarov, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 3392 bytes --]

Hi,

PR100328 has some details about this issue, I am trying to
brief it here.  In the hottest function LBM_performStreamCollideTRT
of SPEC2017 bmk 519.lbm_r, there are many FMA style expressions
(27 FMA, 19 FMS, 11 FNMA).  On rs6000, this kind of FMA style
insn has two flavors: FLOAT_REG and VSX_REG, the VSX_REG reg
class have 64 registers whose foregoing 32 ones make up the
whole FLOAT_REG.  There are some differences for these two
flavors, taking "*fma<mode>4_fpr" as example:

(define_insn "*fma<mode>4_fpr"
  [(set (match_operand:SFDF 0 "gpc_reg_operand" "=<Ff>,wa,wa")
	(fma:SFDF
	  (match_operand:SFDF 1 "gpc_reg_operand" "%<Ff>,wa,wa")
	  (match_operand:SFDF 2 "gpc_reg_operand" "<Ff>,wa,0")
	  (match_operand:SFDF 3 "gpc_reg_operand" "<Ff>,0,wa")))]

// wa => A VSX register (VSR), vs0…vs63, aka. VSX_REG.
// <Ff> (f/d) => A floating point register, aka. FLOAT_REG.

So for VSX_REG, we only have the destructive form, when VSX_REG
alternative being used, the operand 2 or operand 3 is required
to be the same as operand 0.  reload has to take care of this
constraint and create some non-free register copies if required.

Assuming one fma insn looks like:
  op0 = FMA (op1, op2, op3)

The best regclass of them are VSX_REG, when op1,op2,op3 are all dead,
IRA simply creates three shuffle copies for them (here the operand
order matters, since with the same freq, the one with smaller number
takes preference), but IMO both op2 and op3 should take higher priority
in copy queue due to the matching constraint.

I noticed that there is one function ira_get_dup_out_num, which meant
to create this kind of constraint copy, but the below code looks to
refuse to create if there is an alternative which has valid regclass
without spilled need. 

      default:
	{
	  enum constraint_num cn = lookup_constraint (str);
	  enum reg_class cl = reg_class_for_constraint (cn);
	  if (cl != NO_REGS
	      && !targetm.class_likely_spilled_p (cl))
	    goto fail

	 ...

I cooked one patch attached to make ira respect this kind of matching
constraint guarded with one parameter.  As I stated in the PR, I was
not sure this is on the right track.  The RFC patch is to check the
matching constraint in all alternatives, if there is one alternative
with matching constraint and matches the current preferred regclass
(or best of allocno?), it will record the output operand number and
further create one constraint copy for it.  Normally it can get the
priority against shuffle copies and the matching constraint will get
satisfied with higher possibility, reload doesn't create extra copies
to meet the matching constraint or the desirable register class when
it has to.

For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay
as shuffle copies, and later any of A,B,C,D gets assigned by one
hardware register which is a VSX register (VSX_REG) but not a FP
register (FLOAT_REG), which means it has to pay costs once we can NOT
go with VSX alternatives, so at that time it's important to respect
the matching constraint then we can increase the freq for the remaining
copies related to this (A/B, A/C, A/D).  This idea requires some side
tables to record some information and seems a bit complicated in the
current framework, so the proposed patch aggressively emphasizes the
matching constraint at the time of creating copies.

Any comments are highly appreciated!

BR,
Kewen

[-- Attachment #2: ira_matching_cstr.diff --]
[-- Type: text/plain, Size: 4467 bytes --]

---
 gcc/config/rs6000/rs6000.c |  3 ++
 gcc/ira.c                  | 69 ++++++++++++++++++++++++++++++++++----
 gcc/params.opt             |  4 +++
 3 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 5ae40d6f4ce..eb9c4284f91 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -4852,6 +4852,9 @@ rs6000_option_override_internal (bool global_init_p)
 	 ap = __builtin_next_arg (0).  */
       if (DEFAULT_ABI != ABI_V4)
 	targetm.expand_builtin_va_start = NULL;
+
+      SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+			   param_ira_consider_dup_in_all_alts, 1);
     }
 
   rs6000_override_options_after_change ();
diff --git a/gcc/ira.c b/gcc/ira.c
index b93588d8a9f..beebee7499b 100644
--- a/gcc/ira.c
+++ b/gcc/ira.c
@@ -1937,10 +1939,16 @@ ira_get_dup_out_num (int op_num, alternative_mask alts)
     return -1;
   str = recog_data.constraints[op_num];
   use_commut_op_p = false;
+
+  rtx op = recog_data.operand[op_num];
+  int op_no = reg_or_subregno (op);
+  enum reg_class op_pref_cl = reg_preferred_class (op_no);
+  machine_mode op_mode = GET_MODE (op);
+
   for (;;)
     {
-      rtx op = recog_data.operand[op_num];
-      
+      bool saw_reg_cstr = false;
+
       for (curr_alt = 0, ignore_p = !TEST_BIT (alts, curr_alt),
 	   original = -1;;)
 	{
@@ -1963,9 +1971,25 @@ ira_get_dup_out_num (int op_num, alternative_mask alts)
 		{
 		  enum constraint_num cn = lookup_constraint (str);
 		  enum reg_class cl = reg_class_for_constraint (cn);
-		  if (cl != NO_REGS
-		      && !targetm.class_likely_spilled_p (cl))
-		    goto fail;
+		  if (cl != NO_REGS && !targetm.class_likely_spilled_p (cl))
+		    {
+		      if (param_ira_consider_dup_in_all_alts
+			  && op_pref_cl != NO_REGS)
+			{
+			  /* If it's free to move from one preferred class to
+			     the one without matching constraint, it doesn't
+			     have to respect this constraint with costs.  */
+			  if (cl != op_pref_cl
+			      && (ira_reg_class_intersect[cl][op_pref_cl]
+				  != NO_REGS)
+			      && (ira_may_move_in_cost[op_mode][op_pref_cl][cl]
+				  == 0))
+			    goto fail;
+			  saw_reg_cstr = true;
+			}
+		      else
+			goto fail;
+		    }
 		  if (constraint_satisfied_p (op, cn))
 		    goto fail;
 		  break;
@@ -1979,7 +2003,40 @@ ira_get_dup_out_num (int op_num, alternative_mask alts)
 		  str = end;
 		  if (original != -1 && original != n)
 		    goto fail;
-		  original = n;
+		  if (param_ira_consider_dup_in_all_alts && saw_reg_cstr)
+		    {
+		      rtx out = recog_data.operand[n];
+		      if (!REG_P (out)
+			  && (GET_CODE (out) != SUBREG
+			      || !REG_P (SUBREG_REG (out))))
+			goto fail;
+		      int out_no = reg_or_subregno (out);
+		      if (out_no >= FIRST_PSEUDO_REGISTER)
+			{
+			  const char *out_alts = recog_data.constraints[n];
+			  int tot = curr_alt;
+			  while (tot > 0)
+			    {
+			      if (out_alts[0] == ',')
+				tot--;
+			      out_alts++;
+			    }
+			  enum reg_class out_cl = NO_REGS;
+			  while (*out_alts != '\0' && *out_alts != ',')
+			    {
+			      enum constraint_num cn
+				= lookup_constraint (out_alts);
+			      out_cl = reg_class_for_constraint (cn);
+			      if (out_cl != NO_REGS)
+				break;
+			    }
+			  /* Respect this as it's for preferred rclass.  */
+			  if (out_cl == op_pref_cl)
+			    original = n;
+			}
+		    }
+		  else
+		    original = n;
 		  continue;
 		}
 	      }
diff --git a/gcc/params.opt b/gcc/params.opt
index 7c7aa78992a..7d9d3a5876d 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -326,6 +326,10 @@ Max size of conflict table in MB.
 Common Joined UInteger Var(param_ira_max_loops_num) Init(100) Param Optimization
 Max loops number for regional RA.
 
+-param=ira-consider-dup-in-all-alts=
+Common Joined UInteger Var(param_ira_consider_dup_in_all_alts) Init(0) IntegerRange(0, 1) Param Optimization
+Control ira to continue to find matching constraint (duplicated operand number) even if it has encountered some contraint that has the appropriate register class, it will skip those alternatives whose constraint don't have the same register class as which the operand prefers.
+
 -param=iv-always-prune-cand-set-bound=
 Common Joined UInteger Var(param_iv_always_prune_cand_set_bound) Init(10) Param Optimization
 If number of candidates in the set is smaller, we always try to remove unused ivs during its optimization.
-- 
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC/PATCH v3] ira: Support more matching constraint forms with param [PR100328]
  2021-06-09  5:18 [RFC/PATCH] ira: Consider matching constraints with param [PR100328] Kewen.Lin
@ 2021-06-28  6:26 ` Kewen.Lin
  2021-06-28  7:12   ` Hongtao Liu
                     ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Kewen.Lin @ 2021-06-28  6:26 UTC (permalink / raw)
  To: GCC Patches
  Cc: Vladimir Makarov, bergner, Bill Schmidt, Segher Boessenkool,
	Richard Sandiford, crazylht

[-- Attachment #1: Type: text/plain, Size: 6442 bytes --]

Hi!

on 2021/6/9 下午1:18, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> PR100328 has some details about this issue, I am trying to
> brief it here.  In the hottest function LBM_performStreamCollideTRT
> of SPEC2017 bmk 519.lbm_r, there are many FMA style expressions
> (27 FMA, 19 FMS, 11 FNMA).  On rs6000, this kind of FMA style
> insn has two flavors: FLOAT_REG and VSX_REG, the VSX_REG reg
> class have 64 registers whose foregoing 32 ones make up the
> whole FLOAT_REG.  There are some differences for these two
> flavors, taking "*fma<mode>4_fpr" as example:
> 
> (define_insn "*fma<mode>4_fpr"
>   [(set (match_operand:SFDF 0 "gpc_reg_operand" "=<Ff>,wa,wa")
> 	(fma:SFDF
> 	  (match_operand:SFDF 1 "gpc_reg_operand" "%<Ff>,wa,wa")
> 	  (match_operand:SFDF 2 "gpc_reg_operand" "<Ff>,wa,0")
> 	  (match_operand:SFDF 3 "gpc_reg_operand" "<Ff>,0,wa")))]
> 
> // wa => A VSX register (VSR), vs0…vs63, aka. VSX_REG.
> // <Ff> (f/d) => A floating point register, aka. FLOAT_REG.
> 
> So for VSX_REG, we only have the destructive form, when VSX_REG
> alternative being used, the operand 2 or operand 3 is required
> to be the same as operand 0.  reload has to take care of this
> constraint and create some non-free register copies if required.
> 
> Assuming one fma insn looks like:
>   op0 = FMA (op1, op2, op3)
> 
> The best regclass of them are VSX_REG, when op1,op2,op3 are all dead,
> IRA simply creates three shuffle copies for them (here the operand
> order matters, since with the same freq, the one with smaller number
> takes preference), but IMO both op2 and op3 should take higher priority
> in copy queue due to the matching constraint.
> 
> I noticed that there is one function ira_get_dup_out_num, which meant
> to create this kind of constraint copy, but the below code looks to
> refuse to create if there is an alternative which has valid regclass
> without spilled need. 
> 
>       default:
> 	{
> 	  enum constraint_num cn = lookup_constraint (str);
> 	  enum reg_class cl = reg_class_for_constraint (cn);
> 	  if (cl != NO_REGS
> 	      && !targetm.class_likely_spilled_p (cl))
> 	    goto fail
> 
> 	 ...
> 
> I cooked one patch attached to make ira respect this kind of matching
> constraint guarded with one parameter.  As I stated in the PR, I was
> not sure this is on the right track.  The RFC patch is to check the
> matching constraint in all alternatives, if there is one alternative
> with matching constraint and matches the current preferred regclass
> (or best of allocno?), it will record the output operand number and
> further create one constraint copy for it.  Normally it can get the
> priority against shuffle copies and the matching constraint will get
> satisfied with higher possibility, reload doesn't create extra copies
> to meet the matching constraint or the desirable register class when
> it has to.
> 
> For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay
> as shuffle copies, and later any of A,B,C,D gets assigned by one
> hardware register which is a VSX register (VSX_REG) but not a FP
> register (FLOAT_REG), which means it has to pay costs once we can NOT
> go with VSX alternatives, so at that time it's important to respect
> the matching constraint then we can increase the freq for the remaining
> copies related to this (A/B, A/C, A/D).  This idea requires some side
> tables to record some information and seems a bit complicated in the
> current framework, so the proposed patch aggressively emphasizes the
> matching constraint at the time of creating copies.
> 

Comparing with the original patch (v1), this patch v3 has
considered: (this should be v2 for this mail list, but bump
it to be consistent as PR's).

  - Excluding the case where for one preferred register class
    there can be two or more alternatives, one of them has the
    matching constraint, while another doesn't have.  So for
    the given operand, even if it's assigned by a hardware reg
    which doesn't meet the matching constraint, it can simply
    use the alternative which doesn't have matching constraint
    so no register move is needed.  One typical case is
    define_insn *mov<mode>_internal2 on rs6000.  So we
    shouldn't create constraint copy for it.

  - The possible free register move in the same register class,
    disable this if so since the register move to meet the
    constraint is considered as free.

  - Making it on by default, suggested by Segher & Vladimir, we
    hope to get rid of the parameter if the benchmarking result
    looks good on major targets.

  - Tweaking cost when either of matching constraint two sides
    is hardware register.  Before this patch, the constraint
    copy is simply taken as a real move insn for pref and
    conflict cost with one hardware register, after this patch,
    it's allowed that there are several input operands
    respecting the same matching constraint (but in different
    alternatives), so we should take it to be like shuffle copy
    for some cases to avoid over preferring/disparaging.

Please check the PR comments for more details.

This patch can be bootstrapped & regtested on
powerpc64le-linux-gnu P9 and x86_64-redhat-linux, but have some
"XFAIL->XPASS" failures on aarch64-linux-gnu.  The failure list
was attached in the PR and thought the new assembly looks
improved (expected).

With option Ofast unroll, this patch can help to improve SPEC2017
bmk 508.namd_r +2.42% and 519.lbm_r +2.43% on Power8 while
508.namd_r +3.02% and 519.lbm_r +3.85% on Power9 without any
remarkable degradations.

Since this patch likely benefits x86_64 and aarch64, but I don't
have performance machines with these arches at hand, could
someone kindly help to benchmark it if possible? 

Many thanks in advance!

btw, you can simply ignore the part about parameter
ira-consider-dup-in-all-alts (its name/description), it's sort of
stale, I let it be for now as we will likely get rid of it.

BR,
Kewen
-----
gcc/ChangeLog:

	* doc/invoke.texi (ira-consider-dup-in-all-alts): Document new
	parameter.
	* ira.c (ira_get_dup_out_num): Adjust as parameter
	param_ira_consider_dup_in_all_alts.
	* params.opt (ira-consider-dup-in-all-alts): New.
	* ira-conflicts.c (process_regs_for_copy): Add one parameter
	single_input_op_has_cstr_p.
	(get_freq_for_shuffle_copy): New function.
	(add_insn_allocno_copies): Adjust as single_input_op_has_cstr_p.
	* ira-int.h (ira_get_dup_out_num): Add one bool parameter.

[-- Attachment #2: ira-v3.diff --]
[-- Type: text/plain, Size: 14831 bytes --]

---
 gcc/doc/invoke.texi |   6 +++
 gcc/ira-conflicts.c |  91 +++++++++++++++++++++++++++-------
 gcc/ira-int.h       |   2 +-
 gcc/ira.c           | 118 ++++++++++++++++++++++++++++++++++++++++----
 gcc/params.opt      |   4 ++
 5 files changed, 194 insertions(+), 27 deletions(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 510f24e55ab..d54cc991d18 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13845,6 +13845,12 @@ of available registers reserved for some other purposes is given
 by this parameter.  Default of the parameter
 is the best found from numerous experiments.
 
+@item ira-consider-dup-in-all-alts
+Make IRA to consider matching constraint (duplicated operand number)
+heavily if that one is with preferred register class, even if there
+is some other choice with an appropriate register class no matter
+which is preferred or not.
+
 @item lra-inheritance-ebb-probability-cutoff
 LRA tries to reuse values reloaded in registers in subsequent insns.
 This optimization is called inheritance.  EBB is used as a region to
diff --git a/gcc/ira-conflicts.c b/gcc/ira-conflicts.c
index d83cfc1c1a7..67c4cdcbc8d 100644
--- a/gcc/ira-conflicts.c
+++ b/gcc/ira-conflicts.c
@@ -233,6 +233,15 @@ go_through_subreg (rtx x, int *offset)
   return reg;
 }
 
+/* Return the recomputed frequency for this shuffle copy or its similar
+   case, since it's not for a real move insn, make it smaller.  */
+
+static int
+get_freq_for_shuffle_copy (int freq)
+{
+  return freq < 8 ? 1 : freq / 8;
+}
+
 /* Process registers REG1 and REG2 in move INSN with execution
    frequency FREQ.  The function also processes the registers in a
    potential move insn (INSN == NULL in this case) with frequency
@@ -240,12 +249,14 @@ go_through_subreg (rtx x, int *offset)
    corresponding allocnos or create a copy involving the corresponding
    allocnos.  The function does nothing if the both registers are hard
    registers.  When nothing is changed, the function returns
-   FALSE.  */
+   FALSE.  SINGLE_INPUT_OP_HAS_CSTR_P is only meaningful when constraint_p
+   is true, see function ira_get_dup_out_num for details.  */
 static bool
-process_regs_for_copy (rtx reg1, rtx reg2, bool constraint_p,
-		       rtx_insn *insn, int freq)
+process_regs_for_copy (rtx reg1, rtx reg2, bool constraint_p, rtx_insn *insn,
+		       int freq, bool single_input_op_has_cstr_p = true)
 {
-  int allocno_preferenced_hard_regno, cost, index, offset1, offset2;
+  int allocno_preferenced_hard_regno, index, offset1, offset2;
+  int cost, conflict_cost, move_cost;
   bool only_regs_p;
   ira_allocno_t a;
   reg_class_t rclass, aclass;
@@ -306,9 +317,52 @@ process_regs_for_copy (rtx reg1, rtx reg2, bool constraint_p,
     return false;
   ira_init_register_move_cost_if_necessary (mode);
   if (HARD_REGISTER_P (reg1))
-    cost = ira_register_move_cost[mode][aclass][rclass] * freq;
+    move_cost = ira_register_move_cost[mode][aclass][rclass];
+  else
+    move_cost = ira_register_move_cost[mode][rclass][aclass];
+
+  if (!single_input_op_has_cstr_p)
+    {
+      /* When this is a constraint copy and the matching constraint
+	 doesn't only exist for this given operand but also for some
+	 other operand(s), it means saving the possible move cost is
+	 not necessary to have reg1 and reg2 use the same hardware
+	 register, this hardware preference isn't required to be
+	 fixed.  To avoid it to over prefer this hardware register,
+	 and over disparage this hardware register on conflicted
+	 objects, we need some cost tweaking here, similar to what
+	 we do for shuffle copy.  */
+      gcc_assert (constraint_p);
+      int reduced_freq = get_freq_for_shuffle_copy (freq);
+      if (HARD_REGISTER_P (reg1))
+	/* For reg2 = opcode(reg1, reg3 ...), assume that reg3 is a
+	   pseudo register which has matching constraint on reg2,
+	   even if reg2 isn't assigned by reg1, it's still possible
+	   to have no register moves if reg2 and reg3 use the same
+	   hardware register.  So to avoid the allocation over
+	   prefers reg1, we can just take it as a shuffle copy.  */
+	cost = conflict_cost = move_cost * reduced_freq;
+      else
+	{
+	  /* For reg1 = opcode(reg2, reg3 ...), assume that reg3 is a
+	     pseudo register which has matching constraint on reg2,
+	     to save the register move, it's better to assign reg1
+	     to either of reg2 and reg3 (or one of other pseudos like
+	     reg3), it's reasonable to use freq for the cost.  But
+	     for conflict_cost, since reg2 and reg3 conflicts with
+	     each other, both of them has the chance to be assigned
+	     by reg1, assume reg3 has one copy which also conflicts
+	     with reg2, we shouldn't make it less preferred on reg1
+	     since reg3 has the same chance to be assigned by reg1.
+	     So it adjusts the conflic_cost to make it same as what
+	     we use for shuffle copy.  */
+	  cost = move_cost * freq;
+	  conflict_cost = move_cost * reduced_freq;
+	}
+    }
   else
-    cost = ira_register_move_cost[mode][rclass][aclass] * freq;
+    cost = conflict_cost = move_cost * freq;
+
   do
     {
       ira_allocate_and_set_costs
@@ -317,7 +371,7 @@ process_regs_for_copy (rtx reg1, rtx reg2, bool constraint_p,
       ira_allocate_and_set_costs
 	(&ALLOCNO_CONFLICT_HARD_REG_COSTS (a), aclass, 0);
       ALLOCNO_HARD_REG_COSTS (a)[index] -= cost;
-      ALLOCNO_CONFLICT_HARD_REG_COSTS (a)[index] -= cost;
+      ALLOCNO_CONFLICT_HARD_REG_COSTS (a)[index] -= conflict_cost;
       if (ALLOCNO_HARD_REG_COSTS (a)[index] < ALLOCNO_CLASS_COST (a))
 	ALLOCNO_CLASS_COST (a) = ALLOCNO_HARD_REG_COSTS (a)[index];
       ira_add_allocno_pref (a, allocno_preferenced_hard_regno, freq);
@@ -420,7 +474,8 @@ add_insn_allocno_copies (rtx_insn *insn)
       operand = recog_data.operand[i];
       if (! REG_SUBREG_P (operand))
 	continue;
-      if ((n = ira_get_dup_out_num (i, alts)) >= 0)
+      bool single_input_op_has_cstr_p;
+      if ((n = ira_get_dup_out_num (i, alts, single_input_op_has_cstr_p)) >= 0)
 	{
 	  bound_p[n] = true;
 	  dup = recog_data.operand[n];
@@ -429,8 +484,8 @@ add_insn_allocno_copies (rtx_insn *insn)
 				REG_P (operand)
 				? operand
 				: SUBREG_REG (operand)) != NULL_RTX)
-	    process_regs_for_copy (operand, dup, true, NULL,
-				   freq);
+	    process_regs_for_copy (operand, dup, true, NULL, freq,
+				   single_input_op_has_cstr_p);
 	}
     }
   for (i = 0; i < recog_data.n_operands; i++)
@@ -440,13 +495,15 @@ add_insn_allocno_copies (rtx_insn *insn)
 	  && find_reg_note (insn, REG_DEAD,
 			    REG_P (operand)
 			    ? operand : SUBREG_REG (operand)) != NULL_RTX)
-	/* If an operand dies, prefer its hard register for the output
-	   operands by decreasing the hard register cost or creating
-	   the corresponding allocno copies.  The cost will not
-	   correspond to a real move insn cost, so make the frequency
-	   smaller.  */
-	process_reg_shuffles (insn, operand, i, freq < 8 ? 1 : freq / 8,
-			      bound_p);
+	{
+	  /* If an operand dies, prefer its hard register for the output
+	     operands by decreasing the hard register cost or creating
+	     the corresponding allocno copies.  The cost will not
+	     correspond to a real move insn cost, so make the frequency
+	     smaller.  */
+	  int new_freq = get_freq_for_shuffle_copy (freq);
+	  process_reg_shuffles (insn, operand, i, new_freq, bound_p);
+	}
     }
 }
 
diff --git a/gcc/ira-int.h b/gcc/ira-int.h
index 31e013b0461..da748626e31 100644
--- a/gcc/ira-int.h
+++ b/gcc/ira-int.h
@@ -971,7 +971,7 @@ extern void ira_debug_disposition (void);
 extern void ira_debug_allocno_classes (void);
 extern void ira_init_register_move_cost (machine_mode);
 extern alternative_mask ira_setup_alts (rtx_insn *);
-extern int ira_get_dup_out_num (int, alternative_mask);
+extern int ira_get_dup_out_num (int, alternative_mask, bool &);
 
 /* ira-build.c */
 
diff --git a/gcc/ira.c b/gcc/ira.c
index 638ef4ea17e..75033a45963 100644
--- a/gcc/ira.c
+++ b/gcc/ira.c
@@ -1922,9 +1922,15 @@ ira_setup_alts (rtx_insn *insn)
 /* Return the number of the output non-early clobber operand which
    should be the same in any case as operand with number OP_NUM (or
    negative value if there is no such operand).  ALTS is the mask
-   of alternatives that we should consider.  */
+   of alternatives that we should consider.  SINGLE_INPUT_OP_HAS_CSTR_P
+   should be set in this function, it indicates whether there is only
+   a single input operand which has the matching constraint on the
+   output operand with returned position.  If the pattern allows any
+   one of several input operands holds the matching constraint, it's
+   set as false.  One typical case is FMA insn on rs6000.  */
 int
-ira_get_dup_out_num (int op_num, alternative_mask alts)
+ira_get_dup_out_num (int op_num, alternative_mask alts,
+		     bool &single_input_op_has_cstr_p)
 {
   int curr_alt, c, original;
   bool ignore_p, use_commut_op_p;
@@ -1937,10 +1943,42 @@ ira_get_dup_out_num (int op_num, alternative_mask alts)
     return -1;
   str = recog_data.constraints[op_num];
   use_commut_op_p = false;
+  single_input_op_has_cstr_p = true;
+
+  rtx op = recog_data.operand[op_num];
+  int op_no = reg_or_subregno (op);
+  enum reg_class op_pref_cl = reg_preferred_class (op_no);
+  machine_mode op_mode = GET_MODE (op);
+
+  ira_init_register_move_cost_if_necessary (op_mode);
+  /* If the preferred regclass isn't NO_REG, continue to find the matching
+     constraint in all available alternatives with preferred regclass, even
+     if we have found or will find one alternative whose constraint stands
+     for a REG (!NO_REG) regclass.  Note that it would be fine not to
+     respect matching constraint if the register copy is free, so exclude
+     it.  */
+  bool respect_dup_despite_reg_cstr
+    = param_ira_consider_dup_in_all_alts
+      && op_pref_cl != NO_REGS
+      && ira_register_move_cost[op_mode][op_pref_cl][op_pref_cl] > 0;
+
+  /* Record the alternative whose constraint uses the same regclass as the
+     preferred regclass, later if we find one matching constraint for this
+     operand with preferred reclass, we will visit these recorded
+     alternatives to check whether if there is one alternative in which no
+     any INPUT operands have one matching constraint same as our candidate.
+     If yes, it means there is one alternative which is perfectly fine
+     without satisfying this matching constraint.  If no, it means in any
+     alternatives there is one other INPUT operand holding this matching
+     constraint, it's fine to respect this matching constraint and further
+     create this constraint copy since it would become harmless once some
+     other takes preference and it's interfered.  */
+  alternative_mask pref_cl_alts;
+
   for (;;)
     {
-      rtx op = recog_data.operand[op_num];
-      
+      pref_cl_alts = 0;
+
       for (curr_alt = 0, ignore_p = !TEST_BIT (alts, curr_alt),
 	   original = -1;;)
 	{
@@ -1963,9 +2001,25 @@ ira_get_dup_out_num (int op_num, alternative_mask alts)
 		{
 		  enum constraint_num cn = lookup_constraint (str);
 		  enum reg_class cl = reg_class_for_constraint (cn);
-		  if (cl != NO_REGS
-		      && !targetm.class_likely_spilled_p (cl))
-		    goto fail;
+		  if (cl != NO_REGS && !targetm.class_likely_spilled_p (cl))
+		    {
+		      if (respect_dup_despite_reg_cstr)
+			{
+			  /* If it's free to move from one preferred class to
+			     the one without matching constraint, it doesn't
+			     have to respect this constraint with costs.  */
+			  if (cl != op_pref_cl
+			      && (ira_reg_class_intersect[cl][op_pref_cl]
+				  != NO_REGS)
+			      && (ira_may_move_in_cost[op_mode][op_pref_cl][cl]
+				  == 0))
+			    goto fail;
+			  else if (cl == op_pref_cl)
+			    pref_cl_alts |= ALTERNATIVE_BIT (curr_alt);
+			}
+		      else
+			goto fail;
+		    }
 		  if (constraint_satisfied_p (op, cn))
 		    goto fail;
 		  break;
@@ -1979,7 +2033,21 @@ ira_get_dup_out_num (int op_num, alternative_mask alts)
 		  str = end;
 		  if (original != -1 && original != n)
 		    goto fail;
-		  original = n;
+		  gcc_assert (n < recog_data.n_operands);
+		  if (respect_dup_despite_reg_cstr)
+		    {
+		      const operand_alternative *op_alt
+			= &recog_op_alt[curr_alt * recog_data.n_operands];
+		      /* Only respect the one with preferred rclass, without
+			 respect_dup_despite_reg_cstr, it's possible to get
+			 one whose regclass isn't preferred first before,
+			 but it would fail since there should be other
+			 alternatives with preferred regclass.  */
+		      if (op_alt[n].cl == op_pref_cl)
+			original = n;
+		    }
+		  else
+		    original = n;
 		  continue;
 		}
 	      }
@@ -1988,7 +2056,39 @@ ira_get_dup_out_num (int op_num, alternative_mask alts)
       if (original == -1)
 	goto fail;
       if (recog_data.operand_type[original] == OP_OUT)
-	return original;
+	{
+	  if (pref_cl_alts == 0)
+	    return original;
+	  /* Visit these recorded alternatives to check whether if
+	     there is one alternative in which no any INPUT operands
+	     have one matching constraint same as our candidate.
+	     Give up this candidate if so.  */
+	  int nop, nalt;
+	  for (nalt = 0; nalt < recog_data.n_alternatives; nalt++)
+	    {
+	      if (!TEST_BIT (pref_cl_alts, nalt))
+		continue;
+	      const operand_alternative *op_alt
+		= &recog_op_alt[nalt * recog_data.n_operands];
+	      bool dup_in_other = false;
+	      for (nop = 0; nop < recog_data.n_operands; nop++)
+		{
+		  if (recog_data.operand_type[nop] != OP_IN)
+		    continue;
+		  if (nop == op_num)
+		    continue;
+		  if (op_alt[nop].matches == original)
+		    {
+		      dup_in_other = true;
+		      break;
+		    }
+		}
+	      if (!dup_in_other)
+		return -1;
+	    }
+	  single_input_op_has_cstr_p = false;
+	  return original;
+	}
     fail:
       if (use_commut_op_p)
 	break;
diff --git a/gcc/params.opt b/gcc/params.opt
index 18e6036c4f4..5121e3ddc80 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -330,6 +330,10 @@ Max size of conflict table in MB.
 Common Joined UInteger Var(param_ira_max_loops_num) Init(100) Param Optimization
 Max loops number for regional RA.
 
+-param=ira-consider-dup-in-all-alts=
+Common Joined UInteger Var(param_ira_consider_dup_in_all_alts) Init(1) IntegerRange(0, 1) Param Optimization
+Control ira to consider matching constraint (duplicated operand number) heavily if that one is with preferred register class, even if there is some other choice with an appropriate register class no matter which is preferred or not.
+
 -param=iv-always-prune-cand-set-bound=
 Common Joined UInteger Var(param_iv_always_prune_cand_set_bound) Init(10) Param Optimization
 If number of candidates in the set is smaller, we always try to remove unused ivs during its optimization.
-- 
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC/PATCH v3] ira: Support more matching constraint forms with param [PR100328]
  2021-06-28  6:26 ` [RFC/PATCH v3] ira: Support more matching constraint forms " Kewen.Lin
@ 2021-06-28  7:12   ` Hongtao Liu
  2021-06-28  7:20     ` Hongtao Liu
  2021-06-30 15:24   ` Vladimir Makarov
  2021-06-30 15:25   ` [RFC/PATCH v3] " Vladimir Makarov
  2 siblings, 1 reply; 16+ messages in thread
From: Hongtao Liu @ 2021-06-28  7:12 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Vladimir Makarov, bergner, Bill Schmidt,
	Segher Boessenkool, Richard Sandiford

On Mon, Jun 28, 2021 at 2:50 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi!
>
> on 2021/6/9 下午1:18, Kewen.Lin via Gcc-patches wrote:
> > Hi,
> >
> > PR100328 has some details about this issue, I am trying to
> > brief it here.  In the hottest function LBM_performStreamCollideTRT
> > of SPEC2017 bmk 519.lbm_r, there are many FMA style expressions
> > (27 FMA, 19 FMS, 11 FNMA).  On rs6000, this kind of FMA style
> > insn has two flavors: FLOAT_REG and VSX_REG, the VSX_REG reg
> > class have 64 registers whose foregoing 32 ones make up the
> > whole FLOAT_REG.  There are some differences for these two
> > flavors, taking "*fma<mode>4_fpr" as example:
> >
> > (define_insn "*fma<mode>4_fpr"
> >   [(set (match_operand:SFDF 0 "gpc_reg_operand" "=<Ff>,wa,wa")
> >       (fma:SFDF
> >         (match_operand:SFDF 1 "gpc_reg_operand" "%<Ff>,wa,wa")
> >         (match_operand:SFDF 2 "gpc_reg_operand" "<Ff>,wa,0")
> >         (match_operand:SFDF 3 "gpc_reg_operand" "<Ff>,0,wa")))]
> >
> > // wa => A VSX register (VSR), vs0…vs63, aka. VSX_REG.
> > // <Ff> (f/d) => A floating point register, aka. FLOAT_REG.
> >
> > So for VSX_REG, we only have the destructive form, when VSX_REG
> > alternative being used, the operand 2 or operand 3 is required
> > to be the same as operand 0.  reload has to take care of this
> > constraint and create some non-free register copies if required.
> >
> > Assuming one fma insn looks like:
> >   op0 = FMA (op1, op2, op3)
> >
> > The best regclass of them are VSX_REG, when op1,op2,op3 are all dead,
> > IRA simply creates three shuffle copies for them (here the operand
> > order matters, since with the same freq, the one with smaller number
> > takes preference), but IMO both op2 and op3 should take higher priority
> > in copy queue due to the matching constraint.
> >
> > I noticed that there is one function ira_get_dup_out_num, which meant
> > to create this kind of constraint copy, but the below code looks to
> > refuse to create if there is an alternative which has valid regclass
> > without spilled need.
> >
> >       default:
> >       {
> >         enum constraint_num cn = lookup_constraint (str);
> >         enum reg_class cl = reg_class_for_constraint (cn);
> >         if (cl != NO_REGS
> >             && !targetm.class_likely_spilled_p (cl))
> >           goto fail
> >
> >        ...
> >
> > I cooked one patch attached to make ira respect this kind of matching
> > constraint guarded with one parameter.  As I stated in the PR, I was
> > not sure this is on the right track.  The RFC patch is to check the
> > matching constraint in all alternatives, if there is one alternative
> > with matching constraint and matches the current preferred regclass
> > (or best of allocno?), it will record the output operand number and
> > further create one constraint copy for it.  Normally it can get the
> > priority against shuffle copies and the matching constraint will get
> > satisfied with higher possibility, reload doesn't create extra copies
> > to meet the matching constraint or the desirable register class when
> > it has to.
> >
> > For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay
> > as shuffle copies, and later any of A,B,C,D gets assigned by one
> > hardware register which is a VSX register (VSX_REG) but not a FP
> > register (FLOAT_REG), which means it has to pay costs once we can NOT
> > go with VSX alternatives, so at that time it's important to respect
> > the matching constraint then we can increase the freq for the remaining
> > copies related to this (A/B, A/C, A/D).  This idea requires some side
> > tables to record some information and seems a bit complicated in the
> > current framework, so the proposed patch aggressively emphasizes the
> > matching constraint at the time of creating copies.
> >
>
> Comparing with the original patch (v1), this patch v3 has
> considered: (this should be v2 for this mail list, but bump
> it to be consistent as PR's).
>
>   - Excluding the case where for one preferred register class
>     there can be two or more alternatives, one of them has the
>     matching constraint, while another doesn't have.  So for
>     the given operand, even if it's assigned by a hardware reg
>     which doesn't meet the matching constraint, it can simply
>     use the alternative which doesn't have matching constraint
>     so no register move is needed.  One typical case is
>     define_insn *mov<mode>_internal2 on rs6000.  So we
>     shouldn't create constraint copy for it.
>
>   - The possible free register move in the same register class,
>     disable this if so since the register move to meet the
>     constraint is considered as free.
>
>   - Making it on by default, suggested by Segher & Vladimir, we
>     hope to get rid of the parameter if the benchmarking result
>     looks good on major targets.
>
>   - Tweaking cost when either of matching constraint two sides
>     is hardware register.  Before this patch, the constraint
>     copy is simply taken as a real move insn for pref and
>     conflict cost with one hardware register, after this patch,
>     it's allowed that there are several input operands
>     respecting the same matching constraint (but in different
>     alternatives), so we should take it to be like shuffle copy
>     for some cases to avoid over preferring/disparaging.
>
> Please check the PR comments for more details.
>
> This patch can be bootstrapped & regtested on
> powerpc64le-linux-gnu P9 and x86_64-redhat-linux, but have some
> "XFAIL->XPASS" failures on aarch64-linux-gnu.  The failure list
> was attached in the PR and thought the new assembly looks
> improved (expected).
>
> With option Ofast unroll, this patch can help to improve SPEC2017
> bmk 508.namd_r +2.42% and 519.lbm_r +2.43% on Power8 while
> 508.namd_r +3.02% and 519.lbm_r +3.85% on Power9 without any
> remarkable degradations.
>
> Since this patch likely benefits x86_64 and aarch64, but I don't
> have performance machines with these arches at hand, could
> someone kindly help to benchmark it if possible?
I can help test it on Intel cascade lake and AMD milan.
>
> Many thanks in advance!
>
> btw, you can simply ignore the part about parameter
> ira-consider-dup-in-all-alts (its name/description), it's sort of
> stale, I let it be for now as we will likely get rid of it.
>
> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
>         * doc/invoke.texi (ira-consider-dup-in-all-alts): Document new
>         parameter.
>         * ira.c (ira_get_dup_out_num): Adjust as parameter
>         param_ira_consider_dup_in_all_alts.
>         * params.opt (ira-consider-dup-in-all-alts): New.
>         * ira-conflicts.c (process_regs_for_copy): Add one parameter
>         single_input_op_has_cstr_p.
>         (get_freq_for_shuffle_copy): New function.
>         (add_insn_allocno_copies): Adjust as single_input_op_has_cstr_p.
>         * ira-int.h (ira_get_dup_out_num): Add one bool parameter.



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC/PATCH v3] ira: Support more matching constraint forms with param [PR100328]
  2021-06-28  7:12   ` Hongtao Liu
@ 2021-06-28  7:20     ` Hongtao Liu
  2021-06-28  7:27       ` Kewen.Lin
  0 siblings, 1 reply; 16+ messages in thread
From: Hongtao Liu @ 2021-06-28  7:20 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Vladimir Makarov, bergner, Bill Schmidt,
	Segher Boessenkool, Richard Sandiford

On Mon, Jun 28, 2021 at 3:12 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Mon, Jun 28, 2021 at 2:50 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >
> > Hi!
> >
> > on 2021/6/9 下午1:18, Kewen.Lin via Gcc-patches wrote:
> > > Hi,
> > >
> > > PR100328 has some details about this issue, I am trying to
> > > brief it here.  In the hottest function LBM_performStreamCollideTRT
> > > of SPEC2017 bmk 519.lbm_r, there are many FMA style expressions
> > > (27 FMA, 19 FMS, 11 FNMA).  On rs6000, this kind of FMA style
> > > insn has two flavors: FLOAT_REG and VSX_REG, the VSX_REG reg
> > > class have 64 registers whose foregoing 32 ones make up the
> > > whole FLOAT_REG.  There are some differences for these two
> > > flavors, taking "*fma<mode>4_fpr" as example:
> > >
> > > (define_insn "*fma<mode>4_fpr"
> > >   [(set (match_operand:SFDF 0 "gpc_reg_operand" "=<Ff>,wa,wa")
> > >       (fma:SFDF
> > >         (match_operand:SFDF 1 "gpc_reg_operand" "%<Ff>,wa,wa")
> > >         (match_operand:SFDF 2 "gpc_reg_operand" "<Ff>,wa,0")
> > >         (match_operand:SFDF 3 "gpc_reg_operand" "<Ff>,0,wa")))]
> > >
> > > // wa => A VSX register (VSR), vs0…vs63, aka. VSX_REG.
> > > // <Ff> (f/d) => A floating point register, aka. FLOAT_REG.
> > >
> > > So for VSX_REG, we only have the destructive form, when VSX_REG
> > > alternative being used, the operand 2 or operand 3 is required
> > > to be the same as operand 0.  reload has to take care of this
> > > constraint and create some non-free register copies if required.
> > >
> > > Assuming one fma insn looks like:
> > >   op0 = FMA (op1, op2, op3)
> > >
> > > The best regclass of them are VSX_REG, when op1,op2,op3 are all dead,
> > > IRA simply creates three shuffle copies for them (here the operand
> > > order matters, since with the same freq, the one with smaller number
> > > takes preference), but IMO both op2 and op3 should take higher priority
> > > in copy queue due to the matching constraint.
> > >
> > > I noticed that there is one function ira_get_dup_out_num, which meant
> > > to create this kind of constraint copy, but the below code looks to
> > > refuse to create if there is an alternative which has valid regclass
> > > without spilled need.
> > >
> > >       default:
> > >       {
> > >         enum constraint_num cn = lookup_constraint (str);
> > >         enum reg_class cl = reg_class_for_constraint (cn);
> > >         if (cl != NO_REGS
> > >             && !targetm.class_likely_spilled_p (cl))
> > >           goto fail
> > >
> > >        ...
> > >
> > > I cooked one patch attached to make ira respect this kind of matching
> > > constraint guarded with one parameter.  As I stated in the PR, I was
> > > not sure this is on the right track.  The RFC patch is to check the
> > > matching constraint in all alternatives, if there is one alternative
> > > with matching constraint and matches the current preferred regclass
> > > (or best of allocno?), it will record the output operand number and
> > > further create one constraint copy for it.  Normally it can get the
> > > priority against shuffle copies and the matching constraint will get
> > > satisfied with higher possibility, reload doesn't create extra copies
> > > to meet the matching constraint or the desirable register class when
> > > it has to.
> > >
> > > For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay
> > > as shuffle copies, and later any of A,B,C,D gets assigned by one
> > > hardware register which is a VSX register (VSX_REG) but not a FP
> > > register (FLOAT_REG), which means it has to pay costs once we can NOT
> > > go with VSX alternatives, so at that time it's important to respect
> > > the matching constraint then we can increase the freq for the remaining
> > > copies related to this (A/B, A/C, A/D).  This idea requires some side
> > > tables to record some information and seems a bit complicated in the
> > > current framework, so the proposed patch aggressively emphasizes the
> > > matching constraint at the time of creating copies.
> > >
> >
> > Comparing with the original patch (v1), this patch v3 has
> > considered: (this should be v2 for this mail list, but bump
> > it to be consistent as PR's).
> >
> >   - Excluding the case where for one preferred register class
> >     there can be two or more alternatives, one of them has the
> >     matching constraint, while another doesn't have.  So for
> >     the given operand, even if it's assigned by a hardware reg
> >     which doesn't meet the matching constraint, it can simply
> >     use the alternative which doesn't have matching constraint
> >     so no register move is needed.  One typical case is
> >     define_insn *mov<mode>_internal2 on rs6000.  So we
> >     shouldn't create constraint copy for it.
> >
> >   - The possible free register move in the same register class,
> >     disable this if so since the register move to meet the
> >     constraint is considered as free.
> >
> >   - Making it on by default, suggested by Segher & Vladimir, we
> >     hope to get rid of the parameter if the benchmarking result
> >     looks good on major targets.
> >
> >   - Tweaking cost when either of matching constraint two sides
> >     is hardware register.  Before this patch, the constraint
> >     copy is simply taken as a real move insn for pref and
> >     conflict cost with one hardware register, after this patch,
> >     it's allowed that there are several input operands
> >     respecting the same matching constraint (but in different
> >     alternatives), so we should take it to be like shuffle copy
> >     for some cases to avoid over preferring/disparaging.
> >
> > Please check the PR comments for more details.
> >
> > This patch can be bootstrapped & regtested on
> > powerpc64le-linux-gnu P9 and x86_64-redhat-linux, but have some
> > "XFAIL->XPASS" failures on aarch64-linux-gnu.  The failure list
> > was attached in the PR and thought the new assembly looks
> > improved (expected).
> >
> > With option Ofast unroll, this patch can help to improve SPEC2017
> > bmk 508.namd_r +2.42% and 519.lbm_r +2.43% on Power8 while
> > 508.namd_r +3.02% and 519.lbm_r +3.85% on Power9 without any
> > remarkable degradations.
> >
> > Since this patch likely benefits x86_64 and aarch64, but I don't
> > have performance machines with these arches at hand, could
> > someone kindly help to benchmark it if possible?
> I can help test it on Intel cascade lake and AMD milan.
And could you rebase your patch on the lastest trunk, i got several
failures when applying the patch
~ git apply ira-v3.diff
error: patch failed: gcc/doc/invoke.texi:13845
error: gcc/doc/invoke.texi: patch does not apply
error: patch failed: gcc/ira-conflicts.c:233
error: gcc/ira-conflicts.c: patch does not apply
error: patch failed: gcc/ira-int.h:971
error: gcc/ira-int.h: patch does not apply
error: patch failed: gcc/ira.c:1922
error: gcc/ira.c: patch does not apply
error: patch failed: gcc/params.opt:330
error: gcc/params.opt: patch does not apply

> >
> > Many thanks in advance!
> >
> > btw, you can simply ignore the part about parameter
> > ira-consider-dup-in-all-alts (its name/description), it's sort of
> > stale, I let it be for now as we will likely get rid of it.
> >
> > BR,
> > Kewen
> > -----
> > gcc/ChangeLog:
> >
> >         * doc/invoke.texi (ira-consider-dup-in-all-alts): Document new
> >         parameter.
> >         * ira.c (ira_get_dup_out_num): Adjust as parameter
> >         param_ira_consider_dup_in_all_alts.
> >         * params.opt (ira-consider-dup-in-all-alts): New.
> >         * ira-conflicts.c (process_regs_for_copy): Add one parameter
> >         single_input_op_has_cstr_p.
> >         (get_freq_for_shuffle_copy): New function.
> >         (add_insn_allocno_copies): Adjust as single_input_op_has_cstr_p.
> >         * ira-int.h (ira_get_dup_out_num): Add one bool parameter.
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC/PATCH v3] ira: Support more matching constraint forms with param [PR100328]
  2021-06-28  7:20     ` Hongtao Liu
@ 2021-06-28  7:27       ` Kewen.Lin
  2021-06-30  8:53         ` Hongtao Liu
  2021-06-30 15:42         ` Richard Sandiford
  0 siblings, 2 replies; 16+ messages in thread
From: Kewen.Lin @ 2021-06-28  7:27 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: GCC Patches, Vladimir Makarov, bergner, Bill Schmidt,
	Segher Boessenkool, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 7222 bytes --]

on 2021/6/28 下午3:20, Hongtao Liu wrote:
> On Mon, Jun 28, 2021 at 3:12 PM Hongtao Liu <crazylht@gmail.com> wrote:
>>
>> On Mon, Jun 28, 2021 at 2:50 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>
>>> Hi!
>>>
>>> on 2021/6/9 下午1:18, Kewen.Lin via Gcc-patches wrote:
>>>> Hi,
>>>>
>>>> PR100328 has some details about this issue, I am trying to
>>>> brief it here.  In the hottest function LBM_performStreamCollideTRT
>>>> of SPEC2017 bmk 519.lbm_r, there are many FMA style expressions
>>>> (27 FMA, 19 FMS, 11 FNMA).  On rs6000, this kind of FMA style
>>>> insn has two flavors: FLOAT_REG and VSX_REG, the VSX_REG reg
>>>> class have 64 registers whose foregoing 32 ones make up the
>>>> whole FLOAT_REG.  There are some differences for these two
>>>> flavors, taking "*fma<mode>4_fpr" as example:
>>>>
>>>> (define_insn "*fma<mode>4_fpr"
>>>>   [(set (match_operand:SFDF 0 "gpc_reg_operand" "=<Ff>,wa,wa")
>>>>       (fma:SFDF
>>>>         (match_operand:SFDF 1 "gpc_reg_operand" "%<Ff>,wa,wa")
>>>>         (match_operand:SFDF 2 "gpc_reg_operand" "<Ff>,wa,0")
>>>>         (match_operand:SFDF 3 "gpc_reg_operand" "<Ff>,0,wa")))]
>>>>
>>>> // wa => A VSX register (VSR), vs0…vs63, aka. VSX_REG.
>>>> // <Ff> (f/d) => A floating point register, aka. FLOAT_REG.
>>>>
>>>> So for VSX_REG, we only have the destructive form, when VSX_REG
>>>> alternative being used, the operand 2 or operand 3 is required
>>>> to be the same as operand 0.  reload has to take care of this
>>>> constraint and create some non-free register copies if required.
>>>>
>>>> Assuming one fma insn looks like:
>>>>   op0 = FMA (op1, op2, op3)
>>>>
>>>> The best regclass of them are VSX_REG, when op1,op2,op3 are all dead,
>>>> IRA simply creates three shuffle copies for them (here the operand
>>>> order matters, since with the same freq, the one with smaller number
>>>> takes preference), but IMO both op2 and op3 should take higher priority
>>>> in copy queue due to the matching constraint.
>>>>
>>>> I noticed that there is one function ira_get_dup_out_num, which meant
>>>> to create this kind of constraint copy, but the below code looks to
>>>> refuse to create if there is an alternative which has valid regclass
>>>> without spilled need.
>>>>
>>>>       default:
>>>>       {
>>>>         enum constraint_num cn = lookup_constraint (str);
>>>>         enum reg_class cl = reg_class_for_constraint (cn);
>>>>         if (cl != NO_REGS
>>>>             && !targetm.class_likely_spilled_p (cl))
>>>>           goto fail
>>>>
>>>>        ...
>>>>
>>>> I cooked one patch attached to make ira respect this kind of matching
>>>> constraint guarded with one parameter.  As I stated in the PR, I was
>>>> not sure this is on the right track.  The RFC patch is to check the
>>>> matching constraint in all alternatives, if there is one alternative
>>>> with matching constraint and matches the current preferred regclass
>>>> (or best of allocno?), it will record the output operand number and
>>>> further create one constraint copy for it.  Normally it can get the
>>>> priority against shuffle copies and the matching constraint will get
>>>> satisfied with higher possibility, reload doesn't create extra copies
>>>> to meet the matching constraint or the desirable register class when
>>>> it has to.
>>>>
>>>> For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay
>>>> as shuffle copies, and later any of A,B,C,D gets assigned by one
>>>> hardware register which is a VSX register (VSX_REG) but not a FP
>>>> register (FLOAT_REG), which means it has to pay costs once we can NOT
>>>> go with VSX alternatives, so at that time it's important to respect
>>>> the matching constraint then we can increase the freq for the remaining
>>>> copies related to this (A/B, A/C, A/D).  This idea requires some side
>>>> tables to record some information and seems a bit complicated in the
>>>> current framework, so the proposed patch aggressively emphasizes the
>>>> matching constraint at the time of creating copies.
>>>>
>>>
>>> Comparing with the original patch (v1), this patch v3 has
>>> considered: (this should be v2 for this mail list, but bump
>>> it to be consistent as PR's).
>>>
>>>   - Excluding the case where for one preferred register class
>>>     there can be two or more alternatives, one of them has the
>>>     matching constraint, while another doesn't have.  So for
>>>     the given operand, even if it's assigned by a hardware reg
>>>     which doesn't meet the matching constraint, it can simply
>>>     use the alternative which doesn't have matching constraint
>>>     so no register move is needed.  One typical case is
>>>     define_insn *mov<mode>_internal2 on rs6000.  So we
>>>     shouldn't create constraint copy for it.
>>>
>>>   - The possible free register move in the same register class,
>>>     disable this if so since the register move to meet the
>>>     constraint is considered as free.
>>>
>>>   - Making it on by default, suggested by Segher & Vladimir, we
>>>     hope to get rid of the parameter if the benchmarking result
>>>     looks good on major targets.
>>>
>>>   - Tweaking cost when either of matching constraint two sides
>>>     is hardware register.  Before this patch, the constraint
>>>     copy is simply taken as a real move insn for pref and
>>>     conflict cost with one hardware register, after this patch,
>>>     it's allowed that there are several input operands
>>>     respecting the same matching constraint (but in different
>>>     alternatives), so we should take it to be like shuffle copy
>>>     for some cases to avoid over preferring/disparaging.
>>>
>>> Please check the PR comments for more details.
>>>
>>> This patch can be bootstrapped & regtested on
>>> powerpc64le-linux-gnu P9 and x86_64-redhat-linux, but have some
>>> "XFAIL->XPASS" failures on aarch64-linux-gnu.  The failure list
>>> was attached in the PR and thought the new assembly looks
>>> improved (expected).
>>>
>>> With option Ofast unroll, this patch can help to improve SPEC2017
>>> bmk 508.namd_r +2.42% and 519.lbm_r +2.43% on Power8 while
>>> 508.namd_r +3.02% and 519.lbm_r +3.85% on Power9 without any
>>> remarkable degradations.
>>>
>>> Since this patch likely benefits x86_64 and aarch64, but I don't
>>> have performance machines with these arches at hand, could
>>> someone kindly help to benchmark it if possible?
>> I can help test it on Intel cascade lake and AMD milan.


Thanks for your help, Hongtao!


> And could you rebase your patch on the lastest trunk, i got several
> failures when applying the patch
> ~ git apply ira-v3.diff
> error: patch failed: gcc/doc/invoke.texi:13845
> error: gcc/doc/invoke.texi: patch does not apply
> error: patch failed: gcc/ira-conflicts.c:233
> error: gcc/ira-conflicts.c: patch does not apply
> error: patch failed: gcc/ira-int.h:971
> error: gcc/ira-int.h: patch does not apply
> error: patch failed: gcc/ira.c:1922
> error: gcc/ira.c: patch does not apply
> error: patch failed: gcc/params.opt:330
> error: gcc/params.opt: patch does not apply
> 

I think it's due to unexpected git stat lines in previously attached diff.

I have attached the format-patch file.  Please have a check.  Thanks!


BR,
Kewen

[-- Attachment #2: 0001-ira-Support-more-matching-constraint-forms-with-para.patch --]
[-- Type: text/plain, Size: 15544 bytes --]

From 60271c7ea61f3b958a3497b18f41ed16a25d82eb Mon Sep 17 00:00:00 2001
From: Kewen Lin <linkw@linux.ibm.com>
Date: Mon, 21 Jun 2021 22:51:09 -0500
Subject: [PATCH] ira: Support more matching constraint forms with param

gcc/ChangeLog:

	* doc/invoke.texi (ira-consider-dup-in-all-alts): Document new
	parameter.
	* ira.c (ira_get_dup_out_num): Adjust as parameter
	param_ira_consider_dup_in_all_alts.
	* params.opt (ira-consider-dup-in-all-alts): New.
	* ira-conflicts.c (process_regs_for_copy): Add one parameter
	single_input_op_has_cstr_p.
	(get_freq_for_shuffle_copy): New function.
	(add_insn_allocno_copies): Adjust as single_input_op_has_cstr_p.
	* ira-int.h (ira_get_dup_out_num): Add one bool parameter.
---
 gcc/doc/invoke.texi |   6 +++
 gcc/ira-conflicts.c |  91 +++++++++++++++++++++++++++-------
 gcc/ira-int.h       |   2 +-
 gcc/ira.c           | 118 ++++++++++++++++++++++++++++++++++++++++----
 gcc/params.opt      |   4 ++
 5 files changed, 194 insertions(+), 27 deletions(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 510f24e55ab..d54cc991d18 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13845,6 +13845,12 @@ of available registers reserved for some other purposes is given
 by this parameter.  Default of the parameter
 is the best found from numerous experiments.
 
+@item ira-consider-dup-in-all-alts
+Make IRA to consider matching constraint (duplicated operand number)
+heavily if that one is with preferred register class, even if there
+is some other choice with an appropriate register class no matter
+which is preferred or not.
+
 @item lra-inheritance-ebb-probability-cutoff
 LRA tries to reuse values reloaded in registers in subsequent insns.
 This optimization is called inheritance.  EBB is used as a region to
diff --git a/gcc/ira-conflicts.c b/gcc/ira-conflicts.c
index d83cfc1c1a7..67c4cdcbc8d 100644
--- a/gcc/ira-conflicts.c
+++ b/gcc/ira-conflicts.c
@@ -233,6 +233,15 @@ go_through_subreg (rtx x, int *offset)
   return reg;
 }
 
+/* Return the recomputed frequency for this shuffle copy or its similar
+   case, since it's not for a real move insn, make it smaller.  */
+
+static int
+get_freq_for_shuffle_copy (int freq)
+{
+  return freq < 8 ? 1 : freq / 8;
+}
+
 /* Process registers REG1 and REG2 in move INSN with execution
    frequency FREQ.  The function also processes the registers in a
    potential move insn (INSN == NULL in this case) with frequency
@@ -240,12 +249,14 @@ go_through_subreg (rtx x, int *offset)
    corresponding allocnos or create a copy involving the corresponding
    allocnos.  The function does nothing if the both registers are hard
    registers.  When nothing is changed, the function returns
-   FALSE.  */
+   FALSE.  SINGLE_INPUT_OP_HAS_CSTR_P is only meaningful when constraint_p
+   is true, see function ira_get_dup_out_num for details.  */
 static bool
-process_regs_for_copy (rtx reg1, rtx reg2, bool constraint_p,
-		       rtx_insn *insn, int freq)
+process_regs_for_copy (rtx reg1, rtx reg2, bool constraint_p, rtx_insn *insn,
+		       int freq, bool single_input_op_has_cstr_p = true)
 {
-  int allocno_preferenced_hard_regno, cost, index, offset1, offset2;
+  int allocno_preferenced_hard_regno, index, offset1, offset2;
+  int cost, conflict_cost, move_cost;
   bool only_regs_p;
   ira_allocno_t a;
   reg_class_t rclass, aclass;
@@ -306,9 +317,52 @@ process_regs_for_copy (rtx reg1, rtx reg2, bool constraint_p,
     return false;
   ira_init_register_move_cost_if_necessary (mode);
   if (HARD_REGISTER_P (reg1))
-    cost = ira_register_move_cost[mode][aclass][rclass] * freq;
+    move_cost = ira_register_move_cost[mode][aclass][rclass];
+  else
+    move_cost = ira_register_move_cost[mode][rclass][aclass];
+
+  if (!single_input_op_has_cstr_p)
+    {
+      /* When this is a constraint copy and the matching constraint
+	 doesn't only exist for this given operand but also for some
+	 other operand(s), it means saving the possible move cost is
+	 not necessary to have reg1 and reg2 use the same hardware
+	 register, this hardware preference isn't required to be
+	 fixed.  To avoid it to over prefer this hardware register,
+	 and over disparage this hardware register on conflicted
+	 objects, we need some cost tweaking here, similar to what
+	 we do for shuffle copy.  */
+      gcc_assert (constraint_p);
+      int reduced_freq = get_freq_for_shuffle_copy (freq);
+      if (HARD_REGISTER_P (reg1))
+	/* For reg2 = opcode(reg1, reg3 ...), assume that reg3 is a
+	   pseudo register which has matching constraint on reg2,
+	   even if reg2 isn't assigned by reg1, it's still possible
+	   to have no register moves if reg2 and reg3 use the same
+	   hardware register.  So to avoid the allocation over
+	   prefers reg1, we can just take it as a shuffle copy.  */
+	cost = conflict_cost = move_cost * reduced_freq;
+      else
+	{
+	  /* For reg1 = opcode(reg2, reg3 ...), assume that reg3 is a
+	     pseudo register which has matching constraint on reg2,
+	     to save the register move, it's better to assign reg1
+	     to either of reg2 and reg3 (or one of other pseudos like
+	     reg3), it's reasonable to use freq for the cost.  But
+	     for conflict_cost, since reg2 and reg3 conflicts with
+	     each other, both of them has the chance to be assigned
+	     by reg1, assume reg3 has one copy which also conflicts
+	     with reg2, we shouldn't make it less preferred on reg1
+	     since reg3 has the same chance to be assigned by reg1.
+	     So it adjusts the conflic_cost to make it same as what
+	     we use for shuffle copy.  */
+	  cost = move_cost * freq;
+	  conflict_cost = move_cost * reduced_freq;
+	}
+    }
   else
-    cost = ira_register_move_cost[mode][rclass][aclass] * freq;
+    cost = conflict_cost = move_cost * freq;
+
   do
     {
       ira_allocate_and_set_costs
@@ -317,7 +371,7 @@ process_regs_for_copy (rtx reg1, rtx reg2, bool constraint_p,
       ira_allocate_and_set_costs
 	(&ALLOCNO_CONFLICT_HARD_REG_COSTS (a), aclass, 0);
       ALLOCNO_HARD_REG_COSTS (a)[index] -= cost;
-      ALLOCNO_CONFLICT_HARD_REG_COSTS (a)[index] -= cost;
+      ALLOCNO_CONFLICT_HARD_REG_COSTS (a)[index] -= conflict_cost;
       if (ALLOCNO_HARD_REG_COSTS (a)[index] < ALLOCNO_CLASS_COST (a))
 	ALLOCNO_CLASS_COST (a) = ALLOCNO_HARD_REG_COSTS (a)[index];
       ira_add_allocno_pref (a, allocno_preferenced_hard_regno, freq);
@@ -420,7 +474,8 @@ add_insn_allocno_copies (rtx_insn *insn)
       operand = recog_data.operand[i];
       if (! REG_SUBREG_P (operand))
 	continue;
-      if ((n = ira_get_dup_out_num (i, alts)) >= 0)
+      bool single_input_op_has_cstr_p;
+      if ((n = ira_get_dup_out_num (i, alts, single_input_op_has_cstr_p)) >= 0)
 	{
 	  bound_p[n] = true;
 	  dup = recog_data.operand[n];
@@ -429,8 +484,8 @@ add_insn_allocno_copies (rtx_insn *insn)
 				REG_P (operand)
 				? operand
 				: SUBREG_REG (operand)) != NULL_RTX)
-	    process_regs_for_copy (operand, dup, true, NULL,
-				   freq);
+	    process_regs_for_copy (operand, dup, true, NULL, freq,
+				   single_input_op_has_cstr_p);
 	}
     }
   for (i = 0; i < recog_data.n_operands; i++)
@@ -440,13 +495,15 @@ add_insn_allocno_copies (rtx_insn *insn)
 	  && find_reg_note (insn, REG_DEAD,
 			    REG_P (operand)
 			    ? operand : SUBREG_REG (operand)) != NULL_RTX)
-	/* If an operand dies, prefer its hard register for the output
-	   operands by decreasing the hard register cost or creating
-	   the corresponding allocno copies.  The cost will not
-	   correspond to a real move insn cost, so make the frequency
-	   smaller.  */
-	process_reg_shuffles (insn, operand, i, freq < 8 ? 1 : freq / 8,
-			      bound_p);
+	{
+	  /* If an operand dies, prefer its hard register for the output
+	     operands by decreasing the hard register cost or creating
+	     the corresponding allocno copies.  The cost will not
+	     correspond to a real move insn cost, so make the frequency
+	     smaller.  */
+	  int new_freq = get_freq_for_shuffle_copy (freq);
+	  process_reg_shuffles (insn, operand, i, new_freq, bound_p);
+	}
     }
 }
 
diff --git a/gcc/ira-int.h b/gcc/ira-int.h
index 31e013b0461..da748626e31 100644
--- a/gcc/ira-int.h
+++ b/gcc/ira-int.h
@@ -971,7 +971,7 @@ extern void ira_debug_disposition (void);
 extern void ira_debug_allocno_classes (void);
 extern void ira_init_register_move_cost (machine_mode);
 extern alternative_mask ira_setup_alts (rtx_insn *);
-extern int ira_get_dup_out_num (int, alternative_mask);
+extern int ira_get_dup_out_num (int, alternative_mask, bool &);
 
 /* ira-build.c */
 
diff --git a/gcc/ira.c b/gcc/ira.c
index 638ef4ea17e..75033a45963 100644
--- a/gcc/ira.c
+++ b/gcc/ira.c
@@ -1922,9 +1922,15 @@ ira_setup_alts (rtx_insn *insn)
 /* Return the number of the output non-early clobber operand which
    should be the same in any case as operand with number OP_NUM (or
    negative value if there is no such operand).  ALTS is the mask
-   of alternatives that we should consider.  */
+   of alternatives that we should consider.  SINGLE_INPUT_OP_HAS_CSTR_P
+   should be set in this function, it indicates whether there is only
+   a single input operand which has the matching constraint on the
+   output operand with returned position.  If the pattern allows any
+   one of several input operands holds the matching constraint, it's
+   set as false.  One typical case is FMA insn on rs6000.  */
 int
-ira_get_dup_out_num (int op_num, alternative_mask alts)
+ira_get_dup_out_num (int op_num, alternative_mask alts,
+		     bool &single_input_op_has_cstr_p)
 {
   int curr_alt, c, original;
   bool ignore_p, use_commut_op_p;
@@ -1937,10 +1943,42 @@ ira_get_dup_out_num (int op_num, alternative_mask alts)
     return -1;
   str = recog_data.constraints[op_num];
   use_commut_op_p = false;
+  single_input_op_has_cstr_p = true;
+
+  rtx op = recog_data.operand[op_num];
+  int op_no = reg_or_subregno (op);
+  enum reg_class op_pref_cl = reg_preferred_class (op_no);
+  machine_mode op_mode = GET_MODE (op);
+
+  ira_init_register_move_cost_if_necessary (op_mode);
+  /* If the preferred regclass isn't NO_REG, continue to find the matching
+     constraint in all available alternatives with preferred regclass, even
+     if we have found or will find one alternative whose constraint stands
+     for a REG (!NO_REG) regclass.  Note that it would be fine not to
+     respect matching constraint if the register copy is free, so exclude
+     it.  */
+  bool respect_dup_despite_reg_cstr
+    = param_ira_consider_dup_in_all_alts
+      && op_pref_cl != NO_REGS
+      && ira_register_move_cost[op_mode][op_pref_cl][op_pref_cl] > 0;
+
+  /* Record the alternative whose constraint uses the same regclass as the
+     preferred regclass, later if we find one matching constraint for this
+     operand with preferred reclass, we will visit these recorded
+     alternatives to check whether if there is one alternative in which no
+     any INPUT operands have one matching constraint same as our candidate.
+     If yes, it means there is one alternative which is perfectly fine
+     without satisfying this matching constraint.  If no, it means in any
+     alternatives there is one other INPUT operand holding this matching
+     constraint, it's fine to respect this matching constraint and further
+     create this constraint copy since it would become harmless once some
+     other takes preference and it's interfered.  */
+  alternative_mask pref_cl_alts;
+
   for (;;)
     {
-      rtx op = recog_data.operand[op_num];
-      
+      pref_cl_alts = 0;
+
       for (curr_alt = 0, ignore_p = !TEST_BIT (alts, curr_alt),
 	   original = -1;;)
 	{
@@ -1963,9 +2001,25 @@ ira_get_dup_out_num (int op_num, alternative_mask alts)
 		{
 		  enum constraint_num cn = lookup_constraint (str);
 		  enum reg_class cl = reg_class_for_constraint (cn);
-		  if (cl != NO_REGS
-		      && !targetm.class_likely_spilled_p (cl))
-		    goto fail;
+		  if (cl != NO_REGS && !targetm.class_likely_spilled_p (cl))
+		    {
+		      if (respect_dup_despite_reg_cstr)
+			{
+			  /* If it's free to move from one preferred class to
+			     the one without matching constraint, it doesn't
+			     have to respect this constraint with costs.  */
+			  if (cl != op_pref_cl
+			      && (ira_reg_class_intersect[cl][op_pref_cl]
+				  != NO_REGS)
+			      && (ira_may_move_in_cost[op_mode][op_pref_cl][cl]
+				  == 0))
+			    goto fail;
+			  else if (cl == op_pref_cl)
+			    pref_cl_alts |= ALTERNATIVE_BIT (curr_alt);
+			}
+		      else
+			goto fail;
+		    }
 		  if (constraint_satisfied_p (op, cn))
 		    goto fail;
 		  break;
@@ -1979,7 +2033,21 @@ ira_get_dup_out_num (int op_num, alternative_mask alts)
 		  str = end;
 		  if (original != -1 && original != n)
 		    goto fail;
-		  original = n;
+		  gcc_assert (n < recog_data.n_operands);
+		  if (respect_dup_despite_reg_cstr)
+		    {
+		      const operand_alternative *op_alt
+			= &recog_op_alt[curr_alt * recog_data.n_operands];
+		      /* Only respect the one with preferred rclass, without
+			 respect_dup_despite_reg_cstr, it's possible to get
+			 one whose regclass isn't preferred first before,
+			 but it would fail since there should be other
+			 alternatives with preferred regclass.  */
+		      if (op_alt[n].cl == op_pref_cl)
+			original = n;
+		    }
+		  else
+		    original = n;
 		  continue;
 		}
 	      }
@@ -1988,7 +2056,39 @@ ira_get_dup_out_num (int op_num, alternative_mask alts)
       if (original == -1)
 	goto fail;
       if (recog_data.operand_type[original] == OP_OUT)
-	return original;
+	{
+	  if (pref_cl_alts == 0)
+	    return original;
+	  /* Visit these recorded alternatives to check whether if
+	     there is one alternative in which no any INPUT operands
+	     have one matching constraint same as our candidate.
+	     Give up this candidate if so.  */
+	  int nop, nalt;
+	  for (nalt = 0; nalt < recog_data.n_alternatives; nalt++)
+	    {
+	      if (!TEST_BIT (pref_cl_alts, nalt))
+		continue;
+	      const operand_alternative *op_alt
+		= &recog_op_alt[nalt * recog_data.n_operands];
+	      bool dup_in_other = false;
+	      for (nop = 0; nop < recog_data.n_operands; nop++)
+		{
+		  if (recog_data.operand_type[nop] != OP_IN)
+		    continue;
+		  if (nop == op_num)
+		    continue;
+		  if (op_alt[nop].matches == original)
+		    {
+		      dup_in_other = true;
+		      break;
+		    }
+		}
+	      if (!dup_in_other)
+		return -1;
+	    }
+	  single_input_op_has_cstr_p = false;
+	  return original;
+	}
     fail:
       if (use_commut_op_p)
 	break;
diff --git a/gcc/params.opt b/gcc/params.opt
index 18e6036c4f4..5121e3ddc80 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -330,6 +330,10 @@ Max size of conflict table in MB.
 Common Joined UInteger Var(param_ira_max_loops_num) Init(100) Param Optimization
 Max loops number for regional RA.
 
+-param=ira-consider-dup-in-all-alts=
+Common Joined UInteger Var(param_ira_consider_dup_in_all_alts) Init(1) IntegerRange(0, 1) Param Optimization
+Control ira to consider matching constraint (duplicated operand number) heavily if that one is with preferred register class, even if there is some other choice with an appropriate register class no matter which is preferred or not.
+
 -param=iv-always-prune-cand-set-bound=
 Common Joined UInteger Var(param_iv_always_prune_cand_set_bound) Init(10) Param Optimization
 If number of candidates in the set is smaller, we always try to remove unused ivs during its optimization.
-- 
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC/PATCH v3] ira: Support more matching constraint forms with param [PR100328]
  2021-06-28  7:27       ` Kewen.Lin
@ 2021-06-30  8:53         ` Hongtao Liu
  2021-06-30  9:42           ` Kewen.Lin
  2021-06-30 15:42         ` Richard Sandiford
  1 sibling, 1 reply; 16+ messages in thread
From: Hongtao Liu @ 2021-06-30  8:53 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Vladimir Makarov, bergner, Bill Schmidt,
	Segher Boessenkool, Richard Sandiford

On Mon, Jun 28, 2021 at 3:27 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> on 2021/6/28 下午3:20, Hongtao Liu wrote:
> > On Mon, Jun 28, 2021 at 3:12 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >>
> >> On Mon, Jun 28, 2021 at 2:50 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>
> >>> Hi!
> >>>
> >>> on 2021/6/9 下午1:18, Kewen.Lin via Gcc-patches wrote:
> >>>> Hi,
> >>>>
> >>>> PR100328 has some details about this issue, I am trying to
> >>>> brief it here.  In the hottest function LBM_performStreamCollideTRT
> >>>> of SPEC2017 bmk 519.lbm_r, there are many FMA style expressions
> >>>> (27 FMA, 19 FMS, 11 FNMA).  On rs6000, this kind of FMA style
> >>>> insn has two flavors: FLOAT_REG and VSX_REG, the VSX_REG reg
> >>>> class have 64 registers whose foregoing 32 ones make up the
> >>>> whole FLOAT_REG.  There are some differences for these two
> >>>> flavors, taking "*fma<mode>4_fpr" as example:
> >>>>
> >>>> (define_insn "*fma<mode>4_fpr"
> >>>>   [(set (match_operand:SFDF 0 "gpc_reg_operand" "=<Ff>,wa,wa")
> >>>>       (fma:SFDF
> >>>>         (match_operand:SFDF 1 "gpc_reg_operand" "%<Ff>,wa,wa")
> >>>>         (match_operand:SFDF 2 "gpc_reg_operand" "<Ff>,wa,0")
> >>>>         (match_operand:SFDF 3 "gpc_reg_operand" "<Ff>,0,wa")))]
> >>>>
> >>>> // wa => A VSX register (VSR), vs0…vs63, aka. VSX_REG.
> >>>> // <Ff> (f/d) => A floating point register, aka. FLOAT_REG.
> >>>>
> >>>> So for VSX_REG, we only have the destructive form, when VSX_REG
> >>>> alternative being used, the operand 2 or operand 3 is required
> >>>> to be the same as operand 0.  reload has to take care of this
> >>>> constraint and create some non-free register copies if required.
> >>>>
> >>>> Assuming one fma insn looks like:
> >>>>   op0 = FMA (op1, op2, op3)
> >>>>
> >>>> The best regclass of them are VSX_REG, when op1,op2,op3 are all dead,
> >>>> IRA simply creates three shuffle copies for them (here the operand
> >>>> order matters, since with the same freq, the one with smaller number
> >>>> takes preference), but IMO both op2 and op3 should take higher priority
> >>>> in copy queue due to the matching constraint.
> >>>>
> >>>> I noticed that there is one function ira_get_dup_out_num, which meant
> >>>> to create this kind of constraint copy, but the below code looks to
> >>>> refuse to create if there is an alternative which has valid regclass
> >>>> without spilled need.
> >>>>
> >>>>       default:
> >>>>       {
> >>>>         enum constraint_num cn = lookup_constraint (str);
> >>>>         enum reg_class cl = reg_class_for_constraint (cn);
> >>>>         if (cl != NO_REGS
> >>>>             && !targetm.class_likely_spilled_p (cl))
> >>>>           goto fail
> >>>>
> >>>>        ...
> >>>>
> >>>> I cooked one patch attached to make ira respect this kind of matching
> >>>> constraint guarded with one parameter.  As I stated in the PR, I was
> >>>> not sure this is on the right track.  The RFC patch is to check the
> >>>> matching constraint in all alternatives, if there is one alternative
> >>>> with matching constraint and matches the current preferred regclass
> >>>> (or best of allocno?), it will record the output operand number and
> >>>> further create one constraint copy for it.  Normally it can get the
> >>>> priority against shuffle copies and the matching constraint will get
> >>>> satisfied with higher possibility, reload doesn't create extra copies
> >>>> to meet the matching constraint or the desirable register class when
> >>>> it has to.
> >>>>
> >>>> For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay
> >>>> as shuffle copies, and later any of A,B,C,D gets assigned by one
> >>>> hardware register which is a VSX register (VSX_REG) but not a FP
> >>>> register (FLOAT_REG), which means it has to pay costs once we can NOT
> >>>> go with VSX alternatives, so at that time it's important to respect
> >>>> the matching constraint then we can increase the freq for the remaining
> >>>> copies related to this (A/B, A/C, A/D).  This idea requires some side
> >>>> tables to record some information and seems a bit complicated in the
> >>>> current framework, so the proposed patch aggressively emphasizes the
> >>>> matching constraint at the time of creating copies.
> >>>>
> >>>
> >>> Comparing with the original patch (v1), this patch v3 has
> >>> considered: (this should be v2 for this mail list, but bump
> >>> it to be consistent as PR's).
> >>>
> >>>   - Excluding the case where for one preferred register class
> >>>     there can be two or more alternatives, one of them has the
> >>>     matching constraint, while another doesn't have.  So for
> >>>     the given operand, even if it's assigned by a hardware reg
> >>>     which doesn't meet the matching constraint, it can simply
> >>>     use the alternative which doesn't have matching constraint
> >>>     so no register move is needed.  One typical case is
> >>>     define_insn *mov<mode>_internal2 on rs6000.  So we
> >>>     shouldn't create constraint copy for it.
> >>>
> >>>   - The possible free register move in the same register class,
> >>>     disable this if so since the register move to meet the
> >>>     constraint is considered as free.
> >>>
> >>>   - Making it on by default, suggested by Segher & Vladimir, we
> >>>     hope to get rid of the parameter if the benchmarking result
> >>>     looks good on major targets.
> >>>
> >>>   - Tweaking cost when either of matching constraint two sides
> >>>     is hardware register.  Before this patch, the constraint
> >>>     copy is simply taken as a real move insn for pref and
> >>>     conflict cost with one hardware register, after this patch,
> >>>     it's allowed that there are several input operands
> >>>     respecting the same matching constraint (but in different
> >>>     alternatives), so we should take it to be like shuffle copy
> >>>     for some cases to avoid over preferring/disparaging.
> >>>
> >>> Please check the PR comments for more details.
> >>>
> >>> This patch can be bootstrapped & regtested on
> >>> powerpc64le-linux-gnu P9 and x86_64-redhat-linux, but have some
> >>> "XFAIL->XPASS" failures on aarch64-linux-gnu.  The failure list
> >>> was attached in the PR and thought the new assembly looks
> >>> improved (expected).
> >>>
> >>> With option Ofast unroll, this patch can help to improve SPEC2017
> >>> bmk 508.namd_r +2.42% and 519.lbm_r +2.43% on Power8 while
> >>> 508.namd_r +3.02% and 519.lbm_r +3.85% on Power9 without any
> >>> remarkable degradations.

Here's SPEC2017  rate result tested on AMD milan
option is: -march=znver2 -Ofast -funroll-loops  -mfpmath=sse -flto

fprate:
      503.bwaves_r                 0.01    (A)  shliclel219
      507.cactuBSSN_r             -0.19    (A)  shliclel219
      508.namd_r                   0.02    (A)  shliclel219
      510.parest_r                -0.68    (A)  shliclel219
      511.povray_r                 1.59    (A)  shliclel219
      521.wrf_r                    0.19    (A)  shliclel219
      526.blender_r                0.68    (A)  shliclel219
      527.cam4_r                  -0.30    (A)  shliclel219
      538.imagick_r               -3.81 <- (A)  shliclel219
      544.nab_r                    0.02    (A)  shliclel219
      549.fotonik3d_r              0.02    (A)  shliclel219
      554.roms_r                  -0.43    (A)  shliclel219
      997.specrand_fr             -3.80 <- (A)  shliclel219
                                    Geometric mean:  -0.52
intrate:
      500.perlbench_r             -1.54    (A)  shliclel219
      502.gcc_r                   -0.38    (A)  shliclel219
      505.mcf_r                   -0.10    (A)  shliclel219
      520.omnetpp_r               -0.24    (A)  shliclel219
      523.xalancbmk_r             -1.04    (A)  shliclel219
      525.x264_r                   0.31    (A)  shliclel219
      531.deepsjeng_r             -0.02    (A)  shliclel219
      541.leela_r                  0.95    (A)  shliclel219
      548.exchange2_r              0.08    (A)  shliclel219
      557.xz_r                    -0.40    (A)  shliclel219
                                    Geometric mean:  -0.24
> >>>
> >>> Since this patch likely benefits x86_64 and aarch64, but I don't
> >>> have performance machines with these arches at hand, could
> >>> someone kindly help to benchmark it if possible?
> >> I can help test it on Intel cascade lake and AMD milan.
>
>
> Thanks for your help, Hongtao!
>
>
> > And could you rebase your patch on the lastest trunk, i got several
> > failures when applying the patch
> > ~ git apply ira-v3.diff
> > error: patch failed: gcc/doc/invoke.texi:13845
> > error: gcc/doc/invoke.texi: patch does not apply
> > error: patch failed: gcc/ira-conflicts.c:233
> > error: gcc/ira-conflicts.c: patch does not apply
> > error: patch failed: gcc/ira-int.h:971
> > error: gcc/ira-int.h: patch does not apply
> > error: patch failed: gcc/ira.c:1922
> > error: gcc/ira.c: patch does not apply
> > error: patch failed: gcc/params.opt:330
> > error: gcc/params.opt: patch does not apply
> >
>
> I think it's due to unexpected git stat lines in previously attached diff.
>
> I have attached the format-patch file.  Please have a check.  Thanks!
>
>
> BR,
> Kewen



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC/PATCH v3] ira: Support more matching constraint forms with param [PR100328]
  2021-06-30  8:53         ` Hongtao Liu
@ 2021-06-30  9:42           ` Kewen.Lin
  2021-06-30 10:18             ` Hongtao Liu
  0 siblings, 1 reply; 16+ messages in thread
From: Kewen.Lin @ 2021-06-30  9:42 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: GCC Patches, Vladimir Makarov, bergner, Bill Schmidt,
	Segher Boessenkool, Richard Sandiford

on 2021/6/30 下午4:53, Hongtao Liu wrote:
> On Mon, Jun 28, 2021 at 3:27 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> on 2021/6/28 下午3:20, Hongtao Liu wrote:
>>> On Mon, Jun 28, 2021 at 3:12 PM Hongtao Liu <crazylht@gmail.com> wrote:
>>>>
>>>> On Mon, Jun 28, 2021 at 2:50 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>>
>>>>> Hi!
>>>>>
>>>>> on 2021/6/9 下午1:18, Kewen.Lin via Gcc-patches wrote:
>>>>>> Hi,
>>>>>>
>>>>>> PR100328 has some details about this issue, I am trying to
>>>>>> brief it here.  In the hottest function LBM_performStreamCollideTRT
>>>>>> of SPEC2017 bmk 519.lbm_r, there are many FMA style expressions
>>>>>> (27 FMA, 19 FMS, 11 FNMA).  On rs6000, this kind of FMA style
>>>>>> insn has two flavors: FLOAT_REG and VSX_REG, the VSX_REG reg
>>>>>> class have 64 registers whose foregoing 32 ones make up the
>>>>>> whole FLOAT_REG.  There are some differences for these two
>>>>>> flavors, taking "*fma<mode>4_fpr" as example:
>>>>>>
>>>>>> (define_insn "*fma<mode>4_fpr"
>>>>>>   [(set (match_operand:SFDF 0 "gpc_reg_operand" "=<Ff>,wa,wa")
>>>>>>       (fma:SFDF
>>>>>>         (match_operand:SFDF 1 "gpc_reg_operand" "%<Ff>,wa,wa")
>>>>>>         (match_operand:SFDF 2 "gpc_reg_operand" "<Ff>,wa,0")
>>>>>>         (match_operand:SFDF 3 "gpc_reg_operand" "<Ff>,0,wa")))]
>>>>>>
>>>>>> // wa => A VSX register (VSR), vs0…vs63, aka. VSX_REG.
>>>>>> // <Ff> (f/d) => A floating point register, aka. FLOAT_REG.
>>>>>>
>>>>>> So for VSX_REG, we only have the destructive form, when VSX_REG
>>>>>> alternative being used, the operand 2 or operand 3 is required
>>>>>> to be the same as operand 0.  reload has to take care of this
>>>>>> constraint and create some non-free register copies if required.
>>>>>>
>>>>>> Assuming one fma insn looks like:
>>>>>>   op0 = FMA (op1, op2, op3)
>>>>>>
>>>>>> The best regclass of them are VSX_REG, when op1,op2,op3 are all dead,
>>>>>> IRA simply creates three shuffle copies for them (here the operand
>>>>>> order matters, since with the same freq, the one with smaller number
>>>>>> takes preference), but IMO both op2 and op3 should take higher priority
>>>>>> in copy queue due to the matching constraint.
>>>>>>
>>>>>> I noticed that there is one function ira_get_dup_out_num, which meant
>>>>>> to create this kind of constraint copy, but the below code looks to
>>>>>> refuse to create if there is an alternative which has valid regclass
>>>>>> without spilled need.
>>>>>>
>>>>>>       default:
>>>>>>       {
>>>>>>         enum constraint_num cn = lookup_constraint (str);
>>>>>>         enum reg_class cl = reg_class_for_constraint (cn);
>>>>>>         if (cl != NO_REGS
>>>>>>             && !targetm.class_likely_spilled_p (cl))
>>>>>>           goto fail
>>>>>>
>>>>>>        ...
>>>>>>
>>>>>> I cooked one patch attached to make ira respect this kind of matching
>>>>>> constraint guarded with one parameter.  As I stated in the PR, I was
>>>>>> not sure this is on the right track.  The RFC patch is to check the
>>>>>> matching constraint in all alternatives, if there is one alternative
>>>>>> with matching constraint and matches the current preferred regclass
>>>>>> (or best of allocno?), it will record the output operand number and
>>>>>> further create one constraint copy for it.  Normally it can get the
>>>>>> priority against shuffle copies and the matching constraint will get
>>>>>> satisfied with higher possibility, reload doesn't create extra copies
>>>>>> to meet the matching constraint or the desirable register class when
>>>>>> it has to.
>>>>>>
>>>>>> For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay
>>>>>> as shuffle copies, and later any of A,B,C,D gets assigned by one
>>>>>> hardware register which is a VSX register (VSX_REG) but not a FP
>>>>>> register (FLOAT_REG), which means it has to pay costs once we can NOT
>>>>>> go with VSX alternatives, so at that time it's important to respect
>>>>>> the matching constraint then we can increase the freq for the remaining
>>>>>> copies related to this (A/B, A/C, A/D).  This idea requires some side
>>>>>> tables to record some information and seems a bit complicated in the
>>>>>> current framework, so the proposed patch aggressively emphasizes the
>>>>>> matching constraint at the time of creating copies.
>>>>>>
>>>>>
>>>>> Comparing with the original patch (v1), this patch v3 has
>>>>> considered: (this should be v2 for this mail list, but bump
>>>>> it to be consistent as PR's).
>>>>>
>>>>>   - Excluding the case where for one preferred register class
>>>>>     there can be two or more alternatives, one of them has the
>>>>>     matching constraint, while another doesn't have.  So for
>>>>>     the given operand, even if it's assigned by a hardware reg
>>>>>     which doesn't meet the matching constraint, it can simply
>>>>>     use the alternative which doesn't have matching constraint
>>>>>     so no register move is needed.  One typical case is
>>>>>     define_insn *mov<mode>_internal2 on rs6000.  So we
>>>>>     shouldn't create constraint copy for it.
>>>>>
>>>>>   - The possible free register move in the same register class,
>>>>>     disable this if so since the register move to meet the
>>>>>     constraint is considered as free.
>>>>>
>>>>>   - Making it on by default, suggested by Segher & Vladimir, we
>>>>>     hope to get rid of the parameter if the benchmarking result
>>>>>     looks good on major targets.
>>>>>
>>>>>   - Tweaking cost when either of matching constraint two sides
>>>>>     is hardware register.  Before this patch, the constraint
>>>>>     copy is simply taken as a real move insn for pref and
>>>>>     conflict cost with one hardware register, after this patch,
>>>>>     it's allowed that there are several input operands
>>>>>     respecting the same matching constraint (but in different
>>>>>     alternatives), so we should take it to be like shuffle copy
>>>>>     for some cases to avoid over preferring/disparaging.
>>>>>
>>>>> Please check the PR comments for more details.
>>>>>
>>>>> This patch can be bootstrapped & regtested on
>>>>> powerpc64le-linux-gnu P9 and x86_64-redhat-linux, but have some
>>>>> "XFAIL->XPASS" failures on aarch64-linux-gnu.  The failure list
>>>>> was attached in the PR and thought the new assembly looks
>>>>> improved (expected).
>>>>>
>>>>> With option Ofast unroll, this patch can help to improve SPEC2017
>>>>> bmk 508.namd_r +2.42% and 519.lbm_r +2.43% on Power8 while
>>>>> 508.namd_r +3.02% and 519.lbm_r +3.85% on Power9 without any
>>>>> remarkable degradations.
> 
> Here's SPEC2017  rate result tested on AMD milan
> option is: -march=znver2 -Ofast -funroll-loops  -mfpmath=sse -flto
> 
> fprate:
>       503.bwaves_r                 0.01    (A)  shliclel219
>       507.cactuBSSN_r             -0.19    (A)  shliclel219
>       508.namd_r                   0.02    (A)  shliclel219
>       510.parest_r                -0.68    (A)  shliclel219
>       511.povray_r                 1.59    (A)  shliclel219
>       521.wrf_r                    0.19    (A)  shliclel219
>       526.blender_r                0.68    (A)  shliclel219
>       527.cam4_r                  -0.30    (A)  shliclel219
>       538.imagick_r               -3.81 <- (A)  shliclel219
>       544.nab_r                    0.02    (A)  shliclel219
>       549.fotonik3d_r              0.02    (A)  shliclel219
>       554.roms_r                  -0.43    (A)  shliclel219
>       997.specrand_fr             -3.80 <- (A)  shliclel219
>                                     Geometric mean:  -0.52
> intrate:
>       500.perlbench_r             -1.54    (A)  shliclel219
>       502.gcc_r                   -0.38    (A)  shliclel219
>       505.mcf_r                   -0.10    (A)  shliclel219
>       520.omnetpp_r               -0.24    (A)  shliclel219
>       523.xalancbmk_r             -1.04    (A)  shliclel219
>       525.x264_r                   0.31    (A)  shliclel219
>       531.deepsjeng_r             -0.02    (A)  shliclel219
>       541.leela_r                  0.95    (A)  shliclel219
>       548.exchange2_r              0.08    (A)  shliclel219
>       557.xz_r                    -0.40    (A)  shliclel219
>                                     Geometric mean:  -0.24


Roger, thanks!  The result looks not good, I think I'll disable it
for target x86_64 in next version.  By the way, bmk 519.lbm_r seemed
missing, just curious whether due to that it failed to build even
with baseline?

BR,
Kewen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC/PATCH v3] ira: Support more matching constraint forms with param [PR100328]
  2021-06-30  9:42           ` Kewen.Lin
@ 2021-06-30 10:18             ` Hongtao Liu
  0 siblings, 0 replies; 16+ messages in thread
From: Hongtao Liu @ 2021-06-30 10:18 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Vladimir Makarov, bergner, Bill Schmidt,
	Segher Boessenkool, Richard Sandiford

On Wed, Jun 30, 2021 at 5:42 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> on 2021/6/30 下午4:53, Hongtao Liu wrote:
> > On Mon, Jun 28, 2021 at 3:27 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> on 2021/6/28 下午3:20, Hongtao Liu wrote:
> >>> On Mon, Jun 28, 2021 at 3:12 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >>>>
> >>>> On Mon, Jun 28, 2021 at 2:50 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>>>
> >>>>> Hi!
> >>>>>
> >>>>> on 2021/6/9 下午1:18, Kewen.Lin via Gcc-patches wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> PR100328 has some details about this issue, I am trying to
> >>>>>> brief it here.  In the hottest function LBM_performStreamCollideTRT
> >>>>>> of SPEC2017 bmk 519.lbm_r, there are many FMA style expressions
> >>>>>> (27 FMA, 19 FMS, 11 FNMA).  On rs6000, this kind of FMA style
> >>>>>> insn has two flavors: FLOAT_REG and VSX_REG, the VSX_REG reg
> >>>>>> class have 64 registers whose foregoing 32 ones make up the
> >>>>>> whole FLOAT_REG.  There are some differences for these two
> >>>>>> flavors, taking "*fma<mode>4_fpr" as example:
> >>>>>>
> >>>>>> (define_insn "*fma<mode>4_fpr"
> >>>>>>   [(set (match_operand:SFDF 0 "gpc_reg_operand" "=<Ff>,wa,wa")
> >>>>>>       (fma:SFDF
> >>>>>>         (match_operand:SFDF 1 "gpc_reg_operand" "%<Ff>,wa,wa")
> >>>>>>         (match_operand:SFDF 2 "gpc_reg_operand" "<Ff>,wa,0")
> >>>>>>         (match_operand:SFDF 3 "gpc_reg_operand" "<Ff>,0,wa")))]
> >>>>>>
> >>>>>> // wa => A VSX register (VSR), vs0…vs63, aka. VSX_REG.
> >>>>>> // <Ff> (f/d) => A floating point register, aka. FLOAT_REG.
> >>>>>>
> >>>>>> So for VSX_REG, we only have the destructive form, when VSX_REG
> >>>>>> alternative being used, the operand 2 or operand 3 is required
> >>>>>> to be the same as operand 0.  reload has to take care of this
> >>>>>> constraint and create some non-free register copies if required.
> >>>>>>
> >>>>>> Assuming one fma insn looks like:
> >>>>>>   op0 = FMA (op1, op2, op3)
> >>>>>>
> >>>>>> The best regclass of them are VSX_REG, when op1,op2,op3 are all dead,
> >>>>>> IRA simply creates three shuffle copies for them (here the operand
> >>>>>> order matters, since with the same freq, the one with smaller number
> >>>>>> takes preference), but IMO both op2 and op3 should take higher priority
> >>>>>> in copy queue due to the matching constraint.
> >>>>>>
> >>>>>> I noticed that there is one function ira_get_dup_out_num, which meant
> >>>>>> to create this kind of constraint copy, but the below code looks to
> >>>>>> refuse to create if there is an alternative which has valid regclass
> >>>>>> without spilled need.
> >>>>>>
> >>>>>>       default:
> >>>>>>       {
> >>>>>>         enum constraint_num cn = lookup_constraint (str);
> >>>>>>         enum reg_class cl = reg_class_for_constraint (cn);
> >>>>>>         if (cl != NO_REGS
> >>>>>>             && !targetm.class_likely_spilled_p (cl))
> >>>>>>           goto fail
> >>>>>>
> >>>>>>        ...
> >>>>>>
> >>>>>> I cooked one patch attached to make ira respect this kind of matching
> >>>>>> constraint guarded with one parameter.  As I stated in the PR, I was
> >>>>>> not sure this is on the right track.  The RFC patch is to check the
> >>>>>> matching constraint in all alternatives, if there is one alternative
> >>>>>> with matching constraint and matches the current preferred regclass
> >>>>>> (or best of allocno?), it will record the output operand number and
> >>>>>> further create one constraint copy for it.  Normally it can get the
> >>>>>> priority against shuffle copies and the matching constraint will get
> >>>>>> satisfied with higher possibility, reload doesn't create extra copies
> >>>>>> to meet the matching constraint or the desirable register class when
> >>>>>> it has to.
> >>>>>>
> >>>>>> For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay
> >>>>>> as shuffle copies, and later any of A,B,C,D gets assigned by one
> >>>>>> hardware register which is a VSX register (VSX_REG) but not a FP
> >>>>>> register (FLOAT_REG), which means it has to pay costs once we can NOT
> >>>>>> go with VSX alternatives, so at that time it's important to respect
> >>>>>> the matching constraint then we can increase the freq for the remaining
> >>>>>> copies related to this (A/B, A/C, A/D).  This idea requires some side
> >>>>>> tables to record some information and seems a bit complicated in the
> >>>>>> current framework, so the proposed patch aggressively emphasizes the
> >>>>>> matching constraint at the time of creating copies.
> >>>>>>
> >>>>>
> >>>>> Comparing with the original patch (v1), this patch v3 has
> >>>>> considered: (this should be v2 for this mail list, but bump
> >>>>> it to be consistent as PR's).
> >>>>>
> >>>>>   - Excluding the case where for one preferred register class
> >>>>>     there can be two or more alternatives, one of them has the
> >>>>>     matching constraint, while another doesn't have.  So for
> >>>>>     the given operand, even if it's assigned by a hardware reg
> >>>>>     which doesn't meet the matching constraint, it can simply
> >>>>>     use the alternative which doesn't have matching constraint
> >>>>>     so no register move is needed.  One typical case is
> >>>>>     define_insn *mov<mode>_internal2 on rs6000.  So we
> >>>>>     shouldn't create constraint copy for it.
> >>>>>
> >>>>>   - The possible free register move in the same register class,
> >>>>>     disable this if so since the register move to meet the
> >>>>>     constraint is considered as free.
> >>>>>
> >>>>>   - Making it on by default, suggested by Segher & Vladimir, we
> >>>>>     hope to get rid of the parameter if the benchmarking result
> >>>>>     looks good on major targets.
> >>>>>
> >>>>>   - Tweaking cost when either of matching constraint two sides
> >>>>>     is hardware register.  Before this patch, the constraint
> >>>>>     copy is simply taken as a real move insn for pref and
> >>>>>     conflict cost with one hardware register, after this patch,
> >>>>>     it's allowed that there are several input operands
> >>>>>     respecting the same matching constraint (but in different
> >>>>>     alternatives), so we should take it to be like shuffle copy
> >>>>>     for some cases to avoid over preferring/disparaging.
> >>>>>
> >>>>> Please check the PR comments for more details.
> >>>>>
> >>>>> This patch can be bootstrapped & regtested on
> >>>>> powerpc64le-linux-gnu P9 and x86_64-redhat-linux, but have some
> >>>>> "XFAIL->XPASS" failures on aarch64-linux-gnu.  The failure list
> >>>>> was attached in the PR and thought the new assembly looks
> >>>>> improved (expected).
> >>>>>
> >>>>> With option Ofast unroll, this patch can help to improve SPEC2017
> >>>>> bmk 508.namd_r +2.42% and 519.lbm_r +2.43% on Power8 while
> >>>>> 508.namd_r +3.02% and 519.lbm_r +3.85% on Power9 without any
> >>>>> remarkable degradations.
> >
> > Here's SPEC2017  rate result tested on AMD milan
> > option is: -march=znver2 -Ofast -funroll-loops  -mfpmath=sse -flto
> >
> > fprate:
> >       503.bwaves_r                 0.01    (A)  shliclel219
> >       507.cactuBSSN_r             -0.19    (A)  shliclel219
> >       508.namd_r                   0.02    (A)  shliclel219
> >       510.parest_r                -0.68    (A)  shliclel219
> >       511.povray_r                 1.59    (A)  shliclel219
> >       521.wrf_r                    0.19    (A)  shliclel219
> >       526.blender_r                0.68    (A)  shliclel219
> >       527.cam4_r                  -0.30    (A)  shliclel219
> >       538.imagick_r               -3.81 <- (A)  shliclel219
> >       544.nab_r                    0.02    (A)  shliclel219
> >       549.fotonik3d_r              0.02    (A)  shliclel219
> >       554.roms_r                  -0.43    (A)  shliclel219
> >       997.specrand_fr             -3.80 <- (A)  shliclel219
> >                                     Geometric mean:  -0.52
> > intrate:
> >       500.perlbench_r             -1.54    (A)  shliclel219
> >       502.gcc_r                   -0.38    (A)  shliclel219
> >       505.mcf_r                   -0.10    (A)  shliclel219
> >       520.omnetpp_r               -0.24    (A)  shliclel219
> >       523.xalancbmk_r             -1.04    (A)  shliclel219
> >       525.x264_r                   0.31    (A)  shliclel219
> >       531.deepsjeng_r             -0.02    (A)  shliclel219
> >       541.leela_r                  0.95    (A)  shliclel219
> >       548.exchange2_r              0.08    (A)  shliclel219
> >       557.xz_r                    -0.40    (A)  shliclel219
> >                                     Geometric mean:  -0.24
>
>
> Roger, thanks!  The result looks not good, I think I'll disable it
> for target x86_64 in next version.  By the way, bmk 519.lbm_r seemed
> missing, just curious whether due to that it failed to build even
> with baseline?
519.lbm_r           0  ------    ------    BuildSame on milan

here is fprate on CLX:
      503.bwaves_r               -0.12
      507.cactuBSSN_r            -0.02
      508.namd_r                 -0.57
      510.parest_r                0.40
      511.povray_r               -0.37
      519.lbm_r                   0.10
      521.wrf_r                   0.61
      526.blender_r              -0.50
      527.cam4_r                 -0.45
      538.imagick_r              -6.61 <-
      544.nab_r                  -0.11
      549.fotonik3d_r             0.16
      554.roms_r                  0.22
      997.specrand_fr            -0.18

And there's something broken on my local cascade lake, so intrate test
result for CLX would be later.
>
> BR,
> Kewen



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC/PATCH v3] ira: Support more matching constraint forms with param [PR100328]
  2021-06-28  6:26 ` [RFC/PATCH v3] ira: Support more matching constraint forms " Kewen.Lin
  2021-06-28  7:12   ` Hongtao Liu
@ 2021-06-30 15:24   ` Vladimir Makarov
  2021-07-02  2:11     ` [PATCH v4] " Kewen.Lin
  2021-06-30 15:25   ` [RFC/PATCH v3] " Vladimir Makarov
  2 siblings, 1 reply; 16+ messages in thread
From: Vladimir Makarov @ 2021-06-30 15:24 UTC (permalink / raw)
  To: Kewen.Lin, GCC Patches
  Cc: bergner, Bill Schmidt, Segher Boessenkool, Richard Sandiford, crazylht


On 2021-06-28 2:26 a.m., Kewen.Lin wrote:
> Hi!
>
> on 2021/6/9 下午1:18, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>> PR100328 has some details about this issue, I am trying to
>> brief it here.  In the hottest function LBM_performStreamCollideTRT
>> of SPEC2017 bmk 519.lbm_r, there are many FMA style expressions
>> (27 FMA, 19 FMS, 11 FNMA).  On rs6000, this kind of FMA style
>> insn has two flavors: FLOAT_REG and VSX_REG, the VSX_REG reg
>> class have 64 registers whose foregoing 32 ones make up the
>> whole FLOAT_REG.  There are some differences for these two
>> flavors, taking "*fma<mode>4_fpr" as example:
>>
>> (define_insn "*fma<mode>4_fpr"
>>    [(set (match_operand:SFDF 0 "gpc_reg_operand" "=<Ff>,wa,wa")
>> 	(fma:SFDF
>> 	  (match_operand:SFDF 1 "gpc_reg_operand" "%<Ff>,wa,wa")
>> 	  (match_operand:SFDF 2 "gpc_reg_operand" "<Ff>,wa,0")
>> 	  (match_operand:SFDF 3 "gpc_reg_operand" "<Ff>,0,wa")))]
>>
>> // wa => A VSX register (VSR), vs0…vs63, aka. VSX_REG.
>> // <Ff> (f/d) => A floating point register, aka. FLOAT_REG.
>>
>> So for VSX_REG, we only have the destructive form, when VSX_REG
>> alternative being used, the operand 2 or operand 3 is required
>> to be the same as operand 0.  reload has to take care of this
>> constraint and create some non-free register copies if required.
>>
>> Assuming one fma insn looks like:
>>    op0 = FMA (op1, op2, op3)
>>
>> The best regclass of them are VSX_REG, when op1,op2,op3 are all dead,
>> IRA simply creates three shuffle copies for them (here the operand
>> order matters, since with the same freq, the one with smaller number
>> takes preference), but IMO both op2 and op3 should take higher priority
>> in copy queue due to the matching constraint.
>>
>> I noticed that there is one function ira_get_dup_out_num, which meant
>> to create this kind of constraint copy, but the below code looks to
>> refuse to create if there is an alternative which has valid regclass
>> without spilled need.
>>
>>        default:
>> 	{
>> 	  enum constraint_num cn = lookup_constraint (str);
>> 	  enum reg_class cl = reg_class_for_constraint (cn);
>> 	  if (cl != NO_REGS
>> 	      && !targetm.class_likely_spilled_p (cl))
>> 	    goto fail
>>
>> 	 ...
>>
>> I cooked one patch attached to make ira respect this kind of matching
>> constraint guarded with one parameter.  As I stated in the PR, I was
>> not sure this is on the right track.  The RFC patch is to check the
>> matching constraint in all alternatives, if there is one alternative
>> with matching constraint and matches the current preferred regclass
>> (or best of allocno?), it will record the output operand number and
>> further create one constraint copy for it.  Normally it can get the
>> priority against shuffle copies and the matching constraint will get
>> satisfied with higher possibility, reload doesn't create extra copies
>> to meet the matching constraint or the desirable register class when
>> it has to.
>>
>> For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay
>> as shuffle copies, and later any of A,B,C,D gets assigned by one
>> hardware register which is a VSX register (VSX_REG) but not a FP
>> register (FLOAT_REG), which means it has to pay costs once we can NOT
>> go with VSX alternatives, so at that time it's important to respect
>> the matching constraint then we can increase the freq for the remaining
>> copies related to this (A/B, A/C, A/D).  This idea requires some side
>> tables to record some information and seems a bit complicated in the
>> current framework, so the proposed patch aggressively emphasizes the
>> matching constraint at the time of creating copies.
>>
> Comparing with the original patch (v1), this patch v3 has
> considered: (this should be v2 for this mail list, but bump
> it to be consistent as PR's).
>
>    - Excluding the case where for one preferred register class
>      there can be two or more alternatives, one of them has the
>      matching constraint, while another doesn't have.  So for
>      the given operand, even if it's assigned by a hardware reg
>      which doesn't meet the matching constraint, it can simply
>      use the alternative which doesn't have matching constraint
>      so no register move is needed.  One typical case is
>      define_insn *mov<mode>_internal2 on rs6000.  So we
>      shouldn't create constraint copy for it.
>
>    - The possible free register move in the same register class,
>      disable this if so since the register move to meet the
>      constraint is considered as free.
>
>    - Making it on by default, suggested by Segher & Vladimir, we
>      hope to get rid of the parameter if the benchmarking result
>      looks good on major targets.
>
>    - Tweaking cost when either of matching constraint two sides
>      is hardware register.  Before this patch, the constraint
>      copy is simply taken as a real move insn for pref and
>      conflict cost with one hardware register, after this patch,
>      it's allowed that there are several input operands
>      respecting the same matching constraint (but in different
>      alternatives), so we should take it to be like shuffle copy
>      for some cases to avoid over preferring/disparaging.
>
> Please check the PR comments for more details.
>
> This patch can be bootstrapped & regtested on
> powerpc64le-linux-gnu P9 and x86_64-redhat-linux, but have some
> "XFAIL->XPASS" failures on aarch64-linux-gnu.  The failure list
> was attached in the PR and thought the new assembly looks
> improved (expected).
>
> With option Ofast unroll, this patch can help to improve SPEC2017
> bmk 508.namd_r +2.42% and 519.lbm_r +2.43% on Power8 while
> 508.namd_r +3.02% and 519.lbm_r +3.85% on Power9 without any
> remarkable degradations.
>
> Since this patch likely benefits x86_64 and aarch64, but I don't
> have performance machines with these arches at hand, could
> someone kindly help to benchmark it if possible?
>
> Many thanks in advance!
>
> btw, you can simply ignore the part about parameter
> ira-consider-dup-in-all-alts (its name/description), it's sort of
> stale, I let it be for now as we will likely get rid of it.

Kewen, thank you for addressing remarks for the previous version of the 
patch.  The patch is ok to commit with some minor changes:

o In a comment for function ira_get_dup_out_num there is no mention of 
effect of the param on the function returned value and returned value of 
single_input_op_has_cstr_p and this imho creates wrong function 
interface description.

o It would be still nice to change name op_no to op_regno in 
ira_get_dup_out_num.

It is ok to commit the patch to the mainline with condition that you 
submit the patch switching off the parameter for x86-64 right after that 
as Hongtao Liu has shown its negative effect on x86-64 SPEC2017.

Thank you again for working on this issue.

> gcc/ChangeLog:
>
> 	* doc/invoke.texi (ira-consider-dup-in-all-alts): Document new
> 	parameter.
> 	* ira.c (ira_get_dup_out_num): Adjust as parameter
> 	param_ira_consider_dup_in_all_alts.
> 	* params.opt (ira-consider-dup-in-all-alts): New.
> 	* ira-conflicts.c (process_regs_for_copy): Add one parameter
> 	single_input_op_has_cstr_p.
> 	(get_freq_for_shuffle_copy): New function.
> 	(add_insn_allocno_copies): Adjust as single_input_op_has_cstr_p.
> 	* ira-int.h (ira_get_dup_out_num): Add one bool parameter.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC/PATCH v3] ira: Support more matching constraint forms with param [PR100328]
  2021-06-28  6:26 ` [RFC/PATCH v3] ira: Support more matching constraint forms " Kewen.Lin
  2021-06-28  7:12   ` Hongtao Liu
  2021-06-30 15:24   ` Vladimir Makarov
@ 2021-06-30 15:25   ` Vladimir Makarov
  2 siblings, 0 replies; 16+ messages in thread
From: Vladimir Makarov @ 2021-06-30 15:25 UTC (permalink / raw)
  To: Kewen.Lin, GCC Patches
  Cc: bergner, Bill Schmidt, Segher Boessenkool, Richard Sandiford, crazylht


On 2021-06-28 2:26 a.m., Kewen.Lin wrote:
> Hi!
>
> on 2021/6/9 下午1:18, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>> PR100328 has some details about this issue, I am trying to
>> brief it here.  In the hottest function LBM_performStreamCollideTRT
>> of SPEC2017 bmk 519.lbm_r, there are many FMA style expressions
>> (27 FMA, 19 FMS, 11 FNMA).  On rs6000, this kind of FMA style
>> insn has two flavors: FLOAT_REG and VSX_REG, the VSX_REG reg
>> class have 64 registers whose foregoing 32 ones make up the
>> whole FLOAT_REG.  There are some differences for these two
>> flavors, taking "*fma<mode>4_fpr" as example:
>>
>> (define_insn "*fma<mode>4_fpr"
>>    [(set (match_operand:SFDF 0 "gpc_reg_operand" "=<Ff>,wa,wa")
>> 	(fma:SFDF
>> 	  (match_operand:SFDF 1 "gpc_reg_operand" "%<Ff>,wa,wa")
>> 	  (match_operand:SFDF 2 "gpc_reg_operand" "<Ff>,wa,0")
>> 	  (match_operand:SFDF 3 "gpc_reg_operand" "<Ff>,0,wa")))]
>>
>> // wa => A VSX register (VSR), vs0…vs63, aka. VSX_REG.
>> // <Ff> (f/d) => A floating point register, aka. FLOAT_REG.
>>
>> So for VSX_REG, we only have the destructive form, when VSX_REG
>> alternative being used, the operand 2 or operand 3 is required
>> to be the same as operand 0.  reload has to take care of this
>> constraint and create some non-free register copies if required.
>>
>> Assuming one fma insn looks like:
>>    op0 = FMA (op1, op2, op3)
>>
>> The best regclass of them are VSX_REG, when op1,op2,op3 are all dead,
>> IRA simply creates three shuffle copies for them (here the operand
>> order matters, since with the same freq, the one with smaller number
>> takes preference), but IMO both op2 and op3 should take higher priority
>> in copy queue due to the matching constraint.
>>
>> I noticed that there is one function ira_get_dup_out_num, which meant
>> to create this kind of constraint copy, but the below code looks to
>> refuse to create if there is an alternative which has valid regclass
>> without spilled need.
>>
>>        default:
>> 	{
>> 	  enum constraint_num cn = lookup_constraint (str);
>> 	  enum reg_class cl = reg_class_for_constraint (cn);
>> 	  if (cl != NO_REGS
>> 	      && !targetm.class_likely_spilled_p (cl))
>> 	    goto fail
>>
>> 	 ...
>>
>> I cooked one patch attached to make ira respect this kind of matching
>> constraint guarded with one parameter.  As I stated in the PR, I was
>> not sure this is on the right track.  The RFC patch is to check the
>> matching constraint in all alternatives, if there is one alternative
>> with matching constraint and matches the current preferred regclass
>> (or best of allocno?), it will record the output operand number and
>> further create one constraint copy for it.  Normally it can get the
>> priority against shuffle copies and the matching constraint will get
>> satisfied with higher possibility, reload doesn't create extra copies
>> to meet the matching constraint or the desirable register class when
>> it has to.
>>
>> For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay
>> as shuffle copies, and later any of A,B,C,D gets assigned by one
>> hardware register which is a VSX register (VSX_REG) but not a FP
>> register (FLOAT_REG), which means it has to pay costs once we can NOT
>> go with VSX alternatives, so at that time it's important to respect
>> the matching constraint then we can increase the freq for the remaining
>> copies related to this (A/B, A/C, A/D).  This idea requires some side
>> tables to record some information and seems a bit complicated in the
>> current framework, so the proposed patch aggressively emphasizes the
>> matching constraint at the time of creating copies.
>>
> Comparing with the original patch (v1), this patch v3 has
> considered: (this should be v2 for this mail list, but bump
> it to be consistent as PR's).
>
>    - Excluding the case where for one preferred register class
>      there can be two or more alternatives, one of them has the
>      matching constraint, while another doesn't have.  So for
>      the given operand, even if it's assigned by a hardware reg
>      which doesn't meet the matching constraint, it can simply
>      use the alternative which doesn't have matching constraint
>      so no register move is needed.  One typical case is
>      define_insn *mov<mode>_internal2 on rs6000.  So we
>      shouldn't create constraint copy for it.
>
>    - The possible free register move in the same register class,
>      disable this if so since the register move to meet the
>      constraint is considered as free.
>
>    - Making it on by default, suggested by Segher & Vladimir, we
>      hope to get rid of the parameter if the benchmarking result
>      looks good on major targets.
>
>    - Tweaking cost when either of matching constraint two sides
>      is hardware register.  Before this patch, the constraint
>      copy is simply taken as a real move insn for pref and
>      conflict cost with one hardware register, after this patch,
>      it's allowed that there are several input operands
>      respecting the same matching constraint (but in different
>      alternatives), so we should take it to be like shuffle copy
>      for some cases to avoid over preferring/disparaging.
>
> Please check the PR comments for more details.
>
> This patch can be bootstrapped & regtested on
> powerpc64le-linux-gnu P9 and x86_64-redhat-linux, but have some
> "XFAIL->XPASS" failures on aarch64-linux-gnu.  The failure list
> was attached in the PR and thought the new assembly looks
> improved (expected).
>
> With option Ofast unroll, this patch can help to improve SPEC2017
> bmk 508.namd_r +2.42% and 519.lbm_r +2.43% on Power8 while
> 508.namd_r +3.02% and 519.lbm_r +3.85% on Power9 without any
> remarkable degradations.
>
> Since this patch likely benefits x86_64 and aarch64, but I don't
> have performance machines with these arches at hand, could
> someone kindly help to benchmark it if possible?
>
> Many thanks in advance!
>
> btw, you can simply ignore the part about parameter
> ira-consider-dup-in-all-alts (its name/description), it's sort of
> stale, I let it be for now as we will likely get rid of it.

Kewen, thank you for addressing remarks for the previous version of the 
patch.  The patch is ok to commit with some minor changes:

o In a comment for function ira_get_dup_out_num there is no mention of 
effect of the param on the function returned value and returned value of 
single_input_op_has_cstr_p and this imho creates wrong function 
interface description.

o It would be still nice to change name op_no to op_regno in 
ira_get_dup_out_num.

It is ok to commit the patch to the mainline with condition that you 
submit the patch switching off the parameter for x86-64 right after that 
as Hongtao Liu has shown its negative effect on x86-64 SPEC2017.

Thank you again for working on this issue.

> gcc/ChangeLog:
>
> 	* doc/invoke.texi (ira-consider-dup-in-all-alts): Document new
> 	parameter.
> 	* ira.c (ira_get_dup_out_num): Adjust as parameter
> 	param_ira_consider_dup_in_all_alts.
> 	* params.opt (ira-consider-dup-in-all-alts): New.
> 	* ira-conflicts.c (process_regs_for_copy): Add one parameter
> 	single_input_op_has_cstr_p.
> 	(get_freq_for_shuffle_copy): New function.
> 	(add_insn_allocno_copies): Adjust as single_input_op_has_cstr_p.
> 	* ira-int.h (ira_get_dup_out_num): Add one bool parameter.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC/PATCH v3] ira: Support more matching constraint forms with param [PR100328]
  2021-06-28  7:27       ` Kewen.Lin
  2021-06-30  8:53         ` Hongtao Liu
@ 2021-06-30 15:42         ` Richard Sandiford
  2021-07-02  2:18           ` Kewen.Lin
  1 sibling, 1 reply; 16+ messages in thread
From: Richard Sandiford @ 2021-06-30 15:42 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Hongtao Liu, GCC Patches, Vladimir Makarov, bergner,
	Bill Schmidt, Segher Boessenkool

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> on 2021/6/28 下午3:20, Hongtao Liu wrote:
>> On Mon, Jun 28, 2021 at 3:12 PM Hongtao Liu <crazylht@gmail.com> wrote:
>>>
>>> On Mon, Jun 28, 2021 at 2:50 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>
>>>> Hi!
>>>>
>>>> on 2021/6/9 下午1:18, Kewen.Lin via Gcc-patches wrote:
>>>>> Hi,
>>>>>
>>>>> PR100328 has some details about this issue, I am trying to
>>>>> brief it here.  In the hottest function LBM_performStreamCollideTRT
>>>>> of SPEC2017 bmk 519.lbm_r, there are many FMA style expressions
>>>>> (27 FMA, 19 FMS, 11 FNMA).  On rs6000, this kind of FMA style
>>>>> insn has two flavors: FLOAT_REG and VSX_REG, the VSX_REG reg
>>>>> class have 64 registers whose foregoing 32 ones make up the
>>>>> whole FLOAT_REG.  There are some differences for these two
>>>>> flavors, taking "*fma<mode>4_fpr" as example:
>>>>>
>>>>> (define_insn "*fma<mode>4_fpr"
>>>>>   [(set (match_operand:SFDF 0 "gpc_reg_operand" "=<Ff>,wa,wa")
>>>>>       (fma:SFDF
>>>>>         (match_operand:SFDF 1 "gpc_reg_operand" "%<Ff>,wa,wa")
>>>>>         (match_operand:SFDF 2 "gpc_reg_operand" "<Ff>,wa,0")
>>>>>         (match_operand:SFDF 3 "gpc_reg_operand" "<Ff>,0,wa")))]
>>>>>
>>>>> // wa => A VSX register (VSR), vs0…vs63, aka. VSX_REG.
>>>>> // <Ff> (f/d) => A floating point register, aka. FLOAT_REG.
>>>>>
>>>>> So for VSX_REG, we only have the destructive form, when VSX_REG
>>>>> alternative being used, the operand 2 or operand 3 is required
>>>>> to be the same as operand 0.  reload has to take care of this
>>>>> constraint and create some non-free register copies if required.
>>>>>
>>>>> Assuming one fma insn looks like:
>>>>>   op0 = FMA (op1, op2, op3)
>>>>>
>>>>> The best regclass of them are VSX_REG, when op1,op2,op3 are all dead,
>>>>> IRA simply creates three shuffle copies for them (here the operand
>>>>> order matters, since with the same freq, the one with smaller number
>>>>> takes preference), but IMO both op2 and op3 should take higher priority
>>>>> in copy queue due to the matching constraint.
>>>>>
>>>>> I noticed that there is one function ira_get_dup_out_num, which meant
>>>>> to create this kind of constraint copy, but the below code looks to
>>>>> refuse to create if there is an alternative which has valid regclass
>>>>> without spilled need.
>>>>>
>>>>>       default:
>>>>>       {
>>>>>         enum constraint_num cn = lookup_constraint (str);
>>>>>         enum reg_class cl = reg_class_for_constraint (cn);
>>>>>         if (cl != NO_REGS
>>>>>             && !targetm.class_likely_spilled_p (cl))
>>>>>           goto fail
>>>>>
>>>>>        ...
>>>>>
>>>>> I cooked one patch attached to make ira respect this kind of matching
>>>>> constraint guarded with one parameter.  As I stated in the PR, I was
>>>>> not sure this is on the right track.  The RFC patch is to check the
>>>>> matching constraint in all alternatives, if there is one alternative
>>>>> with matching constraint and matches the current preferred regclass
>>>>> (or best of allocno?), it will record the output operand number and
>>>>> further create one constraint copy for it.  Normally it can get the
>>>>> priority against shuffle copies and the matching constraint will get
>>>>> satisfied with higher possibility, reload doesn't create extra copies
>>>>> to meet the matching constraint or the desirable register class when
>>>>> it has to.
>>>>>
>>>>> For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay
>>>>> as shuffle copies, and later any of A,B,C,D gets assigned by one
>>>>> hardware register which is a VSX register (VSX_REG) but not a FP
>>>>> register (FLOAT_REG), which means it has to pay costs once we can NOT
>>>>> go with VSX alternatives, so at that time it's important to respect
>>>>> the matching constraint then we can increase the freq for the remaining
>>>>> copies related to this (A/B, A/C, A/D).  This idea requires some side
>>>>> tables to record some information and seems a bit complicated in the
>>>>> current framework, so the proposed patch aggressively emphasizes the
>>>>> matching constraint at the time of creating copies.
>>>>>
>>>>
>>>> Comparing with the original patch (v1), this patch v3 has
>>>> considered: (this should be v2 for this mail list, but bump
>>>> it to be consistent as PR's).
>>>>
>>>>   - Excluding the case where for one preferred register class
>>>>     there can be two or more alternatives, one of them has the
>>>>     matching constraint, while another doesn't have.  So for
>>>>     the given operand, even if it's assigned by a hardware reg
>>>>     which doesn't meet the matching constraint, it can simply
>>>>     use the alternative which doesn't have matching constraint
>>>>     so no register move is needed.  One typical case is
>>>>     define_insn *mov<mode>_internal2 on rs6000.  So we
>>>>     shouldn't create constraint copy for it.
>>>>
>>>>   - The possible free register move in the same register class,
>>>>     disable this if so since the register move to meet the
>>>>     constraint is considered as free.
>>>>
>>>>   - Making it on by default, suggested by Segher & Vladimir, we
>>>>     hope to get rid of the parameter if the benchmarking result
>>>>     looks good on major targets.
>>>>
>>>>   - Tweaking cost when either of matching constraint two sides
>>>>     is hardware register.  Before this patch, the constraint
>>>>     copy is simply taken as a real move insn for pref and
>>>>     conflict cost with one hardware register, after this patch,
>>>>     it's allowed that there are several input operands
>>>>     respecting the same matching constraint (but in different
>>>>     alternatives), so we should take it to be like shuffle copy
>>>>     for some cases to avoid over preferring/disparaging.
>>>>
>>>> Please check the PR comments for more details.
>>>>
>>>> This patch can be bootstrapped & regtested on
>>>> powerpc64le-linux-gnu P9 and x86_64-redhat-linux, but have some
>>>> "XFAIL->XPASS" failures on aarch64-linux-gnu.  The failure list
>>>> was attached in the PR and thought the new assembly looks
>>>> improved (expected).
>>>>
>>>> With option Ofast unroll, this patch can help to improve SPEC2017
>>>> bmk 508.namd_r +2.42% and 519.lbm_r +2.43% on Power8 while
>>>> 508.namd_r +3.02% and 519.lbm_r +3.85% on Power9 without any
>>>> remarkable degradations.
>>>>
>>>> Since this patch likely benefits x86_64 and aarch64, but I don't
>>>> have performance machines with these arches at hand, could
>>>> someone kindly help to benchmark it if possible?
>>> I can help test it on Intel cascade lake and AMD milan.
>
>
> Thanks for your help, Hongtao!
>
>
>> And could you rebase your patch on the lastest trunk, i got several
>> failures when applying the patch
>> ~ git apply ira-v3.diff
>> error: patch failed: gcc/doc/invoke.texi:13845
>> error: gcc/doc/invoke.texi: patch does not apply
>> error: patch failed: gcc/ira-conflicts.c:233
>> error: gcc/ira-conflicts.c: patch does not apply
>> error: patch failed: gcc/ira-int.h:971
>> error: gcc/ira-int.h: patch does not apply
>> error: patch failed: gcc/ira.c:1922
>> error: gcc/ira.c: patch does not apply
>> error: patch failed: gcc/params.opt:330
>> error: gcc/params.opt: patch does not apply
>> 
>
> I think it's due to unexpected git stat lines in previously attached diff.
>
> I have attached the format-patch file.  Please have a check.  Thanks!

FWIW, this seems to be neutral for SPEC 2017 on AArch64.  The SVE
XFAIL->XPASS transitions mean it's definitely a good thing for
AArch64 in that respect though.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4] ira: Support more matching constraint forms with param [PR100328]
  2021-06-30 15:24   ` Vladimir Makarov
@ 2021-07-02  2:11     ` Kewen.Lin
  2021-07-02  2:28       ` [PATCH] i386: Disable param ira-consider-dup-in-all-alts [PR100328] Kewen.Lin
  2021-07-05 13:04       ` [PATCH v4] ira: Support more matching constraint forms with param [PR100328] Vladimir Makarov
  0 siblings, 2 replies; 16+ messages in thread
From: Kewen.Lin @ 2021-07-02  2:11 UTC (permalink / raw)
  To: Vladimir Makarov
  Cc: bergner, Bill Schmidt, Segher Boessenkool, Richard Sandiford,
	crazylht, GCC Patches

[-- Attachment #1: Type: text/plain, Size: 7413 bytes --]

Hi Vladimir,

on 2021/6/30 下午11:24, Vladimir Makarov wrote:
> 
> On 2021-06-28 2:26 a.m., Kewen.Lin wrote:
>> Hi!
>>
>> on 2021/6/9 下午1:18, Kewen.Lin via Gcc-patches wrote:
>>> Hi,
>>>
>>> PR100328 has some details about this issue, I am trying to
>>> brief it here.  In the hottest function LBM_performStreamCollideTRT
>>> of SPEC2017 bmk 519.lbm_r, there are many FMA style expressions
>>> (27 FMA, 19 FMS, 11 FNMA).  On rs6000, this kind of FMA style
>>> insn has two flavors: FLOAT_REG and VSX_REG, the VSX_REG reg
>>> class have 64 registers whose foregoing 32 ones make up the
>>> whole FLOAT_REG.  There are some differences for these two
>>> flavors, taking "*fma<mode>4_fpr" as example:
>>>
>>> (define_insn "*fma<mode>4_fpr"
>>>    [(set (match_operand:SFDF 0 "gpc_reg_operand" "=<Ff>,wa,wa")
>>>     (fma:SFDF
>>>       (match_operand:SFDF 1 "gpc_reg_operand" "%<Ff>,wa,wa")
>>>       (match_operand:SFDF 2 "gpc_reg_operand" "<Ff>,wa,0")
>>>       (match_operand:SFDF 3 "gpc_reg_operand" "<Ff>,0,wa")))]
>>>
>>> // wa => A VSX register (VSR), vs0…vs63, aka. VSX_REG.
>>> // <Ff> (f/d) => A floating point register, aka. FLOAT_REG.
>>>
>>> So for VSX_REG, we only have the destructive form, when VSX_REG
>>> alternative being used, the operand 2 or operand 3 is required
>>> to be the same as operand 0.  reload has to take care of this
>>> constraint and create some non-free register copies if required.
>>>
>>> Assuming one fma insn looks like:
>>>    op0 = FMA (op1, op2, op3)
>>>
>>> The best regclass of them are VSX_REG, when op1,op2,op3 are all dead,
>>> IRA simply creates three shuffle copies for them (here the operand
>>> order matters, since with the same freq, the one with smaller number
>>> takes preference), but IMO both op2 and op3 should take higher priority
>>> in copy queue due to the matching constraint.
>>>
>>> I noticed that there is one function ira_get_dup_out_num, which meant
>>> to create this kind of constraint copy, but the below code looks to
>>> refuse to create if there is an alternative which has valid regclass
>>> without spilled need.
>>>
>>>        default:
>>>     {
>>>       enum constraint_num cn = lookup_constraint (str);
>>>       enum reg_class cl = reg_class_for_constraint (cn);
>>>       if (cl != NO_REGS
>>>           && !targetm.class_likely_spilled_p (cl))
>>>         goto fail
>>>
>>>      ...
>>>
>>> I cooked one patch attached to make ira respect this kind of matching
>>> constraint guarded with one parameter.  As I stated in the PR, I was
>>> not sure this is on the right track.  The RFC patch is to check the
>>> matching constraint in all alternatives, if there is one alternative
>>> with matching constraint and matches the current preferred regclass
>>> (or best of allocno?), it will record the output operand number and
>>> further create one constraint copy for it.  Normally it can get the
>>> priority against shuffle copies and the matching constraint will get
>>> satisfied with higher possibility, reload doesn't create extra copies
>>> to meet the matching constraint or the desirable register class when
>>> it has to.
>>>
>>> For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay
>>> as shuffle copies, and later any of A,B,C,D gets assigned by one
>>> hardware register which is a VSX register (VSX_REG) but not a FP
>>> register (FLOAT_REG), which means it has to pay costs once we can NOT
>>> go with VSX alternatives, so at that time it's important to respect
>>> the matching constraint then we can increase the freq for the remaining
>>> copies related to this (A/B, A/C, A/D).  This idea requires some side
>>> tables to record some information and seems a bit complicated in the
>>> current framework, so the proposed patch aggressively emphasizes the
>>> matching constraint at the time of creating copies.
>>>
>> Comparing with the original patch (v1), this patch v3 has
>> considered: (this should be v2 for this mail list, but bump
>> it to be consistent as PR's).
>>
>>    - Excluding the case where for one preferred register class
>>      there can be two or more alternatives, one of them has the
>>      matching constraint, while another doesn't have.  So for
>>      the given operand, even if it's assigned by a hardware reg
>>      which doesn't meet the matching constraint, it can simply
>>      use the alternative which doesn't have matching constraint
>>      so no register move is needed.  One typical case is
>>      define_insn *mov<mode>_internal2 on rs6000.  So we
>>      shouldn't create constraint copy for it.
>>
>>    - The possible free register move in the same register class,
>>      disable this if so since the register move to meet the
>>      constraint is considered as free.
>>
>>    - Making it on by default, suggested by Segher & Vladimir, we
>>      hope to get rid of the parameter if the benchmarking result
>>      looks good on major targets.
>>
>>    - Tweaking cost when either of matching constraint two sides
>>      is hardware register.  Before this patch, the constraint
>>      copy is simply taken as a real move insn for pref and
>>      conflict cost with one hardware register, after this patch,
>>      it's allowed that there are several input operands
>>      respecting the same matching constraint (but in different
>>      alternatives), so we should take it to be like shuffle copy
>>      for some cases to avoid over preferring/disparaging.
>>
>> Please check the PR comments for more details.
>>
>> This patch can be bootstrapped & regtested on
>> powerpc64le-linux-gnu P9 and x86_64-redhat-linux, but have some
>> "XFAIL->XPASS" failures on aarch64-linux-gnu.  The failure list
>> was attached in the PR and thought the new assembly looks
>> improved (expected).
>>
>> With option Ofast unroll, this patch can help to improve SPEC2017
>> bmk 508.namd_r +2.42% and 519.lbm_r +2.43% on Power8 while
>> 508.namd_r +3.02% and 519.lbm_r +3.85% on Power9 without any
>> remarkable degradations.
>>
>> Since this patch likely benefits x86_64 and aarch64, but I don't
>> have performance machines with these arches at hand, could
>> someone kindly help to benchmark it if possible?
>>
>> Many thanks in advance!
>>
>> btw, you can simply ignore the part about parameter
>> ira-consider-dup-in-all-alts (its name/description), it's sort of
>> stale, I let it be for now as we will likely get rid of it.
> 
> Kewen, thank you for addressing remarks for the previous version of the patch.  The patch is ok to commit with some minor changes:
> 
> o In a comment for function ira_get_dup_out_num there is no mention of effect of the param on the function returned value and returned value of single_input_op_has_cstr_p and this imho creates wrong function interface description.
> 
> o It would be still nice to change name op_no to op_regno in ira_get_dup_out_num.
> 
> It is ok to commit the patch to the mainline with condition that you submit the patch switching off the parameter for x86-64 right after that as Hongtao Liu has shown its negative effect on x86-64 SPEC2017.
> 

Many thanks for your review!  I've updated the patch according to your comments and also polished some comments and document words a bit.  Does it look better to you?


BR,
Kewen

[-- Attachment #2: 0001-ira-Support-more-matching-constraint-forms-with-para.patch --]
[-- Type: text/plain, Size: 43782 bytes --]

From 2a338d5160676e58cf13b3e83bd131725bcb4234 Mon Sep 17 00:00:00 2001
From: Kewen Lin <linkw@linux.ibm.com>
Date: Mon, 21 Jun 2021 22:51:09 -0500
Subject: [PATCH 1/2] ira: Support more matching constraint forms with param
 [PR100328]

This patch is to make IRA consider matching constraint heavily,
even if there is at least one other alternative with non-NO_REG
register class constraint, it will continue and check matching
constraint in all available alternatives and respect the
matching constraint with preferred register class.

One typical case is destructive FMA style instruction on rs6000.
Without this patch, for the mentioned FMA instruction, IRA won't
respect the matching constraint on VSX_REG since there are some
alternative with FLOAT_REG which doesn't have matching constraint.
It can cause extra register copies since later reload has to make
code to respect the constraint.  This patch make IRA respect this
matching constraint on VSX_REG which is the preferred regclass,
but it excludes some cases where for one preferred register class
there can be two or more alternatives, one of them has the
matching constraint, while another doesn't have.  It also
considers the possibility of free register copy.

With option Ofast unroll, this patch can help to improve SPEC2017
bmk 508.namd_r +2.42% and 519.lbm_r +2.43% on Power8 while
508.namd_r +3.02% and 519.lbm_r +3.85% on Power9 without any
remarkable degradations.  It also improved something on SVE as
testcase changes showed and Richard's confirmation.

Bootstrapped & regtested on powerpc64le-linux-gnu P9,
x86_64-redhat-linux and aarch64-linux-gnu.

gcc/ChangeLog:

	PR rtl-optimization/100328
	* doc/invoke.texi (ira-consider-dup-in-all-alts): Document new
	parameter.
	* ira.c (ira_get_dup_out_num): Adjust as parameter
	param_ira_consider_dup_in_all_alts.
	* params.opt (ira-consider-dup-in-all-alts): New.
	* ira-conflicts.c (process_regs_for_copy): Add one parameter
	single_input_op_has_cstr_p.
	(get_freq_for_shuffle_copy): New function.
	(add_insn_allocno_copies): Adjust as single_input_op_has_cstr_p.
	* ira-int.h (ira_get_dup_out_num): Add one bool parameter.

gcc/testsuite/ChangeLog:

	PR rtl-optimization/100328
	* gcc.target/aarch64/sve/acle/asm/div_f16.c: Remove one xfail.
	* gcc.target/aarch64/sve/acle/asm/div_f32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/div_f64.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/divr_f16.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/divr_f32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/divr_f64.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/mad_f16.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/mad_f32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/mad_f64.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/mla_f16.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/mla_f32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/mla_f64.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/mls_f16.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/mls_f32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/mls_f64.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/msb_f16.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/msb_f32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/msb_f64.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/mulx_f16.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/mulx_f32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/mulx_f64.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/nmad_f16.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/nmad_f32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/nmad_f64.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/nmla_f16.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/nmla_f32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/nmla_f64.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/nmls_f16.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/nmls_f32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/nmls_f64.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/nmsb_f16.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/nmsb_f32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/nmsb_f64.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/sub_f16.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/sub_f32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/sub_f64.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/subr_f16.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/subr_f32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/subr_f64.c: Likewise.
---
 gcc/doc/invoke.texi                           |  10 ++
 gcc/ira-conflicts.c                           |  93 ++++++++++---
 gcc/ira-int.h                                 |   2 +-
 gcc/ira.c                                     | 128 ++++++++++++++++--
 gcc/params.opt                                |   4 +
 .../gcc.target/aarch64/sve/acle/asm/div_f16.c |   2 +-
 .../gcc.target/aarch64/sve/acle/asm/div_f32.c |   2 +-
 .../gcc.target/aarch64/sve/acle/asm/div_f64.c |   2 +-
 .../aarch64/sve/acle/asm/divr_f16.c           |   2 +-
 .../aarch64/sve/acle/asm/divr_f32.c           |   2 +-
 .../aarch64/sve/acle/asm/divr_f64.c           |   2 +-
 .../gcc.target/aarch64/sve/acle/asm/mad_f16.c |   2 +-
 .../gcc.target/aarch64/sve/acle/asm/mad_f32.c |   2 +-
 .../gcc.target/aarch64/sve/acle/asm/mad_f64.c |   2 +-
 .../gcc.target/aarch64/sve/acle/asm/mla_f16.c |   2 +-
 .../gcc.target/aarch64/sve/acle/asm/mla_f32.c |   2 +-
 .../gcc.target/aarch64/sve/acle/asm/mla_f64.c |   2 +-
 .../gcc.target/aarch64/sve/acle/asm/mls_f16.c |   2 +-
 .../gcc.target/aarch64/sve/acle/asm/mls_f32.c |   2 +-
 .../gcc.target/aarch64/sve/acle/asm/mls_f64.c |   2 +-
 .../gcc.target/aarch64/sve/acle/asm/msb_f16.c |   2 +-
 .../gcc.target/aarch64/sve/acle/asm/msb_f32.c |   2 +-
 .../gcc.target/aarch64/sve/acle/asm/msb_f64.c |   2 +-
 .../aarch64/sve/acle/asm/mulx_f16.c           |   2 +-
 .../aarch64/sve/acle/asm/mulx_f32.c           |   2 +-
 .../aarch64/sve/acle/asm/mulx_f64.c           |   2 +-
 .../aarch64/sve/acle/asm/nmad_f16.c           |   2 +-
 .../aarch64/sve/acle/asm/nmad_f32.c           |   2 +-
 .../aarch64/sve/acle/asm/nmad_f64.c           |   2 +-
 .../aarch64/sve/acle/asm/nmla_f16.c           |   2 +-
 .../aarch64/sve/acle/asm/nmla_f32.c           |   2 +-
 .../aarch64/sve/acle/asm/nmla_f64.c           |   2 +-
 .../aarch64/sve/acle/asm/nmls_f16.c           |   2 +-
 .../aarch64/sve/acle/asm/nmls_f32.c           |   2 +-
 .../aarch64/sve/acle/asm/nmls_f64.c           |   2 +-
 .../aarch64/sve/acle/asm/nmsb_f16.c           |   2 +-
 .../aarch64/sve/acle/asm/nmsb_f32.c           |   2 +-
 .../aarch64/sve/acle/asm/nmsb_f64.c           |   2 +-
 .../gcc.target/aarch64/sve/acle/asm/sub_f16.c |   2 +-
 .../gcc.target/aarch64/sve/acle/asm/sub_f32.c |   2 +-
 .../gcc.target/aarch64/sve/acle/asm/sub_f64.c |   2 +-
 .../aarch64/sve/acle/asm/subr_f16.c           |   2 +-
 .../aarch64/sve/acle/asm/subr_f32.c           |   2 +-
 .../aarch64/sve/acle/asm/subr_f64.c           |   2 +-
 44 files changed, 248 insertions(+), 67 deletions(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index a9fd5fdc104..f470fc6be58 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13917,6 +13917,16 @@ of available registers reserved for some other purposes is given
 by this parameter.  Default of the parameter
 is the best found from numerous experiments.
 
+@item ira-consider-dup-in-all-alts
+Make IRA to consider matching constraint (duplicated operand number)
+heavily in all available alternatives for preferred register class.
+If it is set as zero, it means IRA only respects the matching
+constraint when it's in the only available alternative with an
+appropriate register class.  Otherwise, it means IRA will check all
+available alternatives for preferred register class even if it has
+found some choice with an appropriate register class and respect the
+found qualified matching constraint.
+
 @item lra-inheritance-ebb-probability-cutoff
 LRA tries to reuse values reloaded in registers in subsequent insns.
 This optimization is called inheritance.  EBB is used as a region to
diff --git a/gcc/ira-conflicts.c b/gcc/ira-conflicts.c
index d83cfc1c1a7..86c6f242f18 100644
--- a/gcc/ira-conflicts.c
+++ b/gcc/ira-conflicts.c
@@ -233,19 +233,30 @@ go_through_subreg (rtx x, int *offset)
   return reg;
 }
 
+/* Return the recomputed frequency for this shuffle copy or its similar
+   case, since it's not for a real move insn, make it smaller.  */
+
+static int
+get_freq_for_shuffle_copy (int freq)
+{
+  return freq < 8 ? 1 : freq / 8;
+}
+
 /* Process registers REG1 and REG2 in move INSN with execution
    frequency FREQ.  The function also processes the registers in a
    potential move insn (INSN == NULL in this case) with frequency
    FREQ.  The function can modify hard register costs of the
    corresponding allocnos or create a copy involving the corresponding
    allocnos.  The function does nothing if the both registers are hard
-   registers.  When nothing is changed, the function returns
-   FALSE.  */
+   registers.  When nothing is changed, the function returns FALSE.
+   SINGLE_INPUT_OP_HAS_CSTR_P is only meaningful when constraint_p
+   is true, see function ira_get_dup_out_num for its meaning.  */
 static bool
-process_regs_for_copy (rtx reg1, rtx reg2, bool constraint_p,
-		       rtx_insn *insn, int freq)
+process_regs_for_copy (rtx reg1, rtx reg2, bool constraint_p, rtx_insn *insn,
+		       int freq, bool single_input_op_has_cstr_p = true)
 {
-  int allocno_preferenced_hard_regno, cost, index, offset1, offset2;
+  int allocno_preferenced_hard_regno, index, offset1, offset2;
+  int cost, conflict_cost, move_cost;
   bool only_regs_p;
   ira_allocno_t a;
   reg_class_t rclass, aclass;
@@ -306,9 +317,52 @@ process_regs_for_copy (rtx reg1, rtx reg2, bool constraint_p,
     return false;
   ira_init_register_move_cost_if_necessary (mode);
   if (HARD_REGISTER_P (reg1))
-    cost = ira_register_move_cost[mode][aclass][rclass] * freq;
+    move_cost = ira_register_move_cost[mode][aclass][rclass];
+  else
+    move_cost = ira_register_move_cost[mode][rclass][aclass];
+
+  if (!single_input_op_has_cstr_p)
+    {
+      /* When this is a constraint copy and the matching constraint
+	 doesn't only exist for this given operand but also for some
+	 other operand(s), it means saving the possible move cost does
+	 NOT need to require reg1 and reg2 to use the same hardware
+	 register, so this hardware preference isn't required to be
+	 fixed.  To avoid it to over prefer this hardware register,
+	 and over disparage this hardware register on conflicted
+	 objects, we need some cost tweaking here, similar to what
+	 we do for shuffle copy.  */
+      gcc_assert (constraint_p);
+      int reduced_freq = get_freq_for_shuffle_copy (freq);
+      if (HARD_REGISTER_P (reg1))
+	/* For reg2 = opcode(reg1, reg3 ...), assume that reg3 is a
+	   pseudo register which has matching constraint on reg2,
+	   even if reg2 isn't assigned by reg1, it's still possible
+	   not to have register moves if reg2 and reg3 use the same
+	   hardware register.  So to avoid the allocation to over
+	   prefer reg1, we can just take it as a shuffle copy.  */
+	cost = conflict_cost = move_cost * reduced_freq;
+      else
+	{
+	  /* For reg1 = opcode(reg2, reg3 ...), assume that reg3 is a
+	     pseudo register which has matching constraint on reg2,
+	     to save the register move, it's better to assign reg1
+	     to either of reg2 and reg3 (or one of other pseudos like
+	     reg3), it's reasonable to use freq for the cost.  But
+	     for conflict_cost, since reg2 and reg3 conflicts with
+	     each other, both of them has the chance to be assigned
+	     by reg1, assume reg3 has one copy which also conflicts
+	     with reg2, we shouldn't make it less preferred on reg1
+	     since reg3 has the same chance to be assigned by reg1.
+	     So it adjusts the conflic_cost to make it same as what
+	     we use for shuffle copy.  */
+	  cost = move_cost * freq;
+	  conflict_cost = move_cost * reduced_freq;
+	}
+    }
   else
-    cost = ira_register_move_cost[mode][rclass][aclass] * freq;
+    cost = conflict_cost = move_cost * freq;
+
   do
     {
       ira_allocate_and_set_costs
@@ -317,7 +371,7 @@ process_regs_for_copy (rtx reg1, rtx reg2, bool constraint_p,
       ira_allocate_and_set_costs
 	(&ALLOCNO_CONFLICT_HARD_REG_COSTS (a), aclass, 0);
       ALLOCNO_HARD_REG_COSTS (a)[index] -= cost;
-      ALLOCNO_CONFLICT_HARD_REG_COSTS (a)[index] -= cost;
+      ALLOCNO_CONFLICT_HARD_REG_COSTS (a)[index] -= conflict_cost;
       if (ALLOCNO_HARD_REG_COSTS (a)[index] < ALLOCNO_CLASS_COST (a))
 	ALLOCNO_CLASS_COST (a) = ALLOCNO_HARD_REG_COSTS (a)[index];
       ira_add_allocno_pref (a, allocno_preferenced_hard_regno, freq);
@@ -420,7 +474,8 @@ add_insn_allocno_copies (rtx_insn *insn)
       operand = recog_data.operand[i];
       if (! REG_SUBREG_P (operand))
 	continue;
-      if ((n = ira_get_dup_out_num (i, alts)) >= 0)
+      bool single_input_op_has_cstr_p;
+      if ((n = ira_get_dup_out_num (i, alts, single_input_op_has_cstr_p)) >= 0)
 	{
 	  bound_p[n] = true;
 	  dup = recog_data.operand[n];
@@ -429,8 +484,8 @@ add_insn_allocno_copies (rtx_insn *insn)
 				REG_P (operand)
 				? operand
 				: SUBREG_REG (operand)) != NULL_RTX)
-	    process_regs_for_copy (operand, dup, true, NULL,
-				   freq);
+	    process_regs_for_copy (operand, dup, true, NULL, freq,
+				   single_input_op_has_cstr_p);
 	}
     }
   for (i = 0; i < recog_data.n_operands; i++)
@@ -440,13 +495,15 @@ add_insn_allocno_copies (rtx_insn *insn)
 	  && find_reg_note (insn, REG_DEAD,
 			    REG_P (operand)
 			    ? operand : SUBREG_REG (operand)) != NULL_RTX)
-	/* If an operand dies, prefer its hard register for the output
-	   operands by decreasing the hard register cost or creating
-	   the corresponding allocno copies.  The cost will not
-	   correspond to a real move insn cost, so make the frequency
-	   smaller.  */
-	process_reg_shuffles (insn, operand, i, freq < 8 ? 1 : freq / 8,
-			      bound_p);
+	{
+	  /* If an operand dies, prefer its hard register for the output
+	     operands by decreasing the hard register cost or creating
+	     the corresponding allocno copies.  The cost will not
+	     correspond to a real move insn cost, so make the frequency
+	     smaller.  */
+	  int new_freq = get_freq_for_shuffle_copy (freq);
+	  process_reg_shuffles (insn, operand, i, new_freq, bound_p);
+	}
     }
 }
 
diff --git a/gcc/ira-int.h b/gcc/ira-int.h
index 31e013b0461..da748626e31 100644
--- a/gcc/ira-int.h
+++ b/gcc/ira-int.h
@@ -971,7 +971,7 @@ extern void ira_debug_disposition (void);
 extern void ira_debug_allocno_classes (void);
 extern void ira_init_register_move_cost (machine_mode);
 extern alternative_mask ira_setup_alts (rtx_insn *);
-extern int ira_get_dup_out_num (int, alternative_mask);
+extern int ira_get_dup_out_num (int, alternative_mask, bool &);
 
 /* ira-build.c */
 
diff --git a/gcc/ira.c b/gcc/ira.c
index 638ef4ea17e..866fb98f2e5 100644
--- a/gcc/ira.c
+++ b/gcc/ira.c
@@ -1922,9 +1922,25 @@ ira_setup_alts (rtx_insn *insn)
 /* Return the number of the output non-early clobber operand which
    should be the same in any case as operand with number OP_NUM (or
    negative value if there is no such operand).  ALTS is the mask
-   of alternatives that we should consider.  */
+   of alternatives that we should consider.  SINGLE_INPUT_OP_HAS_CSTR_P
+   should be set in this function, it indicates whether there is only
+   a single input operand which has the matching constraint on the
+   output operand at the position specified in return value.  If the
+   pattern allows any one of several input operands holds the matching
+   constraint, it's set as false, one typical case is destructive FMA
+   instruction on target rs6000.  Note that for a non-NO_REG preferred
+   register class with no free register move copy, if the parameter
+   PARAM_IRA_CONSIDER_DUP_IN_ALL_ALTS is set to one, this function
+   will check all available alternatives for matching constraints,
+   even if it has found or will find one alternative with non-NO_REG
+   regclass, it can respect more cases with matching constraints.  If
+   PARAM_IRA_CONSIDER_DUP_IN_ALL_ALTS is set to zero,
+   SINGLE_INPUT_OP_HAS_CSTR_P is always true, it will stop to find
+   matching constraint relationship once it hits some alternative with
+   some non-NO_REG regclass.  */
 int
-ira_get_dup_out_num (int op_num, alternative_mask alts)
+ira_get_dup_out_num (int op_num, alternative_mask alts,
+		     bool &single_input_op_has_cstr_p)
 {
   int curr_alt, c, original;
   bool ignore_p, use_commut_op_p;
@@ -1937,10 +1953,42 @@ ira_get_dup_out_num (int op_num, alternative_mask alts)
     return -1;
   str = recog_data.constraints[op_num];
   use_commut_op_p = false;
+  single_input_op_has_cstr_p = true;
+
+  rtx op = recog_data.operand[op_num];
+  int op_regno = reg_or_subregno (op);
+  enum reg_class op_pref_cl = reg_preferred_class (op_regno);
+  machine_mode op_mode = GET_MODE (op);
+
+  ira_init_register_move_cost_if_necessary (op_mode);
+  /* If the preferred regclass isn't NO_REG, continue to find the matching
+     constraint in all available alternatives with preferred regclass, even
+     if we have found or will find one alternative whose constraint stands
+     for a REG (non-NO_REG) regclass.  Note that it would be fine not to
+     respect matching constraint if the register copy is free, so exclude
+     it.  */
+  bool respect_dup_despite_reg_cstr
+    = param_ira_consider_dup_in_all_alts
+      && op_pref_cl != NO_REGS
+      && ira_register_move_cost[op_mode][op_pref_cl][op_pref_cl] > 0;
+
+  /* Record the alternative whose constraint uses the same regclass as the
+     preferred regclass, later if we find one matching constraint for this
+     operand with preferred reclass, we will visit these recorded
+     alternatives to check whether if there is one alternative in which no
+     any INPUT operands have one matching constraint same as our candidate.
+     If yes, it means there is one alternative which is perfectly fine
+     without satisfying this matching constraint.  If no, it means in any
+     alternatives there is one other INPUT operand holding this matching
+     constraint, it's fine to respect this matching constraint and further
+     create this constraint copy since it would become harmless once some
+     other takes preference and it's interfered.  */
+  alternative_mask pref_cl_alts;
+
   for (;;)
     {
-      rtx op = recog_data.operand[op_num];
-      
+      pref_cl_alts = 0;
+
       for (curr_alt = 0, ignore_p = !TEST_BIT (alts, curr_alt),
 	   original = -1;;)
 	{
@@ -1963,9 +2011,25 @@ ira_get_dup_out_num (int op_num, alternative_mask alts)
 		{
 		  enum constraint_num cn = lookup_constraint (str);
 		  enum reg_class cl = reg_class_for_constraint (cn);
-		  if (cl != NO_REGS
-		      && !targetm.class_likely_spilled_p (cl))
-		    goto fail;
+		  if (cl != NO_REGS && !targetm.class_likely_spilled_p (cl))
+		    {
+		      if (respect_dup_despite_reg_cstr)
+			{
+			  /* If it's free to move from one preferred class to
+			     the one without matching constraint, it doesn't
+			     have to respect this constraint with costs.  */
+			  if (cl != op_pref_cl
+			      && (ira_reg_class_intersect[cl][op_pref_cl]
+				  != NO_REGS)
+			      && (ira_may_move_in_cost[op_mode][op_pref_cl][cl]
+				  == 0))
+			    goto fail;
+			  else if (cl == op_pref_cl)
+			    pref_cl_alts |= ALTERNATIVE_BIT (curr_alt);
+			}
+		      else
+			goto fail;
+		    }
 		  if (constraint_satisfied_p (op, cn))
 		    goto fail;
 		  break;
@@ -1979,7 +2043,21 @@ ira_get_dup_out_num (int op_num, alternative_mask alts)
 		  str = end;
 		  if (original != -1 && original != n)
 		    goto fail;
-		  original = n;
+		  gcc_assert (n < recog_data.n_operands);
+		  if (respect_dup_despite_reg_cstr)
+		    {
+		      const operand_alternative *op_alt
+			= &recog_op_alt[curr_alt * recog_data.n_operands];
+		      /* Only respect the one with preferred rclass, without
+			 respect_dup_despite_reg_cstr it's possible to get
+			 one whose regclass isn't preferred first before,
+			 but it would fail since there should be other
+			 alternatives with preferred regclass.  */
+		      if (op_alt[n].cl == op_pref_cl)
+			original = n;
+		    }
+		  else
+		    original = n;
 		  continue;
 		}
 	      }
@@ -1988,7 +2066,39 @@ ira_get_dup_out_num (int op_num, alternative_mask alts)
       if (original == -1)
 	goto fail;
       if (recog_data.operand_type[original] == OP_OUT)
-	return original;
+	{
+	  if (pref_cl_alts == 0)
+	    return original;
+	  /* Visit these recorded alternatives to check whether
+	     there is one alternative in which no any INPUT operands
+	     have one matching constraint same as our candidate.
+	     Give up this candidate if so.  */
+	  int nop, nalt;
+	  for (nalt = 0; nalt < recog_data.n_alternatives; nalt++)
+	    {
+	      if (!TEST_BIT (pref_cl_alts, nalt))
+		continue;
+	      const operand_alternative *op_alt
+		= &recog_op_alt[nalt * recog_data.n_operands];
+	      bool dup_in_other = false;
+	      for (nop = 0; nop < recog_data.n_operands; nop++)
+		{
+		  if (recog_data.operand_type[nop] != OP_IN)
+		    continue;
+		  if (nop == op_num)
+		    continue;
+		  if (op_alt[nop].matches == original)
+		    {
+		      dup_in_other = true;
+		      break;
+		    }
+		}
+	      if (!dup_in_other)
+		return -1;
+	    }
+	  single_input_op_has_cstr_p = false;
+	  return original;
+	}
     fail:
       if (use_commut_op_p)
 	break;
diff --git a/gcc/params.opt b/gcc/params.opt
index 18e6036c4f4..577cd42c173 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -330,6 +330,10 @@ Max size of conflict table in MB.
 Common Joined UInteger Var(param_ira_max_loops_num) Init(100) Param Optimization
 Max loops number for regional RA.
 
+-param=ira-consider-dup-in-all-alts=
+Common Joined UInteger Var(param_ira_consider_dup_in_all_alts) Init(1) IntegerRange(0, 1) Param Optimization
+Control ira to consider matching constraint (duplicated operand number) heavily in all available alternatives for preferred register class.  If it is set as zero, it means ira only respects the matching constraint when it's in the only available alternative with an appropriate register class.  Otherwise, it means ira will check all available alternatives for preferred register class even if it has found some choice with an appropriate register class and respect the found qualified matching constraint.
+
 -param=iv-always-prune-cand-set-bound=
 Common Joined UInteger Var(param_iv_always_prune_cand_set_bound) Init(10) Param Optimization
 If number of candidates in the set is smaller, we always try to remove unused ivs during its optimization.
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_f16.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_f16.c
index 35f5c158911..8bcd094c996 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_f16.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_f16.c
@@ -218,7 +218,7 @@ TEST_UNIFORM_ZD (div_h4_f16_x_tied1, svfloat16_t, __fp16,
 		 z0 = svdiv_x (p0, z0, d4))
 
 /*
-** div_h4_f16_x_untied: { xfail *-*-* }
+** div_h4_f16_x_untied:
 **	mov	z0\.h, h4
 **	fdivr	z0\.h, p0/m, z0\.h, z1\.h
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_f32.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_f32.c
index 40cc203da67..546c61dc783 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_f32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_f32.c
@@ -218,7 +218,7 @@ TEST_UNIFORM_ZD (div_s4_f32_x_tied1, svfloat32_t, float,
 		 z0 = svdiv_x (p0, z0, d4))
 
 /*
-** div_s4_f32_x_untied: { xfail *-*-* }
+** div_s4_f32_x_untied:
 **	mov	z0\.s, s4
 **	fdivr	z0\.s, p0/m, z0\.s, z1\.s
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_f64.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_f64.c
index 56acbbe9550..1e24bc26484 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_f64.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_f64.c
@@ -218,7 +218,7 @@ TEST_UNIFORM_ZD (div_d4_f64_x_tied1, svfloat64_t, double,
 		 z0 = svdiv_x (p0, z0, d4))
 
 /*
-** div_d4_f64_x_untied: { xfail *-*-* }
+** div_d4_f64_x_untied:
 **	mov	z0\.d, d4
 **	fdivr	z0\.d, p0/m, z0\.d, z1\.d
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/divr_f16.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/divr_f16.c
index 03cc0343bd2..e293be65a06 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/divr_f16.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/divr_f16.c
@@ -239,7 +239,7 @@ TEST_UNIFORM_ZD (divr_h4_f16_x_tied1, svfloat16_t, __fp16,
 		 z0 = svdivr_x (p0, z0, d4))
 
 /*
-** divr_h4_f16_x_untied: { xfail *-*-* }
+** divr_h4_f16_x_untied:
 **	mov	z0\.h, h4
 **	fdiv	z0\.h, p0/m, z0\.h, z1\.h
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/divr_f32.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/divr_f32.c
index c2b65fc33fa..04a7ac40bb2 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/divr_f32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/divr_f32.c
@@ -239,7 +239,7 @@ TEST_UNIFORM_ZD (divr_s4_f32_x_tied1, svfloat32_t, float,
 		 z0 = svdivr_x (p0, z0, d4))
 
 /*
-** divr_s4_f32_x_untied: { xfail *-*-* }
+** divr_s4_f32_x_untied:
 **	mov	z0\.s, s4
 **	fdiv	z0\.s, p0/m, z0\.s, z1\.s
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/divr_f64.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/divr_f64.c
index 0a72a37b1d5..bef1a9b059c 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/divr_f64.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/divr_f64.c
@@ -239,7 +239,7 @@ TEST_UNIFORM_ZD (divr_d4_f64_x_tied1, svfloat64_t, double,
 		 z0 = svdivr_x (p0, z0, d4))
 
 /*
-** divr_d4_f64_x_untied: { xfail *-*-* }
+** divr_d4_f64_x_untied:
 **	mov	z0\.d, d4
 **	fdiv	z0\.d, p0/m, z0\.d, z1\.d
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mad_f16.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mad_f16.c
index 7656f9e5410..4b3148419c5 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mad_f16.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mad_f16.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (mad_h4_f16_x_tied2, svfloat16_t, __fp16,
 		 z0 = svmad_x (p0, z1, z0, d4))
 
 /*
-** mad_h4_f16_x_untied: { xfail *-*-* }
+** mad_h4_f16_x_untied:
 **	mov	z0\.h, h4
 **	fmla	z0\.h, p0/m, z1\.h, z2\.h
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mad_f32.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mad_f32.c
index dbdd2b9d10b..d5dbc85d5a3 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mad_f32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mad_f32.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (mad_s4_f32_x_tied2, svfloat32_t, float,
 		 z0 = svmad_x (p0, z1, z0, d4))
 
 /*
-** mad_s4_f32_x_untied: { xfail *-*-* }
+** mad_s4_f32_x_untied:
 **	mov	z0\.s, s4
 **	fmla	z0\.s, p0/m, z1\.s, z2\.s
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mad_f64.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mad_f64.c
index 978281295e8..7b5dc22826e 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mad_f64.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mad_f64.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (mad_d4_f64_x_tied2, svfloat64_t, double,
 		 z0 = svmad_x (p0, z1, z0, d4))
 
 /*
-** mad_d4_f64_x_untied: { xfail *-*-* }
+** mad_d4_f64_x_untied:
 **	mov	z0\.d, d4
 **	fmla	z0\.d, p0/m, z1\.d, z2\.d
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mla_f16.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mla_f16.c
index f22a582efa6..d32ce5845d1 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mla_f16.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mla_f16.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (mla_h4_f16_x_tied2, svfloat16_t, __fp16,
 		 z0 = svmla_x (p0, z1, z0, d4))
 
 /*
-** mla_h4_f16_x_untied: { xfail *-*-* }
+** mla_h4_f16_x_untied:
 **	mov	z0\.h, h4
 **	fmad	z0\.h, p0/m, z2\.h, z1\.h
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mla_f32.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mla_f32.c
index 1d95eb0a724..d10ba69a53e 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mla_f32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mla_f32.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (mla_s4_f32_x_tied2, svfloat32_t, float,
 		 z0 = svmla_x (p0, z1, z0, d4))
 
 /*
-** mla_s4_f32_x_untied: { xfail *-*-* }
+** mla_s4_f32_x_untied:
 **	mov	z0\.s, s4
 **	fmad	z0\.s, p0/m, z2\.s, z1\.s
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mla_f64.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mla_f64.c
index 74fd2926710..94c1e0b0753 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mla_f64.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mla_f64.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (mla_d4_f64_x_tied2, svfloat64_t, double,
 		 z0 = svmla_x (p0, z1, z0, d4))
 
 /*
-** mla_d4_f64_x_untied: { xfail *-*-* }
+** mla_d4_f64_x_untied:
 **	mov	z0\.d, d4
 **	fmad	z0\.d, p0/m, z2\.d, z1\.d
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mls_f16.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mls_f16.c
index 87fba3da7ff..b58104d5eaf 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mls_f16.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mls_f16.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (mls_h4_f16_x_tied2, svfloat16_t, __fp16,
 		 z0 = svmls_x (p0, z1, z0, d4))
 
 /*
-** mls_h4_f16_x_untied: { xfail *-*-* }
+** mls_h4_f16_x_untied:
 **	mov	z0\.h, h4
 **	fmsb	z0\.h, p0/m, z2\.h, z1\.h
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mls_f32.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mls_f32.c
index 04ce1ec46e0..7d6e60519b0 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mls_f32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mls_f32.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (mls_s4_f32_x_tied2, svfloat32_t, float,
 		 z0 = svmls_x (p0, z1, z0, d4))
 
 /*
-** mls_s4_f32_x_untied: { xfail *-*-* }
+** mls_s4_f32_x_untied:
 **	mov	z0\.s, s4
 **	fmsb	z0\.s, p0/m, z2\.s, z1\.s
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mls_f64.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mls_f64.c
index 1e2108af671..a6ed28eec5c 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mls_f64.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mls_f64.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (mls_d4_f64_x_tied2, svfloat64_t, double,
 		 z0 = svmls_x (p0, z1, z0, d4))
 
 /*
-** mls_d4_f64_x_untied: { xfail *-*-* }
+** mls_d4_f64_x_untied:
 **	mov	z0\.d, d4
 **	fmsb	z0\.d, p0/m, z2\.d, z1\.d
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/msb_f16.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/msb_f16.c
index fe11457c4f8..894961a9ec5 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/msb_f16.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/msb_f16.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (msb_h4_f16_x_tied2, svfloat16_t, __fp16,
 		 z0 = svmsb_x (p0, z1, z0, d4))
 
 /*
-** msb_h4_f16_x_untied: { xfail *-*-* }
+** msb_h4_f16_x_untied:
 **	mov	z0\.h, h4
 **	fmls	z0\.h, p0/m, z1\.h, z2\.h
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/msb_f32.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/msb_f32.c
index f7a9f2767e8..0d0915958a3 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/msb_f32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/msb_f32.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (msb_s4_f32_x_tied2, svfloat32_t, float,
 		 z0 = svmsb_x (p0, z1, z0, d4))
 
 /*
-** msb_s4_f32_x_untied: { xfail *-*-* }
+** msb_s4_f32_x_untied:
 **	mov	z0\.s, s4
 **	fmls	z0\.s, p0/m, z1\.s, z2\.s
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/msb_f64.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/msb_f64.c
index e3ff414d81a..52dc3968e24 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/msb_f64.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/msb_f64.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (msb_d4_f64_x_tied2, svfloat64_t, double,
 		 z0 = svmsb_x (p0, z1, z0, d4))
 
 /*
-** msb_d4_f64_x_untied: { xfail *-*-* }
+** msb_d4_f64_x_untied:
 **	mov	z0\.d, d4
 **	fmls	z0\.d, p0/m, z1\.d, z2\.d
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mulx_f16.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mulx_f16.c
index ce02c3caa39..b8d6bf5d92c 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mulx_f16.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mulx_f16.c
@@ -303,7 +303,7 @@ TEST_UNIFORM_ZD (mulx_h4_f16_x_tied1, svfloat16_t, __fp16,
 		 z0 = svmulx_x (p0, z0, d4))
 
 /*
-** mulx_h4_f16_x_untied: { xfail *-*-* }
+** mulx_h4_f16_x_untied:
 **	mov	z0\.h, h4
 **	fmulx	z0\.h, p0/m, z0\.h, z1\.h
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mulx_f32.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mulx_f32.c
index e0d3695932c..b8f5c1310d7 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mulx_f32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mulx_f32.c
@@ -303,7 +303,7 @@ TEST_UNIFORM_ZD (mulx_s4_f32_x_tied1, svfloat32_t, float,
 		 z0 = svmulx_x (p0, z0, d4))
 
 /*
-** mulx_s4_f32_x_untied: { xfail *-*-* }
+** mulx_s4_f32_x_untied:
 **	mov	z0\.s, s4
 **	fmulx	z0\.s, p0/m, z0\.s, z1\.s
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mulx_f64.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mulx_f64.c
index 6af5703ffaf..746cc94143d 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mulx_f64.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mulx_f64.c
@@ -303,7 +303,7 @@ TEST_UNIFORM_ZD (mulx_d4_f64_x_tied1, svfloat64_t, double,
 		 z0 = svmulx_x (p0, z0, d4))
 
 /*
-** mulx_d4_f64_x_untied: { xfail *-*-* }
+** mulx_d4_f64_x_untied:
 **	mov	z0\.d, d4
 **	fmulx	z0\.d, p0/m, z0\.d, z1\.d
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmad_f16.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmad_f16.c
index abfe0a0c056..92e0664e647 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmad_f16.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmad_f16.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (nmad_h4_f16_x_tied2, svfloat16_t, __fp16,
 		 z0 = svnmad_x (p0, z1, z0, d4))
 
 /*
-** nmad_h4_f16_x_untied: { xfail *-*-* }
+** nmad_h4_f16_x_untied:
 **	mov	z0\.h, h4
 **	fnmla	z0\.h, p0/m, z1\.h, z2\.h
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmad_f32.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmad_f32.c
index ab86385c382..cef731ebcfe 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmad_f32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmad_f32.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (nmad_s4_f32_x_tied2, svfloat32_t, float,
 		 z0 = svnmad_x (p0, z1, z0, d4))
 
 /*
-** nmad_s4_f32_x_untied: { xfail *-*-* }
+** nmad_s4_f32_x_untied:
 **	mov	z0\.s, s4
 **	fnmla	z0\.s, p0/m, z1\.s, z2\.s
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmad_f64.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmad_f64.c
index c236ff5a1a1..43b97c0de50 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmad_f64.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmad_f64.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (nmad_d4_f64_x_tied2, svfloat64_t, double,
 		 z0 = svnmad_x (p0, z1, z0, d4))
 
 /*
-** nmad_d4_f64_x_untied: { xfail *-*-* }
+** nmad_d4_f64_x_untied:
 **	mov	z0\.d, d4
 **	fnmla	z0\.d, p0/m, z1\.d, z2\.d
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmla_f16.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmla_f16.c
index f7ac377fdc2..75d0ec7d3ab 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmla_f16.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmla_f16.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (nmla_h4_f16_x_tied2, svfloat16_t, __fp16,
 		 z0 = svnmla_x (p0, z1, z0, d4))
 
 /*
-** nmla_h4_f16_x_untied: { xfail *-*-* }
+** nmla_h4_f16_x_untied:
 **	mov	z0\.h, h4
 **	fnmad	z0\.h, p0/m, z2\.h, z1\.h
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmla_f32.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmla_f32.c
index ef9542d7405..da594d3eb95 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmla_f32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmla_f32.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (nmla_s4_f32_x_tied2, svfloat32_t, float,
 		 z0 = svnmla_x (p0, z1, z0, d4))
 
 /*
-** nmla_s4_f32_x_untied: { xfail *-*-* }
+** nmla_s4_f32_x_untied:
 **	mov	z0\.s, s4
 **	fnmad	z0\.s, p0/m, z2\.s, z1\.s
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmla_f64.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmla_f64.c
index 441821f606b..73f15f41762 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmla_f64.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmla_f64.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (nmla_d4_f64_x_tied2, svfloat64_t, double,
 		 z0 = svnmla_x (p0, z1, z0, d4))
 
 /*
-** nmla_d4_f64_x_untied: { xfail *-*-* }
+** nmla_d4_f64_x_untied:
 **	mov	z0\.d, d4
 **	fnmad	z0\.d, p0/m, z2\.d, z1\.d
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmls_f16.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmls_f16.c
index 8aa6c750970..ccf7e51ffc9 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmls_f16.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmls_f16.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (nmls_h4_f16_x_tied2, svfloat16_t, __fp16,
 		 z0 = svnmls_x (p0, z1, z0, d4))
 
 /*
-** nmls_h4_f16_x_untied: { xfail *-*-* }
+** nmls_h4_f16_x_untied:
 **	mov	z0\.h, h4
 **	fnmsb	z0\.h, p0/m, z2\.h, z1\.h
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmls_f32.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmls_f32.c
index 42ea13faca8..10d345026f7 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmls_f32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmls_f32.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (nmls_s4_f32_x_tied2, svfloat32_t, float,
 		 z0 = svnmls_x (p0, z1, z0, d4))
 
 /*
-** nmls_s4_f32_x_untied: { xfail *-*-* }
+** nmls_s4_f32_x_untied:
 **	mov	z0\.s, s4
 **	fnmsb	z0\.s, p0/m, z2\.s, z1\.s
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmls_f64.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmls_f64.c
index 994c2a74eeb..bf2a4418a9f 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmls_f64.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmls_f64.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (nmls_d4_f64_x_tied2, svfloat64_t, double,
 		 z0 = svnmls_x (p0, z1, z0, d4))
 
 /*
-** nmls_d4_f64_x_untied: { xfail *-*-* }
+** nmls_d4_f64_x_untied:
 **	mov	z0\.d, d4
 **	fnmsb	z0\.d, p0/m, z2\.d, z1\.d
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmsb_f16.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmsb_f16.c
index c1140148522..5311ceb4408 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmsb_f16.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmsb_f16.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (nmsb_h4_f16_x_tied2, svfloat16_t, __fp16,
 		 z0 = svnmsb_x (p0, z1, z0, d4))
 
 /*
-** nmsb_h4_f16_x_untied: { xfail *-*-* }
+** nmsb_h4_f16_x_untied:
 **	mov	z0\.h, h4
 **	fnmls	z0\.h, p0/m, z1\.h, z2\.h
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmsb_f32.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmsb_f32.c
index c2204e040ee..6f1407a8717 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmsb_f32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmsb_f32.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (nmsb_s4_f32_x_tied2, svfloat32_t, float,
 		 z0 = svnmsb_x (p0, z1, z0, d4))
 
 /*
-** nmsb_s4_f32_x_untied: { xfail *-*-* }
+** nmsb_s4_f32_x_untied:
 **	mov	z0\.s, s4
 **	fnmls	z0\.s, p0/m, z1\.s, z2\.s
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmsb_f64.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmsb_f64.c
index 56592d3ae2e..5e4e1dd7ea6 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmsb_f64.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/nmsb_f64.c
@@ -281,7 +281,7 @@ TEST_UNIFORM_ZD (nmsb_d4_f64_x_tied2, svfloat64_t, double,
 		 z0 = svnmsb_x (p0, z1, z0, d4))
 
 /*
-** nmsb_d4_f64_x_untied: { xfail *-*-* }
+** nmsb_d4_f64_x_untied:
 **	mov	z0\.d, d4
 **	fnmls	z0\.d, p0/m, z1\.d, z2\.d
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sub_f16.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sub_f16.c
index bf4a0ab1ef5..48a57466f9d 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sub_f16.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sub_f16.c
@@ -336,7 +336,7 @@ TEST_UNIFORM_ZD (sub_h4_f16_x_tied1, svfloat16_t, __fp16,
 		 z0 = svsub_x (p0, z0, d4))
 
 /*
-** sub_h4_f16_x_untied: { xfail *-*-* }
+** sub_h4_f16_x_untied:
 **	mov	z0\.h, h4
 **	fsubr	z0\.h, p0/m, z0\.h, z1\.h
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sub_f32.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sub_f32.c
index 05be52bade8..32d57be9a35 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sub_f32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sub_f32.c
@@ -336,7 +336,7 @@ TEST_UNIFORM_ZD (sub_s4_f32_x_tied1, svfloat32_t, float,
 		 z0 = svsub_x (p0, z0, d4))
 
 /*
-** sub_s4_f32_x_untied: { xfail *-*-* }
+** sub_s4_f32_x_untied:
 **	mov	z0\.s, s4
 **	fsubr	z0\.s, p0/m, z0\.s, z1\.s
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sub_f64.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sub_f64.c
index 2179382c3f5..cdc25582649 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sub_f64.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sub_f64.c
@@ -336,7 +336,7 @@ TEST_UNIFORM_ZD (sub_d4_f64_x_tied1, svfloat64_t, double,
 		 z0 = svsub_x (p0, z0, d4))
 
 /*
-** sub_d4_f64_x_untied: { xfail *-*-* }
+** sub_d4_f64_x_untied:
 **	mov	z0\.d, d4
 **	fsubr	z0\.d, p0/m, z0\.d, z1\.d
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_f16.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_f16.c
index e14357db27f..6929b286218 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_f16.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_f16.c
@@ -285,7 +285,7 @@ TEST_UNIFORM_ZD (subr_h4_f16_x_tied1, svfloat16_t, __fp16,
 		 z0 = svsubr_x (p0, z0, d4))
 
 /*
-** subr_h4_f16_x_untied: { xfail *-*-* }
+** subr_h4_f16_x_untied:
 **	mov	z0\.h, h4
 **	fsub	z0\.h, p0/m, z0\.h, z1\.h
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_f32.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_f32.c
index 98dc7ad2b97..5bf90a39145 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_f32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_f32.c
@@ -285,7 +285,7 @@ TEST_UNIFORM_ZD (subr_s4_f32_x_tied1, svfloat32_t, float,
 		 z0 = svsubr_x (p0, z0, d4))
 
 /*
-** subr_s4_f32_x_untied: { xfail *-*-* }
+** subr_s4_f32_x_untied:
 **	mov	z0\.s, s4
 **	fsub	z0\.s, p0/m, z0\.s, z1\.s
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_f64.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_f64.c
index 81f1112d762..7091c40bbb2 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_f64.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_f64.c
@@ -285,7 +285,7 @@ TEST_UNIFORM_ZD (subr_d4_f64_x_tied1, svfloat64_t, double,
 		 z0 = svsubr_x (p0, z0, d4))
 
 /*
-** subr_d4_f64_x_untied: { xfail *-*-* }
+** subr_d4_f64_x_untied:
 **	mov	z0\.d, d4
 **	fsub	z0\.d, p0/m, z0\.d, z1\.d
 **	ret
-- 
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC/PATCH v3] ira: Support more matching constraint forms with param [PR100328]
  2021-06-30 15:42         ` Richard Sandiford
@ 2021-07-02  2:18           ` Kewen.Lin
  0 siblings, 0 replies; 16+ messages in thread
From: Kewen.Lin @ 2021-07-02  2:18 UTC (permalink / raw)
  To: richard.sandiford
  Cc: Hongtao Liu, GCC Patches, Vladimir Makarov, bergner,
	Bill Schmidt, Segher Boessenkool

Hi Richard,

on 2021/6/30 下午11:42, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> on 2021/6/28 下午3:20, Hongtao Liu wrote:
>>> On Mon, Jun 28, 2021 at 3:12 PM Hongtao Liu <crazylht@gmail.com> wrote:
>>>>
>>>> On Mon, Jun 28, 2021 at 2:50 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>>
>>>>> Hi!
>>>>>
>>>>> on 2021/6/9 下午1:18, Kewen.Lin via Gcc-patches wrote:
>>>>>> Hi,
>>>>>>
>>>>>> PR100328 has some details about this issue, I am trying to
>>>>>> brief it here.  In the hottest function LBM_performStreamCollideTRT
>>>>>> of SPEC2017 bmk 519.lbm_r, there are many FMA style expressions
>>>>>> (27 FMA, 19 FMS, 11 FNMA).  On rs6000, this kind of FMA style
>>>>>> insn has two flavors: FLOAT_REG and VSX_REG, the VSX_REG reg
>>>>>> class have 64 registers whose foregoing 32 ones make up the
>>>>>> whole FLOAT_REG.  There are some differences for these two
>>>>>> flavors, taking "*fma<mode>4_fpr" as example:
>>>>>>
>>>>>> (define_insn "*fma<mode>4_fpr"
>>>>>>   [(set (match_operand:SFDF 0 "gpc_reg_operand" "=<Ff>,wa,wa")
>>>>>>       (fma:SFDF
>>>>>>         (match_operand:SFDF 1 "gpc_reg_operand" "%<Ff>,wa,wa")
>>>>>>         (match_operand:SFDF 2 "gpc_reg_operand" "<Ff>,wa,0")
>>>>>>         (match_operand:SFDF 3 "gpc_reg_operand" "<Ff>,0,wa")))]
>>>>>>
>>>>>> // wa => A VSX register (VSR), vs0…vs63, aka. VSX_REG.
>>>>>> // <Ff> (f/d) => A floating point register, aka. FLOAT_REG.
>>>>>>
>>>>>> So for VSX_REG, we only have the destructive form, when VSX_REG
>>>>>> alternative being used, the operand 2 or operand 3 is required
>>>>>> to be the same as operand 0.  reload has to take care of this
>>>>>> constraint and create some non-free register copies if required.
>>>>>>
>>>>>> Assuming one fma insn looks like:
>>>>>>   op0 = FMA (op1, op2, op3)
>>>>>>
>>>>>> The best regclass of them are VSX_REG, when op1,op2,op3 are all dead,
>>>>>> IRA simply creates three shuffle copies for them (here the operand
>>>>>> order matters, since with the same freq, the one with smaller number
>>>>>> takes preference), but IMO both op2 and op3 should take higher priority
>>>>>> in copy queue due to the matching constraint.
>>>>>>
>>>>>> I noticed that there is one function ira_get_dup_out_num, which meant
>>>>>> to create this kind of constraint copy, but the below code looks to
>>>>>> refuse to create if there is an alternative which has valid regclass
>>>>>> without spilled need.
>>>>>>
>>>>>>       default:
>>>>>>       {
>>>>>>         enum constraint_num cn = lookup_constraint (str);
>>>>>>         enum reg_class cl = reg_class_for_constraint (cn);
>>>>>>         if (cl != NO_REGS
>>>>>>             && !targetm.class_likely_spilled_p (cl))
>>>>>>           goto fail
>>>>>>
>>>>>>        ...
>>>>>>
>>>>>> I cooked one patch attached to make ira respect this kind of matching
>>>>>> constraint guarded with one parameter.  As I stated in the PR, I was
>>>>>> not sure this is on the right track.  The RFC patch is to check the
>>>>>> matching constraint in all alternatives, if there is one alternative
>>>>>> with matching constraint and matches the current preferred regclass
>>>>>> (or best of allocno?), it will record the output operand number and
>>>>>> further create one constraint copy for it.  Normally it can get the
>>>>>> priority against shuffle copies and the matching constraint will get
>>>>>> satisfied with higher possibility, reload doesn't create extra copies
>>>>>> to meet the matching constraint or the desirable register class when
>>>>>> it has to.
>>>>>>
>>>>>> For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay
>>>>>> as shuffle copies, and later any of A,B,C,D gets assigned by one
>>>>>> hardware register which is a VSX register (VSX_REG) but not a FP
>>>>>> register (FLOAT_REG), which means it has to pay costs once we can NOT
>>>>>> go with VSX alternatives, so at that time it's important to respect
>>>>>> the matching constraint then we can increase the freq for the remaining
>>>>>> copies related to this (A/B, A/C, A/D).  This idea requires some side
>>>>>> tables to record some information and seems a bit complicated in the
>>>>>> current framework, so the proposed patch aggressively emphasizes the
>>>>>> matching constraint at the time of creating copies.
>>>>>>
>>>>>
>>>>> Comparing with the original patch (v1), this patch v3 has
>>>>> considered: (this should be v2 for this mail list, but bump
>>>>> it to be consistent as PR's).
>>>>>
>>>>>   - Excluding the case where for one preferred register class
>>>>>     there can be two or more alternatives, one of them has the
>>>>>     matching constraint, while another doesn't have.  So for
>>>>>     the given operand, even if it's assigned by a hardware reg
>>>>>     which doesn't meet the matching constraint, it can simply
>>>>>     use the alternative which doesn't have matching constraint
>>>>>     so no register move is needed.  One typical case is
>>>>>     define_insn *mov<mode>_internal2 on rs6000.  So we
>>>>>     shouldn't create constraint copy for it.
>>>>>
>>>>>   - The possible free register move in the same register class,
>>>>>     disable this if so since the register move to meet the
>>>>>     constraint is considered as free.
>>>>>
>>>>>   - Making it on by default, suggested by Segher & Vladimir, we
>>>>>     hope to get rid of the parameter if the benchmarking result
>>>>>     looks good on major targets.
>>>>>
>>>>>   - Tweaking cost when either of matching constraint two sides
>>>>>     is hardware register.  Before this patch, the constraint
>>>>>     copy is simply taken as a real move insn for pref and
>>>>>     conflict cost with one hardware register, after this patch,
>>>>>     it's allowed that there are several input operands
>>>>>     respecting the same matching constraint (but in different
>>>>>     alternatives), so we should take it to be like shuffle copy
>>>>>     for some cases to avoid over preferring/disparaging.
>>>>>
>>>>> Please check the PR comments for more details.
>>>>>
>>>>> This patch can be bootstrapped & regtested on
>>>>> powerpc64le-linux-gnu P9 and x86_64-redhat-linux, but have some
>>>>> "XFAIL->XPASS" failures on aarch64-linux-gnu.  The failure list
>>>>> was attached in the PR and thought the new assembly looks
>>>>> improved (expected).
>>>>>
>>>>> With option Ofast unroll, this patch can help to improve SPEC2017
>>>>> bmk 508.namd_r +2.42% and 519.lbm_r +2.43% on Power8 while
>>>>> 508.namd_r +3.02% and 519.lbm_r +3.85% on Power9 without any
>>>>> remarkable degradations.
>>>>>
>>>>> Since this patch likely benefits x86_64 and aarch64, but I don't
>>>>> have performance machines with these arches at hand, could
>>>>> someone kindly help to benchmark it if possible?
>>>> I can help test it on Intel cascade lake and AMD milan.
>>
>>
>> Thanks for your help, Hongtao!
>>
>>
>>> And could you rebase your patch on the lastest trunk, i got several
>>> failures when applying the patch
>>> ~ git apply ira-v3.diff
>>> error: patch failed: gcc/doc/invoke.texi:13845
>>> error: gcc/doc/invoke.texi: patch does not apply
>>> error: patch failed: gcc/ira-conflicts.c:233
>>> error: gcc/ira-conflicts.c: patch does not apply
>>> error: patch failed: gcc/ira-int.h:971
>>> error: gcc/ira-int.h: patch does not apply
>>> error: patch failed: gcc/ira.c:1922
>>> error: gcc/ira.c: patch does not apply
>>> error: patch failed: gcc/params.opt:330
>>> error: gcc/params.opt: patch does not apply
>>>
>>
>> I think it's due to unexpected git stat lines in previously attached diff.
>>
>> I have attached the format-patch file.  Please have a check.  Thanks!
> 
> FWIW, this seems to be neutral for SPEC 2017 on AArch64.  The SVE
> XFAIL->XPASS transitions mean it's definitely a good thing for
> AArch64 in that respect though.

Thanks for the information!  It gives us the confidence to turn it on by
default now, I've removed those xfails in the latest version
https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574315.html.

BR,
Kewen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] i386: Disable param ira-consider-dup-in-all-alts [PR100328]
  2021-07-02  2:11     ` [PATCH v4] " Kewen.Lin
@ 2021-07-02  2:28       ` Kewen.Lin
  2021-07-02  8:05         ` Uros Bizjak
  2021-07-05 13:04       ` [PATCH v4] ira: Support more matching constraint forms with param [PR100328] Vladimir Makarov
  1 sibling, 1 reply; 16+ messages in thread
From: Kewen.Lin @ 2021-07-02  2:28 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 521 bytes --]

Hi,

With Hongtao's help (thanks), we got the SPEC2017 performance
evaluation result on x86_64 (see [1]), this new parameter
ira-consider-dup-in-all-alts has negative effects on i386.
Since we observed it can benefit ports aarch64 and rs6000, the
param is set as 1 by default, this patch is to disable it on
i386 explicitly to avoid performance degradation there.

Bootstrapped & regtested on x86_64-redhat-linux.

Is it ok for trunk?

BR,
Kewen

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573986.html
-----

[-- Attachment #2: 0002-i386-Disable-param-ira-consider-dup-in-all-alts-PR10.patch --]
[-- Type: text/plain, Size: 1255 bytes --]

From 457c7b3032e20ea0f9d8c8d2980e7da6daeedb13 Mon Sep 17 00:00:00 2001
From: Kewen Lin <linkw@linux.ibm.com>
Date: Mon, 21 Jun 2021 22:51:09 -0500
Subject: [PATCH 2/2] i386: Disable param ira-consider-dup-in-all-alts
 [PR100328]

With Hongtao's SPEC2017 performance evaluation result here:
https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573986.html
this new parameter ira-consider-dup-in-all-alts has negative
effects on i386, this patch is to disable it explicitly on
i386.

Bootstrapped & regtested on x86_64-redhat-linux.

gcc/ChangeLog:

	PR rtl-optimization/100328
	* config/i386/i386-options.c (ix86_option_override_internal):
	Set param_ira_consider_dup_in_all_alts to 0.
---
 gcc/config/i386/i386-options.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
index 0eccb549c22..7a35c468da3 100644
--- a/gcc/config/i386/i386-options.c
+++ b/gcc/config/i386/i386-options.c
@@ -2831,6 +2831,8 @@ ix86_option_override_internal (bool main_args_p,
   if (ix86_indirect_branch != indirect_branch_keep)
     SET_OPTION_IF_UNSET (opts, opts_set, flag_jump_tables, 0);
 
+  SET_OPTION_IF_UNSET (opts, opts_set, param_ira_consider_dup_in_all_alts, 0);
+
   return true;
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] i386: Disable param ira-consider-dup-in-all-alts [PR100328]
  2021-07-02  2:28       ` [PATCH] i386: Disable param ira-consider-dup-in-all-alts [PR100328] Kewen.Lin
@ 2021-07-02  8:05         ` Uros Bizjak
  0 siblings, 0 replies; 16+ messages in thread
From: Uros Bizjak @ 2021-07-02  8:05 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: gcc-patches, Vladimir Makarov, Hongtao Liu

On Fri, Jul 2, 2021 at 4:28 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi,
>
> With Hongtao's help (thanks), we got the SPEC2017 performance
> evaluation result on x86_64 (see [1]), this new parameter
> ira-consider-dup-in-all-alts has negative effects on i386.
> Since we observed it can benefit ports aarch64 and rs6000, the
> param is set as 1 by default, this patch is to disable it on
> i386 explicitly to avoid performance degradation there.
>
> Bootstrapped & regtested on x86_64-redhat-linux.
>
> Is it ok for trunk?

OK.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4] ira: Support more matching constraint forms with param [PR100328]
  2021-07-02  2:11     ` [PATCH v4] " Kewen.Lin
  2021-07-02  2:28       ` [PATCH] i386: Disable param ira-consider-dup-in-all-alts [PR100328] Kewen.Lin
@ 2021-07-05 13:04       ` Vladimir Makarov
  1 sibling, 0 replies; 16+ messages in thread
From: Vladimir Makarov @ 2021-07-05 13:04 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: bergner, Bill Schmidt, Segher Boessenkool, Richard Sandiford,
	crazylht, GCC Patches


On 2021-07-01 10:11 p.m., Kewen.Lin wrote:
> Hi Vladimir,
>
> on 2021/6/30 下午11:24, Vladimir Makarov wrote:
>>
>> Many thanks for your review!  I've updated the patch according to your comments and also polished some comments and document words a bit.  Does it look better to you?
>>
Sorry for the delay with the answer.  The patch is better for me now and 
can be committed into the trunk.

Thanks again for working on this performance issue.



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-07-05 13:04 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-09  5:18 [RFC/PATCH] ira: Consider matching constraints with param [PR100328] Kewen.Lin
2021-06-28  6:26 ` [RFC/PATCH v3] ira: Support more matching constraint forms " Kewen.Lin
2021-06-28  7:12   ` Hongtao Liu
2021-06-28  7:20     ` Hongtao Liu
2021-06-28  7:27       ` Kewen.Lin
2021-06-30  8:53         ` Hongtao Liu
2021-06-30  9:42           ` Kewen.Lin
2021-06-30 10:18             ` Hongtao Liu
2021-06-30 15:42         ` Richard Sandiford
2021-07-02  2:18           ` Kewen.Lin
2021-06-30 15:24   ` Vladimir Makarov
2021-07-02  2:11     ` [PATCH v4] " Kewen.Lin
2021-07-02  2:28       ` [PATCH] i386: Disable param ira-consider-dup-in-all-alts [PR100328] Kewen.Lin
2021-07-02  8:05         ` Uros Bizjak
2021-07-05 13:04       ` [PATCH v4] ira: Support more matching constraint forms with param [PR100328] Vladimir Makarov
2021-06-30 15:25   ` [RFC/PATCH v3] " Vladimir Makarov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).