[PATCH 0/6] Add a late-combine pass

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH 0/6] Add a late-combine pass
@ 2024-06-20 13:34 Richard Sandiford
  2024-06-20 13:34 ` [PATCH 1/6] rtl-ssa: Rework _ignoring interfaces Richard Sandiford
                   ` (6 more replies)
  0 siblings, 7 replies; 36+ messages in thread
From: Richard Sandiford @ 2024-06-20 13:34 UTC (permalink / raw)
  To: jlaw, gcc-patches; +Cc: Richard Sandiford

This series is a resubmission of the late-combine work.  I've fixed
some bugs that Jeff's cross-target CI found last time and some others
that I hit since then.

I've also removed a source of quadraticness (oops!).  Doing that
in turn drove some tweaks to the rtl-ssa scan routines.

The complexity of the new pass should be amortised O(n1 log(n2)), where
n1 is the total number of input operands in the function and n2 is the
number of instructions.  The log(n2) component comes from searching call
clobbers and is very much a worst case.  We therefore shouldn't need a
--param to limit the optimisation.

I think the main comment from last time was that we should enable
the pass by default on most targets.  If there is a known reason
why the pass doesn't work on a particular target, we should default
to off for that specific target and file a bug to track the problem.

The only targets that I know need to be handled in this way are
i386, rs6000 and xtensa.  See the covering note in the last patch
for details.  If the series is OK, I'll file PRs for those targets
after pushing the patches.

Tested on aarch64-linux-gnu and x86_64-linux-gnu (somewhat of a
token gesture given the default-off for x86_64).  Also tested by
compiling one target per CPU directory and comparing the assembly output
for parts of the GCC testsuite.  This is just a way of getting a flavour
of how the pass performs; it obviously isn't a meaningful benchmark.
All targets seemed to improve on average, as described in the covering
note to the last patch.

The original motivation for the pass was to fix things like PR106594.
However, it also helps to reclaim some of the optimisations that
were lost in r15-268.  Please let me know if there are some cases
that the pass fails to reclaim.

The series depends on Gui Haochen's insn_cost fix.

OK to install?

Thanks to Jeff for the help with testing the series.

Richard

Richard Sandiford (6):
  rtl-ssa: Rework _ignoring interfaces
  rtl-ssa: Don't cost no-op moves
  iq2000: Fix test and branch instructions
  sh: Make *minus_plus_one work after RA
  xstormy16: Fix xs_hi_nonmemory_operand
  Add a late-combine pass [PR106594]

 gcc/Makefile.in                               |   1 +
 gcc/common.opt                                |   5 +
 gcc/config/aarch64/aarch64-cc-fusion.cc       |   4 +-
 gcc/config/i386/i386-options.cc               |   4 +
 gcc/config/iq2000/iq2000.cc                   |   2 +-
 gcc/config/iq2000/iq2000.md                   |   4 +-
 gcc/config/rs6000/rs6000.cc                   |   8 +
 gcc/config/sh/sh.md                           |   6 +-
 gcc/config/stormy16/predicates.md             |   2 +-
 gcc/config/xtensa/xtensa.cc                   |  11 +
 gcc/doc/invoke.texi                           |  11 +-
 gcc/doc/rtl.texi                              |  14 +-
 gcc/late-combine.cc                           | 747 ++++++++++++++++++
 gcc/opts.cc                                   |   1 +
 gcc/pair-fusion.cc                            |  34 +-
 gcc/passes.def                                |   2 +
 gcc/rtl-ssa.h                                 |   1 +
 gcc/rtl-ssa/access-utils.h                    | 145 ++--
 gcc/rtl-ssa/change-utils.h                    |  67 +-
 gcc/rtl-ssa/changes.cc                        |   6 +-
 gcc/rtl-ssa/changes.h                         |  13 -
 gcc/rtl-ssa/functions.h                       |  16 +-
 gcc/rtl-ssa/insn-utils.h                      |   8 -
 gcc/rtl-ssa/insns.cc                          |   7 +-
 gcc/rtl-ssa/insns.h                           |  12 -
 gcc/rtl-ssa/member-fns.inl                    |  35 +-
 gcc/rtl-ssa/movement.h                        | 118 ++-
 gcc/rtl-ssa/predicates.h                      |  58 ++
 gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c  |   2 +-
 gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c  |   2 +-
 gcc/testsuite/gcc.dg/stack-check-4.c          |   2 +-
 .../aarch64/bitfield-bitint-abi-align16.c     |   2 +-
 .../aarch64/bitfield-bitint-abi-align8.c      |   2 +-
 gcc/testsuite/gcc.target/aarch64/pr106594_1.c |  20 +
 .../gcc.target/aarch64/sve/cond_asrd_3.c      |  10 +-
 .../gcc.target/aarch64/sve/cond_convert_3.c   |   8 +-
 .../gcc.target/aarch64/sve/cond_convert_6.c   |   8 +-
 .../gcc.target/aarch64/sve/cond_fabd_5.c      |  11 +-
 .../gcc.target/aarch64/sve/cond_unary_4.c     |  13 +-
 gcc/tree-pass.h                               |   1 +
 40 files changed, 1127 insertions(+), 296 deletions(-)
 create mode 100644 gcc/late-combine.cc
 create mode 100644 gcc/rtl-ssa/predicates.h
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594_1.c

-- 
2.25.1

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/6] rtl-ssa: Rework _ignoring interfaces
  2024-06-20 13:34 [PATCH 0/6] Add a late-combine pass Richard Sandiford
@ 2024-06-20 13:34 ` Richard Sandiford
  2024-06-20 21:22   ` Alex Coplan
  2024-06-21 14:40   ` Jeff Law
  2024-06-20 13:34 ` [PATCH 2/6] rtl-ssa: Don't cost no-op moves Richard Sandiford
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 36+ messages in thread
From: Richard Sandiford @ 2024-06-20 13:34 UTC (permalink / raw)
  To: jlaw, gcc-patches; +Cc: Richard Sandiford

rtl-ssa has routines for scanning forwards or backwards for something
under the control of an exclusion set.  These searches are currently
used for two main things:

- to work out where an instruction can be moved within its EBB
- to work out whether recog can add a new hard register clobber

The exclusion set was originally a callback function that returned
true for insns that should be ignored.  However, for the late-combine
work, I'd also like to be able to skip an entire definition, along
with all its uses.

This patch prepares for that by turning the exclusion set into an
object that provides predicate member functions.  Currently the
only two member functions are:

- should_ignore_insn: what the old callback did
- should_ignore_def: the new functionality

but more could be added later.

Doing this also makes it easy to remove some assymmetry that I think
in hindsight was a mistake: in forward scans, ignoring an insn meant
ignoring all definitions in that insn (ok) and all uses of those
definitions (non-obvious).  The new interface makes it possible
to select the required behaviour, with that behaviour being applied
consistently in both directions.

Now that the exclusion set is a dedicated object, rather than
just a "random" function, I think it makes sense to remove the
_ignoring suffix from the function names.  The suffix was originally
there to describe the callback, and in particular to emphasise that
a true return meant "ignore" rather than "heed".

gcc/
	* rtl-ssa.h: Include predicates.h.
	* rtl-ssa/predicates.h: New file.
	* rtl-ssa/access-utils.h (prev_call_clobbers_ignoring): Rename to...
	(prev_call_clobbers): ...this and treat the ignore parameter as an
	object with the same interface as ignore_nothing.
	(next_call_clobbers_ignoring): Rename to...
	(next_call_clobbers): ...this and treat the ignore parameter as an
	object with the same interface as ignore_nothing.
	(first_nondebug_insn_use_ignoring): Rename to...
	(first_nondebug_insn_use): ...this and treat the ignore parameter as
	an object with the same interface as ignore_nothing.
	(last_nondebug_insn_use_ignoring): Rename to...
	(last_nondebug_insn_use): ...this and treat the ignore parameter as
	an object with the same interface as ignore_nothing.
	(last_access_ignoring): Rename to...
	(last_access): ...this and treat the ignore parameter as an object
	with the same interface as ignore_nothing.  Conditionally skip
	definitions.
	(prev_access_ignoring): Rename to...
	(prev_access): ...this and treat the ignore parameter as an object
	with the same interface as ignore_nothing.
	(first_def_ignoring): Replace with...
	(first_access): ...this new function.
	(next_access_ignoring): Rename to...
	(next_access): ...this and treat the ignore parameter as an object
	with the same interface as ignore_nothing.  Conditionally skip
	definitions.
	* rtl-ssa/change-utils.h (insn_is_changing): Delete.
	(restrict_movement_ignoring): Rename to...
	(restrict_movement): ...this and treat the ignore parameter as an
	object with the same interface as ignore_nothing.
	(recog_ignoring): Rename to...
	(recog): ...this and treat the ignore parameter as an object with
	the same interface as ignore_nothing.
	* rtl-ssa/changes.h (insn_is_changing_closure): Delete.
	* rtl-ssa/functions.h (function_info::add_regno_clobber): Treat
	the ignore parameter as an object with the same interface as
	ignore_nothing.
	* rtl-ssa/insn-utils.h (insn_is): Delete.
	* rtl-ssa/insns.h (insn_is_closure): Delete.
	* rtl-ssa/member-fns.inl
	(insn_is_changing_closure::insn_is_changing_closure): Delete.
	(insn_is_changing_closure::operator()): Likewise.
	(function_info::add_regno_clobber): Treat the ignore parameter
	as an object with the same interface as ignore_nothing.
	(ignore_changing_insns::ignore_changing_insns): New function.
	(ignore_changing_insns::should_ignore_insn): Likewise.
	* rtl-ssa/movement.h (restrict_movement_for_dead_range): Treat
	the ignore parameter as an object with the same interface as
	ignore_nothing.
	(restrict_movement_for_defs_ignoring): Rename to...
	(restrict_movement_for_defs): ...this and treat the ignore parameter
	as an object with the same interface as ignore_nothing.
	(restrict_movement_for_uses_ignoring): Rename to...
	(restrict_movement_for_uses): ...this and treat the ignore parameter
	as an object with the same interface as ignore_nothing.  Conditionally
	skip definitions.
	* doc/rtl.texi: Update for above name changes.  Use
	ignore_changing_insns instead of insn_is_changing.
	* config/aarch64/aarch64-cc-fusion.cc (cc_fusion::parallelize_insns):
	Likewise.
	* pair-fusion.cc (no_ignore): Delete.
	(latest_hazard_before, first_hazard_after): Update for above name
	changes.  Use ignore_nothing instead of no_ignore.
	(pair_fusion_bb_info::fuse_pair): Update for above name changes.
	Use ignore_changing_insns instead of insn_is_changing.
	(pair_fusion::try_promote_writeback): Likewise.
---
 gcc/config/aarch64/aarch64-cc-fusion.cc |   4 +-
 gcc/doc/rtl.texi                        |  14 +--
 gcc/pair-fusion.cc                      |  34 +++---
 gcc/rtl-ssa.h                           |   1 +
 gcc/rtl-ssa/access-utils.h              | 145 +++++++++++++-----------
 gcc/rtl-ssa/change-utils.h              |  67 +++++------
 gcc/rtl-ssa/changes.h                   |  13 ---
 gcc/rtl-ssa/functions.h                 |  16 ++-
 gcc/rtl-ssa/insn-utils.h                |   8 --
 gcc/rtl-ssa/insns.h                     |  12 --
 gcc/rtl-ssa/member-fns.inl              |  35 +++---
 gcc/rtl-ssa/movement.h                  | 118 +++++++++----------
 gcc/rtl-ssa/predicates.h                |  58 ++++++++++
 13 files changed, 275 insertions(+), 250 deletions(-)
 create mode 100644 gcc/rtl-ssa/predicates.h

diff --git a/gcc/config/aarch64/aarch64-cc-fusion.cc b/gcc/config/aarch64/aarch64-cc-fusion.cc
index a4f43295680..e97c26682d0 100644
--- a/gcc/config/aarch64/aarch64-cc-fusion.cc
+++ b/gcc/config/aarch64/aarch64-cc-fusion.cc
@@ -183,7 +183,7 @@ cc_fusion::parallelize_insns (def_info *cc_def, rtx cc_set,
   auto other_change = insn_change::delete_insn (other_insn);
   insn_change *changes[] = { &other_change, &cc_change };
   cc_change.move_range = cc_insn->ebb ()->insn_range ();
-  if (!restrict_movement_ignoring (cc_change, insn_is_changing (changes)))
+  if (!restrict_movement (cc_change, ignore_changing_insns (changes)))
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
 	fprintf (dump_file, "-- cannot satisfy all definitions and uses\n");
@@ -205,7 +205,7 @@ cc_fusion::parallelize_insns (def_info *cc_def, rtx cc_set,
   validate_change (cc_rtl, &PATTERN (cc_rtl), m_parallel, 1);
 
   // These routines report failures themselves.
-  if (!recog_ignoring (attempt, cc_change, insn_is_changing (changes))
+  if (!recog (attempt, cc_change, ignore_changing_insns (changes))
       || !changes_are_worthwhile (changes)
       || !crtl->ssa->verify_insn_changes (changes))
     return false;
diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
index aa10b5235b5..846a043bdc7 100644
--- a/gcc/doc/rtl.texi
+++ b/gcc/doc/rtl.texi
@@ -5073,7 +5073,7 @@ in the correct order with respect to each other.
 The way to do this is:
 
 @smallexample
-if (!rtl_ssa::restrict_movement_ignoring (change, insn_is_changing (changes)))
+if (!rtl_ssa::restrict_movement (change, ignore_changing_insns (changes)))
   return false;
 @end smallexample
 
@@ -5085,7 +5085,7 @@ changing instructions (which might, for example, no longer need
 to clobber the flags register).  The way to do this is:
 
 @smallexample
-if (!rtl_ssa::recog_ignoring (attempt, change, insn_is_changing (changes)))
+if (!rtl_ssa::recog (attempt, change, ignore_changing_insns (changes)))
   return false;
 @end smallexample
 
@@ -5137,16 +5137,16 @@ change2.move_range = @dots{};
 
 rtl_ssa::insn_change *changes[] = @{ &change1, &change2 @};
 
-auto is_changing = insn_is_changing (changes);
-if (!rtl_ssa::restrict_movement_ignoring (change1, is_changing)
-    || !rtl_ssa::restrict_movement_ignoring (change2, is_changing))
+auto ignore = ignore_changing_insns (changes);
+if (!rtl_ssa::restrict_movement (change1, ignore)
+    || !rtl_ssa::restrict_movement (change2, ignore))
   return false;
 
 insn_change_watermark watermark;
 // Use validate_change etc. to change INSN1's and INSN2's patterns.
 @dots{}
-if (!rtl_ssa::recog_ignoring (attempt, change1, is_changing)
-    || !rtl_ssa::recog_ignoring (attempt, change2, is_changing)
+if (!rtl_ssa::recog (attempt, change1, ignore)
+    || !rtl_ssa::recog (attempt, change2, ignore)
     || !rtl_ssa::changes_are_worthwhile (changes)
     || !crtl->ssa->verify_insn_changes (changes))
   return false;
diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc
index 26b2284ed37..31d2c21c88f 100644
--- a/gcc/pair-fusion.cc
+++ b/gcc/pair-fusion.cc
@@ -563,9 +563,6 @@ pair_fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
     }
 }
 
-// Dummy predicate that never ignores any insns.
-static bool no_ignore (insn_info *) { return false; }
-
 // Return the latest dataflow hazard before INSN.
 //
 // If IGNORE is non-NULL, this points to a sub-rtx which we should ignore for
@@ -643,9 +640,8 @@ latest_hazard_before (insn_info *insn, rtx *ignore,
 	  if (!call_group->clobbers (def->resource ()))
 	    continue;
 
-	  auto clobber_insn = prev_call_clobbers_ignoring (*call_group,
-							   def->insn (),
-							   no_ignore);
+	  auto clobber_insn = prev_call_clobbers (*call_group, def->insn (),
+						  ignore_nothing ());
 	  if (clobber_insn)
 	    hazard (clobber_insn);
 	}
@@ -704,9 +700,8 @@ first_hazard_after (insn_info *insn, rtx *ignore)
 	  if (!call_group->clobbers (def->resource ()))
 	    continue;
 
-	  auto clobber_insn = next_call_clobbers_ignoring (*call_group,
-							   def->insn (),
-							   no_ignore);
+	  auto clobber_insn = next_call_clobbers (*call_group, def->insn (),
+						  ignore_nothing ());
 	  if (clobber_insn)
 	    hazard (clobber_insn);
 	}
@@ -733,16 +728,15 @@ first_hazard_after (insn_info *insn, rtx *ignore)
 
       // Also need to handle call clobbers of our uses (again WaR).
       //
-      // See restrict_movement_for_uses_ignoring for why we don't
-      // need to check backwards for call clobbers.
+      // See restrict_movement_for_uses for why we don't need to check
+      // backwards for call clobbers.
       for (auto call_group : use->ebb ()->call_clobbers ())
 	{
 	  if (!call_group->clobbers (use->resource ()))
 	    continue;
 
-	  auto clobber_insn = next_call_clobbers_ignoring (*call_group,
-							   use->insn (),
-							   no_ignore);
+	  auto clobber_insn = next_call_clobbers (*call_group, use->insn (),
+						  ignore_nothing ());
 	  if (clobber_insn)
 	    hazard (clobber_insn);
 	}
@@ -1965,12 +1959,12 @@ pair_fusion_bb_info::fuse_pair (bool load_p,
 	}
     }
 
-  auto is_changing = insn_is_changing (changes);
+  auto ignore = ignore_changing_insns (changes);
   for (unsigned i = 0; i < changes.length (); i++)
-    gcc_assert (rtl_ssa::restrict_movement_ignoring (*changes[i], is_changing));
+    gcc_assert (rtl_ssa::restrict_movement (*changes[i], ignore));
 
   // Check the pair pattern is recog'd.
-  if (!rtl_ssa::recog_ignoring (attempt, *pair_change, is_changing))
+  if (!rtl_ssa::recog (attempt, *pair_change, ignore))
     {
       if (dump_file)
 	fprintf (dump_file, "  failed to form pair, recog failed\n");
@@ -2953,11 +2947,11 @@ pair_fusion::try_promote_writeback (insn_info *insn, bool load_p)
 					pair_change.new_defs);
   gcc_assert (pair_change.new_defs.is_valid ());
 
-  auto is_changing = insn_is_changing (changes);
+  auto ignore = ignore_changing_insns (changes);
   for (unsigned i = 0; i < ARRAY_SIZE (changes); i++)
-    gcc_assert (rtl_ssa::restrict_movement_ignoring (*changes[i], is_changing));
+    gcc_assert (rtl_ssa::restrict_movement (*changes[i], ignore));
 
-  if (!rtl_ssa::recog_ignoring (attempt, pair_change, is_changing))
+  if (!rtl_ssa::recog (attempt, pair_change, ignore))
     {
       if (dump_file)
 	fprintf (dump_file, "i%d: recog failed on wb pair, bailing out\n",
diff --git a/gcc/rtl-ssa.h b/gcc/rtl-ssa.h
index 17337639ae8..2718d97b6d9 100644
--- a/gcc/rtl-ssa.h
+++ b/gcc/rtl-ssa.h
@@ -63,6 +63,7 @@
 #include "rtl-ssa/blocks.h"
 #include "rtl-ssa/changes.h"
 #include "rtl-ssa/functions.h"
+#include "rtl-ssa/predicates.h"
 #include "rtl-ssa/is-a.inl"
 #include "rtl-ssa/access-utils.h"
 #include "rtl-ssa/insn-utils.h"
diff --git a/gcc/rtl-ssa/access-utils.h b/gcc/rtl-ssa/access-utils.h
index f889300666d..8805eec1d7f 100644
--- a/gcc/rtl-ssa/access-utils.h
+++ b/gcc/rtl-ssa/access-utils.h
@@ -321,19 +321,22 @@ int lookup_def (def_splay_tree &, insn_info *);
 int lookup_clobber (clobber_tree &, insn_info *);
 int lookup_call_clobbers (insn_call_clobbers_tree &, insn_info *);
 
-// Search backwards from immediately before INSN for the first instruction
-// recorded in TREE, ignoring any instruction I for which IGNORE (I) is true.
-// Return null if no such instruction exists.
-template<typename IgnorePredicate>
+// Search backwards from immediately before INSN for the first "relevant"
+// instruction recorded in TREE.  IGNORE is an object that provides the same
+// interface as ignore_nothing; it defines which insns are "relevant"
+// and which should be ignored.
+//
+// Return null if no such relevant instruction exists.
+template<typename IgnorePredicates>
 insn_info *
-prev_call_clobbers_ignoring (insn_call_clobbers_tree &tree, insn_info *insn,
-			     IgnorePredicate ignore)
+prev_call_clobbers (insn_call_clobbers_tree &tree, insn_info *insn,
+		    IgnorePredicates ignore)
 {
   if (!tree)
     return nullptr;
 
   int comparison = lookup_call_clobbers (tree, insn);
-  while (comparison <= 0 || ignore (tree->insn ()))
+  while (comparison <= 0 || ignore.should_ignore_insn (tree->insn ()))
     {
       if (!tree.splay_prev_node ())
 	return nullptr;
@@ -343,19 +346,22 @@ prev_call_clobbers_ignoring (insn_call_clobbers_tree &tree, insn_info *insn,
   return tree->insn ();
 }
 
-// Search forwards from immediately after INSN for the first instruction
-// recorded in TREE, ignoring any instruction I for which IGNORE (I) is true.
-// Return null if no such instruction exists.
-template<typename IgnorePredicate>
+// Search forwards from immediately after INSN for the first "relevant"
+// instruction recorded in TREE.  IGNORE is an object that provides the
+// same interface as ignore_nothing; it defines which insns are "relevant"
+// and which should be ignored.
+//
+// Return null if no such relevant instruction exists.
+template<typename IgnorePredicates>
 insn_info *
-next_call_clobbers_ignoring (insn_call_clobbers_tree &tree, insn_info *insn,
-			     IgnorePredicate ignore)
+next_call_clobbers (insn_call_clobbers_tree &tree, insn_info *insn,
+		    IgnorePredicates ignore)
 {
   if (!tree)
     return nullptr;
 
   int comparison = lookup_call_clobbers (tree, insn);
-  while (comparison >= 0 || ignore (tree->insn ()))
+  while (comparison >= 0 || ignore.should_ignore_insn (tree->insn ()))
     {
       if (!tree.splay_next_node ())
 	return nullptr;
@@ -370,17 +376,18 @@ next_call_clobbers_ignoring (insn_call_clobbers_tree &tree, insn_info *insn,
 inline insn_info *
 next_call_clobbers (insn_call_clobbers_tree &tree, insn_info *insn)
 {
-  auto ignore = [](const insn_info *) { return false; };
-  return next_call_clobbers_ignoring (tree, insn, ignore);
+  return next_call_clobbers (tree, insn, ignore_nothing ());
 }
 
-// If ACCESS is a set, return the first use of ACCESS by a nondebug insn I
-// for which IGNORE (I) is false.  Return null if ACCESS is not a set or if
-// no such use exists.
-template<typename IgnorePredicate>
+// If ACCESS is a set, return the first "relevant" use of ACCESS by a
+// nondebug insn.  IGNORE is an object that provides the same interface
+// as ignore_nothing; it defines which accesses and insns are "relevant"
+// and which should be ignored.
+//
+// Return null if ACCESS is not a set or if no such relevant use exists.
+template<typename IgnorePredicates>
 inline use_info *
-first_nondebug_insn_use_ignoring (const access_info *access,
-				  IgnorePredicate ignore)
+first_nondebug_insn_use (const access_info *access, IgnorePredicates ignore)
 {
   if (const set_info *set = set_with_nondebug_insn_uses (access))
     {
@@ -389,7 +396,7 @@ first_nondebug_insn_use_ignoring (const access_info *access,
       use_info *use = set->first_use ();
       do
 	{
-	  if (!ignore (use->insn ()))
+	  if (!ignore.should_ignore_insn (use->insn ()))
 	    return use;
 	  use = use->next_nondebug_insn_use ();
 	}
@@ -398,13 +405,15 @@ first_nondebug_insn_use_ignoring (const access_info *access,
   return nullptr;
 }
 
-// If ACCESS is a set, return the last use of ACCESS by a nondebug insn I for
-// which IGNORE (I) is false.  Return null if ACCESS is not a set or if no
-// such use exists.
-template<typename IgnorePredicate>
+// If ACCESS is a set, return the last "relevant" use of ACCESS by a
+// nondebug insn.  IGNORE is an object that provides the same interface
+// as ignore_nothing; it defines which accesses and insns are "relevant"
+// and which should be ignored.
+//
+// Return null if ACCESS is not a set or if no such relevant use exists.
+template<typename IgnorePredicates>
 inline use_info *
-last_nondebug_insn_use_ignoring (const access_info *access,
-				 IgnorePredicate ignore)
+last_nondebug_insn_use (const access_info *access, IgnorePredicates ignore)
 {
   if (const set_info *set = set_with_nondebug_insn_uses (access))
     {
@@ -413,7 +422,7 @@ last_nondebug_insn_use_ignoring (const access_info *access,
       use_info *use = set->last_nondebug_insn_use ();
       do
 	{
-	  if (!ignore (use->insn ()))
+	  if (!ignore.should_ignore_insn (use->insn ()))
 	    return use;
 	  use = use->prev_use ();
 	}
@@ -427,7 +436,8 @@ last_nondebug_insn_use_ignoring (const access_info *access,
 // Otherwise, search backwards for an access to DEF->resource (), starting at
 // the end of DEF's live range.  Ignore clobbers if IGNORE_CLOBBERS_SETTING
 // is YES, otherwise treat them like any other access.  Also ignore any
-// access A for which IGNORE (access_insn (A)) is true.
+// accesses and insns that IGNORE says should be ignored, where IGNORE
+// is an object that provides the same interface as ignore_nothing.
 //
 // Thus if DEF is a set that is used by nondebug insns, the first access
 // that the function considers is the last such use of the set.  Otherwise,
@@ -438,23 +448,21 @@ last_nondebug_insn_use_ignoring (const access_info *access,
 //
 // Note that this function does not consider separately-recorded call clobbers,
 // although such clobbers are only relevant if IGNORE_CLOBBERS_SETTING is NO.
-template<typename IgnorePredicate>
+template<typename IgnorePredicates>
 access_info *
-last_access_ignoring (def_info *def, ignore_clobbers ignore_clobbers_setting,
-		      IgnorePredicate ignore)
+last_access (def_info *def, ignore_clobbers ignore_clobbers_setting,
+	     IgnorePredicates ignore)
 {
   while (def)
     {
       auto *clobber = dyn_cast<clobber_info *> (def);
       if (clobber && ignore_clobbers_setting == ignore_clobbers::YES)
 	def = first_clobber_in_group (clobber);
-      else
+      else if (!ignore.should_ignore_def (def))
 	{
-	  if (use_info *use = last_nondebug_insn_use_ignoring (def, ignore))
+	  if (use_info *use = last_nondebug_insn_use (def, ignore))
 	    return use;
-
-	  insn_info *insn = def->insn ();
-	  if (!ignore (insn))
+	  if (!ignore.should_ignore_insn (def->insn ()))
 	    return def;
 	}
       def = def->prev_def ();
@@ -465,8 +473,9 @@ last_access_ignoring (def_info *def, ignore_clobbers ignore_clobbers_setting,
 // Search backwards for an access to DEF->resource (), starting
 // immediately before the point at which DEF occurs.  Ignore clobbers
 // if IGNORE_CLOBBERS_SETTING is YES, otherwise treat them like any other
-// access.  Also ignore any access A for which IGNORE (access_insn (A))
-// is true.
+// access.  Also ignore any accesses and insns that IGNORE says should be
+// ignored, where IGNORE is an object that provides the same interface as
+// ignore_nothing.
 //
 // Thus if DEF->insn () uses DEF->resource (), that use is the first access
 // that the function considers, since an instruction's uses occur strictly
@@ -474,40 +483,44 @@ last_access_ignoring (def_info *def, ignore_clobbers ignore_clobbers_setting,
 //
 // Note that this function does not consider separately-recorded call clobbers,
 // although such clobbers are only relevant if IGNORE_CLOBBERS_SETTING is NO.
-template<typename IgnorePredicate>
+template<typename IgnorePredicates>
 inline access_info *
-prev_access_ignoring (def_info *def, ignore_clobbers ignore_clobbers_setting,
-		      IgnorePredicate ignore)
+prev_access (def_info *def, ignore_clobbers ignore_clobbers_setting,
+	     IgnorePredicates ignore)
 {
-  return last_access_ignoring (def->prev_def (), ignore_clobbers_setting,
-			       ignore);
+  return last_access (def->prev_def (), ignore_clobbers_setting, ignore);
 }
 
 // If DEF is null, return null.
 //
-// Otherwise, search forwards for a definition of DEF->resource (),
+// Otherwise, search forwards for an access to DEF->resource (),
 // starting at DEF itself.  Ignore clobbers if IGNORE_CLOBBERS_SETTING
 // is YES, otherwise treat them like any other access.  Also ignore any
-// definition D for which IGNORE (D->insn ()) is true.
+// accesses and insns that IGNORE says should be ignored, where IGNORE
+// is an object that provides the same interface as ignore_nothing.
 //
 // Return the definition found, or null if there is no access that meets
 // the criteria.
 //
 // Note that this function does not consider separately-recorded call clobbers,
 // although such clobbers are only relevant if IGNORE_CLOBBERS_SETTING is NO.
-template<typename IgnorePredicate>
-def_info *
-first_def_ignoring (def_info *def, ignore_clobbers ignore_clobbers_setting,
-		    IgnorePredicate ignore)
+template<typename IgnorePredicates>
+access_info *
+first_access (def_info *def, ignore_clobbers ignore_clobbers_setting,
+	      IgnorePredicates ignore)
 {
   while (def)
     {
       auto *clobber = dyn_cast<clobber_info *> (def);
       if (clobber && ignore_clobbers_setting == ignore_clobbers::YES)
 	def = last_clobber_in_group (clobber);
-      else if (!ignore (def->insn ()))
-	return def;
-
+      else if (!ignore.should_ignore_def (def))
+	{
+	  if (!ignore.should_ignore_insn (def->insn ()))
+	    return def;
+	  if (use_info *use = first_nondebug_insn_use (def, ignore))
+	    return use;
+	}
       def = def->next_def ();
     }
   return nullptr;
@@ -516,27 +529,29 @@ first_def_ignoring (def_info *def, ignore_clobbers ignore_clobbers_setting,
 // Search forwards for the next access to DEF->resource (),
 // starting immediately after DEF's instruction.  Ignore clobbers if
 // IGNORE_CLOBBERS_SETTING is YES, otherwise treat them like any other access.
-// Also ignore any access A for which IGNORE (access_insn (A)) is true;
-// in this context, ignoring a set includes ignoring all uses of the set.
+// Also ignore any accesses and insns that IGNORE says should be ignored,
+// where IGNORE is an object that provides the same interface as
+// ignore_nothing.
 //
 // Thus if DEF is a set with uses by nondebug insns, the first access that the
-// function considers is the first such use of the set.
+// function considers is the first such use of the set.  Otherwise, the first
+// access that the function considers is the definition after DEF.
 //
 // Return the access found, or null if there is no access that meets the
 // criteria.
 //
 // Note that this function does not consider separately-recorded call clobbers,
 // although such clobbers are only relevant if IGNORE_CLOBBERS_SETTING is NO.
-template<typename IgnorePredicate>
+template<typename IgnorePredicates>
 access_info *
-next_access_ignoring (def_info *def, ignore_clobbers ignore_clobbers_setting,
-		      IgnorePredicate ignore)
+next_access (def_info *def, ignore_clobbers ignore_clobbers_setting,
+	     IgnorePredicates ignore)
 {
-  if (use_info *use = first_nondebug_insn_use_ignoring (def, ignore))
-    return use;
+  if (!ignore.should_ignore_def (def))
+    if (use_info *use = first_nondebug_insn_use (def, ignore))
+      return use;
 
-  return first_def_ignoring (def->next_def (), ignore_clobbers_setting,
-			     ignore);
+  return first_access (def->next_def (), ignore_clobbers_setting, ignore);
 }
 
 // Return true if ACCESS1 should before ACCESS2 in an access_array.
diff --git a/gcc/rtl-ssa/change-utils.h b/gcc/rtl-ssa/change-utils.h
index fce41b0157a..fa27d1ad047 100644
--- a/gcc/rtl-ssa/change-utils.h
+++ b/gcc/rtl-ssa/change-utils.h
@@ -30,25 +30,15 @@ insn_is_changing (array_slice<insn_change *const> changes,
   return false;
 }
 
-// Return a closure of insn_is_changing, for use as a predicate.
-// This could be done using local lambdas instead, but the predicate is
-// used often enough that having a class should be more convenient and allow
-// reuse of template instantiations.
-//
-// We don't use std::bind because it would involve an indirect function call,
-// whereas this function is used in relatively performance-critical code.
-inline insn_is_changing_closure
-insn_is_changing (array_slice<insn_change *const> changes)
-{
-  return insn_is_changing_closure (changes);
-}
-
 // Restrict CHANGE.move_range so that the changed instruction can perform
-// all its definitions and uses.  Assume that if:
+// all its definitions and uses.
+//
+// IGNORE is an object that provides the same interface as ignore_nothing.
+// Assume that if:
 //
 // - CHANGE contains an access A1 of resource R;
 // - an instruction I2 contains another access A2 to R; and
-// - IGNORE (I2) is true
+// - IGNORE says that I2 should be ignored
 //
 // then either:
 //
@@ -56,31 +46,33 @@ insn_is_changing (array_slice<insn_change *const> changes)
 // - something will ensure that A1 and A2 maintain their current order,
 //   without this having to be enforced by CHANGE's move range.
 //
-// IGNORE should return true for CHANGE.insn ().
+// Assume the same thing about a definition D of R, and about all uses of D,
+// if IGNORE says that D should be ignored.
+//
+// IGNORE should ignore CHANGE.insn ().
 //
 // Return true on success, otherwise leave CHANGE.move_range in an invalid
 // state.
 //
 // This function only works correctly for instructions that remain within
 // the same extended basic block.
-template<typename IgnorePredicate>
+template<typename IgnorePredicates>
 bool
-restrict_movement_ignoring (insn_change &change, IgnorePredicate ignore)
+restrict_movement (insn_change &change, IgnorePredicates ignore)
 {
   // Uses generally lead to failure quicker, so test those first.
-  return (restrict_movement_for_uses_ignoring (change.move_range,
-					       change.new_uses, ignore)
-	  && restrict_movement_for_defs_ignoring (change.move_range,
-						  change.new_defs, ignore)
+  return (restrict_movement_for_uses (change.move_range,
+				      change.new_uses, ignore)
+	  && restrict_movement_for_defs (change.move_range,
+					 change.new_defs, ignore)
 	  && canonicalize_move_range (change.move_range, change.insn ()));
 }
 
-// Like restrict_movement_ignoring, but ignore only the instruction
-// that is being changed.
+// As above, but ignore only the instruction that is being changed.
 inline bool
 restrict_movement (insn_change &change)
 {
-  return restrict_movement_ignoring (change, insn_is (change.insn ()));
+  return restrict_movement (change, ignore_insn (change.insn ()));
 }
 
 using add_regno_clobber_fn = std::function<bool (insn_change &,
@@ -91,18 +83,22 @@ bool recog_internal (insn_change &, add_regno_clobber_fn);
 // tweaking the pattern or adding extra clobbers in order to make it match.
 //
 // When adding an extra clobber for register R, restrict CHANGE.move_range
-// to a range of instructions for which R is not live.  When determining
-// whether R is live, ignore accesses made by an instruction I if
-// IGNORE (I) is true.  The caller then assumes the responsibility
-// of ensuring that CHANGE and I are placed in a valid order.
+// to a range of instructions for which R is not live.  Use IGNORE to guide
+// this process, where IGNORE is an object that provides the same interface
+// as ignore_nothing.  When determining whether R is live, ignore accesses
+// made by an instruction I if IGNORE says that I should be ignored.
+// The caller then assumes the responsibility of ensuring that CHANGE
+// and I are placed in a valid order.  Similarly, ignore live ranges
+// associated with a definition of R if IGNORE says that that definition
+// should be ignored.
 //
-// IGNORE should return true for CHANGE.insn ().
+// IGNORE should ignore CHANGE.insn ().
 //
 // Return true on success.  Leave CHANGE unmodified on failure.
-template<typename IgnorePredicate>
+template<typename IgnorePredicates>
 inline bool
-recog_ignoring (obstack_watermark &watermark, insn_change &change,
-		IgnorePredicate ignore)
+recog (obstack_watermark &watermark, insn_change &change,
+       IgnorePredicates ignore)
 {
   auto add_regno_clobber = [&](insn_change &change, unsigned int regno)
     {
@@ -111,12 +107,11 @@ recog_ignoring (obstack_watermark &watermark, insn_change &change,
   return recog_internal (change, add_regno_clobber);
 }
 
-// As for recog_ignoring, but ignore only the instruction that is being
-// changed.
+// As above, but ignore only the instruction that is being changed.
 inline bool
 recog (obstack_watermark &watermark, insn_change &change)
 {
-  return recog_ignoring (watermark, change, insn_is (change.insn ()));
+  return recog (watermark, change, ignore_insn (change.insn ()));
 }
 
 // Check whether insn costs indicate that the net effect of the changes
diff --git a/gcc/rtl-ssa/changes.h b/gcc/rtl-ssa/changes.h
index 35ab02243a9..0bcd962fa77 100644
--- a/gcc/rtl-ssa/changes.h
+++ b/gcc/rtl-ssa/changes.h
@@ -98,19 +98,6 @@ private:
   bool m_is_deletion;
 };
 
-// A class that represents a closure of the two-argument form of
-// insn_is_changing.  See the comment above the one-argument form
-// for details.
-class insn_is_changing_closure
-{
-public:
-  insn_is_changing_closure (array_slice<insn_change *const> changes);
-  bool operator() (const insn_info *) const;
-
-private:
-  array_slice<insn_change *const> m_changes;
-};
-
 void pp_insn_change (pretty_printer *, const insn_change &);
 
 }
diff --git a/gcc/rtl-ssa/functions.h b/gcc/rtl-ssa/functions.h
index f5aca643beb..479c6992e97 100644
--- a/gcc/rtl-ssa/functions.h
+++ b/gcc/rtl-ssa/functions.h
@@ -165,16 +165,22 @@ public:
 
   // If CHANGE doesn't already clobber REGNO, try to add such a clobber,
   // limiting the movement range in order to make the clobber valid.
-  // When determining whether REGNO is live, ignore accesses made by an
-  // instruction I if IGNORE (I) is true.  The caller then assumes the
-  // responsibility of ensuring that CHANGE and I are placed in a valid order.
+  // Use IGNORE to guide this process, where IGNORE is an object that
+  // provides the same interface as ignore_nothing.
+  //
+  // That is, when determining whether REGNO is live, ignore accesses made
+  // by an instruction I if IGNORE says that I should be ignored.  The caller
+  // then assumes the responsibility of ensuring that CHANGE and I are placed
+  // in a valid order.  Similarly, ignore live ranges associated/ with a
+  // definition of REGNO if IGNORE says that that definition should be
+  // ignored.
   //
   // Return true on success.  Leave CHANGE unmodified when returning false.
   //
   // WATERMARK is a watermark returned by new_change_attempt ().
-  template<typename IgnorePredicate>
+  template<typename IgnorePredicates>
   bool add_regno_clobber (obstack_watermark &watermark, insn_change &change,
-			  unsigned int regno, IgnorePredicate ignore);
+			  unsigned int regno, IgnorePredicates ignore);
 
   // Return true if change_insns will be able to perform the changes
   // described by CHANGES.
diff --git a/gcc/rtl-ssa/insn-utils.h b/gcc/rtl-ssa/insn-utils.h
index bd3a4cbdcfa..1c54fe662e3 100644
--- a/gcc/rtl-ssa/insn-utils.h
+++ b/gcc/rtl-ssa/insn-utils.h
@@ -35,12 +35,4 @@ later_insn (insn_info *insn1, insn_info *insn2)
   return *insn1 < *insn2 ? insn2 : insn1;
 }
 
-// Return a closure of operator== for INSN.  See insn_is_changing for
-// the rationale for defining the function this way.
-inline insn_is_closure
-insn_is (const insn_info *insn)
-{
-  return insn_is_closure (insn);
-}
-
 }
diff --git a/gcc/rtl-ssa/insns.h b/gcc/rtl-ssa/insns.h
index 334d02888ca..1ba56abc2ca 100644
--- a/gcc/rtl-ssa/insns.h
+++ b/gcc/rtl-ssa/insns.h
@@ -493,18 +493,6 @@ public:
   insn_info *last;
 };
 
-// A class that represents a closure of operator== for instructions.
-// This is used by insn_is; see there for details.
-class insn_is_closure
-{
-public:
-  insn_is_closure (const insn_info *insn) : m_insn (insn) {}
-  bool operator() (const insn_info *other) const { return m_insn == other; }
-
-private:
-  const insn_info *m_insn;
-};
-
 void pp_insn (pretty_printer *, const insn_info *);
 
 }
diff --git a/gcc/rtl-ssa/member-fns.inl b/gcc/rtl-ssa/member-fns.inl
index e4825ad2a18..833907b62c9 100644
--- a/gcc/rtl-ssa/member-fns.inl
+++ b/gcc/rtl-ssa/member-fns.inl
@@ -870,21 +870,6 @@ inline insn_change::insn_change (insn_info *insn, delete_action)
 {
 }
 
-inline insn_is_changing_closure::
-insn_is_changing_closure (array_slice<insn_change *const> changes)
-  : m_changes (changes)
-{
-}
-
-inline bool
-insn_is_changing_closure::operator() (const insn_info *insn) const
-{
-  for (const insn_change *change : m_changes)
-    if (change->insn () == insn)
-      return true;
-  return false;
-}
-
 inline iterator_range<bb_iterator>
 function_info::bbs () const
 {
@@ -963,11 +948,11 @@ function_info::single_dominating_def (unsigned int regno) const
   return nullptr;
 }
 
-template<typename IgnorePredicate>
+template<typename IgnorePredicates>
 bool
 function_info::add_regno_clobber (obstack_watermark &watermark,
 				  insn_change &change, unsigned int regno,
-				  IgnorePredicate ignore)
+				  IgnorePredicates ignore)
 {
   // Check whether CHANGE already clobbers REGNO.
   if (find_access (change.new_defs, regno))
@@ -1003,4 +988,20 @@ function_info::change_alloc (obstack_watermark &wm, Ts... args)
   return new (addr) T (std::forward<Ts> (args)...);
 }
 
+inline
+ignore_changing_insns::
+ignore_changing_insns (array_slice<insn_change *const> changes)
+  : m_changes (changes)
+{
+}
+
+inline bool
+ignore_changing_insns::should_ignore_insn (const insn_info *insn)
+{
+  for (const insn_change *change : m_changes)
+    if (change->insn () == insn)
+      return true;
+  return false;
+}
+
 }
diff --git a/gcc/rtl-ssa/movement.h b/gcc/rtl-ssa/movement.h
index f55c234e824..17d31e0b5cb 100644
--- a/gcc/rtl-ssa/movement.h
+++ b/gcc/rtl-ssa/movement.h
@@ -83,10 +83,13 @@ canonicalize_move_range (insn_range_info &move_range, insn_info *insn)
 }
 
 // Try to restrict movement range MOVE_RANGE of INSN so that it can set
-// or clobber REGNO.  Assume that if:
+// or clobber REGNO.
+//
+// IGNORE is an object that provides the same interface as ignore_nothing.
+// Assume that if:
 //
 // - an instruction I2 contains another access A to REGNO; and
-// - IGNORE (I2) is true
+// - IGNORE says that I2 should be ignored
 //
 // then either:
 //
@@ -94,15 +97,18 @@ canonicalize_move_range (insn_range_info &move_range, insn_info *insn)
 // - something will ensure that the new definition of REGNO does not
 //   interfere with A, without this having to be enforced by I1's move range.
 //
+// If IGNORE says that a definition D of REGNO should be ignored, assume that
+// the new definition of REGNO will not conflict with D.
+//
 // Return true on success, otherwise leave MOVE_RANGE in an invalid state.
 //
 // This function only works correctly for instructions that remain within
 // the same extended basic block.
-template<typename IgnorePredicate>
+template<typename IgnorePredicates>
 bool
 restrict_movement_for_dead_range (insn_range_info &move_range,
 				  unsigned int regno, insn_info *insn,
-				  IgnorePredicate ignore)
+				  IgnorePredicates ignore)
 {
   // Find a definition at or neighboring INSN.
   resource_info resource = full_register (regno);
@@ -141,17 +147,18 @@ restrict_movement_for_dead_range (insn_range_info &move_range,
     {
       // Stop the instruction moving beyond the previous relevant access
       // to REGNO.
-      access_info *prev_access
-	= last_access_ignoring (prev, ignore_clobbers::YES, ignore);
+      access_info *prev_access = last_access (prev, ignore_clobbers::YES,
+					      ignore);
       if (prev_access)
 	move_range = move_later_than (move_range, access_insn (prev_access));
     }
 
-  // Stop the instruction moving beyond the next relevant definition of REGNO.
+  // Stop the instruction moving beyond the next relevant use or definition
+  // of REGNO.
   def_info *next = dl.matching_set_or_first_def_of_next_group ();
-  next = first_def_ignoring (next, ignore_clobbers::YES, ignore);
-  if (next)
-    move_range = move_earlier_than (move_range, next->insn ());
+  access_info *next_access = first_access (next, ignore_clobbers::YES, ignore);
+  if (next_access)
+    move_range = move_earlier_than (move_range, access_insn (next_access));
 
   return canonicalize_move_range (move_range, insn);
 }
@@ -159,11 +166,14 @@ restrict_movement_for_dead_range (insn_range_info &move_range,
 // Try to restrict movement range MOVE_RANGE so that it is possible for the
 // instruction being moved ("instruction I1") to perform all the definitions
 // in DEFS while still preserving dependencies between those definitions
-// and surrounding instructions.  Assume that if:
+// and surrounding instructions.
+//
+// IGNORE is an object that provides the same interface as ignore_nothing.
+// Assume that if:
 //
 // - DEFS contains a definition D of resource R;
 // - an instruction I2 contains another access A to R; and
-// - IGNORE (I2) is true
+// - IGNORE says that I2 should be ignored
 //
 // then either:
 //
@@ -171,14 +181,17 @@ restrict_movement_for_dead_range (insn_range_info &move_range,
 // - something will ensure that D and A maintain their current order,
 //   without this having to be enforced by I1's move range.
 //
+// Assume the same thing about a definition D and all uses of D if IGNORE
+// says that D should be ignored.
+//
 // Return true on success, otherwise leave MOVE_RANGE in an invalid state.
 //
 // This function only works correctly for instructions that remain within
 // the same extended basic block.
-template<typename IgnorePredicate>
+template<typename IgnorePredicates>
 bool
-restrict_movement_for_defs_ignoring (insn_range_info &move_range,
-				     def_array defs, IgnorePredicate ignore)
+restrict_movement_for_defs (insn_range_info &move_range, def_array defs,
+			    IgnorePredicates ignore)
 {
   for (def_info *def : defs)
     {
@@ -194,29 +207,16 @@ restrict_movement_for_defs_ignoring (insn_range_info &move_range,
       // are being moved at once.
       bool is_clobber = is_a<clobber_info *> (def);
 
-      // Search back for the first unfiltered use or definition of the
+      // Search back for the first relevant use or definition of the
       // same resource.
       access_info *access;
-      access = prev_access_ignoring (def, ignore_clobbers (is_clobber),
-				     ignore);
+      access = prev_access (def, ignore_clobbers (is_clobber), ignore);
       if (access)
 	move_range = move_later_than (move_range, access_insn (access));
 
-      // Search forward for the first unfiltered use of DEF,
-      // or the first unfiltered definition that follows DEF.
-      //
-      // We don't need to consider uses of following definitions, since
-      // if IGNORE (D->insn ()) is true for some definition D, the caller
-      // is guarantees that either
-      //
-      // - D will be removed, and thus its uses will be removed; or
-      // - D will occur after DEF, and thus D's uses will also occur
-      //   after DEF.
-      //
-      // This is purely a simplification: we could also process D's uses,
-      // but we don't need to.
-      access = next_access_ignoring (def, ignore_clobbers (is_clobber),
-				     ignore);
+      // Search forward for the next relevant use or definition of the
+      // same resource.
+      access = next_access (def, ignore_clobbers (is_clobber), ignore);
       if (access)
 	move_range = move_earlier_than (move_range, access_insn (access));
 
@@ -238,13 +238,11 @@ restrict_movement_for_defs_ignoring (insn_range_info &move_range,
 	    return false;
 
 	  insn_info *insn;
-	  insn = prev_call_clobbers_ignoring (*call_group, def->insn (),
-					      ignore);
+	  insn = prev_call_clobbers (*call_group, def->insn (), ignore);
 	  if (insn)
 	    move_range = move_later_than (move_range, insn);
 
-	  insn = next_call_clobbers_ignoring (*call_group, def->insn (),
-					      ignore);
+	  insn = next_call_clobbers (*call_group, def->insn (), ignore);
 	  if (insn)
 	    move_range = move_earlier_than (move_range, insn);
 	}
@@ -262,11 +260,11 @@ restrict_movement_for_defs_ignoring (insn_range_info &move_range,
   return bool (move_range);
 }
 
-// Like restrict_movement_for_defs_ignoring, but for the uses in USES.
-template<typename IgnorePredicate>
+// Like restrict_movement_for_defs, but for the uses in USES.
+template<typename IgnorePredicates>
 bool
-restrict_movement_for_uses_ignoring (insn_range_info &move_range,
-				     use_array uses, IgnorePredicate ignore)
+restrict_movement_for_uses (insn_range_info &move_range, use_array uses,
+			    IgnorePredicates ignore)
 {
   for (const use_info *use : uses)
     {
@@ -284,31 +282,21 @@ restrict_movement_for_uses_ignoring (insn_range_info &move_range,
       if (use->is_in_debug_insn ())
 	continue;
 
-      // If the used value is defined by an instruction I2 for which
-      // IGNORE (I2) is true, the caller guarantees that I2 will occur
-      // before change.insn ().  Otherwise, make sure that the use occurs
-      // after the definition.
+      // If the used value is defined by an ignored instruction I2,
+      // the caller guarantees that I2 will occur before change.insn ()
+      // and that its value will still be available at change.insn ().
+      // Otherwise, make sure that the use occurs after the definition.
       insn_info *insn = set->insn ();
-      if (!ignore (insn))
+      if (!ignore.should_ignore_def (set)
+	  && !ignore.should_ignore_insn (insn))
 	move_range = move_later_than (move_range, insn);
 
-      // Search forward for the first unfiltered definition that follows SET.
-      //
-      // We don't need to consider the uses of these definitions, since
-      // if IGNORE (D->insn ()) is true for some definition D, the caller
-      // is guarantees that either
-      //
-      // - D will be removed, and thus its uses will be removed; or
-      // - D will occur after USE, and thus D's uses will also occur
-      //   after USE.
-      //
-      // This is purely a simplification: we could also process D's uses,
-      // but we don't need to.
-      def_info *def;
-      def = first_def_ignoring (set->next_def (), ignore_clobbers::NO,
-				ignore);
-      if (def)
-	move_range = move_earlier_than (move_range, def->insn ());
+      // Search forward after SET's live range for the first relevant
+      // use or definition of the same resource.
+      access_info *access;
+      access = first_access (set->next_def (), ignore_clobbers::NO, ignore);
+      if (access)
+	move_range = move_earlier_than (move_range, access_insn (access));
 
       // If USE uses a hard register, take any call clobbers into account too.
       // SET will necessarily occur after any previous call clobber, so we
@@ -326,8 +314,8 @@ restrict_movement_for_uses_ignoring (insn_range_info &move_range,
 	  if (!move_range)
 	    return false;
 
-	  insn_info *insn = next_call_clobbers_ignoring (*call_group,
-							 use->insn (), ignore);
+	  insn_info *insn = next_call_clobbers (*call_group, use->insn (),
+						ignore);
 	  if (insn)
 	    move_range = move_earlier_than (move_range, insn);
 	}
diff --git a/gcc/rtl-ssa/predicates.h b/gcc/rtl-ssa/predicates.h
new file mode 100644
index 00000000000..225a8c658b4
--- /dev/null
+++ b/gcc/rtl-ssa/predicates.h
@@ -0,0 +1,58 @@
+// RTL SSA predicate classes                                        -*- C++ -*-
+// Copyright (C) 2024 Free Software Foundation, Inc.
+//
+// This file is part of GCC.
+//
+// GCC is free software; you can redistribute it and/or modify it under
+// the terms of the GNU General Public License as published by the Free
+// Software Foundation; either version 3, or (at your option) any later
+// version.
+//
+// GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+// WARRANTY; without even the implied warranty of MERCHANTABILITY or
+// FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+// for more details.
+//
+// You should have received a copy of the GNU General Public License
+// along with GCC; see the file COPYING3.  If not see
+// <http://www.gnu.org/licenses/>.
+
+namespace rtl_ssa {
+
+// Collects predicates that affect a scan over the IR, specifying what
+// (if anything) should be ignored.
+struct ignore_nothing
+{
+  // Return true if the scan should ignore the given definition
+  // and all uses of the definition.
+  bool should_ignore_def (const def_info *) { return false; }
+
+  // Return true if the scan should ignore the given instruction.
+  bool should_ignore_insn (const insn_info *) { return false; }
+};
+
+// Predicates that ignore the instruction passed to the constructor
+// (and nothing else).
+class ignore_insn : public ignore_nothing
+{
+public:
+  ignore_insn (const insn_info *insn) : m_insn (insn) {}
+  bool should_ignore_insn (const insn_info *insn) { return insn == m_insn; }
+
+private:
+  const insn_info *m_insn;
+};
+
+// Predicates that ignore all the instructions being changed by a set
+// of insn_changes.
+class ignore_changing_insns : public ignore_nothing
+{
+public:
+  ignore_changing_insns (array_slice<insn_change *const>);
+  bool should_ignore_insn (const insn_info *);
+
+private:
+  array_slice<insn_change *const> m_changes;
+};
+
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/6] rtl-ssa: Don't cost no-op moves
  2024-06-20 13:34 [PATCH 0/6] Add a late-combine pass Richard Sandiford
  2024-06-20 13:34 ` [PATCH 1/6] rtl-ssa: Rework _ignoring interfaces Richard Sandiford
@ 2024-06-20 13:34 ` Richard Sandiford
  2024-06-21 14:32   ` Jeff Law
  2024-06-20 13:34 ` [PATCH 3/6] iq2000: Fix test and branch instructions Richard Sandiford
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Richard Sandiford @ 2024-06-20 13:34 UTC (permalink / raw)
  To: jlaw, gcc-patches; +Cc: Richard Sandiford

No-op moves are given the code NOOP_MOVE_INSN_CODE if we plan
to delete them later.  Such insns shouldn't be costed, partly
because they're going to disappear, and partly because targets
won't recognise the insn code.

gcc/
	* rtl-ssa/changes.cc (rtl_ssa::changes_are_worthwhile): Don't
	cost no-op moves.
	* rtl-ssa/insns.cc (insn_info::calculate_cost): Likewise.
---
 gcc/rtl-ssa/changes.cc | 6 +++++-
 gcc/rtl-ssa/insns.cc   | 7 ++++++-
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc
index c5ac4956a19..bc80d7da829 100644
--- a/gcc/rtl-ssa/changes.cc
+++ b/gcc/rtl-ssa/changes.cc
@@ -177,13 +177,17 @@ rtl_ssa::changes_are_worthwhile (array_slice<insn_change *const> changes,
   auto entry_count = ENTRY_BLOCK_PTR_FOR_FN (cfun)->count;
   for (insn_change *change : changes)
     {
+      // Count zero for the old cost if the old instruction was a no-op
+      // move or had an unknown cost.  This should reduce the chances of
+      // making an unprofitable change.
       old_cost += change->old_cost ();
       basic_block cfg_bb = change->bb ()->cfg_bb ();
       bool for_speed = optimize_bb_for_speed_p (cfg_bb);
       if (for_speed)
 	weighted_old_cost += (cfg_bb->count.to_sreal_scale (entry_count)
 			      * change->old_cost ());
-      if (!change->is_deletion ())
+      if (!change->is_deletion ()
+	  && INSN_CODE (change->rtl ()) != NOOP_MOVE_INSN_CODE)
 	{
 	  change->new_cost = insn_cost (change->rtl (), for_speed);
 	  /* If the cost is unknown, replacement is not worthwhile.  */
diff --git a/gcc/rtl-ssa/insns.cc b/gcc/rtl-ssa/insns.cc
index 0171d93c357..68365e323ec 100644
--- a/gcc/rtl-ssa/insns.cc
+++ b/gcc/rtl-ssa/insns.cc
@@ -48,7 +48,12 @@ insn_info::calculate_cost () const
 {
   basic_block cfg_bb = BLOCK_FOR_INSN (m_rtl);
   temporarily_undo_changes (0);
-  m_cost_or_uid = insn_cost (m_rtl, optimize_bb_for_speed_p (cfg_bb));
+  if (INSN_CODE (m_rtl) == NOOP_MOVE_INSN_CODE)
+    // insn_cost also uses 0 to mean "don't know".  Callers that
+    // want to distinguish the cases will need to check INSN_CODE.
+    m_cost_or_uid = 0;
+  else
+    m_cost_or_uid = insn_cost (m_rtl, optimize_bb_for_speed_p (cfg_bb));
   redo_changes (0);
 }
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 3/6] iq2000: Fix test and branch instructions
  2024-06-20 13:34 [PATCH 0/6] Add a late-combine pass Richard Sandiford
  2024-06-20 13:34 ` [PATCH 1/6] rtl-ssa: Rework _ignoring interfaces Richard Sandiford
  2024-06-20 13:34 ` [PATCH 2/6] rtl-ssa: Don't cost no-op moves Richard Sandiford
@ 2024-06-20 13:34 ` Richard Sandiford
  2024-06-21 14:33   ` Jeff Law
  2024-06-20 13:34 ` [PATCH 4/6] sh: Make *minus_plus_one work after RA Richard Sandiford
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Richard Sandiford @ 2024-06-20 13:34 UTC (permalink / raw)
  To: jlaw, gcc-patches; +Cc: Richard Sandiford

The iq2000 test and branch instructions had patterns like:

  [(set (pc)
	(if_then_else
	 (eq (and:SI (match_operand:SI 0 "register_operand" "r")
		     (match_operand:SI 1 "power_of_2_operand" "I"))
	      (const_int 0))
	 (match_operand 2 "pc_or_label_operand" "")
	 (match_operand 3 "pc_or_label_operand" "")))]

power_of_2_operand allows any 32-bit power of 2, whereas "I" only
accepts 16-bit signed constants.  This meant that any power of 2
greater than 32768 would cause an "insn does not satisfy its
constraints" ICE.

Also, the %p operand modifier barfed on 1<<31, which is sign-
rather than zero-extended to 64 bits.  The code is inherently
limited to 32-bit operands -- power_of_2_operand contains a test
involving "unsigned" -- so this patch just ands with 0xffffffff.

gcc/
	* config/iq2000/iq2000.cc (iq2000_print_operand): Make %p handle 1<<31.
	* config/iq2000/iq2000.md: Remove "I" constraints on
	power_of_2_operands.
---
 gcc/config/iq2000/iq2000.cc | 2 +-
 gcc/config/iq2000/iq2000.md | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/config/iq2000/iq2000.cc b/gcc/config/iq2000/iq2000.cc
index f9f8c417841..136675d0fbb 100644
--- a/gcc/config/iq2000/iq2000.cc
+++ b/gcc/config/iq2000/iq2000.cc
@@ -3127,7 +3127,7 @@ iq2000_print_operand (FILE *file, rtx op, int letter)
     {
       int value;
       if (code != CONST_INT
-	  || (value = exact_log2 (INTVAL (op))) < 0)
+	  || (value = exact_log2 (UINTVAL (op) & 0xffffffff)) < 0)
 	output_operand_lossage ("invalid %%p value");
       else
 	fprintf (file, "%d", value);
diff --git a/gcc/config/iq2000/iq2000.md b/gcc/config/iq2000/iq2000.md
index 8617efac3c6..e62c250ce8c 100644
--- a/gcc/config/iq2000/iq2000.md
+++ b/gcc/config/iq2000/iq2000.md
@@ -1175,7 +1175,7 @@ (define_insn ""
   [(set (pc)
 	(if_then_else
 	 (eq (and:SI (match_operand:SI 0 "register_operand" "r")
-		     (match_operand:SI 1 "power_of_2_operand" "I"))
+		     (match_operand:SI 1 "power_of_2_operand"))
 	      (const_int 0))
 	 (match_operand 2 "pc_or_label_operand" "")
 	 (match_operand 3 "pc_or_label_operand" "")))]
@@ -1189,7 +1189,7 @@ (define_insn ""
   [(set (pc)
 	(if_then_else
 	 (ne (and:SI (match_operand:SI 0 "register_operand" "r")
-		     (match_operand:SI 1 "power_of_2_operand" "I"))
+		     (match_operand:SI 1 "power_of_2_operand"))
 	     (const_int 0))
 	 (match_operand 2 "pc_or_label_operand" "")
 	 (match_operand 3 "pc_or_label_operand" "")))]
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 4/6] sh: Make *minus_plus_one work after RA
  2024-06-20 13:34 [PATCH 0/6] Add a late-combine pass Richard Sandiford
                   ` (2 preceding siblings ...)
  2024-06-20 13:34 ` [PATCH 3/6] iq2000: Fix test and branch instructions Richard Sandiford
@ 2024-06-20 13:34 ` Richard Sandiford
  2024-06-21  0:15   ` Oleg Endo
  2024-06-20 13:34 ` [PATCH 5/6] xstormy16: Fix xs_hi_nonmemory_operand Richard Sandiford
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Richard Sandiford @ 2024-06-20 13:34 UTC (permalink / raw)
  To: jlaw, gcc-patches; +Cc: Richard Sandiford

*minus_plus_one had no constraints, which meant that it could be
matched after RA with operands 0, 1 and 2 all being different.
The associated split instead requires operand 0 to be tied to
operand 1.

gcc/
	* config/sh/sh.md (*minus_plus_one): Add constraints.
---
 gcc/config/sh/sh.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/config/sh/sh.md b/gcc/config/sh/sh.md
index 92a1efeb811..9491b49e55b 100644
--- a/gcc/config/sh/sh.md
+++ b/gcc/config/sh/sh.md
@@ -1642,9 +1642,9 @@ (define_insn_and_split "*addc"
 ;; matched.  Split this up into a simple sub add sequence, as this will save
 ;; us one sett insn.
 (define_insn_and_split "*minus_plus_one"
-  [(set (match_operand:SI 0 "arith_reg_dest" "")
-	(plus:SI (minus:SI (match_operand:SI 1 "arith_reg_operand" "")
-			   (match_operand:SI 2 "arith_reg_operand" ""))
+  [(set (match_operand:SI 0 "arith_reg_dest" "=r")
+	(plus:SI (minus:SI (match_operand:SI 1 "arith_reg_operand" "0")
+			   (match_operand:SI 2 "arith_reg_operand" "r"))
 		 (const_int 1)))]
   "TARGET_SH1"
   "#"
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 5/6] xstormy16: Fix xs_hi_nonmemory_operand
  2024-06-20 13:34 [PATCH 0/6] Add a late-combine pass Richard Sandiford
                   ` (3 preceding siblings ...)
  2024-06-20 13:34 ` [PATCH 4/6] sh: Make *minus_plus_one work after RA Richard Sandiford
@ 2024-06-20 13:34 ` Richard Sandiford
  2024-06-21 14:33   ` Jeff Law
  2024-06-20 13:34 ` [PATCH 6/6] Add a late-combine pass [PR106594] Richard Sandiford
  2024-06-28 12:25 ` LoongArch vs. [PATCH 0/6] Add a late-combine pass Xi Ruoyao
  6 siblings, 1 reply; 36+ messages in thread
From: Richard Sandiford @ 2024-06-20 13:34 UTC (permalink / raw)
  To: jlaw, gcc-patches; +Cc: Richard Sandiford

All uses of xs_hi_nonmemory_operand allow constraint "i",
which means that they allow consts, symbol_refs and label_refs.
The definition of xs_hi_nonmemory_operand accounted for consts,
but not for symbol_refs and label_refs.

gcc/
	* config/stormy16/predicates.md (xs_hi_nonmemory_operand): Handle
	symbol_ref and label_ref.
---
 gcc/config/stormy16/predicates.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/stormy16/predicates.md b/gcc/config/stormy16/predicates.md
index 67c2ddc107c..085c9c5ed2d 100644
--- a/gcc/config/stormy16/predicates.md
+++ b/gcc/config/stormy16/predicates.md
@@ -152,7 +152,7 @@ (define_predicate "xstormy16_carry_plus_operand"
 })
 
 (define_predicate "xs_hi_nonmemory_operand"
-  (match_code "const_int,reg,subreg,const")
+  (match_code "const_int,reg,subreg,const,symbol_ref,label_ref")
 {
   return nonmemory_operand (op, mode);
 })
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 6/6] Add a late-combine pass [PR106594]
  2024-06-20 13:34 [PATCH 0/6] Add a late-combine pass Richard Sandiford
                   ` (4 preceding siblings ...)
  2024-06-20 13:34 ` [PATCH 5/6] xstormy16: Fix xs_hi_nonmemory_operand Richard Sandiford
@ 2024-06-20 13:34 ` Richard Sandiford
  2024-06-21  0:17   ` Oleg Endo
                     ` (4 more replies)
  2024-06-28 12:25 ` LoongArch vs. [PATCH 0/6] Add a late-combine pass Xi Ruoyao
  6 siblings, 5 replies; 36+ messages in thread
From: Richard Sandiford @ 2024-06-20 13:34 UTC (permalink / raw)
  To: jlaw, gcc-patches; +Cc: Richard Sandiford

This patch adds a combine pass that runs late in the pipeline.
There are two instances: one between combine and split1, and one
after postreload.

The pass currently has a single objective: remove definitions by
substituting into all uses.  The pre-RA version tries to restrict
itself to cases that are likely to have a neutral or beneficial
effect on register pressure.

The patch fixes PR106594.  It also fixes a few FAILs and XFAILs
in the aarch64 test results, mostly due to making proper use of
MOVPRFX in cases where we didn't previously.

This is just a first step.  I'm hoping that the pass could be
used for other combine-related optimisations in future.  In particular,
the post-RA version doesn't need to restrict itself to cases where all
uses are substitutable, since it doesn't have to worry about register
pressure.  If we did that, and if we extended it to handle multi-register
REGs, the pass might be a viable replacement for regcprop, which in
turn might reduce the cost of having a post-RA instance of the new pass.

On most targets, the pass is enabled by default at -O2 and above.
However, it has a tendency to undo x86's STV and RPAD passes,
by folding the more complex post-STV/RPAD form back into the
simpler pre-pass form.

Also, running a pass after register allocation means that we can
now match define_insn_and_splits that were previously only matched
before register allocation.  This trips things like:

  (define_insn_and_split "..."
    [...pattern...]
    "...cond..."
    "#"
    "&& 1"
    [...pattern...]
    {
      ...unconditional use of gen_reg_rtx ()...;
    }

because matching and splitting after RA will call gen_reg_rtx when
pseudos are no longer allowed.  rs6000 has several instances of this.

xtensa has a variation in which the split condition is:

    "&& can_create_pseudo_p ()"

The failure then is that, if we match after RA, we'll never be
able to split the instruction.

The patch therefore disables the pass by default on i386, rs6000
and xtensa.  Hopefully we can fix those ports later (if their
maintainers want).  It seems easier to add the pass first, though,
to make it easier to test any such fixes.

gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need
quite a few updates for the late-combine output.  That might be
worth doing, but it seems too complex to do as part of this patch.

I tried compiling at least one target per CPU directory and comparing
the assembly output for parts of the GCC testsuite.  This is just a way
of getting a flavour of how the pass performs; it obviously isn't a
meaningful benchmark.  All targets seemed to improve on average:

Target                 Tests   Good    Bad   %Good   Delta  Median
======                 =====   ====    ===   =====   =====  ======
aarch64-linux-gnu       2215   1975    240  89.16%   -4159      -1
aarch64_be-linux-gnu    1569   1483     86  94.52%  -10117      -1
alpha-linux-gnu         1454   1370     84  94.22%   -9502      -1
amdgcn-amdhsa           5122   4671    451  91.19%  -35737      -1
arc-elf                 2166   1932    234  89.20%  -37742      -1
arm-linux-gnueabi       1953   1661    292  85.05%  -12415      -1
arm-linux-gnueabihf     1834   1549    285  84.46%  -11137      -1
avr-elf                 4789   4330    459  90.42% -441276      -4
bfin-elf                2795   2394    401  85.65%  -19252      -1
bpf-elf                 3122   2928    194  93.79%   -8785      -1
c6x-elf                 2227   1929    298  86.62%  -17339      -1
cris-elf                3464   3270    194  94.40%  -23263      -2
csky-elf                2915   2591    324  88.89%  -22146      -1
epiphany-elf            2399   2304     95  96.04%  -28698      -2
fr30-elf                7712   7299    413  94.64%  -99830      -2
frv-linux-gnu           3332   2877    455  86.34%  -25108      -1
ft32-elf                2775   2667    108  96.11%  -25029      -1
h8300-elf               3176   2862    314  90.11%  -29305      -2
hppa64-hp-hpux11.23     4287   4247     40  99.07%  -45963      -2
ia64-linux-gnu          2343   1946    397  83.06%   -9907      -2
iq2000-elf              9684   9637     47  99.51% -126557      -2
lm32-elf                2681   2608     73  97.28%  -59884      -3
loongarch64-linux-gnu   1303   1218     85  93.48%  -13375      -2
m32r-elf                1626   1517    109  93.30%   -9323      -2
m68k-linux-gnu          3022   2620    402  86.70%  -21531      -1
mcore-elf               2315   2085    230  90.06%  -24160      -1
microblaze-elf          2782   2585    197  92.92%  -16530      -1
mipsel-linux-gnu        1958   1827    131  93.31%  -15462      -1
mipsisa64-linux-gnu     1655   1488    167  89.91%  -16592      -2
mmix                    4914   4814    100  97.96%  -63021      -1
mn10300-elf             3639   3320    319  91.23%  -34752      -2
moxie-rtems             3497   3252    245  92.99%  -87305      -3
msp430-elf              4353   3876    477  89.04%  -23780      -1
nds32le-elf             3042   2780    262  91.39%  -27320      -1
nios2-linux-gnu         1683   1355    328  80.51%   -8065      -1
nvptx-none              2114   1781    333  84.25%  -12589      -2
or1k-elf                3045   2699    346  88.64%  -14328      -2
pdp11                   4515   4146    369  91.83%  -26047      -2
pru-elf                 1585   1245    340  78.55%   -5225      -1
riscv32-elf             2122   2000    122  94.25% -101162      -2
riscv64-elf             1841   1726    115  93.75%  -49997      -2
rl78-elf                2823   2530    293  89.62%  -40742      -4
rx-elf                  2614   2480    134  94.87%  -18863      -1
s390-linux-gnu          1591   1393    198  87.55%  -16696      -1
s390x-linux-gnu         2015   1879    136  93.25%  -21134      -1
sh-linux-gnu            1870   1507    363  80.59%   -9491      -1
sparc-linux-gnu         1123   1075     48  95.73%  -14503      -1
sparc-wrs-vxworks       1121   1073     48  95.72%  -14578      -1
sparc64-linux-gnu       1096   1021     75  93.16%  -15003      -1
v850-elf                1897   1728    169  91.09%  -11078      -1
vax-netbsdelf           3035   2995     40  98.68%  -27642      -1
visium-elf              1392   1106    286  79.45%   -7984      -2
xstormy16-elf           2577   2071    506  80.36%  -13061      -1

gcc/
	PR rtl-optimization/106594
	* Makefile.in (OBJS): Add late-combine.o.
	* common.opt (flate-combine-instructions): New option.
	* doc/invoke.texi: Document it.
	* opts.cc (default_options_table): Enable it by default at -O2
	and above.
	* tree-pass.h (make_pass_late_combine): Declare.
	* late-combine.cc: New file.
	* passes.def: Add two instances of late_combine.
	* config/i386/i386-options.cc (ix86_override_options_after_change):
	Disable late-combine by default.
	* config/rs6000/rs6000.cc (rs6000_option_override_internal): Likewise.
	* config/xtensa/xtensa.cc (xtensa_option_override): Likewise.

gcc/testsuite/
	PR rtl-optimization/106594
	* gcc.dg/ira-shrinkwrap-prep-1.c: Restrict XFAIL to non-aarch64
	targets.
	* gcc.dg/ira-shrinkwrap-prep-2.c: Likewise.
	* gcc.dg/stack-check-4.c: Add -fno-shrink-wrap.
	* gcc.target/aarch64/bitfield-bitint-abi-align16.c: Add
	-fno-late-combine-instructions.
	* gcc.target/aarch64/bitfield-bitint-abi-align8.c: Likewise.
	* gcc.target/aarch64/sve/cond_asrd_3.c: Remove XFAILs.
	* gcc.target/aarch64/sve/cond_convert_3.c: Likewise.
	* gcc.target/aarch64/sve/cond_fabd_5.c: Likewise.
	* gcc.target/aarch64/sve/cond_convert_6.c: Expect the MOVPRFX /Zs
	described in the comment.
	* gcc.target/aarch64/sve/cond_unary_4.c: Likewise.
	* gcc.target/aarch64/pr106594_1.c: New test.
---
 gcc/Makefile.in                               |   1 +
 gcc/common.opt                                |   5 +
 gcc/config/i386/i386-options.cc               |   4 +
 gcc/config/rs6000/rs6000.cc                   |   8 +
 gcc/config/xtensa/xtensa.cc                   |  11 +
 gcc/doc/invoke.texi                           |  11 +-
 gcc/late-combine.cc                           | 747 ++++++++++++++++++
 gcc/opts.cc                                   |   1 +
 gcc/passes.def                                |   2 +
 gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c  |   2 +-
 gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c  |   2 +-
 gcc/testsuite/gcc.dg/stack-check-4.c          |   2 +-
 .../aarch64/bitfield-bitint-abi-align16.c     |   2 +-
 .../aarch64/bitfield-bitint-abi-align8.c      |   2 +-
 gcc/testsuite/gcc.target/aarch64/pr106594_1.c |  20 +
 .../gcc.target/aarch64/sve/cond_asrd_3.c      |  10 +-
 .../gcc.target/aarch64/sve/cond_convert_3.c   |   8 +-
 .../gcc.target/aarch64/sve/cond_convert_6.c   |   8 +-
 .../gcc.target/aarch64/sve/cond_fabd_5.c      |  11 +-
 .../gcc.target/aarch64/sve/cond_unary_4.c     |  13 +-
 gcc/tree-pass.h                               |   1 +
 21 files changed, 834 insertions(+), 37 deletions(-)
 create mode 100644 gcc/late-combine.cc
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594_1.c

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index f5adb647d3f..5e29ddb5690 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1574,6 +1574,7 @@ OBJS = \
 	ira-lives.o \
 	jump.o \
 	langhooks.o \
+	late-combine.o \
 	lcm.o \
 	lists.o \
 	loop-doloop.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index f2bc47fdc5e..327230967ea 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1796,6 +1796,11 @@ Common Var(flag_large_source_files) Init(0)
 Improve GCC's ability to track column numbers in large source files,
 at the expense of slower compilation.
 
+flate-combine-instructions
+Common Var(flag_late_combine_instructions) Optimization Init(0)
+Run two instruction combination passes late in the pass pipeline;
+one before register allocation and one after.
+
 floop-parallelize-all
 Common Var(flag_loop_parallelize_all) Optimization
 Mark all loops as parallel.
diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
index f2cecc0e254..4620bf8e9e6 100644
--- a/gcc/config/i386/i386-options.cc
+++ b/gcc/config/i386/i386-options.cc
@@ -1942,6 +1942,10 @@ ix86_override_options_after_change (void)
 	flag_cunroll_grow_size = flag_peel_loops || optimize >= 3;
     }
 
+  /* Late combine tends to undo some of the effects of STV and RPAD,
+     by combining instructions back to their original form.  */
+  if (!OPTION_SET_P (flag_late_combine_instructions))
+    flag_late_combine_instructions = 0;
 }
 
 /* Clear stack slot assignments remembered from previous functions.
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index e4dc629ddcc..f39b8909925 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -4768,6 +4768,14 @@ rs6000_option_override_internal (bool global_init_p)
 	targetm.expand_builtin_va_start = NULL;
     }
 
+  /* One of the late-combine passes runs after register allocation
+     and can match define_insn_and_splits that were previously used
+     only before register allocation.  Some of those define_insn_and_splits
+     use gen_reg_rtx unconditionally.  Disable late-combine by default
+     until the define_insn_and_splits are fixed.  */
+  if (!OPTION_SET_P (flag_late_combine_instructions))
+    flag_late_combine_instructions = 0;
+
   rs6000_override_options_after_change ();
 
   /* If not explicitly specified via option, decide whether to generate indexed
diff --git a/gcc/config/xtensa/xtensa.cc b/gcc/config/xtensa/xtensa.cc
index 45dc1be3ff5..308dc62e0f8 100644
--- a/gcc/config/xtensa/xtensa.cc
+++ b/gcc/config/xtensa/xtensa.cc
@@ -59,6 +59,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "print-rtl.h"
 #include <math.h>
+#include "opts.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -2916,6 +2917,16 @@ xtensa_option_override (void)
       flag_reorder_blocks_and_partition = 0;
       flag_reorder_blocks = 1;
     }
+
+  /* One of the late-combine passes runs after register allocation
+     and can match define_insn_and_splits that were previously used
+     only before register allocation.  Some of those define_insn_and_splits
+     require the split to take place, but have a split condition of
+     can_create_pseudo_p, and so matching after RA will give an
+     unsplittable instruction.  Disable late-combine by default until
+     the define_insn_and_splits are fixed.  */
+  if (!OPTION_SET_P (flag_late_combine_instructions))
+    flag_late_combine_instructions = 0;
 }
 
 /* Implement TARGET_HARD_REGNO_NREGS.  */
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 5d7a87fde86..3b8c427d509 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -575,7 +575,7 @@ Objective-C and Objective-C++ Dialects}.
 -fipa-bit-cp  -fipa-vrp  -fipa-pta  -fipa-profile  -fipa-pure-const
 -fipa-reference  -fipa-reference-addressable
 -fipa-stack-alignment  -fipa-icf  -fira-algorithm=@var{algorithm}
--flive-patching=@var{level}
+-flate-combine-instructions  -flive-patching=@var{level}
 -fira-region=@var{region}  -fira-hoist-pressure
 -fira-loop-pressure  -fno-ira-share-save-slots
 -fno-ira-share-spill-slots
@@ -13675,6 +13675,15 @@ equivalences that are found only by GCC and equivalences found only by Gold.
 
 This flag is enabled by default at @option{-O2} and @option{-Os}.
 
+@opindex flate-combine-instructions
+@item -flate-combine-instructions
+Enable two instruction combination passes that run relatively late in the
+compilation process.  One of the passes runs before register allocation and
+the other after register allocation.  The main aim of the passes is to
+substitute definitions into all uses.
+
+Most targets enable this flag by default at @option{-O2} and @option{-Os}.
+
 @opindex flive-patching
 @item -flive-patching=@var{level}
 Control GCC's optimizations to produce output suitable for live-patching.
diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc
new file mode 100644
index 00000000000..22a1d81d38e
--- /dev/null
+++ b/gcc/late-combine.cc
@@ -0,0 +1,747 @@
+// Late-stage instruction combination pass.
+// Copyright (C) 2023-2024 Free Software Foundation, Inc.
+//
+// This file is part of GCC.
+//
+// GCC is free software; you can redistribute it and/or modify it under
+// the terms of the GNU General Public License as published by the Free
+// Software Foundation; either version 3, or (at your option) any later
+// version.
+//
+// GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+// WARRANTY; without even the implied warranty of MERCHANTABILITY or
+// FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+// for more details.
+//
+// You should have received a copy of the GNU General Public License
+// along with GCC; see the file COPYING3.  If not see
+// <http://www.gnu.org/licenses/>.
+
+// The current purpose of this pass is to substitute definitions into
+// all uses, so that the definition can be removed.  However, it could
+// be extended to handle other combination-related optimizations in future.
+//
+// The pass can run before or after register allocation.  When running
+// before register allocation, it tries to avoid cases that are likely
+// to increase register pressure.  For the same reason, it avoids moving
+// instructions around, even if doing so would allow an optimization to
+// succeed.  These limitations are removed when running after register
+// allocation.
+
+#define INCLUDE_ALGORITHM
+#define INCLUDE_FUNCTIONAL
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "df.h"
+#include "rtl-ssa.h"
+#include "print-rtl.h"
+#include "tree-pass.h"
+#include "cfgcleanup.h"
+#include "target.h"
+
+using namespace rtl_ssa;
+
+namespace {
+const pass_data pass_data_late_combine =
+{
+  RTL_PASS, // type
+  "late_combine", // name
+  OPTGROUP_NONE, // optinfo_flags
+  TV_NONE, // tv_id
+  0, // properties_required
+  0, // properties_provided
+  0, // properties_destroyed
+  0, // todo_flags_start
+  TODO_df_finish, // todo_flags_finish
+};
+
+// Represents an attempt to substitute a single-set definition into all
+// uses of the definition.
+class insn_combination
+{
+public:
+  insn_combination (set_info *, rtx, rtx);
+  bool run ();
+  array_slice<insn_change *const> use_changes () const;
+
+private:
+  use_array get_new_uses (use_info *);
+  bool substitute_nondebug_use (use_info *);
+  bool substitute_nondebug_uses (set_info *);
+  bool try_to_preserve_debug_info (insn_change &, use_info *);
+  void substitute_debug_use (use_info *);
+  bool substitute_note (insn_info *, rtx, bool);
+  void substitute_notes (insn_info *, bool);
+  void substitute_note_uses (use_info *);
+  void substitute_optional_uses (set_info *);
+
+  // Represents the state of the function's RTL at the start of this
+  // combination attempt.
+  insn_change_watermark m_rtl_watermark;
+
+  // Represents the rtl-ssa state at the start of this combination attempt.
+  obstack_watermark m_attempt;
+
+  // The instruction that contains the definition, and that we're trying
+  // to delete.
+  insn_info *m_def_insn;
+
+  // The definition itself.
+  set_info *m_def;
+
+  // The destination and source of the single set that defines m_def.
+  // The destination is known to be a plain REG.
+  rtx m_dest;
+  rtx m_src;
+
+  // Contains the full list of changes that we want to make, in reverse
+  // postorder.
+  auto_vec<insn_change *> m_nondebug_changes;
+};
+
+// Class that represents one run of the pass.
+class late_combine
+{
+public:
+  unsigned int execute (function *);
+
+private:
+  rtx optimizable_set (insn_info *);
+  bool check_register_pressure (insn_info *, rtx);
+  bool check_uses (set_info *, rtx);
+  bool combine_into_uses (insn_info *, insn_info *);
+
+  auto_vec<insn_info *> m_worklist;
+};
+
+insn_combination::insn_combination (set_info *def, rtx dest, rtx src)
+  : m_rtl_watermark (),
+    m_attempt (crtl->ssa->new_change_attempt ()),
+    m_def_insn (def->insn ()),
+    m_def (def),
+    m_dest (dest),
+    m_src (src),
+    m_nondebug_changes ()
+{
+}
+
+array_slice<insn_change *const>
+insn_combination::use_changes () const
+{
+  return { m_nondebug_changes.address () + 1,
+	   m_nondebug_changes.length () - 1 };
+}
+
+// USE is a direct or indirect use of m_def.  Return the list of uses
+// that would be needed after substituting m_def into the instruction.
+// The returned list is marked as invalid if USE's insn and m_def_insn
+// use different definitions for the same resource (register or memory).
+use_array
+insn_combination::get_new_uses (use_info *use)
+{
+  auto *def = use->def ();
+  auto *use_insn = use->insn ();
+
+  use_array new_uses = use_insn->uses ();
+  new_uses = remove_uses_of_def (m_attempt, new_uses, def);
+  new_uses = merge_access_arrays (m_attempt, m_def_insn->uses (), new_uses);
+  if (new_uses.is_valid () && use->ebb () != m_def->ebb ())
+    new_uses = crtl->ssa->make_uses_available (m_attempt, new_uses, use->bb (),
+					       use_insn->is_debug_insn ());
+  return new_uses;
+}
+
+// Start the process of trying to replace USE by substitution, given that
+// USE occurs in a non-debug instruction.  Check:
+//
+// - that the substitution can be represented in RTL
+//
+// - that each use of a resource (register or memory) within the new
+//   instruction has a consistent definition
+//
+// - that the new instruction is a recognized pattern
+//
+// - that the instruction can be placed somewhere that makes all definitions
+//   and uses valid, and that permits any new hard-register clobbers added
+//   during the recognition process
+//
+// Return true on success.
+bool
+insn_combination::substitute_nondebug_use (use_info *use)
+{
+  insn_info *use_insn = use->insn ();
+  rtx_insn *use_rtl = use_insn->rtl ();
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    dump_insn_slim (dump_file, use->insn ()->rtl ());
+
+  // Check that we can change the instruction pattern.  Leave recognition
+  // of the result till later.
+  insn_propagation prop (use_rtl, m_dest, m_src);
+  if (!prop.apply_to_pattern (&PATTERN (use_rtl))
+      || prop.num_replacements == 0)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "-- RTL substitution failed\n");
+      return false;
+    }
+
+  use_array new_uses = get_new_uses (use);
+  if (!new_uses.is_valid ())
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "-- could not prove that all sources"
+		 " are available\n");
+      return false;
+    }
+
+  // Create a tentative change for the use.
+  auto *where = XOBNEW (m_attempt, insn_change);
+  auto *use_change = new (where) insn_change (use_insn);
+  m_nondebug_changes.safe_push (use_change);
+  use_change->new_uses = new_uses;
+
+  struct local_ignore : ignore_nothing
+  {
+    local_ignore (const set_info *def, const insn_info *use_insn)
+      : m_def (def), m_use_insn (use_insn) {}
+
+    // We don't limit the number of insns per optimization, so ignoring all
+    // insns for all insns would lead to quadratic complexity.  Just ignore
+    // the use and definition, which should be enough for most purposes.
+    bool
+    should_ignore_insn (const insn_info *insn)
+    {
+      return insn == m_def->insn () || insn == m_use_insn;
+    }
+
+    // Ignore the definition that we're removing, and all uses of it.
+    bool should_ignore_def (const def_info *def) { return def == m_def; }
+
+    const set_info *m_def;
+    const insn_info *m_use_insn;
+  };
+
+  auto ignore = local_ignore (m_def, use_insn);
+
+  // Moving instructions before register allocation could increase
+  // register pressure.  Only try moving them after RA.
+  if (reload_completed && can_move_insn_p (use_insn))
+    use_change->move_range = { use_insn->bb ()->head_insn (),
+			       use_insn->ebb ()->last_bb ()->end_insn () };
+  if (!restrict_movement (*use_change, ignore))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "-- cannot satisfy all definitions and uses"
+		 " in insn %d\n", INSN_UID (use_insn->rtl ()));
+      return false;
+    }
+
+  if (!recog (m_attempt, *use_change, ignore))
+    return false;
+
+  return true;
+}
+
+// Apply substitute_nondebug_use to all direct and indirect uses of DEF.
+// There will be at most one level of indirection.
+bool
+insn_combination::substitute_nondebug_uses (set_info *def)
+{
+  for (use_info *use : def->nondebug_insn_uses ())
+    if (!use->is_live_out_use ()
+	&& !use->only_occurs_in_notes ()
+	&& !substitute_nondebug_use (use))
+      return false;
+
+  for (use_info *use : def->phi_uses ())
+    if (!substitute_nondebug_uses (use->phi ()))
+      return false;
+
+  return true;
+}
+
+// USE_CHANGE.insn () is a debug instruction that uses m_def.  Try to
+// substitute the definition into the instruction and try to describe
+// the result in USE_CHANGE.  Return true on success.  Failure means that
+// the instruction must be reset instead.
+bool
+insn_combination::try_to_preserve_debug_info (insn_change &use_change,
+					      use_info *use)
+{
+  // Punt on unsimplified subregs of hard registers.  In that case,
+  // propagation can succeed and create a wider reg than the one we
+  // started with.
+  if (HARD_REGISTER_NUM_P (use->regno ())
+      && use->includes_subregs ())
+    return false;
+
+  insn_info *use_insn = use_change.insn ();
+  rtx_insn *use_rtl = use_insn->rtl ();
+
+  use_change.new_uses = get_new_uses (use);
+  if (!use_change.new_uses.is_valid ()
+      || !restrict_movement (use_change))
+    return false;
+
+  insn_propagation prop (use_rtl, m_dest, m_src);
+  return prop.apply_to_pattern (&INSN_VAR_LOCATION_LOC (use_rtl));
+}
+
+// USE_INSN is a debug instruction that uses m_def.  Update it to reflect
+// the fact that m_def is going to disappear.  Try to preserve the source
+// value if possible, but reset the instruction if not.
+void
+insn_combination::substitute_debug_use (use_info *use)
+{
+  auto *use_insn = use->insn ();
+  rtx_insn *use_rtl = use_insn->rtl ();
+
+  auto use_change = insn_change (use_insn);
+  if (!try_to_preserve_debug_info (use_change, use))
+    {
+      use_change.new_uses = {};
+      use_change.move_range = use_change.insn ();
+      INSN_VAR_LOCATION_LOC (use_rtl) = gen_rtx_UNKNOWN_VAR_LOC ();
+    }
+  insn_change *changes[] = { &use_change };
+  crtl->ssa->change_insns (changes);
+}
+
+// NOTE is a reg note of USE_INSN, which previously used m_def.  Update
+// the note to reflect the fact that m_def is going to disappear.  Return
+// true on success, or false if the note must be deleted.
+//
+// CAN_PROPAGATE is true if m_dest can be replaced with m_use.
+bool
+insn_combination::substitute_note (insn_info *use_insn, rtx note,
+				   bool can_propagate)
+{
+  if (REG_NOTE_KIND (note) == REG_EQUAL
+      || REG_NOTE_KIND (note) == REG_EQUIV)
+    {
+      insn_propagation prop (use_insn->rtl (), m_dest, m_src);
+      return (prop.apply_to_rvalue (&XEXP (note, 0))
+	      && (can_propagate || prop.num_replacements == 0));
+    }
+  return true;
+}
+
+// Update USE_INSN's notes after deciding to go ahead with the optimization.
+// CAN_PROPAGATE is true if m_dest can be replaced with m_use.
+void
+insn_combination::substitute_notes (insn_info *use_insn, bool can_propagate)
+{
+  rtx_insn *use_rtl = use_insn->rtl ();
+  rtx *ptr = &REG_NOTES (use_rtl);
+  while (rtx note = *ptr)
+    {
+      if (substitute_note (use_insn, note, can_propagate))
+	ptr = &XEXP (note, 1);
+      else
+	*ptr = XEXP (note, 1);
+    }
+}
+
+// We've decided to go ahead with the substitution.  Update all REG_NOTES
+// involving USE.
+void
+insn_combination::substitute_note_uses (use_info *use)
+{
+  insn_info *use_insn = use->insn ();
+
+  bool can_propagate = true;
+  if (use->only_occurs_in_notes ())
+    {
+      // The only uses are in notes.  Try to keep the note if we can,
+      // but removing it is better than aborting the optimization.
+      insn_change use_change (use_insn);
+      use_change.new_uses = get_new_uses (use);
+      if (!use_change.new_uses.is_valid ()
+	  || !restrict_movement (use_change))
+	{
+	  use_change.move_range = use_insn;
+	  use_change.new_uses = remove_uses_of_def (m_attempt,
+						    use_insn->uses (),
+						    use->def ());
+	  can_propagate = false;
+	}
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "%s notes in:\n",
+		   can_propagate ? "updating" : "removing");
+	  dump_insn_slim (dump_file, use_insn->rtl ());
+	}
+      substitute_notes (use_insn, can_propagate);
+      insn_change *changes[] = { &use_change };
+      crtl->ssa->change_insns (changes);
+    }
+  else
+    // We've already decided to update the insn's pattern and know that m_src
+    // will be available at the insn's new location.  Now update its notes.
+    substitute_notes (use_insn, can_propagate);
+}
+
+// We've decided to go ahead with the substitution and we've dealt with
+// all uses that occur in the patterns of non-debug insns.  Update all
+// other uses for the fact that m_def is about to disappear.
+void
+insn_combination::substitute_optional_uses (set_info *def)
+{
+  if (auto insn_uses = def->all_insn_uses ())
+    {
+      use_info *use = *insn_uses.begin ();
+      while (use)
+	{
+	  use_info *next_use = use->next_any_insn_use ();
+	  if (use->is_in_debug_insn ())
+	    substitute_debug_use (use);
+	  else if (!use->is_live_out_use ())
+	    substitute_note_uses (use);
+	  use = next_use;
+	}
+    }
+  for (use_info *use : def->phi_uses ())
+    substitute_optional_uses (use->phi ());
+}
+
+// Try to perform the substitution.  Return true on success.
+bool
+insn_combination::run ()
+{
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "\ntrying to combine definition of r%d in:\n",
+	       m_def->regno ());
+      dump_insn_slim (dump_file, m_def_insn->rtl ());
+      fprintf (dump_file, "into:\n");
+    }
+
+  auto def_change = insn_change::delete_insn (m_def_insn);
+  m_nondebug_changes.safe_push (&def_change);
+
+  if (!substitute_nondebug_uses (m_def)
+      || !changes_are_worthwhile (m_nondebug_changes)
+      || !crtl->ssa->verify_insn_changes (m_nondebug_changes))
+    return false;
+
+  substitute_optional_uses (m_def);
+
+  confirm_change_group ();
+  crtl->ssa->change_insns (m_nondebug_changes);
+  return true;
+}
+
+// See whether INSN is a single_set that we can optimize.  Return the
+// set if so, otherwise return null.
+rtx
+late_combine::optimizable_set (insn_info *insn)
+{
+  if (!insn->can_be_optimized ()
+      || insn->is_asm ()
+      || insn->is_call ()
+      || insn->has_volatile_refs ()
+      || insn->has_pre_post_modify ()
+      || !can_move_insn_p (insn))
+    return NULL_RTX;
+
+  return single_set (insn->rtl ());
+}
+
+// Suppose that we can replace all uses of SET_DEST (SET) with SET_SRC (SET),
+// where SET occurs in INSN.  Return true if doing so is not likely to
+// increase register pressure.
+bool
+late_combine::check_register_pressure (insn_info *insn, rtx set)
+{
+  // Plain register-to-register moves do not establish a register class
+  // preference and have no well-defined effect on the register allocator.
+  // If changes in register class are needed, the register allocator is
+  // in the best position to place those changes.  If no change in
+  // register class is needed, then the optimization reduces register
+  // pressure if SET_SRC (set) was already live at uses, otherwise the
+  // optimization is pressure-neutral.
+  rtx src = SET_SRC (set);
+  if (REG_P (src))
+    return true;
+
+  // On the same basis, substituting a SET_SRC that contains a single
+  // pseudo register either reduces pressure or is pressure-neutral,
+  // subject to the constraints below.  We would need to do more
+  // analysis for SET_SRCs that use more than one pseudo register.
+  unsigned int nregs = 0;
+  for (auto *use : insn->uses ())
+    if (use->is_reg ()
+	&& !HARD_REGISTER_NUM_P (use->regno ())
+	&& !use->only_occurs_in_notes ())
+      if (++nregs > 1)
+	return false;
+
+  // If there are no pseudo registers in SET_SRC then the optimization
+  // should improve register pressure.
+  if (nregs == 0)
+    return true;
+
+  // We'd be substituting (set (reg R1) SRC) where SRC is known to
+  // contain a single pseudo register R2.  Assume for simplicity that
+  // each new use of R2 would need to be in the same class C as the
+  // current use of R2.  If, for a realistic allocation, C is a
+  // non-strict superset of the R1's register class, the effect on
+  // register pressure should be positive or neutral.  If instead
+  // R1 occupies a different register class from R2, or if R1 has
+  // more allocation freedom than R2, then there's a higher risk that
+  // the effect on register pressure could be negative.
+  //
+  // First use constrain_operands to get the most likely choice of
+  // alternative.  For simplicity, just handle the case where the
+  // output operand is operand 0.
+  extract_insn (insn->rtl ());
+  rtx dest = SET_DEST (set);
+  if (recog_data.n_operands == 0
+      || recog_data.operand[0] != dest)
+    return false;
+
+  if (!constrain_operands (0, get_enabled_alternatives (insn->rtl ())))
+    return false;
+
+  preprocess_constraints (insn->rtl ());
+  auto *alt = which_op_alt ();
+  auto dest_class = alt[0].cl;
+
+  // Check operands 1 and above.
+  auto check_src = [&] (unsigned int i)
+    {
+      if (recog_data.is_operator[i])
+	return true;
+
+      rtx op = recog_data.operand[i];
+      if (CONSTANT_P (op))
+	return true;
+
+      if (SUBREG_P (op))
+	op = SUBREG_REG (op);
+      if (REG_P (op))
+	{
+	  // Ignore hard registers.  We've already rejected uses of non-fixed
+	  // hard registers in the SET_SRC.
+	  if (HARD_REGISTER_P (op))
+	    return true;
+
+	  // Make sure that the source operand's class is at least as
+	  // permissive as the destination operand's class.
+	  auto src_class = alternative_class (alt, i);
+	  if (!reg_class_subset_p (dest_class, src_class))
+	    return false;
+
+	  // Make sure that the source operand occupies no more hard
+	  // registers than the destination operand.  This mostly matters
+	  // for subregs.
+	  if (targetm.class_max_nregs (dest_class, GET_MODE (dest))
+	      < targetm.class_max_nregs (src_class, GET_MODE (op)))
+	    return false;
+
+	  return true;
+	}
+      return false;
+    };
+  for (int i = 1; i < recog_data.n_operands; ++i)
+    if (recog_data.operand_type[i] != OP_OUT && !check_src (i))
+      return false;
+
+  return true;
+}
+
+// Check uses of DEF to see whether there is anything obvious that
+// prevents the substitution of SET into uses of DEF.
+bool
+late_combine::check_uses (set_info *def, rtx set)
+{
+  use_info *prev_use = nullptr;
+  for (use_info *use : def->nondebug_insn_uses ())
+    {
+      insn_info *use_insn = use->insn ();
+
+      if (use->is_live_out_use ())
+	continue;
+      if (use->only_occurs_in_notes ())
+	continue;
+
+      // We cannot replace all uses if the value is live on exit.
+      if (use->is_artificial ())
+	return false;
+
+      // Avoid increasing the complexity of instructions that
+      // reference allocatable hard registers.
+      if (!REG_P (SET_SRC (set))
+	  && !reload_completed
+	  && (accesses_include_nonfixed_hard_registers (use_insn->uses ())
+	      || accesses_include_nonfixed_hard_registers (use_insn->defs ())))
+	return false;
+
+      // Don't substitute into a non-local goto, since it can then be
+      // treated as a jump to local label, e.g. in shorten_branches.
+      // ??? But this shouldn't be necessary.
+      if (use_insn->is_jump ()
+	  && find_reg_note (use_insn->rtl (), REG_NON_LOCAL_GOTO, NULL_RTX))
+	return false;
+
+      // Reject cases where one of the uses is a function argument.
+      // The combine attempt should fail anyway, but this is a common
+      // case that is easy to check early.
+      if (use_insn->is_call ()
+	  && HARD_REGISTER_P (SET_DEST (set))
+	  && find_reg_fusage (use_insn->rtl (), USE, SET_DEST (set)))
+	return false;
+
+      // We'll keep the uses in their original order, even if we move
+      // them relative to other instructions.  Make sure that non-final
+      // uses do not change any values that occur in the SET_SRC.
+      if (prev_use && prev_use->ebb () == use->ebb ())
+	{
+	  def_info *ultimate_def = look_through_degenerate_phi (def);
+	  if (insn_clobbers_resources (prev_use->insn (),
+				       ultimate_def->insn ()->uses ()))
+	    return false;
+	}
+
+      prev_use = use;
+    }
+
+  for (use_info *use : def->phi_uses ())
+    if (!use->phi ()->is_degenerate ()
+	|| !check_uses (use->phi (), set))
+      return false;
+
+  return true;
+}
+
+// Try to remove INSN by substituting a definition into all uses.
+// If the optimization moves any instructions before CURSOR, add those
+// instructions to the end of m_worklist.
+bool
+late_combine::combine_into_uses (insn_info *insn, insn_info *cursor)
+{
+  // For simplicity, don't try to handle sets of multiple hard registers.
+  // And for correctness, don't remove any assignments to the stack or
+  // frame pointers, since that would implicitly change the set of valid
+  // memory locations between this assignment and the next.
+  //
+  // Removing assignments to the hard frame pointer would invalidate
+  // backtraces.
+  set_info *def = single_set_info (insn);
+  if (!def
+      || !def->is_reg ()
+      || def->regno () == STACK_POINTER_REGNUM
+      || def->regno () == FRAME_POINTER_REGNUM
+      || def->regno () == HARD_FRAME_POINTER_REGNUM)
+    return false;
+
+  rtx set = optimizable_set (insn);
+  if (!set)
+    return false;
+
+  // For simplicity, don't try to handle subreg destinations.
+  rtx dest = SET_DEST (set);
+  if (!REG_P (dest) || def->regno () != REGNO (dest))
+    return false;
+
+  // Don't prolong the live ranges of allocatable hard registers, or put
+  // them into more complicated instructions.  Failing to prevent this
+  // could lead to spill failures, or at least to worst register allocation.
+  if (!reload_completed
+      && accesses_include_nonfixed_hard_registers (insn->uses ()))
+    return false;
+
+  if (!reload_completed && !check_register_pressure (insn, set))
+    return false;
+
+  if (!check_uses (def, set))
+    return false;
+
+  insn_combination combination (def, SET_DEST (set), SET_SRC (set));
+  if (!combination.run ())
+    return false;
+
+  for (auto *use_change : combination.use_changes ())
+    if (*use_change->insn () < *cursor)
+      m_worklist.safe_push (use_change->insn ());
+    else
+      break;
+  return true;
+}
+
+// Run the pass on function FN.
+unsigned int
+late_combine::execute (function *fn)
+{
+  // Initialization.
+  calculate_dominance_info (CDI_DOMINATORS);
+  df_analyze ();
+  crtl->ssa = new rtl_ssa::function_info (fn);
+  // Don't allow memory_operand to match volatile MEMs.
+  init_recog_no_volatile ();
+
+  insn_info *insn = *crtl->ssa->nondebug_insns ().begin ();
+  while (insn)
+    {
+      if (!insn->is_artificial ())
+	{
+	  insn_info *prev = insn->prev_nondebug_insn ();
+	  if (combine_into_uses (insn, prev))
+	    {
+	      // Any instructions that get added to the worklist were
+	      // previously after PREV.  Thus if we were able to move
+	      // an instruction X before PREV during one combination,
+	      // X cannot depend on any instructions that we move before
+	      // PREV during subsequent combinations.  This means that
+	      // the worklist should be free of backwards dependencies,
+	      // even if it isn't necessarily in RPO.
+	      for (unsigned int i = 0; i < m_worklist.length (); ++i)
+		combine_into_uses (m_worklist[i], prev);
+	      m_worklist.truncate (0);
+	      insn = prev;
+	    }
+	}
+      insn = insn->next_nondebug_insn ();
+    }
+
+  // Finalization.
+  if (crtl->ssa->perform_pending_updates ())
+    cleanup_cfg (0);
+  // Make the recognizer allow volatile MEMs again.
+  init_recog ();
+  free_dominance_info (CDI_DOMINATORS);
+  return 0;
+}
+
+class pass_late_combine : public rtl_opt_pass
+{
+public:
+  pass_late_combine (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_late_combine, ctxt)
+  {}
+
+  // opt_pass methods:
+  opt_pass *clone () override { return new pass_late_combine (m_ctxt); }
+  bool gate (function *) override { return flag_late_combine_instructions; }
+  unsigned int execute (function *) override;
+};
+
+unsigned int
+pass_late_combine::execute (function *fn)
+{
+  return late_combine ().execute (fn);
+}
+
+} // end namespace
+
+// Create a new CC fusion pass instance.
+
+rtl_opt_pass *
+make_pass_late_combine (gcc::context *ctxt)
+{
+  return new pass_late_combine (ctxt);
+}
diff --git a/gcc/opts.cc b/gcc/opts.cc
index 1b1b46455af..915bce88fd6 100644
--- a/gcc/opts.cc
+++ b/gcc/opts.cc
@@ -664,6 +664,7 @@ static const struct default_options default_options_table[] =
       VECT_COST_MODEL_VERY_CHEAP },
     { OPT_LEVELS_2_PLUS, OPT_finline_functions, NULL, 1 },
     { OPT_LEVELS_2_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
+    { OPT_LEVELS_2_PLUS, OPT_flate_combine_instructions, NULL, 1 },
 
     /* -O2 and above optimizations, but not -Os or -Og.  */
     { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_falign_functions, NULL, 1 },
diff --git a/gcc/passes.def b/gcc/passes.def
index 041229e47a6..13c9dc34ddf 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -493,6 +493,7 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_initialize_regs);
       NEXT_PASS (pass_ud_rtl_dce);
       NEXT_PASS (pass_combine);
+      NEXT_PASS (pass_late_combine);
       NEXT_PASS (pass_if_after_combine);
       NEXT_PASS (pass_jump_after_combine);
       NEXT_PASS (pass_partition_blocks);
@@ -512,6 +513,7 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_postreload);
       PUSH_INSERT_PASSES_WITHIN (pass_postreload)
 	  NEXT_PASS (pass_postreload_cse);
+	  NEXT_PASS (pass_late_combine);
 	  NEXT_PASS (pass_gcse2);
 	  NEXT_PASS (pass_split_after_reload);
 	  NEXT_PASS (pass_ree);
diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
index f290b9ccbdc..a95637abbe5 100644
--- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
+++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
@@ -25,5 +25,5 @@ bar (long a)
 }
 
 /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */
-/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */
+/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */
 /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail powerpc*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c
index 6212c95585d..0690e036eaa 100644
--- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c
+++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c
@@ -30,6 +30,6 @@ bar (long a)
 }
 
 /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */
-/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */
+/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */
 /* XFAIL due to PR70681.  */ 
 /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail arm*-*-* powerpc*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/stack-check-4.c b/gcc/testsuite/gcc.dg/stack-check-4.c
index b0c5c61972f..052d2abc2f1 100644
--- a/gcc/testsuite/gcc.dg/stack-check-4.c
+++ b/gcc/testsuite/gcc.dg/stack-check-4.c
@@ -20,7 +20,7 @@
    scan for.   We scan for both the positive and negative cases.  */
 
 /* { dg-do compile } */
-/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls" } */
+/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls -fno-shrink-wrap" } */
 /* { dg-require-effective-target supports_stack_clash_protection } */
 
 extern void arf (char *);
diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
index 4a228b0a1ce..c29a230a771 100644
--- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
+++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
@@ -1,5 +1,5 @@
 /* { dg-do compile { target bitint } } */
-/* { dg-additional-options "-std=c23 -O2 -fno-stack-protector -save-temps -fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-additional-options "-std=c23 -O2 -fno-stack-protector -save-temps -fno-schedule-insns -fno-schedule-insns2 -fno-late-combine-instructions" } */
 /* { dg-final { check-function-bodies "**" "" "" } } */
 
 #define ALIGN 16
diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
index e7f773640f0..13ffbf416ca 100644
--- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
+++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
@@ -1,5 +1,5 @@
 /* { dg-do compile { target bitint } } */
-/* { dg-additional-options "-std=c23 -O2 -fno-stack-protector -save-temps -fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-additional-options "-std=c23 -O2 -fno-stack-protector -save-temps -fno-schedule-insns -fno-schedule-insns2 -fno-late-combine-instructions" } */
 /* { dg-final { check-function-bodies "**" "" "" } } */
 
 #define ALIGN 8
diff --git a/gcc/testsuite/gcc.target/aarch64/pr106594_1.c b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c
new file mode 100644
index 00000000000..71bcafcb44f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c
@@ -0,0 +1,20 @@
+/* { dg-options "-O2" } */
+
+extern const int constellation_64qam[64];
+
+void foo(int nbits,
+         const char *p_src,
+         int *p_dst) {
+
+  while (nbits > 0U) {
+    char first = *p_src++;
+
+    char index1 = ((first & 0x3) << 4) | (first >> 4);
+
+    *p_dst++ = constellation_64qam[index1];
+
+    nbits--;
+  }
+}
+
+/* { dg-final { scan-assembler {(?n)\tldr\t.*\[x[0-9]+, w[0-9]+, sxtw #?2\]} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c
index 0d620a30d5d..b537c6154a3 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c
@@ -27,9 +27,9 @@ TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.h, p[0-7]/m, z[0-9]+\.h, #4\n} 2 } } */
 /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.s, p[0-7]/m, z[0-9]+\.s, #4\n} 1 } } */
 
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */
 
-/* { dg-final { scan-assembler-not {\tmov\tz} { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-not {\tmov\tz} } } */
+/* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c
index a294effd4a9..cff806c278d 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c
@@ -30,11 +30,9 @@ TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-times {\tscvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 
-/* Really we should be able to use MOVPRFX /z here, but at the moment
-   we're relying on combine to merge a SEL and an arithmetic operation,
-   and the SEL doesn't allow the "false" value to be zero when the "true"
-   value is a register.  */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */
 
 /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */
 /* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c
index 6541a2ea49d..abf0a2e832f 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c
@@ -30,11 +30,9 @@ TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-times {\tfcvtzs\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 /* { dg-final { scan-assembler-times {\tfcvtzu\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 
-/* Really we should be able to use MOVPRFX /z here, but at the moment
-   we're relying on combine to merge a SEL and an arithmetic operation,
-   and the SEL doesn't allow the "false" value to be zero when the "true"
-   value is a register.  */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */
 
 /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */
 /* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c
index e66477b3bce..401201b315a 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c
@@ -24,12 +24,9 @@ TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
 /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 
-/* Really we should be able to use MOVPRFX /Z here, but at the moment
-   we're relying on combine to merge a SEL and an arithmetic operation,
-   and the SEL doesn't allow zero operands.  */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 } } */
 
 /* { dg-final { scan-assembler-not {\tmov\tz[^,]*z} } } */
-/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c
index a491f899088..cbb957bffa4 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c
@@ -52,15 +52,10 @@ TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
 /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 
-/* Really we should be able to use MOVPRFX /z here, but at the moment
-   we're relying on combine to merge a SEL and an arithmetic operation,
-   and the SEL doesn't allow the "false" value to be zero when the "true"
-   value is a register.  */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 7 } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 1 } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 2 } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 2 } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 4 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 4 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 4 } } */
 
 /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */
 /* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index edebb2be245..38902b1b01b 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -615,6 +615,7 @@ extern rtl_opt_pass *make_pass_branch_prob (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_value_profile_transformations (gcc::context
 							      *ctxt);
 extern rtl_opt_pass *make_pass_postreload_cse (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_late_combine (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_gcse2 (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_split_after_reload (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_thread_prologue_and_epilogue (gcc::context
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/6] rtl-ssa: Rework _ignoring interfaces
  2024-06-20 13:34 ` [PATCH 1/6] rtl-ssa: Rework _ignoring interfaces Richard Sandiford
@ 2024-06-20 21:22   ` Alex Coplan
  2024-06-21  8:11     ` Richard Sandiford
  2024-06-21 14:40   ` Jeff Law
  1 sibling, 1 reply; 36+ messages in thread
From: Alex Coplan @ 2024-06-20 21:22 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: jlaw, gcc-patches

Hi Richard,

I had a quick look through the patch and noticed a couple of minor typos.
Otherwise looks like a nice cleanup!

On 20/06/2024 14:34, Richard Sandiford wrote:
> rtl-ssa has routines for scanning forwards or backwards for something
> under the control of an exclusion set.  These searches are currently
> used for two main things:
> 
> - to work out where an instruction can be moved within its EBB
> - to work out whether recog can add a new hard register clobber
> 
> The exclusion set was originally a callback function that returned
> true for insns that should be ignored.  However, for the late-combine
> work, I'd also like to be able to skip an entire definition, along
> with all its uses.
> 
> This patch prepares for that by turning the exclusion set into an
> object that provides predicate member functions.  Currently the
> only two member functions are:
> 
> - should_ignore_insn: what the old callback did
> - should_ignore_def: the new functionality
> 
> but more could be added later.
> 
> Doing this also makes it easy to remove some assymmetry that I think

s/assymmetry/asymmetry/

> in hindsight was a mistake: in forward scans, ignoring an insn meant
> ignoring all definitions in that insn (ok) and all uses of those
> definitions (non-obvious).  The new interface makes it possible
> to select the required behaviour, with that behaviour being applied
> consistently in both directions.
> 
> Now that the exclusion set is a dedicated object, rather than
> just a "random" function, I think it makes sense to remove the
> _ignoring suffix from the function names.  The suffix was originally
> there to describe the callback, and in particular to emphasise that
> a true return meant "ignore" rather than "heed".
> 
> gcc/
> 	* rtl-ssa.h: Include predicates.h.
> 	* rtl-ssa/predicates.h: New file.
> 	* rtl-ssa/access-utils.h (prev_call_clobbers_ignoring): Rename to...
> 	(prev_call_clobbers): ...this and treat the ignore parameter as an
> 	object with the same interface as ignore_nothing.
> 	(next_call_clobbers_ignoring): Rename to...
> 	(next_call_clobbers): ...this and treat the ignore parameter as an
> 	object with the same interface as ignore_nothing.
> 	(first_nondebug_insn_use_ignoring): Rename to...
> 	(first_nondebug_insn_use): ...this and treat the ignore parameter as
> 	an object with the same interface as ignore_nothing.
> 	(last_nondebug_insn_use_ignoring): Rename to...
> 	(last_nondebug_insn_use): ...this and treat the ignore parameter as
> 	an object with the same interface as ignore_nothing.
> 	(last_access_ignoring): Rename to...
> 	(last_access): ...this and treat the ignore parameter as an object
> 	with the same interface as ignore_nothing.  Conditionally skip
> 	definitions.
> 	(prev_access_ignoring): Rename to...
> 	(prev_access): ...this and treat the ignore parameter as an object
> 	with the same interface as ignore_nothing.
> 	(first_def_ignoring): Replace with...
> 	(first_access): ...this new function.
> 	(next_access_ignoring): Rename to...
> 	(next_access): ...this and treat the ignore parameter as an object
> 	with the same interface as ignore_nothing.  Conditionally skip
> 	definitions.
> 	* rtl-ssa/change-utils.h (insn_is_changing): Delete.
> 	(restrict_movement_ignoring): Rename to...
> 	(restrict_movement): ...this and treat the ignore parameter as an
> 	object with the same interface as ignore_nothing.
> 	(recog_ignoring): Rename to...
> 	(recog): ...this and treat the ignore parameter as an object with
> 	the same interface as ignore_nothing.
> 	* rtl-ssa/changes.h (insn_is_changing_closure): Delete.
> 	* rtl-ssa/functions.h (function_info::add_regno_clobber): Treat
> 	the ignore parameter as an object with the same interface as
> 	ignore_nothing.
> 	* rtl-ssa/insn-utils.h (insn_is): Delete.
> 	* rtl-ssa/insns.h (insn_is_closure): Delete.
> 	* rtl-ssa/member-fns.inl
> 	(insn_is_changing_closure::insn_is_changing_closure): Delete.
> 	(insn_is_changing_closure::operator()): Likewise.
> 	(function_info::add_regno_clobber): Treat the ignore parameter
> 	as an object with the same interface as ignore_nothing.
> 	(ignore_changing_insns::ignore_changing_insns): New function.
> 	(ignore_changing_insns::should_ignore_insn): Likewise.
> 	* rtl-ssa/movement.h (restrict_movement_for_dead_range): Treat
> 	the ignore parameter as an object with the same interface as
> 	ignore_nothing.
> 	(restrict_movement_for_defs_ignoring): Rename to...
> 	(restrict_movement_for_defs): ...this and treat the ignore parameter
> 	as an object with the same interface as ignore_nothing.
> 	(restrict_movement_for_uses_ignoring): Rename to...
> 	(restrict_movement_for_uses): ...this and treat the ignore parameter
> 	as an object with the same interface as ignore_nothing.  Conditionally
> 	skip definitions.
> 	* doc/rtl.texi: Update for above name changes.  Use
> 	ignore_changing_insns instead of insn_is_changing.
> 	* config/aarch64/aarch64-cc-fusion.cc (cc_fusion::parallelize_insns):
> 	Likewise.
> 	* pair-fusion.cc (no_ignore): Delete.
> 	(latest_hazard_before, first_hazard_after): Update for above name
> 	changes.  Use ignore_nothing instead of no_ignore.
> 	(pair_fusion_bb_info::fuse_pair): Update for above name changes.
> 	Use ignore_changing_insns instead of insn_is_changing.
> 	(pair_fusion::try_promote_writeback): Likewise.
> ---
>  gcc/config/aarch64/aarch64-cc-fusion.cc |   4 +-
>  gcc/doc/rtl.texi                        |  14 +--
>  gcc/pair-fusion.cc                      |  34 +++---
>  gcc/rtl-ssa.h                           |   1 +
>  gcc/rtl-ssa/access-utils.h              | 145 +++++++++++++-----------
>  gcc/rtl-ssa/change-utils.h              |  67 +++++------
>  gcc/rtl-ssa/changes.h                   |  13 ---
>  gcc/rtl-ssa/functions.h                 |  16 ++-
>  gcc/rtl-ssa/insn-utils.h                |   8 --
>  gcc/rtl-ssa/insns.h                     |  12 --
>  gcc/rtl-ssa/member-fns.inl              |  35 +++---
>  gcc/rtl-ssa/movement.h                  | 118 +++++++++----------
>  gcc/rtl-ssa/predicates.h                |  58 ++++++++++
>  13 files changed, 275 insertions(+), 250 deletions(-)
>  create mode 100644 gcc/rtl-ssa/predicates.h
> 
<snip>
> diff --git a/gcc/rtl-ssa/functions.h b/gcc/rtl-ssa/functions.h
> index f5aca643beb..479c6992e97 100644
> --- a/gcc/rtl-ssa/functions.h
> +++ b/gcc/rtl-ssa/functions.h
> @@ -165,16 +165,22 @@ public:
>  
>    // If CHANGE doesn't already clobber REGNO, try to add such a clobber,
>    // limiting the movement range in order to make the clobber valid.
> -  // When determining whether REGNO is live, ignore accesses made by an
> -  // instruction I if IGNORE (I) is true.  The caller then assumes the
> -  // responsibility of ensuring that CHANGE and I are placed in a valid order.
> +  // Use IGNORE to guide this process, where IGNORE is an object that
> +  // provides the same interface as ignore_nothing.
> +  //
> +  // That is, when determining whether REGNO is live, ignore accesses made
> +  // by an instruction I if IGNORE says that I should be ignored.  The caller
> +  // then assumes the responsibility of ensuring that CHANGE and I are placed
> +  // in a valid order.  Similarly, ignore live ranges associated/ with a

Stray '/' after associated?

Thanks,
Alex

> +  // definition of REGNO if IGNORE says that that definition should be
> +  // ignored.
>    //
>    // Return true on success.  Leave CHANGE unmodified when returning false.
>    //
>    // WATERMARK is a watermark returned by new_change_attempt ().
> -  template<typename IgnorePredicate>
> +  template<typename IgnorePredicates>
>    bool add_regno_clobber (obstack_watermark &watermark, insn_change &change,
> -			  unsigned int regno, IgnorePredicate ignore);
> +			  unsigned int regno, IgnorePredicates ignore);
>  
>    // Return true if change_insns will be able to perform the changes
>    // described by CHANGES.
<snip>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/6] sh: Make *minus_plus_one work after RA
  2024-06-20 13:34 ` [PATCH 4/6] sh: Make *minus_plus_one work after RA Richard Sandiford
@ 2024-06-21  0:15   ` Oleg Endo
  0 siblings, 0 replies; 36+ messages in thread
From: Oleg Endo @ 2024-06-21  0:15 UTC (permalink / raw)
  To: Richard Sandiford, jlaw, gcc-patches


On Thu, 2024-06-20 at 14:34 +0100, Richard Sandiford wrote:
> *minus_plus_one had no constraints, which meant that it could be
> matched after RA with operands 0, 1 and 2 all being different.
> The associated split instead requires operand 0 to be tied to
> operand 1.

Thanks for spotting this.  Makes sense, please install.

Best regards,
Oleg Endo

> 
> gcc/
> 	* config/sh/sh.md (*minus_plus_one): Add constraints.
> ---
>  gcc/config/sh/sh.md | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/gcc/config/sh/sh.md b/gcc/config/sh/sh.md
> index 92a1efeb811..9491b49e55b 100644
> --- a/gcc/config/sh/sh.md
> +++ b/gcc/config/sh/sh.md
> @@ -1642,9 +1642,9 @@ (define_insn_and_split "*addc"
>  ;; matched.  Split this up into a simple sub add sequence, as this will save
>  ;; us one sett insn.
>  (define_insn_and_split "*minus_plus_one"
> -  [(set (match_operand:SI 0 "arith_reg_dest" "")
> -	(plus:SI (minus:SI (match_operand:SI 1 "arith_reg_operand" "")
> -			   (match_operand:SI 2 "arith_reg_operand" ""))
> +  [(set (match_operand:SI 0 "arith_reg_dest" "=r")
> +	(plus:SI (minus:SI (match_operand:SI 1 "arith_reg_operand" "0")
> +			   (match_operand:SI 2 "arith_reg_operand" "r"))
>  		 (const_int 1)))]
>    "TARGET_SH1"
>    "#"
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] Add a late-combine pass [PR106594]
  2024-06-20 13:34 ` [PATCH 6/6] Add a late-combine pass [PR106594] Richard Sandiford
@ 2024-06-21  0:17   ` Oleg Endo
  2024-06-21  8:09     ` Richard Sandiford
  2024-06-21  5:54   ` Richard Biener
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 36+ messages in thread
From: Oleg Endo @ 2024-06-21  0:17 UTC (permalink / raw)
  To: Richard Sandiford, jlaw, gcc-patches


On Thu, 2024-06-20 at 14:34 +0100, Richard Sandiford wrote:
> 
> I tried compiling at least one target per CPU directory and comparing
> the assembly output for parts of the GCC testsuite.  This is just a way
> of getting a flavour of how the pass performs; it obviously isn't a
> meaningful benchmark.  All targets seemed to improve on average:
> 
> Target                 Tests   Good    Bad   %Good   Delta  Median
> ======                 =====   ====    ===   =====   =====  ======
> aarch64-linux-gnu       2215   1975    240  89.16%   -4159      -1
> aarch64_be-linux-gnu    1569   1483     86  94.52%  -10117      -1
> alpha-linux-gnu         1454   1370     84  94.22%   -9502      -1
> amdgcn-amdhsa           5122   4671    451  91.19%  -35737      -1
> arc-elf                 2166   1932    234  89.20%  -37742      -1
> arm-linux-gnueabi       1953   1661    292  85.05%  -12415      -1
> arm-linux-gnueabihf     1834   1549    285  84.46%  -11137      -1
> avr-elf                 4789   4330    459  90.42% -441276      -4
> bfin-elf                2795   2394    401  85.65%  -19252      -1
> bpf-elf                 3122   2928    194  93.79%   -8785      -1
> c6x-elf                 2227   1929    298  86.62%  -17339      -1
> cris-elf                3464   3270    194  94.40%  -23263      -2
> csky-elf                2915   2591    324  88.89%  -22146      -1
> epiphany-elf            2399   2304     95  96.04%  -28698      -2
> fr30-elf                7712   7299    413  94.64%  -99830      -2
> frv-linux-gnu           3332   2877    455  86.34%  -25108      -1
> ft32-elf                2775   2667    108  96.11%  -25029      -1
> h8300-elf               3176   2862    314  90.11%  -29305      -2
> hppa64-hp-hpux11.23     4287   4247     40  99.07%  -45963      -2
> ia64-linux-gnu          2343   1946    397  83.06%   -9907      -2
> iq2000-elf              9684   9637     47  99.51% -126557      -2
> lm32-elf                2681   2608     73  97.28%  -59884      -3
> loongarch64-linux-gnu   1303   1218     85  93.48%  -13375      -2
> m32r-elf                1626   1517    109  93.30%   -9323      -2
> m68k-linux-gnu          3022   2620    402  86.70%  -21531      -1
> mcore-elf               2315   2085    230  90.06%  -24160      -1
> microblaze-elf          2782   2585    197  92.92%  -16530      -1
> mipsel-linux-gnu        1958   1827    131  93.31%  -15462      -1
> mipsisa64-linux-gnu     1655   1488    167  89.91%  -16592      -2
> mmix                    4914   4814    100  97.96%  -63021      -1
> mn10300-elf             3639   3320    319  91.23%  -34752      -2
> moxie-rtems             3497   3252    245  92.99%  -87305      -3
> msp430-elf              4353   3876    477  89.04%  -23780      -1
> nds32le-elf             3042   2780    262  91.39%  -27320      -1
> nios2-linux-gnu         1683   1355    328  80.51%   -8065      -1
> nvptx-none              2114   1781    333  84.25%  -12589      -2
> or1k-elf                3045   2699    346  88.64%  -14328      -2
> pdp11                   4515   4146    369  91.83%  -26047      -2
> pru-elf                 1585   1245    340  78.55%   -5225      -1
> riscv32-elf             2122   2000    122  94.25% -101162      -2
> riscv64-elf             1841   1726    115  93.75%  -49997      -2
> rl78-elf                2823   2530    293  89.62%  -40742      -4
> rx-elf                  2614   2480    134  94.87%  -18863      -1
> s390-linux-gnu          1591   1393    198  87.55%  -16696      -1
> s390x-linux-gnu         2015   1879    136  93.25%  -21134      -1
> sh-linux-gnu            1870   1507    363  80.59%   -9491      -1
> sparc-linux-gnu         1123   1075     48  95.73%  -14503      -1
> sparc-wrs-vxworks       1121   1073     48  95.72%  -14578      -1
> sparc64-linux-gnu       1096   1021     75  93.16%  -15003      -1
> v850-elf                1897   1728    169  91.09%  -11078      -1
> vax-netbsdelf           3035   2995     40  98.68%  -27642      -1
> visium-elf              1392   1106    286  79.45%   -7984      -2
> xstormy16-elf           2577   2071    506  80.36%  -13061      -1
> 
> 

Since you have already briefly compared some of the code, can you share
those cases which get worse and might require some potential follow up
patches?

Best regards,
Oleg Endo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] Add a late-combine pass [PR106594]
  2024-06-20 13:34 ` [PATCH 6/6] Add a late-combine pass [PR106594] Richard Sandiford
  2024-06-21  0:17   ` Oleg Endo
@ 2024-06-21  5:54   ` Richard Biener
  2024-06-21  8:21     ` Richard Sandiford
  2024-06-21 15:00   ` Jeff Law
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 36+ messages in thread
From: Richard Biener @ 2024-06-21  5:54 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: jlaw, gcc-patches

On Thu, Jun 20, 2024 at 3:37 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> This patch adds a combine pass that runs late in the pipeline.
> There are two instances: one between combine and split1, and one
> after postreload.
>
> The pass currently has a single objective: remove definitions by
> substituting into all uses.  The pre-RA version tries to restrict
> itself to cases that are likely to have a neutral or beneficial
> effect on register pressure.
>
> The patch fixes PR106594.  It also fixes a few FAILs and XFAILs
> in the aarch64 test results, mostly due to making proper use of
> MOVPRFX in cases where we didn't previously.
>
> This is just a first step.  I'm hoping that the pass could be
> used for other combine-related optimisations in future.  In particular,
> the post-RA version doesn't need to restrict itself to cases where all
> uses are substitutable, since it doesn't have to worry about register
> pressure.  If we did that, and if we extended it to handle multi-register
> REGs, the pass might be a viable replacement for regcprop, which in
> turn might reduce the cost of having a post-RA instance of the new pass.
>
> On most targets, the pass is enabled by default at -O2 and above.
> However, it has a tendency to undo x86's STV and RPAD passes,
> by folding the more complex post-STV/RPAD form back into the
> simpler pre-pass form.
>
> Also, running a pass after register allocation means that we can
> now match define_insn_and_splits that were previously only matched
> before register allocation.  This trips things like:
>
>   (define_insn_and_split "..."
>     [...pattern...]
>     "...cond..."
>     "#"
>     "&& 1"
>     [...pattern...]
>     {
>       ...unconditional use of gen_reg_rtx ()...;
>     }
>
> because matching and splitting after RA will call gen_reg_rtx when
> pseudos are no longer allowed.  rs6000 has several instances of this.
>
> xtensa has a variation in which the split condition is:
>
>     "&& can_create_pseudo_p ()"
>
> The failure then is that, if we match after RA, we'll never be
> able to split the instruction.
>
> The patch therefore disables the pass by default on i386, rs6000
> and xtensa.  Hopefully we can fix those ports later (if their
> maintainers want).  It seems easier to add the pass first, though,
> to make it easier to test any such fixes.
>
> gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need
> quite a few updates for the late-combine output.  That might be
> worth doing, but it seems too complex to do as part of this patch.
>
> I tried compiling at least one target per CPU directory and comparing
> the assembly output for parts of the GCC testsuite.  This is just a way
> of getting a flavour of how the pass performs; it obviously isn't a
> meaningful benchmark.  All targets seemed to improve on average:
>
> Target                 Tests   Good    Bad   %Good   Delta  Median
> ======                 =====   ====    ===   =====   =====  ======
> aarch64-linux-gnu       2215   1975    240  89.16%   -4159      -1
> aarch64_be-linux-gnu    1569   1483     86  94.52%  -10117      -1
> alpha-linux-gnu         1454   1370     84  94.22%   -9502      -1
> amdgcn-amdhsa           5122   4671    451  91.19%  -35737      -1
> arc-elf                 2166   1932    234  89.20%  -37742      -1
> arm-linux-gnueabi       1953   1661    292  85.05%  -12415      -1
> arm-linux-gnueabihf     1834   1549    285  84.46%  -11137      -1
> avr-elf                 4789   4330    459  90.42% -441276      -4
> bfin-elf                2795   2394    401  85.65%  -19252      -1
> bpf-elf                 3122   2928    194  93.79%   -8785      -1
> c6x-elf                 2227   1929    298  86.62%  -17339      -1
> cris-elf                3464   3270    194  94.40%  -23263      -2
> csky-elf                2915   2591    324  88.89%  -22146      -1
> epiphany-elf            2399   2304     95  96.04%  -28698      -2
> fr30-elf                7712   7299    413  94.64%  -99830      -2
> frv-linux-gnu           3332   2877    455  86.34%  -25108      -1
> ft32-elf                2775   2667    108  96.11%  -25029      -1
> h8300-elf               3176   2862    314  90.11%  -29305      -2
> hppa64-hp-hpux11.23     4287   4247     40  99.07%  -45963      -2
> ia64-linux-gnu          2343   1946    397  83.06%   -9907      -2
> iq2000-elf              9684   9637     47  99.51% -126557      -2
> lm32-elf                2681   2608     73  97.28%  -59884      -3
> loongarch64-linux-gnu   1303   1218     85  93.48%  -13375      -2
> m32r-elf                1626   1517    109  93.30%   -9323      -2
> m68k-linux-gnu          3022   2620    402  86.70%  -21531      -1
> mcore-elf               2315   2085    230  90.06%  -24160      -1
> microblaze-elf          2782   2585    197  92.92%  -16530      -1
> mipsel-linux-gnu        1958   1827    131  93.31%  -15462      -1
> mipsisa64-linux-gnu     1655   1488    167  89.91%  -16592      -2
> mmix                    4914   4814    100  97.96%  -63021      -1
> mn10300-elf             3639   3320    319  91.23%  -34752      -2
> moxie-rtems             3497   3252    245  92.99%  -87305      -3
> msp430-elf              4353   3876    477  89.04%  -23780      -1
> nds32le-elf             3042   2780    262  91.39%  -27320      -1
> nios2-linux-gnu         1683   1355    328  80.51%   -8065      -1
> nvptx-none              2114   1781    333  84.25%  -12589      -2
> or1k-elf                3045   2699    346  88.64%  -14328      -2
> pdp11                   4515   4146    369  91.83%  -26047      -2
> pru-elf                 1585   1245    340  78.55%   -5225      -1
> riscv32-elf             2122   2000    122  94.25% -101162      -2
> riscv64-elf             1841   1726    115  93.75%  -49997      -2
> rl78-elf                2823   2530    293  89.62%  -40742      -4
> rx-elf                  2614   2480    134  94.87%  -18863      -1
> s390-linux-gnu          1591   1393    198  87.55%  -16696      -1
> s390x-linux-gnu         2015   1879    136  93.25%  -21134      -1
> sh-linux-gnu            1870   1507    363  80.59%   -9491      -1
> sparc-linux-gnu         1123   1075     48  95.73%  -14503      -1
> sparc-wrs-vxworks       1121   1073     48  95.72%  -14578      -1
> sparc64-linux-gnu       1096   1021     75  93.16%  -15003      -1
> v850-elf                1897   1728    169  91.09%  -11078      -1
> vax-netbsdelf           3035   2995     40  98.68%  -27642      -1
> visium-elf              1392   1106    286  79.45%   -7984      -2
> xstormy16-elf           2577   2071    506  80.36%  -13061      -1

I wonder if you can amend doc/passes.texi, specifically noting differences
between fwprop, combine and late-combine?

> gcc/
>         PR rtl-optimization/106594
>         * Makefile.in (OBJS): Add late-combine.o.
>         * common.opt (flate-combine-instructions): New option.
>         * doc/invoke.texi: Document it.
>         * opts.cc (default_options_table): Enable it by default at -O2
>         and above.
>         * tree-pass.h (make_pass_late_combine): Declare.
>         * late-combine.cc: New file.
>         * passes.def: Add two instances of late_combine.
>         * config/i386/i386-options.cc (ix86_override_options_after_change):
>         Disable late-combine by default.
>         * config/rs6000/rs6000.cc (rs6000_option_override_internal): Likewise.
>         * config/xtensa/xtensa.cc (xtensa_option_override): Likewise.
>
> gcc/testsuite/
>         PR rtl-optimization/106594
>         * gcc.dg/ira-shrinkwrap-prep-1.c: Restrict XFAIL to non-aarch64
>         targets.
>         * gcc.dg/ira-shrinkwrap-prep-2.c: Likewise.
>         * gcc.dg/stack-check-4.c: Add -fno-shrink-wrap.
>         * gcc.target/aarch64/bitfield-bitint-abi-align16.c: Add
>         -fno-late-combine-instructions.
>         * gcc.target/aarch64/bitfield-bitint-abi-align8.c: Likewise.
>         * gcc.target/aarch64/sve/cond_asrd_3.c: Remove XFAILs.
>         * gcc.target/aarch64/sve/cond_convert_3.c: Likewise.
>         * gcc.target/aarch64/sve/cond_fabd_5.c: Likewise.
>         * gcc.target/aarch64/sve/cond_convert_6.c: Expect the MOVPRFX /Zs
>         described in the comment.
>         * gcc.target/aarch64/sve/cond_unary_4.c: Likewise.
>         * gcc.target/aarch64/pr106594_1.c: New test.
> ---
>  gcc/Makefile.in                               |   1 +
>  gcc/common.opt                                |   5 +
>  gcc/config/i386/i386-options.cc               |   4 +
>  gcc/config/rs6000/rs6000.cc                   |   8 +
>  gcc/config/xtensa/xtensa.cc                   |  11 +
>  gcc/doc/invoke.texi                           |  11 +-
>  gcc/late-combine.cc                           | 747 ++++++++++++++++++
>  gcc/opts.cc                                   |   1 +
>  gcc/passes.def                                |   2 +
>  gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c  |   2 +-
>  gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c  |   2 +-
>  gcc/testsuite/gcc.dg/stack-check-4.c          |   2 +-
>  .../aarch64/bitfield-bitint-abi-align16.c     |   2 +-
>  .../aarch64/bitfield-bitint-abi-align8.c      |   2 +-
>  gcc/testsuite/gcc.target/aarch64/pr106594_1.c |  20 +
>  .../gcc.target/aarch64/sve/cond_asrd_3.c      |  10 +-
>  .../gcc.target/aarch64/sve/cond_convert_3.c   |   8 +-
>  .../gcc.target/aarch64/sve/cond_convert_6.c   |   8 +-
>  .../gcc.target/aarch64/sve/cond_fabd_5.c      |  11 +-
>  .../gcc.target/aarch64/sve/cond_unary_4.c     |  13 +-
>  gcc/tree-pass.h                               |   1 +
>  21 files changed, 834 insertions(+), 37 deletions(-)
>  create mode 100644 gcc/late-combine.cc
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594_1.c
>
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index f5adb647d3f..5e29ddb5690 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1574,6 +1574,7 @@ OBJS = \
>         ira-lives.o \
>         jump.o \
>         langhooks.o \
> +       late-combine.o \
>         lcm.o \
>         lists.o \
>         loop-doloop.o \
> diff --git a/gcc/common.opt b/gcc/common.opt
> index f2bc47fdc5e..327230967ea 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -1796,6 +1796,11 @@ Common Var(flag_large_source_files) Init(0)
>  Improve GCC's ability to track column numbers in large source files,
>  at the expense of slower compilation.
>
> +flate-combine-instructions
> +Common Var(flag_late_combine_instructions) Optimization Init(0)
> +Run two instruction combination passes late in the pass pipeline;
> +one before register allocation and one after.
> +
>  floop-parallelize-all
>  Common Var(flag_loop_parallelize_all) Optimization
>  Mark all loops as parallel.
> diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
> index f2cecc0e254..4620bf8e9e6 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -1942,6 +1942,10 @@ ix86_override_options_after_change (void)
>         flag_cunroll_grow_size = flag_peel_loops || optimize >= 3;
>      }
>
> +  /* Late combine tends to undo some of the effects of STV and RPAD,
> +     by combining instructions back to their original form.  */
> +  if (!OPTION_SET_P (flag_late_combine_instructions))
> +    flag_late_combine_instructions = 0;
>  }
>
>  /* Clear stack slot assignments remembered from previous functions.
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index e4dc629ddcc..f39b8909925 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -4768,6 +4768,14 @@ rs6000_option_override_internal (bool global_init_p)
>         targetm.expand_builtin_va_start = NULL;
>      }
>
> +  /* One of the late-combine passes runs after register allocation
> +     and can match define_insn_and_splits that were previously used
> +     only before register allocation.  Some of those define_insn_and_splits
> +     use gen_reg_rtx unconditionally.  Disable late-combine by default
> +     until the define_insn_and_splits are fixed.  */
> +  if (!OPTION_SET_P (flag_late_combine_instructions))
> +    flag_late_combine_instructions = 0;
> +
>    rs6000_override_options_after_change ();
>
>    /* If not explicitly specified via option, decide whether to generate indexed
> diff --git a/gcc/config/xtensa/xtensa.cc b/gcc/config/xtensa/xtensa.cc
> index 45dc1be3ff5..308dc62e0f8 100644
> --- a/gcc/config/xtensa/xtensa.cc
> +++ b/gcc/config/xtensa/xtensa.cc
> @@ -59,6 +59,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-pass.h"
>  #include "print-rtl.h"
>  #include <math.h>
> +#include "opts.h"
>
>  /* This file should be included last.  */
>  #include "target-def.h"
> @@ -2916,6 +2917,16 @@ xtensa_option_override (void)
>        flag_reorder_blocks_and_partition = 0;
>        flag_reorder_blocks = 1;
>      }
> +
> +  /* One of the late-combine passes runs after register allocation
> +     and can match define_insn_and_splits that were previously used
> +     only before register allocation.  Some of those define_insn_and_splits
> +     require the split to take place, but have a split condition of
> +     can_create_pseudo_p, and so matching after RA will give an
> +     unsplittable instruction.  Disable late-combine by default until
> +     the define_insn_and_splits are fixed.  */
> +  if (!OPTION_SET_P (flag_late_combine_instructions))
> +    flag_late_combine_instructions = 0;
>  }
>
>  /* Implement TARGET_HARD_REGNO_NREGS.  */
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 5d7a87fde86..3b8c427d509 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -575,7 +575,7 @@ Objective-C and Objective-C++ Dialects}.
>  -fipa-bit-cp  -fipa-vrp  -fipa-pta  -fipa-profile  -fipa-pure-const
>  -fipa-reference  -fipa-reference-addressable
>  -fipa-stack-alignment  -fipa-icf  -fira-algorithm=@var{algorithm}
> --flive-patching=@var{level}
> +-flate-combine-instructions  -flive-patching=@var{level}
>  -fira-region=@var{region}  -fira-hoist-pressure
>  -fira-loop-pressure  -fno-ira-share-save-slots
>  -fno-ira-share-spill-slots
> @@ -13675,6 +13675,15 @@ equivalences that are found only by GCC and equivalences found only by Gold.
>
>  This flag is enabled by default at @option{-O2} and @option{-Os}.
>
> +@opindex flate-combine-instructions
> +@item -flate-combine-instructions
> +Enable two instruction combination passes that run relatively late in the
> +compilation process.  One of the passes runs before register allocation and
> +the other after register allocation.  The main aim of the passes is to
> +substitute definitions into all uses.
> +
> +Most targets enable this flag by default at @option{-O2} and @option{-Os}.
> +
>  @opindex flive-patching
>  @item -flive-patching=@var{level}
>  Control GCC's optimizations to produce output suitable for live-patching.
> diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc
> new file mode 100644
> index 00000000000..22a1d81d38e
> --- /dev/null
> +++ b/gcc/late-combine.cc
> @@ -0,0 +1,747 @@
> +// Late-stage instruction combination pass.
> +// Copyright (C) 2023-2024 Free Software Foundation, Inc.
> +//
> +// This file is part of GCC.
> +//
> +// GCC is free software; you can redistribute it and/or modify it under
> +// the terms of the GNU General Public License as published by the Free
> +// Software Foundation; either version 3, or (at your option) any later
> +// version.
> +//
> +// GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +// WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +// FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +// for more details.
> +//
> +// You should have received a copy of the GNU General Public License
> +// along with GCC; see the file COPYING3.  If not see
> +// <http://www.gnu.org/licenses/>.
> +
> +// The current purpose of this pass is to substitute definitions into
> +// all uses, so that the definition can be removed.  However, it could
> +// be extended to handle other combination-related optimizations in future.
> +//
> +// The pass can run before or after register allocation.  When running
> +// before register allocation, it tries to avoid cases that are likely
> +// to increase register pressure.  For the same reason, it avoids moving
> +// instructions around, even if doing so would allow an optimization to
> +// succeed.  These limitations are removed when running after register
> +// allocation.
> +
> +#define INCLUDE_ALGORITHM
> +#define INCLUDE_FUNCTIONAL
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "df.h"
> +#include "rtl-ssa.h"
> +#include "print-rtl.h"
> +#include "tree-pass.h"
> +#include "cfgcleanup.h"
> +#include "target.h"
> +
> +using namespace rtl_ssa;
> +
> +namespace {
> +const pass_data pass_data_late_combine =
> +{
> +  RTL_PASS, // type
> +  "late_combine", // name
> +  OPTGROUP_NONE, // optinfo_flags
> +  TV_NONE, // tv_id
> +  0, // properties_required
> +  0, // properties_provided
> +  0, // properties_destroyed
> +  0, // todo_flags_start
> +  TODO_df_finish, // todo_flags_finish
> +};
> +
> +// Represents an attempt to substitute a single-set definition into all
> +// uses of the definition.
> +class insn_combination
> +{
> +public:
> +  insn_combination (set_info *, rtx, rtx);
> +  bool run ();
> +  array_slice<insn_change *const> use_changes () const;
> +
> +private:
> +  use_array get_new_uses (use_info *);
> +  bool substitute_nondebug_use (use_info *);
> +  bool substitute_nondebug_uses (set_info *);
> +  bool try_to_preserve_debug_info (insn_change &, use_info *);
> +  void substitute_debug_use (use_info *);
> +  bool substitute_note (insn_info *, rtx, bool);
> +  void substitute_notes (insn_info *, bool);
> +  void substitute_note_uses (use_info *);
> +  void substitute_optional_uses (set_info *);
> +
> +  // Represents the state of the function's RTL at the start of this
> +  // combination attempt.
> +  insn_change_watermark m_rtl_watermark;
> +
> +  // Represents the rtl-ssa state at the start of this combination attempt.
> +  obstack_watermark m_attempt;
> +
> +  // The instruction that contains the definition, and that we're trying
> +  // to delete.
> +  insn_info *m_def_insn;
> +
> +  // The definition itself.
> +  set_info *m_def;
> +
> +  // The destination and source of the single set that defines m_def.
> +  // The destination is known to be a plain REG.
> +  rtx m_dest;
> +  rtx m_src;
> +
> +  // Contains the full list of changes that we want to make, in reverse
> +  // postorder.
> +  auto_vec<insn_change *> m_nondebug_changes;
> +};
> +
> +// Class that represents one run of the pass.
> +class late_combine
> +{
> +public:
> +  unsigned int execute (function *);
> +
> +private:
> +  rtx optimizable_set (insn_info *);
> +  bool check_register_pressure (insn_info *, rtx);
> +  bool check_uses (set_info *, rtx);
> +  bool combine_into_uses (insn_info *, insn_info *);
> +
> +  auto_vec<insn_info *> m_worklist;
> +};
> +
> +insn_combination::insn_combination (set_info *def, rtx dest, rtx src)
> +  : m_rtl_watermark (),
> +    m_attempt (crtl->ssa->new_change_attempt ()),
> +    m_def_insn (def->insn ()),
> +    m_def (def),
> +    m_dest (dest),
> +    m_src (src),
> +    m_nondebug_changes ()
> +{
> +}
> +
> +array_slice<insn_change *const>
> +insn_combination::use_changes () const
> +{
> +  return { m_nondebug_changes.address () + 1,
> +          m_nondebug_changes.length () - 1 };
> +}
> +
> +// USE is a direct or indirect use of m_def.  Return the list of uses
> +// that would be needed after substituting m_def into the instruction.
> +// The returned list is marked as invalid if USE's insn and m_def_insn
> +// use different definitions for the same resource (register or memory).
> +use_array
> +insn_combination::get_new_uses (use_info *use)
> +{
> +  auto *def = use->def ();
> +  auto *use_insn = use->insn ();
> +
> +  use_array new_uses = use_insn->uses ();
> +  new_uses = remove_uses_of_def (m_attempt, new_uses, def);
> +  new_uses = merge_access_arrays (m_attempt, m_def_insn->uses (), new_uses);
> +  if (new_uses.is_valid () && use->ebb () != m_def->ebb ())
> +    new_uses = crtl->ssa->make_uses_available (m_attempt, new_uses, use->bb (),
> +                                              use_insn->is_debug_insn ());
> +  return new_uses;
> +}
> +
> +// Start the process of trying to replace USE by substitution, given that
> +// USE occurs in a non-debug instruction.  Check:
> +//
> +// - that the substitution can be represented in RTL
> +//
> +// - that each use of a resource (register or memory) within the new
> +//   instruction has a consistent definition
> +//
> +// - that the new instruction is a recognized pattern
> +//
> +// - that the instruction can be placed somewhere that makes all definitions
> +//   and uses valid, and that permits any new hard-register clobbers added
> +//   during the recognition process
> +//
> +// Return true on success.
> +bool
> +insn_combination::substitute_nondebug_use (use_info *use)
> +{
> +  insn_info *use_insn = use->insn ();
> +  rtx_insn *use_rtl = use_insn->rtl ();
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    dump_insn_slim (dump_file, use->insn ()->rtl ());
> +
> +  // Check that we can change the instruction pattern.  Leave recognition
> +  // of the result till later.
> +  insn_propagation prop (use_rtl, m_dest, m_src);
> +  if (!prop.apply_to_pattern (&PATTERN (use_rtl))
> +      || prop.num_replacements == 0)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file, "-- RTL substitution failed\n");
> +      return false;
> +    }
> +
> +  use_array new_uses = get_new_uses (use);
> +  if (!new_uses.is_valid ())
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file, "-- could not prove that all sources"
> +                " are available\n");
> +      return false;
> +    }
> +
> +  // Create a tentative change for the use.
> +  auto *where = XOBNEW (m_attempt, insn_change);
> +  auto *use_change = new (where) insn_change (use_insn);
> +  m_nondebug_changes.safe_push (use_change);
> +  use_change->new_uses = new_uses;
> +
> +  struct local_ignore : ignore_nothing
> +  {
> +    local_ignore (const set_info *def, const insn_info *use_insn)
> +      : m_def (def), m_use_insn (use_insn) {}
> +
> +    // We don't limit the number of insns per optimization, so ignoring all
> +    // insns for all insns would lead to quadratic complexity.  Just ignore
> +    // the use and definition, which should be enough for most purposes.
> +    bool
> +    should_ignore_insn (const insn_info *insn)
> +    {
> +      return insn == m_def->insn () || insn == m_use_insn;
> +    }
> +
> +    // Ignore the definition that we're removing, and all uses of it.
> +    bool should_ignore_def (const def_info *def) { return def == m_def; }
> +
> +    const set_info *m_def;
> +    const insn_info *m_use_insn;
> +  };
> +
> +  auto ignore = local_ignore (m_def, use_insn);
> +
> +  // Moving instructions before register allocation could increase
> +  // register pressure.  Only try moving them after RA.
> +  if (reload_completed && can_move_insn_p (use_insn))
> +    use_change->move_range = { use_insn->bb ()->head_insn (),
> +                              use_insn->ebb ()->last_bb ()->end_insn () };
> +  if (!restrict_movement (*use_change, ignore))
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file, "-- cannot satisfy all definitions and uses"
> +                " in insn %d\n", INSN_UID (use_insn->rtl ()));
> +      return false;
> +    }
> +
> +  if (!recog (m_attempt, *use_change, ignore))
> +    return false;
> +
> +  return true;
> +}
> +
> +// Apply substitute_nondebug_use to all direct and indirect uses of DEF.
> +// There will be at most one level of indirection.
> +bool
> +insn_combination::substitute_nondebug_uses (set_info *def)
> +{
> +  for (use_info *use : def->nondebug_insn_uses ())
> +    if (!use->is_live_out_use ()
> +       && !use->only_occurs_in_notes ()
> +       && !substitute_nondebug_use (use))
> +      return false;
> +
> +  for (use_info *use : def->phi_uses ())
> +    if (!substitute_nondebug_uses (use->phi ()))
> +      return false;
> +
> +  return true;
> +}
> +
> +// USE_CHANGE.insn () is a debug instruction that uses m_def.  Try to
> +// substitute the definition into the instruction and try to describe
> +// the result in USE_CHANGE.  Return true on success.  Failure means that
> +// the instruction must be reset instead.
> +bool
> +insn_combination::try_to_preserve_debug_info (insn_change &use_change,
> +                                             use_info *use)
> +{
> +  // Punt on unsimplified subregs of hard registers.  In that case,
> +  // propagation can succeed and create a wider reg than the one we
> +  // started with.
> +  if (HARD_REGISTER_NUM_P (use->regno ())
> +      && use->includes_subregs ())
> +    return false;
> +
> +  insn_info *use_insn = use_change.insn ();
> +  rtx_insn *use_rtl = use_insn->rtl ();
> +
> +  use_change.new_uses = get_new_uses (use);
> +  if (!use_change.new_uses.is_valid ()
> +      || !restrict_movement (use_change))
> +    return false;
> +
> +  insn_propagation prop (use_rtl, m_dest, m_src);
> +  return prop.apply_to_pattern (&INSN_VAR_LOCATION_LOC (use_rtl));
> +}
> +
> +// USE_INSN is a debug instruction that uses m_def.  Update it to reflect
> +// the fact that m_def is going to disappear.  Try to preserve the source
> +// value if possible, but reset the instruction if not.
> +void
> +insn_combination::substitute_debug_use (use_info *use)
> +{
> +  auto *use_insn = use->insn ();
> +  rtx_insn *use_rtl = use_insn->rtl ();
> +
> +  auto use_change = insn_change (use_insn);
> +  if (!try_to_preserve_debug_info (use_change, use))
> +    {
> +      use_change.new_uses = {};
> +      use_change.move_range = use_change.insn ();
> +      INSN_VAR_LOCATION_LOC (use_rtl) = gen_rtx_UNKNOWN_VAR_LOC ();
> +    }
> +  insn_change *changes[] = { &use_change };
> +  crtl->ssa->change_insns (changes);
> +}
> +
> +// NOTE is a reg note of USE_INSN, which previously used m_def.  Update
> +// the note to reflect the fact that m_def is going to disappear.  Return
> +// true on success, or false if the note must be deleted.
> +//
> +// CAN_PROPAGATE is true if m_dest can be replaced with m_use.
> +bool
> +insn_combination::substitute_note (insn_info *use_insn, rtx note,
> +                                  bool can_propagate)
> +{
> +  if (REG_NOTE_KIND (note) == REG_EQUAL
> +      || REG_NOTE_KIND (note) == REG_EQUIV)
> +    {
> +      insn_propagation prop (use_insn->rtl (), m_dest, m_src);
> +      return (prop.apply_to_rvalue (&XEXP (note, 0))
> +             && (can_propagate || prop.num_replacements == 0));
> +    }
> +  return true;
> +}
> +
> +// Update USE_INSN's notes after deciding to go ahead with the optimization.
> +// CAN_PROPAGATE is true if m_dest can be replaced with m_use.
> +void
> +insn_combination::substitute_notes (insn_info *use_insn, bool can_propagate)
> +{
> +  rtx_insn *use_rtl = use_insn->rtl ();
> +  rtx *ptr = &REG_NOTES (use_rtl);
> +  while (rtx note = *ptr)
> +    {
> +      if (substitute_note (use_insn, note, can_propagate))
> +       ptr = &XEXP (note, 1);
> +      else
> +       *ptr = XEXP (note, 1);
> +    }
> +}
> +
> +// We've decided to go ahead with the substitution.  Update all REG_NOTES
> +// involving USE.
> +void
> +insn_combination::substitute_note_uses (use_info *use)
> +{
> +  insn_info *use_insn = use->insn ();
> +
> +  bool can_propagate = true;
> +  if (use->only_occurs_in_notes ())
> +    {
> +      // The only uses are in notes.  Try to keep the note if we can,
> +      // but removing it is better than aborting the optimization.
> +      insn_change use_change (use_insn);
> +      use_change.new_uses = get_new_uses (use);
> +      if (!use_change.new_uses.is_valid ()
> +         || !restrict_movement (use_change))
> +       {
> +         use_change.move_range = use_insn;
> +         use_change.new_uses = remove_uses_of_def (m_attempt,
> +                                                   use_insn->uses (),
> +                                                   use->def ());
> +         can_propagate = false;
> +       }
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       {
> +         fprintf (dump_file, "%s notes in:\n",
> +                  can_propagate ? "updating" : "removing");
> +         dump_insn_slim (dump_file, use_insn->rtl ());
> +       }
> +      substitute_notes (use_insn, can_propagate);
> +      insn_change *changes[] = { &use_change };
> +      crtl->ssa->change_insns (changes);
> +    }
> +  else
> +    // We've already decided to update the insn's pattern and know that m_src
> +    // will be available at the insn's new location.  Now update its notes.
> +    substitute_notes (use_insn, can_propagate);
> +}
> +
> +// We've decided to go ahead with the substitution and we've dealt with
> +// all uses that occur in the patterns of non-debug insns.  Update all
> +// other uses for the fact that m_def is about to disappear.
> +void
> +insn_combination::substitute_optional_uses (set_info *def)
> +{
> +  if (auto insn_uses = def->all_insn_uses ())
> +    {
> +      use_info *use = *insn_uses.begin ();
> +      while (use)
> +       {
> +         use_info *next_use = use->next_any_insn_use ();
> +         if (use->is_in_debug_insn ())
> +           substitute_debug_use (use);
> +         else if (!use->is_live_out_use ())
> +           substitute_note_uses (use);
> +         use = next_use;
> +       }
> +    }
> +  for (use_info *use : def->phi_uses ())
> +    substitute_optional_uses (use->phi ());
> +}
> +
> +// Try to perform the substitution.  Return true on success.
> +bool
> +insn_combination::run ()
> +{
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file, "\ntrying to combine definition of r%d in:\n",
> +              m_def->regno ());
> +      dump_insn_slim (dump_file, m_def_insn->rtl ());
> +      fprintf (dump_file, "into:\n");
> +    }
> +
> +  auto def_change = insn_change::delete_insn (m_def_insn);
> +  m_nondebug_changes.safe_push (&def_change);
> +
> +  if (!substitute_nondebug_uses (m_def)
> +      || !changes_are_worthwhile (m_nondebug_changes)
> +      || !crtl->ssa->verify_insn_changes (m_nondebug_changes))
> +    return false;
> +
> +  substitute_optional_uses (m_def);
> +
> +  confirm_change_group ();
> +  crtl->ssa->change_insns (m_nondebug_changes);
> +  return true;
> +}
> +
> +// See whether INSN is a single_set that we can optimize.  Return the
> +// set if so, otherwise return null.
> +rtx
> +late_combine::optimizable_set (insn_info *insn)
> +{
> +  if (!insn->can_be_optimized ()
> +      || insn->is_asm ()
> +      || insn->is_call ()
> +      || insn->has_volatile_refs ()
> +      || insn->has_pre_post_modify ()
> +      || !can_move_insn_p (insn))
> +    return NULL_RTX;
> +
> +  return single_set (insn->rtl ());
> +}
> +
> +// Suppose that we can replace all uses of SET_DEST (SET) with SET_SRC (SET),
> +// where SET occurs in INSN.  Return true if doing so is not likely to
> +// increase register pressure.
> +bool
> +late_combine::check_register_pressure (insn_info *insn, rtx set)
> +{
> +  // Plain register-to-register moves do not establish a register class
> +  // preference and have no well-defined effect on the register allocator.
> +  // If changes in register class are needed, the register allocator is
> +  // in the best position to place those changes.  If no change in
> +  // register class is needed, then the optimization reduces register
> +  // pressure if SET_SRC (set) was already live at uses, otherwise the
> +  // optimization is pressure-neutral.
> +  rtx src = SET_SRC (set);
> +  if (REG_P (src))
> +    return true;
> +
> +  // On the same basis, substituting a SET_SRC that contains a single
> +  // pseudo register either reduces pressure or is pressure-neutral,
> +  // subject to the constraints below.  We would need to do more
> +  // analysis for SET_SRCs that use more than one pseudo register.
> +  unsigned int nregs = 0;
> +  for (auto *use : insn->uses ())
> +    if (use->is_reg ()
> +       && !HARD_REGISTER_NUM_P (use->regno ())
> +       && !use->only_occurs_in_notes ())
> +      if (++nregs > 1)
> +       return false;
> +
> +  // If there are no pseudo registers in SET_SRC then the optimization
> +  // should improve register pressure.
> +  if (nregs == 0)
> +    return true;
> +
> +  // We'd be substituting (set (reg R1) SRC) where SRC is known to
> +  // contain a single pseudo register R2.  Assume for simplicity that
> +  // each new use of R2 would need to be in the same class C as the
> +  // current use of R2.  If, for a realistic allocation, C is a
> +  // non-strict superset of the R1's register class, the effect on
> +  // register pressure should be positive or neutral.  If instead
> +  // R1 occupies a different register class from R2, or if R1 has
> +  // more allocation freedom than R2, then there's a higher risk that
> +  // the effect on register pressure could be negative.
> +  //
> +  // First use constrain_operands to get the most likely choice of
> +  // alternative.  For simplicity, just handle the case where the
> +  // output operand is operand 0.
> +  extract_insn (insn->rtl ());
> +  rtx dest = SET_DEST (set);
> +  if (recog_data.n_operands == 0
> +      || recog_data.operand[0] != dest)
> +    return false;
> +
> +  if (!constrain_operands (0, get_enabled_alternatives (insn->rtl ())))
> +    return false;
> +
> +  preprocess_constraints (insn->rtl ());
> +  auto *alt = which_op_alt ();
> +  auto dest_class = alt[0].cl;
> +
> +  // Check operands 1 and above.
> +  auto check_src = [&] (unsigned int i)
> +    {
> +      if (recog_data.is_operator[i])
> +       return true;
> +
> +      rtx op = recog_data.operand[i];
> +      if (CONSTANT_P (op))
> +       return true;
> +
> +      if (SUBREG_P (op))
> +       op = SUBREG_REG (op);
> +      if (REG_P (op))
> +       {
> +         // Ignore hard registers.  We've already rejected uses of non-fixed
> +         // hard registers in the SET_SRC.
> +         if (HARD_REGISTER_P (op))
> +           return true;
> +
> +         // Make sure that the source operand's class is at least as
> +         // permissive as the destination operand's class.
> +         auto src_class = alternative_class (alt, i);
> +         if (!reg_class_subset_p (dest_class, src_class))
> +           return false;
> +
> +         // Make sure that the source operand occupies no more hard
> +         // registers than the destination operand.  This mostly matters
> +         // for subregs.
> +         if (targetm.class_max_nregs (dest_class, GET_MODE (dest))
> +             < targetm.class_max_nregs (src_class, GET_MODE (op)))
> +           return false;
> +
> +         return true;
> +       }
> +      return false;
> +    };
> +  for (int i = 1; i < recog_data.n_operands; ++i)
> +    if (recog_data.operand_type[i] != OP_OUT && !check_src (i))
> +      return false;
> +
> +  return true;
> +}
> +
> +// Check uses of DEF to see whether there is anything obvious that
> +// prevents the substitution of SET into uses of DEF.
> +bool
> +late_combine::check_uses (set_info *def, rtx set)
> +{
> +  use_info *prev_use = nullptr;
> +  for (use_info *use : def->nondebug_insn_uses ())
> +    {
> +      insn_info *use_insn = use->insn ();
> +
> +      if (use->is_live_out_use ())
> +       continue;
> +      if (use->only_occurs_in_notes ())
> +       continue;
> +
> +      // We cannot replace all uses if the value is live on exit.
> +      if (use->is_artificial ())
> +       return false;
> +
> +      // Avoid increasing the complexity of instructions that
> +      // reference allocatable hard registers.
> +      if (!REG_P (SET_SRC (set))
> +         && !reload_completed
> +         && (accesses_include_nonfixed_hard_registers (use_insn->uses ())
> +             || accesses_include_nonfixed_hard_registers (use_insn->defs ())))
> +       return false;
> +
> +      // Don't substitute into a non-local goto, since it can then be
> +      // treated as a jump to local label, e.g. in shorten_branches.
> +      // ??? But this shouldn't be necessary.
> +      if (use_insn->is_jump ()
> +         && find_reg_note (use_insn->rtl (), REG_NON_LOCAL_GOTO, NULL_RTX))
> +       return false;
> +
> +      // Reject cases where one of the uses is a function argument.
> +      // The combine attempt should fail anyway, but this is a common
> +      // case that is easy to check early.
> +      if (use_insn->is_call ()
> +         && HARD_REGISTER_P (SET_DEST (set))
> +         && find_reg_fusage (use_insn->rtl (), USE, SET_DEST (set)))
> +       return false;
> +
> +      // We'll keep the uses in their original order, even if we move
> +      // them relative to other instructions.  Make sure that non-final
> +      // uses do not change any values that occur in the SET_SRC.
> +      if (prev_use && prev_use->ebb () == use->ebb ())
> +       {
> +         def_info *ultimate_def = look_through_degenerate_phi (def);
> +         if (insn_clobbers_resources (prev_use->insn (),
> +                                      ultimate_def->insn ()->uses ()))
> +           return false;
> +       }
> +
> +      prev_use = use;
> +    }
> +
> +  for (use_info *use : def->phi_uses ())
> +    if (!use->phi ()->is_degenerate ()
> +       || !check_uses (use->phi (), set))
> +      return false;
> +
> +  return true;
> +}
> +
> +// Try to remove INSN by substituting a definition into all uses.
> +// If the optimization moves any instructions before CURSOR, add those
> +// instructions to the end of m_worklist.
> +bool
> +late_combine::combine_into_uses (insn_info *insn, insn_info *cursor)
> +{
> +  // For simplicity, don't try to handle sets of multiple hard registers.
> +  // And for correctness, don't remove any assignments to the stack or
> +  // frame pointers, since that would implicitly change the set of valid
> +  // memory locations between this assignment and the next.
> +  //
> +  // Removing assignments to the hard frame pointer would invalidate
> +  // backtraces.
> +  set_info *def = single_set_info (insn);
> +  if (!def
> +      || !def->is_reg ()
> +      || def->regno () == STACK_POINTER_REGNUM
> +      || def->regno () == FRAME_POINTER_REGNUM
> +      || def->regno () == HARD_FRAME_POINTER_REGNUM)
> +    return false;
> +
> +  rtx set = optimizable_set (insn);
> +  if (!set)
> +    return false;
> +
> +  // For simplicity, don't try to handle subreg destinations.
> +  rtx dest = SET_DEST (set);
> +  if (!REG_P (dest) || def->regno () != REGNO (dest))
> +    return false;
> +
> +  // Don't prolong the live ranges of allocatable hard registers, or put
> +  // them into more complicated instructions.  Failing to prevent this
> +  // could lead to spill failures, or at least to worst register allocation.
> +  if (!reload_completed
> +      && accesses_include_nonfixed_hard_registers (insn->uses ()))
> +    return false;
> +
> +  if (!reload_completed && !check_register_pressure (insn, set))
> +    return false;
> +
> +  if (!check_uses (def, set))
> +    return false;
> +
> +  insn_combination combination (def, SET_DEST (set), SET_SRC (set));
> +  if (!combination.run ())
> +    return false;
> +
> +  for (auto *use_change : combination.use_changes ())
> +    if (*use_change->insn () < *cursor)
> +      m_worklist.safe_push (use_change->insn ());
> +    else
> +      break;
> +  return true;
> +}
> +
> +// Run the pass on function FN.
> +unsigned int
> +late_combine::execute (function *fn)
> +{
> +  // Initialization.
> +  calculate_dominance_info (CDI_DOMINATORS);
> +  df_analyze ();
> +  crtl->ssa = new rtl_ssa::function_info (fn);
> +  // Don't allow memory_operand to match volatile MEMs.
> +  init_recog_no_volatile ();
> +
> +  insn_info *insn = *crtl->ssa->nondebug_insns ().begin ();
> +  while (insn)
> +    {
> +      if (!insn->is_artificial ())
> +       {
> +         insn_info *prev = insn->prev_nondebug_insn ();
> +         if (combine_into_uses (insn, prev))
> +           {
> +             // Any instructions that get added to the worklist were
> +             // previously after PREV.  Thus if we were able to move
> +             // an instruction X before PREV during one combination,
> +             // X cannot depend on any instructions that we move before
> +             // PREV during subsequent combinations.  This means that
> +             // the worklist should be free of backwards dependencies,
> +             // even if it isn't necessarily in RPO.
> +             for (unsigned int i = 0; i < m_worklist.length (); ++i)
> +               combine_into_uses (m_worklist[i], prev);
> +             m_worklist.truncate (0);
> +             insn = prev;
> +           }
> +       }
> +      insn = insn->next_nondebug_insn ();
> +    }
> +
> +  // Finalization.
> +  if (crtl->ssa->perform_pending_updates ())
> +    cleanup_cfg (0);
> +  // Make the recognizer allow volatile MEMs again.
> +  init_recog ();
> +  free_dominance_info (CDI_DOMINATORS);
> +  return 0;
> +}
> +
> +class pass_late_combine : public rtl_opt_pass
> +{
> +public:
> +  pass_late_combine (gcc::context *ctxt)
> +    : rtl_opt_pass (pass_data_late_combine, ctxt)
> +  {}
> +
> +  // opt_pass methods:
> +  opt_pass *clone () override { return new pass_late_combine (m_ctxt); }
> +  bool gate (function *) override { return flag_late_combine_instructions; }
> +  unsigned int execute (function *) override;
> +};
> +
> +unsigned int
> +pass_late_combine::execute (function *fn)
> +{
> +  return late_combine ().execute (fn);
> +}
> +
> +} // end namespace
> +
> +// Create a new CC fusion pass instance.
> +
> +rtl_opt_pass *
> +make_pass_late_combine (gcc::context *ctxt)
> +{
> +  return new pass_late_combine (ctxt);
> +}
> diff --git a/gcc/opts.cc b/gcc/opts.cc
> index 1b1b46455af..915bce88fd6 100644
> --- a/gcc/opts.cc
> +++ b/gcc/opts.cc
> @@ -664,6 +664,7 @@ static const struct default_options default_options_table[] =
>        VECT_COST_MODEL_VERY_CHEAP },
>      { OPT_LEVELS_2_PLUS, OPT_finline_functions, NULL, 1 },
>      { OPT_LEVELS_2_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
> +    { OPT_LEVELS_2_PLUS, OPT_flate_combine_instructions, NULL, 1 },
>
>      /* -O2 and above optimizations, but not -Os or -Og.  */
>      { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_falign_functions, NULL, 1 },
> diff --git a/gcc/passes.def b/gcc/passes.def
> index 041229e47a6..13c9dc34ddf 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -493,6 +493,7 @@ along with GCC; see the file COPYING3.  If not see
>        NEXT_PASS (pass_initialize_regs);
>        NEXT_PASS (pass_ud_rtl_dce);
>        NEXT_PASS (pass_combine);
> +      NEXT_PASS (pass_late_combine);
>        NEXT_PASS (pass_if_after_combine);
>        NEXT_PASS (pass_jump_after_combine);
>        NEXT_PASS (pass_partition_blocks);
> @@ -512,6 +513,7 @@ along with GCC; see the file COPYING3.  If not see
>        NEXT_PASS (pass_postreload);
>        PUSH_INSERT_PASSES_WITHIN (pass_postreload)
>           NEXT_PASS (pass_postreload_cse);
> +         NEXT_PASS (pass_late_combine);
>           NEXT_PASS (pass_gcse2);
>           NEXT_PASS (pass_split_after_reload);
>           NEXT_PASS (pass_ree);
> diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
> index f290b9ccbdc..a95637abbe5 100644
> --- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
> +++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
> @@ -25,5 +25,5 @@ bar (long a)
>  }
>
>  /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */
> -/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */
> +/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */
>  /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail powerpc*-*-* } } } */
> diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c
> index 6212c95585d..0690e036eaa 100644
> --- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c
> +++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c
> @@ -30,6 +30,6 @@ bar (long a)
>  }
>
>  /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */
> -/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */
> +/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */
>  /* XFAIL due to PR70681.  */
>  /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail arm*-*-* powerpc*-*-* } } } */
> diff --git a/gcc/testsuite/gcc.dg/stack-check-4.c b/gcc/testsuite/gcc.dg/stack-check-4.c
> index b0c5c61972f..052d2abc2f1 100644
> --- a/gcc/testsuite/gcc.dg/stack-check-4.c
> +++ b/gcc/testsuite/gcc.dg/stack-check-4.c
> @@ -20,7 +20,7 @@
>     scan for.   We scan for both the positive and negative cases.  */
>
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls" } */
> +/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls -fno-shrink-wrap" } */
>  /* { dg-require-effective-target supports_stack_clash_protection } */
>
>  extern void arf (char *);
> diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
> index 4a228b0a1ce..c29a230a771 100644
> --- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
> +++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile { target bitint } } */
> -/* { dg-additional-options "-std=c23 -O2 -fno-stack-protector -save-temps -fno-schedule-insns -fno-schedule-insns2" } */
> +/* { dg-additional-options "-std=c23 -O2 -fno-stack-protector -save-temps -fno-schedule-insns -fno-schedule-insns2 -fno-late-combine-instructions" } */
>  /* { dg-final { check-function-bodies "**" "" "" } } */
>
>  #define ALIGN 16
> diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
> index e7f773640f0..13ffbf416ca 100644
> --- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
> +++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile { target bitint } } */
> -/* { dg-additional-options "-std=c23 -O2 -fno-stack-protector -save-temps -fno-schedule-insns -fno-schedule-insns2" } */
> +/* { dg-additional-options "-std=c23 -O2 -fno-stack-protector -save-temps -fno-schedule-insns -fno-schedule-insns2 -fno-late-combine-instructions" } */
>  /* { dg-final { check-function-bodies "**" "" "" } } */
>
>  #define ALIGN 8
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr106594_1.c b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c
> new file mode 100644
> index 00000000000..71bcafcb44f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c
> @@ -0,0 +1,20 @@
> +/* { dg-options "-O2" } */
> +
> +extern const int constellation_64qam[64];
> +
> +void foo(int nbits,
> +         const char *p_src,
> +         int *p_dst) {
> +
> +  while (nbits > 0U) {
> +    char first = *p_src++;
> +
> +    char index1 = ((first & 0x3) << 4) | (first >> 4);
> +
> +    *p_dst++ = constellation_64qam[index1];
> +
> +    nbits--;
> +  }
> +}
> +
> +/* { dg-final { scan-assembler {(?n)\tldr\t.*\[x[0-9]+, w[0-9]+, sxtw #?2\]} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c
> index 0d620a30d5d..b537c6154a3 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c
> @@ -27,9 +27,9 @@ TEST_ALL (DEF_LOOP)
>  /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.h, p[0-7]/m, z[0-9]+\.h, #4\n} 2 } } */
>  /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.s, p[0-7]/m, z[0-9]+\.s, #4\n} 1 } } */
>
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */
>
> -/* { dg-final { scan-assembler-not {\tmov\tz} { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-not {\tmov\tz} } } */
> +/* { dg-final { scan-assembler-not {\tsel\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c
> index a294effd4a9..cff806c278d 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c
> @@ -30,11 +30,9 @@ TEST_ALL (DEF_LOOP)
>  /* { dg-final { scan-assembler-times {\tscvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
>  /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
>
> -/* Really we should be able to use MOVPRFX /z here, but at the moment
> -   we're relying on combine to merge a SEL and an arithmetic operation,
> -   and the SEL doesn't allow the "false" value to be zero when the "true"
> -   value is a register.  */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */
>
>  /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */
>  /* { dg-final { scan-assembler-not {\tsel\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c
> index 6541a2ea49d..abf0a2e832f 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c
> @@ -30,11 +30,9 @@ TEST_ALL (DEF_LOOP)
>  /* { dg-final { scan-assembler-times {\tfcvtzs\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
>  /* { dg-final { scan-assembler-times {\tfcvtzu\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
>
> -/* Really we should be able to use MOVPRFX /z here, but at the moment
> -   we're relying on combine to merge a SEL and an arithmetic operation,
> -   and the SEL doesn't allow the "false" value to be zero when the "true"
> -   value is a register.  */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */
>
>  /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */
>  /* { dg-final { scan-assembler-not {\tsel\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c
> index e66477b3bce..401201b315a 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c
> @@ -24,12 +24,9 @@ TEST_ALL (DEF_LOOP)
>  /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
>  /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
>
> -/* Really we should be able to use MOVPRFX /Z here, but at the moment
> -   we're relying on combine to merge a SEL and an arithmetic operation,
> -   and the SEL doesn't allow zero operands.  */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 } } */
>
>  /* { dg-final { scan-assembler-not {\tmov\tz[^,]*z} } } */
> -/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-not {\tsel\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c
> index a491f899088..cbb957bffa4 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c
> @@ -52,15 +52,10 @@ TEST_ALL (DEF_LOOP)
>  /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
>  /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
>
> -/* Really we should be able to use MOVPRFX /z here, but at the moment
> -   we're relying on combine to merge a SEL and an arithmetic operation,
> -   and the SEL doesn't allow the "false" value to be zero when the "true"
> -   value is a register.  */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 7 } } */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 1 } } */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 2 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 2 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 4 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 4 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 4 } } */
>
>  /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */
>  /* { dg-final { scan-assembler-not {\tsel\t} } } */
> diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> index edebb2be245..38902b1b01b 100644
> --- a/gcc/tree-pass.h
> +++ b/gcc/tree-pass.h
> @@ -615,6 +615,7 @@ extern rtl_opt_pass *make_pass_branch_prob (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_value_profile_transformations (gcc::context
>                                                               *ctxt);
>  extern rtl_opt_pass *make_pass_postreload_cse (gcc::context *ctxt);
> +extern rtl_opt_pass *make_pass_late_combine (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_gcse2 (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_split_after_reload (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_thread_prologue_and_epilogue (gcc::context
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] Add a late-combine pass [PR106594]
  2024-06-21  0:17   ` Oleg Endo
@ 2024-06-21  8:09     ` Richard Sandiford
  0 siblings, 0 replies; 36+ messages in thread
From: Richard Sandiford @ 2024-06-21  8:09 UTC (permalink / raw)
  To: Oleg Endo; +Cc: jlaw, gcc-patches

Oleg Endo <oleg.endo@t-online.de> writes:
> On Thu, 2024-06-20 at 14:34 +0100, Richard Sandiford wrote:
>> 
>> I tried compiling at least one target per CPU directory and comparing
>> the assembly output for parts of the GCC testsuite.  This is just a way
>> of getting a flavour of how the pass performs; it obviously isn't a
>> meaningful benchmark.  All targets seemed to improve on average:
>> 
>> Target                 Tests   Good    Bad   %Good   Delta  Median
>> ======                 =====   ====    ===   =====   =====  ======
>> aarch64-linux-gnu       2215   1975    240  89.16%   -4159      -1
>> aarch64_be-linux-gnu    1569   1483     86  94.52%  -10117      -1
>> alpha-linux-gnu         1454   1370     84  94.22%   -9502      -1
>> amdgcn-amdhsa           5122   4671    451  91.19%  -35737      -1
>> arc-elf                 2166   1932    234  89.20%  -37742      -1
>> arm-linux-gnueabi       1953   1661    292  85.05%  -12415      -1
>> arm-linux-gnueabihf     1834   1549    285  84.46%  -11137      -1
>> avr-elf                 4789   4330    459  90.42% -441276      -4
>> bfin-elf                2795   2394    401  85.65%  -19252      -1
>> bpf-elf                 3122   2928    194  93.79%   -8785      -1
>> c6x-elf                 2227   1929    298  86.62%  -17339      -1
>> cris-elf                3464   3270    194  94.40%  -23263      -2
>> csky-elf                2915   2591    324  88.89%  -22146      -1
>> epiphany-elf            2399   2304     95  96.04%  -28698      -2
>> fr30-elf                7712   7299    413  94.64%  -99830      -2
>> frv-linux-gnu           3332   2877    455  86.34%  -25108      -1
>> ft32-elf                2775   2667    108  96.11%  -25029      -1
>> h8300-elf               3176   2862    314  90.11%  -29305      -2
>> hppa64-hp-hpux11.23     4287   4247     40  99.07%  -45963      -2
>> ia64-linux-gnu          2343   1946    397  83.06%   -9907      -2
>> iq2000-elf              9684   9637     47  99.51% -126557      -2
>> lm32-elf                2681   2608     73  97.28%  -59884      -3
>> loongarch64-linux-gnu   1303   1218     85  93.48%  -13375      -2
>> m32r-elf                1626   1517    109  93.30%   -9323      -2
>> m68k-linux-gnu          3022   2620    402  86.70%  -21531      -1
>> mcore-elf               2315   2085    230  90.06%  -24160      -1
>> microblaze-elf          2782   2585    197  92.92%  -16530      -1
>> mipsel-linux-gnu        1958   1827    131  93.31%  -15462      -1
>> mipsisa64-linux-gnu     1655   1488    167  89.91%  -16592      -2
>> mmix                    4914   4814    100  97.96%  -63021      -1
>> mn10300-elf             3639   3320    319  91.23%  -34752      -2
>> moxie-rtems             3497   3252    245  92.99%  -87305      -3
>> msp430-elf              4353   3876    477  89.04%  -23780      -1
>> nds32le-elf             3042   2780    262  91.39%  -27320      -1
>> nios2-linux-gnu         1683   1355    328  80.51%   -8065      -1
>> nvptx-none              2114   1781    333  84.25%  -12589      -2
>> or1k-elf                3045   2699    346  88.64%  -14328      -2
>> pdp11                   4515   4146    369  91.83%  -26047      -2
>> pru-elf                 1585   1245    340  78.55%   -5225      -1
>> riscv32-elf             2122   2000    122  94.25% -101162      -2
>> riscv64-elf             1841   1726    115  93.75%  -49997      -2
>> rl78-elf                2823   2530    293  89.62%  -40742      -4
>> rx-elf                  2614   2480    134  94.87%  -18863      -1
>> s390-linux-gnu          1591   1393    198  87.55%  -16696      -1
>> s390x-linux-gnu         2015   1879    136  93.25%  -21134      -1
>> sh-linux-gnu            1870   1507    363  80.59%   -9491      -1
>> sparc-linux-gnu         1123   1075     48  95.73%  -14503      -1
>> sparc-wrs-vxworks       1121   1073     48  95.72%  -14578      -1
>> sparc64-linux-gnu       1096   1021     75  93.16%  -15003      -1
>> v850-elf                1897   1728    169  91.09%  -11078      -1
>> vax-netbsdelf           3035   2995     40  98.68%  -27642      -1
>> visium-elf              1392   1106    286  79.45%   -7984      -2
>> xstormy16-elf           2577   2071    506  80.36%  -13061      -1
>> 
>> 
>
> Since you have already briefly compared some of the code, can you share
> those cases which get worse and might require some potential follow up
> patches?

I think a lot of them are unpredictable secondary effects, such as on
register allocation, tail merging potential, and so on.  For sh, it also
includes whether delay slots are filled with useful work, or whether
they get a nop.  (Instruction combination tends to create more complex
instructions, so there will be fewer 2-byte instructions to act as delay
slot candidates.)

Also, this kind of combination can decrease the number of instructions
but increase the constant pool size.  The figures take that into account.
(The comparison is a bit ad-hoc, though, since I wasn't dedicated enough
to try to build a full source->executable toolchain for each target. :))

To give one example, the effect on gcc.c-torture/compile/20040727-1.c is:

@@ -6,18 +6,21 @@
        .global GC_dirty_init
        .type   GC_dirty_init, @function
 GC_dirty_init:
-       mov.l   .L2,r4
-       mov     r4,r6
-       mov     r4,r5
-       add     #-64,r5
-       mov.l   .L3,r0
+       mov.l   .L2,r6
+       mov.l   .L3,r5
+       mov.l   .L4,r4
+       mov.l   .L5,r0
        jmp     @r0
-       add     #-128,r4
-.L4:
+       nop
+.L6:
        .align 2
 .L2:
        .long   GC_old_exc_ports+132
 .L3:
+       .long   GC_old_exc_ports+68
+.L4:
+       .long   GC_old_exc_ports+4
+.L5:
        .long   task_get_exception_ports
        .size   GC_dirty_init, .-GC_dirty_init
        .local  GC_old_exc_ports

Thanks,
Richard

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/6] rtl-ssa: Rework _ignoring interfaces
  2024-06-20 21:22   ` Alex Coplan
@ 2024-06-21  8:11     ` Richard Sandiford
  0 siblings, 0 replies; 36+ messages in thread
From: Richard Sandiford @ 2024-06-21  8:11 UTC (permalink / raw)
  To: Alex Coplan; +Cc: jlaw, gcc-patches

Alex Coplan <alex.coplan@arm.com> writes:
> Hi Richard,
>
> I had a quick look through the patch and noticed a couple of minor typos.
> Otherwise looks like a nice cleanup!

Thanks for the review!  I've fixed the typos in my local copy.

Richard

> On 20/06/2024 14:34, Richard Sandiford wrote:
>> rtl-ssa has routines for scanning forwards or backwards for something
>> under the control of an exclusion set.  These searches are currently
>> used for two main things:
>> 
>> - to work out where an instruction can be moved within its EBB
>> - to work out whether recog can add a new hard register clobber
>> 
>> The exclusion set was originally a callback function that returned
>> true for insns that should be ignored.  However, for the late-combine
>> work, I'd also like to be able to skip an entire definition, along
>> with all its uses.
>> 
>> This patch prepares for that by turning the exclusion set into an
>> object that provides predicate member functions.  Currently the
>> only two member functions are:
>> 
>> - should_ignore_insn: what the old callback did
>> - should_ignore_def: the new functionality
>> 
>> but more could be added later.
>> 
>> Doing this also makes it easy to remove some assymmetry that I think
>
> s/assymmetry/asymmetry/
>
>> in hindsight was a mistake: in forward scans, ignoring an insn meant
>> ignoring all definitions in that insn (ok) and all uses of those
>> definitions (non-obvious).  The new interface makes it possible
>> to select the required behaviour, with that behaviour being applied
>> consistently in both directions.
>> 
>> Now that the exclusion set is a dedicated object, rather than
>> just a "random" function, I think it makes sense to remove the
>> _ignoring suffix from the function names.  The suffix was originally
>> there to describe the callback, and in particular to emphasise that
>> a true return meant "ignore" rather than "heed".
>> 
>> gcc/
>> 	* rtl-ssa.h: Include predicates.h.
>> 	* rtl-ssa/predicates.h: New file.
>> 	* rtl-ssa/access-utils.h (prev_call_clobbers_ignoring): Rename to...
>> 	(prev_call_clobbers): ...this and treat the ignore parameter as an
>> 	object with the same interface as ignore_nothing.
>> 	(next_call_clobbers_ignoring): Rename to...
>> 	(next_call_clobbers): ...this and treat the ignore parameter as an
>> 	object with the same interface as ignore_nothing.
>> 	(first_nondebug_insn_use_ignoring): Rename to...
>> 	(first_nondebug_insn_use): ...this and treat the ignore parameter as
>> 	an object with the same interface as ignore_nothing.
>> 	(last_nondebug_insn_use_ignoring): Rename to...
>> 	(last_nondebug_insn_use): ...this and treat the ignore parameter as
>> 	an object with the same interface as ignore_nothing.
>> 	(last_access_ignoring): Rename to...
>> 	(last_access): ...this and treat the ignore parameter as an object
>> 	with the same interface as ignore_nothing.  Conditionally skip
>> 	definitions.
>> 	(prev_access_ignoring): Rename to...
>> 	(prev_access): ...this and treat the ignore parameter as an object
>> 	with the same interface as ignore_nothing.
>> 	(first_def_ignoring): Replace with...
>> 	(first_access): ...this new function.
>> 	(next_access_ignoring): Rename to...
>> 	(next_access): ...this and treat the ignore parameter as an object
>> 	with the same interface as ignore_nothing.  Conditionally skip
>> 	definitions.
>> 	* rtl-ssa/change-utils.h (insn_is_changing): Delete.
>> 	(restrict_movement_ignoring): Rename to...
>> 	(restrict_movement): ...this and treat the ignore parameter as an
>> 	object with the same interface as ignore_nothing.
>> 	(recog_ignoring): Rename to...
>> 	(recog): ...this and treat the ignore parameter as an object with
>> 	the same interface as ignore_nothing.
>> 	* rtl-ssa/changes.h (insn_is_changing_closure): Delete.
>> 	* rtl-ssa/functions.h (function_info::add_regno_clobber): Treat
>> 	the ignore parameter as an object with the same interface as
>> 	ignore_nothing.
>> 	* rtl-ssa/insn-utils.h (insn_is): Delete.
>> 	* rtl-ssa/insns.h (insn_is_closure): Delete.
>> 	* rtl-ssa/member-fns.inl
>> 	(insn_is_changing_closure::insn_is_changing_closure): Delete.
>> 	(insn_is_changing_closure::operator()): Likewise.
>> 	(function_info::add_regno_clobber): Treat the ignore parameter
>> 	as an object with the same interface as ignore_nothing.
>> 	(ignore_changing_insns::ignore_changing_insns): New function.
>> 	(ignore_changing_insns::should_ignore_insn): Likewise.
>> 	* rtl-ssa/movement.h (restrict_movement_for_dead_range): Treat
>> 	the ignore parameter as an object with the same interface as
>> 	ignore_nothing.
>> 	(restrict_movement_for_defs_ignoring): Rename to...
>> 	(restrict_movement_for_defs): ...this and treat the ignore parameter
>> 	as an object with the same interface as ignore_nothing.
>> 	(restrict_movement_for_uses_ignoring): Rename to...
>> 	(restrict_movement_for_uses): ...this and treat the ignore parameter
>> 	as an object with the same interface as ignore_nothing.  Conditionally
>> 	skip definitions.
>> 	* doc/rtl.texi: Update for above name changes.  Use
>> 	ignore_changing_insns instead of insn_is_changing.
>> 	* config/aarch64/aarch64-cc-fusion.cc (cc_fusion::parallelize_insns):
>> 	Likewise.
>> 	* pair-fusion.cc (no_ignore): Delete.
>> 	(latest_hazard_before, first_hazard_after): Update for above name
>> 	changes.  Use ignore_nothing instead of no_ignore.
>> 	(pair_fusion_bb_info::fuse_pair): Update for above name changes.
>> 	Use ignore_changing_insns instead of insn_is_changing.
>> 	(pair_fusion::try_promote_writeback): Likewise.
>> ---
>>  gcc/config/aarch64/aarch64-cc-fusion.cc |   4 +-
>>  gcc/doc/rtl.texi                        |  14 +--
>>  gcc/pair-fusion.cc                      |  34 +++---
>>  gcc/rtl-ssa.h                           |   1 +
>>  gcc/rtl-ssa/access-utils.h              | 145 +++++++++++++-----------
>>  gcc/rtl-ssa/change-utils.h              |  67 +++++------
>>  gcc/rtl-ssa/changes.h                   |  13 ---
>>  gcc/rtl-ssa/functions.h                 |  16 ++-
>>  gcc/rtl-ssa/insn-utils.h                |   8 --
>>  gcc/rtl-ssa/insns.h                     |  12 --
>>  gcc/rtl-ssa/member-fns.inl              |  35 +++---
>>  gcc/rtl-ssa/movement.h                  | 118 +++++++++----------
>>  gcc/rtl-ssa/predicates.h                |  58 ++++++++++
>>  13 files changed, 275 insertions(+), 250 deletions(-)
>>  create mode 100644 gcc/rtl-ssa/predicates.h
>> 
> <snip>
>> diff --git a/gcc/rtl-ssa/functions.h b/gcc/rtl-ssa/functions.h
>> index f5aca643beb..479c6992e97 100644
>> --- a/gcc/rtl-ssa/functions.h
>> +++ b/gcc/rtl-ssa/functions.h
>> @@ -165,16 +165,22 @@ public:
>>  
>>    // If CHANGE doesn't already clobber REGNO, try to add such a clobber,
>>    // limiting the movement range in order to make the clobber valid.
>> -  // When determining whether REGNO is live, ignore accesses made by an
>> -  // instruction I if IGNORE (I) is true.  The caller then assumes the
>> -  // responsibility of ensuring that CHANGE and I are placed in a valid order.
>> +  // Use IGNORE to guide this process, where IGNORE is an object that
>> +  // provides the same interface as ignore_nothing.
>> +  //
>> +  // That is, when determining whether REGNO is live, ignore accesses made
>> +  // by an instruction I if IGNORE says that I should be ignored.  The caller
>> +  // then assumes the responsibility of ensuring that CHANGE and I are placed
>> +  // in a valid order.  Similarly, ignore live ranges associated/ with a
>
> Stray '/' after associated?
>
> Thanks,
> Alex
>
>> +  // definition of REGNO if IGNORE says that that definition should be
>> +  // ignored.
>>    //
>>    // Return true on success.  Leave CHANGE unmodified when returning false.
>>    //
>>    // WATERMARK is a watermark returned by new_change_attempt ().
>> -  template<typename IgnorePredicate>
>> +  template<typename IgnorePredicates>
>>    bool add_regno_clobber (obstack_watermark &watermark, insn_change &change,
>> -			  unsigned int regno, IgnorePredicate ignore);
>> +			  unsigned int regno, IgnorePredicates ignore);
>>  
>>    // Return true if change_insns will be able to perform the changes
>>    // described by CHANGES.
> <snip>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] Add a late-combine pass [PR106594]
  2024-06-21  5:54   ` Richard Biener
@ 2024-06-21  8:21     ` Richard Sandiford
  2024-06-21  9:26       ` Richard Biener
  0 siblings, 1 reply; 36+ messages in thread
From: Richard Sandiford @ 2024-06-21  8:21 UTC (permalink / raw)
  To: Richard Biener; +Cc: jlaw, gcc-patches

Richard Biener <richard.guenther@gmail.com> writes:
> [...]
> I wonder if you can amend doc/passes.texi, specifically noting differences
> between fwprop, combine and late-combine?

Ooh, we have a doc/passes.texi? :)  Somehow missed that.

How about the patch below?

Thanks,
Richard


diff --git a/gcc/doc/passes.texi b/gcc/doc/passes.texi
index 5746d3ec636..4ac7a2306a1 100644
--- a/gcc/doc/passes.texi
+++ b/gcc/doc/passes.texi
@@ -991,6 +991,25 @@ RTL expressions for the instructions by substitution, simplifies the
 result using algebra, and then attempts to match the result against
 the machine description.  The code is located in @file{combine.cc}.
 
+@item Late instruction combination
+
+This pass attempts to do further instruction combination, on top of
+that performed by @file{combine.cc}.  Its current purpose is to
+substitute definitions into all uses simultaneously, so that the
+definition can be removed.  This differs from the forward propagation
+pass, whose purpose is instead to simplify individual uses on the
+assumption that the definition will remain.  It differs from
+@file{combine.cc} in that there is no hard-coded limit on the number
+of instructions that can be combined at once.  It also differs from
+@file{combine.cc} in that it can move instructions, where necessary.
+
+However, the pass is not in principle limited to this form of
+combination.  It is intended to be a home for other, future
+combination approaches as well.
+
+The pass runs twice, once before register allocation and once after
+register allocation.  The code is located in @file{late-combine.cc}.
+
 @item Mode switching optimization
 
 This pass looks for instructions that require the processor to be in a

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] Add a late-combine pass [PR106594]
  2024-06-21  8:21     ` Richard Sandiford
@ 2024-06-21  9:26       ` Richard Biener
  0 siblings, 0 replies; 36+ messages in thread
From: Richard Biener @ 2024-06-21  9:26 UTC (permalink / raw)
  To: Richard Biener, jlaw, gcc-patches, richard.sandiford

On Fri, Jun 21, 2024 at 10:21 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > [...]
> > I wonder if you can amend doc/passes.texi, specifically noting differences
> > between fwprop, combine and late-combine?
>
> Ooh, we have a doc/passes.texi? :)  Somehow missed that.

Yeah, I also usually forget this.

> How about the patch below?

Thanks - looks good to me.

Richard.

> Thanks,
> Richard
>
>
> diff --git a/gcc/doc/passes.texi b/gcc/doc/passes.texi
> index 5746d3ec636..4ac7a2306a1 100644
> --- a/gcc/doc/passes.texi
> +++ b/gcc/doc/passes.texi
> @@ -991,6 +991,25 @@ RTL expressions for the instructions by substitution, simplifies the
>  result using algebra, and then attempts to match the result against
>  the machine description.  The code is located in @file{combine.cc}.
>
> +@item Late instruction combination
> +
> +This pass attempts to do further instruction combination, on top of
> +that performed by @file{combine.cc}.  Its current purpose is to
> +substitute definitions into all uses simultaneously, so that the
> +definition can be removed.  This differs from the forward propagation
> +pass, whose purpose is instead to simplify individual uses on the
> +assumption that the definition will remain.  It differs from
> +@file{combine.cc} in that there is no hard-coded limit on the number
> +of instructions that can be combined at once.  It also differs from
> +@file{combine.cc} in that it can move instructions, where necessary.
> +
> +However, the pass is not in principle limited to this form of
> +combination.  It is intended to be a home for other, future
> +combination approaches as well.
> +
> +The pass runs twice, once before register allocation and once after
> +register allocation.  The code is located in @file{late-combine.cc}.
> +
>  @item Mode switching optimization
>
>  This pass looks for instructions that require the processor to be in a

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/6] rtl-ssa: Don't cost no-op moves
  2024-06-20 13:34 ` [PATCH 2/6] rtl-ssa: Don't cost no-op moves Richard Sandiford
@ 2024-06-21 14:32   ` Jeff Law
  0 siblings, 0 replies; 36+ messages in thread
From: Jeff Law @ 2024-06-21 14:32 UTC (permalink / raw)
  To: Richard Sandiford, jlaw, gcc-patches



On 6/20/24 7:34 AM, Richard Sandiford wrote:
> No-op moves are given the code NOOP_MOVE_INSN_CODE if we plan
> to delete them later.  Such insns shouldn't be costed, partly
> because they're going to disappear, and partly because targets
> won't recognise the insn code.
> 
> gcc/
> 	* rtl-ssa/changes.cc (rtl_ssa::changes_are_worthwhile): Don't
> 	cost no-op moves.
> 	* rtl-ssa/insns.cc (insn_info::calculate_cost): Likewise.
This is OK.  Your call if you want to include it now or wait for the 
full series to be ACK'd.

jeff


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/6] xstormy16: Fix xs_hi_nonmemory_operand
  2024-06-20 13:34 ` [PATCH 5/6] xstormy16: Fix xs_hi_nonmemory_operand Richard Sandiford
@ 2024-06-21 14:33   ` Jeff Law
  0 siblings, 0 replies; 36+ messages in thread
From: Jeff Law @ 2024-06-21 14:33 UTC (permalink / raw)
  To: Richard Sandiford, jlaw, gcc-patches



On 6/20/24 7:34 AM, Richard Sandiford wrote:
> All uses of xs_hi_nonmemory_operand allow constraint "i",
> which means that they allow consts, symbol_refs and label_refs.
> The definition of xs_hi_nonmemory_operand accounted for consts,
> but not for symbol_refs and label_refs.
> 
> gcc/
> 	* config/stormy16/predicates.md (xs_hi_nonmemory_operand): Handle
> 	symbol_ref and label_ref.
OK for the trunk anytime.
jeff


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] iq2000: Fix test and branch instructions
  2024-06-20 13:34 ` [PATCH 3/6] iq2000: Fix test and branch instructions Richard Sandiford
@ 2024-06-21 14:33   ` Jeff Law
  0 siblings, 0 replies; 36+ messages in thread
From: Jeff Law @ 2024-06-21 14:33 UTC (permalink / raw)
  To: Richard Sandiford, jlaw, gcc-patches



On 6/20/24 7:34 AM, Richard Sandiford wrote:
> The iq2000 test and branch instructions had patterns like:
> 
>    [(set (pc)
> 	(if_then_else
> 	 (eq (and:SI (match_operand:SI 0 "register_operand" "r")
> 		     (match_operand:SI 1 "power_of_2_operand" "I"))
> 	      (const_int 0))
> 	 (match_operand 2 "pc_or_label_operand" "")
> 	 (match_operand 3 "pc_or_label_operand" "")))]
> 
> power_of_2_operand allows any 32-bit power of 2, whereas "I" only
> accepts 16-bit signed constants.  This meant that any power of 2
> greater than 32768 would cause an "insn does not satisfy its
> constraints" ICE.
> 
> Also, the %p operand modifier barfed on 1<<31, which is sign-
> rather than zero-extended to 64 bits.  The code is inherently
> limited to 32-bit operands -- power_of_2_operand contains a test
> involving "unsigned" -- so this patch just ands with 0xffffffff.
> 
> gcc/
> 	* config/iq2000/iq2000.cc (iq2000_print_operand): Make %p handle 1<<31.
> 	* config/iq2000/iq2000.md: Remove "I" constraints on
> 	power_of_2_operands.
OK for the trunk.
jeff


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/6] rtl-ssa: Rework _ignoring interfaces
  2024-06-20 13:34 ` [PATCH 1/6] rtl-ssa: Rework _ignoring interfaces Richard Sandiford
  2024-06-20 21:22   ` Alex Coplan
@ 2024-06-21 14:40   ` Jeff Law
  1 sibling, 0 replies; 36+ messages in thread
From: Jeff Law @ 2024-06-21 14:40 UTC (permalink / raw)
  To: Richard Sandiford, jlaw, gcc-patches



On 6/20/24 7:34 AM, Richard Sandiford wrote:
> rtl-ssa has routines for scanning forwards or backwards for something
> under the control of an exclusion set.  These searches are currently
> used for two main things:
> 
> - to work out where an instruction can be moved within its EBB
> - to work out whether recog can add a new hard register clobber
> 
> The exclusion set was originally a callback function that returned
> true for insns that should be ignored.  However, for the late-combine
> work, I'd also like to be able to skip an entire definition, along
> with all its uses.
> 
> This patch prepares for that by turning the exclusion set into an
> object that provides predicate member functions.  Currently the
> only two member functions are:
> 
> - should_ignore_insn: what the old callback did
> - should_ignore_def: the new functionality
> 
> but more could be added later.
> 
> Doing this also makes it easy to remove some assymmetry that I think
> in hindsight was a mistake: in forward scans, ignoring an insn meant
> ignoring all definitions in that insn (ok) and all uses of those
> definitions (non-obvious).  The new interface makes it possible
> to select the required behaviour, with that behaviour being applied
> consistently in both directions.
> 
> Now that the exclusion set is a dedicated object, rather than
> just a "random" function, I think it makes sense to remove the
> _ignoring suffix from the function names.  The suffix was originally
> there to describe the callback, and in particular to emphasise that
> a true return meant "ignore" rather than "heed".
> 
> gcc/
> 	* rtl-ssa.h: Include predicates.h.
> 	* rtl-ssa/predicates.h: New file.
> 	* rtl-ssa/access-utils.h (prev_call_clobbers_ignoring): Rename to...
> 	(prev_call_clobbers): ...this and treat the ignore parameter as an
> 	object with the same interface as ignore_nothing.
> 	(next_call_clobbers_ignoring): Rename to...
> 	(next_call_clobbers): ...this and treat the ignore parameter as an
> 	object with the same interface as ignore_nothing.
> 	(first_nondebug_insn_use_ignoring): Rename to...
> 	(first_nondebug_insn_use): ...this and treat the ignore parameter as
> 	an object with the same interface as ignore_nothing.
> 	(last_nondebug_insn_use_ignoring): Rename to...
> 	(last_nondebug_insn_use): ...this and treat the ignore parameter as
> 	an object with the same interface as ignore_nothing.
> 	(last_access_ignoring): Rename to...
> 	(last_access): ...this and treat the ignore parameter as an object
> 	with the same interface as ignore_nothing.  Conditionally skip
> 	definitions.
> 	(prev_access_ignoring): Rename to...
> 	(prev_access): ...this and treat the ignore parameter as an object
> 	with the same interface as ignore_nothing.
> 	(first_def_ignoring): Replace with...
> 	(first_access): ...this new function.
> 	(next_access_ignoring): Rename to...
> 	(next_access): ...this and treat the ignore parameter as an object
> 	with the same interface as ignore_nothing.  Conditionally skip
> 	definitions.
> 	* rtl-ssa/change-utils.h (insn_is_changing): Delete.
> 	(restrict_movement_ignoring): Rename to...
> 	(restrict_movement): ...this and treat the ignore parameter as an
> 	object with the same interface as ignore_nothing.
> 	(recog_ignoring): Rename to...
> 	(recog): ...this and treat the ignore parameter as an object with
> 	the same interface as ignore_nothing.
> 	* rtl-ssa/changes.h (insn_is_changing_closure): Delete.
> 	* rtl-ssa/functions.h (function_info::add_regno_clobber): Treat
> 	the ignore parameter as an object with the same interface as
> 	ignore_nothing.
> 	* rtl-ssa/insn-utils.h (insn_is): Delete.
> 	* rtl-ssa/insns.h (insn_is_closure): Delete.
> 	* rtl-ssa/member-fns.inl
> 	(insn_is_changing_closure::insn_is_changing_closure): Delete.
> 	(insn_is_changing_closure::operator()): Likewise.
> 	(function_info::add_regno_clobber): Treat the ignore parameter
> 	as an object with the same interface as ignore_nothing.
> 	(ignore_changing_insns::ignore_changing_insns): New function.
> 	(ignore_changing_insns::should_ignore_insn): Likewise.
> 	* rtl-ssa/movement.h (restrict_movement_for_dead_range): Treat
> 	the ignore parameter as an object with the same interface as
> 	ignore_nothing.
> 	(restrict_movement_for_defs_ignoring): Rename to...
> 	(restrict_movement_for_defs): ...this and treat the ignore parameter
> 	as an object with the same interface as ignore_nothing.
> 	(restrict_movement_for_uses_ignoring): Rename to...
> 	(restrict_movement_for_uses): ...this and treat the ignore parameter
> 	as an object with the same interface as ignore_nothing.  Conditionally
> 	skip definitions.
> 	* doc/rtl.texi: Update for above name changes.  Use
> 	ignore_changing_insns instead of insn_is_changing.
> 	* config/aarch64/aarch64-cc-fusion.cc (cc_fusion::parallelize_insns):
> 	Likewise.
> 	* pair-fusion.cc (no_ignore): Delete.
> 	(latest_hazard_before, first_hazard_after): Update for above name
> 	changes.  Use ignore_nothing instead of no_ignore.
> 	(pair_fusion_bb_info::fuse_pair): Update for above name changes.
> 	Use ignore_changing_insns instead of insn_is_changing.
> 	(pair_fusion::try_promote_writeback): Likewise.
> ---

OK.

jeff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] Add a late-combine pass [PR106594]
  2024-06-20 13:34 ` [PATCH 6/6] Add a late-combine pass [PR106594] Richard Sandiford
  2024-06-21  0:17   ` Oleg Endo
  2024-06-21  5:54   ` Richard Biener
@ 2024-06-21 15:00   ` Jeff Law
  2024-06-22  5:12   ` Takayuki 'January June' Suwa
  2024-06-25  9:02   ` Thomas Schwinge
  4 siblings, 0 replies; 36+ messages in thread
From: Jeff Law @ 2024-06-21 15:00 UTC (permalink / raw)
  To: Richard Sandiford, jlaw, gcc-patches



On 6/20/24 7:34 AM, Richard Sandiford wrote:
> This patch adds a combine pass that runs late in the pipeline.
> There are two instances: one between combine and split1, and one
> after postreload.
> 
> The pass currently has a single objective: remove definitions by
> substituting into all uses.  The pre-RA version tries to restrict
> itself to cases that are likely to have a neutral or beneficial
> effect on register pressure.
I would expect this to fix a problem we've seen on RISC-V as well. 
Essentially we have A, B an C.  We want to combine A->B and A->C 
generating B' and C' and eliminate A.  This shows up in the xz loop.


.
> 
> On most targets, the pass is enabled by default at -O2 and above.
> However, it has a tendency to undo x86's STV and RPAD passes,
> by folding the more complex post-STV/RPAD form back into the
> simpler pre-pass form.
IIRC the limited enablement was one of the things folks were unhappy 
about in the gcc-14 cycle.  Good to see that addressed.


> 
> Also, running a pass after register allocation means that we can
> now match define_insn_and_splits that were previously only matched
> before register allocation.  This trips things like:
> 
>    (define_insn_and_split "..."
>      [...pattern...]
>      "...cond..."
>      "#"
>      "&& 1"
>      [...pattern...]
>      {
>        ...unconditional use of gen_reg_rtx ()...;
>      }
> 
> because matching and splitting after RA will call gen_reg_rtx when
> pseudos are no longer allowed.  rs6000 has several instances of this.
Interesting.  I suspect ppc won't be the only affected port.  This is 
somewhat worrisome.

> 
> xtensa has a variation in which the split condition is:
> 
>      "&& can_create_pseudo_p ()"
> 
> The failure then is that, if we match after RA, we'll never be
> able to split the instruction.
> 
> The patch therefore disables the pass by default on i386, rs6000
> and xtensa.  Hopefully we can fix those ports later (if their
> maintainers want).  It seems easier to add the pass first, though,
> to make it easier to test any such fixes.
I suspect it'll be a "does this make code better on the port, then let's 
fix the port so it can be used consistently" kind of scenario.  Given 
the data you've presented I strongly suspect it would make the code 
better on the xtensa, so hopefully Max will do the gruntwork on that one.


> 
> gcc/
> 	PR rtl-optimization/106594
> 	* Makefile.in (OBJS): Add late-combine.o.
> 	* common.opt (flate-combine-instructions): New option.
> 	* doc/invoke.texi: Document it.
> 	* opts.cc (default_options_table): Enable it by default at -O2
> 	and above.
> 	* tree-pass.h (make_pass_late_combine): Declare.
> 	* late-combine.cc: New file.
> 	* passes.def: Add two instances of late_combine.
> 	* config/i386/i386-options.cc (ix86_override_options_after_change):
> 	Disable late-combine by default.
> 	* config/rs6000/rs6000.cc (rs6000_option_override_internal): Likewise.
> 	* config/xtensa/xtensa.cc (xtensa_option_override): Likewise.
> 
> gcc/testsuite/
> 	PR rtl-optimization/106594
> 	* gcc.dg/ira-shrinkwrap-prep-1.c: Restrict XFAIL to non-aarch64
> 	targets.
> 	* gcc.dg/ira-shrinkwrap-prep-2.c: Likewise.
> 	* gcc.dg/stack-check-4.c: Add -fno-shrink-wrap.
> 	* gcc.target/aarch64/bitfield-bitint-abi-align16.c: Add
> 	-fno-late-combine-instructions.
> 	* gcc.target/aarch64/bitfield-bitint-abi-align8.c: Likewise.
> 	* gcc.target/aarch64/sve/cond_asrd_3.c: Remove XFAILs.
> 	* gcc.target/aarch64/sve/cond_convert_3.c: Likewise.
> 	* gcc.target/aarch64/sve/cond_fabd_5.c: Likewise.
> 	* gcc.target/aarch64/sve/cond_convert_6.c: Expect the MOVPRFX /Zs
> 	described in the comment.
> 	* gcc.target/aarch64/sve/cond_unary_4.c: Likewise.
> 	* gcc.target/aarch64/pr106594_1.c: New test.
> ---


OK.  Obviously we'll need to keep an eye on testing state after this 
patch.  I do expect fallout from the splitter issue noted above, but 
IMHO those are port problems for the port maintainers to sort out.

Jeff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] Add a late-combine pass [PR106594]
  2024-06-20 13:34 ` [PATCH 6/6] Add a late-combine pass [PR106594] Richard Sandiford
                     ` (2 preceding siblings ...)
  2024-06-21 15:00   ` Jeff Law
@ 2024-06-22  5:12   ` Takayuki 'January June' Suwa
  2024-06-22 16:49     ` Richard Sandiford
  2024-06-25  9:02   ` Thomas Schwinge
  4 siblings, 1 reply; 36+ messages in thread
From: Takayuki 'January June' Suwa @ 2024-06-22  5:12 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

Hi!

On 2024/06/20 22:34, Richard Sandiford wrote:
> This patch adds a combine pass that runs late in the pipeline.
> There are two instances: one between combine and split1, and one
> after postreload.
> 
> The pass currently has a single objective: remove definitions by
> substituting into all uses.  The pre-RA version tries to restrict
> itself to cases that are likely to have a neutral or beneficial
> effect on register pressure.
> 
> The patch fixes PR106594.  It also fixes a few FAILs and XFAILs
> in the aarch64 test results, mostly due to making proper use of
> MOVPRFX in cases where we didn't previously.
> 
> This is just a first step.  I'm hoping that the pass could be
> used for other combine-related optimisations in future.  In particular,
> the post-RA version doesn't need to restrict itself to cases where all
> uses are substitutable, since it doesn't have to worry about register
> pressure.  If we did that, and if we extended it to handle multi-register
> REGs, the pass might be a viable replacement for regcprop, which in
> turn might reduce the cost of having a post-RA instance of the new pass.
> 
> On most targets, the pass is enabled by default at -O2 and above.
> However, it has a tendency to undo x86's STV and RPAD passes,
> by folding the more complex post-STV/RPAD form back into the
> simpler pre-pass form.
> 
> Also, running a pass after register allocation means that we can
> now match define_insn_and_splits that were previously only matched
> before register allocation.  This trips things like:
> 
>    (define_insn_and_split "..."
>      [...pattern...]
>      "...cond..."
>      "#"
>      "&& 1"
>      [...pattern...]
>      {
>        ...unconditional use of gen_reg_rtx ()...;
>      }
> 
> because matching and splitting after RA will call gen_reg_rtx when
> pseudos are no longer allowed.  rs6000 has several instances of this.

xtensa also has something like that.

> xtensa has a variation in which the split condition is:
> 
>      "&& can_create_pseudo_p ()"
> 
> The failure then is that, if we match after RA, we'll never be
> able to split the instruction.

To be honest, I'm confusing by the possibility of adding a split pattern
application opportunity that depends on the optimization options after
Rel... ah, LRA and before the existing rtl-split2.

Because I just recently submitted a patch that I expected would reliably
(i.e. regardless of optimization options, etc.) apply the split pattern
first in the rtl-split2 pass after RA, and it was merged.

> 
> The patch therefore disables the pass by default on i386, rs6000
> and xtensa.  Hopefully we can fix those ports later (if their
> maintainers want).  It seems easier to add the pass first, though,
> to make it easier to test any such fixes.
> 
> gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need
> quite a few updates for the late-combine output.  That might be
> worth doing, but it seems too complex to do as part of this patch.
> 
> I tried compiling at least one target per CPU directory and comparing
> the assembly output for parts of the GCC testsuite.  This is just a way
> of getting a flavour of how the pass performs; it obviously isn't a
> meaningful benchmark.  All targets seemed to improve on average:
> 
> Target                 Tests   Good    Bad   %Good   Delta  Median
> ======                 =====   ====    ===   =====   =====  ======
> aarch64-linux-gnu       2215   1975    240  89.16%   -4159      -1
> aarch64_be-linux-gnu    1569   1483     86  94.52%  -10117      -1
> alpha-linux-gnu         1454   1370     84  94.22%   -9502      -1
> amdgcn-amdhsa           5122   4671    451  91.19%  -35737      -1
> arc-elf                 2166   1932    234  89.20%  -37742      -1
> arm-linux-gnueabi       1953   1661    292  85.05%  -12415      -1
> arm-linux-gnueabihf     1834   1549    285  84.46%  -11137      -1
> avr-elf                 4789   4330    459  90.42% -441276      -4
> bfin-elf                2795   2394    401  85.65%  -19252      -1
> bpf-elf                 3122   2928    194  93.79%   -8785      -1
> c6x-elf                 2227   1929    298  86.62%  -17339      -1
> cris-elf                3464   3270    194  94.40%  -23263      -2
> csky-elf                2915   2591    324  88.89%  -22146      -1
> epiphany-elf            2399   2304     95  96.04%  -28698      -2
> fr30-elf                7712   7299    413  94.64%  -99830      -2
> frv-linux-gnu           3332   2877    455  86.34%  -25108      -1
> ft32-elf                2775   2667    108  96.11%  -25029      -1
> h8300-elf               3176   2862    314  90.11%  -29305      -2
> hppa64-hp-hpux11.23     4287   4247     40  99.07%  -45963      -2
> ia64-linux-gnu          2343   1946    397  83.06%   -9907      -2
> iq2000-elf              9684   9637     47  99.51% -126557      -2
> lm32-elf                2681   2608     73  97.28%  -59884      -3
> loongarch64-linux-gnu   1303   1218     85  93.48%  -13375      -2
> m32r-elf                1626   1517    109  93.30%   -9323      -2
> m68k-linux-gnu          3022   2620    402  86.70%  -21531      -1
> mcore-elf               2315   2085    230  90.06%  -24160      -1
> microblaze-elf          2782   2585    197  92.92%  -16530      -1
> mipsel-linux-gnu        1958   1827    131  93.31%  -15462      -1
> mipsisa64-linux-gnu     1655   1488    167  89.91%  -16592      -2
> mmix                    4914   4814    100  97.96%  -63021      -1
> mn10300-elf             3639   3320    319  91.23%  -34752      -2
> moxie-rtems             3497   3252    245  92.99%  -87305      -3
> msp430-elf              4353   3876    477  89.04%  -23780      -1
> nds32le-elf             3042   2780    262  91.39%  -27320      -1
> nios2-linux-gnu         1683   1355    328  80.51%   -8065      -1
> nvptx-none              2114   1781    333  84.25%  -12589      -2
> or1k-elf                3045   2699    346  88.64%  -14328      -2
> pdp11                   4515   4146    369  91.83%  -26047      -2
> pru-elf                 1585   1245    340  78.55%   -5225      -1
> riscv32-elf             2122   2000    122  94.25% -101162      -2
> riscv64-elf             1841   1726    115  93.75%  -49997      -2
> rl78-elf                2823   2530    293  89.62%  -40742      -4
> rx-elf                  2614   2480    134  94.87%  -18863      -1
> s390-linux-gnu          1591   1393    198  87.55%  -16696      -1
> s390x-linux-gnu         2015   1879    136  93.25%  -21134      -1
> sh-linux-gnu            1870   1507    363  80.59%   -9491      -1
> sparc-linux-gnu         1123   1075     48  95.73%  -14503      -1
> sparc-wrs-vxworks       1121   1073     48  95.72%  -14578      -1
> sparc64-linux-gnu       1096   1021     75  93.16%  -15003      -1
> v850-elf                1897   1728    169  91.09%  -11078      -1
> vax-netbsdelf           3035   2995     40  98.68%  -27642      -1
> visium-elf              1392   1106    286  79.45%   -7984      -2
> xstormy16-elf           2577   2071    506  80.36%  -13061      -1
> 
> ** snip **

To be more frank, once a split pattern is defined, it is applied by
the existing five split paths and possibly by combiners. In most cases,
it is enough to apply it in one of these places, or that is what the
pattern creator intended.

Wouldn't applying the split pattern indiscriminately in various places
be a waste of execution resources and bring about unexpected and undesired
results?

I think we need some way to properly control the application of the split
pattern, perhaps some predicate function.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] Add a late-combine pass [PR106594]
  2024-06-22  5:12   ` Takayuki 'January June' Suwa
@ 2024-06-22 16:49     ` Richard Sandiford
  2024-06-23  4:40       ` Takayuki 'January June' Suwa
  2024-06-23  9:34       ` Richard Biener
  0 siblings, 2 replies; 36+ messages in thread
From: Richard Sandiford @ 2024-06-22 16:49 UTC (permalink / raw)
  To: Takayuki 'January June' Suwa; +Cc: gcc-patches

Takayuki 'January June' Suwa <jjsuwa_sys3175@yahoo.co.jp> writes:
> On 2024/06/20 22:34, Richard Sandiford wrote:
>> This patch adds a combine pass that runs late in the pipeline.
>> There are two instances: one between combine and split1, and one
>> after postreload.
>> 
>> The pass currently has a single objective: remove definitions by
>> substituting into all uses.  The pre-RA version tries to restrict
>> itself to cases that are likely to have a neutral or beneficial
>> effect on register pressure.
>> 
>> The patch fixes PR106594.  It also fixes a few FAILs and XFAILs
>> in the aarch64 test results, mostly due to making proper use of
>> MOVPRFX in cases where we didn't previously.
>> 
>> This is just a first step.  I'm hoping that the pass could be
>> used for other combine-related optimisations in future.  In particular,
>> the post-RA version doesn't need to restrict itself to cases where all
>> uses are substitutable, since it doesn't have to worry about register
>> pressure.  If we did that, and if we extended it to handle multi-register
>> REGs, the pass might be a viable replacement for regcprop, which in
>> turn might reduce the cost of having a post-RA instance of the new pass.
>> 
>> On most targets, the pass is enabled by default at -O2 and above.
>> However, it has a tendency to undo x86's STV and RPAD passes,
>> by folding the more complex post-STV/RPAD form back into the
>> simpler pre-pass form.
>> 
>> Also, running a pass after register allocation means that we can
>> now match define_insn_and_splits that were previously only matched
>> before register allocation.  This trips things like:
>> 
>>    (define_insn_and_split "..."
>>      [...pattern...]
>>      "...cond..."
>>      "#"
>>      "&& 1"
>>      [...pattern...]
>>      {
>>        ...unconditional use of gen_reg_rtx ()...;
>>      }
>> 
>> because matching and splitting after RA will call gen_reg_rtx when
>> pseudos are no longer allowed.  rs6000 has several instances of this.
>
> xtensa also has something like that.
>
>> xtensa has a variation in which the split condition is:
>> 
>>      "&& can_create_pseudo_p ()"
>> 
>> The failure then is that, if we match after RA, we'll never be
>> able to split the instruction.
>
> To be honest, I'm confusing by the possibility of adding a split pattern
> application opportunity that depends on the optimization options after
> Rel... ah, LRA and before the existing rtl-split2.
>
> Because I just recently submitted a patch that I expected would reliably
> (i.e. regardless of optimization options, etc.) apply the split pattern
> first in the rtl-split2 pass after RA, and it was merged.
>
>> 
>> The patch therefore disables the pass by default on i386, rs6000
>> and xtensa.  Hopefully we can fix those ports later (if their
>> maintainers want).  It seems easier to add the pass first, though,
>> to make it easier to test any such fixes.
>> 
>> gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need
>> quite a few updates for the late-combine output.  That might be
>> worth doing, but it seems too complex to do as part of this patch.
>> 
>> I tried compiling at least one target per CPU directory and comparing
>> the assembly output for parts of the GCC testsuite.  This is just a way
>> of getting a flavour of how the pass performs; it obviously isn't a
>> meaningful benchmark.  All targets seemed to improve on average:
>> 
>> Target                 Tests   Good    Bad   %Good   Delta  Median
>> ======                 =====   ====    ===   =====   =====  ======
>> aarch64-linux-gnu       2215   1975    240  89.16%   -4159      -1
>> aarch64_be-linux-gnu    1569   1483     86  94.52%  -10117      -1
>> alpha-linux-gnu         1454   1370     84  94.22%   -9502      -1
>> amdgcn-amdhsa           5122   4671    451  91.19%  -35737      -1
>> arc-elf                 2166   1932    234  89.20%  -37742      -1
>> arm-linux-gnueabi       1953   1661    292  85.05%  -12415      -1
>> arm-linux-gnueabihf     1834   1549    285  84.46%  -11137      -1
>> avr-elf                 4789   4330    459  90.42% -441276      -4
>> bfin-elf                2795   2394    401  85.65%  -19252      -1
>> bpf-elf                 3122   2928    194  93.79%   -8785      -1
>> c6x-elf                 2227   1929    298  86.62%  -17339      -1
>> cris-elf                3464   3270    194  94.40%  -23263      -2
>> csky-elf                2915   2591    324  88.89%  -22146      -1
>> epiphany-elf            2399   2304     95  96.04%  -28698      -2
>> fr30-elf                7712   7299    413  94.64%  -99830      -2
>> frv-linux-gnu           3332   2877    455  86.34%  -25108      -1
>> ft32-elf                2775   2667    108  96.11%  -25029      -1
>> h8300-elf               3176   2862    314  90.11%  -29305      -2
>> hppa64-hp-hpux11.23     4287   4247     40  99.07%  -45963      -2
>> ia64-linux-gnu          2343   1946    397  83.06%   -9907      -2
>> iq2000-elf              9684   9637     47  99.51% -126557      -2
>> lm32-elf                2681   2608     73  97.28%  -59884      -3
>> loongarch64-linux-gnu   1303   1218     85  93.48%  -13375      -2
>> m32r-elf                1626   1517    109  93.30%   -9323      -2
>> m68k-linux-gnu          3022   2620    402  86.70%  -21531      -1
>> mcore-elf               2315   2085    230  90.06%  -24160      -1
>> microblaze-elf          2782   2585    197  92.92%  -16530      -1
>> mipsel-linux-gnu        1958   1827    131  93.31%  -15462      -1
>> mipsisa64-linux-gnu     1655   1488    167  89.91%  -16592      -2
>> mmix                    4914   4814    100  97.96%  -63021      -1
>> mn10300-elf             3639   3320    319  91.23%  -34752      -2
>> moxie-rtems             3497   3252    245  92.99%  -87305      -3
>> msp430-elf              4353   3876    477  89.04%  -23780      -1
>> nds32le-elf             3042   2780    262  91.39%  -27320      -1
>> nios2-linux-gnu         1683   1355    328  80.51%   -8065      -1
>> nvptx-none              2114   1781    333  84.25%  -12589      -2
>> or1k-elf                3045   2699    346  88.64%  -14328      -2
>> pdp11                   4515   4146    369  91.83%  -26047      -2
>> pru-elf                 1585   1245    340  78.55%   -5225      -1
>> riscv32-elf             2122   2000    122  94.25% -101162      -2
>> riscv64-elf             1841   1726    115  93.75%  -49997      -2
>> rl78-elf                2823   2530    293  89.62%  -40742      -4
>> rx-elf                  2614   2480    134  94.87%  -18863      -1
>> s390-linux-gnu          1591   1393    198  87.55%  -16696      -1
>> s390x-linux-gnu         2015   1879    136  93.25%  -21134      -1
>> sh-linux-gnu            1870   1507    363  80.59%   -9491      -1
>> sparc-linux-gnu         1123   1075     48  95.73%  -14503      -1
>> sparc-wrs-vxworks       1121   1073     48  95.72%  -14578      -1
>> sparc64-linux-gnu       1096   1021     75  93.16%  -15003      -1
>> v850-elf                1897   1728    169  91.09%  -11078      -1
>> vax-netbsdelf           3035   2995     40  98.68%  -27642      -1
>> visium-elf              1392   1106    286  79.45%   -7984      -2
>> xstormy16-elf           2577   2071    506  80.36%  -13061      -1
>> 
>> ** snip **
>
> To be more frank, once a split pattern is defined, it is applied by
> the existing five split paths and possibly by combiners. In most cases,
> it is enough to apply it in one of these places, or that is what the
> pattern creator intended.
>
> Wouldn't applying the split pattern indiscriminately in various places
> be a waste of execution resources and bring about unexpected and undesired
> results?
>
> I think we need some way to properly control the application of the split
> pattern, perhaps some predicate function.

The problem is more the define_insn part of the define_insn_and_split,
rather than the define_split part.  The number and location of the split
passes is the same: anything matched by rtl-late_combine1 will be split by
rtl-split1 and anything matched by rtl-late_combine2 will be split by
rtl-split2.  (If the split condition allows it, of course.)

But more things can be matched by rtl-late_combine2 than are matched by
other post-RA passes like rtl-postreload.  And that's what causes the
issue.  If:

    (define_insn_and_split "..."
      [...pattern...]
      "...cond..."
      "#"
      "&& 1"
      [...pattern...]
      {
        ...unconditional use of gen_reg_rtx ()...;
      }

is matched by rtl-late_combine2, the split will be done by rtl-split2.
But the split will ICE, because it isn't valid to call gen_reg_rtx after
register allocation.

Similarly, if:

    (define_insn_and_split "..."
      [...pattern...]
      "...cond..."
      "#"
      "&& can_create_pseudo_p ()"
      [...pattern...]
      {
        ...unconditional use of gen_reg_rtx ()...;
      }

is matched by rtl-late_combine2, the can_create_pseudo_p condition will
be false in rtl-split2, and in all subsequent split passes.  So we'll
still have the unsplit instruction during final, which will ICE because
it doesn't have a valid means of implementing the "#".

The traditional (and IMO correct) way to handle this is to make the
pattern reserve the temporary registers that it needs, using match_scratches.
rs6000 has many examples of this.  E.g.:

(define_insn_and_split "@ieee_128bit_vsx_neg<mode>2"
  [(set (match_operand:IEEE128 0 "register_operand" "=wa")
	(neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa")))
   (clobber (match_scratch:V16QI 2 "=v"))]
  "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW"
  "#"
  "&& 1"
  [(parallel [(set (match_dup 0)
		   (neg:IEEE128 (match_dup 1)))
	      (use (match_dup 2))])]
{
  if (GET_CODE (operands[2]) == SCRATCH)
    operands[2] = gen_reg_rtx (V16QImode);

  emit_insn (gen_ieee_128bit_negative_zero (operands[2]));
}
  [(set_attr "length" "8")
   (set_attr "type" "vecsimple")])

Before RA, this is just:

  (set ...)
  (clobber (scratch:V16QI))

and the split creates a new register.  After RA, operand 2 provides
the required temporary register:

  (set ...)
  (clobber (reg:V16QI TMP))

Another approach is to add can_create_pseudo_p () to the define_insn
condition (rather than the split condition).  But IMO that's an ICE
trap, since insns that have already been matched & accepted shouldn't
suddenly become invalid if recog is reattempted later.

Thanks,
Richard


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] Add a late-combine pass [PR106594]
  2024-06-22 16:49     ` Richard Sandiford
@ 2024-06-23  4:40       ` Takayuki 'January June' Suwa
  2024-06-23  9:34       ` Richard Biener
  1 sibling, 0 replies; 36+ messages in thread
From: Takayuki 'January June' Suwa @ 2024-06-23  4:40 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

Hi!

On 2024/06/23 1:49, Richard Sandiford wrote:
> Takayuki 'January June' Suwa <jjsuwa_sys3175@yahoo.co.jp> writes:
>> On 2024/06/20 22:34, Richard Sandiford wrote:
>>> This patch adds a combine pass that runs late in the pipeline.
>>> There are two instances: one between combine and split1, and one
>>> after postreload.
>>>
>>> The pass currently has a single objective: remove definitions by
>>> substituting into all uses.  The pre-RA version tries to restrict
>>> itself to cases that are likely to have a neutral or beneficial
>>> effect on register pressure.
>>>
>>> The patch fixes PR106594.  It also fixes a few FAILs and XFAILs
>>> in the aarch64 test results, mostly due to making proper use of
>>> MOVPRFX in cases where we didn't previously.
>>>
>>> This is just a first step.  I'm hoping that the pass could be
>>> used for other combine-related optimisations in future.  In particular,
>>> the post-RA version doesn't need to restrict itself to cases where all
>>> uses are substitutable, since it doesn't have to worry about register
>>> pressure.  If we did that, and if we extended it to handle multi-register
>>> REGs, the pass might be a viable replacement for regcprop, which in
>>> turn might reduce the cost of having a post-RA instance of the new pass.
>>>
>>> On most targets, the pass is enabled by default at -O2 and above.
>>> However, it has a tendency to undo x86's STV and RPAD passes,
>>> by folding the more complex post-STV/RPAD form back into the
>>> simpler pre-pass form.
>>>
>>> Also, running a pass after register allocation means that we can
>>> now match define_insn_and_splits that were previously only matched
>>> before register allocation.  This trips things like:
>>>
>>>     (define_insn_and_split "..."
>>>       [...pattern...]
>>>       "...cond..."
>>>       "#"
>>>       "&& 1"
>>>       [...pattern...]
>>>       {
>>>         ...unconditional use of gen_reg_rtx ()...;
>>>       }
>>>
>>> because matching and splitting after RA will call gen_reg_rtx when
>>> pseudos are no longer allowed.  rs6000 has several instances of this.
>>
>> xtensa also has something like that.
>>
>>> xtensa has a variation in which the split condition is:
>>>
>>>       "&& can_create_pseudo_p ()"
>>>
>>> The failure then is that, if we match after RA, we'll never be
>>> able to split the instruction.
>>
>> To be honest, I'm confusing by the possibility of adding a split pattern
>> application opportunity that depends on the optimization options after
>> Rel... ah, LRA and before the existing rtl-split2.
>>
>> Because I just recently submitted a patch that I expected would reliably
>> (i.e. regardless of optimization options, etc.) apply the split pattern
>> first in the rtl-split2 pass after RA, and it was merged.
>>
>>>
>>> The patch therefore disables the pass by default on i386, rs6000
>>> and xtensa.  Hopefully we can fix those ports later (if their
>>> maintainers want).  It seems easier to add the pass first, though,
>>> to make it easier to test any such fixes.
>>>
>>> gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need
>>> quite a few updates for the late-combine output.  That might be
>>> worth doing, but it seems too complex to do as part of this patch.
>>>
>>> I tried compiling at least one target per CPU directory and comparing
>>> the assembly output for parts of the GCC testsuite.  This is just a way
>>> of getting a flavour of how the pass performs; it obviously isn't a
>>> meaningful benchmark.  All targets seemed to improve on average:
>>>
>>> Target                 Tests   Good    Bad   %Good   Delta  Median
>>> ======                 =====   ====    ===   =====   =====  ======
>>> aarch64-linux-gnu       2215   1975    240  89.16%   -4159      -1
>>> aarch64_be-linux-gnu    1569   1483     86  94.52%  -10117      -1
>>> alpha-linux-gnu         1454   1370     84  94.22%   -9502      -1
>>> amdgcn-amdhsa           5122   4671    451  91.19%  -35737      -1
>>> arc-elf                 2166   1932    234  89.20%  -37742      -1
>>> arm-linux-gnueabi       1953   1661    292  85.05%  -12415      -1
>>> arm-linux-gnueabihf     1834   1549    285  84.46%  -11137      -1
>>> avr-elf                 4789   4330    459  90.42% -441276      -4
>>> bfin-elf                2795   2394    401  85.65%  -19252      -1
>>> bpf-elf                 3122   2928    194  93.79%   -8785      -1
>>> c6x-elf                 2227   1929    298  86.62%  -17339      -1
>>> cris-elf                3464   3270    194  94.40%  -23263      -2
>>> csky-elf                2915   2591    324  88.89%  -22146      -1
>>> epiphany-elf            2399   2304     95  96.04%  -28698      -2
>>> fr30-elf                7712   7299    413  94.64%  -99830      -2
>>> frv-linux-gnu           3332   2877    455  86.34%  -25108      -1
>>> ft32-elf                2775   2667    108  96.11%  -25029      -1
>>> h8300-elf               3176   2862    314  90.11%  -29305      -2
>>> hppa64-hp-hpux11.23     4287   4247     40  99.07%  -45963      -2
>>> ia64-linux-gnu          2343   1946    397  83.06%   -9907      -2
>>> iq2000-elf              9684   9637     47  99.51% -126557      -2
>>> lm32-elf                2681   2608     73  97.28%  -59884      -3
>>> loongarch64-linux-gnu   1303   1218     85  93.48%  -13375      -2
>>> m32r-elf                1626   1517    109  93.30%   -9323      -2
>>> m68k-linux-gnu          3022   2620    402  86.70%  -21531      -1
>>> mcore-elf               2315   2085    230  90.06%  -24160      -1
>>> microblaze-elf          2782   2585    197  92.92%  -16530      -1
>>> mipsel-linux-gnu        1958   1827    131  93.31%  -15462      -1
>>> mipsisa64-linux-gnu     1655   1488    167  89.91%  -16592      -2
>>> mmix                    4914   4814    100  97.96%  -63021      -1
>>> mn10300-elf             3639   3320    319  91.23%  -34752      -2
>>> moxie-rtems             3497   3252    245  92.99%  -87305      -3
>>> msp430-elf              4353   3876    477  89.04%  -23780      -1
>>> nds32le-elf             3042   2780    262  91.39%  -27320      -1
>>> nios2-linux-gnu         1683   1355    328  80.51%   -8065      -1
>>> nvptx-none              2114   1781    333  84.25%  -12589      -2
>>> or1k-elf                3045   2699    346  88.64%  -14328      -2
>>> pdp11                   4515   4146    369  91.83%  -26047      -2
>>> pru-elf                 1585   1245    340  78.55%   -5225      -1
>>> riscv32-elf             2122   2000    122  94.25% -101162      -2
>>> riscv64-elf             1841   1726    115  93.75%  -49997      -2
>>> rl78-elf                2823   2530    293  89.62%  -40742      -4
>>> rx-elf                  2614   2480    134  94.87%  -18863      -1
>>> s390-linux-gnu          1591   1393    198  87.55%  -16696      -1
>>> s390x-linux-gnu         2015   1879    136  93.25%  -21134      -1
>>> sh-linux-gnu            1870   1507    363  80.59%   -9491      -1
>>> sparc-linux-gnu         1123   1075     48  95.73%  -14503      -1
>>> sparc-wrs-vxworks       1121   1073     48  95.72%  -14578      -1
>>> sparc64-linux-gnu       1096   1021     75  93.16%  -15003      -1
>>> v850-elf                1897   1728    169  91.09%  -11078      -1
>>> vax-netbsdelf           3035   2995     40  98.68%  -27642      -1
>>> visium-elf              1392   1106    286  79.45%   -7984      -2
>>> xstormy16-elf           2577   2071    506  80.36%  -13061      -1
>>>
>>> ** snip **
>>
>> To be more frank, once a split pattern is defined, it is applied by
>> the existing five split paths and possibly by combiners. In most cases,
>> it is enough to apply it in one of these places, or that is what the
>> pattern creator intended.
>>
>> Wouldn't applying the split pattern indiscriminately in various places
>> be a waste of execution resources and bring about unexpected and undesired
>> results?
>>
>> I think we need some way to properly control the application of the split
>> pattern, perhaps some predicate function.
> 
> The problem is more the define_insn part of the define_insn_and_split,
> rather than the define_split part.  The number and location of the split
> passes is the same: anything matched by rtl-late_combine1 will be split by
> rtl-split1 and anything matched by rtl-late_combine2 will be split by
> rtl-split2.  (If the split condition allows it, of course.)
> 
> But more things can be matched by rtl-late_combine2 than are matched by
> other post-RA passes like rtl-postreload.  And that's what causes the
> issue.  If:
> 
>      (define_insn_and_split "..."
>        [...pattern...]
>        "...cond..."
>        "#"
>        "&& 1"
>        [...pattern...]
>        {
>          ...unconditional use of gen_reg_rtx ()...;
>        }
> 
> is matched by rtl-late_combine2, the split will be done by rtl-split2.
> But the split will ICE, because it isn't valid to call gen_reg_rtx after
> register allocation.
> 
> Similarly, if:
> 
>      (define_insn_and_split "..."
>        [...pattern...]
>        "...cond..."
>        "#"
>        "&& can_create_pseudo_p ()"
>        [...pattern...]
>        {
>          ...unconditional use of gen_reg_rtx ()...;
>        }
> 
> is matched by rtl-late_combine2, the can_create_pseudo_p condition will
> be false in rtl-split2, and in all subsequent split passes.  So we'll
> still have the unsplit instruction during final, which will ICE because
> it doesn't have a valid means of implementing the "#".
> 
> The traditional (and IMO correct) way to handle this is to make the
> pattern reserve the temporary registers that it needs, using match_scratches.
> rs6000 has many examples of this.  E.g.:
> 
> (define_insn_and_split "@ieee_128bit_vsx_neg<mode>2"
>    [(set (match_operand:IEEE128 0 "register_operand" "=wa")
> 	(neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa")))
>     (clobber (match_scratch:V16QI 2 "=v"))]
>    "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW"
>    "#"
>    "&& 1"
>    [(parallel [(set (match_dup 0)
> 		   (neg:IEEE128 (match_dup 1)))
> 	      (use (match_dup 2))])]
> {
>    if (GET_CODE (operands[2]) == SCRATCH)
>      operands[2] = gen_reg_rtx (V16QImode);
> 
>    emit_insn (gen_ieee_128bit_negative_zero (operands[2]));
> }
>    [(set_attr "length" "8")
>     (set_attr "type" "vecsimple")])
> 
> Before RA, this is just:
> 
>    (set ...)
>    (clobber (scratch:V16QI))
> 
> and the split creates a new register.  After RA, operand 2 provides
> the required temporary register:
> 
>    (set ...)
>    (clobber (reg:V16QI TMP))
> 
> Another approach is to add can_create_pseudo_p () to the define_insn
> condition (rather than the split condition).  But IMO that's an ICE
> trap, since insns that have already been matched & accepted shouldn't
> suddenly become invalid if recog is reattempted later.
> 
> Thanks,
> Richard
> 

Ah, I see, I understand the standard idiom for synchronizing RA between
define_insn and split conditions(, I guess).  However, as I wrote before,
it is true that split paths are useful for other purposes as well.

If this is unacceptable practice, then an alternative solution would be
to add a target-specific path (in my case, after postreload and before
rtl-late_combine2).

That is obviously a very involved process.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] Add a late-combine pass [PR106594]
  2024-06-22 16:49     ` Richard Sandiford
  2024-06-23  4:40       ` Takayuki 'January June' Suwa
@ 2024-06-23  9:34       ` Richard Biener
  2024-06-24  8:03         ` Richard Sandiford
  1 sibling, 1 reply; 36+ messages in thread
From: Richard Biener @ 2024-06-23  9:34 UTC (permalink / raw)
  To: Takayuki 'January June' Suwa, gcc-patches, richard.sandiford

On Sat, Jun 22, 2024 at 6:50 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Takayuki 'January June' Suwa <jjsuwa_sys3175@yahoo.co.jp> writes:
> > On 2024/06/20 22:34, Richard Sandiford wrote:
> >> This patch adds a combine pass that runs late in the pipeline.
> >> There are two instances: one between combine and split1, and one
> >> after postreload.
> >>
> >> The pass currently has a single objective: remove definitions by
> >> substituting into all uses.  The pre-RA version tries to restrict
> >> itself to cases that are likely to have a neutral or beneficial
> >> effect on register pressure.
> >>
> >> The patch fixes PR106594.  It also fixes a few FAILs and XFAILs
> >> in the aarch64 test results, mostly due to making proper use of
> >> MOVPRFX in cases where we didn't previously.
> >>
> >> This is just a first step.  I'm hoping that the pass could be
> >> used for other combine-related optimisations in future.  In particular,
> >> the post-RA version doesn't need to restrict itself to cases where all
> >> uses are substitutable, since it doesn't have to worry about register
> >> pressure.  If we did that, and if we extended it to handle multi-register
> >> REGs, the pass might be a viable replacement for regcprop, which in
> >> turn might reduce the cost of having a post-RA instance of the new pass.
> >>
> >> On most targets, the pass is enabled by default at -O2 and above.
> >> However, it has a tendency to undo x86's STV and RPAD passes,
> >> by folding the more complex post-STV/RPAD form back into the
> >> simpler pre-pass form.
> >>
> >> Also, running a pass after register allocation means that we can
> >> now match define_insn_and_splits that were previously only matched
> >> before register allocation.  This trips things like:
> >>
> >>    (define_insn_and_split "..."
> >>      [...pattern...]
> >>      "...cond..."
> >>      "#"
> >>      "&& 1"
> >>      [...pattern...]
> >>      {
> >>        ...unconditional use of gen_reg_rtx ()...;
> >>      }
> >>
> >> because matching and splitting after RA will call gen_reg_rtx when
> >> pseudos are no longer allowed.  rs6000 has several instances of this.
> >
> > xtensa also has something like that.
> >
> >> xtensa has a variation in which the split condition is:
> >>
> >>      "&& can_create_pseudo_p ()"
> >>
> >> The failure then is that, if we match after RA, we'll never be
> >> able to split the instruction.
> >
> > To be honest, I'm confusing by the possibility of adding a split pattern
> > application opportunity that depends on the optimization options after
> > Rel... ah, LRA and before the existing rtl-split2.
> >
> > Because I just recently submitted a patch that I expected would reliably
> > (i.e. regardless of optimization options, etc.) apply the split pattern
> > first in the rtl-split2 pass after RA, and it was merged.
> >
> >>
> >> The patch therefore disables the pass by default on i386, rs6000
> >> and xtensa.  Hopefully we can fix those ports later (if their
> >> maintainers want).  It seems easier to add the pass first, though,
> >> to make it easier to test any such fixes.
> >>
> >> gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need
> >> quite a few updates for the late-combine output.  That might be
> >> worth doing, but it seems too complex to do as part of this patch.
> >>
> >> I tried compiling at least one target per CPU directory and comparing
> >> the assembly output for parts of the GCC testsuite.  This is just a way
> >> of getting a flavour of how the pass performs; it obviously isn't a
> >> meaningful benchmark.  All targets seemed to improve on average:
> >>
> >> Target                 Tests   Good    Bad   %Good   Delta  Median
> >> ======                 =====   ====    ===   =====   =====  ======
> >> aarch64-linux-gnu       2215   1975    240  89.16%   -4159      -1
> >> aarch64_be-linux-gnu    1569   1483     86  94.52%  -10117      -1
> >> alpha-linux-gnu         1454   1370     84  94.22%   -9502      -1
> >> amdgcn-amdhsa           5122   4671    451  91.19%  -35737      -1
> >> arc-elf                 2166   1932    234  89.20%  -37742      -1
> >> arm-linux-gnueabi       1953   1661    292  85.05%  -12415      -1
> >> arm-linux-gnueabihf     1834   1549    285  84.46%  -11137      -1
> >> avr-elf                 4789   4330    459  90.42% -441276      -4
> >> bfin-elf                2795   2394    401  85.65%  -19252      -1
> >> bpf-elf                 3122   2928    194  93.79%   -8785      -1
> >> c6x-elf                 2227   1929    298  86.62%  -17339      -1
> >> cris-elf                3464   3270    194  94.40%  -23263      -2
> >> csky-elf                2915   2591    324  88.89%  -22146      -1
> >> epiphany-elf            2399   2304     95  96.04%  -28698      -2
> >> fr30-elf                7712   7299    413  94.64%  -99830      -2
> >> frv-linux-gnu           3332   2877    455  86.34%  -25108      -1
> >> ft32-elf                2775   2667    108  96.11%  -25029      -1
> >> h8300-elf               3176   2862    314  90.11%  -29305      -2
> >> hppa64-hp-hpux11.23     4287   4247     40  99.07%  -45963      -2
> >> ia64-linux-gnu          2343   1946    397  83.06%   -9907      -2
> >> iq2000-elf              9684   9637     47  99.51% -126557      -2
> >> lm32-elf                2681   2608     73  97.28%  -59884      -3
> >> loongarch64-linux-gnu   1303   1218     85  93.48%  -13375      -2
> >> m32r-elf                1626   1517    109  93.30%   -9323      -2
> >> m68k-linux-gnu          3022   2620    402  86.70%  -21531      -1
> >> mcore-elf               2315   2085    230  90.06%  -24160      -1
> >> microblaze-elf          2782   2585    197  92.92%  -16530      -1
> >> mipsel-linux-gnu        1958   1827    131  93.31%  -15462      -1
> >> mipsisa64-linux-gnu     1655   1488    167  89.91%  -16592      -2
> >> mmix                    4914   4814    100  97.96%  -63021      -1
> >> mn10300-elf             3639   3320    319  91.23%  -34752      -2
> >> moxie-rtems             3497   3252    245  92.99%  -87305      -3
> >> msp430-elf              4353   3876    477  89.04%  -23780      -1
> >> nds32le-elf             3042   2780    262  91.39%  -27320      -1
> >> nios2-linux-gnu         1683   1355    328  80.51%   -8065      -1
> >> nvptx-none              2114   1781    333  84.25%  -12589      -2
> >> or1k-elf                3045   2699    346  88.64%  -14328      -2
> >> pdp11                   4515   4146    369  91.83%  -26047      -2
> >> pru-elf                 1585   1245    340  78.55%   -5225      -1
> >> riscv32-elf             2122   2000    122  94.25% -101162      -2
> >> riscv64-elf             1841   1726    115  93.75%  -49997      -2
> >> rl78-elf                2823   2530    293  89.62%  -40742      -4
> >> rx-elf                  2614   2480    134  94.87%  -18863      -1
> >> s390-linux-gnu          1591   1393    198  87.55%  -16696      -1
> >> s390x-linux-gnu         2015   1879    136  93.25%  -21134      -1
> >> sh-linux-gnu            1870   1507    363  80.59%   -9491      -1
> >> sparc-linux-gnu         1123   1075     48  95.73%  -14503      -1
> >> sparc-wrs-vxworks       1121   1073     48  95.72%  -14578      -1
> >> sparc64-linux-gnu       1096   1021     75  93.16%  -15003      -1
> >> v850-elf                1897   1728    169  91.09%  -11078      -1
> >> vax-netbsdelf           3035   2995     40  98.68%  -27642      -1
> >> visium-elf              1392   1106    286  79.45%   -7984      -2
> >> xstormy16-elf           2577   2071    506  80.36%  -13061      -1
> >>
> >> ** snip **
> >
> > To be more frank, once a split pattern is defined, it is applied by
> > the existing five split paths and possibly by combiners. In most cases,
> > it is enough to apply it in one of these places, or that is what the
> > pattern creator intended.
> >
> > Wouldn't applying the split pattern indiscriminately in various places
> > be a waste of execution resources and bring about unexpected and undesired
> > results?
> >
> > I think we need some way to properly control the application of the split
> > pattern, perhaps some predicate function.
>
> The problem is more the define_insn part of the define_insn_and_split,
> rather than the define_split part.  The number and location of the split
> passes is the same: anything matched by rtl-late_combine1 will be split by
> rtl-split1 and anything matched by rtl-late_combine2 will be split by
> rtl-split2.  (If the split condition allows it, of course.)
>
> But more things can be matched by rtl-late_combine2 than are matched by
> other post-RA passes like rtl-postreload.  And that's what causes the
> issue.  If:
>
>     (define_insn_and_split "..."
>       [...pattern...]
>       "...cond..."
>       "#"
>       "&& 1"
>       [...pattern...]
>       {
>         ...unconditional use of gen_reg_rtx ()...;
>       }
>
> is matched by rtl-late_combine2, the split will be done by rtl-split2.
> But the split will ICE, because it isn't valid to call gen_reg_rtx after
> register allocation.
>
> Similarly, if:
>
>     (define_insn_and_split "..."
>       [...pattern...]
>       "...cond..."
>       "#"
>       "&& can_create_pseudo_p ()"
>       [...pattern...]
>       {
>         ...unconditional use of gen_reg_rtx ()...;
>       }
>
> is matched by rtl-late_combine2, the can_create_pseudo_p condition will
> be false in rtl-split2, and in all subsequent split passes.  So we'll
> still have the unsplit instruction during final, which will ICE because
> it doesn't have a valid means of implementing the "#".
>
> The traditional (and IMO correct) way to handle this is to make the
> pattern reserve the temporary registers that it needs, using match_scratches.
> rs6000 has many examples of this.  E.g.:
>
> (define_insn_and_split "@ieee_128bit_vsx_neg<mode>2"
>   [(set (match_operand:IEEE128 0 "register_operand" "=wa")
>         (neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa")))
>    (clobber (match_scratch:V16QI 2 "=v"))]
>   "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW"
>   "#"
>   "&& 1"
>   [(parallel [(set (match_dup 0)
>                    (neg:IEEE128 (match_dup 1)))
>               (use (match_dup 2))])]
> {
>   if (GET_CODE (operands[2]) == SCRATCH)
>     operands[2] = gen_reg_rtx (V16QImode);
>
>   emit_insn (gen_ieee_128bit_negative_zero (operands[2]));
> }
>   [(set_attr "length" "8")
>    (set_attr "type" "vecsimple")])
>
> Before RA, this is just:
>
>   (set ...)
>   (clobber (scratch:V16QI))
>
> and the split creates a new register.  After RA, operand 2 provides
> the required temporary register:
>
>   (set ...)
>   (clobber (reg:V16QI TMP))
>
> Another approach is to add can_create_pseudo_p () to the define_insn
> condition (rather than the split condition).  But IMO that's an ICE
> trap, since insns that have already been matched & accepted shouldn't
> suddenly become invalid if recog is reattempted later.

What about splitting immediately in late-combine?  Wouldn't that possibly
allow more combinations to immediately happen?

Richard.

> Thanks,
> Richard
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] Add a late-combine pass [PR106594]
  2024-06-23  9:34       ` Richard Biener
@ 2024-06-24  8:03         ` Richard Sandiford
  2024-06-24 11:22           ` Richard Biener
  0 siblings, 1 reply; 36+ messages in thread
From: Richard Sandiford @ 2024-06-24  8:03 UTC (permalink / raw)
  To: Richard Biener; +Cc: Takayuki 'January June' Suwa, gcc-patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Sat, Jun 22, 2024 at 6:50 PM Richard Sandiford
>> The traditional (and IMO correct) way to handle this is to make the
>> pattern reserve the temporary registers that it needs, using match_scratches.
>> rs6000 has many examples of this.  E.g.:
>>
>> (define_insn_and_split "@ieee_128bit_vsx_neg<mode>2"
>>   [(set (match_operand:IEEE128 0 "register_operand" "=wa")
>>         (neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa")))
>>    (clobber (match_scratch:V16QI 2 "=v"))]
>>   "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW"
>>   "#"
>>   "&& 1"
>>   [(parallel [(set (match_dup 0)
>>                    (neg:IEEE128 (match_dup 1)))
>>               (use (match_dup 2))])]
>> {
>>   if (GET_CODE (operands[2]) == SCRATCH)
>>     operands[2] = gen_reg_rtx (V16QImode);
>>
>>   emit_insn (gen_ieee_128bit_negative_zero (operands[2]));
>> }
>>   [(set_attr "length" "8")
>>    (set_attr "type" "vecsimple")])
>>
>> Before RA, this is just:
>>
>>   (set ...)
>>   (clobber (scratch:V16QI))
>>
>> and the split creates a new register.  After RA, operand 2 provides
>> the required temporary register:
>>
>>   (set ...)
>>   (clobber (reg:V16QI TMP))
>>
>> Another approach is to add can_create_pseudo_p () to the define_insn
>> condition (rather than the split condition).  But IMO that's an ICE
>> trap, since insns that have already been matched & accepted shouldn't
>> suddenly become invalid if recog is reattempted later.
>
> What about splitting immediately in late-combine?  Wouldn't that possibly
> allow more combinations to immediately happen?

It would be difficult to guarantee termination.  Often the split
instructions can be immediately recombined back to the original
instruction.  Even if we guard against that happening directly,
it'd be difficult to prove that it can't happen indirectly.

We might also run into issues like PR101523.

Combine uses define_splits (without define_insns) for 3->2 combinations,
but the current late-combine optimisation is kind-of 1/N+1->1 x N.

Personally, I think we should allow targets to use the .md file to
define match.pd-style simplification rules involving unspecs, but there
were objections to that when I last suggested it.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] Add a late-combine pass [PR106594]
  2024-06-24  8:03         ` Richard Sandiford
@ 2024-06-24 11:22           ` Richard Biener
  2024-06-24 11:34             ` Richard Sandiford
  0 siblings, 1 reply; 36+ messages in thread
From: Richard Biener @ 2024-06-24 11:22 UTC (permalink / raw)
  To: Richard Biener, Takayuki 'January June' Suwa,
	gcc-patches, richard.sandiford

On Mon, Jun 24, 2024 at 10:03 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Sat, Jun 22, 2024 at 6:50 PM Richard Sandiford
> >> The traditional (and IMO correct) way to handle this is to make the
> >> pattern reserve the temporary registers that it needs, using match_scratches.
> >> rs6000 has many examples of this.  E.g.:
> >>
> >> (define_insn_and_split "@ieee_128bit_vsx_neg<mode>2"
> >>   [(set (match_operand:IEEE128 0 "register_operand" "=wa")
> >>         (neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa")))
> >>    (clobber (match_scratch:V16QI 2 "=v"))]
> >>   "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW"
> >>   "#"
> >>   "&& 1"
> >>   [(parallel [(set (match_dup 0)
> >>                    (neg:IEEE128 (match_dup 1)))
> >>               (use (match_dup 2))])]
> >> {
> >>   if (GET_CODE (operands[2]) == SCRATCH)
> >>     operands[2] = gen_reg_rtx (V16QImode);
> >>
> >>   emit_insn (gen_ieee_128bit_negative_zero (operands[2]));
> >> }
> >>   [(set_attr "length" "8")
> >>    (set_attr "type" "vecsimple")])
> >>
> >> Before RA, this is just:
> >>
> >>   (set ...)
> >>   (clobber (scratch:V16QI))
> >>
> >> and the split creates a new register.  After RA, operand 2 provides
> >> the required temporary register:
> >>
> >>   (set ...)
> >>   (clobber (reg:V16QI TMP))
> >>
> >> Another approach is to add can_create_pseudo_p () to the define_insn
> >> condition (rather than the split condition).  But IMO that's an ICE
> >> trap, since insns that have already been matched & accepted shouldn't
> >> suddenly become invalid if recog is reattempted later.
> >
> > What about splitting immediately in late-combine?  Wouldn't that possibly
> > allow more combinations to immediately happen?
>
> It would be difficult to guarantee termination.  Often the split
> instructions can be immediately recombined back to the original
> instruction.  Even if we guard against that happening directly,
> it'd be difficult to prove that it can't happen indirectly.
>
> We might also run into issues like PR101523.
>
> Combine uses define_splits (without define_insns) for 3->2 combinations,
> but the current late-combine optimisation is kind-of 1/N+1->1 x N.
>
> Personally, I think we should allow targets to use the .md file to
> define match.pd-style simplification rules involving unspecs, but there
> were objections to that when I last suggested it.

Isn't that what basically "combine-helper" patterns do to some extent?

Richard.

>
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] Add a late-combine pass [PR106594]
  2024-06-24 11:22           ` Richard Biener
@ 2024-06-24 11:34             ` Richard Sandiford
  2024-06-24 12:18               ` Richard Biener
  0 siblings, 1 reply; 36+ messages in thread
From: Richard Sandiford @ 2024-06-24 11:34 UTC (permalink / raw)
  To: Richard Biener; +Cc: Takayuki 'January June' Suwa, gcc-patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Mon, Jun 24, 2024 at 10:03 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> writes:
>> > On Sat, Jun 22, 2024 at 6:50 PM Richard Sandiford
>> >> The traditional (and IMO correct) way to handle this is to make the
>> >> pattern reserve the temporary registers that it needs, using match_scratches.
>> >> rs6000 has many examples of this.  E.g.:
>> >>
>> >> (define_insn_and_split "@ieee_128bit_vsx_neg<mode>2"
>> >>   [(set (match_operand:IEEE128 0 "register_operand" "=wa")
>> >>         (neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa")))
>> >>    (clobber (match_scratch:V16QI 2 "=v"))]
>> >>   "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW"
>> >>   "#"
>> >>   "&& 1"
>> >>   [(parallel [(set (match_dup 0)
>> >>                    (neg:IEEE128 (match_dup 1)))
>> >>               (use (match_dup 2))])]
>> >> {
>> >>   if (GET_CODE (operands[2]) == SCRATCH)
>> >>     operands[2] = gen_reg_rtx (V16QImode);
>> >>
>> >>   emit_insn (gen_ieee_128bit_negative_zero (operands[2]));
>> >> }
>> >>   [(set_attr "length" "8")
>> >>    (set_attr "type" "vecsimple")])
>> >>
>> >> Before RA, this is just:
>> >>
>> >>   (set ...)
>> >>   (clobber (scratch:V16QI))
>> >>
>> >> and the split creates a new register.  After RA, operand 2 provides
>> >> the required temporary register:
>> >>
>> >>   (set ...)
>> >>   (clobber (reg:V16QI TMP))
>> >>
>> >> Another approach is to add can_create_pseudo_p () to the define_insn
>> >> condition (rather than the split condition).  But IMO that's an ICE
>> >> trap, since insns that have already been matched & accepted shouldn't
>> >> suddenly become invalid if recog is reattempted later.
>> >
>> > What about splitting immediately in late-combine?  Wouldn't that possibly
>> > allow more combinations to immediately happen?
>>
>> It would be difficult to guarantee termination.  Often the split
>> instructions can be immediately recombined back to the original
>> instruction.  Even if we guard against that happening directly,
>> it'd be difficult to prove that it can't happen indirectly.
>>
>> We might also run into issues like PR101523.
>>
>> Combine uses define_splits (without define_insns) for 3->2 combinations,
>> but the current late-combine optimisation is kind-of 1/N+1->1 x N.
>>
>> Personally, I think we should allow targets to use the .md file to
>> define match.pd-style simplification rules involving unspecs, but there
>> were objections to that when I last suggested it.
>
> Isn't that what basically "combine-helper" patterns do to some extent?

Partly, but:

(1) It's a big hammer.  It means we add all the overhead of a define_insn
    for something that is only meant to survive between one pass and the next.

(2) Unlike match.pd, it isn't designed to be applied iteratively.
    There is no attempt even in theory to ensure that match helper
    -> split -> match helper -> split -> ... would terminate.

(3) It operates at the level of complete instructions, including e.g.
    destinations of sets.  The kind of rule I had in mind would be aimed
    at arithmetic simplification, and would operate at the simplify-rtx.cc
    level.

    That is, if simplify_foo failed to apply a target-independent rule,
    it could fall back on an automatically generated target-specific rule,
    with the requirement/understanding that these rules really should be
    target-specific.  One easy way of enforcing that is to say that
    at least one side of a production rule must involve an unspec.

Richard



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] Add a late-combine pass [PR106594]
  2024-06-24 11:34             ` Richard Sandiford
@ 2024-06-24 12:18               ` Richard Biener
  0 siblings, 0 replies; 36+ messages in thread
From: Richard Biener @ 2024-06-24 12:18 UTC (permalink / raw)
  To: Richard Biener, Takayuki 'January June' Suwa,
	gcc-patches, richard.sandiford

On Mon, Jun 24, 2024 at 1:34 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Mon, Jun 24, 2024 at 10:03 AM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Richard Biener <richard.guenther@gmail.com> writes:
> >> > On Sat, Jun 22, 2024 at 6:50 PM Richard Sandiford
> >> >> The traditional (and IMO correct) way to handle this is to make the
> >> >> pattern reserve the temporary registers that it needs, using match_scratches.
> >> >> rs6000 has many examples of this.  E.g.:
> >> >>
> >> >> (define_insn_and_split "@ieee_128bit_vsx_neg<mode>2"
> >> >>   [(set (match_operand:IEEE128 0 "register_operand" "=wa")
> >> >>         (neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa")))
> >> >>    (clobber (match_scratch:V16QI 2 "=v"))]
> >> >>   "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW"
> >> >>   "#"
> >> >>   "&& 1"
> >> >>   [(parallel [(set (match_dup 0)
> >> >>                    (neg:IEEE128 (match_dup 1)))
> >> >>               (use (match_dup 2))])]
> >> >> {
> >> >>   if (GET_CODE (operands[2]) == SCRATCH)
> >> >>     operands[2] = gen_reg_rtx (V16QImode);
> >> >>
> >> >>   emit_insn (gen_ieee_128bit_negative_zero (operands[2]));
> >> >> }
> >> >>   [(set_attr "length" "8")
> >> >>    (set_attr "type" "vecsimple")])
> >> >>
> >> >> Before RA, this is just:
> >> >>
> >> >>   (set ...)
> >> >>   (clobber (scratch:V16QI))
> >> >>
> >> >> and the split creates a new register.  After RA, operand 2 provides
> >> >> the required temporary register:
> >> >>
> >> >>   (set ...)
> >> >>   (clobber (reg:V16QI TMP))
> >> >>
> >> >> Another approach is to add can_create_pseudo_p () to the define_insn
> >> >> condition (rather than the split condition).  But IMO that's an ICE
> >> >> trap, since insns that have already been matched & accepted shouldn't
> >> >> suddenly become invalid if recog is reattempted later.
> >> >
> >> > What about splitting immediately in late-combine?  Wouldn't that possibly
> >> > allow more combinations to immediately happen?
> >>
> >> It would be difficult to guarantee termination.  Often the split
> >> instructions can be immediately recombined back to the original
> >> instruction.  Even if we guard against that happening directly,
> >> it'd be difficult to prove that it can't happen indirectly.
> >>
> >> We might also run into issues like PR101523.
> >>
> >> Combine uses define_splits (without define_insns) for 3->2 combinations,
> >> but the current late-combine optimisation is kind-of 1/N+1->1 x N.
> >>
> >> Personally, I think we should allow targets to use the .md file to
> >> define match.pd-style simplification rules involving unspecs, but there
> >> were objections to that when I last suggested it.
> >
> > Isn't that what basically "combine-helper" patterns do to some extent?
>
> Partly, but:
>
> (1) It's a big hammer.  It means we add all the overhead of a define_insn
>     for something that is only meant to survive between one pass and the next.
>
> (2) Unlike match.pd, it isn't designed to be applied iteratively.
>     There is no attempt even in theory to ensure that match helper
>     -> split -> match helper -> split -> ... would terminate.
>
> (3) It operates at the level of complete instructions, including e.g.
>     destinations of sets.  The kind of rule I had in mind would be aimed
>     at arithmetic simplification, and would operate at the simplify-rtx.cc
>     level.
>
>     That is, if simplify_foo failed to apply a target-independent rule,
>     it could fall back on an automatically generated target-specific rule,
>     with the requirement/understanding that these rules really should be
>     target-specific.  One easy way of enforcing that is to say that
>     at least one side of a production rule must involve an unspec.

OK, that makes sense.  I did think of having something like match.pd
generate simplify-rtx.cc.  It probably has different constraints so that
simply translating tree codes to rtx codes and re-using match.pd patterns
isn't going to work well.

Richard.

> Richard
>
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] Add a late-combine pass [PR106594]
  2024-06-20 13:34 ` [PATCH 6/6] Add a late-combine pass [PR106594] Richard Sandiford
                     ` (3 preceding siblings ...)
  2024-06-22  5:12   ` Takayuki 'January June' Suwa
@ 2024-06-25  9:02   ` Thomas Schwinge
  2024-06-25  9:07     ` Richard Sandiford
  4 siblings, 1 reply; 36+ messages in thread
From: Thomas Schwinge @ 2024-06-25  9:02 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: jlaw, gcc-patches, sjames, seurer

Hi!

On 2024-06-20T14:34:18+0100, Richard Sandiford <richard.sandiford@arm.com> wrote:
> This patch adds a combine pass that runs late in the pipeline.
> [...]

Nice!

> The patch [...] disables the pass by default on i386, rs6000
> and xtensa.

Like here:

> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -1942,6 +1942,10 @@ ix86_override_options_after_change (void)
>  	flag_cunroll_grow_size = flag_peel_loops || optimize >= 3;
>      }
>  
> +  /* Late combine tends to undo some of the effects of STV and RPAD,
> +     by combining instructions back to their original form.  */
> +  if (!OPTION_SET_P (flag_late_combine_instructions))
> +    flag_late_combine_instructions = 0;
>  }

..., I think also here:

> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -4768,6 +4768,14 @@ rs6000_option_override_internal (bool global_init_p)
>  	targetm.expand_builtin_va_start = NULL;
>      }
>  
> +  /* One of the late-combine passes runs after register allocation
> +     and can match define_insn_and_splits that were previously used
> +     only before register allocation.  Some of those define_insn_and_splits
> +     use gen_reg_rtx unconditionally.  Disable late-combine by default
> +     until the define_insn_and_splits are fixed.  */
> +  if (!OPTION_SET_P (flag_late_combine_instructions))
> +    flag_late_combine_instructions = 0;
> +
>    rs6000_override_options_after_change ();

..., this needs to be done in 'rs6000_override_options_after_change'
instead of 'rs6000_option_override_internal', to address the PRs under
discussion.  I'm testing such a patch.


Grüße
 Thomas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] Add a late-combine pass [PR106594]
  2024-06-25  9:02   ` Thomas Schwinge
@ 2024-06-25  9:07     ` Richard Sandiford
  2024-06-25  9:23       ` rs6000: Properly default-disable late-combine passes [PR106594, PR115622, PR115633] (was: [PATCH 6/6] Add a late-combine pass [PR106594]) Thomas Schwinge
  0 siblings, 1 reply; 36+ messages in thread
From: Richard Sandiford @ 2024-06-25  9:07 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: jlaw, gcc-patches, sjames, seurer

Thomas Schwinge <tschwinge@baylibre.com> writes:
> Hi!
>
> On 2024-06-20T14:34:18+0100, Richard Sandiford <richard.sandiford@arm.com> wrote:
>> This patch adds a combine pass that runs late in the pipeline.
>> [...]
>
> Nice!
>
>> The patch [...] disables the pass by default on i386, rs6000
>> and xtensa.
>
> Like here:
>
>> --- a/gcc/config/i386/i386-options.cc
>> +++ b/gcc/config/i386/i386-options.cc
>> @@ -1942,6 +1942,10 @@ ix86_override_options_after_change (void)
>>  	flag_cunroll_grow_size = flag_peel_loops || optimize >= 3;
>>      }
>>  
>> +  /* Late combine tends to undo some of the effects of STV and RPAD,
>> +     by combining instructions back to their original form.  */
>> +  if (!OPTION_SET_P (flag_late_combine_instructions))
>> +    flag_late_combine_instructions = 0;
>>  }
>
> ..., I think also here:
>
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -4768,6 +4768,14 @@ rs6000_option_override_internal (bool global_init_p)
>>  	targetm.expand_builtin_va_start = NULL;
>>      }
>>  
>> +  /* One of the late-combine passes runs after register allocation
>> +     and can match define_insn_and_splits that were previously used
>> +     only before register allocation.  Some of those define_insn_and_splits
>> +     use gen_reg_rtx unconditionally.  Disable late-combine by default
>> +     until the define_insn_and_splits are fixed.  */
>> +  if (!OPTION_SET_P (flag_late_combine_instructions))
>> +    flag_late_combine_instructions = 0;
>> +
>>    rs6000_override_options_after_change ();
>
> ..., this needs to be done in 'rs6000_override_options_after_change'
> instead of 'rs6000_option_override_internal', to address the PRs under
> discussion.  I'm testing such a patch.

Oops!  Sorry about that, and thanks for tracking it down.

Richard

^ permalink raw reply	[flat|nested] 36+ messages in thread

* rs6000: Properly default-disable late-combine passes [PR106594, PR115622, PR115633] (was: [PATCH 6/6] Add a late-combine pass [PR106594])
  2024-06-25  9:07     ` Richard Sandiford
@ 2024-06-25  9:23       ` Thomas Schwinge
  2024-06-25  9:28         ` rs6000: Properly default-disable late-combine passes [PR106594, PR115622, PR115633] Richard Sandiford
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Schwinge @ 2024-06-25  9:23 UTC (permalink / raw)
  To: Richard Sandiford, gcc-patches; +Cc: jlaw, sjames, seurer

[-- Attachment #1: Type: text/plain, Size: 2136 bytes --]

Hi!

On 2024-06-25T10:07:47+0100, Richard Sandiford <richard.sandiford@arm.com> wrote:
> Thomas Schwinge <tschwinge@baylibre.com> writes:
>> On 2024-06-20T14:34:18+0100, Richard Sandiford <richard.sandiford@arm.com> wrote:
>>> This patch adds a combine pass that runs late in the pipeline.
>>> [...]
>>
>> Nice!
>>
>>> The patch [...] disables the pass by default on i386, rs6000
>>> and xtensa.
>>
>> Like here:
>>
>>> --- a/gcc/config/i386/i386-options.cc
>>> +++ b/gcc/config/i386/i386-options.cc
>>> @@ -1942,6 +1942,10 @@ ix86_override_options_after_change (void)
>>>  	flag_cunroll_grow_size = flag_peel_loops || optimize >= 3;
>>>      }
>>>  
>>> +  /* Late combine tends to undo some of the effects of STV and RPAD,
>>> +     by combining instructions back to their original form.  */
>>> +  if (!OPTION_SET_P (flag_late_combine_instructions))
>>> +    flag_late_combine_instructions = 0;
>>>  }
>>
>> ..., I think also here:
>>
>>> --- a/gcc/config/rs6000/rs6000.cc
>>> +++ b/gcc/config/rs6000/rs6000.cc
>>> @@ -4768,6 +4768,14 @@ rs6000_option_override_internal (bool global_init_p)
>>>  	targetm.expand_builtin_va_start = NULL;
>>>      }
>>>  
>>> +  /* One of the late-combine passes runs after register allocation
>>> +     and can match define_insn_and_splits that were previously used
>>> +     only before register allocation.  Some of those define_insn_and_splits
>>> +     use gen_reg_rtx unconditionally.  Disable late-combine by default
>>> +     until the define_insn_and_splits are fixed.  */
>>> +  if (!OPTION_SET_P (flag_late_combine_instructions))
>>> +    flag_late_combine_instructions = 0;
>>> +
>>>    rs6000_override_options_after_change ();
>>
>> ..., this needs to be done in 'rs6000_override_options_after_change'
>> instead of 'rs6000_option_override_internal', to address the PRs under
>> discussion.  I'm testing such a patch.
>
> Oops!  Sorry about that, and thanks for tracking it down.

No worries.  ;-) OK to push the attached
"rs6000: Properly default-disable late-combine passes [PR106594, PR115622, PR115633]"?


Grüße
 Thomas



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-rs6000-Properly-default-disable-late-combine-passes-.patch --]
[-- Type: text/x-diff, Size: 2227 bytes --]

From ccd12107fb06017f878384d2186ed5f01a1dab79 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <tschwinge@baylibre.com>
Date: Tue, 25 Jun 2024 10:55:41 +0200
Subject: [PATCH] rs6000: Properly default-disable late-combine passes
 [PR106594, PR115622, PR115633]

..., so that it also works for '__attribute__ ((optimize("[...]")))' etc.

	PR target/106594
	PR target/115622
	PR target/115633
	gcc/
	* config/rs6000/rs6000.cc (rs6000_option_override_internal): Move
	default-disable of late-combine passes from here...
	(rs6000_override_options_after_change): ... to here.
---
 gcc/config/rs6000/rs6000.cc | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index f39b8909925..713fac75f26 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -3431,6 +3431,14 @@ rs6000_override_options_after_change (void)
   /* If we are inserting ROP-protect instructions, disable shrink wrap.  */
   if (rs6000_rop_protect)
     flag_shrink_wrap = 0;
+
+  /* One of the late-combine passes runs after register allocation
+     and can match define_insn_and_splits that were previously used
+     only before register allocation.  Some of those define_insn_and_splits
+     use gen_reg_rtx unconditionally.  Disable late-combine by default
+     until the define_insn_and_splits are fixed.  */
+  if (!OPTION_SET_P (flag_late_combine_instructions))
+    flag_late_combine_instructions = 0;
 }
 
 #ifdef TARGET_USES_LINUX64_OPT
@@ -4768,14 +4776,6 @@ rs6000_option_override_internal (bool global_init_p)
 	targetm.expand_builtin_va_start = NULL;
     }
 
-  /* One of the late-combine passes runs after register allocation
-     and can match define_insn_and_splits that were previously used
-     only before register allocation.  Some of those define_insn_and_splits
-     use gen_reg_rtx unconditionally.  Disable late-combine by default
-     until the define_insn_and_splits are fixed.  */
-  if (!OPTION_SET_P (flag_late_combine_instructions))
-    flag_late_combine_instructions = 0;
-
   rs6000_override_options_after_change ();
 
   /* If not explicitly specified via option, decide whether to generate indexed
-- 
2.34.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: rs6000: Properly default-disable late-combine passes [PR106594, PR115622, PR115633]
  2024-06-25  9:23       ` rs6000: Properly default-disable late-combine passes [PR106594, PR115622, PR115633] (was: [PATCH 6/6] Add a late-combine pass [PR106594]) Thomas Schwinge
@ 2024-06-25  9:28         ` Richard Sandiford
  0 siblings, 0 replies; 36+ messages in thread
From: Richard Sandiford @ 2024-06-25  9:28 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: gcc-patches, jlaw, sjames, seurer

Thomas Schwinge <tschwinge@baylibre.com> writes:
> Hi!
>
> On 2024-06-25T10:07:47+0100, Richard Sandiford <richard.sandiford@arm.com> wrote:
>> Thomas Schwinge <tschwinge@baylibre.com> writes:
>>> On 2024-06-20T14:34:18+0100, Richard Sandiford <richard.sandiford@arm.com> wrote:
>>>> This patch adds a combine pass that runs late in the pipeline.
>>>> [...]
>>>
>>> Nice!
>>>
>>>> The patch [...] disables the pass by default on i386, rs6000
>>>> and xtensa.
>>>
>>> Like here:
>>>
>>>> --- a/gcc/config/i386/i386-options.cc
>>>> +++ b/gcc/config/i386/i386-options.cc
>>>> @@ -1942,6 +1942,10 @@ ix86_override_options_after_change (void)
>>>>  	flag_cunroll_grow_size = flag_peel_loops || optimize >= 3;
>>>>      }
>>>>  
>>>> +  /* Late combine tends to undo some of the effects of STV and RPAD,
>>>> +     by combining instructions back to their original form.  */
>>>> +  if (!OPTION_SET_P (flag_late_combine_instructions))
>>>> +    flag_late_combine_instructions = 0;
>>>>  }
>>>
>>> ..., I think also here:
>>>
>>>> --- a/gcc/config/rs6000/rs6000.cc
>>>> +++ b/gcc/config/rs6000/rs6000.cc
>>>> @@ -4768,6 +4768,14 @@ rs6000_option_override_internal (bool global_init_p)
>>>>  	targetm.expand_builtin_va_start = NULL;
>>>>      }
>>>>  
>>>> +  /* One of the late-combine passes runs after register allocation
>>>> +     and can match define_insn_and_splits that were previously used
>>>> +     only before register allocation.  Some of those define_insn_and_splits
>>>> +     use gen_reg_rtx unconditionally.  Disable late-combine by default
>>>> +     until the define_insn_and_splits are fixed.  */
>>>> +  if (!OPTION_SET_P (flag_late_combine_instructions))
>>>> +    flag_late_combine_instructions = 0;
>>>> +
>>>>    rs6000_override_options_after_change ();
>>>
>>> ..., this needs to be done in 'rs6000_override_options_after_change'
>>> instead of 'rs6000_option_override_internal', to address the PRs under
>>> discussion.  I'm testing such a patch.
>>
>> Oops!  Sorry about that, and thanks for tracking it down.
>
> No worries.  ;-) OK to push the attached
> "rs6000: Properly default-disable late-combine passes [PR106594, PR115622, PR115633]"?

Yes, thanks.

Richard

> Grüße
>  Thomas
>
>
> From ccd12107fb06017f878384d2186ed5f01a1dab79 Mon Sep 17 00:00:00 2001
> From: Thomas Schwinge <tschwinge@baylibre.com>
> Date: Tue, 25 Jun 2024 10:55:41 +0200
> Subject: [PATCH] rs6000: Properly default-disable late-combine passes
>  [PR106594, PR115622, PR115633]
>
> ..., so that it also works for '__attribute__ ((optimize("[...]")))' etc.
>
> 	PR target/106594
> 	PR target/115622
> 	PR target/115633
> 	gcc/
> 	* config/rs6000/rs6000.cc (rs6000_option_override_internal): Move
> 	default-disable of late-combine passes from here...
> 	(rs6000_override_options_after_change): ... to here.
> ---
>  gcc/config/rs6000/rs6000.cc | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index f39b8909925..713fac75f26 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -3431,6 +3431,14 @@ rs6000_override_options_after_change (void)
>    /* If we are inserting ROP-protect instructions, disable shrink wrap.  */
>    if (rs6000_rop_protect)
>      flag_shrink_wrap = 0;
> +
> +  /* One of the late-combine passes runs after register allocation
> +     and can match define_insn_and_splits that were previously used
> +     only before register allocation.  Some of those define_insn_and_splits
> +     use gen_reg_rtx unconditionally.  Disable late-combine by default
> +     until the define_insn_and_splits are fixed.  */
> +  if (!OPTION_SET_P (flag_late_combine_instructions))
> +    flag_late_combine_instructions = 0;
>  }
>  
>  #ifdef TARGET_USES_LINUX64_OPT
> @@ -4768,14 +4776,6 @@ rs6000_option_override_internal (bool global_init_p)
>  	targetm.expand_builtin_va_start = NULL;
>      }
>  
> -  /* One of the late-combine passes runs after register allocation
> -     and can match define_insn_and_splits that were previously used
> -     only before register allocation.  Some of those define_insn_and_splits
> -     use gen_reg_rtx unconditionally.  Disable late-combine by default
> -     until the define_insn_and_splits are fixed.  */
> -  if (!OPTION_SET_P (flag_late_combine_instructions))
> -    flag_late_combine_instructions = 0;
> -
>    rs6000_override_options_after_change ();
>  
>    /* If not explicitly specified via option, decide whether to generate indexed

^ permalink raw reply	[flat|nested] 36+ messages in thread

* LoongArch vs. [PATCH 0/6] Add a late-combine pass
  2024-06-20 13:34 [PATCH 0/6] Add a late-combine pass Richard Sandiford
                   ` (5 preceding siblings ...)
  2024-06-20 13:34 ` [PATCH 6/6] Add a late-combine pass [PR106594] Richard Sandiford
@ 2024-06-28 12:25 ` Xi Ruoyao
  2024-06-28 12:34   ` chenglulu
  6 siblings, 1 reply; 36+ messages in thread
From: Xi Ruoyao @ 2024-06-28 12:25 UTC (permalink / raw)
  To: Richard Sandiford, jlaw, gcc-patches; +Cc: chenglulu

Hi Richard,

The late combine pass has triggered some FAILs on LoongArch and I'm
investigating.  One of them is movcf2gr-via-fr.c.  In 315r.postreload:

(insn 22 7 24 2 (set (reg:FCC 32 $f0 [87])
        (reg:FCC 64 $fcc0 [87])) "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 {movfcc_internal}
     (nil))
(insn 24 22 8 2 (set (reg:FCC 4 $r4 [88])
        (reg:FCC 32 $f0 [87])) "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 {movfcc_internal}
     (nil))

The late combine pass combines these to:

(insn 24 7 8 2 (set (reg:FCC 4 $r4 [88])
        (reg:FCC 64 $fcc0 [87])) "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 {movfcc_internal}
     (nil))

But we are using a FPR ($f0) here deliberately to work around an
architectural issue in LA464 causing a direct FCC-to-GPR move very slow.

Could you suggest how to fix this issue?

On Thu, 2024-06-20 at 14:34 +0100, Richard Sandiford wrote:
> This series is a resubmission of the late-combine work.  I've fixed
> some bugs that Jeff's cross-target CI found last time and some others
> that I hit since then.

/* snip */

-- 
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: LoongArch vs. [PATCH 0/6] Add a late-combine pass
  2024-06-28 12:25 ` LoongArch vs. [PATCH 0/6] Add a late-combine pass Xi Ruoyao
@ 2024-06-28 12:34   ` chenglulu
  2024-06-28 12:35     ` Xi Ruoyao
  0 siblings, 1 reply; 36+ messages in thread
From: chenglulu @ 2024-06-28 12:34 UTC (permalink / raw)
  To: Xi Ruoyao, Richard Sandiford, jlaw, gcc-patches


在 2024/6/28 下午8:25, Xi Ruoyao 写道:
> Hi Richard,
>
> The late combine pass has triggered some FAILs on LoongArch and I'm
> investigating.  One of them is movcf2gr-via-fr.c.  In 315r.postreload:
>
> (insn 22 7 24 2 (set (reg:FCC 32 $f0 [87])
>          (reg:FCC 64 $fcc0 [87])) "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 {movfcc_internal}
>       (nil))
> (insn 24 22 8 2 (set (reg:FCC 4 $r4 [88])
>          (reg:FCC 32 $f0 [87])) "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 {movfcc_internal}
>       (nil))
>
> The late combine pass combines these to:
>
> (insn 24 7 8 2 (set (reg:FCC 4 $r4 [88])
>          (reg:FCC 64 $fcc0 [87])) "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 {movfcc_internal}
>       (nil))
>
> But we are using a FPR ($f0) here deliberately to work around an
> architectural issue in LA464 causing a direct FCC-to-GPR move very slow.
>
> Could you suggest how to fix this issue?

Hi, Ruoyao:

We need to define TARGET_INSN_COST and set the cost of movcf2gr/movgr2cf.

I've fixed this and am doing correctness testing now.

>
> On Thu, 2024-06-20 at 14:34 +0100, Richard Sandiford wrote:
>> This series is a resubmission of the late-combine work.  I've fixed
>> some bugs that Jeff's cross-target CI found last time and some others
>> that I hit since then.
> /* snip */
>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: LoongArch vs. [PATCH 0/6] Add a late-combine pass
  2024-06-28 12:34   ` chenglulu
@ 2024-06-28 12:35     ` Xi Ruoyao
  2024-06-28 12:44       ` chenglulu
  0 siblings, 1 reply; 36+ messages in thread
From: Xi Ruoyao @ 2024-06-28 12:35 UTC (permalink / raw)
  To: chenglulu, Richard Sandiford, jlaw, gcc-patches

On Fri, 2024-06-28 at 20:34 +0800, chenglulu wrote:
> 
> 在 2024/6/28 下午8:25, Xi Ruoyao 写道:
> > Hi Richard,
> > 
> > The late combine pass has triggered some FAILs on LoongArch and I'm
> > investigating.  One of them is movcf2gr-via-fr.c.  In
> > 315r.postreload:
> > 
> > (insn 22 7 24 2 (set (reg:FCC 32 $f0 [87])
> >          (reg:FCC 64 $fcc0 [87]))
> > "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
> > 168 {movfcc_internal}
> >       (nil))
> > (insn 24 22 8 2 (set (reg:FCC 4 $r4 [88])
> >          (reg:FCC 32 $f0 [87]))
> > "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
> > 168 {movfcc_internal}
> >       (nil))
> > 
> > The late combine pass combines these to:
> > 
> > (insn 24 7 8 2 (set (reg:FCC 4 $r4 [88])
> >          (reg:FCC 64 $fcc0 [87]))
> > "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
> > 168 {movfcc_internal}
> >       (nil))
> > 
> > But we are using a FPR ($f0) here deliberately to work around an
> > architectural issue in LA464 causing a direct FCC-to-GPR move very
> > slow.
> > 
> > Could you suggest how to fix this issue?
> 
> Hi, Ruoyao:
> 
> We need to define TARGET_INSN_COST and set the cost of
> movcf2gr/movgr2cf.
> 
> I've fixed this and am doing correctness testing now.

Ah thanks!  So it uses insn cost instead of rtx cost and I didn't
realize.


-- 
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: LoongArch vs. [PATCH 0/6] Add a late-combine pass
  2024-06-28 12:35     ` Xi Ruoyao
@ 2024-06-28 12:44       ` chenglulu
  0 siblings, 0 replies; 36+ messages in thread
From: chenglulu @ 2024-06-28 12:44 UTC (permalink / raw)
  To: Xi Ruoyao, Richard Sandiford, jlaw, gcc-patches


在 2024/6/28 下午8:35, Xi Ruoyao 写道:
> On Fri, 2024-06-28 at 20:34 +0800, chenglulu wrote:
>> 在 2024/6/28 下午8:25, Xi Ruoyao 写道:
>>> Hi Richard,
>>>
>>> The late combine pass has triggered some FAILs on LoongArch and I'm
>>> investigating.  One of them is movcf2gr-via-fr.c.  In
>>> 315r.postreload:
>>>
>>> (insn 22 7 24 2 (set (reg:FCC 32 $f0 [87])
>>>           (reg:FCC 64 $fcc0 [87]))
>>> "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
>>> 168 {movfcc_internal}
>>>        (nil))
>>> (insn 24 22 8 2 (set (reg:FCC 4 $r4 [88])
>>>           (reg:FCC 32 $f0 [87]))
>>> "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
>>> 168 {movfcc_internal}
>>>        (nil))
>>>
>>> The late combine pass combines these to:
>>>
>>> (insn 24 7 8 2 (set (reg:FCC 4 $r4 [88])
>>>           (reg:FCC 64 $fcc0 [87]))
>>> "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
>>> 168 {movfcc_internal}
>>>        (nil))
>>>
>>> But we are using a FPR ($f0) here deliberately to work around an
>>> architectural issue in LA464 causing a direct FCC-to-GPR move very
>>> slow.
>>>
>>> Could you suggest how to fix this issue?
>> Hi, Ruoyao:
>>
>> We need to define TARGET_INSN_COST and set the cost of
>> movcf2gr/movgr2cf.
>>
>> I've fixed this and am doing correctness testing now.
> Ah thanks!  So it uses insn cost instead of rtx cost and I didn't
> realize.
>
>
That's right.:-D


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2024-06-28 12:44 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-20 13:34 [PATCH 0/6] Add a late-combine pass Richard Sandiford
2024-06-20 13:34 ` [PATCH 1/6] rtl-ssa: Rework _ignoring interfaces Richard Sandiford
2024-06-20 21:22   ` Alex Coplan
2024-06-21  8:11     ` Richard Sandiford
2024-06-21 14:40   ` Jeff Law
2024-06-20 13:34 ` [PATCH 2/6] rtl-ssa: Don't cost no-op moves Richard Sandiford
2024-06-21 14:32   ` Jeff Law
2024-06-20 13:34 ` [PATCH 3/6] iq2000: Fix test and branch instructions Richard Sandiford
2024-06-21 14:33   ` Jeff Law
2024-06-20 13:34 ` [PATCH 4/6] sh: Make *minus_plus_one work after RA Richard Sandiford
2024-06-21  0:15   ` Oleg Endo
2024-06-20 13:34 ` [PATCH 5/6] xstormy16: Fix xs_hi_nonmemory_operand Richard Sandiford
2024-06-21 14:33   ` Jeff Law
2024-06-20 13:34 ` [PATCH 6/6] Add a late-combine pass [PR106594] Richard Sandiford
2024-06-21  0:17   ` Oleg Endo
2024-06-21  8:09     ` Richard Sandiford
2024-06-21  5:54   ` Richard Biener
2024-06-21  8:21     ` Richard Sandiford
2024-06-21  9:26       ` Richard Biener
2024-06-21 15:00   ` Jeff Law
2024-06-22  5:12   ` Takayuki 'January June' Suwa
2024-06-22 16:49     ` Richard Sandiford
2024-06-23  4:40       ` Takayuki 'January June' Suwa
2024-06-23  9:34       ` Richard Biener
2024-06-24  8:03         ` Richard Sandiford
2024-06-24 11:22           ` Richard Biener
2024-06-24 11:34             ` Richard Sandiford
2024-06-24 12:18               ` Richard Biener
2024-06-25  9:02   ` Thomas Schwinge
2024-06-25  9:07     ` Richard Sandiford
2024-06-25  9:23       ` rs6000: Properly default-disable late-combine passes [PR106594, PR115622, PR115633] (was: [PATCH 6/6] Add a late-combine pass [PR106594]) Thomas Schwinge
2024-06-25  9:28         ` rs6000: Properly default-disable late-combine passes [PR106594, PR115622, PR115633] Richard Sandiford
2024-06-28 12:25 ` LoongArch vs. [PATCH 0/6] Add a late-combine pass Xi Ruoyao
2024-06-28 12:34   ` chenglulu
2024-06-28 12:35     ` Xi Ruoyao
2024-06-28 12:44       ` chenglulu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).