public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] combine: Allow 2->2 combos but limit insn searches [PR116398]
@ 2025-04-01 15:34 Richard Sandiford
  2025-04-02 13:04 ` Richard Biener
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Sandiford @ 2025-04-01 15:34 UTC (permalink / raw)
  To: gcc-patches; +Cc: jakub, jlaw, rguenther, segher

The problem in PR101523 was that, after each successful 2->2
combination attempt, distribute_links would search further and
further for the next combinable use of the i2 destination.
The original patch for the PR dealt with that by disallowing such
combinations.  However, that led to various optimisation regressions,
so there was interest in allowing the combinations again, at least
until an alternative way of getting the same results is in place.

This patch does that, but limits distribute_links to a certain number
of instructions when i2 is unchanged.  As Segher said in the PR trail,
it would make more conceptual sense to apply the limit unconditionally,
but I thought it would be better to change as little as possible at
this development stage.  Logically, in stage 1, the --param should
be applied directly by distribute_links with no input from callers.

As I mentioned in:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116398#c28

I think it's safe to drop log links even if a use exists.  All
processing of log links seems to handle the absence of a link
for a particular register in a conservative way.

The initial set-up errs on the side of dropping links, since for example
create_log_links has:

             /* flow.c claimed:

                 We don't build a LOG_LINK for hard registers contained
                 in ASM_OPERANDs.  If these registers get replaced,
                 we might wind up changing the semantics of the insn,
                 even if reload can make what appear to be valid
                 assignments later.  */
              if (regno < FIRST_PSEUDO_REGISTER
                  && asm_noperands (PATTERN (use_insn)) >= 0)
                continue;

which excludes combinations by dropping log links, rather than during
try_combine.  And:

      /* If this register is being initialized using itself, and the
         register is uninitialized in this basic block, and there are
         no LOG_LINKS which set the register, then part of the
         register is uninitialized.  In that case we can't assume
         anything about the number of nonzero bits.

         ??? We could do better if we checked this in
         reg_{nonzero_bits,num_sign_bit_copies}_for_combine.  Then we
         could avoid making assumptions about the insn which initially
         sets the register, while still using the information in other
         insns.  We would have to be careful to check every insn
         involved in the combination.  */

      if (insn
          && reg_referenced_p (x, PATTERN (insn))
          && !REGNO_REG_SET_P (DF_LR_IN (BLOCK_FOR_INSN (insn)),
                               REGNO (x)))
        {
          struct insn_link *link;

          FOR_EACH_LOG_LINK (link, insn)
            if (dead_or_set_p (link->insn, x))
              break;
          if (!link)
            {
              rsp->nonzero_bits = GET_MODE_MASK (mode);
              rsp->sign_bit_copies = 1;
              return;
            }
        }

treats the lack of a log link is a possible sign of uninitialised data,
but that would be a missed optimisation rather than a correctness issue.

One question is what the default --param value should be.  I went with
Jakub's suggestion of 3000 from:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116398#c25

Also, to answer Jakub's question in that comment, I tried bisecting:

  int limit = atoi (getenv ("BISECT"));

(so applying the limit for all calls from try_combine) with an
abort in distribute_links if the limit caused a link to be skipped.
The minimum BISECT value that allowed an aarch64-linux-gnu bootstrap
to succeed with --enable-languages=all --enable-checking=yes,rtl,extra
was 142, so much lower than the parameter value.  I realised too late
that --enable-checking=release would probably have been a more
interesting test.

Some of the try_combine changes come from Richi's patch in:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523#c53

Bootstrapped & regression-tested on aarch64-linux-gnu,
powerpc64le-linux-gnu and x86_64-linux-gnu.  OK to install?

Richard


gcc/
	PR testsuite/116398
	* params.opt (-param=max-combine-search-insns=): New param.
	* doc/invoke.texi: Document it.
	* combine.cc (distribute_links): Add a limit parameter.
	(try_combine): Use the new param to limit distribute_links
	when i2 is unchanged.  Return i3 rather than i2 if i2 is unchanged.

gcc/testsuite/
	* gcc.target/aarch64/popcnt-le-1.c: Account for commutativity of TST.
	* gcc.target/aarch64/popcnt-le-3.c: Likewise AND.
	* gcc.target/aarch64/sve/pred-not-gen-1.c: Revert previous patch.
	* gcc.target/aarch64/sve/pred-not-gen-4.c: Likewise.
	* gcc.target/aarch64/sve/var_stride_2.c: Likewise.
	* gcc.target/aarch64/sve/var_stride_4.c: Likewise.

Co-authored-by: Richard Biener <rguenther@suse.de>
---
 gcc/combine.cc                                | 30 +++++++++----------
 gcc/doc/invoke.texi                           |  9 ++++++
 gcc/params.opt                                |  4 +++
 .../gcc.target/aarch64/popcnt-le-1.c          |  4 +--
 .../gcc.target/aarch64/popcnt-le-3.c          |  4 +--
 gcc/testsuite/gcc.target/aarch64/pr100056.c   |  4 ++-
 .../gcc.target/aarch64/sve/pred-not-gen-1.c   |  4 +--
 .../gcc.target/aarch64/sve/pred-not-gen-4.c   |  4 +--
 .../gcc.target/aarch64/sve/var_stride_2.c     |  3 +-
 .../gcc.target/aarch64/sve/var_stride_4.c     |  3 +-
 10 files changed, 42 insertions(+), 27 deletions(-)

diff --git a/gcc/combine.cc b/gcc/combine.cc
index ef13f5d5e90..28403d1f9e0 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -472,7 +472,7 @@ static void move_deaths (rtx, rtx, int, rtx_insn *, rtx *);
 static bool reg_bitfield_target_p (rtx, rtx);
 static void distribute_notes (rtx, rtx_insn *, rtx_insn *, rtx_insn *,
 			      rtx, rtx, rtx);
-static void distribute_links (struct insn_link *);
+static void distribute_links (struct insn_link *, int limit = INT_MAX);
 static void mark_used_regs_combine (rtx);
 static void record_promoted_value (rtx_insn *, rtx);
 static bool unmentioned_reg_p (rtx, rtx);
@@ -4208,16 +4208,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0,
       adjust_for_new_dest (i3);
     }
 
-  /* If I2 didn't change, this is not a combination (but a simplification or
-     canonicalisation with context), which should not be done here.  Doing
-     it here explodes the algorithm.  Don't.  */
-  if (rtx_equal_p (newi2pat, PATTERN (i2)))
-    {
-      if (dump_file)
-	fprintf (dump_file, "i2 didn't change, not doing this\n");
-      undo_all ();
-      return 0;
-    }
+  bool i2_unchanged = rtx_equal_p (newi2pat, PATTERN (i2));
 
   /* We now know that we can do this combination.  Merge the insns and
      update the status of registers and LOG_LINKS.  */
@@ -4593,10 +4584,11 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0,
 			    NULL_RTX, NULL_RTX, NULL_RTX);
       }
 
-    distribute_links (i3links);
-    distribute_links (i2links);
-    distribute_links (i1links);
-    distribute_links (i0links);
+    int limit = i2_unchanged ? param_max_combine_search_insns : INT_MAX;
+    distribute_links (i3links, limit);
+    distribute_links (i2links, limit);
+    distribute_links (i1links, limit);
+    distribute_links (i0links, limit);
 
     if (REG_P (i2dest))
       {
@@ -4785,6 +4777,9 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0,
   combine_successes++;
   undo_commit ();
 
+  if (newi2pat && i2_unchanged)
+    return i3;
+
   rtx_insn *ret = newi2pat ? i2 : i3;
   if (added_links_insn && DF_INSN_LUID (added_links_insn) < DF_INSN_LUID (ret))
     ret = added_links_insn;
@@ -14987,7 +14982,7 @@ distribute_notes (rtx notes, rtx_insn *from_insn, rtx_insn *i3, rtx_insn *i2,
    pointing at I3 when I3's destination is changed.  */
 
 static void
-distribute_links (struct insn_link *links)
+distribute_links (struct insn_link *links, int limit)
 {
   struct insn_link *link, *next_link;
 
@@ -15053,6 +15048,7 @@ distribute_links (struct insn_link *links)
 	 I3 to I2.  Also note that not much searching is typically done here
 	 since most links don't point very far away.  */
 
+      int count = 0;
       for (insn = NEXT_INSN (link->insn);
 	   (insn && (this_basic_block->next_bb == EXIT_BLOCK_PTR_FOR_FN (cfun)
 		     || BB_HEAD (this_basic_block->next_bb) != insn));
@@ -15073,6 +15069,8 @@ distribute_links (struct insn_link *links)
 	  }
 	else if (INSN_P (insn) && reg_set_p (reg, insn))
 	  break;
+	else if (count++ == limit)
+	  break;
 
       /* If we found a place to put the link, place it there unless there
 	 is already a link to the same insn as LINK at that point.  */
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 0da3f296beb..4c1f903d370 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -16467,6 +16467,15 @@ in combiner for a pseudo register as last known value of that register.
 @item max-combine-insns
 The maximum number of instructions the RTL combiner tries to combine.
 
+@item max-combine-search-insns
+The maximum number of instructions that the RTL combiner searches in order
+to find the next use of a given register definition.  If this limit is reached
+without finding such a use, the combiner will stop trying to optimize the
+definition.
+
+Currently this limit only applies after certain successful combination
+attempts, but it could be extended to other cases in future.
+
 @item integer-share-limit
 Small integer constants can use a shared data structure, reducing the
 compiler's memory usage and increasing its speed.  This sets the maximum
diff --git a/gcc/params.opt b/gcc/params.opt
index 4f4eb4d7a2a..422d082b355 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -477,6 +477,10 @@ The maximum number of instructions to consider to unroll in a loop on average.
 Common Joined UInteger Var(param_max_combine_insns) Init(4) IntegerRange(2, 4) Param Optimization
 The maximum number of insns combine tries to combine.
 
+-param=max-combine-search-insns=
+Common Joined UInteger Var(param_max_combine_search_insns) Init(3000) Param Optimization
+The maximum number of instructions that combine searches in order to find the next use of a particular register.
+
 -param=max-completely-peel-loop-nest-depth=
 Common Joined UInteger Var(param_max_unroll_iterations) Init(8) Param Optimization
 The maximum depth of a loop nest we completely peel.
diff --git a/gcc/testsuite/gcc.target/aarch64/popcnt-le-1.c b/gcc/testsuite/gcc.target/aarch64/popcnt-le-1.c
index b4141da982c..843fdac9fd8 100644
--- a/gcc/testsuite/gcc.target/aarch64/popcnt-le-1.c
+++ b/gcc/testsuite/gcc.target/aarch64/popcnt-le-1.c
@@ -8,7 +8,7 @@
 /*
 ** le32:
 **	sub	w([0-9]+), w0, #1
-**	tst	w0, w\1
+**	tst	(?:w0, w\1|w\1, w0)
 **	cset	w0, eq
 **	ret
 */
@@ -20,7 +20,7 @@ unsigned le32 (const unsigned int a) {
 /*
 ** gt32:
 **	sub	w([0-9]+), w0, #1
-**	tst	w0, w\1
+**	tst	(?:w0, w\1|w\1, w0)
 **	cset	w0, ne
 **	ret
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/popcnt-le-3.c b/gcc/testsuite/gcc.target/aarch64/popcnt-le-3.c
index b811e6f6e8f..3b558e95d81 100644
--- a/gcc/testsuite/gcc.target/aarch64/popcnt-le-3.c
+++ b/gcc/testsuite/gcc.target/aarch64/popcnt-le-3.c
@@ -8,7 +8,7 @@
 /*
 ** le16:
 **	sub	w([0-9]+), w0, #1
-**	and	w([0-9]+), w0, w\1
+**	and	w([0-9]+), (?:w0, w\1|w\1, w0)
 **	tst	w\2, 65535
 **	cset	w0, eq
 **	ret
@@ -21,7 +21,7 @@ unsigned le16 (const unsigned short a) {
 /*
 ** gt16:
 **	sub	w([0-9]+), w0, #1
-**	and	w([0-9]+), w0, w\1
+**	and	w([0-9]+), (?:w0, w\1|w\1, w0)
 **	tst	w\2, 65535
 **	cset	w0, ne
 **	ret
diff --git a/gcc/testsuite/gcc.target/aarch64/pr100056.c b/gcc/testsuite/gcc.target/aarch64/pr100056.c
index 0b77824da45..70499772d28 100644
--- a/gcc/testsuite/gcc.target/aarch64/pr100056.c
+++ b/gcc/testsuite/gcc.target/aarch64/pr100056.c
@@ -1,7 +1,9 @@
 /* PR target/100056 */
 /* { dg-do compile } */
 /* { dg-options "-O2" } */
-/* { dg-final { scan-assembler-not {\t[us]bfiz\tw[0-9]+, w[0-9]+, 11} } } */
+/* { dg-final { scan-assembler-not {\t[us]bfiz\tw[0-9]+, w[0-9]+, 11} { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\t[us]bfiz\tw[0-9]+, w[0-9]+, 11} 2 } } */
+/* { dg-final { scan-assembler-times {\tadd\tw[0-9]+, w[0-9]+, w[0-9]+, uxtb\n} 2 } } */
 
 int
 or_shift_u8 (unsigned char i)
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-1.c b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-1.c
index a7d2795ebe2..c9a8b82c48a 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-1.c
@@ -19,6 +19,6 @@ void f10(double * restrict z, double * restrict w, double * restrict x, double *
     }
 }
 
-/* { dg-final { scan-assembler-not {\tbic\t} { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tnot\tp[0-9]+\.b, p[0-9]+/z, p[0-9]+\.b\n} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-not {\tbic\t} } } */
+/* { dg-final { scan-assembler-times {\tnot\tp[0-9]+\.b, p[0-9]+/z, p[0-9]+\.b\n} 1 } } */
 /* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-9]+/z, z[0-9]+\.d, #0} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-4.c b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-4.c
index 20cbd7550b7..1845bd3f0f7 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-4.c
@@ -8,6 +8,6 @@ void f13(double * restrict z, double * restrict w, double * restrict x, double *
     }
 }
 
-/* { dg-final { scan-assembler-not {\tbic\t} { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tnot\tp[0-9]+\.b, p[0-9]+/z, p[0-9]+\.b\n} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-not {\tbic\t} } } */
+/* { dg-final { scan-assembler-times {\tnot\tp[0-9]+\.b, p[0-9]+/z, p[0-9]+\.b\n} 1 } } */
 /* { dg-final { scan-assembler-times {\tfcmuo\tp[0-9]+\.d, p[0-9]+/z, z[0-9]+\.d, z[0-9]+\.d} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/var_stride_2.c b/gcc/testsuite/gcc.target/aarch64/sve/var_stride_2.c
index 33b9f0f197e..b8afea70207 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/var_stride_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/var_stride_2.c
@@ -16,7 +16,8 @@ f (TYPE *x, TYPE *y, unsigned short n, unsigned short m)
 /* { dg-final { scan-assembler {\tldr\tw[0-9]+} } } */
 /* { dg-final { scan-assembler {\tstr\tw[0-9]+} } } */
 /* Should multiply by (257-1)*4 rather than (VF-1)*4 or (VF-2)*4.  */
-/* { dg-final { scan-assembler-times {\tadd\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 10\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tubfiz\tx[0-9]+, x2, 10, 16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tubfiz\tx[0-9]+, x3, 10, 16\n} 1 } } */
 /* { dg-final { scan-assembler-not {\tcmp\tx[0-9]+, 0} } } */
 /* { dg-final { scan-assembler-not {\tcmp\tw[0-9]+, 0} } } */
 /* { dg-final { scan-assembler-not {\tcsel\tx[0-9]+} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/var_stride_4.c b/gcc/testsuite/gcc.target/aarch64/sve/var_stride_4.c
index 71b826a4c1b..d2e74f9d417 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/var_stride_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/var_stride_4.c
@@ -16,7 +16,8 @@ f (TYPE *x, TYPE *y, int n, int m)
 /* { dg-final { scan-assembler {\tldr\tw[0-9]+} } } */
 /* { dg-final { scan-assembler {\tstr\tw[0-9]+} } } */
 /* Should multiply by (257-1)*4 rather than (VF-1)*4.  */
-/* { dg-final { scan-assembler-times {\t(?:lsl\tx[0-9]+, x[0-9]+, 10|sbfiz\tx[0-9]+, x[0-9]+, 10, 32)\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tsbfiz\tx[0-9]+, x2, 10, 32\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tsbfiz\tx[0-9]+, x3, 10, 32\n} 1 } } */
 /* { dg-final { scan-assembler {\tcmp\tw2, 0} } } */
 /* { dg-final { scan-assembler {\tcmp\tw3, 0} } } */
 /* { dg-final { scan-assembler-times {\tcsel\tx[0-9]+} 4 } } */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] combine: Allow 2->2 combos but limit insn searches [PR116398]
  2025-04-01 15:34 [PATCH] combine: Allow 2->2 combos but limit insn searches [PR116398] Richard Sandiford
@ 2025-04-02 13:04 ` Richard Biener
  2025-04-02 13:33   ` Richard Sandiford
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Biener @ 2025-04-02 13:04 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches, jakub, jlaw, segher

On Tue, 1 Apr 2025, Richard Sandiford wrote:

> The problem in PR101523 was that, after each successful 2->2
> combination attempt, distribute_links would search further and
> further for the next combinable use of the i2 destination.
> The original patch for the PR dealt with that by disallowing such
> combinations.  However, that led to various optimisation regressions,
> so there was interest in allowing the combinations again, at least
> until an alternative way of getting the same results is in place.
> 
> This patch does that, but limits distribute_links to a certain number
> of instructions when i2 is unchanged.  As Segher said in the PR trail,
> it would make more conceptual sense to apply the limit unconditionally,
> but I thought it would be better to change as little as possible at
> this development stage.  Logically, in stage 1, the --param should
> be applied directly by distribute_links with no input from callers.
> 
> As I mentioned in:
> 
>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116398#c28
> 
> I think it's safe to drop log links even if a use exists.  All
> processing of log links seems to handle the absence of a link
> for a particular register in a conservative way.
> 
> The initial set-up errs on the side of dropping links, since for example
> create_log_links has:
> 
>              /* flow.c claimed:
> 
>                  We don't build a LOG_LINK for hard registers contained
>                  in ASM_OPERANDs.  If these registers get replaced,
>                  we might wind up changing the semantics of the insn,
>                  even if reload can make what appear to be valid
>                  assignments later.  */
>               if (regno < FIRST_PSEUDO_REGISTER
>                   && asm_noperands (PATTERN (use_insn)) >= 0)
>                 continue;
> 
> which excludes combinations by dropping log links, rather than during
> try_combine.  And:
> 
>       /* If this register is being initialized using itself, and the
>          register is uninitialized in this basic block, and there are
>          no LOG_LINKS which set the register, then part of the
>          register is uninitialized.  In that case we can't assume
>          anything about the number of nonzero bits.
> 
>          ??? We could do better if we checked this in
>          reg_{nonzero_bits,num_sign_bit_copies}_for_combine.  Then we
>          could avoid making assumptions about the insn which initially
>          sets the register, while still using the information in other
>          insns.  We would have to be careful to check every insn
>          involved in the combination.  */
> 
>       if (insn
>           && reg_referenced_p (x, PATTERN (insn))
>           && !REGNO_REG_SET_P (DF_LR_IN (BLOCK_FOR_INSN (insn)),
>                                REGNO (x)))
>         {
>           struct insn_link *link;
> 
>           FOR_EACH_LOG_LINK (link, insn)
>             if (dead_or_set_p (link->insn, x))
>               break;
>           if (!link)
>             {
>               rsp->nonzero_bits = GET_MODE_MASK (mode);
>               rsp->sign_bit_copies = 1;
>               return;
>             }
>         }
> 
> treats the lack of a log link is a possible sign of uninitialised data,
> but that would be a missed optimisation rather than a correctness issue.
> 
> One question is what the default --param value should be.  I went with
> Jakub's suggestion of 3000 from:
> 
>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116398#c25
> 
> Also, to answer Jakub's question in that comment, I tried bisecting:
> 
>   int limit = atoi (getenv ("BISECT"));
> 
> (so applying the limit for all calls from try_combine) with an
> abort in distribute_links if the limit caused a link to be skipped.
> The minimum BISECT value that allowed an aarch64-linux-gnu bootstrap
> to succeed with --enable-languages=all --enable-checking=yes,rtl,extra
> was 142, so much lower than the parameter value.  I realised too late
> that --enable-checking=release would probably have been a more
> interesting test.
> 
> Some of the try_combine changes come from Richi's patch in:
> 
>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523#c53
> 
> Bootstrapped & regression-tested on aarch64-linux-gnu,
> powerpc64le-linux-gnu and x86_64-linux-gnu.  OK to install?
> 
> Richard
> 
> 
> gcc/
> 	PR testsuite/116398
> 	* params.opt (-param=max-combine-search-insns=): New param.
> 	* doc/invoke.texi: Document it.
> 	* combine.cc (distribute_links): Add a limit parameter.
> 	(try_combine): Use the new param to limit distribute_links
> 	when i2 is unchanged.  Return i3 rather than i2 if i2 is unchanged.
> 
> gcc/testsuite/
> 	* gcc.target/aarch64/popcnt-le-1.c: Account for commutativity of TST.
> 	* gcc.target/aarch64/popcnt-le-3.c: Likewise AND.
> 	* gcc.target/aarch64/sve/pred-not-gen-1.c: Revert previous patch.
> 	* gcc.target/aarch64/sve/pred-not-gen-4.c: Likewise.
> 	* gcc.target/aarch64/sve/var_stride_2.c: Likewise.
> 	* gcc.target/aarch64/sve/var_stride_4.c: Likewise.
> 
> Co-authored-by: Richard Biener <rguenther@suse.de>
> ---
>  gcc/combine.cc                                | 30 +++++++++----------
>  gcc/doc/invoke.texi                           |  9 ++++++
>  gcc/params.opt                                |  4 +++
>  .../gcc.target/aarch64/popcnt-le-1.c          |  4 +--
>  .../gcc.target/aarch64/popcnt-le-3.c          |  4 +--
>  gcc/testsuite/gcc.target/aarch64/pr100056.c   |  4 ++-
>  .../gcc.target/aarch64/sve/pred-not-gen-1.c   |  4 +--
>  .../gcc.target/aarch64/sve/pred-not-gen-4.c   |  4 +--
>  .../gcc.target/aarch64/sve/var_stride_2.c     |  3 +-
>  .../gcc.target/aarch64/sve/var_stride_4.c     |  3 +-
>  10 files changed, 42 insertions(+), 27 deletions(-)
> 
> diff --git a/gcc/combine.cc b/gcc/combine.cc
> index ef13f5d5e90..28403d1f9e0 100644
> --- a/gcc/combine.cc
> +++ b/gcc/combine.cc
> @@ -472,7 +472,7 @@ static void move_deaths (rtx, rtx, int, rtx_insn *, rtx *);
>  static bool reg_bitfield_target_p (rtx, rtx);
>  static void distribute_notes (rtx, rtx_insn *, rtx_insn *, rtx_insn *,
>  			      rtx, rtx, rtx);
> -static void distribute_links (struct insn_link *);
> +static void distribute_links (struct insn_link *, int limit = INT_MAX);
>  static void mark_used_regs_combine (rtx);
>  static void record_promoted_value (rtx_insn *, rtx);
>  static bool unmentioned_reg_p (rtx, rtx);
> @@ -4208,16 +4208,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0,
>        adjust_for_new_dest (i3);
>      }
>  
> -  /* If I2 didn't change, this is not a combination (but a simplification or
> -     canonicalisation with context), which should not be done here.  Doing
> -     it here explodes the algorithm.  Don't.  */
> -  if (rtx_equal_p (newi2pat, PATTERN (i2)))
> -    {
> -      if (dump_file)
> -	fprintf (dump_file, "i2 didn't change, not doing this\n");
> -      undo_all ();
> -      return 0;
> -    }
> +  bool i2_unchanged = rtx_equal_p (newi2pat, PATTERN (i2));
>  
>    /* We now know that we can do this combination.  Merge the insns and
>       update the status of registers and LOG_LINKS.  */
> @@ -4593,10 +4584,11 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0,
>  			    NULL_RTX, NULL_RTX, NULL_RTX);
>        }
>  
> -    distribute_links (i3links);
> -    distribute_links (i2links);
> -    distribute_links (i1links);
> -    distribute_links (i0links);
> +    int limit = i2_unchanged ? param_max_combine_search_insns : INT_MAX;
> +    distribute_links (i3links, limit);
> +    distribute_links (i2links, limit);
> +    distribute_links (i1links, limit);
> +    distribute_links (i0links, limit);

I'm not sure I understand how the distribute_links mixes in with
the change below.

The original issue reported in the PR was that returning i2
caused evern insn between i2 and i3 to be re-processed by the
main loop, re-trying previously failed combine attempts and
producing garbage.

The disadvantage of directly returning i3 is that if added_links_insn
was inbetween i2 and i3 we've missed to re-try on insn with changed
links.

But what you do now also limits the insn walk in distribute_links
(IMO a good thing on its own), but not taking advantage of that
when i2_unchanged (distributing the link to insns between i2 and i3
is useless in that case).

Ideally we'd distribute links to insns between i2 and i3 in a backward
walk from i3, but pick up the first use if within the limit, as we
want to limit the number of insns we reprocess in the try_combine callers
loop.

Really ideally we'd want to remember all added_links_insn and only
re-process those, not all other unaffected insns between i2 and i3
(or earlier).

That said, your change looks like two independent changes?  IIRC Jakub
at some point proposed to limit the 'return i3' below to cases where
i2 is too far away or sth like that.

Thanks,
Richard.

>      if (REG_P (i2dest))
>        {
> @@ -4785,6 +4777,9 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0,
>    combine_successes++;
>    undo_commit ();
>  
> +  if (newi2pat && i2_unchanged)
> +    return i3;
> +
>    rtx_insn *ret = newi2pat ? i2 : i3;
>    if (added_links_insn && DF_INSN_LUID (added_links_insn) < DF_INSN_LUID (ret))
>      ret = added_links_insn;
> @@ -14987,7 +14982,7 @@ distribute_notes (rtx notes, rtx_insn *from_insn, rtx_insn *i3, rtx_insn *i2,
>     pointing at I3 when I3's destination is changed.  */
>  
>  static void
> -distribute_links (struct insn_link *links)
> +distribute_links (struct insn_link *links, int limit)
>  {
>    struct insn_link *link, *next_link;
>  
> @@ -15053,6 +15048,7 @@ distribute_links (struct insn_link *links)
>  	 I3 to I2.  Also note that not much searching is typically done here
>  	 since most links don't point very far away.  */
>  
> +      int count = 0;
>        for (insn = NEXT_INSN (link->insn);
>  	   (insn && (this_basic_block->next_bb == EXIT_BLOCK_PTR_FOR_FN (cfun)
>  		     || BB_HEAD (this_basic_block->next_bb) != insn));
> @@ -15073,6 +15069,8 @@ distribute_links (struct insn_link *links)
>  	  }
>  	else if (INSN_P (insn) && reg_set_p (reg, insn))
>  	  break;
> +	else if (count++ == limit)
> +	  break;
>  
>        /* If we found a place to put the link, place it there unless there
>  	 is already a link to the same insn as LINK at that point.  */
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 0da3f296beb..4c1f903d370 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -16467,6 +16467,15 @@ in combiner for a pseudo register as last known value of that register.
>  @item max-combine-insns
>  The maximum number of instructions the RTL combiner tries to combine.
>  
> +@item max-combine-search-insns
> +The maximum number of instructions that the RTL combiner searches in order
> +to find the next use of a given register definition.  If this limit is reached
> +without finding such a use, the combiner will stop trying to optimize the
> +definition.
> +
> +Currently this limit only applies after certain successful combination
> +attempts, but it could be extended to other cases in future.
> +
>  @item integer-share-limit
>  Small integer constants can use a shared data structure, reducing the
>  compiler's memory usage and increasing its speed.  This sets the maximum
> diff --git a/gcc/params.opt b/gcc/params.opt
> index 4f4eb4d7a2a..422d082b355 100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -477,6 +477,10 @@ The maximum number of instructions to consider to unroll in a loop on average.
>  Common Joined UInteger Var(param_max_combine_insns) Init(4) IntegerRange(2, 4) Param Optimization
>  The maximum number of insns combine tries to combine.
>  
> +-param=max-combine-search-insns=
> +Common Joined UInteger Var(param_max_combine_search_insns) Init(3000) Param Optimization
> +The maximum number of instructions that combine searches in order to find the next use of a particular register.
> +
>  -param=max-completely-peel-loop-nest-depth=
>  Common Joined UInteger Var(param_max_unroll_iterations) Init(8) Param Optimization
>  The maximum depth of a loop nest we completely peel.
> diff --git a/gcc/testsuite/gcc.target/aarch64/popcnt-le-1.c b/gcc/testsuite/gcc.target/aarch64/popcnt-le-1.c
> index b4141da982c..843fdac9fd8 100644
> --- a/gcc/testsuite/gcc.target/aarch64/popcnt-le-1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/popcnt-le-1.c
> @@ -8,7 +8,7 @@
>  /*
>  ** le32:
>  **	sub	w([0-9]+), w0, #1
> -**	tst	w0, w\1
> +**	tst	(?:w0, w\1|w\1, w0)
>  **	cset	w0, eq
>  **	ret
>  */
> @@ -20,7 +20,7 @@ unsigned le32 (const unsigned int a) {
>  /*
>  ** gt32:
>  **	sub	w([0-9]+), w0, #1
> -**	tst	w0, w\1
> +**	tst	(?:w0, w\1|w\1, w0)
>  **	cset	w0, ne
>  **	ret
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/popcnt-le-3.c b/gcc/testsuite/gcc.target/aarch64/popcnt-le-3.c
> index b811e6f6e8f..3b558e95d81 100644
> --- a/gcc/testsuite/gcc.target/aarch64/popcnt-le-3.c
> +++ b/gcc/testsuite/gcc.target/aarch64/popcnt-le-3.c
> @@ -8,7 +8,7 @@
>  /*
>  ** le16:
>  **	sub	w([0-9]+), w0, #1
> -**	and	w([0-9]+), w0, w\1
> +**	and	w([0-9]+), (?:w0, w\1|w\1, w0)
>  **	tst	w\2, 65535
>  **	cset	w0, eq
>  **	ret
> @@ -21,7 +21,7 @@ unsigned le16 (const unsigned short a) {
>  /*
>  ** gt16:
>  **	sub	w([0-9]+), w0, #1
> -**	and	w([0-9]+), w0, w\1
> +**	and	w([0-9]+), (?:w0, w\1|w\1, w0)
>  **	tst	w\2, 65535
>  **	cset	w0, ne
>  **	ret
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr100056.c b/gcc/testsuite/gcc.target/aarch64/pr100056.c
> index 0b77824da45..70499772d28 100644
> --- a/gcc/testsuite/gcc.target/aarch64/pr100056.c
> +++ b/gcc/testsuite/gcc.target/aarch64/pr100056.c
> @@ -1,7 +1,9 @@
>  /* PR target/100056 */
>  /* { dg-do compile } */
>  /* { dg-options "-O2" } */
> -/* { dg-final { scan-assembler-not {\t[us]bfiz\tw[0-9]+, w[0-9]+, 11} } } */
> +/* { dg-final { scan-assembler-not {\t[us]bfiz\tw[0-9]+, w[0-9]+, 11} { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-times {\t[us]bfiz\tw[0-9]+, w[0-9]+, 11} 2 } } */
> +/* { dg-final { scan-assembler-times {\tadd\tw[0-9]+, w[0-9]+, w[0-9]+, uxtb\n} 2 } } */
>  
>  int
>  or_shift_u8 (unsigned char i)
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-1.c b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-1.c
> index a7d2795ebe2..c9a8b82c48a 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-1.c
> @@ -19,6 +19,6 @@ void f10(double * restrict z, double * restrict w, double * restrict x, double *
>      }
>  }
>  
> -/* { dg-final { scan-assembler-not {\tbic\t} { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tnot\tp[0-9]+\.b, p[0-9]+/z, p[0-9]+\.b\n} 1 { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-not {\tbic\t} } } */
> +/* { dg-final { scan-assembler-times {\tnot\tp[0-9]+\.b, p[0-9]+/z, p[0-9]+\.b\n} 1 } } */
>  /* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.d, p[0-9]+/z, z[0-9]+\.d, #0} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-4.c b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-4.c
> index 20cbd7550b7..1845bd3f0f7 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pred-not-gen-4.c
> @@ -8,6 +8,6 @@ void f13(double * restrict z, double * restrict w, double * restrict x, double *
>      }
>  }
>  
> -/* { dg-final { scan-assembler-not {\tbic\t} { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tnot\tp[0-9]+\.b, p[0-9]+/z, p[0-9]+\.b\n} 1 { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-not {\tbic\t} } } */
> +/* { dg-final { scan-assembler-times {\tnot\tp[0-9]+\.b, p[0-9]+/z, p[0-9]+\.b\n} 1 } } */
>  /* { dg-final { scan-assembler-times {\tfcmuo\tp[0-9]+\.d, p[0-9]+/z, z[0-9]+\.d, z[0-9]+\.d} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/var_stride_2.c b/gcc/testsuite/gcc.target/aarch64/sve/var_stride_2.c
> index 33b9f0f197e..b8afea70207 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/var_stride_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/var_stride_2.c
> @@ -16,7 +16,8 @@ f (TYPE *x, TYPE *y, unsigned short n, unsigned short m)
>  /* { dg-final { scan-assembler {\tldr\tw[0-9]+} } } */
>  /* { dg-final { scan-assembler {\tstr\tw[0-9]+} } } */
>  /* Should multiply by (257-1)*4 rather than (VF-1)*4 or (VF-2)*4.  */
> -/* { dg-final { scan-assembler-times {\tadd\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 10\n} 2 } } */
> +/* { dg-final { scan-assembler-times {\tubfiz\tx[0-9]+, x2, 10, 16\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tubfiz\tx[0-9]+, x3, 10, 16\n} 1 } } */
>  /* { dg-final { scan-assembler-not {\tcmp\tx[0-9]+, 0} } } */
>  /* { dg-final { scan-assembler-not {\tcmp\tw[0-9]+, 0} } } */
>  /* { dg-final { scan-assembler-not {\tcsel\tx[0-9]+} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/var_stride_4.c b/gcc/testsuite/gcc.target/aarch64/sve/var_stride_4.c
> index 71b826a4c1b..d2e74f9d417 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/var_stride_4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/var_stride_4.c
> @@ -16,7 +16,8 @@ f (TYPE *x, TYPE *y, int n, int m)
>  /* { dg-final { scan-assembler {\tldr\tw[0-9]+} } } */
>  /* { dg-final { scan-assembler {\tstr\tw[0-9]+} } } */
>  /* Should multiply by (257-1)*4 rather than (VF-1)*4.  */
> -/* { dg-final { scan-assembler-times {\t(?:lsl\tx[0-9]+, x[0-9]+, 10|sbfiz\tx[0-9]+, x[0-9]+, 10, 32)\n} 2 } } */
> +/* { dg-final { scan-assembler-times {\tsbfiz\tx[0-9]+, x2, 10, 32\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tsbfiz\tx[0-9]+, x3, 10, 32\n} 1 } } */
>  /* { dg-final { scan-assembler {\tcmp\tw2, 0} } } */
>  /* { dg-final { scan-assembler {\tcmp\tw3, 0} } } */
>  /* { dg-final { scan-assembler-times {\tcsel\tx[0-9]+} 4 } } */
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] combine: Allow 2->2 combos but limit insn searches [PR116398]
  2025-04-02 13:04 ` Richard Biener
@ 2025-04-02 13:33   ` Richard Sandiford
  2025-04-03  7:41     ` Richard Biener
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Sandiford @ 2025-04-02 13:33 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, jakub, jlaw, segher

Richard Biener <rguenther@suse.de> writes:
> On Tue, 1 Apr 2025, Richard Sandiford wrote:
>
>> The problem in PR101523 was that, after each successful 2->2
>> combination attempt, distribute_links would search further and
>> further for the next combinable use of the i2 destination.
>> The original patch for the PR dealt with that by disallowing such
>> combinations.  However, that led to various optimisation regressions,
>> so there was interest in allowing the combinations again, at least
>> until an alternative way of getting the same results is in place.
>> 
>> This patch does that, but limits distribute_links to a certain number
>> of instructions when i2 is unchanged.  As Segher said in the PR trail,
>> it would make more conceptual sense to apply the limit unconditionally,
>> but I thought it would be better to change as little as possible at
>> this development stage.  Logically, in stage 1, the --param should
>> be applied directly by distribute_links with no input from callers.
>> 
>> As I mentioned in:
>> 
>>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116398#c28
>> 
>> I think it's safe to drop log links even if a use exists.  All
>> processing of log links seems to handle the absence of a link
>> for a particular register in a conservative way.
>> 
>> The initial set-up errs on the side of dropping links, since for example
>> create_log_links has:
>> 
>>              /* flow.c claimed:
>> 
>>                  We don't build a LOG_LINK for hard registers contained
>>                  in ASM_OPERANDs.  If these registers get replaced,
>>                  we might wind up changing the semantics of the insn,
>>                  even if reload can make what appear to be valid
>>                  assignments later.  */
>>               if (regno < FIRST_PSEUDO_REGISTER
>>                   && asm_noperands (PATTERN (use_insn)) >= 0)
>>                 continue;
>> 
>> which excludes combinations by dropping log links, rather than during
>> try_combine.  And:
>> 
>>       /* If this register is being initialized using itself, and the
>>          register is uninitialized in this basic block, and there are
>>          no LOG_LINKS which set the register, then part of the
>>          register is uninitialized.  In that case we can't assume
>>          anything about the number of nonzero bits.
>> 
>>          ??? We could do better if we checked this in
>>          reg_{nonzero_bits,num_sign_bit_copies}_for_combine.  Then we
>>          could avoid making assumptions about the insn which initially
>>          sets the register, while still using the information in other
>>          insns.  We would have to be careful to check every insn
>>          involved in the combination.  */
>> 
>>       if (insn
>>           && reg_referenced_p (x, PATTERN (insn))
>>           && !REGNO_REG_SET_P (DF_LR_IN (BLOCK_FOR_INSN (insn)),
>>                                REGNO (x)))
>>         {
>>           struct insn_link *link;
>> 
>>           FOR_EACH_LOG_LINK (link, insn)
>>             if (dead_or_set_p (link->insn, x))
>>               break;
>>           if (!link)
>>             {
>>               rsp->nonzero_bits = GET_MODE_MASK (mode);
>>               rsp->sign_bit_copies = 1;
>>               return;
>>             }
>>         }
>> 
>> treats the lack of a log link is a possible sign of uninitialised data,
>> but that would be a missed optimisation rather than a correctness issue.
>> 
>> One question is what the default --param value should be.  I went with
>> Jakub's suggestion of 3000 from:
>> 
>>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116398#c25
>> 
>> Also, to answer Jakub's question in that comment, I tried bisecting:
>> 
>>   int limit = atoi (getenv ("BISECT"));
>> 
>> (so applying the limit for all calls from try_combine) with an
>> abort in distribute_links if the limit caused a link to be skipped.
>> The minimum BISECT value that allowed an aarch64-linux-gnu bootstrap
>> to succeed with --enable-languages=all --enable-checking=yes,rtl,extra
>> was 142, so much lower than the parameter value.  I realised too late
>> that --enable-checking=release would probably have been a more
>> interesting test.
>> 
>> Some of the try_combine changes come from Richi's patch in:
>> 
>>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523#c53
>> 
>> Bootstrapped & regression-tested on aarch64-linux-gnu,
>> powerpc64le-linux-gnu and x86_64-linux-gnu.  OK to install?
>> 
>> Richard
>> 
>> 
>> gcc/
>> 	PR testsuite/116398
>> 	* params.opt (-param=max-combine-search-insns=): New param.
>> 	* doc/invoke.texi: Document it.
>> 	* combine.cc (distribute_links): Add a limit parameter.
>> 	(try_combine): Use the new param to limit distribute_links
>> 	when i2 is unchanged.  Return i3 rather than i2 if i2 is unchanged.
>> 
>> gcc/testsuite/
>> 	* gcc.target/aarch64/popcnt-le-1.c: Account for commutativity of TST.
>> 	* gcc.target/aarch64/popcnt-le-3.c: Likewise AND.
>> 	* gcc.target/aarch64/sve/pred-not-gen-1.c: Revert previous patch.
>> 	* gcc.target/aarch64/sve/pred-not-gen-4.c: Likewise.
>> 	* gcc.target/aarch64/sve/var_stride_2.c: Likewise.
>> 	* gcc.target/aarch64/sve/var_stride_4.c: Likewise.
>> 
>> Co-authored-by: Richard Biener <rguenther@suse.de>
>> ---
>>  gcc/combine.cc                                | 30 +++++++++----------
>>  gcc/doc/invoke.texi                           |  9 ++++++
>>  gcc/params.opt                                |  4 +++
>>  .../gcc.target/aarch64/popcnt-le-1.c          |  4 +--
>>  .../gcc.target/aarch64/popcnt-le-3.c          |  4 +--
>>  gcc/testsuite/gcc.target/aarch64/pr100056.c   |  4 ++-
>>  .../gcc.target/aarch64/sve/pred-not-gen-1.c   |  4 +--
>>  .../gcc.target/aarch64/sve/pred-not-gen-4.c   |  4 +--
>>  .../gcc.target/aarch64/sve/var_stride_2.c     |  3 +-
>>  .../gcc.target/aarch64/sve/var_stride_4.c     |  3 +-
>>  10 files changed, 42 insertions(+), 27 deletions(-)
>> 
>> diff --git a/gcc/combine.cc b/gcc/combine.cc
>> index ef13f5d5e90..28403d1f9e0 100644
>> --- a/gcc/combine.cc
>> +++ b/gcc/combine.cc
>> @@ -472,7 +472,7 @@ static void move_deaths (rtx, rtx, int, rtx_insn *, rtx *);
>>  static bool reg_bitfield_target_p (rtx, rtx);
>>  static void distribute_notes (rtx, rtx_insn *, rtx_insn *, rtx_insn *,
>>  			      rtx, rtx, rtx);
>> -static void distribute_links (struct insn_link *);
>> +static void distribute_links (struct insn_link *, int limit = INT_MAX);
>>  static void mark_used_regs_combine (rtx);
>>  static void record_promoted_value (rtx_insn *, rtx);
>>  static bool unmentioned_reg_p (rtx, rtx);
>> @@ -4208,16 +4208,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0,
>>        adjust_for_new_dest (i3);
>>      }
>>  
>> -  /* If I2 didn't change, this is not a combination (but a simplification or
>> -     canonicalisation with context), which should not be done here.  Doing
>> -     it here explodes the algorithm.  Don't.  */
>> -  if (rtx_equal_p (newi2pat, PATTERN (i2)))
>> -    {
>> -      if (dump_file)
>> -	fprintf (dump_file, "i2 didn't change, not doing this\n");
>> -      undo_all ();
>> -      return 0;
>> -    }
>> +  bool i2_unchanged = rtx_equal_p (newi2pat, PATTERN (i2));
>>  
>>    /* We now know that we can do this combination.  Merge the insns and
>>       update the status of registers and LOG_LINKS.  */
>> @@ -4593,10 +4584,11 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0,
>>  			    NULL_RTX, NULL_RTX, NULL_RTX);
>>        }
>>  
>> -    distribute_links (i3links);
>> -    distribute_links (i2links);
>> -    distribute_links (i1links);
>> -    distribute_links (i0links);
>> +    int limit = i2_unchanged ? param_max_combine_search_insns : INT_MAX;
>> +    distribute_links (i3links, limit);
>> +    distribute_links (i2links, limit);
>> +    distribute_links (i1links, limit);
>> +    distribute_links (i0links, limit);
>
> I'm not sure I understand how the distribute_links mixes in with
> the change below.
>
> The original issue reported in the PR was that returning i2
> caused evern insn between i2 and i3 to be re-processed by the
> main loop, re-trying previously failed combine attempts and
> producing garbage.
>
> The disadvantage of directly returning i3 is that if added_links_insn
> was inbetween i2 and i3 we've missed to re-try on insn with changed
> links.

Can that happen though?  log links are supposed to be attached to the
next use of the register (only), so I wouldn't expect the next use to be
before i3 (putatively the previous first use).

> But what you do now also limits the insn walk in distribute_links
> (IMO a good thing on its own), but not taking advantage of that
> when i2_unchanged (distributing the link to insns between i2 and i3
> is useless in that case).

Yeah, it's true that distribute_links could potentially start the
search at a better place.  But that's IMO a third, somewhat independent
optimisation.  And it seems more dangerous than the proposed patch,
since if it turns out that the next use is before i3 in some current
or future circumstances, we could end up generating wrong code.

> Ideally we'd distribute links to insns between i2 and i3 in a backward
> walk from i3, but pick up the first use if within the limit, as we
> want to limit the number of insns we reprocess in the try_combine callers
> loop.
>
> Really ideally we'd want to remember all added_links_insn and only
> re-process those, not all other unaffected insns between i2 and i3
> (or earlier).
>
> That said, your change looks like two independent changes?  IIRC Jakub
> at some point proposed to limit the 'return i3' below to cases where
> i2 is too far away or sth like that.

Yes, it's two independent changes.  The return i3 one is taken directly
from your comment in:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523#c53

I think that's the right thing to do, for the reason you explained there
and above, so I didn't want to lose it.  But like you say in the PR,
combine did still take a large slice of time with that change in isolation.

Jakub discussed the idea of augmenting that change with one that makes
try_combine check luids.  The distribute_links part of the patch is
supposed to be an alternative to that.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] combine: Allow 2->2 combos but limit insn searches [PR116398]
  2025-04-02 13:33   ` Richard Sandiford
@ 2025-04-03  7:41     ` Richard Biener
  0 siblings, 0 replies; 4+ messages in thread
From: Richard Biener @ 2025-04-03  7:41 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches, jakub, jlaw, segher

On Wed, 2 Apr 2025, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Tue, 1 Apr 2025, Richard Sandiford wrote:
> >
> >> The problem in PR101523 was that, after each successful 2->2
> >> combination attempt, distribute_links would search further and
> >> further for the next combinable use of the i2 destination.
> >> The original patch for the PR dealt with that by disallowing such
> >> combinations.  However, that led to various optimisation regressions,
> >> so there was interest in allowing the combinations again, at least
> >> until an alternative way of getting the same results is in place.
> >> 
> >> This patch does that, but limits distribute_links to a certain number
> >> of instructions when i2 is unchanged.  As Segher said in the PR trail,
> >> it would make more conceptual sense to apply the limit unconditionally,
> >> but I thought it would be better to change as little as possible at
> >> this development stage.  Logically, in stage 1, the --param should
> >> be applied directly by distribute_links with no input from callers.
> >> 
> >> As I mentioned in:
> >> 
> >>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116398#c28
> >> 
> >> I think it's safe to drop log links even if a use exists.  All
> >> processing of log links seems to handle the absence of a link
> >> for a particular register in a conservative way.
> >> 
> >> The initial set-up errs on the side of dropping links, since for example
> >> create_log_links has:
> >> 
> >>              /* flow.c claimed:
> >> 
> >>                  We don't build a LOG_LINK for hard registers contained
> >>                  in ASM_OPERANDs.  If these registers get replaced,
> >>                  we might wind up changing the semantics of the insn,
> >>                  even if reload can make what appear to be valid
> >>                  assignments later.  */
> >>               if (regno < FIRST_PSEUDO_REGISTER
> >>                   && asm_noperands (PATTERN (use_insn)) >= 0)
> >>                 continue;
> >> 
> >> which excludes combinations by dropping log links, rather than during
> >> try_combine.  And:
> >> 
> >>       /* If this register is being initialized using itself, and the
> >>          register is uninitialized in this basic block, and there are
> >>          no LOG_LINKS which set the register, then part of the
> >>          register is uninitialized.  In that case we can't assume
> >>          anything about the number of nonzero bits.
> >> 
> >>          ??? We could do better if we checked this in
> >>          reg_{nonzero_bits,num_sign_bit_copies}_for_combine.  Then we
> >>          could avoid making assumptions about the insn which initially
> >>          sets the register, while still using the information in other
> >>          insns.  We would have to be careful to check every insn
> >>          involved in the combination.  */
> >> 
> >>       if (insn
> >>           && reg_referenced_p (x, PATTERN (insn))
> >>           && !REGNO_REG_SET_P (DF_LR_IN (BLOCK_FOR_INSN (insn)),
> >>                                REGNO (x)))
> >>         {
> >>           struct insn_link *link;
> >> 
> >>           FOR_EACH_LOG_LINK (link, insn)
> >>             if (dead_or_set_p (link->insn, x))
> >>               break;
> >>           if (!link)
> >>             {
> >>               rsp->nonzero_bits = GET_MODE_MASK (mode);
> >>               rsp->sign_bit_copies = 1;
> >>               return;
> >>             }
> >>         }
> >> 
> >> treats the lack of a log link is a possible sign of uninitialised data,
> >> but that would be a missed optimisation rather than a correctness issue.
> >> 
> >> One question is what the default --param value should be.  I went with
> >> Jakub's suggestion of 3000 from:
> >> 
> >>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116398#c25
> >> 
> >> Also, to answer Jakub's question in that comment, I tried bisecting:
> >> 
> >>   int limit = atoi (getenv ("BISECT"));
> >> 
> >> (so applying the limit for all calls from try_combine) with an
> >> abort in distribute_links if the limit caused a link to be skipped.
> >> The minimum BISECT value that allowed an aarch64-linux-gnu bootstrap
> >> to succeed with --enable-languages=all --enable-checking=yes,rtl,extra
> >> was 142, so much lower than the parameter value.  I realised too late
> >> that --enable-checking=release would probably have been a more
> >> interesting test.
> >> 
> >> Some of the try_combine changes come from Richi's patch in:
> >> 
> >>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523#c53
> >> 
> >> Bootstrapped & regression-tested on aarch64-linux-gnu,
> >> powerpc64le-linux-gnu and x86_64-linux-gnu.  OK to install?
> >> 
> >> Richard
> >> 
> >> 
> >> gcc/
> >> 	PR testsuite/116398
> >> 	* params.opt (-param=max-combine-search-insns=): New param.
> >> 	* doc/invoke.texi: Document it.
> >> 	* combine.cc (distribute_links): Add a limit parameter.
> >> 	(try_combine): Use the new param to limit distribute_links
> >> 	when i2 is unchanged.  Return i3 rather than i2 if i2 is unchanged.
> >> 
> >> gcc/testsuite/
> >> 	* gcc.target/aarch64/popcnt-le-1.c: Account for commutativity of TST.
> >> 	* gcc.target/aarch64/popcnt-le-3.c: Likewise AND.
> >> 	* gcc.target/aarch64/sve/pred-not-gen-1.c: Revert previous patch.
> >> 	* gcc.target/aarch64/sve/pred-not-gen-4.c: Likewise.
> >> 	* gcc.target/aarch64/sve/var_stride_2.c: Likewise.
> >> 	* gcc.target/aarch64/sve/var_stride_4.c: Likewise.
> >> 
> >> Co-authored-by: Richard Biener <rguenther@suse.de>
> >> ---
> >>  gcc/combine.cc                                | 30 +++++++++----------
> >>  gcc/doc/invoke.texi                           |  9 ++++++
> >>  gcc/params.opt                                |  4 +++
> >>  .../gcc.target/aarch64/popcnt-le-1.c          |  4 +--
> >>  .../gcc.target/aarch64/popcnt-le-3.c          |  4 +--
> >>  gcc/testsuite/gcc.target/aarch64/pr100056.c   |  4 ++-
> >>  .../gcc.target/aarch64/sve/pred-not-gen-1.c   |  4 +--
> >>  .../gcc.target/aarch64/sve/pred-not-gen-4.c   |  4 +--
> >>  .../gcc.target/aarch64/sve/var_stride_2.c     |  3 +-
> >>  .../gcc.target/aarch64/sve/var_stride_4.c     |  3 +-
> >>  10 files changed, 42 insertions(+), 27 deletions(-)
> >> 
> >> diff --git a/gcc/combine.cc b/gcc/combine.cc
> >> index ef13f5d5e90..28403d1f9e0 100644
> >> --- a/gcc/combine.cc
> >> +++ b/gcc/combine.cc
> >> @@ -472,7 +472,7 @@ static void move_deaths (rtx, rtx, int, rtx_insn *, rtx *);
> >>  static bool reg_bitfield_target_p (rtx, rtx);
> >>  static void distribute_notes (rtx, rtx_insn *, rtx_insn *, rtx_insn *,
> >>  			      rtx, rtx, rtx);
> >> -static void distribute_links (struct insn_link *);
> >> +static void distribute_links (struct insn_link *, int limit = INT_MAX);
> >>  static void mark_used_regs_combine (rtx);
> >>  static void record_promoted_value (rtx_insn *, rtx);
> >>  static bool unmentioned_reg_p (rtx, rtx);
> >> @@ -4208,16 +4208,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0,
> >>        adjust_for_new_dest (i3);
> >>      }
> >>  
> >> -  /* If I2 didn't change, this is not a combination (but a simplification or
> >> -     canonicalisation with context), which should not be done here.  Doing
> >> -     it here explodes the algorithm.  Don't.  */
> >> -  if (rtx_equal_p (newi2pat, PATTERN (i2)))
> >> -    {
> >> -      if (dump_file)
> >> -	fprintf (dump_file, "i2 didn't change, not doing this\n");
> >> -      undo_all ();
> >> -      return 0;
> >> -    }
> >> +  bool i2_unchanged = rtx_equal_p (newi2pat, PATTERN (i2));
> >>  
> >>    /* We now know that we can do this combination.  Merge the insns and
> >>       update the status of registers and LOG_LINKS.  */
> >> @@ -4593,10 +4584,11 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0,
> >>  			    NULL_RTX, NULL_RTX, NULL_RTX);
> >>        }
> >>  
> >> -    distribute_links (i3links);
> >> -    distribute_links (i2links);
> >> -    distribute_links (i1links);
> >> -    distribute_links (i0links);
> >> +    int limit = i2_unchanged ? param_max_combine_search_insns : INT_MAX;
> >> +    distribute_links (i3links, limit);
> >> +    distribute_links (i2links, limit);
> >> +    distribute_links (i1links, limit);
> >> +    distribute_links (i0links, limit);
> >
> > I'm not sure I understand how the distribute_links mixes in with
> > the change below.
> >
> > The original issue reported in the PR was that returning i2
> > caused evern insn between i2 and i3 to be re-processed by the
> > main loop, re-trying previously failed combine attempts and
> > producing garbage.
> >
> > The disadvantage of directly returning i3 is that if added_links_insn
> > was inbetween i2 and i3 we've missed to re-try on insn with changed
> > links.
> 
> Can that happen though?  log links are supposed to be attached to the
> next use of the register (only), so I wouldn't expect the next use to be
> before i3 (putatively the previous first use).

But log-links on i2 are for uses in i2 for defs before i2, given
i0/i1/i2 changed/have gone away we adjust those to eventually end
up on insns between i2 and i3 (or indeed after i3).  Combine then
want's to try combine the insns with changed log-links.

At least that's my simple-minded understanding.

> > But what you do now also limits the insn walk in distribute_links
> > (IMO a good thing on its own), but not taking advantage of that
> > when i2_unchanged (distributing the link to insns between i2 and i3
> > is useless in that case).
> 
> Yeah, it's true that distribute_links could potentially start the
> search at a better place.  But that's IMO a third, somewhat independent
> optimisation.  And it seems more dangerous than the proposed patch,
> since if it turns out that the next use is before i3 in some current
> or future circumstances, we could end up generating wrong code.
> 
> > Ideally we'd distribute links to insns between i2 and i3 in a backward
> > walk from i3, but pick up the first use if within the limit, as we
> > want to limit the number of insns we reprocess in the try_combine callers
> > loop.
> >
> > Really ideally we'd want to remember all added_links_insn and only
> > re-process those, not all other unaffected insns between i2 and i3
> > (or earlier).
> >
> > That said, your change looks like two independent changes?  IIRC Jakub
> > at some point proposed to limit the 'return i3' below to cases where
> > i2 is too far away or sth like that.
> 
> Yes, it's two independent changes.  The return i3 one is taken directly
> from your comment in:
> 
>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523#c53
> 
> I think that's the right thing to do, for the reason you explained there
> and above, so I didn't want to lose it.  But like you say in the PR,
> combine did still take a large slice of time with that change in isolation.
> 
> Jakub discussed the idea of augmenting that change with one that makes
> try_combine check luids.  The distribute_links part of the patch is
> supposed to be an alternative to that.

I see.  I understood Jakub thinks my change is too aggressive since
it does indeed miss combine attempts from changed log-links.

I remember that when allowing added_links_insn to override the
i3 return for i2_unchaged_p the bad compile-time re-surfaced.  So
now I wonder whether, if i2_unchanged, distribute_links (i2links)
does (should do?) anything?  I don't see anything that would avoid
adjusting links when there's still a use in newi2_patt (IIRC
i0 and i1 are always gone?).

But sure, the distribute_links walk made the behavior worse than
quadratic here.

Richard.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-04-03  7:41 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-01 15:34 [PATCH] combine: Allow 2->2 combos but limit insn searches [PR116398] Richard Sandiford
2025-04-02 13:04 ` Richard Biener
2025-04-02 13:33   ` Richard Sandiford
2025-04-03  7:41     ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).