public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: [PATCH] rtl-optimization/54052 - RTL SSA PHI insertion compile-time hog
       [not found] <20240219104736.25B63384F00A@sourceware.org>
@ 2024-02-19 10:55 ` Richard Sandiford
  2024-02-19 11:32   ` Richard Biener
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Sandiford @ 2024-02-19 10:55 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <rguenther@suse.de> writes:
> The following tries to address the PHI insertion compile-time hog in
> RTL fwprop observed with the PR54052 testcase where the loop computing
> the "unfiltered" set of variables possibly needing PHI nodes for each
> block exhibits quadratic compile-time and memory-use.
>
> Instead of only pruning the set of candidate regs by LR_IN in the
> second worklist loop do this when computing "unfiltered" already.
>
> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>
> I'll note that in PR98863 you say in comment#39
>
> "Just to give an update on this: I have a patch that reduces the
> amount of memory consumed by fwprop so that it no longer seems
> to be outlier.  However, it involves doing more bitmap operations.
> In this testcase we have a larger number of registers that
> seem to be live but unused across a large region of code,
> so bitmap ANDs with the live in sets are expensive and hit
> the worst-case O(nblocksnregisters).  I'm still trying to find
> a way of reducing the effect of that."
>
> suggesting that the very AND operation I'm introducing below
> was an actual problem.  It's just not very clear what testcase
> this was on (the PR hasn't one, it just talks about WRF with LTO
> and then some individual TUs of it).

Yeah, like you say, I think this kind of AND was exactly the problem.
If the DEF set is much smaller than the IN set, we can spend a lot of
compile time (and cache) iterating over the leading elements of the
IN set.  So this could be trading one hog for another.

Could we use some heuristic to choose between the two?  If the IN set
is "sensible", do the AND, otherwise keep it as-is?

> Indeed the patch doesn't do anything about quadraticness but
> it seems to effectively reduce the size of 'unfiltered' (but
> bringing in LR_IN into the workset).

Yeah, this is always going to be O(blocks * registers) in the worst case.

[Was just in the process of replying to the bugzilla ticket when
this patch arrived :) ]

Thanks,
Richard

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] rtl-optimization/54052 - RTL SSA PHI insertion compile-time hog
  2024-02-19 10:55 ` [PATCH] rtl-optimization/54052 - RTL SSA PHI insertion compile-time hog Richard Sandiford
@ 2024-02-19 11:32   ` Richard Biener
  2024-02-19 12:58     ` Richard Sandiford
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Biener @ 2024-02-19 11:32 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Mon, 19 Feb 2024, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > The following tries to address the PHI insertion compile-time hog in
> > RTL fwprop observed with the PR54052 testcase where the loop computing
> > the "unfiltered" set of variables possibly needing PHI nodes for each
> > block exhibits quadratic compile-time and memory-use.
> >
> > Instead of only pruning the set of candidate regs by LR_IN in the
> > second worklist loop do this when computing "unfiltered" already.
> >
> > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
> >
> > I'll note that in PR98863 you say in comment#39
> >
> > "Just to give an update on this: I have a patch that reduces the
> > amount of memory consumed by fwprop so that it no longer seems
> > to be outlier.  However, it involves doing more bitmap operations.
> > In this testcase we have a larger number of registers that
> > seem to be live but unused across a large region of code,
> > so bitmap ANDs with the live in sets are expensive and hit
> > the worst-case O(nblocksnregisters).  I'm still trying to find
> > a way of reducing the effect of that."
> >
> > suggesting that the very AND operation I'm introducing below
> > was an actual problem.  It's just not very clear what testcase
> > this was on (the PR hasn't one, it just talks about WRF with LTO
> > and then some individual TUs of it).
> 
> Yeah, like you say, I think this kind of AND was exactly the problem.
> If the DEF set is much smaller than the IN set, we can spend a lot of
> compile time (and cache) iterating over the leading elements of the
> IN set.  So this could be trading one hog for another.
> 
> Could we use some heuristic to choose between the two?  If the IN set
> is "sensible", do the AND, otherwise keep it as-is?

Not sure how, I don't think DF caches the set sizes (we could
compute them, of course).  But I just made an experiment and
using DF_LR_OUT instead of DF_LR_BB_INFO->def improves compile-time
as well.  So incremental ontop of the posted:

diff --git a/gcc/rtl-ssa/blocks.cc b/gcc/rtl-ssa/blocks.cc
index 9d1cd1b0365..6fd08602c1b 100644
--- a/gcc/rtl-ssa/blocks.cc
+++ b/gcc/rtl-ssa/blocks.cc
@@ -645,12 +645,11 @@ function_info::place_phis (build_info &bi)
       if (bitmap_empty_p (&frontiers[b1]))
        continue;
 
-      bitmap b1_def = &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, 
b1))->def;
+      bitmap b1_def = DF_LR_OUT (BASIC_BLOCK_FOR_FN (m_fn, b1));
       bitmap_iterator bmi;
       unsigned int b2;
       EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi)
-       if (bitmap_ior_and_into (&unfiltered[b2], b1_def,
-                                DF_LR_IN (BASIC_BLOCK_FOR_FN (m_fn, b2)))
+       if (bitmap_ior_into (&unfiltered[b2], b1_def)
            && !bitmap_empty_p (&frontiers[b2]))
          // Propagate the (potential) new phi node definitions in B2.
          bitmap_set_bit (worklist, b2);

of course that's too big (including live-through), but we could
prune by DF_LR_OUT like

diff --git a/gcc/rtl-ssa/blocks.cc b/gcc/rtl-ssa/blocks.cc
index 9d1cd1b0365..6a4dd05908f 100644
--- a/gcc/rtl-ssa/blocks.cc
+++ b/gcc/rtl-ssa/blocks.cc
@@ -645,12 +645,13 @@ function_info::place_phis (build_info &bi)
       if (bitmap_empty_p (&frontiers[b1]))
        continue;
 
-      bitmap b1_def = &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, 
b1))->def;
+      auto_bitmap b1_def;
+      bitmap_and (b1_def, &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, 
b1))->def,
+                 DF_LR_OUT (BASIC_BLOCK_FOR_FN (m_fn, b1)));
       bitmap_iterator bmi;
       unsigned int b2;
       EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi)
-       if (bitmap_ior_and_into (&unfiltered[b2], b1_def,
-                                DF_LR_IN (BASIC_BLOCK_FOR_FN (m_fn, b2)))
+       if (bitmap_ior_into (&unfiltered[b2], b1_def)
            && !bitmap_empty_p (&frontiers[b2]))
          // Propagate the (potential) new phi node definitions in B2.
          bitmap_set_bit (worklist, b2);

so for the testcase it seems we have a lot of local defined but
not "global" used defs.

Would that work for you or am I missing something?

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] rtl-optimization/54052 - RTL SSA PHI insertion compile-time hog
  2024-02-19 11:32   ` Richard Biener
@ 2024-02-19 12:58     ` Richard Sandiford
  2024-02-19 13:14       ` Richard Biener
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Sandiford @ 2024-02-19 12:58 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <rguenther@suse.de> writes:
> On Mon, 19 Feb 2024, Richard Sandiford wrote:
>
>> Richard Biener <rguenther@suse.de> writes:
>> > The following tries to address the PHI insertion compile-time hog in
>> > RTL fwprop observed with the PR54052 testcase where the loop computing
>> > the "unfiltered" set of variables possibly needing PHI nodes for each
>> > block exhibits quadratic compile-time and memory-use.
>> >
>> > Instead of only pruning the set of candidate regs by LR_IN in the
>> > second worklist loop do this when computing "unfiltered" already.
>> >
>> > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>> >
>> > I'll note that in PR98863 you say in comment#39
>> >
>> > "Just to give an update on this: I have a patch that reduces the
>> > amount of memory consumed by fwprop so that it no longer seems
>> > to be outlier.  However, it involves doing more bitmap operations.
>> > In this testcase we have a larger number of registers that
>> > seem to be live but unused across a large region of code,
>> > so bitmap ANDs with the live in sets are expensive and hit
>> > the worst-case O(nblocksnregisters).  I'm still trying to find
>> > a way of reducing the effect of that."
>> >
>> > suggesting that the very AND operation I'm introducing below
>> > was an actual problem.  It's just not very clear what testcase
>> > this was on (the PR hasn't one, it just talks about WRF with LTO
>> > and then some individual TUs of it).
>> 
>> Yeah, like you say, I think this kind of AND was exactly the problem.
>> If the DEF set is much smaller than the IN set, we can spend a lot of
>> compile time (and cache) iterating over the leading elements of the
>> IN set.  So this could be trading one hog for another.
>> 
>> Could we use some heuristic to choose between the two?  If the IN set
>> is "sensible", do the AND, otherwise keep it as-is?
>
> Not sure how, I don't think DF caches the set sizes (we could
> compute them, of course).  But I just made an experiment and
> using DF_LR_OUT instead of DF_LR_BB_INFO->def improves compile-time
> as well.  So incremental ontop of the posted:
>
> diff --git a/gcc/rtl-ssa/blocks.cc b/gcc/rtl-ssa/blocks.cc
> index 9d1cd1b0365..6fd08602c1b 100644
> --- a/gcc/rtl-ssa/blocks.cc
> +++ b/gcc/rtl-ssa/blocks.cc
> @@ -645,12 +645,11 @@ function_info::place_phis (build_info &bi)
>        if (bitmap_empty_p (&frontiers[b1]))
>         continue;
>  
> -      bitmap b1_def = &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, 
> b1))->def;
> +      bitmap b1_def = DF_LR_OUT (BASIC_BLOCK_FOR_FN (m_fn, b1));
>        bitmap_iterator bmi;
>        unsigned int b2;
>        EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi)
> -       if (bitmap_ior_and_into (&unfiltered[b2], b1_def,
> -                                DF_LR_IN (BASIC_BLOCK_FOR_FN (m_fn, b2)))
> +       if (bitmap_ior_into (&unfiltered[b2], b1_def)
>             && !bitmap_empty_p (&frontiers[b2]))
>           // Propagate the (potential) new phi node definitions in B2.
>           bitmap_set_bit (worklist, b2);
>
> of course that's too big (including live-through), but we could
> prune by DF_LR_OUT like
>
> diff --git a/gcc/rtl-ssa/blocks.cc b/gcc/rtl-ssa/blocks.cc
> index 9d1cd1b0365..6a4dd05908f 100644
> --- a/gcc/rtl-ssa/blocks.cc
> +++ b/gcc/rtl-ssa/blocks.cc
> @@ -645,12 +645,13 @@ function_info::place_phis (build_info &bi)
>        if (bitmap_empty_p (&frontiers[b1]))
>         continue;
>  
> -      bitmap b1_def = &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, 
> b1))->def;
> +      auto_bitmap b1_def;
> +      bitmap_and (b1_def, &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, 
> b1))->def,
> +                 DF_LR_OUT (BASIC_BLOCK_FOR_FN (m_fn, b1)));
>        bitmap_iterator bmi;
>        unsigned int b2;
>        EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi)
> -       if (bitmap_ior_and_into (&unfiltered[b2], b1_def,
> -                                DF_LR_IN (BASIC_BLOCK_FOR_FN (m_fn, b2)))
> +       if (bitmap_ior_into (&unfiltered[b2], b1_def)
>             && !bitmap_empty_p (&frontiers[b2]))
>           // Propagate the (potential) new phi node definitions in B2.
>           bitmap_set_bit (worklist, b2);
>
> so for the testcase it seems we have a lot of local defined but
> not "global" used defs.
>
> Would that work for you or am I missing something?

I suppose that's better than the first version when a block has a
large number of dominance frontiers.  But I can't remember whether
that was the case in PR98863.  I have a feeling that I tried the above
as part of the PR, since it's the obvious way of applying liveness
once per block rather than once per edge.  But I think it was more
just sheer weight of numbers.

I wonder whether we could create a new DEF bitmap on-the-fly while
creating the defs and uses, restricting it to potential PHI registers.
The test for potential PHI registers is constant time, and we could
build the bitmaps as tree views before converting to list views.
Can give it a go if that sounds OK.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] rtl-optimization/54052 - RTL SSA PHI insertion compile-time hog
  2024-02-19 12:58     ` Richard Sandiford
@ 2024-02-19 13:14       ` Richard Biener
  2024-02-19 13:25         ` Richard Sandiford
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Biener @ 2024-02-19 13:14 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Mon, 19 Feb 2024, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Mon, 19 Feb 2024, Richard Sandiford wrote:
> >
> >> Richard Biener <rguenther@suse.de> writes:
> >> > The following tries to address the PHI insertion compile-time hog in
> >> > RTL fwprop observed with the PR54052 testcase where the loop computing
> >> > the "unfiltered" set of variables possibly needing PHI nodes for each
> >> > block exhibits quadratic compile-time and memory-use.
> >> >
> >> > Instead of only pruning the set of candidate regs by LR_IN in the
> >> > second worklist loop do this when computing "unfiltered" already.
> >> >
> >> > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
> >> >
> >> > I'll note that in PR98863 you say in comment#39
> >> >
> >> > "Just to give an update on this: I have a patch that reduces the
> >> > amount of memory consumed by fwprop so that it no longer seems
> >> > to be outlier.  However, it involves doing more bitmap operations.
> >> > In this testcase we have a larger number of registers that
> >> > seem to be live but unused across a large region of code,
> >> > so bitmap ANDs with the live in sets are expensive and hit
> >> > the worst-case O(nblocksnregisters).  I'm still trying to find
> >> > a way of reducing the effect of that."
> >> >
> >> > suggesting that the very AND operation I'm introducing below
> >> > was an actual problem.  It's just not very clear what testcase
> >> > this was on (the PR hasn't one, it just talks about WRF with LTO
> >> > and then some individual TUs of it).
> >> 
> >> Yeah, like you say, I think this kind of AND was exactly the problem.
> >> If the DEF set is much smaller than the IN set, we can spend a lot of
> >> compile time (and cache) iterating over the leading elements of the
> >> IN set.  So this could be trading one hog for another.
> >> 
> >> Could we use some heuristic to choose between the two?  If the IN set
> >> is "sensible", do the AND, otherwise keep it as-is?
> >
> > Not sure how, I don't think DF caches the set sizes (we could
> > compute them, of course).  But I just made an experiment and
> > using DF_LR_OUT instead of DF_LR_BB_INFO->def improves compile-time
> > as well.  So incremental ontop of the posted:
> >
> > diff --git a/gcc/rtl-ssa/blocks.cc b/gcc/rtl-ssa/blocks.cc
> > index 9d1cd1b0365..6fd08602c1b 100644
> > --- a/gcc/rtl-ssa/blocks.cc
> > +++ b/gcc/rtl-ssa/blocks.cc
> > @@ -645,12 +645,11 @@ function_info::place_phis (build_info &bi)
> >        if (bitmap_empty_p (&frontiers[b1]))
> >         continue;
> >  
> > -      bitmap b1_def = &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, 
> > b1))->def;
> > +      bitmap b1_def = DF_LR_OUT (BASIC_BLOCK_FOR_FN (m_fn, b1));
> >        bitmap_iterator bmi;
> >        unsigned int b2;
> >        EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi)
> > -       if (bitmap_ior_and_into (&unfiltered[b2], b1_def,
> > -                                DF_LR_IN (BASIC_BLOCK_FOR_FN (m_fn, b2)))
> > +       if (bitmap_ior_into (&unfiltered[b2], b1_def)
> >             && !bitmap_empty_p (&frontiers[b2]))
> >           // Propagate the (potential) new phi node definitions in B2.
> >           bitmap_set_bit (worklist, b2);
> >
> > of course that's too big (including live-through), but we could
> > prune by DF_LR_OUT like
> >
> > diff --git a/gcc/rtl-ssa/blocks.cc b/gcc/rtl-ssa/blocks.cc
> > index 9d1cd1b0365..6a4dd05908f 100644
> > --- a/gcc/rtl-ssa/blocks.cc
> > +++ b/gcc/rtl-ssa/blocks.cc
> > @@ -645,12 +645,13 @@ function_info::place_phis (build_info &bi)
> >        if (bitmap_empty_p (&frontiers[b1]))
> >         continue;
> >  
> > -      bitmap b1_def = &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, 
> > b1))->def;
> > +      auto_bitmap b1_def;
> > +      bitmap_and (b1_def, &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, 
> > b1))->def,
> > +                 DF_LR_OUT (BASIC_BLOCK_FOR_FN (m_fn, b1)));
> >        bitmap_iterator bmi;
> >        unsigned int b2;
> >        EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi)
> > -       if (bitmap_ior_and_into (&unfiltered[b2], b1_def,
> > -                                DF_LR_IN (BASIC_BLOCK_FOR_FN (m_fn, b2)))
> > +       if (bitmap_ior_into (&unfiltered[b2], b1_def)
> >             && !bitmap_empty_p (&frontiers[b2]))
> >           // Propagate the (potential) new phi node definitions in B2.
> >           bitmap_set_bit (worklist, b2);
> >
> > so for the testcase it seems we have a lot of local defined but
> > not "global" used defs.
> >
> > Would that work for you or am I missing something?
> 
> I suppose that's better than the first version when a block has a
> large number of dominance frontiers.  But I can't remember whether
> that was the case in PR98863.  I have a feeling that I tried the above
> as part of the PR, since it's the obvious way of applying liveness
> once per block rather than once per edge.  But I think it was more
> just sheer weight of numbers.

Note at least this (see below for the actual patch for trunk) is
in the linear part, so it shouldn't trigger any quadraticness.
It speeds up the testcase the same as the original one.

> I wonder whether we could create a new DEF bitmap on-the-fly while
> creating the defs and uses, restricting it to potential PHI registers.
> The test for potential PHI registers is constant time, and we could
> build the bitmaps as tree views before converting to list views.
> Can give it a go if that sounds OK.

Do that within DF itself?  Not sure about that.

Anyway, I think the below is small enough that it would also qualify
for backporting.  Unless there's a problem with it, of course.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

OK?

Thanks,
Richard.

From d6f57f72fbbdb832b33e4ba5ccbf60a8d90ea264 Mon Sep 17 00:00:00 2001
From: Richard Biener <rguenther@suse.de>
Date: Mon, 19 Feb 2024 11:10:50 +0100
Subject: [PATCH] rtl-optimization/54052 - RTL SSA PHI insertion compile-time
 hog
To: gcc-patches@gcc.gnu.org

The following tries to address the PHI insertion compile-time hog in
RTL fwprop observed with the PR54052 testcase where the loop computing
the "unfiltered" set of variables possibly needing PHI nodes for each
block exhibits quadratic compile-time and memory-use.

It does so by pruning the local DEFs with LR_OUT of the block, removing
regs that can never be LR_IN (defined by this block) in the dominance
frontier.

	PR rtl-optimization/54052
	* rtl-ssa/blocks.cc (function_info::place_phis): Filter
	local defs by LR_OUT.
---
 gcc/rtl-ssa/blocks.cc | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/gcc/rtl-ssa/blocks.cc b/gcc/rtl-ssa/blocks.cc
index 8996443e8d5..cf4224e61ec 100644
--- a/gcc/rtl-ssa/blocks.cc
+++ b/gcc/rtl-ssa/blocks.cc
@@ -645,7 +645,12 @@ function_info::place_phis (build_info &bi)
       if (bitmap_empty_p (&frontiers[b1]))
 	continue;
 
-      bitmap b1_def = &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, b1))->def;
+      // Defs in B1 that are possibly in LR_IN in the dominance frontier
+      // blocks.
+      auto_bitmap b1_def;
+      bitmap_and (b1_def, &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, b1))->def,
+		  DF_LR_OUT (BASIC_BLOCK_FOR_FN (m_fn, b1)));
+
       bitmap_iterator bmi;
       unsigned int b2;
       EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi)
-- 
2.35.3


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] rtl-optimization/54052 - RTL SSA PHI insertion compile-time hog
  2024-02-19 13:14       ` Richard Biener
@ 2024-02-19 13:25         ` Richard Sandiford
  2024-02-19 13:41           ` Richard Biener
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Sandiford @ 2024-02-19 13:25 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <rguenther@suse.de> writes:
>> I suppose that's better than the first version when a block has a
>> large number of dominance frontiers.  But I can't remember whether
>> that was the case in PR98863.  I have a feeling that I tried the above
>> as part of the PR, since it's the obvious way of applying liveness
>> once per block rather than once per edge.  But I think it was more
>> just sheer weight of numbers.
>
> Note at least this (see below for the actual patch for trunk) is
> in the linear part, so it shouldn't trigger any quadraticness.
> It speeds up the testcase the same as the original one.

But some of the "quadracticness" is from O(nblocks*nregisters) or
O(nedges*nregisters), rather than through iteration.

>> I wonder whether we could create a new DEF bitmap on-the-fly while
>> creating the defs and uses, restricting it to potential PHI registers.
>> The test for potential PHI registers is constant time, and we could
>> build the bitmaps as tree views before converting to list views.
>> Can give it a go if that sounds OK.
>
> Do that within DF itself?  Not sure about that.

No, in rtl-ssa.  It would be convenient to have a version of
bitmap_ior_and_into in which the final bitmap is an sbitmap though.

I.e., schematically:

      EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi)
	if (bitmap_ior_and_into (&unfiltered[b2], b1_def,
				 bi.potential_phi_regs)
	    && !bitmap_empty_p (&frontiers[b2]))
	  // Propagate the (potential) new phi node definitions in B2.
	  bitmap_set_bit (worklist, b2);

with a new overload to make that possible.

> Anyway, I think the below is small enough that it would also qualify
> for backporting.  Unless there's a problem with it, of course.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> OK?

I still think this is likely to reintroduce the problem in PR98863.
I won't object further though, so yeah, please go ahead.

Richard

> Thanks,
> Richard.
>
> From d6f57f72fbbdb832b33e4ba5ccbf60a8d90ea264 Mon Sep 17 00:00:00 2001
> From: Richard Biener <rguenther@suse.de>
> Date: Mon, 19 Feb 2024 11:10:50 +0100
> Subject: [PATCH] rtl-optimization/54052 - RTL SSA PHI insertion compile-time
>  hog
> To: gcc-patches@gcc.gnu.org
>
> The following tries to address the PHI insertion compile-time hog in
> RTL fwprop observed with the PR54052 testcase where the loop computing
> the "unfiltered" set of variables possibly needing PHI nodes for each
> block exhibits quadratic compile-time and memory-use.
>
> It does so by pruning the local DEFs with LR_OUT of the block, removing
> regs that can never be LR_IN (defined by this block) in the dominance
> frontier.
>
> 	PR rtl-optimization/54052
> 	* rtl-ssa/blocks.cc (function_info::place_phis): Filter
> 	local defs by LR_OUT.
> ---
>  gcc/rtl-ssa/blocks.cc | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/rtl-ssa/blocks.cc b/gcc/rtl-ssa/blocks.cc
> index 8996443e8d5..cf4224e61ec 100644
> --- a/gcc/rtl-ssa/blocks.cc
> +++ b/gcc/rtl-ssa/blocks.cc
> @@ -645,7 +645,12 @@ function_info::place_phis (build_info &bi)
>        if (bitmap_empty_p (&frontiers[b1]))
>  	continue;
>  
> -      bitmap b1_def = &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, b1))->def;
> +      // Defs in B1 that are possibly in LR_IN in the dominance frontier
> +      // blocks.
> +      auto_bitmap b1_def;
> +      bitmap_and (b1_def, &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, b1))->def,
> +		  DF_LR_OUT (BASIC_BLOCK_FOR_FN (m_fn, b1)));
> +
>        bitmap_iterator bmi;
>        unsigned int b2;
>        EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] rtl-optimization/54052 - RTL SSA PHI insertion compile-time hog
  2024-02-19 13:25         ` Richard Sandiford
@ 2024-02-19 13:41           ` Richard Biener
  0 siblings, 0 replies; 7+ messages in thread
From: Richard Biener @ 2024-02-19 13:41 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Mon, 19 Feb 2024, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> >> I suppose that's better than the first version when a block has a
> >> large number of dominance frontiers.  But I can't remember whether
> >> that was the case in PR98863.  I have a feeling that I tried the above
> >> as part of the PR, since it's the obvious way of applying liveness
> >> once per block rather than once per edge.  But I think it was more
> >> just sheer weight of numbers.
> >
> > Note at least this (see below for the actual patch for trunk) is
> > in the linear part, so it shouldn't trigger any quadraticness.
> > It speeds up the testcase the same as the original one.
> 
> But some of the "quadracticness" is from O(nblocks*nregisters) or
> O(nedges*nregisters), rather than through iteration.

Yeah, that's of course what DF_LR already is subject to, though,
so we hardly make that worse here.

> >> I wonder whether we could create a new DEF bitmap on-the-fly while
> >> creating the defs and uses, restricting it to potential PHI registers.
> >> The test for potential PHI registers is constant time, and we could
> >> build the bitmaps as tree views before converting to list views.
> >> Can give it a go if that sounds OK.
> >
> > Do that within DF itself?  Not sure about that.
> 
> No, in rtl-ssa.  It would be convenient to have a version of
> bitmap_ior_and_into in which the final bitmap is an sbitmap though.
> 
> I.e., schematically:
> 
>       EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi)
> 	if (bitmap_ior_and_into (&unfiltered[b2], b1_def,
> 				 bi.potential_phi_regs)
> 	    && !bitmap_empty_p (&frontiers[b2]))
> 	  // Propagate the (potential) new phi node definitions in B2.
> 	  bitmap_set_bit (worklist, b2);
> 
> with a new overload to make that possible.
> 
> > Anyway, I think the below is small enough that it would also qualify
> > for backporting.  Unless there's a problem with it, of course.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >
> > OK?
> 
> I still think this is likely to reintroduce the problem in PR98863.
> I won't object further though, so yeah, please go ahead.

The core issue there was memory-usage, that's unlikely going to be
worse with this patch.

I've pushed it to trunk and will keep an eye on our testers that
cover WRF.

Thanks,
Richard.

> Richard
> 
> > Thanks,
> > Richard.
> >
> > From d6f57f72fbbdb832b33e4ba5ccbf60a8d90ea264 Mon Sep 17 00:00:00 2001
> > From: Richard Biener <rguenther@suse.de>
> > Date: Mon, 19 Feb 2024 11:10:50 +0100
> > Subject: [PATCH] rtl-optimization/54052 - RTL SSA PHI insertion compile-time
> >  hog
> > To: gcc-patches@gcc.gnu.org
> >
> > The following tries to address the PHI insertion compile-time hog in
> > RTL fwprop observed with the PR54052 testcase where the loop computing
> > the "unfiltered" set of variables possibly needing PHI nodes for each
> > block exhibits quadratic compile-time and memory-use.
> >
> > It does so by pruning the local DEFs with LR_OUT of the block, removing
> > regs that can never be LR_IN (defined by this block) in the dominance
> > frontier.
> >
> > 	PR rtl-optimization/54052
> > 	* rtl-ssa/blocks.cc (function_info::place_phis): Filter
> > 	local defs by LR_OUT.
> > ---
> >  gcc/rtl-ssa/blocks.cc | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/gcc/rtl-ssa/blocks.cc b/gcc/rtl-ssa/blocks.cc
> > index 8996443e8d5..cf4224e61ec 100644
> > --- a/gcc/rtl-ssa/blocks.cc
> > +++ b/gcc/rtl-ssa/blocks.cc
> > @@ -645,7 +645,12 @@ function_info::place_phis (build_info &bi)
> >        if (bitmap_empty_p (&frontiers[b1]))
> >  	continue;
> >  
> > -      bitmap b1_def = &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, b1))->def;
> > +      // Defs in B1 that are possibly in LR_IN in the dominance frontier
> > +      // blocks.
> > +      auto_bitmap b1_def;
> > +      bitmap_and (b1_def, &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, b1))->def,
> > +		  DF_LR_OUT (BASIC_BLOCK_FOR_FN (m_fn, b1)));
> > +
> >        bitmap_iterator bmi;
> >        unsigned int b2;
> >        EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi)
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] rtl-optimization/54052 - RTL SSA PHI insertion compile-time hog
@ 2024-02-19 10:47 Richard Biener
  0 siblings, 0 replies; 7+ messages in thread
From: Richard Biener @ 2024-02-19 10:47 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.sandiford

The following tries to address the PHI insertion compile-time hog in
RTL fwprop observed with the PR54052 testcase where the loop computing
the "unfiltered" set of variables possibly needing PHI nodes for each
block exhibits quadratic compile-time and memory-use.

Instead of only pruning the set of candidate regs by LR_IN in the
second worklist loop do this when computing "unfiltered" already.

Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.

I'll note that in PR98863 you say in comment#39

"Just to give an update on this: I have a patch that reduces the
amount of memory consumed by fwprop so that it no longer seems
to be outlier.  However, it involves doing more bitmap operations.
In this testcase we have a larger number of registers that
seem to be live but unused across a large region of code,
so bitmap ANDs with the live in sets are expensive and hit
the worst-case O(nblocksnregisters).  I'm still trying to find
a way of reducing the effect of that."

suggesting that the very AND operation I'm introducing below
was an actual problem.  It's just not very clear what testcase
this was on (the PR hasn't one, it just talks about WRF with LTO
and then some individual TUs of it).

Indeed the patch doesn't do anything about quadraticness but
it seems to effectively reduce the size of 'unfiltered' (but
bringing in LR_IN into the workset).

For the PR54052 testcase this improves compile-time from 1420s
to 64s and slightly reduces peak memory use (I would have expected
more but didn't do any statistics on the 'unfiltered' bitmaps
themselves).

OK for trunk and branches if tests go well?

Thanks,
Richard.

	PR rtl-optimization/54052
	* rtl-ssa/blocks.cc (function_info::place_phis): Filter
	unfiltered by LR_IN.
---
 gcc/rtl-ssa/blocks.cc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/rtl-ssa/blocks.cc b/gcc/rtl-ssa/blocks.cc
index 8996443e8d5..9d1cd1b0365 100644
--- a/gcc/rtl-ssa/blocks.cc
+++ b/gcc/rtl-ssa/blocks.cc
@@ -649,7 +649,8 @@ function_info::place_phis (build_info &bi)
       bitmap_iterator bmi;
       unsigned int b2;
       EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi)
-	if (bitmap_ior_into (&unfiltered[b2], b1_def)
+	if (bitmap_ior_and_into (&unfiltered[b2], b1_def,
+				 DF_LR_IN (BASIC_BLOCK_FOR_FN (m_fn, b2)))
 	    && !bitmap_empty_p (&frontiers[b2]))
 	  // Propagate the (potential) new phi node definitions in B2.
 	  bitmap_set_bit (worklist, b2);
-- 
2.35.3

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-02-19 13:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20240219104736.25B63384F00A@sourceware.org>
2024-02-19 10:55 ` [PATCH] rtl-optimization/54052 - RTL SSA PHI insertion compile-time hog Richard Sandiford
2024-02-19 11:32   ` Richard Biener
2024-02-19 12:58     ` Richard Sandiford
2024-02-19 13:14       ` Richard Biener
2024-02-19 13:25         ` Richard Sandiford
2024-02-19 13:41           ` Richard Biener
2024-02-19 10:47 Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).