public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: [PATCH] middle-end/114480 - IDF compute is slow
       [not found] <96606.124032711422300395@us-mta-154.us.mimecast.lan>
@ 2024-03-27 16:02 ` Jakub Jelinek
  2024-03-27 18:44   ` Michael Matz
  0 siblings, 1 reply; 5+ messages in thread
From: Jakub Jelinek @ 2024-03-27 16:02 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

On Wed, Mar 27, 2024 at 04:42:21PM +0100, Richard Biener wrote:
> 	PR middle-end/114480
> 	* cfganal.cc (compute_idf): Use simpler bitmap iteration,
> 	touch work_set only when phi_insertion_points changed.
> ---
>  gcc/cfganal.cc | 10 +++-------
>  1 file changed, 3 insertions(+), 7 deletions(-)
> 
> diff --git a/gcc/cfganal.cc b/gcc/cfganal.cc
> index 432775decf1..5ef629f677e 100644
> --- a/gcc/cfganal.cc
> +++ b/gcc/cfganal.cc
> @@ -1701,8 +1701,7 @@ compute_idf (bitmap def_blocks, bitmap_head *dfs)
>           on earlier blocks first is better.
>  	 ???  Basic blocks are by no means guaranteed to be ordered in
>  	 optimal order for this iteration.  */
> -      bb_index = bitmap_first_set_bit (work_set);
> -      bitmap_clear_bit (work_set, bb_index);
> +      bb_index = bitmap_clear_first_set_bit (work_set);
>  
>        /* Since the registration of NEW -> OLD name mappings is done
>  	 separately from the call to update_ssa, when updating the SSA

The above is clearly obvious.

> @@ -1712,12 +1711,9 @@ compute_idf (bitmap def_blocks, bitmap_head *dfs)
>        gcc_checking_assert (bb_index
>  			   < (unsigned) last_basic_block_for_fn (cfun));
>  
> -      EXECUTE_IF_AND_COMPL_IN_BITMAP (&dfs[bb_index], phi_insertion_points,
> -	                              0, i, bi)
> -	{
> +      EXECUTE_IF_SET_IN_BITMAP (&dfs[bb_index], 0, i, bi)
> +	if (bitmap_set_bit (phi_insertion_points, i))
>  	  bitmap_set_bit (work_set, i);
> -	  bitmap_set_bit (phi_insertion_points, i);
> -	}
>      }

I don't understand why the above is better.
Wouldn't it be best to do
      bitmap_ior_and_compl_into (work_set, &dfs[bb_index],
				 phi_insertion_points);
      bitmap_ior_into (phi_insertion_points, &dfs[bb_index]);
?

	Jakub


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] middle-end/114480 - IDF compute is slow
  2024-03-27 16:02 ` [PATCH] middle-end/114480 - IDF compute is slow Jakub Jelinek
@ 2024-03-27 18:44   ` Michael Matz
  2024-03-27 18:57     ` Jakub Jelinek
  2024-03-28  8:12     ` Richard Biener
  0 siblings, 2 replies; 5+ messages in thread
From: Michael Matz @ 2024-03-27 18:44 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Richard Biener, gcc-patches

Hey,

On Wed, 27 Mar 2024, Jakub Jelinek wrote:

> > @@ -1712,12 +1711,9 @@ compute_idf (bitmap def_blocks, bitmap_head *dfs)
> >        gcc_checking_assert (bb_index
> >  			   < (unsigned) last_basic_block_for_fn (cfun));
> >  
> > -      EXECUTE_IF_AND_COMPL_IN_BITMAP (&dfs[bb_index], phi_insertion_points,
> > -	                              0, i, bi)
> > -	{
> > +      EXECUTE_IF_SET_IN_BITMAP (&dfs[bb_index], 0, i, bi)
> > +	if (bitmap_set_bit (phi_insertion_points, i))
> >  	  bitmap_set_bit (work_set, i);
> > -	  bitmap_set_bit (phi_insertion_points, i);
> > -	}
> >      }
> 
> I don't understand why the above is better.
> Wouldn't it be best to do
>       bitmap_ior_and_compl_into (work_set, &dfs[bb_index],
> 				 phi_insertion_points);
>       bitmap_ior_into (phi_insertion_points, &dfs[bb_index]);
> ?

I had the same hunch, but:

1) One would have to make work_set be non-tree-view again (which with the 
current structure is a wash anyway, and that makes sense as accesses to 
work_set aren't heavily random here).

2) But doing that and using bitmap_ior.._into is still measurably slower: 
on a reduced testcase with -O0 -fno-checking, proposed structure 
(tree-view or not-tree-view workset doesn't matter):

 tree SSA rewrite                   :  14.93 ( 12%)   0.01 (  2%)  14.95 ( 
12%)    27M (  8%)

with non-tree-view, and your suggestion:

 tree SSA rewrite                   :  20.68 ( 12%)   0.02 (  4%)  20.75 ( 
12%)    27M (  8%)

I can only speculate that the usually extreme sparsity of the bitmaps in 
question make the setup costs of the two bitmap_ior calls actually more 
expensive than the often skipped second call to bitmap_set_bit in Richis 
proposed structure.  (That or cache effects)


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] middle-end/114480 - IDF compute is slow
  2024-03-27 18:44   ` Michael Matz
@ 2024-03-27 18:57     ` Jakub Jelinek
  2024-03-28  8:12     ` Richard Biener
  1 sibling, 0 replies; 5+ messages in thread
From: Jakub Jelinek @ 2024-03-27 18:57 UTC (permalink / raw)
  To: Michael Matz; +Cc: Richard Biener, gcc-patches

On Wed, Mar 27, 2024 at 07:44:28PM +0100, Michael Matz wrote:
> Hey,
> 
> On Wed, 27 Mar 2024, Jakub Jelinek wrote:
> 
> > > @@ -1712,12 +1711,9 @@ compute_idf (bitmap def_blocks, bitmap_head *dfs)
> > >        gcc_checking_assert (bb_index
> > >  			   < (unsigned) last_basic_block_for_fn (cfun));
> > >  
> > > -      EXECUTE_IF_AND_COMPL_IN_BITMAP (&dfs[bb_index], phi_insertion_points,
> > > -	                              0, i, bi)
> > > -	{
> > > +      EXECUTE_IF_SET_IN_BITMAP (&dfs[bb_index], 0, i, bi)
> > > +	if (bitmap_set_bit (phi_insertion_points, i))
> > >  	  bitmap_set_bit (work_set, i);
> > > -	  bitmap_set_bit (phi_insertion_points, i);
> > > -	}
> > >      }
> > 
> > I don't understand why the above is better.
> > Wouldn't it be best to do
> >       bitmap_ior_and_compl_into (work_set, &dfs[bb_index],
> > 				 phi_insertion_points);
> >       bitmap_ior_into (phi_insertion_points, &dfs[bb_index]);
> > ?
> 
> I had the same hunch, but:
> 
> 1) One would have to make work_set be non-tree-view again (which with the 
> current structure is a wash anyway, and that makes sense as accesses to 
> work_set aren't heavily random here).
> 
> 2) But doing that and using bitmap_ior.._into is still measurably slower: 
> on a reduced testcase with -O0 -fno-checking, proposed structure 
> (tree-view or not-tree-view workset doesn't matter):
> 
>  tree SSA rewrite                   :  14.93 ( 12%)   0.01 (  2%)  14.95 ( 
> 12%)    27M (  8%)
> 
> with non-tree-view, and your suggestion:
> 
>  tree SSA rewrite                   :  20.68 ( 12%)   0.02 (  4%)  20.75 ( 
> 12%)    27M (  8%)
> 
> I can only speculate that the usually extreme sparsity of the bitmaps in 
> question make the setup costs of the two bitmap_ior calls actually more 
> expensive than the often skipped second call to bitmap_set_bit in Richis 
> proposed structure.  (That or cache effects)

Ok then.

	Jakub


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] middle-end/114480 - IDF compute is slow
  2024-03-27 18:44   ` Michael Matz
  2024-03-27 18:57     ` Jakub Jelinek
@ 2024-03-28  8:12     ` Richard Biener
  1 sibling, 0 replies; 5+ messages in thread
From: Richard Biener @ 2024-03-28  8:12 UTC (permalink / raw)
  To: Michael Matz; +Cc: Jakub Jelinek, gcc-patches

On Wed, 27 Mar 2024, Michael Matz wrote:

> Hey,
> 
> On Wed, 27 Mar 2024, Jakub Jelinek wrote:
> 
> > > @@ -1712,12 +1711,9 @@ compute_idf (bitmap def_blocks, bitmap_head *dfs)
> > >        gcc_checking_assert (bb_index
> > >  			   < (unsigned) last_basic_block_for_fn (cfun));
> > >  
> > > -      EXECUTE_IF_AND_COMPL_IN_BITMAP (&dfs[bb_index], phi_insertion_points,
> > > -	                              0, i, bi)
> > > -	{
> > > +      EXECUTE_IF_SET_IN_BITMAP (&dfs[bb_index], 0, i, bi)
> > > +	if (bitmap_set_bit (phi_insertion_points, i))
> > >  	  bitmap_set_bit (work_set, i);
> > > -	  bitmap_set_bit (phi_insertion_points, i);
> > > -	}
> > >      }
> > 
> > I don't understand why the above is better.
> > Wouldn't it be best to do
> >       bitmap_ior_and_compl_into (work_set, &dfs[bb_index],
> > 				 phi_insertion_points);
> >       bitmap_ior_into (phi_insertion_points, &dfs[bb_index]);
> > ?
> 
> I had the same hunch, but:
> 
> 1) One would have to make work_set be non-tree-view again (which with the 
> current structure is a wash anyway, and that makes sense as accesses to 
> work_set aren't heavily random here).

The tree-view is a wash indeed (I tried many things).

> 2) But doing that and using bitmap_ior.._into is still measurably slower: 
> on a reduced testcase with -O0 -fno-checking, proposed structure 
> (tree-view or not-tree-view workset doesn't matter):
> 
>  tree SSA rewrite                   :  14.93 ( 12%)   0.01 (  2%)  14.95 ( 
> 12%)    27M (  8%)
> 
> with non-tree-view, and your suggestion:
> 
>  tree SSA rewrite                   :  20.68 ( 12%)   0.02 (  4%)  20.75 ( 
> 12%)    27M (  8%)
> 
> I can only speculate that the usually extreme sparsity of the bitmaps in 
> question make the setup costs of the two bitmap_ior calls actually more 
> expensive than the often skipped second call to bitmap_set_bit in Richis 
> proposed structure.  (That or cache effects)

So slightly "better" than Jakubs variant would be

      if (bitmap_ior_and_compl_into (work_set, &dfs[bb_index],
                                     phi_insertion_points))
        bitmap_ior_into (phi_insertion_points, &dfs[bb_index]);

since phi_insertion_points grows that IOR becomes more expensive over 
time.

The above for me (today Zen2, yesterday Zen4) is

 tree SSA rewrite                   : 181.02 ( 37%) 

with unconditiona ior_into:

 tree SSA rewrite                   : 180.93 ( 36%)

while my patch is

 tree SSA rewrite                   :  22.04 (  6%)

not sure what uarch Micha tested on.  I think the testcase has simply
many variables we write into SSA (man compute_idf calls), many BBs
but very low popcount DFS[] so iterating over DFS[] only is very
beneficial here as opposed to also walk phi_insertion_points
and work_set.  I think low popcount DFS[] is quite typical
for a CFG - but for sure popcount of DFS[] is going to be lower
than popcount of the IDF (phi_insertion_points).

Btw, with my patch compute_idf is completely off the profile so it's
hard to improve further (we do process blocks possibly twice for
example, but that doesn't make a difference here)

Indeed doing statistics shows the maximum popcount of a dominance
frontier is 8 but 99% have just a single block.  But the popcount
of the final IDF is more than 10000 half of the time and more
than 1000 90% of the time.

I have pushed the patch now.

Richard.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] middle-end/114480 - IDF compute is slow
@ 2024-03-27 15:42 Richard Biener
  0 siblings, 0 replies; 5+ messages in thread
From: Richard Biener @ 2024-03-27 15:42 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek

The testcase in this PR shows very slow IDF compute:

  tree SSA rewrite                   :  76.99 ( 31%)
  24.78%        243663  cc1plus  cc1plus             [.] compute_idf

which can be mitigated to some extent by refactoring the bitmap
operations to simpler variants.  With the patch below this becomes

  tree SSA rewrite                   :  15.23 (  8%)

when not optimizing and in addition to that

  tree SSA incremental               : 181.52 ( 30%)

to

  tree SSA incremental               :  24.09 (  6%)

when optimizing.

Bootstrap and regtest running on x86_64-unknown-linux-gnu.

OK if that succeeds?

Thanks,
Richard.

	PR middle-end/114480
	* cfganal.cc (compute_idf): Use simpler bitmap iteration,
	touch work_set only when phi_insertion_points changed.
---
 gcc/cfganal.cc | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/gcc/cfganal.cc b/gcc/cfganal.cc
index 432775decf1..5ef629f677e 100644
--- a/gcc/cfganal.cc
+++ b/gcc/cfganal.cc
@@ -1701,8 +1701,7 @@ compute_idf (bitmap def_blocks, bitmap_head *dfs)
          on earlier blocks first is better.
 	 ???  Basic blocks are by no means guaranteed to be ordered in
 	 optimal order for this iteration.  */
-      bb_index = bitmap_first_set_bit (work_set);
-      bitmap_clear_bit (work_set, bb_index);
+      bb_index = bitmap_clear_first_set_bit (work_set);
 
       /* Since the registration of NEW -> OLD name mappings is done
 	 separately from the call to update_ssa, when updating the SSA
@@ -1712,12 +1711,9 @@ compute_idf (bitmap def_blocks, bitmap_head *dfs)
       gcc_checking_assert (bb_index
 			   < (unsigned) last_basic_block_for_fn (cfun));
 
-      EXECUTE_IF_AND_COMPL_IN_BITMAP (&dfs[bb_index], phi_insertion_points,
-	                              0, i, bi)
-	{
+      EXECUTE_IF_SET_IN_BITMAP (&dfs[bb_index], 0, i, bi)
+	if (bitmap_set_bit (phi_insertion_points, i))
 	  bitmap_set_bit (work_set, i);
-	  bitmap_set_bit (phi_insertion_points, i);
-	}
     }
 
   return phi_insertion_points;
-- 
2.35.3

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-03-28  8:12 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <96606.124032711422300395@us-mta-154.us.mimecast.lan>
2024-03-27 16:02 ` [PATCH] middle-end/114480 - IDF compute is slow Jakub Jelinek
2024-03-27 18:44   ` Michael Matz
2024-03-27 18:57     ` Jakub Jelinek
2024-03-28  8:12     ` Richard Biener
2024-03-27 15:42 Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).