* Re: [PATCH] middle-end/114480 - IDF compute is slow
[not found] <96606.124032711422300395@us-mta-154.us.mimecast.lan>
@ 2024-03-27 16:02 ` Jakub Jelinek
2024-03-27 18:44 ` Michael Matz
0 siblings, 1 reply; 5+ messages in thread
From: Jakub Jelinek @ 2024-03-27 16:02 UTC (permalink / raw)
To: Richard Biener; +Cc: gcc-patches
On Wed, Mar 27, 2024 at 04:42:21PM +0100, Richard Biener wrote:
> PR middle-end/114480
> * cfganal.cc (compute_idf): Use simpler bitmap iteration,
> touch work_set only when phi_insertion_points changed.
> ---
> gcc/cfganal.cc | 10 +++-------
> 1 file changed, 3 insertions(+), 7 deletions(-)
>
> diff --git a/gcc/cfganal.cc b/gcc/cfganal.cc
> index 432775decf1..5ef629f677e 100644
> --- a/gcc/cfganal.cc
> +++ b/gcc/cfganal.cc
> @@ -1701,8 +1701,7 @@ compute_idf (bitmap def_blocks, bitmap_head *dfs)
> on earlier blocks first is better.
> ??? Basic blocks are by no means guaranteed to be ordered in
> optimal order for this iteration. */
> - bb_index = bitmap_first_set_bit (work_set);
> - bitmap_clear_bit (work_set, bb_index);
> + bb_index = bitmap_clear_first_set_bit (work_set);
>
> /* Since the registration of NEW -> OLD name mappings is done
> separately from the call to update_ssa, when updating the SSA
The above is clearly obvious.
> @@ -1712,12 +1711,9 @@ compute_idf (bitmap def_blocks, bitmap_head *dfs)
> gcc_checking_assert (bb_index
> < (unsigned) last_basic_block_for_fn (cfun));
>
> - EXECUTE_IF_AND_COMPL_IN_BITMAP (&dfs[bb_index], phi_insertion_points,
> - 0, i, bi)
> - {
> + EXECUTE_IF_SET_IN_BITMAP (&dfs[bb_index], 0, i, bi)
> + if (bitmap_set_bit (phi_insertion_points, i))
> bitmap_set_bit (work_set, i);
> - bitmap_set_bit (phi_insertion_points, i);
> - }
> }
I don't understand why the above is better.
Wouldn't it be best to do
bitmap_ior_and_compl_into (work_set, &dfs[bb_index],
phi_insertion_points);
bitmap_ior_into (phi_insertion_points, &dfs[bb_index]);
?
Jakub
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] middle-end/114480 - IDF compute is slow
2024-03-27 16:02 ` [PATCH] middle-end/114480 - IDF compute is slow Jakub Jelinek
@ 2024-03-27 18:44 ` Michael Matz
2024-03-27 18:57 ` Jakub Jelinek
2024-03-28 8:12 ` Richard Biener
0 siblings, 2 replies; 5+ messages in thread
From: Michael Matz @ 2024-03-27 18:44 UTC (permalink / raw)
To: Jakub Jelinek; +Cc: Richard Biener, gcc-patches
Hey,
On Wed, 27 Mar 2024, Jakub Jelinek wrote:
> > @@ -1712,12 +1711,9 @@ compute_idf (bitmap def_blocks, bitmap_head *dfs)
> > gcc_checking_assert (bb_index
> > < (unsigned) last_basic_block_for_fn (cfun));
> >
> > - EXECUTE_IF_AND_COMPL_IN_BITMAP (&dfs[bb_index], phi_insertion_points,
> > - 0, i, bi)
> > - {
> > + EXECUTE_IF_SET_IN_BITMAP (&dfs[bb_index], 0, i, bi)
> > + if (bitmap_set_bit (phi_insertion_points, i))
> > bitmap_set_bit (work_set, i);
> > - bitmap_set_bit (phi_insertion_points, i);
> > - }
> > }
>
> I don't understand why the above is better.
> Wouldn't it be best to do
> bitmap_ior_and_compl_into (work_set, &dfs[bb_index],
> phi_insertion_points);
> bitmap_ior_into (phi_insertion_points, &dfs[bb_index]);
> ?
I had the same hunch, but:
1) One would have to make work_set be non-tree-view again (which with the
current structure is a wash anyway, and that makes sense as accesses to
work_set aren't heavily random here).
2) But doing that and using bitmap_ior.._into is still measurably slower:
on a reduced testcase with -O0 -fno-checking, proposed structure
(tree-view or not-tree-view workset doesn't matter):
tree SSA rewrite : 14.93 ( 12%) 0.01 ( 2%) 14.95 (
12%) 27M ( 8%)
with non-tree-view, and your suggestion:
tree SSA rewrite : 20.68 ( 12%) 0.02 ( 4%) 20.75 (
12%) 27M ( 8%)
I can only speculate that the usually extreme sparsity of the bitmaps in
question make the setup costs of the two bitmap_ior calls actually more
expensive than the often skipped second call to bitmap_set_bit in Richis
proposed structure. (That or cache effects)
Ciao,
Michael.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] middle-end/114480 - IDF compute is slow
2024-03-27 18:44 ` Michael Matz
@ 2024-03-27 18:57 ` Jakub Jelinek
2024-03-28 8:12 ` Richard Biener
1 sibling, 0 replies; 5+ messages in thread
From: Jakub Jelinek @ 2024-03-27 18:57 UTC (permalink / raw)
To: Michael Matz; +Cc: Richard Biener, gcc-patches
On Wed, Mar 27, 2024 at 07:44:28PM +0100, Michael Matz wrote:
> Hey,
>
> On Wed, 27 Mar 2024, Jakub Jelinek wrote:
>
> > > @@ -1712,12 +1711,9 @@ compute_idf (bitmap def_blocks, bitmap_head *dfs)
> > > gcc_checking_assert (bb_index
> > > < (unsigned) last_basic_block_for_fn (cfun));
> > >
> > > - EXECUTE_IF_AND_COMPL_IN_BITMAP (&dfs[bb_index], phi_insertion_points,
> > > - 0, i, bi)
> > > - {
> > > + EXECUTE_IF_SET_IN_BITMAP (&dfs[bb_index], 0, i, bi)
> > > + if (bitmap_set_bit (phi_insertion_points, i))
> > > bitmap_set_bit (work_set, i);
> > > - bitmap_set_bit (phi_insertion_points, i);
> > > - }
> > > }
> >
> > I don't understand why the above is better.
> > Wouldn't it be best to do
> > bitmap_ior_and_compl_into (work_set, &dfs[bb_index],
> > phi_insertion_points);
> > bitmap_ior_into (phi_insertion_points, &dfs[bb_index]);
> > ?
>
> I had the same hunch, but:
>
> 1) One would have to make work_set be non-tree-view again (which with the
> current structure is a wash anyway, and that makes sense as accesses to
> work_set aren't heavily random here).
>
> 2) But doing that and using bitmap_ior.._into is still measurably slower:
> on a reduced testcase with -O0 -fno-checking, proposed structure
> (tree-view or not-tree-view workset doesn't matter):
>
> tree SSA rewrite : 14.93 ( 12%) 0.01 ( 2%) 14.95 (
> 12%) 27M ( 8%)
>
> with non-tree-view, and your suggestion:
>
> tree SSA rewrite : 20.68 ( 12%) 0.02 ( 4%) 20.75 (
> 12%) 27M ( 8%)
>
> I can only speculate that the usually extreme sparsity of the bitmaps in
> question make the setup costs of the two bitmap_ior calls actually more
> expensive than the often skipped second call to bitmap_set_bit in Richis
> proposed structure. (That or cache effects)
Ok then.
Jakub
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] middle-end/114480 - IDF compute is slow
2024-03-27 18:44 ` Michael Matz
2024-03-27 18:57 ` Jakub Jelinek
@ 2024-03-28 8:12 ` Richard Biener
1 sibling, 0 replies; 5+ messages in thread
From: Richard Biener @ 2024-03-28 8:12 UTC (permalink / raw)
To: Michael Matz; +Cc: Jakub Jelinek, gcc-patches
On Wed, 27 Mar 2024, Michael Matz wrote:
> Hey,
>
> On Wed, 27 Mar 2024, Jakub Jelinek wrote:
>
> > > @@ -1712,12 +1711,9 @@ compute_idf (bitmap def_blocks, bitmap_head *dfs)
> > > gcc_checking_assert (bb_index
> > > < (unsigned) last_basic_block_for_fn (cfun));
> > >
> > > - EXECUTE_IF_AND_COMPL_IN_BITMAP (&dfs[bb_index], phi_insertion_points,
> > > - 0, i, bi)
> > > - {
> > > + EXECUTE_IF_SET_IN_BITMAP (&dfs[bb_index], 0, i, bi)
> > > + if (bitmap_set_bit (phi_insertion_points, i))
> > > bitmap_set_bit (work_set, i);
> > > - bitmap_set_bit (phi_insertion_points, i);
> > > - }
> > > }
> >
> > I don't understand why the above is better.
> > Wouldn't it be best to do
> > bitmap_ior_and_compl_into (work_set, &dfs[bb_index],
> > phi_insertion_points);
> > bitmap_ior_into (phi_insertion_points, &dfs[bb_index]);
> > ?
>
> I had the same hunch, but:
>
> 1) One would have to make work_set be non-tree-view again (which with the
> current structure is a wash anyway, and that makes sense as accesses to
> work_set aren't heavily random here).
The tree-view is a wash indeed (I tried many things).
> 2) But doing that and using bitmap_ior.._into is still measurably slower:
> on a reduced testcase with -O0 -fno-checking, proposed structure
> (tree-view or not-tree-view workset doesn't matter):
>
> tree SSA rewrite : 14.93 ( 12%) 0.01 ( 2%) 14.95 (
> 12%) 27M ( 8%)
>
> with non-tree-view, and your suggestion:
>
> tree SSA rewrite : 20.68 ( 12%) 0.02 ( 4%) 20.75 (
> 12%) 27M ( 8%)
>
> I can only speculate that the usually extreme sparsity of the bitmaps in
> question make the setup costs of the two bitmap_ior calls actually more
> expensive than the often skipped second call to bitmap_set_bit in Richis
> proposed structure. (That or cache effects)
So slightly "better" than Jakubs variant would be
if (bitmap_ior_and_compl_into (work_set, &dfs[bb_index],
phi_insertion_points))
bitmap_ior_into (phi_insertion_points, &dfs[bb_index]);
since phi_insertion_points grows that IOR becomes more expensive over
time.
The above for me (today Zen2, yesterday Zen4) is
tree SSA rewrite : 181.02 ( 37%)
with unconditiona ior_into:
tree SSA rewrite : 180.93 ( 36%)
while my patch is
tree SSA rewrite : 22.04 ( 6%)
not sure what uarch Micha tested on. I think the testcase has simply
many variables we write into SSA (man compute_idf calls), many BBs
but very low popcount DFS[] so iterating over DFS[] only is very
beneficial here as opposed to also walk phi_insertion_points
and work_set. I think low popcount DFS[] is quite typical
for a CFG - but for sure popcount of DFS[] is going to be lower
than popcount of the IDF (phi_insertion_points).
Btw, with my patch compute_idf is completely off the profile so it's
hard to improve further (we do process blocks possibly twice for
example, but that doesn't make a difference here)
Indeed doing statistics shows the maximum popcount of a dominance
frontier is 8 but 99% have just a single block. But the popcount
of the final IDF is more than 10000 half of the time and more
than 1000 90% of the time.
I have pushed the patch now.
Richard.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH] middle-end/114480 - IDF compute is slow
@ 2024-03-27 15:42 Richard Biener
0 siblings, 0 replies; 5+ messages in thread
From: Richard Biener @ 2024-03-27 15:42 UTC (permalink / raw)
To: gcc-patches; +Cc: Jakub Jelinek
The testcase in this PR shows very slow IDF compute:
tree SSA rewrite : 76.99 ( 31%)
24.78% 243663 cc1plus cc1plus [.] compute_idf
which can be mitigated to some extent by refactoring the bitmap
operations to simpler variants. With the patch below this becomes
tree SSA rewrite : 15.23 ( 8%)
when not optimizing and in addition to that
tree SSA incremental : 181.52 ( 30%)
to
tree SSA incremental : 24.09 ( 6%)
when optimizing.
Bootstrap and regtest running on x86_64-unknown-linux-gnu.
OK if that succeeds?
Thanks,
Richard.
PR middle-end/114480
* cfganal.cc (compute_idf): Use simpler bitmap iteration,
touch work_set only when phi_insertion_points changed.
---
gcc/cfganal.cc | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/gcc/cfganal.cc b/gcc/cfganal.cc
index 432775decf1..5ef629f677e 100644
--- a/gcc/cfganal.cc
+++ b/gcc/cfganal.cc
@@ -1701,8 +1701,7 @@ compute_idf (bitmap def_blocks, bitmap_head *dfs)
on earlier blocks first is better.
??? Basic blocks are by no means guaranteed to be ordered in
optimal order for this iteration. */
- bb_index = bitmap_first_set_bit (work_set);
- bitmap_clear_bit (work_set, bb_index);
+ bb_index = bitmap_clear_first_set_bit (work_set);
/* Since the registration of NEW -> OLD name mappings is done
separately from the call to update_ssa, when updating the SSA
@@ -1712,12 +1711,9 @@ compute_idf (bitmap def_blocks, bitmap_head *dfs)
gcc_checking_assert (bb_index
< (unsigned) last_basic_block_for_fn (cfun));
- EXECUTE_IF_AND_COMPL_IN_BITMAP (&dfs[bb_index], phi_insertion_points,
- 0, i, bi)
- {
+ EXECUTE_IF_SET_IN_BITMAP (&dfs[bb_index], 0, i, bi)
+ if (bitmap_set_bit (phi_insertion_points, i))
bitmap_set_bit (work_set, i);
- bitmap_set_bit (phi_insertion_points, i);
- }
}
return phi_insertion_points;
--
2.35.3
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-03-28 8:12 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <96606.124032711422300395@us-mta-154.us.mimecast.lan>
2024-03-27 16:02 ` [PATCH] middle-end/114480 - IDF compute is slow Jakub Jelinek
2024-03-27 18:44 ` Michael Matz
2024-03-27 18:57 ` Jakub Jelinek
2024-03-28 8:12 ` Richard Biener
2024-03-27 15:42 Richard Biener
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).