Re: nvptx vs. [PATCH] Add a late-combine pass [PR106594]

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Thomas Schwinge <tschwinge@baylibre.com>
To: Richard Sandiford <richard.sandiford@arm.com>
Cc: jlaw@ventanamicro.com, rdapp.gcc@gmail.com,
	gcc-patches@gcc.gnu.org, Tom de Vries <tdevries@suse.de>,
	Roger Sayle <roger@nextmovesoftware.com>
Subject: Re: nvptx vs. [PATCH] Add a late-combine pass [PR106594]
Date: Mon, 01 Jul 2024 13:55:32 +0200	[thread overview]
Message-ID: <87cynxuy17.fsf@euler.schwinge.ddns.net> (raw)
In-Reply-To: <mpt8qypm2sh.fsf@arm.com>

Hi Richard!

On 2024-06-28T17:48:30+0100, Richard Sandiford <richard.sandiford@arm.com> wrote:
> Richard Sandiford <richard.sandiford@arm.com> writes:
>> Thomas Schwinge <tschwinge@baylibre.com> writes:
>>> On 2024-06-27T23:20:18+0200, I wrote:
>>>> On 2024-06-27T22:27:21+0200, I wrote:
>>>>> On 2024-06-27T18:49:17+0200, I wrote:
>>>>>> On 2023-10-24T19:49:10+0100, Richard Sandiford <richard.sandiford@arm.com> wrote:
>>>>>>> This patch adds a combine pass that runs late in the pipeline.
>>>>>
>>>>> [After sending, I realized I replied to a previous thread of this work.]
>>>>>
>>>>>> I've beek looking a bit through recent nvptx target code generation
>>>>>> changes for GCC target libraries, and thought I'd also share here my
>>>>>> findings for the "late-combine" changes in isolation, for nvptx target.
>>>>>> 
>>>>>> First the unexpected thing:
>>>>>
>>>>> So much for "unexpected thing" -- next level of unexpected here...
>>>>> Appreciated if anyone feels like helping me find my way through this, but
>>>>> I totally understand if you've got other things to do.
>>>>
>>>> OK, I found something already.  (Unexpectedly quickly...)  ;-)
>>>>
>>>>>> there are a few cases where we now see unused
>>>>>> registers get declared
>>>
>>>> But in fact, for both cases
>>>
>>> Now tested: 's%both%all'.  :-)
>>>
>>>> the unexpected difference goes away if after
>>>> 'pass_late_combine' I inject a 'pass_fast_rtl_dce'.  That's normally run
>>>> as part of 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's
>>>> all not active for nvptx target given '!reload_completed', given nvptx is
>>>> 'targetm.no_register_allocation'.  Maybe we need to enable a few more
>>>> passes, or is there anything in 'pass_late_combine' to change, so that we
>>>> don't run into this?  Does it inadvertently mark registers live or
>>>> something like that?
>>>
>>> Basically, is 'pass_late_combine' potentionally doing things that depend
>>> on later clean-up?  (..., or shouldn't it be doing these things in the
>>> first place?)
>>
>> It's possible that late-combine could expose dead code, but I imagine
>> it's a niche case.
>>
>> I had a look at the nvptx logs from my comparison, and the cases in
>> which I saw this seemed to be those where late-combine doesn't find
>> anything to do.  Does that match your examples?  Specifically,
>> the effect should be the same with -fdbg-cnt=late_combine:0-0
>>
>> I think what's happening is that:
>>
>> - combine exposes dead code
>>
>> - ce2 previously ran df_analyze with DF_LR_RUN_DCE set, and so cleared
>>   up the dead code
>>
>> - late-combine instead runs df_analyze without that flag (since late-combine
>>   itself doesn't really care whether dead code is present)
>>
>> - if late-combine doesn't do anything, ce2's df_analyze call has nothing
>>   to do, and skips even the DCE
>>
>> The easiest fix would be to add:
>>
>>   df_set_flags (DF_LR_RUN_DCE);
>>
>> before df_analyze in late-combine.cc, so that it behaves like ce2.
>> But the arrangement feels wrong.  I would have expected DF_LR_RUN_DCE
>> to depend on whether df_analyze had been called since the last DCE pass
>> (whether DF_LR_RUN_DCE or a full DCE).
>
> I'm testing the attached patch to do that.  I'll submit it properly if
> testing passes, but it seems to fix the extra-register problem for me.

> Give fast DCE a separate dirty flag

Thanks, and yes, your analysis makes sense to me (to the extent that I
only superficially understand these parts of GCC) -- and I confirm that
your proposed change to "Give fast DCE a separate dirty flag" does
address the issue for nvptx target.


Grüße
 Thomas


> Thomas pointed out that we sometimes failed to eliminate some dead code
> (specifically clobbers of otherwise unused registers) on nvptx when
> late-combine is enabled.  This happens because:
>
> - combine is able to optimise the function in a way that exposes dead code.
>   This leaves the df information in a "dirty" state.
>
> - late_combine calls df_analyze without DF_LR_RUN_DCE run set.
>   This updates the df information and clears the "dirty" state.
>
> - late_combine doesn't find any extra optimisations, and so leaves
>   the df information up-to-date.
>
> - if_after_combine (ce2) calls df_analyze with DF_LR_RUN_DCE set.
>   Because the df information is already up-to-date, fast DCE is
>   not run.
>
> The upshot is that running late-combine has the effect of suppressing
> a DCE opportunity that would have been noticed without late_combine.
>
> I think this shows that we should track the state of the DCE separately
> from the LR problem.  Every pass updates the latter, but not all passes
> update the former.
>
> gcc/
> 	* df.h (DF_LR_DCE): New df_problem_id.
> 	(df_lr_dce): New macro.
> 	* df-core.cc (rest_of_handle_df_finish): Check for a null free_fun.
> 	* df-problems.cc (df_lr_finalize): Split out fast DCE handling to...
> 	(df_lr_dce_finalize): ...this new function.
> 	(problem_LR_DCE): New df_problem.
> 	(df_lr_add_problem): Register LR_DCE rather than LR itself.
> 	* dce.cc (fast_dce): Clear df_lr_dce->solutions_dirty.
> ---
>  gcc/dce.cc         |  3 ++
>  gcc/df-core.cc     |  3 +-
>  gcc/df-problems.cc | 96 ++++++++++++++++++++++++++++++++--------------
>  gcc/df.h           |  2 +
>  4 files changed, 74 insertions(+), 30 deletions(-)
>
> diff --git a/gcc/dce.cc b/gcc/dce.cc
> index be1a2a87732..04e8d98818d 100644
> --- a/gcc/dce.cc
> +++ b/gcc/dce.cc
> @@ -1182,6 +1182,9 @@ fast_dce (bool word_level)
>    BITMAP_FREE (processed);
>    BITMAP_FREE (redo_out);
>    BITMAP_FREE (all_blocks);
> +
> +  /* Both forms of DCE should make further DCE unnecessary.  */
> +  df_lr_dce->solutions_dirty = false;
>  }
>  
>  
> diff --git a/gcc/df-core.cc b/gcc/df-core.cc
> index b0e8a88d433..8fd778a8618 100644
> --- a/gcc/df-core.cc
> +++ b/gcc/df-core.cc
> @@ -806,7 +806,8 @@ rest_of_handle_df_finish (void)
>    for (i = 0; i < df->num_problems_defined; i++)
>      {
>        struct dataflow *dflow = df->problems_in_order[i];
> -      dflow->problem->free_fun ();
> +      if (dflow->problem->free_fun)
> +	dflow->problem->free_fun ();
>      }
>  
>    free (df->postorder);
> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
> index 88ee0dd67fc..bfd24bd1e86 100644
> --- a/gcc/df-problems.cc
> +++ b/gcc/df-problems.cc
> @@ -1054,37 +1054,10 @@ df_lr_transfer_function (int bb_index)
>  }
>  
>  
> -/* Run the fast dce as a side effect of building LR.  */
> -
>  static void
> -df_lr_finalize (bitmap all_blocks)
> +df_lr_finalize (bitmap)
>  {
>    df_lr->solutions_dirty = false;
> -  if (df->changeable_flags & DF_LR_RUN_DCE)
> -    {
> -      run_fast_df_dce ();
> -
> -      /* If dce deletes some instructions, we need to recompute the lr
> -	 solution before proceeding further.  The problem is that fast
> -	 dce is a pessimestic dataflow algorithm.  In the case where
> -	 it deletes a statement S inside of a loop, the uses inside of
> -	 S may not be deleted from the dataflow solution because they
> -	 were carried around the loop.  While it is conservatively
> -	 correct to leave these extra bits, the standards of df
> -	 require that we maintain the best possible (least fixed
> -	 point) solution.  The only way to do that is to redo the
> -	 iteration from the beginning.  See PR35805 for an
> -	 example.  */
> -      if (df_lr->solutions_dirty)
> -	{
> -	  df_clear_flags (DF_LR_RUN_DCE);
> -	  df_lr_alloc (all_blocks);
> -	  df_lr_local_compute (all_blocks);
> -	  df_worklist_dataflow (df_lr, all_blocks, df->postorder, df->n_blocks);
> -	  df_lr_finalize (all_blocks);
> -	  df_set_flags (DF_LR_RUN_DCE);
> -	}
> -    }
>  }
>  
>  
> @@ -1266,6 +1239,69 @@ static const struct df_problem problem_LR =
>    false                       /* Reset blocks on dropping out of blocks_to_analyze.  */
>  };
>  
> +/* Run the fast DCE after building LR.  This is a separate problem so that
> +   the "dirty" flag is only cleared after a DCE pass is actually run.  */
> +
> +static void
> +df_lr_dce_finalize (bitmap all_blocks)
> +{
> +  if (!(df->changeable_flags & DF_LR_RUN_DCE))
> +    return;
> +
> +  /* Also clears df_lr_dce->solutions_dirty.  */
> +  run_fast_df_dce ();
> +
> +  /* If dce deletes some instructions, we need to recompute the lr
> +     solution before proceeding further.  The problem is that fast
> +     dce is a pessimestic dataflow algorithm.  In the case where
> +     it deletes a statement S inside of a loop, the uses inside of
> +     S may not be deleted from the dataflow solution because they
> +     were carried around the loop.  While it is conservatively
> +     correct to leave these extra bits, the standards of df
> +     require that we maintain the best possible (least fixed
> +     point) solution.  The only way to do that is to redo the
> +     iteration from the beginning.  See PR35805 for an
> +     example.  */
> +  if (df_lr->solutions_dirty)
> +    {
> +      df_clear_flags (DF_LR_RUN_DCE);
> +      df_lr_alloc (all_blocks);
> +      df_lr_local_compute (all_blocks);
> +      df_worklist_dataflow (df_lr, all_blocks, df->postorder, df->n_blocks);
> +      df_lr_finalize (all_blocks);
> +      df_set_flags (DF_LR_RUN_DCE);
> +    }
> +}
> +
> +static const struct df_problem problem_LR_DCE
> +{
> +  DF_LR_DCE,                  /* Problem id.  */
> +  DF_BACKWARD,                /* Direction (arbitrary).  */
> +  NULL,                       /* Allocate the problem specific data.  */
> +  NULL,                       /* Reset global information.  */
> +  NULL,                       /* Free basic block info.  */
> +  NULL,                       /* Local compute function.  */
> +  NULL,                       /* Init the solution specific data.  */
> +  NULL,                       /* Worklist solver.  */
> +  NULL,                       /* Confluence operator 0.  */
> +  NULL,                       /* Confluence operator n.  */
> +  NULL,                       /* Transfer function.  */
> +  df_lr_dce_finalize,         /* Finalize function.  */
> +  NULL,                       /* Free all of the problem information.  */
> +  NULL,                       /* Remove this problem from the stack of dataflow problems.  */
> +  NULL,                       /* Debugging.  */
> +  NULL,                       /* Debugging start block.  */
> +  NULL,                       /* Debugging end block.  */
> +  NULL,                       /* Debugging start insn.  */
> +  NULL,                       /* Debugging end insn.  */
> +  NULL,                       /* Incremental solution verify start.  */
> +  NULL,                       /* Incremental solution verify end.  */
> +  &problem_LR,                /* Dependent problem.  */
> +  0,                          /* Size of entry of block_info array.  */
> +  TV_DF_LR,                   /* Timing variable.  */
> +  false                       /* Reset blocks on dropping out of blocks_to_analyze.  */
> +};
> +
>  
>  /* Create a new DATAFLOW instance and add it to an existing instance
>     of DF.  The returned structure is what is used to get at the
> @@ -1274,7 +1310,9 @@ static const struct df_problem problem_LR =
>  void
>  df_lr_add_problem (void)
>  {
> -  df_add_problem (&problem_LR);
> +  /* Also add the fast DCE problem.  It is then conditionally enabled by
> +     the DF_LR_RUN_DCE flag.  */
> +  df_add_problem (&problem_LR_DCE);
>    /* These will be initialized when df_scan_blocks processes each
>       block.  */
>    df_lr->out_of_date_transfer_functions = BITMAP_ALLOC (&df_bitmap_obstack);
> diff --git a/gcc/df.h b/gcc/df.h
> index c4e690b40cf..bbb4f297b47 100644
> --- a/gcc/df.h
> +++ b/gcc/df.h
> @@ -47,6 +47,7 @@ enum df_problem_id
>    {
>      DF_SCAN,
>      DF_LR,                /* Live Registers backward. */
> +    DF_LR_DCE,            /* Dead code elimination post-pass for LR.  */
>      DF_LIVE,              /* Live Registers & Uninitialized Registers */
>      DF_RD,                /* Reaching Defs. */
>      DF_CHAIN,             /* Def-Use and/or Use-Def Chains. */
> @@ -940,6 +941,7 @@ extern class df_d *df;
>  #define df_scan    (df->problems_by_index[DF_SCAN])
>  #define df_rd      (df->problems_by_index[DF_RD])
>  #define df_lr      (df->problems_by_index[DF_LR])
> +#define df_lr_dce  (df->problems_by_index[DF_LR_DCE])
>  #define df_live    (df->problems_by_index[DF_LIVE])
>  #define df_chain   (df->problems_by_index[DF_CHAIN])
>  #define df_word_lr (df->problems_by_index[DF_WORD_LR])
> -- 
> 2.25.1

next prev parent reply	other threads:[~2024-07-01 11:55 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-24 18:49 Richard Sandiford
2023-11-30 14:10 ` Ping: " Richard Sandiford
2023-12-11 15:23 ` Richard Sandiford
2023-12-11 16:18   ` Robin Dapp
2023-12-30 15:35 ` Ping^3: " Richard Sandiford
2024-01-01  3:11   ` YunQiang Su
2024-01-05 10:10     ` YunQiang Su
2023-12-30 18:13 ` Segher Boessenkool
2024-01-02  9:47   ` Richard Sandiford
2024-06-24 19:37     ` Segher Boessenkool
2024-06-25 10:31       ` Richard Biener
2024-06-25 17:22         ` YunQiang Su
2024-01-03  4:20 ` Jeff Law
2024-01-05 17:35   ` Richard Sandiford
2024-01-08  5:03     ` Jeff Law
2024-01-08 11:52       ` Richard Sandiford
2024-01-08 16:14         ` Jeff Law
2024-01-08 16:59           ` Richard Sandiford
2024-01-08 17:10             ` Jeff Law
2024-01-08 19:11               ` Richard Sandiford
2024-01-08 21:42                 ` Jeff Law
2024-01-10 13:01     ` Richard Sandiford
2024-01-10 13:35       ` Richard Biener
2024-01-10 16:27         ` Jeff Law
2024-01-10 16:40       ` Jeff Law
2024-06-21  4:50 ` Hongtao Liu
2024-06-27 16:49 ` nvptx vs. " Thomas Schwinge
2024-06-27 20:27   ` Thomas Schwinge
2024-06-27 21:20     ` Thomas Schwinge
2024-06-27 22:41       ` Thomas Schwinge
2024-06-28 14:01         ` Richard Sandiford
2024-06-28 16:48           ` Richard Sandiford
2024-07-01 11:55             ` Thomas Schwinge [this message]
2024-07-01 11:55         ` WIP Move 'pass_fast_rtl_dce' from 'pass_postreload' into 'pass_late_compilation' (was: nvptx vs. [PATCH] Add a late-combine pass [PR106594]) Thomas Schwinge
2024-06-28  6:07       ` nvptx vs. [PATCH] Add a late-combine pass [PR106594] Roger Sayle

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87cynxuy17.fsf@euler.schwinge.ddns.net \
    --to=tschwinge@baylibre.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jlaw@ventanamicro.com \
    --cc=rdapp.gcc@gmail.com \
    --cc=richard.sandiford@arm.com \
    --cc=roger@nextmovesoftware.com \
    --cc=tdevries@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).