replacing the backwards threader and more

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* replacing the backwards threader and more
@ 2021-06-09 11:48 Aldy Hernandez
  2021-06-09 12:09 ` Richard Biener
  2021-06-13 21:59 ` Jeff Law
  0 siblings, 2 replies; 25+ messages in thread
From: Aldy Hernandez @ 2021-06-09 11:48 UTC (permalink / raw)
  To: Jeff Law; +Cc: Andrew MacLeod, GCC Mailing List

Hi Jeff.  Hi folks.

What started as a foray into severing the old (forward) threader's 
dependency on evrp, turned into a rewrite of the backwards threader 
code.  I'd like to discuss the possibility of replacing the current 
backwards threader with a new one that gets far more threads and can 
potentially subsume all threaders in the future.

I won't include code here, as it will just detract from the high level 
discussion.  But if it helps, I could post what I have, which just needs 
some cleanups and porting to the latest trunk changes Andrew has made.

Currently the backwards threader works by traversing DEF chains through 
PHIs leading to possible paths that start in a constant.  When such a 
path is found, it is checked to see if it is profitable, and if so, the 
constant path is threaded.  The current implementation is rather limited 
since backwards paths must end in a constant.  For example, the 
backwards threader can't get any of the tests in 
gcc.dg/tree-ssa/ssa-thread-14.c:

   if (a && b)
     foo ();
   if (!b && c)
     bar ();

etc.

After my refactoring patches to the threading code, it is now possible 
to drop in an alternate implementation that shares the profitability 
code (is this path profitable?), the jump registry, and the actual jump 
threading code.  I have leveraged this to write a ranger-based threader 
that gets every single thread the current code gets, plus 90-130% more.

Here are the details from the branch, which should be very similar to 
trunk.  I'm presenting the branch numbers because they contain Andrew's 
upcoming relational query which significantly juices up the results.

New threader:
          ethread:65043    (+3.06%)
          dom:32450      (-13.3%)
          backwards threader:72482   (+89.6%)
          vrp:40532      (-30.7%)
   Total threaded:  210507 (+6.70%)

This means that the new code gets 89.6% more jump threading 
opportunities than the code I want to replace.  In doing so, it reduces 
the amount of DOM threading opportunities by 13.3% and by 30.7% from the 
VRP jump threader.  The total  improvement across the jump threading 
opportunities in the compiler is 6.70%.

However, these are pessimistic numbers...

I have noticed that some of the threading opportunities that DOM and VRP 
now get are not because they're smarter, but because they're picking up 
opportunities that the new code exposes.  I experimented with running an 
iterative threader, and then seeing what VRP and DOM could actually get. 
  This is too expensive to do in real life, but it at least shows what 
the effect of the new code is on DOM/VRP's abilities:

   Iterative threader:
     ethread:65043    (+3.06%)
     dom:31170    (-16.7%)
         thread:86717    (+127%)
         vrp:33851    (-42.2%)
   Total threaded:  216781 (+9.90%)

This means that the new code not only gets 127% more cases, but it 
reduces the DOM and VRP opportunities considerably (16.7% and 42.2% 
respectively).   The end result is that we have the possibility of 
getting almost 10% more jump threading opportunities in the entire 
compilation run.

(Note that the new code gets even more opportunities, but I'm only 
reporting the profitable ones that made it all the way through to the 
threader backend, and actually eliminated a branch.)

The overall compilation hit from this work is currently 1.38% as 
measured by callgrind.  We should be able to reduce this a bit, plus we 
could get some of that back if we can replace the DOM and VRP threaders 
(future work).

My proposed implementation should be able to get any threading 
opportunity, and will get more as range-ops and ranger improve.

I can go into the details if necessary, but the gist of it is that we 
leverage the import facility in the ranger to only look up paths that 
have a direct repercussion in the conditional being threaded, thus 
reducing the search space.  This enhanced path discovery, plus an engine 
to resolve conditionals based on knowledge from a CFG path, is all that 
is needed to register new paths.  There is no limit to how far back we 
look, though in practice, we stop looking once a path is too expensive 
to continue the search in a given direction.

The solver API is simple:

// This class is a thread path solver.  Given a set of BBs indicating
// a path through the CFG, range_in_path() will return the range
// of an SSA as if the BBs in the path would have been executed in
// order.
//
// Note that the blocks are in reverse order, thus the exit block is 
path[0].

class thread_solver : gori_compute
{

public:
   thread_solver (gimple_ranger &ranger);
   virtual ~thread_solver ();
   void set_path (const vec<basic_block> *, const bitmap_head *imports);
   void range_in_path (irange &, tree name);
   void range_in_path (irange &, gimple *);
...
};

Basically, as we're discovering paths, we ask the solver what the value 
of the final conditional in a BB is in a given path.  If it resolves, we 
register the path.

A follow-up project would be to analyze what DOM/VRP are actually 
getting that we don't, because in theory with an enhanced ranger, we 
should be able to get everything they do (minus some float stuff, and 
some CSE things DOM does).  However, IMO, this is good enough to at 
least replace the current backwards threading code.

My suggestion would be to keep both implementations, defaulting to the 
ranger based, and running the old code immediately after-- trapping if 
it can find any threading opportunities.  After a few weeks, we could 
kill the old code.

Thoughts?

Aldy

p.s. BTW, ranger-based is technically a minomer.  It's gori based.  We 
don't need the entire ranger caching ability here.  I'm only using it to 
get the imports for the interesting conditionals, since those are static.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-09 11:48 replacing the backwards threader and more Aldy Hernandez
@ 2021-06-09 12:09 ` Richard Biener
  2021-06-09 15:34   ` Aldy Hernandez
  2021-06-21 14:40   ` Aldy Hernandez
  2021-06-13 21:59 ` Jeff Law
  1 sibling, 2 replies; 25+ messages in thread
From: Richard Biener @ 2021-06-09 12:09 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Jeff Law, GCC Mailing List

On Wed, Jun 9, 2021 at 1:50 PM Aldy Hernandez via Gcc <gcc@gcc.gnu.org> wrote:
>
> Hi Jeff.  Hi folks.
>
> What started as a foray into severing the old (forward) threader's
> dependency on evrp, turned into a rewrite of the backwards threader
> code.  I'd like to discuss the possibility of replacing the current
> backwards threader with a new one that gets far more threads and can
> potentially subsume all threaders in the future.
>
> I won't include code here, as it will just detract from the high level
> discussion.  But if it helps, I could post what I have, which just needs
> some cleanups and porting to the latest trunk changes Andrew has made.
>
> Currently the backwards threader works by traversing DEF chains through
> PHIs leading to possible paths that start in a constant.  When such a
> path is found, it is checked to see if it is profitable, and if so, the
> constant path is threaded.  The current implementation is rather limited
> since backwards paths must end in a constant.  For example, the
> backwards threader can't get any of the tests in
> gcc.dg/tree-ssa/ssa-thread-14.c:
>
>    if (a && b)
>      foo ();
>    if (!b && c)
>      bar ();
>
> etc.
>
> After my refactoring patches to the threading code, it is now possible
> to drop in an alternate implementation that shares the profitability
> code (is this path profitable?), the jump registry, and the actual jump
> threading code.  I have leveraged this to write a ranger-based threader
> that gets every single thread the current code gets, plus 90-130% more.
>
> Here are the details from the branch, which should be very similar to
> trunk.  I'm presenting the branch numbers because they contain Andrew's
> upcoming relational query which significantly juices up the results.
>
> New threader:
>           ethread:65043    (+3.06%)
>           dom:32450      (-13.3%)
>           backwards threader:72482   (+89.6%)
>           vrp:40532      (-30.7%)
>    Total threaded:  210507 (+6.70%)
>
> This means that the new code gets 89.6% more jump threading
> opportunities than the code I want to replace.  In doing so, it reduces
> the amount of DOM threading opportunities by 13.3% and by 30.7% from the
> VRP jump threader.  The total  improvement across the jump threading
> opportunities in the compiler is 6.70%.
>
> However, these are pessimistic numbers...
>
> I have noticed that some of the threading opportunities that DOM and VRP
> now get are not because they're smarter, but because they're picking up
> opportunities that the new code exposes.  I experimented with running an
> iterative threader, and then seeing what VRP and DOM could actually get.
>   This is too expensive to do in real life, but it at least shows what
> the effect of the new code is on DOM/VRP's abilities:
>
>    Iterative threader:
>      ethread:65043    (+3.06%)
>      dom:31170    (-16.7%)
>          thread:86717    (+127%)
>          vrp:33851    (-42.2%)
>    Total threaded:  216781 (+9.90%)
>
> This means that the new code not only gets 127% more cases, but it
> reduces the DOM and VRP opportunities considerably (16.7% and 42.2%
> respectively).   The end result is that we have the possibility of
> getting almost 10% more jump threading opportunities in the entire
> compilation run.

Yeah, DOM once was iterating ...

You probably have noticed that we have very man (way too many)
'thread' passes, often in close succession with each other or
DOM or VRP.  So in the above numbers I wonder if you can break
down the numbers individually for the actual passes (in their order)?

> (Note that the new code gets even more opportunities, but I'm only
> reporting the profitable ones that made it all the way through to the
> threader backend, and actually eliminated a branch.)
>
> The overall compilation hit from this work is currently 1.38% as
> measured by callgrind.  We should be able to reduce this a bit, plus we
> could get some of that back if we can replace the DOM and VRP threaders
> (future work).
>
> My proposed implementation should be able to get any threading
> opportunity, and will get more as range-ops and ranger improve.
>
> I can go into the details if necessary, but the gist of it is that we
> leverage the import facility in the ranger to only look up paths that
> have a direct repercussion in the conditional being threaded, thus
> reducing the search space.  This enhanced path discovery, plus an engine
> to resolve conditionals based on knowledge from a CFG path, is all that
> is needed to register new paths.  There is no limit to how far back we
> look, though in practice, we stop looking once a path is too expensive
> to continue the search in a given direction.
>
> The solver API is simple:
>
> // This class is a thread path solver.  Given a set of BBs indicating
> // a path through the CFG, range_in_path() will return the range
> // of an SSA as if the BBs in the path would have been executed in
> // order.
> //
> // Note that the blocks are in reverse order, thus the exit block is
> path[0].
>
> class thread_solver : gori_compute
> {
>
> public:
>    thread_solver (gimple_ranger &ranger);
>    virtual ~thread_solver ();
>    void set_path (const vec<basic_block> *, const bitmap_head *imports);
>    void range_in_path (irange &, tree name);
>    void range_in_path (irange &, gimple *);
> ...
> };
>
> Basically, as we're discovering paths, we ask the solver what the value
> of the final conditional in a BB is in a given path.  If it resolves, we
> register the path.
>
> A follow-up project would be to analyze what DOM/VRP are actually
> getting that we don't, because in theory with an enhanced ranger, we
> should be able to get everything they do (minus some float stuff, and
> some CSE things DOM does).  However, IMO, this is good enough to at
> least replace the current backwards threading code.
>
> My suggestion would be to keep both implementations, defaulting to the
> ranger based, and running the old code immediately after-- trapping if
> it can find any threading opportunities.

But due to iteration uncovering new opportunities this will inevitably
break, no?

> After a few weeks, we could
> kill the old code.

Note that for analyzing threadings done apart from looking at overall
numbers the statistics infrastructure can be useful, likewise could
be the opt-info one where you can diff stats based on file or function
(statistics) or even location of a participating jump (opt-info).

If you are re-using tree-ssa-thread{edge,update}.{c,h} anyway you
probably only have to amend one or two places.  I'm personally
breaking things down to file/function via statistics to spot gross
differences more locally.

IMHO removing threading from VRP (as a step to make "VRP"
into another EVRP run) should be part of the initial transition,
there's always a thread pass nearby.  Performing threading from
EVRP itself might be another option to evaluate.  Trimming down
the number of (now backwards-)threaders would be another goal.

Richard.

> Thoughts?
>
> Aldy
>
> p.s. BTW, ranger-based is technically a minomer.  It's gori based.  We
> don't need the entire ranger caching ability here.  I'm only using it to
> get the imports for the interesting conditionals, since those are static.
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-09 12:09 ` Richard Biener
@ 2021-06-09 15:34   ` Aldy Hernandez
  2021-06-09 16:18     ` Richard Biener
  2021-06-09 19:47     ` Jeff Law
  2021-06-21 14:40   ` Aldy Hernandez
  1 sibling, 2 replies; 25+ messages in thread
From: Aldy Hernandez @ 2021-06-09 15:34 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, GCC Mailing List



On 6/9/21 2:09 PM, Richard Biener wrote:
> On Wed, Jun 9, 2021 at 1:50 PM Aldy Hernandez via Gcc <gcc@gcc.gnu.org> wrote:
>>
>> Hi Jeff.  Hi folks.
>>
>> What started as a foray into severing the old (forward) threader's
>> dependency on evrp, turned into a rewrite of the backwards threader
>> code.  I'd like to discuss the possibility of replacing the current
>> backwards threader with a new one that gets far more threads and can
>> potentially subsume all threaders in the future.
>>
>> I won't include code here, as it will just detract from the high level
>> discussion.  But if it helps, I could post what I have, which just needs
>> some cleanups and porting to the latest trunk changes Andrew has made.
>>
>> Currently the backwards threader works by traversing DEF chains through
>> PHIs leading to possible paths that start in a constant.  When such a
>> path is found, it is checked to see if it is profitable, and if so, the
>> constant path is threaded.  The current implementation is rather limited
>> since backwards paths must end in a constant.  For example, the
>> backwards threader can't get any of the tests in
>> gcc.dg/tree-ssa/ssa-thread-14.c:
>>
>>     if (a && b)
>>       foo ();
>>     if (!b && c)
>>       bar ();
>>
>> etc.
>>
>> After my refactoring patches to the threading code, it is now possible
>> to drop in an alternate implementation that shares the profitability
>> code (is this path profitable?), the jump registry, and the actual jump
>> threading code.  I have leveraged this to write a ranger-based threader
>> that gets every single thread the current code gets, plus 90-130% more.
>>
>> Here are the details from the branch, which should be very similar to
>> trunk.  I'm presenting the branch numbers because they contain Andrew's
>> upcoming relational query which significantly juices up the results.
>>
>> New threader:
>>            ethread:65043    (+3.06%)
>>            dom:32450      (-13.3%)
>>            backwards threader:72482   (+89.6%)
>>            vrp:40532      (-30.7%)
>>     Total threaded:  210507 (+6.70%)
>>
>> This means that the new code gets 89.6% more jump threading
>> opportunities than the code I want to replace.  In doing so, it reduces
>> the amount of DOM threading opportunities by 13.3% and by 30.7% from the
>> VRP jump threader.  The total  improvement across the jump threading
>> opportunities in the compiler is 6.70%.
>>
>> However, these are pessimistic numbers...
>>
>> I have noticed that some of the threading opportunities that DOM and VRP
>> now get are not because they're smarter, but because they're picking up
>> opportunities that the new code exposes.  I experimented with running an
>> iterative threader, and then seeing what VRP and DOM could actually get.
>>    This is too expensive to do in real life, but it at least shows what
>> the effect of the new code is on DOM/VRP's abilities:
>>
>>     Iterative threader:
>>       ethread:65043    (+3.06%)
>>       dom:31170    (-16.7%)
>>           thread:86717    (+127%)
>>           vrp:33851    (-42.2%)
>>     Total threaded:  216781 (+9.90%)
>>
>> This means that the new code not only gets 127% more cases, but it
>> reduces the DOM and VRP opportunities considerably (16.7% and 42.2%
>> respectively).   The end result is that we have the possibility of
>> getting almost 10% more jump threading opportunities in the entire
>> compilation run.
> 
> Yeah, DOM once was iterating ...
> 
> You probably have noticed that we have very man (way too many)
> 'thread' passes, often in close succession with each other or
> DOM or VRP.  So in the above numbers I wonder if you can break
> down the numbers individually for the actual passes (in their order)?

Sure, I can do that.  Let me whip up the old branch and gather some info.

> 
>> (Note that the new code gets even more opportunities, but I'm only
>> reporting the profitable ones that made it all the way through to the
>> threader backend, and actually eliminated a branch.)
>>
>> The overall compilation hit from this work is currently 1.38% as
>> measured by callgrind.  We should be able to reduce this a bit, plus we
>> could get some of that back if we can replace the DOM and VRP threaders
>> (future work).
>>
>> My proposed implementation should be able to get any threading
>> opportunity, and will get more as range-ops and ranger improve.
>>
>> I can go into the details if necessary, but the gist of it is that we
>> leverage the import facility in the ranger to only look up paths that
>> have a direct repercussion in the conditional being threaded, thus
>> reducing the search space.  This enhanced path discovery, plus an engine
>> to resolve conditionals based on knowledge from a CFG path, is all that
>> is needed to register new paths.  There is no limit to how far back we
>> look, though in practice, we stop looking once a path is too expensive
>> to continue the search in a given direction.
>>
>> The solver API is simple:
>>
>> // This class is a thread path solver.  Given a set of BBs indicating
>> // a path through the CFG, range_in_path() will return the range
>> // of an SSA as if the BBs in the path would have been executed in
>> // order.
>> //
>> // Note that the blocks are in reverse order, thus the exit block is
>> path[0].
>>
>> class thread_solver : gori_compute
>> {
>>
>> public:
>>     thread_solver (gimple_ranger &ranger);
>>     virtual ~thread_solver ();
>>     void set_path (const vec<basic_block> *, const bitmap_head *imports);
>>     void range_in_path (irange &, tree name);
>>     void range_in_path (irange &, gimple *);
>> ...
>> };
>>
>> Basically, as we're discovering paths, we ask the solver what the value
>> of the final conditional in a BB is in a given path.  If it resolves, we
>> register the path.
>>
>> A follow-up project would be to analyze what DOM/VRP are actually
>> getting that we don't, because in theory with an enhanced ranger, we
>> should be able to get everything they do (minus some float stuff, and
>> some CSE things DOM does).  However, IMO, this is good enough to at
>> least replace the current backwards threading code.
>>
>> My suggestion would be to keep both implementations, defaulting to the
>> ranger based, and running the old code immediately after-- trapping if
>> it can find any threading opportunities.
> 
> But due to iteration uncovering new opportunities this will inevitably
> break, no?

No, actually because I'm comparing current and new backwards threader 
behavior on the same IL, something I can't do with VRP because the IL is 
slightly different (VRP asserts).

So I can compare apples to apples in the backwards threader code.  What 
I do is run the new code first, and then run the old code on the same IL 
before the threader has altered the CFG, while asserting that the old 
code cannot register any "new" paths not already present in the registry.

I have tested this assertion throughout 200+ .ii files from a bootstrap, 
and there has never been a case where the old code can get something we 
can't.

> 
>> After a few weeks, we could
>> kill the old code.
> 
> Note that for analyzing threadings done apart from looking at overall
> numbers the statistics infrastructure can be useful, likewise could
> be the opt-info one where you can diff stats based on file or function
> (statistics) or even location of a participating jump (opt-info).
> 
> If you are re-using tree-ssa-thread{edge,update}.{c,h} anyway you
> probably only have to amend one or two places.  I'm personally
> breaking things down to file/function via statistics to spot gross
> differences more locally.

I am indeed re-using all of tree-ssa-thread{edge,update}.  That is 
unchanged.  I've been using the overall statistics plus an awk script to 
collate it all.  But thanks for the opt-info tip.  I didn't know about that.

> 
> IMHO removing threading from VRP (as a step to make "VRP"
> into another EVRP run) should be part of the initial transition,
> there's always a thread pass nearby.  Performing threading from
> EVRP itself might be another option to evaluate.  Trimming down
> the number of (now backwards-)threaders would be another goal.

Agreed.  I will have to sit down and investigate what VRP is getting. 
Last I peeked it was either stuff range-ops could be taught, or 
unconditional jumps to unconditional jumps that could be easily handled. 
  However, I think we can prove that the current backwards threader code 
cannot get anything in the presence of the new code, and could be easily 
replaced before concentrating on DOM/VRP.

That being said, figuring out exactly the discrepancies with DOM/VRP is 
on my short-term radar ;-).

Aldy


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-09 15:34   ` Aldy Hernandez
@ 2021-06-09 16:18     ` Richard Biener
  2021-06-09 19:47     ` Jeff Law
  1 sibling, 0 replies; 25+ messages in thread
From: Richard Biener @ 2021-06-09 16:18 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Jeff Law, GCC Mailing List

On June 9, 2021 5:34:03 PM GMT+02:00, Aldy Hernandez <aldyh@redhat.com> wrote:
>
>
>On 6/9/21 2:09 PM, Richard Biener wrote:
>> On Wed, Jun 9, 2021 at 1:50 PM Aldy Hernandez via Gcc
><gcc@gcc.gnu.org> wrote:
>>>
>>> Hi Jeff.  Hi folks.
>>>
>>> What started as a foray into severing the old (forward) threader's
>>> dependency on evrp, turned into a rewrite of the backwards threader
>>> code.  I'd like to discuss the possibility of replacing the current
>>> backwards threader with a new one that gets far more threads and can
>>> potentially subsume all threaders in the future.
>>>
>>> I won't include code here, as it will just detract from the high
>level
>>> discussion.  But if it helps, I could post what I have, which just
>needs
>>> some cleanups and porting to the latest trunk changes Andrew has
>made.
>>>
>>> Currently the backwards threader works by traversing DEF chains
>through
>>> PHIs leading to possible paths that start in a constant.  When such
>a
>>> path is found, it is checked to see if it is profitable, and if so,
>the
>>> constant path is threaded.  The current implementation is rather
>limited
>>> since backwards paths must end in a constant.  For example, the
>>> backwards threader can't get any of the tests in
>>> gcc.dg/tree-ssa/ssa-thread-14.c:
>>>
>>>     if (a && b)
>>>       foo ();
>>>     if (!b && c)
>>>       bar ();
>>>
>>> etc.
>>>
>>> After my refactoring patches to the threading code, it is now
>possible
>>> to drop in an alternate implementation that shares the profitability
>>> code (is this path profitable?), the jump registry, and the actual
>jump
>>> threading code.  I have leveraged this to write a ranger-based
>threader
>>> that gets every single thread the current code gets, plus 90-130%
>more.
>>>
>>> Here are the details from the branch, which should be very similar
>to
>>> trunk.  I'm presenting the branch numbers because they contain
>Andrew's
>>> upcoming relational query which significantly juices up the results.
>>>
>>> New threader:
>>>            ethread:65043    (+3.06%)
>>>            dom:32450      (-13.3%)
>>>            backwards threader:72482   (+89.6%)
>>>            vrp:40532      (-30.7%)
>>>     Total threaded:  210507 (+6.70%)
>>>
>>> This means that the new code gets 89.6% more jump threading
>>> opportunities than the code I want to replace.  In doing so, it
>reduces
>>> the amount of DOM threading opportunities by 13.3% and by 30.7% from
>the
>>> VRP jump threader.  The total  improvement across the jump threading
>>> opportunities in the compiler is 6.70%.
>>>
>>> However, these are pessimistic numbers...
>>>
>>> I have noticed that some of the threading opportunities that DOM and
>VRP
>>> now get are not because they're smarter, but because they're picking
>up
>>> opportunities that the new code exposes.  I experimented with
>running an
>>> iterative threader, and then seeing what VRP and DOM could actually
>get.
>>>    This is too expensive to do in real life, but it at least shows
>what
>>> the effect of the new code is on DOM/VRP's abilities:
>>>
>>>     Iterative threader:
>>>       ethread:65043    (+3.06%)
>>>       dom:31170    (-16.7%)
>>>           thread:86717    (+127%)
>>>           vrp:33851    (-42.2%)
>>>     Total threaded:  216781 (+9.90%)
>>>
>>> This means that the new code not only gets 127% more cases, but it
>>> reduces the DOM and VRP opportunities considerably (16.7% and 42.2%
>>> respectively).   The end result is that we have the possibility of
>>> getting almost 10% more jump threading opportunities in the entire
>>> compilation run.
>> 
>> Yeah, DOM once was iterating ...
>> 
>> You probably have noticed that we have very man (way too many)
>> 'thread' passes, often in close succession with each other or
>> DOM or VRP.  So in the above numbers I wonder if you can break
>> down the numbers individually for the actual passes (in their order)?
>
>Sure, I can do that.  Let me whip up the old branch and gather some
>info.
>
>> 
>>> (Note that the new code gets even more opportunities, but I'm only
>>> reporting the profitable ones that made it all the way through to
>the
>>> threader backend, and actually eliminated a branch.)
>>>
>>> The overall compilation hit from this work is currently 1.38% as
>>> measured by callgrind.  We should be able to reduce this a bit, plus
>we
>>> could get some of that back if we can replace the DOM and VRP
>threaders
>>> (future work).
>>>
>>> My proposed implementation should be able to get any threading
>>> opportunity, and will get more as range-ops and ranger improve.
>>>
>>> I can go into the details if necessary, but the gist of it is that
>we
>>> leverage the import facility in the ranger to only look up paths
>that
>>> have a direct repercussion in the conditional being threaded, thus
>>> reducing the search space.  This enhanced path discovery, plus an
>engine
>>> to resolve conditionals based on knowledge from a CFG path, is all
>that
>>> is needed to register new paths.  There is no limit to how far back
>we
>>> look, though in practice, we stop looking once a path is too
>expensive
>>> to continue the search in a given direction.
>>>
>>> The solver API is simple:
>>>
>>> // This class is a thread path solver.  Given a set of BBs
>indicating
>>> // a path through the CFG, range_in_path() will return the range
>>> // of an SSA as if the BBs in the path would have been executed in
>>> // order.
>>> //
>>> // Note that the blocks are in reverse order, thus the exit block is
>>> path[0].
>>>
>>> class thread_solver : gori_compute
>>> {
>>>
>>> public:
>>>     thread_solver (gimple_ranger &ranger);
>>>     virtual ~thread_solver ();
>>>     void set_path (const vec<basic_block> *, const bitmap_head
>*imports);
>>>     void range_in_path (irange &, tree name);
>>>     void range_in_path (irange &, gimple *);
>>> ...
>>> };
>>>
>>> Basically, as we're discovering paths, we ask the solver what the
>value
>>> of the final conditional in a BB is in a given path.  If it
>resolves, we
>>> register the path.
>>>
>>> A follow-up project would be to analyze what DOM/VRP are actually
>>> getting that we don't, because in theory with an enhanced ranger, we
>>> should be able to get everything they do (minus some float stuff,
>and
>>> some CSE things DOM does).  However, IMO, this is good enough to at
>>> least replace the current backwards threading code.
>>>
>>> My suggestion would be to keep both implementations, defaulting to
>the
>>> ranger based, and running the old code immediately after-- trapping
>if
>>> it can find any threading opportunities.
>> 
>> But due to iteration uncovering new opportunities this will
>inevitably
>> break, no?
>
>No, actually because I'm comparing current and new backwards threader 
>behavior on the same IL, something I can't do with VRP because the IL
>is 
>slightly different (VRP asserts).
>
>So I can compare apples to apples in the backwards threader code.  What
>
>I do is run the new code first, and then run the old code on the same
>IL 
>before the threader has altered the CFG, while asserting that the old 
>code cannot register any "new" paths not already present in the
>registry.
>
>I have tested this assertion throughout 200+ .ii files from a
>bootstrap, 
>and there has never been a case where the old code can get something we
>
>can't.

Ah, that's nice! 

>> 
>>> After a few weeks, we could
>>> kill the old code.
>> 
>> Note that for analyzing threadings done apart from looking at overall
>> numbers the statistics infrastructure can be useful, likewise could
>> be the opt-info one where you can diff stats based on file or
>function
>> (statistics) or even location of a participating jump (opt-info).
>> 
>> If you are re-using tree-ssa-thread{edge,update}.{c,h} anyway you
>> probably only have to amend one or two places.  I'm personally
>> breaking things down to file/function via statistics to spot gross
>> differences more locally.
>
>I am indeed re-using all of tree-ssa-thread{edge,update}.  That is 
>unchanged.  I've been using the overall statistics plus an awk script
>to 
>collate it all.  But thanks for the opt-info tip.  I didn't know about
>that.
>
>> 
>> IMHO removing threading from VRP (as a step to make "VRP"
>> into another EVRP run) should be part of the initial transition,
>> there's always a thread pass nearby.  Performing threading from
>> EVRP itself might be another option to evaluate.  Trimming down
>> the number of (now backwards-)threaders would be another goal.
>
>Agreed.  I will have to sit down and investigate what VRP is getting. 
>Last I peeked it was either stuff range-ops could be taught, or 
>unconditional jumps to unconditional jumps that could be easily
>handled. 
>However, I think we can prove that the current backwards threader code 
>cannot get anything in the presence of the new code, and could be
>easily 
>replaced before concentrating on DOM/VRP.
>
>That being said, figuring out exactly the discrepancies with DOM/VRP is
>
>on my short-term radar ;-).
>
>Aldy


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-09 15:34   ` Aldy Hernandez
  2021-06-09 16:18     ` Richard Biener
@ 2021-06-09 19:47     ` Jeff Law
  2021-06-09 20:39       ` Aldy Hernandez
  1 sibling, 1 reply; 25+ messages in thread
From: Jeff Law @ 2021-06-09 19:47 UTC (permalink / raw)
  To: Aldy Hernandez, Richard Biener; +Cc: GCC Mailing List



On 6/9/2021 9:34 AM, Aldy Hernandez wrote:
>
>
> On 6/9/21 2:09 PM, Richard Biener wrote:
>> On Wed, Jun 9, 2021 at 1:50 PM Aldy Hernandez via Gcc 
>> <gcc@gcc.gnu.org> wrote:
>>>
>>> Hi Jeff.  Hi folks.
>>>
>>> What started as a foray into severing the old (forward) threader's
>>> dependency on evrp, turned into a rewrite of the backwards threader
>>> code.  I'd like to discuss the possibility of replacing the current
>>> backwards threader with a new one that gets far more threads and can
>>> potentially subsume all threaders in the future.
>>>
>>> I won't include code here, as it will just detract from the high level
>>> discussion.  But if it helps, I could post what I have, which just 
>>> needs
>>> some cleanups and porting to the latest trunk changes Andrew has made.
>>>
>>> Currently the backwards threader works by traversing DEF chains through
>>> PHIs leading to possible paths that start in a constant.  When such a
>>> path is found, it is checked to see if it is profitable, and if so, the
>>> constant path is threaded.  The current implementation is rather 
>>> limited
>>> since backwards paths must end in a constant.  For example, the
>>> backwards threader can't get any of the tests in
>>> gcc.dg/tree-ssa/ssa-thread-14.c:
>>>
>>>     if (a && b)
>>>       foo ();
>>>     if (!b && c)
>>>       bar ();
>>>
>>> etc.
>>>
>>> After my refactoring patches to the threading code, it is now possible
>>> to drop in an alternate implementation that shares the profitability
>>> code (is this path profitable?), the jump registry, and the actual jump
>>> threading code.  I have leveraged this to write a ranger-based threader
>>> that gets every single thread the current code gets, plus 90-130% more.
>>>
>>> Here are the details from the branch, which should be very similar to
>>> trunk.  I'm presenting the branch numbers because they contain Andrew's
>>> upcoming relational query which significantly juices up the results.
>>>
>>> New threader:
>>>            ethread:65043    (+3.06%)
>>>            dom:32450      (-13.3%)
>>>            backwards threader:72482   (+89.6%)
>>>            vrp:40532      (-30.7%)
>>>     Total threaded:  210507 (+6.70%)
>>>
>>> This means that the new code gets 89.6% more jump threading
>>> opportunities than the code I want to replace.  In doing so, it reduces
>>> the amount of DOM threading opportunities by 13.3% and by 30.7% from 
>>> the
>>> VRP jump threader.  The total  improvement across the jump threading
>>> opportunities in the compiler is 6.70%.
>>>
>>> However, these are pessimistic numbers...
>>>
>>> I have noticed that some of the threading opportunities that DOM and 
>>> VRP
>>> now get are not because they're smarter, but because they're picking up
>>> opportunities that the new code exposes.  I experimented with 
>>> running an
>>> iterative threader, and then seeing what VRP and DOM could actually 
>>> get.
>>>    This is too expensive to do in real life, but it at least shows what
>>> the effect of the new code is on DOM/VRP's abilities:
>>>
>>>     Iterative threader:
>>>       ethread:65043    (+3.06%)
>>>       dom:31170    (-16.7%)
>>>           thread:86717    (+127%)
>>>           vrp:33851    (-42.2%)
>>>     Total threaded:  216781 (+9.90%)
>>>
>>> This means that the new code not only gets 127% more cases, but it
>>> reduces the DOM and VRP opportunities considerably (16.7% and 42.2%
>>> respectively).   The end result is that we have the possibility of
>>> getting almost 10% more jump threading opportunities in the entire
>>> compilation run.
>>
>> Yeah, DOM once was iterating ...
>>
>> You probably have noticed that we have very man (way too many)
>> 'thread' passes, often in close succession with each other or
>> DOM or VRP.  So in the above numbers I wonder if you can break
>> down the numbers individually for the actual passes (in their order)?
>
> Sure, I can do that.  Let me whip up the old branch and gather some info.
I'll save you some time.  The jump threaders in VRP are doing the least 
amount of lifting and the ones we want to kill first.  IIRC the one from 
vrp2 is doing nearly nothing at this point.


Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-09 19:47     ` Jeff Law
@ 2021-06-09 20:39       ` Aldy Hernandez
  2021-06-09 20:43         ` Jeff Law
  0 siblings, 1 reply; 25+ messages in thread
From: Aldy Hernandez @ 2021-06-09 20:39 UTC (permalink / raw)
  To: Jeff Law, Richard Biener; +Cc: GCC Mailing List



On 6/9/21 9:47 PM, Jeff Law wrote:
> 
> 
> On 6/9/2021 9:34 AM, Aldy Hernandez wrote:
>>
>>
>> On 6/9/21 2:09 PM, Richard Biener wrote:
>>> On Wed, Jun 9, 2021 at 1:50 PM Aldy Hernandez via Gcc 
>>> <gcc@gcc.gnu.org> wrote:
>>>>
>>>> Hi Jeff.  Hi folks.
>>>>
>>>> What started as a foray into severing the old (forward) threader's
>>>> dependency on evrp, turned into a rewrite of the backwards threader
>>>> code.  I'd like to discuss the possibility of replacing the current
>>>> backwards threader with a new one that gets far more threads and can
>>>> potentially subsume all threaders in the future.
>>>>
>>>> I won't include code here, as it will just detract from the high level
>>>> discussion.  But if it helps, I could post what I have, which just 
>>>> needs
>>>> some cleanups and porting to the latest trunk changes Andrew has made.
>>>>
>>>> Currently the backwards threader works by traversing DEF chains through
>>>> PHIs leading to possible paths that start in a constant.  When such a
>>>> path is found, it is checked to see if it is profitable, and if so, the
>>>> constant path is threaded.  The current implementation is rather 
>>>> limited
>>>> since backwards paths must end in a constant.  For example, the
>>>> backwards threader can't get any of the tests in
>>>> gcc.dg/tree-ssa/ssa-thread-14.c:
>>>>
>>>>     if (a && b)
>>>>       foo ();
>>>>     if (!b && c)
>>>>       bar ();
>>>>
>>>> etc.
>>>>
>>>> After my refactoring patches to the threading code, it is now possible
>>>> to drop in an alternate implementation that shares the profitability
>>>> code (is this path profitable?), the jump registry, and the actual jump
>>>> threading code.  I have leveraged this to write a ranger-based threader
>>>> that gets every single thread the current code gets, plus 90-130% more.
>>>>
>>>> Here are the details from the branch, which should be very similar to
>>>> trunk.  I'm presenting the branch numbers because they contain Andrew's
>>>> upcoming relational query which significantly juices up the results.
>>>>
>>>> New threader:
>>>>            ethread:65043    (+3.06%)
>>>>            dom:32450      (-13.3%)
>>>>            backwards threader:72482   (+89.6%)
>>>>            vrp:40532      (-30.7%)
>>>>     Total threaded:  210507 (+6.70%)
>>>>
>>>> This means that the new code gets 89.6% more jump threading
>>>> opportunities than the code I want to replace.  In doing so, it reduces
>>>> the amount of DOM threading opportunities by 13.3% and by 30.7% from 
>>>> the
>>>> VRP jump threader.  The total  improvement across the jump threading
>>>> opportunities in the compiler is 6.70%.
>>>>
>>>> However, these are pessimistic numbers...
>>>>
>>>> I have noticed that some of the threading opportunities that DOM and 
>>>> VRP
>>>> now get are not because they're smarter, but because they're picking up
>>>> opportunities that the new code exposes.  I experimented with 
>>>> running an
>>>> iterative threader, and then seeing what VRP and DOM could actually 
>>>> get.
>>>>    This is too expensive to do in real life, but it at least shows what
>>>> the effect of the new code is on DOM/VRP's abilities:
>>>>
>>>>     Iterative threader:
>>>>       ethread:65043    (+3.06%)
>>>>       dom:31170    (-16.7%)
>>>>           thread:86717    (+127%)
>>>>           vrp:33851    (-42.2%)
>>>>     Total threaded:  216781 (+9.90%)
>>>>
>>>> This means that the new code not only gets 127% more cases, but it
>>>> reduces the DOM and VRP opportunities considerably (16.7% and 42.2%
>>>> respectively).   The end result is that we have the possibility of
>>>> getting almost 10% more jump threading opportunities in the entire
>>>> compilation run.
>>>
>>> Yeah, DOM once was iterating ...
>>>
>>> You probably have noticed that we have very man (way too many)
>>> 'thread' passes, often in close succession with each other or
>>> DOM or VRP.  So in the above numbers I wonder if you can break
>>> down the numbers individually for the actual passes (in their order)?
>>
>> Sure, I can do that.  Let me whip up the old branch and gather some info.
> I'll save you some time.  The jump threaders in VRP are doing the least 
> amount of lifting and the ones we want to kill first.  IIRC the one from 
> vrp2 is doing nearly nothing at this point.

Sure, that was going to be my next target.

What are your thoughts on replacing the current backwards threader, 
though?  That's basically ready to go.

Aldy


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-09 20:39       ` Aldy Hernandez
@ 2021-06-09 20:43         ` Jeff Law
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff Law @ 2021-06-09 20:43 UTC (permalink / raw)
  To: Aldy Hernandez, Richard Biener; +Cc: GCC Mailing List



On 6/9/2021 2:39 PM, Aldy Hernandez wrote:
>
>
> On 6/9/21 9:47 PM, Jeff Law wrote:
>>
>>
>> On 6/9/2021 9:34 AM, Aldy Hernandez wrote:
>>>
>>>
>>> On 6/9/21 2:09 PM, Richard Biener wrote:
>>>> On Wed, Jun 9, 2021 at 1:50 PM Aldy Hernandez via Gcc 
>>>> <gcc@gcc.gnu.org> wrote:
>>>>>
>>>>> Hi Jeff.  Hi folks.
>>>>>
>>>>> What started as a foray into severing the old (forward) threader's
>>>>> dependency on evrp, turned into a rewrite of the backwards threader
>>>>> code.  I'd like to discuss the possibility of replacing the current
>>>>> backwards threader with a new one that gets far more threads and can
>>>>> potentially subsume all threaders in the future.
>>>>>
>>>>> I won't include code here, as it will just detract from the high 
>>>>> level
>>>>> discussion.  But if it helps, I could post what I have, which just 
>>>>> needs
>>>>> some cleanups and porting to the latest trunk changes Andrew has 
>>>>> made.
>>>>>
>>>>> Currently the backwards threader works by traversing DEF chains 
>>>>> through
>>>>> PHIs leading to possible paths that start in a constant. When such a
>>>>> path is found, it is checked to see if it is profitable, and if 
>>>>> so, the
>>>>> constant path is threaded.  The current implementation is rather 
>>>>> limited
>>>>> since backwards paths must end in a constant.  For example, the
>>>>> backwards threader can't get any of the tests in
>>>>> gcc.dg/tree-ssa/ssa-thread-14.c:
>>>>>
>>>>>     if (a && b)
>>>>>       foo ();
>>>>>     if (!b && c)
>>>>>       bar ();
>>>>>
>>>>> etc.
>>>>>
>>>>> After my refactoring patches to the threading code, it is now 
>>>>> possible
>>>>> to drop in an alternate implementation that shares the profitability
>>>>> code (is this path profitable?), the jump registry, and the actual 
>>>>> jump
>>>>> threading code.  I have leveraged this to write a ranger-based 
>>>>> threader
>>>>> that gets every single thread the current code gets, plus 90-130% 
>>>>> more.
>>>>>
>>>>> Here are the details from the branch, which should be very similar to
>>>>> trunk.  I'm presenting the branch numbers because they contain 
>>>>> Andrew's
>>>>> upcoming relational query which significantly juices up the results.
>>>>>
>>>>> New threader:
>>>>>            ethread:65043    (+3.06%)
>>>>>            dom:32450      (-13.3%)
>>>>>            backwards threader:72482   (+89.6%)
>>>>>            vrp:40532      (-30.7%)
>>>>>     Total threaded:  210507 (+6.70%)
>>>>>
>>>>> This means that the new code gets 89.6% more jump threading
>>>>> opportunities than the code I want to replace.  In doing so, it 
>>>>> reduces
>>>>> the amount of DOM threading opportunities by 13.3% and by 30.7% 
>>>>> from the
>>>>> VRP jump threader.  The total  improvement across the jump threading
>>>>> opportunities in the compiler is 6.70%.
>>>>>
>>>>> However, these are pessimistic numbers...
>>>>>
>>>>> I have noticed that some of the threading opportunities that DOM 
>>>>> and VRP
>>>>> now get are not because they're smarter, but because they're 
>>>>> picking up
>>>>> opportunities that the new code exposes.  I experimented with 
>>>>> running an
>>>>> iterative threader, and then seeing what VRP and DOM could 
>>>>> actually get.
>>>>>    This is too expensive to do in real life, but it at least shows 
>>>>> what
>>>>> the effect of the new code is on DOM/VRP's abilities:
>>>>>
>>>>>     Iterative threader:
>>>>>       ethread:65043    (+3.06%)
>>>>>       dom:31170    (-16.7%)
>>>>>           thread:86717    (+127%)
>>>>>           vrp:33851    (-42.2%)
>>>>>     Total threaded:  216781 (+9.90%)
>>>>>
>>>>> This means that the new code not only gets 127% more cases, but it
>>>>> reduces the DOM and VRP opportunities considerably (16.7% and 42.2%
>>>>> respectively).   The end result is that we have the possibility of
>>>>> getting almost 10% more jump threading opportunities in the entire
>>>>> compilation run.
>>>>
>>>> Yeah, DOM once was iterating ...
>>>>
>>>> You probably have noticed that we have very man (way too many)
>>>> 'thread' passes, often in close succession with each other or
>>>> DOM or VRP.  So in the above numbers I wonder if you can break
>>>> down the numbers individually for the actual passes (in their order)?
>>>
>>> Sure, I can do that.  Let me whip up the old branch and gather some 
>>> info.
>> I'll save you some time.  The jump threaders in VRP are doing the 
>> least amount of lifting and the ones we want to kill first. IIRC the 
>> one from vrp2 is doing nearly nothing at this point.
>
> Sure, that was going to be my next target.
>
> What are your thoughts on replacing the current backwards threader, 
> though?  That's basically ready to go.
Going to take a deep dive into it Saturday.

Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-09 12:09 ` Richard Biener
  2021-06-09 15:34   ` Aldy Hernandez
@ 2021-06-21 14:40   ` Aldy Hernandez
  2021-06-24 16:14     ` Jeff Law
  1 sibling, 1 reply; 25+ messages in thread
From: Aldy Hernandez @ 2021-06-21 14:40 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, GCC Mailing List, Andrew MacLeod



On 6/9/21 2:09 PM, Richard Biener wrote:
> On Wed, Jun 9, 2021 at 1:50 PM Aldy Hernandez via Gcc <gcc@gcc.gnu.org> wrote:
>>
>> Hi Jeff.  Hi folks.
>>
>> What started as a foray into severing the old (forward) threader's
>> dependency on evrp, turned into a rewrite of the backwards threader
>> code.  I'd like to discuss the possibility of replacing the current
>> backwards threader with a new one that gets far more threads and can
>> potentially subsume all threaders in the future.
>>
>> I won't include code here, as it will just detract from the high level
>> discussion.  But if it helps, I could post what I have, which just needs
>> some cleanups and porting to the latest trunk changes Andrew has made.
>>
>> Currently the backwards threader works by traversing DEF chains through
>> PHIs leading to possible paths that start in a constant.  When such a
>> path is found, it is checked to see if it is profitable, and if so, the
>> constant path is threaded.  The current implementation is rather limited
>> since backwards paths must end in a constant.  For example, the
>> backwards threader can't get any of the tests in
>> gcc.dg/tree-ssa/ssa-thread-14.c:
>>
>>     if (a && b)
>>       foo ();
>>     if (!b && c)
>>       bar ();
>>
>> etc.
>>
>> After my refactoring patches to the threading code, it is now possible
>> to drop in an alternate implementation that shares the profitability
>> code (is this path profitable?), the jump registry, and the actual jump
>> threading code.  I have leveraged this to write a ranger-based threader
>> that gets every single thread the current code gets, plus 90-130% more.
>>
>> Here are the details from the branch, which should be very similar to
>> trunk.  I'm presenting the branch numbers because they contain Andrew's
>> upcoming relational query which significantly juices up the results.
>>
>> New threader:
>>            ethread:65043    (+3.06%)
>>            dom:32450      (-13.3%)
>>            backwards threader:72482   (+89.6%)
>>            vrp:40532      (-30.7%)
>>     Total threaded:  210507 (+6.70%)
>>
>> This means that the new code gets 89.6% more jump threading
>> opportunities than the code I want to replace.  In doing so, it reduces
>> the amount of DOM threading opportunities by 13.3% and by 30.7% from the
>> VRP jump threader.  The total  improvement across the jump threading
>> opportunities in the compiler is 6.70%.
>>
>> However, these are pessimistic numbers...
>>
>> I have noticed that some of the threading opportunities that DOM and VRP
>> now get are not because they're smarter, but because they're picking up
>> opportunities that the new code exposes.  I experimented with running an
>> iterative threader, and then seeing what VRP and DOM could actually get.
>>    This is too expensive to do in real life, but it at least shows what
>> the effect of the new code is on DOM/VRP's abilities:
>>
>>     Iterative threader:
>>       ethread:65043    (+3.06%)
>>       dom:31170    (-16.7%)
>>           thread:86717    (+127%)
>>           vrp:33851    (-42.2%)
>>     Total threaded:  216781 (+9.90%)
>>
>> This means that the new code not only gets 127% more cases, but it
>> reduces the DOM and VRP opportunities considerably (16.7% and 42.2%
>> respectively).   The end result is that we have the possibility of
>> getting almost 10% more jump threading opportunities in the entire
>> compilation run.
> 
> Yeah, DOM once was iterating ...
> 
> You probably have noticed that we have very man (way too many)
> 'thread' passes, often in close succession with each other or
> DOM or VRP.  So in the above numbers I wonder if you can break
> down the numbers individually for the actual passes (in their order)?

As promised.

*** LEGACY:
ethread42:61152 30.1369% (61152 threads for 30.1% of total)
thread117:29646 14.6101%
vrp118:62088 30.5982%
thread132:2232 1.09997%
dom133:31116 15.3346%
thread197:1950 0.960998%
dom198:10661 5.25395%
thread200:587 0.289285%
vrp201:3482 1.716%
Total:  202914

The above is from current trunk with my patches applied, defaulting to 
legacy mode.  It follows the pass number nomenclature in the 
*.statistics files.

New threader code (This is what I envision current trunk to look with my 
patchset):

*** RANGER:
ethread42:64389 30.2242%
thread117:49449 23.2114%
vrp118:46118 21.6478%
thread132:8153 3.82702%
dom133:27168 12.7527%
thread197:5542 2.60141%
dom198:8191 3.84485%
thread200:1038 0.487237%
vrp201:2990 1.40351%
Total:  213038

New threader code running in iterative mode (illustrative purposes only):

*** RANGER: iterative mode
ethread42:64389 28.895%
thread117:76013 34.1113%
vrp118:31823 14.2808%
thread132:7165 3.21534%
dom133:25881 11.6143%
thread197:6226 2.79396%
dom198:7536 3.38183%
thread200:971 0.435743%
vrp201:2834 1.27178%
Total:  222838

The above run in iterative mode is only used to show what the new code 
fails to get since dom* and vrp* are still threading some things despite 
repeated runs of the new code.

As I've mentioned, I haven't dug deep into what VRP* and DOM* are 
threading over the new threader code.  Last time I checked it was some 
combination of:

a) DOM doing folding memory references into temporaries that could then 
be used (even by the ranger-threader code) to resolve paths.

b) Things we should be able to tweak range-ops to get.

c) I did see some simple jumps to jumps, that should be trivial to get.

As this is future work, I didn't bother to look too deeply.

Note that over my 395 .ii files from a bootstrap, not once was the 
legacy backward threading code able to get something the new code 
couldn't get.

It seems like any threading past the first run of DOM are slim pickings.

Aldy

p.s. The above numbers do not reflect Andrew's upcoming relational work. 
  It seems there's very little benefit in the threader for it, as most 
of the benefit was from the recalculation work.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-21 14:40   ` Aldy Hernandez
@ 2021-06-24 16:14     ` Jeff Law
  2021-06-25  7:54       ` Richard Biener
  0 siblings, 1 reply; 25+ messages in thread
From: Jeff Law @ 2021-06-24 16:14 UTC (permalink / raw)
  To: Aldy Hernandez, Richard Biener; +Cc: GCC Mailing List, Andrew MacLeod



On 6/21/2021 8:40 AM, Aldy Hernandez wrote:
>
>
> On 6/9/21 2:09 PM, Richard Biener wrote:
>> On Wed, Jun 9, 2021 at 1:50 PM Aldy Hernandez via Gcc 
>> <gcc@gcc.gnu.org> wrote:
>>>
>>> Hi Jeff.  Hi folks.
>>>
>>> What started as a foray into severing the old (forward) threader's
>>> dependency on evrp, turned into a rewrite of the backwards threader
>>> code.  I'd like to discuss the possibility of replacing the current
>>> backwards threader with a new one that gets far more threads and can
>>> potentially subsume all threaders in the future.
>>>
>>> I won't include code here, as it will just detract from the high level
>>> discussion.  But if it helps, I could post what I have, which just 
>>> needs
>>> some cleanups and porting to the latest trunk changes Andrew has made.
>>>
>>> Currently the backwards threader works by traversing DEF chains through
>>> PHIs leading to possible paths that start in a constant.  When such a
>>> path is found, it is checked to see if it is profitable, and if so, the
>>> constant path is threaded.  The current implementation is rather 
>>> limited
>>> since backwards paths must end in a constant.  For example, the
>>> backwards threader can't get any of the tests in
>>> gcc.dg/tree-ssa/ssa-thread-14.c:
>>>
>>>     if (a && b)
>>>       foo ();
>>>     if (!b && c)
>>>       bar ();
>>>
>>> etc.
>>>
>>> After my refactoring patches to the threading code, it is now possible
>>> to drop in an alternate implementation that shares the profitability
>>> code (is this path profitable?), the jump registry, and the actual jump
>>> threading code.  I have leveraged this to write a ranger-based threader
>>> that gets every single thread the current code gets, plus 90-130% more.
>>>
>>> Here are the details from the branch, which should be very similar to
>>> trunk.  I'm presenting the branch numbers because they contain Andrew's
>>> upcoming relational query which significantly juices up the results.
>>>
>>> New threader:
>>>            ethread:65043    (+3.06%)
>>>            dom:32450      (-13.3%)
>>>            backwards threader:72482   (+89.6%)
>>>            vrp:40532      (-30.7%)
>>>     Total threaded:  210507 (+6.70%)
>>>
>>> This means that the new code gets 89.6% more jump threading
>>> opportunities than the code I want to replace.  In doing so, it reduces
>>> the amount of DOM threading opportunities by 13.3% and by 30.7% from 
>>> the
>>> VRP jump threader.  The total  improvement across the jump threading
>>> opportunities in the compiler is 6.70%.
>>>
>>> However, these are pessimistic numbers...
>>>
>>> I have noticed that some of the threading opportunities that DOM and 
>>> VRP
>>> now get are not because they're smarter, but because they're picking up
>>> opportunities that the new code exposes.  I experimented with 
>>> running an
>>> iterative threader, and then seeing what VRP and DOM could actually 
>>> get.
>>>    This is too expensive to do in real life, but it at least shows what
>>> the effect of the new code is on DOM/VRP's abilities:
>>>
>>>     Iterative threader:
>>>       ethread:65043    (+3.06%)
>>>       dom:31170    (-16.7%)
>>>           thread:86717    (+127%)
>>>           vrp:33851    (-42.2%)
>>>     Total threaded:  216781 (+9.90%)
>>>
>>> This means that the new code not only gets 127% more cases, but it
>>> reduces the DOM and VRP opportunities considerably (16.7% and 42.2%
>>> respectively).   The end result is that we have the possibility of
>>> getting almost 10% more jump threading opportunities in the entire
>>> compilation run.
>>
>> Yeah, DOM once was iterating ...
>>
>> You probably have noticed that we have very man (way too many)
>> 'thread' passes, often in close succession with each other or
>> DOM or VRP.  So in the above numbers I wonder if you can break
>> down the numbers individually for the actual passes (in their order)?
>
> As promised.
>
> *** LEGACY:
> ethread42:61152 30.1369% (61152 threads for 30.1% of total)
> thread117:29646 14.6101%
> vrp118:62088 30.5982%
> thread132:2232 1.09997%
> dom133:31116 15.3346%
> thread197:1950 0.960998%
> dom198:10661 5.25395%
> thread200:587 0.289285%
> vrp201:3482 1.716%
> Total:  202914

> The above is from current trunk with my patches applied, defaulting to 
> legacy mode.  It follows the pass number nomenclature in the 
> *.statistics files.
>
> New threader code (This is what I envision current trunk to look with 
> my patchset):
>
> *** RANGER:
> ethread42:64389 30.2242%
> thread117:49449 23.2114%
> vrp118:46118 21.6478%
> thread132:8153 3.82702%
> dom133:27168 12.7527%
> thread197:5542 2.60141%
> dom198:8191 3.84485%
> thread200:1038 0.487237%
> vrp201:2990 1.40351%
> Total:  213038
So this makes me think we should focus on dropping thread197, thread200, 
& vrp201 and I'd probably focus on vrp201 first since we know we want to 
get rid of it anyway and that may change the data for thread???.  Then 
I'd be looking at thread200 and thread197 in that order.  I suspect that 
at least some of the cases in thread200 and vrp201 are exposed by dom198.


Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-24 16:14     ` Jeff Law
@ 2021-06-25  7:54       ` Richard Biener
  2021-06-25  7:58         ` Richard Biener
  2021-06-25 16:20         ` Aldy Hernandez
  0 siblings, 2 replies; 25+ messages in thread
From: Richard Biener @ 2021-06-25  7:54 UTC (permalink / raw)
  To: Jeff Law; +Cc: Aldy Hernandez, GCC Mailing List, Andrew MacLeod

On Thu, Jun 24, 2021 at 6:14 PM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
>
> On 6/21/2021 8:40 AM, Aldy Hernandez wrote:
> >
> >
> > On 6/9/21 2:09 PM, Richard Biener wrote:
> >> On Wed, Jun 9, 2021 at 1:50 PM Aldy Hernandez via Gcc
> >> <gcc@gcc.gnu.org> wrote:
> >>>
> >>> Hi Jeff.  Hi folks.
> >>>
> >>> What started as a foray into severing the old (forward) threader's
> >>> dependency on evrp, turned into a rewrite of the backwards threader
> >>> code.  I'd like to discuss the possibility of replacing the current
> >>> backwards threader with a new one that gets far more threads and can
> >>> potentially subsume all threaders in the future.
> >>>
> >>> I won't include code here, as it will just detract from the high level
> >>> discussion.  But if it helps, I could post what I have, which just
> >>> needs
> >>> some cleanups and porting to the latest trunk changes Andrew has made.
> >>>
> >>> Currently the backwards threader works by traversing DEF chains through
> >>> PHIs leading to possible paths that start in a constant.  When such a
> >>> path is found, it is checked to see if it is profitable, and if so, the
> >>> constant path is threaded.  The current implementation is rather
> >>> limited
> >>> since backwards paths must end in a constant.  For example, the
> >>> backwards threader can't get any of the tests in
> >>> gcc.dg/tree-ssa/ssa-thread-14.c:
> >>>
> >>>     if (a && b)
> >>>       foo ();
> >>>     if (!b && c)
> >>>       bar ();
> >>>
> >>> etc.
> >>>
> >>> After my refactoring patches to the threading code, it is now possible
> >>> to drop in an alternate implementation that shares the profitability
> >>> code (is this path profitable?), the jump registry, and the actual jump
> >>> threading code.  I have leveraged this to write a ranger-based threader
> >>> that gets every single thread the current code gets, plus 90-130% more.
> >>>
> >>> Here are the details from the branch, which should be very similar to
> >>> trunk.  I'm presenting the branch numbers because they contain Andrew's
> >>> upcoming relational query which significantly juices up the results.
> >>>
> >>> New threader:
> >>>            ethread:65043    (+3.06%)
> >>>            dom:32450      (-13.3%)
> >>>            backwards threader:72482   (+89.6%)
> >>>            vrp:40532      (-30.7%)
> >>>     Total threaded:  210507 (+6.70%)
> >>>
> >>> This means that the new code gets 89.6% more jump threading
> >>> opportunities than the code I want to replace.  In doing so, it reduces
> >>> the amount of DOM threading opportunities by 13.3% and by 30.7% from
> >>> the
> >>> VRP jump threader.  The total  improvement across the jump threading
> >>> opportunities in the compiler is 6.70%.
> >>>
> >>> However, these are pessimistic numbers...
> >>>
> >>> I have noticed that some of the threading opportunities that DOM and
> >>> VRP
> >>> now get are not because they're smarter, but because they're picking up
> >>> opportunities that the new code exposes.  I experimented with
> >>> running an
> >>> iterative threader, and then seeing what VRP and DOM could actually
> >>> get.
> >>>    This is too expensive to do in real life, but it at least shows what
> >>> the effect of the new code is on DOM/VRP's abilities:
> >>>
> >>>     Iterative threader:
> >>>       ethread:65043    (+3.06%)
> >>>       dom:31170    (-16.7%)
> >>>           thread:86717    (+127%)
> >>>           vrp:33851    (-42.2%)
> >>>     Total threaded:  216781 (+9.90%)
> >>>
> >>> This means that the new code not only gets 127% more cases, but it
> >>> reduces the DOM and VRP opportunities considerably (16.7% and 42.2%
> >>> respectively).   The end result is that we have the possibility of
> >>> getting almost 10% more jump threading opportunities in the entire
> >>> compilation run.
> >>
> >> Yeah, DOM once was iterating ...
> >>
> >> You probably have noticed that we have very man (way too many)
> >> 'thread' passes, often in close succession with each other or
> >> DOM or VRP.  So in the above numbers I wonder if you can break
> >> down the numbers individually for the actual passes (in their order)?
> >
> > As promised.
> >
> > *** LEGACY:
> > ethread42:61152 30.1369% (61152 threads for 30.1% of total)
> > thread117:29646 14.6101%
> > vrp118:62088 30.5982%
> > thread132:2232 1.09997%
> > dom133:31116 15.3346%
> > thread197:1950 0.960998%
> > dom198:10661 5.25395%
> > thread200:587 0.289285%
> > vrp201:3482 1.716%
> > Total:  202914
>
> > The above is from current trunk with my patches applied, defaulting to
> > legacy mode.  It follows the pass number nomenclature in the
> > *.statistics files.
> >
> > New threader code (This is what I envision current trunk to look with
> > my patchset):
> >
> > *** RANGER:
> > ethread42:64389 30.2242%
> > thread117:49449 23.2114%
> > vrp118:46118 21.6478%
> > thread132:8153 3.82702%
> > dom133:27168 12.7527%
> > thread197:5542 2.60141%
> > dom198:8191 3.84485%
> > thread200:1038 0.487237%
> > vrp201:2990 1.40351%
> > Total:  213038
> So this makes me think we should focus on dropping thread197, thread200,
> & vrp201 and I'd probably focus on vrp201 first since we know we want to
> get rid of it anyway and that may change the data for thread???.  Then
> I'd be looking at thread200 and thread197 in that order.  I suspect that
> at least some of the cases in thread200 and vrp201 are exposed by dom198.

It might be interesting to see replacing vrp201 with evrp and run
thread200 after
that or alternatively move dom198 after vrp201 and then replace vrp201 with
evrp and kill off thread200 completely.  Supposedly we will for now keep the
forward threading in DOM.  Since we have FRE right before DOM it's
likely only threading that DOM performs now.  (but of course testcases
will need adjustment)

Richard.

>
> Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-25  7:54       ` Richard Biener
@ 2021-06-25  7:58         ` Richard Biener
  2021-06-25 16:20         ` Aldy Hernandez
  1 sibling, 0 replies; 25+ messages in thread
From: Richard Biener @ 2021-06-25  7:58 UTC (permalink / raw)
  To: Jeff Law; +Cc: Aldy Hernandez, GCC Mailing List, Andrew MacLeod

On Fri, Jun 25, 2021 at 9:54 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Thu, Jun 24, 2021 at 6:14 PM Jeff Law <jeffreyalaw@gmail.com> wrote:
> >
> >
> >
> > On 6/21/2021 8:40 AM, Aldy Hernandez wrote:
> > >
> > >
> > > On 6/9/21 2:09 PM, Richard Biener wrote:
> > >> On Wed, Jun 9, 2021 at 1:50 PM Aldy Hernandez via Gcc
> > >> <gcc@gcc.gnu.org> wrote:
> > >>>
> > >>> Hi Jeff.  Hi folks.
> > >>>
> > >>> What started as a foray into severing the old (forward) threader's
> > >>> dependency on evrp, turned into a rewrite of the backwards threader
> > >>> code.  I'd like to discuss the possibility of replacing the current
> > >>> backwards threader with a new one that gets far more threads and can
> > >>> potentially subsume all threaders in the future.
> > >>>
> > >>> I won't include code here, as it will just detract from the high level
> > >>> discussion.  But if it helps, I could post what I have, which just
> > >>> needs
> > >>> some cleanups and porting to the latest trunk changes Andrew has made.
> > >>>
> > >>> Currently the backwards threader works by traversing DEF chains through
> > >>> PHIs leading to possible paths that start in a constant.  When such a
> > >>> path is found, it is checked to see if it is profitable, and if so, the
> > >>> constant path is threaded.  The current implementation is rather
> > >>> limited
> > >>> since backwards paths must end in a constant.  For example, the
> > >>> backwards threader can't get any of the tests in
> > >>> gcc.dg/tree-ssa/ssa-thread-14.c:
> > >>>
> > >>>     if (a && b)
> > >>>       foo ();
> > >>>     if (!b && c)
> > >>>       bar ();
> > >>>
> > >>> etc.
> > >>>
> > >>> After my refactoring patches to the threading code, it is now possible
> > >>> to drop in an alternate implementation that shares the profitability
> > >>> code (is this path profitable?), the jump registry, and the actual jump
> > >>> threading code.  I have leveraged this to write a ranger-based threader
> > >>> that gets every single thread the current code gets, plus 90-130% more.
> > >>>
> > >>> Here are the details from the branch, which should be very similar to
> > >>> trunk.  I'm presenting the branch numbers because they contain Andrew's
> > >>> upcoming relational query which significantly juices up the results.
> > >>>
> > >>> New threader:
> > >>>            ethread:65043    (+3.06%)
> > >>>            dom:32450      (-13.3%)
> > >>>            backwards threader:72482   (+89.6%)
> > >>>            vrp:40532      (-30.7%)
> > >>>     Total threaded:  210507 (+6.70%)
> > >>>
> > >>> This means that the new code gets 89.6% more jump threading
> > >>> opportunities than the code I want to replace.  In doing so, it reduces
> > >>> the amount of DOM threading opportunities by 13.3% and by 30.7% from
> > >>> the
> > >>> VRP jump threader.  The total  improvement across the jump threading
> > >>> opportunities in the compiler is 6.70%.
> > >>>
> > >>> However, these are pessimistic numbers...
> > >>>
> > >>> I have noticed that some of the threading opportunities that DOM and
> > >>> VRP
> > >>> now get are not because they're smarter, but because they're picking up
> > >>> opportunities that the new code exposes.  I experimented with
> > >>> running an
> > >>> iterative threader, and then seeing what VRP and DOM could actually
> > >>> get.
> > >>>    This is too expensive to do in real life, but it at least shows what
> > >>> the effect of the new code is on DOM/VRP's abilities:
> > >>>
> > >>>     Iterative threader:
> > >>>       ethread:65043    (+3.06%)
> > >>>       dom:31170    (-16.7%)
> > >>>           thread:86717    (+127%)
> > >>>           vrp:33851    (-42.2%)
> > >>>     Total threaded:  216781 (+9.90%)
> > >>>
> > >>> This means that the new code not only gets 127% more cases, but it
> > >>> reduces the DOM and VRP opportunities considerably (16.7% and 42.2%
> > >>> respectively).   The end result is that we have the possibility of
> > >>> getting almost 10% more jump threading opportunities in the entire
> > >>> compilation run.
> > >>
> > >> Yeah, DOM once was iterating ...
> > >>
> > >> You probably have noticed that we have very man (way too many)
> > >> 'thread' passes, often in close succession with each other or
> > >> DOM or VRP.  So in the above numbers I wonder if you can break
> > >> down the numbers individually for the actual passes (in their order)?
> > >
> > > As promised.
> > >
> > > *** LEGACY:
> > > ethread42:61152 30.1369% (61152 threads for 30.1% of total)
> > > thread117:29646 14.6101%
> > > vrp118:62088 30.5982%
> > > thread132:2232 1.09997%
> > > dom133:31116 15.3346%
> > > thread197:1950 0.960998%
> > > dom198:10661 5.25395%
> > > thread200:587 0.289285%
> > > vrp201:3482 1.716%
> > > Total:  202914
> >
> > > The above is from current trunk with my patches applied, defaulting to
> > > legacy mode.  It follows the pass number nomenclature in the
> > > *.statistics files.
> > >
> > > New threader code (This is what I envision current trunk to look with
> > > my patchset):
> > >
> > > *** RANGER:
> > > ethread42:64389 30.2242%
> > > thread117:49449 23.2114%
> > > vrp118:46118 21.6478%
> > > thread132:8153 3.82702%
> > > dom133:27168 12.7527%
> > > thread197:5542 2.60141%
> > > dom198:8191 3.84485%
> > > thread200:1038 0.487237%
> > > vrp201:2990 1.40351%
> > > Total:  213038
> > So this makes me think we should focus on dropping thread197, thread200,
> > & vrp201 and I'd probably focus on vrp201 first since we know we want to
> > get rid of it anyway and that may change the data for thread???.  Then
> > I'd be looking at thread200 and thread197 in that order.  I suspect that
> > at least some of the cases in thread200 and vrp201 are exposed by dom198.
>
> It might be interesting to see replacing vrp201 with evrp and run
> thread200 after
> that or alternatively move dom198 after vrp201 and then replace vrp201 with
> evrp and kill off thread200 completely.  Supposedly we will for now keep the
> forward threading in DOM.  Since we have FRE right before DOM it's
> likely only threading that DOM performs now.  (but of course testcases
> will need adjustment)

So instead of

      NEXT_PASS (pass_fre, false /* may_iterate */);
      /* After late FRE we rewrite no longer addressed locals into SSA
         form if possible.  */
      NEXT_PASS (pass_thread_jumps);
      NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
      NEXT_PASS (pass_strlen);
      NEXT_PASS (pass_thread_jumps);
      NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
      /* Threading can leave many const/copy propagations in the IL.
         Clean them up.  Instead of just copy_prop, we use ccp to
         compute alignment and nonzero bits.  */
      NEXT_PASS (pass_ccp, true /* nonzero_p */);

do

      NEXT_PASS (pass_fre, false /* may_iterate */);
      NEXT_PASS (pass_early_vrp);
      NEXT_PASS (pass_thread_jumps);
      NEXT_PASS (pass_strlen); /// ???
      /* Threading can leave many const/copy propagations in the IL.
         Clean them up.  Instead of just copy_prop, we use ccp to
         compute alignment and nonzero bits.  */
      NEXT_PASS (pass_ccp, true /* nonzero_p */);

where I only wonder about pass_strlen placement and how much it depends on
earlier threading (I'd just run it after FRE?)

> Richard.
>
> >
> > Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-25  7:54       ` Richard Biener
  2021-06-25  7:58         ` Richard Biener
@ 2021-06-25 16:20         ` Aldy Hernandez
  2021-06-25 17:19           ` Martin Sebor
  2021-06-28  8:22           ` Richard Biener
  1 sibling, 2 replies; 25+ messages in thread
From: Aldy Hernandez @ 2021-06-25 16:20 UTC (permalink / raw)
  To: Richard Biener, Jeff Law; +Cc: GCC Mailing List, Andrew MacLeod, Martin Sebor

Hi folks.

I'm done with benchmarking, testing and cleanups, so I'd like to post my 
patchset for review.  However, before doing so, I'd like to address a 
handful of meta-issues that may affect how I post these patches.

Trapping on differences
=======================

Originally I wanted to contribute verification code that would trap if 
the legacy code threaded any edges the new code couldn't (to be removed 
after a week).  However, after having tested on various architectures 
and only running once into a missing thread, I'm leaning towards 
omitting the verification code, since it's fragile, time consuming, and 
quite hacky.

For the record, I have tested on x86-64, aarch64, ppc64 and ppc64le. 
There is only one case, across bootstrap and regression tests where the 
verification code is ever tripped (discussed below).

Performance
===========

I re-ran benchmarks as per our callgrind suite, and the penalty with the 
current pipeline is 1.55% of overall compilation time.  As is being 
discussed, we should be able to mitigate this significantly by removing 
other threading passes.

Failing testcases
=================

I have yet to run into incorrect code being generated, but I have had to 
tweak a considerable number of tests.  I have verified every single 
discrepancy and documented my changes in the testsuite when it merited 
doing so.  However, there are a couple tests that trigger regressions 
and I'd like to ask for guidance on how to address them.

1. gcc.c-torture/compile/pr83510.c

I would like to XFAIL this.

What happens here is that thread1 threads a switch statement such that 
the various cases have been split into different independent blocks. 
One of these blocks exposes an arr[i_27] access which is later 
propagated by VRP to be arr[10].  This is an invalid access, but the 
array bounds code doesn't know it is an unreachable path.

However, it is not until dom2 that we "know" that the value of the 
switch index is such that the path to arr[10] is unreachable.  For that 
matter, it is not until dom3 that we remove the unreachable path.

2. -Wfree-nonheap-object

This warning is triggered while cleaning up an auto_vec.  I see that the 
va_heap::release() inline is wrapped with a pragma ignore 
"-Wfree-nonheap-object", but this is not sufficient because jump 
threading may alter uses in such a way that may_emit_free_warning() will 
warn on the *inlined* location, thus bypassing the pragma.

I worked around this with a mere:

 > @@ -13839,6 +13839,7 @@ maybe_emit_free_warning (tree exp)
>    location_t loc = tree_inlined_location (exp);
> +  loc = EXPR_LOCATION (exp);

but this causes a ton of Wfree-nonheap* tests to fail.  I think someone 
more knowledgeable should address this (msebor??).

3. uninit-pred-9_b.c

The uninit code is getting confused with the threading and the bogus 
warning in line 24 is back.  I looked at the thread, and it is correct.

I'm afraid all these warnings are quite fragile in the presence of more 
aggressive optimizations, and I suspect it will only get worse.

4. libphobos/src/std/net/isemail.d

This is a D test where we don't actually fail, but we trigger the 
verification code.  It is the only jump threading edge that the new code 
fails to get over the old code, and it only happens on ppc64.

It triggers because a BB4 -> BB5 is too expensive to thread, but a BBn 
-> BB3 -> BB4 -> BB5 is considered safe to thread because BB3 is a latch 
and it alters the profitability equation.  The reason we don't get it, 
is that we assume that if a X->Y is unprofitable, it is not worth 
looking at W->X->Y and so forth.

Jeff had some fancy ideas on how to attack this.  Once such idea was to 
stop looking back, but only for things we were absolutely sure would 
never yield a profitable path.  I tried a subset of this, by allowing 
further looks on this latch test, but my 1.55% overall performance 
penalty turned into an 8.33% penalty.  Personally it looks way too 
expensive for this one isolated case.  Besides, the test where this 
clamping code originally came from still succeeds (commit 
eab2541b860c48203115ac6dca3284e982015d2c).

CONCLUSION
==========

That's basically it.

If we agree the above things are not big issues, or can be addressed as 
follow-ups, I'd like to start the ball rolling on the new threader. 
This would allow more extensive testing of the code, and separate it a 
bit from the other big changes coming up :).

Aldy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-25 16:20         ` Aldy Hernandez
@ 2021-06-25 17:19           ` Martin Sebor
  2021-06-28  6:17             ` Aldy Hernandez
  2021-06-28  8:22           ` Richard Biener
  1 sibling, 1 reply; 25+ messages in thread
From: Martin Sebor @ 2021-06-25 17:19 UTC (permalink / raw)
  To: Aldy Hernandez, Richard Biener, Jeff Law; +Cc: GCC Mailing List, Martin Sebor

On 6/25/21 10:20 AM, Aldy Hernandez via Gcc wrote:
> Hi folks.
> 
> I'm done with benchmarking, testing and cleanups, so I'd like to post my 
> patchset for review.  However, before doing so, I'd like to address a 
> handful of meta-issues that may affect how I post these patches.
> 
> Trapping on differences
> =======================
> 
> Originally I wanted to contribute verification code that would trap if 
> the legacy code threaded any edges the new code couldn't (to be removed 
> after a week).  However, after having tested on various architectures 
> and only running once into a missing thread, I'm leaning towards 
> omitting the verification code, since it's fragile, time consuming, and 
> quite hacky.
> 
> For the record, I have tested on x86-64, aarch64, ppc64 and ppc64le. 
> There is only one case, across bootstrap and regression tests where the 
> verification code is ever tripped (discussed below).
> 
> Performance
> ===========
> 
> I re-ran benchmarks as per our callgrind suite, and the penalty with the 
> current pipeline is 1.55% of overall compilation time.  As is being 
> discussed, we should be able to mitigate this significantly by removing 
> other threading passes.
> 
> Failing testcases
> =================
> 
> I have yet to run into incorrect code being generated, but I have had to 
> tweak a considerable number of tests.  I have verified every single 
> discrepancy and documented my changes in the testsuite when it merited 
> doing so.  However, there are a couple tests that trigger regressions 
> and I'd like to ask for guidance on how to address them.
> 
> 1. gcc.c-torture/compile/pr83510.c
> 
> I would like to XFAIL this.
> 
> What happens here is that thread1 threads a switch statement such that 
> the various cases have been split into different independent blocks. One 
> of these blocks exposes an arr[i_27] access which is later propagated by 
> VRP to be arr[10].  This is an invalid access, but the array bounds code 
> doesn't know it is an unreachable path.

The test has a bunch of loops that iterate over the 10 array elements.
There have been bug reports about loop unrolling causing false positives
-Warray-bounds (e.g., PR 92539, 92110, or 86341) so this could be
the same issue.

> 
> However, it is not until dom2 that we "know" that the value of the 
> switch index is such that the path to arr[10] is unreachable.  For that 
> matter, it is not until dom3 that we remove the unreachable path.

If you do XFAIL it can you please isolate a small test case and open
a bug and make it a -Warray-bounds blocker?

> 
> 2. -Wfree-nonheap-object
> 
> This warning is triggered while cleaning up an auto_vec.  I see that the 
> va_heap::release() inline is wrapped with a pragma ignore 
> "-Wfree-nonheap-object", but this is not sufficient because jump 
> threading may alter uses in such a way that may_emit_free_warning() will 
> warn on the *inlined* location, thus bypassing the pragma.
> 
> I worked around this with a mere:
> 
>  > @@ -13839,6 +13839,7 @@ maybe_emit_free_warning (tree exp)
>>    location_t loc = tree_inlined_location (exp);
>> +  loc = EXPR_LOCATION (exp);
> 
> but this causes a ton of Wfree-nonheap* tests to fail.  I think someone 
> more knowledgeable should address this (msebor??).

This sounds like the same problem as PR 98871.  Does the patch below
fix it?
https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572515.html
If so, I suggest getting that patch in first to avoid testsuite
failures.  If it doesn't fix it I'll look into it before you commit
your changes.

> 
> 3. uninit-pred-9_b.c
> 
> The uninit code is getting confused with the threading and the bogus 
> warning in line 24 is back.  I looked at the thread, and it is correct.
> 
> I'm afraid all these warnings are quite fragile in the presence of more 
> aggressive optimizations, and I suspect it will only get worse.

 From my recent review of open -Wmaybe-uninitialized bugs (and
the code) it does seem to be both fragile and getting worse.  I've
only found a few simple problems so far in the code but nothing that
would make a dramatic difference so I can't say if it's possible to
do much better, but I'm not done or ready to give up.  If you XFAIL
this too please open a bug for it and make it a blocker for
-Wuninitialized?

Martin

> 
> 4. libphobos/src/std/net/isemail.d
> 
> This is a D test where we don't actually fail, but we trigger the 
> verification code.  It is the only jump threading edge that the new code 
> fails to get over the old code, and it only happens on ppc64.
> 
> It triggers because a BB4 -> BB5 is too expensive to thread, but a BBn 
> -> BB3 -> BB4 -> BB5 is considered safe to thread because BB3 is a latch 
> and it alters the profitability equation.  The reason we don't get it, 
> is that we assume that if a X->Y is unprofitable, it is not worth 
> looking at W->X->Y and so forth.
> 
> Jeff had some fancy ideas on how to attack this.  Once such idea was to 
> stop looking back, but only for things we were absolutely sure would 
> never yield a profitable path.  I tried a subset of this, by allowing 
> further looks on this latch test, but my 1.55% overall performance 
> penalty turned into an 8.33% penalty.  Personally it looks way too 
> expensive for this one isolated case.  Besides, the test where this 
> clamping code originally came from still succeeds (commit 
> eab2541b860c48203115ac6dca3284e982015d2c).
> 
> CONCLUSION
> ==========
> 
> That's basically it.
> 
> If we agree the above things are not big issues, or can be addressed as 
> follow-ups, I'd like to start the ball rolling on the new threader. This 
> would allow more extensive testing of the code, and separate it a bit 
> from the other big changes coming up :).
> 
> Aldy
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-25 17:19           ` Martin Sebor
@ 2021-06-28  6:17             ` Aldy Hernandez
  2021-06-28 22:29               ` Martin Sebor
  0 siblings, 1 reply; 25+ messages in thread
From: Aldy Hernandez @ 2021-06-28  6:17 UTC (permalink / raw)
  To: Martin Sebor, Richard Biener, Jeff Law; +Cc: GCC Mailing List, Martin Sebor



On 6/25/21 7:19 PM, Martin Sebor wrote:
> On 6/25/21 10:20 AM, Aldy Hernandez via Gcc wrote:
>> Hi folks.
>>
>> I'm done with benchmarking, testing and cleanups, so I'd like to post 
>> my patchset for review.  However, before doing so, I'd like to address 
>> a handful of meta-issues that may affect how I post these patches.
>>
>> Trapping on differences
>> =======================
>>
>> Originally I wanted to contribute verification code that would trap if 
>> the legacy code threaded any edges the new code couldn't (to be 
>> removed after a week).  However, after having tested on various 
>> architectures and only running once into a missing thread, I'm leaning 
>> towards omitting the verification code, since it's fragile, time 
>> consuming, and quite hacky.
>>
>> For the record, I have tested on x86-64, aarch64, ppc64 and ppc64le. 
>> There is only one case, across bootstrap and regression tests where 
>> the verification code is ever tripped (discussed below).
>>
>> Performance
>> ===========
>>
>> I re-ran benchmarks as per our callgrind suite, and the penalty with 
>> the current pipeline is 1.55% of overall compilation time.  As is 
>> being discussed, we should be able to mitigate this significantly by 
>> removing other threading passes.
>>
>> Failing testcases
>> =================
>>
>> I have yet to run into incorrect code being generated, but I have had 
>> to tweak a considerable number of tests.  I have verified every single 
>> discrepancy and documented my changes in the testsuite when it merited 
>> doing so.  However, there are a couple tests that trigger regressions 
>> and I'd like to ask for guidance on how to address them.
>>
>> 1. gcc.c-torture/compile/pr83510.c
>>
>> I would like to XFAIL this.
>>
>> What happens here is that thread1 threads a switch statement such that 
>> the various cases have been split into different independent blocks. 
>> One of these blocks exposes an arr[i_27] access which is later 
>> propagated by VRP to be arr[10].  This is an invalid access, but the 
>> array bounds code doesn't know it is an unreachable path.
> 
> The test has a bunch of loops that iterate over the 10 array elements.
> There have been bug reports about loop unrolling causing false positives
> -Warray-bounds (e.g., PR 92539, 92110, or 86341) so this could be
> the same issue.
> 
>>
>> However, it is not until dom2 that we "know" that the value of the 
>> switch index is such that the path to arr[10] is unreachable.  For 
>> that matter, it is not until dom3 that we remove the unreachable path.
> 
> If you do XFAIL it can you please isolate a small test case and open
> a bug and make it a -Warray-bounds blocker?

Will do.

> 
>>
>> 2. -Wfree-nonheap-object
>>
>> This warning is triggered while cleaning up an auto_vec.  I see that 
>> the va_heap::release() inline is wrapped with a pragma ignore 
>> "-Wfree-nonheap-object", but this is not sufficient because jump 
>> threading may alter uses in such a way that may_emit_free_warning() 
>> will warn on the *inlined* location, thus bypassing the pragma.
>>
>> I worked around this with a mere:
>>
>>  > @@ -13839,6 +13839,7 @@ maybe_emit_free_warning (tree exp)
>>>    location_t loc = tree_inlined_location (exp);
>>> +  loc = EXPR_LOCATION (exp);
>>
>> but this causes a ton of Wfree-nonheap* tests to fail.  I think 
>> someone more knowledgeable should address this (msebor??).
> 
> This sounds like the same problem as PR 98871.  Does the patch below
> fix it?
> https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572515.html
> If so, I suggest getting that patch in first to avoid testsuite
> failures.  If it doesn't fix it I'll look into it before you commit
> your changes.

The latest patch in the above thread does not apply.  Do you have a more 
recent patch?

>> 3. uninit-pred-9_b.c
>>
>> The uninit code is getting confused with the threading and the bogus 
>> warning in line 24 is back.  I looked at the thread, and it is correct.
>>
>> I'm afraid all these warnings are quite fragile in the presence of 
>> more aggressive optimizations, and I suspect it will only get worse.
> 
>  From my recent review of open -Wmaybe-uninitialized bugs (and
> the code) it does seem to be both fragile and getting worse.  I've
> only found a few simple problems so far in the code but nothing that
> would make a dramatic difference so I can't say if it's possible to
> do much better, but I'm not done or ready to give up.  If you XFAIL
> this too please open a bug for it and make it a blocker for
> -Wuninitialized?

Will do.

Good to know these are somewhat known issues.

Thanks.
Aldy


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-28  6:17             ` Aldy Hernandez
@ 2021-06-28 22:29               ` Martin Sebor
  2021-06-29  0:32                 ` Aldy Hernandez
  0 siblings, 1 reply; 25+ messages in thread
From: Martin Sebor @ 2021-06-28 22:29 UTC (permalink / raw)
  To: Aldy Hernandez, Richard Biener, Jeff Law; +Cc: GCC Mailing List, Martin Sebor

On 6/28/21 12:17 AM, Aldy Hernandez wrote:
> 
> 
> On 6/25/21 7:19 PM, Martin Sebor wrote:
>> On 6/25/21 10:20 AM, Aldy Hernandez via Gcc wrote:
>>> Hi folks.
>>>
>>> I'm done with benchmarking, testing and cleanups, so I'd like to post 
>>> my patchset for review.  However, before doing so, I'd like to 
>>> address a handful of meta-issues that may affect how I post these 
>>> patches.
>>>
>>> Trapping on differences
>>> =======================
>>>
>>> Originally I wanted to contribute verification code that would trap 
>>> if the legacy code threaded any edges the new code couldn't (to be 
>>> removed after a week).  However, after having tested on various 
>>> architectures and only running once into a missing thread, I'm 
>>> leaning towards omitting the verification code, since it's fragile, 
>>> time consuming, and quite hacky.
>>>
>>> For the record, I have tested on x86-64, aarch64, ppc64 and ppc64le. 
>>> There is only one case, across bootstrap and regression tests where 
>>> the verification code is ever tripped (discussed below).
>>>
>>> Performance
>>> ===========
>>>
>>> I re-ran benchmarks as per our callgrind suite, and the penalty with 
>>> the current pipeline is 1.55% of overall compilation time.  As is 
>>> being discussed, we should be able to mitigate this significantly by 
>>> removing other threading passes.
>>>
>>> Failing testcases
>>> =================
>>>
>>> I have yet to run into incorrect code being generated, but I have had 
>>> to tweak a considerable number of tests.  I have verified every 
>>> single discrepancy and documented my changes in the testsuite when it 
>>> merited doing so.  However, there are a couple tests that trigger 
>>> regressions and I'd like to ask for guidance on how to address them.
>>>
>>> 1. gcc.c-torture/compile/pr83510.c
>>>
>>> I would like to XFAIL this.
>>>
>>> What happens here is that thread1 threads a switch statement such 
>>> that the various cases have been split into different independent 
>>> blocks. One of these blocks exposes an arr[i_27] access which is 
>>> later propagated by VRP to be arr[10].  This is an invalid access, 
>>> but the array bounds code doesn't know it is an unreachable path.
>>
>> The test has a bunch of loops that iterate over the 10 array elements.
>> There have been bug reports about loop unrolling causing false positives
>> -Warray-bounds (e.g., PR 92539, 92110, or 86341) so this could be
>> the same issue.
>>
>>>
>>> However, it is not until dom2 that we "know" that the value of the 
>>> switch index is such that the path to arr[10] is unreachable.  For 
>>> that matter, it is not until dom3 that we remove the unreachable path.
>>
>> If you do XFAIL it can you please isolate a small test case and open
>> a bug and make it a -Warray-bounds blocker?
> 
> Will do.
> 
>>
>>>
>>> 2. -Wfree-nonheap-object
>>>
>>> This warning is triggered while cleaning up an auto_vec.  I see that 
>>> the va_heap::release() inline is wrapped with a pragma ignore 
>>> "-Wfree-nonheap-object", but this is not sufficient because jump 
>>> threading may alter uses in such a way that may_emit_free_warning() 
>>> will warn on the *inlined* location, thus bypassing the pragma.
>>>
>>> I worked around this with a mere:
>>>
>>>  > @@ -13839,6 +13839,7 @@ maybe_emit_free_warning (tree exp)
>>>>    location_t loc = tree_inlined_location (exp);
>>>> +  loc = EXPR_LOCATION (exp);
>>>
>>> but this causes a ton of Wfree-nonheap* tests to fail.  I think 
>>> someone more knowledgeable should address this (msebor??).
>>
>> This sounds like the same problem as PR 98871.  Does the patch below
>> fix it?
>> https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572515.html
>> If so, I suggest getting that patch in first to avoid testsuite
>> failures.  If it doesn't fix it I'll look into it before you commit
>> your changes.
> 
> The latest patch in the above thread does not apply.  Do you have a more 
> recent patch?

The first patch in the series applies cleanly for me:
   https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572516.html
but patch 2/4 does need a few adjustments.  With those and your
latest changes applied I don't see any -Wfree-nonheap-object test
failures.

Martin

> 
>>> 3. uninit-pred-9_b.c
>>>
>>> The uninit code is getting confused with the threading and the bogus 
>>> warning in line 24 is back.  I looked at the thread, and it is correct.
>>>
>>> I'm afraid all these warnings are quite fragile in the presence of 
>>> more aggressive optimizations, and I suspect it will only get worse.
>>
>>  From my recent review of open -Wmaybe-uninitialized bugs (and
>> the code) it does seem to be both fragile and getting worse.  I've
>> only found a few simple problems so far in the code but nothing that
>> would make a dramatic difference so I can't say if it's possible to
>> do much better, but I'm not done or ready to give up.  If you XFAIL
>> this too please open a bug for it and make it a blocker for
>> -Wuninitialized?
> 
> Will do.
> 
> Good to know these are somewhat known issues.
> 
> Thanks.
> Aldy
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-28 22:29               ` Martin Sebor
@ 2021-06-29  0:32                 ` Aldy Hernandez
  2021-06-29 22:16                   ` Martin Sebor
  0 siblings, 1 reply; 25+ messages in thread
From: Aldy Hernandez @ 2021-06-29  0:32 UTC (permalink / raw)
  To: Martin Sebor; +Cc: Richard Biener, Jeff Law, GCC Mailing List, Martin Sebor

I had changed my approach to passing -Wno-free-nonheap-object in the
Makefile. Can you try disabling the Makefile bit and bootstrapping, cause
that was still failing.

Thanks.
Aldy

On Tue, Jun 29, 2021, 00:29 Martin Sebor <msebor@gmail.com> wrote:

> On 6/28/21 12:17 AM, Aldy Hernandez wrote:
> >
> >
> > On 6/25/21 7:19 PM, Martin Sebor wrote:
> >> On 6/25/21 10:20 AM, Aldy Hernandez via Gcc wrote:
> >>> Hi folks.
> >>>
> >>> I'm done with benchmarking, testing and cleanups, so I'd like to post
> >>> my patchset for review.  However, before doing so, I'd like to
> >>> address a handful of meta-issues that may affect how I post these
> >>> patches.
> >>>
> >>> Trapping on differences
> >>> =======================
> >>>
> >>> Originally I wanted to contribute verification code that would trap
> >>> if the legacy code threaded any edges the new code couldn't (to be
> >>> removed after a week).  However, after having tested on various
> >>> architectures and only running once into a missing thread, I'm
> >>> leaning towards omitting the verification code, since it's fragile,
> >>> time consuming, and quite hacky.
> >>>
> >>> For the record, I have tested on x86-64, aarch64, ppc64 and ppc64le.
> >>> There is only one case, across bootstrap and regression tests where
> >>> the verification code is ever tripped (discussed below).
> >>>
> >>> Performance
> >>> ===========
> >>>
> >>> I re-ran benchmarks as per our callgrind suite, and the penalty with
> >>> the current pipeline is 1.55% of overall compilation time.  As is
> >>> being discussed, we should be able to mitigate this significantly by
> >>> removing other threading passes.
> >>>
> >>> Failing testcases
> >>> =================
> >>>
> >>> I have yet to run into incorrect code being generated, but I have had
> >>> to tweak a considerable number of tests.  I have verified every
> >>> single discrepancy and documented my changes in the testsuite when it
> >>> merited doing so.  However, there are a couple tests that trigger
> >>> regressions and I'd like to ask for guidance on how to address them.
> >>>
> >>> 1. gcc.c-torture/compile/pr83510.c
> >>>
> >>> I would like to XFAIL this.
> >>>
> >>> What happens here is that thread1 threads a switch statement such
> >>> that the various cases have been split into different independent
> >>> blocks. One of these blocks exposes an arr[i_27] access which is
> >>> later propagated by VRP to be arr[10].  This is an invalid access,
> >>> but the array bounds code doesn't know it is an unreachable path.
> >>
> >> The test has a bunch of loops that iterate over the 10 array elements.
> >> There have been bug reports about loop unrolling causing false positives
> >> -Warray-bounds (e.g., PR 92539, 92110, or 86341) so this could be
> >> the same issue.
> >>
> >>>
> >>> However, it is not until dom2 that we "know" that the value of the
> >>> switch index is such that the path to arr[10] is unreachable.  For
> >>> that matter, it is not until dom3 that we remove the unreachable path.
> >>
> >> If you do XFAIL it can you please isolate a small test case and open
> >> a bug and make it a -Warray-bounds blocker?
> >
> > Will do.
> >
> >>
> >>>
> >>> 2. -Wfree-nonheap-object
> >>>
> >>> This warning is triggered while cleaning up an auto_vec.  I see that
> >>> the va_heap::release() inline is wrapped with a pragma ignore
> >>> "-Wfree-nonheap-object", but this is not sufficient because jump
> >>> threading may alter uses in such a way that may_emit_free_warning()
> >>> will warn on the *inlined* location, thus bypassing the pragma.
> >>>
> >>> I worked around this with a mere:
> >>>
> >>>  > @@ -13839,6 +13839,7 @@ maybe_emit_free_warning (tree exp)
> >>>>    location_t loc = tree_inlined_location (exp);
> >>>> +  loc = EXPR_LOCATION (exp);
> >>>
> >>> but this causes a ton of Wfree-nonheap* tests to fail.  I think
> >>> someone more knowledgeable should address this (msebor??).
> >>
> >> This sounds like the same problem as PR 98871.  Does the patch below
> >> fix it?
> >> https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572515.html
> >> If so, I suggest getting that patch in first to avoid testsuite
> >> failures.  If it doesn't fix it I'll look into it before you commit
> >> your changes.
> >
> > The latest patch in the above thread does not apply.  Do you have a more
> > recent patch?
>
> The first patch in the series applies cleanly for me:
>    https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572516.html
> but patch 2/4 does need a few adjustments.  With those and your
> latest changes applied I don't see any -Wfree-nonheap-object test
> failures.
>
> Martin
>
> >
> >>> 3. uninit-pred-9_b.c
> >>>
> >>> The uninit code is getting confused with the threading and the bogus
> >>> warning in line 24 is back.  I looked at the thread, and it is correct.
> >>>
> >>> I'm afraid all these warnings are quite fragile in the presence of
> >>> more aggressive optimizations, and I suspect it will only get worse.
> >>
> >>  From my recent review of open -Wmaybe-uninitialized bugs (and
> >> the code) it does seem to be both fragile and getting worse.  I've
> >> only found a few simple problems so far in the code but nothing that
> >> would make a dramatic difference so I can't say if it's possible to
> >> do much better, but I'm not done or ready to give up.  If you XFAIL
> >> this too please open a bug for it and make it a blocker for
> >> -Wuninitialized?
> >
> > Will do.
> >
> > Good to know these are somewhat known issues.
> >
> > Thanks.
> > Aldy
> >
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-29  0:32                 ` Aldy Hernandez
@ 2021-06-29 22:16                   ` Martin Sebor
  2021-06-30  6:07                     ` Aldy Hernandez
  0 siblings, 1 reply; 25+ messages in thread
From: Martin Sebor @ 2021-06-29 22:16 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Richard Biener, Jeff Law, GCC Mailing List

On 6/28/21 6:32 PM, Aldy Hernandez wrote:
> I had changed my approach to passing -Wno-free-nonheap-object in the 
> Makefile. Can you try disabling the Makefile bit and bootstrapping, 
> cause that was still failing.

I see.  I just tested it and my patch does let the existing #pragma
suppress the warning.  I just pinged the core patch yesterday so
once it and the rest of the series gets approved there will be no
need for the Makefile workaround.

Martin

> 
> Thanks.
> Aldy
> 
> On Tue, Jun 29, 2021, 00:29 Martin Sebor <msebor@gmail.com 
> <mailto:msebor@gmail.com>> wrote:
> 
>     On 6/28/21 12:17 AM, Aldy Hernandez wrote:
>      >
>      >
>      > On 6/25/21 7:19 PM, Martin Sebor wrote:
>      >> On 6/25/21 10:20 AM, Aldy Hernandez via Gcc wrote:
>      >>> Hi folks.
>      >>>
>      >>> I'm done with benchmarking, testing and cleanups, so I'd like
>     to post
>      >>> my patchset for review.  However, before doing so, I'd like to
>      >>> address a handful of meta-issues that may affect how I post these
>      >>> patches.
>      >>>
>      >>> Trapping on differences
>      >>> =======================
>      >>>
>      >>> Originally I wanted to contribute verification code that would
>     trap
>      >>> if the legacy code threaded any edges the new code couldn't (to be
>      >>> removed after a week).  However, after having tested on various
>      >>> architectures and only running once into a missing thread, I'm
>      >>> leaning towards omitting the verification code, since it's
>     fragile,
>      >>> time consuming, and quite hacky.
>      >>>
>      >>> For the record, I have tested on x86-64, aarch64, ppc64 and
>     ppc64le.
>      >>> There is only one case, across bootstrap and regression tests
>     where
>      >>> the verification code is ever tripped (discussed below).
>      >>>
>      >>> Performance
>      >>> ===========
>      >>>
>      >>> I re-ran benchmarks as per our callgrind suite, and the penalty
>     with
>      >>> the current pipeline is 1.55% of overall compilation time.  As is
>      >>> being discussed, we should be able to mitigate this
>     significantly by
>      >>> removing other threading passes.
>      >>>
>      >>> Failing testcases
>      >>> =================
>      >>>
>      >>> I have yet to run into incorrect code being generated, but I
>     have had
>      >>> to tweak a considerable number of tests.  I have verified every
>      >>> single discrepancy and documented my changes in the testsuite
>     when it
>      >>> merited doing so.  However, there are a couple tests that trigger
>      >>> regressions and I'd like to ask for guidance on how to address
>     them.
>      >>>
>      >>> 1. gcc.c-torture/compile/pr83510.c
>      >>>
>      >>> I would like to XFAIL this.
>      >>>
>      >>> What happens here is that thread1 threads a switch statement such
>      >>> that the various cases have been split into different independent
>      >>> blocks. One of these blocks exposes an arr[i_27] access which is
>      >>> later propagated by VRP to be arr[10].  This is an invalid access,
>      >>> but the array bounds code doesn't know it is an unreachable path.
>      >>
>      >> The test has a bunch of loops that iterate over the 10 array
>     elements.
>      >> There have been bug reports about loop unrolling causing false
>     positives
>      >> -Warray-bounds (e.g., PR 92539, 92110, or 86341) so this could be
>      >> the same issue.
>      >>
>      >>>
>      >>> However, it is not until dom2 that we "know" that the value of the
>      >>> switch index is such that the path to arr[10] is unreachable.  For
>      >>> that matter, it is not until dom3 that we remove the
>     unreachable path.
>      >>
>      >> If you do XFAIL it can you please isolate a small test case and open
>      >> a bug and make it a -Warray-bounds blocker?
>      >
>      > Will do.
>      >
>      >>
>      >>>
>      >>> 2. -Wfree-nonheap-object
>      >>>
>      >>> This warning is triggered while cleaning up an auto_vec.  I see
>     that
>      >>> the va_heap::release() inline is wrapped with a pragma ignore
>      >>> "-Wfree-nonheap-object", but this is not sufficient because jump
>      >>> threading may alter uses in such a way that
>     may_emit_free_warning()
>      >>> will warn on the *inlined* location, thus bypassing the pragma.
>      >>>
>      >>> I worked around this with a mere:
>      >>>
>      >>>  > @@ -13839,6 +13839,7 @@ maybe_emit_free_warning (tree exp)
>      >>>>    location_t loc = tree_inlined_location (exp);
>      >>>> +  loc = EXPR_LOCATION (exp);
>      >>>
>      >>> but this causes a ton of Wfree-nonheap* tests to fail.  I think
>      >>> someone more knowledgeable should address this (msebor??).
>      >>
>      >> This sounds like the same problem as PR 98871.  Does the patch below
>      >> fix it?
>      >> https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572515.html
>      >> If so, I suggest getting that patch in first to avoid testsuite
>      >> failures.  If it doesn't fix it I'll look into it before you commit
>      >> your changes.
>      >
>      > The latest patch in the above thread does not apply.  Do you have
>     a more
>      > recent patch?
> 
>     The first patch in the series applies cleanly for me:
>     https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572516.html
>     but patch 2/4 does need a few adjustments.  With those and your
>     latest changes applied I don't see any -Wfree-nonheap-object test
>     failures.
> 
>     Martin
> 
>      >
>      >>> 3. uninit-pred-9_b.c
>      >>>
>      >>> The uninit code is getting confused with the threading and the
>     bogus
>      >>> warning in line 24 is back.  I looked at the thread, and it is
>     correct.
>      >>>
>      >>> I'm afraid all these warnings are quite fragile in the presence of
>      >>> more aggressive optimizations, and I suspect it will only get
>     worse.
>      >>
>      >>  From my recent review of open -Wmaybe-uninitialized bugs (and
>      >> the code) it does seem to be both fragile and getting worse.  I've
>      >> only found a few simple problems so far in the code but nothing that
>      >> would make a dramatic difference so I can't say if it's possible to
>      >> do much better, but I'm not done or ready to give up.  If you XFAIL
>      >> this too please open a bug for it and make it a blocker for
>      >> -Wuninitialized?
>      >
>      > Will do.
>      >
>      > Good to know these are somewhat known issues.
>      >
>      > Thanks.
>      > Aldy
>      >
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-29 22:16                   ` Martin Sebor
@ 2021-06-30  6:07                     ` Aldy Hernandez
  0 siblings, 0 replies; 25+ messages in thread
From: Aldy Hernandez @ 2021-06-30  6:07 UTC (permalink / raw)
  To: Martin Sebor; +Cc: Richard Biener, Jeff Law, GCC Mailing List



On 6/30/21 12:16 AM, Martin Sebor wrote:
> On 6/28/21 6:32 PM, Aldy Hernandez wrote:
>> I had changed my approach to passing -Wno-free-nonheap-object in the 
>> Makefile. Can you try disabling the Makefile bit and bootstrapping, 
>> cause that was still failing.
> 
> I see.  I just tested it and my patch does let the existing #pragma
> suppress the warning.  I just pinged the core patch yesterday so
> once it and the rest of the series gets approved there will be no
> need for the Makefile workaround.

Thanks for taking care of this.
Aldy


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-25 16:20         ` Aldy Hernandez
  2021-06-25 17:19           ` Martin Sebor
@ 2021-06-28  8:22           ` Richard Biener
  2021-06-28 12:05             ` Aldy Hernandez
  1 sibling, 1 reply; 25+ messages in thread
From: Richard Biener @ 2021-06-28  8:22 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Jeff Law, GCC Mailing List, Andrew MacLeod, Martin Sebor

On Fri, Jun 25, 2021 at 6:20 PM Aldy Hernandez <aldyh@redhat.com> wrote:
>
> Hi folks.
>
> I'm done with benchmarking, testing and cleanups, so I'd like to post my
> patchset for review.  However, before doing so, I'd like to address a
> handful of meta-issues that may affect how I post these patches.

First of all thanks for the detailed analysis and report!

> Trapping on differences
> =======================
>
> Originally I wanted to contribute verification code that would trap if
> the legacy code threaded any edges the new code couldn't (to be removed
> after a week).  However, after having tested on various architectures
> and only running once into a missing thread, I'm leaning towards
> omitting the verification code, since it's fragile, time consuming, and
> quite hacky.

Agreed.

> For the record, I have tested on x86-64, aarch64, ppc64 and ppc64le.
> There is only one case, across bootstrap and regression tests where the
> verification code is ever tripped (discussed below).
>
> Performance
> ===========
>
> I re-ran benchmarks as per our callgrind suite, and the penalty with the
> current pipeline is 1.55% of overall compilation time.  As is being
> discussed, we should be able to mitigate this significantly by removing
> other threading passes.

1.55% for the overall compilation looks quite large - is this suite particularly
demanding on VRP/ranger?  Is the figure for 'cc1files' similar? (collect
preprocessed sources of gcc/*.{c,cc} compiles, like by appending
-save-temps to CFLAGS in a non-bootstrap build)

I'm curious if you tried to look at the worst offenders and if there's
anything new, threading specific.  More threading might result in
larger code (how is cc1 size affected by the change?) and more
code to compile by later passes can easily explain >1% differences.

> Failing testcases
> =================
>
> I have yet to run into incorrect code being generated, but I have had to
> tweak a considerable number of tests.  I have verified every single
> discrepancy and documented my changes in the testsuite when it merited
> doing so.  However, there are a couple tests that trigger regressions
> and I'd like to ask for guidance on how to address them.
>
> 1. gcc.c-torture/compile/pr83510.c
>
> I would like to XFAIL this.
>
> What happens here is that thread1 threads a switch statement such that
> the various cases have been split into different independent blocks.
> One of these blocks exposes an arr[i_27] access which is later
> propagated by VRP to be arr[10].  This is an invalid access, but the
> array bounds code doesn't know it is an unreachable path.
>
> However, it is not until dom2 that we "know" that the value of the
> switch index is such that the path to arr[10] is unreachable.  For that
> matter, it is not until dom3 that we remove the unreachable path.

Sounds reasonable.

> 2. -Wfree-nonheap-object
>
> This warning is triggered while cleaning up an auto_vec.  I see that the
> va_heap::release() inline is wrapped with a pragma ignore
> "-Wfree-nonheap-object", but this is not sufficient because jump
> threading may alter uses in such a way that may_emit_free_warning() will
> warn on the *inlined* location, thus bypassing the pragma.
>
> I worked around this with a mere:
>
>  > @@ -13839,6 +13839,7 @@ maybe_emit_free_warning (tree exp)
> >    location_t loc = tree_inlined_location (exp);
> > +  loc = EXPR_LOCATION (exp);
>
> but this causes a ton of Wfree-nonheap* tests to fail.  I think someone
> more knowledgeable should address this (msebor??).

That looks wrong, but see msebors response for maybe some better answer.

> 3. uninit-pred-9_b.c
>
> The uninit code is getting confused with the threading and the bogus
> warning in line 24 is back.  I looked at the thread, and it is correct.
>
> I'm afraid all these warnings are quite fragile in the presence of more
> aggressive optimizations, and I suspect it will only get worse.

Yep.  Shouldn't be a blocker, just XFAIL ...

> 4. libphobos/src/std/net/isemail.d
>
> This is a D test where we don't actually fail, but we trigger the
> verification code.  It is the only jump threading edge that the new code
> fails to get over the old code, and it only happens on ppc64.
>
> It triggers because a BB4 -> BB5 is too expensive to thread, but a BBn
> -> BB3 -> BB4 -> BB5 is considered safe to thread because BB3 is a latch
> and it alters the profitability equation.  The reason we don't get it,
> is that we assume that if a X->Y is unprofitable, it is not worth
> looking at W->X->Y and so forth.
>
> Jeff had some fancy ideas on how to attack this.  Once such idea was to
> stop looking back, but only for things we were absolutely sure would
> never yield a profitable path.  I tried a subset of this, by allowing
> further looks on this latch test, but my 1.55% overall performance
> penalty turned into an 8.33% penalty.  Personally it looks way too
> expensive for this one isolated case.  Besides, the test where this
> clamping code originally came from still succeeds (commit
> eab2541b860c48203115ac6dca3284e982015d2c).

ick - I definitely expect some corner case optimization "regressions"
caused by the change if and only because it might trigger pass
ordering issues with later passes not expecting "optimized" code.
So I'd disregard a single "missed" case for this moment.  It might
be interesting to try to distill a C testcase from the case and file it
with bugzilla for reference purposes.

> CONCLUSION
> ==========
>
> That's basically it.
>
> If we agree the above things are not big issues, or can be addressed as
> follow-ups, I'd like to start the ball rolling on the new threader.
> This would allow more extensive testing of the code, and separate it a
> bit from the other big changes coming up :).

I'd say go ahead (with the bogus change somehow resolved), I'd still
like to know sth about the compile-time issue but then we can eventually
deal with this as followup as well.  If size is the reason we can see taming
down the threader via some --param adjustments for example.

I'm just trying to make sure we're not introducing some algorithmic
quadraticnesses that were not there before.

Richard.

> Aldy
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-28  8:22           ` Richard Biener
@ 2021-06-28 12:05             ` Aldy Hernandez
  0 siblings, 0 replies; 25+ messages in thread
From: Aldy Hernandez @ 2021-06-28 12:05 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, GCC Mailing List, Andrew MacLeod, Martin Sebor



On 6/28/21 10:22 AM, Richard Biener wrote:
> On Fri, Jun 25, 2021 at 6:20 PM Aldy Hernandez <aldyh@redhat.com> wrote:
>>
>> Hi folks.
>>
>> I'm done with benchmarking, testing and cleanups, so I'd like to post my
>> patchset for review.  However, before doing so, I'd like to address a
>> handful of meta-issues that may affect how I post these patches.
> 
> First of all thanks for the detailed analysis and report!
> 
>> Trapping on differences
>> =======================
>>
>> Originally I wanted to contribute verification code that would trap if
>> the legacy code threaded any edges the new code couldn't (to be removed
>> after a week).  However, after having tested on various architectures
>> and only running once into a missing thread, I'm leaning towards
>> omitting the verification code, since it's fragile, time consuming, and
>> quite hacky.
> 
> Agreed.
> 
>> For the record, I have tested on x86-64, aarch64, ppc64 and ppc64le.
>> There is only one case, across bootstrap and regression tests where the
>> verification code is ever tripped (discussed below).
>>
>> Performance
>> ===========
>>
>> I re-ran benchmarks as per our callgrind suite, and the penalty with the
>> current pipeline is 1.55% of overall compilation time.  As is being
>> discussed, we should be able to mitigate this significantly by removing
>> other threading passes.
> 
> 1.55% for the overall compilation looks quite large - is this suite particularly
> demanding on VRP/ranger?  Is the figure for 'cc1files' similar? (collect
> preprocessed sources of gcc/*.{c,cc} compiles, like by appending
> -save-temps to CFLAGS in a non-bootstrap build)

The figures are for cc1 files.  They are .ii files we run under 
callgrind and compare "instruction counts" as per the tool.  We've found 
it's comparable to -ftime-report, though not exactly.  We've been using 
this for benchmarking, since evrp times were so small that it was hard 
to compare user times.

> 
> I'm curious if you tried to look at the worst offenders and if there's
> anything new, threading specific.  More threading might result in
> larger code (how is cc1 size affected by the change?) and more
> code to compile by later passes can easily explain >1% differences.

A while ago, I did look at worst offenders on the branch, but there was 
nothing obvious.  As it stands now, the average cost for the threading 
pass is 858%, with 95% of .ii files falling under 1800%.  I could look 
at the outliers which seem to be c-fold, gcc-ar, gcc-ranlib, gcc-nm, and 
c-opts.

cc1 on x86-64 seems to be 0.135% larger for an --enable-checking build.

> 
>> Failing testcases
>> =================
>>
>> I have yet to run into incorrect code being generated, but I have had to
>> tweak a considerable number of tests.  I have verified every single
>> discrepancy and documented my changes in the testsuite when it merited
>> doing so.  However, there are a couple tests that trigger regressions
>> and I'd like to ask for guidance on how to address them.
>>
>> 1. gcc.c-torture/compile/pr83510.c
>>
>> I would like to XFAIL this.
>>
>> What happens here is that thread1 threads a switch statement such that
>> the various cases have been split into different independent blocks.
>> One of these blocks exposes an arr[i_27] access which is later
>> propagated by VRP to be arr[10].  This is an invalid access, but the
>> array bounds code doesn't know it is an unreachable path.
>>
>> However, it is not until dom2 that we "know" that the value of the
>> switch index is such that the path to arr[10] is unreachable.  For that
>> matter, it is not until dom3 that we remove the unreachable path.
> 
> Sounds reasonable.
> 
>> 2. -Wfree-nonheap-object
>>
>> This warning is triggered while cleaning up an auto_vec.  I see that the
>> va_heap::release() inline is wrapped with a pragma ignore
>> "-Wfree-nonheap-object", but this is not sufficient because jump
>> threading may alter uses in such a way that may_emit_free_warning() will
>> warn on the *inlined* location, thus bypassing the pragma.
>>
>> I worked around this with a mere:
>>
>>   > @@ -13839,6 +13839,7 @@ maybe_emit_free_warning (tree exp)
>>>     location_t loc = tree_inlined_location (exp);
>>> +  loc = EXPR_LOCATION (exp);
>>
>> but this causes a ton of Wfree-nonheap* tests to fail.  I think someone
>> more knowledgeable should address this (msebor??).
> 
> That looks wrong, but see msebors response for maybe some better answer.

Oh, it was just a stop-gap.  I have instead opted for the following 
temporary measure in Makefile.in:

tree-ssa-loop-im.o-warn = -Wno-free-nonheap-object

I will work with Martin on a more appropriate solution.

> 
>> 3. uninit-pred-9_b.c
>>
>> The uninit code is getting confused with the threading and the bogus
>> warning in line 24 is back.  I looked at the thread, and it is correct.
>>
>> I'm afraid all these warnings are quite fragile in the presence of more
>> aggressive optimizations, and I suspect it will only get worse.
> 
> Yep.  Shouldn't be a blocker, just XFAIL ...
> 
>> 4. libphobos/src/std/net/isemail.d
>>
>> This is a D test where we don't actually fail, but we trigger the
>> verification code.  It is the only jump threading edge that the new code
>> fails to get over the old code, and it only happens on ppc64.
>>
>> It triggers because a BB4 -> BB5 is too expensive to thread, but a BBn
>> -> BB3 -> BB4 -> BB5 is considered safe to thread because BB3 is a latch
>> and it alters the profitability equation.  The reason we don't get it,
>> is that we assume that if a X->Y is unprofitable, it is not worth
>> looking at W->X->Y and so forth.
>>
>> Jeff had some fancy ideas on how to attack this.  Once such idea was to
>> stop looking back, but only for things we were absolutely sure would
>> never yield a profitable path.  I tried a subset of this, by allowing
>> further looks on this latch test, but my 1.55% overall performance
>> penalty turned into an 8.33% penalty.  Personally it looks way too
>> expensive for this one isolated case.  Besides, the test where this
>> clamping code originally came from still succeeds (commit
>> eab2541b860c48203115ac6dca3284e982015d2c).
> 
> ick - I definitely expect some corner case optimization "regressions"
> caused by the change if and only because it might trigger pass
> ordering issues with later passes not expecting "optimized" code.
> So I'd disregard a single "missed" case for this moment.  It might
> be interesting to try to distill a C testcase from the case and file it
> with bugzilla for reference purposes.

Assuming there's nothing inherently D-ish in the testcase, I can 
certainly do this.

> 
>> CONCLUSION
>> ==========
>>
>> That's basically it.
>>
>> If we agree the above things are not big issues, or can be addressed as
>> follow-ups, I'd like to start the ball rolling on the new threader.
>> This would allow more extensive testing of the code, and separate it a
>> bit from the other big changes coming up :).
> 
> I'd say go ahead (with the bogus change somehow resolved), I'd still
> like to know sth about the compile-time issue but then we can eventually
> deal with this as followup as well.  If size is the reason we can see taming
> down the threader via some --param adjustments for example.
> 
> I'm just trying to make sure we're not introducing some algorithmic
> quadraticnesses that were not there before.


Agreed.  FWIW, there are already various --param knobs in the backwards 
threader we could tweak, though in practice the truly problematic cases 
we chop right off the bat, as soon as we go down an unprofitable path.

Thanks.
Aldy


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-09 11:48 replacing the backwards threader and more Aldy Hernandez
  2021-06-09 12:09 ` Richard Biener
@ 2021-06-13 21:59 ` Jeff Law
  2021-06-14  6:40   ` Richard Biener
  1 sibling, 1 reply; 25+ messages in thread
From: Jeff Law @ 2021-06-13 21:59 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Andrew MacLeod, GCC Mailing List



On 6/9/2021 5:48 AM, Aldy Hernandez wrote:
> Hi Jeff.  Hi folks.
>
> What started as a foray into severing the old (forward) threader's 
> dependency on evrp, turned into a rewrite of the backwards threader 
> code.  I'd like to discuss the possibility of replacing the current 
> backwards threader with a new one that gets far more threads and can 
> potentially subsume all threaders in the future.
>
> I won't include code here, as it will just detract from the high level 
> discussion.  But if it helps, I could post what I have, which just 
> needs some cleanups and porting to the latest trunk changes Andrew has 
> made.
>
> Currently the backwards threader works by traversing DEF chains 
> through PHIs leading to possible paths that start in a constant. When 
> such a path is found, it is checked to see if it is profitable, and if 
> so, the constant path is threaded.  The current implementation is 
> rather limited since backwards paths must end in a constant.  For 
> example, the backwards threader can't get any of the tests in 
> gcc.dg/tree-ssa/ssa-thread-14.c:
>
>   if (a && b)
>     foo ();
>   if (!b && c)
>     bar ();
>
> etc.
Right.  And these kinds of cases are particularly interesting to capture 
-- not only do you remove the runtime test/compare, all the setup code 
usually dies as well.  I can't remember who, but someone added some bits 
to detect these cases in DOM a while back and while the number of 
additional jumps threaded wasn't great, the overall impact was much 
better than we initially realized.   Instead of allowign removal of a 
single compare/branch, it typically allowed removal of a chain of 
logicals that fed the conditional.

>
> After my refactoring patches to the threading code, it is now possible 
> to drop in an alternate implementation that shares the profitability 
> code (is this path profitable?), the jump registry, and the actual 
> jump threading code.  I have leveraged this to write a ranger-based 
> threader that gets every single thread the current code gets, plus 
> 90-130% more.
Sweet.

>
> Here are the details from the branch, which should be very similar to 
> trunk.  I'm presenting the branch numbers because they contain 
> Andrew's upcoming relational query which significantly juices up the 
> results.
Yea, I'm not surprised that the relational query helps significantly 
here.  And I'm not surprised that we can do much better with the 
backwards threader with a rewrite.

Much of the ranger design was with the idea behind using it in the 
backwards jump threader in mind.  Backwards threading is, IMHO, a much 
better way to think about the problem.  THe backwards threader also has 
a much stronger region copier -- so we don't have to live with the 
various limitations of the old jump threading approach.



>
> New threader:
>          ethread:65043    (+3.06%)
>          dom:32450      (-13.3%)
>          backwards threader:72482   (+89.6%)
>          vrp:40532      (-30.7%)
>   Total threaded:  210507 (+6.70%)
>
> This means that the new code gets 89.6% more jump threading 
> opportunities than the code I want to replace.  In doing so, it 
> reduces the amount of DOM threading opportunities by 13.3% and by 
> 30.7% from the VRP jump threader.  The total  improvement across the 
> jump threading opportunities in the compiler is 6.70%.
This looks good at first glance.  It's worth noting that the backwards 
threader runs before the others, so, yea, as it captures more stuff I 
would expect DOM/VRP to capture fewer things.    It would be interesting 
to know the breakdown of things caught by VRP1/VRP2 and how much of that 
is secondary opportunities that are only appearing because we've done a 
better job earlier.

And just to be clear, I expect that we're going to leave some of those 
secondary opportunities on the table -- we just don't want it to be too 
many :-)  When I last looked at this my sense was wiring the backwards 
threader and ranger together should be enough to subsume VRP1/VRP2 jump 
threading.

>
> However, these are pessimistic numbers...
>
> I have noticed that some of the threading opportunities that DOM and 
> VRP now get are not because they're smarter, but because they're 
> picking up opportunities that the new code exposes.  I experimented 
> with running an iterative threader, and then seeing what VRP and DOM 
> could actually get.  This is too expensive to do in real life, but it 
> at least shows what the effect of the new code is on DOM/VRP's abilities:
>
>   Iterative threader:
>     ethread:65043    (+3.06%)
>     dom:31170    (-16.7%)
>         thread:86717    (+127%)
>         vrp:33851    (-42.2%)
>   Total threaded:  216781 (+9.90%)
>
> This means that the new code not only gets 127% more cases, but it 
> reduces the DOM and VRP opportunities considerably (16.7% and 42.2% 
> respectively).   The end result is that we have the possibility of 
> getting almost 10% more jump threading opportunities in the entire 
> compilation run.
>
> (Note that the new code gets even more opportunities, but I'm only 
> reporting the profitable ones that made it all the way through to the 
> threader backend, and actually eliminated a branch.)
Thanks for clarifying that.  It was one of the questions that first 
popped into my mind.

>
> The overall compilation hit from this work is currently 1.38% as 
> measured by callgrind.  We should be able to reduce this a bit, plus 
> we could get some of that back if we can replace the DOM and VRP 
> threaders (future work).
Given how close we were to dropping the VRP threaders before, I would 
support dropping them at the same time.  That gives you a bit more 
compile-time budget.

>
> My proposed implementation should be able to get any threading 
> opportunity, and will get more as range-ops and ranger improve.
>
> I can go into the details if necessary, but the gist of it is that we 
> leverage the import facility in the ranger to only look up paths that 
> have a direct repercussion in the conditional being threaded, thus 
> reducing the search space.  This enhanced path discovery, plus an 
> engine to resolve conditionals based on knowledge from a CFG path, is 
> all that is needed to register new paths.  There is no limit to how 
> far back we look, though in practice, we stop looking once a path is 
> too expensive to continue the search in a given direction.
Right.  That's one of the great things about Ranger -- it can 
dramatically reduce the search space.

>
> The solver API is simple:
>
> // This class is a thread path solver.  Given a set of BBs indicating
> // a path through the CFG, range_in_path() will return the range
> // of an SSA as if the BBs in the path would have been executed in
> // order.
> //
> // Note that the blocks are in reverse order, thus the exit block is 
> path[0].
>
> class thread_solver : gori_compute
> {
>
> public:
>   thread_solver (gimple_ranger &ranger);
>   virtual ~thread_solver ();
>   void set_path (const vec<basic_block> *, const bitmap_head *imports);
>   void range_in_path (irange &, tree name);
>   void range_in_path (irange &, gimple *);
> ...
> };
>
> Basically, as we're discovering paths, we ask the solver what the 
> value of the final conditional in a BB is in a given path.  If it 
> resolves, we register the path.
Exactly.  Given a path, do we know enough to resolve the conditional at 
the end.  If so register the path as a potential jump threading opportunity.

>
> A follow-up project would be to analyze what DOM/VRP are actually 
> getting that we don't, because in theory with an enhanced ranger, we 
> should be able to get everything they do (minus some float stuff, and 
> some CSE things DOM does).  However, IMO, this is good enough to at 
> least replace the current backwards threading code.
I bet it's going to be tougher to remove DOM's threader.  It knows how 
to do thinks like resolve memory references using temporary equivalences 
and such.  But I bet it's enough to drop the VRP based threaders.

>
> My suggestion would be to keep both implementations, defaulting to the 
> ranger based, and running the old code immediately after-- trapping if 
> it can find any threading opportunities.  After a few weeks, we could 
> kill the old code.
I can live with that :-)

> p.s. BTW, ranger-based is technically a minomer.  It's gori based.  We 
> don't need the entire ranger caching ability here.  I'm only using it 
> to get the imports for the interesting conditionals, since those are 
> static.
That's probably sufficient to capture the first order effects.  I can 
come up with theoritical cases that it'd likely miss (involving 
temporary equivalences which could affect the set of interesting 
conditionals), but I doubt they're important in practice.

jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-13 21:59 ` Jeff Law
@ 2021-06-14  6:40   ` Richard Biener
  2021-06-15  4:03     ` Jeff Law
  0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2021-06-14  6:40 UTC (permalink / raw)
  To: Jeff Law; +Cc: Aldy Hernandez, GCC Mailing List

On Mon, Jun 14, 2021 at 12:02 AM Jeff Law via Gcc <gcc@gcc.gnu.org> wrote:
>
>
>
> On 6/9/2021 5:48 AM, Aldy Hernandez wrote:
> > Hi Jeff.  Hi folks.
> >
> > What started as a foray into severing the old (forward) threader's
> > dependency on evrp, turned into a rewrite of the backwards threader
> > code.  I'd like to discuss the possibility of replacing the current
> > backwards threader with a new one that gets far more threads and can
> > potentially subsume all threaders in the future.
> >
> > I won't include code here, as it will just detract from the high level
> > discussion.  But if it helps, I could post what I have, which just
> > needs some cleanups and porting to the latest trunk changes Andrew has
> > made.
> >
> > Currently the backwards threader works by traversing DEF chains
> > through PHIs leading to possible paths that start in a constant. When
> > such a path is found, it is checked to see if it is profitable, and if
> > so, the constant path is threaded.  The current implementation is
> > rather limited since backwards paths must end in a constant.  For
> > example, the backwards threader can't get any of the tests in
> > gcc.dg/tree-ssa/ssa-thread-14.c:
> >
> >   if (a && b)
> >     foo ();
> >   if (!b && c)
> >     bar ();
> >
> > etc.
> Right.  And these kinds of cases are particularly interesting to capture
> -- not only do you remove the runtime test/compare, all the setup code
> usually dies as well.  I can't remember who, but someone added some bits
> to detect these cases in DOM a while back and while the number of
> additional jumps threaded wasn't great, the overall impact was much
> better than we initially realized.   Instead of allowign removal of a
> single compare/branch, it typically allowed removal of a chain of
> logicals that fed the conditional.
>
> >
> > After my refactoring patches to the threading code, it is now possible
> > to drop in an alternate implementation that shares the profitability
> > code (is this path profitable?), the jump registry, and the actual
> > jump threading code.  I have leveraged this to write a ranger-based
> > threader that gets every single thread the current code gets, plus
> > 90-130% more.
> Sweet.
>
> >
> > Here are the details from the branch, which should be very similar to
> > trunk.  I'm presenting the branch numbers because they contain
> > Andrew's upcoming relational query which significantly juices up the
> > results.
> Yea, I'm not surprised that the relational query helps significantly
> here.  And I'm not surprised that we can do much better with the
> backwards threader with a rewrite.
>
> Much of the ranger design was with the idea behind using it in the
> backwards jump threader in mind.  Backwards threading is, IMHO, a much
> better way to think about the problem.  THe backwards threader also has
> a much stronger region copier -- so we don't have to live with the
> various limitations of the old jump threading approach.
>
>
>
> >
> > New threader:
> >          ethread:65043    (+3.06%)
> >          dom:32450      (-13.3%)
> >          backwards threader:72482   (+89.6%)
> >          vrp:40532      (-30.7%)
> >   Total threaded:  210507 (+6.70%)
> >
> > This means that the new code gets 89.6% more jump threading
> > opportunities than the code I want to replace.  In doing so, it
> > reduces the amount of DOM threading opportunities by 13.3% and by
> > 30.7% from the VRP jump threader.  The total  improvement across the
> > jump threading opportunities in the compiler is 6.70%.
> This looks good at first glance.  It's worth noting that the backwards
> threader runs before the others, so, yea, as it captures more stuff I
> would expect DOM/VRP to capture fewer things.    It would be interesting
> to know the breakdown of things caught by VRP1/VRP2 and how much of that
> is secondary opportunities that are only appearing because we've done a
> better job earlier.
>
> And just to be clear, I expect that we're going to leave some of those
> secondary opportunities on the table -- we just don't want it to be too
> many :-)  When I last looked at this my sense was wiring the backwards
> threader and ranger together should be enough to subsume VRP1/VRP2 jump
> threading.
>
> >
> > However, these are pessimistic numbers...
> >
> > I have noticed that some of the threading opportunities that DOM and
> > VRP now get are not because they're smarter, but because they're
> > picking up opportunities that the new code exposes.  I experimented
> > with running an iterative threader, and then seeing what VRP and DOM
> > could actually get.  This is too expensive to do in real life, but it
> > at least shows what the effect of the new code is on DOM/VRP's abilities:
> >
> >   Iterative threader:
> >     ethread:65043    (+3.06%)
> >     dom:31170    (-16.7%)
> >         thread:86717    (+127%)
> >         vrp:33851    (-42.2%)
> >   Total threaded:  216781 (+9.90%)
> >
> > This means that the new code not only gets 127% more cases, but it
> > reduces the DOM and VRP opportunities considerably (16.7% and 42.2%
> > respectively).   The end result is that we have the possibility of
> > getting almost 10% more jump threading opportunities in the entire
> > compilation run.
> >
> > (Note that the new code gets even more opportunities, but I'm only
> > reporting the profitable ones that made it all the way through to the
> > threader backend, and actually eliminated a branch.)
> Thanks for clarifying that.  It was one of the questions that first
> popped into my mind.
>
> >
> > The overall compilation hit from this work is currently 1.38% as
> > measured by callgrind.  We should be able to reduce this a bit, plus
> > we could get some of that back if we can replace the DOM and VRP
> > threaders (future work).
> Given how close we were to dropping the VRP threaders before, I would
> support dropping them at the same time.  That gives you a bit more
> compile-time budget.
>
> >
> > My proposed implementation should be able to get any threading
> > opportunity, and will get more as range-ops and ranger improve.
> >
> > I can go into the details if necessary, but the gist of it is that we
> > leverage the import facility in the ranger to only look up paths that
> > have a direct repercussion in the conditional being threaded, thus
> > reducing the search space.  This enhanced path discovery, plus an
> > engine to resolve conditionals based on knowledge from a CFG path, is
> > all that is needed to register new paths.  There is no limit to how
> > far back we look, though in practice, we stop looking once a path is
> > too expensive to continue the search in a given direction.
> Right.  That's one of the great things about Ranger -- it can
> dramatically reduce the search space.
>
> >
> > The solver API is simple:
> >
> > // This class is a thread path solver.  Given a set of BBs indicating
> > // a path through the CFG, range_in_path() will return the range
> > // of an SSA as if the BBs in the path would have been executed in
> > // order.
> > //
> > // Note that the blocks are in reverse order, thus the exit block is
> > path[0].
> >
> > class thread_solver : gori_compute
> > {
> >
> > public:
> >   thread_solver (gimple_ranger &ranger);
> >   virtual ~thread_solver ();
> >   void set_path (const vec<basic_block> *, const bitmap_head *imports);
> >   void range_in_path (irange &, tree name);
> >   void range_in_path (irange &, gimple *);
> > ...
> > };
> >
> > Basically, as we're discovering paths, we ask the solver what the
> > value of the final conditional in a BB is in a given path.  If it
> > resolves, we register the path.
> Exactly.  Given a path, do we know enough to resolve the conditional at
> the end.  If so register the path as a potential jump threading opportunity.
>
> >
> > A follow-up project would be to analyze what DOM/VRP are actually
> > getting that we don't, because in theory with an enhanced ranger, we
> > should be able to get everything they do (minus some float stuff, and
> > some CSE things DOM does).  However, IMO, this is good enough to at
> > least replace the current backwards threading code.
> I bet it's going to be tougher to remove DOM's threader.  It knows how
> to do thinks like resolve memory references using temporary equivalences
> and such.  But I bet it's enough to drop the VRP based threaders.

Yes.  In fact I am wondering if adding threading to the not iterating FRE
would make it possible to drop DOM, replacing it with instances of FRE.

Richard.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-14  6:40   ` Richard Biener
@ 2021-06-15  4:03     ` Jeff Law
  2021-06-15  5:39       ` Aldy Hernandez
  0 siblings, 1 reply; 25+ messages in thread
From: Jeff Law @ 2021-06-15  4:03 UTC (permalink / raw)
  To: Richard Biener; +Cc: Aldy Hernandez, GCC Mailing List



On 6/14/2021 12:40 AM, Richard Biener wrote:
>
>> I bet it's going to be tougher to remove DOM's threader.  It knows how
>> to do thinks like resolve memory references using temporary equivalences
>> and such.  But I bet it's enough to drop the VRP based threaders.
> Yes.  In fact I am wondering if adding threading to the not iterating FRE
> would make it possible to drop DOM, replacing it with instances of FRE.
I'd think so.  I'd approach as:

1. Remove the VRP threader instances.
2. Drop cprop and redundancy elimination from DOM using FRE instead
3. What's left of DOM is just forward jump threading.  Revamp & 
simplify.  Then either make it a distinct pass or as a sub-pass of FRE.

But one could just as easily look at adding threading to FRE and just 
killing DOM and its jump threading.

Jeff


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-15  4:03     ` Jeff Law
@ 2021-06-15  5:39       ` Aldy Hernandez
  2021-06-15 16:35         ` Jeff Law
  0 siblings, 1 reply; 25+ messages in thread
From: Aldy Hernandez @ 2021-06-15  5:39 UTC (permalink / raw)
  To: Jeff Law, Richard Biener; +Cc: GCC Mailing List, Andrew MacLeod

On 6/15/21 6:03 AM, Jeff Law wrote:
> 
> 
> On 6/14/2021 12:40 AM, Richard Biener wrote:
>>
>>> I bet it's going to be tougher to remove DOM's threader.  It knows how
>>> to do thinks like resolve memory references using temporary equivalences
>>> and such.  But I bet it's enough to drop the VRP based threaders.
>> Yes.  In fact I am wondering if adding threading to the not iterating FRE
>> would make it possible to drop DOM, replacing it with instances of FRE.
> I'd think so.  I'd approach as:
> 
> 1. Remove the VRP threader instances.
> 2. Drop cprop and redundancy elimination from DOM using FRE instead
> 3. What's left of DOM is just forward jump threading.  Revamp & 
> simplify.  Then either make it a distinct pass or as a sub-pass of FRE.
> 
> But one could just as easily look at adding threading to FRE and just 
> killing DOM and its jump threading.

Andrew will hopefully be contributing the relational work this week. 
Once he does so, I will gather more granular stats for VRP1/VRP2 so we 
can assess the situation.

Also, a bunch of tests needed to be tweaked because the new code picks 
up so many threads.  I'm waiting for the relational work to avoid having 
to adjust the tests again.

I would personally prefer to add an item 0 to the above list: replace 
current backwards threader code with the rewrite.  I would hate to 
replace all threaders at once and deal with the fallout.  Perhaps it's 
better to replace the backward threaders, and once that settles move 
onto VRP[12]??

Aldy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: replacing the backwards threader and more
  2021-06-15  5:39       ` Aldy Hernandez
@ 2021-06-15 16:35         ` Jeff Law
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff Law @ 2021-06-15 16:35 UTC (permalink / raw)
  To: Aldy Hernandez, Richard Biener; +Cc: GCC Mailing List, Andrew MacLeod



On 6/14/2021 11:39 PM, Aldy Hernandez wrote:
>
>
> On 6/15/21 6:03 AM, Jeff Law wrote:
>>
>>
>> On 6/14/2021 12:40 AM, Richard Biener wrote:
>>>
>>>> I bet it's going to be tougher to remove DOM's threader.  It knows how
>>>> to do thinks like resolve memory references using temporary 
>>>> equivalences
>>>> and such.  But I bet it's enough to drop the VRP based threaders.
>>> Yes.  In fact I am wondering if adding threading to the not 
>>> iterating FRE
>>> would make it possible to drop DOM, replacing it with instances of FRE.
>> I'd think so.  I'd approach as:
>>
>> 1. Remove the VRP threader instances.
>> 2. Drop cprop and redundancy elimination from DOM using FRE instead
>> 3. What's left of DOM is just forward jump threading.  Revamp & 
>> simplify.  Then either make it a distinct pass or as a sub-pass of FRE.
>>
>> But one could just as easily look at adding threading to FRE and just 
>> killing DOM and its jump threading.
>
> Andrew will hopefully be contributing the relational work this week. 
> Once he does so, I will gather more granular stats for VRP1/VRP2 so we 
> can assess the situation.
>
> Also, a bunch of tests needed to be tweaked because the new code picks 
> up so many threads.  I'm waiting for the relational work to avoid 
> having to adjust the tests again.
>
> I would personally prefer to add an item 0 to the above list: replace 
> current backwards threader code with the rewrite.  I would hate to 
> replace all threaders at once and deal with the fallout. Perhaps it's 
> better to replace the backward threaders, and once that settles move 
> onto VRP[12]??
Sorry, yes, there's a step #0, replace the backwards threader. Mentally 
I groups that with "Remove VRP threader instsances", but it's a distinct 
step and a prerequisite.

jeff


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-06-30  6:07 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-09 11:48 replacing the backwards threader and more Aldy Hernandez
2021-06-09 12:09 ` Richard Biener
2021-06-09 15:34   ` Aldy Hernandez
2021-06-09 16:18     ` Richard Biener
2021-06-09 19:47     ` Jeff Law
2021-06-09 20:39       ` Aldy Hernandez
2021-06-09 20:43         ` Jeff Law
2021-06-21 14:40   ` Aldy Hernandez
2021-06-24 16:14     ` Jeff Law
2021-06-25  7:54       ` Richard Biener
2021-06-25  7:58         ` Richard Biener
2021-06-25 16:20         ` Aldy Hernandez
2021-06-25 17:19           ` Martin Sebor
2021-06-28  6:17             ` Aldy Hernandez
2021-06-28 22:29               ` Martin Sebor
2021-06-29  0:32                 ` Aldy Hernandez
2021-06-29 22:16                   ` Martin Sebor
2021-06-30  6:07                     ` Aldy Hernandez
2021-06-28  8:22           ` Richard Biener
2021-06-28 12:05             ` Aldy Hernandez
2021-06-13 21:59 ` Jeff Law
2021-06-14  6:40   ` Richard Biener
2021-06-15  4:03     ` Jeff Law
2021-06-15  5:39       ` Aldy Hernandez
2021-06-15 16:35         ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).