Re: [RFC] Don't move cold code out of loop by checking bb count

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Xionghu Luo <luoxhu@linux.ibm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: Segher Boessenkool <segher@kernel.crashing.org>,
	Bill Schmidt <wschmidt@linux.ibm.com>,
	linkw@gcc.gnu.org, GCC Patches <gcc-patches@gcc.gnu.org>,
	Jan Hubicka <hubicka@ucw.cz>, David Edelsohn <dje.gcc@gmail.com>
Subject: Re: [RFC] Don't move cold code out of loop by checking bb count
Date: Thu, 23 Sep 2021 10:16:24 +0800	[thread overview]
Message-ID: <3d4636a3-db03-4ddf-954a-52e06eef5e7c@linux.ibm.com> (raw)
In-Reply-To: <f3daa57b-ac7a-cee6-e23d-756cbd51ab3e@linux.ibm.com>



On 2021/9/23 10:13, Xionghu Luo via Gcc-patches wrote:
> 
> 
> On 2021/9/22 17:14, Richard Biener wrote:
>> On Thu, Sep 9, 2021 at 3:56 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>>
>>>
>>>
>>> On 2021/8/26 19:33, Richard Biener wrote:
>>>> On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo <luoxhu@linux.ibm.com> 
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> On 2021/8/6 20:15, Richard Biener wrote:
>>>>>> On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luoxhu@linux.ibm.com> 
>>>>>> wrote:
>>>>>>>
>>>>>>> There was a patch trying to avoid move cold block out of loop:
>>>>>>>
>>>>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>>>>>>
>>>>>>> Richard suggested to "never hoist anything from a bb with lower 
>>>>>>> execution
>>>>>>> frequency to a bb with higher one in LIM invariantness_dom_walker
>>>>>>> before_dom_children".
>>>>>>>
>>>>>>> This patch does this profile count check in both gimple LIM
>>>>>>> move_computations_worker and RTL loop-invariant.c 
>>>>>>> find_invariants_bb,
>>>>>>> if the loop bb is colder than loop preheader, don't hoist it out of
>>>>>>> loop.
>>>>>>>
>>>>>>> Also, the profile count in loop split pass should be corrected to 
>>>>>>> avoid
>>>>>>> lim2 and lim4 mismatch behavior, currently, the new loop 
>>>>>>> preheader generated
>>>>>>> by loop_version is set to "[count: 0]:", then lim4 after lsplt 
>>>>>>> pass will
>>>>>>> move statement out of loop unexpectely when lim2 didn't move it.  
>>>>>>> This
>>>>>>> change could fix regression on 544.nab_r from -1.55% to +0.46%.
>>>>>>>
>>>>>>> SPEC2017 performance evaluation shows 1% performance improvement for
>>>>>>> intrate GEOMEAN and no obvious regression for others.  Especially,
>>>>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>>>>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>>>>>>> on P8LE.
>>>>>>>
>>>>>>> Regression and bootstrap tested pass on P8LE, any comments?  Thanks.
>>>>>>
>>>>>> While I'm not familiar with the RTL invariant motion pass the 
>>>>>> patch there
>>>>>> looks reasonable.  Note that we should assess the profile quality
>>>>>> somehow - I'm not sure how to do that, CCed Honza for that.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>>
>>>>>> For the GIMPLE part the patch looks quite complicated - but note it
>>>>>> probably has to be since LIM performs kind of a "CSE" on loads
>>>>>> (and stores for store-motion), so when there are multiple stmts
>>>>>> affected by a hoisting decision the biggest block count has to be
>>>>>> accounted.  Likewise when there are dependent stmts involved
>>>>>> that might include conditional stmts (a "PHI"), but the overall
>>>>>> cost should be looked at.
>>>>>
>>>>> Currently, The gimple code check two situations with the patch:
>>>>> 1) The statement or PHI‘s BB is *colder* then preheader, don't move 
>>>>> it out
>>>>> of loop;
>>>>> 2) The statement or PHI's BB is *hotter* then preheader, but any of 
>>>>> it's rhs
>>>>> couldn't be moved out of loop, also don't move it out of loop to 
>>>>> avoid definition
>>>>> not dominates use error.
>>>>
>>>> But part 2) is obviously already done.  What I tried to say is your 
>>>> heuristic
>>>> doesn't integrate nicely with the pass but I admitted that it might 
>>>> be a bit
>>>> difficult to find a place to add this heuristic.
>>>>
>>>> There is lim_data->cost which we could bias negatively but then this is
>>>> a cost that is independent on the hoisting distance.  But doing this 
>>>> would
>>>> work at least for the case where the immediately enclosing loop 
>>>> preheader
>>>> is hotter than the stmt and with this it would be a patch that's 
>>>> similarly
>>>> simple as the RTL one.
>>>>
>>>> Another possibility is to simply only adjust PHI processing in
>>>> compute_invariantness, capping movement according to the hotness
>>>> heuristic.  The same could be done for regular stmts there but I'm
>>>> not sure that will do good in the end since this function is supposed
>>>> to compute "correctness" (well, it also has the cost stuff), and it's
>>>> not the place to do overall cost considerations.
>>>
>>> Thanks.  I found that adding a function find_coldest_out_loop and 
>>> check it in
>>> outermost_invariant_loop to find the coldest invariant loop between 
>>> outermost
>>> loop and itself could also reach the purpose.  Then the gimple code 
>>> check is
>>> redundant and could be removed.
>>>
>>>>
>>>>> May be I could collect the number of instructions not hoisted with 
>>>>> the patch
>>>>> on regression tests and SPEC2017 to do a estimation for "multiple 
>>>>> stmts affected"
>>>>> and "overall cost" need to be considered?  But it seems 
>>>>> move_computations_worker
>>>>> couldn't rollback if we still want to hoist multiple stmts out 
>>>>> during the iterations?
>>>>>
>>>>>>
>>>>>> Now - GIMPLE LIM "costing" is somewhat backward right now
>>>>>> and it isn't set up to consider those multiple involved stmts.  Plus
>>>>>> the store-motion part does not have any cost part (but it depends
>>>>>> on previously decided invariant motions).
>>>>>>
>>>>>> I think the way you implemented the check will cause no hoisting
>>>>>> to be performed instead of, say, hoisting to a different loop level
>>>>>> only.  Possibly shown when you consider a loop nest like
>>>>>>
>>>>>>      for (;;)
>>>>>>        if (unlikely_cond)
>>>>>>          for (;;)
>>>>>>             invariant;
>>>>>>
>>>>>> we want to hoist 'invariant' but only from the inner loop even if it
>>>>>> is invariant also in the outer loop.
>>>>>
>>>>>
>>>>> For this case, theorotically I think the master GCC will optimize 
>>>>> it to:
>>>>>
>>>>>     invariant;
>>>>>     for (;;)
>>>>>       if (unlikely_cond)
>>>>>         for (;;)
>>>>>            ;
>>>>>
>>>>> 'invariant' is moved out of outer loop, but with the patch, it will 
>>>>> get:
>>>>>
>>>>>     for (;;)
>>>>>       if (unlikely_cond)
>>>>>         {
>>>>>           invariant;
>>>>>           for (;;)
>>>>>              ;
>>>>>         }
>>>>>
>>>>> 'invariant' is *cold* for outer loop, but it is still *hot* for 
>>>>> inner loop,
>>>>> so hoist it out of inner loop, this is exactly what we want, right?
>>>>
>>>> Yes.  I had doubts your patch would achieve that.
>>>>
>>>
>>>
>>> The below updated patch could achieve it:
>>>
>>>
>>> There was a patch trying to avoid move cold block out of loop:
>>>
>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>>
>>> Richard suggested to "never hoist anything from a bb with lower 
>>> execution
>>> frequency to a bb with higher one in LIM invariantness_dom_walker
>>> before_dom_children".
>>>
>>> In gimple LIM analysis,  add find_coldest_out_loop to move invariants to
>>> expected target loop, then  if profile count of the loop bb is colder
>>> than target loop preheader, it won't be hoisted out of loop.
>>>
>>> SPEC2017 performance evaluation shows 1% performance improvement for
>>> intrate GEOMEAN and no obvious regression for others.  Especially,
>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>>> on P8LE.
>>>
>>> Regression and bootstrap tested pass on P8LE, any comments?  Thanks.
>>
>> Can you split the RTL and GIMPLE changes and measure them separately
>> please?
> 
> I did that before and got below data, it is slightly different due to
> using ratio instead of seconds, 500.perlbench_r obviously benefits
> from the RTL part change, while gimple part only improves exchange2
> and blender, with a regression on nab which requires the fix of loop
> split,
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576566.html
> 
> Reason is lim2 doesn't hoist code out of loop, but loop split generates
> duplicated loop with incorrect profile count on preheader and header bb,
> then later lim4 hoists code out of loop to unexpected place.  Same loop
> shows mismatch behavior in lim2 and lim4.  With that patch, the regression
> is gone.
> 
> 
>                Gimple+RTL |  Gimple lim | RTL loop-invariant
> 500.perlbench_r     8.03%    0.67%    7.69%
> 502.gcc_r           0.56%    0.37%    0.19%
> 505.mcf_r           0.19%   -0.19%    0.39%
> 520.omnetpp_r       0.83%    0.83%    0.83%
> 523.xalancbmk_r    -0.78%    0.00%   -1.04%
> 525.x264_r          0.17%    0.00%    0.00%
> 531.deepsjeng_r     0.00%    0.31%    0.00%
> 541.leela_r         0.00%   -0.31%    0.31%
> 548.exchange2_r     2.08%    1.85%    0.23%
> 557.xz_r            0.97%    0.00%    0.65%
> 503.bwaves_r       -0.12%    0.00%   -0.23%
> 507.cactuBSSN_r     0.00%    0.14%    0.00%
> 508.namd_r          0.00%    0.00%    0.00%
> 510.parest_r       -0.16%   -0.65%    0.00%
> 511.povray_r        0.30%    0.91%    0.91%
> 519.lbm_r           0.15%    0.00%    0.00%
> 521.wrf_r           0.00%    0.00%   -0.80%
> 526.blender_r       1.84%    0.26%    0.52%
> 527.cam4_r          0.28%    0.00%    0.00%
> 538.imagick_r       0.20%    0.00%    0.00%
> 544.nab_r          -1.55%   -0.78%    0.00%
> 549.fotonik3d_r    -0.25%    0.00%    0.00%
> 554.roms_r         -0.84%    0.00%   -0.63%
> INT GEAMEAN         1.16%    0.35%    0.90%
> FLOAT GEOMEAN      -0.01%   -0.01%   -0.02%
> GEOMEAN             0.50%    0.15%    0.38%


BTW, feedback from other platform:


      I do see ~8% performance improvement for 500.perlbench on aarch64.


> 
> 
> Will address other comments in later reply.  Thanks.
> 
> 

-- 
Thanks,
Xionghu

next prev parent reply	other threads:[~2021-09-23  2:16 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-02  5:05 Xiong Hu Luo
2021-08-06 12:15 ` Richard Biener
2021-08-10  2:03   ` Xionghu Luo
2021-08-10  4:25     ` Ulrich Drepper
2021-08-19  5:51       ` [PATCH v2] " Xionghu Luo
2021-08-26 11:33     ` [RFC] " Richard Biener
2021-09-09  1:55       ` Xionghu Luo
2021-09-22  9:14         ` Richard Biener
2021-09-23  2:13           ` Xionghu Luo
2021-09-23  2:16             ` Xionghu Luo [this message]
2021-09-24  6:29           ` Xionghu Luo
2021-09-28 12:09             ` Richard Biener
2021-10-09  3:44               ` Xionghu Luo
2021-10-15  8:11                 ` Richard Biener
2021-10-18  4:29                   ` Xionghu Luo
2021-10-19  1:47                     ` [PATCH v5 2/2] " Xionghu Luo
2021-10-26 13:20                     ` [RFC] " Richard Biener
2021-10-27  2:40                       ` Xionghu Luo
2021-10-29 11:48                         ` Richard Biener
2021-11-03  6:49                           ` Xionghu Luo
2021-11-03 13:29                           ` Xionghu Luo
2021-11-04 13:00                             ` Richard Biener
2021-11-10  3:08                               ` [PATCH v7 2/2] " Xionghu Luo
2021-11-24  5:15                                 ` Ping: " Xionghu Luo
2021-11-24  7:35                                   ` Richard Biener
2021-12-01 10:09                                 ` Richard Biener
2021-12-06  5:09                                   ` [PATCH v8 " Xionghu Luo
2021-12-06  5:26                                     ` Xionghu Luo
2021-12-07 12:17                                       ` Richard Biener
2021-12-08  6:32                                         ` Xionghu Luo
2021-12-20  7:29                                           ` Richard Biener
2021-12-21  3:59                                             ` Xionghu Luo
2021-10-27 12:54                 ` [RFC] " Jan Hubicka
2021-10-28  1:49                   ` Xionghu Luo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3d4636a3-db03-4ddf-954a-52e06eef5e7c@linux.ibm.com \
    --to=luoxhu@linux.ibm.com \
    --cc=dje.gcc@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=hubicka@ucw.cz \
    --cc=linkw@gcc.gnu.org \
    --cc=richard.guenther@gmail.com \
    --cc=segher@kernel.crashing.org \
    --cc=wschmidt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).