From: Xionghu Luo <luoxhu@linux.ibm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: Segher Boessenkool <segher@kernel.crashing.org>,
Bill Schmidt <wschmidt@linux.ibm.com>,
linkw@gcc.gnu.org, GCC Patches <gcc-patches@gcc.gnu.org>,
Jan Hubicka <hubicka@ucw.cz>, David Edelsohn <dje.gcc@gmail.com>
Subject: Re: [RFC] Don't move cold code out of loop by checking bb count
Date: Thu, 23 Sep 2021 10:16:24 +0800 [thread overview]
Message-ID: <3d4636a3-db03-4ddf-954a-52e06eef5e7c@linux.ibm.com> (raw)
In-Reply-To: <f3daa57b-ac7a-cee6-e23d-756cbd51ab3e@linux.ibm.com>
On 2021/9/23 10:13, Xionghu Luo via Gcc-patches wrote:
>
>
> On 2021/9/22 17:14, Richard Biener wrote:
>> On Thu, Sep 9, 2021 at 3:56 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>>
>>>
>>>
>>> On 2021/8/26 19:33, Richard Biener wrote:
>>>> On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo <luoxhu@linux.ibm.com>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> On 2021/8/6 20:15, Richard Biener wrote:
>>>>>> On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luoxhu@linux.ibm.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> There was a patch trying to avoid move cold block out of loop:
>>>>>>>
>>>>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>>>>>>
>>>>>>> Richard suggested to "never hoist anything from a bb with lower
>>>>>>> execution
>>>>>>> frequency to a bb with higher one in LIM invariantness_dom_walker
>>>>>>> before_dom_children".
>>>>>>>
>>>>>>> This patch does this profile count check in both gimple LIM
>>>>>>> move_computations_worker and RTL loop-invariant.c
>>>>>>> find_invariants_bb,
>>>>>>> if the loop bb is colder than loop preheader, don't hoist it out of
>>>>>>> loop.
>>>>>>>
>>>>>>> Also, the profile count in loop split pass should be corrected to
>>>>>>> avoid
>>>>>>> lim2 and lim4 mismatch behavior, currently, the new loop
>>>>>>> preheader generated
>>>>>>> by loop_version is set to "[count: 0]:", then lim4 after lsplt
>>>>>>> pass will
>>>>>>> move statement out of loop unexpectely when lim2 didn't move it.
>>>>>>> This
>>>>>>> change could fix regression on 544.nab_r from -1.55% to +0.46%.
>>>>>>>
>>>>>>> SPEC2017 performance evaluation shows 1% performance improvement for
>>>>>>> intrate GEOMEAN and no obvious regression for others. Especially,
>>>>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>>>>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>>>>>>> on P8LE.
>>>>>>>
>>>>>>> Regression and bootstrap tested pass on P8LE, any comments? Thanks.
>>>>>>
>>>>>> While I'm not familiar with the RTL invariant motion pass the
>>>>>> patch there
>>>>>> looks reasonable. Note that we should assess the profile quality
>>>>>> somehow - I'm not sure how to do that, CCed Honza for that.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>>
>>>>>> For the GIMPLE part the patch looks quite complicated - but note it
>>>>>> probably has to be since LIM performs kind of a "CSE" on loads
>>>>>> (and stores for store-motion), so when there are multiple stmts
>>>>>> affected by a hoisting decision the biggest block count has to be
>>>>>> accounted. Likewise when there are dependent stmts involved
>>>>>> that might include conditional stmts (a "PHI"), but the overall
>>>>>> cost should be looked at.
>>>>>
>>>>> Currently, The gimple code check two situations with the patch:
>>>>> 1) The statement or PHI‘s BB is *colder* then preheader, don't move
>>>>> it out
>>>>> of loop;
>>>>> 2) The statement or PHI's BB is *hotter* then preheader, but any of
>>>>> it's rhs
>>>>> couldn't be moved out of loop, also don't move it out of loop to
>>>>> avoid definition
>>>>> not dominates use error.
>>>>
>>>> But part 2) is obviously already done. What I tried to say is your
>>>> heuristic
>>>> doesn't integrate nicely with the pass but I admitted that it might
>>>> be a bit
>>>> difficult to find a place to add this heuristic.
>>>>
>>>> There is lim_data->cost which we could bias negatively but then this is
>>>> a cost that is independent on the hoisting distance. But doing this
>>>> would
>>>> work at least for the case where the immediately enclosing loop
>>>> preheader
>>>> is hotter than the stmt and with this it would be a patch that's
>>>> similarly
>>>> simple as the RTL one.
>>>>
>>>> Another possibility is to simply only adjust PHI processing in
>>>> compute_invariantness, capping movement according to the hotness
>>>> heuristic. The same could be done for regular stmts there but I'm
>>>> not sure that will do good in the end since this function is supposed
>>>> to compute "correctness" (well, it also has the cost stuff), and it's
>>>> not the place to do overall cost considerations.
>>>
>>> Thanks. I found that adding a function find_coldest_out_loop and
>>> check it in
>>> outermost_invariant_loop to find the coldest invariant loop between
>>> outermost
>>> loop and itself could also reach the purpose. Then the gimple code
>>> check is
>>> redundant and could be removed.
>>>
>>>>
>>>>> May be I could collect the number of instructions not hoisted with
>>>>> the patch
>>>>> on regression tests and SPEC2017 to do a estimation for "multiple
>>>>> stmts affected"
>>>>> and "overall cost" need to be considered? But it seems
>>>>> move_computations_worker
>>>>> couldn't rollback if we still want to hoist multiple stmts out
>>>>> during the iterations?
>>>>>
>>>>>>
>>>>>> Now - GIMPLE LIM "costing" is somewhat backward right now
>>>>>> and it isn't set up to consider those multiple involved stmts. Plus
>>>>>> the store-motion part does not have any cost part (but it depends
>>>>>> on previously decided invariant motions).
>>>>>>
>>>>>> I think the way you implemented the check will cause no hoisting
>>>>>> to be performed instead of, say, hoisting to a different loop level
>>>>>> only. Possibly shown when you consider a loop nest like
>>>>>>
>>>>>> for (;;)
>>>>>> if (unlikely_cond)
>>>>>> for (;;)
>>>>>> invariant;
>>>>>>
>>>>>> we want to hoist 'invariant' but only from the inner loop even if it
>>>>>> is invariant also in the outer loop.
>>>>>
>>>>>
>>>>> For this case, theorotically I think the master GCC will optimize
>>>>> it to:
>>>>>
>>>>> invariant;
>>>>> for (;;)
>>>>> if (unlikely_cond)
>>>>> for (;;)
>>>>> ;
>>>>>
>>>>> 'invariant' is moved out of outer loop, but with the patch, it will
>>>>> get:
>>>>>
>>>>> for (;;)
>>>>> if (unlikely_cond)
>>>>> {
>>>>> invariant;
>>>>> for (;;)
>>>>> ;
>>>>> }
>>>>>
>>>>> 'invariant' is *cold* for outer loop, but it is still *hot* for
>>>>> inner loop,
>>>>> so hoist it out of inner loop, this is exactly what we want, right?
>>>>
>>>> Yes. I had doubts your patch would achieve that.
>>>>
>>>
>>>
>>> The below updated patch could achieve it:
>>>
>>>
>>> There was a patch trying to avoid move cold block out of loop:
>>>
>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>>
>>> Richard suggested to "never hoist anything from a bb with lower
>>> execution
>>> frequency to a bb with higher one in LIM invariantness_dom_walker
>>> before_dom_children".
>>>
>>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to
>>> expected target loop, then if profile count of the loop bb is colder
>>> than target loop preheader, it won't be hoisted out of loop.
>>>
>>> SPEC2017 performance evaluation shows 1% performance improvement for
>>> intrate GEOMEAN and no obvious regression for others. Especially,
>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>>> on P8LE.
>>>
>>> Regression and bootstrap tested pass on P8LE, any comments? Thanks.
>>
>> Can you split the RTL and GIMPLE changes and measure them separately
>> please?
>
> I did that before and got below data, it is slightly different due to
> using ratio instead of seconds, 500.perlbench_r obviously benefits
> from the RTL part change, while gimple part only improves exchange2
> and blender, with a regression on nab which requires the fix of loop
> split,
>
> https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576566.html
>
> Reason is lim2 doesn't hoist code out of loop, but loop split generates
> duplicated loop with incorrect profile count on preheader and header bb,
> then later lim4 hoists code out of loop to unexpected place. Same loop
> shows mismatch behavior in lim2 and lim4. With that patch, the regression
> is gone.
>
>
> Gimple+RTL | Gimple lim | RTL loop-invariant
> 500.perlbench_r 8.03% 0.67% 7.69%
> 502.gcc_r 0.56% 0.37% 0.19%
> 505.mcf_r 0.19% -0.19% 0.39%
> 520.omnetpp_r 0.83% 0.83% 0.83%
> 523.xalancbmk_r -0.78% 0.00% -1.04%
> 525.x264_r 0.17% 0.00% 0.00%
> 531.deepsjeng_r 0.00% 0.31% 0.00%
> 541.leela_r 0.00% -0.31% 0.31%
> 548.exchange2_r 2.08% 1.85% 0.23%
> 557.xz_r 0.97% 0.00% 0.65%
> 503.bwaves_r -0.12% 0.00% -0.23%
> 507.cactuBSSN_r 0.00% 0.14% 0.00%
> 508.namd_r 0.00% 0.00% 0.00%
> 510.parest_r -0.16% -0.65% 0.00%
> 511.povray_r 0.30% 0.91% 0.91%
> 519.lbm_r 0.15% 0.00% 0.00%
> 521.wrf_r 0.00% 0.00% -0.80%
> 526.blender_r 1.84% 0.26% 0.52%
> 527.cam4_r 0.28% 0.00% 0.00%
> 538.imagick_r 0.20% 0.00% 0.00%
> 544.nab_r -1.55% -0.78% 0.00%
> 549.fotonik3d_r -0.25% 0.00% 0.00%
> 554.roms_r -0.84% 0.00% -0.63%
> INT GEAMEAN 1.16% 0.35% 0.90%
> FLOAT GEOMEAN -0.01% -0.01% -0.02%
> GEOMEAN 0.50% 0.15% 0.38%
BTW, feedback from other platform:
I do see ~8% performance improvement for 500.perlbench on aarch64.
>
>
> Will address other comments in later reply. Thanks.
>
>
--
Thanks,
Xionghu
next prev parent reply other threads:[~2021-09-23 2:16 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-02 5:05 Xiong Hu Luo
2021-08-06 12:15 ` Richard Biener
2021-08-10 2:03 ` Xionghu Luo
2021-08-10 4:25 ` Ulrich Drepper
2021-08-19 5:51 ` [PATCH v2] " Xionghu Luo
2021-08-26 11:33 ` [RFC] " Richard Biener
2021-09-09 1:55 ` Xionghu Luo
2021-09-22 9:14 ` Richard Biener
2021-09-23 2:13 ` Xionghu Luo
2021-09-23 2:16 ` Xionghu Luo [this message]
2021-09-24 6:29 ` Xionghu Luo
2021-09-28 12:09 ` Richard Biener
2021-10-09 3:44 ` Xionghu Luo
2021-10-15 8:11 ` Richard Biener
2021-10-18 4:29 ` Xionghu Luo
2021-10-19 1:47 ` [PATCH v5 2/2] " Xionghu Luo
2021-10-26 13:20 ` [RFC] " Richard Biener
2021-10-27 2:40 ` Xionghu Luo
2021-10-29 11:48 ` Richard Biener
2021-11-03 6:49 ` Xionghu Luo
2021-11-03 13:29 ` Xionghu Luo
2021-11-04 13:00 ` Richard Biener
2021-11-10 3:08 ` [PATCH v7 2/2] " Xionghu Luo
2021-11-24 5:15 ` Ping: " Xionghu Luo
2021-11-24 7:35 ` Richard Biener
2021-12-01 10:09 ` Richard Biener
2021-12-06 5:09 ` [PATCH v8 " Xionghu Luo
2021-12-06 5:26 ` Xionghu Luo
2021-12-07 12:17 ` Richard Biener
2021-12-08 6:32 ` Xionghu Luo
2021-12-20 7:29 ` Richard Biener
2021-12-21 3:59 ` Xionghu Luo
2021-10-27 12:54 ` [RFC] " Jan Hubicka
2021-10-28 1:49 ` Xionghu Luo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3d4636a3-db03-4ddf-954a-52e06eef5e7c@linux.ibm.com \
--to=luoxhu@linux.ibm.com \
--cc=dje.gcc@gmail.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=hubicka@ucw.cz \
--cc=linkw@gcc.gnu.org \
--cc=richard.guenther@gmail.com \
--cc=segher@kernel.crashing.org \
--cc=wschmidt@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).