Question on tree LIM

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Question on tree LIM
@ 2021-07-02  3:33 Kewen.Lin
  2021-07-02  8:07 ` Richard Biener
  0 siblings, 1 reply; 5+ messages in thread
From: Kewen.Lin @ 2021-07-02  3:33 UTC (permalink / raw)
  To: GCC Development

Hi,

I am investigating one degradation related to SPEC2017 exchange2_r,
with loop vectorization on at -O2, it degraded by 6%.  By some
isolation, I found it isn't directly caused by vectorization itself,
but exposed by vectorization, some stuffs for vectorization
condition checks are hoisted out and they increase the register
pressure, finally results in more spillings than before.  If I simply
disable tree lim4, I can see the gap becomes smaller (just 40%+ of
the original), if further disable rtl lim, it just becomes to 30% of
the original.  It seems to indicate there is some room to improve in
both LIMs.

By quick scanning in tree LIM, I noticed that there seems no any
considerations on register pressure, it looked intentional? I am
wondering what's the design philosophy behind it?  Is it because that
it's hard to model register pressure well here?  If so, it seems to
put the burden onto late RA, which needs to have a good
rematerialization support.

btw, the example loop is at line 1150 from src exchange2.fppized.f90

   1150 block(rnext:9, 7, i7) = block(rnext:9, 7, i7) + 10

The extra hoisted statements after the vectorization on this loop
(cheap cost model btw) are:

    _686 = (integer(kind=8)) rnext_679;
    _1111 = (sizetype) _19;
    _1112 = _1111 * 12;
    _1927 = _1112 + 12;
  * _1895 = _1927 - _2650;
    _1113 = (unsigned long) rnext_679;
  * niters.6220_1128 = 10 - _1113;
  * _1021 = 9 - _1113;
  * bnd.6221_940 = niters.6220_1128 >> 2;
  * niters_vector_mult_vf.6222_939 = niters.6220_1128 & 18446744073709551612;
    _144 = niters_vector_mult_vf.6222_939 + _1113;
    tmp.6223_934 = (integer(kind=8)) _144;
    S.823_1004 = _1021 <= 2 ? _686 : tmp.6223_934;
  * ivtmp.6410_289 = (unsigned long) S.823_1004;

PS: * indicates the one has a long live interval.

BR,
Kewen

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question on tree LIM
  2021-07-02  3:33 Question on tree LIM Kewen.Lin
@ 2021-07-02  8:07 ` Richard Biener
  2021-07-02  9:05   ` Kewen.Lin
  0 siblings, 1 reply; 5+ messages in thread
From: Richard Biener @ 2021-07-02  8:07 UTC (permalink / raw)
  To: Kewen.Lin, Andre Vieira (lists); +Cc: GCC Development

On Fri, Jul 2, 2021 at 5:34 AM Kewen.Lin via Gcc <gcc@gcc.gnu.org> wrote:
>
> Hi,
>
> I am investigating one degradation related to SPEC2017 exchange2_r,
> with loop vectorization on at -O2, it degraded by 6%.  By some
> isolation, I found it isn't directly caused by vectorization itself,
> but exposed by vectorization, some stuffs for vectorization
> condition checks are hoisted out and they increase the register
> pressure, finally results in more spillings than before.  If I simply
> disable tree lim4, I can see the gap becomes smaller (just 40%+ of
> the original), if further disable rtl lim, it just becomes to 30% of
> the original.  It seems to indicate there is some room to improve in
> both LIMs.
>
> By quick scanning in tree LIM, I noticed that there seems no any
> considerations on register pressure, it looked intentional? I am
> wondering what's the design philosophy behind it?  Is it because that
> it's hard to model register pressure well here?  If so, it seems to
> put the burden onto late RA, which needs to have a good
> rematerialization support.

Yes, it is "intentional" in that doing any kind of prioritization based
on register pressure is hard on the GIMPLE level since most
high-level transforms try to expose followup transforms which you'd
somehow have to anticipate.  Note that LIMs "cost model" (if you can
call it such...) is too simplistic to be a good base to decide which
10 of the 20 candidates you want to move (and I've repeatedly pondered
to remove it completely).

As to putting the burden on RA - yes, that's one possibility.  The other
possibility is to use the register-pressure aware scheduler, though not
sure if that will ever move things into loop bodies.

> btw, the example loop is at line 1150 from src exchange2.fppized.f90
>
>    1150 block(rnext:9, 7, i7) = block(rnext:9, 7, i7) + 10
>
> The extra hoisted statements after the vectorization on this loop
> (cheap cost model btw) are:
>
>     _686 = (integer(kind=8)) rnext_679;
>     _1111 = (sizetype) _19;
>     _1112 = _1111 * 12;
>     _1927 = _1112 + 12;
>   * _1895 = _1927 - _2650;
>     _1113 = (unsigned long) rnext_679;
>   * niters.6220_1128 = 10 - _1113;
>   * _1021 = 9 - _1113;
>   * bnd.6221_940 = niters.6220_1128 >> 2;
>   * niters_vector_mult_vf.6222_939 = niters.6220_1128 & 18446744073709551612;
>     _144 = niters_vector_mult_vf.6222_939 + _1113;
>     tmp.6223_934 = (integer(kind=8)) _144;
>     S.823_1004 = _1021 <= 2 ? _686 : tmp.6223_934;
>   * ivtmp.6410_289 = (unsigned long) S.823_1004;
>
> PS: * indicates the one has a long live interval.

Note for the vectorizer generated conditions there's quite some room for
improvements to reduce the amount of semi-redundant computations.  I've
pointed out some to Andre, in particular suggesting to maintain a single
"remaining scalar iterations" IV across all the checks to avoid keeping
'niters' live and doing all the above masking & shifting repeatedly before
the prologue/main/vectorized epilogue/epilogue loops.  Not sure how far
he got with that idea.

Richard.

>
> BR,
> Kewen

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question on tree LIM
  2021-07-02  8:07 ` Richard Biener
@ 2021-07-02  9:05   ` Kewen.Lin
  2021-07-02 11:28     ` Richard Biener
  0 siblings, 1 reply; 5+ messages in thread
From: Kewen.Lin @ 2021-07-02  9:05 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Development, Andre Vieira (lists)

Hi Richard,

on 2021/7/2 下午4:07, Richard Biener wrote:
> On Fri, Jul 2, 2021 at 5:34 AM Kewen.Lin via Gcc <gcc@gcc.gnu.org> wrote:
>>
>> Hi,
>>
>> I am investigating one degradation related to SPEC2017 exchange2_r,
>> with loop vectorization on at -O2, it degraded by 6%.  By some
>> isolation, I found it isn't directly caused by vectorization itself,
>> but exposed by vectorization, some stuffs for vectorization
>> condition checks are hoisted out and they increase the register
>> pressure, finally results in more spillings than before.  If I simply
>> disable tree lim4, I can see the gap becomes smaller (just 40%+ of
>> the original), if further disable rtl lim, it just becomes to 30% of
>> the original.  It seems to indicate there is some room to improve in
>> both LIMs.
>>
>> By quick scanning in tree LIM, I noticed that there seems no any
>> considerations on register pressure, it looked intentional? I am
>> wondering what's the design philosophy behind it?  Is it because that
>> it's hard to model register pressure well here?  If so, it seems to
>> put the burden onto late RA, which needs to have a good
>> rematerialization support.
> 
> Yes, it is "intentional" in that doing any kind of prioritization based
> on register pressure is hard on the GIMPLE level since most
> high-level transforms try to expose followup transforms which you'd
> somehow have to anticipate.  Note that LIMs "cost model" (if you can
> call it such...) is too simplistic to be a good base to decide which
> 10 of the 20 candidates you want to move (and I've repeatedly pondered
> to remove it completely).
> 

Thanks for the explanation!  Do you really want to remove it completely
rather than just improve it with a better one?  :-\

Here there are some PRs (PR96825, PR98782) related to exchange2_r which
seems to suffer from high register pressure and bad spillings.  Not sure
whether they are also somehow related to the pressure given from LIM, but
the trigger is commit
1118a3ff9d3ad6a64bba25dc01e7703325e23d92 which adjusts prediction
frequency, maybe it's worth to re-visiting this idea about considering
BB frequency in LIM cost model:
https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

> As to putting the burden on RA - yes, that's one possibility.  The other
> possibility is to use the register-pressure aware scheduler, though not
> sure if that will ever move things into loop bodies.
> 

Brandly new idea!  IIUC it requires a global scheduler, not sure how well
GCC global scheduler performs, generally speaking the register-pressure
aware scheduler will prefer the insn which has more deads (for that
intensive regclass), for this problem the modeling seems a bit different,
it has to care about total interference numbers between two "equivalent"
blocks (src/dest), not sure if it's easier to do than rematerialization.

>> btw, the example loop is at line 1150 from src exchange2.fppized.f90
>>
>>    1150 block(rnext:9, 7, i7) = block(rnext:9, 7, i7) + 10
>>
>> The extra hoisted statements after the vectorization on this loop
>> (cheap cost model btw) are:
>>
>>     _686 = (integer(kind=8)) rnext_679;
>>     _1111 = (sizetype) _19;
>>     _1112 = _1111 * 12;
>>     _1927 = _1112 + 12;
>>   * _1895 = _1927 - _2650;
>>     _1113 = (unsigned long) rnext_679;
>>   * niters.6220_1128 = 10 - _1113;
>>   * _1021 = 9 - _1113;
>>   * bnd.6221_940 = niters.6220_1128 >> 2;
>>   * niters_vector_mult_vf.6222_939 = niters.6220_1128 & 18446744073709551612;
>>     _144 = niters_vector_mult_vf.6222_939 + _1113;
>>     tmp.6223_934 = (integer(kind=8)) _144;
>>     S.823_1004 = _1021 <= 2 ? _686 : tmp.6223_934;
>>   * ivtmp.6410_289 = (unsigned long) S.823_1004;
>>
>> PS: * indicates the one has a long live interval.
> 
> Note for the vectorizer generated conditions there's quite some room for
> improvements to reduce the amount of semi-redundant computations.  I've
> pointed out some to Andre, in particular suggesting to maintain a single
> "remaining scalar iterations" IV across all the checks to avoid keeping
> 'niters' live and doing all the above masking & shifting repeatedly before
> the prologue/main/vectorized epilogue/epilogue loops.  Not sure how far
> he got with that idea.
> 

Great, it definitely helps to mitigate this problem.  Thanks for the information.


BR,
Kewen

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question on tree LIM
  2021-07-02  9:05   ` Kewen.Lin
@ 2021-07-02 11:28     ` Richard Biener
  2021-07-05  2:29       ` Kewen.Lin
  0 siblings, 1 reply; 5+ messages in thread
From: Richard Biener @ 2021-07-02 11:28 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Development, Andre Vieira (lists)

On Fri, Jul 2, 2021 at 11:05 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Richard,
>
> on 2021/7/2 下午4:07, Richard Biener wrote:
> > On Fri, Jul 2, 2021 at 5:34 AM Kewen.Lin via Gcc <gcc@gcc.gnu.org> wrote:
> >>
> >> Hi,
> >>
> >> I am investigating one degradation related to SPEC2017 exchange2_r,
> >> with loop vectorization on at -O2, it degraded by 6%.  By some
> >> isolation, I found it isn't directly caused by vectorization itself,
> >> but exposed by vectorization, some stuffs for vectorization
> >> condition checks are hoisted out and they increase the register
> >> pressure, finally results in more spillings than before.  If I simply
> >> disable tree lim4, I can see the gap becomes smaller (just 40%+ of
> >> the original), if further disable rtl lim, it just becomes to 30% of
> >> the original.  It seems to indicate there is some room to improve in
> >> both LIMs.
> >>
> >> By quick scanning in tree LIM, I noticed that there seems no any
> >> considerations on register pressure, it looked intentional? I am
> >> wondering what's the design philosophy behind it?  Is it because that
> >> it's hard to model register pressure well here?  If so, it seems to
> >> put the burden onto late RA, which needs to have a good
> >> rematerialization support.
> >
> > Yes, it is "intentional" in that doing any kind of prioritization based
> > on register pressure is hard on the GIMPLE level since most
> > high-level transforms try to expose followup transforms which you'd
> > somehow have to anticipate.  Note that LIMs "cost model" (if you can
> > call it such...) is too simplistic to be a good base to decide which
> > 10 of the 20 candidates you want to move (and I've repeatedly pondered
> > to remove it completely).
> >
>
> Thanks for the explanation!  Do you really want to remove it completely
> rather than just improve it with a better one?  :-\

;)  For example the LIM cost model makes it not hoist an invariant (int)x
but then PRE which detects invariant motion opportunities as partial
redundances happily does (because PRE has no cost model at all - heh).

> Here there are some PRs (PR96825, PR98782) related to exchange2_r which
> seems to suffer from high register pressure and bad spillings.  Not sure
> whether they are also somehow related to the pressure given from LIM, but
> the trigger is commit
> 1118a3ff9d3ad6a64bba25dc01e7703325e23d92 which adjusts prediction
> frequency, maybe it's worth to re-visiting this idea about considering
> BB frequency in LIM cost model:
> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

Note most "problems", and those which are harder to undo, stem from
LIMs store-motion which increases register pressure inside loops by
adding loop-carried dependences.  The BB frequency might be a way
to order candidates when we have a way to set a better cap on the
number of refs to move.  Note the current "cost" model is rather a
benefit model and causes us to not move cheap things (like the above
conversion) because it seems not worth the trouble.

Note a very simple way would be to have a --param specifying a
maximum number of refs to move (but note there are several
LIM/store-motion passes so any such static limit would have
surprising effects).  For store-motion I considered a hard limit on
the number of loop carried dependences (PHIs) and counting both
existing and added ones (to avoid the surprise).

Note how such limits or other cost models should consider inner and
outer loop behavior remains to be determined - at least LIM works
at the level of whole loop nests and there's a rough idea of dependent
transforms but simply gathering candidates and stripping some isn't
going to work without major surgery in that area I think.

> > As to putting the burden on RA - yes, that's one possibility.  The other
> > possibility is to use the register-pressure aware scheduler, though not
> > sure if that will ever move things into loop bodies.
> >
>
> Brandly new idea!  IIUC it requires a global scheduler, not sure how well
> GCC global scheduler performs, generally speaking the register-pressure
> aware scheduler will prefer the insn which has more deads (for that
> intensive regclass), for this problem the modeling seems a bit different,
> it has to care about total interference numbers between two "equivalent"
> blocks (src/dest), not sure if it's easier to do than rematerialization.

No idea either but as said above undoing store-motion is harder than
scheduling or RA remat.

> >> btw, the example loop is at line 1150 from src exchange2.fppized.f90
> >>
> >>    1150 block(rnext:9, 7, i7) = block(rnext:9, 7, i7) + 10
> >>
> >> The extra hoisted statements after the vectorization on this loop
> >> (cheap cost model btw) are:
> >>
> >>     _686 = (integer(kind=8)) rnext_679;
> >>     _1111 = (sizetype) _19;
> >>     _1112 = _1111 * 12;
> >>     _1927 = _1112 + 12;
> >>   * _1895 = _1927 - _2650;
> >>     _1113 = (unsigned long) rnext_679;
> >>   * niters.6220_1128 = 10 - _1113;
> >>   * _1021 = 9 - _1113;
> >>   * bnd.6221_940 = niters.6220_1128 >> 2;
> >>   * niters_vector_mult_vf.6222_939 = niters.6220_1128 & 18446744073709551612;
> >>     _144 = niters_vector_mult_vf.6222_939 + _1113;
> >>     tmp.6223_934 = (integer(kind=8)) _144;
> >>     S.823_1004 = _1021 <= 2 ? _686 : tmp.6223_934;
> >>   * ivtmp.6410_289 = (unsigned long) S.823_1004;
> >>
> >> PS: * indicates the one has a long live interval.
> >
> > Note for the vectorizer generated conditions there's quite some room for
> > improvements to reduce the amount of semi-redundant computations.  I've
> > pointed out some to Andre, in particular suggesting to maintain a single
> > "remaining scalar iterations" IV across all the checks to avoid keeping
> > 'niters' live and doing all the above masking & shifting repeatedly before
> > the prologue/main/vectorized epilogue/epilogue loops.  Not sure how far
> > he got with that idea.
> >
>
> Great, it definitely helps to mitigate this problem.  Thanks for the information.
>
>
> BR,
> Kewen

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question on tree LIM
  2021-07-02 11:28     ` Richard Biener
@ 2021-07-05  2:29       ` Kewen.Lin
  0 siblings, 0 replies; 5+ messages in thread
From: Kewen.Lin @ 2021-07-05  2:29 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Development, Andre Vieira (lists), Xiong Hu Luo

on 2021/7/2 下午7:28, Richard Biener wrote:
> On Fri, Jul 2, 2021 at 11:05 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi Richard,
>>
>> on 2021/7/2 下午4:07, Richard Biener wrote:
>>> On Fri, Jul 2, 2021 at 5:34 AM Kewen.Lin via Gcc <gcc@gcc.gnu.org> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am investigating one degradation related to SPEC2017 exchange2_r,
>>>> with loop vectorization on at -O2, it degraded by 6%.  By some
>>>> isolation, I found it isn't directly caused by vectorization itself,
>>>> but exposed by vectorization, some stuffs for vectorization
>>>> condition checks are hoisted out and they increase the register
>>>> pressure, finally results in more spillings than before.  If I simply
>>>> disable tree lim4, I can see the gap becomes smaller (just 40%+ of
>>>> the original), if further disable rtl lim, it just becomes to 30% of
>>>> the original.  It seems to indicate there is some room to improve in
>>>> both LIMs.
>>>>
>>>> By quick scanning in tree LIM, I noticed that there seems no any
>>>> considerations on register pressure, it looked intentional? I am
>>>> wondering what's the design philosophy behind it?  Is it because that
>>>> it's hard to model register pressure well here?  If so, it seems to
>>>> put the burden onto late RA, which needs to have a good
>>>> rematerialization support.
>>>
>>> Yes, it is "intentional" in that doing any kind of prioritization based
>>> on register pressure is hard on the GIMPLE level since most
>>> high-level transforms try to expose followup transforms which you'd
>>> somehow have to anticipate.  Note that LIMs "cost model" (if you can
>>> call it such...) is too simplistic to be a good base to decide which
>>> 10 of the 20 candidates you want to move (and I've repeatedly pondered
>>> to remove it completely).
>>>
>>
>> Thanks for the explanation!  Do you really want to remove it completely
>> rather than just improve it with a better one?  :-\
> 
> ;)  For example the LIM cost model makes it not hoist an invariant (int)x
> but then PRE which detects invariant motion opportunities as partial
> redundances happily does (because PRE has no cost model at all - heh).
> 

Got it, thanks for further clarification. :)

>> Here there are some PRs (PR96825, PR98782) related to exchange2_r which
>> seems to suffer from high register pressure and bad spillings.  Not sure
>> whether they are also somehow related to the pressure given from LIM, but
>> the trigger is commit
>> 1118a3ff9d3ad6a64bba25dc01e7703325e23d92 which adjusts prediction
>> frequency, maybe it's worth to re-visiting this idea about considering
>> BB frequency in LIM cost model:
>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> 
> Note most "problems", and those which are harder to undo, stem from
> LIMs store-motion which increases register pressure inside loops by
> adding loop-carried dependences.  The BB frequency might be a way
> to order candidates when we have a way to set a better cap on the
> number of refs to move.  Note the current "cost" model is rather a
> benefit model and causes us to not move cheap things (like the above
> conversion) because it seems not worth the trouble.
> 

Yeah, I noticed it at least excludes "cheap" ones.

> Note a very simple way would be to have a --param specifying a
> maximum number of refs to move (but note there are several
> LIM/store-motion passes so any such static limit would have
> surprising effects).  For store-motion I considered a hard limit on
> the number of loop carried dependences (PHIs) and counting both
> existing and added ones (to avoid the surprise).
> 
> Note how such limits or other cost models should consider inner and
> outer loop behavior remains to be determined - at least LIM works
> at the level of whole loop nests and there's a rough idea of dependent
> transforms but simply gathering candidates and stripping some isn't
> going to work without major surgery in that area I think.
> 

Thanks for all the notes and thoughts, I might had better to visit RA remat
first, Xionghu had some interests to investigate how to consider BB freq in
LIMs, I will check its effect and further check these ideas if need then.

BR,
Kewen

>>> As to putting the burden on RA - yes, that's one possibility.  The other
>>> possibility is to use the register-pressure aware scheduler, though not
>>> sure if that will ever move things into loop bodies.
>>>
>>
>> Brandly new idea!  IIUC it requires a global scheduler, not sure how well
>> GCC global scheduler performs, generally speaking the register-pressure
>> aware scheduler will prefer the insn which has more deads (for that
>> intensive regclass), for this problem the modeling seems a bit different,
>> it has to care about total interference numbers between two "equivalent"
>> blocks (src/dest), not sure if it's easier to do than rematerialization.
> 
> No idea either but as said above undoing store-motion is harder than
> scheduling or RA remat.
> 
>>>> btw, the example loop is at line 1150 from src exchange2.fppized.f90
>>>>
>>>>    1150 block(rnext:9, 7, i7) = block(rnext:9, 7, i7) + 10
>>>>
>>>> The extra hoisted statements after the vectorization on this loop
>>>> (cheap cost model btw) are:
>>>>
>>>>     _686 = (integer(kind=8)) rnext_679;
>>>>     _1111 = (sizetype) _19;
>>>>     _1112 = _1111 * 12;
>>>>     _1927 = _1112 + 12;
>>>>   * _1895 = _1927 - _2650;
>>>>     _1113 = (unsigned long) rnext_679;
>>>>   * niters.6220_1128 = 10 - _1113;
>>>>   * _1021 = 9 - _1113;
>>>>   * bnd.6221_940 = niters.6220_1128 >> 2;
>>>>   * niters_vector_mult_vf.6222_939 = niters.6220_1128 & 18446744073709551612;
>>>>     _144 = niters_vector_mult_vf.6222_939 + _1113;
>>>>     tmp.6223_934 = (integer(kind=8)) _144;
>>>>     S.823_1004 = _1021 <= 2 ? _686 : tmp.6223_934;
>>>>   * ivtmp.6410_289 = (unsigned long) S.823_1004;
>>>>
>>>> PS: * indicates the one has a long live interval.
>>>
>>> Note for the vectorizer generated conditions there's quite some room for
>>> improvements to reduce the amount of semi-redundant computations.  I've
>>> pointed out some to Andre, in particular suggesting to maintain a single
>>> "remaining scalar iterations" IV across all the checks to avoid keeping
>>> 'niters' live and doing all the above masking & shifting repeatedly before
>>> the prologue/main/vectorized epilogue/epilogue loops.  Not sure how far
>>> he got with that idea.
>>>
>>
>> Great, it definitely helps to mitigate this problem.  Thanks for the information.
>>
>>
>> BR,
>> Kewen


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-07-05  2:29 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-02  3:33 Question on tree LIM Kewen.Lin
2021-07-02  8:07 ` Richard Biener
2021-07-02  9:05   ` Kewen.Lin
2021-07-02 11:28     ` Richard Biener
2021-07-05  2:29       ` Kewen.Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).