[RFC] avoid type conversion through versioning loop

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* [RFC] avoid type conversion through versioning loop
@ 2021-03-22  6:39 guojiufu
  2021-03-22  8:22 ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: guojiufu @ 2021-03-22  6:39 UTC (permalink / raw)
  To: gcc, wschmidt, segher, rguenther, law

Hi All,

As we know, type conversion is not in favor of optimization, and may
cause performance issue.

For example:
for (unsigned int i = 0; i < n; ++i)
   a[m + i] = 1;  //or a[30 + i] = xxxx

In this code, the index to access arrays is 'unsinged int' (32bit),
while the register operations (like index increase i++) would on longer 
bits
on 64bit machines, and then conversion between 32bits and 64bits and 
happen.

19: L19:
10: NOTE_INSN_BASIC_BLOCK 4
11: %9:DI=%5:DI<<0x2
17: %5:SI=%5:SI+0x1
18: %5:DI=zero_extend(%5:SI)  #clear high 32bits
14: [%3:DI+%9:DI]=%10:SI
40: {pc={(ctr:DI!=0x1)?L19:pc};ctr:DI=ctr:DI-0x1;clobber scratch;clobber 
scratch;}

In some cases, the shorter integer type would not overflow,
and then the type conversion could be eliminated in some cases.
So, I'm thinking of an optimization to avoid those conversions.
The idea looks like current loop-versioning. Using the above example:

for (unsigned int i = 0; i < n; ++i)
    a[30 + i] = 1;

change to:

if (n <= UINT_MAX)  // ++i does not cause overflow, n + 30 <= UINT_MAX
   for (long l_i = 0; l_i < l_n; ++l_i)
     a[30 + l_i] = 1;
else
   for (unsigned int i = 0; i < n; ++i)
     a[30 + i] = 1;

With this kind of transformation, it would be in favor of scev, and
some other optimizations could be beneficial: like vectorization.

How about this idea? Thanks for any comments!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] avoid type conversion through versioning loop
  2021-03-22  6:39 [RFC] avoid type conversion through versioning loop guojiufu
@ 2021-03-22  8:22 ` Richard Biener
  2021-03-22  8:31   ` Jakub Jelinek
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2021-03-22  8:22 UTC (permalink / raw)
  To: guojiufu
  Cc: GCC Development, Bill Schmidt, Segher Boessenkool,
	Richard Guenther, Jeff Law

On Mon, Mar 22, 2021 at 7:41 AM guojiufu via Gcc <gcc@gcc.gnu.org> wrote:
>
> Hi All,
>
> As we know, type conversion is not in favor of optimization, and may
> cause performance issue.
>
> For example:
> for (unsigned int i = 0; i < n; ++i)
>    a[m + i] = 1;  //or a[30 + i] = xxxx
>
> In this code, the index to access arrays is 'unsinged int' (32bit),
> while the register operations (like index increase i++) would on longer
> bits
> on 64bit machines, and then conversion between 32bits and 64bits and
> happen.
>
> 19: L19:
> 10: NOTE_INSN_BASIC_BLOCK 4
> 11: %9:DI=%5:DI<<0x2
> 17: %5:SI=%5:SI+0x1
> 18: %5:DI=zero_extend(%5:SI)  #clear high 32bits
> 14: [%3:DI+%9:DI]=%10:SI
> 40: {pc={(ctr:DI!=0x1)?L19:pc};ctr:DI=ctr:DI-0x1;clobber scratch;clobber
> scratch;}
>
> In some cases, the shorter integer type would not overflow,
> and then the type conversion could be eliminated in some cases.
> So, I'm thinking of an optimization to avoid those conversions.
> The idea looks like current loop-versioning. Using the above example:
>
> for (unsigned int i = 0; i < n; ++i)
>     a[30 + i] = 1;
>
> change to:
>
> if (n <= UINT_MAX)  // ++i does not cause overflow, n + 30 <= UINT_MAX
>    for (long l_i = 0; l_i < l_n; ++l_i)
>      a[30 + l_i] = 1;
> else
>    for (unsigned int i = 0; i < n; ++i)
>      a[30 + i] = 1;
>
> With this kind of transformation, it would be in favor of scev, and
> some other optimizations could be beneficial: like vectorization.
>
> How about this idea? Thanks for any comments!

Better than doing loop versioning is to enhance SCEV (and thus also
dependence analysis) to track extra conditions they need to handle
cases similar as to how niter analysis computes it's 'assumptions'
condition.  That allows the versioning to be done when there's an
actual beneficial transform (like vectorization) rather than just
upfront for the eventual chance that there'll be any.  Ideally such
transform would then choose IVs in their transformed copy that
are analyzable w/o repeating such versioning exercise for the next
transform.

Richard.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] avoid type conversion through versioning loop
  2021-03-22  8:22 ` Richard Biener
@ 2021-03-22  8:31   ` Jakub Jelinek
  2021-03-23  3:33     ` guojiufu
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Jelinek @ 2021-03-22  8:31 UTC (permalink / raw)
  To: Richard Biener
  Cc: guojiufu, GCC Development, Richard Guenther, Segher Boessenkool,
	Jeff Law

On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc wrote:
> Better than doing loop versioning is to enhance SCEV (and thus also
> dependence analysis) to track extra conditions they need to handle
> cases similar as to how niter analysis computes it's 'assumptions'
> condition.  That allows the versioning to be done when there's an
> actual beneficial transform (like vectorization) rather than just
> upfront for the eventual chance that there'll be any.  Ideally such
> transform would then choose IVs in their transformed copy that
> are analyzable w/o repeating such versioning exercise for the next
> transform.

And it might be beneficial to perform some type promotion/demotion
pass, either early during vectorization or separately before vectorization
on a loop copy guarded with the ifns e.g. ifconv uses too.
Find out what type sizes the loop use, first try to demote computations
to narrower types in the vectorized loop candidate (e.g. if something
is computed in a wider type only to have the result demoted to narrower
type), then pick up the widest type size still in use in the loop (ok,
this assumes we don't mix multiple vector sizes in the loop, but currently
our vectorizer doesn't do that) and try to promote computations that could
be promoted to that type size.  We do partially something like that during
vect patterns for bool types, but not other types I think.

	Jakub

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] avoid type conversion through versioning loop
  2021-03-22  8:31   ` Jakub Jelinek
@ 2021-03-23  3:33     ` guojiufu
  2021-03-23  8:25       ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: guojiufu @ 2021-03-23  3:33 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Richard Biener, GCC Development, Richard Guenther,
	Segher Boessenkool, jeffreyalaw

On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
> On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc wrote:
>> Better than doing loop versioning is to enhance SCEV (and thus also
>> dependence analysis) to track extra conditions they need to handle
>> cases similar as to how niter analysis computes it's 'assumptions'
>> condition.  That allows the versioning to be done when there's an
>> actual beneficial transform (like vectorization) rather than just
>> upfront for the eventual chance that there'll be any.  Ideally such
>> transform would then choose IVs in their transformed copy that
>> are analyzable w/o repeating such versioning exercise for the next
>> transform.
> 
> And it might be beneficial to perform some type promotion/demotion
> pass, either early during vectorization or separately before 
> vectorization
> on a loop copy guarded with the ifns e.g. ifconv uses too.
> Find out what type sizes the loop use, first try to demote computations
> to narrower types in the vectorized loop candidate (e.g. if something
> is computed in a wider type only to have the result demoted to narrower
> type), then pick up the widest type size still in use in the loop (ok,
> this assumes we don't mix multiple vector sizes in the loop, but 
> currently
> our vectorizer doesn't do that) and try to promote computations that 
> could
> be promoted to that type size.  We do partially something like that 
> during
> vect patterns for bool types, but not other types I think.
> 
> 	Jakub

Thanks for the suggestions!

Enhancing SCEV could help other optimizations and improve performance in 
some cases.
While one of the direct ideas of using the '64bit type' is to eliminate 
conversions,
even for some cases which are not easy to be optimized through 
ifconv/vectorization,
for examples:

   unsigned int i = 0;
   while (a[i]>1e-3)
     i++;

   unsigned int i = 0;
   while (p1[i] == p2[i] && p1[i] != '\0')
     i++;

Or only do versioning on type for this kind of loop? Any suggestions?

BR.
Jiufu Guo.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] avoid type conversion through versioning loop
  2021-03-23  3:33     ` guojiufu
@ 2021-03-23  8:25       ` Richard Biener
  2021-03-24  2:55         ` guojiufu
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2021-03-23  8:25 UTC (permalink / raw)
  To: guojiufu
  Cc: Jakub Jelinek, GCC Development, Richard Guenther,
	Segher Boessenkool, Jeff Law

On Tue, Mar 23, 2021 at 4:33 AM guojiufu <guojiufu@imap.linux.ibm.com> wrote:
>
> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc wrote:
> >> Better than doing loop versioning is to enhance SCEV (and thus also
> >> dependence analysis) to track extra conditions they need to handle
> >> cases similar as to how niter analysis computes it's 'assumptions'
> >> condition.  That allows the versioning to be done when there's an
> >> actual beneficial transform (like vectorization) rather than just
> >> upfront for the eventual chance that there'll be any.  Ideally such
> >> transform would then choose IVs in their transformed copy that
> >> are analyzable w/o repeating such versioning exercise for the next
> >> transform.
> >
> > And it might be beneficial to perform some type promotion/demotion
> > pass, either early during vectorization or separately before
> > vectorization
> > on a loop copy guarded with the ifns e.g. ifconv uses too.
> > Find out what type sizes the loop use, first try to demote computations
> > to narrower types in the vectorized loop candidate (e.g. if something
> > is computed in a wider type only to have the result demoted to narrower
> > type), then pick up the widest type size still in use in the loop (ok,
> > this assumes we don't mix multiple vector sizes in the loop, but
> > currently
> > our vectorizer doesn't do that) and try to promote computations that
> > could
> > be promoted to that type size.  We do partially something like that
> > during
> > vect patterns for bool types, but not other types I think.
> >
> >       Jakub
>
> Thanks for the suggestions!
>
> Enhancing SCEV could help other optimizations and improve performance in
> some cases.
> While one of the direct ideas of using the '64bit type' is to eliminate
> conversions,
> even for some cases which are not easy to be optimized through
> ifconv/vectorization,
> for examples:
>
>    unsigned int i = 0;
>    while (a[i]>1e-3)
>      i++;
>
>    unsigned int i = 0;
>    while (p1[i] == p2[i] && p1[i] != '\0')
>      i++;
>
> Or only do versioning on type for this kind of loop? Any suggestions?

But the "optimization" resulting from such versioning is hard to
determine upfront which means we'll pay quite a big code size cost
for unknown questionable gain.  What's the particular optimization
in the above cases?  Note that for example for

    unsigned int i = 0;
    while (a[i]>1e-3)
       i++;

you know that when 'i' wraps then the loop will not terminate.  There's
the address computation that is i * sizeof (T) which is done in a larger
type to avoid overflow so we have &a + zext (i) * 8 - is that the operation
that is 'slow' for you?

Richard.

> BR.
> Jiufu Guo.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] avoid type conversion through versioning loop
  2021-03-23  8:25       ` Richard Biener
@ 2021-03-24  2:55         ` guojiufu
  2021-03-24  7:55           ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: guojiufu @ 2021-03-24  2:55 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jakub Jelinek, GCC Development, Richard Guenther, Segher Boessenkool

On 2021-03-23 16:25, Richard Biener via Gcc wrote:
> On Tue, Mar 23, 2021 at 4:33 AM guojiufu <guojiufu@imap.linux.ibm.com> 
> wrote:
>> 
>> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
>> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc wrote:
>> >> Better than doing loop versioning is to enhance SCEV (and thus also
>> >> dependence analysis) to track extra conditions they need to handle
>> >> cases similar as to how niter analysis computes it's 'assumptions'
>> >> condition.  That allows the versioning to be done when there's an
>> >> actual beneficial transform (like vectorization) rather than just
>> >> upfront for the eventual chance that there'll be any.  Ideally such
>> >> transform would then choose IVs in their transformed copy that
>> >> are analyzable w/o repeating such versioning exercise for the next
>> >> transform.
>> >
>> > And it might be beneficial to perform some type promotion/demotion
>> > pass, either early during vectorization or separately before
>> > vectorization
>> > on a loop copy guarded with the ifns e.g. ifconv uses too.
>> > Find out what type sizes the loop use, first try to demote computations
>> > to narrower types in the vectorized loop candidate (e.g. if something
>> > is computed in a wider type only to have the result demoted to narrower
>> > type), then pick up the widest type size still in use in the loop (ok,
>> > this assumes we don't mix multiple vector sizes in the loop, but
>> > currently
>> > our vectorizer doesn't do that) and try to promote computations that
>> > could
>> > be promoted to that type size.  We do partially something like that
>> > during
>> > vect patterns for bool types, but not other types I think.
>> >
>> >       Jakub
>> 
>> Thanks for the suggestions!
>> 
>> Enhancing SCEV could help other optimizations and improve performance 
>> in
>> some cases.
>> While one of the direct ideas of using the '64bit type' is to 
>> eliminate
>> conversions,
>> even for some cases which are not easy to be optimized through
>> ifconv/vectorization,
>> for examples:
>> 
>>    unsigned int i = 0;
>>    while (a[i]>1e-3)
>>      i++;
>> 
>>    unsigned int i = 0;
>>    while (p1[i] == p2[i] && p1[i] != '\0')
>>      i++;
>> 
>> Or only do versioning on type for this kind of loop? Any suggestions?
> 
> But the "optimization" resulting from such versioning is hard to
> determine upfront which means we'll pay quite a big code size cost
> for unknown questionable gain.  What's the particular optimization

Right.  Code size increasing is a big pain on large loops. If the gain
is not significant, this optimization may not positive.

> in the above cases?  Note that for example for
> 
>     unsigned int i = 0;
>     while (a[i]>1e-3)
>        i++;
> 
> you know that when 'i' wraps then the loop will not terminate.  There's

Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is
checkable.  Otherwise, the optimization to avoid zext is not adoptable.

> the address computation that is i * sizeof (T) which is done in a 
> larger
> type to avoid overflow so we have &a + zext (i) * 8 - is that the 
> operation
> that is 'slow' for you?

This is the point: "zext(i)" is the instruction that I want to 
eliminate,
which is the direct goal of the optimization.

The gain of eliminating the 'zext' is visible or not, and the code size
increasing is small enough or not, this is a question and needs to 
trade-off.
It may be only acceptable if the loop is very small, then eliminating 
'zext'
would help to save runtime, and code size increase maybe not big.

Thanks again for your very helpful comments!

BR.
Jiufu Guo.

> 
> Richard.
> 
>> BR.
>> Jiufu Guo.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] avoid type conversion through versioning loop
  2021-03-24  2:55         ` guojiufu
@ 2021-03-24  7:55           ` Richard Biener
  2021-03-24 12:12             ` guojiufu
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2021-03-24  7:55 UTC (permalink / raw)
  To: guojiufu
  Cc: Jakub Jelinek, GCC Development, Richard Guenther, Segher Boessenkool

On Wed, Mar 24, 2021 at 3:55 AM guojiufu <guojiufu@linux.ibm.com> wrote:
>
> On 2021-03-23 16:25, Richard Biener via Gcc wrote:
> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu <guojiufu@imap.linux.ibm.com>
> > wrote:
> >>
> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc wrote:
> >> >> Better than doing loop versioning is to enhance SCEV (and thus also
> >> >> dependence analysis) to track extra conditions they need to handle
> >> >> cases similar as to how niter analysis computes it's 'assumptions'
> >> >> condition.  That allows the versioning to be done when there's an
> >> >> actual beneficial transform (like vectorization) rather than just
> >> >> upfront for the eventual chance that there'll be any.  Ideally such
> >> >> transform would then choose IVs in their transformed copy that
> >> >> are analyzable w/o repeating such versioning exercise for the next
> >> >> transform.
> >> >
> >> > And it might be beneficial to perform some type promotion/demotion
> >> > pass, either early during vectorization or separately before
> >> > vectorization
> >> > on a loop copy guarded with the ifns e.g. ifconv uses too.
> >> > Find out what type sizes the loop use, first try to demote computations
> >> > to narrower types in the vectorized loop candidate (e.g. if something
> >> > is computed in a wider type only to have the result demoted to narrower
> >> > type), then pick up the widest type size still in use in the loop (ok,
> >> > this assumes we don't mix multiple vector sizes in the loop, but
> >> > currently
> >> > our vectorizer doesn't do that) and try to promote computations that
> >> > could
> >> > be promoted to that type size.  We do partially something like that
> >> > during
> >> > vect patterns for bool types, but not other types I think.
> >> >
> >> >       Jakub
> >>
> >> Thanks for the suggestions!
> >>
> >> Enhancing SCEV could help other optimizations and improve performance
> >> in
> >> some cases.
> >> While one of the direct ideas of using the '64bit type' is to
> >> eliminate
> >> conversions,
> >> even for some cases which are not easy to be optimized through
> >> ifconv/vectorization,
> >> for examples:
> >>
> >>    unsigned int i = 0;
> >>    while (a[i]>1e-3)
> >>      i++;
> >>
> >>    unsigned int i = 0;
> >>    while (p1[i] == p2[i] && p1[i] != '\0')
> >>      i++;
> >>
> >> Or only do versioning on type for this kind of loop? Any suggestions?
> >
> > But the "optimization" resulting from such versioning is hard to
> > determine upfront which means we'll pay quite a big code size cost
> > for unknown questionable gain.  What's the particular optimization
>
> Right.  Code size increasing is a big pain on large loops. If the gain
> is not significant, this optimization may not positive.
>
> > in the above cases?  Note that for example for
> >
> >     unsigned int i = 0;
> >     while (a[i]>1e-3)
> >        i++;
> >
> > you know that when 'i' wraps then the loop will not terminate.  There's
>
> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is
> checkable.  Otherwise, the optimization to avoid zext is not adoptable.
>
> > the address computation that is i * sizeof (T) which is done in a
> > larger
> > type to avoid overflow so we have &a + zext (i) * 8 - is that the
> > operation
> > that is 'slow' for you?
>
> This is the point: "zext(i)" is the instruction that I want to
> eliminate,
> which is the direct goal of the optimization.
>
> The gain of eliminating the 'zext' is visible or not, and the code size
> increasing is small enough or not, this is a question and needs to
> trade-off.
> It may be only acceptable if the loop is very small, then eliminating
> 'zext'
> would help to save runtime, and code size increase maybe not big.

OK, so I indeed think that the desire to micro-optimize a 'zext' doesn't
make versioning a good trade-off.  The micro-architecture should better
not make that totally slow (I'd expect an extra latency comparable to
the multiply or add on the &a + zext(i) * 8 instruction chain).

OTOH making SCEV analysis not give up but instead record the constraints
under which its solution is valid is a very good and useful thing to do.

Richard.

> Thanks again for your very helpful comments!
>
> BR.
> Jiufu Guo.
>
> >
> > Richard.
> >
> >> BR.
> >> Jiufu Guo.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] avoid type conversion through versioning loop
  2021-03-24  7:55           ` Richard Biener
@ 2021-03-24 12:12             ` guojiufu
  2021-03-24 12:33               ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: guojiufu @ 2021-03-24 12:12 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jakub Jelinek, GCC Development, Richard Guenther, Segher Boessenkool

On 2021-03-24 15:55, Richard Biener wrote:
> On Wed, Mar 24, 2021 at 3:55 AM guojiufu <guojiufu@linux.ibm.com> 
> wrote:
>> 
>> On 2021-03-23 16:25, Richard Biener via Gcc wrote:
>> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu <guojiufu@imap.linux.ibm.com>
>> > wrote:
>> >>
>> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
>> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc wrote:
>> >> >> Better than doing loop versioning is to enhance SCEV (and thus also
>> >> >> dependence analysis) to track extra conditions they need to handle
>> >> >> cases similar as to how niter analysis computes it's 'assumptions'
>> >> >> condition.  That allows the versioning to be done when there's an
>> >> >> actual beneficial transform (like vectorization) rather than just
>> >> >> upfront for the eventual chance that there'll be any.  Ideally such
>> >> >> transform would then choose IVs in their transformed copy that
>> >> >> are analyzable w/o repeating such versioning exercise for the next
>> >> >> transform.
>> >> >
>> >> > And it might be beneficial to perform some type promotion/demotion
>> >> > pass, either early during vectorization or separately before
>> >> > vectorization
>> >> > on a loop copy guarded with the ifns e.g. ifconv uses too.
>> >> > Find out what type sizes the loop use, first try to demote computations
>> >> > to narrower types in the vectorized loop candidate (e.g. if something
>> >> > is computed in a wider type only to have the result demoted to narrower
>> >> > type), then pick up the widest type size still in use in the loop (ok,
>> >> > this assumes we don't mix multiple vector sizes in the loop, but
>> >> > currently
>> >> > our vectorizer doesn't do that) and try to promote computations that
>> >> > could
>> >> > be promoted to that type size.  We do partially something like that
>> >> > during
>> >> > vect patterns for bool types, but not other types I think.
>> >> >
>> >> >       Jakub
>> >>
>> >> Thanks for the suggestions!
>> >>
>> >> Enhancing SCEV could help other optimizations and improve performance
>> >> in
>> >> some cases.
>> >> While one of the direct ideas of using the '64bit type' is to
>> >> eliminate
>> >> conversions,
>> >> even for some cases which are not easy to be optimized through
>> >> ifconv/vectorization,
>> >> for examples:
>> >>
>> >>    unsigned int i = 0;
>> >>    while (a[i]>1e-3)
>> >>      i++;
>> >>
>> >>    unsigned int i = 0;
>> >>    while (p1[i] == p2[i] && p1[i] != '\0')
>> >>      i++;
>> >>
>> >> Or only do versioning on type for this kind of loop? Any suggestions?
>> >
>> > But the "optimization" resulting from such versioning is hard to
>> > determine upfront which means we'll pay quite a big code size cost
>> > for unknown questionable gain.  What's the particular optimization
>> 
>> Right.  Code size increasing is a big pain on large loops. If the gain
>> is not significant, this optimization may not positive.
>> 
>> > in the above cases?  Note that for example for
>> >
>> >     unsigned int i = 0;
>> >     while (a[i]>1e-3)
>> >        i++;
>> >
>> > you know that when 'i' wraps then the loop will not terminate.  There's
>> 
>> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound 
>> is
>> checkable.  Otherwise, the optimization to avoid zext is not 
>> adoptable.
>> 
>> > the address computation that is i * sizeof (T) which is done in a
>> > larger
>> > type to avoid overflow so we have &a + zext (i) * 8 - is that the
>> > operation
>> > that is 'slow' for you?
>> 
>> This is the point: "zext(i)" is the instruction that I want to
>> eliminate,
>> which is the direct goal of the optimization.
>> 
>> The gain of eliminating the 'zext' is visible or not, and the code 
>> size
>> increasing is small enough or not, this is a question and needs to
>> trade-off.
>> It may be only acceptable if the loop is very small, then eliminating
>> 'zext'
>> would help to save runtime, and code size increase maybe not big.
> 
> OK, so I indeed think that the desire to micro-optimize a 'zext' 
> doesn't
> make versioning a good trade-off.  The micro-architecture should better
> not make that totally slow (I'd expect an extra latency comparable to
> the multiply or add on the &a + zext(i) * 8 instruction chain).

Agree, I understand your point.  The concern is some micro-architectures 
are not
very well on this yet.  I tested the above example code:
   unsigned i = 0;
   while (a[i] > 1e-3 && i < n)
     i++;
there are ~30% performance improvement if using "long i" instead 
of"unsigned i"
on ppc64le and x86.  It seems those instructions are not optimized too 
much on
some platforms.  So, I'm wondering we need to do this in GCC.

> 
> OTOH making SCEV analysis not give up but instead record the 
> constraints
> under which its solution is valid is a very good and useful thing to 
> do.

Thanks! Enhance SCEV could help a few cases, especially when other 
optimizations
are enabled.

Thanks again for your suggestions!

BR.
Jiufu Guo.

> 
> Richard.
> 
>> Thanks again for your very helpful comments!
>> 
>> BR.
>> Jiufu Guo.
>> 
>> >
>> > Richard.
>> >
>> >> BR.
>> >> Jiufu Guo.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] avoid type conversion through versioning loop
  2021-03-24 12:12             ` guojiufu
@ 2021-03-24 12:33               ` Richard Biener
  2021-03-24 15:17                 ` guojiufu
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2021-03-24 12:33 UTC (permalink / raw)
  To: guojiufu
  Cc: Richard Biener, Jakub Jelinek, GCC Development, Segher Boessenkool

On Wed, 24 Mar 2021, guojiufu wrote:

> On 2021-03-24 15:55, Richard Biener wrote:
> > On Wed, Mar 24, 2021 at 3:55 AM guojiufu <guojiufu@linux.ibm.com> wrote:
> >> 
> >> On 2021-03-23 16:25, Richard Biener via Gcc wrote:
> >> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu <guojiufu@imap.linux.ibm.com>
> >> > wrote:
> >> >>
> >> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
> >> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc
> >> >> > wrote:
> >> >> >> Better than doing loop versioning is to enhance SCEV (and thus also
> >> >> >> dependence analysis) to track extra conditions they need to handle
> >> >> >> cases similar as to how niter analysis computes it's 'assumptions'
> >> >> >> condition.  That allows the versioning to be done when there's an
> >> >> >> actual beneficial transform (like vectorization) rather than just
> >> >> >> upfront for the eventual chance that there'll be any.  Ideally such
> >> >> >> transform would then choose IVs in their transformed copy that
> >> >> >> are analyzable w/o repeating such versioning exercise for the next
> >> >> >> transform.
> >> >> >
> >> >> > And it might be beneficial to perform some type promotion/demotion
> >> >> > pass, either early during vectorization or separately before
> >> >> > vectorization
> >> >> > on a loop copy guarded with the ifns e.g. ifconv uses too.
> >> >> > Find out what type sizes the loop use, first try to demote
> >> >> > computations
> >> >> > to narrower types in the vectorized loop candidate (e.g. if something
> >> >> > is computed in a wider type only to have the result demoted to
> >> >> > narrower
> >> >> > type), then pick up the widest type size still in use in the loop (ok,
> >> >> > this assumes we don't mix multiple vector sizes in the loop, but
> >> >> > currently
> >> >> > our vectorizer doesn't do that) and try to promote computations that
> >> >> > could
> >> >> > be promoted to that type size.  We do partially something like that
> >> >> > during
> >> >> > vect patterns for bool types, but not other types I think.
> >> >> >
> >> >> >       Jakub
> >> >>
> >> >> Thanks for the suggestions!
> >> >>
> >> >> Enhancing SCEV could help other optimizations and improve performance
> >> >> in
> >> >> some cases.
> >> >> While one of the direct ideas of using the '64bit type' is to
> >> >> eliminate
> >> >> conversions,
> >> >> even for some cases which are not easy to be optimized through
> >> >> ifconv/vectorization,
> >> >> for examples:
> >> >>
> >> >>    unsigned int i = 0;
> >> >>    while (a[i]>1e-3)
> >> >>      i++;
> >> >>
> >> >>    unsigned int i = 0;
> >> >>    while (p1[i] == p2[i] && p1[i] != '\0')
> >> >>      i++;
> >> >>
> >> >> Or only do versioning on type for this kind of loop? Any suggestions?
> >> >
> >> > But the "optimization" resulting from such versioning is hard to
> >> > determine upfront which means we'll pay quite a big code size cost
> >> > for unknown questionable gain.  What's the particular optimization
> >> 
> >> Right.  Code size increasing is a big pain on large loops. If the gain
> >> is not significant, this optimization may not positive.
> >> 
> >> > in the above cases?  Note that for example for
> >> >
> >> >     unsigned int i = 0;
> >> >     while (a[i]>1e-3)
> >> >        i++;
> >> >
> >> > you know that when 'i' wraps then the loop will not terminate.  There's
> >> 
> >> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is
> >> checkable.  Otherwise, the optimization to avoid zext is not adoptable.
> >> 
> >> > the address computation that is i * sizeof (T) which is done in a
> >> > larger
> >> > type to avoid overflow so we have &a + zext (i) * 8 - is that the
> >> > operation
> >> > that is 'slow' for you?
> >> 
> >> This is the point: "zext(i)" is the instruction that I want to
> >> eliminate,
> >> which is the direct goal of the optimization.
> >> 
> >> The gain of eliminating the 'zext' is visible or not, and the code size
> >> increasing is small enough or not, this is a question and needs to
> >> trade-off.
> >> It may be only acceptable if the loop is very small, then eliminating
> >> 'zext'
> >> would help to save runtime, and code size increase maybe not big.
> > 
> > OK, so I indeed think that the desire to micro-optimize a 'zext' doesn't
> > make versioning a good trade-off.  The micro-architecture should better
> > not make that totally slow (I'd expect an extra latency comparable to
> > the multiply or add on the &a + zext(i) * 8 instruction chain).
> 
> Agree, I understand your point.  The concern is some micro-architectures are
> not
> very well on this yet.  I tested the above example code:
>   unsigned i = 0;
>   while (a[i] > 1e-3 && i < n)
>     i++;
> there are ~30% performance improvement if using "long i" instead of"unsigned
> i"
> on ppc64le and x86.  It seems those instructions are not optimized too much on
> some platforms.  So, I'm wondering we need to do this in GCC.

On x86 I see indexed addressing modes being used which should be fine.
Compilable testcase:

unsigned foo (double *a, unsigned n)
{
  unsigned i = 0;
  while (a[i] > 1e-3 && i < n)
    i++;
  return i;
}

ppc64le seems to do some odd unrolling/peeling or whatnot, I have a hard
time following its assembly ... ah, -fno-unroll-loops "helps" and
produces

.L5:
        lfd %f0,0(%r9)
        addi %r3,%r3,1
        addi %r9,%r9,8
        rldicl %r3,%r3,0,32
        fcmpu %cr0,%f0,%f12
        bnglr %cr0
        bdnz .L5

which looks pretty good to me, I suppose the rldicl is the
zero-extension but the IVs are already 64bit and the
zero-extension should be sunk to the loop exit instead.

> > 
> > OTOH making SCEV analysis not give up but instead record the constraints
> > under which its solution is valid is a very good and useful thing to do.
> 
> Thanks! Enhance SCEV could help a few cases, especially when other
> optimizations
> are enabled.
> 
> Thanks again for your suggestions!
> 
> BR.
> Jiufu Guo.
> 
> > 
> > Richard.
> > 
> >> Thanks again for your very helpful comments!
> >> 
> >> BR.
> >> Jiufu Guo.
> >> 
> >> >
> >> > Richard.
> >> >
> >> >> BR.
> >> >> Jiufu Guo.
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] avoid type conversion through versioning loop
  2021-03-24 12:33               ` Richard Biener
@ 2021-03-24 15:17                 ` guojiufu
  2021-03-25  8:35                   ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: guojiufu @ 2021-03-24 15:17 UTC (permalink / raw)
  To: Richard Biener
  Cc: Richard Biener, Jakub Jelinek, GCC Development, Segher Boessenkool

On 2021-03-24 20:33, Richard Biener wrote:
> On Wed, 24 Mar 2021, guojiufu wrote:
> 
>> On 2021-03-24 15:55, Richard Biener wrote:
>> > On Wed, Mar 24, 2021 at 3:55 AM guojiufu <guojiufu@linux.ibm.com> wrote:
>> >>
>> >> On 2021-03-23 16:25, Richard Biener via Gcc wrote:
>> >> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu <guojiufu@imap.linux.ibm.com>
>> >> > wrote:
>> >> >>
>> >> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
>> >> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc
>> >> >> > wrote:
>> >> >> >> Better than doing loop versioning is to enhance SCEV (and thus also
>> >> >> >> dependence analysis) to track extra conditions they need to handle
>> >> >> >> cases similar as to how niter analysis computes it's 'assumptions'
>> >> >> >> condition.  That allows the versioning to be done when there's an
>> >> >> >> actual beneficial transform (like vectorization) rather than just
>> >> >> >> upfront for the eventual chance that there'll be any.  Ideally such
>> >> >> >> transform would then choose IVs in their transformed copy that
>> >> >> >> are analyzable w/o repeating such versioning exercise for the next
>> >> >> >> transform.
>> >> >> >
>> >> >> > And it might be beneficial to perform some type promotion/demotion
>> >> >> > pass, either early during vectorization or separately before
>> >> >> > vectorization
>> >> >> > on a loop copy guarded with the ifns e.g. ifconv uses too.
>> >> >> > Find out what type sizes the loop use, first try to demote
>> >> >> > computations
>> >> >> > to narrower types in the vectorized loop candidate (e.g. if something
>> >> >> > is computed in a wider type only to have the result demoted to
>> >> >> > narrower
>> >> >> > type), then pick up the widest type size still in use in the loop (ok,
>> >> >> > this assumes we don't mix multiple vector sizes in the loop, but
>> >> >> > currently
>> >> >> > our vectorizer doesn't do that) and try to promote computations that
>> >> >> > could
>> >> >> > be promoted to that type size.  We do partially something like that
>> >> >> > during
>> >> >> > vect patterns for bool types, but not other types I think.
>> >> >> >
>> >> >> >       Jakub
>> >> >>
>> >> >> Thanks for the suggestions!
>> >> >>
>> >> >> Enhancing SCEV could help other optimizations and improve performance
>> >> >> in
>> >> >> some cases.
>> >> >> While one of the direct ideas of using the '64bit type' is to
>> >> >> eliminate
>> >> >> conversions,
>> >> >> even for some cases which are not easy to be optimized through
>> >> >> ifconv/vectorization,
>> >> >> for examples:
>> >> >>
>> >> >>    unsigned int i = 0;
>> >> >>    while (a[i]>1e-3)
>> >> >>      i++;
>> >> >>
>> >> >>    unsigned int i = 0;
>> >> >>    while (p1[i] == p2[i] && p1[i] != '\0')
>> >> >>      i++;
>> >> >>
>> >> >> Or only do versioning on type for this kind of loop? Any suggestions?
>> >> >
>> >> > But the "optimization" resulting from such versioning is hard to
>> >> > determine upfront which means we'll pay quite a big code size cost
>> >> > for unknown questionable gain.  What's the particular optimization
>> >>
>> >> Right.  Code size increasing is a big pain on large loops. If the gain
>> >> is not significant, this optimization may not positive.
>> >>
>> >> > in the above cases?  Note that for example for
>> >> >
>> >> >     unsigned int i = 0;
>> >> >     while (a[i]>1e-3)
>> >> >        i++;
>> >> >
>> >> > you know that when 'i' wraps then the loop will not terminate.  There's
>> >>
>> >> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is
>> >> checkable.  Otherwise, the optimization to avoid zext is not adoptable.
>> >>
>> >> > the address computation that is i * sizeof (T) which is done in a
>> >> > larger
>> >> > type to avoid overflow so we have &a + zext (i) * 8 - is that the
>> >> > operation
>> >> > that is 'slow' for you?
>> >>
>> >> This is the point: "zext(i)" is the instruction that I want to
>> >> eliminate,
>> >> which is the direct goal of the optimization.
>> >>
>> >> The gain of eliminating the 'zext' is visible or not, and the code size
>> >> increasing is small enough or not, this is a question and needs to
>> >> trade-off.
>> >> It may be only acceptable if the loop is very small, then eliminating
>> >> 'zext'
>> >> would help to save runtime, and code size increase maybe not big.
>> >
>> > OK, so I indeed think that the desire to micro-optimize a 'zext' doesn't
>> > make versioning a good trade-off.  The micro-architecture should better
>> > not make that totally slow (I'd expect an extra latency comparable to
>> > the multiply or add on the &a + zext(i) * 8 instruction chain).
>> 
>> Agree, I understand your point.  The concern is some 
>> micro-architectures are
>> not
>> very well on this yet.  I tested the above example code:
>>   unsigned i = 0;
>>   while (a[i] > 1e-3 && i < n)
>>     i++;
>> there are ~30% performance improvement if using "long i" instead 
>> of"unsigned
>> i"
>> on ppc64le and x86.  It seems those instructions are not optimized too 
>> much on
>> some platforms.  So, I'm wondering we need to do this in GCC.
> 
> On x86 I see indexed addressing modes being used which should be fine.
> Compilable testcase:
> 
> unsigned foo (double *a, unsigned n)
> {
>   unsigned i = 0;
>   while (a[i] > 1e-3 && i < n)
>     i++;
>   return i;
> }
> 
> ppc64le seems to do some odd unrolling/peeling or whatnot, I have a 
> hard
> time following its assembly ... ah, -fno-unroll-loops "helps" and
> produces
> 
> .L5:
>         lfd %f0,0(%r9)
>         addi %r3,%r3,1
>         addi %r9,%r9,8
>         rldicl %r3,%r3,0,32
>         fcmpu %cr0,%f0,%f12
>         bnglr %cr0
>         bdnz .L5
> 
> which looks pretty good to me, I suppose the rldicl is the
> zero-extension but the IVs are already 64bit and the
> zero-extension should be sunk to the loop exit instead.

Thanks so much for your quick reply!

Yes, "rldicl %r3,%r3,0,32" is sinkable and avoids this zext,
which also improve performance.

:) my test code is a little differentent:  argument n's type is long 
(sorry).
unsigned __attribute__ ((noinline)) foo (double *a, long n)
{
   unsigned i = 0;
   while (a[i] > 1e-3 && i < n)
     i++;
   return i;
}

   111: L111:
    39: pc={(%7:CC>=0)?simple_return:pc}
   102: %3:SI=%3:SI+0x1
   103: %9:DI=%3:DI<<0x3&0x7fffffff8
   104: %3:DI=zero_extend(%3:SI)
   105: %7:CC=cmp(%3:DI,%4:DI)
   106: %0:DF=[%10:DI+%9:DI]
   107: %0:CCFP=cmp(%0:DF,%12:DF)
   109: pc={(%0:CCFP>0)?L111:pc}

.L14:
         bgelr 7
         addi 3,3,1
         rldic 9,3,3,29
         rldicl 3,3,0,32
         cmpd 7,3,4
         lfdx 0,10,9
	fcmpu 0,0,12
         bgt 0,.L14

-------------- or for code:
unsigned __attribute__ ((noinline)) foo (double *a, unsigned n,  
unsigned i)
{
   while (a[i] > 1e-3 && i != n) ///if using "&& i < n" the zext is 
sinkable
     i++;
   return i;
}

    99: L99:
    69: {pc={(ctr:DI==0x1)?L39:pc};ctr:DI=ctr:DI-0x1;clobber 
scratch;clobber scratch;}
    36: L36:
    23: %5:SI=%5:SI+0x1
    25: %9:DI=%5:DI<<0x3&0x7fffffff8
    24: %5:DI=zero_extend(%5:SI)
    29: %0:DF=[%3:DI+%9:DI]
    30: %0:CCFP=cmp(%0:DF,%12:DF)
    31: pc={(%0:CCFP>0)?L99:pc}


.L12:
         bdz .L2
.L5:
         addi 5,5,1
         rldic 9,5,3,29
         rldicl 5,5,0,32
         lfdx 0,3,9
         fcmpu 0,0,12
         bgt 0,.L12
.L2:

Any suggestions about how to relax the 'zext' in above cases? My 
previous idea was versioning :).

> 
>> >
>> > OTOH making SCEV analysis not give up but instead record the constraints
>> > under which its solution is valid is a very good and useful thing to do.
>> 
>> Thanks! Enhance SCEV could help a few cases, especially when other
>> optimizations
>> are enabled.
>> 
>> Thanks again for your suggestions!
>> 
>> BR.
>> Jiufu Guo.
>> 
>> >
>> > Richard.
>> >
>> >> Thanks again for your very helpful comments!
>> >>
>> >> BR.
>> >> Jiufu Guo.
>> >>
>> >> >
>> >> > Richard.
>> >> >
>> >> >> BR.
>> >> >> Jiufu Guo.
>> 
>> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] avoid type conversion through versioning loop
  2021-03-24 15:17                 ` guojiufu
@ 2021-03-25  8:35                   ` Richard Biener
  2021-03-25 13:32                     ` guojiufu
  2021-03-26  8:19                     ` guojiufu
  0 siblings, 2 replies; 14+ messages in thread
From: Richard Biener @ 2021-03-25  8:35 UTC (permalink / raw)
  To: guojiufu
  Cc: Richard Biener, Jakub Jelinek, GCC Development,
	Segher Boessenkool, Michael Matz

On Wed, 24 Mar 2021, guojiufu wrote:

> On 2021-03-24 20:33, Richard Biener wrote:
> > On Wed, 24 Mar 2021, guojiufu wrote:
> > 
> >> On 2021-03-24 15:55, Richard Biener wrote:
> >> > On Wed, Mar 24, 2021 at 3:55 AM guojiufu <guojiufu@linux.ibm.com> wrote:
> >> >>
> >> >> On 2021-03-23 16:25, Richard Biener via Gcc wrote:
> >> >> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu <guojiufu@imap.linux.ibm.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
> >> >> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc
> >> >> >> > wrote:
> >> >> >> >> Better than doing loop versioning is to enhance SCEV (and thus
> >> >> >> >> also
> >> >> >> >> dependence analysis) to track extra conditions they need to handle
> >> >> >> >> cases similar as to how niter analysis computes it's 'assumptions'
> >> >> >> >> condition.  That allows the versioning to be done when there's an
> >> >> >> >> actual beneficial transform (like vectorization) rather than just
> >> >> >> >> upfront for the eventual chance that there'll be any.  Ideally
> >> >> >> >> such
> >> >> >> >> transform would then choose IVs in their transformed copy that
> >> >> >> >> are analyzable w/o repeating such versioning exercise for the next
> >> >> >> >> transform.
> >> >> >> >
> >> >> >> > And it might be beneficial to perform some type promotion/demotion
> >> >> >> > pass, either early during vectorization or separately before
> >> >> >> > vectorization
> >> >> >> > on a loop copy guarded with the ifns e.g. ifconv uses too.
> >> >> >> > Find out what type sizes the loop use, first try to demote
> >> >> >> > computations
> >> >> >> > to narrower types in the vectorized loop candidate (e.g. if
> >> >> >> > something
> >> >> >> > is computed in a wider type only to have the result demoted to
> >> >> >> > narrower
> >> >> >> > type), then pick up the widest type size still in use in the loop
> >> >> >> > (ok,
> >> >> >> > this assumes we don't mix multiple vector sizes in the loop, but
> >> >> >> > currently
> >> >> >> > our vectorizer doesn't do that) and try to promote computations
> >> >> >> > that
> >> >> >> > could
> >> >> >> > be promoted to that type size.  We do partially something like that
> >> >> >> > during
> >> >> >> > vect patterns for bool types, but not other types I think.
> >> >> >> >
> >> >> >> >       Jakub
> >> >> >>
> >> >> >> Thanks for the suggestions!
> >> >> >>
> >> >> >> Enhancing SCEV could help other optimizations and improve performance
> >> >> >> in
> >> >> >> some cases.
> >> >> >> While one of the direct ideas of using the '64bit type' is to
> >> >> >> eliminate
> >> >> >> conversions,
> >> >> >> even for some cases which are not easy to be optimized through
> >> >> >> ifconv/vectorization,
> >> >> >> for examples:
> >> >> >>
> >> >> >>    unsigned int i = 0;
> >> >> >>    while (a[i]>1e-3)
> >> >> >>      i++;
> >> >> >>
> >> >> >>    unsigned int i = 0;
> >> >> >>    while (p1[i] == p2[i] && p1[i] != '\0')
> >> >> >>      i++;
> >> >> >>
> >> >> >> Or only do versioning on type for this kind of loop? Any suggestions?
> >> >> >
> >> >> > But the "optimization" resulting from such versioning is hard to
> >> >> > determine upfront which means we'll pay quite a big code size cost
> >> >> > for unknown questionable gain.  What's the particular optimization
> >> >>
> >> >> Right.  Code size increasing is a big pain on large loops. If the gain
> >> >> is not significant, this optimization may not positive.
> >> >>
> >> >> > in the above cases?  Note that for example for
> >> >> >
> >> >> >     unsigned int i = 0;
> >> >> >     while (a[i]>1e-3)
> >> >> >        i++;
> >> >> >
> >> >> > you know that when 'i' wraps then the loop will not terminate.
> >> >> > There's
> >> >>
> >> >> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is
> >> >> checkable.  Otherwise, the optimization to avoid zext is not adoptable.
> >> >>
> >> >> > the address computation that is i * sizeof (T) which is done in a
> >> >> > larger
> >> >> > type to avoid overflow so we have &a + zext (i) * 8 - is that the
> >> >> > operation
> >> >> > that is 'slow' for you?
> >> >>
> >> >> This is the point: "zext(i)" is the instruction that I want to
> >> >> eliminate,
> >> >> which is the direct goal of the optimization.
> >> >>
> >> >> The gain of eliminating the 'zext' is visible or not, and the code size
> >> >> increasing is small enough or not, this is a question and needs to
> >> >> trade-off.
> >> >> It may be only acceptable if the loop is very small, then eliminating
> >> >> 'zext'
> >> >> would help to save runtime, and code size increase maybe not big.
> >> >
> >> > OK, so I indeed think that the desire to micro-optimize a 'zext' doesn't
> >> > make versioning a good trade-off.  The micro-architecture should better
> >> > not make that totally slow (I'd expect an extra latency comparable to
> >> > the multiply or add on the &a + zext(i) * 8 instruction chain).
> >> 
> >> Agree, I understand your point.  The concern is some micro-architectures
> >> are
> >> not
> >> very well on this yet.  I tested the above example code:
> >>   unsigned i = 0;
> >>   while (a[i] > 1e-3 && i < n)
> >>     i++;
> >> there are ~30% performance improvement if using "long i" instead
> >> of"unsigned
> >> i"
> >> on ppc64le and x86.  It seems those instructions are not optimized too much
> >> on
> >> some platforms.  So, I'm wondering we need to do this in GCC.
> > 
> > On x86 I see indexed addressing modes being used which should be fine.
> > Compilable testcase:
> > 
> > unsigned foo (double *a, unsigned n)
> > {
> >   unsigned i = 0;
> >   while (a[i] > 1e-3 && i < n)
> >     i++;
> >   return i;
> > }
> > 
> > ppc64le seems to do some odd unrolling/peeling or whatnot, I have a hard
> > time following its assembly ... ah, -fno-unroll-loops "helps" and
> > produces
> > 
> > .L5:
> >         lfd %f0,0(%r9)
> >         addi %r3,%r3,1
> >         addi %r9,%r9,8
> >         rldicl %r3,%r3,0,32
> >         fcmpu %cr0,%f0,%f12
> >         bnglr %cr0
> >         bdnz .L5
> > 
> > which looks pretty good to me, I suppose the rldicl is the
> > zero-extension but the IVs are already 64bit and the
> > zero-extension should be sunk to the loop exit instead.
> 
> Thanks so much for your quick reply!
> 
> Yes, "rldicl %r3,%r3,0,32" is sinkable and avoids this zext,
> which also improve performance.
> 
> :) my test code is a little differentent:  argument n's type is long 
> (sorry).
> unsigned __attribute__ ((noinline)) foo (double *a, long n)
> {
>   unsigned i = 0;
>   while (a[i] > 1e-3 && i < n)
>     i++;
>   return i;
> }

You can't elide the zero-extend here since n can be bigger
than MAX_UINT and thus i has to wrap.

>   111: L111:
>   39: pc={(%7:CC>=0)?simple_return:pc}
>   102: %3:SI=%3:SI+0x1
>   103: %9:DI=%3:DI<<0x3&0x7fffffff8
>   104: %3:DI=zero_extend(%3:SI)
>   105: %7:CC=cmp(%3:DI,%4:DI)
>   106: %0:DF=[%10:DI+%9:DI]
>   107: %0:CCFP=cmp(%0:DF,%12:DF)
>   109: pc={(%0:CCFP>0)?L111:pc}
> 
> .L14:
>         bgelr 7
>         addi 3,3,1
>         rldic 9,3,3,29
>         rldicl 3,3,0,32
>         cmpd 7,3,4
>         lfdx 0,10,9
> 	fcmpu 0,0,12
>         bgt 0,.L14
> 
> -------------- or for code:
> unsigned __attribute__ ((noinline)) foo (double *a, unsigned n,  unsigned i)
> {
>   while (a[i] > 1e-3 && i != n) ///if using "&& i < n" the zext is sinkable
>     i++;
>   return i;
> }

With i != n we cannot compute the number of iterations because i can
wrap, thus we cannot elide the zero-extension.

>    99: L99:
>    69: {pc={(ctr:DI==0x1)?L39:pc};ctr:DI=ctr:DI-0x1;clobber 
> scratch;clobber scratch;}
>    36: L36:
>    23: %5:SI=%5:SI+0x1
>    25: %9:DI=%5:DI<<0x3&0x7fffffff8
>    24: %5:DI=zero_extend(%5:SI)
>    29: %0:DF=[%3:DI+%9:DI]
>    30: %0:CCFP=cmp(%0:DF,%12:DF)
>    31: pc={(%0:CCFP>0)?L99:pc}
> 
> 
> .L12:
>         bdz .L2
> .L5:
>         addi 5,5,1
>         rldic 9,5,3,29
>         rldicl 5,5,0,32
>         lfdx 0,3,9
>         fcmpu 0,0,12
>         bgt 0,.L12
> .L2:
> 
> Any suggestions about how to relax the 'zext' in above cases? My previous idea
> was versioning :).

I don't think there's another good way.  Not sure if I'd call it 
versioning, I suppose for these kind of possibly wrapping IVs it
would be iteration space splitting?  So maybe it could be integrated
into the loop split pass somehow?  Let's look at

unsigned foo (double *a, unsigned n, unsigned i)
{
   while (a[i] > 1e-3 && i != n)
     i++;
   return i;
}

we'd split it as

  while (a[i] > 1e-3 && i > n)
    i++;
  while (a[i] > 1e-3 && i < n)
    i++;

but I still see zero-extends here on x86.

For the case with 'long n' the issue is that the loop might not even
terminate and thus i wraps anyway.  There one could split the
iteration space at MIN (n, UINT_MAX), using an unsigned upper bound.

Not sure if this makes sense..

Richard.

> > 
> >> >
> >> > OTOH making SCEV analysis not give up but instead record the constraints
> >> > under which its solution is valid is a very good and useful thing to do.
> >> 
> >> Thanks! Enhance SCEV could help a few cases, especially when other
> >> optimizations
> >> are enabled.
> >> 
> >> Thanks again for your suggestions!
> >> 
> >> BR.
> >> Jiufu Guo.
> >> 
> >> >
> >> > Richard.
> >> >
> >> >> Thanks again for your very helpful comments!
> >> >>
> >> >> BR.
> >> >> Jiufu Guo.
> >> >>
> >> >> >
> >> >> > Richard.
> >> >> >
> >> >> >> BR.
> >> >> >> Jiufu Guo.
> >> 
> >> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] avoid type conversion through versioning loop
  2021-03-25  8:35                   ` Richard Biener
@ 2021-03-25 13:32                     ` guojiufu
  2021-03-25 13:47                       ` guojiufu
  2021-03-26  8:19                     ` guojiufu
  1 sibling, 1 reply; 14+ messages in thread
From: guojiufu @ 2021-03-25 13:32 UTC (permalink / raw)
  To: Richard Biener
  Cc: Richard Biener, Jakub Jelinek, GCC Development,
	Segher Boessenkool, Michael Matz

On 2021-03-25 16:35, Richard Biener wrote:
> On Wed, 24 Mar 2021, guojiufu wrote:
> 
>> On 2021-03-24 20:33, Richard Biener wrote:
>> > On Wed, 24 Mar 2021, guojiufu wrote:
>> >
>> >> On 2021-03-24 15:55, Richard Biener wrote:
>> >> > On Wed, Mar 24, 2021 at 3:55 AM guojiufu <guojiufu@linux.ibm.com> wrote:
>> >> >>
>> >> >> On 2021-03-23 16:25, Richard Biener via Gcc wrote:
>> >> >> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu <guojiufu@imap.linux.ibm.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
>> >> >> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc
>> >> >> >> > wrote:
>> >> >> >> >> Better than doing loop versioning is to enhance SCEV (and thus
>> >> >> >> >> also
>> >> >> >> >> dependence analysis) to track extra conditions they need to handle
>> >> >> >> >> cases similar as to how niter analysis computes it's 'assumptions'
>> >> >> >> >> condition.  That allows the versioning to be done when there's an
>> >> >> >> >> actual beneficial transform (like vectorization) rather than just
>> >> >> >> >> upfront for the eventual chance that there'll be any.  Ideally
>> >> >> >> >> such
>> >> >> >> >> transform would then choose IVs in their transformed copy that
>> >> >> >> >> are analyzable w/o repeating such versioning exercise for the next
>> >> >> >> >> transform.
>> >> >> >> >
>> >> >> >> > And it might be beneficial to perform some type promotion/demotion
>> >> >> >> > pass, either early during vectorization or separately before
>> >> >> >> > vectorization
>> >> >> >> > on a loop copy guarded with the ifns e.g. ifconv uses too.
>> >> >> >> > Find out what type sizes the loop use, first try to demote
>> >> >> >> > computations
>> >> >> >> > to narrower types in the vectorized loop candidate (e.g. if
>> >> >> >> > something
>> >> >> >> > is computed in a wider type only to have the result demoted to
>> >> >> >> > narrower
>> >> >> >> > type), then pick up the widest type size still in use in the loop
>> >> >> >> > (ok,
>> >> >> >> > this assumes we don't mix multiple vector sizes in the loop, but
>> >> >> >> > currently
>> >> >> >> > our vectorizer doesn't do that) and try to promote computations
>> >> >> >> > that
>> >> >> >> > could
>> >> >> >> > be promoted to that type size.  We do partially something like that
>> >> >> >> > during
>> >> >> >> > vect patterns for bool types, but not other types I think.
>> >> >> >> >
>> >> >> >> >       Jakub
>> >> >> >>
>> >> >> >> Thanks for the suggestions!
>> >> >> >>
>> >> >> >> Enhancing SCEV could help other optimizations and improve performance
>> >> >> >> in
>> >> >> >> some cases.
>> >> >> >> While one of the direct ideas of using the '64bit type' is to
>> >> >> >> eliminate
>> >> >> >> conversions,
>> >> >> >> even for some cases which are not easy to be optimized through
>> >> >> >> ifconv/vectorization,
>> >> >> >> for examples:
>> >> >> >>
>> >> >> >>    unsigned int i = 0;
>> >> >> >>    while (a[i]>1e-3)
>> >> >> >>      i++;
>> >> >> >>
>> >> >> >>    unsigned int i = 0;
>> >> >> >>    while (p1[i] == p2[i] && p1[i] != '\0')
>> >> >> >>      i++;
>> >> >> >>
>> >> >> >> Or only do versioning on type for this kind of loop? Any suggestions?
>> >> >> >
>> >> >> > But the "optimization" resulting from such versioning is hard to
>> >> >> > determine upfront which means we'll pay quite a big code size cost
>> >> >> > for unknown questionable gain.  What's the particular optimization
>> >> >>
>> >> >> Right.  Code size increasing is a big pain on large loops. If the gain
>> >> >> is not significant, this optimization may not positive.
>> >> >>
>> >> >> > in the above cases?  Note that for example for
>> >> >> >
>> >> >> >     unsigned int i = 0;
>> >> >> >     while (a[i]>1e-3)
>> >> >> >        i++;
>> >> >> >
>> >> >> > you know that when 'i' wraps then the loop will not terminate.
>> >> >> > There's
>> >> >>
>> >> >> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is
>> >> >> checkable.  Otherwise, the optimization to avoid zext is not adoptable.
>> >> >>
>> >> >> > the address computation that is i * sizeof (T) which is done in a
>> >> >> > larger
>> >> >> > type to avoid overflow so we have &a + zext (i) * 8 - is that the
>> >> >> > operation
>> >> >> > that is 'slow' for you?
>> >> >>
>> >> >> This is the point: "zext(i)" is the instruction that I want to
>> >> >> eliminate,
>> >> >> which is the direct goal of the optimization.
>> >> >>
>> >> >> The gain of eliminating the 'zext' is visible or not, and the code size
>> >> >> increasing is small enough or not, this is a question and needs to
>> >> >> trade-off.
>> >> >> It may be only acceptable if the loop is very small, then eliminating
>> >> >> 'zext'
>> >> >> would help to save runtime, and code size increase maybe not big.
>> >> >
>> >> > OK, so I indeed think that the desire to micro-optimize a 'zext' doesn't
>> >> > make versioning a good trade-off.  The micro-architecture should better
>> >> > not make that totally slow (I'd expect an extra latency comparable to
>> >> > the multiply or add on the &a + zext(i) * 8 instruction chain).
>> >>
>> >> Agree, I understand your point.  The concern is some micro-architectures
>> >> are
>> >> not
>> >> very well on this yet.  I tested the above example code:
>> >>   unsigned i = 0;
>> >>   while (a[i] > 1e-3 && i < n)
>> >>     i++;
>> >> there are ~30% performance improvement if using "long i" instead
>> >> of"unsigned
>> >> i"
>> >> on ppc64le and x86.  It seems those instructions are not optimized too much
>> >> on
>> >> some platforms.  So, I'm wondering we need to do this in GCC.
>> >
>> > On x86 I see indexed addressing modes being used which should be fine.
>> > Compilable testcase:
>> >
>> > unsigned foo (double *a, unsigned n)
>> > {
>> >   unsigned i = 0;
>> >   while (a[i] > 1e-3 && i < n)
>> >     i++;
>> >   return i;
>> > }
>> >
>> > ppc64le seems to do some odd unrolling/peeling or whatnot, I have a hard
>> > time following its assembly ... ah, -fno-unroll-loops "helps" and
>> > produces
>> >
>> > .L5:
>> >         lfd %f0,0(%r9)
>> >         addi %r3,%r3,1
>> >         addi %r9,%r9,8
>> >         rldicl %r3,%r3,0,32
>> >         fcmpu %cr0,%f0,%f12
>> >         bnglr %cr0
>> >         bdnz .L5
>> >
>> > which looks pretty good to me, I suppose the rldicl is the
>> > zero-extension but the IVs are already 64bit and the
>> > zero-extension should be sunk to the loop exit instead.
>> 
>> Thanks so much for your quick reply!
>> 
>> Yes, "rldicl %r3,%r3,0,32" is sinkable and avoids this zext,
>> which also improve performance.
>> 
>> :) my test code is a little differentent:  argument n's type is long
>> (sorry).
>> unsigned __attribute__ ((noinline)) foo (double *a, long n)
>> {
>>   unsigned i = 0;
>>   while (a[i] > 1e-3 && i < n)
>>     i++;
>>   return i;
>> }
> 
> You can't elide the zero-extend here since n can be bigger
> than MAX_UINT and thus i has to wrap.
> 
>>   111: L111:
>>   39: pc={(%7:CC>=0)?simple_return:pc}
>>   102: %3:SI=%3:SI+0x1
>>   103: %9:DI=%3:DI<<0x3&0x7fffffff8
>>   104: %3:DI=zero_extend(%3:SI)
>>   105: %7:CC=cmp(%3:DI,%4:DI)
>>   106: %0:DF=[%10:DI+%9:DI]
>>   107: %0:CCFP=cmp(%0:DF,%12:DF)
>>   109: pc={(%0:CCFP>0)?L111:pc}
>> 
>> .L14:
>>         bgelr 7
>>         addi 3,3,1
>>         rldic 9,3,3,29
>>         rldicl 3,3,0,32
>>         cmpd 7,3,4
>>         lfdx 0,10,9
>> 	fcmpu 0,0,12
>>         bgt 0,.L14
>> 
>> -------------- or for code:
>> unsigned __attribute__ ((noinline)) foo (double *a, unsigned n,  
>> unsigned i)
>> {
>>   while (a[i] > 1e-3 && i != n) ///if using "&& i < n" the zext is 
>> sinkable
>>     i++;
>>   return i;
>> }
> 
> With i != n we cannot compute the number of iterations because i can
> wrap, thus we cannot elide the zero-extension.

Yes, because 'i' could wrap, so 'zext' can not be avoided from the loop 
body.

> 
>>    99: L99:
>>    69: {pc={(ctr:DI==0x1)?L39:pc};ctr:DI=ctr:DI-0x1;clobber
>> scratch;clobber scratch;}
>>    36: L36:
>>    23: %5:SI=%5:SI+0x1
>>    25: %9:DI=%5:DI<<0x3&0x7fffffff8
>>    24: %5:DI=zero_extend(%5:SI)
>>    29: %0:DF=[%3:DI+%9:DI]
>>    30: %0:CCFP=cmp(%0:DF,%12:DF)
>>    31: pc={(%0:CCFP>0)?L99:pc}
>> 
>> 
>> .L12:
>>         bdz .L2
>> .L5:
>>         addi 5,5,1
>>         rldic 9,5,3,29
>>         rldicl 5,5,0,32
>>         lfdx 0,3,9
>>         fcmpu 0,0,12
>>         bgt 0,.L12
>> .L2:
>> 
>> Any suggestions about how to relax the 'zext' in above cases? My 
>> previous idea
>> was versioning :).
> 
> I don't think there's another good way.  Not sure if I'd call it
> versioning, I suppose for these kind of possibly wrapping IVs it
> would be iteration space splitting?  So maybe it could be integrated
> into the loop split pass somehow?  Let's look at
> 
> unsigned foo (double *a, unsigned n, unsigned i)
> {
>    while (a[i] > 1e-3 && i != n)
>      i++;
>    return i;
> }
> 
> we'd split it as
> 
>   while (a[i] > 1e-3 && i > n)
>     i++;
>   while (a[i] > 1e-3 && i < n)
>     i++;
Or:
if (i < n)
   while (a[i] > 1e-3 && i < n)   // i step until n, not wrap
     i++;
else if (i > n)
  while (a[i] > 1e-3 && i > n) // or "while (a[i] > 1e-3 && i <= 
UINT_MAX)"
     i++;

These two new loops maybe able to elide 'zext'.

> 
> but I still see zero-extends here on x86.
> 
> For the case with 'long n' the issue is that the loop might not even
> terminate and thus i wraps anyway.  There one could split the
> iteration space at MIN (n, UINT_MAX), using an unsigned upper bound.

Yes, if 'long n' > UINT_MAX, 'i' would wrap, so 'zext' is kept in the 
loop body.
If the compiler can analyze that 'i' will not wrap, then 'zext' may move 
out of the loop.

if (n <= UINT_MAX)
   loop run n times
else
   loop may infinite, or say 'unsigned i < n is always true'


Thanks for your always helpful suggestions!


BR.
Jiufu Guo.


> 
> Not sure if this makes sense..


> 
> Richard.
> 
>> >
>> >> >
>> >> > OTOH making SCEV analysis not give up but instead record the constraints
>> >> > under which its solution is valid is a very good and useful thing to do.
>> >>
>> >> Thanks! Enhance SCEV could help a few cases, especially when other
>> >> optimizations
>> >> are enabled.
>> >>
>> >> Thanks again for your suggestions!
>> >>
>> >> BR.
>> >> Jiufu Guo.
>> >>
>> >> >
>> >> > Richard.
>> >> >
>> >> >> Thanks again for your very helpful comments!
>> >> >>
>> >> >> BR.
>> >> >> Jiufu Guo.
>> >> >>
>> >> >> >
>> >> >> > Richard.
>> >> >> >
>> >> >> >> BR.
>> >> >> >> Jiufu Guo.
>> >>
>> >>
>> 
>> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] avoid type conversion through versioning loop
  2021-03-25 13:32                     ` guojiufu
@ 2021-03-25 13:47                       ` guojiufu
  0 siblings, 0 replies; 14+ messages in thread
From: guojiufu @ 2021-03-25 13:47 UTC (permalink / raw)
  To: guojiufu
  Cc: Richard Biener, Jakub Jelinek, Segher Boessenkool, GCC Development

On 2021-03-25 21:32, guojiufu via Gcc wrote:
> On 2021-03-25 16:35, Richard Biener wrote:
>> On Wed, 24 Mar 2021, guojiufu wrote:
>> 
>>> On 2021-03-24 20:33, Richard Biener wrote:
>>> > On Wed, 24 Mar 2021, guojiufu wrote:
>>> >
>>> >> On 2021-03-24 15:55, Richard Biener wrote:
>>> >> > On Wed, Mar 24, 2021 at 3:55 AM guojiufu <guojiufu@linux.ibm.com> wrote:
>>> >> >>
>>> >> >> On 2021-03-23 16:25, Richard Biener via Gcc wrote:
>>> >> >> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu <guojiufu@imap.linux.ibm.com>
>>> >> >> > wrote:
>>> >> >> >>
>>> >> >> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
>>> >> >> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc
>>> >> >> >> > wrote:
>>> >> >> >> >> Better than doing loop versioning is to enhance SCEV (and thus
>>> >> >> >> >> also
>>> >> >> >> >> dependence analysis) to track extra conditions they need to handle
>>> >> >> >> >> cases similar as to how niter analysis computes it's 'assumptions'
>>> >> >> >> >> condition.  That allows the versioning to be done when there's an
>>> >> >> >> >> actual beneficial transform (like vectorization) rather than just
>>> >> >> >> >> upfront for the eventual chance that there'll be any.  Ideally
>>> >> >> >> >> such
>>> >> >> >> >> transform would then choose IVs in their transformed copy that
>>> >> >> >> >> are analyzable w/o repeating such versioning exercise for the next
>>> >> >> >> >> transform.
>>> >> >> >> >
>>> >> >> >> > And it might be beneficial to perform some type promotion/demotion
>>> >> >> >> > pass, either early during vectorization or separately before
>>> >> >> >> > vectorization
>>> >> >> >> > on a loop copy guarded with the ifns e.g. ifconv uses too.
>>> >> >> >> > Find out what type sizes the loop use, first try to demote
>>> >> >> >> > computations
>>> >> >> >> > to narrower types in the vectorized loop candidate (e.g. if
>>> >> >> >> > something
>>> >> >> >> > is computed in a wider type only to have the result demoted to
>>> >> >> >> > narrower
>>> >> >> >> > type), then pick up the widest type size still in use in the loop
>>> >> >> >> > (ok,
>>> >> >> >> > this assumes we don't mix multiple vector sizes in the loop, but
>>> >> >> >> > currently
>>> >> >> >> > our vectorizer doesn't do that) and try to promote computations
>>> >> >> >> > that
>>> >> >> >> > could
>>> >> >> >> > be promoted to that type size.  We do partially something like that
>>> >> >> >> > during
>>> >> >> >> > vect patterns for bool types, but not other types I think.
>>> >> >> >> >
>>> >> >> >> >       Jakub
>>> >> >> >>
>>> >> >> >> Thanks for the suggestions!
>>> >> >> >>
>>> >> >> >> Enhancing SCEV could help other optimizations and improve performance
>>> >> >> >> in
>>> >> >> >> some cases.
>>> >> >> >> While one of the direct ideas of using the '64bit type' is to
>>> >> >> >> eliminate
>>> >> >> >> conversions,
>>> >> >> >> even for some cases which are not easy to be optimized through
>>> >> >> >> ifconv/vectorization,
>>> >> >> >> for examples:
>>> >> >> >>
>>> >> >> >>    unsigned int i = 0;
>>> >> >> >>    while (a[i]>1e-3)
>>> >> >> >>      i++;
>>> >> >> >>
>>> >> >> >>    unsigned int i = 0;
>>> >> >> >>    while (p1[i] == p2[i] && p1[i] != '\0')
>>> >> >> >>      i++;
>>> >> >> >>
>>> >> >> >> Or only do versioning on type for this kind of loop? Any suggestions?
>>> >> >> >
>>> >> >> > But the "optimization" resulting from such versioning is hard to
>>> >> >> > determine upfront which means we'll pay quite a big code size cost
>>> >> >> > for unknown questionable gain.  What's the particular optimization
>>> >> >>
>>> >> >> Right.  Code size increasing is a big pain on large loops. If the gain
>>> >> >> is not significant, this optimization may not positive.
>>> >> >>
>>> >> >> > in the above cases?  Note that for example for
>>> >> >> >
>>> >> >> >     unsigned int i = 0;
>>> >> >> >     while (a[i]>1e-3)
>>> >> >> >        i++;
>>> >> >> >
>>> >> >> > you know that when 'i' wraps then the loop will not terminate.
>>> >> >> > There's
>>> >> >>
>>> >> >> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is
>>> >> >> checkable.  Otherwise, the optimization to avoid zext is not adoptable.
>>> >> >>
>>> >> >> > the address computation that is i * sizeof (T) which is done in a
>>> >> >> > larger
>>> >> >> > type to avoid overflow so we have &a + zext (i) * 8 - is that the
>>> >> >> > operation
>>> >> >> > that is 'slow' for you?
>>> >> >>
>>> >> >> This is the point: "zext(i)" is the instruction that I want to
>>> >> >> eliminate,
>>> >> >> which is the direct goal of the optimization.
>>> >> >>
>>> >> >> The gain of eliminating the 'zext' is visible or not, and the code size
>>> >> >> increasing is small enough or not, this is a question and needs to
>>> >> >> trade-off.
>>> >> >> It may be only acceptable if the loop is very small, then eliminating
>>> >> >> 'zext'
>>> >> >> would help to save runtime, and code size increase maybe not big.
>>> >> >
>>> >> > OK, so I indeed think that the desire to micro-optimize a 'zext' doesn't
>>> >> > make versioning a good trade-off.  The micro-architecture should better
>>> >> > not make that totally slow (I'd expect an extra latency comparable to
>>> >> > the multiply or add on the &a + zext(i) * 8 instruction chain).
>>> >>
>>> >> Agree, I understand your point.  The concern is some micro-architectures
>>> >> are
>>> >> not
>>> >> very well on this yet.  I tested the above example code:
>>> >>   unsigned i = 0;
>>> >>   while (a[i] > 1e-3 && i < n)
>>> >>     i++;
>>> >> there are ~30% performance improvement if using "long i" instead
>>> >> of"unsigned
>>> >> i"
>>> >> on ppc64le and x86.  It seems those instructions are not optimized too much
>>> >> on
>>> >> some platforms.  So, I'm wondering we need to do this in GCC.
>>> >
>>> > On x86 I see indexed addressing modes being used which should be fine.
>>> > Compilable testcase:
>>> >
>>> > unsigned foo (double *a, unsigned n)
>>> > {
>>> >   unsigned i = 0;
>>> >   while (a[i] > 1e-3 && i < n)
>>> >     i++;
>>> >   return i;
>>> > }
>>> >
>>> > ppc64le seems to do some odd unrolling/peeling or whatnot, I have a hard
>>> > time following its assembly ... ah, -fno-unroll-loops "helps" and
>>> > produces
>>> >
>>> > .L5:
>>> >         lfd %f0,0(%r9)
>>> >         addi %r3,%r3,1
>>> >         addi %r9,%r9,8
>>> >         rldicl %r3,%r3,0,32
>>> >         fcmpu %cr0,%f0,%f12
>>> >         bnglr %cr0
>>> >         bdnz .L5
>>> >
>>> > which looks pretty good to me, I suppose the rldicl is the
>>> > zero-extension but the IVs are already 64bit and the
>>> > zero-extension should be sunk to the loop exit instead.
>>> 
>>> Thanks so much for your quick reply!
>>> 
>>> Yes, "rldicl %r3,%r3,0,32" is sinkable and avoids this zext,
>>> which also improve performance.
>>> 
>>> :) my test code is a little differentent:  argument n's type is long
>>> (sorry).
>>> unsigned __attribute__ ((noinline)) foo (double *a, long n)
>>> {
>>>   unsigned i = 0;
>>>   while (a[i] > 1e-3 && i < n)
>>>     i++;
>>>   return i;
>>> }
>> 
>> You can't elide the zero-extend here since n can be bigger
>> than MAX_UINT and thus i has to wrap.
>> 
>>>   111: L111:
>>>   39: pc={(%7:CC>=0)?simple_return:pc}
>>>   102: %3:SI=%3:SI+0x1
>>>   103: %9:DI=%3:DI<<0x3&0x7fffffff8
>>>   104: %3:DI=zero_extend(%3:SI)
>>>   105: %7:CC=cmp(%3:DI,%4:DI)
>>>   106: %0:DF=[%10:DI+%9:DI]
>>>   107: %0:CCFP=cmp(%0:DF,%12:DF)
>>>   109: pc={(%0:CCFP>0)?L111:pc}
>>> 
>>> .L14:
>>>         bgelr 7
>>>         addi 3,3,1
>>>         rldic 9,3,3,29
>>>         rldicl 3,3,0,32
>>>         cmpd 7,3,4
>>>         lfdx 0,10,9
>>> 	fcmpu 0,0,12
>>>         bgt 0,.L14
>>> 
>>> -------------- or for code:
>>> unsigned __attribute__ ((noinline)) foo (double *a, unsigned n,  
>>> unsigned i)
>>> {
>>>   while (a[i] > 1e-3 && i != n) ///if using "&& i < n" the zext is 
>>> sinkable
>>>     i++;
>>>   return i;
>>> }
>> 
>> With i != n we cannot compute the number of iterations because i can
>> wrap, thus we cannot elide the zero-extension.
> 
> Yes, because 'i' could wrap, so 'zext' can not be avoided from the loop 
> body.
> 
>> 
>>>    99: L99:
>>>    69: {pc={(ctr:DI==0x1)?L39:pc};ctr:DI=ctr:DI-0x1;clobber
>>> scratch;clobber scratch;}
>>>    36: L36:
>>>    23: %5:SI=%5:SI+0x1
>>>    25: %9:DI=%5:DI<<0x3&0x7fffffff8
>>>    24: %5:DI=zero_extend(%5:SI)
>>>    29: %0:DF=[%3:DI+%9:DI]
>>>    30: %0:CCFP=cmp(%0:DF,%12:DF)
>>>    31: pc={(%0:CCFP>0)?L99:pc}
>>> 
>>> 
>>> .L12:
>>>         bdz .L2
>>> .L5:
>>>         addi 5,5,1
>>>         rldic 9,5,3,29
>>>         rldicl 5,5,0,32
>>>         lfdx 0,3,9
>>>         fcmpu 0,0,12
>>>         bgt 0,.L12
>>> .L2:
>>> 
>>> Any suggestions about how to relax the 'zext' in above cases? My 
>>> previous idea
>>> was versioning :).
>> 
>> I don't think there's another good way.  Not sure if I'd call it
>> versioning, I suppose for these kind of possibly wrapping IVs it
>> would be iteration space splitting?  So maybe it could be integrated
>> into the loop split pass somehow?  Let's look at
>> 
>> unsigned foo (double *a, unsigned n, unsigned i)
>> {
>>    while (a[i] > 1e-3 && i != n)
>>      i++;
>>    return i;
>> }
>> 
>> we'd split it as
>> 
>>   while (a[i] > 1e-3 && i > n)
>>     i++;
>>   while (a[i] > 1e-3 && i < n)
>>     i++;
> Or:
> if (i < n)
>   while (a[i] > 1e-3 && i < n)   // i step until n, not wrap
>     i++;
> else if (i > n)
>  while (a[i] > 1e-3 && i > n) // or "while (a[i] > 1e-3 && i <= 
> UINT_MAX)"
>     i++;
> 
   There is an typo, it should be.
else if (i > n)
   while (a[i] > 1e-3 && i != n) // i wrap.
    i++

As in your split loops:
while (a[i] > 1e-3 && i > n) // i may step to UINT_MAX and then 0
   i++;
while (a[i] > 1e-3 && i < n) // will no wrap
   i++;

> These two new loops maybe able to elide 'zext'.
> 
>> 
>> but I still see zero-extends here on x86.
>> 
>> For the case with 'long n' the issue is that the loop might not even
>> terminate and thus i wraps anyway.  There one could split the
>> iteration space at MIN (n, UINT_MAX), using an unsigned upper bound.
> 
> Yes, if 'long n' > UINT_MAX, 'i' would wrap, so 'zext' is kept in the 
> loop body.
> If the compiler can analyze that 'i' will not wrap, then 'zext' may
> move out of the loop.
> 
> if (n <= UINT_MAX)
>   loop run n times
> else
>   loop may infinite, or say 'unsigned i < n is always true'
> 
> 
> Thanks for your always helpful suggestions!
> 
> 
> BR.
> Jiufu Guo.
> 
> 
>> 
>> Not sure if this makes sense..
> 
> 
>> 
>> Richard.
>> 
>>> >
>>> >> >
>>> >> > OTOH making SCEV analysis not give up but instead record the constraints
>>> >> > under which its solution is valid is a very good and useful thing to do.
>>> >>
>>> >> Thanks! Enhance SCEV could help a few cases, especially when other
>>> >> optimizations
>>> >> are enabled.
>>> >>
>>> >> Thanks again for your suggestions!
>>> >>
>>> >> BR.
>>> >> Jiufu Guo.
>>> >>
>>> >> >
>>> >> > Richard.
>>> >> >
>>> >> >> Thanks again for your very helpful comments!
>>> >> >>
>>> >> >> BR.
>>> >> >> Jiufu Guo.
>>> >> >>
>>> >> >> >
>>> >> >> > Richard.
>>> >> >> >
>>> >> >> >> BR.
>>> >> >> >> Jiufu Guo.
>>> >>
>>> >>
>>> 
>>> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] avoid type conversion through versioning loop
  2021-03-25  8:35                   ` Richard Biener
  2021-03-25 13:32                     ` guojiufu
@ 2021-03-26  8:19                     ` guojiufu
  1 sibling, 0 replies; 14+ messages in thread
From: guojiufu @ 2021-03-26  8:19 UTC (permalink / raw)
  To: Richard Biener
  Cc: Richard Biener, Jakub Jelinek, GCC Development,
	Segher Boessenkool, Michael Matz

On 2021-03-25 16:35, Richard Biener wrote:
> On Wed, 24 Mar 2021, guojiufu wrote:
> 
>> On 2021-03-24 20:33, Richard Biener wrote:
>> > On Wed, 24 Mar 2021, guojiufu wrote:
>> >
>> >> On 2021-03-24 15:55, Richard Biener wrote:
>> >> > On Wed, Mar 24, 2021 at 3:55 AM guojiufu <guojiufu@linux.ibm.com> wrote:
>> >> >>
>> >> >> On 2021-03-23 16:25, Richard Biener via Gcc wrote:
>> >> >> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu <guojiufu@imap.linux.ibm.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
>> >> >> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc
>> >> >> >> > wrote:
>> >> >> >> >> Better than doing loop versioning is to enhance SCEV (and thus
>> >> >> >> >> also
>> >> >> >> >> dependence analysis) to track extra conditions they need to handle
>> >> >> >> >> cases similar as to how niter analysis computes it's 'assumptions'
>> >> >> >> >> condition.  That allows the versioning to be done when there's an
>> >> >> >> >> actual beneficial transform (like vectorization) rather than just
>> >> >> >> >> upfront for the eventual chance that there'll be any.  Ideally
>> >> >> >> >> such
>> >> >> >> >> transform would then choose IVs in their transformed copy that
>> >> >> >> >> are analyzable w/o repeating such versioning exercise for the next
>> >> >> >> >> transform.
>> >> >> >> >
>> >> >> >> > And it might be beneficial to perform some type promotion/demotion
>> >> >> >> > pass, either early during vectorization or separately before
>> >> >> >> > vectorization
>> >> >> >> > on a loop copy guarded with the ifns e.g. ifconv uses too.
>> >> >> >> > Find out what type sizes the loop use, first try to demote
>> >> >> >> > computations
>> >> >> >> > to narrower types in the vectorized loop candidate (e.g. if
>> >> >> >> > something
>> >> >> >> > is computed in a wider type only to have the result demoted to
>> >> >> >> > narrower
>> >> >> >> > type), then pick up the widest type size still in use in the loop
>> >> >> >> > (ok,
>> >> >> >> > this assumes we don't mix multiple vector sizes in the loop, but
>> >> >> >> > currently
>> >> >> >> > our vectorizer doesn't do that) and try to promote computations
>> >> >> >> > that
>> >> >> >> > could
>> >> >> >> > be promoted to that type size.  We do partially something like that
>> >> >> >> > during
>> >> >> >> > vect patterns for bool types, but not other types I think.
>> >> >> >> >
>> >> >> >> >       Jakub
>> >> >> >>
>> >> >> >> Thanks for the suggestions!
>> >> >> >>
>> >> >> >> Enhancing SCEV could help other optimizations and improve performance
>> >> >> >> in
>> >> >> >> some cases.
>> >> >> >> While one of the direct ideas of using the '64bit type' is to
>> >> >> >> eliminate
>> >> >> >> conversions,
>> >> >> >> even for some cases which are not easy to be optimized through
>> >> >> >> ifconv/vectorization,
>> >> >> >> for examples:
>> >> >> >>
>> >> >> >>    unsigned int i = 0;
>> >> >> >>    while (a[i]>1e-3)
>> >> >> >>      i++;
>> >> >> >>
>> >> >> >>    unsigned int i = 0;
>> >> >> >>    while (p1[i] == p2[i] && p1[i] != '\0')
>> >> >> >>      i++;
>> >> >> >>
>> >> >> >> Or only do versioning on type for this kind of loop? Any suggestions?
>> >> >> >
>> >> >> > But the "optimization" resulting from such versioning is hard to
>> >> >> > determine upfront which means we'll pay quite a big code size cost
>> >> >> > for unknown questionable gain.  What's the particular optimization
>> >> >>
>> >> >> Right.  Code size increasing is a big pain on large loops. If the gain
>> >> >> is not significant, this optimization may not positive.
>> >> >>
>> >> >> > in the above cases?  Note that for example for
>> >> >> >
>> >> >> >     unsigned int i = 0;
>> >> >> >     while (a[i]>1e-3)
>> >> >> >        i++;
>> >> >> >
>> >> >> > you know that when 'i' wraps then the loop will not terminate.
>> >> >> > There's
>> >> >>
>> >> >> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is
>> >> >> checkable.  Otherwise, the optimization to avoid zext is not adoptable.
>> >> >>
>> >> >> > the address computation that is i * sizeof (T) which is done in a
>> >> >> > larger
>> >> >> > type to avoid overflow so we have &a + zext (i) * 8 - is that the
>> >> >> > operation
>> >> >> > that is 'slow' for you?
>> >> >>
>> >> >> This is the point: "zext(i)" is the instruction that I want to
>> >> >> eliminate,
>> >> >> which is the direct goal of the optimization.
>> >> >>
>> >> >> The gain of eliminating the 'zext' is visible or not, and the code size
>> >> >> increasing is small enough or not, this is a question and needs to
>> >> >> trade-off.
>> >> >> It may be only acceptable if the loop is very small, then eliminating
>> >> >> 'zext'
>> >> >> would help to save runtime, and code size increase maybe not big.
>> >> >
>> >> > OK, so I indeed think that the desire to micro-optimize a 'zext' doesn't
>> >> > make versioning a good trade-off.  The micro-architecture should better
>> >> > not make that totally slow (I'd expect an extra latency comparable to
>> >> > the multiply or add on the &a + zext(i) * 8 instruction chain).
>> >>
>> >> Agree, I understand your point.  The concern is some micro-architectures
>> >> are
>> >> not
>> >> very well on this yet.  I tested the above example code:
>> >>   unsigned i = 0;
>> >>   while (a[i] > 1e-3 && i < n)
>> >>     i++;
>> >> there are ~30% performance improvement if using "long i" instead
>> >> of"unsigned
>> >> i"
>> >> on ppc64le and x86.  It seems those instructions are not optimized too much
>> >> on
>> >> some platforms.  So, I'm wondering we need to do this in GCC.
>> >
>> > On x86 I see indexed addressing modes being used which should be fine.
>> > Compilable testcase:
>> >
>> > unsigned foo (double *a, unsigned n)
>> > {
>> >   unsigned i = 0;
>> >   while (a[i] > 1e-3 && i < n)
>> >     i++;
>> >   return i;
>> > }
>> >
>> > ppc64le seems to do some odd unrolling/peeling or whatnot, I have a hard
>> > time following its assembly ... ah, -fno-unroll-loops "helps" and
>> > produces
>> >
>> > .L5:
>> >         lfd %f0,0(%r9)
>> >         addi %r3,%r3,1
>> >         addi %r9,%r9,8
>> >         rldicl %r3,%r3,0,32
>> >         fcmpu %cr0,%f0,%f12
>> >         bnglr %cr0
>> >         bdnz .L5
>> >
>> > which looks pretty good to me, I suppose the rldicl is the
>> > zero-extension but the IVs are already 64bit and the
>> > zero-extension should be sunk to the loop exit instead.
>> 
>> Thanks so much for your quick reply!
>> 
>> Yes, "rldicl %r3,%r3,0,32" is sinkable and avoids this zext,
>> which also improve performance.
>> 
>> :) my test code is a little differentent:  argument n's type is long
>> (sorry).
>> unsigned __attribute__ ((noinline)) foo (double *a, long n)
>> {
>>   unsigned i = 0;
>>   while (a[i] > 1e-3 && i < n)
>>     i++;
>>   return i;
>> }
> 
> You can't elide the zero-extend here since n can be bigger
> than MAX_UINT and thus i has to wrap.
> 
>>   111: L111:
>>   39: pc={(%7:CC>=0)?simple_return:pc}
>>   102: %3:SI=%3:SI+0x1
>>   103: %9:DI=%3:DI<<0x3&0x7fffffff8
>>   104: %3:DI=zero_extend(%3:SI)
>>   105: %7:CC=cmp(%3:DI,%4:DI)
>>   106: %0:DF=[%10:DI+%9:DI]
>>   107: %0:CCFP=cmp(%0:DF,%12:DF)
>>   109: pc={(%0:CCFP>0)?L111:pc}
>> 
>> .L14:
>>         bgelr 7
>>         addi 3,3,1
>>         rldic 9,3,3,29
>>         rldicl 3,3,0,32
>>         cmpd 7,3,4
>>         lfdx 0,10,9
>> 	fcmpu 0,0,12
>>         bgt 0,.L14
>> 
>> -------------- or for code:
>> unsigned __attribute__ ((noinline)) foo (double *a, unsigned n,  
>> unsigned i)
>> {
>>   while (a[i] > 1e-3 && i != n) ///if using "&& i < n" the zext is 
>> sinkable
>>     i++;
>>   return i;
>> }
> 
> With i != n we cannot compute the number of iterations because i can
> wrap, thus we cannot elide the zero-extension.
> 
>>    99: L99:
>>    69: {pc={(ctr:DI==0x1)?L39:pc};ctr:DI=ctr:DI-0x1;clobber
>> scratch;clobber scratch;}
>>    36: L36:
>>    23: %5:SI=%5:SI+0x1
>>    25: %9:DI=%5:DI<<0x3&0x7fffffff8
>>    24: %5:DI=zero_extend(%5:SI)
>>    29: %0:DF=[%3:DI+%9:DI]
>>    30: %0:CCFP=cmp(%0:DF,%12:DF)
>>    31: pc={(%0:CCFP>0)?L99:pc}
>> 
>> 
>> .L12:
>>         bdz .L2
>> .L5:
>>         addi 5,5,1
>>         rldic 9,5,3,29
>>         rldicl 5,5,0,32
>>         lfdx 0,3,9
>>         fcmpu 0,0,12
>>         bgt 0,.L12
>> .L2:
>> 
>> Any suggestions about how to relax the 'zext' in above cases? My 
>> previous idea
>> was versioning :).
> 
> I don't think there's another good way.  Not sure if I'd call it
> versioning, I suppose for these kind of possibly wrapping IVs it
> would be iteration space splitting?  So maybe it could be integrated
> into the loop split pass somehow?  Let's look at
> 
> unsigned foo (double *a, unsigned n, unsigned i)
> {
>    while (a[i] > 1e-3 && i != n)
>      i++;
>    return i;
> }
> 
> we'd split it as
> 
>   while (a[i] > 1e-3 && i > n)
>     i++;
>   while (a[i] > 1e-3 && i < n)
>     i++;
> 
And then, we can continue transform to:
   while (a[i] > 1e-3 && i > n && i <= UINT_MAX)
     i++;
   while (a[i] > 1e-3 && i < n)
     i++;
===>
   if (i > n)
     while (a[i] > 1e-3 && i <= UINT_MAX)
       i++;
   while (a[i] > 1e-3 && i < n)
     i++;
===> (transform to below, may similar with sinking zext out of loops)
   long il = i; long nl = n;
   if (il > nl)
     {
      while (a[il] > 1e-3 && il <= UINT_MAX)
        il++;
      il = il > UINT_MAX ? 0 : il; // il = zext (il#0)
     }
   while (a[il] > 1e-3 && il < nl)
     il++;
   i = il;


Thanks a lot for your great help!


BR.
Jiufu Guo.


> but I still see zero-extends here on x86.
> 
> For the case with 'long n' the issue is that the loop might not even
> terminate and thus i wraps anyway.  There one could split the
> iteration space at MIN (n, UINT_MAX), using an unsigned upper bound.
> 
> Not sure if this makes sense..
> 
> Richard.
> 
>> >
>> >> >
>> >> > OTOH making SCEV analysis not give up but instead record the constraints
>> >> > under which its solution is valid is a very good and useful thing to do.
>> >>
>> >> Thanks! Enhance SCEV could help a few cases, especially when other
>> >> optimizations
>> >> are enabled.
>> >>
>> >> Thanks again for your suggestions!
>> >>
>> >> BR.
>> >> Jiufu Guo.
>> >>
>> >> >
>> >> > Richard.
>> >> >
>> >> >> Thanks again for your very helpful comments!
>> >> >>
>> >> >> BR.
>> >> >> Jiufu Guo.
>> >> >>
>> >> >> >
>> >> >> > Richard.
>> >> >> >
>> >> >> >> BR.
>> >> >> >> Jiufu Guo.
>> >>
>> >>
>> 
>> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-03-26  8:19 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-22  6:39 [RFC] avoid type conversion through versioning loop guojiufu
2021-03-22  8:22 ` Richard Biener
2021-03-22  8:31   ` Jakub Jelinek
2021-03-23  3:33     ` guojiufu
2021-03-23  8:25       ` Richard Biener
2021-03-24  2:55         ` guojiufu
2021-03-24  7:55           ` Richard Biener
2021-03-24 12:12             ` guojiufu
2021-03-24 12:33               ` Richard Biener
2021-03-24 15:17                 ` guojiufu
2021-03-25  8:35                   ` Richard Biener
2021-03-25 13:32                     ` guojiufu
2021-03-25 13:47                       ` guojiufu
2021-03-26  8:19                     ` guojiufu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).