Re: [RFC] avoid type conversion through versioning loop

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

From: guojiufu <guojiufu@linux.ibm.com>
To: Richard Biener <rguenther@suse.de>
Cc: Richard Biener <richard.guenther@gmail.com>,
	Jakub Jelinek <jakub@redhat.com>,
	GCC Development <gcc@gcc.gnu.org>,
	Segher Boessenkool <segher@kernel.crashing.org>,
	Michael Matz <matz@suse.de>
Subject: Re: [RFC] avoid type conversion through versioning loop
Date: Thu, 25 Mar 2021 21:32:08 +0800	[thread overview]
Message-ID: <d43de5f4ab6db80e2c7d3bfbca8da72d@imap.linux.ibm.com> (raw)
In-Reply-To: <nycvar.YFH.7.76.2103250917590.17979@zhemvz.fhfr.qr>

On 2021-03-25 16:35, Richard Biener wrote:
> On Wed, 24 Mar 2021, guojiufu wrote:
> 
>> On 2021-03-24 20:33, Richard Biener wrote:
>> > On Wed, 24 Mar 2021, guojiufu wrote:
>> >
>> >> On 2021-03-24 15:55, Richard Biener wrote:
>> >> > On Wed, Mar 24, 2021 at 3:55 AM guojiufu <guojiufu@linux.ibm.com> wrote:
>> >> >>
>> >> >> On 2021-03-23 16:25, Richard Biener via Gcc wrote:
>> >> >> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu <guojiufu@imap.linux.ibm.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
>> >> >> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc
>> >> >> >> > wrote:
>> >> >> >> >> Better than doing loop versioning is to enhance SCEV (and thus
>> >> >> >> >> also
>> >> >> >> >> dependence analysis) to track extra conditions they need to handle
>> >> >> >> >> cases similar as to how niter analysis computes it's 'assumptions'
>> >> >> >> >> condition.  That allows the versioning to be done when there's an
>> >> >> >> >> actual beneficial transform (like vectorization) rather than just
>> >> >> >> >> upfront for the eventual chance that there'll be any.  Ideally
>> >> >> >> >> such
>> >> >> >> >> transform would then choose IVs in their transformed copy that
>> >> >> >> >> are analyzable w/o repeating such versioning exercise for the next
>> >> >> >> >> transform.
>> >> >> >> >
>> >> >> >> > And it might be beneficial to perform some type promotion/demotion
>> >> >> >> > pass, either early during vectorization or separately before
>> >> >> >> > vectorization
>> >> >> >> > on a loop copy guarded with the ifns e.g. ifconv uses too.
>> >> >> >> > Find out what type sizes the loop use, first try to demote
>> >> >> >> > computations
>> >> >> >> > to narrower types in the vectorized loop candidate (e.g. if
>> >> >> >> > something
>> >> >> >> > is computed in a wider type only to have the result demoted to
>> >> >> >> > narrower
>> >> >> >> > type), then pick up the widest type size still in use in the loop
>> >> >> >> > (ok,
>> >> >> >> > this assumes we don't mix multiple vector sizes in the loop, but
>> >> >> >> > currently
>> >> >> >> > our vectorizer doesn't do that) and try to promote computations
>> >> >> >> > that
>> >> >> >> > could
>> >> >> >> > be promoted to that type size.  We do partially something like that
>> >> >> >> > during
>> >> >> >> > vect patterns for bool types, but not other types I think.
>> >> >> >> >
>> >> >> >> >       Jakub
>> >> >> >>
>> >> >> >> Thanks for the suggestions!
>> >> >> >>
>> >> >> >> Enhancing SCEV could help other optimizations and improve performance
>> >> >> >> in
>> >> >> >> some cases.
>> >> >> >> While one of the direct ideas of using the '64bit type' is to
>> >> >> >> eliminate
>> >> >> >> conversions,
>> >> >> >> even for some cases which are not easy to be optimized through
>> >> >> >> ifconv/vectorization,
>> >> >> >> for examples:
>> >> >> >>
>> >> >> >>    unsigned int i = 0;
>> >> >> >>    while (a[i]>1e-3)
>> >> >> >>      i++;
>> >> >> >>
>> >> >> >>    unsigned int i = 0;
>> >> >> >>    while (p1[i] == p2[i] && p1[i] != '\0')
>> >> >> >>      i++;
>> >> >> >>
>> >> >> >> Or only do versioning on type for this kind of loop? Any suggestions?
>> >> >> >
>> >> >> > But the "optimization" resulting from such versioning is hard to
>> >> >> > determine upfront which means we'll pay quite a big code size cost
>> >> >> > for unknown questionable gain.  What's the particular optimization
>> >> >>
>> >> >> Right.  Code size increasing is a big pain on large loops. If the gain
>> >> >> is not significant, this optimization may not positive.
>> >> >>
>> >> >> > in the above cases?  Note that for example for
>> >> >> >
>> >> >> >     unsigned int i = 0;
>> >> >> >     while (a[i]>1e-3)
>> >> >> >        i++;
>> >> >> >
>> >> >> > you know that when 'i' wraps then the loop will not terminate.
>> >> >> > There's
>> >> >>
>> >> >> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is
>> >> >> checkable.  Otherwise, the optimization to avoid zext is not adoptable.
>> >> >>
>> >> >> > the address computation that is i * sizeof (T) which is done in a
>> >> >> > larger
>> >> >> > type to avoid overflow so we have &a + zext (i) * 8 - is that the
>> >> >> > operation
>> >> >> > that is 'slow' for you?
>> >> >>
>> >> >> This is the point: "zext(i)" is the instruction that I want to
>> >> >> eliminate,
>> >> >> which is the direct goal of the optimization.
>> >> >>
>> >> >> The gain of eliminating the 'zext' is visible or not, and the code size
>> >> >> increasing is small enough or not, this is a question and needs to
>> >> >> trade-off.
>> >> >> It may be only acceptable if the loop is very small, then eliminating
>> >> >> 'zext'
>> >> >> would help to save runtime, and code size increase maybe not big.
>> >> >
>> >> > OK, so I indeed think that the desire to micro-optimize a 'zext' doesn't
>> >> > make versioning a good trade-off.  The micro-architecture should better
>> >> > not make that totally slow (I'd expect an extra latency comparable to
>> >> > the multiply or add on the &a + zext(i) * 8 instruction chain).
>> >>
>> >> Agree, I understand your point.  The concern is some micro-architectures
>> >> are
>> >> not
>> >> very well on this yet.  I tested the above example code:
>> >>   unsigned i = 0;
>> >>   while (a[i] > 1e-3 && i < n)
>> >>     i++;
>> >> there are ~30% performance improvement if using "long i" instead
>> >> of"unsigned
>> >> i"
>> >> on ppc64le and x86.  It seems those instructions are not optimized too much
>> >> on
>> >> some platforms.  So, I'm wondering we need to do this in GCC.
>> >
>> > On x86 I see indexed addressing modes being used which should be fine.
>> > Compilable testcase:
>> >
>> > unsigned foo (double *a, unsigned n)
>> > {
>> >   unsigned i = 0;
>> >   while (a[i] > 1e-3 && i < n)
>> >     i++;
>> >   return i;
>> > }
>> >
>> > ppc64le seems to do some odd unrolling/peeling or whatnot, I have a hard
>> > time following its assembly ... ah, -fno-unroll-loops "helps" and
>> > produces
>> >
>> > .L5:
>> >         lfd %f0,0(%r9)
>> >         addi %r3,%r3,1
>> >         addi %r9,%r9,8
>> >         rldicl %r3,%r3,0,32
>> >         fcmpu %cr0,%f0,%f12
>> >         bnglr %cr0
>> >         bdnz .L5
>> >
>> > which looks pretty good to me, I suppose the rldicl is the
>> > zero-extension but the IVs are already 64bit and the
>> > zero-extension should be sunk to the loop exit instead.
>> 
>> Thanks so much for your quick reply!
>> 
>> Yes, "rldicl %r3,%r3,0,32" is sinkable and avoids this zext,
>> which also improve performance.
>> 
>> :) my test code is a little differentent:  argument n's type is long
>> (sorry).
>> unsigned __attribute__ ((noinline)) foo (double *a, long n)
>> {
>>   unsigned i = 0;
>>   while (a[i] > 1e-3 && i < n)
>>     i++;
>>   return i;
>> }
> 
> You can't elide the zero-extend here since n can be bigger
> than MAX_UINT and thus i has to wrap.
> 
>>   111: L111:
>>   39: pc={(%7:CC>=0)?simple_return:pc}
>>   102: %3:SI=%3:SI+0x1
>>   103: %9:DI=%3:DI<<0x3&0x7fffffff8
>>   104: %3:DI=zero_extend(%3:SI)
>>   105: %7:CC=cmp(%3:DI,%4:DI)
>>   106: %0:DF=[%10:DI+%9:DI]
>>   107: %0:CCFP=cmp(%0:DF,%12:DF)
>>   109: pc={(%0:CCFP>0)?L111:pc}
>> 
>> .L14:
>>         bgelr 7
>>         addi 3,3,1
>>         rldic 9,3,3,29
>>         rldicl 3,3,0,32
>>         cmpd 7,3,4
>>         lfdx 0,10,9
>> 	fcmpu 0,0,12
>>         bgt 0,.L14
>> 
>> -------------- or for code:
>> unsigned __attribute__ ((noinline)) foo (double *a, unsigned n,  
>> unsigned i)
>> {
>>   while (a[i] > 1e-3 && i != n) ///if using "&& i < n" the zext is 
>> sinkable
>>     i++;
>>   return i;
>> }
> 
> With i != n we cannot compute the number of iterations because i can
> wrap, thus we cannot elide the zero-extension.

Yes, because 'i' could wrap, so 'zext' can not be avoided from the loop 
body.

> 
>>    99: L99:
>>    69: {pc={(ctr:DI==0x1)?L39:pc};ctr:DI=ctr:DI-0x1;clobber
>> scratch;clobber scratch;}
>>    36: L36:
>>    23: %5:SI=%5:SI+0x1
>>    25: %9:DI=%5:DI<<0x3&0x7fffffff8
>>    24: %5:DI=zero_extend(%5:SI)
>>    29: %0:DF=[%3:DI+%9:DI]
>>    30: %0:CCFP=cmp(%0:DF,%12:DF)
>>    31: pc={(%0:CCFP>0)?L99:pc}
>> 
>> 
>> .L12:
>>         bdz .L2
>> .L5:
>>         addi 5,5,1
>>         rldic 9,5,3,29
>>         rldicl 5,5,0,32
>>         lfdx 0,3,9
>>         fcmpu 0,0,12
>>         bgt 0,.L12
>> .L2:
>> 
>> Any suggestions about how to relax the 'zext' in above cases? My 
>> previous idea
>> was versioning :).
> 
> I don't think there's another good way.  Not sure if I'd call it
> versioning, I suppose for these kind of possibly wrapping IVs it
> would be iteration space splitting?  So maybe it could be integrated
> into the loop split pass somehow?  Let's look at
> 
> unsigned foo (double *a, unsigned n, unsigned i)
> {
>    while (a[i] > 1e-3 && i != n)
>      i++;
>    return i;
> }
> 
> we'd split it as
> 
>   while (a[i] > 1e-3 && i > n)
>     i++;
>   while (a[i] > 1e-3 && i < n)
>     i++;
Or:
if (i < n)
   while (a[i] > 1e-3 && i < n)   // i step until n, not wrap
     i++;
else if (i > n)
  while (a[i] > 1e-3 && i > n) // or "while (a[i] > 1e-3 && i <= 
UINT_MAX)"
     i++;

These two new loops maybe able to elide 'zext'.

> 
> but I still see zero-extends here on x86.
> 
> For the case with 'long n' the issue is that the loop might not even
> terminate and thus i wraps anyway.  There one could split the
> iteration space at MIN (n, UINT_MAX), using an unsigned upper bound.

Yes, if 'long n' > UINT_MAX, 'i' would wrap, so 'zext' is kept in the 
loop body.
If the compiler can analyze that 'i' will not wrap, then 'zext' may move 
out of the loop.

if (n <= UINT_MAX)
   loop run n times
else
   loop may infinite, or say 'unsigned i < n is always true'


Thanks for your always helpful suggestions!


BR.
Jiufu Guo.


> 
> Not sure if this makes sense..


> 
> Richard.
> 
>> >
>> >> >
>> >> > OTOH making SCEV analysis not give up but instead record the constraints
>> >> > under which its solution is valid is a very good and useful thing to do.
>> >>
>> >> Thanks! Enhance SCEV could help a few cases, especially when other
>> >> optimizations
>> >> are enabled.
>> >>
>> >> Thanks again for your suggestions!
>> >>
>> >> BR.
>> >> Jiufu Guo.
>> >>
>> >> >
>> >> > Richard.
>> >> >
>> >> >> Thanks again for your very helpful comments!
>> >> >>
>> >> >> BR.
>> >> >> Jiufu Guo.
>> >> >>
>> >> >> >
>> >> >> > Richard.
>> >> >> >
>> >> >> >> BR.
>> >> >> >> Jiufu Guo.
>> >>
>> >>
>> 
>>

next prev parent reply	other threads:[~2021-03-25 13:32 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-22  6:39 guojiufu
2021-03-22  8:22 ` Richard Biener
2021-03-22  8:31   ` Jakub Jelinek
2021-03-23  3:33     ` guojiufu
2021-03-23  8:25       ` Richard Biener
2021-03-24  2:55         ` guojiufu
2021-03-24  7:55           ` Richard Biener
2021-03-24 12:12             ` guojiufu
2021-03-24 12:33               ` Richard Biener
2021-03-24 15:17                 ` guojiufu
2021-03-25  8:35                   ` Richard Biener
2021-03-25 13:32                     ` guojiufu [this message]
2021-03-25 13:47                       ` guojiufu
2021-03-26  8:19                     ` guojiufu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d43de5f4ab6db80e2c7d3bfbca8da72d@imap.linux.ibm.com \
    --to=guojiufu@linux.ibm.com \
    --cc=gcc@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=matz@suse.de \
    --cc=rguenther@suse.de \
    --cc=richard.guenther@gmail.com \
    --cc=segher@kernel.crashing.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).