From: guojiufu <guojiufu@linux.ibm.com>
To: Richard Biener <rguenther@suse.de>
Cc: Richard Biener <richard.guenther@gmail.com>,
Jakub Jelinek <jakub@redhat.com>,
GCC Development <gcc@gcc.gnu.org>,
Segher Boessenkool <segher@kernel.crashing.org>,
Michael Matz <matz@suse.de>
Subject: Re: [RFC] avoid type conversion through versioning loop
Date: Thu, 25 Mar 2021 21:32:08 +0800 [thread overview]
Message-ID: <d43de5f4ab6db80e2c7d3bfbca8da72d@imap.linux.ibm.com> (raw)
In-Reply-To: <nycvar.YFH.7.76.2103250917590.17979@zhemvz.fhfr.qr>
On 2021-03-25 16:35, Richard Biener wrote:
> On Wed, 24 Mar 2021, guojiufu wrote:
>
>> On 2021-03-24 20:33, Richard Biener wrote:
>> > On Wed, 24 Mar 2021, guojiufu wrote:
>> >
>> >> On 2021-03-24 15:55, Richard Biener wrote:
>> >> > On Wed, Mar 24, 2021 at 3:55 AM guojiufu <guojiufu@linux.ibm.com> wrote:
>> >> >>
>> >> >> On 2021-03-23 16:25, Richard Biener via Gcc wrote:
>> >> >> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu <guojiufu@imap.linux.ibm.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
>> >> >> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc
>> >> >> >> > wrote:
>> >> >> >> >> Better than doing loop versioning is to enhance SCEV (and thus
>> >> >> >> >> also
>> >> >> >> >> dependence analysis) to track extra conditions they need to handle
>> >> >> >> >> cases similar as to how niter analysis computes it's 'assumptions'
>> >> >> >> >> condition. That allows the versioning to be done when there's an
>> >> >> >> >> actual beneficial transform (like vectorization) rather than just
>> >> >> >> >> upfront for the eventual chance that there'll be any. Ideally
>> >> >> >> >> such
>> >> >> >> >> transform would then choose IVs in their transformed copy that
>> >> >> >> >> are analyzable w/o repeating such versioning exercise for the next
>> >> >> >> >> transform.
>> >> >> >> >
>> >> >> >> > And it might be beneficial to perform some type promotion/demotion
>> >> >> >> > pass, either early during vectorization or separately before
>> >> >> >> > vectorization
>> >> >> >> > on a loop copy guarded with the ifns e.g. ifconv uses too.
>> >> >> >> > Find out what type sizes the loop use, first try to demote
>> >> >> >> > computations
>> >> >> >> > to narrower types in the vectorized loop candidate (e.g. if
>> >> >> >> > something
>> >> >> >> > is computed in a wider type only to have the result demoted to
>> >> >> >> > narrower
>> >> >> >> > type), then pick up the widest type size still in use in the loop
>> >> >> >> > (ok,
>> >> >> >> > this assumes we don't mix multiple vector sizes in the loop, but
>> >> >> >> > currently
>> >> >> >> > our vectorizer doesn't do that) and try to promote computations
>> >> >> >> > that
>> >> >> >> > could
>> >> >> >> > be promoted to that type size. We do partially something like that
>> >> >> >> > during
>> >> >> >> > vect patterns for bool types, but not other types I think.
>> >> >> >> >
>> >> >> >> > Jakub
>> >> >> >>
>> >> >> >> Thanks for the suggestions!
>> >> >> >>
>> >> >> >> Enhancing SCEV could help other optimizations and improve performance
>> >> >> >> in
>> >> >> >> some cases.
>> >> >> >> While one of the direct ideas of using the '64bit type' is to
>> >> >> >> eliminate
>> >> >> >> conversions,
>> >> >> >> even for some cases which are not easy to be optimized through
>> >> >> >> ifconv/vectorization,
>> >> >> >> for examples:
>> >> >> >>
>> >> >> >> unsigned int i = 0;
>> >> >> >> while (a[i]>1e-3)
>> >> >> >> i++;
>> >> >> >>
>> >> >> >> unsigned int i = 0;
>> >> >> >> while (p1[i] == p2[i] && p1[i] != '\0')
>> >> >> >> i++;
>> >> >> >>
>> >> >> >> Or only do versioning on type for this kind of loop? Any suggestions?
>> >> >> >
>> >> >> > But the "optimization" resulting from such versioning is hard to
>> >> >> > determine upfront which means we'll pay quite a big code size cost
>> >> >> > for unknown questionable gain. What's the particular optimization
>> >> >>
>> >> >> Right. Code size increasing is a big pain on large loops. If the gain
>> >> >> is not significant, this optimization may not positive.
>> >> >>
>> >> >> > in the above cases? Note that for example for
>> >> >> >
>> >> >> > unsigned int i = 0;
>> >> >> > while (a[i]>1e-3)
>> >> >> > i++;
>> >> >> >
>> >> >> > you know that when 'i' wraps then the loop will not terminate.
>> >> >> > There's
>> >> >>
>> >> >> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is
>> >> >> checkable. Otherwise, the optimization to avoid zext is not adoptable.
>> >> >>
>> >> >> > the address computation that is i * sizeof (T) which is done in a
>> >> >> > larger
>> >> >> > type to avoid overflow so we have &a + zext (i) * 8 - is that the
>> >> >> > operation
>> >> >> > that is 'slow' for you?
>> >> >>
>> >> >> This is the point: "zext(i)" is the instruction that I want to
>> >> >> eliminate,
>> >> >> which is the direct goal of the optimization.
>> >> >>
>> >> >> The gain of eliminating the 'zext' is visible or not, and the code size
>> >> >> increasing is small enough or not, this is a question and needs to
>> >> >> trade-off.
>> >> >> It may be only acceptable if the loop is very small, then eliminating
>> >> >> 'zext'
>> >> >> would help to save runtime, and code size increase maybe not big.
>> >> >
>> >> > OK, so I indeed think that the desire to micro-optimize a 'zext' doesn't
>> >> > make versioning a good trade-off. The micro-architecture should better
>> >> > not make that totally slow (I'd expect an extra latency comparable to
>> >> > the multiply or add on the &a + zext(i) * 8 instruction chain).
>> >>
>> >> Agree, I understand your point. The concern is some micro-architectures
>> >> are
>> >> not
>> >> very well on this yet. I tested the above example code:
>> >> unsigned i = 0;
>> >> while (a[i] > 1e-3 && i < n)
>> >> i++;
>> >> there are ~30% performance improvement if using "long i" instead
>> >> of"unsigned
>> >> i"
>> >> on ppc64le and x86. It seems those instructions are not optimized too much
>> >> on
>> >> some platforms. So, I'm wondering we need to do this in GCC.
>> >
>> > On x86 I see indexed addressing modes being used which should be fine.
>> > Compilable testcase:
>> >
>> > unsigned foo (double *a, unsigned n)
>> > {
>> > unsigned i = 0;
>> > while (a[i] > 1e-3 && i < n)
>> > i++;
>> > return i;
>> > }
>> >
>> > ppc64le seems to do some odd unrolling/peeling or whatnot, I have a hard
>> > time following its assembly ... ah, -fno-unroll-loops "helps" and
>> > produces
>> >
>> > .L5:
>> > lfd %f0,0(%r9)
>> > addi %r3,%r3,1
>> > addi %r9,%r9,8
>> > rldicl %r3,%r3,0,32
>> > fcmpu %cr0,%f0,%f12
>> > bnglr %cr0
>> > bdnz .L5
>> >
>> > which looks pretty good to me, I suppose the rldicl is the
>> > zero-extension but the IVs are already 64bit and the
>> > zero-extension should be sunk to the loop exit instead.
>>
>> Thanks so much for your quick reply!
>>
>> Yes, "rldicl %r3,%r3,0,32" is sinkable and avoids this zext,
>> which also improve performance.
>>
>> :) my test code is a little differentent: argument n's type is long
>> (sorry).
>> unsigned __attribute__ ((noinline)) foo (double *a, long n)
>> {
>> unsigned i = 0;
>> while (a[i] > 1e-3 && i < n)
>> i++;
>> return i;
>> }
>
> You can't elide the zero-extend here since n can be bigger
> than MAX_UINT and thus i has to wrap.
>
>> 111: L111:
>> 39: pc={(%7:CC>=0)?simple_return:pc}
>> 102: %3:SI=%3:SI+0x1
>> 103: %9:DI=%3:DI<<0x3&0x7fffffff8
>> 104: %3:DI=zero_extend(%3:SI)
>> 105: %7:CC=cmp(%3:DI,%4:DI)
>> 106: %0:DF=[%10:DI+%9:DI]
>> 107: %0:CCFP=cmp(%0:DF,%12:DF)
>> 109: pc={(%0:CCFP>0)?L111:pc}
>>
>> .L14:
>> bgelr 7
>> addi 3,3,1
>> rldic 9,3,3,29
>> rldicl 3,3,0,32
>> cmpd 7,3,4
>> lfdx 0,10,9
>> fcmpu 0,0,12
>> bgt 0,.L14
>>
>> -------------- or for code:
>> unsigned __attribute__ ((noinline)) foo (double *a, unsigned n,
>> unsigned i)
>> {
>> while (a[i] > 1e-3 && i != n) ///if using "&& i < n" the zext is
>> sinkable
>> i++;
>> return i;
>> }
>
> With i != n we cannot compute the number of iterations because i can
> wrap, thus we cannot elide the zero-extension.
Yes, because 'i' could wrap, so 'zext' can not be avoided from the loop
body.
>
>> 99: L99:
>> 69: {pc={(ctr:DI==0x1)?L39:pc};ctr:DI=ctr:DI-0x1;clobber
>> scratch;clobber scratch;}
>> 36: L36:
>> 23: %5:SI=%5:SI+0x1
>> 25: %9:DI=%5:DI<<0x3&0x7fffffff8
>> 24: %5:DI=zero_extend(%5:SI)
>> 29: %0:DF=[%3:DI+%9:DI]
>> 30: %0:CCFP=cmp(%0:DF,%12:DF)
>> 31: pc={(%0:CCFP>0)?L99:pc}
>>
>>
>> .L12:
>> bdz .L2
>> .L5:
>> addi 5,5,1
>> rldic 9,5,3,29
>> rldicl 5,5,0,32
>> lfdx 0,3,9
>> fcmpu 0,0,12
>> bgt 0,.L12
>> .L2:
>>
>> Any suggestions about how to relax the 'zext' in above cases? My
>> previous idea
>> was versioning :).
>
> I don't think there's another good way. Not sure if I'd call it
> versioning, I suppose for these kind of possibly wrapping IVs it
> would be iteration space splitting? So maybe it could be integrated
> into the loop split pass somehow? Let's look at
>
> unsigned foo (double *a, unsigned n, unsigned i)
> {
> while (a[i] > 1e-3 && i != n)
> i++;
> return i;
> }
>
> we'd split it as
>
> while (a[i] > 1e-3 && i > n)
> i++;
> while (a[i] > 1e-3 && i < n)
> i++;
Or:
if (i < n)
while (a[i] > 1e-3 && i < n) // i step until n, not wrap
i++;
else if (i > n)
while (a[i] > 1e-3 && i > n) // or "while (a[i] > 1e-3 && i <=
UINT_MAX)"
i++;
These two new loops maybe able to elide 'zext'.
>
> but I still see zero-extends here on x86.
>
> For the case with 'long n' the issue is that the loop might not even
> terminate and thus i wraps anyway. There one could split the
> iteration space at MIN (n, UINT_MAX), using an unsigned upper bound.
Yes, if 'long n' > UINT_MAX, 'i' would wrap, so 'zext' is kept in the
loop body.
If the compiler can analyze that 'i' will not wrap, then 'zext' may move
out of the loop.
if (n <= UINT_MAX)
loop run n times
else
loop may infinite, or say 'unsigned i < n is always true'
Thanks for your always helpful suggestions!
BR.
Jiufu Guo.
>
> Not sure if this makes sense..
>
> Richard.
>
>> >
>> >> >
>> >> > OTOH making SCEV analysis not give up but instead record the constraints
>> >> > under which its solution is valid is a very good and useful thing to do.
>> >>
>> >> Thanks! Enhance SCEV could help a few cases, especially when other
>> >> optimizations
>> >> are enabled.
>> >>
>> >> Thanks again for your suggestions!
>> >>
>> >> BR.
>> >> Jiufu Guo.
>> >>
>> >> >
>> >> > Richard.
>> >> >
>> >> >> Thanks again for your very helpful comments!
>> >> >>
>> >> >> BR.
>> >> >> Jiufu Guo.
>> >> >>
>> >> >> >
>> >> >> > Richard.
>> >> >> >
>> >> >> >> BR.
>> >> >> >> Jiufu Guo.
>> >>
>> >>
>>
>>
next prev parent reply other threads:[~2021-03-25 13:32 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-22 6:39 guojiufu
2021-03-22 8:22 ` Richard Biener
2021-03-22 8:31 ` Jakub Jelinek
2021-03-23 3:33 ` guojiufu
2021-03-23 8:25 ` Richard Biener
2021-03-24 2:55 ` guojiufu
2021-03-24 7:55 ` Richard Biener
2021-03-24 12:12 ` guojiufu
2021-03-24 12:33 ` Richard Biener
2021-03-24 15:17 ` guojiufu
2021-03-25 8:35 ` Richard Biener
2021-03-25 13:32 ` guojiufu [this message]
2021-03-25 13:47 ` guojiufu
2021-03-26 8:19 ` guojiufu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d43de5f4ab6db80e2c7d3bfbca8da72d@imap.linux.ibm.com \
--to=guojiufu@linux.ibm.com \
--cc=gcc@gcc.gnu.org \
--cc=jakub@redhat.com \
--cc=matz@suse.de \
--cc=rguenther@suse.de \
--cc=richard.guenther@gmail.com \
--cc=segher@kernel.crashing.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).