From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by sourceware.org (Postfix) with ESMTPS id 0F50C3857C50 for ; Thu, 25 Mar 2021 08:35:08 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 0F50C3857C50 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rguenther@suse.de X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id C59D3AD4A; Thu, 25 Mar 2021 08:35:06 +0000 (UTC) Date: Thu, 25 Mar 2021 09:35:05 +0100 (CET) From: Richard Biener To: guojiufu cc: Richard Biener , Jakub Jelinek , GCC Development , Segher Boessenkool , Michael Matz Subject: Re: [RFC] avoid type conversion through versioning loop In-Reply-To: <16536725ffe09efed90e662992d9de9c@imap.linux.ibm.com> Message-ID: References: <6cee0cd1c8c6023eac77d32b531e4f4b@imap.linux.ibm.com> <20210322083146.GB231854@tucnak> <1d8a918d84f4fcd5da265f1ac4e81ae1@imap.linux.ibm.com> <63d960f6b52344e8bef084c86a56a70e@imap.linux.ibm.com> <16536725ffe09efed90e662992d9de9c@imap.linux.ibm.com> User-Agent: Alpine 2.21 (LSU 202 2017-01-01) MIME-Version: 1.0 X-Spam-Status: No, score=-2.9 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, PDS_TONAME_EQ_TOLOCAL_HDRS_LCASE, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 8BIT X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: gcc@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Mar 2021 08:35:10 -0000 On Wed, 24 Mar 2021, guojiufu wrote: > On 2021-03-24 20:33, Richard Biener wrote: > > On Wed, 24 Mar 2021, guojiufu wrote: > > > >> On 2021-03-24 15:55, Richard Biener wrote: > >> > On Wed, Mar 24, 2021 at 3:55 AM guojiufu wrote: > >> >> > >> >> On 2021-03-23 16:25, Richard Biener via Gcc wrote: > >> >> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu > >> >> > wrote: > >> >> >> > >> >> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote: > >> >> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc > >> >> >> > wrote: > >> >> >> >> Better than doing loop versioning is to enhance SCEV (and thus > >> >> >> >> also > >> >> >> >> dependence analysis) to track extra conditions they need to handle > >> >> >> >> cases similar as to how niter analysis computes it's 'assumptions' > >> >> >> >> condition. That allows the versioning to be done when there's an > >> >> >> >> actual beneficial transform (like vectorization) rather than just > >> >> >> >> upfront for the eventual chance that there'll be any. Ideally > >> >> >> >> such > >> >> >> >> transform would then choose IVs in their transformed copy that > >> >> >> >> are analyzable w/o repeating such versioning exercise for the next > >> >> >> >> transform. > >> >> >> > > >> >> >> > And it might be beneficial to perform some type promotion/demotion > >> >> >> > pass, either early during vectorization or separately before > >> >> >> > vectorization > >> >> >> > on a loop copy guarded with the ifns e.g. ifconv uses too. > >> >> >> > Find out what type sizes the loop use, first try to demote > >> >> >> > computations > >> >> >> > to narrower types in the vectorized loop candidate (e.g. if > >> >> >> > something > >> >> >> > is computed in a wider type only to have the result demoted to > >> >> >> > narrower > >> >> >> > type), then pick up the widest type size still in use in the loop > >> >> >> > (ok, > >> >> >> > this assumes we don't mix multiple vector sizes in the loop, but > >> >> >> > currently > >> >> >> > our vectorizer doesn't do that) and try to promote computations > >> >> >> > that > >> >> >> > could > >> >> >> > be promoted to that type size. We do partially something like that > >> >> >> > during > >> >> >> > vect patterns for bool types, but not other types I think. > >> >> >> > > >> >> >> > Jakub > >> >> >> > >> >> >> Thanks for the suggestions! > >> >> >> > >> >> >> Enhancing SCEV could help other optimizations and improve performance > >> >> >> in > >> >> >> some cases. > >> >> >> While one of the direct ideas of using the '64bit type' is to > >> >> >> eliminate > >> >> >> conversions, > >> >> >> even for some cases which are not easy to be optimized through > >> >> >> ifconv/vectorization, > >> >> >> for examples: > >> >> >> > >> >> >> unsigned int i = 0; > >> >> >> while (a[i]>1e-3) > >> >> >> i++; > >> >> >> > >> >> >> unsigned int i = 0; > >> >> >> while (p1[i] == p2[i] && p1[i] != '\0') > >> >> >> i++; > >> >> >> > >> >> >> Or only do versioning on type for this kind of loop? Any suggestions? > >> >> > > >> >> > But the "optimization" resulting from such versioning is hard to > >> >> > determine upfront which means we'll pay quite a big code size cost > >> >> > for unknown questionable gain. What's the particular optimization > >> >> > >> >> Right. Code size increasing is a big pain on large loops. If the gain > >> >> is not significant, this optimization may not positive. > >> >> > >> >> > in the above cases? Note that for example for > >> >> > > >> >> > unsigned int i = 0; > >> >> > while (a[i]>1e-3) > >> >> > i++; > >> >> > > >> >> > you know that when 'i' wraps then the loop will not terminate. > >> >> > There's > >> >> > >> >> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is > >> >> checkable. Otherwise, the optimization to avoid zext is not adoptable. > >> >> > >> >> > the address computation that is i * sizeof (T) which is done in a > >> >> > larger > >> >> > type to avoid overflow so we have &a + zext (i) * 8 - is that the > >> >> > operation > >> >> > that is 'slow' for you? > >> >> > >> >> This is the point: "zext(i)" is the instruction that I want to > >> >> eliminate, > >> >> which is the direct goal of the optimization. > >> >> > >> >> The gain of eliminating the 'zext' is visible or not, and the code size > >> >> increasing is small enough or not, this is a question and needs to > >> >> trade-off. > >> >> It may be only acceptable if the loop is very small, then eliminating > >> >> 'zext' > >> >> would help to save runtime, and code size increase maybe not big. > >> > > >> > OK, so I indeed think that the desire to micro-optimize a 'zext' doesn't > >> > make versioning a good trade-off. The micro-architecture should better > >> > not make that totally slow (I'd expect an extra latency comparable to > >> > the multiply or add on the &a + zext(i) * 8 instruction chain). > >> > >> Agree, I understand your point. The concern is some micro-architectures > >> are > >> not > >> very well on this yet. I tested the above example code: > >> unsigned i = 0; > >> while (a[i] > 1e-3 && i < n) > >> i++; > >> there are ~30% performance improvement if using "long i" instead > >> of"unsigned > >> i" > >> on ppc64le and x86. It seems those instructions are not optimized too much > >> on > >> some platforms. So, I'm wondering we need to do this in GCC. > > > > On x86 I see indexed addressing modes being used which should be fine. > > Compilable testcase: > > > > unsigned foo (double *a, unsigned n) > > { > > unsigned i = 0; > > while (a[i] > 1e-3 && i < n) > > i++; > > return i; > > } > > > > ppc64le seems to do some odd unrolling/peeling or whatnot, I have a hard > > time following its assembly ... ah, -fno-unroll-loops "helps" and > > produces > > > > .L5: > > lfd %f0,0(%r9) > > addi %r3,%r3,1 > > addi %r9,%r9,8 > > rldicl %r3,%r3,0,32 > > fcmpu %cr0,%f0,%f12 > > bnglr %cr0 > > bdnz .L5 > > > > which looks pretty good to me, I suppose the rldicl is the > > zero-extension but the IVs are already 64bit and the > > zero-extension should be sunk to the loop exit instead. > > Thanks so much for your quick reply! > > Yes, "rldicl %r3,%r3,0,32" is sinkable and avoids this zext, > which also improve performance. > > :) my test code is a little differentent: argument n's type is long > (sorry). > unsigned __attribute__ ((noinline)) foo (double *a, long n) > { > unsigned i = 0; > while (a[i] > 1e-3 && i < n) > i++; > return i; > } You can't elide the zero-extend here since n can be bigger than MAX_UINT and thus i has to wrap. > 111: L111: > 39: pc={(%7:CC>=0)?simple_return:pc} > 102: %3:SI=%3:SI+0x1 > 103: %9:DI=%3:DI<<0x3&0x7fffffff8 > 104: %3:DI=zero_extend(%3:SI) > 105: %7:CC=cmp(%3:DI,%4:DI) > 106: %0:DF=[%10:DI+%9:DI] > 107: %0:CCFP=cmp(%0:DF,%12:DF) > 109: pc={(%0:CCFP>0)?L111:pc} > > .L14: > bgelr 7 > addi 3,3,1 > rldic 9,3,3,29 > rldicl 3,3,0,32 > cmpd 7,3,4 > lfdx 0,10,9 > fcmpu 0,0,12 > bgt 0,.L14 > > -------------- or for code: > unsigned __attribute__ ((noinline)) foo (double *a, unsigned n, unsigned i) > { > while (a[i] > 1e-3 && i != n) ///if using "&& i < n" the zext is sinkable > i++; > return i; > } With i != n we cannot compute the number of iterations because i can wrap, thus we cannot elide the zero-extension. > 99: L99: > 69: {pc={(ctr:DI==0x1)?L39:pc};ctr:DI=ctr:DI-0x1;clobber > scratch;clobber scratch;} > 36: L36: > 23: %5:SI=%5:SI+0x1 > 25: %9:DI=%5:DI<<0x3&0x7fffffff8 > 24: %5:DI=zero_extend(%5:SI) > 29: %0:DF=[%3:DI+%9:DI] > 30: %0:CCFP=cmp(%0:DF,%12:DF) > 31: pc={(%0:CCFP>0)?L99:pc} > > > .L12: > bdz .L2 > .L5: > addi 5,5,1 > rldic 9,5,3,29 > rldicl 5,5,0,32 > lfdx 0,3,9 > fcmpu 0,0,12 > bgt 0,.L12 > .L2: > > Any suggestions about how to relax the 'zext' in above cases? My previous idea > was versioning :). I don't think there's another good way. Not sure if I'd call it versioning, I suppose for these kind of possibly wrapping IVs it would be iteration space splitting? So maybe it could be integrated into the loop split pass somehow? Let's look at unsigned foo (double *a, unsigned n, unsigned i) { while (a[i] > 1e-3 && i != n) i++; return i; } we'd split it as while (a[i] > 1e-3 && i > n) i++; while (a[i] > 1e-3 && i < n) i++; but I still see zero-extends here on x86. For the case with 'long n' the issue is that the loop might not even terminate and thus i wraps anyway. There one could split the iteration space at MIN (n, UINT_MAX), using an unsigned upper bound. Not sure if this makes sense.. Richard. > > > >> > > >> > OTOH making SCEV analysis not give up but instead record the constraints > >> > under which its solution is valid is a very good and useful thing to do. > >> > >> Thanks! Enhance SCEV could help a few cases, especially when other > >> optimizations > >> are enabled. > >> > >> Thanks again for your suggestions! > >> > >> BR. > >> Jiufu Guo. > >> > >> > > >> > Richard. > >> > > >> >> Thanks again for your very helpful comments! > >> >> > >> >> BR. > >> >> Jiufu Guo. > >> >> > >> >> > > >> >> > Richard. > >> >> > > >> >> >> BR. > >> >> >> Jiufu Guo. > >> > >> > > -- Richard Biener SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)