From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by sourceware.org (Postfix) with ESMTPS id 2DC783858028 for ; Fri, 26 Mar 2021 08:19:42 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 2DC783858028 Received: from pps.filterd (m0098413.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 12Q845AF132358; Fri, 26 Mar 2021 04:19:41 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com with ESMTP id 37h76ve882-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 26 Mar 2021 04:19:41 -0400 Received: from m0098413.ppops.net (m0098413.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 12Q84Cvf133158; Fri, 26 Mar 2021 04:19:41 -0400 Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0b-001b2d01.pphosted.com with ESMTP id 37h76ve87w-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 26 Mar 2021 04:19:41 -0400 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.16.0.43/8.16.0.43) with SMTP id 12Q8HcrG025004; Fri, 26 Mar 2021 08:19:40 GMT Received: from b01cxnp23034.gho.pok.ibm.com (b01cxnp23034.gho.pok.ibm.com [9.57.198.29]) by ppma01dal.us.ibm.com with ESMTP id 37h150cqy0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 26 Mar 2021 08:19:40 +0000 Received: from b01ledav005.gho.pok.ibm.com (b01ledav005.gho.pok.ibm.com [9.57.199.110]) by b01cxnp23034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 12Q8Jd2535520898 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 26 Mar 2021 08:19:39 GMT Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 83C38AE060; Fri, 26 Mar 2021 08:19:39 +0000 (GMT) Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 3F03CAE067; Fri, 26 Mar 2021 08:19:39 +0000 (GMT) Received: from ltc.linux.ibm.com (unknown [9.10.229.42]) by b01ledav005.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 26 Mar 2021 08:19:39 +0000 (GMT) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Fri, 26 Mar 2021 16:19:38 +0800 From: guojiufu To: Richard Biener Cc: Richard Biener , Jakub Jelinek , GCC Development , Segher Boessenkool , Michael Matz Subject: Re: [RFC] avoid type conversion through versioning loop In-Reply-To: References: <6cee0cd1c8c6023eac77d32b531e4f4b@imap.linux.ibm.com> <20210322083146.GB231854@tucnak> <1d8a918d84f4fcd5da265f1ac4e81ae1@imap.linux.ibm.com> <63d960f6b52344e8bef084c86a56a70e@imap.linux.ibm.com> <16536725ffe09efed90e662992d9de9c@imap.linux.ibm.com> Message-ID: X-Sender: guojiufu@linux.ibm.com User-Agent: Roundcube Webmail/1.1.12 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: MpTQNv490Wihm506TTbSLdNYTh6q3ons X-Proofpoint-ORIG-GUID: Q1npGvrM23ze7xWMESG6DAwjbCllnakz X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.369, 18.0.761 definitions=2021-03-26_02:2021-03-25, 2021-03-26 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxlogscore=909 lowpriorityscore=0 spamscore=0 phishscore=0 malwarescore=0 suspectscore=0 mlxscore=0 clxscore=1015 bulkscore=0 adultscore=0 priorityscore=1501 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2103250000 definitions=main-2103260058 X-Spam-Status: No, score=-4.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Mar 2021 08:19:44 -0000 On 2021-03-25 16:35, Richard Biener wrote: > On Wed, 24 Mar 2021, guojiufu wrote: > >> On 2021-03-24 20:33, Richard Biener wrote: >> > On Wed, 24 Mar 2021, guojiufu wrote: >> > >> >> On 2021-03-24 15:55, Richard Biener wrote: >> >> > On Wed, Mar 24, 2021 at 3:55 AM guojiufu wrote: >> >> >> >> >> >> On 2021-03-23 16:25, Richard Biener via Gcc wrote: >> >> >> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu >> >> >> > wrote: >> >> >> >> >> >> >> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote: >> >> >> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc >> >> >> >> > wrote: >> >> >> >> >> Better than doing loop versioning is to enhance SCEV (and thus >> >> >> >> >> also >> >> >> >> >> dependence analysis) to track extra conditions they need to handle >> >> >> >> >> cases similar as to how niter analysis computes it's 'assumptions' >> >> >> >> >> condition. That allows the versioning to be done when there's an >> >> >> >> >> actual beneficial transform (like vectorization) rather than just >> >> >> >> >> upfront for the eventual chance that there'll be any. Ideally >> >> >> >> >> such >> >> >> >> >> transform would then choose IVs in their transformed copy that >> >> >> >> >> are analyzable w/o repeating such versioning exercise for the next >> >> >> >> >> transform. >> >> >> >> > >> >> >> >> > And it might be beneficial to perform some type promotion/demotion >> >> >> >> > pass, either early during vectorization or separately before >> >> >> >> > vectorization >> >> >> >> > on a loop copy guarded with the ifns e.g. ifconv uses too. >> >> >> >> > Find out what type sizes the loop use, first try to demote >> >> >> >> > computations >> >> >> >> > to narrower types in the vectorized loop candidate (e.g. if >> >> >> >> > something >> >> >> >> > is computed in a wider type only to have the result demoted to >> >> >> >> > narrower >> >> >> >> > type), then pick up the widest type size still in use in the loop >> >> >> >> > (ok, >> >> >> >> > this assumes we don't mix multiple vector sizes in the loop, but >> >> >> >> > currently >> >> >> >> > our vectorizer doesn't do that) and try to promote computations >> >> >> >> > that >> >> >> >> > could >> >> >> >> > be promoted to that type size. We do partially something like that >> >> >> >> > during >> >> >> >> > vect patterns for bool types, but not other types I think. >> >> >> >> > >> >> >> >> > Jakub >> >> >> >> >> >> >> >> Thanks for the suggestions! >> >> >> >> >> >> >> >> Enhancing SCEV could help other optimizations and improve performance >> >> >> >> in >> >> >> >> some cases. >> >> >> >> While one of the direct ideas of using the '64bit type' is to >> >> >> >> eliminate >> >> >> >> conversions, >> >> >> >> even for some cases which are not easy to be optimized through >> >> >> >> ifconv/vectorization, >> >> >> >> for examples: >> >> >> >> >> >> >> >> unsigned int i = 0; >> >> >> >> while (a[i]>1e-3) >> >> >> >> i++; >> >> >> >> >> >> >> >> unsigned int i = 0; >> >> >> >> while (p1[i] == p2[i] && p1[i] != '\0') >> >> >> >> i++; >> >> >> >> >> >> >> >> Or only do versioning on type for this kind of loop? Any suggestions? >> >> >> > >> >> >> > But the "optimization" resulting from such versioning is hard to >> >> >> > determine upfront which means we'll pay quite a big code size cost >> >> >> > for unknown questionable gain. What's the particular optimization >> >> >> >> >> >> Right. Code size increasing is a big pain on large loops. If the gain >> >> >> is not significant, this optimization may not positive. >> >> >> >> >> >> > in the above cases? Note that for example for >> >> >> > >> >> >> > unsigned int i = 0; >> >> >> > while (a[i]>1e-3) >> >> >> > i++; >> >> >> > >> >> >> > you know that when 'i' wraps then the loop will not terminate. >> >> >> > There's >> >> >> >> >> >> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is >> >> >> checkable. Otherwise, the optimization to avoid zext is not adoptable. >> >> >> >> >> >> > the address computation that is i * sizeof (T) which is done in a >> >> >> > larger >> >> >> > type to avoid overflow so we have &a + zext (i) * 8 - is that the >> >> >> > operation >> >> >> > that is 'slow' for you? >> >> >> >> >> >> This is the point: "zext(i)" is the instruction that I want to >> >> >> eliminate, >> >> >> which is the direct goal of the optimization. >> >> >> >> >> >> The gain of eliminating the 'zext' is visible or not, and the code size >> >> >> increasing is small enough or not, this is a question and needs to >> >> >> trade-off. >> >> >> It may be only acceptable if the loop is very small, then eliminating >> >> >> 'zext' >> >> >> would help to save runtime, and code size increase maybe not big. >> >> > >> >> > OK, so I indeed think that the desire to micro-optimize a 'zext' doesn't >> >> > make versioning a good trade-off. The micro-architecture should better >> >> > not make that totally slow (I'd expect an extra latency comparable to >> >> > the multiply or add on the &a + zext(i) * 8 instruction chain). >> >> >> >> Agree, I understand your point. The concern is some micro-architectures >> >> are >> >> not >> >> very well on this yet. I tested the above example code: >> >> unsigned i = 0; >> >> while (a[i] > 1e-3 && i < n) >> >> i++; >> >> there are ~30% performance improvement if using "long i" instead >> >> of"unsigned >> >> i" >> >> on ppc64le and x86. It seems those instructions are not optimized too much >> >> on >> >> some platforms. So, I'm wondering we need to do this in GCC. >> > >> > On x86 I see indexed addressing modes being used which should be fine. >> > Compilable testcase: >> > >> > unsigned foo (double *a, unsigned n) >> > { >> > unsigned i = 0; >> > while (a[i] > 1e-3 && i < n) >> > i++; >> > return i; >> > } >> > >> > ppc64le seems to do some odd unrolling/peeling or whatnot, I have a hard >> > time following its assembly ... ah, -fno-unroll-loops "helps" and >> > produces >> > >> > .L5: >> > lfd %f0,0(%r9) >> > addi %r3,%r3,1 >> > addi %r9,%r9,8 >> > rldicl %r3,%r3,0,32 >> > fcmpu %cr0,%f0,%f12 >> > bnglr %cr0 >> > bdnz .L5 >> > >> > which looks pretty good to me, I suppose the rldicl is the >> > zero-extension but the IVs are already 64bit and the >> > zero-extension should be sunk to the loop exit instead. >> >> Thanks so much for your quick reply! >> >> Yes, "rldicl %r3,%r3,0,32" is sinkable and avoids this zext, >> which also improve performance. >> >> :) my test code is a little differentent: argument n's type is long >> (sorry). >> unsigned __attribute__ ((noinline)) foo (double *a, long n) >> { >> unsigned i = 0; >> while (a[i] > 1e-3 && i < n) >> i++; >> return i; >> } > > You can't elide the zero-extend here since n can be bigger > than MAX_UINT and thus i has to wrap. > >> 111: L111: >> 39: pc={(%7:CC>=0)?simple_return:pc} >> 102: %3:SI=%3:SI+0x1 >> 103: %9:DI=%3:DI<<0x3&0x7fffffff8 >> 104: %3:DI=zero_extend(%3:SI) >> 105: %7:CC=cmp(%3:DI,%4:DI) >> 106: %0:DF=[%10:DI+%9:DI] >> 107: %0:CCFP=cmp(%0:DF,%12:DF) >> 109: pc={(%0:CCFP>0)?L111:pc} >> >> .L14: >> bgelr 7 >> addi 3,3,1 >> rldic 9,3,3,29 >> rldicl 3,3,0,32 >> cmpd 7,3,4 >> lfdx 0,10,9 >> fcmpu 0,0,12 >> bgt 0,.L14 >> >> -------------- or for code: >> unsigned __attribute__ ((noinline)) foo (double *a, unsigned n, >> unsigned i) >> { >> while (a[i] > 1e-3 && i != n) ///if using "&& i < n" the zext is >> sinkable >> i++; >> return i; >> } > > With i != n we cannot compute the number of iterations because i can > wrap, thus we cannot elide the zero-extension. > >> 99: L99: >> 69: {pc={(ctr:DI==0x1)?L39:pc};ctr:DI=ctr:DI-0x1;clobber >> scratch;clobber scratch;} >> 36: L36: >> 23: %5:SI=%5:SI+0x1 >> 25: %9:DI=%5:DI<<0x3&0x7fffffff8 >> 24: %5:DI=zero_extend(%5:SI) >> 29: %0:DF=[%3:DI+%9:DI] >> 30: %0:CCFP=cmp(%0:DF,%12:DF) >> 31: pc={(%0:CCFP>0)?L99:pc} >> >> >> .L12: >> bdz .L2 >> .L5: >> addi 5,5,1 >> rldic 9,5,3,29 >> rldicl 5,5,0,32 >> lfdx 0,3,9 >> fcmpu 0,0,12 >> bgt 0,.L12 >> .L2: >> >> Any suggestions about how to relax the 'zext' in above cases? My >> previous idea >> was versioning :). > > I don't think there's another good way. Not sure if I'd call it > versioning, I suppose for these kind of possibly wrapping IVs it > would be iteration space splitting? So maybe it could be integrated > into the loop split pass somehow? Let's look at > > unsigned foo (double *a, unsigned n, unsigned i) > { > while (a[i] > 1e-3 && i != n) > i++; > return i; > } > > we'd split it as > > while (a[i] > 1e-3 && i > n) > i++; > while (a[i] > 1e-3 && i < n) > i++; > And then, we can continue transform to: while (a[i] > 1e-3 && i > n && i <= UINT_MAX) i++; while (a[i] > 1e-3 && i < n) i++; ===> if (i > n) while (a[i] > 1e-3 && i <= UINT_MAX) i++; while (a[i] > 1e-3 && i < n) i++; ===> (transform to below, may similar with sinking zext out of loops) long il = i; long nl = n; if (il > nl) { while (a[il] > 1e-3 && il <= UINT_MAX) il++; il = il > UINT_MAX ? 0 : il; // il = zext (il#0) } while (a[il] > 1e-3 && il < nl) il++; i = il; Thanks a lot for your great help! BR. Jiufu Guo. > but I still see zero-extends here on x86. > > For the case with 'long n' the issue is that the loop might not even > terminate and thus i wraps anyway. There one could split the > iteration space at MIN (n, UINT_MAX), using an unsigned upper bound. > > Not sure if this makes sense.. > > Richard. > >> > >> >> > >> >> > OTOH making SCEV analysis not give up but instead record the constraints >> >> > under which its solution is valid is a very good and useful thing to do. >> >> >> >> Thanks! Enhance SCEV could help a few cases, especially when other >> >> optimizations >> >> are enabled. >> >> >> >> Thanks again for your suggestions! >> >> >> >> BR. >> >> Jiufu Guo. >> >> >> >> > >> >> > Richard. >> >> > >> >> >> Thanks again for your very helpful comments! >> >> >> >> >> >> BR. >> >> >> Jiufu Guo. >> >> >> >> >> >> > >> >> >> > Richard. >> >> >> > >> >> >> >> BR. >> >> >> >> Jiufu Guo. >> >> >> >> >> >>