From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <guojiufu@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com
 [148.163.158.5])
 by sourceware.org (Postfix) with ESMTPS id 2DC783858028
 for <gcc@gcc.gnu.org>; Fri, 26 Mar 2021 08:19:42 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 2DC783858028
Received: from pps.filterd (m0098413.ppops.net [127.0.0.1])
 by mx0b-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id
 12Q845AF132358; Fri, 26 Mar 2021 04:19:41 -0400
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0b-001b2d01.pphosted.com with ESMTP id 37h76ve882-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 26 Mar 2021 04:19:41 -0400
Received: from m0098413.ppops.net (m0098413.ppops.net [127.0.0.1])
 by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 12Q84Cvf133158;
 Fri, 26 Mar 2021 04:19:41 -0400
Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com
 [169.63.214.131])
 by mx0b-001b2d01.pphosted.com with ESMTP id 37h76ve87w-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 26 Mar 2021 04:19:41 -0400
Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1])
 by ppma01dal.us.ibm.com (8.16.0.43/8.16.0.43) with SMTP id 12Q8HcrG025004;
 Fri, 26 Mar 2021 08:19:40 GMT
Received: from b01cxnp23034.gho.pok.ibm.com (b01cxnp23034.gho.pok.ibm.com
 [9.57.198.29]) by ppma01dal.us.ibm.com with ESMTP id 37h150cqy0-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 26 Mar 2021 08:19:40 +0000
Received: from b01ledav005.gho.pok.ibm.com (b01ledav005.gho.pok.ibm.com
 [9.57.199.110])
 by b01cxnp23034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 12Q8Jd2535520898
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Fri, 26 Mar 2021 08:19:39 GMT
Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 83C38AE060;
 Fri, 26 Mar 2021 08:19:39 +0000 (GMT)
Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 3F03CAE067;
 Fri, 26 Mar 2021 08:19:39 +0000 (GMT)
Received: from ltc.linux.ibm.com (unknown [9.10.229.42])
 by b01ledav005.gho.pok.ibm.com (Postfix) with ESMTP;
 Fri, 26 Mar 2021 08:19:39 +0000 (GMT)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII;
 format=flowed
Content-Transfer-Encoding: 7bit
Date: Fri, 26 Mar 2021 16:19:38 +0800
From: guojiufu <guojiufu@linux.ibm.com>
To: Richard Biener <rguenther@suse.de>
Cc: Richard Biener <richard.guenther@gmail.com>, Jakub Jelinek
 <jakub@redhat.com>, GCC Development <gcc@gcc.gnu.org>, Segher Boessenkool
 <segher@kernel.crashing.org>, Michael Matz <matz@suse.de>
Subject: Re: [RFC] avoid type conversion through versioning loop
In-Reply-To: <nycvar.YFH.7.76.2103250917590.17979@zhemvz.fhfr.qr>
References: <6cee0cd1c8c6023eac77d32b531e4f4b@imap.linux.ibm.com>
 <CAFiYyc1gLLxf=Y2+mxAvLtU-KhHEQ+Y52fd=tQC=RaR1tMrGHg@mail.gmail.com>
 <20210322083146.GB231854@tucnak>
 <1d8a918d84f4fcd5da265f1ac4e81ae1@imap.linux.ibm.com>
 <CAFiYyc1g0AHJcqhtn0k2esL8Zy4-W1M8=tNfmrdy28iKskM5Rg@mail.gmail.com>
 <63d960f6b52344e8bef084c86a56a70e@imap.linux.ibm.com>
 <CAFiYyc2XSd2PKn+CkVPgLYeNsC-2ZCNFKzJU6Z07KV6xAXf=Bw@mail.gmail.com>
 <f5ca33b3ce8fe1dd0265d51a575a34ed@imap.linux.ibm.com>
 <nycvar.YFH.7.76.2103241327000.17979@zhemvz.fhfr.qr>
 <16536725ffe09efed90e662992d9de9c@imap.linux.ibm.com>
 <nycvar.YFH.7.76.2103250917590.17979@zhemvz.fhfr.qr>
Message-ID: <f0addd88b012a79ecf28c673e6a56e9a@imap.linux.ibm.com>
X-Sender: guojiufu@linux.ibm.com
User-Agent: Roundcube Webmail/1.1.12
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: MpTQNv490Wihm506TTbSLdNYTh6q3ons
X-Proofpoint-ORIG-GUID: Q1npGvrM23ze7xWMESG6DAwjbCllnakz
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.369, 18.0.761
 definitions=2021-03-26_02:2021-03-25,
 2021-03-26 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 mlxlogscore=909
 lowpriorityscore=0 spamscore=0 phishscore=0 malwarescore=0 suspectscore=0
 mlxscore=0 clxscore=1015 bulkscore=0 adultscore=0 priorityscore=1501
 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2103250000 definitions=main-2103260058
X-Spam-Status: No, score=-4.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H4,
 RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc mailing list <gcc.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc>,
 <mailto:gcc-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <mailto:gcc-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc>,
 <mailto:gcc-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 26 Mar 2021 08:19:44 -0000

On 2021-03-25 16:35, Richard Biener wrote:
> On Wed, 24 Mar 2021, guojiufu wrote:
> 
>> On 2021-03-24 20:33, Richard Biener wrote:
>> > On Wed, 24 Mar 2021, guojiufu wrote:
>> >
>> >> On 2021-03-24 15:55, Richard Biener wrote:
>> >> > On Wed, Mar 24, 2021 at 3:55 AM guojiufu <guojiufu@linux.ibm.com> wrote:
>> >> >>
>> >> >> On 2021-03-23 16:25, Richard Biener via Gcc wrote:
>> >> >> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu <guojiufu@imap.linux.ibm.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
>> >> >> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc
>> >> >> >> > wrote:
>> >> >> >> >> Better than doing loop versioning is to enhance SCEV (and thus
>> >> >> >> >> also
>> >> >> >> >> dependence analysis) to track extra conditions they need to handle
>> >> >> >> >> cases similar as to how niter analysis computes it's 'assumptions'
>> >> >> >> >> condition.  That allows the versioning to be done when there's an
>> >> >> >> >> actual beneficial transform (like vectorization) rather than just
>> >> >> >> >> upfront for the eventual chance that there'll be any.  Ideally
>> >> >> >> >> such
>> >> >> >> >> transform would then choose IVs in their transformed copy that
>> >> >> >> >> are analyzable w/o repeating such versioning exercise for the next
>> >> >> >> >> transform.
>> >> >> >> >
>> >> >> >> > And it might be beneficial to perform some type promotion/demotion
>> >> >> >> > pass, either early during vectorization or separately before
>> >> >> >> > vectorization
>> >> >> >> > on a loop copy guarded with the ifns e.g. ifconv uses too.
>> >> >> >> > Find out what type sizes the loop use, first try to demote
>> >> >> >> > computations
>> >> >> >> > to narrower types in the vectorized loop candidate (e.g. if
>> >> >> >> > something
>> >> >> >> > is computed in a wider type only to have the result demoted to
>> >> >> >> > narrower
>> >> >> >> > type), then pick up the widest type size still in use in the loop
>> >> >> >> > (ok,
>> >> >> >> > this assumes we don't mix multiple vector sizes in the loop, but
>> >> >> >> > currently
>> >> >> >> > our vectorizer doesn't do that) and try to promote computations
>> >> >> >> > that
>> >> >> >> > could
>> >> >> >> > be promoted to that type size.  We do partially something like that
>> >> >> >> > during
>> >> >> >> > vect patterns for bool types, but not other types I think.
>> >> >> >> >
>> >> >> >> >       Jakub
>> >> >> >>
>> >> >> >> Thanks for the suggestions!
>> >> >> >>
>> >> >> >> Enhancing SCEV could help other optimizations and improve performance
>> >> >> >> in
>> >> >> >> some cases.
>> >> >> >> While one of the direct ideas of using the '64bit type' is to
>> >> >> >> eliminate
>> >> >> >> conversions,
>> >> >> >> even for some cases which are not easy to be optimized through
>> >> >> >> ifconv/vectorization,
>> >> >> >> for examples:
>> >> >> >>
>> >> >> >>    unsigned int i = 0;
>> >> >> >>    while (a[i]>1e-3)
>> >> >> >>      i++;
>> >> >> >>
>> >> >> >>    unsigned int i = 0;
>> >> >> >>    while (p1[i] == p2[i] && p1[i] != '\0')
>> >> >> >>      i++;
>> >> >> >>
>> >> >> >> Or only do versioning on type for this kind of loop? Any suggestions?
>> >> >> >
>> >> >> > But the "optimization" resulting from such versioning is hard to
>> >> >> > determine upfront which means we'll pay quite a big code size cost
>> >> >> > for unknown questionable gain.  What's the particular optimization
>> >> >>
>> >> >> Right.  Code size increasing is a big pain on large loops. If the gain
>> >> >> is not significant, this optimization may not positive.
>> >> >>
>> >> >> > in the above cases?  Note that for example for
>> >> >> >
>> >> >> >     unsigned int i = 0;
>> >> >> >     while (a[i]>1e-3)
>> >> >> >        i++;
>> >> >> >
>> >> >> > you know that when 'i' wraps then the loop will not terminate.
>> >> >> > There's
>> >> >>
>> >> >> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is
>> >> >> checkable.  Otherwise, the optimization to avoid zext is not adoptable.
>> >> >>
>> >> >> > the address computation that is i * sizeof (T) which is done in a
>> >> >> > larger
>> >> >> > type to avoid overflow so we have &a + zext (i) * 8 - is that the
>> >> >> > operation
>> >> >> > that is 'slow' for you?
>> >> >>
>> >> >> This is the point: "zext(i)" is the instruction that I want to
>> >> >> eliminate,
>> >> >> which is the direct goal of the optimization.
>> >> >>
>> >> >> The gain of eliminating the 'zext' is visible or not, and the code size
>> >> >> increasing is small enough or not, this is a question and needs to
>> >> >> trade-off.
>> >> >> It may be only acceptable if the loop is very small, then eliminating
>> >> >> 'zext'
>> >> >> would help to save runtime, and code size increase maybe not big.
>> >> >
>> >> > OK, so I indeed think that the desire to micro-optimize a 'zext' doesn't
>> >> > make versioning a good trade-off.  The micro-architecture should better
>> >> > not make that totally slow (I'd expect an extra latency comparable to
>> >> > the multiply or add on the &a + zext(i) * 8 instruction chain).
>> >>
>> >> Agree, I understand your point.  The concern is some micro-architectures
>> >> are
>> >> not
>> >> very well on this yet.  I tested the above example code:
>> >>   unsigned i = 0;
>> >>   while (a[i] > 1e-3 && i < n)
>> >>     i++;
>> >> there are ~30% performance improvement if using "long i" instead
>> >> of"unsigned
>> >> i"
>> >> on ppc64le and x86.  It seems those instructions are not optimized too much
>> >> on
>> >> some platforms.  So, I'm wondering we need to do this in GCC.
>> >
>> > On x86 I see indexed addressing modes being used which should be fine.
>> > Compilable testcase:
>> >
>> > unsigned foo (double *a, unsigned n)
>> > {
>> >   unsigned i = 0;
>> >   while (a[i] > 1e-3 && i < n)
>> >     i++;
>> >   return i;
>> > }
>> >
>> > ppc64le seems to do some odd unrolling/peeling or whatnot, I have a hard
>> > time following its assembly ... ah, -fno-unroll-loops "helps" and
>> > produces
>> >
>> > .L5:
>> >         lfd %f0,0(%r9)
>> >         addi %r3,%r3,1
>> >         addi %r9,%r9,8
>> >         rldicl %r3,%r3,0,32
>> >         fcmpu %cr0,%f0,%f12
>> >         bnglr %cr0
>> >         bdnz .L5
>> >
>> > which looks pretty good to me, I suppose the rldicl is the
>> > zero-extension but the IVs are already 64bit and the
>> > zero-extension should be sunk to the loop exit instead.
>> 
>> Thanks so much for your quick reply!
>> 
>> Yes, "rldicl %r3,%r3,0,32" is sinkable and avoids this zext,
>> which also improve performance.
>> 
>> :) my test code is a little differentent:  argument n's type is long
>> (sorry).
>> unsigned __attribute__ ((noinline)) foo (double *a, long n)
>> {
>>   unsigned i = 0;
>>   while (a[i] > 1e-3 && i < n)
>>     i++;
>>   return i;
>> }
> 
> You can't elide the zero-extend here since n can be bigger
> than MAX_UINT and thus i has to wrap.
> 
>>   111: L111:
>>   39: pc={(%7:CC>=0)?simple_return:pc}
>>   102: %3:SI=%3:SI+0x1
>>   103: %9:DI=%3:DI<<0x3&0x7fffffff8
>>   104: %3:DI=zero_extend(%3:SI)
>>   105: %7:CC=cmp(%3:DI,%4:DI)
>>   106: %0:DF=[%10:DI+%9:DI]
>>   107: %0:CCFP=cmp(%0:DF,%12:DF)
>>   109: pc={(%0:CCFP>0)?L111:pc}
>> 
>> .L14:
>>         bgelr 7
>>         addi 3,3,1
>>         rldic 9,3,3,29
>>         rldicl 3,3,0,32
>>         cmpd 7,3,4
>>         lfdx 0,10,9
>> 	fcmpu 0,0,12
>>         bgt 0,.L14
>> 
>> -------------- or for code:
>> unsigned __attribute__ ((noinline)) foo (double *a, unsigned n,  
>> unsigned i)
>> {
>>   while (a[i] > 1e-3 && i != n) ///if using "&& i < n" the zext is 
>> sinkable
>>     i++;
>>   return i;
>> }
> 
> With i != n we cannot compute the number of iterations because i can
> wrap, thus we cannot elide the zero-extension.
> 
>>    99: L99:
>>    69: {pc={(ctr:DI==0x1)?L39:pc};ctr:DI=ctr:DI-0x1;clobber
>> scratch;clobber scratch;}
>>    36: L36:
>>    23: %5:SI=%5:SI+0x1
>>    25: %9:DI=%5:DI<<0x3&0x7fffffff8
>>    24: %5:DI=zero_extend(%5:SI)
>>    29: %0:DF=[%3:DI+%9:DI]
>>    30: %0:CCFP=cmp(%0:DF,%12:DF)
>>    31: pc={(%0:CCFP>0)?L99:pc}
>> 
>> 
>> .L12:
>>         bdz .L2
>> .L5:
>>         addi 5,5,1
>>         rldic 9,5,3,29
>>         rldicl 5,5,0,32
>>         lfdx 0,3,9
>>         fcmpu 0,0,12
>>         bgt 0,.L12
>> .L2:
>> 
>> Any suggestions about how to relax the 'zext' in above cases? My 
>> previous idea
>> was versioning :).
> 
> I don't think there's another good way.  Not sure if I'd call it
> versioning, I suppose for these kind of possibly wrapping IVs it
> would be iteration space splitting?  So maybe it could be integrated
> into the loop split pass somehow?  Let's look at
> 
> unsigned foo (double *a, unsigned n, unsigned i)
> {
>    while (a[i] > 1e-3 && i != n)
>      i++;
>    return i;
> }
> 
> we'd split it as
> 
>   while (a[i] > 1e-3 && i > n)
>     i++;
>   while (a[i] > 1e-3 && i < n)
>     i++;
> 
And then, we can continue transform to:
   while (a[i] > 1e-3 && i > n && i <= UINT_MAX)
     i++;
   while (a[i] > 1e-3 && i < n)
     i++;
===>
   if (i > n)
     while (a[i] > 1e-3 && i <= UINT_MAX)
       i++;
   while (a[i] > 1e-3 && i < n)
     i++;
===> (transform to below, may similar with sinking zext out of loops)
   long il = i; long nl = n;
   if (il > nl)
     {
      while (a[il] > 1e-3 && il <= UINT_MAX)
        il++;
      il = il > UINT_MAX ? 0 : il; // il = zext (il#0)
     }
   while (a[il] > 1e-3 && il < nl)
     il++;
   i = il;


Thanks a lot for your great help!


BR.
Jiufu Guo.


> but I still see zero-extends here on x86.
> 
> For the case with 'long n' the issue is that the loop might not even
> terminate and thus i wraps anyway.  There one could split the
> iteration space at MIN (n, UINT_MAX), using an unsigned upper bound.
> 
> Not sure if this makes sense..
> 
> Richard.
> 
>> >
>> >> >
>> >> > OTOH making SCEV analysis not give up but instead record the constraints
>> >> > under which its solution is valid is a very good and useful thing to do.
>> >>
>> >> Thanks! Enhance SCEV could help a few cases, especially when other
>> >> optimizations
>> >> are enabled.
>> >>
>> >> Thanks again for your suggestions!
>> >>
>> >> BR.
>> >> Jiufu Guo.
>> >>
>> >> >
>> >> > Richard.
>> >> >
>> >> >> Thanks again for your very helpful comments!
>> >> >>
>> >> >> BR.
>> >> >> Jiufu Guo.
>> >> >>
>> >> >> >
>> >> >> > Richard.
>> >> >> >
>> >> >> >> BR.
>> >> >> >> Jiufu Guo.
>> >>
>> >>
>> 
>>