From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 7D7FA385741F for ; Fri, 25 Jun 2021 12:58:23 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 7D7FA385741F Received: from pps.filterd (m0098404.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 15PCjCJ4141770; Fri, 25 Jun 2021 08:58:22 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 39dfe0gfcv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 25 Jun 2021 08:58:21 -0400 Received: from m0098404.ppops.net (m0098404.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 15PCjt1U146550; Fri, 25 Jun 2021 08:58:21 -0400 Received: from ppma06ams.nl.ibm.com (66.31.33a9.ip4.static.sl-reverse.com [169.51.49.102]) by mx0a-001b2d01.pphosted.com with ESMTP id 39dfe0gfbn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 25 Jun 2021 08:58:20 -0400 Received: from pps.filterd (ppma06ams.nl.ibm.com [127.0.0.1]) by ppma06ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 15PCs6PV010804; Fri, 25 Jun 2021 12:58:18 GMT Received: from b06cxnps3075.portsmouth.uk.ibm.com (d06relay10.portsmouth.uk.ibm.com [9.149.109.195]) by ppma06ams.nl.ibm.com with ESMTP id 3997uhb1p5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 25 Jun 2021 12:58:18 +0000 Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com [9.149.105.61]) by b06cxnps3075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 15PCwGM031916536 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 25 Jun 2021 12:58:16 GMT Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1282E11C04A; Fri, 25 Jun 2021 12:58:16 +0000 (GMT) Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 7BC3C11C050; Fri, 25 Jun 2021 12:58:13 +0000 (GMT) Received: from luoxhus-MacBook-Pro.local (unknown [9.200.58.146]) by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTPS; Fri, 25 Jun 2021 12:58:13 +0000 (GMT) Subject: Re: [PATCH] New hook adjust_iv_update_pos To: Richard Biener Cc: GCC Patches , Segher Boessenkool , Bill Schmidt , davidxl@google.com, David Edelsohn , "H. J. Lu" References: <20210625083101.2828805-1-luoxhu@linux.ibm.com> From: Xionghu Luo Message-ID: <91c8d741-6b50-1bc1-5823-5ab2bc5e6922@linux.ibm.com> Date: Fri, 25 Jun 2021 20:58:10 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.0; rv:68.0) Gecko/20100101 Thunderbird/68.12.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: 7qbn-mG4z8t4hjUB21vvsJ5mEoYD9R2h X-Proofpoint-GUID: XZ6lTj9yrjDHAZQgDGFcNjteyWMQ8Scc X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-06-25_04:2021-06-25, 2021-06-25 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 priorityscore=1501 mlxscore=0 clxscore=1015 impostorscore=0 adultscore=0 bulkscore=0 mlxlogscore=999 spamscore=0 suspectscore=0 phishscore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2106250071 X-Spam-Status: No, score=-4.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jun 2021 12:58:25 -0000 +cc Xinliang David Li since he introduced the function adjust_iv_update_pos. Looking forward to the input. :) On 2021/6/25 18:02, Richard Biener wrote: > On Fri, Jun 25, 2021 at 11:41 AM Xionghu Luo wrote: >> >> >> >> On 2021/6/25 16:54, Richard Biener wrote: >>> On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches >>> wrote: >>>> >>>> From: Xiong Hu Luo >>>> >>>> adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help performance >>>> on Power. For example, it generates mismatched address offset after >>>> adjust iv update statement position: >>>> >>>> [local count: 70988443]: >>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1]; >>>> ivtmp.30_415 = ivtmp.30_414 + 1; >>>> _34 = ref_180 + 18446744073709551615; >>>> _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1]; >>>> if (_84 == _86) >>>> goto ; [94.50%] >>>> else >>>> goto ; [5.50%] >>>> >>>> Disable it will produce: >>>> >>>> [local count: 70988443]: >>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1]; >>>> _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1]; >>>> ivtmp.30_415 = ivtmp.30_414 + 1; >>>> if (_84 == _86) >>>> goto ; [94.50%] >>>> else >>>> goto ; [5.50%] >>>> >>>> Then later pass loop unroll could benefit from same address offset >>>> with different base address and reduces register dependency. >>>> This patch could improve performance by 10% for typical case on Power, >>>> no performance change observed for X86 or Aarch64 due to small loops >>>> not unrolled on these platforms. Any comments? >>> >>> The case you quote is special in that if we hoisted the IV update before >>> the other MEM _also_ used in the condition it would be fine again. >> >> Thanks. I tried to hoist the IV update statement before the first MEM (Fix 2), it >> shows even worse performance due to not unroll(two more "base-1" is generated in gimple, >> then loop->ninsns is 11 so small loops is not unrolled), change the threshold from >> 10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 times, the >> performance is SAME to the one that IV update statement in the *MIDDLE* (trunk). >> From the ASM, we can see the index register %r4 is used in two iterations which >> maybe a bottle neck for hiding instruction latency? >> >> Then it seems reasonable the performance would be better if keep the IV update >> statement at *LAST* (Fix 1). >> >> (Fix 2): >> [local count: 70988443]: >> ivtmp.30_415 = ivtmp.30_414 + 1; >> _34 = ip_229 + 18446744073709551615; >> _84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1]; >> _33 = ref_180 + 18446744073709551615; >> _86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1]; >> if (_84 == _86) >> goto ; [94.50%] >> else >> goto ; [5.50%] >> >> >> .L67: >> lbzx %r12,%r24,%r4 >> lbzx %r25,%r7,%r4 >> cmpw %cr0,%r12,%r25 >> bne %cr0,.L11 >> mr %r26,%r4 >> addi %r4,%r4,1 >> lbzx %r12,%r24,%r4 >> lbzx %r25,%r7,%r4 >> mr %r6,%r26 >> cmpw %cr0,%r12,%r25 >> bne %cr0,.L11 >> mr %r26,%r4 >> .L12: >> cmpdi %cr0,%r10,1 >> addi %r4,%r26,1 >> mr %r6,%r26 >> addi %r10,%r10,-1 >> bne %cr0,.L67 >> >>> >>> Now, adjust_iv_update_pos doesn't seem to check that the >>> condition actually uses the IV use stmt def, so it likely applies to >>> too many cases. >>> >>> Unfortunately the introducing rev didn't come with a testcase, >>> but still I think fixing up adjust_iv_update_pos is better than >>> introducing a way to short-cut it per target decision. >>> >>> One "fix" might be to add a check that either the condition >>> lhs or rhs is the def of the IV use and the other operand >>> is invariant. Or if it's of similar structure hoist across the >>> other iv-use as well. Not that I understand the argument >>> about the overlapping life-range. >>> >>> You also don't provide a complete testcase ... >>> >> >> Attached the test code, will also add it it patch in future version. >> The issue comes from a very small hot loop: >> >> do { >> len++; >> } while(len < maxlen && ip[len] == ref[len]); > > unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen) > { > unsigned int len = 2; > do { > len++; > }while(len < maxlen && ip[len] == ref[len]); > return len; > } > > I can see the effect on this loop on x86_64 as well, we end up with > > .L6: > movzbl (%rdi,%rax), %ecx > addq $1, %rax > cmpb -1(%rsi,%rax), %cl > jne .L1 > .L3: > movl %eax, %r8d > cmpl %edx, %eax > jb .L6 > > but without the trick it is > > .L6: > movzbl (%rdi,%rax), %r8d > movzbl (%rsi,%rax), %ecx > addq $1, %rax > cmpb %cl, %r8b > jne .L1 > .L3: > movl %eax, %r9d > cmpl %edx, %eax > jb .L6 > > so here you can see the missed fusion. Of course > in this case the IV update could have been sunk into > the .L3 block and replicated on the exit edge as well. > > I'm not sure if the motivation for the change introducing this > trick was the above kind of combination or not, but I guess > so. The dependence distance of the IV increment to the > use is now shorter, so I'm not sure the combined variant is > better. > > Richard. > > >> >> -- >> Thanks, >> Xionghu -- Thanks, Xionghu