From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by sourceware.org (Postfix) with ESMTPS id 1FD3A3858039; Mon, 28 Jun 2021 08:07:20 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 1FD3A3858039 Received: from pps.filterd (m0098420.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 15S840jW174856; Mon, 28 Jun 2021 04:07:19 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com with ESMTP id 39f77qmrfr-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 28 Jun 2021 04:07:19 -0400 Received: from m0098420.ppops.net (m0098420.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 15S85Pq0183125; Mon, 28 Jun 2021 04:07:19 -0400 Received: from ppma05fra.de.ibm.com (6c.4a.5195.ip4.static.sl-reverse.com [149.81.74.108]) by mx0b-001b2d01.pphosted.com with ESMTP id 39f77qmrf6-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 28 Jun 2021 04:07:18 -0400 Received: from pps.filterd (ppma05fra.de.ibm.com [127.0.0.1]) by ppma05fra.de.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 15S83hSc018091; Mon, 28 Jun 2021 08:07:17 GMT Received: from b06avi18626390.portsmouth.uk.ibm.com (b06avi18626390.portsmouth.uk.ibm.com [9.149.26.192]) by ppma05fra.de.ibm.com with ESMTP id 39duv8gben-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 28 Jun 2021 08:07:17 +0000 Received: from d06av26.portsmouth.uk.ibm.com (d06av26.portsmouth.uk.ibm.com [9.149.105.62]) by b06avi18626390.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 15S85i9S25952684 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 28 Jun 2021 08:05:44 GMT Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id CFA87AE071; Mon, 28 Jun 2021 08:07:14 +0000 (GMT) Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C1C8AAE0E6; Mon, 28 Jun 2021 08:07:12 +0000 (GMT) Received: from luoxhus-MacBook-Pro.local (unknown [9.200.155.117]) by d06av26.portsmouth.uk.ibm.com (Postfix) with ESMTPS; Mon, 28 Jun 2021 08:07:12 +0000 (GMT) From: Xionghu Luo Subject: Re: [PATCH] New hook adjust_iv_update_pos To: Richard Biener Cc: GCC Patches , Segher Boessenkool , Bill Schmidt , linkw@gcc.gnu.org, David Edelsohn , "H. J. Lu" References: <20210625083101.2828805-1-luoxhu@linux.ibm.com> Message-ID: <3e5723ef-0e50-ae6a-f503-1d4f1a015b16@linux.ibm.com> Date: Mon, 28 Jun 2021 16:07:10 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.0; rv:68.0) Gecko/20100101 Thunderbird/68.12.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: 7-lfz1XUl3rT8RyMqAjo9Y1widdhtN7i X-Proofpoint-GUID: Ar9OyAodSThT7Kss5oZEuN9F_aFslbjb X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-06-28_05:2021-06-25, 2021-06-28 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 adultscore=0 phishscore=0 lowpriorityscore=0 clxscore=1015 mlxscore=0 suspectscore=0 bulkscore=0 priorityscore=1501 malwarescore=0 spamscore=0 impostorscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2106280055 X-Spam-Status: No, score=-5.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_MSPIKE_H2, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Jun 2021 08:07:21 -0000 On 2021/6/25 18:02, Richard Biener wrote: > On Fri, Jun 25, 2021 at 11:41 AM Xionghu Luo wrote: >> >> >> >> On 2021/6/25 16:54, Richard Biener wrote: >>> On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches >>> wrote: >>>> >>>> From: Xiong Hu Luo >>>> >>>> adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help performance >>>> on Power. For example, it generates mismatched address offset after >>>> adjust iv update statement position: >>>> >>>> [local count: 70988443]: >>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1]; >>>> ivtmp.30_415 = ivtmp.30_414 + 1; >>>> _34 = ref_180 + 18446744073709551615; >>>> _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1]; >>>> if (_84 == _86) >>>> goto ; [94.50%] >>>> else >>>> goto ; [5.50%] >>>> >>>> Disable it will produce: >>>> >>>> [local count: 70988443]: >>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1]; >>>> _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1]; >>>> ivtmp.30_415 = ivtmp.30_414 + 1; >>>> if (_84 == _86) >>>> goto ; [94.50%] >>>> else >>>> goto ; [5.50%] >>>> >>>> Then later pass loop unroll could benefit from same address offset >>>> with different base address and reduces register dependency. >>>> This patch could improve performance by 10% for typical case on Power, >>>> no performance change observed for X86 or Aarch64 due to small loops >>>> not unrolled on these platforms. Any comments? >>> >>> The case you quote is special in that if we hoisted the IV update before >>> the other MEM _also_ used in the condition it would be fine again. >> >> Thanks. I tried to hoist the IV update statement before the first MEM (Fix 2), it >> shows even worse performance due to not unroll(two more "base-1" is generated in gimple, >> then loop->ninsns is 11 so small loops is not unrolled), change the threshold from >> 10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 times, the >> performance is SAME to the one that IV update statement in the *MIDDLE* (trunk). >> From the ASM, we can see the index register %r4 is used in two iterations which >> maybe a bottle neck for hiding instruction latency? >> >> Then it seems reasonable the performance would be better if keep the IV update >> statement at *LAST* (Fix 1). >> >> (Fix 2): >> [local count: 70988443]: >> ivtmp.30_415 = ivtmp.30_414 + 1; >> _34 = ip_229 + 18446744073709551615; >> _84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1]; >> _33 = ref_180 + 18446744073709551615; >> _86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1]; >> if (_84 == _86) >> goto ; [94.50%] >> else >> goto ; [5.50%] >> >> >> .L67: >> lbzx %r12,%r24,%r4 >> lbzx %r25,%r7,%r4 >> cmpw %cr0,%r12,%r25 >> bne %cr0,.L11 >> mr %r26,%r4 >> addi %r4,%r4,1 >> lbzx %r12,%r24,%r4 >> lbzx %r25,%r7,%r4 >> mr %r6,%r26 >> cmpw %cr0,%r12,%r25 >> bne %cr0,.L11 >> mr %r26,%r4 >> .L12: >> cmpdi %cr0,%r10,1 >> addi %r4,%r26,1 >> mr %r6,%r26 >> addi %r10,%r10,-1 >> bne %cr0,.L67 >> >>> >>> Now, adjust_iv_update_pos doesn't seem to check that the >>> condition actually uses the IV use stmt def, so it likely applies to >>> too many cases. >>> >>> Unfortunately the introducing rev didn't come with a testcase, >>> but still I think fixing up adjust_iv_update_pos is better than >>> introducing a way to short-cut it per target decision. >>> >>> One "fix" might be to add a check that either the condition >>> lhs or rhs is the def of the IV use and the other operand >>> is invariant. Or if it's of similar structure hoist across the >>> other iv-use as well. Not that I understand the argument >>> about the overlapping life-range. >>> >>> You also don't provide a complete testcase ... >>> >> >> Attached the test code, will also add it it patch in future version. >> The issue comes from a very small hot loop: >> >> do { >> len++; >> } while(len < maxlen && ip[len] == ref[len]); > > unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen) > { > unsigned int len = 2; > do { > len++; > }while(len < maxlen && ip[len] == ref[len]); > return len; > } > > I can see the effect on this loop on x86_64 as well, we end up with > > .L6: > movzbl (%rdi,%rax), %ecx > addq $1, %rax > cmpb -1(%rsi,%rax), %cl > jne .L1 > .L3: > movl %eax, %r8d > cmpl %edx, %eax > jb .L6 > > but without the trick it is > > .L6: > movzbl (%rdi,%rax), %r8d > movzbl (%rsi,%rax), %ecx > addq $1, %rax > cmpb %cl, %r8b > jne .L1 > .L3: > movl %eax, %r9d > cmpl %edx, %eax > jb .L6 Verified this small piece of code on X86, there is no performance change with or without adjust_iv_update_pos (I've checked the ASM exactly same with yours): luoxhu@gcc14:~/workspace/lzf_compress_testcase$ gcc -O2 test.c luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out 1 real 0m7.003s user 0m6.996s sys 0m0.000s luoxhu@gcc14:~/workspace/lzf_compress_testcase$ /home/luoxhu/workspace/build/gcc/xgcc -B/home/luoxhu/workspace/build/g cc/ -O2 test.c luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out 1 real 0m7.070s user 0m7.068s sys 0m0.000s But for AArch64, current GCC code also generates similar code with or without adjust_iv_update_pos, the runtime is 10.705s for them. L6: ldrb w4, [x6, x3] add x3, x3, 1 ldrb w5, [x1, x3] cmp w5, w4 bne .L1 .L3: mov w0, w3 cmp w2, w3 bhi .L6 No adjust_iv_update_pos: .L6: ldrb w5, [x6, x3] ldrb w4, [x1, x3] add x3, x3, 1 cmp w5, w4 bne .L1 .L3: mov w0, w3 cmp w2, w3 bhi .L6 While built with old GCC(gcc version 7.4.1 20190424), it generates worse code and runtime is 11.664s: .L6: ldrb w4, [x6, x3] add x3, x3, 1 add x5, x1, x3 ldrb w5, [x5, -1] cmp w5, w4 bne .L1 .L3: cmp w3, w2 mov w0, w3 bcc .L6 In general, only Power shows negative performance with adjust_iv_update_pos, that's why I tried to add target hook for it, is this reasonable? Or we should just remove adjust_iv_update_pos since it doesn't help performance for X86 or other targets? test.c #include __attribute__((noinline)) unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen) { unsigned int len = 2; do { len++; }while(len < maxlen && ip[len] == ref[len]); return len; } int main (int argc, char* argv[]) { unsigned char string_a [] = "abcdefghijklmnopqrstuvwxyzmnpppppppppppaaaaaaabbbbbbeeee"; unsigned char string_b [] = "abcdefghijklmnopqrstuvwxyzmnpppppppppppaaaaaaabbbbbbeene"; unsigned long ret = 0; for (long i = 0; i < atoi(argv[1]) * 100000000; i++) ret += foo (string_a, string_b, sizeof(string_a)); return ret; } > > so here you can see the missed fusion. Of course > in this case the IV update could have been sunk into > the .L3 block and replicated on the exit edge as well. > > I'm not sure if the motivation for the change introducing this > trick was the above kind of combination or not, but I guess > so. The dependence distance of the IV increment to the > use is now shorter, so I'm not sure the combined variant is > better. > > Richard. > > >> >> -- >> Thanks, >> Xionghu -- Thanks, Xionghu