From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vs1-xe29.google.com (mail-vs1-xe29.google.com [IPv6:2607:f8b0:4864:20::e29]) by sourceware.org (Postfix) with ESMTPS id 622943857C53 for ; Mon, 12 Jul 2021 09:00:32 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 622943857C53 Received: by mail-vs1-xe29.google.com with SMTP id w13so4960833vsa.3 for ; Mon, 12 Jul 2021 02:00:32 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=aGI21Jofo2S/97RpWPRDk4oJeHzBL//sJgHcImcytDI=; b=p0Q2OnVttJ3llrgxkAiZMPihQBAmebd6dZX7uTz/lPCbpqnJWac74YtwNozsxYFjZK Anbw5OHu+oK2pxUcLxCYah9Vk3OB7rhDYqkdHKXeKkm6NIRftUJD02AJT657v5JLQlUT Z61IAbii6Q1VKMR79Qv19dEfXyk3jVO/DBrPwKqUPLGuYNMgMChkJTDlqdP0a+P6ff4C Dbrd9RTlpKv+pZwYWiVyew3UhI679FINHdEY3csRe/YrWGuAs/umunAmOnCQ0xwwSwhW kxlbPgFvOgSlD+48aEczGf71JhQhkg59N4eGsAt8z5CZSn3s/qumnpgpmg4svKYV5nJD g40A== X-Gm-Message-State: AOAM532ptg42NrB2Nbe6OdUXxuiCu/SOEWOQi2MwBZvvGNe2udA2o9LW HItGM2YJwBr/a54GyKzCT2Sxo8xzERmMsBXLZsc= X-Google-Smtp-Source: ABdhPJxQeeO+i5mUffyq1IFvc4fAbpo1Us+ST0GLSoTKtRjvcDqg4SnwY1TZfwXBth0iDhXXoyEhlakCx8kryi0ROPA= X-Received: by 2002:a67:2d8b:: with SMTP id t133mr19789940vst.45.1626080431778; Mon, 12 Jul 2021 02:00:31 -0700 (PDT) MIME-Version: 1.0 References: <20210625083101.2828805-1-luoxhu@linux.ibm.com> <3e5723ef-0e50-ae6a-f503-1d4f1a015b16@linux.ibm.com> <04501d24-ffa5-aaf3-3291-fc95ce3b37d8@linux.ibm.com> In-Reply-To: <04501d24-ffa5-aaf3-3291-fc95ce3b37d8@linux.ibm.com> From: Hongtao Liu Date: Mon, 12 Jul 2021 17:05:18 +0800 Message-ID: Subject: Re: [PATCH] New hook adjust_iv_update_pos To: Xionghu Luo , Uros Bizjak Cc: Richard Biener , Segher Boessenkool , Bill Schmidt , GCC Patches , "Liu, Hongtao" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-4.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Jul 2021 09:00:34 -0000 On Mon, Jul 12, 2021 at 4:14 PM Xionghu Luo via Gcc-patches wrote: > > > > On 2021/7/7 21:20, Richard Biener wrote: > > On Tue, Jun 29, 2021 at 11:19 AM Xionghu Luo wrote: > >> > >> > >> > >> On 2021/6/28 16:25, Richard Biener wrote: > >>> On Mon, Jun 28, 2021 at 10:07 AM Xionghu Luo wrote: > >>>> > >>>> > >>>> > >>>> On 2021/6/25 18:02, Richard Biener wrote: > >>>>> On Fri, Jun 25, 2021 at 11:41 AM Xionghu Luo wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 2021/6/25 16:54, Richard Biener wrote: > >>>>>>> On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> From: Xiong Hu Luo > >>>>>>>> > >>>>>>>> adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help performance > >>>>>>>> on Power. For example, it generates mismatched address offset after > >>>>>>>> adjust iv update statement position: > >>>>>>>> > >>>>>>>> [local count: 70988443]: > >>>>>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1]; > >>>>>>>> ivtmp.30_415 = ivtmp.30_414 + 1; > >>>>>>>> _34 = ref_180 + 18446744073709551615; > >>>>>>>> _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1]; > >>>>>>>> if (_84 == _86) > >>>>>>>> goto ; [94.50%] > >>>>>>>> else > >>>>>>>> goto ; [5.50%] > >>>>>>>> > >>>>>>>> Disable it will produce: > >>>>>>>> > >>>>>>>> [local count: 70988443]: > >>>>>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1]; > >>>>>>>> _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1]; > >>>>>>>> ivtmp.30_415 = ivtmp.30_414 + 1; > >>>>>>>> if (_84 == _86) > >>>>>>>> goto ; [94.50%] > >>>>>>>> else > >>>>>>>> goto ; [5.50%] > >>>>>>>> > >>>>>>>> Then later pass loop unroll could benefit from same address offset > >>>>>>>> with different base address and reduces register dependency. > >>>>>>>> This patch could improve performance by 10% for typical case on Power, > >>>>>>>> no performance change observed for X86 or Aarch64 due to small loops > >>>>>>>> not unrolled on these platforms. Any comments? > >>>>>>> > >>>>>>> The case you quote is special in that if we hoisted the IV update before > >>>>>>> the other MEM _also_ used in the condition it would be fine again. > >>>>>> > >>>>>> Thanks. I tried to hoist the IV update statement before the first MEM (Fix 2), it > >>>>>> shows even worse performance due to not unroll(two more "base-1" is generated in gimple, > >>>>>> then loop->ninsns is 11 so small loops is not unrolled), change the threshold from > >>>>>> 10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 times, the > >>>>>> performance is SAME to the one that IV update statement in the *MIDDLE* (trunk). > >>>>>> From the ASM, we can see the index register %r4 is used in two iterations which > >>>>>> maybe a bottle neck for hiding instruction latency? > >>>>>> > >>>>>> Then it seems reasonable the performance would be better if keep the IV update > >>>>>> statement at *LAST* (Fix 1). > >>>>>> > >>>>>> (Fix 2): > >>>>>> [local count: 70988443]: > >>>>>> ivtmp.30_415 = ivtmp.30_414 + 1; > >>>>>> _34 = ip_229 + 18446744073709551615; > >>>>>> _84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1]; > >>>>>> _33 = ref_180 + 18446744073709551615; > >>>>>> _86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1]; > >>>>>> if (_84 == _86) > >>>>>> goto ; [94.50%] > >>>>>> else > >>>>>> goto ; [5.50%] > >>>>>> > >>>>>> > >>>>>> .L67: > >>>>>> lbzx %r12,%r24,%r4 > >>>>>> lbzx %r25,%r7,%r4 > >>>>>> cmpw %cr0,%r12,%r25 > >>>>>> bne %cr0,.L11 > >>>>>> mr %r26,%r4 > >>>>>> addi %r4,%r4,1 > >>>>>> lbzx %r12,%r24,%r4 > >>>>>> lbzx %r25,%r7,%r4 > >>>>>> mr %r6,%r26 > >>>>>> cmpw %cr0,%r12,%r25 > >>>>>> bne %cr0,.L11 > >>>>>> mr %r26,%r4 > >>>>>> .L12: > >>>>>> cmpdi %cr0,%r10,1 > >>>>>> addi %r4,%r26,1 > >>>>>> mr %r6,%r26 > >>>>>> addi %r10,%r10,-1 > >>>>>> bne %cr0,.L67 > >>>>>> > >>>>>>> > >>>>>>> Now, adjust_iv_update_pos doesn't seem to check that the > >>>>>>> condition actually uses the IV use stmt def, so it likely applies to > >>>>>>> too many cases. > >>>>>>> > >>>>>>> Unfortunately the introducing rev didn't come with a testcase, > >>>>>>> but still I think fixing up adjust_iv_update_pos is better than > >>>>>>> introducing a way to short-cut it per target decision. > >>>>>>> > >>>>>>> One "fix" might be to add a check that either the condition > >>>>>>> lhs or rhs is the def of the IV use and the other operand > >>>>>>> is invariant. Or if it's of similar structure hoist across the > >>>>>>> other iv-use as well. Not that I understand the argument > >>>>>>> about the overlapping life-range. > >>>>>>> > >>>>>>> You also don't provide a complete testcase ... > >>>>>>> > >>>>>> > >>>>>> Attached the test code, will also add it it patch in future version. > >>>>>> The issue comes from a very small hot loop: > >>>>>> > >>>>>> do { > >>>>>> len++; > >>>>>> } while(len < maxlen && ip[len] == ref[len]); > >>>>> > >>>>> unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen) > >>>>> { > >>>>> unsigned int len = 2; > >>>>> do { > >>>>> len++; > >>>>> }while(len < maxlen && ip[len] == ref[len]); > >>>>> return len; > >>>>> } > >>>>> > >>>>> I can see the effect on this loop on x86_64 as well, we end up with > >>>>> > >>>>> .L6: > >>>>> movzbl (%rdi,%rax), %ecx > >>>>> addq $1, %rax > >>>>> cmpb -1(%rsi,%rax), %cl > >>>>> jne .L1 > >>>>> .L3: > >>>>> movl %eax, %r8d > >>>>> cmpl %edx, %eax > >>>>> jb .L6 > >>>>> > >>>>> but without the trick it is > >>>>> > >>>>> .L6: > >>>>> movzbl (%rdi,%rax), %r8d > >>>>> movzbl (%rsi,%rax), %ecx > >>>>> addq $1, %rax > >>>>> cmpb %cl, %r8b > >>>>> jne .L1 > >>>>> .L3: > >>>>> movl %eax, %r9d > >>>>> cmpl %edx, %eax > >>>>> jb .L6 > >>>> > >>>> Verified this small piece of code on X86, there is no performance > >>>> change with or without adjust_iv_update_pos (I've checked the ASM > >>>> exactly same with yours): > >>>> > >>>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ gcc -O2 test.c > >>>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out 1 > >>>> > >>>> real 0m7.003s > >>>> user 0m6.996s > >>>> sys 0m0.000s > >>>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ /home/luoxhu/workspace/build/gcc/xgcc -B/home/luoxhu/workspace/build/g > >>>> cc/ -O2 test.c > >>>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out 1 > >>>> > >>>> real 0m7.070s > >>>> user 0m7.068s > >>>> sys 0m0.000s > >>>> > >>>> > >>>> But for AArch64, current GCC code also generates similar code with > >>>> or without adjust_iv_update_pos, the runtime is 10.705s for them. > >>>> > >>>> L6: > >>>> ldrb w4, [x6, x3] > >>>> add x3, x3, 1 > >>>> ldrb w5, [x1, x3] > >>>> cmp w5, w4 > >>>> bne .L1 > >>>> .L3: > >>>> mov w0, w3 > >>>> cmp w2, w3 > >>>> bhi .L6 > >>>> > >>>> No adjust_iv_update_pos: > >>>> > >>>> .L6: > >>>> ldrb w5, [x6, x3] > >>>> ldrb w4, [x1, x3] > >>>> add x3, x3, 1 > >>>> cmp w5, w4 > >>>> bne .L1 > >>>> .L3: > >>>> mov w0, w3 > >>>> cmp w2, w3 > >>>> bhi .L6 > >>>> > >>>> > >>>> While built with old GCC(gcc version 7.4.1 20190424), it generates > >>>> worse code and runtime is 11.664s: > >>>> > >>>> .L6: > >>>> ldrb w4, [x6, x3] > >>>> add x3, x3, 1 > >>>> add x5, x1, x3 > >>>> ldrb w5, [x5, -1] > >>>> cmp w5, w4 > >>>> bne .L1 > >>>> .L3: > >>>> cmp w3, w2 > >>>> mov w0, w3 > >>>> bcc .L6 > >>>> > >>>> > >>>> In general, only Power shows negative performance with adjust_iv_update_pos, > >>>> that's why I tried to add target hook for it, is this reasonable? Or we should > >>>> just remove adjust_iv_update_pos since it doesn't help performance for X86 or > >>>> other targets? > >>> > >>> It was Bin who added the functionality - maybe he can remember the cases > >>> he saw improvements for. > >>> > >>> I think we should instead try to change the code that it doesn't apply to > >>> CONDs where both operands are defined by stmts using the IV? > >>> And this time add a testcase ;) > >> > >> Thanks. I am not sure whether should check both cond_lhs and cond_rhs or just > >> cond_lhs is enough if COND expr's LHS is always processed first? > >> > >> For example, adjust_iv_update_pos will be called twice for this case, > >> after the first call, the gimple will be updated to: > >> > >> [local count: 1014686026]: > >> _1 = (sizetype) len_8; > >> _2 = ip_10(D) + _1; > >> _3 = MEM[(unsigned char *)ip_10(D) + ivtmp.15_15 * 1]; > >> _5 = ref_12(D) + _1; > >> _6 = *_5; > >> ivtmp.15_16 = ivtmp.15_15 + 1; > >> if (_3 == _6) > >> goto ; [94.50%] > >> else > >> goto ; [5.50%] > >> > >> > >> At this moment, use->stmt is "_6 = *_5;", it is not using IV directly? > > > > So looking around I think you want to pass down the IV group from > > rewrite_groups to rewrite_use_address and then see whether the > > definition of the RHS is one of the IV USE stmts that are going to be > > expressed using the same candidate. In fact, I guess (to avoid > > linear searching the group), we likely might want to restrict considering > > adjust_iv_update_pos for IV candidates with a single [address] use? > > That would be more restrictive of course (OTOH the list shouldn't be > > too big - hopefully). In case the other side of the conditional is > > a memory using a different IV the replacement should still be OK. > > > > Richard. > > Thanks, however those methods would still break the X86 optimization > with worse ASM(one more movzbl) though no performance change observed. > CCed X86 maintainers to see which one they prefer, and whether target > hook is a better solution? > > > GCC trunk: > > .L6: > movzbl (%rdi,%rax), %ecx > addq $1, %rax > cmpb -1(%rsi,%rax), %cl > jne .L1 > .L3: > movl %eax, %r8d > cmpl %edx, %eax > jb .L6 > > Disable adjust_iv_update_pos: > > .L6: > movzbl (%rdi,%rax), %r8d > movzbl (%rsi,%rax), %ecx > addq $1, %rax > cmpb %cl, %r8b > jne .L1 > .L3: > movl %eax, %r9d > cmpl %edx, %eax > jb .L6 > latency should be the same movzbl mem reg 5 cmp reg8 reg8 1 total 6 vs cmp mem8 reg8 6 Guess they're the same in uops level. CC uros, any opinion? > > > -- > Thanks, > Xionghu -- BR, Hongtao