From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ej1-x62d.google.com (mail-ej1-x62d.google.com [IPv6:2a00:1450:4864:20::62d]) by sourceware.org (Postfix) with ESMTPS id 7B9483952013; Wed, 7 Jul 2021 13:20:57 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 7B9483952013 Received: by mail-ej1-x62d.google.com with SMTP id v20so3210816eji.10; Wed, 07 Jul 2021 06:20:57 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=buJQNY3j9u0+TqngT06UGo+SlbNyihmarfEUZIKdgVs=; b=Qctr4ZXikznDJFrZYowY/XWQgwwt/FCURsr81Dq1m8RcaLB04JujgxTJpBMMH9z64B 4jo+Nuz77Y3FOs9MHeSWI+NJUKQZpxvai9qYiwkALMnDgQu//z6GpbTqk6uUw6Pu4zVo NdL/Ts9UqA7+ce6pF7QQ1Vi3PHb9EVR6pcKjsaAkjfXYaM9OjSme3hc2w1K83fm1S4TS o9KYN/BPbux3WRwM3Qb2TGvRZyazINaxvi0eIoDKvkToIe4uEa6cOsYDVYkI/7jIgBgX 7N9zLOE7N1g1RdjQt0yI9ftZC79fIkfakEaM1jFCFWrO2xadGSOK1nqJGk2qKF7ftsBa minw== X-Gm-Message-State: AOAM530uZBi8lHaqGjVO3Fi4AfQghV45SrA5GNJxwsLTsUYYB8YkFFs6 A2RWdYwn8JiewW0p45Nd/bDiGx6DE/cDDbjASZA= X-Google-Smtp-Source: ABdhPJw/RdcK1TrfAs7zx/F+ZjNKCp8NLNYafkQvS1PH5DLRAPejQ+5uqWXx4xA33rwE13wl4C2fsznWwq/34ktv0zM= X-Received: by 2002:a17:906:f9c5:: with SMTP id lj5mr2822511ejb.482.1625664056437; Wed, 07 Jul 2021 06:20:56 -0700 (PDT) MIME-Version: 1.0 References: <20210625083101.2828805-1-luoxhu@linux.ibm.com> <3e5723ef-0e50-ae6a-f503-1d4f1a015b16@linux.ibm.com> In-Reply-To: From: Richard Biener Date: Wed, 7 Jul 2021 15:20:44 +0200 Message-ID: Subject: Re: [PATCH] New hook adjust_iv_update_pos To: Xionghu Luo Cc: "Amker.Cheng" , GCC Patches , Segher Boessenkool , Bill Schmidt , linkw@gcc.gnu.org, David Edelsohn , "H. J. Lu" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 07 Jul 2021 13:20:59 -0000 On Tue, Jun 29, 2021 at 11:19 AM Xionghu Luo wrote: > > > > On 2021/6/28 16:25, Richard Biener wrote: > > On Mon, Jun 28, 2021 at 10:07 AM Xionghu Luo wrote: > >> > >> > >> > >> On 2021/6/25 18:02, Richard Biener wrote: > >>> On Fri, Jun 25, 2021 at 11:41 AM Xionghu Luo wrote: > >>>> > >>>> > >>>> > >>>> On 2021/6/25 16:54, Richard Biener wrote: > >>>>> On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches > >>>>> wrote: > >>>>>> > >>>>>> From: Xiong Hu Luo > >>>>>> > >>>>>> adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help performance > >>>>>> on Power. For example, it generates mismatched address offset after > >>>>>> adjust iv update statement position: > >>>>>> > >>>>>> [local count: 70988443]: > >>>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1]; > >>>>>> ivtmp.30_415 = ivtmp.30_414 + 1; > >>>>>> _34 = ref_180 + 18446744073709551615; > >>>>>> _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1]; > >>>>>> if (_84 == _86) > >>>>>> goto ; [94.50%] > >>>>>> else > >>>>>> goto ; [5.50%] > >>>>>> > >>>>>> Disable it will produce: > >>>>>> > >>>>>> [local count: 70988443]: > >>>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1]; > >>>>>> _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1]; > >>>>>> ivtmp.30_415 = ivtmp.30_414 + 1; > >>>>>> if (_84 == _86) > >>>>>> goto ; [94.50%] > >>>>>> else > >>>>>> goto ; [5.50%] > >>>>>> > >>>>>> Then later pass loop unroll could benefit from same address offset > >>>>>> with different base address and reduces register dependency. > >>>>>> This patch could improve performance by 10% for typical case on Power, > >>>>>> no performance change observed for X86 or Aarch64 due to small loops > >>>>>> not unrolled on these platforms. Any comments? > >>>>> > >>>>> The case you quote is special in that if we hoisted the IV update before > >>>>> the other MEM _also_ used in the condition it would be fine again. > >>>> > >>>> Thanks. I tried to hoist the IV update statement before the first MEM (Fix 2), it > >>>> shows even worse performance due to not unroll(two more "base-1" is generated in gimple, > >>>> then loop->ninsns is 11 so small loops is not unrolled), change the threshold from > >>>> 10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 times, the > >>>> performance is SAME to the one that IV update statement in the *MIDDLE* (trunk). > >>>> From the ASM, we can see the index register %r4 is used in two iterations which > >>>> maybe a bottle neck for hiding instruction latency? > >>>> > >>>> Then it seems reasonable the performance would be better if keep the IV update > >>>> statement at *LAST* (Fix 1). > >>>> > >>>> (Fix 2): > >>>> [local count: 70988443]: > >>>> ivtmp.30_415 = ivtmp.30_414 + 1; > >>>> _34 = ip_229 + 18446744073709551615; > >>>> _84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1]; > >>>> _33 = ref_180 + 18446744073709551615; > >>>> _86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1]; > >>>> if (_84 == _86) > >>>> goto ; [94.50%] > >>>> else > >>>> goto ; [5.50%] > >>>> > >>>> > >>>> .L67: > >>>> lbzx %r12,%r24,%r4 > >>>> lbzx %r25,%r7,%r4 > >>>> cmpw %cr0,%r12,%r25 > >>>> bne %cr0,.L11 > >>>> mr %r26,%r4 > >>>> addi %r4,%r4,1 > >>>> lbzx %r12,%r24,%r4 > >>>> lbzx %r25,%r7,%r4 > >>>> mr %r6,%r26 > >>>> cmpw %cr0,%r12,%r25 > >>>> bne %cr0,.L11 > >>>> mr %r26,%r4 > >>>> .L12: > >>>> cmpdi %cr0,%r10,1 > >>>> addi %r4,%r26,1 > >>>> mr %r6,%r26 > >>>> addi %r10,%r10,-1 > >>>> bne %cr0,.L67 > >>>> > >>>>> > >>>>> Now, adjust_iv_update_pos doesn't seem to check that the > >>>>> condition actually uses the IV use stmt def, so it likely applies to > >>>>> too many cases. > >>>>> > >>>>> Unfortunately the introducing rev didn't come with a testcase, > >>>>> but still I think fixing up adjust_iv_update_pos is better than > >>>>> introducing a way to short-cut it per target decision. > >>>>> > >>>>> One "fix" might be to add a check that either the condition > >>>>> lhs or rhs is the def of the IV use and the other operand > >>>>> is invariant. Or if it's of similar structure hoist across the > >>>>> other iv-use as well. Not that I understand the argument > >>>>> about the overlapping life-range. > >>>>> > >>>>> You also don't provide a complete testcase ... > >>>>> > >>>> > >>>> Attached the test code, will also add it it patch in future version. > >>>> The issue comes from a very small hot loop: > >>>> > >>>> do { > >>>> len++; > >>>> } while(len < maxlen && ip[len] == ref[len]); > >>> > >>> unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen) > >>> { > >>> unsigned int len = 2; > >>> do { > >>> len++; > >>> }while(len < maxlen && ip[len] == ref[len]); > >>> return len; > >>> } > >>> > >>> I can see the effect on this loop on x86_64 as well, we end up with > >>> > >>> .L6: > >>> movzbl (%rdi,%rax), %ecx > >>> addq $1, %rax > >>> cmpb -1(%rsi,%rax), %cl > >>> jne .L1 > >>> .L3: > >>> movl %eax, %r8d > >>> cmpl %edx, %eax > >>> jb .L6 > >>> > >>> but without the trick it is > >>> > >>> .L6: > >>> movzbl (%rdi,%rax), %r8d > >>> movzbl (%rsi,%rax), %ecx > >>> addq $1, %rax > >>> cmpb %cl, %r8b > >>> jne .L1 > >>> .L3: > >>> movl %eax, %r9d > >>> cmpl %edx, %eax > >>> jb .L6 > >> > >> Verified this small piece of code on X86, there is no performance > >> change with or without adjust_iv_update_pos (I've checked the ASM > >> exactly same with yours): > >> > >> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ gcc -O2 test.c > >> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out 1 > >> > >> real 0m7.003s > >> user 0m6.996s > >> sys 0m0.000s > >> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ /home/luoxhu/workspace/build/gcc/xgcc -B/home/luoxhu/workspace/build/g > >> cc/ -O2 test.c > >> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out 1 > >> > >> real 0m7.070s > >> user 0m7.068s > >> sys 0m0.000s > >> > >> > >> But for AArch64, current GCC code also generates similar code with > >> or without adjust_iv_update_pos, the runtime is 10.705s for them. > >> > >> L6: > >> ldrb w4, [x6, x3] > >> add x3, x3, 1 > >> ldrb w5, [x1, x3] > >> cmp w5, w4 > >> bne .L1 > >> .L3: > >> mov w0, w3 > >> cmp w2, w3 > >> bhi .L6 > >> > >> No adjust_iv_update_pos: > >> > >> .L6: > >> ldrb w5, [x6, x3] > >> ldrb w4, [x1, x3] > >> add x3, x3, 1 > >> cmp w5, w4 > >> bne .L1 > >> .L3: > >> mov w0, w3 > >> cmp w2, w3 > >> bhi .L6 > >> > >> > >> While built with old GCC(gcc version 7.4.1 20190424), it generates > >> worse code and runtime is 11.664s: > >> > >> .L6: > >> ldrb w4, [x6, x3] > >> add x3, x3, 1 > >> add x5, x1, x3 > >> ldrb w5, [x5, -1] > >> cmp w5, w4 > >> bne .L1 > >> .L3: > >> cmp w3, w2 > >> mov w0, w3 > >> bcc .L6 > >> > >> > >> In general, only Power shows negative performance with adjust_iv_update_pos, > >> that's why I tried to add target hook for it, is this reasonable? Or we should > >> just remove adjust_iv_update_pos since it doesn't help performance for X86 or > >> other targets? > > > > It was Bin who added the functionality - maybe he can remember the cases > > he saw improvements for. > > > > I think we should instead try to change the code that it doesn't apply to > > CONDs where both operands are defined by stmts using the IV? > > And this time add a testcase ;) > > Thanks. I am not sure whether should check both cond_lhs and cond_rhs or just > cond_lhs is enough if COND expr's LHS is always processed first? > > For example, adjust_iv_update_pos will be called twice for this case, > after the first call, the gimple will be updated to: > > [local count: 1014686026]: > _1 = (sizetype) len_8; > _2 = ip_10(D) + _1; > _3 = MEM[(unsigned char *)ip_10(D) + ivtmp.15_15 * 1]; > _5 = ref_12(D) + _1; > _6 = *_5; > ivtmp.15_16 = ivtmp.15_15 + 1; > if (_3 == _6) > goto ; [94.50%] > else > goto ; [5.50%] > > > At this moment, use->stmt is "_6 = *_5;", it is not using IV directly? So looking around I think you want to pass down the IV group from rewrite_groups to rewrite_use_address and then see whether the definition of the RHS is one of the IV USE stmts that are going to be expressed using the same candidate. In fact, I guess (to avoid linear searching the group), we likely might want to restrict considering adjust_iv_update_pos for IV candidates with a single [address] use? That would be more restrictive of course (OTOH the list shouldn't be too big - hopefully). In case the other side of the conditional is a memory using a different IV the replacement should still be OK. Richard. > > > [PATCH] ivopts: Don't adjust IV update statement if both operands use the IV in COND [PR101250] > > gcc/ChangeLog: > > PR middle-end/101250 > * tree-ssa-loop-ivopts.c (adjust_iv_update_pos): Check the COND > operands whether both use IV. > --- > gcc/testsuite/gcc.dg/tree-ssa/pr101250.c | 15 ++++++++++++ > gcc/tree-ssa-loop-ivopts.c | 31 +++++++++++++++++++++++- > 2 files changed, 45 insertions(+), 1 deletion(-) > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr101250.c > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr101250.c b/gcc/testsuite/gcc.dg/tree-ssa/pr101250.c > new file mode 100644 > index 00000000000..70d3701de49 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr101250.c > @@ -0,0 +1,15 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-ivopts-details" } */ > + > +unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen) > +{ > + unsigned int len = 2; > + do > + { > + len++; > + } > + while (len < maxlen && ip[len] == ref[len]); > + return len; > +} > + > +/* { dg-final { scan-tree-dump-not "Reordering" "ivopts" } } */ > diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c > index 12a8a49a307..082097bff38 100644 > --- a/gcc/tree-ssa-loop-ivopts.c > +++ b/gcc/tree-ssa-loop-ivopts.c > @@ -7352,7 +7352,7 @@ static void > adjust_iv_update_pos (struct iv_cand *cand, struct iv_use *use) > { > tree var_after; > - gimple *iv_update, *stmt; > + gimple *iv_update, *stmt, *cond_stmt; > basic_block bb; > gimple_stmt_iterator gsi, gsi_iv; > > @@ -7370,6 +7370,7 @@ adjust_iv_update_pos (struct iv_cand *cand, struct iv_use *use) > if (gimple_code (stmt) != GIMPLE_COND) > return; > > + cond_stmt = stmt; > gsi_prev_nondebug (&gsi); > stmt = gsi_stmt (gsi); > if (stmt != iv_update) > @@ -7386,6 +7387,34 @@ adjust_iv_update_pos (struct iv_cand *cand, struct iv_use *use) > if (stmt != use->stmt) > return; > > + gcond *gcond_stmt = as_a (cond_stmt); > + tree *cond_lhs = gimple_cond_lhs_ptr (gcond_stmt); > + tree *cond_rhs = gimple_cond_rhs_ptr (gcond_stmt); > + if (TREE_CODE (*cond_lhs) == SSA_NAME && TREE_CODE (*cond_rhs) == SSA_NAME) > + { > + /* If both sides are memory operations use iv var_before, adjust the > + iv update stmt before second meory operations will cause offset index > + mismatch, which may hurt performance if unroll, so return here to avoid > + reorder. > + > + _1 = (sizetype) len_8; > + _2 = ip_10(D) + _1; > + _3 = MEM[(unsigned char *)ip_10(D) + ivtmp.15_15 * 1]; > + _5 = ref_12(D) + _1; > + _6 = *_5; > + ivtmp.15_16 = ivtmp.15_15 + 1; > + if (_3 == _6). */ > + > + gimple *lhs_mem = SSA_NAME_DEF_STMT (*cond_lhs); > + gimple *rhs_mem = SSA_NAME_DEF_STMT (*cond_rhs); > + gimple *use_stmt; > + imm_use_iterator use_iter; > + FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, cand->var_before) > + if ((use_stmt == lhs_mem || use_stmt == rhs_mem) > + && use_stmt != use->stmt) > + return; > + } > + > if (TREE_CODE (gimple_assign_lhs (stmt)) != SSA_NAME) > return; > > -- > 2.27.0.90.geebb51ba8c > > > > -- > Thanks, > Xionghu