From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.guenther@gmail.com>
Received: from mail-ej1-x62d.google.com (mail-ej1-x62d.google.com
 [IPv6:2a00:1450:4864:20::62d])
 by sourceware.org (Postfix) with ESMTPS id 7B9483952013;
 Wed,  7 Jul 2021 13:20:57 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 7B9483952013
Received: by mail-ej1-x62d.google.com with SMTP id v20so3210816eji.10;
 Wed, 07 Jul 2021 06:20:57 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=buJQNY3j9u0+TqngT06UGo+SlbNyihmarfEUZIKdgVs=;
 b=Qctr4ZXikznDJFrZYowY/XWQgwwt/FCURsr81Dq1m8RcaLB04JujgxTJpBMMH9z64B
 4jo+Nuz77Y3FOs9MHeSWI+NJUKQZpxvai9qYiwkALMnDgQu//z6GpbTqk6uUw6Pu4zVo
 NdL/Ts9UqA7+ce6pF7QQ1Vi3PHb9EVR6pcKjsaAkjfXYaM9OjSme3hc2w1K83fm1S4TS
 o9KYN/BPbux3WRwM3Qb2TGvRZyazINaxvi0eIoDKvkToIe4uEa6cOsYDVYkI/7jIgBgX
 7N9zLOE7N1g1RdjQt0yI9ftZC79fIkfakEaM1jFCFWrO2xadGSOK1nqJGk2qKF7ftsBa
 minw==
X-Gm-Message-State: AOAM530uZBi8lHaqGjVO3Fi4AfQghV45SrA5GNJxwsLTsUYYB8YkFFs6
 A2RWdYwn8JiewW0p45Nd/bDiGx6DE/cDDbjASZA=
X-Google-Smtp-Source: ABdhPJw/RdcK1TrfAs7zx/F+ZjNKCp8NLNYafkQvS1PH5DLRAPejQ+5uqWXx4xA33rwE13wl4C2fsznWwq/34ktv0zM=
X-Received: by 2002:a17:906:f9c5:: with SMTP id
 lj5mr2822511ejb.482.1625664056437; 
 Wed, 07 Jul 2021 06:20:56 -0700 (PDT)
MIME-Version: 1.0
References: <20210625083101.2828805-1-luoxhu@linux.ibm.com>
 <CAFiYyc3izF3re7bG_otVyhX9cRc_Dghv=GJHwxu_iDtXvQxn2Q@mail.gmail.com>
 <ef640ef9-8ec7-77b8-785e-13052a63e0ae@linux.ibm.com>
 <CAFiYyc2wyEm8=THL2u_Kot1EnkMaT6-zKNDM+nBb_k75WsTiSA@mail.gmail.com>
 <3e5723ef-0e50-ae6a-f503-1d4f1a015b16@linux.ibm.com>
 <CAFiYyc17K2zcfju6XfPZ9QvuLunGKsqW32ipQ7qtYuiFrw+bzA@mail.gmail.com>
 <f333c891-69b4-bd22-fc8d-930eb16d7ac7@linux.ibm.com>
In-Reply-To: <f333c891-69b4-bd22-fc8d-930eb16d7ac7@linux.ibm.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Wed, 7 Jul 2021 15:20:44 +0200
Message-ID: <CAFiYyc2bxSpq=RBVmqLCGvYsPoYuRHXKFJpkspt8Mh7g-jEmGA@mail.gmail.com>
Subject: Re: [PATCH] New hook adjust_iv_update_pos
To: Xionghu Luo <luoxhu@linux.ibm.com>
Cc: "Amker.Cheng" <amker.cheng@gmail.com>,
 GCC Patches <gcc-patches@gcc.gnu.org>, 
 Segher Boessenkool <segher@kernel.crashing.org>,
 Bill Schmidt <wschmidt@linux.ibm.com>, linkw@gcc.gnu.org, 
 David Edelsohn <dje.gcc@gmail.com>, "H. J. Lu" <hjl.tools@gmail.com>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-9.0 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 07 Jul 2021 13:20:59 -0000

On Tue, Jun 29, 2021 at 11:19 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>
>
>
> On 2021/6/28 16:25, Richard Biener wrote:
> > On Mon, Jun 28, 2021 at 10:07 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
> >>
> >>
> >>
> >> On 2021/6/25 18:02, Richard Biener wrote:
> >>> On Fri, Jun 25, 2021 at 11:41 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2021/6/25 16:54, Richard Biener wrote:
> >>>>> On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches
> >>>>> <gcc-patches@gcc.gnu.org> wrote:
> >>>>>>
> >>>>>> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
> >>>>>>
> >>>>>> adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help performance
> >>>>>> on Power.  For example, it generates mismatched address offset after
> >>>>>> adjust iv update statement position:
> >>>>>>
> >>>>>> <bb 32> [local count: 70988443]:
> >>>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
> >>>>>> ivtmp.30_415 = ivtmp.30_414 + 1;
> >>>>>> _34 = ref_180 + 18446744073709551615;
> >>>>>> _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
> >>>>>> if (_84 == _86)
> >>>>>>      goto <bb 56>; [94.50%]
> >>>>>>      else
> >>>>>>      goto <bb 87>; [5.50%]
> >>>>>>
> >>>>>> Disable it will produce:
> >>>>>>
> >>>>>> <bb 32> [local count: 70988443]:
> >>>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
> >>>>>> _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1];
> >>>>>> ivtmp.30_415 = ivtmp.30_414 + 1;
> >>>>>> if (_84 == _86)
> >>>>>>      goto <bb 56>; [94.50%]
> >>>>>>      else
> >>>>>>      goto <bb 87>; [5.50%]
> >>>>>>
> >>>>>> Then later pass loop unroll could benefit from same address offset
> >>>>>> with different base address and reduces register dependency.
> >>>>>> This patch could improve performance by 10% for typical case on Power,
> >>>>>> no performance change observed for X86 or Aarch64 due to small loops
> >>>>>> not unrolled on these platforms.  Any comments?
> >>>>>
> >>>>> The case you quote is special in that if we hoisted the IV update before
> >>>>> the other MEM _also_ used in the condition it would be fine again.
> >>>>
> >>>> Thanks.  I tried to hoist the IV update statement before the first MEM (Fix 2), it
> >>>> shows even worse performance due to not unroll(two more "base-1" is generated in gimple,
> >>>> then loop->ninsns is 11 so small loops is not unrolled), change the threshold from
> >>>> 10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 times, the
> >>>> performance is SAME to the one that IV update statement in the *MIDDLE* (trunk).
> >>>>   From the ASM, we can see the index register %r4 is used in two iterations which
> >>>> maybe a bottle neck for hiding instruction latency?
> >>>>
> >>>> Then it seems reasonable the performance would be better if keep the IV update
> >>>> statement at *LAST* (Fix 1).
> >>>>
> >>>> (Fix 2):
> >>>>     <bb 32> [local count: 70988443]:
> >>>>     ivtmp.30_415 = ivtmp.30_414 + 1;
> >>>>     _34 = ip_229 + 18446744073709551615;
> >>>>     _84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
> >>>>     _33 = ref_180 + 18446744073709551615;
> >>>>     _86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1];
> >>>>     if (_84 == _86)
> >>>>       goto <bb 56>; [94.50%]
> >>>>     else
> >>>>       goto <bb 87>; [5.50%]
> >>>>
> >>>>
> >>>> .L67:
> >>>>           lbzx %r12,%r24,%r4
> >>>>           lbzx %r25,%r7,%r4
> >>>>           cmpw %cr0,%r12,%r25
> >>>>           bne %cr0,.L11
> >>>>           mr %r26,%r4
> >>>>           addi %r4,%r4,1
> >>>>           lbzx %r12,%r24,%r4
> >>>>           lbzx %r25,%r7,%r4
> >>>>           mr %r6,%r26
> >>>>           cmpw %cr0,%r12,%r25
> >>>>           bne %cr0,.L11
> >>>>           mr %r26,%r4
> >>>> .L12:
> >>>>           cmpdi %cr0,%r10,1
> >>>>           addi %r4,%r26,1
> >>>>           mr %r6,%r26
> >>>>           addi %r10,%r10,-1
> >>>>           bne %cr0,.L67
> >>>>
> >>>>>
> >>>>> Now, adjust_iv_update_pos doesn't seem to check that the
> >>>>> condition actually uses the IV use stmt def, so it likely applies to
> >>>>> too many cases.
> >>>>>
> >>>>> Unfortunately the introducing rev didn't come with a testcase,
> >>>>> but still I think fixing up adjust_iv_update_pos is better than
> >>>>> introducing a way to short-cut it per target decision.
> >>>>>
> >>>>> One "fix" might be to add a check that either the condition
> >>>>> lhs or rhs is the def of the IV use and the other operand
> >>>>> is invariant.  Or if it's of similar structure hoist across the
> >>>>> other iv-use as well.  Not that I understand the argument
> >>>>> about the overlapping life-range.
> >>>>>
> >>>>> You also don't provide a complete testcase ...
> >>>>>
> >>>>
> >>>> Attached the test code, will also add it it patch in future version.
> >>>> The issue comes from a very small hot loop:
> >>>>
> >>>>           do {
> >>>>             len++;
> >>>>           } while(len < maxlen && ip[len] == ref[len]);
> >>>
> >>> unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen)
> >>> {
> >>>     unsigned int len = 2;
> >>>     do {
> >>>         len++;
> >>>     }while(len < maxlen && ip[len] == ref[len]);
> >>>     return len;
> >>> }
> >>>
> >>> I can see the effect on this loop on x86_64 as well, we end up with
> >>>
> >>> .L6:
> >>>           movzbl  (%rdi,%rax), %ecx
> >>>           addq    $1, %rax
> >>>           cmpb    -1(%rsi,%rax), %cl
> >>>           jne     .L1
> >>> .L3:
> >>>           movl    %eax, %r8d
> >>>           cmpl    %edx, %eax
> >>>           jb      .L6
> >>>
> >>> but without the trick it is
> >>>
> >>> .L6:
> >>>           movzbl  (%rdi,%rax), %r8d
> >>>           movzbl  (%rsi,%rax), %ecx
> >>>           addq    $1, %rax
> >>>           cmpb    %cl, %r8b
> >>>           jne     .L1
> >>> .L3:
> >>>           movl    %eax, %r9d
> >>>           cmpl    %edx, %eax
> >>>           jb      .L6
> >>
> >> Verified this small piece of code on X86, there is no performance
> >> change with or without adjust_iv_update_pos (I've checked the ASM
> >> exactly same with yours):
> >>
> >> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ gcc -O2 test.c
> >> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out  1
> >>
> >> real    0m7.003s
> >> user    0m6.996s
> >> sys     0m0.000s
> >> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ /home/luoxhu/workspace/build/gcc/xgcc -B/home/luoxhu/workspace/build/g
> >> cc/ -O2 test.c
> >> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out  1
> >>
> >> real    0m7.070s
> >> user    0m7.068s
> >> sys     0m0.000s
> >>
> >>
> >> But for AArch64, current GCC code also generates similar code with
> >> or without adjust_iv_update_pos, the runtime is 10.705s for them.
> >>
> >> L6:
> >>          ldrb    w4, [x6, x3]
> >>          add     x3, x3, 1
> >>          ldrb    w5, [x1, x3]
> >>          cmp     w5, w4
> >>          bne     .L1
> >> .L3:
> >>          mov     w0, w3
> >>          cmp     w2, w3
> >>          bhi     .L6
> >>
> >> No adjust_iv_update_pos:
> >>
> >> .L6:
> >>          ldrb    w5, [x6, x3]
> >>          ldrb    w4, [x1, x3]
> >>          add     x3, x3, 1
> >>          cmp     w5, w4
> >>          bne     .L1
> >> .L3:
> >>          mov     w0, w3
> >>          cmp     w2, w3
> >>          bhi     .L6
> >>
> >>
> >> While built with old GCC(gcc version 7.4.1 20190424), it generates
> >> worse code and runtime is 11.664s:
> >>
> >> .L6:
> >>          ldrb    w4, [x6, x3]
> >>          add     x3, x3, 1
> >>          add     x5, x1, x3
> >>          ldrb    w5, [x5, -1]
> >>          cmp     w5, w4
> >>          bne     .L1
> >> .L3:
> >>          cmp     w3, w2
> >>          mov     w0, w3
> >>          bcc     .L6
> >>
> >>
> >> In general, only Power shows negative performance with adjust_iv_update_pos,
> >> that's why I tried to add target hook for it, is this reasonable?  Or we should
> >> just remove adjust_iv_update_pos since it doesn't help performance for X86 or
> >> other targets?
> >
> > It was Bin who added the functionality - maybe he can remember the cases
> > he saw improvements for.
> >
> > I think we should instead try to change the code that it doesn't apply to
> > CONDs where both operands are defined by stmts using the IV?
> > And this time add a testcase ;)
>
> Thanks. I am not sure whether should check both cond_lhs and cond_rhs or just
> cond_lhs is enough if COND expr's LHS is always processed first?
>
> For example, adjust_iv_update_pos will be called twice for this case,
> after the first call, the gimple will be updated to:
>
> <bb 4> [local count: 1014686026]:
> _1 = (sizetype) len_8;
> _2 = ip_10(D) + _1;
> _3 = MEM[(unsigned char *)ip_10(D) + ivtmp.15_15 * 1];
> _5 = ref_12(D) + _1;
> _6 = *_5;
> ivtmp.15_16 = ivtmp.15_15 + 1;
> if (_3 == _6)
>   goto <bb 6>; [94.50%]
> else
>   goto <bb 10>; [5.50%]
>
>
> At this moment, use->stmt is "_6 = *_5;", it is not using IV directly?

So looking around I think you want to pass down the IV group from
rewrite_groups to rewrite_use_address and then see whether the
definition of the RHS is one of the IV USE stmts that are going to be
expressed using the same candidate.  In fact, I guess (to avoid
linear searching the group), we likely might want to restrict considering
adjust_iv_update_pos for IV candidates with a single [address] use?
That would be more restrictive of course (OTOH the list shouldn't be
too big - hopefully).  In case the other side of the conditional is
a memory using a different IV the replacement should still be OK.

Richard.

>
>
> [PATCH] ivopts: Don't adjust IV update statement if both operands use the IV in COND [PR101250]
>
> gcc/ChangeLog:
>
>         PR middle-end/101250
>         * tree-ssa-loop-ivopts.c (adjust_iv_update_pos): Check the COND
>         operands whether both use IV.
> ---
>  gcc/testsuite/gcc.dg/tree-ssa/pr101250.c | 15 ++++++++++++
>  gcc/tree-ssa-loop-ivopts.c               | 31 +++++++++++++++++++++++-
>  2 files changed, 45 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr101250.c
>
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr101250.c b/gcc/testsuite/gcc.dg/tree-ssa/pr101250.c
> new file mode 100644
> index 00000000000..70d3701de49
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr101250.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-ivopts-details" } */
> +
> +unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen)
> +{
> +  unsigned int len = 2;
> +  do
> +    {
> +      len++;
> +    }
> +  while (len < maxlen && ip[len] == ref[len]);
> +  return len;
> +}
> +
> +/* { dg-final { scan-tree-dump-not "Reordering" "ivopts" } } */
> diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
> index 12a8a49a307..082097bff38 100644
> --- a/gcc/tree-ssa-loop-ivopts.c
> +++ b/gcc/tree-ssa-loop-ivopts.c
> @@ -7352,7 +7352,7 @@ static void
>  adjust_iv_update_pos (struct iv_cand *cand, struct iv_use *use)
>  {
>    tree var_after;
> -  gimple *iv_update, *stmt;
> +  gimple *iv_update, *stmt, *cond_stmt;
>    basic_block bb;
>    gimple_stmt_iterator gsi, gsi_iv;
>
> @@ -7370,6 +7370,7 @@ adjust_iv_update_pos (struct iv_cand *cand, struct iv_use *use)
>    if (gimple_code (stmt) != GIMPLE_COND)
>      return;
>
> +  cond_stmt = stmt;
>    gsi_prev_nondebug (&gsi);
>    stmt = gsi_stmt (gsi);
>    if (stmt != iv_update)
> @@ -7386,6 +7387,34 @@ adjust_iv_update_pos (struct iv_cand *cand, struct iv_use *use)
>    if (stmt != use->stmt)
>      return;
>
> +  gcond *gcond_stmt = as_a <gcond *> (cond_stmt);
> +  tree *cond_lhs = gimple_cond_lhs_ptr (gcond_stmt);
> +  tree *cond_rhs = gimple_cond_rhs_ptr (gcond_stmt);
> +  if (TREE_CODE (*cond_lhs) == SSA_NAME && TREE_CODE (*cond_rhs) == SSA_NAME)
> +    {
> +      /* If both sides are memory operations use iv var_before, adjust the
> +       iv update stmt before second meory operations will cause offset index
> +       mismatch, which may hurt performance if unroll, so return here to avoid
> +       reorder.
> +
> +       _1 = (sizetype) len_8;
> +       _2 = ip_10(D) + _1;
> +       _3 = MEM[(unsigned char *)ip_10(D) + ivtmp.15_15 * 1];
> +       _5 = ref_12(D) + _1;
> +       _6 = *_5;
> +       ivtmp.15_16 = ivtmp.15_15 + 1;
> +       if (_3 == _6).   */
> +
> +      gimple *lhs_mem = SSA_NAME_DEF_STMT (*cond_lhs);
> +      gimple *rhs_mem = SSA_NAME_DEF_STMT (*cond_rhs);
> +      gimple *use_stmt;
> +      imm_use_iterator use_iter;
> +      FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, cand->var_before)
> +       if ((use_stmt == lhs_mem || use_stmt == rhs_mem)
> +           && use_stmt != use->stmt)
> +         return;
> +    }
> +
>    if (TREE_CODE (gimple_assign_lhs (stmt)) != SSA_NAME)
>      return;
>
> --
> 2.27.0.90.geebb51ba8c
>
>
>
> --
> Thanks,
> Xionghu