From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <crazylht@gmail.com>
Received: from mail-vs1-xe29.google.com (mail-vs1-xe29.google.com
 [IPv6:2607:f8b0:4864:20::e29])
 by sourceware.org (Postfix) with ESMTPS id 622943857C53
 for <gcc-patches@gcc.gnu.org>; Mon, 12 Jul 2021 09:00:32 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 622943857C53
Received: by mail-vs1-xe29.google.com with SMTP id w13so4960833vsa.3
 for <gcc-patches@gcc.gnu.org>; Mon, 12 Jul 2021 02:00:32 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=aGI21Jofo2S/97RpWPRDk4oJeHzBL//sJgHcImcytDI=;
 b=p0Q2OnVttJ3llrgxkAiZMPihQBAmebd6dZX7uTz/lPCbpqnJWac74YtwNozsxYFjZK
 Anbw5OHu+oK2pxUcLxCYah9Vk3OB7rhDYqkdHKXeKkm6NIRftUJD02AJT657v5JLQlUT
 Z61IAbii6Q1VKMR79Qv19dEfXyk3jVO/DBrPwKqUPLGuYNMgMChkJTDlqdP0a+P6ff4C
 Dbrd9RTlpKv+pZwYWiVyew3UhI679FINHdEY3csRe/YrWGuAs/umunAmOnCQ0xwwSwhW
 kxlbPgFvOgSlD+48aEczGf71JhQhkg59N4eGsAt8z5CZSn3s/qumnpgpmg4svKYV5nJD
 g40A==
X-Gm-Message-State: AOAM532ptg42NrB2Nbe6OdUXxuiCu/SOEWOQi2MwBZvvGNe2udA2o9LW
 HItGM2YJwBr/a54GyKzCT2Sxo8xzERmMsBXLZsc=
X-Google-Smtp-Source: ABdhPJxQeeO+i5mUffyq1IFvc4fAbpo1Us+ST0GLSoTKtRjvcDqg4SnwY1TZfwXBth0iDhXXoyEhlakCx8kryi0ROPA=
X-Received: by 2002:a67:2d8b:: with SMTP id t133mr19789940vst.45.1626080431778; 
 Mon, 12 Jul 2021 02:00:31 -0700 (PDT)
MIME-Version: 1.0
References: <20210625083101.2828805-1-luoxhu@linux.ibm.com>
 <CAFiYyc3izF3re7bG_otVyhX9cRc_Dghv=GJHwxu_iDtXvQxn2Q@mail.gmail.com>
 <ef640ef9-8ec7-77b8-785e-13052a63e0ae@linux.ibm.com>
 <CAFiYyc2wyEm8=THL2u_Kot1EnkMaT6-zKNDM+nBb_k75WsTiSA@mail.gmail.com>
 <3e5723ef-0e50-ae6a-f503-1d4f1a015b16@linux.ibm.com>
 <CAFiYyc17K2zcfju6XfPZ9QvuLunGKsqW32ipQ7qtYuiFrw+bzA@mail.gmail.com>
 <f333c891-69b4-bd22-fc8d-930eb16d7ac7@linux.ibm.com>
 <CAFiYyc2bxSpq=RBVmqLCGvYsPoYuRHXKFJpkspt8Mh7g-jEmGA@mail.gmail.com>
 <04501d24-ffa5-aaf3-3291-fc95ce3b37d8@linux.ibm.com>
In-Reply-To: <04501d24-ffa5-aaf3-3291-fc95ce3b37d8@linux.ibm.com>
From: Hongtao Liu <crazylht@gmail.com>
Date: Mon, 12 Jul 2021 17:05:18 +0800
Message-ID: <CAMZc-bwH09K4etbF+ke0Aj11D+Xa3cVVw1=_qom0ieoBbcCg3w@mail.gmail.com>
Subject: Re: [PATCH] New hook adjust_iv_update_pos
To: Xionghu Luo <luoxhu@linux.ibm.com>, Uros Bizjak <ubizjak@gmail.com>
Cc: Richard Biener <richard.guenther@gmail.com>, 
 Segher Boessenkool <segher@kernel.crashing.org>,
 Bill Schmidt <wschmidt@linux.ibm.com>, 
 GCC Patches <gcc-patches@gcc.gnu.org>, "Liu, Hongtao" <hongtao.liu@intel.com>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-4.0 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Jul 2021 09:00:34 -0000

On Mon, Jul 12, 2021 at 4:14 PM Xionghu Luo via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
>
>
> On 2021/7/7 21:20, Richard Biener wrote:
> > On Tue, Jun 29, 2021 at 11:19 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
> >>
> >>
> >>
> >> On 2021/6/28 16:25, Richard Biener wrote:
> >>> On Mon, Jun 28, 2021 at 10:07 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2021/6/25 18:02, Richard Biener wrote:
> >>>>> On Fri, Jun 25, 2021 at 11:41 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 2021/6/25 16:54, Richard Biener wrote:
> >>>>>>> On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches
> >>>>>>> <gcc-patches@gcc.gnu.org> wrote:
> >>>>>>>>
> >>>>>>>> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
> >>>>>>>>
> >>>>>>>> adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help performance
> >>>>>>>> on Power.  For example, it generates mismatched address offset after
> >>>>>>>> adjust iv update statement position:
> >>>>>>>>
> >>>>>>>> <bb 32> [local count: 70988443]:
> >>>>>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
> >>>>>>>> ivtmp.30_415 = ivtmp.30_414 + 1;
> >>>>>>>> _34 = ref_180 + 18446744073709551615;
> >>>>>>>> _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
> >>>>>>>> if (_84 == _86)
> >>>>>>>>       goto <bb 56>; [94.50%]
> >>>>>>>>       else
> >>>>>>>>       goto <bb 87>; [5.50%]
> >>>>>>>>
> >>>>>>>> Disable it will produce:
> >>>>>>>>
> >>>>>>>> <bb 32> [local count: 70988443]:
> >>>>>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
> >>>>>>>> _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1];
> >>>>>>>> ivtmp.30_415 = ivtmp.30_414 + 1;
> >>>>>>>> if (_84 == _86)
> >>>>>>>>       goto <bb 56>; [94.50%]
> >>>>>>>>       else
> >>>>>>>>       goto <bb 87>; [5.50%]
> >>>>>>>>
> >>>>>>>> Then later pass loop unroll could benefit from same address offset
> >>>>>>>> with different base address and reduces register dependency.
> >>>>>>>> This patch could improve performance by 10% for typical case on Power,
> >>>>>>>> no performance change observed for X86 or Aarch64 due to small loops
> >>>>>>>> not unrolled on these platforms.  Any comments?
> >>>>>>>
> >>>>>>> The case you quote is special in that if we hoisted the IV update before
> >>>>>>> the other MEM _also_ used in the condition it would be fine again.
> >>>>>>
> >>>>>> Thanks.  I tried to hoist the IV update statement before the first MEM (Fix 2), it
> >>>>>> shows even worse performance due to not unroll(two more "base-1" is generated in gimple,
> >>>>>> then loop->ninsns is 11 so small loops is not unrolled), change the threshold from
> >>>>>> 10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 times, the
> >>>>>> performance is SAME to the one that IV update statement in the *MIDDLE* (trunk).
> >>>>>>    From the ASM, we can see the index register %r4 is used in two iterations which
> >>>>>> maybe a bottle neck for hiding instruction latency?
> >>>>>>
> >>>>>> Then it seems reasonable the performance would be better if keep the IV update
> >>>>>> statement at *LAST* (Fix 1).
> >>>>>>
> >>>>>> (Fix 2):
> >>>>>>      <bb 32> [local count: 70988443]:
> >>>>>>      ivtmp.30_415 = ivtmp.30_414 + 1;
> >>>>>>      _34 = ip_229 + 18446744073709551615;
> >>>>>>      _84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
> >>>>>>      _33 = ref_180 + 18446744073709551615;
> >>>>>>      _86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1];
> >>>>>>      if (_84 == _86)
> >>>>>>        goto <bb 56>; [94.50%]
> >>>>>>      else
> >>>>>>        goto <bb 87>; [5.50%]
> >>>>>>
> >>>>>>
> >>>>>> .L67:
> >>>>>>            lbzx %r12,%r24,%r4
> >>>>>>            lbzx %r25,%r7,%r4
> >>>>>>            cmpw %cr0,%r12,%r25
> >>>>>>            bne %cr0,.L11
> >>>>>>            mr %r26,%r4
> >>>>>>            addi %r4,%r4,1
> >>>>>>            lbzx %r12,%r24,%r4
> >>>>>>            lbzx %r25,%r7,%r4
> >>>>>>            mr %r6,%r26
> >>>>>>            cmpw %cr0,%r12,%r25
> >>>>>>            bne %cr0,.L11
> >>>>>>            mr %r26,%r4
> >>>>>> .L12:
> >>>>>>            cmpdi %cr0,%r10,1
> >>>>>>            addi %r4,%r26,1
> >>>>>>            mr %r6,%r26
> >>>>>>            addi %r10,%r10,-1
> >>>>>>            bne %cr0,.L67
> >>>>>>
> >>>>>>>
> >>>>>>> Now, adjust_iv_update_pos doesn't seem to check that the
> >>>>>>> condition actually uses the IV use stmt def, so it likely applies to
> >>>>>>> too many cases.
> >>>>>>>
> >>>>>>> Unfortunately the introducing rev didn't come with a testcase,
> >>>>>>> but still I think fixing up adjust_iv_update_pos is better than
> >>>>>>> introducing a way to short-cut it per target decision.
> >>>>>>>
> >>>>>>> One "fix" might be to add a check that either the condition
> >>>>>>> lhs or rhs is the def of the IV use and the other operand
> >>>>>>> is invariant.  Or if it's of similar structure hoist across the
> >>>>>>> other iv-use as well.  Not that I understand the argument
> >>>>>>> about the overlapping life-range.
> >>>>>>>
> >>>>>>> You also don't provide a complete testcase ...
> >>>>>>>
> >>>>>>
> >>>>>> Attached the test code, will also add it it patch in future version.
> >>>>>> The issue comes from a very small hot loop:
> >>>>>>
> >>>>>>            do {
> >>>>>>              len++;
> >>>>>>            } while(len < maxlen && ip[len] == ref[len]);
> >>>>>
> >>>>> unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen)
> >>>>> {
> >>>>>      unsigned int len = 2;
> >>>>>      do {
> >>>>>          len++;
> >>>>>      }while(len < maxlen && ip[len] == ref[len]);
> >>>>>      return len;
> >>>>> }
> >>>>>
> >>>>> I can see the effect on this loop on x86_64 as well, we end up with
> >>>>>
> >>>>> .L6:
> >>>>>            movzbl  (%rdi,%rax), %ecx
> >>>>>            addq    $1, %rax
> >>>>>            cmpb    -1(%rsi,%rax), %cl
> >>>>>            jne     .L1
> >>>>> .L3:
> >>>>>            movl    %eax, %r8d
> >>>>>            cmpl    %edx, %eax
> >>>>>            jb      .L6
> >>>>>
> >>>>> but without the trick it is
> >>>>>
> >>>>> .L6:
> >>>>>            movzbl  (%rdi,%rax), %r8d
> >>>>>            movzbl  (%rsi,%rax), %ecx
> >>>>>            addq    $1, %rax
> >>>>>            cmpb    %cl, %r8b
> >>>>>            jne     .L1
> >>>>> .L3:
> >>>>>            movl    %eax, %r9d
> >>>>>            cmpl    %edx, %eax
> >>>>>            jb      .L6
> >>>>
> >>>> Verified this small piece of code on X86, there is no performance
> >>>> change with or without adjust_iv_update_pos (I've checked the ASM
> >>>> exactly same with yours):
> >>>>
> >>>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ gcc -O2 test.c
> >>>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out  1
> >>>>
> >>>> real    0m7.003s
> >>>> user    0m6.996s
> >>>> sys     0m0.000s
> >>>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ /home/luoxhu/workspace/build/gcc/xgcc -B/home/luoxhu/workspace/build/g
> >>>> cc/ -O2 test.c
> >>>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out  1
> >>>>
> >>>> real    0m7.070s
> >>>> user    0m7.068s
> >>>> sys     0m0.000s
> >>>>
> >>>>
> >>>> But for AArch64, current GCC code also generates similar code with
> >>>> or without adjust_iv_update_pos, the runtime is 10.705s for them.
> >>>>
> >>>> L6:
> >>>>           ldrb    w4, [x6, x3]
> >>>>           add     x3, x3, 1
> >>>>           ldrb    w5, [x1, x3]
> >>>>           cmp     w5, w4
> >>>>           bne     .L1
> >>>> .L3:
> >>>>           mov     w0, w3
> >>>>           cmp     w2, w3
> >>>>           bhi     .L6
> >>>>
> >>>> No adjust_iv_update_pos:
> >>>>
> >>>> .L6:
> >>>>           ldrb    w5, [x6, x3]
> >>>>           ldrb    w4, [x1, x3]
> >>>>           add     x3, x3, 1
> >>>>           cmp     w5, w4
> >>>>           bne     .L1
> >>>> .L3:
> >>>>           mov     w0, w3
> >>>>           cmp     w2, w3
> >>>>           bhi     .L6
> >>>>
> >>>>
> >>>> While built with old GCC(gcc version 7.4.1 20190424), it generates
> >>>> worse code and runtime is 11.664s:
> >>>>
> >>>> .L6:
> >>>>           ldrb    w4, [x6, x3]
> >>>>           add     x3, x3, 1
> >>>>           add     x5, x1, x3
> >>>>           ldrb    w5, [x5, -1]
> >>>>           cmp     w5, w4
> >>>>           bne     .L1
> >>>> .L3:
> >>>>           cmp     w3, w2
> >>>>           mov     w0, w3
> >>>>           bcc     .L6
> >>>>
> >>>>
> >>>> In general, only Power shows negative performance with adjust_iv_update_pos,
> >>>> that's why I tried to add target hook for it, is this reasonable?  Or we should
> >>>> just remove adjust_iv_update_pos since it doesn't help performance for X86 or
> >>>> other targets?
> >>>
> >>> It was Bin who added the functionality - maybe he can remember the cases
> >>> he saw improvements for.
> >>>
> >>> I think we should instead try to change the code that it doesn't apply to
> >>> CONDs where both operands are defined by stmts using the IV?
> >>> And this time add a testcase ;)
> >>
> >> Thanks. I am not sure whether should check both cond_lhs and cond_rhs or just
> >> cond_lhs is enough if COND expr's LHS is always processed first?
> >>
> >> For example, adjust_iv_update_pos will be called twice for this case,
> >> after the first call, the gimple will be updated to:
> >>
> >> <bb 4> [local count: 1014686026]:
> >> _1 = (sizetype) len_8;
> >> _2 = ip_10(D) + _1;
> >> _3 = MEM[(unsigned char *)ip_10(D) + ivtmp.15_15 * 1];
> >> _5 = ref_12(D) + _1;
> >> _6 = *_5;
> >> ivtmp.15_16 = ivtmp.15_15 + 1;
> >> if (_3 == _6)
> >>    goto <bb 6>; [94.50%]
> >> else
> >>    goto <bb 10>; [5.50%]
> >>
> >>
> >> At this moment, use->stmt is "_6 = *_5;", it is not using IV directly?
> >
> > So looking around I think you want to pass down the IV group from
> > rewrite_groups to rewrite_use_address and then see whether the
> > definition of the RHS is one of the IV USE stmts that are going to be
> > expressed using the same candidate.  In fact, I guess (to avoid
> > linear searching the group), we likely might want to restrict considering
> > adjust_iv_update_pos for IV candidates with a single [address] use?
> > That would be more restrictive of course (OTOH the list shouldn't be
> > too big - hopefully).  In case the other side of the conditional is
> > a memory using a different IV the replacement should still be OK.
> >
> > Richard.
>
> Thanks, however those methods would still break the X86 optimization
> with worse ASM(one more movzbl) though no performance change observed.
> CCed X86 maintainers to see which one they prefer, and whether target
> hook is a better solution?
>
>
> GCC trunk:
>
> .L6:
>           movzbl  (%rdi,%rax), %ecx
>           addq    $1, %rax
>           cmpb    -1(%rsi,%rax), %cl
>           jne     .L1
> .L3:
>           movl    %eax, %r8d
>           cmpl    %edx, %eax
>           jb      .L6
>
> Disable adjust_iv_update_pos:
>
> .L6:
>           movzbl  (%rdi,%rax), %r8d
>           movzbl  (%rsi,%rax), %ecx
>           addq    $1, %rax
>           cmpb    %cl, %r8b
>           jne     .L1
> .L3:
>           movl    %eax, %r9d
>           cmpl    %edx, %eax
>           jb      .L6
>
latency should be the same
movzbl mem reg 5
cmp reg8 reg8 1
total 6
vs
cmp mem8 reg8 6

Guess they're the same in uops level.

CC uros, any opinion?

>
>
> --
> Thanks,
> Xionghu


-- 
BR,
Hongtao