public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Xionghu Luo <luoxhu@linux.ibm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: "Amker.Cheng" <amker.cheng@gmail.com>,
	GCC Patches <gcc-patches@gcc.gnu.org>,
	Segher Boessenkool <segher@kernel.crashing.org>,
	Bill Schmidt <wschmidt@linux.ibm.com>,
	hongtao.liu@intel.com, Uros Bizjak <ubizjak@gmail.com>,
	"H. J. Lu" <hjl.tools@gmail.com>
Subject: Re: [PATCH] New hook adjust_iv_update_pos
Date: Mon, 12 Jul 2021 16:13:25 +0800	[thread overview]
Message-ID: <04501d24-ffa5-aaf3-3291-fc95ce3b37d8@linux.ibm.com> (raw)
In-Reply-To: <CAFiYyc2bxSpq=RBVmqLCGvYsPoYuRHXKFJpkspt8Mh7g-jEmGA@mail.gmail.com>



On 2021/7/7 21:20, Richard Biener wrote:
> On Tue, Jun 29, 2021 at 11:19 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>
>>
>>
>> On 2021/6/28 16:25, Richard Biener wrote:
>>> On Mon, Jun 28, 2021 at 10:07 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2021/6/25 18:02, Richard Biener wrote:
>>>>> On Fri, Jun 25, 2021 at 11:41 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2021/6/25 16:54, Richard Biener wrote:
>>>>>>> On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches
>>>>>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>>>>>
>>>>>>>> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
>>>>>>>>
>>>>>>>> adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help performance
>>>>>>>> on Power.  For example, it generates mismatched address offset after
>>>>>>>> adjust iv update statement position:
>>>>>>>>
>>>>>>>> <bb 32> [local count: 70988443]:
>>>>>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
>>>>>>>> ivtmp.30_415 = ivtmp.30_414 + 1;
>>>>>>>> _34 = ref_180 + 18446744073709551615;
>>>>>>>> _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
>>>>>>>> if (_84 == _86)
>>>>>>>>       goto <bb 56>; [94.50%]
>>>>>>>>       else
>>>>>>>>       goto <bb 87>; [5.50%]
>>>>>>>>
>>>>>>>> Disable it will produce:
>>>>>>>>
>>>>>>>> <bb 32> [local count: 70988443]:
>>>>>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
>>>>>>>> _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1];
>>>>>>>> ivtmp.30_415 = ivtmp.30_414 + 1;
>>>>>>>> if (_84 == _86)
>>>>>>>>       goto <bb 56>; [94.50%]
>>>>>>>>       else
>>>>>>>>       goto <bb 87>; [5.50%]
>>>>>>>>
>>>>>>>> Then later pass loop unroll could benefit from same address offset
>>>>>>>> with different base address and reduces register dependency.
>>>>>>>> This patch could improve performance by 10% for typical case on Power,
>>>>>>>> no performance change observed for X86 or Aarch64 due to small loops
>>>>>>>> not unrolled on these platforms.  Any comments?
>>>>>>>
>>>>>>> The case you quote is special in that if we hoisted the IV update before
>>>>>>> the other MEM _also_ used in the condition it would be fine again.
>>>>>>
>>>>>> Thanks.  I tried to hoist the IV update statement before the first MEM (Fix 2), it
>>>>>> shows even worse performance due to not unroll(two more "base-1" is generated in gimple,
>>>>>> then loop->ninsns is 11 so small loops is not unrolled), change the threshold from
>>>>>> 10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 times, the
>>>>>> performance is SAME to the one that IV update statement in the *MIDDLE* (trunk).
>>>>>>    From the ASM, we can see the index register %r4 is used in two iterations which
>>>>>> maybe a bottle neck for hiding instruction latency?
>>>>>>
>>>>>> Then it seems reasonable the performance would be better if keep the IV update
>>>>>> statement at *LAST* (Fix 1).
>>>>>>
>>>>>> (Fix 2):
>>>>>>      <bb 32> [local count: 70988443]:
>>>>>>      ivtmp.30_415 = ivtmp.30_414 + 1;
>>>>>>      _34 = ip_229 + 18446744073709551615;
>>>>>>      _84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
>>>>>>      _33 = ref_180 + 18446744073709551615;
>>>>>>      _86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1];
>>>>>>      if (_84 == _86)
>>>>>>        goto <bb 56>; [94.50%]
>>>>>>      else
>>>>>>        goto <bb 87>; [5.50%]
>>>>>>
>>>>>>
>>>>>> .L67:
>>>>>>            lbzx %r12,%r24,%r4
>>>>>>            lbzx %r25,%r7,%r4
>>>>>>            cmpw %cr0,%r12,%r25
>>>>>>            bne %cr0,.L11
>>>>>>            mr %r26,%r4
>>>>>>            addi %r4,%r4,1
>>>>>>            lbzx %r12,%r24,%r4
>>>>>>            lbzx %r25,%r7,%r4
>>>>>>            mr %r6,%r26
>>>>>>            cmpw %cr0,%r12,%r25
>>>>>>            bne %cr0,.L11
>>>>>>            mr %r26,%r4
>>>>>> .L12:
>>>>>>            cmpdi %cr0,%r10,1
>>>>>>            addi %r4,%r26,1
>>>>>>            mr %r6,%r26
>>>>>>            addi %r10,%r10,-1
>>>>>>            bne %cr0,.L67
>>>>>>
>>>>>>>
>>>>>>> Now, adjust_iv_update_pos doesn't seem to check that the
>>>>>>> condition actually uses the IV use stmt def, so it likely applies to
>>>>>>> too many cases.
>>>>>>>
>>>>>>> Unfortunately the introducing rev didn't come with a testcase,
>>>>>>> but still I think fixing up adjust_iv_update_pos is better than
>>>>>>> introducing a way to short-cut it per target decision.
>>>>>>>
>>>>>>> One "fix" might be to add a check that either the condition
>>>>>>> lhs or rhs is the def of the IV use and the other operand
>>>>>>> is invariant.  Or if it's of similar structure hoist across the
>>>>>>> other iv-use as well.  Not that I understand the argument
>>>>>>> about the overlapping life-range.
>>>>>>>
>>>>>>> You also don't provide a complete testcase ...
>>>>>>>
>>>>>>
>>>>>> Attached the test code, will also add it it patch in future version.
>>>>>> The issue comes from a very small hot loop:
>>>>>>
>>>>>>            do {
>>>>>>              len++;
>>>>>>            } while(len < maxlen && ip[len] == ref[len]);
>>>>>
>>>>> unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen)
>>>>> {
>>>>>      unsigned int len = 2;
>>>>>      do {
>>>>>          len++;
>>>>>      }while(len < maxlen && ip[len] == ref[len]);
>>>>>      return len;
>>>>> }
>>>>>
>>>>> I can see the effect on this loop on x86_64 as well, we end up with
>>>>>
>>>>> .L6:
>>>>>            movzbl  (%rdi,%rax), %ecx
>>>>>            addq    $1, %rax
>>>>>            cmpb    -1(%rsi,%rax), %cl
>>>>>            jne     .L1
>>>>> .L3:
>>>>>            movl    %eax, %r8d
>>>>>            cmpl    %edx, %eax
>>>>>            jb      .L6
>>>>>
>>>>> but without the trick it is
>>>>>
>>>>> .L6:
>>>>>            movzbl  (%rdi,%rax), %r8d
>>>>>            movzbl  (%rsi,%rax), %ecx
>>>>>            addq    $1, %rax
>>>>>            cmpb    %cl, %r8b
>>>>>            jne     .L1
>>>>> .L3:
>>>>>            movl    %eax, %r9d
>>>>>            cmpl    %edx, %eax
>>>>>            jb      .L6
>>>>
>>>> Verified this small piece of code on X86, there is no performance
>>>> change with or without adjust_iv_update_pos (I've checked the ASM
>>>> exactly same with yours):
>>>>
>>>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ gcc -O2 test.c
>>>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out  1
>>>>
>>>> real    0m7.003s
>>>> user    0m6.996s
>>>> sys     0m0.000s
>>>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ /home/luoxhu/workspace/build/gcc/xgcc -B/home/luoxhu/workspace/build/g
>>>> cc/ -O2 test.c
>>>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out  1
>>>>
>>>> real    0m7.070s
>>>> user    0m7.068s
>>>> sys     0m0.000s
>>>>
>>>>
>>>> But for AArch64, current GCC code also generates similar code with
>>>> or without adjust_iv_update_pos, the runtime is 10.705s for them.
>>>>
>>>> L6:
>>>>           ldrb    w4, [x6, x3]
>>>>           add     x3, x3, 1
>>>>           ldrb    w5, [x1, x3]
>>>>           cmp     w5, w4
>>>>           bne     .L1
>>>> .L3:
>>>>           mov     w0, w3
>>>>           cmp     w2, w3
>>>>           bhi     .L6
>>>>
>>>> No adjust_iv_update_pos:
>>>>
>>>> .L6:
>>>>           ldrb    w5, [x6, x3]
>>>>           ldrb    w4, [x1, x3]
>>>>           add     x3, x3, 1
>>>>           cmp     w5, w4
>>>>           bne     .L1
>>>> .L3:
>>>>           mov     w0, w3
>>>>           cmp     w2, w3
>>>>           bhi     .L6
>>>>
>>>>
>>>> While built with old GCC(gcc version 7.4.1 20190424), it generates
>>>> worse code and runtime is 11.664s:
>>>>
>>>> .L6:
>>>>           ldrb    w4, [x6, x3]
>>>>           add     x3, x3, 1
>>>>           add     x5, x1, x3
>>>>           ldrb    w5, [x5, -1]
>>>>           cmp     w5, w4
>>>>           bne     .L1
>>>> .L3:
>>>>           cmp     w3, w2
>>>>           mov     w0, w3
>>>>           bcc     .L6
>>>>
>>>>
>>>> In general, only Power shows negative performance with adjust_iv_update_pos,
>>>> that's why I tried to add target hook for it, is this reasonable?  Or we should
>>>> just remove adjust_iv_update_pos since it doesn't help performance for X86 or
>>>> other targets?
>>>
>>> It was Bin who added the functionality - maybe he can remember the cases
>>> he saw improvements for.
>>>
>>> I think we should instead try to change the code that it doesn't apply to
>>> CONDs where both operands are defined by stmts using the IV?
>>> And this time add a testcase ;)
>>
>> Thanks. I am not sure whether should check both cond_lhs and cond_rhs or just
>> cond_lhs is enough if COND expr's LHS is always processed first?
>>
>> For example, adjust_iv_update_pos will be called twice for this case,
>> after the first call, the gimple will be updated to:
>>
>> <bb 4> [local count: 1014686026]:
>> _1 = (sizetype) len_8;
>> _2 = ip_10(D) + _1;
>> _3 = MEM[(unsigned char *)ip_10(D) + ivtmp.15_15 * 1];
>> _5 = ref_12(D) + _1;
>> _6 = *_5;
>> ivtmp.15_16 = ivtmp.15_15 + 1;
>> if (_3 == _6)
>>    goto <bb 6>; [94.50%]
>> else
>>    goto <bb 10>; [5.50%]
>>
>>
>> At this moment, use->stmt is "_6 = *_5;", it is not using IV directly?
> 
> So looking around I think you want to pass down the IV group from
> rewrite_groups to rewrite_use_address and then see whether the
> definition of the RHS is one of the IV USE stmts that are going to be
> expressed using the same candidate.  In fact, I guess (to avoid
> linear searching the group), we likely might want to restrict considering
> adjust_iv_update_pos for IV candidates with a single [address] use?
> That would be more restrictive of course (OTOH the list shouldn't be
> too big - hopefully).  In case the other side of the conditional is
> a memory using a different IV the replacement should still be OK.
> 
> Richard.

Thanks, however those methods would still break the X86 optimization
with worse ASM(one more movzbl) though no performance change observed.
CCed X86 maintainers to see which one they prefer, and whether target
hook is a better solution?


GCC trunk:

.L6:
          movzbl  (%rdi,%rax), %ecx
          addq    $1, %rax
          cmpb    -1(%rsi,%rax), %cl
          jne     .L1
.L3:
          movl    %eax, %r8d
          cmpl    %edx, %eax
          jb      .L6

Disable adjust_iv_update_pos:

.L6:
          movzbl  (%rdi,%rax), %r8d
          movzbl  (%rsi,%rax), %ecx
          addq    $1, %rax
          cmpb    %cl, %r8b
          jne     .L1
.L3:
          movl    %eax, %r9d
          cmpl    %edx, %eax
          jb      .L6



-- 
Thanks,
Xionghu

  reply	other threads:[~2021-07-12  8:13 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-25  8:31 Xionghu Luo
2021-06-25  8:54 ` Richard Biener
2021-06-25  9:41   ` Xionghu Luo
2021-06-25 10:02     ` Richard Biener
2021-06-25 12:58       ` Xionghu Luo
2021-06-25 13:12       ` Xionghu Luo
2021-06-28  8:07       ` Xionghu Luo
2021-06-28  8:25         ` Richard Biener
2021-06-29  9:19           ` Xionghu Luo
2021-07-07 13:20             ` Richard Biener
2021-07-12  8:13               ` Xionghu Luo [this message]
2021-07-12  9:05                 ` Hongtao Liu
2021-07-12  9:49                   ` Richard Biener
2021-07-07  5:51           ` Bin.Cheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=04501d24-ffa5-aaf3-3291-fc95ce3b37d8@linux.ibm.com \
    --to=luoxhu@linux.ibm.com \
    --cc=amker.cheng@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=hjl.tools@gmail.com \
    --cc=hongtao.liu@intel.com \
    --cc=richard.guenther@gmail.com \
    --cc=segher@kernel.crashing.org \
    --cc=ubizjak@gmail.com \
    --cc=wschmidt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).