From: Xionghu Luo <luoxhu@linux.ibm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>,
Segher Boessenkool <segher@kernel.crashing.org>,
Bill Schmidt <wschmidt@linux.ibm.com>,
davidxl@google.com, David Edelsohn <dje.gcc@gmail.com>,
"H. J. Lu" <hjl.tools@gmail.com>
Subject: Re: [PATCH] New hook adjust_iv_update_pos
Date: Fri, 25 Jun 2021 20:58:10 +0800 [thread overview]
Message-ID: <91c8d741-6b50-1bc1-5823-5ab2bc5e6922@linux.ibm.com> (raw)
In-Reply-To: <CAFiYyc2wyEm8=THL2u_Kot1EnkMaT6-zKNDM+nBb_k75WsTiSA@mail.gmail.com>
+cc Xinliang David Li since he introduced the function
adjust_iv_update_pos. Looking forward to the input. :)
On 2021/6/25 18:02, Richard Biener wrote:
> On Fri, Jun 25, 2021 at 11:41 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>
>>
>>
>> On 2021/6/25 16:54, Richard Biener wrote:
>>> On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches
>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>
>>>> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
>>>>
>>>> adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help performance
>>>> on Power. For example, it generates mismatched address offset after
>>>> adjust iv update statement position:
>>>>
>>>> <bb 32> [local count: 70988443]:
>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
>>>> ivtmp.30_415 = ivtmp.30_414 + 1;
>>>> _34 = ref_180 + 18446744073709551615;
>>>> _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
>>>> if (_84 == _86)
>>>> goto <bb 56>; [94.50%]
>>>> else
>>>> goto <bb 87>; [5.50%]
>>>>
>>>> Disable it will produce:
>>>>
>>>> <bb 32> [local count: 70988443]:
>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
>>>> _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1];
>>>> ivtmp.30_415 = ivtmp.30_414 + 1;
>>>> if (_84 == _86)
>>>> goto <bb 56>; [94.50%]
>>>> else
>>>> goto <bb 87>; [5.50%]
>>>>
>>>> Then later pass loop unroll could benefit from same address offset
>>>> with different base address and reduces register dependency.
>>>> This patch could improve performance by 10% for typical case on Power,
>>>> no performance change observed for X86 or Aarch64 due to small loops
>>>> not unrolled on these platforms. Any comments?
>>>
>>> The case you quote is special in that if we hoisted the IV update before
>>> the other MEM _also_ used in the condition it would be fine again.
>>
>> Thanks. I tried to hoist the IV update statement before the first MEM (Fix 2), it
>> shows even worse performance due to not unroll(two more "base-1" is generated in gimple,
>> then loop->ninsns is 11 so small loops is not unrolled), change the threshold from
>> 10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 times, the
>> performance is SAME to the one that IV update statement in the *MIDDLE* (trunk).
>> From the ASM, we can see the index register %r4 is used in two iterations which
>> maybe a bottle neck for hiding instruction latency?
>>
>> Then it seems reasonable the performance would be better if keep the IV update
>> statement at *LAST* (Fix 1).
>>
>> (Fix 2):
>> <bb 32> [local count: 70988443]:
>> ivtmp.30_415 = ivtmp.30_414 + 1;
>> _34 = ip_229 + 18446744073709551615;
>> _84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
>> _33 = ref_180 + 18446744073709551615;
>> _86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1];
>> if (_84 == _86)
>> goto <bb 56>; [94.50%]
>> else
>> goto <bb 87>; [5.50%]
>>
>>
>> .L67:
>> lbzx %r12,%r24,%r4
>> lbzx %r25,%r7,%r4
>> cmpw %cr0,%r12,%r25
>> bne %cr0,.L11
>> mr %r26,%r4
>> addi %r4,%r4,1
>> lbzx %r12,%r24,%r4
>> lbzx %r25,%r7,%r4
>> mr %r6,%r26
>> cmpw %cr0,%r12,%r25
>> bne %cr0,.L11
>> mr %r26,%r4
>> .L12:
>> cmpdi %cr0,%r10,1
>> addi %r4,%r26,1
>> mr %r6,%r26
>> addi %r10,%r10,-1
>> bne %cr0,.L67
>>
>>>
>>> Now, adjust_iv_update_pos doesn't seem to check that the
>>> condition actually uses the IV use stmt def, so it likely applies to
>>> too many cases.
>>>
>>> Unfortunately the introducing rev didn't come with a testcase,
>>> but still I think fixing up adjust_iv_update_pos is better than
>>> introducing a way to short-cut it per target decision.
>>>
>>> One "fix" might be to add a check that either the condition
>>> lhs or rhs is the def of the IV use and the other operand
>>> is invariant. Or if it's of similar structure hoist across the
>>> other iv-use as well. Not that I understand the argument
>>> about the overlapping life-range.
>>>
>>> You also don't provide a complete testcase ...
>>>
>>
>> Attached the test code, will also add it it patch in future version.
>> The issue comes from a very small hot loop:
>>
>> do {
>> len++;
>> } while(len < maxlen && ip[len] == ref[len]);
>
> unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen)
> {
> unsigned int len = 2;
> do {
> len++;
> }while(len < maxlen && ip[len] == ref[len]);
> return len;
> }
>
> I can see the effect on this loop on x86_64 as well, we end up with
>
> .L6:
> movzbl (%rdi,%rax), %ecx
> addq $1, %rax
> cmpb -1(%rsi,%rax), %cl
> jne .L1
> .L3:
> movl %eax, %r8d
> cmpl %edx, %eax
> jb .L6
>
> but without the trick it is
>
> .L6:
> movzbl (%rdi,%rax), %r8d
> movzbl (%rsi,%rax), %ecx
> addq $1, %rax
> cmpb %cl, %r8b
> jne .L1
> .L3:
> movl %eax, %r9d
> cmpl %edx, %eax
> jb .L6
>
> so here you can see the missed fusion. Of course
> in this case the IV update could have been sunk into
> the .L3 block and replicated on the exit edge as well.
>
> I'm not sure if the motivation for the change introducing this
> trick was the above kind of combination or not, but I guess
> so. The dependence distance of the IV increment to the
> use is now shorter, so I'm not sure the combined variant is
> better.
>
> Richard.
>
>
>>
>> --
>> Thanks,
>> Xionghu
--
Thanks,
Xionghu
next prev parent reply other threads:[~2021-06-25 12:58 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-25 8:31 Xionghu Luo
2021-06-25 8:54 ` Richard Biener
2021-06-25 9:41 ` Xionghu Luo
2021-06-25 10:02 ` Richard Biener
2021-06-25 12:58 ` Xionghu Luo [this message]
2021-06-25 13:12 ` Xionghu Luo
2021-06-28 8:07 ` Xionghu Luo
2021-06-28 8:25 ` Richard Biener
2021-06-29 9:19 ` Xionghu Luo
2021-07-07 13:20 ` Richard Biener
2021-07-12 8:13 ` Xionghu Luo
2021-07-12 9:05 ` Hongtao Liu
2021-07-12 9:49 ` Richard Biener
2021-07-07 5:51 ` Bin.Cheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=91c8d741-6b50-1bc1-5823-5ab2bc5e6922@linux.ibm.com \
--to=luoxhu@linux.ibm.com \
--cc=davidxl@google.com \
--cc=dje.gcc@gmail.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=hjl.tools@gmail.com \
--cc=richard.guenther@gmail.com \
--cc=segher@kernel.crashing.org \
--cc=wschmidt@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).