On 2021/6/25 16:54, Richard Biener wrote:
> On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
>>
>> adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help performance
>> on Power.  For example, it generates mismatched address offset after
>> adjust iv update statement position:
>>
>> <bb 32> [local count: 70988443]:
>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
>> ivtmp.30_415 = ivtmp.30_414 + 1;
>> _34 = ref_180 + 18446744073709551615;
>> _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
>> if (_84 == _86)
>>    goto <bb 56>; [94.50%]
>>    else
>>    goto <bb 87>; [5.50%]
>>
>> Disable it will produce:
>>
>> <bb 32> [local count: 70988443]:
>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
>> _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1];
>> ivtmp.30_415 = ivtmp.30_414 + 1;
>> if (_84 == _86)
>>    goto <bb 56>; [94.50%]
>>    else
>>    goto <bb 87>; [5.50%]
>>
>> Then later pass loop unroll could benefit from same address offset
>> with different base address and reduces register dependency.
>> This patch could improve performance by 10% for typical case on Power,
>> no performance change observed for X86 or Aarch64 due to small loops
>> not unrolled on these platforms.  Any comments?
> 
> The case you quote is special in that if we hoisted the IV update before
> the other MEM _also_ used in the condition it would be fine again.

Thanks.  I tried to hoist the IV update statement before the first MEM (Fix 2), it
shows even worse performance due to not unroll(two more "base-1" is generated in gimple,
then loop->ninsns is 11 so small loops is not unrolled), change the threshold from
10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 times, the
performance is SAME to the one that IV update statement in the *MIDDLE* (trunk). 
From the ASM, we can see the index register %r4 is used in two iterations which
maybe a bottle neck for hiding instruction latency?

Then it seems reasonable the performance would be better if keep the IV update
statement at *LAST* (Fix 1).

(Fix 2):
  <bb 32> [local count: 70988443]:
  ivtmp.30_415 = ivtmp.30_414 + 1;
  _34 = ip_229 + 18446744073709551615;
  _84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
  _33 = ref_180 + 18446744073709551615;
  _86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1];
  if (_84 == _86)
    goto <bb 56>; [94.50%]
  else
    goto <bb 87>; [5.50%]


.L67:
        lbzx %r12,%r24,%r4
        lbzx %r25,%r7,%r4
        cmpw %cr0,%r12,%r25
        bne %cr0,.L11
        mr %r26,%r4
        addi %r4,%r4,1
        lbzx %r12,%r24,%r4
        lbzx %r25,%r7,%r4
        mr %r6,%r26
        cmpw %cr0,%r12,%r25
        bne %cr0,.L11
        mr %r26,%r4
.L12:
        cmpdi %cr0,%r10,1
        addi %r4,%r26,1
        mr %r6,%r26
        addi %r10,%r10,-1
        bne %cr0,.L67

> 
> Now, adjust_iv_update_pos doesn't seem to check that the
> condition actually uses the IV use stmt def, so it likely applies to
> too many cases.
> 
> Unfortunately the introducing rev didn't come with a testcase,
> but still I think fixing up adjust_iv_update_pos is better than
> introducing a way to short-cut it per target decision.
> 
> One "fix" might be to add a check that either the condition
> lhs or rhs is the def of the IV use and the other operand
> is invariant.  Or if it's of similar structure hoist across the
> other iv-use as well.  Not that I understand the argument
> about the overlapping life-range.
> 
> You also don't provide a complete testcase ...
> 

Attached the test code, will also add it it patch in future version.
The issue comes from a very small hot loop:

	do {
	  len++;
	} while(len < maxlen && ip[len] == ref[len]);


-- 
Thanks,
Xionghu