On 2021/6/25 16:54, Richard Biener wrote: > On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches > wrote: >> >> From: Xiong Hu Luo >> >> adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help performance >> on Power. For example, it generates mismatched address offset after >> adjust iv update statement position: >> >> [local count: 70988443]: >> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1]; >> ivtmp.30_415 = ivtmp.30_414 + 1; >> _34 = ref_180 + 18446744073709551615; >> _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1]; >> if (_84 == _86) >> goto ; [94.50%] >> else >> goto ; [5.50%] >> >> Disable it will produce: >> >> [local count: 70988443]: >> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1]; >> _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1]; >> ivtmp.30_415 = ivtmp.30_414 + 1; >> if (_84 == _86) >> goto ; [94.50%] >> else >> goto ; [5.50%] >> >> Then later pass loop unroll could benefit from same address offset >> with different base address and reduces register dependency. >> This patch could improve performance by 10% for typical case on Power, >> no performance change observed for X86 or Aarch64 due to small loops >> not unrolled on these platforms. Any comments? > > The case you quote is special in that if we hoisted the IV update before > the other MEM _also_ used in the condition it would be fine again. Thanks. I tried to hoist the IV update statement before the first MEM (Fix 2), it shows even worse performance due to not unroll(two more "base-1" is generated in gimple, then loop->ninsns is 11 so small loops is not unrolled), change the threshold from 10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 times, the performance is SAME to the one that IV update statement in the *MIDDLE* (trunk). From the ASM, we can see the index register %r4 is used in two iterations which maybe a bottle neck for hiding instruction latency? Then it seems reasonable the performance would be better if keep the IV update statement at *LAST* (Fix 1). (Fix 2): [local count: 70988443]: ivtmp.30_415 = ivtmp.30_414 + 1; _34 = ip_229 + 18446744073709551615; _84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1]; _33 = ref_180 + 18446744073709551615; _86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1]; if (_84 == _86) goto ; [94.50%] else goto ; [5.50%] .L67: lbzx %r12,%r24,%r4 lbzx %r25,%r7,%r4 cmpw %cr0,%r12,%r25 bne %cr0,.L11 mr %r26,%r4 addi %r4,%r4,1 lbzx %r12,%r24,%r4 lbzx %r25,%r7,%r4 mr %r6,%r26 cmpw %cr0,%r12,%r25 bne %cr0,.L11 mr %r26,%r4 .L12: cmpdi %cr0,%r10,1 addi %r4,%r26,1 mr %r6,%r26 addi %r10,%r10,-1 bne %cr0,.L67 > > Now, adjust_iv_update_pos doesn't seem to check that the > condition actually uses the IV use stmt def, so it likely applies to > too many cases. > > Unfortunately the introducing rev didn't come with a testcase, > but still I think fixing up adjust_iv_update_pos is better than > introducing a way to short-cut it per target decision. > > One "fix" might be to add a check that either the condition > lhs or rhs is the def of the IV use and the other operand > is invariant. Or if it's of similar structure hoist across the > other iv-use as well. Not that I understand the argument > about the overlapping life-range. > > You also don't provide a complete testcase ... > Attached the test code, will also add it it patch in future version. The issue comes from a very small hot loop: do { len++; } while(len < maxlen && ip[len] == ref[len]); -- Thanks, Xionghu