From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <luoxhu@linux.ibm.com>
Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com
 [148.163.158.5])
 by sourceware.org (Postfix) with ESMTPS id 55B1F3AA983C
 for <gcc-patches@gcc.gnu.org>; Fri, 25 Jun 2021 13:12:40 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 55B1F3AA983C
Received: from pps.filterd (m0127361.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id
 15PD3xxY017940; Fri, 25 Jun 2021 09:12:39 -0400
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0a-001b2d01.pphosted.com with ESMTP id 39df13hx76-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 25 Jun 2021 09:12:39 -0400
Received: from m0127361.ppops.net (m0127361.ppops.net [127.0.0.1])
 by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 15PD40CR018112;
 Fri, 25 Jun 2021 09:12:38 -0400
Received: from ppma03ams.nl.ibm.com (62.31.33a9.ip4.static.sl-reverse.com
 [169.51.49.98])
 by mx0a-001b2d01.pphosted.com with ESMTP id 39df13hx62-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 25 Jun 2021 09:12:38 -0400
Received: from pps.filterd (ppma03ams.nl.ibm.com [127.0.0.1])
 by ppma03ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 15PD8Ad7015931;
 Fri, 25 Jun 2021 13:12:37 GMT
Received: from b06cxnps4076.portsmouth.uk.ibm.com
 (d06relay13.portsmouth.uk.ibm.com [9.149.109.198])
 by ppma03ams.nl.ibm.com with ESMTP id 399878b24d-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 25 Jun 2021 13:12:36 +0000
Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com
 [9.149.105.61])
 by b06cxnps4076.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 15PDCYkl30802178
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Fri, 25 Jun 2021 13:12:34 GMT
Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id AC0E411C054;
 Fri, 25 Jun 2021 13:12:34 +0000 (GMT)
Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 395B111C050;
 Fri, 25 Jun 2021 13:12:32 +0000 (GMT)
Received: from luoxhus-MacBook-Pro.local (unknown [9.200.58.146])
 by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTPS;
 Fri, 25 Jun 2021 13:12:31 +0000 (GMT)
Subject: Re: [PATCH] New hook adjust_iv_update_pos
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>,
 Segher Boessenkool <segher@kernel.crashing.org>,
 Bill Schmidt <wschmidt@linux.ibm.com>, davidxl@google.com,
 davidxl@google.com, "H. J. Lu" <hjl.tools@gmail.com>
References: <20210625083101.2828805-1-luoxhu@linux.ibm.com>
 <CAFiYyc3izF3re7bG_otVyhX9cRc_Dghv=GJHwxu_iDtXvQxn2Q@mail.gmail.com>
 <ef640ef9-8ec7-77b8-785e-13052a63e0ae@linux.ibm.com>
 <CAFiYyc2wyEm8=THL2u_Kot1EnkMaT6-zKNDM+nBb_k75WsTiSA@mail.gmail.com>
From: Xionghu Luo <luoxhu@linux.ibm.com>
Message-ID: <2e03a7f6-c24f-bff4-eff0-03d94b30c726@linux.ibm.com>
Date: Fri, 25 Jun 2021 21:12:29 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.0; rv:68.0)
 Gecko/20100101 Thunderbird/68.12.0
MIME-Version: 1.0
In-Reply-To: <CAFiYyc2wyEm8=THL2u_Kot1EnkMaT6-zKNDM+nBb_k75WsTiSA@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: PU5LT1YZ2otQajs1asxn9G6ILlf0wOJl
X-Proofpoint-ORIG-GUID: eG21bUgwcHisU-r24Jv9x0PeEJcUTMUb
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790
 definitions=2021-06-25_04:2021-06-25,
 2021-06-25 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 mlxlogscore=999
 malwarescore=0 bulkscore=0 impostorscore=0 lowpriorityscore=0 adultscore=0
 clxscore=1015 spamscore=0 phishscore=0 priorityscore=1501 mlxscore=0
 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2104190000 definitions=main-2106250075
X-Spam-Status: No, score=-4.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 25 Jun 2021 13:12:42 -0000


On 2021/6/25 18:02, Richard Biener wrote:
> On Fri, Jun 25, 2021 at 11:41 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>
>>
>>
>> On 2021/6/25 16:54, Richard Biener wrote:
>>> On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches
>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>
>>>> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
>>>>
>>>> adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help performance
>>>> on Power.  For example, it generates mismatched address offset after
>>>> adjust iv update statement position:
>>>>
>>>> <bb 32> [local count: 70988443]:
>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
>>>> ivtmp.30_415 = ivtmp.30_414 + 1;
>>>> _34 = ref_180 + 18446744073709551615;
>>>> _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
>>>> if (_84 == _86)
>>>>     goto <bb 56>; [94.50%]
>>>>     else
>>>>     goto <bb 87>; [5.50%]
>>>>
>>>> Disable it will produce:
>>>>
>>>> <bb 32> [local count: 70988443]:
>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
>>>> _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1];
>>>> ivtmp.30_415 = ivtmp.30_414 + 1;
>>>> if (_84 == _86)
>>>>     goto <bb 56>; [94.50%]
>>>>     else
>>>>     goto <bb 87>; [5.50%]
>>>>
>>>> Then later pass loop unroll could benefit from same address offset
>>>> with different base address and reduces register dependency.
>>>> This patch could improve performance by 10% for typical case on Power,
>>>> no performance change observed for X86 or Aarch64 due to small loops
>>>> not unrolled on these platforms.  Any comments?
>>>
>>> The case you quote is special in that if we hoisted the IV update before
>>> the other MEM _also_ used in the condition it would be fine again.
>>
>> Thanks.  I tried to hoist the IV update statement before the first MEM (Fix 2), it
>> shows even worse performance due to not unroll(two more "base-1" is generated in gimple,
>> then loop->ninsns is 11 so small loops is not unrolled), change the threshold from
>> 10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 times, the
>> performance is SAME to the one that IV update statement in the *MIDDLE* (trunk).
>>  From the ASM, we can see the index register %r4 is used in two iterations which
>> maybe a bottle neck for hiding instruction latency?
>>
>> Then it seems reasonable the performance would be better if keep the IV update
>> statement at *LAST* (Fix 1).
>>
>> (Fix 2):
>>    <bb 32> [local count: 70988443]:
>>    ivtmp.30_415 = ivtmp.30_414 + 1;
>>    _34 = ip_229 + 18446744073709551615;
>>    _84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
>>    _33 = ref_180 + 18446744073709551615;
>>    _86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1];
>>    if (_84 == _86)
>>      goto <bb 56>; [94.50%]
>>    else
>>      goto <bb 87>; [5.50%]
>>
>>
>> .L67:
>>          lbzx %r12,%r24,%r4
>>          lbzx %r25,%r7,%r4
>>          cmpw %cr0,%r12,%r25
>>          bne %cr0,.L11
>>          mr %r26,%r4
>>          addi %r4,%r4,1
>>          lbzx %r12,%r24,%r4
>>          lbzx %r25,%r7,%r4
>>          mr %r6,%r26
>>          cmpw %cr0,%r12,%r25
>>          bne %cr0,.L11
>>          mr %r26,%r4
>> .L12:
>>          cmpdi %cr0,%r10,1
>>          addi %r4,%r26,1
>>          mr %r6,%r26
>>          addi %r10,%r10,-1
>>          bne %cr0,.L67
>>
>>>
>>> Now, adjust_iv_update_pos doesn't seem to check that the
>>> condition actually uses the IV use stmt def, so it likely applies to
>>> too many cases.
>>>
>>> Unfortunately the introducing rev didn't come with a testcase,
>>> but still I think fixing up adjust_iv_update_pos is better than
>>> introducing a way to short-cut it per target decision.
>>>
>>> One "fix" might be to add a check that either the condition
>>> lhs or rhs is the def of the IV use and the other operand
>>> is invariant.  Or if it's of similar structure hoist across the
>>> other iv-use as well.  Not that I understand the argument
>>> about the overlapping life-range.
>>>
>>> You also don't provide a complete testcase ...
>>>
>>
>> Attached the test code, will also add it it patch in future version.
>> The issue comes from a very small hot loop:
>>
>>          do {
>>            len++;
>>          } while(len < maxlen && ip[len] == ref[len]);
> 
> unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen)
> {
>    unsigned int len = 2;
>    do {
>        len++;
>    }while(len < maxlen && ip[len] == ref[len]);
>    return len;
> }
> 
> I can see the effect on this loop on x86_64 as well, we end up with
> 
> .L6:
>          movzbl  (%rdi,%rax), %ecx
>          addq    $1, %rax
>          cmpb    -1(%rsi,%rax), %cl
>          jne     .L1
> .L3:
>          movl    %eax, %r8d
>          cmpl    %edx, %eax
>          jb      .L6
> 
> but without the trick it is
> 
> .L6:
>          movzbl  (%rdi,%rax), %r8d
>          movzbl  (%rsi,%rax), %ecx
>          addq    $1, %rax
>          cmpb    %cl, %r8b
>          jne     .L1
> .L3:
>          movl    %eax, %r9d
>          cmpl    %edx, %eax
>          jb      .L6
> 
> so here you can see the missed fusion.  Of course
> in this case the IV update could have been sunk into
> the .L3 block and replicated on the exit edge as well.

It is not sunk by the recently added gimple sink2 pass,
reason is the latch block is an empty block.  Remove this
check would cause some SPEC performance regression.

tree-ssa-sink.c
static bool
statement_sink_location (gimple *stmt, basic_block frombb,
			 gimple_stmt_iterator *togsi, bool *zero_uses_p)
{
...
   /* If the latch block is empty, don't make it non-empty by sinking
      something into it.  */
   if (sinkbb == frombb->loop_father->latch
       && empty_block_p (sinkbb))
     return false;
...
}


   <bb 31> [local count: 75120046]:
   # ivtmp.30_414 = PHI <ivtmp.30_416(30), ivtmp.30_415(57)>
   _418 = (unsigned int) ivtmp.30_414;
   if (maxlen_184 > _418)
     goto <bb 32>; [94.50%]
   else
     goto <bb 87>; [5.50%]

   <bb 32> [local count: 70988443]:
   _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
   _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1];
   ivtmp.30_415 = ivtmp.30_414 + 1;
   if (_84 == _86)
     goto <bb 57>; [94.50%]
   else
     goto <bb 88>; [5.50%]

   <bb 88> [local count: 3904364]:
   goto <bb 33>; [100.00%]

   <bb 57> [local count: 67084079]:
   goto <bb 31>; [100.00%]


> 
> I'm not sure if the motivation for the change introducing this
> trick was the above kind of combination or not, but I guess
> so.  The dependence distance of the IV increment to the
> use is now shorter, so I'm not sure the combined variant is
> better.
> 
> Richard.
> 
> 
>>
>> --
>> Thanks,
>> Xionghu

-- 
Thanks,
Xionghu