From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=tvmz=FE=linux.ibm.com=linkw@sourceware.org>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1])
	by sourceware.org (Postfix) with ESMTPS id F28FF3858D20
	for <gcc-patches@gcc.gnu.org>; Wed, 20 Sep 2023 02:40:40 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org F28FF3858D20
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com
Received: from pps.filterd (m0353729.ppops.net [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 38K2cKXb026233;
	Wed, 20 Sep 2023 02:40:37 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=message-id : date :
 mime-version : subject : to : references : cc : from : in-reply-to :
 content-type : content-transfer-encoding; s=pp1;
 bh=QMCK6nzzINfaHMZMQSqL7STjHIedV4CiTSMbaWkbnpo=;
 b=FtkZ7IcHJxI08CLolErrkAro03dJWXZv2vXVQCr7q0yIl7yJ6N9duopXhRH4rW5gveyF
 ACBTp6I54L0+quXCgBy/TMk8eeP93TJ8jMyqJUIABgx07ORfvAtjHeVCeLl+1Uicr5xv
 IJaF4EdT0zy/hfOyGmNJ0EEyXCi3FYNcr1hD4sWzd9IgS1EW+Fsb6MSfn6t0nBh7S2uX
 LfEAgoRvvQ+ZyzISvlijOhDjVSNfQbQpHcaP8FW9XHAbW7dbnah2jp54FV0Ndgnxqwdl
 OicolxOhFRFVtrtUrbRLn+xzh/MlEEvChTy8rm08t5POH2nJ/+cxp/MydH9v+AQhhUAG sg== 
Received: from pps.reinject (localhost [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3t7qwt06k7-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Wed, 20 Sep 2023 02:40:36 +0000
Received: from m0353729.ppops.net (m0353729.ppops.net [127.0.0.1])
	by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 38K2d5bX028551;
	Wed, 20 Sep 2023 02:40:36 GMT
Received: from ppma21.wdc07v.mail.ibm.com (5b.69.3da9.ip4.static.sl-reverse.com [169.61.105.91])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3t7qwt06jv-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Wed, 20 Sep 2023 02:40:36 +0000
Received: from pps.filterd (ppma21.wdc07v.mail.ibm.com [127.0.0.1])
	by ppma21.wdc07v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 38K1L5qe011656;
	Wed, 20 Sep 2023 02:40:35 GMT
Received: from smtprelay07.fra02v.mail.ibm.com ([9.218.2.229])
	by ppma21.wdc07v.mail.ibm.com (PPS) with ESMTPS id 3t5qpnhy1b-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Wed, 20 Sep 2023 02:40:35 +0000
Received: from smtpav03.fra02v.mail.ibm.com (smtpav03.fra02v.mail.ibm.com [10.20.54.102])
	by smtprelay07.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 38K2eXES60096868
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
	Wed, 20 Sep 2023 02:40:33 GMT
Received: from smtpav03.fra02v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id 0854F2004D;
	Wed, 20 Sep 2023 02:40:33 +0000 (GMT)
Received: from smtpav03.fra02v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id AAB9120040;
	Wed, 20 Sep 2023 02:40:31 +0000 (GMT)
Received: from [9.197.231.47] (unknown [9.197.231.47])
	by smtpav03.fra02v.mail.ibm.com (Postfix) with ESMTP;
	Wed, 20 Sep 2023 02:40:31 +0000 (GMT)
Message-ID: <a429abb5-8acb-cf40-1327-0a8ab68de820@linux.ibm.com>
Date: Wed, 20 Sep 2023 10:40:30 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
 Gecko/20100101 Thunderbird/91.6.1
Subject: Re: [PATCH/RFC 08/10] aarch64: Don't use CEIL for vector_store in
 aarch64_stp_sequence_cost
Content-Language: en-US
To: Richard Biener <richard.guenther@gmail.com>, richard.sandiford@arm.com
References: <cover.1694657494.git.linkw@linux.ibm.com>
 <bc85799abb2616dcac511424a1b50b57e48c2556.1694657494.git.linkw@linux.ibm.com>
 <mpt4jjrkh4p.fsf@arm.com>
 <CAFiYyc0B=R4SGA1Hs1D9pF4TGQbFeUFxNBre8c4mgin1kFoTwg@mail.gmail.com>
Cc: gcc-patches@gcc.gnu.org
From: "Kewen.Lin" <linkw@linux.ibm.com>
In-Reply-To: <CAFiYyc0B=R4SGA1Hs1D9pF4TGQbFeUFxNBre8c4mgin1kFoTwg@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: 3TIGbUi6ZUHB1v-_oc3D7msMT1td-qoz
X-Proofpoint-ORIG-GUID: _iyqchJeSCG2CA7fp_kAS00YYEcMyQVP
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.267,Aquarius:18.0.980,Hydra:6.0.601,FMLib:17.11.176.26
 definitions=2023-09-20_02,2023-09-19_01,2023-05-22_02
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 mlxlogscore=892
 bulkscore=0 phishscore=0 impostorscore=0 suspectscore=0 clxscore=1015
 mlxscore=0 lowpriorityscore=0 adultscore=0 malwarescore=0
 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2308100000 definitions=main-2309200019
X-Spam-Status: No, score=-12.8 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,GIT_PATCH_0,KAM_SHORT,NICE_REPLY_A,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hi,

on 2023/9/18 16:53, Richard Biener wrote:
> On Mon, Sep 18, 2023 at 10:41 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Kewen Lin <linkw@linux.ibm.com> writes:
>>> This costing adjustment patch series exposes one issue in
>>> aarch64 specific costing adjustment for STP sequence.  It
>>> causes the below test cases to fail:
>>>
>>>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
>>>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
>>>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
>>>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
>>>
>>> Take the below function extracted from ldp_stp_15.c as
>>> example:
>>>
>>> void
>>> dup_8_int32_t (int32_t *x, int32_t val)
>>> {
>>>     for (int i = 0; i < 8; ++i)
>>>           x[i] = val;
>>> }
>>>
>>> Without my patch series, during slp1 it gets:
>>>
>>>   val_8(D) 2 times unaligned_store (misalign -1) costs 2 in body
>>>   node 0x10008c85e38 1 times scalar_to_vec costs 1 in prologue
>>>
>>> then the final vector cost is 3.
>>>
>>> With my patch series, during slp1 it gets:
>>>
>>>   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
>>>   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
>>>   node 0x10004cc5d88 1 times scalar_to_vec costs 1 in prologue
>>>
>>> but the final vector cost is 17.  The unaligned_store count is
>>> actually unchanged, but the final vector costs become different,
>>> it's because the below aarch64 special handling makes the
>>> different costs:
>>>
>>>   /* Apply the heuristic described above m_stp_sequence_cost.  */
>>>   if (m_stp_sequence_cost != ~0U)
>>>     {
>>>       uint64_t cost = aarch64_stp_sequence_cost (count, kind,
>>>                                                stmt_info, vectype);
>>>       m_stp_sequence_cost = MIN (m._stp_sequence_cost + cost, ~0U);
>>>     }
>>>
>>> For the former, since the count is 2, function
>>> aarch64_stp_sequence_cost returns 2 as "CEIL (count, 2) * 2".
>>> While for the latter, it's separated into twice calls with
>>> count 1, aarch64_stp_sequence_cost returns 2 for each time,
>>> so it returns 4 in total.
>>>
>>> For this case, the stmt with scalar_to_vec also contributes
>>> 4 to m_stp_sequence_cost, then the final m_stp_sequence_cost
>>> are 6 (2+4) vs. 8 (4+4).
>>>
>>> Considering scalar_costs->m_stp_sequence_cost is 8 and below
>>> checking and re-assigning:
>>>
>>>   else if (m_stp_sequence_cost >= scalar_costs->m_stp_sequence_cost)
>>>     m_costs[vect_body] = 2 * scalar_costs->total_cost ();
>>>
>>> For the former, the body cost of vector isn't changed; but
>>> for the latter, the body cost of vector is double of scalar
>>> cost which is 8 for this case, then it becomes 16 which is
>>> bigger than what we expect.
>>>
>>> I'm not sure why it adopts CEIL for the return value for
>>> case unaligned_store in function aarch64_stp_sequence_cost,
>>> but I tried to modify it with "return count;" (as it can
>>> get back to previous cost), there is no failures exposed
>>> in regression testing.  I expected that if the previous
>>> unaligned_store count is even, this adjustment doesn't
>>> change anything, if it's odd, the adjustment may reduce
>>> it by one, but I'd guess it would be few.  Besides, as
>>> the comments for m_stp_sequence_cost, the current
>>> handlings seems temporary, maybe a tweak like this can be
>>> accepted, so I posted this RFC/PATCH to request comments.
>>> this one line change is considered.
>>
>> It's unfortunate that doing this didn't show up a regression.
>> I guess it's not a change we explicitly added tests to guard against.
>>
>> But the point of the condition is to estimate how many single stores
>> (STRs) and how many paired stores (STPs) would be generated.  As far
>> as this heuristic goes, STP (storing two values) is as cheap as STR
>> (storing only one value).  So the point of the CEIL is to count 1 store
>> as having equal cost to 2, 3 as having equal cost to 4, etc.
>>

Thanks for the explanation and ...

>> For a heuristic like that, costing a vector stmt once with count 2
>> is different from costing 2 vector stmts with count 1.  The former
>> makes it obvious that the 2 vector stmts are associated with the
>> same scalar stmt, and are highly likely to be consecutive.  The latter
>> (costing 2 stmts with count 1) could also happen for unrelated stmts.
>>
>> ISTM that costing once with count N provides strictly more information
>> to targets than costing N time with count 1.  Is there no way we can
>> keep the current behaviour?  E.g. rather than costing a stmt immediately
>> within a loop, could we just increment a counter and cost once at the end?
> 
> I suppose we can.  But isn't the logic currently (or before the series) cheated
> for variable-strided stores with ncopies > 1?  That is, while it sounds like
> reasonable heuristics you can't really rely on this as the vectorizer doesn't
> currently provide the info whether two vector loads/stores are adjacent?
> 
> Making sure we only pass count > 1 for adjacent load/store would be possible
> though.  We should document this with comments though.

... the suggestion!

This sounds applied for both load and store, if the others in this series look
fine, I guess we can file a bug for the exposed test cases first then fix it
with a follow-up patch for both load and store.

BR,
Kewen

> 
> Richard.
> 
>>
>> Thanks,
>> Richard
>>
>>> gcc/ChangeLog:
>>>
>>>       * config/aarch64/aarch64.cc (aarch64_stp_sequence_cost): Return
>>>       count directly instead of the adjusted value computed with CEIL.
>>> ---
>>>  gcc/config/aarch64/aarch64.cc | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>>> index 37d414021ca..9fb4fbd883d 100644
>>> --- a/gcc/config/aarch64/aarch64.cc
>>> +++ b/gcc/config/aarch64/aarch64.cc
>>> @@ -17051,7 +17051,7 @@ aarch64_stp_sequence_cost (unsigned int count, vect_cost_for_stmt kind,
>>>         if (!aarch64_aligned_constant_offset_p (stmt_info, size))
>>>           return count * 2;
>>>       }
>>> -      return CEIL (count, 2) * 2;
>>> +      return count;
>>>
>>>      case scalar_store:
>>>        if (stmt_info && STMT_VINFO_DATA_REF (stmt_info)