From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linkw@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com
 [148.163.156.1])
 by sourceware.org (Postfix) with ESMTPS id AF7193858C60
 for <gcc-patches@gcc.gnu.org>; Tue, 21 Sep 2021 03:24:19 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org AF7193858C60
Received: from pps.filterd (m0098404.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 18L3FeqX018806; 
 Mon, 20 Sep 2021 23:24:18 -0400
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0a-001b2d01.pphosted.com with ESMTP id 3b77b683s2-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Mon, 20 Sep 2021 23:24:18 -0400
Received: from m0098404.ppops.net (m0098404.ppops.net [127.0.0.1])
 by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 18L3G80F021976;
 Mon, 20 Sep 2021 23:24:17 -0400
Received: from ppma06ams.nl.ibm.com (66.31.33a9.ip4.static.sl-reverse.com
 [169.51.49.102])
 by mx0a-001b2d01.pphosted.com with ESMTP id 3b77b683rj-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Mon, 20 Sep 2021 23:24:17 -0400
Received: from pps.filterd (ppma06ams.nl.ibm.com [127.0.0.1])
 by ppma06ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 18L3DH3I006943;
 Tue, 21 Sep 2021 03:24:15 GMT
Received: from b06cxnps4075.portsmouth.uk.ibm.com
 (d06relay12.portsmouth.uk.ibm.com [9.149.109.197])
 by ppma06ams.nl.ibm.com with ESMTP id 3b57cjf1h6-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Tue, 21 Sep 2021 03:24:14 +0000
Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com
 [9.149.105.61])
 by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 18L3OCwW35914180
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Tue, 21 Sep 2021 03:24:12 GMT
Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 6C8DF11C04C;
 Tue, 21 Sep 2021 03:24:12 +0000 (GMT)
Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 9A11B11C050;
 Tue, 21 Sep 2021 03:24:10 +0000 (GMT)
Received: from KewenLins-MacBook-Pro.local (unknown [9.200.63.17])
 by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTP;
 Tue, 21 Sep 2021 03:24:10 +0000 (GMT)
Subject: Re: [PATCH] rs6000: Modify the way for extra penalized cost
To: Segher Boessenkool <segher@kernel.crashing.org>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>,
 Bill Schmidt <wschmidt@linux.ibm.com>, David Edelsohn <dje.gcc@gmail.com>
References: <d26e0a72-a029-c765-75ab-9b31de44f114@linux.ibm.com>
 <20210917220158.GW1583@gate.crashing.org>
From: "Kewen.Lin" <linkw@linux.ibm.com>
Message-ID: <5a152e8d-908b-4971-46c6-721418405199@linux.ibm.com>
Date: Tue, 21 Sep 2021 11:24:08 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0)
 Gecko/20100101 Thunderbird/78.10.0
MIME-Version: 1.0
In-Reply-To: <20210917220158.GW1583@gate.crashing.org>
Content-Type: text/plain; charset=gbk
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-TM-AS-GCONF: 00
X-Proofpoint-ORIG-GUID: IrAJEt3T5_2yl3P5EwrXBhFV2gWoqBD-
X-Proofpoint-GUID: McF0carVt3C7pqRxvidBo8dyKK_Bvsbu
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.182.1,Aquarius:18.0.790,Hydra:6.0.391,FMLib:17.0.607.475
 definitions=2021-09-20_11,2021-09-20_01,2020-04-07_01
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 mlxlogscore=999
 malwarescore=0 clxscore=1015 priorityscore=1501 phishscore=0
 lowpriorityscore=0 spamscore=0 mlxscore=0 impostorscore=0 bulkscore=0
 adultscore=0 suspectscore=0 classifier=spam adjust=0 reason=mlx
 scancount=1 engine=8.12.0-2109030001 definitions=main-2109210017
X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, MIME_CHARSET_FARAWAY, NICE_REPLY_A,
 RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 21 Sep 2021 03:24:21 -0000

Hi Segher,

Thanks for the review!

on 2021/9/18 上午6:01, Segher Boessenkool wrote:
> Hi!
> 
> On Thu, Sep 16, 2021 at 09:14:15AM +0800, Kewen.Lin wrote:
>> The way with nunits * stmt_cost can get one much exaggerated
>> penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320,
>> that's why we need one bound.  To make it scale, this patch
>> doesn't use nunits * stmt_cost any more, but it still keeps
>> nunits since there are actually nunits scalar loads there.  So
>> it uses one cost adjusted from stmt_cost, since the current
>> stmt_cost sort of considers nunits, we can stablize the cost
>> for big nunits and retain the cost for small nunits.  After
>> some tries, this patch gets the adjusted cost as:
>>
>>     stmt_cost / (log2(nunits) * log2(nunits))
> 
> So for  V16QI it gives *16/(4*4) so *1
>         V8HI  it gives *8/(3*3)  so *8/9
>         V4SI  it gives *4/(2*2)  so *1
>         V2DI  it gives *2/(1*1)  so *2
> and for V1TI  it gives *1/(0*0) which is UB (no, does not crash for us,
> just gives wildly wrong answers; the div returns 0 on recent systems).
> 

I don't expected we will have V1TI for strided/elementwise load,
if it's one unit vector, it's the whole vector itself.
Besides, the below assertion should exclude it already.

>> For V16QI, the adjusted cost would be 1 and total penalized
>> cost is 16, it isn't exaggerated.  For V2DI, the adjusted
>> cost would be 2 and total penalized cost is 4, which is the
>> same as before.  btw, I tried to use one single log2(nunits),
>> but the penalized cost is still big enough and can't fix the
>> degraded bmk blender_r.
> 
> Does it make sense to treat V2DI (and V2DF) as twice more expensive than
> other vectors, which are all pretty much equal cost (except those that
> end up with cost 0)?  If so, there are simpler ways to do that.
> 

Yeah, from the SPEC2017 evaluation, it's good with this.  The costing
framework of vectorization doesn't consider the dependent insn chain
and available #unit etc. like local scheduling (it can't either), so
we have to use some heuristics to handle some special cases.  For more
units vector construction, the used instructions are more.  It has more
chances to schedule them better (even run in parallelly when enough
available units at the time), so we don't need to penalize more for them.
For V2DI, the load result is fed into construction directly, the current
stmt_cost is to consider merging and only 2, penalizing it with one is
not enough from the bwaves experiment.

>> +	  int nunits_log2 = exact_log2 (nunits);
>> +	  gcc_assert (nunits_log2 > 0);
>> +	  unsigned int nunits_sq = nunits_log2 * nunits_log2;
> 
>> = 0
> 
> This of course is assuming nunits will always be a power of 2, but I'm
> sure that we have many other places in the compiler assuming that
> already, so that is fine.  And if one day this stops being true we will
> get a nice ICE, pretty much the best we could hope for.
> 

Yeah, exact_log2 returns -1 for non power of 2 input, for example:

input     output
0     ->    -1
1     ->    0
2     ->    1
3     ->    -1

>> +	  unsigned int adjusted_cost = stmt_cost / nunits_sq;
> 
> But this can divide by 0.  Or are we somehow guaranteed that nunits
> will never be 1?  Yes the log2 check above, sure, but that ICEs if this
> is violated; is there anything that actually guarantees it is true?
> 

As I mentioned above, I don't expect we can have nunits 1 strided/ew load,
and the ICE should check this and ensure dividing by zero never happens.  :)

>> +	  gcc_assert (adjusted_cost > 0);
> 
> I don't see how you guarantee this, either.
> 

It's mainly to prevent that one day we tweak the cost for construction
in rs6000_builtin_vectorization_cost then make some unexpected values
generated here.  But now these expected values are guaranteed as the
current costs and the formula.

> 
> A magic crazy formula like this is no good.  If you want to make the
> cost of everything but V2D* be the same, and that of V2D* be twice that,
> that is a weird heuristic, but we can live with that perhaps.  But that
> beats completely unexplained (and unexplainable) magic!
> 
> Sorry.
> 

That's all right, thanks for the comments!  let's improve it.  :)
How about just assigning 2 for V2DI and 1 for the others for the
penalized_cost_per_load with some detailed commentary, it should have
the same effect with this "magic crazy formula", but I guess it can
be more clear.

BR,
Kewen