From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id C05863858C2C for ; Thu, 9 Sep 2021 17:19:32 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C05863858C2C Received: from pps.filterd (m0098396.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 189H49PF122859; Thu, 9 Sep 2021 13:19:31 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 3ayk6y60w9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 09 Sep 2021 13:19:31 -0400 Received: from m0098396.ppops.net (m0098396.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 189HFfwT170460; Thu, 9 Sep 2021 13:19:31 -0400 Received: from ppma03wdc.us.ibm.com (ba.79.3fa9.ip4.static.sl-reverse.com [169.63.121.186]) by mx0a-001b2d01.pphosted.com with ESMTP id 3ayk6y60vw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 09 Sep 2021 13:19:31 -0400 Received: from pps.filterd (ppma03wdc.us.ibm.com [127.0.0.1]) by ppma03wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 189H3RTb011751; Thu, 9 Sep 2021 17:19:30 GMT Received: from b01cxnp22033.gho.pok.ibm.com (b01cxnp22033.gho.pok.ibm.com [9.57.198.23]) by ppma03wdc.us.ibm.com with ESMTP id 3axcnqam9f-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 09 Sep 2021 17:19:29 +0000 Received: from b01ledav005.gho.pok.ibm.com (b01ledav005.gho.pok.ibm.com [9.57.199.110]) by b01cxnp22033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 189HJTEu30540196 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 9 Sep 2021 17:19:29 GMT Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 620D0AE060; Thu, 9 Sep 2021 17:19:29 +0000 (GMT) Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B513FAE05F; Thu, 9 Sep 2021 17:19:28 +0000 (GMT) Received: from Bills-MacBook-Pro.local (unknown [9.211.104.79]) by b01ledav005.gho.pok.ibm.com (Postfix) with ESMTP; Thu, 9 Sep 2021 17:19:28 +0000 (GMT) Reply-To: wschmidt@linux.ibm.com Subject: Re: [PATCH v4] rs6000: Add load density heuristic To: Segher Boessenkool , "Kewen.Lin" Cc: David Edelsohn , will schmidt , GCC Patches References: <7b9f9bdf-1ed5-139b-de9c-511ee8454b85@linux.ibm.com> <3424a3d3-fa4e-16f9-89c6-0b07beec957d@linux.ibm.com> <77fe5ac1-200f-db69-a92a-5d349642f394@linux.ibm.com> <4f7c5da8-75d3-2d98-b728-e1a319392097@linux.ibm.com> <20210909161152.GR1583@gate.crashing.org> From: Bill Schmidt Message-ID: <894f01c3-6481-0757-751f-b4239a4f0232@linux.ibm.com> Date: Thu, 9 Sep 2021 12:19:28 -0500 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.14.0 MIME-Version: 1.0 In-Reply-To: <20210909161152.GR1583@gate.crashing.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-GB X-TM-AS-GCONF: 00 X-Proofpoint-GUID: r453BWei66BPFOeH8mqZcc7uFrr9rFew X-Proofpoint-ORIG-GUID: wmkUFewVX0esnTzKWhk1XYERJdogrBoQ X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-09-09_06:2021-09-09, 2021-09-09 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxlogscore=999 mlxscore=0 bulkscore=0 phishscore=0 malwarescore=0 suspectscore=0 adultscore=0 clxscore=1015 lowpriorityscore=0 priorityscore=1501 spamscore=0 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2109030001 definitions=main-2109090106 X-Spam-Status: No, score=-6.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Sep 2021 17:19:34 -0000 On 9/9/21 11:11 AM, Segher Boessenkool wrote: > Hi! > > On Wed, Sep 08, 2021 at 02:57:14PM +0800, Kewen.Lin wrote: >>>> + /* If we have strided or elementwise loads into a vector, it's >>>> + possible to be bounded by latency and execution resources for >>>> + many scalar loads. Try to account for this by scaling the >>>> + construction cost by the number of elements involved, when >>>> + handling each matching statement we record the possible extra >>>> + penalized cost into target cost, in the end of costing for >>>> + the whole loop, we do the actual penalization once some load >>>> + density heuristics are satisfied. */ >>> The above comment is quite hard to read. Can you please break up the last >>> sentence into at least two sentences? >> How about the below: >> >> + /* If we have strided or elementwise loads into a vector, it's > "strided" is not a word: it properly is "stridden", which does not read > very well either. "Have loads by stride, or by element, ..."? Is that > good English, and easier to understand? No, this is OK.  "Strided loads" is a term of art used by the vectorizer; whether or not it was the Queen's English, it's what we have...  (And I think you might only find "bestridden" in some 18th or 19th century English poetry... :-) > >> + possible to be bounded by latency and execution resources for >> + many scalar loads. Try to account for this by scaling the >> + construction cost by the number of elements involved. For >> + each matching statement, we record the possible extra >> + penalized cost into the relevant field in target cost. When >> + we want to finalize the whole loop costing, we will check if >> + those related load density heuristics are satisfied, and add >> + this accumulated penalized cost if yes. */ >> >>> Otherwise this looks good to me, and I recommend maintainers approve with >>> that clarified. > Does that text look good to you now Bill? It is still kinda complex, > maybe you see a way to make it simpler. I think it's OK now.  The complexity at least matches the code now instead of exceeding it. :-P  j/k... > >> * config/rs6000/rs6000.c (struct rs6000_cost_data): New members >> nstmts, nloads and extra_ctor_cost. >> (rs6000_density_test): Add load density related heuristics and the >> checks, do extra costing on vector construction statements if need. > "and the checks"? Oh, "and checks"? It is probably fine to just leave > out this whole phrase part :-) > > Don't use commas like this in changelogs. s/, do/. Do/ Yes this is a > bit boring text that way, but that is the purpose: it makes it simpler > to read (and read quickly, even merely scan). > >> @@ -5262,6 +5262,12 @@ typedef struct _rs6000_cost_data > [ Btw, you can get rid of the typedef now, just have a struct with the > non-underscore name, we have C++ now. Such a mechanical change (as > separate patch!) is pre-approved. ] > >> + /* Check if we need to penalize the body cost for latency and >> + execution resources bound from strided or elementwise loads >> + into a vector. */ > Bill, is that clear enough? I'm sure something nicer would help here, > but it's hard for me to write anything :-) Perhaps:  "Check whether we need to penalize the body cost to account for excess strided or elementwise loads." > >> + if (data->extra_ctor_cost > 0) >> + { >> + /* Threshold for load stmts percentage in all vectorized stmts. */ >> + const int DENSITY_LOAD_PCT_THRESHOLD = 45; > Threshold for what? > > 45% is awfully exact. Can you make this a param? > >> + /* Threshold for total number of load stmts. */ >> + const int DENSITY_LOAD_NUM_THRESHOLD = 20; > Same. We have similar magic constants in here already.  Parameterizing is possible, but I'm more interested in making sure the numbers are appropriate for each processor.  Given that Kewen reports they work well for both P9 and P10, I'm pretty happy with what we have here.  (Kewen, thanks for running the P10 experiments!) Perhaps a follow-up patch to add params for the magic constants would be reasonable, but I'd personally consider it pretty low priority. > >> + unsigned int load_pct = (data->nloads * 100) / (data->nstmts); > No parens around the last thing please. The other pair of parens is > unneeded as well, but perhaps it is easier to read like that. > >> + if (dump_enabled_p ()) >> + dump_printf_loc (MSG_NOTE, vect_location, >> + "Found %u loads and load pct. %u%% exceed " >> + "the threshold, penalizing loop body " >> + "cost by extra cost %u for ctor.\n", >> + data->nloads, load_pct, data->extra_ctor_cost); > That line does not fit. Make it more lines? > > It is a pity that using these interfaces at all takes up 45 chars > of noise already. > >> +/* Helper function for add_stmt_cost. Check each statement cost >> + entry, gather information and update the target_cost fields >> + accordingly. */ >> +static void >> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data, >> + enum vect_cost_for_stmt kind, >> + struct _stmt_vec_info *stmt_info, >> + enum vect_cost_model_location where, >> + int stmt_cost, unsigned int orig_count) > Please put those last two on separate lines as well? > >> + /* As function rs6000_builtin_vectorization_cost shows, we have >> + priced much on V16QI/V8HI vector construction as their units, >> + if we penalize them with nunits * stmt_cost, it can result in >> + an unreliable body cost, eg: for V16QI on Power8, stmt_cost >> + is 20 and nunits is 16, the extra cost is 320 which looks >> + much exaggerated. So let's use one maximum bound for the >> + extra penalized cost for vector construction here. */ >> + const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12; >> + if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR) >> + extra_cost = MAX_PENALIZED_COST_FOR_CTOR; > That is a pretty gross hack. Can you think of any saner way to not have > those out of scale costs in the first place? In Kewen's defense, the whole business of "finish_cost" for these vectorized loops is to tweak things that don't work quite right with the hooks currently provided to the vectorizer to add costs on a per-stmt basis without looking at the overall set of statements.  It gives the back end a chance to massage things and exercise veto power over otherwise bad decisions.  By nature, that's going to be very much a heuristic exercise.  Personally I think the heuristics used here are pretty reasonable, and importantly they are designed to only be employed in pretty rare circumstances.  It doesn't look easy to me to avoid the need for a cap here without making the rest of the heuristics harder to understand.  But sure, he can try! :) Kewen, thanks for the updates! Bill > > Okay for trunk with such tweaks. Thanks! (And please consult with Bill > for the wordsmithing :-) ) > > > Segher