From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 111E13892023 for ; Wed, 28 Jul 2021 03:00:00 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 111E13892023 Received: from pps.filterd (m0098393.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 16S2hddS084094; Tue, 27 Jul 2021 22:59:59 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 3a2xpy893m-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 27 Jul 2021 22:59:59 -0400 Received: from m0098393.ppops.net (m0098393.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 16S2iEi7085133; Tue, 27 Jul 2021 22:59:59 -0400 Received: from ppma06ams.nl.ibm.com (66.31.33a9.ip4.static.sl-reverse.com [169.51.49.102]) by mx0a-001b2d01.pphosted.com with ESMTP id 3a2xpy8934-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 27 Jul 2021 22:59:59 -0400 Received: from pps.filterd (ppma06ams.nl.ibm.com [127.0.0.1]) by ppma06ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 16S2wt7l031311; Wed, 28 Jul 2021 02:59:57 GMT Received: from b06cxnps4076.portsmouth.uk.ibm.com (d06relay13.portsmouth.uk.ibm.com [9.149.109.198]) by ppma06ams.nl.ibm.com with ESMTP id 3a235kgryq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 28 Jul 2021 02:59:56 +0000 Received: from d06av24.portsmouth.uk.ibm.com (d06av24.portsmouth.uk.ibm.com [9.149.105.60]) by b06cxnps4076.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 16S2xsGp29229546 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 28 Jul 2021 02:59:54 GMT Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 3297B4203F; Wed, 28 Jul 2021 02:59:54 +0000 (GMT) Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 4E98342041; Wed, 28 Jul 2021 02:59:52 +0000 (GMT) Received: from kewenlins-mbp.cn.ibm.com (unknown [9.200.147.34]) by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTP; Wed, 28 Jul 2021 02:59:51 +0000 (GMT) Subject: Re: [PATCH v2] rs6000: Add load density heuristic To: will schmidt Cc: Bill Schmidt , Segher Boessenkool , David Edelsohn , GCC Patches References: <7b9f9bdf-1ed5-139b-de9c-511ee8454b85@linux.ibm.com> <3424a3d3-fa4e-16f9-89c6-0b07beec957d@linux.ibm.com> <101b1a2a18a6332b305e6259355d91772d0e02be.camel@vnet.ibm.com> From: "Kewen.Lin" Message-ID: <91063938-9ac4-93ed-c438-abf25d4eca05@linux.ibm.com> Date: Wed, 28 Jul 2021 10:59:50 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.10.0 In-Reply-To: <101b1a2a18a6332b305e6259355d91772d0e02be.camel@vnet.ibm.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US X-TM-AS-GCONF: 00 X-Proofpoint-GUID: 7Rdmyviy6Hp3yLCQtHm-C-nOrlEM9du1 X-Proofpoint-ORIG-GUID: wztXBDsdXwrD4sdhglfkf_T_Tr-W5Z0b Content-Transfer-Encoding: 8bit X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-07-28_01:2021-07-27, 2021-07-27 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 adultscore=0 priorityscore=1501 phishscore=0 impostorscore=0 bulkscore=0 mlxscore=0 lowpriorityscore=0 mlxlogscore=999 suspectscore=0 clxscore=1015 spamscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2107140000 definitions=main-2107280012 X-Spam-Status: No, score=-11.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, NICE_REPLY_A, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 28 Jul 2021 03:00:03 -0000 Hi William, Thanks for the review comments! on 2021/7/28 上午6:25, will schmidt wrote: > On Wed, 2021-05-26 at 10:59 +0800, Kewen.Lin via Gcc-patches wrote: >> Hi, >> > > > Hi, > > >> This is the updated version of patch to deal with the bwaves_r >> degradation due to vector construction fed by strided loads. >> >> As Richi's comments [1], this follows the similar idea to over >> price the vector construction fed by VMAT_ELEMENTWISE or >> VMAT_STRIDED_SLP. Instead of adding the extra cost on vector >> construction costing immediately, it firstly records how many >> loads and vectorized statements in the given loop, later in >> rs6000_density_test (called by finish_cost) it computes the >> load density ratio against all vectorized stmts, and check >> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD >> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing >> if both thresholds are exceeded. > > ok > >> >> Note that this new load density heuristic check is based on >> some fields in target cost which are updated as needed when >> scanning each add_stmt_cost entry, it's independent of the >> current function rs6000_density_test which requires to scan >> non_vect stmts. Since it's checking the load stmts count >> vs. all vectorized stmts, it's kind of density, so I put >> it in function rs6000_density_test. With the same reason to >> keep it independent, I didn't put it as an else arm of the >> current existing density threshold check hunk or before this >> hunk. > > ok > >> >> In the investigation of -1.04% degradation from 526.blender_r >> on Power8, I noticed that the extra penalized cost 320 on one >> single vector construction with type V16QI is much exaggerated, >> which makes the final body cost unreliable, so this patch adds >> one maximum bound for the extra penalized cost for each vector >> construction statement. > > ok > >> >> Bootstrapped/regtested on powerpc64le-linux-gnu P9. >> >> Full SPEC2017 performance evaluation on Power8/Power9 with >> option combinations: >> * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math} >> * {-O3, -Ofast} {,-funroll-loops} >> >> bwaves_r degradations on P8/P9 have been fixed, nothing else >> remarkable was observed. > > So, this fixes the "-1.04% degradation from 526.blender_r on Power8" > degredation with no additional regressions. that sounds good. > Sorry for the confusion, the original intention of this patch is to fix the -8% degradation at -O2 -ftree-vectorize vs. -O2 on bwaves_r. (From the last evaluation based on r12-1442, P8 is about -10% while P9 is about -9%). The mentioned -1.04% degradation from 526.blender_r on P8 was expected to be a reason why we need the bound for the extra penalized cost adjustment. >> >> Is it ok for trunk? >> >> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html >> >> BR, >> Kewen >> ----- >> gcc/ChangeLog: >> >> * config/rs6000/rs6000.c (struct rs6000_cost_data): New members >> nstmts, nloads and extra_ctor_cost. >> (rs6000_density_test): Add load density related heuristics and the >> checks, do extra costing on vector construction statements if need. >> (rs6000_init_cost): Init new members. >> (rs6000_update_target_cost_per_stmt): New function. >> (rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function >> rs6000_update_target_cost_per_stmt and call it. >> > >> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c >> index 83d29cbfac1..806c3335cbc 100644 >> --- a/gcc/config/rs6000/rs6000.c >> +++ b/gcc/config/rs6000/rs6000.c >> >> @@ -5231,6 +5231,12 @@ typedef struct _rs6000_cost_data >> { >> struct loop *loop_info; >> unsigned cost[3]; >> + /* Total number of vectorized stmts (loop only). */ >> + unsigned nstmts; >> + /* Total number of loads (loop only). */ >> + unsigned nloads; >> + /* Possible extra penalized cost on vector construction (loop only). */ >> + unsigned extra_ctor_cost; >> >> /* For each vectorized loop, this var holds TRUE iff a non-memory vector >> instruction is needed by the vectorization. */ >> bool vect_nonmem; >> @@ -5292,9 +5298,45 @@ rs6000_density_test (rs6000_cost_data *data) >> if (dump_enabled_p ()) >> dump_printf_loc (MSG_NOTE, vect_location, >> "density %d%%, cost %d exceeds threshold, penalizing " >> - "loop body cost by %d%%", density_pct, >> + "loop body cost by %d%%\n", density_pct, >> vec_cost + not_vec_cost, DENSITY_PENALTY); >> } >> + >> + /* Check if we need to penalize the body cost for latency and >> + execution resources bound from strided or elementwise loads >> + into a vector. */ >> + if (data->extra_ctor_cost > 0) >> + { >> + /* Threshold for load stmts percentage in all vectorized stmts. */ >> + const int DENSITY_LOAD_PCT_THRESHOLD = 45; >> + /* Threshold for total number of load stmts. */ >> + const int DENSITY_LOAD_NUM_THRESHOLD = 20; >> + >> + gcc_assert (data->nloads <= data->nstmts); >> + unsigned int load_pct = (data->nloads * 100) / (data->nstmts); >> + >> + /* It's likely to be bounded by latency and execution resources >> + from many scalar loads which are strided or elementwise loads >> + into a vector if both conditions below are found: >> + 1. there are many loads, it's easy to result in a long wait >> + for load units; >> + 2. load has a big proportion of all vectorized statements, >> + it's not easy to schedule other statements to spread among >> + the loads. >> + One typical case is the innermost loop of the hotspot of SPEC2017 >> + 503.bwaves_r without loop interchange. */ >> + if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD >> + && load_pct > DENSITY_LOAD_PCT_THRESHOLD) >> + { >> + data->cost[vect_body] += data->extra_ctor_cost; >> + if (dump_enabled_p ()) >> + dump_printf_loc (MSG_NOTE, vect_location, >> + "Found %u loads and load pct. %u%% exceed " >> + "the threshold, penalizing loop body " >> + "cost by extra cost %u for ctor.\n", >> + data->nloads, load_pct, data->extra_ctor_cost); >> + } >> + } >> >> } >> >> /* Implement targetm.vectorize.init_cost. */ >> @@ -5308,6 +5350,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar) >> data->cost[vect_body] = 0; >> data->cost[vect_epilogue] = 0; >> data->vect_nonmem = false; >> + data->nstmts = 0; >> + data->nloads = 0; >> + data->extra_ctor_cost = 0; >> >> data->costing_for_scalar = costing_for_scalar; >> return data; >> } >> @@ -5335,6 +5380,66 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind, >> return 0; >> } >> >> +/* As a visitor function for each statement cost entry handled in >> + function add_stmt_cost, gather some information and update its >> + relevant fields in target cost accordingly. */ > > I got lost trying to parse that.. (could be just me :-) > > Possibly instead something like > /* Helper function for add_stmt_cost ; gather information and update > the target_cost fields accordingly. */ > > OK, will update. I was thinking for each entry handled in function add_stmt_cost, this helper acts like a visitor, trying to visit each entry and take some actions if some conditions are satisifed. > >> +static void >> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data, >> + enum vect_cost_for_stmt kind, >> + struct _stmt_vec_info *stmt_info, >> + enum vect_cost_model_location where, >> + int stmt_cost, unsigned int orig_count) >> +{ >> + >> + /* Check whether we're doing something other than just a copy loop. >> + Not all such loops may be profitably vectorized; see >> + rs6000_finish_cost. */ >> + if ((kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote >> + || kind == vec_construct || kind == scalar_to_vec) >> + || (where == vect_body && kind == vector_stmt)) > > I thought I saw an alignment issue, then noticed the "(" that was > hiding on me.. :-) > > Maybe clearer to read if you rearrange slightly and flatten it ? I > defer to others on that.. > > if ((kind == vec_to_scalar > || kind == vec_perm > || kind == vec_promote_demote > || kind == > vec_construct > || kind == scalar_to_vec) > || (kind == vector_stmt && where == vect_body) > > This hunk is factored out from function rs6000_add_stmt_cost, maybe I can keep the original formatting? The formatting tool isn't so smart, and sometimes rearrange things to become unexpected (although it meets the basic rule, not so elegant), sigh. >> + data->vect_nonmem = true; >> + >> + /* Gather some information when we are costing the vector version for >> + the statements locate in a loop body. */ > s/version/instruction? operation?/ ? ? > s/locate/located/ > Will fix. >> + if (!data->costing_for_scalar && data->loop_info && where == vect_body) >> + { >> + data->nstmts += orig_count; >> + >> + if (kind == scalar_load || kind == vector_load || kind == unaligned_load >> + || kind == vector_gather_load) > > Cosmetically, I'd move the second "||" to the next line to balance > those two lines a bit more. > Will fix. >> + data->nloads += orig_count; >> + >> + /* If we have strided or elementwise loads into a vector, it's >> + possible to be bounded by latency and execution resources for >> + many scalar loads. Try to account for this by scaling the >> + construction cost by the number of elements involved, when >> + handling each matching statement we record the possible extra >> + penalized cost into target cost, in the end of costing for >> + the whole loop, we do the actual penalization once some load >> + density heuristics are satisfied. */ >> + if (kind == vec_construct && stmt_info >> + && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type >> + && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE >> + || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP)) >> + { >> + tree vectype = STMT_VINFO_VECTYPE (stmt_info); >> + unsigned int nunits = vect_nunits_for_cost (vectype); >> + unsigned int extra_cost = nunits * stmt_cost; >> + /* As function rs6000_builtin_vectorization_cost shows, we have >> + priced much on V16QI/V8HI vector construction as their units, >> + if we penalize them with nunits * stmt_cost, it can result in >> + a unreliable body cost, eg: for V16QI on Power8, stmt_cost is > s/a/an/ Will fix. >> + 20 and nunits is 16, the extra cost is 320 which looks much >> + exaggerated. So let's use one maximum bound for the extra >> + penalized cost for vector construction here. */ >> + const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12; >> + if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR) >> + extra_cost = MAX_PENALIZED_COST_FOR_CTOR; >> + data->extra_ctor_cost += extra_cost; >> + } >> + } >> +} > ok > >> + >> >> /* Implement targetm.vectorize.add_stmt_cost. */ >> >> static unsigned >> @@ -5354,6 +5459,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count, >> /* Statements in an inner loop relative to the loop being >> vectorized are weighted more heavily. The value here is >> arbitrary and could potentially be improved with analysis. */ >> + unsigned int orig_count = count; > > I don't see any other assignments to orig_count. Does 'count' get > updated elsewhere in rs6000_add_stmt_cost() that the new orig_count > variable is necessary? > Yeah, the 'count' could get updated but the default "git diff" doesn't show it, the codes omitted look as below: if (where == vect_body && stmt_info && stmt_in_inner_loop_p (vinfo, stmt_info)) { loop_vec_info loop_vinfo = dyn_cast (vinfo); gcc_assert (loop_vinfo); count *= LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo); /* FIXME. */ } BR, Kewen >> >> if (where == vect_body && stmt_info >> && stmt_in_inner_loop_p (vinfo, stmt_info)) >> { >> @@ -5365,14 +5471,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count, >> retval = (unsigned) (count * stmt_cost); >> cost_data->cost[where] += retval; >> >> - /* Check whether we're doing something other than just a copy loop. >> - Not all such loops may be profitably vectorized; see >> - rs6000_finish_cost. */ >> - if ((kind == vec_to_scalar || kind == vec_perm >> - || kind == vec_promote_demote || kind == vec_construct >> - || kind == scalar_to_vec) >> - || (where == vect_body && kind == vector_stmt)) >> - cost_data->vect_nonmem = true; >> + rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where, >> + stmt_cost, orig_count); >> >> } >> >> return retval; >> > >