From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 4F652385040A for ; Tue, 27 Jul 2021 22:25:12 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 4F652385040A Received: from pps.filterd (m0098396.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 16RM3kXW061328; Tue, 27 Jul 2021 18:25:11 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 3a2tassbmt-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 27 Jul 2021 18:25:10 -0400 Received: from m0098396.ppops.net (m0098396.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 16RM56tl067268; Tue, 27 Jul 2021 18:25:10 -0400 Received: from ppma04wdc.us.ibm.com (1a.90.2fa9.ip4.static.sl-reverse.com [169.47.144.26]) by mx0a-001b2d01.pphosted.com with ESMTP id 3a2tassbmb-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 27 Jul 2021 18:25:10 -0400 Received: from pps.filterd (ppma04wdc.us.ibm.com [127.0.0.1]) by ppma04wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 16RMMEVq006975; Tue, 27 Jul 2021 22:25:09 GMT Received: from b01cxnp22034.gho.pok.ibm.com (b01cxnp22034.gho.pok.ibm.com [9.57.198.24]) by ppma04wdc.us.ibm.com with ESMTP id 3a235n5btf-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 27 Jul 2021 22:25:09 +0000 Received: from b01ledav003.gho.pok.ibm.com (b01ledav003.gho.pok.ibm.com [9.57.199.108]) by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 16RMP8TE37290332 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 27 Jul 2021 22:25:08 GMT Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 82342B2077; Tue, 27 Jul 2021 22:25:08 +0000 (GMT) Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 13B47B2073; Tue, 27 Jul 2021 22:25:07 +0000 (GMT) Received: from lexx (unknown [9.171.17.235]) by b01ledav003.gho.pok.ibm.com (Postfix) with ESMTP; Tue, 27 Jul 2021 22:25:06 +0000 (GMT) Message-ID: <101b1a2a18a6332b305e6259355d91772d0e02be.camel@vnet.ibm.com> Subject: Re: [PATCH v2] rs6000: Add load density heuristic From: will schmidt To: "Kewen.Lin" , GCC Patches Cc: Bill Schmidt , Segher Boessenkool , David Edelsohn Date: Tue, 27 Jul 2021 17:25:05 -0500 In-Reply-To: <3424a3d3-fa4e-16f9-89c6-0b07beec957d@linux.ibm.com> References: <7b9f9bdf-1ed5-139b-de9c-511ee8454b85@linux.ibm.com> <3424a3d3-fa4e-16f9-89c6-0b07beec957d@linux.ibm.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.28.5 (3.28.5-10.el7) X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: hMvLoMbq-Grc4SO54KXE9QhurVHhvNxq X-Proofpoint-GUID: UMPLINQREewCacK8DfL6jz6WHE_sQVId Content-Transfer-Encoding: 7bit X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-07-27_14:2021-07-27, 2021-07-27 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 mlxlogscore=999 malwarescore=0 adultscore=0 phishscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 bulkscore=0 priorityscore=1501 impostorscore=0 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2107140000 definitions=main-2107270128 X-Spam-Status: No, score=-12.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Jul 2021 22:25:14 -0000 On Wed, 2021-05-26 at 10:59 +0800, Kewen.Lin via Gcc-patches wrote: > Hi, > Hi, > This is the updated version of patch to deal with the bwaves_r > degradation due to vector construction fed by strided loads. > > As Richi's comments [1], this follows the similar idea to over > price the vector construction fed by VMAT_ELEMENTWISE or > VMAT_STRIDED_SLP. Instead of adding the extra cost on vector > construction costing immediately, it firstly records how many > loads and vectorized statements in the given loop, later in > rs6000_density_test (called by finish_cost) it computes the > load density ratio against all vectorized stmts, and check > with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD > and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing > if both thresholds are exceeded. ok > > Note that this new load density heuristic check is based on > some fields in target cost which are updated as needed when > scanning each add_stmt_cost entry, it's independent of the > current function rs6000_density_test which requires to scan > non_vect stmts. Since it's checking the load stmts count > vs. all vectorized stmts, it's kind of density, so I put > it in function rs6000_density_test. With the same reason to > keep it independent, I didn't put it as an else arm of the > current existing density threshold check hunk or before this > hunk. ok > > In the investigation of -1.04% degradation from 526.blender_r > on Power8, I noticed that the extra penalized cost 320 on one > single vector construction with type V16QI is much exaggerated, > which makes the final body cost unreliable, so this patch adds > one maximum bound for the extra penalized cost for each vector > construction statement. ok > > Bootstrapped/regtested on powerpc64le-linux-gnu P9. > > Full SPEC2017 performance evaluation on Power8/Power9 with > option combinations: > * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math} > * {-O3, -Ofast} {,-funroll-loops} > > bwaves_r degradations on P8/P9 have been fixed, nothing else > remarkable was observed. So, this fixes the "-1.04% degradation from 526.blender_r on Power8" degredation with no additional regressions. that sounds good. > > Is it ok for trunk? > > [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html > > BR, > Kewen > ----- > gcc/ChangeLog: > > * config/rs6000/rs6000.c (struct rs6000_cost_data): New members > nstmts, nloads and extra_ctor_cost. > (rs6000_density_test): Add load density related heuristics and the > checks, do extra costing on vector construction statements if need. > (rs6000_init_cost): Init new members. > (rs6000_update_target_cost_per_stmt): New function. > (rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function > rs6000_update_target_cost_per_stmt and call it. > > diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c > index 83d29cbfac1..806c3335cbc 100644 > --- a/gcc/config/rs6000/rs6000.c > +++ b/gcc/config/rs6000/rs6000.c > > @@ -5231,6 +5231,12 @@ typedef struct _rs6000_cost_data > { > struct loop *loop_info; > unsigned cost[3]; > + /* Total number of vectorized stmts (loop only). */ > + unsigned nstmts; > + /* Total number of loads (loop only). */ > + unsigned nloads; > + /* Possible extra penalized cost on vector construction (loop only). */ > + unsigned extra_ctor_cost; > > /* For each vectorized loop, this var holds TRUE iff a non-memory vector > instruction is needed by the vectorization. */ > bool vect_nonmem; > @@ -5292,9 +5298,45 @@ rs6000_density_test (rs6000_cost_data *data) > if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, > "density %d%%, cost %d exceeds threshold, penalizing " > - "loop body cost by %d%%", density_pct, > + "loop body cost by %d%%\n", density_pct, > vec_cost + not_vec_cost, DENSITY_PENALTY); > } > + > + /* Check if we need to penalize the body cost for latency and > + execution resources bound from strided or elementwise loads > + into a vector. */ > + if (data->extra_ctor_cost > 0) > + { > + /* Threshold for load stmts percentage in all vectorized stmts. */ > + const int DENSITY_LOAD_PCT_THRESHOLD = 45; > + /* Threshold for total number of load stmts. */ > + const int DENSITY_LOAD_NUM_THRESHOLD = 20; > + > + gcc_assert (data->nloads <= data->nstmts); > + unsigned int load_pct = (data->nloads * 100) / (data->nstmts); > + > + /* It's likely to be bounded by latency and execution resources > + from many scalar loads which are strided or elementwise loads > + into a vector if both conditions below are found: > + 1. there are many loads, it's easy to result in a long wait > + for load units; > + 2. load has a big proportion of all vectorized statements, > + it's not easy to schedule other statements to spread among > + the loads. > + One typical case is the innermost loop of the hotspot of SPEC2017 > + 503.bwaves_r without loop interchange. */ > + if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD > + && load_pct > DENSITY_LOAD_PCT_THRESHOLD) > + { > + data->cost[vect_body] += data->extra_ctor_cost; > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "Found %u loads and load pct. %u%% exceed " > + "the threshold, penalizing loop body " > + "cost by extra cost %u for ctor.\n", > + data->nloads, load_pct, data->extra_ctor_cost); > + } > + } > > } > > /* Implement targetm.vectorize.init_cost. */ > @@ -5308,6 +5350,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar) > data->cost[vect_body] = 0; > data->cost[vect_epilogue] = 0; > data->vect_nonmem = false; > + data->nstmts = 0; > + data->nloads = 0; > + data->extra_ctor_cost = 0; > > data->costing_for_scalar = costing_for_scalar; > return data; > } > @@ -5335,6 +5380,66 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind, > return 0; > } > > +/* As a visitor function for each statement cost entry handled in > + function add_stmt_cost, gather some information and update its > + relevant fields in target cost accordingly. */ I got lost trying to parse that.. (could be just me :-) Possibly instead something like /* Helper function for add_stmt_cost ; gather information and update the target_cost fields accordingly. */ > +static void > +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data, > + enum vect_cost_for_stmt kind, > + struct _stmt_vec_info *stmt_info, > + enum vect_cost_model_location where, > + int stmt_cost, unsigned int orig_count) > +{ > + > + /* Check whether we're doing something other than just a copy loop. > + Not all such loops may be profitably vectorized; see > + rs6000_finish_cost. */ > + if ((kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote > + || kind == vec_construct || kind == scalar_to_vec) > + || (where == vect_body && kind == vector_stmt)) I thought I saw an alignment issue, then noticed the "(" that was hiding on me.. :-) Maybe clearer to read if you rearrange slightly and flatten it ? I defer to others on that.. if ((kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote || kind == vec_construct || kind == scalar_to_vec) || (kind == vector_stmt && where == vect_body) > + data->vect_nonmem = true; > + > + /* Gather some information when we are costing the vector version for > + the statements locate in a loop body. */ s/version/instruction? operation?/ ? ? s/locate/located/ > + if (!data->costing_for_scalar && data->loop_info && where == vect_body) > + { > + data->nstmts += orig_count; > + > + if (kind == scalar_load || kind == vector_load || kind == unaligned_load > + || kind == vector_gather_load) Cosmetically, I'd move the second "||" to the next line to balance those two lines a bit more. > + data->nloads += orig_count; > + > + /* If we have strided or elementwise loads into a vector, it's > + possible to be bounded by latency and execution resources for > + many scalar loads. Try to account for this by scaling the > + construction cost by the number of elements involved, when > + handling each matching statement we record the possible extra > + penalized cost into target cost, in the end of costing for > + the whole loop, we do the actual penalization once some load > + density heuristics are satisfied. */ > + if (kind == vec_construct && stmt_info > + && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type > + && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE > + || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP)) > + { > + tree vectype = STMT_VINFO_VECTYPE (stmt_info); > + unsigned int nunits = vect_nunits_for_cost (vectype); > + unsigned int extra_cost = nunits * stmt_cost; > + /* As function rs6000_builtin_vectorization_cost shows, we have > + priced much on V16QI/V8HI vector construction as their units, > + if we penalize them with nunits * stmt_cost, it can result in > + a unreliable body cost, eg: for V16QI on Power8, stmt_cost is s/a/an/ > + 20 and nunits is 16, the extra cost is 320 which looks much > + exaggerated. So let's use one maximum bound for the extra > + penalized cost for vector construction here. */ > + const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12; > + if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR) > + extra_cost = MAX_PENALIZED_COST_FOR_CTOR; > + data->extra_ctor_cost += extra_cost; > + } > + } > +} ok > + > > /* Implement targetm.vectorize.add_stmt_cost. */ > > static unsigned > @@ -5354,6 +5459,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count, > /* Statements in an inner loop relative to the loop being > vectorized are weighted more heavily. The value here is > arbitrary and could potentially be improved with analysis. */ > + unsigned int orig_count = count; I don't see any other assignments to orig_count. Does 'count' get updated elsewhere in rs6000_add_stmt_cost() that the new orig_count variable is necessary? > > if (where == vect_body && stmt_info > && stmt_in_inner_loop_p (vinfo, stmt_info)) > { > @@ -5365,14 +5471,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count, > retval = (unsigned) (count * stmt_cost); > cost_data->cost[where] += retval; > > - /* Check whether we're doing something other than just a copy loop. > - Not all such loops may be profitably vectorized; see > - rs6000_finish_cost. */ > - if ((kind == vec_to_scalar || kind == vec_perm > - || kind == vec_promote_demote || kind == vec_construct > - || kind == scalar_to_vec) > - || (where == vect_body && kind == vector_stmt)) > - cost_data->vect_nonmem = true; > + rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where, > + stmt_cost, orig_count); > > } > > return retval; >