From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 76044385829E for ; Mon, 15 Aug 2022 08:05:49 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 76044385829E Received: from pps.filterd (m0187473.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27F7lg6c015065; Mon, 15 Aug 2022 08:05:45 GMT Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hyj2q8d6v-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 15 Aug 2022 08:05:45 +0000 Received: from m0187473.ppops.net (m0187473.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27F7m4vu016175; Mon, 15 Aug 2022 08:05:45 GMT Received: from ppma01fra.de.ibm.com (46.49.7a9f.ip4.static.sl-reverse.com [159.122.73.70]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hyj2q8d58-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 15 Aug 2022 08:05:44 +0000 Received: from pps.filterd (ppma01fra.de.ibm.com [127.0.0.1]) by ppma01fra.de.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27F7acvW002765; Mon, 15 Aug 2022 08:05:42 GMT Received: from b06cxnps4074.portsmouth.uk.ibm.com (d06relay11.portsmouth.uk.ibm.com [9.149.109.196]) by ppma01fra.de.ibm.com with ESMTP id 3hx3k91ax0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 15 Aug 2022 08:05:42 +0000 Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com [9.149.105.61]) by b06cxnps4074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27F85edh30605696 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 15 Aug 2022 08:05:40 GMT Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E6A9F11C054; Mon, 15 Aug 2022 08:05:39 +0000 (GMT) Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 7C28C11C04A; Mon, 15 Aug 2022 08:05:37 +0000 (GMT) Received: from [9.197.235.82] (unknown [9.197.235.82]) by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 15 Aug 2022 08:05:37 +0000 (GMT) Message-ID: <88d36e92-cf50-7ae3-f975-c273741c022c@linux.ibm.com> Date: Mon, 15 Aug 2022 16:05:36 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.6.1 Subject: PING^1 [PATCH] rs6000: Suggest unroll factor for loop vectorization Content-Language: en-US To: GCC Patches Cc: Richard Sandiford , Peter Bergner , David Edelsohn , Segher Boessenkool , Richard Biener References: From: "Kewen.Lin" In-Reply-To: Content-Type: text/plain; charset=UTF-8 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: hTBsDhkw3q2pi0nw9Ir8zYPBzWDcX4nZ X-Proofpoint-ORIG-GUID: bKQNCjBNrzbBHBR-G64GZyC7CaW16q7f Content-Transfer-Encoding: 7bit X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-15_04,2022-08-11_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 bulkscore=0 malwarescore=0 adultscore=0 mlxlogscore=999 spamscore=0 suspectscore=0 lowpriorityscore=0 priorityscore=1501 mlxscore=0 clxscore=1015 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208150028 X-Spam-Status: No, score=-11.9 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Aug 2022 08:05:55 -0000 Hi, Gentle ping: https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598601.html BR, Kewen on 2022/7/20 17:30, Kewen.Lin via Gcc-patches wrote: > Hi, > > Commit r12-6679-g7ca1582ca60dc8 made vectorizer accept one > unroll factor to be applied to vectorization factor when > vectorizing the main loop, it would be suggested by target > when doing costing. > > This patch introduces function determine_suggested_unroll_factor > for rs6000 port, to make it be able to suggest the unroll factor > for a given loop being vectorized. Referring to aarch64 port > and basing on the analysis on SPEC2017 performance evaluation > results, it mainly considers these aspects: > 1) unroll option and pragma which can disable unrolling for the > given loop; > 2) simple hardware resource model with issued non memory access > vector insn per cycle; > 3) aggressive heuristics when iteration count is unknown: > - reduction case to break cross iteration dependency; > - emulated gather load; > 4) estimated iteration count when iteration count is unknown; > > With this patch, SPEC2017 performance evaluation results on > Power8/9/10 are listed below (speedup pct.): > > * Power10 > - O2: all are neutral (excluding some noises); > - Ofast: 510.parest_r +6.67%, the others are neutral > (use ... for the followings); > - Ofast + unroll: 510.parest_r +5.91%, ... > - Ofast + LTO + PGO: 510.parest_r +3.00%, ... > - Ofast + cheap vect cost: 510.parest_r +6.23%, ... > - Ofast + very-cheap vect cost: all are neutral; > > * Power9 > - Ofast: 510.parest_r +8.73%, 538.imagick_r +11.18% > (likely noise), 500.perlbench_r +1.84%, ... > > * Power8 > - Ofast: 510.parest_r +5.43%, ...; > > This patch also introduces one documented parameter > rs6000-vect-unroll-limit= similar to what aarch64 proposes, > by evaluating on P8/P9/P10, the default value 4 is slightly > better than the other choices like 2 and 8. > > It also parameterizes two other values as undocumented > parameters for future tweaking. One parameter is > rs6000-vect-unroll-issue, it's to simply model hardware > resource for non memory access vector instructions to avoid > excessive unrolling, initially I tried to use the value in > the hook rs6000_issue_rate, but the evaluation showed it's > bad, so I evaluated different values 2/4/6/8 on P8/P9/P10 at > Ofast, the results showed the default value 4 is good enough > on these different architectures. For a record, choice 8 > could make 510.parest_r's gain become smaller or gone on > P8/P9/P10; choice 6 could make 503.bwaves_r degrade by more > than 1% on P8/P10; and choice 2 could make 538.imagick_r > degrade by 3.8%. The other parameter is > rs6000-vect-unroll-reduc-threshold. It's mainly inspired by > 510.parest_r and tweaked as it, evaluating with different > values 0/1/2/3 for the threshold, it showed value 1 is the > best choice. For a record, choice 0 could make 525.x264_r > degrade by 2% and 527.cam4_r degrade by 2.95% on P10, > 548.exchange2_r degrade by 1.41% and 527.cam4_r degrade by > 2.54% on P8; choice 2 and bigger values could make > 510.parest_r's gain become smaller. > > Bootstrapped and regtested on powerpc64-linux-gnu P7 and P8, > and powerpc64le-linux-gnu P9. Bootstrapped on > powerpc64le-linux-gnu P10, but one failure was exposed during > regression testing there, it's identified as one miss > optimization and can be reproduced without this support, > PR106365 was opened for further tracking. > > Is it for trunk? > > BR, > Kewen > ------ > gcc/ChangeLog: > > * config/rs6000/rs6000.cc (class rs6000_cost_data): Add new members > m_nstores, m_reduc_factor, m_gather_load and member function > determine_suggested_unroll_factor. > (rs6000_cost_data::update_target_cost_per_stmt): Update for m_nstores, > m_reduc_factor and m_gather_load. > (rs6000_cost_data::determine_suggested_unroll_factor): New function. > (rs6000_cost_data::finish_cost): Use determine_suggested_unroll_factor. > * config/rs6000/rs6000.opt (rs6000-vect-unroll-limit): New parameter. > (rs6000-vect-unroll-issue): Likewise. > (rs6000-vect-unroll-reduc-threshold): Likewise. > * doc/invoke.texi (rs6000-vect-unroll-limit): Document new parameter. > > --- > gcc/config/rs6000/rs6000.cc | 125 ++++++++++++++++++++++++++++++++++- > gcc/config/rs6000/rs6000.opt | 18 +++++ > gcc/doc/invoke.texi | 7 ++ > 3 files changed, 147 insertions(+), 3 deletions(-) > > diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc > index 3ff16b8ae04..d0f107d70a8 100644 > --- a/gcc/config/rs6000/rs6000.cc > +++ b/gcc/config/rs6000/rs6000.cc > @@ -5208,16 +5208,23 @@ protected: > vect_cost_model_location, unsigned int); > void density_test (loop_vec_info); > void adjust_vect_cost_per_loop (loop_vec_info); > + unsigned int determine_suggested_unroll_factor (loop_vec_info); > > /* Total number of vectorized stmts (loop only). */ > unsigned m_nstmts = 0; > /* Total number of loads (loop only). */ > unsigned m_nloads = 0; > + /* Total number of stores (loop only). */ > + unsigned m_nstores = 0; > + /* Reduction factor for suggesting unroll factor (loop only). */ > + unsigned m_reduc_factor = 0; > /* Possible extra penalized cost on vector construction (loop only). */ > unsigned m_extra_ctor_cost = 0; > /* For each vectorized loop, this var holds TRUE iff a non-memory vector > instruction is needed by the vectorization. */ > bool m_vect_nonmem = false; > + /* If this loop gets vectorized with emulated gather load. */ > + bool m_gather_load = false; > }; > > /* Test for likely overcommitment of vector hardware resources. If a > @@ -5368,9 +5375,34 @@ rs6000_cost_data::update_target_cost_per_stmt (vect_cost_for_stmt kind, > { > m_nstmts += orig_count; > > - if (kind == scalar_load || kind == vector_load > - || kind == unaligned_load || kind == vector_gather_load) > - m_nloads += orig_count; > + if (kind == scalar_load > + || kind == vector_load > + || kind == unaligned_load > + || kind == vector_gather_load) > + { > + m_nloads += orig_count; > + if (stmt_info && STMT_VINFO_GATHER_SCATTER_P (stmt_info)) > + m_gather_load = true; > + } > + else if (kind == scalar_store > + || kind == vector_store > + || kind == unaligned_store > + || kind == vector_scatter_store) > + m_nstores += orig_count; > + else if ((kind == scalar_stmt > + || kind == vector_stmt > + || kind == vec_to_scalar) > + && stmt_info > + && vect_is_reduction (stmt_info)) > + { > + /* Loop body contains normal int or fp operations and epilogue > + contains vector reduction. For simplicity, we assume int > + operation takes one cycle and fp operation takes one more. */ > + tree lhs = gimple_get_lhs (stmt_info->stmt); > + bool is_float = FLOAT_TYPE_P (TREE_TYPE (lhs)); > + unsigned int basic_cost = is_float ? 2 : 1; > + m_reduc_factor = MAX (basic_cost * orig_count, m_reduc_factor); > + } > > /* Power processors do not currently have instructions for strided > and elementwise loads, and instead we must generate multiple > @@ -5462,6 +5494,90 @@ rs6000_cost_data::adjust_vect_cost_per_loop (loop_vec_info loop_vinfo) > } > } > > +/* Determine suggested unroll factor by considering some below factors: > + > + - unroll option/pragma which can disable unrolling for this loop; > + - simple hardware resource model for non memory vector insns; > + - aggressive heuristics when iteration count is unknown: > + - reduction case to break cross iteration dependency; > + - emulated gather load; > + - estimated iteration count when iteration count is unknown; > +*/ > + > + > +unsigned int > +rs6000_cost_data::determine_suggested_unroll_factor (loop_vec_info loop_vinfo) > +{ > + class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); > + > + /* Don't unroll if it's specified explicitly not to be unrolled. */ > + if (loop->unroll == 1 > + || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops) > + || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops)) > + return 1; > + > + unsigned int nstmts_nonldst = m_nstmts - m_nloads - m_nstores; > + /* Don't unroll if no vector instructions excepting for memory access. */ > + if (nstmts_nonldst == 0) > + return 1; > + > + /* Consider breaking cross iteration dependency for reduction. */ > + unsigned int reduc_factor = m_reduc_factor > 1 ? m_reduc_factor : 1; > + > + /* Use this simple hardware resource model that how many non ld/st > + vector instructions can be issued per cycle. */ > + unsigned int issue_width = rs6000_vect_unroll_issue; > + unsigned int uf = CEIL (reduc_factor * issue_width, nstmts_nonldst); > + uf = MIN ((unsigned int) rs6000_vect_unroll_limit, uf); > + /* Make sure it is power of 2. */ > + uf = 1 << ceil_log2 (uf); > + > + /* If the iteration count is known, the costing would be exact enough, > + don't worry it could be worse. */ > + if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) > + return uf; > + > + /* Inspired by SPEC2017 parest_r, we want to aggressively unroll the > + loop if either condition is satisfied: > + - reduction factor exceeds the threshold; > + - emulated gather load adopted. */ > + if (reduc_factor > (unsigned int) rs6000_vect_unroll_reduc_threshold > + || m_gather_load) > + return uf; > + > + /* Check if we can conclude it's good to unroll from the estimated > + iteration count. */ > + HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop); > + unsigned int vf = vect_vf_for_cost (loop_vinfo); > + unsigned int unrolled_vf = vf * uf; > + if (est_niter == -1 || est_niter < unrolled_vf) > + /* When the estimated iteration of this loop is unknown, it's possible > + that we are able to vectorize this loop with the original VF but fail > + to vectorize it with the unrolled VF any more if the actual iteration > + count is in between. */ > + return 1; > + else > + { > + unsigned int epil_niter_unr = est_niter % unrolled_vf; > + unsigned int epil_niter = est_niter % vf; > + /* Even if we have partial vector support, it can be still inefficent > + to calculate the length when the iteration count is unknown, so > + only expect it's good to unroll when the epilogue iteration count > + is not bigger than VF (only one time length calculation). */ > + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) > + && epil_niter_unr <= vf) > + return uf; > + /* Without partial vector support, conservatively unroll this when > + the epilogue iteration count is less than the original one > + (epilogue execution time wouldn't be longer than before). */ > + else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) > + && epil_niter_unr <= epil_niter) > + return uf; > + } > + > + return 1; > +} > + > void > rs6000_cost_data::finish_cost (const vector_costs *scalar_costs) > { > @@ -5478,6 +5594,9 @@ rs6000_cost_data::finish_cost (const vector_costs *scalar_costs) > && LOOP_VINFO_VECT_FACTOR (loop_vinfo) == 2 > && LOOP_REQUIRES_VERSIONING (loop_vinfo)) > m_costs[vect_body] += 10000; > + > + m_suggested_unroll_factor > + = determine_suggested_unroll_factor (loop_vinfo); > } > > vector_costs::finish_cost (scalar_costs); > diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt > index 4931d781c4e..80c2c61a9de 100644 > --- a/gcc/config/rs6000/rs6000.opt > +++ b/gcc/config/rs6000/rs6000.opt > @@ -624,6 +624,14 @@ mieee128-constant > Target Var(TARGET_IEEE128_CONSTANT) Init(1) Save > Generate (do not generate) code that uses the LXVKQ instruction. > > +; Documented parameters > + > +-param=rs6000-vect-unroll-limit= > +Target Joined UInteger Var(rs6000_vect_unroll_limit) Init(4) IntegerRange(1, 64) Param > +Used to limit unroll factor which indicates how much the autovectorizer may > +unroll a loop. The default value is 4. > + > +; Undocumented parameters > -param=rs6000-density-pct-threshold= > Target Undocumented Joined UInteger Var(rs6000_density_pct_threshold) Init(85) IntegerRange(0, 100) Param > When costing for loop vectorization, we probably need to penalize the loop body > @@ -661,3 +669,13 @@ Like parameter rs6000-density-load-pct-threshold, we also check if the total > number of load statements exceeds the threshold specified by this parameter, > and penalize only if it's satisfied. The default value is 20. > > +-param=rs6000-vect-unroll-issue= > +Target Undocumented Joined UInteger Var(rs6000_vect_unroll_issue) Init(4) IntegerRange(1, 128) Param > +Indicate how many non memory access vector instructions can be issued per > +cycle, it's used in unroll factor determination for autovectorizer. The > +default value is 4. > + > +-param=rs6000-vect-unroll-reduc-threshold= > +Target Undocumented Joined UInteger Var(rs6000_vect_unroll_reduc_threshold) Init(1) Param > +When reduction factor computed for a loop exceeds the threshold specified by > +this parameter, prefer to unroll this loop. The default value is 1. > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > index 84d6f0f9860..097ab1d5563 100644 > --- a/gcc/doc/invoke.texi > +++ b/gcc/doc/invoke.texi > @@ -29658,6 +29658,13 @@ Generate (do not generate) code that will run in privileged state. > @opindex no-block-ops-unaligned-vsx > Generate (do not generate) unaligned vsx loads and stores for > inline expansion of @code{memcpy} and @code{memmove}. > + > +@item --param rs6000-vect-unroll-limit= > +The vectorizer will check with target information to determine whether it > +would be beneficial to unroll the main vectorized loop and by how much. This > +parameter sets the upper bound of how much the vectorizer will unroll the main > +loop. The default value is four. > + > @end table > > @node RX Options > -- > 2.27.0