From: "Kewen.Lin" <linkw@linux.ibm.com>
To: GCC Patches <gcc-patches@gcc.gnu.org>
Cc: Richard Sandiford <richard.sandiford@arm.com>,
Peter Bergner <bergner@linux.ibm.com>,
Segher Boessenkool <segher@kernel.crashing.org>,
David Edelsohn <dje.gcc@gmail.com>,
Richard Biener <richard.guenther@gmail.com>
Subject: PING^2 [PATCH] rs6000: Suggest unroll factor for loop vectorization
Date: Mon, 29 Aug 2022 14:22:27 +0800 [thread overview]
Message-ID: <38ad15c0-75bd-9ca1-6efa-aa655f05f00a@linux.ibm.com> (raw)
In-Reply-To: <88d36e92-cf50-7ae3-f975-c273741c022c@linux.ibm.com>
Hi,
Gentle ping: https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598601.html
BR,
Kewen
>
> on 2022/7/20 17:30, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>> Commit r12-6679-g7ca1582ca60dc8 made vectorizer accept one
>> unroll factor to be applied to vectorization factor when
>> vectorizing the main loop, it would be suggested by target
>> when doing costing.
>>
>> This patch introduces function determine_suggested_unroll_factor
>> for rs6000 port, to make it be able to suggest the unroll factor
>> for a given loop being vectorized. Referring to aarch64 port
>> and basing on the analysis on SPEC2017 performance evaluation
>> results, it mainly considers these aspects:
>> 1) unroll option and pragma which can disable unrolling for the
>> given loop;
>> 2) simple hardware resource model with issued non memory access
>> vector insn per cycle;
>> 3) aggressive heuristics when iteration count is unknown:
>> - reduction case to break cross iteration dependency;
>> - emulated gather load;
>> 4) estimated iteration count when iteration count is unknown;
>>
>> With this patch, SPEC2017 performance evaluation results on
>> Power8/9/10 are listed below (speedup pct.):
>>
>> * Power10
>> - O2: all are neutral (excluding some noises);
>> - Ofast: 510.parest_r +6.67%, the others are neutral
>> (use ... for the followings);
>> - Ofast + unroll: 510.parest_r +5.91%, ...
>> - Ofast + LTO + PGO: 510.parest_r +3.00%, ...
>> - Ofast + cheap vect cost: 510.parest_r +6.23%, ...
>> - Ofast + very-cheap vect cost: all are neutral;
>>
>> * Power9
>> - Ofast: 510.parest_r +8.73%, 538.imagick_r +11.18%
>> (likely noise), 500.perlbench_r +1.84%, ...
>>
>> * Power8
>> - Ofast: 510.parest_r +5.43%, ...;
>>
>> This patch also introduces one documented parameter
>> rs6000-vect-unroll-limit= similar to what aarch64 proposes,
>> by evaluating on P8/P9/P10, the default value 4 is slightly
>> better than the other choices like 2 and 8.
>>
>> It also parameterizes two other values as undocumented
>> parameters for future tweaking. One parameter is
>> rs6000-vect-unroll-issue, it's to simply model hardware
>> resource for non memory access vector instructions to avoid
>> excessive unrolling, initially I tried to use the value in
>> the hook rs6000_issue_rate, but the evaluation showed it's
>> bad, so I evaluated different values 2/4/6/8 on P8/P9/P10 at
>> Ofast, the results showed the default value 4 is good enough
>> on these different architectures. For a record, choice 8
>> could make 510.parest_r's gain become smaller or gone on
>> P8/P9/P10; choice 6 could make 503.bwaves_r degrade by more
>> than 1% on P8/P10; and choice 2 could make 538.imagick_r
>> degrade by 3.8%. The other parameter is
>> rs6000-vect-unroll-reduc-threshold. It's mainly inspired by
>> 510.parest_r and tweaked as it, evaluating with different
>> values 0/1/2/3 for the threshold, it showed value 1 is the
>> best choice. For a record, choice 0 could make 525.x264_r
>> degrade by 2% and 527.cam4_r degrade by 2.95% on P10,
>> 548.exchange2_r degrade by 1.41% and 527.cam4_r degrade by
>> 2.54% on P8; choice 2 and bigger values could make
>> 510.parest_r's gain become smaller.
>>
>> Bootstrapped and regtested on powerpc64-linux-gnu P7 and P8,
>> and powerpc64le-linux-gnu P9. Bootstrapped on
>> powerpc64le-linux-gnu P10, but one failure was exposed during
>> regression testing there, it's identified as one miss
>> optimization and can be reproduced without this support,
>> PR106365 was opened for further tracking.
>>
>> Is it for trunk?
>>
>> BR,
>> Kewen
>> ------
>> gcc/ChangeLog:
>>
>> * config/rs6000/rs6000.cc (class rs6000_cost_data): Add new members
>> m_nstores, m_reduc_factor, m_gather_load and member function
>> determine_suggested_unroll_factor.
>> (rs6000_cost_data::update_target_cost_per_stmt): Update for m_nstores,
>> m_reduc_factor and m_gather_load.
>> (rs6000_cost_data::determine_suggested_unroll_factor): New function.
>> (rs6000_cost_data::finish_cost): Use determine_suggested_unroll_factor.
>> * config/rs6000/rs6000.opt (rs6000-vect-unroll-limit): New parameter.
>> (rs6000-vect-unroll-issue): Likewise.
>> (rs6000-vect-unroll-reduc-threshold): Likewise.
>> * doc/invoke.texi (rs6000-vect-unroll-limit): Document new parameter.
>>
>> ---
>> gcc/config/rs6000/rs6000.cc | 125 ++++++++++++++++++++++++++++++++++-
>> gcc/config/rs6000/rs6000.opt | 18 +++++
>> gcc/doc/invoke.texi | 7 ++
>> 3 files changed, 147 insertions(+), 3 deletions(-)
>>
>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>> index 3ff16b8ae04..d0f107d70a8 100644
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -5208,16 +5208,23 @@ protected:
>> vect_cost_model_location, unsigned int);
>> void density_test (loop_vec_info);
>> void adjust_vect_cost_per_loop (loop_vec_info);
>> + unsigned int determine_suggested_unroll_factor (loop_vec_info);
>>
>> /* Total number of vectorized stmts (loop only). */
>> unsigned m_nstmts = 0;
>> /* Total number of loads (loop only). */
>> unsigned m_nloads = 0;
>> + /* Total number of stores (loop only). */
>> + unsigned m_nstores = 0;
>> + /* Reduction factor for suggesting unroll factor (loop only). */
>> + unsigned m_reduc_factor = 0;
>> /* Possible extra penalized cost on vector construction (loop only). */
>> unsigned m_extra_ctor_cost = 0;
>> /* For each vectorized loop, this var holds TRUE iff a non-memory vector
>> instruction is needed by the vectorization. */
>> bool m_vect_nonmem = false;
>> + /* If this loop gets vectorized with emulated gather load. */
>> + bool m_gather_load = false;
>> };
>>
>> /* Test for likely overcommitment of vector hardware resources. If a
>> @@ -5368,9 +5375,34 @@ rs6000_cost_data::update_target_cost_per_stmt (vect_cost_for_stmt kind,
>> {
>> m_nstmts += orig_count;
>>
>> - if (kind == scalar_load || kind == vector_load
>> - || kind == unaligned_load || kind == vector_gather_load)
>> - m_nloads += orig_count;
>> + if (kind == scalar_load
>> + || kind == vector_load
>> + || kind == unaligned_load
>> + || kind == vector_gather_load)
>> + {
>> + m_nloads += orig_count;
>> + if (stmt_info && STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> + m_gather_load = true;
>> + }
>> + else if (kind == scalar_store
>> + || kind == vector_store
>> + || kind == unaligned_store
>> + || kind == vector_scatter_store)
>> + m_nstores += orig_count;
>> + else if ((kind == scalar_stmt
>> + || kind == vector_stmt
>> + || kind == vec_to_scalar)
>> + && stmt_info
>> + && vect_is_reduction (stmt_info))
>> + {
>> + /* Loop body contains normal int or fp operations and epilogue
>> + contains vector reduction. For simplicity, we assume int
>> + operation takes one cycle and fp operation takes one more. */
>> + tree lhs = gimple_get_lhs (stmt_info->stmt);
>> + bool is_float = FLOAT_TYPE_P (TREE_TYPE (lhs));
>> + unsigned int basic_cost = is_float ? 2 : 1;
>> + m_reduc_factor = MAX (basic_cost * orig_count, m_reduc_factor);
>> + }
>>
>> /* Power processors do not currently have instructions for strided
>> and elementwise loads, and instead we must generate multiple
>> @@ -5462,6 +5494,90 @@ rs6000_cost_data::adjust_vect_cost_per_loop (loop_vec_info loop_vinfo)
>> }
>> }
>>
>> +/* Determine suggested unroll factor by considering some below factors:
>> +
>> + - unroll option/pragma which can disable unrolling for this loop;
>> + - simple hardware resource model for non memory vector insns;
>> + - aggressive heuristics when iteration count is unknown:
>> + - reduction case to break cross iteration dependency;
>> + - emulated gather load;
>> + - estimated iteration count when iteration count is unknown;
>> +*/
>> +
>> +
>> +unsigned int
>> +rs6000_cost_data::determine_suggested_unroll_factor (loop_vec_info loop_vinfo)
>> +{
>> + class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>> +
>> + /* Don't unroll if it's specified explicitly not to be unrolled. */
>> + if (loop->unroll == 1
>> + || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops)
>> + || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops))
>> + return 1;
>> +
>> + unsigned int nstmts_nonldst = m_nstmts - m_nloads - m_nstores;
>> + /* Don't unroll if no vector instructions excepting for memory access. */
>> + if (nstmts_nonldst == 0)
>> + return 1;
>> +
>> + /* Consider breaking cross iteration dependency for reduction. */
>> + unsigned int reduc_factor = m_reduc_factor > 1 ? m_reduc_factor : 1;
>> +
>> + /* Use this simple hardware resource model that how many non ld/st
>> + vector instructions can be issued per cycle. */
>> + unsigned int issue_width = rs6000_vect_unroll_issue;
>> + unsigned int uf = CEIL (reduc_factor * issue_width, nstmts_nonldst);
>> + uf = MIN ((unsigned int) rs6000_vect_unroll_limit, uf);
>> + /* Make sure it is power of 2. */
>> + uf = 1 << ceil_log2 (uf);
>> +
>> + /* If the iteration count is known, the costing would be exact enough,
>> + don't worry it could be worse. */
>> + if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
>> + return uf;
>> +
>> + /* Inspired by SPEC2017 parest_r, we want to aggressively unroll the
>> + loop if either condition is satisfied:
>> + - reduction factor exceeds the threshold;
>> + - emulated gather load adopted. */
>> + if (reduc_factor > (unsigned int) rs6000_vect_unroll_reduc_threshold
>> + || m_gather_load)
>> + return uf;
>> +
>> + /* Check if we can conclude it's good to unroll from the estimated
>> + iteration count. */
>> + HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop);
>> + unsigned int vf = vect_vf_for_cost (loop_vinfo);
>> + unsigned int unrolled_vf = vf * uf;
>> + if (est_niter == -1 || est_niter < unrolled_vf)
>> + /* When the estimated iteration of this loop is unknown, it's possible
>> + that we are able to vectorize this loop with the original VF but fail
>> + to vectorize it with the unrolled VF any more if the actual iteration
>> + count is in between. */
>> + return 1;
>> + else
>> + {
>> + unsigned int epil_niter_unr = est_niter % unrolled_vf;
>> + unsigned int epil_niter = est_niter % vf;
>> + /* Even if we have partial vector support, it can be still inefficent
>> + to calculate the length when the iteration count is unknown, so
>> + only expect it's good to unroll when the epilogue iteration count
>> + is not bigger than VF (only one time length calculation). */
>> + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
>> + && epil_niter_unr <= vf)
>> + return uf;
>> + /* Without partial vector support, conservatively unroll this when
>> + the epilogue iteration count is less than the original one
>> + (epilogue execution time wouldn't be longer than before). */
>> + else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
>> + && epil_niter_unr <= epil_niter)
>> + return uf;
>> + }
>> +
>> + return 1;
>> +}
>> +
>> void
>> rs6000_cost_data::finish_cost (const vector_costs *scalar_costs)
>> {
>> @@ -5478,6 +5594,9 @@ rs6000_cost_data::finish_cost (const vector_costs *scalar_costs)
>> && LOOP_VINFO_VECT_FACTOR (loop_vinfo) == 2
>> && LOOP_REQUIRES_VERSIONING (loop_vinfo))
>> m_costs[vect_body] += 10000;
>> +
>> + m_suggested_unroll_factor
>> + = determine_suggested_unroll_factor (loop_vinfo);
>> }
>>
>> vector_costs::finish_cost (scalar_costs);
>> diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
>> index 4931d781c4e..80c2c61a9de 100644
>> --- a/gcc/config/rs6000/rs6000.opt
>> +++ b/gcc/config/rs6000/rs6000.opt
>> @@ -624,6 +624,14 @@ mieee128-constant
>> Target Var(TARGET_IEEE128_CONSTANT) Init(1) Save
>> Generate (do not generate) code that uses the LXVKQ instruction.
>>
>> +; Documented parameters
>> +
>> +-param=rs6000-vect-unroll-limit=
>> +Target Joined UInteger Var(rs6000_vect_unroll_limit) Init(4) IntegerRange(1, 64) Param
>> +Used to limit unroll factor which indicates how much the autovectorizer may
>> +unroll a loop. The default value is 4.
>> +
>> +; Undocumented parameters
>> -param=rs6000-density-pct-threshold=
>> Target Undocumented Joined UInteger Var(rs6000_density_pct_threshold) Init(85) IntegerRange(0, 100) Param
>> When costing for loop vectorization, we probably need to penalize the loop body
>> @@ -661,3 +669,13 @@ Like parameter rs6000-density-load-pct-threshold, we also check if the total
>> number of load statements exceeds the threshold specified by this parameter,
>> and penalize only if it's satisfied. The default value is 20.
>>
>> +-param=rs6000-vect-unroll-issue=
>> +Target Undocumented Joined UInteger Var(rs6000_vect_unroll_issue) Init(4) IntegerRange(1, 128) Param
>> +Indicate how many non memory access vector instructions can be issued per
>> +cycle, it's used in unroll factor determination for autovectorizer. The
>> +default value is 4.
>> +
>> +-param=rs6000-vect-unroll-reduc-threshold=
>> +Target Undocumented Joined UInteger Var(rs6000_vect_unroll_reduc_threshold) Init(1) Param
>> +When reduction factor computed for a loop exceeds the threshold specified by
>> +this parameter, prefer to unroll this loop. The default value is 1.
>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
>> index 84d6f0f9860..097ab1d5563 100644
>> --- a/gcc/doc/invoke.texi
>> +++ b/gcc/doc/invoke.texi
>> @@ -29658,6 +29658,13 @@ Generate (do not generate) code that will run in privileged state.
>> @opindex no-block-ops-unaligned-vsx
>> Generate (do not generate) unaligned vsx loads and stores for
>> inline expansion of @code{memcpy} and @code{memmove}.
>> +
>> +@item --param rs6000-vect-unroll-limit=
>> +The vectorizer will check with target information to determine whether it
>> +would be beneficial to unroll the main vectorized loop and by how much. This
>> +parameter sets the upper bound of how much the vectorizer will unroll the main
>> +loop. The default value is four.
>> +
>> @end table
>>
>> @node RX Options
>> --
>> 2.27.0
prev parent reply other threads:[~2022-08-29 6:22 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-20 9:30 Kewen.Lin
2022-08-15 8:05 ` PING^1 " Kewen.Lin
2022-08-29 6:22 ` Kewen.Lin [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=38ad15c0-75bd-9ca1-6efa-aa655f05f00a@linux.ibm.com \
--to=linkw@linux.ibm.com \
--cc=bergner@linux.ibm.com \
--cc=dje.gcc@gmail.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=richard.guenther@gmail.com \
--cc=richard.sandiford@arm.com \
--cc=segher@kernel.crashing.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).