From: "Kewen.Lin" <linkw@linux.ibm.com>
To: will schmidt <will_schmidt@vnet.ibm.com>
Cc: Bill Schmidt <wschmidt@linux.ibm.com>,
Segher Boessenkool <segher@kernel.crashing.org>,
David Edelsohn <dje.gcc@gmail.com>,
GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH v2] rs6000: Add load density heuristic
Date: Wed, 28 Jul 2021 10:59:50 +0800 [thread overview]
Message-ID: <91063938-9ac4-93ed-c438-abf25d4eca05@linux.ibm.com> (raw)
In-Reply-To: <101b1a2a18a6332b305e6259355d91772d0e02be.camel@vnet.ibm.com>
Hi William,
Thanks for the review comments!
on 2021/7/28 上午6:25, will schmidt wrote:
> On Wed, 2021-05-26 at 10:59 +0800, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>
>
> Hi,
>
>
>> This is the updated version of patch to deal with the bwaves_r
>> degradation due to vector construction fed by strided loads.
>>
>> As Richi's comments [1], this follows the similar idea to over
>> price the vector construction fed by VMAT_ELEMENTWISE or
>> VMAT_STRIDED_SLP. Instead of adding the extra cost on vector
>> construction costing immediately, it firstly records how many
>> loads and vectorized statements in the given loop, later in
>> rs6000_density_test (called by finish_cost) it computes the
>> load density ratio against all vectorized stmts, and check
>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
>> if both thresholds are exceeded.
>
> ok
>
>>
>> Note that this new load density heuristic check is based on
>> some fields in target cost which are updated as needed when
>> scanning each add_stmt_cost entry, it's independent of the
>> current function rs6000_density_test which requires to scan
>> non_vect stmts. Since it's checking the load stmts count
>> vs. all vectorized stmts, it's kind of density, so I put
>> it in function rs6000_density_test. With the same reason to
>> keep it independent, I didn't put it as an else arm of the
>> current existing density threshold check hunk or before this
>> hunk.
>
> ok
>
>>
>> In the investigation of -1.04% degradation from 526.blender_r
>> on Power8, I noticed that the extra penalized cost 320 on one
>> single vector construction with type V16QI is much exaggerated,
>> which makes the final body cost unreliable, so this patch adds
>> one maximum bound for the extra penalized cost for each vector
>> construction statement.
>
> ok
>
>>
>> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
>>
>> Full SPEC2017 performance evaluation on Power8/Power9 with
>> option combinations:
>> * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>> * {-O3, -Ofast} {,-funroll-loops}
>>
>> bwaves_r degradations on P8/P9 have been fixed, nothing else
>> remarkable was observed.
>
> So, this fixes the "-1.04% degradation from 526.blender_r on Power8"
> degredation with no additional regressions. that sounds good.
>
Sorry for the confusion, the original intention of this patch is to
fix the -8% degradation at -O2 -ftree-vectorize vs. -O2 on bwaves_r.
(From the last evaluation based on r12-1442, P8 is about -10% while
P9 is about -9%). The mentioned -1.04% degradation from 526.blender_r
on P8 was expected to be a reason why we need the bound for the extra
penalized cost adjustment.
>>
>> Is it ok for trunk?
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
>>
>> BR,
>> Kewen
>> -----
>> gcc/ChangeLog:
>>
>> * config/rs6000/rs6000.c (struct rs6000_cost_data): New members
>> nstmts, nloads and extra_ctor_cost.
>> (rs6000_density_test): Add load density related heuristics and the
>> checks, do extra costing on vector construction statements if need.
>> (rs6000_init_cost): Init new members.
>> (rs6000_update_target_cost_per_stmt): New function.
>> (rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
>> rs6000_update_target_cost_per_stmt and call it.
>>
>
>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>> index 83d29cbfac1..806c3335cbc 100644
>> --- a/gcc/config/rs6000/rs6000.c
>> +++ b/gcc/config/rs6000/rs6000.c
>>
>> @@ -5231,6 +5231,12 @@ typedef struct _rs6000_cost_data
>> {
>> struct loop *loop_info;
>> unsigned cost[3];
>> + /* Total number of vectorized stmts (loop only). */
>> + unsigned nstmts;
>> + /* Total number of loads (loop only). */
>> + unsigned nloads;
>> + /* Possible extra penalized cost on vector construction (loop only). */
>> + unsigned extra_ctor_cost;
>>
>> /* For each vectorized loop, this var holds TRUE iff a non-memory vector
>> instruction is needed by the vectorization. */
>> bool vect_nonmem;
>> @@ -5292,9 +5298,45 @@ rs6000_density_test (rs6000_cost_data *data)
>> if (dump_enabled_p ())
>> dump_printf_loc (MSG_NOTE, vect_location,
>> "density %d%%, cost %d exceeds threshold, penalizing "
>> - "loop body cost by %d%%", density_pct,
>> + "loop body cost by %d%%\n", density_pct,
>> vec_cost + not_vec_cost, DENSITY_PENALTY);
>> }
>> +
>> + /* Check if we need to penalize the body cost for latency and
>> + execution resources bound from strided or elementwise loads
>> + into a vector. */
>> + if (data->extra_ctor_cost > 0)
>> + {
>> + /* Threshold for load stmts percentage in all vectorized stmts. */
>> + const int DENSITY_LOAD_PCT_THRESHOLD = 45;
>> + /* Threshold for total number of load stmts. */
>> + const int DENSITY_LOAD_NUM_THRESHOLD = 20;
>> +
>> + gcc_assert (data->nloads <= data->nstmts);
>> + unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
>> +
>> + /* It's likely to be bounded by latency and execution resources
>> + from many scalar loads which are strided or elementwise loads
>> + into a vector if both conditions below are found:
>> + 1. there are many loads, it's easy to result in a long wait
>> + for load units;
>> + 2. load has a big proportion of all vectorized statements,
>> + it's not easy to schedule other statements to spread among
>> + the loads.
>> + One typical case is the innermost loop of the hotspot of SPEC2017
>> + 503.bwaves_r without loop interchange. */
>> + if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
>> + && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
>> + {
>> + data->cost[vect_body] += data->extra_ctor_cost;
>> + if (dump_enabled_p ())
>> + dump_printf_loc (MSG_NOTE, vect_location,
>> + "Found %u loads and load pct. %u%% exceed "
>> + "the threshold, penalizing loop body "
>> + "cost by extra cost %u for ctor.\n",
>> + data->nloads, load_pct, data->extra_ctor_cost);
>> + }
>> + }
>>
>> }
>>
>> /* Implement targetm.vectorize.init_cost. */
>> @@ -5308,6 +5350,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar)
>> data->cost[vect_body] = 0;
>> data->cost[vect_epilogue] = 0;
>> data->vect_nonmem = false;
>> + data->nstmts = 0;
>> + data->nloads = 0;
>> + data->extra_ctor_cost = 0;
>>
>> data->costing_for_scalar = costing_for_scalar;
>> return data;
>> }
>> @@ -5335,6 +5380,66 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind,
>> return 0;
>> }
>>
>> +/* As a visitor function for each statement cost entry handled in
>> + function add_stmt_cost, gather some information and update its
>> + relevant fields in target cost accordingly. */
>
> I got lost trying to parse that.. (could be just me :-)
>
> Possibly instead something like
> /* Helper function for add_stmt_cost ; gather information and update
> the target_cost fields accordingly. */
>
>
OK, will update. I was thinking for each entry handled in function
add_stmt_cost, this helper acts like a visitor, trying to visit each
entry and take some actions if some conditions are satisifed.
>
>> +static void
>> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>> + enum vect_cost_for_stmt kind,
>> + struct _stmt_vec_info *stmt_info,
>> + enum vect_cost_model_location where,
>> + int stmt_cost, unsigned int orig_count)
>> +{
>> +
>> + /* Check whether we're doing something other than just a copy loop.
>> + Not all such loops may be profitably vectorized; see
>> + rs6000_finish_cost. */
>> + if ((kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote
>> + || kind == vec_construct || kind == scalar_to_vec)
>> + || (where == vect_body && kind == vector_stmt))
>
> I thought I saw an alignment issue, then noticed the "(" that was
> hiding on me.. :-)
>
> Maybe clearer to read if you rearrange slightly and flatten it ? I
> defer to others on that..
>
> if ((kind == vec_to_scalar
> || kind == vec_perm
> || kind == vec_promote_demote
> || kind ==
> vec_construct
> || kind == scalar_to_vec)
> || (kind == vector_stmt && where == vect_body)
>
>
This hunk is factored out from function rs6000_add_stmt_cost, maybe I
can keep the original formatting? The formatting tool isn't so smart,
and sometimes rearrange things to become unexpected (although it meets
the basic rule, not so elegant), sigh.
>> + data->vect_nonmem = true;
>> +
>> + /* Gather some information when we are costing the vector version for
>> + the statements locate in a loop body. */
> s/version/instruction? operation?/ ? ?
> s/locate/located/
>
Will fix.
>> + if (!data->costing_for_scalar && data->loop_info && where == vect_body)
>> + {
>> + data->nstmts += orig_count;
>> +
>> + if (kind == scalar_load || kind == vector_load || kind == unaligned_load
>> + || kind == vector_gather_load)
>
> Cosmetically, I'd move the second "||" to the next line to balance
> those two lines a bit more.
>
Will fix.
>> + data->nloads += orig_count;
>> +
>> + /* If we have strided or elementwise loads into a vector, it's
>> + possible to be bounded by latency and execution resources for
>> + many scalar loads. Try to account for this by scaling the
>> + construction cost by the number of elements involved, when
>> + handling each matching statement we record the possible extra
>> + penalized cost into target cost, in the end of costing for
>> + the whole loop, we do the actual penalization once some load
>> + density heuristics are satisfied. */
>> + if (kind == vec_construct && stmt_info
>> + && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
>> + && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
>> + || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP))
>> + {
>> + tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>> + unsigned int nunits = vect_nunits_for_cost (vectype);
>> + unsigned int extra_cost = nunits * stmt_cost;
>> + /* As function rs6000_builtin_vectorization_cost shows, we have
>> + priced much on V16QI/V8HI vector construction as their units,
>> + if we penalize them with nunits * stmt_cost, it can result in
>> + a unreliable body cost, eg: for V16QI on Power8, stmt_cost is
> s/a/an/
Will fix.
>> + 20 and nunits is 16, the extra cost is 320 which looks much
>> + exaggerated. So let's use one maximum bound for the extra
>> + penalized cost for vector construction here. */
>> + const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>> + if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>> + extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>> + data->extra_ctor_cost += extra_cost;
>> + }
>> + }
>> +}
> ok
>
>> +
>>
>> /* Implement targetm.vectorize.add_stmt_cost. */
>>
>> static unsigned
>> @@ -5354,6 +5459,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>> /* Statements in an inner loop relative to the loop being
>> vectorized are weighted more heavily. The value here is
>> arbitrary and could potentially be improved with analysis. */
>> + unsigned int orig_count = count;
>
> I don't see any other assignments to orig_count. Does 'count' get
> updated elsewhere in rs6000_add_stmt_cost() that the new orig_count
> variable is necessary?
>
Yeah, the 'count' could get updated but the default "git diff" doesn't
show it, the codes omitted look as below:
if (where == vect_body && stmt_info
&& stmt_in_inner_loop_p (vinfo, stmt_info))
{
loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
gcc_assert (loop_vinfo);
count *= LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo); /* FIXME. */
}
BR,
Kewen
>>
>> if (where == vect_body && stmt_info
>> && stmt_in_inner_loop_p (vinfo, stmt_info))
>> {
>> @@ -5365,14 +5471,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>> retval = (unsigned) (count * stmt_cost);
>> cost_data->cost[where] += retval;
>>
>> - /* Check whether we're doing something other than just a copy loop.
>> - Not all such loops may be profitably vectorized; see
>> - rs6000_finish_cost. */
>> - if ((kind == vec_to_scalar || kind == vec_perm
>> - || kind == vec_promote_demote || kind == vec_construct
>> - || kind == scalar_to_vec)
>> - || (where == vect_body && kind == vector_stmt))
>> - cost_data->vect_nonmem = true;
>> + rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
>> + stmt_cost, orig_count);
>>
>> }
>>
>> return retval;
>>
>
>
next prev parent reply other threads:[~2021-07-28 3:00 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-07 2:29 [PATCH] rs6000: Adjust rs6000_density_test for strided_load Kewen.Lin
2021-05-26 2:59 ` [PATCH v2] rs6000: Add load density heuristic Kewen.Lin
2021-06-09 2:26 ` PING^1 " Kewen.Lin
2021-06-28 7:01 ` PING^2 " Kewen.Lin
2021-07-15 1:59 ` PING^3 " Kewen.Lin
2021-07-27 22:25 ` will schmidt
2021-07-28 2:59 ` Kewen.Lin [this message]
2021-09-06 23:43 ` Segher Boessenkool
2021-09-08 7:01 ` Kewen.Lin
2021-07-28 5:22 ` [PATCH v3] " Kewen.Lin
2021-09-03 15:57 ` Bill Schmidt
2021-09-08 6:57 ` [PATCH v4] " Kewen.Lin
2021-09-08 8:28 ` Kewen.Lin
2021-09-09 16:11 ` Segher Boessenkool
2021-09-09 17:19 ` Bill Schmidt
2021-09-09 17:39 ` Bill Schmidt
2021-09-09 18:24 ` Segher Boessenkool
2021-09-10 3:22 ` Kewen.Lin
2021-09-10 3:46 ` Kewen.Lin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=91063938-9ac4-93ed-c438-abf25d4eca05@linux.ibm.com \
--to=linkw@linux.ibm.com \
--cc=dje.gcc@gmail.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=segher@kernel.crashing.org \
--cc=will_schmidt@vnet.ibm.com \
--cc=wschmidt@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).