[PATCH v2] rs6000: Modify the way for extra penalized cost

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH v2] rs6000: Modify the way for extra penalized cost
@ 2021-09-28  8:16 Kewen.Lin
  2021-10-13  2:30 ` PING^1 " Kewen.Lin
  2021-11-29 22:06 ` Segher Boessenkool
  0 siblings, 2 replies; 8+ messages in thread
From: Kewen.Lin @ 2021-09-28  8:16 UTC (permalink / raw)
  To: GCC Patches; +Cc: Segher Boessenkool, David Edelsohn, Bill Schmidt

Hi,

This patch follows the discussions here[1][2], where Segher
pointed out the existing way to guard the extra penalized
cost for strided/elementwise loads with a magic bound does
not scale.

The way with nunits * stmt_cost can get one much
exaggerated penalized cost, such as: for V16QI on P8, it's
16 * 20 = 320, that's why we need one bound.  To make it
better and more readable, the penalized cost is simplified
as:

    unsigned adjusted_cost = (nunits == 2) ? 2 : 1;
    unsigned extra_cost = nunits * adjusted_cost;

For V2DI/V2DF, it uses 2 penalized cost for each scalar load
while for the other modes, it uses 1.  It's mainly concluded
from the performance evaluations.  One thing might be
related is that: More units vector gets constructed, more
instructions are used.  It has more chances to schedule them
better (even run in parallelly when enough available units
at that time), so it seems reasonable not to penalize more
for them.

The SPEC2017 evaluations on Power8/Power9/Power10 at option
sets O2-vect and Ofast-unroll show this change is neutral.

Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.

Is it ok for trunk?

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html
[2] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580099.html
v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579529.html

BR,
Kewen
-----
gcc/ChangeLog:

	* config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust
	the way to compute extra penalized cost.  Remove useless parameter.
	(rs6000_add_stmt_cost): Adjust the call to function
	rs6000_update_target_cost_per_stmt.


---
 gcc/config/rs6000/rs6000.c | 31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index dd42b0964f1..8200e1152c2 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5422,7 +5422,6 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
 				    enum vect_cost_for_stmt kind,
 				    struct _stmt_vec_info *stmt_info,
 				    enum vect_cost_model_location where,
-				    int stmt_cost,
 				    unsigned int orig_count)
 {

@@ -5462,17 +5461,23 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
 	{
 	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
 	  unsigned int nunits = vect_nunits_for_cost (vectype);
-	  unsigned int extra_cost = nunits * stmt_cost;
-	  /* As function rs6000_builtin_vectorization_cost shows, we have
-	     priced much on V16QI/V8HI vector construction as their units,
-	     if we penalize them with nunits * stmt_cost, it can result in
-	     an unreliable body cost, eg: for V16QI on Power8, stmt_cost
-	     is 20 and nunits is 16, the extra cost is 320 which looks
-	     much exaggerated.  So let's use one maximum bound for the
-	     extra penalized cost for vector construction here.  */
-	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
-	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
-	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
+	  /* Don't expect strided/elementwise loads for just 1 nunit.  */
+	  gcc_assert (nunits > 1);
+	  /* i386 port adopts nunits * stmt_cost as the penalized cost
+	     for this kind of penalization, we used to follow it but
+	     found it could result in an unreliable body cost especially
+	     for V16QI/V8HI modes.  To make it better, we choose this
+	     new heuristic: for each scalar load, we use 2 as penalized
+	     cost for the case with 2 nunits and use 1 for the other
+	     cases.  It's without much supporting theory, mainly
+	     concluded from the broad performance evaluations on Power8,
+	     Power9 and Power10.  One possibly related point is that:
+	     vector construction for more units would use more insns,
+	     it has more chances to schedule them better (even run in
+	     parallelly when enough available units at that time), so
+	     it seems reasonable not to penalize that much for them.  */
+	  unsigned int adjusted_cost = (nunits == 2) ? 2 : 1;
+	  unsigned int extra_cost = nunits * adjusted_cost;
 	  data->extra_ctor_cost += extra_cost;
 	}
     }
@@ -5510,7 +5515,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
       cost_data->cost[where] += retval;

       rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
-					  stmt_cost, orig_count);
+					  orig_count);
     }

   return retval;
--
2.27.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* PING^1 [PATCH v2] rs6000: Modify the way for extra penalized cost
  2021-09-28  8:16 [PATCH v2] rs6000: Modify the way for extra penalized cost Kewen.Lin
@ 2021-10-13  2:30 ` Kewen.Lin
  2021-10-20  9:29   ` PING^2 " Kewen.Lin
  2021-11-29 22:06 ` Segher Boessenkool
  1 sibling, 1 reply; 8+ messages in thread
From: Kewen.Lin @ 2021-10-13  2:30 UTC (permalink / raw)
  To: GCC Patches; +Cc: Bill Schmidt, David Edelsohn, Segher Boessenkool

Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580358.html

BR,
Kewen

on 2021/9/28 下午4:16, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> This patch follows the discussions here[1][2], where Segher
> pointed out the existing way to guard the extra penalized
> cost for strided/elementwise loads with a magic bound does
> not scale.
> 
> The way with nunits * stmt_cost can get one much
> exaggerated penalized cost, such as: for V16QI on P8, it's
> 16 * 20 = 320, that's why we need one bound.  To make it
> better and more readable, the penalized cost is simplified
> as:
> 
>     unsigned adjusted_cost = (nunits == 2) ? 2 : 1;
>     unsigned extra_cost = nunits * adjusted_cost;
> 
> For V2DI/V2DF, it uses 2 penalized cost for each scalar load
> while for the other modes, it uses 1.  It's mainly concluded
> from the performance evaluations.  One thing might be
> related is that: More units vector gets constructed, more
> instructions are used.  It has more chances to schedule them
> better (even run in parallelly when enough available units
> at that time), so it seems reasonable not to penalize more
> for them.
> 
> The SPEC2017 evaluations on Power8/Power9/Power10 at option
> sets O2-vect and Ofast-unroll show this change is neutral.
> 
> Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.
> 
> Is it ok for trunk?
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html
> [2] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580099.html
> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579529.html
> 
> BR,
> Kewen
> -----
> gcc/ChangeLog:
> 
> 	* config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust
> 	the way to compute extra penalized cost.  Remove useless parameter.
> 	(rs6000_add_stmt_cost): Adjust the call to function
> 	rs6000_update_target_cost_per_stmt.
> 
> 
> ---
>  gcc/config/rs6000/rs6000.c | 31 ++++++++++++++++++-------------
>  1 file changed, 18 insertions(+), 13 deletions(-)
> 
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index dd42b0964f1..8200e1152c2 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -5422,7 +5422,6 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>  				    enum vect_cost_for_stmt kind,
>  				    struct _stmt_vec_info *stmt_info,
>  				    enum vect_cost_model_location where,
> -				    int stmt_cost,
>  				    unsigned int orig_count)
>  {
> 
> @@ -5462,17 +5461,23 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>  	{
>  	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>  	  unsigned int nunits = vect_nunits_for_cost (vectype);
> -	  unsigned int extra_cost = nunits * stmt_cost;
> -	  /* As function rs6000_builtin_vectorization_cost shows, we have
> -	     priced much on V16QI/V8HI vector construction as their units,
> -	     if we penalize them with nunits * stmt_cost, it can result in
> -	     an unreliable body cost, eg: for V16QI on Power8, stmt_cost
> -	     is 20 and nunits is 16, the extra cost is 320 which looks
> -	     much exaggerated.  So let's use one maximum bound for the
> -	     extra penalized cost for vector construction here.  */
> -	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
> -	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
> -	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
> +	  /* Don't expect strided/elementwise loads for just 1 nunit.  */
> +	  gcc_assert (nunits > 1);
> +	  /* i386 port adopts nunits * stmt_cost as the penalized cost
> +	     for this kind of penalization, we used to follow it but
> +	     found it could result in an unreliable body cost especially
> +	     for V16QI/V8HI modes.  To make it better, we choose this
> +	     new heuristic: for each scalar load, we use 2 as penalized
> +	     cost for the case with 2 nunits and use 1 for the other
> +	     cases.  It's without much supporting theory, mainly
> +	     concluded from the broad performance evaluations on Power8,
> +	     Power9 and Power10.  One possibly related point is that:
> +	     vector construction for more units would use more insns,
> +	     it has more chances to schedule them better (even run in
> +	     parallelly when enough available units at that time), so
> +	     it seems reasonable not to penalize that much for them.  */
> +	  unsigned int adjusted_cost = (nunits == 2) ? 2 : 1;
> +	  unsigned int extra_cost = nunits * adjusted_cost;
>  	  data->extra_ctor_cost += extra_cost;
>  	}
>      }
> @@ -5510,7 +5515,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>        cost_data->cost[where] += retval;
> 
>        rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
> -					  stmt_cost, orig_count);
> +					  orig_count);
>      }
> 
>    return retval;
> --
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* PING^2 [PATCH v2] rs6000: Modify the way for extra penalized cost
  2021-10-13  2:30 ` PING^1 " Kewen.Lin
@ 2021-10-20  9:29   ` Kewen.Lin
  2021-11-04 10:56     ` PING^3 " Kewen.Lin
  0 siblings, 1 reply; 8+ messages in thread
From: Kewen.Lin @ 2021-10-20  9:29 UTC (permalink / raw)
  To: GCC Patches; +Cc: Bill Schmidt, Segher Boessenkool, David Edelsohn

Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580358.html

BR,
Kewen

> on 2021/9/28 下午4:16, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>> This patch follows the discussions here[1][2], where Segher
>> pointed out the existing way to guard the extra penalized
>> cost for strided/elementwise loads with a magic bound does
>> not scale.
>>
>> The way with nunits * stmt_cost can get one much
>> exaggerated penalized cost, such as: for V16QI on P8, it's
>> 16 * 20 = 320, that's why we need one bound.  To make it
>> better and more readable, the penalized cost is simplified
>> as:
>>
>>     unsigned adjusted_cost = (nunits == 2) ? 2 : 1;
>>     unsigned extra_cost = nunits * adjusted_cost;
>>
>> For V2DI/V2DF, it uses 2 penalized cost for each scalar load
>> while for the other modes, it uses 1.  It's mainly concluded
>> from the performance evaluations.  One thing might be
>> related is that: More units vector gets constructed, more
>> instructions are used.  It has more chances to schedule them
>> better (even run in parallelly when enough available units
>> at that time), so it seems reasonable not to penalize more
>> for them.
>>
>> The SPEC2017 evaluations on Power8/Power9/Power10 at option
>> sets O2-vect and Ofast-unroll show this change is neutral.
>>
>> Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.
>>
>> Is it ok for trunk?
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html
>> [2] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580099.html
>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579529.html
>>
>> BR,
>> Kewen
>> -----
>> gcc/ChangeLog:
>>
>> 	* config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust
>> 	the way to compute extra penalized cost.  Remove useless parameter.
>> 	(rs6000_add_stmt_cost): Adjust the call to function
>> 	rs6000_update_target_cost_per_stmt.
>>
>>
>> ---
>>  gcc/config/rs6000/rs6000.c | 31 ++++++++++++++++++-------------
>>  1 file changed, 18 insertions(+), 13 deletions(-)
>>
>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>> index dd42b0964f1..8200e1152c2 100644
>> --- a/gcc/config/rs6000/rs6000.c
>> +++ b/gcc/config/rs6000/rs6000.c
>> @@ -5422,7 +5422,6 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>>  				    enum vect_cost_for_stmt kind,
>>  				    struct _stmt_vec_info *stmt_info,
>>  				    enum vect_cost_model_location where,
>> -				    int stmt_cost,
>>  				    unsigned int orig_count)
>>  {
>>
>> @@ -5462,17 +5461,23 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>>  	{
>>  	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>>  	  unsigned int nunits = vect_nunits_for_cost (vectype);
>> -	  unsigned int extra_cost = nunits * stmt_cost;
>> -	  /* As function rs6000_builtin_vectorization_cost shows, we have
>> -	     priced much on V16QI/V8HI vector construction as their units,
>> -	     if we penalize them with nunits * stmt_cost, it can result in
>> -	     an unreliable body cost, eg: for V16QI on Power8, stmt_cost
>> -	     is 20 and nunits is 16, the extra cost is 320 which looks
>> -	     much exaggerated.  So let's use one maximum bound for the
>> -	     extra penalized cost for vector construction here.  */
>> -	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>> -	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>> -	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>> +	  /* Don't expect strided/elementwise loads for just 1 nunit.  */
>> +	  gcc_assert (nunits > 1);
>> +	  /* i386 port adopts nunits * stmt_cost as the penalized cost
>> +	     for this kind of penalization, we used to follow it but
>> +	     found it could result in an unreliable body cost especially
>> +	     for V16QI/V8HI modes.  To make it better, we choose this
>> +	     new heuristic: for each scalar load, we use 2 as penalized
>> +	     cost for the case with 2 nunits and use 1 for the other
>> +	     cases.  It's without much supporting theory, mainly
>> +	     concluded from the broad performance evaluations on Power8,
>> +	     Power9 and Power10.  One possibly related point is that:
>> +	     vector construction for more units would use more insns,
>> +	     it has more chances to schedule them better (even run in
>> +	     parallelly when enough available units at that time), so
>> +	     it seems reasonable not to penalize that much for them.  */
>> +	  unsigned int adjusted_cost = (nunits == 2) ? 2 : 1;
>> +	  unsigned int extra_cost = nunits * adjusted_cost;
>>  	  data->extra_ctor_cost += extra_cost;
>>  	}
>>      }
>> @@ -5510,7 +5515,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>>        cost_data->cost[where] += retval;
>>
>>        rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
>> -					  stmt_cost, orig_count);
>> +					  orig_count);
>>      }
>>
>>    return retval;
>> --
>> 2.27.0
>>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* PING^3 [PATCH v2] rs6000: Modify the way for extra penalized cost
  2021-10-20  9:29   ` PING^2 " Kewen.Lin
@ 2021-11-04 10:56     ` Kewen.Lin
  2021-11-22  2:23       ` PING^4 " Kewen.Lin
  0 siblings, 1 reply; 8+ messages in thread
From: Kewen.Lin @ 2021-11-04 10:56 UTC (permalink / raw)
  To: GCC Patches; +Cc: Bill Schmidt, David Edelsohn, Segher Boessenkool

Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580358.html

BR,
Kewen

>> on 2021/9/28 下午4:16, Kewen.Lin via Gcc-patches wrote:
>>> Hi,
>>>
>>> This patch follows the discussions here[1][2], where Segher
>>> pointed out the existing way to guard the extra penalized
>>> cost for strided/elementwise loads with a magic bound does
>>> not scale.
>>>
>>> The way with nunits * stmt_cost can get one much
>>> exaggerated penalized cost, such as: for V16QI on P8, it's
>>> 16 * 20 = 320, that's why we need one bound.  To make it
>>> better and more readable, the penalized cost is simplified
>>> as:
>>>
>>>     unsigned adjusted_cost = (nunits == 2) ? 2 : 1;
>>>     unsigned extra_cost = nunits * adjusted_cost;
>>>
>>> For V2DI/V2DF, it uses 2 penalized cost for each scalar load
>>> while for the other modes, it uses 1.  It's mainly concluded
>>> from the performance evaluations.  One thing might be
>>> related is that: More units vector gets constructed, more
>>> instructions are used.  It has more chances to schedule them
>>> better (even run in parallelly when enough available units
>>> at that time), so it seems reasonable not to penalize more
>>> for them.
>>>
>>> The SPEC2017 evaluations on Power8/Power9/Power10 at option
>>> sets O2-vect and Ofast-unroll show this change is neutral.
>>>
>>> Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.
>>>
>>> Is it ok for trunk?
>>>
>>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html
>>> [2] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580099.html
>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579529.html
>>>
>>> BR,
>>> Kewen
>>> -----
>>> gcc/ChangeLog:
>>>
>>> 	* config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust
>>> 	the way to compute extra penalized cost.  Remove useless parameter.
>>> 	(rs6000_add_stmt_cost): Adjust the call to function
>>> 	rs6000_update_target_cost_per_stmt.
>>>
>>>
>>> ---
>>>  gcc/config/rs6000/rs6000.c | 31 ++++++++++++++++++-------------
>>>  1 file changed, 18 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>>> index dd42b0964f1..8200e1152c2 100644
>>> --- a/gcc/config/rs6000/rs6000.c
>>> +++ b/gcc/config/rs6000/rs6000.c
>>> @@ -5422,7 +5422,6 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>>>  				    enum vect_cost_for_stmt kind,
>>>  				    struct _stmt_vec_info *stmt_info,
>>>  				    enum vect_cost_model_location where,
>>> -				    int stmt_cost,
>>>  				    unsigned int orig_count)
>>>  {
>>>
>>> @@ -5462,17 +5461,23 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>>>  	{
>>>  	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>>>  	  unsigned int nunits = vect_nunits_for_cost (vectype);
>>> -	  unsigned int extra_cost = nunits * stmt_cost;
>>> -	  /* As function rs6000_builtin_vectorization_cost shows, we have
>>> -	     priced much on V16QI/V8HI vector construction as their units,
>>> -	     if we penalize them with nunits * stmt_cost, it can result in
>>> -	     an unreliable body cost, eg: for V16QI on Power8, stmt_cost
>>> -	     is 20 and nunits is 16, the extra cost is 320 which looks
>>> -	     much exaggerated.  So let's use one maximum bound for the
>>> -	     extra penalized cost for vector construction here.  */
>>> -	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>>> -	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>>> -	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>>> +	  /* Don't expect strided/elementwise loads for just 1 nunit.  */
>>> +	  gcc_assert (nunits > 1);
>>> +	  /* i386 port adopts nunits * stmt_cost as the penalized cost
>>> +	     for this kind of penalization, we used to follow it but
>>> +	     found it could result in an unreliable body cost especially
>>> +	     for V16QI/V8HI modes.  To make it better, we choose this
>>> +	     new heuristic: for each scalar load, we use 2 as penalized
>>> +	     cost for the case with 2 nunits and use 1 for the other
>>> +	     cases.  It's without much supporting theory, mainly
>>> +	     concluded from the broad performance evaluations on Power8,
>>> +	     Power9 and Power10.  One possibly related point is that:
>>> +	     vector construction for more units would use more insns,
>>> +	     it has more chances to schedule them better (even run in
>>> +	     parallelly when enough available units at that time), so
>>> +	     it seems reasonable not to penalize that much for them.  */
>>> +	  unsigned int adjusted_cost = (nunits == 2) ? 2 : 1;
>>> +	  unsigned int extra_cost = nunits * adjusted_cost;
>>>  	  data->extra_ctor_cost += extra_cost;
>>>  	}
>>>      }
>>> @@ -5510,7 +5515,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>>>        cost_data->cost[where] += retval;
>>>
>>>        rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
>>> -					  stmt_cost, orig_count);
>>> +					  orig_count);
>>>      }
>>>
>>>    return retval;
>>> --
>>> 2.27.0
>>>
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* PING^4 [PATCH v2] rs6000: Modify the way for extra penalized cost
  2021-11-04 10:56     ` PING^3 " Kewen.Lin
@ 2021-11-22  2:23       ` Kewen.Lin
  0 siblings, 0 replies; 8+ messages in thread
From: Kewen.Lin @ 2021-11-22  2:23 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Bill Schmidt, David Edelsohn, GCC Patches

Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580358.html

BR,
Kewen

>>> on 2021/9/28 下午4:16, Kewen.Lin via Gcc-patches wrote:
>>>> Hi,
>>>>
>>>> This patch follows the discussions here[1][2], where Segher
>>>> pointed out the existing way to guard the extra penalized
>>>> cost for strided/elementwise loads with a magic bound does
>>>> not scale.
>>>>
>>>> The way with nunits * stmt_cost can get one much
>>>> exaggerated penalized cost, such as: for V16QI on P8, it's
>>>> 16 * 20 = 320, that's why we need one bound.  To make it
>>>> better and more readable, the penalized cost is simplified
>>>> as:
>>>>
>>>>     unsigned adjusted_cost = (nunits == 2) ? 2 : 1;
>>>>     unsigned extra_cost = nunits * adjusted_cost;
>>>>
>>>> For V2DI/V2DF, it uses 2 penalized cost for each scalar load
>>>> while for the other modes, it uses 1.  It's mainly concluded
>>>> from the performance evaluations.  One thing might be
>>>> related is that: More units vector gets constructed, more
>>>> instructions are used.  It has more chances to schedule them
>>>> better (even run in parallelly when enough available units
>>>> at that time), so it seems reasonable not to penalize more
>>>> for them.
>>>>
>>>> The SPEC2017 evaluations on Power8/Power9/Power10 at option
>>>> sets O2-vect and Ofast-unroll show this change is neutral.
>>>>
>>>> Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.
>>>>
>>>> Is it ok for trunk?
>>>>
>>>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html
>>>> [2] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580099.html
>>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579529.html
>>>>
>>>> BR,
>>>> Kewen
>>>> -----
>>>> gcc/ChangeLog:
>>>>
>>>> 	* config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust
>>>> 	the way to compute extra penalized cost.  Remove useless parameter.
>>>> 	(rs6000_add_stmt_cost): Adjust the call to function
>>>> 	rs6000_update_target_cost_per_stmt.
>>>>
>>>>
>>>> ---
>>>>  gcc/config/rs6000/rs6000.c | 31 ++++++++++++++++++-------------
>>>>  1 file changed, 18 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>>>> index dd42b0964f1..8200e1152c2 100644
>>>> --- a/gcc/config/rs6000/rs6000.c
>>>> +++ b/gcc/config/rs6000/rs6000.c
>>>> @@ -5422,7 +5422,6 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>>>>  				    enum vect_cost_for_stmt kind,
>>>>  				    struct _stmt_vec_info *stmt_info,
>>>>  				    enum vect_cost_model_location where,
>>>> -				    int stmt_cost,
>>>>  				    unsigned int orig_count)
>>>>  {
>>>>
>>>> @@ -5462,17 +5461,23 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>>>>  	{
>>>>  	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>>>>  	  unsigned int nunits = vect_nunits_for_cost (vectype);
>>>> -	  unsigned int extra_cost = nunits * stmt_cost;
>>>> -	  /* As function rs6000_builtin_vectorization_cost shows, we have
>>>> -	     priced much on V16QI/V8HI vector construction as their units,
>>>> -	     if we penalize them with nunits * stmt_cost, it can result in
>>>> -	     an unreliable body cost, eg: for V16QI on Power8, stmt_cost
>>>> -	     is 20 and nunits is 16, the extra cost is 320 which looks
>>>> -	     much exaggerated.  So let's use one maximum bound for the
>>>> -	     extra penalized cost for vector construction here.  */
>>>> -	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>>>> -	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>>>> -	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>>>> +	  /* Don't expect strided/elementwise loads for just 1 nunit.  */
>>>> +	  gcc_assert (nunits > 1);
>>>> +	  /* i386 port adopts nunits * stmt_cost as the penalized cost
>>>> +	     for this kind of penalization, we used to follow it but
>>>> +	     found it could result in an unreliable body cost especially
>>>> +	     for V16QI/V8HI modes.  To make it better, we choose this
>>>> +	     new heuristic: for each scalar load, we use 2 as penalized
>>>> +	     cost for the case with 2 nunits and use 1 for the other
>>>> +	     cases.  It's without much supporting theory, mainly
>>>> +	     concluded from the broad performance evaluations on Power8,
>>>> +	     Power9 and Power10.  One possibly related point is that:
>>>> +	     vector construction for more units would use more insns,
>>>> +	     it has more chances to schedule them better (even run in
>>>> +	     parallelly when enough available units at that time), so
>>>> +	     it seems reasonable not to penalize that much for them.  */
>>>> +	  unsigned int adjusted_cost = (nunits == 2) ? 2 : 1;
>>>> +	  unsigned int extra_cost = nunits * adjusted_cost;
>>>>  	  data->extra_ctor_cost += extra_cost;
>>>>  	}
>>>>      }
>>>> @@ -5510,7 +5515,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>>>>        cost_data->cost[where] += retval;
>>>>
>>>>        rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
>>>> -					  stmt_cost, orig_count);
>>>> +					  orig_count);
>>>>      }
>>>>
>>>>    return retval;
>>>> --
>>>> 2.27.0
>>>>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] rs6000: Modify the way for extra penalized cost
  2021-09-28  8:16 [PATCH v2] rs6000: Modify the way for extra penalized cost Kewen.Lin
  2021-10-13  2:30 ` PING^1 " Kewen.Lin
@ 2021-11-29 22:06 ` Segher Boessenkool
  2021-11-30  5:05   ` Kewen.Lin
  1 sibling, 1 reply; 8+ messages in thread
From: Segher Boessenkool @ 2021-11-29 22:06 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, David Edelsohn, Bill Schmidt

Hi!

On Tue, Sep 28, 2021 at 04:16:04PM +0800, Kewen.Lin wrote:
> This patch follows the discussions here[1][2], where Segher
> pointed out the existing way to guard the extra penalized
> cost for strided/elementwise loads with a magic bound does
> not scale.
> 
> The way with nunits * stmt_cost can get one much
> exaggerated penalized cost, such as: for V16QI on P8, it's
> 16 * 20 = 320, that's why we need one bound.  To make it
> better and more readable, the penalized cost is simplified
> as:
> 
>     unsigned adjusted_cost = (nunits == 2) ? 2 : 1;
>     unsigned extra_cost = nunits * adjusted_cost;

> For V2DI/V2DF, it uses 2 penalized cost for each scalar load
> while for the other modes, it uses 1.

So for V2D[IF] we get 4, for V4S[IF] we get 4, for V8HI it's 8, and
for V16QI it is 16?  Pretty terrible as well, heh (I would expect all
vector ops to be similar cost).

> It's mainly concluded
> from the performance evaluations.  One thing might be
> related is that: More units vector gets constructed, more
> instructions are used.

Yes, but how often does that happen, compared to actual vector ops?

This also suggests we should cost vector construction separately, which
would pretty obviously be a good thing anyway (it happens often, it has
a quite different cost structure).

> It has more chances to schedule them
> better (even run in parallelly when enough available units
> at that time), so it seems reasonable not to penalize more
> for them.

Yes.

> +	  /* Don't expect strided/elementwise loads for just 1 nunit.  */

"We don't expect" etc.

Okay for trunk.  Thanks!  This probably isn't the last word in this
story, but it is an improvement in any case :-)


Segher

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] rs6000: Modify the way for extra penalized cost
  2021-11-29 22:06 ` Segher Boessenkool
@ 2021-11-30  5:05   ` Kewen.Lin
  2021-11-30 23:05     ` Segher Boessenkool
  0 siblings, 1 reply; 8+ messages in thread
From: Kewen.Lin @ 2021-11-30  5:05 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: GCC Patches, David Edelsohn, Bill Schmidt

Hi Segher,

on 2021/11/30 上午6:06, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Sep 28, 2021 at 04:16:04PM +0800, Kewen.Lin wrote:
>> This patch follows the discussions here[1][2], where Segher
>> pointed out the existing way to guard the extra penalized
>> cost for strided/elementwise loads with a magic bound does
>> not scale.
>>
>> The way with nunits * stmt_cost can get one much
>> exaggerated penalized cost, such as: for V16QI on P8, it's
>> 16 * 20 = 320, that's why we need one bound.  To make it
>> better and more readable, the penalized cost is simplified
>> as:
>>
>>     unsigned adjusted_cost = (nunits == 2) ? 2 : 1;
>>     unsigned extra_cost = nunits * adjusted_cost;
> 
>> For V2DI/V2DF, it uses 2 penalized cost for each scalar load
>> while for the other modes, it uses 1.
> 
> So for V2D[IF] we get 4, for V4S[IF] we get 4, for V8HI it's 8, and
> for V16QI it is 16?  Pretty terrible as well, heh (I would expect all
> vector ops to be similar cost).
> 

But for different vector units it has different number of loads, it seems
reasonable to have more costs when it has more loads to be fed into those
limited number of load/store units.

>> It's mainly concluded
>> from the performance evaluations.  One thing might be
>> related is that: More units vector gets constructed, more
>> instructions are used.
> 
> Yes, but how often does that happen, compared to actual vector ops?
> 
> This also suggests we should cost vector construction separately, which
> would pretty obviously be a good thing anyway (it happens often, it has
> a quite different cost structure).
> 

vectorizer does model vector construction separately, there is an enum
vect_cost_for_stmt *vec_construct*, normally it works well.  But for this
bwaves hotspot, it requires us to do some more penalization as evaluated,
so we put the penalized cost onto this special vector construction when
some heuristic thresholds are met.

>> It has more chances to schedule them
>> better (even run in parallelly when enough available units
>> at that time), so it seems reasonable not to penalize more
>> for them.
> 
> Yes.
> 
>> +	  /* Don't expect strided/elementwise loads for just 1 nunit.  */
> 
> "We don't expect" etc.
> 

Fixed.

> Okay for trunk.  Thanks!  This probably isn't the last word in this
> story, but it is an improvement in any case :-)
> 
> 

Thanks for the review, rebased/retested and committed as r12-5589.
BR,
Kewen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] rs6000: Modify the way for extra penalized cost
  2021-11-30  5:05   ` Kewen.Lin
@ 2021-11-30 23:05     ` Segher Boessenkool
  0 siblings, 0 replies; 8+ messages in thread
From: Segher Boessenkool @ 2021-11-30 23:05 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, David Edelsohn, Bill Schmidt

Hi!

On Tue, Nov 30, 2021 at 01:05:48PM +0800, Kewen.Lin wrote:
> on 2021/11/30 上午6:06, Segher Boessenkool wrote:
> > On Tue, Sep 28, 2021 at 04:16:04PM +0800, Kewen.Lin wrote:
> >>     unsigned adjusted_cost = (nunits == 2) ? 2 : 1;
> >>     unsigned extra_cost = nunits * adjusted_cost;
> > 
> >> For V2DI/V2DF, it uses 2 penalized cost for each scalar load
> >> while for the other modes, it uses 1.
> > 
> > So for V2D[IF] we get 4, for V4S[IF] we get 4, for V8HI it's 8, and
> > for V16QI it is 16?  Pretty terrible as well, heh (I would expect all
> > vector ops to be similar cost).
> 
> But for different vector units it has different number of loads, it seems
> reasonable to have more costs when it has more loads to be fed into those
> limited number of load/store units.

More expensive, yes.  This expensive?  That doesn't look optimal :-)

> > This also suggests we should cost vector construction separately, which
> > would pretty obviously be a good thing anyway (it happens often, it has
> > a quite different cost structure).
> 
> vectorizer does model vector construction separately, there is an enum
> vect_cost_for_stmt *vec_construct*, normally it works well.  But for this
> bwaves hotspot, it requires us to do some more penalization as evaluated,
> so we put the penalized cost onto this special vector construction when
> some heuristic thresholds are met.

Ah, heuristics.  We can adjust them forever :-)


Segher

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-11-30 23:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-28  8:16 [PATCH v2] rs6000: Modify the way for extra penalized cost Kewen.Lin
2021-10-13  2:30 ` PING^1 " Kewen.Lin
2021-10-20  9:29   ` PING^2 " Kewen.Lin
2021-11-04 10:56     ` PING^3 " Kewen.Lin
2021-11-22  2:23       ` PING^4 " Kewen.Lin
2021-11-29 22:06 ` Segher Boessenkool
2021-11-30  5:05   ` Kewen.Lin
2021-11-30 23:05     ` Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).