public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] optimize costly division in rtx_cost
@ 2020-01-05 18:39 Alexander Monakov
  2020-01-10  9:47 ` [PING][PATCH] " Alexander Monakov
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Alexander Monakov @ 2020-01-05 18:39 UTC (permalink / raw)
  To: gcc-patches

Hi,

I noticed there's a costly signed 64-bit division in rtx_cost on x86 as well as
any other target where UNITS_PER_WORD is implemented like TARGET_64BIT ? 8 : 4.
It's also evident that rtx_cost does redundant work for a SET rtx argument.

Obviously the variable named 'factor' rarely exceeds 1, so in the majority of
cases it can be computed with a well-predictable branch rather than a division.

This patch makes rtx_cost do the division only in case mode is wider than
UNITS_PER_WORD, and also moves a test for a SET up front to avoid redundancy.
No functional change.

Bootstrapped on x86_64, ok for trunk?

To illustrate the improvement this buys, for tramp3d -O2 compilation, I got
    
    before:
           73887675319      instructions:u
    
           72438432200      cycles:u
             924298569      idq.ms_uops:u
          102603799255      uops_dispatched.thread:u
    
    after:
           73888371724      instructions:u
    
           72386986612      cycles:u
             802744775      idq.ms_uops:u
          102096987220      uops_dispatched.thread:u

(this is on Sandybridge, idq.ms_uops are uops going via the microcode sequencer,
so the unneeded division is responsible for a good fraction of them)

	* rtlanal.c (rtx_cost): Handle a SET up front. Avoid division if the
	mode is not wider than UNITS_PER_WORD.

diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
index 9a7afccefb8..c7ab86e228b 100644
--- a/gcc/rtlanal.c
+++ b/gcc/rtlanal.c
@@ -4207,18 +4207,23 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
   const char *fmt;
   int total;
   int factor;
+  unsigned mode_size;
 
   if (x == 0)
     return 0;
 
-  if (GET_MODE (x) != VOIDmode)
+  if (GET_CODE (x) == SET)
+    /* A SET doesn't have a mode, so let's look at the SET_DEST to get
+       the mode for the factor.  */
+    mode = GET_MODE (SET_DEST (x));
+  else if (GET_MODE (x) != VOIDmode)
     mode = GET_MODE (x);
 
+  mode_size = estimated_poly_value (GET_MODE_SIZE (mode));
+
   /* A size N times larger than UNITS_PER_WORD likely needs N times as
      many insns, taking N times as long.  */
-  factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
-  if (factor == 0)
-    factor = 1;
+  factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD : 1;
 
   /* Compute the default costs of certain things.
      Note that targetm.rtx_costs can override the defaults.  */
@@ -4243,14 +4248,6 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
       /* Used in combine.c as a marker.  */
       total = 0;
       break;
-    case SET:
-      /* A SET doesn't have a mode, so let's look at the SET_DEST to get
-	 the mode for the factor.  */
-      mode = GET_MODE (SET_DEST (x));
-      factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
-      if (factor == 0)
-	factor = 1;
-      /* FALLTHRU */
     default:
       total = factor * COSTS_N_INSNS (1);
     }

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PING][PATCH] optimize costly division in rtx_cost
  2020-01-05 18:39 [PATCH] optimize costly division in rtx_cost Alexander Monakov
@ 2020-01-10  9:47 ` Alexander Monakov
  2020-01-20  9:37 ` [PATCH] " Alexander Monakov
  2020-02-13 13:46 ` [PING^3][PATCH] " Alexander Monakov
  2 siblings, 0 replies; 5+ messages in thread
From: Alexander Monakov @ 2020-01-10  9:47 UTC (permalink / raw)
  To: gcc-patches

Ping.

On Sun, 5 Jan 2020, Alexander Monakov wrote:

> Hi,
> 
> I noticed there's a costly signed 64-bit division in rtx_cost on x86 as well as
> any other target where UNITS_PER_WORD is implemented like TARGET_64BIT ? 8 : 4.
> It's also evident that rtx_cost does redundant work for a SET rtx argument.
> 
> Obviously the variable named 'factor' rarely exceeds 1, so in the majority of
> cases it can be computed with a well-predictable branch rather than a division.
> 
> This patch makes rtx_cost do the division only in case mode is wider than
> UNITS_PER_WORD, and also moves a test for a SET up front to avoid redundancy.
> No functional change.
> 
> Bootstrapped on x86_64, ok for trunk?
> 
> To illustrate the improvement this buys, for tramp3d -O2 compilation, I got
>     
>     before:
>            73887675319      instructions:u
>     
>            72438432200      cycles:u
>              924298569      idq.ms_uops:u
>           102603799255      uops_dispatched.thread:u
>     
>     after:
>            73888371724      instructions:u
>     
>            72386986612      cycles:u
>              802744775      idq.ms_uops:u
>           102096987220      uops_dispatched.thread:u
> 
> (this is on Sandybridge, idq.ms_uops are uops going via the microcode sequencer,
> so the unneeded division is responsible for a good fraction of them)
> 
> 	* rtlanal.c (rtx_cost): Handle a SET up front. Avoid division if the
> 	mode is not wider than UNITS_PER_WORD.
> 
> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
> index 9a7afccefb8..c7ab86e228b 100644
> --- a/gcc/rtlanal.c
> +++ b/gcc/rtlanal.c
> @@ -4207,18 +4207,23 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
>    const char *fmt;
>    int total;
>    int factor;
> +  unsigned mode_size;
>  
>    if (x == 0)
>      return 0;
>  
> -  if (GET_MODE (x) != VOIDmode)
> +  if (GET_CODE (x) == SET)
> +    /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> +       the mode for the factor.  */
> +    mode = GET_MODE (SET_DEST (x));
> +  else if (GET_MODE (x) != VOIDmode)
>      mode = GET_MODE (x);
>  
> +  mode_size = estimated_poly_value (GET_MODE_SIZE (mode));
> +
>    /* A size N times larger than UNITS_PER_WORD likely needs N times as
>       many insns, taking N times as long.  */
> -  factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> -  if (factor == 0)
> -    factor = 1;
> +  factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD : 1;
>  
>    /* Compute the default costs of certain things.
>       Note that targetm.rtx_costs can override the defaults.  */
> @@ -4243,14 +4248,6 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
>        /* Used in combine.c as a marker.  */
>        total = 0;
>        break;
> -    case SET:
> -      /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> -	 the mode for the factor.  */
> -      mode = GET_MODE (SET_DEST (x));
> -      factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> -      if (factor == 0)
> -	factor = 1;
> -      /* FALLTHRU */
>      default:
>        total = factor * COSTS_N_INSNS (1);
>      }
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] optimize costly division in rtx_cost
  2020-01-05 18:39 [PATCH] optimize costly division in rtx_cost Alexander Monakov
  2020-01-10  9:47 ` [PING][PATCH] " Alexander Monakov
@ 2020-01-20  9:37 ` Alexander Monakov
  2020-02-13 13:46 ` [PING^3][PATCH] " Alexander Monakov
  2 siblings, 0 replies; 5+ messages in thread
From: Alexander Monakov @ 2020-01-20  9:37 UTC (permalink / raw)
  To: gcc-patches

Ping.

On Sun, 5 Jan 2020, Alexander Monakov wrote:

> Hi,
> 
> I noticed there's a costly signed 64-bit division in rtx_cost on x86 as well as
> any other target where UNITS_PER_WORD is implemented like TARGET_64BIT ? 8 : 4.
> It's also evident that rtx_cost does redundant work for a SET rtx argument.
> 
> Obviously the variable named 'factor' rarely exceeds 1, so in the majority of
> cases it can be computed with a well-predictable branch rather than a division.
> 
> This patch makes rtx_cost do the division only in case mode is wider than
> UNITS_PER_WORD, and also moves a test for a SET up front to avoid redundancy.
> No functional change.
> 
> Bootstrapped on x86_64, ok for trunk?
> 
> To illustrate the improvement this buys, for tramp3d -O2 compilation, I got
>     
>     before:
>            73887675319      instructions:u
>     
>            72438432200      cycles:u
>              924298569      idq.ms_uops:u
>           102603799255      uops_dispatched.thread:u
>     
>     after:
>            73888371724      instructions:u
>     
>            72386986612      cycles:u
>              802744775      idq.ms_uops:u
>           102096987220      uops_dispatched.thread:u
> 
> (this is on Sandybridge, idq.ms_uops are uops going via the microcode sequencer,
> so the unneeded division is responsible for a good fraction of them)
> 
> 	* rtlanal.c (rtx_cost): Handle a SET up front. Avoid division if the
> 	mode is not wider than UNITS_PER_WORD.
> 
> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
> index 9a7afccefb8..c7ab86e228b 100644
> --- a/gcc/rtlanal.c
> +++ b/gcc/rtlanal.c
> @@ -4207,18 +4207,23 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
>    const char *fmt;
>    int total;
>    int factor;
> +  unsigned mode_size;
>  
>    if (x == 0)
>      return 0;
>  
> -  if (GET_MODE (x) != VOIDmode)
> +  if (GET_CODE (x) == SET)
> +    /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> +       the mode for the factor.  */
> +    mode = GET_MODE (SET_DEST (x));
> +  else if (GET_MODE (x) != VOIDmode)
>      mode = GET_MODE (x);
>  
> +  mode_size = estimated_poly_value (GET_MODE_SIZE (mode));
> +
>    /* A size N times larger than UNITS_PER_WORD likely needs N times as
>       many insns, taking N times as long.  */
> -  factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> -  if (factor == 0)
> -    factor = 1;
> +  factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD : 1;
>  
>    /* Compute the default costs of certain things.
>       Note that targetm.rtx_costs can override the defaults.  */
> @@ -4243,14 +4248,6 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
>        /* Used in combine.c as a marker.  */
>        total = 0;
>        break;
> -    case SET:
> -      /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> -	 the mode for the factor.  */
> -      mode = GET_MODE (SET_DEST (x));
> -      factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> -      if (factor == 0)
> -	factor = 1;
> -      /* FALLTHRU */
>      default:
>        total = factor * COSTS_N_INSNS (1);
>      }
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PING^3][PATCH] optimize costly division in rtx_cost
  2020-01-05 18:39 [PATCH] optimize costly division in rtx_cost Alexander Monakov
  2020-01-10  9:47 ` [PING][PATCH] " Alexander Monakov
  2020-01-20  9:37 ` [PATCH] " Alexander Monakov
@ 2020-02-13 13:46 ` Alexander Monakov
  2020-02-13 13:55   ` Richard Biener
  2 siblings, 1 reply; 5+ messages in thread
From: Alexander Monakov @ 2020-02-13 13:46 UTC (permalink / raw)
  To: gcc-patches

Ping^3.

On Sun, 5 Jan 2020, Alexander Monakov wrote:

> Hi,
> 
> I noticed there's a costly signed 64-bit division in rtx_cost on x86 as well as
> any other target where UNITS_PER_WORD is implemented like TARGET_64BIT ? 8 : 4.
> It's also evident that rtx_cost does redundant work for a SET rtx argument.
> 
> Obviously the variable named 'factor' rarely exceeds 1, so in the majority of
> cases it can be computed with a well-predictable branch rather than a division.
> 
> This patch makes rtx_cost do the division only in case mode is wider than
> UNITS_PER_WORD, and also moves a test for a SET up front to avoid redundancy.
> No functional change.
> 
> Bootstrapped on x86_64, ok for trunk?
> 
> To illustrate the improvement this buys, for tramp3d -O2 compilation, I got
>     
>     before:
>            73887675319      instructions:u
>     
>            72438432200      cycles:u
>              924298569      idq.ms_uops:u
>           102603799255      uops_dispatched.thread:u
>     
>     after:
>            73888371724      instructions:u
>     
>            72386986612      cycles:u
>              802744775      idq.ms_uops:u
>           102096987220      uops_dispatched.thread:u
> 
> (this is on Sandybridge, idq.ms_uops are uops going via the microcode sequencer,
> so the unneeded division is responsible for a good fraction of them)
> 
> 	* rtlanal.c (rtx_cost): Handle a SET up front. Avoid division if the
> 	mode is not wider than UNITS_PER_WORD.
> 
> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
> index 9a7afccefb8..c7ab86e228b 100644
> --- a/gcc/rtlanal.c
> +++ b/gcc/rtlanal.c
> @@ -4207,18 +4207,23 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
>    const char *fmt;
>    int total;
>    int factor;
> +  unsigned mode_size;
>  
>    if (x == 0)
>      return 0;
>  
> -  if (GET_MODE (x) != VOIDmode)
> +  if (GET_CODE (x) == SET)
> +    /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> +       the mode for the factor.  */
> +    mode = GET_MODE (SET_DEST (x));
> +  else if (GET_MODE (x) != VOIDmode)
>      mode = GET_MODE (x);
>  
> +  mode_size = estimated_poly_value (GET_MODE_SIZE (mode));
> +
>    /* A size N times larger than UNITS_PER_WORD likely needs N times as
>       many insns, taking N times as long.  */
> -  factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> -  if (factor == 0)
> -    factor = 1;
> +  factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD : 1;
>  
>    /* Compute the default costs of certain things.
>       Note that targetm.rtx_costs can override the defaults.  */
> @@ -4243,14 +4248,6 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
>        /* Used in combine.c as a marker.  */
>        total = 0;
>        break;
> -    case SET:
> -      /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> -	 the mode for the factor.  */
> -      mode = GET_MODE (SET_DEST (x));
> -      factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> -      if (factor == 0)
> -	factor = 1;
> -      /* FALLTHRU */
>      default:
>        total = factor * COSTS_N_INSNS (1);
>      }
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PING^3][PATCH] optimize costly division in rtx_cost
  2020-02-13 13:46 ` [PING^3][PATCH] " Alexander Monakov
@ 2020-02-13 13:55   ` Richard Biener
  0 siblings, 0 replies; 5+ messages in thread
From: Richard Biener @ 2020-02-13 13:55 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: GCC Patches

On Thu, Feb 13, 2020 at 2:46 PM Alexander Monakov <amonakov@ispras.ru> wrote:
>
> Ping^3.

OK.

> On Sun, 5 Jan 2020, Alexander Monakov wrote:
>
> > Hi,
> >
> > I noticed there's a costly signed 64-bit division in rtx_cost on x86 as well as
> > any other target where UNITS_PER_WORD is implemented like TARGET_64BIT ? 8 : 4.
> > It's also evident that rtx_cost does redundant work for a SET rtx argument.
> >
> > Obviously the variable named 'factor' rarely exceeds 1, so in the majority of
> > cases it can be computed with a well-predictable branch rather than a division.
> >
> > This patch makes rtx_cost do the division only in case mode is wider than
> > UNITS_PER_WORD, and also moves a test for a SET up front to avoid redundancy.
> > No functional change.
> >
> > Bootstrapped on x86_64, ok for trunk?
> >
> > To illustrate the improvement this buys, for tramp3d -O2 compilation, I got
> >
> >     before:
> >            73887675319      instructions:u
> >
> >            72438432200      cycles:u
> >              924298569      idq.ms_uops:u
> >           102603799255      uops_dispatched.thread:u
> >
> >     after:
> >            73888371724      instructions:u
> >
> >            72386986612      cycles:u
> >              802744775      idq.ms_uops:u
> >           102096987220      uops_dispatched.thread:u
> >
> > (this is on Sandybridge, idq.ms_uops are uops going via the microcode sequencer,
> > so the unneeded division is responsible for a good fraction of them)
> >
> >       * rtlanal.c (rtx_cost): Handle a SET up front. Avoid division if the
> >       mode is not wider than UNITS_PER_WORD.
> >
> > diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
> > index 9a7afccefb8..c7ab86e228b 100644
> > --- a/gcc/rtlanal.c
> > +++ b/gcc/rtlanal.c
> > @@ -4207,18 +4207,23 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
> >    const char *fmt;
> >    int total;
> >    int factor;
> > +  unsigned mode_size;
> >
> >    if (x == 0)
> >      return 0;
> >
> > -  if (GET_MODE (x) != VOIDmode)
> > +  if (GET_CODE (x) == SET)
> > +    /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> > +       the mode for the factor.  */
> > +    mode = GET_MODE (SET_DEST (x));
> > +  else if (GET_MODE (x) != VOIDmode)
> >      mode = GET_MODE (x);
> >
> > +  mode_size = estimated_poly_value (GET_MODE_SIZE (mode));
> > +
> >    /* A size N times larger than UNITS_PER_WORD likely needs N times as
> >       many insns, taking N times as long.  */
> > -  factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> > -  if (factor == 0)
> > -    factor = 1;
> > +  factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD : 1;
> >
> >    /* Compute the default costs of certain things.
> >       Note that targetm.rtx_costs can override the defaults.  */
> > @@ -4243,14 +4248,6 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
> >        /* Used in combine.c as a marker.  */
> >        total = 0;
> >        break;
> > -    case SET:
> > -      /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> > -      the mode for the factor.  */
> > -      mode = GET_MODE (SET_DEST (x));
> > -      factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> > -      if (factor == 0)
> > -     factor = 1;
> > -      /* FALLTHRU */
> >      default:
> >        total = factor * COSTS_N_INSNS (1);
> >      }
> >

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-02-13 13:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-05 18:39 [PATCH] optimize costly division in rtx_cost Alexander Monakov
2020-01-10  9:47 ` [PING][PATCH] " Alexander Monakov
2020-01-20  9:37 ` [PATCH] " Alexander Monakov
2020-02-13 13:46 ` [PING^3][PATCH] " Alexander Monakov
2020-02-13 13:55   ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).