* [PATCH] optimize costly division in rtx_cost
@ 2020-01-05 18:39 Alexander Monakov
2020-01-10 9:47 ` [PING][PATCH] " Alexander Monakov
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: Alexander Monakov @ 2020-01-05 18:39 UTC (permalink / raw)
To: gcc-patches
Hi,
I noticed there's a costly signed 64-bit division in rtx_cost on x86 as well as
any other target where UNITS_PER_WORD is implemented like TARGET_64BIT ? 8 : 4.
It's also evident that rtx_cost does redundant work for a SET rtx argument.
Obviously the variable named 'factor' rarely exceeds 1, so in the majority of
cases it can be computed with a well-predictable branch rather than a division.
This patch makes rtx_cost do the division only in case mode is wider than
UNITS_PER_WORD, and also moves a test for a SET up front to avoid redundancy.
No functional change.
Bootstrapped on x86_64, ok for trunk?
To illustrate the improvement this buys, for tramp3d -O2 compilation, I got
before:
73887675319 instructions:u
72438432200 cycles:u
924298569 idq.ms_uops:u
102603799255 uops_dispatched.thread:u
after:
73888371724 instructions:u
72386986612 cycles:u
802744775 idq.ms_uops:u
102096987220 uops_dispatched.thread:u
(this is on Sandybridge, idq.ms_uops are uops going via the microcode sequencer,
so the unneeded division is responsible for a good fraction of them)
* rtlanal.c (rtx_cost): Handle a SET up front. Avoid division if the
mode is not wider than UNITS_PER_WORD.
diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
index 9a7afccefb8..c7ab86e228b 100644
--- a/gcc/rtlanal.c
+++ b/gcc/rtlanal.c
@@ -4207,18 +4207,23 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
const char *fmt;
int total;
int factor;
+ unsigned mode_size;
if (x == 0)
return 0;
- if (GET_MODE (x) != VOIDmode)
+ if (GET_CODE (x) == SET)
+ /* A SET doesn't have a mode, so let's look at the SET_DEST to get
+ the mode for the factor. */
+ mode = GET_MODE (SET_DEST (x));
+ else if (GET_MODE (x) != VOIDmode)
mode = GET_MODE (x);
+ mode_size = estimated_poly_value (GET_MODE_SIZE (mode));
+
/* A size N times larger than UNITS_PER_WORD likely needs N times as
many insns, taking N times as long. */
- factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
- if (factor == 0)
- factor = 1;
+ factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD : 1;
/* Compute the default costs of certain things.
Note that targetm.rtx_costs can override the defaults. */
@@ -4243,14 +4248,6 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
/* Used in combine.c as a marker. */
total = 0;
break;
- case SET:
- /* A SET doesn't have a mode, so let's look at the SET_DEST to get
- the mode for the factor. */
- mode = GET_MODE (SET_DEST (x));
- factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
- if (factor == 0)
- factor = 1;
- /* FALLTHRU */
default:
total = factor * COSTS_N_INSNS (1);
}
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PING][PATCH] optimize costly division in rtx_cost
2020-01-05 18:39 [PATCH] optimize costly division in rtx_cost Alexander Monakov
@ 2020-01-10 9:47 ` Alexander Monakov
2020-01-20 9:37 ` [PATCH] " Alexander Monakov
2020-02-13 13:46 ` [PING^3][PATCH] " Alexander Monakov
2 siblings, 0 replies; 5+ messages in thread
From: Alexander Monakov @ 2020-01-10 9:47 UTC (permalink / raw)
To: gcc-patches
Ping.
On Sun, 5 Jan 2020, Alexander Monakov wrote:
> Hi,
>
> I noticed there's a costly signed 64-bit division in rtx_cost on x86 as well as
> any other target where UNITS_PER_WORD is implemented like TARGET_64BIT ? 8 : 4.
> It's also evident that rtx_cost does redundant work for a SET rtx argument.
>
> Obviously the variable named 'factor' rarely exceeds 1, so in the majority of
> cases it can be computed with a well-predictable branch rather than a division.
>
> This patch makes rtx_cost do the division only in case mode is wider than
> UNITS_PER_WORD, and also moves a test for a SET up front to avoid redundancy.
> No functional change.
>
> Bootstrapped on x86_64, ok for trunk?
>
> To illustrate the improvement this buys, for tramp3d -O2 compilation, I got
>
> before:
> 73887675319 instructions:u
>
> 72438432200 cycles:u
> 924298569 idq.ms_uops:u
> 102603799255 uops_dispatched.thread:u
>
> after:
> 73888371724 instructions:u
>
> 72386986612 cycles:u
> 802744775 idq.ms_uops:u
> 102096987220 uops_dispatched.thread:u
>
> (this is on Sandybridge, idq.ms_uops are uops going via the microcode sequencer,
> so the unneeded division is responsible for a good fraction of them)
>
> * rtlanal.c (rtx_cost): Handle a SET up front. Avoid division if the
> mode is not wider than UNITS_PER_WORD.
>
> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
> index 9a7afccefb8..c7ab86e228b 100644
> --- a/gcc/rtlanal.c
> +++ b/gcc/rtlanal.c
> @@ -4207,18 +4207,23 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
> const char *fmt;
> int total;
> int factor;
> + unsigned mode_size;
>
> if (x == 0)
> return 0;
>
> - if (GET_MODE (x) != VOIDmode)
> + if (GET_CODE (x) == SET)
> + /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> + the mode for the factor. */
> + mode = GET_MODE (SET_DEST (x));
> + else if (GET_MODE (x) != VOIDmode)
> mode = GET_MODE (x);
>
> + mode_size = estimated_poly_value (GET_MODE_SIZE (mode));
> +
> /* A size N times larger than UNITS_PER_WORD likely needs N times as
> many insns, taking N times as long. */
> - factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> - if (factor == 0)
> - factor = 1;
> + factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD : 1;
>
> /* Compute the default costs of certain things.
> Note that targetm.rtx_costs can override the defaults. */
> @@ -4243,14 +4248,6 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
> /* Used in combine.c as a marker. */
> total = 0;
> break;
> - case SET:
> - /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> - the mode for the factor. */
> - mode = GET_MODE (SET_DEST (x));
> - factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> - if (factor == 0)
> - factor = 1;
> - /* FALLTHRU */
> default:
> total = factor * COSTS_N_INSNS (1);
> }
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] optimize costly division in rtx_cost
2020-01-05 18:39 [PATCH] optimize costly division in rtx_cost Alexander Monakov
2020-01-10 9:47 ` [PING][PATCH] " Alexander Monakov
@ 2020-01-20 9:37 ` Alexander Monakov
2020-02-13 13:46 ` [PING^3][PATCH] " Alexander Monakov
2 siblings, 0 replies; 5+ messages in thread
From: Alexander Monakov @ 2020-01-20 9:37 UTC (permalink / raw)
To: gcc-patches
Ping.
On Sun, 5 Jan 2020, Alexander Monakov wrote:
> Hi,
>
> I noticed there's a costly signed 64-bit division in rtx_cost on x86 as well as
> any other target where UNITS_PER_WORD is implemented like TARGET_64BIT ? 8 : 4.
> It's also evident that rtx_cost does redundant work for a SET rtx argument.
>
> Obviously the variable named 'factor' rarely exceeds 1, so in the majority of
> cases it can be computed with a well-predictable branch rather than a division.
>
> This patch makes rtx_cost do the division only in case mode is wider than
> UNITS_PER_WORD, and also moves a test for a SET up front to avoid redundancy.
> No functional change.
>
> Bootstrapped on x86_64, ok for trunk?
>
> To illustrate the improvement this buys, for tramp3d -O2 compilation, I got
>
> before:
> 73887675319 instructions:u
>
> 72438432200 cycles:u
> 924298569 idq.ms_uops:u
> 102603799255 uops_dispatched.thread:u
>
> after:
> 73888371724 instructions:u
>
> 72386986612 cycles:u
> 802744775 idq.ms_uops:u
> 102096987220 uops_dispatched.thread:u
>
> (this is on Sandybridge, idq.ms_uops are uops going via the microcode sequencer,
> so the unneeded division is responsible for a good fraction of them)
>
> * rtlanal.c (rtx_cost): Handle a SET up front. Avoid division if the
> mode is not wider than UNITS_PER_WORD.
>
> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
> index 9a7afccefb8..c7ab86e228b 100644
> --- a/gcc/rtlanal.c
> +++ b/gcc/rtlanal.c
> @@ -4207,18 +4207,23 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
> const char *fmt;
> int total;
> int factor;
> + unsigned mode_size;
>
> if (x == 0)
> return 0;
>
> - if (GET_MODE (x) != VOIDmode)
> + if (GET_CODE (x) == SET)
> + /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> + the mode for the factor. */
> + mode = GET_MODE (SET_DEST (x));
> + else if (GET_MODE (x) != VOIDmode)
> mode = GET_MODE (x);
>
> + mode_size = estimated_poly_value (GET_MODE_SIZE (mode));
> +
> /* A size N times larger than UNITS_PER_WORD likely needs N times as
> many insns, taking N times as long. */
> - factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> - if (factor == 0)
> - factor = 1;
> + factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD : 1;
>
> /* Compute the default costs of certain things.
> Note that targetm.rtx_costs can override the defaults. */
> @@ -4243,14 +4248,6 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
> /* Used in combine.c as a marker. */
> total = 0;
> break;
> - case SET:
> - /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> - the mode for the factor. */
> - mode = GET_MODE (SET_DEST (x));
> - factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> - if (factor == 0)
> - factor = 1;
> - /* FALLTHRU */
> default:
> total = factor * COSTS_N_INSNS (1);
> }
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PING^3][PATCH] optimize costly division in rtx_cost
2020-01-05 18:39 [PATCH] optimize costly division in rtx_cost Alexander Monakov
2020-01-10 9:47 ` [PING][PATCH] " Alexander Monakov
2020-01-20 9:37 ` [PATCH] " Alexander Monakov
@ 2020-02-13 13:46 ` Alexander Monakov
2020-02-13 13:55 ` Richard Biener
2 siblings, 1 reply; 5+ messages in thread
From: Alexander Monakov @ 2020-02-13 13:46 UTC (permalink / raw)
To: gcc-patches
Ping^3.
On Sun, 5 Jan 2020, Alexander Monakov wrote:
> Hi,
>
> I noticed there's a costly signed 64-bit division in rtx_cost on x86 as well as
> any other target where UNITS_PER_WORD is implemented like TARGET_64BIT ? 8 : 4.
> It's also evident that rtx_cost does redundant work for a SET rtx argument.
>
> Obviously the variable named 'factor' rarely exceeds 1, so in the majority of
> cases it can be computed with a well-predictable branch rather than a division.
>
> This patch makes rtx_cost do the division only in case mode is wider than
> UNITS_PER_WORD, and also moves a test for a SET up front to avoid redundancy.
> No functional change.
>
> Bootstrapped on x86_64, ok for trunk?
>
> To illustrate the improvement this buys, for tramp3d -O2 compilation, I got
>
> before:
> 73887675319 instructions:u
>
> 72438432200 cycles:u
> 924298569 idq.ms_uops:u
> 102603799255 uops_dispatched.thread:u
>
> after:
> 73888371724 instructions:u
>
> 72386986612 cycles:u
> 802744775 idq.ms_uops:u
> 102096987220 uops_dispatched.thread:u
>
> (this is on Sandybridge, idq.ms_uops are uops going via the microcode sequencer,
> so the unneeded division is responsible for a good fraction of them)
>
> * rtlanal.c (rtx_cost): Handle a SET up front. Avoid division if the
> mode is not wider than UNITS_PER_WORD.
>
> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
> index 9a7afccefb8..c7ab86e228b 100644
> --- a/gcc/rtlanal.c
> +++ b/gcc/rtlanal.c
> @@ -4207,18 +4207,23 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
> const char *fmt;
> int total;
> int factor;
> + unsigned mode_size;
>
> if (x == 0)
> return 0;
>
> - if (GET_MODE (x) != VOIDmode)
> + if (GET_CODE (x) == SET)
> + /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> + the mode for the factor. */
> + mode = GET_MODE (SET_DEST (x));
> + else if (GET_MODE (x) != VOIDmode)
> mode = GET_MODE (x);
>
> + mode_size = estimated_poly_value (GET_MODE_SIZE (mode));
> +
> /* A size N times larger than UNITS_PER_WORD likely needs N times as
> many insns, taking N times as long. */
> - factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> - if (factor == 0)
> - factor = 1;
> + factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD : 1;
>
> /* Compute the default costs of certain things.
> Note that targetm.rtx_costs can override the defaults. */
> @@ -4243,14 +4248,6 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
> /* Used in combine.c as a marker. */
> total = 0;
> break;
> - case SET:
> - /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> - the mode for the factor. */
> - mode = GET_MODE (SET_DEST (x));
> - factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> - if (factor == 0)
> - factor = 1;
> - /* FALLTHRU */
> default:
> total = factor * COSTS_N_INSNS (1);
> }
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PING^3][PATCH] optimize costly division in rtx_cost
2020-02-13 13:46 ` [PING^3][PATCH] " Alexander Monakov
@ 2020-02-13 13:55 ` Richard Biener
0 siblings, 0 replies; 5+ messages in thread
From: Richard Biener @ 2020-02-13 13:55 UTC (permalink / raw)
To: Alexander Monakov; +Cc: GCC Patches
On Thu, Feb 13, 2020 at 2:46 PM Alexander Monakov <amonakov@ispras.ru> wrote:
>
> Ping^3.
OK.
> On Sun, 5 Jan 2020, Alexander Monakov wrote:
>
> > Hi,
> >
> > I noticed there's a costly signed 64-bit division in rtx_cost on x86 as well as
> > any other target where UNITS_PER_WORD is implemented like TARGET_64BIT ? 8 : 4.
> > It's also evident that rtx_cost does redundant work for a SET rtx argument.
> >
> > Obviously the variable named 'factor' rarely exceeds 1, so in the majority of
> > cases it can be computed with a well-predictable branch rather than a division.
> >
> > This patch makes rtx_cost do the division only in case mode is wider than
> > UNITS_PER_WORD, and also moves a test for a SET up front to avoid redundancy.
> > No functional change.
> >
> > Bootstrapped on x86_64, ok for trunk?
> >
> > To illustrate the improvement this buys, for tramp3d -O2 compilation, I got
> >
> > before:
> > 73887675319 instructions:u
> >
> > 72438432200 cycles:u
> > 924298569 idq.ms_uops:u
> > 102603799255 uops_dispatched.thread:u
> >
> > after:
> > 73888371724 instructions:u
> >
> > 72386986612 cycles:u
> > 802744775 idq.ms_uops:u
> > 102096987220 uops_dispatched.thread:u
> >
> > (this is on Sandybridge, idq.ms_uops are uops going via the microcode sequencer,
> > so the unneeded division is responsible for a good fraction of them)
> >
> > * rtlanal.c (rtx_cost): Handle a SET up front. Avoid division if the
> > mode is not wider than UNITS_PER_WORD.
> >
> > diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
> > index 9a7afccefb8..c7ab86e228b 100644
> > --- a/gcc/rtlanal.c
> > +++ b/gcc/rtlanal.c
> > @@ -4207,18 +4207,23 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
> > const char *fmt;
> > int total;
> > int factor;
> > + unsigned mode_size;
> >
> > if (x == 0)
> > return 0;
> >
> > - if (GET_MODE (x) != VOIDmode)
> > + if (GET_CODE (x) == SET)
> > + /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> > + the mode for the factor. */
> > + mode = GET_MODE (SET_DEST (x));
> > + else if (GET_MODE (x) != VOIDmode)
> > mode = GET_MODE (x);
> >
> > + mode_size = estimated_poly_value (GET_MODE_SIZE (mode));
> > +
> > /* A size N times larger than UNITS_PER_WORD likely needs N times as
> > many insns, taking N times as long. */
> > - factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> > - if (factor == 0)
> > - factor = 1;
> > + factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD : 1;
> >
> > /* Compute the default costs of certain things.
> > Note that targetm.rtx_costs can override the defaults. */
> > @@ -4243,14 +4248,6 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code outer_code,
> > /* Used in combine.c as a marker. */
> > total = 0;
> > break;
> > - case SET:
> > - /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> > - the mode for the factor. */
> > - mode = GET_MODE (SET_DEST (x));
> > - factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> > - if (factor == 0)
> > - factor = 1;
> > - /* FALLTHRU */
> > default:
> > total = factor * COSTS_N_INSNS (1);
> > }
> >
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-02-13 13:55 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-05 18:39 [PATCH] optimize costly division in rtx_cost Alexander Monakov
2020-01-10 9:47 ` [PING][PATCH] " Alexander Monakov
2020-01-20 9:37 ` [PATCH] " Alexander Monakov
2020-02-13 13:46 ` [PING^3][PATCH] " Alexander Monakov
2020-02-13 13:55 ` Richard Biener
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).