[x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.
@ 2021-08-19 15:59 Roger Sayle
  0 siblings, 0 replies; 7+ messages in thread
From: Roger Sayle @ 2021-08-19 15:59 UTC (permalink / raw)
  To: 'GCC Patches'


Back in June I briefly mentioned in one of my gcc-patches posts that
a change that should have always reduced code size, would mysteriously
occasionally result in slightly larger code (according to CSiBE):
https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573233.html

Investigating further, the cause turns out to be that x86_64's
scalar-to-vector (stv) pass is relying on poor estimates of the size
costs/benefits.  This patch tweaks the backend's compute_convert_gain
method to provide slightly more accurate values when compiling with
-Os. Compilation without -Os is (should be) unaffected.  And for
completeness, I'll mention that the stv pass is a net win for code
size so it's much better to improve its heuristics than simply gate
the pass on !optimize_for_size.

The net effect of this change is to save 1399 bytes on the CSiBE
code size benchmark when compiling with -Os.

This patch has been tested on x86_64-pc-linux-gnu with "make bootstrap"
and "make -k check" with no new failures.

Ok for mainline?


2021-08-19  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	* config/i386/i386-features.c (compute_convert_gain): Provide
	more accurate values for CONST_INT, when optimizing for size.
	* config/i386/i386.c (COSTS_N_BYTES): Move definition from here...
	* config/i386/i386.h (COSTS_N_BYTES): to here.

Roger
--



^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.
  2021-08-23 13:47     ` Richard Biener
@ 2021-08-24  2:08       ` Roger Sayle
  0 siblings, 0 replies; 7+ messages in thread
From: Roger Sayle @ 2021-08-24  2:08 UTC (permalink / raw)
  To: 'Richard Biener'; +Cc: 'GCC Patches'

[-- Attachment #1: Type: text/plain, Size: 2545 bytes --]


Many thanks.  Here’s the version that I've committed with a ??? comment as
requested (even a no-op else clause to make the logic easier to understand).

2021-08-24  Roger Sayle  <roger@nextmovesoftware.com>
	    Richard Biener  <rguenther@suse.de>

gcc/ChangeLog
	* config/i386/i386-features.c (compute_convert_gain): Provide
	more accurate values for CONST_INT, when optimizing for size.
	* config/i386/i386.c (COSTS_N_BYTES): Move definition from here...
	* config/i386/i386.h (COSTS_N_BYTES): to here.

Cheers,
Roger
--

-----Original Message-----
From: Richard Biener <richard.guenther@gmail.com> 
Sent: 23 August 2021 14:47
To: Roger Sayle <roger@nextmovesoftware.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.

On Fri, Aug 20, 2021 at 9:55 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
> Hi Richard,
>
> Benchmarking this patch using CSiBE on x86_64-pc-linux-gnu with -Os -m32 saves 2432 bytes.
> Of the 893 tests, 34 have size differences, 30 are improvements, 4 are regressions (of a few bytes).
>
> > Also I'm missing a 'else' - in the default case there's no cost/benefit of using SSE vs. GPR regs?
> > For SSE it would be a constant pool load.
>
> The code size regression  I primarily wanted to tackle was the zero 
> vs. non-zero case when dealing with immediate operands, which was the 
> piece affected by my and Jakub's xor improvements.
>
> Alas my first attempt to specify a non-zero gain in the default 
> (doesn't fit in SImode) case, increased the code size slightly.  The 
> use of the constant pool complicates things, as the number of times 
> the same value is used becomes an issue.  If the constant being loaded 
> is unique, then clearly the increase in constant pool size should 
> (ideally) be taken into account.  But if the same constant is used 
> multiple times in a chain (or is already in the constant pool), the 
> observed cost is much cheaper.  Empirically, a value of zero isn't a 
> poor choice, so the decision on whether to use vector instructions is shifted to the gains from operations being performed, rather than the loading of integer constants.  No doubt, like rtx_costs, these are free parameters that future generations will continue to tweak and refine.
>
> Given that this patch reduces code size with -Os, both with and without -m32, ok for mainline?

OK if you add a comment for the missing 'else'.

Thanks,
Richard.

> Thanks in advance,
> Roger
> --


[-- Attachment #2: patchz2.txt --]
[-- Type: text/plain, Size: 2848 bytes --]

diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index d9c6652..5a99ea7 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -610,12 +610,40 @@ general_scalar_chain::compute_convert_gain ()
 
 	  case CONST_INT:
 	    if (REG_P (dst))
-	      /* DImode can be immediate for TARGET_64BIT and SImode always.  */
-	      igain += m * COSTS_N_INSNS (1);
+	      {
+		if (optimize_insn_for_size_p ())
+		  {
+		    /* xor (2 bytes) vs. xorps (3 bytes).  */
+		    if (src == const0_rtx)
+		      igain -= COSTS_N_BYTES (1);
+		    /* movdi_internal vs. movv2di_internal.  */
+		    /* => mov (5 bytes) vs. movaps (7 bytes).  */
+		    else if (x86_64_immediate_operand (src, SImode))
+		      igain -= COSTS_N_BYTES (2);
+		    else
+		      /* ??? Larger immediate constants are placed in the
+			 constant pool, where the size benefit/impact of
+			 STV conversion is affected by whether and how
+			 often each constant pool entry is shared/reused.
+			 The value below is empirically derived from the
+			 CSiBE benchmark (and the optimal value may drift
+			 over time).  */
+		      igain += COSTS_N_BYTES (0);
+		  }
+		else
+		  {
+		    /* DImode can be immediate for TARGET_64BIT
+		       and SImode always.  */
+		    igain += m * COSTS_N_INSNS (1);
+		    igain -= vector_const_cost (src);
+		  }
+	      }
 	    else if (MEM_P (dst))
-	      igain += (m * ix86_cost->int_store[2]
-			- ix86_cost->sse_store[sse_cost_idx]);
-	    igain -= vector_const_cost (src);
+	      {
+		igain += (m * ix86_cost->int_store[2]
+			  - ix86_cost->sse_store[sse_cost_idx]);
+		igain -= vector_const_cost (src);
+	      }
 	    break;
 
 	  default:
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 4d4ab6a..5abf2a6 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -19982,8 +19982,6 @@ ix86_division_cost (const struct processor_costs *cost,
     return cost->divide[MODE_INDEX (mode)];
 }
 
-#define COSTS_N_BYTES(N) ((N) * 2)
-
 /* Return cost of shift in MODE.
    If CONSTANT_OP1 is true, the op1 value is known and set in OP1_VAL.
    AND_IN_OP1 specify in op1 is result of and and SHIFT_AND_TRUNCATE
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 21fe51b..edbfcaf 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -88,6 +88,11 @@ struct stringop_algs
   } size [MAX_STRINGOP_ALGS];
 };
 
+/* Analog of COSTS_N_INSNS when optimizing for size.  */
+#ifndef COSTS_N_BYTES
+#define COSTS_N_BYTES(N) ((N) * 2)
+#endif
+
 /* Define the specific costs for a given cpu.  NB: hard_register is used
    by TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST to compute
    hard register move costs by register allocator.  Relative costs of

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.
  2021-08-20 19:55   ` Roger Sayle
@ 2021-08-23 13:47     ` Richard Biener
  2021-08-24  2:08       ` Roger Sayle
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Biener @ 2021-08-23 13:47 UTC (permalink / raw)
  To: Roger Sayle; +Cc: GCC Patches

On Fri, Aug 20, 2021 at 9:55 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Richard,
>
> Benchmarking this patch using CSiBE on x86_64-pc-linux-gnu with -Os -m32 saves 2432 bytes.
> Of the 893 tests, 34 have size differences, 30 are improvements, 4 are regressions (of a few bytes).
>
> > Also I'm missing a 'else' - in the default case there's no cost/benefit of using SSE vs. GPR regs?
> > For SSE it would be a constant pool load.
>
> The code size regression  I primarily wanted to tackle was the zero vs. non-zero case when
> dealing with immediate operands, which was the piece affected by my and Jakub's xor
> improvements.
>
> Alas my first attempt to specify a non-zero gain in the default (doesn't fit in SImode) case,
> increased the code size slightly.  The use of the constant pool complicates things, as the number
> of times the same value is used becomes an issue.  If the constant being loaded is unique, then
> clearly the increase in constant pool size should (ideally) be taken into account.  But if the same
> constant is used multiple times in a chain (or is already in the constant pool), the observed cost
> is much cheaper.  Empirically, a value of zero isn't a poor choice, so the decision on whether to
> use vector instructions is shifted to the gains from operations being performed, rather than the
> loading of integer constants.  No doubt, like rtx_costs, these are free parameters that future
> generations will continue to tweak and refine.
>
> Given that this patch reduces code size with -Os, both with and without -m32, ok for mainline?

OK if you add a comment for the missing 'else'.

Thanks,
Richard.

> Thanks in advance,
> Roger
> --
>
> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: 20 August 2021 08:29
> To: Roger Sayle <roger@nextmovesoftware.com>
> Cc: GCC Patches <gcc-patches@gcc.gnu.org>
> Subject: Re: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.
>
> On Thu, Aug 19, 2021 at 6:01 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
> >
> >
> > Doh!  ENOPATCH.
> >
> > -----Original Message-----
> > From: Roger Sayle <roger@nextmovesoftware.com>
> > Sent: 19 August 2021 16:59
> > To: 'GCC Patches' <gcc-patches@gcc.gnu.org>
> > Subject: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.
> >
> >
> > Back in June I briefly mentioned in one of my gcc-patches posts that a
> > change that should have always reduced code size, would mysteriously
> > occasionally result in slightly larger code (according to CSiBE):
> > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573233.html
> >
> > Investigating further, the cause turns out to be that x86_64's
> > scalar-to-vector (stv) pass is relying on poor estimates of the size
> > costs/benefits.  This patch tweaks the backend's compute_convert_gain
> > method to provide slightly more accurate values when compiling with -Os.
> > Compilation without -Os is (should be) unaffected.  And for
> > completeness, I'll mention that the stv pass is a net win for code
> > size so it's much better to improve its heuristics than simply gate
> > the pass on !optimize_for_size.
> >
> > The net effect of this change is to save 1399 bytes on the CSiBE code
> > size benchmark when compiling with -Os.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with "make bootstrap"
> > and "make -k check" with no new failures.
> >
> > Ok for mainline?
>
> +                   /* xor (2 bytes) vs. xorps (3 bytes).  */
> +                   if (src == const0_rtx)
> +                     igain -= COSTS_N_BYTES (1);
> +                   /* movdi_internal vs. movv2di_internal.  */
> +                   /* => mov (5 bytes) vs. movaps (7 bytes).  */
> +                   else if (x86_64_immediate_operand (src, SImode))
> +                     igain -= COSTS_N_BYTES (2);
>
> doesn't it need two GPR xor for 32bit DImode and two mov?  Thus the non-SSE cost should be times 'm'?  For const0_rtx we may eventually re-use the zero reg for the high part so that is eventually correct.
>
> Also I'm missing a 'else' - in the default case there's no cost/benefit of using SSE vs. GPR regs?  For SSE it would be a constant pool load.
>
> I also wonder, since I now see COSTS_N_BYTES for the first time (heh), whether with -Os we'd need to replace all COSTS_N_INSNS (1) scaling with COSTS_N_BYTES scaling?  OTOH costs_add_n_insns uses COSTS_N_INSNS for the size part as well.
>
> That said, it looks like we're eventually mixing apples and oranges now or even previously?
>
> Thanks,
> Richard.
>
> >
> >
> > 2021-08-19  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         * config/i386/i386-features.c (compute_convert_gain): Provide
> >         more accurate values for CONST_INT, when optimizing for size.
> >         * config/i386/i386.c (COSTS_N_BYTES): Move definition from here...
> >         * config/i386/i386.h (COSTS_N_BYTES): to here.
> >
> > Roger
> > --
> >
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.
  2021-08-20  7:28 ` Richard Biener
  2021-08-20 10:20   ` Roger Sayle
@ 2021-08-20 19:55   ` Roger Sayle
  2021-08-23 13:47     ` Richard Biener
  1 sibling, 1 reply; 7+ messages in thread
From: Roger Sayle @ 2021-08-20 19:55 UTC (permalink / raw)
  To: 'Richard Biener'; +Cc: 'GCC Patches'

Hi Richard,

Benchmarking this patch using CSiBE on x86_64-pc-linux-gnu with -Os -m32 saves 2432 bytes.
Of the 893 tests, 34 have size differences, 30 are improvements, 4 are regressions (of a few bytes).

> Also I'm missing a 'else' - in the default case there's no cost/benefit of using SSE vs. GPR regs?
> For SSE it would be a constant pool load.

The code size regression  I primarily wanted to tackle was the zero vs. non-zero case when
dealing with immediate operands, which was the piece affected by my and Jakub's xor
improvements.

Alas my first attempt to specify a non-zero gain in the default (doesn't fit in SImode) case,
increased the code size slightly.  The use of the constant pool complicates things, as the number
of times the same value is used becomes an issue.  If the constant being loaded is unique, then
clearly the increase in constant pool size should (ideally) be taken into account.  But if the same
constant is used multiple times in a chain (or is already in the constant pool), the observed cost
is much cheaper.  Empirically, a value of zero isn't a poor choice, so the decision on whether to
use vector instructions is shifted to the gains from operations being performed, rather than the
loading of integer constants.  No doubt, like rtx_costs, these are free parameters that future
generations will continue to tweak and refine.

Given that this patch reduces code size with -Os, both with and without -m32, ok for mainline?

Thanks in advance,
Roger
--

-----Original Message-----
From: Richard Biener <richard.guenther@gmail.com> 
Sent: 20 August 2021 08:29
To: Roger Sayle <roger@nextmovesoftware.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.

On Thu, Aug 19, 2021 at 6:01 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Doh!  ENOPATCH.
>
> -----Original Message-----
> From: Roger Sayle <roger@nextmovesoftware.com>
> Sent: 19 August 2021 16:59
> To: 'GCC Patches' <gcc-patches@gcc.gnu.org>
> Subject: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.
>
>
> Back in June I briefly mentioned in one of my gcc-patches posts that a 
> change that should have always reduced code size, would mysteriously 
> occasionally result in slightly larger code (according to CSiBE):
> https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573233.html
>
> Investigating further, the cause turns out to be that x86_64's 
> scalar-to-vector (stv) pass is relying on poor estimates of the size 
> costs/benefits.  This patch tweaks the backend's compute_convert_gain 
> method to provide slightly more accurate values when compiling with -Os.
> Compilation without -Os is (should be) unaffected.  And for 
> completeness, I'll mention that the stv pass is a net win for code 
> size so it's much better to improve its heuristics than simply gate 
> the pass on !optimize_for_size.
>
> The net effect of this change is to save 1399 bytes on the CSiBE code 
> size benchmark when compiling with -Os.
>
> This patch has been tested on x86_64-pc-linux-gnu with "make bootstrap"
> and "make -k check" with no new failures.
>
> Ok for mainline?

+                   /* xor (2 bytes) vs. xorps (3 bytes).  */
+                   if (src == const0_rtx)
+                     igain -= COSTS_N_BYTES (1);
+                   /* movdi_internal vs. movv2di_internal.  */
+                   /* => mov (5 bytes) vs. movaps (7 bytes).  */
+                   else if (x86_64_immediate_operand (src, SImode))
+                     igain -= COSTS_N_BYTES (2);

doesn't it need two GPR xor for 32bit DImode and two mov?  Thus the non-SSE cost should be times 'm'?  For const0_rtx we may eventually re-use the zero reg for the high part so that is eventually correct.

Also I'm missing a 'else' - in the default case there's no cost/benefit of using SSE vs. GPR regs?  For SSE it would be a constant pool load.

I also wonder, since I now see COSTS_N_BYTES for the first time (heh), whether with -Os we'd need to replace all COSTS_N_INSNS (1) scaling with COSTS_N_BYTES scaling?  OTOH costs_add_n_insns uses COSTS_N_INSNS for the size part as well.

That said, it looks like we're eventually mixing apples and oranges now or even previously?

Thanks,
Richard.

>
>
> 2021-08-19  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386-features.c (compute_convert_gain): Provide
>         more accurate values for CONST_INT, when optimizing for size.
>         * config/i386/i386.c (COSTS_N_BYTES): Move definition from here...
>         * config/i386/i386.h (COSTS_N_BYTES): to here.
>
> Roger
> --
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.
  2021-08-20  7:28 ` Richard Biener
@ 2021-08-20 10:20   ` Roger Sayle
  2021-08-20 19:55   ` Roger Sayle
  1 sibling, 0 replies; 7+ messages in thread
From: Roger Sayle @ 2021-08-20 10:20 UTC (permalink / raw)
  To: 'Richard Biener'; +Cc: 'GCC Patches'

Hi Richard,
Fortunately, it's (moderately) safe to mix COSTS_N_INSNS and 
COSTS_N_BYTES in the i386 backend.  The average length of an
x86_64 instruction in typical code is between 2 and 3 bytes, so
the definition of N*4 for COSTS_N_INSNS(N) and N*2 for
COSTS_N_BYTES(N) allows these be mixed, with no less approximation
error than the more common problem, that rtx_costs usually
encodes cycle counts (but converted to units of COSTS_N_INSNS).

The thing that I like about STV's use of a "gain calculator" is that
it allows much more accurate fine tuning.  Many passes, particularly
combine, base their decisions of obtaining total cost estimates (with
large approximate values) and then comparing those two totals.
As any physicist will confirm, it's much better to parameterize on
the observed delta between those two large approximate values.

But you're also right, that I need to run CSiBE with -m32 and get back
to you with the results.

Roger
--

-----Original Message-----
From: Richard Biener <richard.guenther@gmail.com> 
Sent: 20 August 2021 08:29
To: Roger Sayle <roger@nextmovesoftware.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.

On Thu, Aug 19, 2021 at 6:01 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Doh!  ENOPATCH.
>
> -----Original Message-----
> From: Roger Sayle <roger@nextmovesoftware.com>
> Sent: 19 August 2021 16:59
> To: 'GCC Patches' <gcc-patches@gcc.gnu.org>
> Subject: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.
>
>
> Back in June I briefly mentioned in one of my gcc-patches posts that a 
> change that should have always reduced code size, would mysteriously 
> occasionally result in slightly larger code (according to CSiBE):
> https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573233.html
>
> Investigating further, the cause turns out to be that x86_64's 
> scalar-to-vector (stv) pass is relying on poor estimates of the size 
> costs/benefits.  This patch tweaks the backend's compute_convert_gain 
> method to provide slightly more accurate values when compiling with -Os.
> Compilation without -Os is (should be) unaffected.  And for 
> completeness, I'll mention that the stv pass is a net win for code 
> size so it's much better to improve its heuristics than simply gate 
> the pass on !optimize_for_size.
>
> The net effect of this change is to save 1399 bytes on the CSiBE code 
> size benchmark when compiling with -Os.
>
> This patch has been tested on x86_64-pc-linux-gnu with "make bootstrap"
> and "make -k check" with no new failures.
>
> Ok for mainline?

+                   /* xor (2 bytes) vs. xorps (3 bytes).  */
+                   if (src == const0_rtx)
+                     igain -= COSTS_N_BYTES (1);
+                   /* movdi_internal vs. movv2di_internal.  */
+                   /* => mov (5 bytes) vs. movaps (7 bytes).  */
+                   else if (x86_64_immediate_operand (src, SImode))
+                     igain -= COSTS_N_BYTES (2);

doesn't it need two GPR xor for 32bit DImode and two mov?  Thus the non-SSE cost should be times 'm'?  For const0_rtx we may eventually re-use the zero reg for the high part so that is eventually correct.

Also I'm missing a 'else' - in the default case there's no cost/benefit of using SSE vs. GPR regs?  For SSE it would be a constant pool load.

I also wonder, since I now see COSTS_N_BYTES for the first time (heh), whether with -Os we'd need to replace all COSTS_N_INSNS (1) scaling with COSTS_N_BYTES scaling?  OTOH costs_add_n_insns uses COSTS_N_INSNS for the size part as well.

That said, it looks like we're eventually mixing apples and oranges now or even previously?

Thanks,
Richard.

>
>
> 2021-08-19  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386-features.c (compute_convert_gain): Provide
>         more accurate values for CONST_INT, when optimizing for size.
>         * config/i386/i386.c (COSTS_N_BYTES): Move definition from here...
>         * config/i386/i386.h (COSTS_N_BYTES): to here.
>
> Roger
> --
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.
  2021-08-19 16:00 Roger Sayle
@ 2021-08-20  7:28 ` Richard Biener
  2021-08-20 10:20   ` Roger Sayle
  2021-08-20 19:55   ` Roger Sayle
  0 siblings, 2 replies; 7+ messages in thread
From: Richard Biener @ 2021-08-20  7:28 UTC (permalink / raw)
  To: Roger Sayle; +Cc: GCC Patches

On Thu, Aug 19, 2021 at 6:01 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Doh!  ENOPATCH.
>
> -----Original Message-----
> From: Roger Sayle <roger@nextmovesoftware.com>
> Sent: 19 August 2021 16:59
> To: 'GCC Patches' <gcc-patches@gcc.gnu.org>
> Subject: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.
>
>
> Back in June I briefly mentioned in one of my gcc-patches posts that a
> change that should have always reduced code size, would mysteriously
> occasionally result in slightly larger code (according to CSiBE):
> https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573233.html
>
> Investigating further, the cause turns out to be that x86_64's
> scalar-to-vector (stv) pass is relying on poor estimates of the size
> costs/benefits.  This patch tweaks the backend's compute_convert_gain method
> to provide slightly more accurate values when compiling with -Os.
> Compilation without -Os is (should be) unaffected.  And for completeness,
> I'll mention that the stv pass is a net win for code size so it's much
> better to improve its heuristics than simply gate the pass on
> !optimize_for_size.
>
> The net effect of this change is to save 1399 bytes on the CSiBE code size
> benchmark when compiling with -Os.
>
> This patch has been tested on x86_64-pc-linux-gnu with "make bootstrap"
> and "make -k check" with no new failures.
>
> Ok for mainline?

+                   /* xor (2 bytes) vs. xorps (3 bytes).  */
+                   if (src == const0_rtx)
+                     igain -= COSTS_N_BYTES (1);
+                   /* movdi_internal vs. movv2di_internal.  */
+                   /* => mov (5 bytes) vs. movaps (7 bytes).  */
+                   else if (x86_64_immediate_operand (src, SImode))
+                     igain -= COSTS_N_BYTES (2);

doesn't it need two GPR xor for 32bit DImode and two mov?  Thus
the non-SSE cost should be times 'm'?  For const0_rtx we may
eventually re-use the zero reg for the high part so that is eventually
correct.

Also I'm missing a 'else' - in the default case there's no cost/benefit
of using SSE vs. GPR regs?  For SSE it would be a constant pool
load.

I also wonder, since I now see COSTS_N_BYTES for the first time (heh),
whether with -Os we'd need to replace all COSTS_N_INSNS (1)
scaling with COSTS_N_BYTES scaling?  OTOH costs_add_n_insns
uses COSTS_N_INSNS for the size part as well.

That said, it looks like we're eventually mixing apples and oranges
now or even previously?

Thanks,
Richard.

>
>
> 2021-08-19  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386-features.c (compute_convert_gain): Provide
>         more accurate values for CONST_INT, when optimizing for size.
>         * config/i386/i386.c (COSTS_N_BYTES): Move definition from here...
>         * config/i386/i386.h (COSTS_N_BYTES): to here.
>
> Roger
> --
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.
@ 2021-08-19 16:00 Roger Sayle
  2021-08-20  7:28 ` Richard Biener
  0 siblings, 1 reply; 7+ messages in thread
From: Roger Sayle @ 2021-08-19 16:00 UTC (permalink / raw)
  To: 'GCC Patches'

[-- Attachment #1: Type: text/plain, Size: 1574 bytes --]


Doh!  ENOPATCH.

-----Original Message-----
From: Roger Sayle <roger@nextmovesoftware.com> 
Sent: 19 August 2021 16:59
To: 'GCC Patches' <gcc-patches@gcc.gnu.org>
Subject: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.


Back in June I briefly mentioned in one of my gcc-patches posts that a
change that should have always reduced code size, would mysteriously
occasionally result in slightly larger code (according to CSiBE):
https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573233.html

Investigating further, the cause turns out to be that x86_64's
scalar-to-vector (stv) pass is relying on poor estimates of the size
costs/benefits.  This patch tweaks the backend's compute_convert_gain method
to provide slightly more accurate values when compiling with -Os.
Compilation without -Os is (should be) unaffected.  And for completeness,
I'll mention that the stv pass is a net win for code size so it's much
better to improve its heuristics than simply gate the pass on
!optimize_for_size.

The net effect of this change is to save 1399 bytes on the CSiBE code size
benchmark when compiling with -Os.

This patch has been tested on x86_64-pc-linux-gnu with "make bootstrap"
and "make -k check" with no new failures.

Ok for mainline?


2021-08-19  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	* config/i386/i386-features.c (compute_convert_gain): Provide
	more accurate values for CONST_INT, when optimizing for size.
	* config/i386/i386.c (COSTS_N_BYTES): Move definition from here...
	* config/i386/i386.h (COSTS_N_BYTES): to here.

Roger
--


[-- Attachment #2: patchz.txt --]
[-- Type: text/plain, Size: 2443 bytes --]

diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index d9c6652..cdae3dc 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -610,12 +610,31 @@ general_scalar_chain::compute_convert_gain ()
 
 	  case CONST_INT:
 	    if (REG_P (dst))
-	      /* DImode can be immediate for TARGET_64BIT and SImode always.  */
-	      igain += m * COSTS_N_INSNS (1);
+	      {
+		if (optimize_insn_for_size_p ())
+		  {
+		    /* xor (2 bytes) vs. xorps (3 bytes).  */
+		    if (src == const0_rtx)
+		      igain -= COSTS_N_BYTES (1);
+		    /* movdi_internal vs. movv2di_internal.  */
+		    /* => mov (5 bytes) vs. movaps (7 bytes).  */
+		    else if (x86_64_immediate_operand (src, SImode))
+		      igain -= COSTS_N_BYTES (2);
+		  }
+		else
+		  {
+		    /* DImode can be immediate for TARGET_64BIT
+		       and SImode always.  */
+		    igain += m * COSTS_N_INSNS (1);
+		    igain -= vector_const_cost (src);
+		  }
+	      }
 	    else if (MEM_P (dst))
-	      igain += (m * ix86_cost->int_store[2]
-			- ix86_cost->sse_store[sse_cost_idx]);
-	    igain -= vector_const_cost (src);
+	      {
+		igain += (m * ix86_cost->int_store[2]
+			  - ix86_cost->sse_store[sse_cost_idx]);
+		igain -= vector_const_cost (src);
+	      }
 	    break;
 
 	  default:
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 4d4ab6a..5abf2a6 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -19982,8 +19982,6 @@ ix86_division_cost (const struct processor_costs *cost,
     return cost->divide[MODE_INDEX (mode)];
 }
 
-#define COSTS_N_BYTES(N) ((N) * 2)
-
 /* Return cost of shift in MODE.
    If CONSTANT_OP1 is true, the op1 value is known and set in OP1_VAL.
    AND_IN_OP1 specify in op1 is result of and and SHIFT_AND_TRUNCATE
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 21fe51b..edbfcaf 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -88,6 +88,11 @@ struct stringop_algs
   } size [MAX_STRINGOP_ALGS];
 };
 
+/* Analog of COSTS_N_INSNS when optimizing for size.  */
+#ifndef COSTS_N_BYTES
+#define COSTS_N_BYTES(N) ((N) * 2)
+#endif
+
 /* Define the specific costs for a given cpu.  NB: hard_register is used
    by TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST to compute
    hard register move costs by register allocator.  Relative costs of

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-08-24  2:08 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-19 15:59 [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass Roger Sayle
2021-08-19 16:00 Roger Sayle
2021-08-20  7:28 ` Richard Biener
2021-08-20 10:20   ` Roger Sayle
2021-08-20 19:55   ` Roger Sayle
2021-08-23 13:47     ` Richard Biener
2021-08-24  2:08       ` Roger Sayle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).