[x86 PATCH] Fine tune STV register conversion costs for -Os.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [x86 PATCH] Fine tune STV register conversion costs for -Os.
@ 2023-10-23 14:47 Roger Sayle
  2023-10-24 11:39 ` Uros Bizjak
  0 siblings, 1 reply; 2+ messages in thread
From: Roger Sayle @ 2023-10-23 14:47 UTC (permalink / raw)
  To: gcc-patches; +Cc: 'Uros Bizjak'

[-- Attachment #1: Type: text/plain, Size: 1927 bytes --]


The eagle-eyed may have spotted that my recent testcases for DImode shifts
on x86_64 included -mno-stv in the dg-options.  This is because the
Scalar-To-Vector (STV) pass currently transforms these shifts to use
SSE vector operations, producing larger code even with -Os.  The issue
is that the compute_convert_gain currently underestimates the size of
instructions required for interunit moves, which is corrected with the
patch below.

For the simple test case:

unsigned long long shl1(unsigned long long x) { return x << 1; }

without this patch, GCC -m32 -Os -mavx2 currently generates:

shl1:   push   %ebp              // 1 byte
        mov    %esp,%ebp         // 2 bytes
        vmovq  0x8(%ebp),%xmm0   // 5 bytes
        pop    %ebp              // 1 byte
        vpaddq %xmm0,%xmm0,%xmm0 // 4 bytes
        vmovd  %xmm0,%eax        // 4 bytes
        vpextrd $0x1,%xmm0,%edx  // 6 bytes
        ret                      // 1 byte  = 24 bytes total

with this patch, we now generate the shorter

shl1:   push   %ebp             // 1 byte
        mov    %esp,%ebp        // 2 bytes
        mov    0x8(%ebp),%eax   // 3 bytes
        mov    0xc(%ebp),%edx   // 3 bytes
        pop    %ebp             // 1 byte
        add    %eax,%eax        // 2 bytes
        adc    %edx,%edx        // 2 bytes
        ret                     // 1 byte  = 15 bytes total

Benchmarking using CSiBE, shows that this patch saves 1361 bytes
when compiling with -m32 -Os, and saves 172 bytes when compiling
with -Os.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-10-23  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        * config/i386/i386-features.cc (compute_convert_gain): Provide
        more accurate values (sizes) for inter-unit moves with -Os.


Thanks in advance,
Roger
--


[-- Attachment #2: patchoz.txt --]
[-- Type: text/plain, Size: 1843 bytes --]

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index cead397..6fac67e 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -752,11 +752,33 @@ general_scalar_chain::compute_convert_gain ()
     fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
 
   /* Cost the integer to sse and sse to integer moves.  */
-  cost += n_sse_to_integer * ix86_cost->sse_to_integer;
-  /* ???  integer_to_sse but we only have that in the RA cost table.
-     Assume sse_to_integer/integer_to_sse are the same which they
-     are at the moment.  */
-  cost += n_integer_to_sse * ix86_cost->sse_to_integer;
+  if (!optimize_function_for_size_p (cfun))
+    {
+      cost += n_sse_to_integer * ix86_cost->sse_to_integer;
+      /* ???  integer_to_sse but we only have that in the RA cost table.
+              Assume sse_to_integer/integer_to_sse are the same which they
+	      are at the moment.  */
+      cost += n_integer_to_sse * ix86_cost->sse_to_integer;
+    }
+  else if (TARGET_64BIT || smode == SImode)
+    {
+      cost += n_sse_to_integer * COSTS_N_BYTES (4);
+      cost += n_integer_to_sse * COSTS_N_BYTES (4);
+    }
+  else if (TARGET_SSE4_1)
+    {
+      /* vmovd (4 bytes) + vpextrd (6 bytes).  */
+      cost += n_sse_to_integer * COSTS_N_BYTES (10);
+      /* vmovd (4 bytes) + vpinsrd (6 bytes).  */
+      cost += n_integer_to_sse * COSTS_N_BYTES (10);
+    }
+  else
+    {
+      /* movd (4 bytes) + psrlq (5 bytes) + movd (4 bytes).  */
+      cost += n_sse_to_integer * COSTS_N_BYTES (13);
+      /* movd (4 bytes) + movd (4 bytes) + unpckldq (4 bytes).  */
+      cost += n_integer_to_sse * COSTS_N_BYTES (12);
+    }
 
   if (dump_file)
     fprintf (dump_file, "  Registers conversion cost: %d\n", cost);

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [x86 PATCH] Fine tune STV register conversion costs for -Os.
  2023-10-23 14:47 [x86 PATCH] Fine tune STV register conversion costs for -Os Roger Sayle
@ 2023-10-24 11:39 ` Uros Bizjak
  0 siblings, 0 replies; 2+ messages in thread
From: Uros Bizjak @ 2023-10-24 11:39 UTC (permalink / raw)
  To: Roger Sayle; +Cc: gcc-patches

On Mon, Oct 23, 2023 at 4:47 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> The eagle-eyed may have spotted that my recent testcases for DImode shifts
> on x86_64 included -mno-stv in the dg-options.  This is because the
> Scalar-To-Vector (STV) pass currently transforms these shifts to use
> SSE vector operations, producing larger code even with -Os.  The issue
> is that the compute_convert_gain currently underestimates the size of
> instructions required for interunit moves, which is corrected with the
> patch below.
>
> For the simple test case:
>
> unsigned long long shl1(unsigned long long x) { return x << 1; }
>
> without this patch, GCC -m32 -Os -mavx2 currently generates:
>
> shl1:   push   %ebp              // 1 byte
>         mov    %esp,%ebp         // 2 bytes
>         vmovq  0x8(%ebp),%xmm0   // 5 bytes
>         pop    %ebp              // 1 byte
>         vpaddq %xmm0,%xmm0,%xmm0 // 4 bytes
>         vmovd  %xmm0,%eax        // 4 bytes
>         vpextrd $0x1,%xmm0,%edx  // 6 bytes
>         ret                      // 1 byte  = 24 bytes total
>
> with this patch, we now generate the shorter
>
> shl1:   push   %ebp             // 1 byte
>         mov    %esp,%ebp        // 2 bytes
>         mov    0x8(%ebp),%eax   // 3 bytes
>         mov    0xc(%ebp),%edx   // 3 bytes
>         pop    %ebp             // 1 byte
>         add    %eax,%eax        // 2 bytes
>         adc    %edx,%edx        // 2 bytes
>         ret                     // 1 byte  = 15 bytes total
>
> Benchmarking using CSiBE, shows that this patch saves 1361 bytes
> when compiling with -m32 -Os, and saves 172 bytes when compiling
> with -Os.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2023-10-23  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386-features.cc (compute_convert_gain): Provide
>         more accurate values (sizes) for inter-unit moves with -Os.

LGTM.

Thanks,
Uros.

>
>
> Thanks in advance,
> Roger
> --
>

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-10-24 11:39 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-23 14:47 [x86 PATCH] Fine tune STV register conversion costs for -Os Roger Sayle
2023-10-24 11:39 ` Uros Bizjak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).