public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH, i386]: Fix PR target/13958:Conversion from unsigned to double  is painfully slow on P4
@ 2008-03-21 20:52 Uros Bizjak
  2008-03-22 14:51 ` Richard Guenther
  0 siblings, 1 reply; 2+ messages in thread
From: Uros Bizjak @ 2008-03-21 20:52 UTC (permalink / raw)
  To: GCC Patches; +Cc: H. J. Lu

[-- Attachment #1: Type: text/plain, Size: 2552 bytes --]

Hello!

Due to store forwarding penalty (this is how partial memory access is 
called nowadays), the code from PR runs "painfully" slow:

--cut here--
unsigned a[2]={1,2};

inline unsigned foo1(int i) { return a[i]; }

int main()
{
    double x=0;
    int    i;

    for ( i=0; i<100000000; ++i )
        x+=foo1(i%2);

    return (int)x;
}
--cut here--

The inner loop is compiled (-O2 -march=pentium4 -malign-double) to:

.L4:
        movl    %ecx, %eax
        andl    $1, %eax
        movl    a(,%eax,4), %eax
        xorl    %edx, %edx
(*)    pushl   %edx
(*)    pushl   %eax
(*)    fildll  (%esp)
        addl    $8, %esp
        faddp   %st, %st(1)
        addl    $1, %ecx
        cmpl    $100000000, %ecx
        jne     .L4

Instructions marked with (*) form partial memory access.

Runtime:

time ./a.out

real    0m0.794s
user    0m0.724s
sys     0m0.000s

Patched gcc creates:

.L4:
        movl    %edx, %eax
        andl    $1, %eax
        movd    a(,%eax,4), %xmm0
        movq    %xmm0, -16(%ebp)
        fildll  -16(%ebp)
        faddp   %st, %st(1)
        addl    $1, %edx
        cmpl    $100000000, %edx
        jne     .L4

time ./a.out

real    0m0.123s
user    0m0.124s
sys     0m0.000s

This represents more than 5.8x speedup on what is claimed as:

--quote--

Btw, such conversions are quite common in numerical codes that deal
with uniform grids: the array index can be used as a coordinate (usually
after some trivial scaling). Given that the indices used in libstdc++
are usually of the type size_t the slow conversion can have quite a
negative performance impact.

--unqoute--

I guess that such a speedup comes quite handy. This code prefers DImode 
aligned to 8, since we are dealing with real DImode values. H.J. - 
should we align DImode values to 8 for TARGET_MMX/TARGET_SSE ?

2008-03-21  Uros Bizjak  <ubizjak@gmail.com>

        PR target/13958
        * config/i386/i386.md ("*floatunssi<mode2>_1"): New pattern with
        corresponding post-reload splitters.
        ("floatunssi<mode>2"): Expand to unsigned_float x87 insn pattern
        when x87 FP math is selected.
        * config/i386/i386-protos.h (ix86_expand_convert_uns_sixf_sse):
        New function prototype.
        * config/i386/i386.c (ix86_expand_convert_uns_sixf_sse): New
        unreachable function to ease macroization of insn patterns.

The patch was bootstrapped and regression tested on x86_64-pc-linux-gnu 
{,-m32}, patch is committed to SVN.

RMs, Do we want this patch in 4.3.1, although it isn't strictly a 
regression?

Uros.

[-- Attachment #2: p.diff.txt --]
[-- Type: text/plain, Size: 4043 bytes --]

Index: i386.md
===================================================================
--- i386.md	(revision 133430)
+++ i386.md	(working copy)
@@ -5313,13 +5313,76 @@
   DONE;
 })
 
+;; Avoid store forwarding (partial memory) stall penalty by extending
+;; SImode value to DImode through XMM register instead of pushing two
+;; SImode values to stack. Note that even !TARGET_INTER_UNIT_MOVES
+;; targets benefit from this optimization. Also note that fild
+;; loads from memory only.
+
+(define_insn "*floatunssi<mode>2_1"
+  [(set (match_operand:X87MODEF 0 "register_operand" "=f,f")
+	(unsigned_float:X87MODEF
+	  (match_operand:SI 1 "nonimmediate_operand" "x,m")))
+   (clobber (match_operand:DI 2 "memory_operand" "=m,m"))
+   (clobber (match_scratch:SI 3 "=X,x"))]
+  "!TARGET_64BIT
+   && TARGET_80387 && TARGET_SSE"
+  "#"
+  [(set_attr "type" "multi")
+   (set_attr "mode" "<MODE>")])
+
+(define_split
+  [(set (match_operand:X87MODEF 0 "register_operand" "")
+	(unsigned_float:X87MODEF
+	  (match_operand:SI 1 "register_operand" "")))
+   (clobber (match_operand:DI 2 "memory_operand" ""))
+   (clobber (match_scratch:SI 3 ""))]
+  "!TARGET_64BIT
+   && TARGET_80387 && TARGET_SSE
+   && reload_completed"
+  [(set (match_dup 2) (match_dup 1))
+   (set (match_dup 0)
+	(float:X87MODEF (match_dup 2)))]
+  "operands[1] = simplify_gen_subreg (DImode, operands[1], SImode, 0);")
+
+(define_split
+  [(set (match_operand:X87MODEF 0 "register_operand" "")
+	(unsigned_float:X87MODEF
+	  (match_operand:SI 1 "memory_operand" "")))
+   (clobber (match_operand:DI 2 "memory_operand" ""))
+   (clobber (match_scratch:SI 3 ""))]
+  "!TARGET_64BIT
+   && TARGET_80387 && TARGET_SSE
+   && reload_completed"
+  [(set (match_dup 2) (match_dup 3))
+   (set (match_dup 0)
+	(float:X87MODEF (match_dup 2)))]
+{
+  emit_move_insn (operands[3], operands[1]);
+  operands[3] = simplify_gen_subreg (DImode, operands[3], SImode, 0);
+})
+
 (define_expand "floatunssi<mode>2"
-  [(use (match_operand:MODEF 0 "register_operand" ""))
-   (use (match_operand:SI 1 "nonimmediate_operand" ""))]
-  "!TARGET_64BIT && SSE_FLOAT_MODE_P (<MODE>mode) && TARGET_SSE_MATH"
+  [(parallel
+     [(set (match_operand:X87MODEF 0 "register_operand" "")
+	   (unsigned_float:X87MODEF
+	     (match_operand:SI 1 "nonimmediate_operand" "")))
+      (clobber (match_dup 2))
+      (clobber (match_scratch:SI 3 ""))])]
+  "!TARGET_64BIT
+   && ((TARGET_80387 && TARGET_SSE)
+       || (SSE_FLOAT_MODE_P (<MODE>mode) && TARGET_SSE_MATH))"
 {
-  ix86_expand_convert_uns_si<mode>_sse (operands[0], operands[1]);
-  DONE;
+  if (SSE_FLOAT_MODE_P (<MODE>mode) && TARGET_SSE_MATH)
+    {
+      ix86_expand_convert_uns_si<mode>_sse (operands[0], operands[1]);
+      DONE;
+    }
+  else
+    {
+      int slot = virtuals_instantiated ? SLOT_TEMP : SLOT_VIRTUAL;
+      operands[2] = assign_386_stack_local (DImode, slot);
+    }
 })
 
 (define_expand "floatunsdisf2"
Index: i386-protos.h
===================================================================
--- i386-protos.h	(revision 133430)
+++ i386-protos.h	(working copy)
@@ -91,6 +91,7 @@
 extern rtx ix86_build_const_vector (enum machine_mode, bool, rtx);
 extern void ix86_split_convert_uns_si_sse (rtx[]);
 extern void ix86_expand_convert_uns_didf_sse (rtx, rtx);
+extern void ix86_expand_convert_uns_sixf_sse (rtx, rtx);
 extern void ix86_expand_convert_uns_sidf_sse (rtx, rtx);
 extern void ix86_expand_convert_uns_sisf_sse (rtx, rtx);
 extern void ix86_expand_convert_sign_didf_sse (rtx, rtx);
Index: i386.c
===================================================================
--- i386.c	(revision 133430)
+++ i386.c	(working copy)
@@ -10903,6 +10903,14 @@
   ix86_expand_vector_extract (false, target, fp_xmm, 0);
 }
 
+/* Not used, but eases macroization of patterns.  */
+void
+ix86_expand_convert_uns_sixf_sse (rtx target ATTRIBUTE_UNUSED,
+				  rtx input ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
 /* Convert an unsigned SImode value into a DFmode.  Only currently used
    for SSE, but applicable anywhere.  */
 

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH, i386]: Fix PR target/13958:Conversion from unsigned to double is painfully slow on P4
  2008-03-21 20:52 [PATCH, i386]: Fix PR target/13958:Conversion from unsigned to double is painfully slow on P4 Uros Bizjak
@ 2008-03-22 14:51 ` Richard Guenther
  0 siblings, 0 replies; 2+ messages in thread
From: Richard Guenther @ 2008-03-22 14:51 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: GCC Patches, H. J. Lu

On Fri, Mar 21, 2008 at 8:26 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> Hello!
>
>  Due to store forwarding penalty (this is how partial memory access is
>  called nowadays), the code from PR runs "painfully" slow:
>
>  --cut here--
>  unsigned a[2]={1,2};
>
>  inline unsigned foo1(int i) { return a[i]; }
>
>  int main()
>  {
>     double x=0;
>     int    i;
>
>     for ( i=0; i<100000000; ++i )
>         x+=foo1(i%2);
>
>     return (int)x;
>  }
>  --cut here--
>
>  The inner loop is compiled (-O2 -march=pentium4 -malign-double) to:
>
>  .L4:
>         movl    %ecx, %eax
>         andl    $1, %eax
>         movl    a(,%eax,4), %eax
>         xorl    %edx, %edx
>  (*)    pushl   %edx
>  (*)    pushl   %eax
>  (*)    fildll  (%esp)
>         addl    $8, %esp
>         faddp   %st, %st(1)
>         addl    $1, %ecx
>         cmpl    $100000000, %ecx
>         jne     .L4
>
>  Instructions marked with (*) form partial memory access.
>
>  Runtime:
>
>  time ./a.out
>
>  real    0m0.794s
>  user    0m0.724s
>  sys     0m0.000s
>
>  Patched gcc creates:
>
>  .L4:
>         movl    %edx, %eax
>         andl    $1, %eax
>         movd    a(,%eax,4), %xmm0
>         movq    %xmm0, -16(%ebp)
>         fildll  -16(%ebp)
>         faddp   %st, %st(1)
>         addl    $1, %edx
>         cmpl    $100000000, %edx
>         jne     .L4
>
>  time ./a.out
>
>  real    0m0.123s
>  user    0m0.124s
>  sys     0m0.000s
>
>  This represents more than 5.8x speedup on what is claimed as:
>
>  --quote--
>
>  Btw, such conversions are quite common in numerical codes that deal
>  with uniform grids: the array index can be used as a coordinate (usually
>  after some trivial scaling). Given that the indices used in libstdc++
>  are usually of the type size_t the slow conversion can have quite a
>  negative performance impact.
>
>  --unqoute--
>
>  I guess that such a speedup comes quite handy. This code prefers DImode
>  aligned to 8, since we are dealing with real DImode values. H.J. -
>  should we align DImode values to 8 for TARGET_MMX/TARGET_SSE ?
>
>  2008-03-21  Uros Bizjak  <ubizjak@gmail.com>
>
>         PR target/13958
>         * config/i386/i386.md ("*floatunssi<mode2>_1"): New pattern with
>         corresponding post-reload splitters.
>         ("floatunssi<mode>2"): Expand to unsigned_float x87 insn pattern
>         when x87 FP math is selected.
>         * config/i386/i386-protos.h (ix86_expand_convert_uns_sixf_sse):
>         New function prototype.
>         * config/i386/i386.c (ix86_expand_convert_uns_sixf_sse): New
>         unreachable function to ease macroization of insn patterns.
>
>  The patch was bootstrapped and regression tested on x86_64-pc-linux-gnu
>  {,-m32}, patch is committed to SVN.
>
>  RMs, Do we want this patch in 4.3.1, although it isn't strictly a
>  regression?

Does this only affect P4 as the PR states?  Does this have a measuable positive
impact on SPEC?

Otherwise in general no, not without overwhelming benefit.

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2008-03-22 12:30 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-21 20:52 [PATCH, i386]: Fix PR target/13958:Conversion from unsigned to double is painfully slow on P4 Uros Bizjak
2008-03-22 14:51 ` Richard Guenther

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).