public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [i386] scalar ops that preserve the high part of a vector
@ 2012-10-13  9:33 Marc Glisse
  2012-10-14  9:54 ` Uros Bizjak
  0 siblings, 1 reply; 36+ messages in thread
From: Marc Glisse @ 2012-10-13  9:33 UTC (permalink / raw)
  To: gcc-patches; +Cc: ubizjak

[-- Attachment #1: Type: TEXT/PLAIN, Size: 573 bytes --]

Hello,

this patch provides an alternate pattern to let combine recognize scalar 
operations that preserve the high part of a vector. If the strategy is all 
right, I could do the same for more operations (mul, div, ...). Something 
similar is also possible for V4SF (different pattern though), but probably 
not as useful.

bootstrap+testsuite ok.

2012-10-13  Marc Glisse  <marc.glisse@inria.fr>

 	PR target/54855

gcc/
 	* config/i386/sse.md (*sse2_vm<plusminus_insn>v2df3): New define_insn.

gcc/testsuite/
 	* gcc.target/i386/pr54855.c: New testcase.

-- 
Marc Glisse

[-- Attachment #2: Type: TEXT/PLAIN, Size: 2294 bytes --]

Index: config/i386/sse.md
===================================================================
--- config/i386/sse.md	(revision 192420)
+++ config/i386/sse.md	(working copy)
@@ -812,20 +812,38 @@
 	  (const_int 1)))]
   "TARGET_SSE"
   "@
    <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2}
    v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "isa" "noavx,avx")
    (set_attr "type" "sseadd")
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "<ssescalarmode>")])
 
+(define_insn "*sse2_vm<plusminus_insn>v2df3"
+  [(set (match_operand:V2DF 0 "register_operand" "=x,x")
+	(vec_concat:V2DF
+	  (plusminus:DF
+	    (vec_select:DF 
+	      (match_operand:V2DF 1 "register_operand" "0,x")
+	      (parallel [(const_int 0)]))
+	    (match_operand:DF 2 "nonimmediate_operand" "xm,xm"))
+	  (vec_select:DF (match_dup 1) (parallel [(const_int 1)]))))]
+  "TARGET_SSE2"
+  "@
+   <plusminus_mnemonic>sd\t{%2, %0|%0, %2}
+   v<plusminus_mnemonic>sd\t{%2, %1, %0|%0, %1, %2}"
+  [(set_attr "isa" "noavx,avx")
+   (set_attr "type" "sseadd")
+   (set_attr "prefix" "orig,vex")
+   (set_attr "mode" "DF")])
+
 (define_expand "mul<mode>3"
   [(set (match_operand:VF 0 "register_operand")
 	(mult:VF
 	  (match_operand:VF 1 "nonimmediate_operand")
 	  (match_operand:VF 2 "nonimmediate_operand")))]
   "TARGET_SSE"
   "ix86_fixup_binary_operands_no_copy (MULT, <MODE>mode, operands);")
 
 (define_insn "*mul<mode>3"
   [(set (match_operand:VF 0 "register_operand" "=x,x")
Index: testsuite/gcc.target/i386/pr54855.c
===================================================================
--- testsuite/gcc.target/i386/pr54855.c	(revision 0)
+++ testsuite/gcc.target/i386/pr54855.c	(revision 0)
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O -msse2" } */
+
+typedef double vec __attribute__((vector_size(16)));
+
+vec f (vec x)
+{
+  x[0] += 2;
+  return x;
+}
+
+vec g (vec x)
+{
+  x[0] -= 1;
+  return x;
+}
+
+/* { dg-final { scan-assembler-not "mov" } } */

Property changes on: testsuite/gcc.target/i386/pr54855.c
___________________________________________________________________
Added: svn:keywords
   + Author Date Id Revision URL
Added: svn:eol-style
   + native


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-10-13  9:33 [i386] scalar ops that preserve the high part of a vector Marc Glisse
@ 2012-10-14  9:54 ` Uros Bizjak
  2012-10-14 12:52   ` Marc Glisse
  0 siblings, 1 reply; 36+ messages in thread
From: Uros Bizjak @ 2012-10-14  9:54 UTC (permalink / raw)
  To: Marc Glisse; +Cc: gcc-patches

On Sat, Oct 13, 2012 at 10:52 AM, Marc Glisse <marc.glisse@inria.fr> wrote:
> Hello,
>
> this patch provides an alternate pattern to let combine recognize scalar
> operations that preserve the high part of a vector. If the strategy is all
> right, I could do the same for more operations (mul, div, ...). Something
> similar is also possible for V4SF (different pattern though), but probably
> not as useful.

But, we _do_ have vec_merge pattern that describes the operation.
Adding another one to each operation just to satisfy combine is IMO
not correct approach. I'd rather see generic RTX simplification that
simplifies your proposed pattern to vec_merge pattern. Also, as you
mention in PR54855, Comment #5, the approach is too fragile...

Uros.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-10-14  9:54 ` Uros Bizjak
@ 2012-10-14 12:52   ` Marc Glisse
  2012-11-30 12:36     ` Marc Glisse
  0 siblings, 1 reply; 36+ messages in thread
From: Marc Glisse @ 2012-10-14 12:52 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches

On Sun, 14 Oct 2012, Uros Bizjak wrote:

> On Sat, Oct 13, 2012 at 10:52 AM, Marc Glisse <marc.glisse@inria.fr> wrote:
>> Hello,
>>
>> this patch provides an alternate pattern to let combine recognize scalar
>> operations that preserve the high part of a vector. If the strategy is all
>> right, I could do the same for more operations (mul, div, ...). Something
>> similar is also possible for V4SF (different pattern though), but probably
>> not as useful.
>
> But, we _do_ have vec_merge pattern that describes the operation.
> Adding another one to each operation just to satisfy combine is IMO
> not correct approach.

At some point I wondered about _replacing_ the existing pattern, so there 
would only be one ;-)

The vec_merge pattern takes as argument 2 vectors instead of a vector and 
a scalar, and describes the operation as a vector operation where we drop 
half of the result, instead of a scalar operation where we re-add the top 
half of the vector. I don't know if that's the most convenient choice. 
Adding code in simplify-rtx to replace vec_merge with vec_concat / 
vec_select might be easier than the other way around.


If the middle-end somehow gave us:
(plus X (vec_concat Y 0))
it would seem a bit strange to add an optimization that turns it into:
(vec_merge (plus X (subreg:V2DF Y)) X 1)
but then producing:
(vec_concat (plus (vec_select X 0) Y) (vec_select X 1))
would be strange as well.
(ignoring the signed zero issues here)

> I'd rather see generic RTX simplification that
> simplifies your proposed pattern to vec_merge pattern.

Ok, I'll see what I can do.

> Also, as you mention in PR54855, Comment #5, the approach is too 
> fragile...

I am not sure I can make the RTX simplification much less fragile... 
Whenever I see (vec_concat X (vec_select Y 1)), I would have to check 
whether X is some (possibly large) tree of scalar computations involving 
Y[0], move it all to vec_merge computations, and fix other users of some 
of those scalars to now use S[0]. Seems too hard, I would stop at 
single-operation X that is used only once. Besides, the gain is larger in 
proportion when there is a single operation :-)

Thank you for your comments,

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-10-14 12:52   ` Marc Glisse
@ 2012-11-30 12:36     ` Marc Glisse
  2012-11-30 13:55       ` Uros Bizjak
  0 siblings, 1 reply; 36+ messages in thread
From: Marc Glisse @ 2012-11-30 12:36 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches

On Sun, 14 Oct 2012, Marc Glisse wrote:

> On Sun, 14 Oct 2012, Uros Bizjak wrote:
>
>> On Sat, Oct 13, 2012 at 10:52 AM, Marc Glisse <marc.glisse@inria.fr> wrote:
>>> Hello,
>>> 
>>> this patch provides an alternate pattern to let combine recognize scalar
>>> operations that preserve the high part of a vector. If the strategy is all
>>> right, I could do the same for more operations (mul, div, ...). Something
>>> similar is also possible for V4SF (different pattern though), but probably
>>> not as useful.
>> 
>> But, we _do_ have vec_merge pattern that describes the operation.
>> Adding another one to each operation just to satisfy combine is IMO
>> not correct approach.
>
> At some point I wondered about _replacing_ the existing pattern, so there 
> would only be one ;-)
>
> The vec_merge pattern takes as argument 2 vectors instead of a vector and a 
> scalar, and describes the operation as a vector operation where we drop half 
> of the result, instead of a scalar operation where we re-add the top half of 
> the vector. I don't know if that's the most convenient choice. Adding code in 
> simplify-rtx to replace vec_merge with vec_concat / vec_select might be 
> easier than the other way around.
>
>
> If the middle-end somehow gave us:
> (plus X (vec_concat Y 0))
> it would seem a bit strange to add an optimization that turns it into:
> (vec_merge (plus X (subreg:V2DF Y)) X 1)
> but then producing:
> (vec_concat (plus (vec_select X 0) Y) (vec_select X 1))
> would be strange as well.
> (ignoring the signed zero issues here)
>
>> I'd rather see generic RTX simplification that
>> simplifies your proposed pattern to vec_merge pattern.
>
> Ok, I'll see what I can do.
>
>> Also, as you mention in PR54855, Comment #5, the approach is too fragile...
>
> I am not sure I can make the RTX simplification much less fragile... Whenever 
> I see (vec_concat X (vec_select Y 1)), I would have to check whether X is 
> some (possibly large) tree of scalar computations involving Y[0], move it all 
> to vec_merge computations, and fix other users of some of those scalars to 
> now use S[0]. Seems too hard, I would stop at single-operation X that is used 
> only once. Besides, the gain is larger in proportion when there is a single 
> operation :-)
>
> Thank you for your comments,

Hello,

I experimented with the simplify-rtx transformation you suggested, see:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54855

It works when the argument is a register, but not for memory (which is 
where the constant is in the testcase). And the description of the 
operation in sse.md does seem problematic. It says the second argument is:

             (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))

but Intel's documentation says "The source operand can be an XMM register 
or a 64-bit memory location", not quite the same.

Do you think the .md description should really stay this way, or could we 
change it to something that better reflects "64-bit memory location"?

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-11-30 12:36     ` Marc Glisse
@ 2012-11-30 13:55       ` Uros Bizjak
  2012-11-30 22:36         ` Marc Glisse
  0 siblings, 1 reply; 36+ messages in thread
From: Uros Bizjak @ 2012-11-30 13:55 UTC (permalink / raw)
  To: Marc Glisse; +Cc: gcc-patches

On Fri, Nov 30, 2012 at 1:34 PM, Marc Glisse <marc.glisse@inria.fr> wrote:

> Hello,
>
> I experimented with the simplify-rtx transformation you suggested, see:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54855
>
> It works when the argument is a register, but not for memory (which is where
> the constant is in the testcase). And the description of the operation in
> sse.md does seem problematic. It says the second argument is:
>
>             (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))
>
> but Intel's documentation says "The source operand can be an XMM register or
> a 64-bit memory location", not quite the same.
>
> Do you think the .md description should really stay this way, or could we
> change it to something that better reflects "64-bit memory location"?

For reference, we are talking about:

(define_insn "<sse>_vm<plusminus_insn><mode>3"
  [(set (match_operand:VF_128 0 "register_operand" "=x,x")
	(vec_merge:VF_128
	  (plusminus:VF_128
	    (match_operand:VF_128 1 "register_operand" "0,x")
	    (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))
	  (match_dup 1)
	  (const_int 1)))]
  "TARGET_SSE"
  "@
   <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2}
   v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}"
  [(set_attr "isa" "noavx,avx")
   (set_attr "type" "sseadd")
   (set_attr "prefix" "orig,vex")
   (set_attr "mode" "<ssescalarmode>")])

No, looking at your description, the operand 2 should be scalar
operand (we use _s{s,d} scalar instruction here), and for doubles this
should refer to 64bit memory location. I don't remember all the
details about vec_merge scalar instructions, but it looks to me that
canonical representation should be more like your proposal:

+(define_insn "*sse2_vm<plusminus_insn>v2df3"
+  [(set (match_operand:V2DF 0 "register_operand" "=x,x")
+    (vec_concat:V2DF
+      (plusminus:DF
+        (vec_select:DF
+          (match_operand:V2DF 1 "register_operand" "0,x")
+          (parallel [(const_int 0)]))
+        (match_operand:DF 2 "nonimmediate_operand" "xm,xm"))
+      (vec_select:DF (match_dup 1) (parallel [(const_int 1)]))))]
+  "TARGET_SSE2"

Uros.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-11-30 13:55       ` Uros Bizjak
@ 2012-11-30 22:36         ` Marc Glisse
  2012-12-01 17:27           ` Marc Glisse
  0 siblings, 1 reply; 36+ messages in thread
From: Marc Glisse @ 2012-11-30 22:36 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches

On Fri, 30 Nov 2012, Uros Bizjak wrote:

> For reference, we are talking about:
>
> (define_insn "<sse>_vm<plusminus_insn><mode>3"
>  [(set (match_operand:VF_128 0 "register_operand" "=x,x")
> 	(vec_merge:VF_128
> 	  (plusminus:VF_128
> 	    (match_operand:VF_128 1 "register_operand" "0,x")
> 	    (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))
> 	  (match_dup 1)
> 	  (const_int 1)))]
>  "TARGET_SSE"
>  "@
>   <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2}
>   v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}"
>  [(set_attr "isa" "noavx,avx")
>   (set_attr "type" "sseadd")
>   (set_attr "prefix" "orig,vex")
>   (set_attr "mode" "<ssescalarmode>")])
>
> No, looking at your description, the operand 2 should be scalar
> operand (we use _s{s,d} scalar instruction here), and for doubles this
> should refer to 64bit memory location. I don't remember all the
> details about vec_merge scalar instructions, but it looks to me that
> canonical representation should be more like your proposal:
>
> +(define_insn "*sse2_vm<plusminus_insn>v2df3"
> +  [(set (match_operand:V2DF 0 "register_operand" "=x,x")
> +    (vec_concat:V2DF
> +      (plusminus:DF
> +        (vec_select:DF
> +          (match_operand:V2DF 1 "register_operand" "0,x")
> +          (parallel [(const_int 0)]))
> +        (match_operand:DF 2 "nonimmediate_operand" "xm,xm"))
> +      (vec_select:DF (match_dup 1) (parallel [(const_int 1)]))))]
> +  "TARGET_SSE2"

Thank you.

Among the following possible patterns, my choice (if nobody objects) is to 
use 4) for V2DF and 3) (rewritten without iterators) for V4SF. The 
question is then what should be done about the builtins and intrinsics. 
_mm_add_sd takes two __m128. If I change the signature of 
__builtin_ia32_addsd, I can make _mm_add_sd pass __B[0] as second 
argument, but I don't know if I am allowed to change that signature. 
Otherwise I guess I'll need to keep a separate expander for it (I'd rather 
not). And then there are several other operations than +/- to handle.


1) Current pattern:

   [(set (match_operand:VF_128 0 "register_operand" "=x,x")
 	(vec_merge:VF_128
 	  (plusminus:VF_128
 	    (match_operand:VF_128 1 "register_operand" "0,x")
 	    (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))
 	  (match_dup 1)
 	  (const_int 1)))]

2) Minimal fix:

   [(set (match_operand:VF_128 0 "register_operand" "=x,x")
 	(vec_merge:VF_128
 	  (plusminus:VF_128
 	    (match_operand:VF_128 1 "register_operand" "0,x")
 	    (vec_duplicate:VF_128
 	      (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm")))
 	  (match_dup 1)
 	  (const_int 1)))]

3) With the operation in scalar mode:

   [(set (match_operand:VF_128 0 "register_operand" "=x,x")
 	(vec_merge:VF_128
 	  (vec_duplicate:VF_128
 	    (plusminus:<ssescalarmode>
 	      (vec_select:<ssescalarmode>
 		(match_operand:VF_128 1 "register_operand" "0,x")
 		(parallel [(const_int 0)]))
 	      (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm"))))
 	  (match_dup 1)
 	  (const_int 1)))]

4) Special version which only makes sense for vectors of 2 elements:

   [(set (match_operand:V2DF 0 "register_operand" "=x,x")
 	(vec_concat:V2DF
 	  (plusminus:DF
 	    (vec_select:DF
 	      (match_operand:V2DF 1 "register_operand" "0,x")
 	      (parallel [(const_int 0)]))
 	    (match_operand:DF 2 "nonimmediate_operand" "xm,xm"))
 	  (vec_select:DF (match_dup 1) (parallel [(const_int 1)]))))]

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-11-30 22:36         ` Marc Glisse
@ 2012-12-01 17:27           ` Marc Glisse
  2012-12-02 10:51             ` Uros Bizjak
  0 siblings, 1 reply; 36+ messages in thread
From: Marc Glisse @ 2012-12-01 17:27 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches

[-- Attachment #1: Type: TEXT/PLAIN, Size: 895 bytes --]

Hello,

here is a patch. If it is accepted, I'll extend it to other vm patterns 
(mul, div, min, max are likely candidates, but I need to check the doc). 
It passed bootstrap+testsuite on x86_64-linux.


2012-12-01  Marc Glisse  <marc.glisse@inria.fr>

 	PR target/54855
gcc/
 	* config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
 	pattern.
 	* config/i386/i386-builtin-types.def: New function types.
 	* config/i386/i386.c (ix86_expand_args_builtin): Likewise.
 	(bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
 	__builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
 	* config/i386/xmmintrin.h: Adapt to new builtin prototype.
 	* config/i386/emmintrin.h: Likewise.
 	* doc/extend.texi (X86 Built-in Functions): Document changed prototype.

testsuite/
 	* gcc.target/i386/pr54855-1.c: New testcase.
 	* gcc.target/i386/pr54855-2.c: New testcase.

-- 
Marc Glisse

[-- Attachment #2: Type: TEXT/PLAIN, Size: 18937 bytes --]

Index: gcc/testsuite/gcc.target/i386/pr54855-2.c
===================================================================
--- gcc/testsuite/gcc.target/i386/pr54855-2.c	(revision 0)
+++ gcc/testsuite/gcc.target/i386/pr54855-2.c	(revision 0)
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O -msse" } */
+
+typedef float vec __attribute__((vector_size(16)));
+
+vec f (vec x)
+{
+  x[0] += 2;
+  return x;
+}
+
+vec g (vec x)
+{
+  x[0] -= 1;
+  return x;
+}
+
+/* { dg-final { scan-assembler-not "mov" } } */

Property changes on: gcc/testsuite/gcc.target/i386/pr54855-2.c
___________________________________________________________________
Added: svn:keywords
   + Author Date Id Revision URL
Added: svn:eol-style
   + native

Index: gcc/testsuite/gcc.target/i386/pr54855-1.c
===================================================================
--- gcc/testsuite/gcc.target/i386/pr54855-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/i386/pr54855-1.c	(revision 0)
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O -msse2" } */
+
+typedef double vec __attribute__((vector_size(16)));
+
+vec f (vec x)
+{
+  x[0] += 2;
+  return x;
+}
+
+vec g (vec x)
+{
+  x[0] -= 1;
+  return x;
+}
+
+/* { dg-final { scan-assembler-not "mov" } } */

Property changes on: gcc/testsuite/gcc.target/i386/pr54855-1.c
___________________________________________________________________
Added: svn:eol-style
   + native
Added: svn:keywords
   + Author Date Id Revision URL

Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	(revision 194017)
+++ gcc/config/i386/i386.c	(working copy)
@@ -27059,22 +27059,22 @@ static const struct builtin_description
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_cvttps2pi, "__builtin_ia32_cvttps2pi", IX86_BUILTIN_CVTTPS2PI, UNKNOWN, (int) V2SI_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_cvttss2si, "__builtin_ia32_cvttss2si", IX86_BUILTIN_CVTTSS2SI, UNKNOWN, (int) INT_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE | OPTION_MASK_ISA_64BIT, CODE_FOR_sse_cvttss2siq, "__builtin_ia32_cvttss2si64", IX86_BUILTIN_CVTTSS2SI64, UNKNOWN, (int) INT64_FTYPE_V4SF },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_shufps, "__builtin_ia32_shufps", IX86_BUILTIN_SHUFPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF_INT },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_addv4sf3, "__builtin_ia32_addps", IX86_BUILTIN_ADDPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_subv4sf3, "__builtin_ia32_subps", IX86_BUILTIN_SUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_mulv4sf3, "__builtin_ia32_mulps", IX86_BUILTIN_MULPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_divv4sf3, "__builtin_ia32_divps", IX86_BUILTIN_DIVPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
-  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmaddv4sf3,  "__builtin_ia32_addss", IX86_BUILTIN_ADDSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
-  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmsubv4sf3,  "__builtin_ia32_subss", IX86_BUILTIN_SUBSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
+  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmaddv4sf3,  "__builtin_ia32_addss", IX86_BUILTIN_ADDSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_FLOAT },
+  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmsubv4sf3,  "__builtin_ia32_subss", IX86_BUILTIN_SUBSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_FLOAT },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmmulv4sf3,  "__builtin_ia32_mulss", IX86_BUILTIN_MULSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmdivv4sf3,  "__builtin_ia32_divss", IX86_BUILTIN_DIVSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpeqps", IX86_BUILTIN_CMPEQPS, EQ, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpltps", IX86_BUILTIN_CMPLTPS, LT, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpleps", IX86_BUILTIN_CMPLEPS, LE, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpgtps", IX86_BUILTIN_CMPGTPS, LT, (int) V4SF_FTYPE_V4SF_V4SF_SWAP },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpgeps", IX86_BUILTIN_CMPGEPS, LE, (int) V4SF_FTYPE_V4SF_V4SF_SWAP },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpunordps", IX86_BUILTIN_CMPUNORDPS, UNORDERED, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpneqps", IX86_BUILTIN_CMPNEQPS, NE, (int) V4SF_FTYPE_V4SF_V4SF },
@@ -27163,22 +27163,22 @@ static const struct builtin_description
   { OPTION_MASK_ISA_SSE2 | OPTION_MASK_ISA_64BIT, CODE_FOR_sse2_cvttsd2siq, "__builtin_ia32_cvttsd2si64", IX86_BUILTIN_CVTTSD2SI64, UNKNOWN, (int) INT64_FTYPE_V2DF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_cvtps2dq, "__builtin_ia32_cvtps2dq", IX86_BUILTIN_CVTPS2DQ, UNKNOWN, (int) V4SI_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_cvtps2pd, "__builtin_ia32_cvtps2pd", IX86_BUILTIN_CVTPS2PD, UNKNOWN, (int) V2DF_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_fix_truncv4sfv4si2, "__builtin_ia32_cvttps2dq", IX86_BUILTIN_CVTTPS2DQ, UNKNOWN, (int) V4SI_FTYPE_V4SF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_addv2df3, "__builtin_ia32_addpd", IX86_BUILTIN_ADDPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_subv2df3, "__builtin_ia32_subpd", IX86_BUILTIN_SUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_mulv2df3, "__builtin_ia32_mulpd", IX86_BUILTIN_MULPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_divv2df3, "__builtin_ia32_divpd", IX86_BUILTIN_DIVPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
-  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmaddv2df3,  "__builtin_ia32_addsd", IX86_BUILTIN_ADDSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
-  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmsubv2df3,  "__builtin_ia32_subsd", IX86_BUILTIN_SUBSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
+  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmaddv2df3,  "__builtin_ia32_addsd", IX86_BUILTIN_ADDSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_DOUBLE },
+  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmsubv2df3,  "__builtin_ia32_subsd", IX86_BUILTIN_SUBSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_DOUBLE },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmmulv2df3,  "__builtin_ia32_mulsd", IX86_BUILTIN_MULSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmdivv2df3,  "__builtin_ia32_divsd", IX86_BUILTIN_DIVSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpeqpd", IX86_BUILTIN_CMPEQPD, EQ, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpltpd", IX86_BUILTIN_CMPLTPD, LT, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmplepd", IX86_BUILTIN_CMPLEPD, LE, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpgtpd", IX86_BUILTIN_CMPGTPD, LT, (int) V2DF_FTYPE_V2DF_V2DF_SWAP },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpgepd", IX86_BUILTIN_CMPGEPD, LE, (int) V2DF_FTYPE_V2DF_V2DF_SWAP},
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpunordpd", IX86_BUILTIN_CMPUNORDPD, UNORDERED, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpneqpd", IX86_BUILTIN_CMPNEQPD, NE, (int) V2DF_FTYPE_V2DF_V2DF },
@@ -30790,34 +30790,36 @@ ix86_expand_args_builtin (const struct b
     case V4HI_FTYPE_V8QI_V8QI:
     case V4HI_FTYPE_V2SI_V2SI:
     case V4DF_FTYPE_V4DF_V4DF:
     case V4DF_FTYPE_V4DF_V4DI:
     case V4SF_FTYPE_V4SF_V4SF:
     case V4SF_FTYPE_V4SF_V4SI:
     case V4SF_FTYPE_V4SF_V2SI:
     case V4SF_FTYPE_V4SF_V2DF:
     case V4SF_FTYPE_V4SF_DI:
     case V4SF_FTYPE_V4SF_SI:
+    case V4SF_FTYPE_V4SF_FLOAT:
     case V2DI_FTYPE_V2DI_V2DI:
     case V2DI_FTYPE_V16QI_V16QI:
     case V2DI_FTYPE_V4SI_V4SI:
     case V2UDI_FTYPE_V4USI_V4USI:
     case V2DI_FTYPE_V2DI_V16QI:
     case V2DI_FTYPE_V2DF_V2DF:
     case V2SI_FTYPE_V2SI_V2SI:
     case V2SI_FTYPE_V4HI_V4HI:
     case V2SI_FTYPE_V2SF_V2SF:
     case V2DF_FTYPE_V2DF_V2DF:
     case V2DF_FTYPE_V2DF_V4SF:
     case V2DF_FTYPE_V2DF_V2DI:
     case V2DF_FTYPE_V2DF_DI:
     case V2DF_FTYPE_V2DF_SI:
+    case V2DF_FTYPE_V2DF_DOUBLE:
     case V2SF_FTYPE_V2SF_V2SF:
     case V1DI_FTYPE_V1DI_V1DI:
     case V1DI_FTYPE_V8QI_V8QI:
     case V1DI_FTYPE_V2SI_V2SI:
     case V32QI_FTYPE_V16HI_V16HI:
     case V16HI_FTYPE_V8SI_V8SI:
     case V32QI_FTYPE_V32QI_V32QI:
     case V16HI_FTYPE_V32QI_V32QI:
     case V16HI_FTYPE_V16HI_V16HI:
     case V8SI_FTYPE_V4DF_V4DF:
Index: gcc/config/i386/xmmintrin.h
===================================================================
--- gcc/config/i386/xmmintrin.h	(revision 194017)
+++ gcc/config/i386/xmmintrin.h	(working copy)
@@ -92,27 +92,27 @@ _mm_setzero_ps (void)
   return __extension__ (__m128){ 0.0f, 0.0f, 0.0f, 0.0f };
 }
 
 /* Perform the respective operation on the lower SPFP (single-precision
    floating-point) values of A and B; the upper three SPFP values are
    passed through from A.  */
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_ss (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_addss ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) __builtin_ia32_addss ((__v4sf)__A, __B[0]);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_ss (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_subss ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) __builtin_ia32_subss ((__v4sf)__A, __B[0]);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_ss (__m128 __A, __m128 __B)
 {
   return (__m128) __builtin_ia32_mulss ((__v4sf)__A, (__v4sf)__B);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_div_ss (__m128 __A, __m128 __B)
Index: gcc/config/i386/emmintrin.h
===================================================================
--- gcc/config/i386/emmintrin.h	(revision 194017)
+++ gcc/config/i386/emmintrin.h	(working copy)
@@ -226,33 +226,33 @@ _mm_cvtsi128_si64x (__m128i __A)
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_addpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_sd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_addsd ((__v2df)__A, (__v2df)__B);
+  return (__m128d)__builtin_ia32_addsd ((__v2df)__A, __B[0]);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_subpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_sd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_subsd ((__v2df)__A, (__v2df)__B);
+  return (__m128d)__builtin_ia32_subsd ((__v2df)__A, __B[0]);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_mulpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_sd (__m128d __A, __m128d __B)
Index: gcc/config/i386/sse.md
===================================================================
--- gcc/config/i386/sse.md	(revision 194017)
+++ gcc/config/i386/sse.md	(working copy)
@@ -855,36 +855,57 @@
 	  (match_operand:VF 2 "nonimmediate_operand" "xm,xm")))]
   "TARGET_SSE && ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
   "@
    <plusminus_mnemonic><ssemodesuffix>\t{%2, %0|%0, %2}
    v<plusminus_mnemonic><ssemodesuffix>\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "isa" "noavx,avx")
    (set_attr "type" "sseadd")
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "<MODE>")])
 
-(define_insn "<sse>_vm<plusminus_insn><mode>3"
-  [(set (match_operand:VF_128 0 "register_operand" "=x,x")
-	(vec_merge:VF_128
-	  (plusminus:VF_128
-	    (match_operand:VF_128 1 "register_operand" "0,x")
-	    (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))
+(define_insn "sse_vm<plusminus_insn>v4sf3"
+  [(set (match_operand:V4SF 0 "register_operand" "=x,x")
+	(vec_merge:V4SF
+	  (vec_duplicate:V4SF
+	    (plusminus:SF
+	      (vec_select:SF
+		(match_operand:V4SF 1 "register_operand" "0,x")
+		(parallel [(const_int 0)]))
+	      (match_operand:SF 2 "nonimmediate_operand" "xm,xm")))
 	  (match_dup 1)
 	  (const_int 1)))]
   "TARGET_SSE"
   "@
-   <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2}
-   v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}"
+   <plusminus_mnemonic>ss\t{%2, %0|%0, %2}
+   v<plusminus_mnemonic>ss\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "isa" "noavx,avx")
    (set_attr "type" "sseadd")
    (set_attr "prefix" "orig,vex")
-   (set_attr "mode" "<ssescalarmode>")])
+   (set_attr "mode" "SF")])
+
+(define_insn "sse2_vm<plusminus_insn>v2df3"
+  [(set (match_operand:V2DF 0 "register_operand" "=x,x")
+	(vec_concat:V2DF
+	  (plusminus:DF
+	    (vec_select:DF 
+	      (match_operand:V2DF 1 "register_operand" "0,x")
+	      (parallel [(const_int 0)]))
+	    (match_operand:DF 2 "nonimmediate_operand" "xm,xm"))
+	  (vec_select:DF (match_dup 1) (parallel [(const_int 1)]))))]
+  "TARGET_SSE2"
+  "@
+   <plusminus_mnemonic>sd\t{%2, %0|%0, %2}
+   v<plusminus_mnemonic>sd\t{%2, %1, %0|%0, %1, %2}"
+  [(set_attr "isa" "noavx,avx")
+   (set_attr "type" "sseadd")
+   (set_attr "prefix" "orig,vex")
+   (set_attr "mode" "DF")])
 
 (define_expand "mul<mode>3"
   [(set (match_operand:VF 0 "register_operand")
 	(mult:VF
 	  (match_operand:VF 1 "nonimmediate_operand")
 	  (match_operand:VF 2 "nonimmediate_operand")))]
   "TARGET_SSE"
   "ix86_fixup_binary_operands_no_copy (MULT, <MODE>mode, operands);")
 
 (define_insn "*mul<mode>3"
Index: gcc/config/i386/i386-builtin-types.def
===================================================================
--- gcc/config/i386/i386-builtin-types.def	(revision 194017)
+++ gcc/config/i386/i386-builtin-types.def	(working copy)
@@ -263,20 +263,21 @@ DEF_FUNCTION_TYPE (UINT64, UINT64, UINT6
 DEF_FUNCTION_TYPE (UINT8, UINT8, INT)
 DEF_FUNCTION_TYPE (V16QI, V16QI, SI)
 DEF_FUNCTION_TYPE (V16QI, V16QI, V16QI)
 DEF_FUNCTION_TYPE (V16QI, V8HI, V8HI)
 DEF_FUNCTION_TYPE (V1DI, V1DI, SI)
 DEF_FUNCTION_TYPE (V1DI, V1DI, V1DI)
 DEF_FUNCTION_TYPE (V1DI, V2SI, V2SI)
 DEF_FUNCTION_TYPE (V1DI, V8QI, V8QI)
 DEF_FUNCTION_TYPE (V2DF, PCV2DF, V2DI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, DI)
+DEF_FUNCTION_TYPE (V2DF, V2DF, DOUBLE)
 DEF_FUNCTION_TYPE (V2DF, V2DF, INT)
 DEF_FUNCTION_TYPE (V2DF, V2DF, PCDOUBLE)
 DEF_FUNCTION_TYPE (V2DF, V2DF, SI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V2DI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V4SF)
 DEF_FUNCTION_TYPE (V2DF, V4DF, INT)
 DEF_FUNCTION_TYPE (V2DI, V16QI, V16QI)
 DEF_FUNCTION_TYPE (V2DI, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V2DI, V2DI, INT)
@@ -296,20 +297,21 @@ DEF_FUNCTION_TYPE (V4DF, PCV4DF, V4DI)
 DEF_FUNCTION_TYPE (V4DF, V4DF, INT)
 DEF_FUNCTION_TYPE (V4DF, V4DF, V4DF)
 DEF_FUNCTION_TYPE (V4DF, V4DF, V4DI)
 DEF_FUNCTION_TYPE (V4HI, V2SI, V2SI)
 DEF_FUNCTION_TYPE (V4HI, V4HI, INT)
 DEF_FUNCTION_TYPE (V4HI, V4HI, SI)
 DEF_FUNCTION_TYPE (V4HI, V4HI, V4HI)
 DEF_FUNCTION_TYPE (V4HI, V8QI, V8QI)
 DEF_FUNCTION_TYPE (V4SF, PCV4SF, V4SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, DI)
+DEF_FUNCTION_TYPE (V4SF, V4SF, FLOAT)
 DEF_FUNCTION_TYPE (V4SF, V4SF, INT)
 DEF_FUNCTION_TYPE (V4SF, V4SF, PCV2SF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V2DF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V2SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V4SF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V4SI)
 DEF_FUNCTION_TYPE (V4SF, V8SF, INT)
 DEF_FUNCTION_TYPE (V4SI, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V4SI, V4SF, V4SF)
Index: gcc/doc/extend.texi
===================================================================
--- gcc/doc/extend.texi	(revision 194017)
+++ gcc/doc/extend.texi	(working copy)
@@ -9821,22 +9821,22 @@ int __builtin_ia32_comige (v4sf, v4sf)
 int __builtin_ia32_ucomieq (v4sf, v4sf)
 int __builtin_ia32_ucomineq (v4sf, v4sf)
 int __builtin_ia32_ucomilt (v4sf, v4sf)
 int __builtin_ia32_ucomile (v4sf, v4sf)
 int __builtin_ia32_ucomigt (v4sf, v4sf)
 int __builtin_ia32_ucomige (v4sf, v4sf)
 v4sf __builtin_ia32_addps (v4sf, v4sf)
 v4sf __builtin_ia32_subps (v4sf, v4sf)
 v4sf __builtin_ia32_mulps (v4sf, v4sf)
 v4sf __builtin_ia32_divps (v4sf, v4sf)
-v4sf __builtin_ia32_addss (v4sf, v4sf)
-v4sf __builtin_ia32_subss (v4sf, v4sf)
+v4sf __builtin_ia32_addss (v4sf, float)
+v4sf __builtin_ia32_subss (v4sf, float)
 v4sf __builtin_ia32_mulss (v4sf, v4sf)
 v4sf __builtin_ia32_divss (v4sf, v4sf)
 v4si __builtin_ia32_cmpeqps (v4sf, v4sf)
 v4si __builtin_ia32_cmpltps (v4sf, v4sf)
 v4si __builtin_ia32_cmpleps (v4sf, v4sf)
 v4si __builtin_ia32_cmpgtps (v4sf, v4sf)
 v4si __builtin_ia32_cmpgeps (v4sf, v4sf)
 v4si __builtin_ia32_cmpunordps (v4sf, v4sf)
 v4si __builtin_ia32_cmpneqps (v4sf, v4sf)
 v4si __builtin_ia32_cmpnltps (v4sf, v4sf)
@@ -9942,22 +9942,22 @@ v2df __builtin_ia32_cmpunordsd (v2df, v2
 v2df __builtin_ia32_cmpneqsd (v2df, v2df)
 v2df __builtin_ia32_cmpnltsd (v2df, v2df)
 v2df __builtin_ia32_cmpnlesd (v2df, v2df)
 v2df __builtin_ia32_cmpordsd (v2df, v2df)
 v2di __builtin_ia32_paddq (v2di, v2di)
 v2di __builtin_ia32_psubq (v2di, v2di)
 v2df __builtin_ia32_addpd (v2df, v2df)
 v2df __builtin_ia32_subpd (v2df, v2df)
 v2df __builtin_ia32_mulpd (v2df, v2df)
 v2df __builtin_ia32_divpd (v2df, v2df)
-v2df __builtin_ia32_addsd (v2df, v2df)
-v2df __builtin_ia32_subsd (v2df, v2df)
+v2df __builtin_ia32_addsd (v2df, double)
+v2df __builtin_ia32_subsd (v2df, double)
 v2df __builtin_ia32_mulsd (v2df, v2df)
 v2df __builtin_ia32_divsd (v2df, v2df)
 v2df __builtin_ia32_minpd (v2df, v2df)
 v2df __builtin_ia32_maxpd (v2df, v2df)
 v2df __builtin_ia32_minsd (v2df, v2df)
 v2df __builtin_ia32_maxsd (v2df, v2df)
 v2df __builtin_ia32_andpd (v2df, v2df)
 v2df __builtin_ia32_andnpd (v2df, v2df)
 v2df __builtin_ia32_orpd (v2df, v2df)
 v2df __builtin_ia32_xorpd (v2df, v2df)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-01 17:27           ` Marc Glisse
@ 2012-12-02 10:51             ` Uros Bizjak
  2012-12-02 12:30               ` Marc Glisse
  0 siblings, 1 reply; 36+ messages in thread
From: Uros Bizjak @ 2012-12-02 10:51 UTC (permalink / raw)
  To: Marc Glisse; +Cc: gcc-patches, H.J. Lu

On Sat, Dec 1, 2012 at 6:27 PM, Marc Glisse <marc.glisse@inria.fr> wrote:

> here is a patch. If it is accepted, I'll extend it to other vm patterns
> (mul, div, min, max are likely candidates, but I need to check the doc). It
> passed bootstrap+testsuite on x86_64-linux.
>
>
> 2012-12-01  Marc Glisse  <marc.glisse@inria.fr>
>
>         PR target/54855
> gcc/
>         * config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
>         pattern.
>         * config/i386/i386-builtin-types.def: New function types.
>         * config/i386/i386.c (ix86_expand_args_builtin): Likewise.
>         (bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
>         __builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
>         * config/i386/xmmintrin.h: Adapt to new builtin prototype.
>         * config/i386/emmintrin.h: Likewise.
>         * doc/extend.texi (X86 Built-in Functions): Document changed
> prototype.
>
> testsuite/
>         * gcc.target/i386/pr54855-1.c: New testcase.
>         * gcc.target/i386/pr54855-2.c: New testcase.

Yes, the approach looks correct to me, but I wonder why we have
different representations for v4sf and v2df cases? I'd say that we
should canonicalize patterns somewhere in the middle end (probably to
vec_merge variant, as IMO vec_dup looks like degenerated vec_merge
variant), otherwise we will have pattern explosion.

However, the patch is too late for 4.8, but definitely a wanted
generalization and fix of a (partially) wrong representation.

I have also CCd HJ for his opinion, since the patch touches published headers.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-02 10:51             ` Uros Bizjak
@ 2012-12-02 12:30               ` Marc Glisse
  2012-12-03  8:53                 ` Uros Bizjak
  0 siblings, 1 reply; 36+ messages in thread
From: Marc Glisse @ 2012-12-02 12:30 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches, H.J. Lu

On Sun, 2 Dec 2012, Uros Bizjak wrote:

> On Sat, Dec 1, 2012 at 6:27 PM, Marc Glisse <marc.glisse@inria.fr> wrote:
>
>> here is a patch. If it is accepted, I'll extend it to other vm patterns
>> (mul, div, min, max are likely candidates, but I need to check the doc). It
>> passed bootstrap+testsuite on x86_64-linux.
>>
>>
>> 2012-12-01  Marc Glisse  <marc.glisse@inria.fr>
>>
>>         PR target/54855
>> gcc/
>>         * config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
>>         pattern.
>>         * config/i386/i386-builtin-types.def: New function types.
>>         * config/i386/i386.c (ix86_expand_args_builtin): Likewise.
>>         (bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
>>         __builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
>>         * config/i386/xmmintrin.h: Adapt to new builtin prototype.
>>         * config/i386/emmintrin.h: Likewise.
>>         * doc/extend.texi (X86 Built-in Functions): Document changed
>> prototype.
>>
>> testsuite/
>>         * gcc.target/i386/pr54855-1.c: New testcase.
>>         * gcc.target/i386/pr54855-2.c: New testcase.
>
> Yes, the approach looks correct to me, but I wonder why we have
> different representations for v4sf and v2df cases? I'd say that we
> should canonicalize patterns somewhere in the middle end (probably to
> vec_merge variant, as IMO vec_dup looks like degenerated vec_merge
> variant), otherwise we will have pattern explosion.

(I assume s/vec_dup/vec_concat/ above)

Note that this comes from ix86_expand_vector_set, which purposedly uses 
VEC_CONCAT for V2DF and VEC_MERGE for V4SF. It is true that we could use 
the VEC_MERGE version more widely, but this code that selects the most 
appropriate pattern depending on the mode seems good to me. And I wouldn't 
call the few extra entries in sse.md an explosion quite yet...

(also, using VEC_DUPLICATE is quite artificial, in the special case where 
we set the first element of the vector, a subreg should work as well)


> However, the patch is too late for 4.8,

That's fine, I can hold it for 4.9. I'd like to finalize the patch now 
while it is fresh though (I would still redo a quick bootstrap+testsuite 
before commit when trunk re-opens).

Thanks,

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-02 12:30               ` Marc Glisse
@ 2012-12-03  8:53                 ` Uros Bizjak
  2012-12-03 15:34                   ` Marc Glisse
  0 siblings, 1 reply; 36+ messages in thread
From: Uros Bizjak @ 2012-12-03  8:53 UTC (permalink / raw)
  To: Marc Glisse; +Cc: gcc-patches, H.J. Lu

On Sun, Dec 2, 2012 at 1:30 PM, Marc Glisse <marc.glisse@inria.fr> wrote:

>>> here is a patch. If it is accepted, I'll extend it to other vm patterns
>>> (mul, div, min, max are likely candidates, but I need to check the doc).
>>> It
>>> passed bootstrap+testsuite on x86_64-linux.
>>>
>>>
>>> 2012-12-01  Marc Glisse  <marc.glisse@inria.fr>
>>>
>>>         PR target/54855
>>> gcc/
>>>         * config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
>>>         pattern.
>>>         * config/i386/i386-builtin-types.def: New function types.
>>>         * config/i386/i386.c (ix86_expand_args_builtin): Likewise.
>>>         (bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
>>>         __builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
>>>         * config/i386/xmmintrin.h: Adapt to new builtin prototype.
>>>         * config/i386/emmintrin.h: Likewise.
>>>         * doc/extend.texi (X86 Built-in Functions): Document changed
>>> prototype.
>>>
>>> testsuite/
>>>         * gcc.target/i386/pr54855-1.c: New testcase.
>>>         * gcc.target/i386/pr54855-2.c: New testcase.
>>
>>
>> Yes, the approach looks correct to me, but I wonder why we have
>> different representations for v4sf and v2df cases? I'd say that we
>> should canonicalize patterns somewhere in the middle end (probably to
>> vec_merge variant, as IMO vec_dup looks like degenerated vec_merge
>> variant), otherwise we will have pattern explosion.
>
>
> (I assume s/vec_dup/vec_concat/ above)

Ah, yes.

However, looking a bit more into the usage cases for these patterns,
they are only used through intrinsics with _m128 operands. While your
proposed patch makes these patterns more general (they can use 64bit
aligned memory), this is not their usual usage, and for their intended
usage, your proposed improvement complicates these patterns
unnecessarily. Following on these facts, I'd say that we leave these
special patters (since they serve their purpose well) and rather
introduce new patterns for "other" uses.

Uros.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-03  8:53                 ` Uros Bizjak
@ 2012-12-03 15:34                   ` Marc Glisse
  2012-12-03 17:55                     ` Uros Bizjak
  0 siblings, 1 reply; 36+ messages in thread
From: Marc Glisse @ 2012-12-03 15:34 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches, H.J. Lu

On Mon, 3 Dec 2012, Uros Bizjak wrote:

> On Sun, Dec 2, 2012 at 1:30 PM, Marc Glisse <marc.glisse@inria.fr> wrote:
>
>>>> here is a patch. If it is accepted, I'll extend it to other vm patterns
>>>> (mul, div, min, max are likely candidates, but I need to check the doc).
>>>> It
>>>> passed bootstrap+testsuite on x86_64-linux.
>>>>
>>>>
>>>> 2012-12-01  Marc Glisse  <marc.glisse@inria.fr>
>>>>
>>>>         PR target/54855
>>>> gcc/
>>>>         * config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
>>>>         pattern.
>>>>         * config/i386/i386-builtin-types.def: New function types.
>>>>         * config/i386/i386.c (ix86_expand_args_builtin): Likewise.
>>>>         (bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
>>>>         __builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
>>>>         * config/i386/xmmintrin.h: Adapt to new builtin prototype.
>>>>         * config/i386/emmintrin.h: Likewise.
>>>>         * doc/extend.texi (X86 Built-in Functions): Document changed
>>>> prototype.
>>>>
>>>> testsuite/
>>>>         * gcc.target/i386/pr54855-1.c: New testcase.
>>>>         * gcc.target/i386/pr54855-2.c: New testcase.
>>>
>>>
>>> Yes, the approach looks correct to me, but I wonder why we have
>>> different representations for v4sf and v2df cases? I'd say that we
>>> should canonicalize patterns somewhere in the middle end (probably to
>>> vec_merge variant, as IMO vec_dup looks like degenerated vec_merge
>>> variant), otherwise we will have pattern explosion.
>>
>>
>> (I assume s/vec_dup/vec_concat/ above)
>
> Ah, yes.
>
> However, looking a bit more into the usage cases for these patterns,
> they are only used through intrinsics with _m128 operands. While your
> proposed patch makes these patterns more general (they can use 64bit
> aligned memory), this is not their usual usage, and for their intended
> usage, your proposed improvement complicates these patterns
> unnecessarily. Following on these facts, I'd say that we leave these
> special patters (since they serve their purpose well) and rather
> introduce new patterns for "other" uses.

You mean like in the original patch?
http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01279.html

(it only had the V2DF version, not the V4SF one)

Funny how we switched sides, now I am the one who would rather have a 
single pattern instead of having one for the builtin and one for recog. It 
seems that once we add the new pattern, keeping the old one is a waste of 
maintenance time, and the few extra rtx from the slightly longer pattern 
for these seldomly used builtins should be negligible.

But I don't mind, if that's the version you prefer, I'll update the patch.

Thanks,

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-03 15:34                   ` Marc Glisse
@ 2012-12-03 17:55                     ` Uros Bizjak
  2012-12-04 14:05                       ` Marc Glisse
  0 siblings, 1 reply; 36+ messages in thread
From: Uros Bizjak @ 2012-12-03 17:55 UTC (permalink / raw)
  To: Marc Glisse; +Cc: gcc-patches, H.J. Lu

On Mon, Dec 3, 2012 at 4:34 PM, Marc Glisse <marc.glisse@inria.fr> wrote:

>> However, looking a bit more into the usage cases for these patterns,
>> they are only used through intrinsics with _m128 operands. While your
>> proposed patch makes these patterns more general (they can use 64bit
>> aligned memory), this is not their usual usage, and for their intended
>> usage, your proposed improvement complicates these patterns
>> unnecessarily. Following on these facts, I'd say that we leave these
>> special patters (since they serve their purpose well) and rather
>> introduce new patterns for "other" uses.
>
>
> You mean like in the original patch?
> http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01279.html
>
> (it only had the V2DF version, not the V4SF one)
>
> Funny how we switched sides, now I am the one who would rather have a single
> pattern instead of having one for the builtin and one for recog. It seems
> that once we add the new pattern, keeping the old one is a waste of
> maintenance time, and the few extra rtx from the slightly longer pattern for
> these seldomly used builtins should be negligible.

Yes,  I didn't notice at the time that the intention of existing
patterns was to implement intrinsics that exclusively use _m128
operands.

> But I don't mind, if that's the version you prefer, I'll update the patch.

Actually, both approaches have their benefits and drawbacks.
Specialized vec_merge patterns can be efficiently macroized, and
support builtins with _m128 operands in a simple and efficient way.
You are proposing patterns that do not macroize well (this is what was
learned from your last patch) and require breakup of existing
macroized patterns.

So, we are actually adding new functionality - operations on an array
of values. IMO, this warrants new patterns, but please find a way for
V2DF and V4SF to macroize in the same way.

Uros.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-03 17:55                     ` Uros Bizjak
@ 2012-12-04 14:05                       ` Marc Glisse
  2012-12-04 16:28                         ` Marc Glisse
  0 siblings, 1 reply; 36+ messages in thread
From: Marc Glisse @ 2012-12-04 14:05 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches, H.J. Lu

On Mon, 3 Dec 2012, Uros Bizjak wrote:

> On Mon, Dec 3, 2012 at 4:34 PM, Marc Glisse <marc.glisse@inria.fr> wrote:
>
>>> However, looking a bit more into the usage cases for these patterns,
>>> they are only used through intrinsics with _m128 operands. While your
>>> proposed patch makes these patterns more general (they can use 64bit
>>> aligned memory), this is not their usual usage, and for their intended
>>> usage, your proposed improvement complicates these patterns
>>> unnecessarily. Following on these facts, I'd say that we leave these
>>> special patters (since they serve their purpose well) and rather
>>> introduce new patterns for "other" uses.
>>
>>
>> You mean like in the original patch?
>> http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01279.html
>>
>> (it only had the V2DF version, not the V4SF one)
>>
>> Funny how we switched sides, now I am the one who would rather have a single
>> pattern instead of having one for the builtin and one for recog. It seems
>> that once we add the new pattern, keeping the old one is a waste of
>> maintenance time, and the few extra rtx from the slightly longer pattern for
>> these seldomly used builtins should be negligible.
>
> Yes,  I didn't notice at the time that the intention of existing
> patterns was to implement intrinsics that exclusively use _m128
> operands.
>
>> But I don't mind, if that's the version you prefer, I'll update the patch.
>
> Actually, both approaches have their benefits and drawbacks.
> Specialized vec_merge patterns can be efficiently macroized, and
> support builtins with _m128 operands in a simple and efficient way.
> You are proposing patterns that do not macroize well (this is what was
> learned from your last patch) and require breakup of existing
> macroized patterns.
>
> So, we are actually adding new functionality - operations on an array
> of values. IMO, this warrants new patterns, but please find a way for
> V2DF and V4SF to macroize in the same way.

I am still confused as to what is wanted. If the quantity to minimize is
the number of entries in sse.md, we should replace the existing
vec_merge pattern with this one: it macroizes just as well, it directly
matches for V4SF, and the piece of code needed in simplify-rtx for V2DF
isn't too absurd. (then we need to adjust the builtins as in one of the
previous patches)

[(set (match_operand:VF_128 0 "register_operand" "=x,x")
       (vec_merge:VF_128
 	(vec_duplicate:VF_128
 	  (plusminus:<ssescalarmode>
 	    (vec_select:<ssescalarmode>
 	      (match_operand:VF_128 1 "register_operand" "0,x")
 	      (parallel [(const_int 0)]))
 	    (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm"))))
       (match_dup 1)
       (const_int 1)))]

Then there is the question (i) of possibly introducing a specialized
version for V2DF (different pattern) instead of adding code to
simplify-rtx.

And finally there is the question (ii) of keeping the old define_insn in
addition to the new one(s), just for the builtins.

My preference is:
(i) specialized pattern for V2DF
(ii) remove

It seems like you might be ok with:
(i) simplify-rtx
(ii) remove

Do you agree?

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-04 14:05                       ` Marc Glisse
@ 2012-12-04 16:28                         ` Marc Glisse
  2012-12-04 18:06                           ` Uros Bizjak
  0 siblings, 1 reply; 36+ messages in thread
From: Marc Glisse @ 2012-12-04 16:28 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches, H.J. Lu

[-- Attachment #1: Type: TEXT/PLAIN, Size: 931 bytes --]

On Tue, 4 Dec 2012, Marc Glisse wrote:

> Do you agree?

Like this ? (only tested on the new testcases, and then I'd need to ask Eric
his opinion)

2012-12-04  Marc Glisse  <marc.glisse@inria.fr>

 	PR target/54855
gcc/
 	* simplify-rtx.c (simplify_binary_operation_1) <VEC_CONCAT>: Replace
 	with VEC_MERGE.
 	* config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
 	pattern.
 	* config/i386/i386-builtin-types.def: New function types.
 	* config/i386/i386.c (ix86_expand_args_builtin): Likewise.
 	(bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
 	__builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
 	* config/i386/xmmintrin.h: Adapt to new builtin prototype.
 	* config/i386/emmintrin.h: Likewise.
 	* doc/extend.texi (X86 Built-in Functions): Document changed prototype.


testsuite/
 	* gcc.target/i386/pr54855-1.c: New testcase.
 	* gcc.target/i386/pr54855-2.c: New testcase.


-- 
Marc Glisse

[-- Attachment #2: Type: TEXT/PLAIN, Size: 18755 bytes --]

Index: doc/extend.texi
===================================================================
--- doc/extend.texi	(revision 194150)
+++ doc/extend.texi	(working copy)
@@ -9843,22 +9843,22 @@ int __builtin_ia32_comige (v4sf, v4sf)
 int __builtin_ia32_ucomieq (v4sf, v4sf)
 int __builtin_ia32_ucomineq (v4sf, v4sf)
 int __builtin_ia32_ucomilt (v4sf, v4sf)
 int __builtin_ia32_ucomile (v4sf, v4sf)
 int __builtin_ia32_ucomigt (v4sf, v4sf)
 int __builtin_ia32_ucomige (v4sf, v4sf)
 v4sf __builtin_ia32_addps (v4sf, v4sf)
 v4sf __builtin_ia32_subps (v4sf, v4sf)
 v4sf __builtin_ia32_mulps (v4sf, v4sf)
 v4sf __builtin_ia32_divps (v4sf, v4sf)
-v4sf __builtin_ia32_addss (v4sf, v4sf)
-v4sf __builtin_ia32_subss (v4sf, v4sf)
+v4sf __builtin_ia32_addss (v4sf, float)
+v4sf __builtin_ia32_subss (v4sf, float)
 v4sf __builtin_ia32_mulss (v4sf, v4sf)
 v4sf __builtin_ia32_divss (v4sf, v4sf)
 v4si __builtin_ia32_cmpeqps (v4sf, v4sf)
 v4si __builtin_ia32_cmpltps (v4sf, v4sf)
 v4si __builtin_ia32_cmpleps (v4sf, v4sf)
 v4si __builtin_ia32_cmpgtps (v4sf, v4sf)
 v4si __builtin_ia32_cmpgeps (v4sf, v4sf)
 v4si __builtin_ia32_cmpunordps (v4sf, v4sf)
 v4si __builtin_ia32_cmpneqps (v4sf, v4sf)
 v4si __builtin_ia32_cmpnltps (v4sf, v4sf)
@@ -9964,22 +9964,22 @@ v2df __builtin_ia32_cmpunordsd (v2df, v2
 v2df __builtin_ia32_cmpneqsd (v2df, v2df)
 v2df __builtin_ia32_cmpnltsd (v2df, v2df)
 v2df __builtin_ia32_cmpnlesd (v2df, v2df)
 v2df __builtin_ia32_cmpordsd (v2df, v2df)
 v2di __builtin_ia32_paddq (v2di, v2di)
 v2di __builtin_ia32_psubq (v2di, v2di)
 v2df __builtin_ia32_addpd (v2df, v2df)
 v2df __builtin_ia32_subpd (v2df, v2df)
 v2df __builtin_ia32_mulpd (v2df, v2df)
 v2df __builtin_ia32_divpd (v2df, v2df)
-v2df __builtin_ia32_addsd (v2df, v2df)
-v2df __builtin_ia32_subsd (v2df, v2df)
+v2df __builtin_ia32_addsd (v2df, double)
+v2df __builtin_ia32_subsd (v2df, double)
 v2df __builtin_ia32_mulsd (v2df, v2df)
 v2df __builtin_ia32_divsd (v2df, v2df)
 v2df __builtin_ia32_minpd (v2df, v2df)
 v2df __builtin_ia32_maxpd (v2df, v2df)
 v2df __builtin_ia32_minsd (v2df, v2df)
 v2df __builtin_ia32_maxsd (v2df, v2df)
 v2df __builtin_ia32_andpd (v2df, v2df)
 v2df __builtin_ia32_andnpd (v2df, v2df)
 v2df __builtin_ia32_orpd (v2df, v2df)
 v2df __builtin_ia32_xorpd (v2df, v2df)
Index: testsuite/gcc.target/i386/pr54855-2.c
===================================================================
--- testsuite/gcc.target/i386/pr54855-2.c	(revision 0)
+++ testsuite/gcc.target/i386/pr54855-2.c	(revision 0)
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O -msse" } */
+
+typedef float vec __attribute__((vector_size(16)));
+
+vec f (vec x)
+{
+  x[0] += 2;
+  return x;
+}
+
+vec g (vec x)
+{
+  x[0] -= 1;
+  return x;
+}
+
+/* { dg-final { scan-assembler-not "mov" } } */

Property changes on: testsuite/gcc.target/i386/pr54855-2.c
___________________________________________________________________
Added: svn:keywords
   + Author Date Id Revision URL
Added: svn:eol-style
   + native

Index: testsuite/gcc.target/i386/pr54855-1.c
===================================================================
--- testsuite/gcc.target/i386/pr54855-1.c	(revision 0)
+++ testsuite/gcc.target/i386/pr54855-1.c	(revision 0)
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O -msse2" } */
+
+typedef double vec __attribute__((vector_size(16)));
+
+vec f (vec x)
+{
+  x[0] += 2;
+  return x;
+}
+
+vec g (vec x)
+{
+  x[0] -= 1;
+  return x;
+}
+
+/* { dg-final { scan-assembler-not "mov" } } */

Property changes on: testsuite/gcc.target/i386/pr54855-1.c
___________________________________________________________________
Added: svn:eol-style
   + native
Added: svn:keywords
   + Author Date Id Revision URL

Index: config/i386/xmmintrin.h
===================================================================
--- config/i386/xmmintrin.h	(revision 194150)
+++ config/i386/xmmintrin.h	(working copy)
@@ -92,27 +92,27 @@ _mm_setzero_ps (void)
   return __extension__ (__m128){ 0.0f, 0.0f, 0.0f, 0.0f };
 }
 
 /* Perform the respective operation on the lower SPFP (single-precision
    floating-point) values of A and B; the upper three SPFP values are
    passed through from A.  */
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_ss (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_addss ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) __builtin_ia32_addss ((__v4sf)__A, __B[0]);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_ss (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_subss ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) __builtin_ia32_subss ((__v4sf)__A, __B[0]);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_ss (__m128 __A, __m128 __B)
 {
   return (__m128) __builtin_ia32_mulss ((__v4sf)__A, (__v4sf)__B);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_div_ss (__m128 __A, __m128 __B)
Index: config/i386/emmintrin.h
===================================================================
--- config/i386/emmintrin.h	(revision 194150)
+++ config/i386/emmintrin.h	(working copy)
@@ -226,33 +226,33 @@ _mm_cvtsi128_si64x (__m128i __A)
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_addpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_sd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_addsd ((__v2df)__A, (__v2df)__B);
+  return (__m128d)__builtin_ia32_addsd ((__v2df)__A, __B[0]);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_subpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_sd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_subsd ((__v2df)__A, (__v2df)__B);
+  return (__m128d)__builtin_ia32_subsd ((__v2df)__A, __B[0]);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_mulpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_sd (__m128d __A, __m128d __B)
Index: config/i386/sse.md
===================================================================
--- config/i386/sse.md	(revision 194150)
+++ config/i386/sse.md	(working copy)
@@ -858,23 +858,26 @@
    <plusminus_mnemonic><ssemodesuffix>\t{%2, %0|%0, %2}
    v<plusminus_mnemonic><ssemodesuffix>\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "isa" "noavx,avx")
    (set_attr "type" "sseadd")
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "<MODE>")])
 
 (define_insn "<sse>_vm<plusminus_insn><mode>3"
   [(set (match_operand:VF_128 0 "register_operand" "=x,x")
 	(vec_merge:VF_128
-	  (plusminus:VF_128
-	    (match_operand:VF_128 1 "register_operand" "0,x")
-	    (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))
+	  (vec_duplicate:VF_128
+	    (plusminus:<ssescalarmode>
+	      (vec_select:<ssescalarmode>
+		(match_operand:VF_128 1 "register_operand" "0,x")
+		(parallel [(const_int 0)]))
+	      (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm")))
 	  (match_dup 1)
 	  (const_int 1)))]
   "TARGET_SSE"
   "@
    <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2}
    v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "isa" "noavx,avx")
    (set_attr "type" "sseadd")
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "<ssescalarmode>")])
Index: config/i386/i386-builtin-types.def
===================================================================
--- config/i386/i386-builtin-types.def	(revision 194150)
+++ config/i386/i386-builtin-types.def	(working copy)
@@ -263,20 +263,21 @@ DEF_FUNCTION_TYPE (UINT64, UINT64, UINT6
 DEF_FUNCTION_TYPE (UINT8, UINT8, INT)
 DEF_FUNCTION_TYPE (V16QI, V16QI, SI)
 DEF_FUNCTION_TYPE (V16QI, V16QI, V16QI)
 DEF_FUNCTION_TYPE (V16QI, V8HI, V8HI)
 DEF_FUNCTION_TYPE (V1DI, V1DI, SI)
 DEF_FUNCTION_TYPE (V1DI, V1DI, V1DI)
 DEF_FUNCTION_TYPE (V1DI, V2SI, V2SI)
 DEF_FUNCTION_TYPE (V1DI, V8QI, V8QI)
 DEF_FUNCTION_TYPE (V2DF, PCV2DF, V2DI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, DI)
+DEF_FUNCTION_TYPE (V2DF, V2DF, DOUBLE)
 DEF_FUNCTION_TYPE (V2DF, V2DF, INT)
 DEF_FUNCTION_TYPE (V2DF, V2DF, PCDOUBLE)
 DEF_FUNCTION_TYPE (V2DF, V2DF, SI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V2DI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V4SF)
 DEF_FUNCTION_TYPE (V2DF, V4DF, INT)
 DEF_FUNCTION_TYPE (V2DI, V16QI, V16QI)
 DEF_FUNCTION_TYPE (V2DI, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V2DI, V2DI, INT)
@@ -296,20 +297,21 @@ DEF_FUNCTION_TYPE (V4DF, PCV4DF, V4DI)
 DEF_FUNCTION_TYPE (V4DF, V4DF, INT)
 DEF_FUNCTION_TYPE (V4DF, V4DF, V4DF)
 DEF_FUNCTION_TYPE (V4DF, V4DF, V4DI)
 DEF_FUNCTION_TYPE (V4HI, V2SI, V2SI)
 DEF_FUNCTION_TYPE (V4HI, V4HI, INT)
 DEF_FUNCTION_TYPE (V4HI, V4HI, SI)
 DEF_FUNCTION_TYPE (V4HI, V4HI, V4HI)
 DEF_FUNCTION_TYPE (V4HI, V8QI, V8QI)
 DEF_FUNCTION_TYPE (V4SF, PCV4SF, V4SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, DI)
+DEF_FUNCTION_TYPE (V4SF, V4SF, FLOAT)
 DEF_FUNCTION_TYPE (V4SF, V4SF, INT)
 DEF_FUNCTION_TYPE (V4SF, V4SF, PCV2SF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V2DF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V2SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V4SF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V4SI)
 DEF_FUNCTION_TYPE (V4SF, V8SF, INT)
 DEF_FUNCTION_TYPE (V4SI, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V4SI, V4SF, V4SF)
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 194150)
+++ config/i386/i386.c	(working copy)
@@ -27059,22 +27059,22 @@ static const struct builtin_description
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_cvttps2pi, "__builtin_ia32_cvttps2pi", IX86_BUILTIN_CVTTPS2PI, UNKNOWN, (int) V2SI_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_cvttss2si, "__builtin_ia32_cvttss2si", IX86_BUILTIN_CVTTSS2SI, UNKNOWN, (int) INT_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE | OPTION_MASK_ISA_64BIT, CODE_FOR_sse_cvttss2siq, "__builtin_ia32_cvttss2si64", IX86_BUILTIN_CVTTSS2SI64, UNKNOWN, (int) INT64_FTYPE_V4SF },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_shufps, "__builtin_ia32_shufps", IX86_BUILTIN_SHUFPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF_INT },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_addv4sf3, "__builtin_ia32_addps", IX86_BUILTIN_ADDPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_subv4sf3, "__builtin_ia32_subps", IX86_BUILTIN_SUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_mulv4sf3, "__builtin_ia32_mulps", IX86_BUILTIN_MULPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_divv4sf3, "__builtin_ia32_divps", IX86_BUILTIN_DIVPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
-  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmaddv4sf3,  "__builtin_ia32_addss", IX86_BUILTIN_ADDSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
-  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmsubv4sf3,  "__builtin_ia32_subss", IX86_BUILTIN_SUBSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
+  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmaddv4sf3,  "__builtin_ia32_addss", IX86_BUILTIN_ADDSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_FLOAT },
+  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmsubv4sf3,  "__builtin_ia32_subss", IX86_BUILTIN_SUBSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_FLOAT },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmmulv4sf3,  "__builtin_ia32_mulss", IX86_BUILTIN_MULSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmdivv4sf3,  "__builtin_ia32_divss", IX86_BUILTIN_DIVSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpeqps", IX86_BUILTIN_CMPEQPS, EQ, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpltps", IX86_BUILTIN_CMPLTPS, LT, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpleps", IX86_BUILTIN_CMPLEPS, LE, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpgtps", IX86_BUILTIN_CMPGTPS, LT, (int) V4SF_FTYPE_V4SF_V4SF_SWAP },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpgeps", IX86_BUILTIN_CMPGEPS, LE, (int) V4SF_FTYPE_V4SF_V4SF_SWAP },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpunordps", IX86_BUILTIN_CMPUNORDPS, UNORDERED, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpneqps", IX86_BUILTIN_CMPNEQPS, NE, (int) V4SF_FTYPE_V4SF_V4SF },
@@ -27163,22 +27163,22 @@ static const struct builtin_description
   { OPTION_MASK_ISA_SSE2 | OPTION_MASK_ISA_64BIT, CODE_FOR_sse2_cvttsd2siq, "__builtin_ia32_cvttsd2si64", IX86_BUILTIN_CVTTSD2SI64, UNKNOWN, (int) INT64_FTYPE_V2DF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_cvtps2dq, "__builtin_ia32_cvtps2dq", IX86_BUILTIN_CVTPS2DQ, UNKNOWN, (int) V4SI_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_cvtps2pd, "__builtin_ia32_cvtps2pd", IX86_BUILTIN_CVTPS2PD, UNKNOWN, (int) V2DF_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_fix_truncv4sfv4si2, "__builtin_ia32_cvttps2dq", IX86_BUILTIN_CVTTPS2DQ, UNKNOWN, (int) V4SI_FTYPE_V4SF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_addv2df3, "__builtin_ia32_addpd", IX86_BUILTIN_ADDPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_subv2df3, "__builtin_ia32_subpd", IX86_BUILTIN_SUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_mulv2df3, "__builtin_ia32_mulpd", IX86_BUILTIN_MULPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_divv2df3, "__builtin_ia32_divpd", IX86_BUILTIN_DIVPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
-  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmaddv2df3,  "__builtin_ia32_addsd", IX86_BUILTIN_ADDSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
-  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmsubv2df3,  "__builtin_ia32_subsd", IX86_BUILTIN_SUBSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
+  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmaddv2df3,  "__builtin_ia32_addsd", IX86_BUILTIN_ADDSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_DOUBLE },
+  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmsubv2df3,  "__builtin_ia32_subsd", IX86_BUILTIN_SUBSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_DOUBLE },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmmulv2df3,  "__builtin_ia32_mulsd", IX86_BUILTIN_MULSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmdivv2df3,  "__builtin_ia32_divsd", IX86_BUILTIN_DIVSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpeqpd", IX86_BUILTIN_CMPEQPD, EQ, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpltpd", IX86_BUILTIN_CMPLTPD, LT, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmplepd", IX86_BUILTIN_CMPLEPD, LE, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpgtpd", IX86_BUILTIN_CMPGTPD, LT, (int) V2DF_FTYPE_V2DF_V2DF_SWAP },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpgepd", IX86_BUILTIN_CMPGEPD, LE, (int) V2DF_FTYPE_V2DF_V2DF_SWAP},
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpunordpd", IX86_BUILTIN_CMPUNORDPD, UNORDERED, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpneqpd", IX86_BUILTIN_CMPNEQPD, NE, (int) V2DF_FTYPE_V2DF_V2DF },
@@ -30790,34 +30790,36 @@ ix86_expand_args_builtin (const struct b
     case V4HI_FTYPE_V8QI_V8QI:
     case V4HI_FTYPE_V2SI_V2SI:
     case V4DF_FTYPE_V4DF_V4DF:
     case V4DF_FTYPE_V4DF_V4DI:
     case V4SF_FTYPE_V4SF_V4SF:
     case V4SF_FTYPE_V4SF_V4SI:
     case V4SF_FTYPE_V4SF_V2SI:
     case V4SF_FTYPE_V4SF_V2DF:
     case V4SF_FTYPE_V4SF_DI:
     case V4SF_FTYPE_V4SF_SI:
+    case V4SF_FTYPE_V4SF_FLOAT:
     case V2DI_FTYPE_V2DI_V2DI:
     case V2DI_FTYPE_V16QI_V16QI:
     case V2DI_FTYPE_V4SI_V4SI:
     case V2UDI_FTYPE_V4USI_V4USI:
     case V2DI_FTYPE_V2DI_V16QI:
     case V2DI_FTYPE_V2DF_V2DF:
     case V2SI_FTYPE_V2SI_V2SI:
     case V2SI_FTYPE_V4HI_V4HI:
     case V2SI_FTYPE_V2SF_V2SF:
     case V2DF_FTYPE_V2DF_V2DF:
     case V2DF_FTYPE_V2DF_V4SF:
     case V2DF_FTYPE_V2DF_V2DI:
     case V2DF_FTYPE_V2DF_DI:
     case V2DF_FTYPE_V2DF_SI:
+    case V2DF_FTYPE_V2DF_DOUBLE:
     case V2SF_FTYPE_V2SF_V2SF:
     case V1DI_FTYPE_V1DI_V1DI:
     case V1DI_FTYPE_V8QI_V8QI:
     case V1DI_FTYPE_V2SI_V2SI:
     case V32QI_FTYPE_V16HI_V16HI:
     case V16HI_FTYPE_V8SI_V8SI:
     case V32QI_FTYPE_V32QI_V32QI:
     case V16HI_FTYPE_V32QI_V32QI:
     case V16HI_FTYPE_V16HI_V16HI:
     case V8SI_FTYPE_V4DF_V4DF:
Index: simplify-rtx.c
===================================================================
--- simplify-rtx.c	(revision 194150)
+++ simplify-rtx.c	(working copy)
@@ -3588,20 +3588,32 @@ simplify_binary_operation_1 (enum rtx_co
 	    int len0 = XVECLEN (par0, 0);
 	    int len1 = XVECLEN (par1, 0);
 	    rtvec vec = rtvec_alloc (len0 + len1);
 	    for (int i = 0; i < len0; i++)
 	      RTVEC_ELT (vec, i) = XVECEXP (par0, 0, i);
 	    for (int i = 0; i < len1; i++)
 	      RTVEC_ELT (vec, len0 + i) = XVECEXP (par1, 0, i);
 	    return simplify_gen_binary (VEC_SELECT, mode, XEXP (trueop0, 0),
 					gen_rtx_PARALLEL (VOIDmode, vec));
 	  }
+
+	/* Recognize a simple form of VEC_MERGE.  */
+	if (GET_CODE (trueop1) == VEC_SELECT
+	    && GET_MODE (XEXP (trueop1, 0)) == mode
+	    && XVECLEN (XEXP (trueop1, 1), 0) == 1
+	    && INTVAL (XVECEXP (XEXP (trueop1, 1), 0, 0)) == 1)
+	  {
+	    rtx newop0 = gen_rtx_fmt_e (VEC_DUPLICATE, mode, trueop0);
+	    rtx newop1 = XEXP (trueop1, 0);
+	    return simplify_gen_ternary (VEC_MERGE, mode, GET_MODE (newop0),
+					 newop0, newop1, GEN_INT (1));
+	  }
       }
       return 0;
 
     default:
       gcc_unreachable ();
     }
 
   return 0;
 }
 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-04 16:28                         ` Marc Glisse
@ 2012-12-04 18:06                           ` Uros Bizjak
  2012-12-04 18:12                             ` H.J. Lu
  2012-12-05 14:22                             ` Marc Glisse
  0 siblings, 2 replies; 36+ messages in thread
From: Uros Bizjak @ 2012-12-04 18:06 UTC (permalink / raw)
  To: Marc Glisse; +Cc: gcc-patches, H.J. Lu

On Tue, Dec 4, 2012 at 5:28 PM, Marc Glisse <marc.glisse@inria.fr> wrote:
> On Tue, 4 Dec 2012, Marc Glisse wrote:
>
>> Do you agree?
>
>
> Like this ? (only tested on the new testcases, and then I'd need to ask Eric
> his opinion)
>
> 2012-12-04  Marc Glisse  <marc.glisse@inria.fr>
>
>         PR target/54855
> gcc/
>         * simplify-rtx.c (simplify_binary_operation_1) <VEC_CONCAT>: Replace
>         with VEC_MERGE.
>
>         * config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
>         pattern.
>         * config/i386/i386-builtin-types.def: New function types.
>         * config/i386/i386.c (ix86_expand_args_builtin): Likewise.
>         (bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
>         __builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
>         * config/i386/xmmintrin.h: Adapt to new builtin prototype.
>         * config/i386/emmintrin.h: Likewise.
>         * doc/extend.texi (X86 Built-in Functions): Document changed
> prototype.
>
>
> testsuite/
>         * gcc.target/i386/pr54855-1.c: New testcase.
>         * gcc.target/i386/pr54855-2.c: New testcase.

Yes, the approach taken in this patch looks really good to me. There
should be no code differences with your patch, but let's ask HJ for
his opinion on intrinsics header changes.

A little nit below:

> +           rtx newop0 = gen_rtx_fmt_e (VEC_DUPLICATE, mode, trueop0);
> +           rtx newop1 = XEXP (trueop1, 0);
> +           return simplify_gen_ternary (VEC_MERGE, mode, GET_MODE (newop0),
> +                                        newop0, newop1, GEN_INT (1));

You can use const1_rtx here.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-04 18:06                           ` Uros Bizjak
@ 2012-12-04 18:12                             ` H.J. Lu
  2012-12-06 13:42                               ` Kirill Yukhin
  2012-12-05 14:22                             ` Marc Glisse
  1 sibling, 1 reply; 36+ messages in thread
From: H.J. Lu @ 2012-12-04 18:12 UTC (permalink / raw)
  To: Uros Bizjak, Kirill Yukhin; +Cc: Marc Glisse, gcc-patches

On Tue, Dec 4, 2012 at 10:06 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Tue, Dec 4, 2012 at 5:28 PM, Marc Glisse <marc.glisse@inria.fr> wrote:
>> On Tue, 4 Dec 2012, Marc Glisse wrote:
>>
>>> Do you agree?
>>
>>
>> Like this ? (only tested on the new testcases, and then I'd need to ask Eric
>> his opinion)
>>
>> 2012-12-04  Marc Glisse  <marc.glisse@inria.fr>
>>
>>         PR target/54855
>> gcc/
>>         * simplify-rtx.c (simplify_binary_operation_1) <VEC_CONCAT>: Replace
>>         with VEC_MERGE.
>>
>>         * config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
>>         pattern.
>>         * config/i386/i386-builtin-types.def: New function types.
>>         * config/i386/i386.c (ix86_expand_args_builtin): Likewise.
>>         (bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
>>         __builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
>>         * config/i386/xmmintrin.h: Adapt to new builtin prototype.
>>         * config/i386/emmintrin.h: Likewise.
>>         * doc/extend.texi (X86 Built-in Functions): Document changed
>> prototype.
>>
>>
>> testsuite/
>>         * gcc.target/i386/pr54855-1.c: New testcase.
>>         * gcc.target/i386/pr54855-2.c: New testcase.
>
> Yes, the approach taken in this patch looks really good to me. There
> should be no code differences with your patch, but let's ask HJ for
> his opinion on intrinsics header changes.

Hi Kirill,

Can you take  a look?  Thanks.

> A little nit below:
>
>> +           rtx newop0 = gen_rtx_fmt_e (VEC_DUPLICATE, mode, trueop0);
>> +           rtx newop1 = XEXP (trueop1, 0);
>> +           return simplify_gen_ternary (VEC_MERGE, mode, GET_MODE (newop0),
>> +                                        newop0, newop1, GEN_INT (1));
>
> You can use const1_rtx here.
>
> Thanks,
> Uros.



-- 
H.J.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-04 18:06                           ` Uros Bizjak
  2012-12-04 18:12                             ` H.J. Lu
@ 2012-12-05 14:22                             ` Marc Glisse
  2012-12-05 17:07                               ` Paolo Bonzini
  2012-12-05 21:05                               ` Eric Botcazou
  1 sibling, 2 replies; 36+ messages in thread
From: Marc Glisse @ 2012-12-05 14:22 UTC (permalink / raw)
  To: ebotcazou; +Cc: gcc-patches, Uros Bizjak, H.J. Lu

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1454 bytes --]

>> 2012-12-04  Marc Glisse  <marc.glisse@inria.fr>
>>
>>         PR target/54855
>> gcc/
>>         * simplify-rtx.c (simplify_binary_operation_1) <VEC_CONCAT>: Replace
>>         with VEC_MERGE.
>>
>>         * config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
>>         pattern.
>>         * config/i386/i386-builtin-types.def: New function types.
>>         * config/i386/i386.c (ix86_expand_args_builtin): Likewise.
>>         (bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
>>         __builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
>>         * config/i386/xmmintrin.h: Adapt to new builtin prototype.
>>         * config/i386/emmintrin.h: Likewise.
>>         * doc/extend.texi (X86 Built-in Functions): Document changed
>> prototype.
>>
>>
>> testsuite/
>>         * gcc.target/i386/pr54855-1.c: New testcase.
>>         * gcc.target/i386/pr54855-2.c: New testcase.

Hello Eric,

could you take a look at the small simplify-rtx bit of this patch to see 
if the general approach makes sense to you?

(this targets 4.9 and passes bootstrap+testsuite on x86_64-linux)

The point of this transformation is to avoid writing a second define_insn 
in config/i386/sse.md as in the older patch:
http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00028.html

(similar patches for multiplication, division, etc will follow, and this 
will avoid an extra entry in sse.md for each of these operations)

Thanks,

-- 
Marc Glisse

[-- Attachment #2: Type: TEXT/PLAIN, Size: 18966 bytes --]

Index: gcc/simplify-rtx.c
===================================================================
--- gcc/simplify-rtx.c	(revision 194199)
+++ gcc/simplify-rtx.c	(working copy)
@@ -3588,20 +3588,34 @@ simplify_binary_operation_1 (enum rtx_co
 	    int len0 = XVECLEN (par0, 0);
 	    int len1 = XVECLEN (par1, 0);
 	    rtvec vec = rtvec_alloc (len0 + len1);
 	    for (int i = 0; i < len0; i++)
 	      RTVEC_ELT (vec, i) = XVECEXP (par0, 0, i);
 	    for (int i = 0; i < len1; i++)
 	      RTVEC_ELT (vec, len0 + i) = XVECEXP (par1, 0, i);
 	    return simplify_gen_binary (VEC_SELECT, mode, XEXP (trueop0, 0),
 					gen_rtx_PARALLEL (VOIDmode, vec));
 	  }
+
+	/* The x86 back-end uses VEC_CONCAT to set an element in a V2DF, but
+	   VEC_MERGE for scalar operations that preserve the other elements
+	   of a vector.  */
+	if (GET_CODE (trueop1) == VEC_SELECT
+	    && GET_MODE (XEXP (trueop1, 0)) == mode
+	    && XVECLEN (XEXP (trueop1, 1), 0) == 1
+	    && INTVAL (XVECEXP (XEXP (trueop1, 1), 0, 0)) == 1)
+	  {
+	    rtx newop0 = gen_rtx_fmt_e (VEC_DUPLICATE, mode, trueop0);
+	    rtx newop1 = XEXP (trueop1, 0);
+	    return gen_rtx_fmt_eee (VEC_MERGE, mode, newop0, newop1,
+				    const1_rtx);
+	  }
       }
       return 0;
 
     default:
       gcc_unreachable ();
     }
 
   return 0;
 }
 
Index: gcc/testsuite/gcc.target/i386/pr54855-2.c
===================================================================
--- gcc/testsuite/gcc.target/i386/pr54855-2.c	(revision 0)
+++ gcc/testsuite/gcc.target/i386/pr54855-2.c	(revision 0)
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O -msse" } */
+
+typedef float vec __attribute__((vector_size(16)));
+
+vec f (vec x)
+{
+  x[0] += 2;
+  return x;
+}
+
+vec g (vec x)
+{
+  x[0] -= 1;
+  return x;
+}
+
+/* { dg-final { scan-assembler-not "mov" } } */

Property changes on: gcc/testsuite/gcc.target/i386/pr54855-2.c
___________________________________________________________________
Added: svn:keywords
   + Author Date Id Revision URL
Added: svn:eol-style
   + native

Index: gcc/testsuite/gcc.target/i386/pr54855-1.c
===================================================================
--- gcc/testsuite/gcc.target/i386/pr54855-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/i386/pr54855-1.c	(revision 0)
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O -msse2" } */
+
+typedef double vec __attribute__((vector_size(16)));
+
+vec f (vec x)
+{
+  x[0] += 2;
+  return x;
+}
+
+vec g (vec x)
+{
+  x[0] -= 1;
+  return x;
+}
+
+/* { dg-final { scan-assembler-not "mov" } } */

Property changes on: gcc/testsuite/gcc.target/i386/pr54855-1.c
___________________________________________________________________
Added: svn:keywords
   + Author Date Id Revision URL
Added: svn:eol-style
   + native

Index: gcc/doc/extend.texi
===================================================================
--- gcc/doc/extend.texi	(revision 194199)
+++ gcc/doc/extend.texi	(working copy)
@@ -9843,22 +9843,22 @@ int __builtin_ia32_comige (v4sf, v4sf)
 int __builtin_ia32_ucomieq (v4sf, v4sf)
 int __builtin_ia32_ucomineq (v4sf, v4sf)
 int __builtin_ia32_ucomilt (v4sf, v4sf)
 int __builtin_ia32_ucomile (v4sf, v4sf)
 int __builtin_ia32_ucomigt (v4sf, v4sf)
 int __builtin_ia32_ucomige (v4sf, v4sf)
 v4sf __builtin_ia32_addps (v4sf, v4sf)
 v4sf __builtin_ia32_subps (v4sf, v4sf)
 v4sf __builtin_ia32_mulps (v4sf, v4sf)
 v4sf __builtin_ia32_divps (v4sf, v4sf)
-v4sf __builtin_ia32_addss (v4sf, v4sf)
-v4sf __builtin_ia32_subss (v4sf, v4sf)
+v4sf __builtin_ia32_addss (v4sf, float)
+v4sf __builtin_ia32_subss (v4sf, float)
 v4sf __builtin_ia32_mulss (v4sf, v4sf)
 v4sf __builtin_ia32_divss (v4sf, v4sf)
 v4si __builtin_ia32_cmpeqps (v4sf, v4sf)
 v4si __builtin_ia32_cmpltps (v4sf, v4sf)
 v4si __builtin_ia32_cmpleps (v4sf, v4sf)
 v4si __builtin_ia32_cmpgtps (v4sf, v4sf)
 v4si __builtin_ia32_cmpgeps (v4sf, v4sf)
 v4si __builtin_ia32_cmpunordps (v4sf, v4sf)
 v4si __builtin_ia32_cmpneqps (v4sf, v4sf)
 v4si __builtin_ia32_cmpnltps (v4sf, v4sf)
@@ -9964,22 +9964,22 @@ v2df __builtin_ia32_cmpunordsd (v2df, v2
 v2df __builtin_ia32_cmpneqsd (v2df, v2df)
 v2df __builtin_ia32_cmpnltsd (v2df, v2df)
 v2df __builtin_ia32_cmpnlesd (v2df, v2df)
 v2df __builtin_ia32_cmpordsd (v2df, v2df)
 v2di __builtin_ia32_paddq (v2di, v2di)
 v2di __builtin_ia32_psubq (v2di, v2di)
 v2df __builtin_ia32_addpd (v2df, v2df)
 v2df __builtin_ia32_subpd (v2df, v2df)
 v2df __builtin_ia32_mulpd (v2df, v2df)
 v2df __builtin_ia32_divpd (v2df, v2df)
-v2df __builtin_ia32_addsd (v2df, v2df)
-v2df __builtin_ia32_subsd (v2df, v2df)
+v2df __builtin_ia32_addsd (v2df, double)
+v2df __builtin_ia32_subsd (v2df, double)
 v2df __builtin_ia32_mulsd (v2df, v2df)
 v2df __builtin_ia32_divsd (v2df, v2df)
 v2df __builtin_ia32_minpd (v2df, v2df)
 v2df __builtin_ia32_maxpd (v2df, v2df)
 v2df __builtin_ia32_minsd (v2df, v2df)
 v2df __builtin_ia32_maxsd (v2df, v2df)
 v2df __builtin_ia32_andpd (v2df, v2df)
 v2df __builtin_ia32_andnpd (v2df, v2df)
 v2df __builtin_ia32_orpd (v2df, v2df)
 v2df __builtin_ia32_xorpd (v2df, v2df)
Index: gcc/config/i386/sse.md
===================================================================
--- gcc/config/i386/sse.md	(revision 194199)
+++ gcc/config/i386/sse.md	(working copy)
@@ -858,23 +858,26 @@
    <plusminus_mnemonic><ssemodesuffix>\t{%2, %0|%0, %2}
    v<plusminus_mnemonic><ssemodesuffix>\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "isa" "noavx,avx")
    (set_attr "type" "sseadd")
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "<MODE>")])
 
 (define_insn "<sse>_vm<plusminus_insn><mode>3"
   [(set (match_operand:VF_128 0 "register_operand" "=x,x")
 	(vec_merge:VF_128
-	  (plusminus:VF_128
-	    (match_operand:VF_128 1 "register_operand" "0,x")
-	    (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))
+	  (vec_duplicate:VF_128
+	    (plusminus:<ssescalarmode>
+	      (vec_select:<ssescalarmode>
+		(match_operand:VF_128 1 "register_operand" "0,x")
+		(parallel [(const_int 0)]))
+	      (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm")))
 	  (match_dup 1)
 	  (const_int 1)))]
   "TARGET_SSE"
   "@
    <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2}
    v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "isa" "noavx,avx")
    (set_attr "type" "sseadd")
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "<ssescalarmode>")])
Index: gcc/config/i386/i386-builtin-types.def
===================================================================
--- gcc/config/i386/i386-builtin-types.def	(revision 194199)
+++ gcc/config/i386/i386-builtin-types.def	(working copy)
@@ -263,20 +263,21 @@ DEF_FUNCTION_TYPE (UINT64, UINT64, UINT6
 DEF_FUNCTION_TYPE (UINT8, UINT8, INT)
 DEF_FUNCTION_TYPE (V16QI, V16QI, SI)
 DEF_FUNCTION_TYPE (V16QI, V16QI, V16QI)
 DEF_FUNCTION_TYPE (V16QI, V8HI, V8HI)
 DEF_FUNCTION_TYPE (V1DI, V1DI, SI)
 DEF_FUNCTION_TYPE (V1DI, V1DI, V1DI)
 DEF_FUNCTION_TYPE (V1DI, V2SI, V2SI)
 DEF_FUNCTION_TYPE (V1DI, V8QI, V8QI)
 DEF_FUNCTION_TYPE (V2DF, PCV2DF, V2DI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, DI)
+DEF_FUNCTION_TYPE (V2DF, V2DF, DOUBLE)
 DEF_FUNCTION_TYPE (V2DF, V2DF, INT)
 DEF_FUNCTION_TYPE (V2DF, V2DF, PCDOUBLE)
 DEF_FUNCTION_TYPE (V2DF, V2DF, SI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V2DI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V4SF)
 DEF_FUNCTION_TYPE (V2DF, V4DF, INT)
 DEF_FUNCTION_TYPE (V2DI, V16QI, V16QI)
 DEF_FUNCTION_TYPE (V2DI, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V2DI, V2DI, INT)
@@ -296,20 +297,21 @@ DEF_FUNCTION_TYPE (V4DF, PCV4DF, V4DI)
 DEF_FUNCTION_TYPE (V4DF, V4DF, INT)
 DEF_FUNCTION_TYPE (V4DF, V4DF, V4DF)
 DEF_FUNCTION_TYPE (V4DF, V4DF, V4DI)
 DEF_FUNCTION_TYPE (V4HI, V2SI, V2SI)
 DEF_FUNCTION_TYPE (V4HI, V4HI, INT)
 DEF_FUNCTION_TYPE (V4HI, V4HI, SI)
 DEF_FUNCTION_TYPE (V4HI, V4HI, V4HI)
 DEF_FUNCTION_TYPE (V4HI, V8QI, V8QI)
 DEF_FUNCTION_TYPE (V4SF, PCV4SF, V4SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, DI)
+DEF_FUNCTION_TYPE (V4SF, V4SF, FLOAT)
 DEF_FUNCTION_TYPE (V4SF, V4SF, INT)
 DEF_FUNCTION_TYPE (V4SF, V4SF, PCV2SF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V2DF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V2SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V4SF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V4SI)
 DEF_FUNCTION_TYPE (V4SF, V8SF, INT)
 DEF_FUNCTION_TYPE (V4SI, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V4SI, V4SF, V4SF)
Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	(revision 194199)
+++ gcc/config/i386/i386.c	(working copy)
@@ -27059,22 +27059,22 @@ static const struct builtin_description
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_cvttps2pi, "__builtin_ia32_cvttps2pi", IX86_BUILTIN_CVTTPS2PI, UNKNOWN, (int) V2SI_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_cvttss2si, "__builtin_ia32_cvttss2si", IX86_BUILTIN_CVTTSS2SI, UNKNOWN, (int) INT_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE | OPTION_MASK_ISA_64BIT, CODE_FOR_sse_cvttss2siq, "__builtin_ia32_cvttss2si64", IX86_BUILTIN_CVTTSS2SI64, UNKNOWN, (int) INT64_FTYPE_V4SF },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_shufps, "__builtin_ia32_shufps", IX86_BUILTIN_SHUFPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF_INT },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_addv4sf3, "__builtin_ia32_addps", IX86_BUILTIN_ADDPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_subv4sf3, "__builtin_ia32_subps", IX86_BUILTIN_SUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_mulv4sf3, "__builtin_ia32_mulps", IX86_BUILTIN_MULPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_divv4sf3, "__builtin_ia32_divps", IX86_BUILTIN_DIVPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
-  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmaddv4sf3,  "__builtin_ia32_addss", IX86_BUILTIN_ADDSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
-  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmsubv4sf3,  "__builtin_ia32_subss", IX86_BUILTIN_SUBSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
+  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmaddv4sf3,  "__builtin_ia32_addss", IX86_BUILTIN_ADDSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_FLOAT },
+  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmsubv4sf3,  "__builtin_ia32_subss", IX86_BUILTIN_SUBSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_FLOAT },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmmulv4sf3,  "__builtin_ia32_mulss", IX86_BUILTIN_MULSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmdivv4sf3,  "__builtin_ia32_divss", IX86_BUILTIN_DIVSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpeqps", IX86_BUILTIN_CMPEQPS, EQ, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpltps", IX86_BUILTIN_CMPLTPS, LT, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpleps", IX86_BUILTIN_CMPLEPS, LE, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpgtps", IX86_BUILTIN_CMPGTPS, LT, (int) V4SF_FTYPE_V4SF_V4SF_SWAP },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpgeps", IX86_BUILTIN_CMPGEPS, LE, (int) V4SF_FTYPE_V4SF_V4SF_SWAP },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpunordps", IX86_BUILTIN_CMPUNORDPS, UNORDERED, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpneqps", IX86_BUILTIN_CMPNEQPS, NE, (int) V4SF_FTYPE_V4SF_V4SF },
@@ -27163,22 +27163,22 @@ static const struct builtin_description
   { OPTION_MASK_ISA_SSE2 | OPTION_MASK_ISA_64BIT, CODE_FOR_sse2_cvttsd2siq, "__builtin_ia32_cvttsd2si64", IX86_BUILTIN_CVTTSD2SI64, UNKNOWN, (int) INT64_FTYPE_V2DF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_cvtps2dq, "__builtin_ia32_cvtps2dq", IX86_BUILTIN_CVTPS2DQ, UNKNOWN, (int) V4SI_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_cvtps2pd, "__builtin_ia32_cvtps2pd", IX86_BUILTIN_CVTPS2PD, UNKNOWN, (int) V2DF_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_fix_truncv4sfv4si2, "__builtin_ia32_cvttps2dq", IX86_BUILTIN_CVTTPS2DQ, UNKNOWN, (int) V4SI_FTYPE_V4SF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_addv2df3, "__builtin_ia32_addpd", IX86_BUILTIN_ADDPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_subv2df3, "__builtin_ia32_subpd", IX86_BUILTIN_SUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_mulv2df3, "__builtin_ia32_mulpd", IX86_BUILTIN_MULPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_divv2df3, "__builtin_ia32_divpd", IX86_BUILTIN_DIVPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
-  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmaddv2df3,  "__builtin_ia32_addsd", IX86_BUILTIN_ADDSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
-  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmsubv2df3,  "__builtin_ia32_subsd", IX86_BUILTIN_SUBSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
+  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmaddv2df3,  "__builtin_ia32_addsd", IX86_BUILTIN_ADDSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_DOUBLE },
+  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmsubv2df3,  "__builtin_ia32_subsd", IX86_BUILTIN_SUBSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_DOUBLE },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmmulv2df3,  "__builtin_ia32_mulsd", IX86_BUILTIN_MULSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmdivv2df3,  "__builtin_ia32_divsd", IX86_BUILTIN_DIVSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpeqpd", IX86_BUILTIN_CMPEQPD, EQ, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpltpd", IX86_BUILTIN_CMPLTPD, LT, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmplepd", IX86_BUILTIN_CMPLEPD, LE, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpgtpd", IX86_BUILTIN_CMPGTPD, LT, (int) V2DF_FTYPE_V2DF_V2DF_SWAP },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpgepd", IX86_BUILTIN_CMPGEPD, LE, (int) V2DF_FTYPE_V2DF_V2DF_SWAP},
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpunordpd", IX86_BUILTIN_CMPUNORDPD, UNORDERED, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpneqpd", IX86_BUILTIN_CMPNEQPD, NE, (int) V2DF_FTYPE_V2DF_V2DF },
@@ -30790,34 +30790,36 @@ ix86_expand_args_builtin (const struct b
     case V4HI_FTYPE_V8QI_V8QI:
     case V4HI_FTYPE_V2SI_V2SI:
     case V4DF_FTYPE_V4DF_V4DF:
     case V4DF_FTYPE_V4DF_V4DI:
     case V4SF_FTYPE_V4SF_V4SF:
     case V4SF_FTYPE_V4SF_V4SI:
     case V4SF_FTYPE_V4SF_V2SI:
     case V4SF_FTYPE_V4SF_V2DF:
     case V4SF_FTYPE_V4SF_DI:
     case V4SF_FTYPE_V4SF_SI:
+    case V4SF_FTYPE_V4SF_FLOAT:
     case V2DI_FTYPE_V2DI_V2DI:
     case V2DI_FTYPE_V16QI_V16QI:
     case V2DI_FTYPE_V4SI_V4SI:
     case V2UDI_FTYPE_V4USI_V4USI:
     case V2DI_FTYPE_V2DI_V16QI:
     case V2DI_FTYPE_V2DF_V2DF:
     case V2SI_FTYPE_V2SI_V2SI:
     case V2SI_FTYPE_V4HI_V4HI:
     case V2SI_FTYPE_V2SF_V2SF:
     case V2DF_FTYPE_V2DF_V2DF:
     case V2DF_FTYPE_V2DF_V4SF:
     case V2DF_FTYPE_V2DF_V2DI:
     case V2DF_FTYPE_V2DF_DI:
     case V2DF_FTYPE_V2DF_SI:
+    case V2DF_FTYPE_V2DF_DOUBLE:
     case V2SF_FTYPE_V2SF_V2SF:
     case V1DI_FTYPE_V1DI_V1DI:
     case V1DI_FTYPE_V8QI_V8QI:
     case V1DI_FTYPE_V2SI_V2SI:
     case V32QI_FTYPE_V16HI_V16HI:
     case V16HI_FTYPE_V8SI_V8SI:
     case V32QI_FTYPE_V32QI_V32QI:
     case V16HI_FTYPE_V32QI_V32QI:
     case V16HI_FTYPE_V16HI_V16HI:
     case V8SI_FTYPE_V4DF_V4DF:
Index: gcc/config/i386/xmmintrin.h
===================================================================
--- gcc/config/i386/xmmintrin.h	(revision 194199)
+++ gcc/config/i386/xmmintrin.h	(working copy)
@@ -92,27 +92,27 @@ _mm_setzero_ps (void)
   return __extension__ (__m128){ 0.0f, 0.0f, 0.0f, 0.0f };
 }
 
 /* Perform the respective operation on the lower SPFP (single-precision
    floating-point) values of A and B; the upper three SPFP values are
    passed through from A.  */
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_ss (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_addss ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) __builtin_ia32_addss ((__v4sf)__A, __B[0]);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_ss (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_subss ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) __builtin_ia32_subss ((__v4sf)__A, __B[0]);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_ss (__m128 __A, __m128 __B)
 {
   return (__m128) __builtin_ia32_mulss ((__v4sf)__A, (__v4sf)__B);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_div_ss (__m128 __A, __m128 __B)
Index: gcc/config/i386/emmintrin.h
===================================================================
--- gcc/config/i386/emmintrin.h	(revision 194199)
+++ gcc/config/i386/emmintrin.h	(working copy)
@@ -226,33 +226,33 @@ _mm_cvtsi128_si64x (__m128i __A)
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_addpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_sd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_addsd ((__v2df)__A, (__v2df)__B);
+  return (__m128d)__builtin_ia32_addsd ((__v2df)__A, __B[0]);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_subpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_sd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_subsd ((__v2df)__A, (__v2df)__B);
+  return (__m128d)__builtin_ia32_subsd ((__v2df)__A, __B[0]);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_mulpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_sd (__m128d __A, __m128d __B)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-05 14:22                             ` Marc Glisse
@ 2012-12-05 17:07                               ` Paolo Bonzini
  2012-12-05 20:22                                 ` Marc Glisse
  2012-12-05 21:05                               ` Eric Botcazou
  1 sibling, 1 reply; 36+ messages in thread
From: Paolo Bonzini @ 2012-12-05 17:07 UTC (permalink / raw)
  To: Marc Glisse; +Cc: ebotcazou, gcc-patches, Uros Bizjak, H.J. Lu

Il 05/12/2012 15:22, Marc Glisse ha scritto:
> +
> +	/* The x86 back-end uses VEC_CONCAT to set an element in a V2DF, but
> +	   VEC_MERGE for scalar operations that preserve the other elements
> +	   of a vector.  */
> +	if (GET_CODE (trueop1) == VEC_SELECT
> +	    && GET_MODE (XEXP (trueop1, 0)) == mode
> +	    && XVECLEN (XEXP (trueop1, 1), 0) == 1
> +	    && INTVAL (XVECEXP (XEXP (trueop1, 1), 0, 0)) == 1)
> +	  {
> +	    rtx newop0 = gen_rtx_fmt_e (VEC_DUPLICATE, mode, trueop0);
> +	    rtx newop1 = XEXP (trueop1, 0);
> +	    return gen_rtx_fmt_eee (VEC_MERGE, mode, newop0, newop1,
> +				    const1_rtx);
> +	  }

So this changes this:

   (vec_concat:M R1:N (vec_select:N V2:M [1]))

to this

   (vec_merge:M (vec_duplicate:M R1:N) V2:M [1])

I wonder if more patterns in i386.md should be canonicalized.
Basically, the occurrences of gen_rtx_VEC_CONCAT should be changed to
simplify_gen_binary, and the fallout fixed.

Otherwise you have patterns that will not match if someone does generate
the vec_concat via simplify_gen_binary.

Paolo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-05 17:07                               ` Paolo Bonzini
@ 2012-12-05 20:22                                 ` Marc Glisse
  0 siblings, 0 replies; 36+ messages in thread
From: Marc Glisse @ 2012-12-05 20:22 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: ebotcazou, gcc-patches, Uros Bizjak, H.J. Lu

On Wed, 5 Dec 2012, Paolo Bonzini wrote:

> Il 05/12/2012 15:22, Marc Glisse ha scritto:
>> +
>> +	/* The x86 back-end uses VEC_CONCAT to set an element in a V2DF, but
>> +	   VEC_MERGE for scalar operations that preserve the other elements
>> +	   of a vector.  */
>> +	if (GET_CODE (trueop1) == VEC_SELECT
>> +	    && GET_MODE (XEXP (trueop1, 0)) == mode
>> +	    && XVECLEN (XEXP (trueop1, 1), 0) == 1
>> +	    && INTVAL (XVECEXP (XEXP (trueop1, 1), 0, 0)) == 1)
>> +	  {
>> +	    rtx newop0 = gen_rtx_fmt_e (VEC_DUPLICATE, mode, trueop0);
>> +	    rtx newop1 = XEXP (trueop1, 0);
>> +	    return gen_rtx_fmt_eee (VEC_MERGE, mode, newop0, newop1,
>> +				    const1_rtx);
>> +	  }
>
> So this changes this:
>
>   (vec_concat:M R1:N (vec_select:N V2:M [1]))
>
> to this
>
>   (vec_merge:M (vec_duplicate:M R1:N) V2:M [1])

Yes.

> I wonder if more patterns in i386.md should be canonicalized.
> Basically, the occurrences of gen_rtx_VEC_CONCAT should be changed to
> simplify_gen_binary, and the fallout fixed.
>
> Otherwise you have patterns that will not match if someone does generate
> the vec_concat via simplify_gen_binary.

I wondered about that but underestimated the issue. If we decide that the 
vec_merge pattern is the canonical one, we should probably start by making 
ix86_expand_vector_set and others generate it (instead of the vec_concat 
one), and the simplify-rtx patch actually becomes less useful (but not 
useless).

I don't know Uros' position, but re-reading this message:
http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00069.html
it seems like he was indeed suggesting this.

Note that my first choice was to have the vec_concat pattern in sse.md (I 
like the vec_concat pattern better, and since ix86_expand_vector_set has a 
special case to generate it instead of vec_merge for V2DF, someone must 
have agreed at some point), but Uros wants a single entry (using macros) 
for V4SF+V2DF, and hence a similar pattern.

The only simplification we currently have with VEC_MERGE is constant 
propagation. If we are going to produce it more often, we need to add a 
few optimizations, like looking through a vec_select of a vec_merge, or 
doing nothing when merging a vector with itself, etc.

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-05 14:22                             ` Marc Glisse
  2012-12-05 17:07                               ` Paolo Bonzini
@ 2012-12-05 21:05                               ` Eric Botcazou
  1 sibling, 0 replies; 36+ messages in thread
From: Eric Botcazou @ 2012-12-05 21:05 UTC (permalink / raw)
  To: Marc Glisse; +Cc: gcc-patches, Uros Bizjak, H.J. Lu, Richard Henderson

> could you take a look at the small simplify-rtx bit of this patch to see
> if the general approach makes sense to you?
> 
> (this targets 4.9 and passes bootstrap+testsuite on x86_64-linux)
> 
> The point of this transformation is to avoid writing a second define_insn
> in config/i386/sse.md as in the older patch:
> http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00028.html
> 
> (similar patches for multiplication, division, etc will follow, and this
> will avoid an extra entry in sse.md for each of these operations)

I'm not a specialist of vector support though.  You should ask RTH instead 
here (CCed), who has a far more comprehensive view of this stuff than me.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-04 18:12                             ` H.J. Lu
@ 2012-12-06 13:42                               ` Kirill Yukhin
  2012-12-07  6:50                                 ` Michael Zolotukhin
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill Yukhin @ 2012-12-06 13:42 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Uros Bizjak, Marc Glisse, gcc-patches List

>> Yes, the approach taken in this patch looks really good to me. There
>> should be no code differences with your patch, but let's ask HJ for
>> his opinion on intrinsics header changes.
>
> Hi Kirill,
>
> Can you take  a look?  Thanks.

Hi guys, I like changes in intrinsics header.

Thanks, K

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-06 13:42                               ` Kirill Yukhin
@ 2012-12-07  6:50                                 ` Michael Zolotukhin
  2012-12-07  8:46                                   ` Uros Bizjak
  2012-12-07  8:49                                   ` Marc Glisse
  0 siblings, 2 replies; 36+ messages in thread
From: Michael Zolotukhin @ 2012-12-07  6:50 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: H.J. Lu, Uros Bizjak, Marc Glisse, gcc-patches List

Hi guys,
Could I ask several questions just to clarify the things up?

1) Does the root problem lay in the fact that even for scalar
additions we perform the addition on the whole vector and only then
drop the higher parts of the vector? I.e. to fix the test from the PR
we need to replace plus on vector mode with plus on scalar mode?

2) Is one of the main requirements having the same pattern for V4SF
and V2DF version?

3) I don't see vec_concat in patterns from your patches, is it
explicitly generated by some x86-expander?

Anyway, I really like the idea of having some uniformity in describing
patterns for scalar instructions, so thank you for the work!

On 6 December 2012 17:42, Kirill Yukhin <kirill.yukhin@gmail.com> wrote:
>>> Yes, the approach taken in this patch looks really good to me. There
>>> should be no code differences with your patch, but let's ask HJ for
>>> his opinion on intrinsics header changes.
>>
>> Hi Kirill,
>>
>> Can you take  a look?  Thanks.
>
> Hi guys, I like changes in intrinsics header.
>
> Thanks, K



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-07  6:50                                 ` Michael Zolotukhin
@ 2012-12-07  8:46                                   ` Uros Bizjak
  2012-12-07  8:49                                   ` Marc Glisse
  1 sibling, 0 replies; 36+ messages in thread
From: Uros Bizjak @ 2012-12-07  8:46 UTC (permalink / raw)
  To: Michael Zolotukhin; +Cc: Kirill Yukhin, H.J. Lu, Marc Glisse, gcc-patches List

On Fri, Dec 7, 2012 at 7:49 AM, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
> Hi guys,
> Could I ask several questions just to clarify the things up?
>
> 1) Does the root problem lay in the fact that even for scalar
> additions we perform the addition on the whole vector and only then
> drop the higher parts of the vector? I.e. to fix the test from the PR
> we need to replace plus on vector mode with plus on scalar mode?

Yes, existing pattern is used to implement intrinsics, and it is
modelled with vector operand 2. But we in fact emit scalar operation,
so we would like to model the pattern with a scalar operand 2. This
way, the same pattern can be used to emit intrinsics _and_ can be used
to optimize the code from the testcase at the same time. Also, please
note that alignment requirements for vector operand and scalar
operands are different.

> 2) Is one of the main requirements having the same pattern for V4SF
> and V2DF version?

It is not required, but having macroized pattern avoids pattern
explosion, eases maintenance (it is easier to understand similar
functionality if it is described in some uniform way), and in some
cases, macroization opportunities force author to rethink the RTL
description, making the patterns more "universal".

Uros.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-07  6:50                                 ` Michael Zolotukhin
  2012-12-07  8:46                                   ` Uros Bizjak
@ 2012-12-07  8:49                                   ` Marc Glisse
  2012-12-07 10:52                                     ` Michael Zolotukhin
  2012-12-07 14:43                                     ` Richard Henderson
  1 sibling, 2 replies; 36+ messages in thread
From: Marc Glisse @ 2012-12-07  8:49 UTC (permalink / raw)
  To: Michael Zolotukhin
  Cc: Kirill Yukhin, H.J. Lu, Uros Bizjak, gcc-patches List, rth

On Fri, 7 Dec 2012, Michael Zolotukhin wrote:

> 1) Does the root problem lay in the fact that even for scalar
> additions we perform the addition on the whole vector and only then
> drop the higher parts of the vector? I.e. to fix the test from the PR
> we need to replace plus on vector mode with plus on scalar mode?

The root problem is that we model the subs[sd] instructions as taking a 
128-bit second operand, when Intel's documentation says they take a 
32/64-bit operand, which is an important difference for memory operands 
(and constants). Writing a pattern that reconstructs the result from a 
scalar operation also seems more natural than pretending we are doing a 
parallel operation and dropping most of it (easier for recog and friends).

(note: I think the insn was written to support the intrinsic, which does 
take a 128-bit argument, so it did a good job for that)

> 2) Is one of the main requirements having the same pattern for V4SF
> and V2DF version?

Uros seems to think that would be best.

> 3) I don't see vec_concat in patterns from your patches, is it
> explicitly generated by some x86-expander?

It is generated by ix86_expand_vector_set.

> Anyway, I really like the idea of having some uniformity in describing
> patterns for scalar instructions, so thank you for the work!

For 2-element vectors, vec_concat does seem more natural than vec_merge. 
If we chose vec_merge as the canonical representation, we should chose it 
for setting an element in a vector (ix86_expand_vector_set) everywhere, 
not just those scalarish operations.

So it would be good to have rth's opinion on this (svn blame seems to 
indicate he is the one who chose to use vec_concat specifically for V2DF 
instead of vec_merge).

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-07  8:49                                   ` Marc Glisse
@ 2012-12-07 10:52                                     ` Michael Zolotukhin
  2012-12-07 14:02                                       ` Marc Glisse
  2012-12-07 14:43                                     ` Richard Henderson
  1 sibling, 1 reply; 36+ messages in thread
From: Michael Zolotukhin @ 2012-12-07 10:52 UTC (permalink / raw)
  To: Marc Glisse; +Cc: Kirill Yukhin, H.J. Lu, Uros Bizjak, gcc-patches List, rth

Thanks for the explanation!

By the way, if we decide to have one pattern for V4SF instructions and
another for V2DF, we could try to use recently introduced define_subst
here. It won't reduce number of actual patterns (I mean number of
patterns after iterators and subst expanding), but it could help to
make sse.md more compact.

On 7 December 2012 12:49, Marc Glisse <marc.glisse@inria.fr> wrote:
> On Fri, 7 Dec 2012, Michael Zolotukhin wrote:
>
>> 1) Does the root problem lay in the fact that even for scalar
>> additions we perform the addition on the whole vector and only then
>> drop the higher parts of the vector? I.e. to fix the test from the PR
>> we need to replace plus on vector mode with plus on scalar mode?
>
>
> The root problem is that we model the subs[sd] instructions as taking a
> 128-bit second operand, when Intel's documentation says they take a
> 32/64-bit operand, which is an important difference for memory operands (and
> constants). Writing a pattern that reconstructs the result from a scalar
> operation also seems more natural than pretending we are doing a parallel
> operation and dropping most of it (easier for recog and friends).
>
> (note: I think the insn was written to support the intrinsic, which does
> take a 128-bit argument, so it did a good job for that)
>
>
>> 2) Is one of the main requirements having the same pattern for V4SF
>> and V2DF version?
>
>
> Uros seems to think that would be best.
>
>
>> 3) I don't see vec_concat in patterns from your patches, is it
>> explicitly generated by some x86-expander?
>
>
> It is generated by ix86_expand_vector_set.
>
>
>> Anyway, I really like the idea of having some uniformity in describing
>> patterns for scalar instructions, so thank you for the work!
>
>
> For 2-element vectors, vec_concat does seem more natural than vec_merge. If
> we chose vec_merge as the canonical representation, we should chose it for
> setting an element in a vector (ix86_expand_vector_set) everywhere, not just
> those scalarish operations.
>
> So it would be good to have rth's opinion on this (svn blame seems to
> indicate he is the one who chose to use vec_concat specifically for V2DF
> instead of vec_merge).
>
> --
> Marc Glisse



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-07 10:52                                     ` Michael Zolotukhin
@ 2012-12-07 14:02                                       ` Marc Glisse
  0 siblings, 0 replies; 36+ messages in thread
From: Marc Glisse @ 2012-12-07 14:02 UTC (permalink / raw)
  To: Michael Zolotukhin
  Cc: Kirill Yukhin, H.J. Lu, Uros Bizjak, gcc-patches List, rth

[-- Attachment #1: Type: TEXT/PLAIN, Size: 774 bytes --]

On Fri, 7 Dec 2012, Michael Zolotukhin wrote:

> By the way, if we decide to have one pattern for V4SF instructions and
> another for V2DF, we could try to use recently introduced define_subst
> here. It won't reduce number of actual patterns (I mean number of
> patterns after iterators and subst expanding), but it could help to
> make sse.md more compact.

Here is a version of the patch with define_subst. This helps make sse.md 
more compact indeed (well, the define_subst takes space, but it will 
already be there for mult, div, etc). One side effect is that in the 
expanded .md file, we have both variants of the V2DF operation (I switched 
the builtins to use the _vconcat version).

(not tested beyond "make dumpmd" and a quick look at that dump)

-- 
Marc Glisse

[-- Attachment #2: Type: TEXT/PLAIN, Size: 18904 bytes --]

Index: testsuite/gcc.target/i386/pr54855-2.c
===================================================================
--- testsuite/gcc.target/i386/pr54855-2.c	(revision 0)
+++ testsuite/gcc.target/i386/pr54855-2.c	(revision 0)
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O -msse" } */
+
+typedef float vec __attribute__((vector_size(16)));
+
+vec f (vec x)
+{
+  x[0] += 2;
+  return x;
+}
+
+vec g (vec x)
+{
+  x[0] -= 1;
+  return x;
+}
+
+/* { dg-final { scan-assembler-not "mov" } } */

Property changes on: testsuite/gcc.target/i386/pr54855-2.c
___________________________________________________________________
Added: svn:keywords
   + Author Date Id Revision URL
Added: svn:eol-style
   + native

Index: testsuite/gcc.target/i386/pr54855-1.c
===================================================================
--- testsuite/gcc.target/i386/pr54855-1.c	(revision 0)
+++ testsuite/gcc.target/i386/pr54855-1.c	(revision 0)
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O -msse2" } */
+
+typedef double vec __attribute__((vector_size(16)));
+
+vec f (vec x)
+{
+  x[0] += 2;
+  return x;
+}
+
+vec g (vec x)
+{
+  x[0] -= 1;
+  return x;
+}
+
+/* { dg-final { scan-assembler-not "mov" } } */

Property changes on: testsuite/gcc.target/i386/pr54855-1.c
___________________________________________________________________
Added: svn:eol-style
   + native
Added: svn:keywords
   + Author Date Id Revision URL

Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 194301)
+++ config/i386/i386.c	(working copy)
@@ -27070,22 +27070,22 @@ static const struct builtin_description
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_cvttps2pi, "__builtin_ia32_cvttps2pi", IX86_BUILTIN_CVTTPS2PI, UNKNOWN, (int) V2SI_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_cvttss2si, "__builtin_ia32_cvttss2si", IX86_BUILTIN_CVTTSS2SI, UNKNOWN, (int) INT_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE | OPTION_MASK_ISA_64BIT, CODE_FOR_sse_cvttss2siq, "__builtin_ia32_cvttss2si64", IX86_BUILTIN_CVTTSS2SI64, UNKNOWN, (int) INT64_FTYPE_V4SF },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_shufps, "__builtin_ia32_shufps", IX86_BUILTIN_SHUFPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF_INT },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_addv4sf3, "__builtin_ia32_addps", IX86_BUILTIN_ADDPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_subv4sf3, "__builtin_ia32_subps", IX86_BUILTIN_SUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_mulv4sf3, "__builtin_ia32_mulps", IX86_BUILTIN_MULPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_divv4sf3, "__builtin_ia32_divps", IX86_BUILTIN_DIVPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
-  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmaddv4sf3,  "__builtin_ia32_addss", IX86_BUILTIN_ADDSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
-  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmsubv4sf3,  "__builtin_ia32_subss", IX86_BUILTIN_SUBSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
+  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmaddv4sf3,  "__builtin_ia32_addss", IX86_BUILTIN_ADDSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_FLOAT },
+  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmsubv4sf3,  "__builtin_ia32_subss", IX86_BUILTIN_SUBSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_FLOAT },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmmulv4sf3,  "__builtin_ia32_mulss", IX86_BUILTIN_MULSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmdivv4sf3,  "__builtin_ia32_divss", IX86_BUILTIN_DIVSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpeqps", IX86_BUILTIN_CMPEQPS, EQ, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpltps", IX86_BUILTIN_CMPLTPS, LT, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpleps", IX86_BUILTIN_CMPLEPS, LE, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpgtps", IX86_BUILTIN_CMPGTPS, LT, (int) V4SF_FTYPE_V4SF_V4SF_SWAP },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpgeps", IX86_BUILTIN_CMPGEPS, LE, (int) V4SF_FTYPE_V4SF_V4SF_SWAP },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpunordps", IX86_BUILTIN_CMPUNORDPS, UNORDERED, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpneqps", IX86_BUILTIN_CMPNEQPS, NE, (int) V4SF_FTYPE_V4SF_V4SF },
@@ -27174,22 +27174,22 @@ static const struct builtin_description
   { OPTION_MASK_ISA_SSE2 | OPTION_MASK_ISA_64BIT, CODE_FOR_sse2_cvttsd2siq, "__builtin_ia32_cvttsd2si64", IX86_BUILTIN_CVTTSD2SI64, UNKNOWN, (int) INT64_FTYPE_V2DF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_cvtps2dq, "__builtin_ia32_cvtps2dq", IX86_BUILTIN_CVTPS2DQ, UNKNOWN, (int) V4SI_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_cvtps2pd, "__builtin_ia32_cvtps2pd", IX86_BUILTIN_CVTPS2PD, UNKNOWN, (int) V2DF_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_fix_truncv4sfv4si2, "__builtin_ia32_cvttps2dq", IX86_BUILTIN_CVTTPS2DQ, UNKNOWN, (int) V4SI_FTYPE_V4SF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_addv2df3, "__builtin_ia32_addpd", IX86_BUILTIN_ADDPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_subv2df3, "__builtin_ia32_subpd", IX86_BUILTIN_SUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_mulv2df3, "__builtin_ia32_mulpd", IX86_BUILTIN_MULPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_divv2df3, "__builtin_ia32_divpd", IX86_BUILTIN_DIVPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
-  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmaddv2df3,  "__builtin_ia32_addsd", IX86_BUILTIN_ADDSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
-  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmsubv2df3,  "__builtin_ia32_subsd", IX86_BUILTIN_SUBSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
+  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmaddv2df3_vconcat,  "__builtin_ia32_addsd", IX86_BUILTIN_ADDSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_DOUBLE },
+  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmsubv2df3_vconcat,  "__builtin_ia32_subsd", IX86_BUILTIN_SUBSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_DOUBLE },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmmulv2df3,  "__builtin_ia32_mulsd", IX86_BUILTIN_MULSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmdivv2df3,  "__builtin_ia32_divsd", IX86_BUILTIN_DIVSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpeqpd", IX86_BUILTIN_CMPEQPD, EQ, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpltpd", IX86_BUILTIN_CMPLTPD, LT, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmplepd", IX86_BUILTIN_CMPLEPD, LE, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpgtpd", IX86_BUILTIN_CMPGTPD, LT, (int) V2DF_FTYPE_V2DF_V2DF_SWAP },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpgepd", IX86_BUILTIN_CMPGEPD, LE, (int) V2DF_FTYPE_V2DF_V2DF_SWAP},
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpunordpd", IX86_BUILTIN_CMPUNORDPD, UNORDERED, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpneqpd", IX86_BUILTIN_CMPNEQPD, NE, (int) V2DF_FTYPE_V2DF_V2DF },
@@ -30801,34 +30801,36 @@ ix86_expand_args_builtin (const struct b
     case V4HI_FTYPE_V8QI_V8QI:
     case V4HI_FTYPE_V2SI_V2SI:
     case V4DF_FTYPE_V4DF_V4DF:
     case V4DF_FTYPE_V4DF_V4DI:
     case V4SF_FTYPE_V4SF_V4SF:
     case V4SF_FTYPE_V4SF_V4SI:
     case V4SF_FTYPE_V4SF_V2SI:
     case V4SF_FTYPE_V4SF_V2DF:
     case V4SF_FTYPE_V4SF_DI:
     case V4SF_FTYPE_V4SF_SI:
+    case V4SF_FTYPE_V4SF_FLOAT:
     case V2DI_FTYPE_V2DI_V2DI:
     case V2DI_FTYPE_V16QI_V16QI:
     case V2DI_FTYPE_V4SI_V4SI:
     case V2UDI_FTYPE_V4USI_V4USI:
     case V2DI_FTYPE_V2DI_V16QI:
     case V2DI_FTYPE_V2DF_V2DF:
     case V2SI_FTYPE_V2SI_V2SI:
     case V2SI_FTYPE_V4HI_V4HI:
     case V2SI_FTYPE_V2SF_V2SF:
     case V2DF_FTYPE_V2DF_V2DF:
     case V2DF_FTYPE_V2DF_V4SF:
     case V2DF_FTYPE_V2DF_V2DI:
     case V2DF_FTYPE_V2DF_DI:
     case V2DF_FTYPE_V2DF_SI:
+    case V2DF_FTYPE_V2DF_DOUBLE:
     case V2SF_FTYPE_V2SF_V2SF:
     case V1DI_FTYPE_V1DI_V1DI:
     case V1DI_FTYPE_V8QI_V8QI:
     case V1DI_FTYPE_V2SI_V2SI:
     case V32QI_FTYPE_V16HI_V16HI:
     case V16HI_FTYPE_V8SI_V8SI:
     case V32QI_FTYPE_V32QI_V32QI:
     case V16HI_FTYPE_V32QI_V32QI:
     case V16HI_FTYPE_V16HI_V16HI:
     case V8SI_FTYPE_V4DF_V4DF:
Index: config/i386/xmmintrin.h
===================================================================
--- config/i386/xmmintrin.h	(revision 194301)
+++ config/i386/xmmintrin.h	(working copy)
@@ -92,27 +92,27 @@ _mm_setzero_ps (void)
   return __extension__ (__m128){ 0.0f, 0.0f, 0.0f, 0.0f };
 }
 
 /* Perform the respective operation on the lower SPFP (single-precision
    floating-point) values of A and B; the upper three SPFP values are
    passed through from A.  */
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_ss (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_addss ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) __builtin_ia32_addss ((__v4sf)__A, __B[0]);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_ss (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_subss ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) __builtin_ia32_subss ((__v4sf)__A, __B[0]);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_ss (__m128 __A, __m128 __B)
 {
   return (__m128) __builtin_ia32_mulss ((__v4sf)__A, (__v4sf)__B);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_div_ss (__m128 __A, __m128 __B)
Index: config/i386/emmintrin.h
===================================================================
--- config/i386/emmintrin.h	(revision 194301)
+++ config/i386/emmintrin.h	(working copy)
@@ -226,33 +226,33 @@ _mm_cvtsi128_si64x (__m128i __A)
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_addpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_sd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_addsd ((__v2df)__A, (__v2df)__B);
+  return (__m128d)__builtin_ia32_addsd ((__v2df)__A, __B[0]);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_subpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_sd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_subsd ((__v2df)__A, (__v2df)__B);
+  return (__m128d)__builtin_ia32_subsd ((__v2df)__A, __B[0]);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_mulpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_sd (__m128d __A, __m128d __B)
Index: config/i386/sse.md
===================================================================
--- config/i386/sse.md	(revision 194301)
+++ config/i386/sse.md	(working copy)
@@ -404,20 +404,37 @@
 
 ;; Mix-n-match
 (define_mode_iterator AVX256MODE2P [V8SI V8SF V4DF])
 
 ;; Mapping of immediate bits for blend instructions
 (define_mode_attr blendbits
   [(V8SF "255") (V4SF "15") (V4DF "15") (V2DF "3")])
 
 ;; Patterns whose name begins with "sse{,2,3}_" are invoked by intrinsics.
 
+;; Substitutions
+
+(define_subst "replace_vec_merge_with_vec_concat"
+  [(set (match_operand:V2DF 0 "" "")
+	(vec_merge:V2DF
+	  (vec_duplicate:V2DF (match_operand:DF 2 "" ""))
+	  (match_operand:V2DF 1 "" "")
+	  (const_int 1)))]
+  "TARGET_SSE2"
+  [(set (match_dup 0)
+	(vec_concat:V2DF
+	  (match_dup 2)
+	  (vec_select:DF (match_dup 1) (parallel [(const_int 1)]))))])
+
+(define_subst_attr "vec_merge_or_concat" "replace_vec_merge_with_vec_concat"
+		   "" "_vconcat")
+
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;;
 ;; Move patterns
 ;;
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
 ;; All of these patterns are enabled for SSE1 as well as SSE2.
 ;; This is essential for maintaining stable calling conventions.
 
 (define_expand "mov<mode>"
@@ -855,26 +872,29 @@
 	  (match_operand:VF 2 "nonimmediate_operand" "xm,xm")))]
   "TARGET_SSE && ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
   "@
    <plusminus_mnemonic><ssemodesuffix>\t{%2, %0|%0, %2}
    v<plusminus_mnemonic><ssemodesuffix>\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "isa" "noavx,avx")
    (set_attr "type" "sseadd")
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "<MODE>")])
 
-(define_insn "<sse>_vm<plusminus_insn><mode>3"
+(define_insn "<sse>_vm<plusminus_insn><mode>3<vec_merge_or_concat>"
   [(set (match_operand:VF_128 0 "register_operand" "=x,x")
 	(vec_merge:VF_128
-	  (plusminus:VF_128
-	    (match_operand:VF_128 1 "register_operand" "0,x")
-	    (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))
+	  (vec_duplicate:VF_128
+	    (plusminus:<ssescalarmode>
+	      (vec_select:<ssescalarmode>
+		(match_operand:VF_128 1 "register_operand" "0,x")
+		(parallel [(const_int 0)]))
+	      (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm")))
 	  (match_dup 1)
 	  (const_int 1)))]
   "TARGET_SSE"
   "@
    <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2}
    v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "isa" "noavx,avx")
    (set_attr "type" "sseadd")
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "<ssescalarmode>")])
Index: config/i386/i386-builtin-types.def
===================================================================
--- config/i386/i386-builtin-types.def	(revision 194301)
+++ config/i386/i386-builtin-types.def	(working copy)
@@ -263,20 +263,21 @@ DEF_FUNCTION_TYPE (UINT64, UINT64, UINT6
 DEF_FUNCTION_TYPE (UINT8, UINT8, INT)
 DEF_FUNCTION_TYPE (V16QI, V16QI, SI)
 DEF_FUNCTION_TYPE (V16QI, V16QI, V16QI)
 DEF_FUNCTION_TYPE (V16QI, V8HI, V8HI)
 DEF_FUNCTION_TYPE (V1DI, V1DI, SI)
 DEF_FUNCTION_TYPE (V1DI, V1DI, V1DI)
 DEF_FUNCTION_TYPE (V1DI, V2SI, V2SI)
 DEF_FUNCTION_TYPE (V1DI, V8QI, V8QI)
 DEF_FUNCTION_TYPE (V2DF, PCV2DF, V2DI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, DI)
+DEF_FUNCTION_TYPE (V2DF, V2DF, DOUBLE)
 DEF_FUNCTION_TYPE (V2DF, V2DF, INT)
 DEF_FUNCTION_TYPE (V2DF, V2DF, PCDOUBLE)
 DEF_FUNCTION_TYPE (V2DF, V2DF, SI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V2DI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V4SF)
 DEF_FUNCTION_TYPE (V2DF, V4DF, INT)
 DEF_FUNCTION_TYPE (V2DI, V16QI, V16QI)
 DEF_FUNCTION_TYPE (V2DI, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V2DI, V2DI, INT)
@@ -296,20 +297,21 @@ DEF_FUNCTION_TYPE (V4DF, PCV4DF, V4DI)
 DEF_FUNCTION_TYPE (V4DF, V4DF, INT)
 DEF_FUNCTION_TYPE (V4DF, V4DF, V4DF)
 DEF_FUNCTION_TYPE (V4DF, V4DF, V4DI)
 DEF_FUNCTION_TYPE (V4HI, V2SI, V2SI)
 DEF_FUNCTION_TYPE (V4HI, V4HI, INT)
 DEF_FUNCTION_TYPE (V4HI, V4HI, SI)
 DEF_FUNCTION_TYPE (V4HI, V4HI, V4HI)
 DEF_FUNCTION_TYPE (V4HI, V8QI, V8QI)
 DEF_FUNCTION_TYPE (V4SF, PCV4SF, V4SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, DI)
+DEF_FUNCTION_TYPE (V4SF, V4SF, FLOAT)
 DEF_FUNCTION_TYPE (V4SF, V4SF, INT)
 DEF_FUNCTION_TYPE (V4SF, V4SF, PCV2SF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V2DF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V2SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V4SF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V4SI)
 DEF_FUNCTION_TYPE (V4SF, V8SF, INT)
 DEF_FUNCTION_TYPE (V4SI, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V4SI, V4SF, V4SF)
Index: doc/extend.texi
===================================================================
--- doc/extend.texi	(revision 194301)
+++ doc/extend.texi	(working copy)
@@ -9843,22 +9843,22 @@ int __builtin_ia32_comige (v4sf, v4sf)
 int __builtin_ia32_ucomieq (v4sf, v4sf)
 int __builtin_ia32_ucomineq (v4sf, v4sf)
 int __builtin_ia32_ucomilt (v4sf, v4sf)
 int __builtin_ia32_ucomile (v4sf, v4sf)
 int __builtin_ia32_ucomigt (v4sf, v4sf)
 int __builtin_ia32_ucomige (v4sf, v4sf)
 v4sf __builtin_ia32_addps (v4sf, v4sf)
 v4sf __builtin_ia32_subps (v4sf, v4sf)
 v4sf __builtin_ia32_mulps (v4sf, v4sf)
 v4sf __builtin_ia32_divps (v4sf, v4sf)
-v4sf __builtin_ia32_addss (v4sf, v4sf)
-v4sf __builtin_ia32_subss (v4sf, v4sf)
+v4sf __builtin_ia32_addss (v4sf, float)
+v4sf __builtin_ia32_subss (v4sf, float)
 v4sf __builtin_ia32_mulss (v4sf, v4sf)
 v4sf __builtin_ia32_divss (v4sf, v4sf)
 v4si __builtin_ia32_cmpeqps (v4sf, v4sf)
 v4si __builtin_ia32_cmpltps (v4sf, v4sf)
 v4si __builtin_ia32_cmpleps (v4sf, v4sf)
 v4si __builtin_ia32_cmpgtps (v4sf, v4sf)
 v4si __builtin_ia32_cmpgeps (v4sf, v4sf)
 v4si __builtin_ia32_cmpunordps (v4sf, v4sf)
 v4si __builtin_ia32_cmpneqps (v4sf, v4sf)
 v4si __builtin_ia32_cmpnltps (v4sf, v4sf)
@@ -9964,22 +9964,22 @@ v2df __builtin_ia32_cmpunordsd (v2df, v2
 v2df __builtin_ia32_cmpneqsd (v2df, v2df)
 v2df __builtin_ia32_cmpnltsd (v2df, v2df)
 v2df __builtin_ia32_cmpnlesd (v2df, v2df)
 v2df __builtin_ia32_cmpordsd (v2df, v2df)
 v2di __builtin_ia32_paddq (v2di, v2di)
 v2di __builtin_ia32_psubq (v2di, v2di)
 v2df __builtin_ia32_addpd (v2df, v2df)
 v2df __builtin_ia32_subpd (v2df, v2df)
 v2df __builtin_ia32_mulpd (v2df, v2df)
 v2df __builtin_ia32_divpd (v2df, v2df)
-v2df __builtin_ia32_addsd (v2df, v2df)
-v2df __builtin_ia32_subsd (v2df, v2df)
+v2df __builtin_ia32_addsd (v2df, double)
+v2df __builtin_ia32_subsd (v2df, double)
 v2df __builtin_ia32_mulsd (v2df, v2df)
 v2df __builtin_ia32_divsd (v2df, v2df)
 v2df __builtin_ia32_minpd (v2df, v2df)
 v2df __builtin_ia32_maxpd (v2df, v2df)
 v2df __builtin_ia32_minsd (v2df, v2df)
 v2df __builtin_ia32_maxsd (v2df, v2df)
 v2df __builtin_ia32_andpd (v2df, v2df)
 v2df __builtin_ia32_andnpd (v2df, v2df)
 v2df __builtin_ia32_orpd (v2df, v2df)
 v2df __builtin_ia32_xorpd (v2df, v2df)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-07  8:49                                   ` Marc Glisse
  2012-12-07 10:52                                     ` Michael Zolotukhin
@ 2012-12-07 14:43                                     ` Richard Henderson
  2012-12-07 14:47                                       ` Jakub Jelinek
  2012-12-07 15:00                                       ` Marc Glisse
  1 sibling, 2 replies; 36+ messages in thread
From: Richard Henderson @ 2012-12-07 14:43 UTC (permalink / raw)
  To: Marc Glisse
  Cc: Michael Zolotukhin, Kirill Yukhin, H.J. Lu, Uros Bizjak,
	gcc-patches List

On 2012-12-07 02:49, Marc Glisse wrote:
> The root problem is that we model the subs[sd] instructions as taking
> a 128-bit second operand, when Intel's documentation says they take a
> 32/64-bit operand, which is an important difference for memory
> operands (and constants). Writing a pattern that reconstructs the
> result from a scalar operation also seems more natural than
> pretending we are doing a parallel operation and dropping most of it
> (easier for recog and friends).

I agree this is a problem with the current representation.

> For 2-element vectors, vec_concat does seem more natural than
> vec_merge. If we chose vec_merge as the canonical representation, we
> should chose it for setting an element in a vector
> (ix86_expand_vector_set) everywhere, not just those scalarish
> operations.

I'd hate to enshrine vec_merge over vec_concat for the benefit of x86,
and to the detriment of e.g. mips.  There are plenty of embedded simd
implementations that are V2xx only.

If we simply pull the various x86 patterns into one common form, set
and extract included, does that buy us most of what we'd get for
playing games in combine?


As for your xmmintrin.h changes, I'd like to see a test case that verifies
that _mm_add_ss(a, b) does not add extra insns to extract __B[0].

> +(define_insn "<sse>_vm<plusminus_insn><mode>3<vec_merge_or_concat>"
>    [(set (match_operand:VF_128 0 "register_operand" "=x,x")
>  	(vec_merge:VF_128
> -	  (plusminus:VF_128
> -	    (match_operand:VF_128 1 "register_operand" "0,x")
> -	    (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))
> +	  (vec_duplicate:VF_128
> +	    (plusminus:<ssescalarmode>
> +	      (vec_select:<ssescalarmode>
> +		(match_operand:VF_128 1 "register_operand" "0,x")
> +		(parallel [(const_int 0)]))
> +	      (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm")))
>  	  (match_dup 1)
>  	  (const_int 1)))]
>    "TARGET_SSE"
>    "@
>     <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2}
>     v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}"
>    [(set_attr "isa" "noavx,avx")
>     (set_attr "type" "sseadd")
>     (set_attr "prefix" "orig,vex")
>     (set_attr "mode" "<ssescalarmode>")])

Did this really trigger as a substitution?  It's not supposed to have, since
you didn't add (set_attr "replace_vec_merge_with_vec_concat" "yes")...


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-07 14:43                                     ` Richard Henderson
@ 2012-12-07 14:47                                       ` Jakub Jelinek
  2012-12-07 14:53                                         ` Richard Henderson
  2012-12-07 15:00                                       ` Marc Glisse
  1 sibling, 1 reply; 36+ messages in thread
From: Jakub Jelinek @ 2012-12-07 14:47 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Marc Glisse, Michael Zolotukhin, Kirill Yukhin, H.J. Lu,
	Uros Bizjak, gcc-patches List

On Fri, Dec 07, 2012 at 08:43:05AM -0600, Richard Henderson wrote:
> > +(define_insn "<sse>_vm<plusminus_insn><mode>3<vec_merge_or_concat>"
> >    [(set (match_operand:VF_128 0 "register_operand" "=x,x")
> >  	(vec_merge:VF_128
> > -	  (plusminus:VF_128
> > -	    (match_operand:VF_128 1 "register_operand" "0,x")
> > -	    (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))
> > +	  (vec_duplicate:VF_128
> > +	    (plusminus:<ssescalarmode>
> > +	      (vec_select:<ssescalarmode>
> > +		(match_operand:VF_128 1 "register_operand" "0,x")
> > +		(parallel [(const_int 0)]))
> > +	      (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm")))
> >  	  (match_dup 1)
> >  	  (const_int 1)))]
> >    "TARGET_SSE"
> >    "@
> >     <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2}
> >     v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}"
> >    [(set_attr "isa" "noavx,avx")
> >     (set_attr "type" "sseadd")
> >     (set_attr "prefix" "orig,vex")
> >     (set_attr "mode" "<ssescalarmode>")])
> 
> Did this really trigger as a substitution?  It's not supposed to have, since
> you didn't add (set_attr "replace_vec_merge_with_vec_concat" "yes")...

That was the older proposal, the current way to trigger it is using
the substitution attr somewhere, typically in pattern name
- <vec_merge_or_concat> in the above case.

	Jakub

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-07 14:47                                       ` Jakub Jelinek
@ 2012-12-07 14:53                                         ` Richard Henderson
  0 siblings, 0 replies; 36+ messages in thread
From: Richard Henderson @ 2012-12-07 14:53 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Marc Glisse, Michael Zolotukhin, Kirill Yukhin, H.J. Lu,
	Uros Bizjak, gcc-patches List

On 2012-12-07 08:47, Jakub Jelinek wrote:
> That was the older proposal, the current way to trigger it is using
> the substitution attr somewhere, typically in pattern name
> - <vec_merge_or_concat> in the above case.

Ah right.

("My mind is going, Dave.  I can feel it.")


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-07 14:43                                     ` Richard Henderson
  2012-12-07 14:47                                       ` Jakub Jelinek
@ 2012-12-07 15:00                                       ` Marc Glisse
  2012-12-07 15:06                                         ` Richard Henderson
  1 sibling, 1 reply; 36+ messages in thread
From: Marc Glisse @ 2012-12-07 15:00 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Michael Zolotukhin, Kirill Yukhin, H.J. Lu, Uros Bizjak,
	gcc-patches List

On Fri, 7 Dec 2012, Richard Henderson wrote:

> On 2012-12-07 02:49, Marc Glisse wrote:
>> For 2-element vectors, vec_concat does seem more natural than
>> vec_merge. If we chose vec_merge as the canonical representation, we
>> should chose it for setting an element in a vector
>> (ix86_expand_vector_set) everywhere, not just those scalarish
>> operations.
>
> I'd hate to enshrine vec_merge over vec_concat for the benefit of x86,
> and to the detriment of e.g. mips.  There are plenty of embedded simd
> implementations that are V2xx only.
>
> If we simply pull the various x86 patterns into one common form, set
> and extract included, does that buy us most of what we'd get for
> playing games in combine?

I'm sorry, could you be more precise? I don't see clearly what you are 
suggesting.

> As for your xmmintrin.h changes, I'd like to see a test case that verifies
> that _mm_add_ss(a, b) does not add extra insns to extract __B[0].

Yes, good idea, thanks.

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-07 15:00                                       ` Marc Glisse
@ 2012-12-07 15:06                                         ` Richard Henderson
  2012-12-07 15:12                                           ` Marc Glisse
  0 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2012-12-07 15:06 UTC (permalink / raw)
  To: Marc Glisse
  Cc: Michael Zolotukhin, Kirill Yukhin, H.J. Lu, Uros Bizjak,
	gcc-patches List

On 2012-12-07 09:00, Marc Glisse wrote:
> On Fri, 7 Dec 2012, Richard Henderson wrote:
> 
>> On 2012-12-07 02:49, Marc Glisse wrote:
>>> For 2-element vectors, vec_concat does seem more natural than
>>> vec_merge. If we chose vec_merge as the canonical representation, we
>>> should chose it for setting an element in a vector
>>> (ix86_expand_vector_set) everywhere, not just those scalarish
>>> operations.
>>
>> I'd hate to enshrine vec_merge over vec_concat for the benefit of x86,
>> and to the detriment of e.g. mips.  There are plenty of embedded simd
>> implementations that are V2xx only.
>>
>> If we simply pull the various x86 patterns into one common form, set
>> and extract included, does that buy us most of what we'd get for
>> playing games in combine?
> 
> I'm sorry, could you be more precise? I don't see clearly what you are suggesting.

Don't change combine?

Have I lost the plot somewhere here?


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-07 15:06                                         ` Richard Henderson
@ 2012-12-07 15:12                                           ` Marc Glisse
  2012-12-07 16:24                                             ` Richard Henderson
  0 siblings, 1 reply; 36+ messages in thread
From: Marc Glisse @ 2012-12-07 15:12 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Michael Zolotukhin, Kirill Yukhin, H.J. Lu, Uros Bizjak,
	gcc-patches List

On Fri, 7 Dec 2012, Richard Henderson wrote:

> On 2012-12-07 09:00, Marc Glisse wrote:
>> On Fri, 7 Dec 2012, Richard Henderson wrote:
>>
>>> On 2012-12-07 02:49, Marc Glisse wrote:
>>>> For 2-element vectors, vec_concat does seem more natural than
>>>> vec_merge. If we chose vec_merge as the canonical representation, we
>>>> should chose it for setting an element in a vector
>>>> (ix86_expand_vector_set) everywhere, not just those scalarish
>>>> operations.
>>>
>>> I'd hate to enshrine vec_merge over vec_concat for the benefit of x86,
>>> and to the detriment of e.g. mips.  There are plenty of embedded simd
>>> implementations that are V2xx only.
>>>
>>> If we simply pull the various x86 patterns into one common form, set
>>> and extract included, does that buy us most of what we'd get for
>>> playing games in combine?
>>
>> I'm sorry, could you be more precise? I don't see clearly what you are suggesting.
>
> Don't change combine?

but change ix86_expand_vector_set and others to generate vec_merge and 
have only the vec_merge define_insn in sse.md? I guess it would buy a 
large part of it. That's a pretty invasive change, I'll have to try...

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-07 15:12                                           ` Marc Glisse
@ 2012-12-07 16:24                                             ` Richard Henderson
  2012-12-07 17:23                                               ` Marc Glisse
  0 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2012-12-07 16:24 UTC (permalink / raw)
  To: Marc Glisse
  Cc: Michael Zolotukhin, Kirill Yukhin, H.J. Lu, Uros Bizjak,
	gcc-patches List

On 2012-12-07 09:12, Marc Glisse wrote:
> but change ix86_expand_vector_set and others to generate vec_merge
> and have only the vec_merge define_insn in sse.md? I guess it would
> buy a large part of it. That's a pretty invasive change, I'll have to
> try...

Is it really that invasive?  Anyway, it's something worth trying for 4.9...


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-07 16:24                                             ` Richard Henderson
@ 2012-12-07 17:23                                               ` Marc Glisse
  2012-12-08  5:47                                                 ` Marc Glisse
  0 siblings, 1 reply; 36+ messages in thread
From: Marc Glisse @ 2012-12-07 17:23 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Michael Zolotukhin, Kirill Yukhin, H.J. Lu, Uros Bizjak,
	gcc-patches List

On Fri, 7 Dec 2012, Richard Henderson wrote:

> On 2012-12-07 09:12, Marc Glisse wrote:
>> but change ix86_expand_vector_set and others to generate vec_merge
>> and have only the vec_merge define_insn in sse.md? I guess it would
>> buy a large part of it. That's a pretty invasive change, I'll have to
>> try...
>
> Is it really that invasive?

No, changing only V2DF, I seem to have the basic pieces in place changing 
just 6 patterns in sse.md and a couple functions in i386.c. Now I need to 
test it and see how much it affects the generated code...

> Anyway, it's something worth trying for 4.9...

Should I take it that if it ends up looking manageable, you prefer 
changing all the V2DF operations to vec_merge, rather than just adapting 
the addsd pattern? Won't it complicate things for generic optimizations if 
different platforms have different canonical patterns for the same 
operation on V2DF? It seems to me that it would be good if we agreed on a 
pattern common to all platforms so we could canonicalize as in the last 
part of:
http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00243.html
(or possibly the reverse transformation depending on the pattern we settle 
on).

(those 2 patches didn't touch the generic simplify-rtx code either, if I 
remember just the "don't touch combine" part of your suggestion ;-)
http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00028.html
http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00492.html )

-- 
Marc Glisse

PS: I'll ping you about this other patch when trunk re-opens for 4.9, if 
you don't mind:
http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00079.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-07 17:23                                               ` Marc Glisse
@ 2012-12-08  5:47                                                 ` Marc Glisse
  2012-12-12 15:48                                                   ` Richard Henderson
  0 siblings, 1 reply; 36+ messages in thread
From: Marc Glisse @ 2012-12-08  5:47 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Michael Zolotukhin, Kirill Yukhin, H.J. Lu, Uros Bizjak,
	gcc-patches List

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2030 bytes --]

On Fri, 7 Dec 2012, Marc Glisse wrote:

> On Fri, 7 Dec 2012, Richard Henderson wrote:
>
>> On 2012-12-07 09:12, Marc Glisse wrote:
>>> but change ix86_expand_vector_set and others to generate vec_merge
>>> and have only the vec_merge define_insn in sse.md? I guess it would
>>> buy a large part of it. That's a pretty invasive change, I'll have to
>>> try...
>> 
>> Is it really that invasive?
>
> No, changing only V2DF, I seem to have the basic pieces in place changing 
> just 6 patterns in sse.md and a couple functions in i386.c. Now I need to 
> test it and see how much it affects the generated code...

Here is a patch that passes bootstrap+testsuite. I didn't notice anything 
particular about the code generated. Sure, something like 
_mm_add_sd(x,y)[1] simplifies with any of the vec_concat patches but not 
with this vec_merge patch, but that's just a trivial missing piece of code 
in simplify-rtx.c that I'll want to write for 4.9 anyway. My personal 
taste is still for vec_concat, but I'm ok with this one.

(Off topic remark: if I do
v2df x;
x[0]+=1;
y=(v2df){x[1],x[0]};
the compiler sees {x[1],x[0]+1} and never guesses that it should do addsd 
and then shuffle, whereas if I use _mm_add_sd I get the nice 2-line asm)


2012-12-08  Marc Glisse  <marc.glisse@inria.fr>

 	PR target/54855
gcc/
 	* config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
 	pattern.
 	(sse2_loadlpd, sse2_loadhpd): Use vec_merge.
 	* config/i386/i386-builtin-types.def: New function types.
 	* config/i386/i386.c (ix86_expand_args_builtin): Likewise.
 	(bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
 	__builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
 	(ix86_expand_vector_set): Use vec_merge for V2DF.
 	* config/i386/xmmintrin.h: Adapt to new builtin prototype.
 	* config/i386/emmintrin.h: Likewise.
 	* doc/extend.texi (X86 Built-in Functions): Document changed prototype.

testsuite/
 	* gcc.target/i386/pr54855-1.c: New testcase.
 	* gcc.target/i386/pr54855-2.c: New testcase.

-- 
Marc Glisse

[-- Attachment #2: Type: TEXT/PLAIN, Size: 24933 bytes --]

Index: gcc/doc/extend.texi
===================================================================
--- gcc/doc/extend.texi	(revision 194309)
+++ gcc/doc/extend.texi	(working copy)
@@ -9843,22 +9843,22 @@ int __builtin_ia32_comige (v4sf, v4sf)
 int __builtin_ia32_ucomieq (v4sf, v4sf)
 int __builtin_ia32_ucomineq (v4sf, v4sf)
 int __builtin_ia32_ucomilt (v4sf, v4sf)
 int __builtin_ia32_ucomile (v4sf, v4sf)
 int __builtin_ia32_ucomigt (v4sf, v4sf)
 int __builtin_ia32_ucomige (v4sf, v4sf)
 v4sf __builtin_ia32_addps (v4sf, v4sf)
 v4sf __builtin_ia32_subps (v4sf, v4sf)
 v4sf __builtin_ia32_mulps (v4sf, v4sf)
 v4sf __builtin_ia32_divps (v4sf, v4sf)
-v4sf __builtin_ia32_addss (v4sf, v4sf)
-v4sf __builtin_ia32_subss (v4sf, v4sf)
+v4sf __builtin_ia32_addss (v4sf, float)
+v4sf __builtin_ia32_subss (v4sf, float)
 v4sf __builtin_ia32_mulss (v4sf, v4sf)
 v4sf __builtin_ia32_divss (v4sf, v4sf)
 v4si __builtin_ia32_cmpeqps (v4sf, v4sf)
 v4si __builtin_ia32_cmpltps (v4sf, v4sf)
 v4si __builtin_ia32_cmpleps (v4sf, v4sf)
 v4si __builtin_ia32_cmpgtps (v4sf, v4sf)
 v4si __builtin_ia32_cmpgeps (v4sf, v4sf)
 v4si __builtin_ia32_cmpunordps (v4sf, v4sf)
 v4si __builtin_ia32_cmpneqps (v4sf, v4sf)
 v4si __builtin_ia32_cmpnltps (v4sf, v4sf)
@@ -9964,22 +9964,22 @@ v2df __builtin_ia32_cmpunordsd (v2df, v2
 v2df __builtin_ia32_cmpneqsd (v2df, v2df)
 v2df __builtin_ia32_cmpnltsd (v2df, v2df)
 v2df __builtin_ia32_cmpnlesd (v2df, v2df)
 v2df __builtin_ia32_cmpordsd (v2df, v2df)
 v2di __builtin_ia32_paddq (v2di, v2di)
 v2di __builtin_ia32_psubq (v2di, v2di)
 v2df __builtin_ia32_addpd (v2df, v2df)
 v2df __builtin_ia32_subpd (v2df, v2df)
 v2df __builtin_ia32_mulpd (v2df, v2df)
 v2df __builtin_ia32_divpd (v2df, v2df)
-v2df __builtin_ia32_addsd (v2df, v2df)
-v2df __builtin_ia32_subsd (v2df, v2df)
+v2df __builtin_ia32_addsd (v2df, double)
+v2df __builtin_ia32_subsd (v2df, double)
 v2df __builtin_ia32_mulsd (v2df, v2df)
 v2df __builtin_ia32_divsd (v2df, v2df)
 v2df __builtin_ia32_minpd (v2df, v2df)
 v2df __builtin_ia32_maxpd (v2df, v2df)
 v2df __builtin_ia32_minsd (v2df, v2df)
 v2df __builtin_ia32_maxsd (v2df, v2df)
 v2df __builtin_ia32_andpd (v2df, v2df)
 v2df __builtin_ia32_andnpd (v2df, v2df)
 v2df __builtin_ia32_orpd (v2df, v2df)
 v2df __builtin_ia32_xorpd (v2df, v2df)
Index: gcc/testsuite/gcc.target/i386/pr54855-1.c
===================================================================
--- gcc/testsuite/gcc.target/i386/pr54855-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/i386/pr54855-1.c	(revision 0)
@@ -0,0 +1,40 @@
+/* { dg-do compile } */
+/* { dg-options "-O -msse2" } */
+
+#include <emmintrin.h>
+
+__m128d f (__m128d x)
+{
+    __m128d y = { 2, 0 };
+      return _mm_add_sd (x, y);
+}
+
+__m128d g (__m128d x)
+{
+    __m128d y = { 1, 0 };
+      return _mm_sub_sd (x, y);
+}
+
+__m128d h (__m128d x, __m128d y)
+{
+    return _mm_add_sd (x, y);
+}
+
+__m128d i (__m128d x, __m128d y)
+{
+    return _mm_sub_sd (x, y);
+}
+
+__m128d j (__m128d x)
+{
+  x[0] += 2;
+  return x;
+}
+
+__m128d k (__m128d x)
+{
+  x[0] -= 1;
+  return x;
+}
+
+/* { dg-final { scan-assembler-not "mov" } } */

Property changes on: gcc/testsuite/gcc.target/i386/pr54855-1.c
___________________________________________________________________
Added: svn:keywords
   + Author Date Id Revision URL
Added: svn:eol-style
   + native

Index: gcc/testsuite/gcc.target/i386/pr54855-2.c
===================================================================
--- gcc/testsuite/gcc.target/i386/pr54855-2.c	(revision 0)
+++ gcc/testsuite/gcc.target/i386/pr54855-2.c	(revision 0)
@@ -0,0 +1,40 @@
+/* { dg-do compile } */
+/* { dg-options "-O -msse" } */
+
+#include <xmmintrin.h>
+
+__m128 f (__m128 x)
+{
+    __m128 y = { 2, 0, 0, 0 };
+      return _mm_add_ss (x, y);
+}
+
+__m128 g (__m128 x)
+{
+    __m128 y = { 1, 0, 0, 0 };
+      return _mm_sub_ss (x, y);
+}
+
+__m128 h (__m128 x, __m128 y)
+{
+    return _mm_add_ss (x, y);
+}
+
+__m128 i (__m128 x, __m128 y)
+{
+    return _mm_sub_ss (x, y);
+}
+
+__m128 j (__m128 x)
+{
+  x[0] += 2;
+  return x;
+}
+
+__m128 k (__m128 x)
+{
+  x[0] -= 1;
+  return x;
+}
+
+/* { dg-final { scan-assembler-not "mov" } } */

Property changes on: gcc/testsuite/gcc.target/i386/pr54855-2.c
___________________________________________________________________
Added: svn:eol-style
   + native
Added: svn:keywords
   + Author Date Id Revision URL

Index: gcc/config/i386/xmmintrin.h
===================================================================
--- gcc/config/i386/xmmintrin.h	(revision 194309)
+++ gcc/config/i386/xmmintrin.h	(working copy)
@@ -92,27 +92,27 @@ _mm_setzero_ps (void)
   return __extension__ (__m128){ 0.0f, 0.0f, 0.0f, 0.0f };
 }
 
 /* Perform the respective operation on the lower SPFP (single-precision
    floating-point) values of A and B; the upper three SPFP values are
    passed through from A.  */
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_ss (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_addss ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) __builtin_ia32_addss ((__v4sf)__A, __B[0]);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_ss (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_subss ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) __builtin_ia32_subss ((__v4sf)__A, __B[0]);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_ss (__m128 __A, __m128 __B)
 {
   return (__m128) __builtin_ia32_mulss ((__v4sf)__A, (__v4sf)__B);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_div_ss (__m128 __A, __m128 __B)
Index: gcc/config/i386/emmintrin.h
===================================================================
--- gcc/config/i386/emmintrin.h	(revision 194309)
+++ gcc/config/i386/emmintrin.h	(working copy)
@@ -226,33 +226,33 @@ _mm_cvtsi128_si64x (__m128i __A)
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_addpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_sd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_addsd ((__v2df)__A, (__v2df)__B);
+  return (__m128d)__builtin_ia32_addsd ((__v2df)__A, __B[0]);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_subpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_sd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_subsd ((__v2df)__A, (__v2df)__B);
+  return (__m128d)__builtin_ia32_subsd ((__v2df)__A, __B[0]);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_mulpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_sd (__m128d __A, __m128d __B)
Index: gcc/config/i386/sse.md
===================================================================
--- gcc/config/i386/sse.md	(revision 194309)
+++ gcc/config/i386/sse.md	(working copy)
@@ -858,23 +858,26 @@
    <plusminus_mnemonic><ssemodesuffix>\t{%2, %0|%0, %2}
    v<plusminus_mnemonic><ssemodesuffix>\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "isa" "noavx,avx")
    (set_attr "type" "sseadd")
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "<MODE>")])
 
 (define_insn "<sse>_vm<plusminus_insn><mode>3"
   [(set (match_operand:VF_128 0 "register_operand" "=x,x")
 	(vec_merge:VF_128
-	  (plusminus:VF_128
-	    (match_operand:VF_128 1 "register_operand" "0,x")
-	    (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))
+	  (vec_duplicate:VF_128
+	    (plusminus:<ssescalarmode>
+	      (vec_select:<ssescalarmode>
+		(match_operand:VF_128 1 "register_operand" "0,x")
+		(parallel [(const_int 0)]))
+	      (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm")))
 	  (match_dup 1)
 	  (const_int 1)))]
   "TARGET_SSE"
   "@
    <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2}
    v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "isa" "noavx,avx")
    (set_attr "type" "sseadd")
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "<ssescalarmode>")])
@@ -5006,106 +5009,103 @@
    && !(MEM_P (operands[0]) && MEM_P (operands[1]))"
   "@
    movlps\t{%1, %0|%0, %1}
    movaps\t{%1, %0|%0, %1}
    movlps\t{%1, %0|%0, %1}"
   [(set_attr "type" "ssemov")
    (set_attr "mode" "V2SF,V4SF,V2SF")])
 
 (define_expand "sse2_loadhpd_exp"
   [(set (match_operand:V2DF 0 "nonimmediate_operand")
-	(vec_concat:V2DF
-	  (vec_select:DF
-	    (match_operand:V2DF 1 "nonimmediate_operand")
-	    (parallel [(const_int 0)]))
-	  (match_operand:DF 2 "nonimmediate_operand")))]
+	(vec_merge:V2DF
+	  (vec_duplicate:V2DF (match_operand:DF 2 "nonimmediate_operand"))
+	  (match_operand:V2DF 1 "nonimmediate_operand")
+	  (const_int 2)))]
   "TARGET_SSE2"
 {
   rtx dst = ix86_fixup_binary_operands (UNKNOWN, V2DFmode, operands);
 
   emit_insn (gen_sse2_loadhpd (dst, operands[1], operands[2]));
 
   /* Fix up the destination if needed.  */
   if (dst != operands[0])
     emit_move_insn (operands[0], dst);
 
   DONE;
 })
 
 ;; Avoid combining registers from different units in a single alternative,
 ;; see comment above inline_secondary_memory_needed function in i386.c
 (define_insn "sse2_loadhpd"
   [(set (match_operand:V2DF 0 "nonimmediate_operand"
 	  "=x,x,x,x,o,o ,o")
-	(vec_concat:V2DF
-	  (vec_select:DF
-	    (match_operand:V2DF 1 "nonimmediate_operand"
+	(vec_merge:V2DF
+	  (vec_duplicate:V2DF (match_operand:DF 2 "nonimmediate_operand"
+	  " m,m,x,x,x,*f,r"))
+	  (match_operand:V2DF 1 "nonimmediate_operand"
 	  " 0,x,0,x,0,0 ,0")
-	    (parallel [(const_int 0)]))
-	  (match_operand:DF 2 "nonimmediate_operand"
-	  " m,m,x,x,x,*f,r")))]
+	  (const_int 2)))]
   "TARGET_SSE2 && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
   "@
    movhpd\t{%2, %0|%0, %2}
    vmovhpd\t{%2, %1, %0|%0, %1, %2}
    unpcklpd\t{%2, %0|%0, %2}
    vunpcklpd\t{%2, %1, %0|%0, %1, %2}
    #
    #
    #"
   [(set_attr "isa" "noavx,avx,noavx,avx,*,*,*")
    (set_attr "type" "ssemov,ssemov,sselog,sselog,ssemov,fmov,imov")
    (set_attr "prefix_data16" "1,*,*,*,*,*,*")
    (set_attr "prefix" "orig,vex,orig,vex,*,*,*")
    (set_attr "mode" "V1DF,V1DF,V2DF,V2DF,DF,DF,DF")])
 
 (define_split
   [(set (match_operand:V2DF 0 "memory_operand")
-	(vec_concat:V2DF
-	  (vec_select:DF (match_dup 0) (parallel [(const_int 0)]))
-	  (match_operand:DF 1 "register_operand")))]
+	(vec_merge:V2DF
+	  (vec_duplicate:V2DF (match_operand:DF 1 "register_operand"))
+	  (match_dup 0)
+	  (const_int 2)))]
   "TARGET_SSE2 && reload_completed"
   [(set (match_dup 0) (match_dup 1))]
   "operands[0] = adjust_address (operands[0], DFmode, 8);")
 
 (define_expand "sse2_loadlpd_exp"
   [(set (match_operand:V2DF 0 "nonimmediate_operand")
-	(vec_concat:V2DF
-	  (match_operand:DF 2 "nonimmediate_operand")
-	  (vec_select:DF
-	    (match_operand:V2DF 1 "nonimmediate_operand")
-	    (parallel [(const_int 1)]))))]
+	(vec_merge:V2DF
+	  (vec_duplicate:V2DF (match_operand:DF 2 "nonimmediate_operand"))
+	  (match_operand:V2DF 1 "nonimmediate_operand")
+	  (const_int 1)))]
   "TARGET_SSE2"
 {
   rtx dst = ix86_fixup_binary_operands (UNKNOWN, V2DFmode, operands);
 
   emit_insn (gen_sse2_loadlpd (dst, operands[1], operands[2]));
 
   /* Fix up the destination if needed.  */
   if (dst != operands[0])
     emit_move_insn (operands[0], dst);
 
   DONE;
 })
 
 ;; Avoid combining registers from different units in a single alternative,
 ;; see comment above inline_secondary_memory_needed function in i386.c
 (define_insn "sse2_loadlpd"
   [(set (match_operand:V2DF 0 "nonimmediate_operand"
 	  "=x,x,x,x,x,x,x,x,m,m ,m")
-	(vec_concat:V2DF
-	  (match_operand:DF 2 "nonimmediate_operand"
-	  " m,m,m,x,x,0,0,x,x,*f,r")
-	  (vec_select:DF
-	    (match_operand:V2DF 1 "vector_move_operand"
+	(vec_merge:V2DF
+	  (vec_duplicate:V2DF (match_operand:DF 2 "nonimmediate_operand"
+	  " m,m,m,x,x,0,0,x,x,*f,r"))
+	  (match_operand:V2DF 1 "vector_move_operand"
 	  " C,0,x,0,x,x,o,o,0,0 ,0")
-	    (parallel [(const_int 1)]))))]
+	  (const_int 1)))]
   "TARGET_SSE2 && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
   "@
    %vmovsd\t{%2, %0|%0, %2}
    movlpd\t{%2, %0|%0, %2}
    vmovlpd\t{%2, %1, %0|%0, %1, %2}
    movsd\t{%2, %0|%0, %2}
    vmovsd\t{%2, %1, %0|%0, %1, %2}
    shufpd\t{$2, %1, %0|%0, %1, 2}
    movhpd\t{%H1, %0|%0, %H1}
    vmovhpd\t{%H1, %2, %0|%0, %2, %H1}
@@ -5122,23 +5122,24 @@
 	      (const_string "imov")
 	   ]
 	   (const_string "ssemov")))
    (set_attr "prefix_data16" "*,1,*,*,*,*,1,*,*,*,*")
    (set_attr "length_immediate" "*,*,*,*,*,1,*,*,*,*,*")
    (set_attr "prefix" "maybe_vex,orig,vex,orig,vex,orig,orig,vex,*,*,*")
    (set_attr "mode" "DF,V1DF,V1DF,V1DF,V1DF,V2DF,V1DF,V1DF,DF,DF,DF")])
 
 (define_split
   [(set (match_operand:V2DF 0 "memory_operand")
-	(vec_concat:V2DF
-	  (match_operand:DF 1 "register_operand")
-	  (vec_select:DF (match_dup 0) (parallel [(const_int 1)]))))]
+	(vec_merge:V2DF
+	  (vec_duplicate:V2DF (match_operand:DF 1 "register_operand"))
+	  (match_dup 0)
+	  (const_int 1)))]
   "TARGET_SSE2 && reload_completed"
   [(set (match_dup 0) (match_dup 1))]
   "operands[0] = adjust_address (operands[0], DFmode, 0);")
 
 (define_insn "sse2_movsd"
   [(set (match_operand:V2DF 0 "nonimmediate_operand"   "=x,x,x,x,m,x,x,x,o")
 	(vec_merge:V2DF
 	  (match_operand:V2DF 2 "nonimmediate_operand" " x,x,m,m,x,0,0,x,0")
 	  (match_operand:V2DF 1 "nonimmediate_operand" " 0,x,0,x,0,x,o,o,x")
 	  (const_int 1)))]
Index: gcc/config/i386/i386-builtin-types.def
===================================================================
--- gcc/config/i386/i386-builtin-types.def	(revision 194309)
+++ gcc/config/i386/i386-builtin-types.def	(working copy)
@@ -263,20 +263,21 @@ DEF_FUNCTION_TYPE (UINT64, UINT64, UINT6
 DEF_FUNCTION_TYPE (UINT8, UINT8, INT)
 DEF_FUNCTION_TYPE (V16QI, V16QI, SI)
 DEF_FUNCTION_TYPE (V16QI, V16QI, V16QI)
 DEF_FUNCTION_TYPE (V16QI, V8HI, V8HI)
 DEF_FUNCTION_TYPE (V1DI, V1DI, SI)
 DEF_FUNCTION_TYPE (V1DI, V1DI, V1DI)
 DEF_FUNCTION_TYPE (V1DI, V2SI, V2SI)
 DEF_FUNCTION_TYPE (V1DI, V8QI, V8QI)
 DEF_FUNCTION_TYPE (V2DF, PCV2DF, V2DI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, DI)
+DEF_FUNCTION_TYPE (V2DF, V2DF, DOUBLE)
 DEF_FUNCTION_TYPE (V2DF, V2DF, INT)
 DEF_FUNCTION_TYPE (V2DF, V2DF, PCDOUBLE)
 DEF_FUNCTION_TYPE (V2DF, V2DF, SI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V2DI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V4SF)
 DEF_FUNCTION_TYPE (V2DF, V4DF, INT)
 DEF_FUNCTION_TYPE (V2DI, V16QI, V16QI)
 DEF_FUNCTION_TYPE (V2DI, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V2DI, V2DI, INT)
@@ -296,20 +297,21 @@ DEF_FUNCTION_TYPE (V4DF, PCV4DF, V4DI)
 DEF_FUNCTION_TYPE (V4DF, V4DF, INT)
 DEF_FUNCTION_TYPE (V4DF, V4DF, V4DF)
 DEF_FUNCTION_TYPE (V4DF, V4DF, V4DI)
 DEF_FUNCTION_TYPE (V4HI, V2SI, V2SI)
 DEF_FUNCTION_TYPE (V4HI, V4HI, INT)
 DEF_FUNCTION_TYPE (V4HI, V4HI, SI)
 DEF_FUNCTION_TYPE (V4HI, V4HI, V4HI)
 DEF_FUNCTION_TYPE (V4HI, V8QI, V8QI)
 DEF_FUNCTION_TYPE (V4SF, PCV4SF, V4SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, DI)
+DEF_FUNCTION_TYPE (V4SF, V4SF, FLOAT)
 DEF_FUNCTION_TYPE (V4SF, V4SF, INT)
 DEF_FUNCTION_TYPE (V4SF, V4SF, PCV2SF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V2DF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V2SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V4SF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V4SI)
 DEF_FUNCTION_TYPE (V4SF, V8SF, INT)
 DEF_FUNCTION_TYPE (V4SI, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V4SI, V4SF, V4SF)
Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	(revision 194309)
+++ gcc/config/i386/i386.c	(working copy)
@@ -27070,22 +27070,22 @@ static const struct builtin_description
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_cvttps2pi, "__builtin_ia32_cvttps2pi", IX86_BUILTIN_CVTTPS2PI, UNKNOWN, (int) V2SI_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_cvttss2si, "__builtin_ia32_cvttss2si", IX86_BUILTIN_CVTTSS2SI, UNKNOWN, (int) INT_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE | OPTION_MASK_ISA_64BIT, CODE_FOR_sse_cvttss2siq, "__builtin_ia32_cvttss2si64", IX86_BUILTIN_CVTTSS2SI64, UNKNOWN, (int) INT64_FTYPE_V4SF },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_shufps, "__builtin_ia32_shufps", IX86_BUILTIN_SHUFPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF_INT },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_addv4sf3, "__builtin_ia32_addps", IX86_BUILTIN_ADDPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_subv4sf3, "__builtin_ia32_subps", IX86_BUILTIN_SUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_mulv4sf3, "__builtin_ia32_mulps", IX86_BUILTIN_MULPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_divv4sf3, "__builtin_ia32_divps", IX86_BUILTIN_DIVPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
-  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmaddv4sf3,  "__builtin_ia32_addss", IX86_BUILTIN_ADDSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
-  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmsubv4sf3,  "__builtin_ia32_subss", IX86_BUILTIN_SUBSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
+  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmaddv4sf3,  "__builtin_ia32_addss", IX86_BUILTIN_ADDSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_FLOAT },
+  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmsubv4sf3,  "__builtin_ia32_subss", IX86_BUILTIN_SUBSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_FLOAT },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmmulv4sf3,  "__builtin_ia32_mulss", IX86_BUILTIN_MULSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmdivv4sf3,  "__builtin_ia32_divss", IX86_BUILTIN_DIVSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpeqps", IX86_BUILTIN_CMPEQPS, EQ, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpltps", IX86_BUILTIN_CMPLTPS, LT, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpleps", IX86_BUILTIN_CMPLEPS, LE, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpgtps", IX86_BUILTIN_CMPGTPS, LT, (int) V4SF_FTYPE_V4SF_V4SF_SWAP },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpgeps", IX86_BUILTIN_CMPGEPS, LE, (int) V4SF_FTYPE_V4SF_V4SF_SWAP },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpunordps", IX86_BUILTIN_CMPUNORDPS, UNORDERED, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpneqps", IX86_BUILTIN_CMPNEQPS, NE, (int) V4SF_FTYPE_V4SF_V4SF },
@@ -27174,22 +27174,22 @@ static const struct builtin_description
   { OPTION_MASK_ISA_SSE2 | OPTION_MASK_ISA_64BIT, CODE_FOR_sse2_cvttsd2siq, "__builtin_ia32_cvttsd2si64", IX86_BUILTIN_CVTTSD2SI64, UNKNOWN, (int) INT64_FTYPE_V2DF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_cvtps2dq, "__builtin_ia32_cvtps2dq", IX86_BUILTIN_CVTPS2DQ, UNKNOWN, (int) V4SI_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_cvtps2pd, "__builtin_ia32_cvtps2pd", IX86_BUILTIN_CVTPS2PD, UNKNOWN, (int) V2DF_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_fix_truncv4sfv4si2, "__builtin_ia32_cvttps2dq", IX86_BUILTIN_CVTTPS2DQ, UNKNOWN, (int) V4SI_FTYPE_V4SF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_addv2df3, "__builtin_ia32_addpd", IX86_BUILTIN_ADDPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_subv2df3, "__builtin_ia32_subpd", IX86_BUILTIN_SUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_mulv2df3, "__builtin_ia32_mulpd", IX86_BUILTIN_MULPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_divv2df3, "__builtin_ia32_divpd", IX86_BUILTIN_DIVPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
-  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmaddv2df3,  "__builtin_ia32_addsd", IX86_BUILTIN_ADDSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
-  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmsubv2df3,  "__builtin_ia32_subsd", IX86_BUILTIN_SUBSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
+  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmaddv2df3,  "__builtin_ia32_addsd", IX86_BUILTIN_ADDSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_DOUBLE },
+  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmsubv2df3,  "__builtin_ia32_subsd", IX86_BUILTIN_SUBSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_DOUBLE },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmmulv2df3,  "__builtin_ia32_mulsd", IX86_BUILTIN_MULSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmdivv2df3,  "__builtin_ia32_divsd", IX86_BUILTIN_DIVSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpeqpd", IX86_BUILTIN_CMPEQPD, EQ, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpltpd", IX86_BUILTIN_CMPLTPD, LT, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmplepd", IX86_BUILTIN_CMPLEPD, LE, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpgtpd", IX86_BUILTIN_CMPGTPD, LT, (int) V2DF_FTYPE_V2DF_V2DF_SWAP },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpgepd", IX86_BUILTIN_CMPGEPD, LE, (int) V2DF_FTYPE_V2DF_V2DF_SWAP},
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpunordpd", IX86_BUILTIN_CMPUNORDPD, UNORDERED, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpneqpd", IX86_BUILTIN_CMPNEQPD, NE, (int) V2DF_FTYPE_V2DF_V2DF },
@@ -30801,34 +30801,36 @@ ix86_expand_args_builtin (const struct b
     case V4HI_FTYPE_V8QI_V8QI:
     case V4HI_FTYPE_V2SI_V2SI:
     case V4DF_FTYPE_V4DF_V4DF:
     case V4DF_FTYPE_V4DF_V4DI:
     case V4SF_FTYPE_V4SF_V4SF:
     case V4SF_FTYPE_V4SF_V4SI:
     case V4SF_FTYPE_V4SF_V2SI:
     case V4SF_FTYPE_V4SF_V2DF:
     case V4SF_FTYPE_V4SF_DI:
     case V4SF_FTYPE_V4SF_SI:
+    case V4SF_FTYPE_V4SF_FLOAT:
     case V2DI_FTYPE_V2DI_V2DI:
     case V2DI_FTYPE_V16QI_V16QI:
     case V2DI_FTYPE_V4SI_V4SI:
     case V2UDI_FTYPE_V4USI_V4USI:
     case V2DI_FTYPE_V2DI_V16QI:
     case V2DI_FTYPE_V2DF_V2DF:
     case V2SI_FTYPE_V2SI_V2SI:
     case V2SI_FTYPE_V4HI_V4HI:
     case V2SI_FTYPE_V2SF_V2SF:
     case V2DF_FTYPE_V2DF_V2DF:
     case V2DF_FTYPE_V2DF_V4SF:
     case V2DF_FTYPE_V2DF_V2DI:
     case V2DF_FTYPE_V2DF_DI:
     case V2DF_FTYPE_V2DF_SI:
+    case V2DF_FTYPE_V2DF_DOUBLE:
     case V2SF_FTYPE_V2SF_V2SF:
     case V1DI_FTYPE_V1DI_V1DI:
     case V1DI_FTYPE_V8QI_V8QI:
     case V1DI_FTYPE_V2SI_V2SI:
     case V32QI_FTYPE_V16HI_V16HI:
     case V16HI_FTYPE_V8SI_V8SI:
     case V32QI_FTYPE_V32QI_V32QI:
     case V16HI_FTYPE_V32QI_V32QI:
     case V16HI_FTYPE_V16HI_V16HI:
     case V8SI_FTYPE_V4DF_V4DF:
@@ -36442,38 +36444,22 @@ ix86_expand_vector_set (bool mmx_ok, rtx
       tmp = gen_reg_rtx (GET_MODE_INNER (mode));
       ix86_expand_vector_extract (false, tmp, target, 1 - elt);
       if (elt == 0)
 	tmp = gen_rtx_VEC_CONCAT (mode, val, tmp);
       else
 	tmp = gen_rtx_VEC_CONCAT (mode, tmp, val);
       emit_insn (gen_rtx_SET (VOIDmode, target, tmp));
       return;
 
     case V2DFmode:
-      {
-	rtx op0, op1;
-
-	/* For the two element vectors, we implement a VEC_CONCAT with
-	   the extraction of the other element.  */
-
-	tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, GEN_INT (1 - elt)));
-	tmp = gen_rtx_VEC_SELECT (inner_mode, target, tmp);
-
-	if (elt == 0)
-	  op0 = val, op1 = tmp;
-	else
-	  op0 = tmp, op1 = val;
-
-	tmp = gen_rtx_VEC_CONCAT (mode, op0, op1);
-	emit_insn (gen_rtx_SET (VOIDmode, target, tmp));
-      }
-      return;
+      use_vec_merge = TARGET_SSE2;
+      break;
 
     case V4SFmode:
       use_vec_merge = TARGET_SSE4_1;
       if (use_vec_merge)
 	break;
 
       switch (elt)
 	{
 	case 0:
 	  use_vec_merge = true;

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [i386] scalar ops that preserve the high part of a vector
  2012-12-08  5:47                                                 ` Marc Glisse
@ 2012-12-12 15:48                                                   ` Richard Henderson
  0 siblings, 0 replies; 36+ messages in thread
From: Richard Henderson @ 2012-12-12 15:48 UTC (permalink / raw)
  To: Marc Glisse
  Cc: Michael Zolotukhin, Kirill Yukhin, H.J. Lu, Uros Bizjak,
	gcc-patches List

On 12/07/2012 04:02 PM, Marc Glisse wrote:
> 2012-12-08  Marc Glisse  <marc.glisse@inria.fr>
> 
>     PR target/54855
> gcc/
>     * config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
>     pattern.
>     (sse2_loadlpd, sse2_loadhpd): Use vec_merge.
>     * config/i386/i386-builtin-types.def: New function types.
>     * config/i386/i386.c (ix86_expand_args_builtin): Likewise.
>     (bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
>     __builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
>     (ix86_expand_vector_set): Use vec_merge for V2DF.
>     * config/i386/xmmintrin.h: Adapt to new builtin prototype.
>     * config/i386/emmintrin.h: Likewise.
>     * doc/extend.texi (X86 Built-in Functions): Document changed prototype.
> 
> testsuite/
>     * gcc.target/i386/pr54855-1.c: New testcase.
>     * gcc.target/i386/pr54855-2.c: New testcase.

This looks like the right approach.

I won't approve this for 4.8 because (1) this isn't a regression,
and (2) all of the other operations want to be handled similarly.

But I'll approve this for 4.9 as part 1 of a series.  (That need
not be developed and committed all at once, but will be all done
before the end of the next stage1.)


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2012-12-12 15:48 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-13  9:33 [i386] scalar ops that preserve the high part of a vector Marc Glisse
2012-10-14  9:54 ` Uros Bizjak
2012-10-14 12:52   ` Marc Glisse
2012-11-30 12:36     ` Marc Glisse
2012-11-30 13:55       ` Uros Bizjak
2012-11-30 22:36         ` Marc Glisse
2012-12-01 17:27           ` Marc Glisse
2012-12-02 10:51             ` Uros Bizjak
2012-12-02 12:30               ` Marc Glisse
2012-12-03  8:53                 ` Uros Bizjak
2012-12-03 15:34                   ` Marc Glisse
2012-12-03 17:55                     ` Uros Bizjak
2012-12-04 14:05                       ` Marc Glisse
2012-12-04 16:28                         ` Marc Glisse
2012-12-04 18:06                           ` Uros Bizjak
2012-12-04 18:12                             ` H.J. Lu
2012-12-06 13:42                               ` Kirill Yukhin
2012-12-07  6:50                                 ` Michael Zolotukhin
2012-12-07  8:46                                   ` Uros Bizjak
2012-12-07  8:49                                   ` Marc Glisse
2012-12-07 10:52                                     ` Michael Zolotukhin
2012-12-07 14:02                                       ` Marc Glisse
2012-12-07 14:43                                     ` Richard Henderson
2012-12-07 14:47                                       ` Jakub Jelinek
2012-12-07 14:53                                         ` Richard Henderson
2012-12-07 15:00                                       ` Marc Glisse
2012-12-07 15:06                                         ` Richard Henderson
2012-12-07 15:12                                           ` Marc Glisse
2012-12-07 16:24                                             ` Richard Henderson
2012-12-07 17:23                                               ` Marc Glisse
2012-12-08  5:47                                                 ` Marc Glisse
2012-12-12 15:48                                                   ` Richard Henderson
2012-12-05 14:22                             ` Marc Glisse
2012-12-05 17:07                               ` Paolo Bonzini
2012-12-05 20:22                                 ` Marc Glisse
2012-12-05 21:05                               ` Eric Botcazou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).