[RFC PATCH] i386: Do not sanitize upper part of V2SFmode reg with -fno-trapping-math [PR110832]

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [RFC PATCH] i386: Do not sanitize upper part of V2SFmode reg with -fno-trapping-math [PR110832]
@ 2023-07-30 20:12 Uros Bizjak
  2023-07-31  9:40 ` Richard Biener
  0 siblings, 1 reply; 8+ messages in thread
From: Uros Bizjak @ 2023-07-30 20:12 UTC (permalink / raw)
  To: gcc-patches; +Cc: Richard Biener, Jan Hubicka, Hongtao Liu

[-- Attachment #1: Type: text/plain, Size: 1666 bytes --]

Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF
named patterns in order to avoid generation of partial vector V4SFmode
trapping instructions.

The new option is enabled by default, because even with sanitization,
a small but consistent speed up of 2 to 3% with Polyhedron capacita
benchmark can be achieved vs. scalar code.

Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9%
vs. scalar code.  This is what clang does by default, as it defaults
to -fno-trapping-math.

    PR target/110832

gcc/ChangeLog:

    * config/i386/i386.h (TARGET_MMXFP_WITH_SSE): New macro.
    * config/i386/i386/opt (mmmxfp-with-sse): New option.
    * config/i386/mmx.md (movq_<mode>_to_sse): Do not sanitize
    upper part of V2SFmode register with -fno-trapping-math.
    (<plusminusmult:insn>v2sf3): Enable for TARGET_MMXFP_WITH_SSE.
    (divv2sf3): Ditto.
    (<smaxmin:code>v2sf3): Ditto.
    (sqrtv2sf2): Ditto.
    (*mmx_haddv2sf3_low): Ditto.
    (*mmx_hsubv2sf3_low): Ditto.
    (vec_addsubv2sf3): Ditto.
    (vec_cmpv2sfv2si): Ditto.
    (vcond<V2FI:mode>v2sf): Ditto.
    (fmav2sf4): Ditto.
    (fmsv2sf4): Ditto.
    (fnmav2sf4): Ditto.
    (fnmsv2sf4): Ditto.
    (fix_truncv2sfv2si2): Ditto.
    (fixuns_truncv2sfv2si2): Ditto.
    (floatv2siv2sf2): Ditto.
    (floatunsv2siv2sf2): Ditto.
    (nearbyintv2sf2): Ditto.
    (rintv2sf2): Ditto.
    (lrintv2sfv2si2): Ditto.
    (ceilv2sf2): Ditto.
    (lceilv2sfv2si2): Ditto.
    (floorv2sf2): Ditto.
    (lfloorv2sfv2si2): Ditto.
    (btruncv2sf2): Ditto.
    (roundv2sf2): Ditto.
    (lroundv2sfv2si2): Ditto.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.

[-- Attachment #2: r.diff.txt --]
[-- Type: text/plain, Size: 10777 bytes --]

diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index ef342fcee9b..af72b6c48a9 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -50,6 +50,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
 #define TARGET_16BIT_P(x)	TARGET_CODE16_P(x)
 
 #define TARGET_MMX_WITH_SSE	(TARGET_64BIT && TARGET_SSE2)
+#define TARGET_MMXFP_WITH_SSE	(TARGET_MMX_WITH_SSE && ix86_mmxfp_with_sse)
 
 #include "config/vxworks-dummy.h"
 
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 1cc8563477a..1b65fed5daf 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -670,6 +670,10 @@ m3dnowa
 Target Mask(ISA_3DNOW_A) Var(ix86_isa_flags) Save
 Support Athlon 3Dnow! built-in functions.
 
+mmmxfp-with-sse
+Target Var(ix86_mmxfp_with_sse) Init(1)
+Enable MMX floating point vectors in SSE registers
+
 msse
 Target Mask(ISA_SSE) Var(ix86_isa_flags) Save
 Support MMX and SSE built-in functions and code generation.
diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
index 896af76a33f..0555da9022b 100644
--- a/gcc/config/i386/mmx.md
+++ b/gcc/config/i386/mmx.md
@@ -597,7 +597,18 @@ (define_expand "movq_<mode>_to_sse"
 	  (match_operand:V2FI 1 "nonimmediate_operand")
 	  (match_dup 2)))]
   "TARGET_SSE2"
-  "operands[2] = CONST0_RTX (<MODE>mode);")
+{
+  if (<MODE>mode == V2SFmode
+      && !flag_trapping_math)
+    {
+      rtx op1 = force_reg (<MODE>mode, operands[1]);
+      emit_move_insn (operands[0], lowpart_subreg (<mmxdoublevecmode>mode,
+						   op1, <MODE>mode));
+      DONE;
+    }
+
+  operands[2] = CONST0_RTX (<MODE>mode);
+})
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;;
@@ -650,7 +661,7 @@ (define_expand "<insn>v2sf3"
 	(plusminusmult:V2SF
 	  (match_operand:V2SF 1 "nonimmediate_operand")
 	  (match_operand:V2SF 2 "nonimmediate_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMXFP_WITH_SSE"
 {
   rtx op2 = gen_reg_rtx (V4SFmode);
   rtx op1 = gen_reg_rtx (V4SFmode);
@@ -728,7 +739,7 @@ (define_expand "divv2sf3"
   [(set (match_operand:V2SF 0 "register_operand")
 	(div:V2SF (match_operand:V2SF 1 "register_operand")
 		  (match_operand:V2SF 2 "register_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMXFP_WITH_SSE"
 {
   rtx op2 = gen_reg_rtx (V4SFmode);
   rtx op1 = gen_reg_rtx (V4SFmode);
@@ -750,7 +761,7 @@ (define_expand "<code>v2sf3"
         (smaxmin:V2SF
 	  (match_operand:V2SF 1 "register_operand")
 	  (match_operand:V2SF 2 "register_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMXFP_WITH_SSE"
 {
   rtx op2 = gen_reg_rtx (V4SFmode);
   rtx op1 = gen_reg_rtx (V4SFmode);
@@ -852,7 +863,7 @@ (define_insn "mmx_rcpit2v2sf3"
 (define_expand "sqrtv2sf2"
   [(set (match_operand:V2SF 0 "register_operand")
 	(sqrt:V2SF (match_operand:V2SF 1 "nonimmediate_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMXFP_WITH_SSE"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -933,7 +944,7 @@ (define_insn_and_split "*mmx_haddv2sf3_low"
 	  (vec_select:SF
 	    (match_dup 1)
 	    (parallel [(match_operand:SI 3 "const_0_to_1_operand")]))))]
-  "TARGET_SSE3 && TARGET_MMX_WITH_SSE
+  "TARGET_SSE3 && TARGET_MMXFP_WITH_SSE
    && INTVAL (operands[2]) != INTVAL (operands[3])
    && ix86_pre_reload_split ()"
   "#"
@@ -979,7 +990,7 @@ (define_insn_and_split "*mmx_hsubv2sf3_low"
 	  (vec_select:SF
 	    (match_dup 1)
 	    (parallel [(const_int 1)]))))]
-  "TARGET_SSE3 && TARGET_MMX_WITH_SSE
+  "TARGET_SSE3 && TARGET_MMXFP_WITH_SSE
    && ix86_pre_reload_split ()"
   "#"
   "&& 1"
@@ -1041,7 +1052,7 @@ (define_expand "vec_addsubv2sf3"
 	    (match_operand:V2SF 2 "nonimmediate_operand"))
 	  (plus:V2SF (match_dup 1) (match_dup 2))
 	  (const_int 1)))]
-  "TARGET_SSE3 && TARGET_MMX_WITH_SSE"
+  "TARGET_SSE3 && TARGET_MMXFP_WITH_SSE"
 {
   rtx op2 = gen_reg_rtx (V4SFmode);
   rtx op1 = gen_reg_rtx (V4SFmode);
@@ -1104,7 +1115,7 @@ (define_expand "vec_cmpv2sfv2si"
 	(match_operator:V2SI 1 ""
 	  [(match_operand:V2SF 2 "nonimmediate_operand")
 	   (match_operand:V2SF 3 "nonimmediate_operand")]))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMXFP_WITH_SSE"
 {
   rtx ops[4];
   ops[3] = gen_reg_rtx (V4SFmode);
@@ -1130,7 +1141,7 @@ (define_expand "vcond<mode>v2sf"
 	     (match_operand:V2SF 5 "nonimmediate_operand")])
 	  (match_operand:V2FI 1 "general_operand")
 	  (match_operand:V2FI 2 "general_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMXFP_WITH_SSE"
 {
   rtx ops[6];
   ops[5] = gen_reg_rtx (V4SFmode);
@@ -1320,7 +1331,7 @@ (define_expand "fmav2sf4"
 	  (match_operand:V2SF 2 "nonimmediate_operand")
 	  (match_operand:V2SF 3 "nonimmediate_operand")))]
   "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL)
-   && TARGET_MMX_WITH_SSE"
+   && TARGET_MMXFP_WITH_SSE"
 {
   rtx op3 = gen_reg_rtx (V4SFmode);
   rtx op2 = gen_reg_rtx (V4SFmode);
@@ -1345,7 +1356,7 @@ (define_expand "fmsv2sf4"
 	  (neg:V2SF
 	    (match_operand:V2SF 3 "nonimmediate_operand"))))]
   "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL)
-   && TARGET_MMX_WITH_SSE"
+   && TARGET_MMXFP_WITH_SSE"
 {
   rtx op3 = gen_reg_rtx (V4SFmode);
   rtx op2 = gen_reg_rtx (V4SFmode);
@@ -1370,7 +1381,7 @@ (define_expand "fnmav2sf4"
 	  (match_operand:V2SF   2 "nonimmediate_operand")
 	  (match_operand:V2SF   3 "nonimmediate_operand")))]
   "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL)
-   && TARGET_MMX_WITH_SSE"
+   && TARGET_MMXFP_WITH_SSE"
 {
   rtx op3 = gen_reg_rtx (V4SFmode);
   rtx op2 = gen_reg_rtx (V4SFmode);
@@ -1396,7 +1407,7 @@ (define_expand "fnmsv2sf4"
 	  (neg:V2SF
 	    (match_operand:V2SF 3 "nonimmediate_operand"))))]
   "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL)
-   && TARGET_MMX_WITH_SSE"
+   && TARGET_MMXFP_WITH_SSE"
 {
   rtx op3 = gen_reg_rtx (V4SFmode);
   rtx op2 = gen_reg_rtx (V4SFmode);
@@ -1422,7 +1433,7 @@ (define_expand "fnmsv2sf4"
 (define_expand "fix_truncv2sfv2si2"
   [(set (match_operand:V2SI 0 "register_operand")
 	(fix:V2SI (match_operand:V2SF 1 "nonimmediate_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMXFP_WITH_SSE"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SImode);
@@ -1438,7 +1449,7 @@ (define_expand "fix_truncv2sfv2si2"
 (define_expand "fixuns_truncv2sfv2si2"
   [(set (match_operand:V2SI 0 "register_operand")
 	(unsigned_fix:V2SI (match_operand:V2SF 1 "nonimmediate_operand")))]
-  "TARGET_AVX512VL && TARGET_MMX_WITH_SSE"
+  "TARGET_AVX512VL && TARGET_MMXFP_WITH_SSE"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SImode);
@@ -1463,7 +1474,7 @@ (define_insn "mmx_fix_truncv2sfv2si2"
 (define_expand "floatv2siv2sf2"
   [(set (match_operand:V2SF 0 "register_operand")
 	(float:V2SF (match_operand:V2SI 1 "nonimmediate_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMXFP_WITH_SSE"
 {
   rtx op1 = gen_reg_rtx (V4SImode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -1479,7 +1490,7 @@ (define_expand "floatv2siv2sf2"
 (define_expand "floatunsv2siv2sf2"
   [(set (match_operand:V2SF 0 "register_operand")
 	(unsigned_float:V2SF (match_operand:V2SI 1 "nonimmediate_operand")))]
-  "TARGET_AVX512VL && TARGET_MMX_WITH_SSE"
+  "TARGET_AVX512VL && TARGET_MMXFP_WITH_SSE"
 {
   rtx op1 = gen_reg_rtx (V4SImode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -1756,7 +1767,7 @@ (define_expand "vec_initv2sfsf"
 (define_expand "nearbyintv2sf2"
   [(match_operand:V2SF 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
-  "TARGET_SSE4_1 && TARGET_MMX_WITH_SSE"
+  "TARGET_SSE4_1 && TARGET_MMXFP_WITH_SSE"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -1772,7 +1783,7 @@ (define_expand "nearbyintv2sf2"
 (define_expand "rintv2sf2"
   [(match_operand:V2SF 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
-  "TARGET_SSE4_1 && TARGET_MMX_WITH_SSE"
+  "TARGET_SSE4_1 && TARGET_MMXFP_WITH_SSE"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -1788,8 +1799,8 @@ (define_expand "rintv2sf2"
 (define_expand "lrintv2sfv2si2"
   [(match_operand:V2SI 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
- "TARGET_SSE4_1 && !flag_trapping_math
-  && TARGET_MMX_WITH_SSE"
+  "TARGET_SSE4_1 && !flag_trapping_math
+   && TARGET_MMXFP_WITH_SSE"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SImode);
@@ -1806,7 +1817,7 @@ (define_expand "ceilv2sf2"
   [(match_operand:V2SF 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
   "TARGET_SSE4_1 && !flag_trapping_math
-   && TARGET_MMX_WITH_SSE"
+   && TARGET_MMXFP_WITH_SSE"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -1822,8 +1833,8 @@ (define_expand "ceilv2sf2"
 (define_expand "lceilv2sfv2si2"
   [(match_operand:V2SI 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
- "TARGET_SSE4_1 && !flag_trapping_math
-  && TARGET_MMX_WITH_SSE"
+  "TARGET_SSE4_1 && !flag_trapping_math
+   && TARGET_MMXFP_WITH_SSE"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SImode);
@@ -1840,7 +1851,7 @@ (define_expand "floorv2sf2"
   [(match_operand:V2SF 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
   "TARGET_SSE4_1 && !flag_trapping_math
-  && TARGET_MMX_WITH_SSE"
+  && TARGET_MMXFP_WITH_SSE"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -1856,8 +1867,8 @@ (define_expand "floorv2sf2"
 (define_expand "lfloorv2sfv2si2"
   [(match_operand:V2SI 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
- "TARGET_SSE4_1 && !flag_trapping_math
-  && TARGET_MMX_WITH_SSE"
+  "TARGET_SSE4_1 && !flag_trapping_math
+   && TARGET_MMXFP_WITH_SSE"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SImode);
@@ -1874,7 +1885,7 @@ (define_expand "btruncv2sf2"
   [(match_operand:V2SF 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
   "TARGET_SSE4_1 && !flag_trapping_math
-  && TARGET_MMX_WITH_SSE"
+  && TARGET_MMXFP_WITH_SSE"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -1891,7 +1902,7 @@ (define_expand "roundv2sf2"
   [(match_operand:V2SF 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
   "TARGET_SSE4_1 && !flag_trapping_math
-   && TARGET_MMX_WITH_SSE"
+   && TARGET_MMXFP_WITH_SSE"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -1907,8 +1918,8 @@ (define_expand "roundv2sf2"
 (define_expand "lroundv2sfv2si2"
   [(match_operand:V2SI 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
- "TARGET_SSE4_1 && !flag_trapping_math
-  && TARGET_MMX_WITH_SSE"
+  "TARGET_SSE4_1 && !flag_trapping_math
+   && TARGET_MMXFP_WITH_SSE"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SImode);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] i386: Do not sanitize upper part of V2SFmode reg with -fno-trapping-math [PR110832]
  2023-07-30 20:12 [RFC PATCH] i386: Do not sanitize upper part of V2SFmode reg with -fno-trapping-math [PR110832] Uros Bizjak
@ 2023-07-31  9:40 ` Richard Biener
  2023-07-31 10:13   ` Uros Bizjak
  2023-08-07 15:59   ` Uros Bizjak
  0 siblings, 2 replies; 8+ messages in thread
From: Richard Biener @ 2023-07-31  9:40 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches, Jan Hubicka, Hongtao Liu

On Sun, 30 Jul 2023, Uros Bizjak wrote:

> Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF
> named patterns in order to avoid generation of partial vector V4SFmode
> trapping instructions.
> 
> The new option is enabled by default, because even with sanitization,
> a small but consistent speed up of 2 to 3% with Polyhedron capacita
> benchmark can be achieved vs. scalar code.
> 
> Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9%
> vs. scalar code.  This is what clang does by default, as it defaults
> to -fno-trapping-math.

I like the new option, note you lack invoke.texi documentation where
I'd also elaborate a bit on the interaction with -fno-trapping-math
and the possible performance impact then NaNs or denormals leak
into the upper halves and cross-reference -mdaz-ftz.

Thanks,
Richard.

>     PR target/110832
> 
> gcc/ChangeLog:
> 
>     * config/i386/i386.h (TARGET_MMXFP_WITH_SSE): New macro.
>     * config/i386/i386/opt (mmmxfp-with-sse): New option.
>     * config/i386/mmx.md (movq_<mode>_to_sse): Do not sanitize
>     upper part of V2SFmode register with -fno-trapping-math.
>     (<plusminusmult:insn>v2sf3): Enable for TARGET_MMXFP_WITH_SSE.
>     (divv2sf3): Ditto.
>     (<smaxmin:code>v2sf3): Ditto.
>     (sqrtv2sf2): Ditto.
>     (*mmx_haddv2sf3_low): Ditto.
>     (*mmx_hsubv2sf3_low): Ditto.
>     (vec_addsubv2sf3): Ditto.
>     (vec_cmpv2sfv2si): Ditto.
>     (vcond<V2FI:mode>v2sf): Ditto.
>     (fmav2sf4): Ditto.
>     (fmsv2sf4): Ditto.
>     (fnmav2sf4): Ditto.
>     (fnmsv2sf4): Ditto.
>     (fix_truncv2sfv2si2): Ditto.
>     (fixuns_truncv2sfv2si2): Ditto.
>     (floatv2siv2sf2): Ditto.
>     (floatunsv2siv2sf2): Ditto.
>     (nearbyintv2sf2): Ditto.
>     (rintv2sf2): Ditto.
>     (lrintv2sfv2si2): Ditto.
>     (ceilv2sf2): Ditto.
>     (lceilv2sfv2si2): Ditto.
>     (floorv2sf2): Ditto.
>     (lfloorv2sfv2si2): Ditto.
>     (btruncv2sf2): Ditto.
>     (roundv2sf2): Ditto.
>     (lroundv2sfv2si2): Ditto.
> 
> Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
> 
> Uros.
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] i386: Do not sanitize upper part of V2SFmode reg with -fno-trapping-math [PR110832]
  2023-07-31  9:40 ` Richard Biener
@ 2023-07-31 10:13   ` Uros Bizjak
  2023-08-07 15:59   ` Uros Bizjak
  1 sibling, 0 replies; 8+ messages in thread
From: Uros Bizjak @ 2023-07-31 10:13 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jan Hubicka, Hongtao Liu

On Mon, Jul 31, 2023 at 11:40 AM Richard Biener <rguenther@suse.de> wrote:
>
> On Sun, 30 Jul 2023, Uros Bizjak wrote:
>
> > Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF
> > named patterns in order to avoid generation of partial vector V4SFmode
> > trapping instructions.
> >
> > The new option is enabled by default, because even with sanitization,
> > a small but consistent speed up of 2 to 3% with Polyhedron capacita
> > benchmark can be achieved vs. scalar code.
> >
> > Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9%
> > vs. scalar code.  This is what clang does by default, as it defaults
> > to -fno-trapping-math.
>
> I like the new option, note you lack invoke.texi documentation where
> I'd also elaborate a bit on the interaction with -fno-trapping-math
> and the possible performance impact then NaNs or denormals leak
> into the upper halves and cross-reference -mdaz-ftz.

Yes, this is my plan (lack of documentation is due to RFC status of
the patch). OTOH, Hongtao has some other ideas in the PR, so I'll wait
with the patch a bit.

Thanks,
Uros.

> Thanks,
> Richard.
>
> >     PR target/110832
> >
> > gcc/ChangeLog:
> >
> >     * config/i386/i386.h (TARGET_MMXFP_WITH_SSE): New macro.
> >     * config/i386/i386/opt (mmmxfp-with-sse): New option.
> >     * config/i386/mmx.md (movq_<mode>_to_sse): Do not sanitize
> >     upper part of V2SFmode register with -fno-trapping-math.
> >     (<plusminusmult:insn>v2sf3): Enable for TARGET_MMXFP_WITH_SSE.
> >     (divv2sf3): Ditto.
> >     (<smaxmin:code>v2sf3): Ditto.
> >     (sqrtv2sf2): Ditto.
> >     (*mmx_haddv2sf3_low): Ditto.
> >     (*mmx_hsubv2sf3_low): Ditto.
> >     (vec_addsubv2sf3): Ditto.
> >     (vec_cmpv2sfv2si): Ditto.
> >     (vcond<V2FI:mode>v2sf): Ditto.
> >     (fmav2sf4): Ditto.
> >     (fmsv2sf4): Ditto.
> >     (fnmav2sf4): Ditto.
> >     (fnmsv2sf4): Ditto.
> >     (fix_truncv2sfv2si2): Ditto.
> >     (fixuns_truncv2sfv2si2): Ditto.
> >     (floatv2siv2sf2): Ditto.
> >     (floatunsv2siv2sf2): Ditto.
> >     (nearbyintv2sf2): Ditto.
> >     (rintv2sf2): Ditto.
> >     (lrintv2sfv2si2): Ditto.
> >     (ceilv2sf2): Ditto.
> >     (lceilv2sfv2si2): Ditto.
> >     (floorv2sf2): Ditto.
> >     (lfloorv2sfv2si2): Ditto.
> >     (btruncv2sf2): Ditto.
> >     (roundv2sf2): Ditto.
> >     (lroundv2sfv2si2): Ditto.
> >
> > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
> >
> > Uros.
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] i386: Do not sanitize upper part of V2SFmode reg with -fno-trapping-math [PR110832]
  2023-07-31  9:40 ` Richard Biener
  2023-07-31 10:13   ` Uros Bizjak
@ 2023-08-07 15:59   ` Uros Bizjak
  2023-08-08  8:07     ` Richard Biener
  1 sibling, 1 reply; 8+ messages in thread
From: Uros Bizjak @ 2023-08-07 15:59 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jan Hubicka, Hongtao Liu

[-- Attachment #1: Type: text/plain, Size: 1106 bytes --]

On Mon, Jul 31, 2023 at 11:40 AM Richard Biener <rguenther@suse.de> wrote:
>
> On Sun, 30 Jul 2023, Uros Bizjak wrote:
>
> > Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF
> > named patterns in order to avoid generation of partial vector V4SFmode
> > trapping instructions.
> >
> > The new option is enabled by default, because even with sanitization,
> > a small but consistent speed up of 2 to 3% with Polyhedron capacita
> > benchmark can be achieved vs. scalar code.
> >
> > Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9%
> > vs. scalar code.  This is what clang does by default, as it defaults
> > to -fno-trapping-math.
>
> I like the new option, note you lack invoke.texi documentation where
> I'd also elaborate a bit on the interaction with -fno-trapping-math
> and the possible performance impact then NaNs or denormals leak
> into the upper halves and cross-reference -mdaz-ftz.

The attached doc patch is invoke.texi entry for -mmmxfp-with-sse
option. It is written in a way to also cover half-float vectors. WDYT?

Uros.

[-- Attachment #2: d.diff.txt --]
[-- Type: text/plain, Size: 1788 bytes --]

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index fa765d5a0dd..99093172abe 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -1417,6 +1417,7 @@ See RS/6000 and PowerPC Options.
 -mcld  -mcx16  -msahf  -mmovbe  -mcrc32 -mmwait
 -mrecip  -mrecip=@var{opt}
 -mvzeroupper  -mprefer-avx128  -mprefer-vector-width=@var{opt}
+-mmmxfp-with-sse
 -mmove-max=@var{bits} -mstore-max=@var{bits}
 -mmmx  -msse  -msse2  -msse3  -mssse3  -msse4.1  -msse4.2  -msse4  -mavx
 -mavx2  -mavx512f  -mavx512pf  -mavx512er  -mavx512cd  -mavx512vl
@@ -33708,6 +33709,22 @@ This option instructs GCC to use 128-bit AVX instructions instead of
 This option instructs GCC to use @var{opt}-bit vector width in instructions
 instead of default on the selected platform.
 
+@opindex -mmmxfp-with-sse
+@item -mmmxfp-with-sse
+This option enables GCC to generate trapping floating-point operations on
+partial vectors, where vector elements reside in the low part of the 128-bit
+SSE register.  Unless @option{-fno-trapping-math} is specified, the compiler
+guarantees correct trapping behavior by sanitizing all input operands to
+have zeroes in the upper part of the vector register.  Note that by using
+built-in functions or inline assembly with partial vector arguments, NaNs,
+denormal or invalid values can leak into the upper part of the vector,
+causing possible performance issues when @option{-fno-trapping-math} is in
+effect.  These issues can be mitigated by manually sanitizing the upper part
+of the partial vector argument register or by using @option{-mdaz-ftz} to set
+denormals-are-zero (DAZ) flag in the MXCSR register.
+
+This option is enabled by default.
+
 @opindex mmove-max
 @item -mmove-max=@var{bits}
 This option instructs GCC to set the maximum number of bits can be

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] i386: Do not sanitize upper part of V2SFmode reg with -fno-trapping-math [PR110832]
  2023-08-07 15:59   ` Uros Bizjak
@ 2023-08-08  8:07     ` Richard Biener
  2023-08-08  9:06       ` Uros Bizjak
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Biener @ 2023-08-08  8:07 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches, Jan Hubicka, Hongtao Liu

On Mon, 7 Aug 2023, Uros Bizjak wrote:

> On Mon, Jul 31, 2023 at 11:40?AM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Sun, 30 Jul 2023, Uros Bizjak wrote:
> >
> > > Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF
> > > named patterns in order to avoid generation of partial vector V4SFmode
> > > trapping instructions.
> > >
> > > The new option is enabled by default, because even with sanitization,
> > > a small but consistent speed up of 2 to 3% with Polyhedron capacita
> > > benchmark can be achieved vs. scalar code.
> > >
> > > Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9%
> > > vs. scalar code.  This is what clang does by default, as it defaults
> > > to -fno-trapping-math.
> >
> > I like the new option, note you lack invoke.texi documentation where
> > I'd also elaborate a bit on the interaction with -fno-trapping-math
> > and the possible performance impact then NaNs or denormals leak
> > into the upper halves and cross-reference -mdaz-ftz.
> 
> The attached doc patch is invoke.texi entry for -mmmxfp-with-sse
> option. It is written in a way to also cover half-float vectors. WDYT?

"generate trapping floating-point operations"

I'd say "generate floating-point operations that might affect the
set of floating point status flags", the word "trapping" is IMHO 
misleading.
Not sure if "set of floating point status flags" is the correct term,
but it's what the C standard seems to refer to when talking about
things you get with fegetexceptflag.  feraieexcept refers to
"floating-point exceptions".  Unfortunately the -fno-trapping-math
documentation is similarly confusing (and maybe even wrong, I read
it to conform to 'non-stop' IEEE arithmetic).

I'd maybe give an example of a FP operation that's _not_ affected
by the flag (copysign?).

Otherwise it looks OK to me.

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] i386: Do not sanitize upper part of V2SFmode reg with -fno-trapping-math [PR110832]
  2023-08-08  8:07     ` Richard Biener
@ 2023-08-08  9:06       ` Uros Bizjak
  2023-08-08 10:08         ` Richard Biener
  0 siblings, 1 reply; 8+ messages in thread
From: Uros Bizjak @ 2023-08-08  9:06 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jan Hubicka, Hongtao Liu

[-- Attachment #1: Type: text/plain, Size: 3841 bytes --]

On Tue, Aug 8, 2023 at 10:07 AM Richard Biener <rguenther@suse.de> wrote:
>
> On Mon, 7 Aug 2023, Uros Bizjak wrote:
>
> > On Mon, Jul 31, 2023 at 11:40?AM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > On Sun, 30 Jul 2023, Uros Bizjak wrote:
> > >
> > > > Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF
> > > > named patterns in order to avoid generation of partial vector V4SFmode
> > > > trapping instructions.
> > > >
> > > > The new option is enabled by default, because even with sanitization,
> > > > a small but consistent speed up of 2 to 3% with Polyhedron capacita
> > > > benchmark can be achieved vs. scalar code.
> > > >
> > > > Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9%
> > > > vs. scalar code.  This is what clang does by default, as it defaults
> > > > to -fno-trapping-math.
> > >
> > > I like the new option, note you lack invoke.texi documentation where
> > > I'd also elaborate a bit on the interaction with -fno-trapping-math
> > > and the possible performance impact then NaNs or denormals leak
> > > into the upper halves and cross-reference -mdaz-ftz.
> >
> > The attached doc patch is invoke.texi entry for -mmmxfp-with-sse
> > option. It is written in a way to also cover half-float vectors. WDYT?
>
> "generate trapping floating-point operations"
>
> I'd say "generate floating-point operations that might affect the
> set of floating point status flags", the word "trapping" is IMHO
> misleading.
> Not sure if "set of floating point status flags" is the correct term,
> but it's what the C standard seems to refer to when talking about
> things you get with fegetexceptflag.  feraieexcept refers to
> "floating-point exceptions".  Unfortunately the -fno-trapping-math
> documentation is similarly confusing (and maybe even wrong, I read
> it to conform to 'non-stop' IEEE arithmetic).

Thanks for suggesting the right terminology. I think that:

+@opindex mpartial-vector-math
+@item -mpartial-vector-math
+This option enables GCC to generate floating-point operations that might
+affect the set of floating point status flags on partial vectors, where
+vector elements reside in the low part of the 128-bit SSE register.  Unless
+@option{-fno-trapping-math} is specified, the compiler guarantees correct
+behavior by sanitizing all input operands to have zeroes in the unused
+upper part of the vector register.  Note that by using built-in functions
+or inline assembly with partial vector arguments, NaNs, denormal or invalid
+values can leak into the upper part of the vector, causing possible
+performance issues when @option{-fno-trapping-math} is in effect.  These
+issues can be mitigated by manually sanitizing the upper part of the partial
+vector argument register or by using @option{-mdaz-ftz} to set
+denormals-are-zero (DAZ) flag in the MXCSR register.

Now explain in adequate detail what the option does. IMO, the
"floating-point operations that might affect the set of floating point
status flags" correctly identifies affected operations, so an example,
as suggested below, is not necessary.

> I'd maybe give an example of a FP operation that's _not_ affected
> by the flag (copysign?).

Please note that I have renamed the option to "-mpartial-vector-math"
with a short target-specific description:

+partial-vector-math
+Target Var(ix86_partial_vec_math) Init(1)
+Enable floating-point status flags setting SSE vector operations on
partial vectors

which I think summarises the option (without the word "trapping"). The
same approach will be taken for Float16 operations, so the approach is
not specific to MMX vectors.

> Otherwise it looks OK to me.

Thanks, I have attached the RFC V2 patch; I plan to submit a formal
patch later today.

Uros.

[-- Attachment #2: pr110832-v2.diff.txt --]
[-- Type: text/plain, Size: 14654 bytes --]

diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 1cc8563477a..8d9a1ae93f3 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -632,6 +632,10 @@ Enum(prefer_vector_width) String(256) Value(PVW_AVX256)
 EnumValue
 Enum(prefer_vector_width) String(512) Value(PVW_AVX512)
 
+partial-vector-math
+Target Var(ix86_partial_vec_math) Init(1)
+Enable floating-point status flags setting SSE vector operations on partial vectors
+
 mmove-max=
 Target RejectNegative Joined Var(ix86_move_max) Enum(prefer_vector_width) Init(PVW_NONE) Save
 Maximum number of bits that can be moved from memory to memory efficiently.
diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
index b49554e9b8f..95f7a0113e7 100644
--- a/gcc/config/i386/mmx.md
+++ b/gcc/config/i386/mmx.md
@@ -595,7 +595,18 @@ (define_expand "movq_<mode>_to_sse"
 	  (match_operand:V2FI_V4HF 1 "nonimmediate_operand")
 	  (match_dup 2)))]
   "TARGET_SSE2"
-  "operands[2] = CONST0_RTX (<MODE>mode);")
+{
+  if (<MODE>mode == V2SFmode
+      && !flag_trapping_math)
+    {
+      rtx op1 = force_reg (<MODE>mode, operands[1]);
+      emit_move_insn (operands[0], lowpart_subreg (<mmxdoublevecmode>mode,
+						   op1, <MODE>mode));
+      DONE;
+    }
+
+  operands[2] = CONST0_RTX (<MODE>mode);
+})
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;;
@@ -648,7 +659,7 @@ (define_expand "<insn>v2sf3"
 	(plusminusmult:V2SF
 	  (match_operand:V2SF 1 "nonimmediate_operand")
 	  (match_operand:V2SF 2 "nonimmediate_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op2 = gen_reg_rtx (V4SFmode);
   rtx op1 = gen_reg_rtx (V4SFmode);
@@ -726,7 +737,7 @@ (define_expand "divv2sf3"
   [(set (match_operand:V2SF 0 "register_operand")
 	(div:V2SF (match_operand:V2SF 1 "register_operand")
 		  (match_operand:V2SF 2 "register_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op2 = gen_reg_rtx (V4SFmode);
   rtx op1 = gen_reg_rtx (V4SFmode);
@@ -748,7 +759,7 @@ (define_expand "<code>v2sf3"
         (smaxmin:V2SF
 	  (match_operand:V2SF 1 "register_operand")
 	  (match_operand:V2SF 2 "register_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op2 = gen_reg_rtx (V4SFmode);
   rtx op1 = gen_reg_rtx (V4SFmode);
@@ -850,7 +861,7 @@ (define_insn "mmx_rcpit2v2sf3"
 (define_expand "sqrtv2sf2"
   [(set (match_operand:V2SF 0 "register_operand")
 	(sqrt:V2SF (match_operand:V2SF 1 "nonimmediate_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -931,7 +942,7 @@ (define_insn_and_split "*mmx_haddv2sf3_low"
 	  (vec_select:SF
 	    (match_dup 1)
 	    (parallel [(match_operand:SI 3 "const_0_to_1_operand")]))))]
-  "TARGET_SSE3 && TARGET_MMX_WITH_SSE
+  "TARGET_SSE3 && TARGET_MMX_WITH_SSE && ix86_partial_vec_math
    && INTVAL (operands[2]) != INTVAL (operands[3])
    && ix86_pre_reload_split ()"
   "#"
@@ -977,7 +988,7 @@ (define_insn_and_split "*mmx_hsubv2sf3_low"
 	  (vec_select:SF
 	    (match_dup 1)
 	    (parallel [(const_int 1)]))))]
-  "TARGET_SSE3 && TARGET_MMX_WITH_SSE
+  "TARGET_SSE3 && TARGET_MMX_WITH_SSE && ix86_partial_vec_math
    && ix86_pre_reload_split ()"
   "#"
   "&& 1"
@@ -1039,7 +1050,7 @@ (define_expand "vec_addsubv2sf3"
 	    (match_operand:V2SF 2 "nonimmediate_operand"))
 	  (plus:V2SF (match_dup 1) (match_dup 2))
 	  (const_int 1)))]
-  "TARGET_SSE3 && TARGET_MMX_WITH_SSE"
+  "TARGET_SSE3 && TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op2 = gen_reg_rtx (V4SFmode);
   rtx op1 = gen_reg_rtx (V4SFmode);
@@ -1102,7 +1113,7 @@ (define_expand "vec_cmpv2sfv2si"
 	(match_operator:V2SI 1 ""
 	  [(match_operand:V2SF 2 "nonimmediate_operand")
 	   (match_operand:V2SF 3 "nonimmediate_operand")]))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx ops[4];
   ops[3] = gen_reg_rtx (V4SFmode);
@@ -1128,7 +1139,7 @@ (define_expand "vcond<mode>v2sf"
 	     (match_operand:V2SF 5 "nonimmediate_operand")])
 	  (match_operand:V2FI 1 "general_operand")
 	  (match_operand:V2FI 2 "general_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx ops[6];
   ops[5] = gen_reg_rtx (V4SFmode);
@@ -1318,7 +1329,7 @@ (define_expand "fmav2sf4"
 	  (match_operand:V2SF 2 "nonimmediate_operand")
 	  (match_operand:V2SF 3 "nonimmediate_operand")))]
   "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL)
-   && TARGET_MMX_WITH_SSE"
+   && TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op3 = gen_reg_rtx (V4SFmode);
   rtx op2 = gen_reg_rtx (V4SFmode);
@@ -1343,7 +1354,7 @@ (define_expand "fmsv2sf4"
 	  (neg:V2SF
 	    (match_operand:V2SF 3 "nonimmediate_operand"))))]
   "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL)
-   && TARGET_MMX_WITH_SSE"
+   && TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op3 = gen_reg_rtx (V4SFmode);
   rtx op2 = gen_reg_rtx (V4SFmode);
@@ -1368,7 +1379,7 @@ (define_expand "fnmav2sf4"
 	  (match_operand:V2SF   2 "nonimmediate_operand")
 	  (match_operand:V2SF   3 "nonimmediate_operand")))]
   "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL)
-   && TARGET_MMX_WITH_SSE"
+   && TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op3 = gen_reg_rtx (V4SFmode);
   rtx op2 = gen_reg_rtx (V4SFmode);
@@ -1394,7 +1405,7 @@ (define_expand "fnmsv2sf4"
 	  (neg:V2SF
 	    (match_operand:V2SF 3 "nonimmediate_operand"))))]
   "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL)
-   && TARGET_MMX_WITH_SSE"
+   && TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op3 = gen_reg_rtx (V4SFmode);
   rtx op2 = gen_reg_rtx (V4SFmode);
@@ -1420,7 +1431,7 @@ (define_expand "fnmsv2sf4"
 (define_expand "fix_truncv2sfv2si2"
   [(set (match_operand:V2SI 0 "register_operand")
 	(fix:V2SI (match_operand:V2SF 1 "nonimmediate_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SImode);
@@ -1436,7 +1447,7 @@ (define_expand "fix_truncv2sfv2si2"
 (define_expand "fixuns_truncv2sfv2si2"
   [(set (match_operand:V2SI 0 "register_operand")
 	(unsigned_fix:V2SI (match_operand:V2SF 1 "nonimmediate_operand")))]
-  "TARGET_AVX512VL && TARGET_MMX_WITH_SSE"
+  "TARGET_AVX512VL && TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SImode);
@@ -1461,7 +1472,7 @@ (define_insn "mmx_fix_truncv2sfv2si2"
 (define_expand "floatv2siv2sf2"
   [(set (match_operand:V2SF 0 "register_operand")
 	(float:V2SF (match_operand:V2SI 1 "nonimmediate_operand")))]
-  "TARGET_MMX_WITH_SSE"
+  "TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op1 = gen_reg_rtx (V4SImode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -1477,7 +1488,7 @@ (define_expand "floatv2siv2sf2"
 (define_expand "floatunsv2siv2sf2"
   [(set (match_operand:V2SF 0 "register_operand")
 	(unsigned_float:V2SF (match_operand:V2SI 1 "nonimmediate_operand")))]
-  "TARGET_AVX512VL && TARGET_MMX_WITH_SSE"
+  "TARGET_AVX512VL && TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op1 = gen_reg_rtx (V4SImode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -1754,7 +1765,7 @@ (define_expand "vec_initv2sfsf"
 (define_expand "nearbyintv2sf2"
   [(match_operand:V2SF 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
-  "TARGET_SSE4_1 && TARGET_MMX_WITH_SSE"
+  "TARGET_SSE4_1 && TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -1770,7 +1781,7 @@ (define_expand "nearbyintv2sf2"
 (define_expand "rintv2sf2"
   [(match_operand:V2SF 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
-  "TARGET_SSE4_1 && TARGET_MMX_WITH_SSE"
+  "TARGET_SSE4_1 && TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -1786,8 +1797,8 @@ (define_expand "rintv2sf2"
 (define_expand "lrintv2sfv2si2"
   [(match_operand:V2SI 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
- "TARGET_SSE4_1 && !flag_trapping_math
-  && TARGET_MMX_WITH_SSE"
+  "TARGET_SSE4_1 && !flag_trapping_math
+   && TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SImode);
@@ -1804,7 +1815,7 @@ (define_expand "ceilv2sf2"
   [(match_operand:V2SF 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
   "TARGET_SSE4_1 && !flag_trapping_math
-   && TARGET_MMX_WITH_SSE"
+   && TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -1820,8 +1831,8 @@ (define_expand "ceilv2sf2"
 (define_expand "lceilv2sfv2si2"
   [(match_operand:V2SI 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
- "TARGET_SSE4_1 && !flag_trapping_math
-  && TARGET_MMX_WITH_SSE"
+  "TARGET_SSE4_1 && !flag_trapping_math
+   && TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SImode);
@@ -1838,7 +1849,7 @@ (define_expand "floorv2sf2"
   [(match_operand:V2SF 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
   "TARGET_SSE4_1 && !flag_trapping_math
-  && TARGET_MMX_WITH_SSE"
+  && TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -1854,8 +1865,8 @@ (define_expand "floorv2sf2"
 (define_expand "lfloorv2sfv2si2"
   [(match_operand:V2SI 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
- "TARGET_SSE4_1 && !flag_trapping_math
-  && TARGET_MMX_WITH_SSE"
+  "TARGET_SSE4_1 && !flag_trapping_math
+   && TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SImode);
@@ -1872,7 +1883,7 @@ (define_expand "btruncv2sf2"
   [(match_operand:V2SF 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
   "TARGET_SSE4_1 && !flag_trapping_math
-  && TARGET_MMX_WITH_SSE"
+  && TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -1889,7 +1900,7 @@ (define_expand "roundv2sf2"
   [(match_operand:V2SF 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
   "TARGET_SSE4_1 && !flag_trapping_math
-   && TARGET_MMX_WITH_SSE"
+   && TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SFmode);
@@ -1905,8 +1916,8 @@ (define_expand "roundv2sf2"
 (define_expand "lroundv2sfv2si2"
   [(match_operand:V2SI 0 "register_operand")
    (match_operand:V2SF 1 "nonimmediate_operand")]
- "TARGET_SSE4_1 && !flag_trapping_math
-  && TARGET_MMX_WITH_SSE"
+  "TARGET_SSE4_1 && !flag_trapping_math
+   && TARGET_MMX_WITH_SSE && ix86_partial_vec_math"
 {
   rtx op1 = gen_reg_rtx (V4SFmode);
   rtx op0 = gen_reg_rtx (V4SImode);
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 674f956f4b8..f5081c0cfb9 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -1419,6 +1419,7 @@ See RS/6000 and PowerPC Options.
 -mcld  -mcx16  -msahf  -mmovbe  -mcrc32 -mmwait
 -mrecip  -mrecip=@var{opt}
 -mvzeroupper  -mprefer-avx128  -mprefer-vector-width=@var{opt}
+-mpartial-vector-math
 -mmove-max=@var{bits} -mstore-max=@var{bits}
 -mmmx  -msse  -msse2  -msse3  -mssse3  -msse4.1  -msse4.2  -msse4  -mavx
 -mavx2  -mavx512f  -mavx512pf  -mavx512er  -mavx512cd  -mavx512vl
@@ -33754,6 +33755,23 @@ This option instructs GCC to use 128-bit AVX instructions instead of
 This option instructs GCC to use @var{opt}-bit vector width in instructions
 instead of default on the selected platform.
 
+@opindex mpartial-vector-math
+@item -mpartial-vector-math
+This option enables GCC to generate floating-point operations that might
+affect the set of floating point status flags on partial vectors, where
+vector elements reside in the low part of the 128-bit SSE register.  Unless
+@option{-fno-trapping-math} is specified, the compiler guarantees correct
+behavior by sanitizing all input operands to have zeroes in the unused
+upper part of the vector register.  Note that by using built-in functions
+or inline assembly with partial vector arguments, NaNs, denormal or invalid
+values can leak into the upper part of the vector, causing possible
+performance issues when @option{-fno-trapping-math} is in effect.  These
+issues can be mitigated by manually sanitizing the upper part of the partial
+vector argument register or by using @option{-mdaz-ftz} to set
+denormals-are-zero (DAZ) flag in the MXCSR register.
+
+This option is enabled by default.
+
 @opindex mmove-max
 @item -mmove-max=@var{bits}
 This option instructs GCC to set the maximum number of bits can be
diff --git a/gcc/testsuite/gcc.target/i386/pr110832-1.c b/gcc/testsuite/gcc.target/i386/pr110832-1.c
new file mode 100644
index 00000000000..3df22e3b5a7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr110832-1.c
@@ -0,0 +1,12 @@
+/* PR target/110832 */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -msse2 -mno-partial-vector-math" } */
+
+typedef float __attribute__((vector_size(8))) v2sf;
+
+v2sf test (v2sf a, v2sf b)
+{
+  return a + b;
+}
+
+/* { dg-final { scan-assembler-not "addps" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr110832-2.c b/gcc/testsuite/gcc.target/i386/pr110832-2.c
new file mode 100644
index 00000000000..4d16488b4fb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr110832-2.c
@@ -0,0 +1,13 @@
+/* PR target/110832 */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -ftrapping-math -msse2 -mpartial-vector-math -dp" } */
+
+typedef float __attribute__((vector_size(8))) v2sf;
+
+v2sf test (v2sf a, v2sf b)
+{
+  return a + b;
+}
+
+/* { dg-final { scan-assembler "addps" } } */
+/* { dg-final { scan-assembler-times "\\*vec_concatv4sf_0" 2 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr110832-3.c b/gcc/testsuite/gcc.target/i386/pr110832-3.c
new file mode 100644
index 00000000000..02cb4fc8100
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr110832-3.c
@@ -0,0 +1,13 @@
+/* PR target/110832 */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -fno-trapping-math -msse2 -mpartial-vector-math -dp" } */
+
+typedef float __attribute__((vector_size(8))) v2sf;
+
+v2sf test (v2sf a, v2sf b)
+{
+  return a + b;
+}
+
+/* { dg-final { scan-assembler "addps" } } */
+/* { dg-final { scan-assembler-not "\\*vec_concatv4sf_0" } } */

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] i386: Do not sanitize upper part of V2SFmode reg with -fno-trapping-math [PR110832]
  2023-08-08  9:06       ` Uros Bizjak
@ 2023-08-08 10:08         ` Richard Biener
  2023-08-08 11:03           ` Uros Bizjak
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Biener @ 2023-08-08 10:08 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches, Jan Hubicka, Hongtao Liu

On Tue, 8 Aug 2023, Uros Bizjak wrote:

> On Tue, Aug 8, 2023 at 10:07?AM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Mon, 7 Aug 2023, Uros Bizjak wrote:
> >
> > > On Mon, Jul 31, 2023 at 11:40?AM Richard Biener <rguenther@suse.de> wrote:
> > > >
> > > > On Sun, 30 Jul 2023, Uros Bizjak wrote:
> > > >
> > > > > Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF
> > > > > named patterns in order to avoid generation of partial vector V4SFmode
> > > > > trapping instructions.
> > > > >
> > > > > The new option is enabled by default, because even with sanitization,
> > > > > a small but consistent speed up of 2 to 3% with Polyhedron capacita
> > > > > benchmark can be achieved vs. scalar code.
> > > > >
> > > > > Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9%
> > > > > vs. scalar code.  This is what clang does by default, as it defaults
> > > > > to -fno-trapping-math.
> > > >
> > > > I like the new option, note you lack invoke.texi documentation where
> > > > I'd also elaborate a bit on the interaction with -fno-trapping-math
> > > > and the possible performance impact then NaNs or denormals leak
> > > > into the upper halves and cross-reference -mdaz-ftz.
> > >
> > > The attached doc patch is invoke.texi entry for -mmmxfp-with-sse
> > > option. It is written in a way to also cover half-float vectors. WDYT?
> >
> > "generate trapping floating-point operations"
> >
> > I'd say "generate floating-point operations that might affect the
> > set of floating point status flags", the word "trapping" is IMHO
> > misleading.
> > Not sure if "set of floating point status flags" is the correct term,
> > but it's what the C standard seems to refer to when talking about
> > things you get with fegetexceptflag.  feraieexcept refers to
> > "floating-point exceptions".  Unfortunately the -fno-trapping-math
> > documentation is similarly confusing (and maybe even wrong, I read
> > it to conform to 'non-stop' IEEE arithmetic).
> 
> Thanks for suggesting the right terminology. I think that:
> 
> +@opindex mpartial-vector-math
> +@item -mpartial-vector-math
> +This option enables GCC to generate floating-point operations that might
> +affect the set of floating point status flags on partial vectors, where
> +vector elements reside in the low part of the 128-bit SSE register.  Unless
> +@option{-fno-trapping-math} is specified, the compiler guarantees correct
> +behavior by sanitizing all input operands to have zeroes in the unused
> +upper part of the vector register.  Note that by using built-in functions
> +or inline assembly with partial vector arguments, NaNs, denormal or invalid
> +values can leak into the upper part of the vector, causing possible
> +performance issues when @option{-fno-trapping-math} is in effect.  These
> +issues can be mitigated by manually sanitizing the upper part of the partial
> +vector argument register or by using @option{-mdaz-ftz} to set
> +denormals-are-zero (DAZ) flag in the MXCSR register.
> 
> Now explain in adequate detail what the option does. IMO, the
> "floating-point operations that might affect the set of floating point
> status flags" correctly identifies affected operations, so an example,
> as suggested below, is not necessary.
> 
> > I'd maybe give an example of a FP operation that's _not_ affected
> > by the flag (copysign?).
> 
> Please note that I have renamed the option to "-mpartial-vector-math"
> with a short target-specific description:

Ah yes, that's a less confusing name but then it might suggest
that -mno-partial-vector-math would disable all of that, including
integer ops, not only the patterns possibly affecting the exception
flags?  Note I don't have a better suggestion and this is clearly
better than the one mentioning mmx.

> +partial-vector-math
> +Target Var(ix86_partial_vec_math) Init(1)
> +Enable floating-point status flags setting SSE vector operations on
> partial vectors
> 
> which I think summarises the option (without the word "trapping"). The
> same approach will be taken for Float16 operations, so the approach is
> not specific to MMX vectors.
> 
> > Otherwise it looks OK to me.
> 
> Thanks, I have attached the RFC V2 patch; I plan to submit a formal
> patch later today.

Thanks.  With AVX512VL there might also be the option to use
a mask (with the penalty of a very much larger instruction encoding).

Richard.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] i386: Do not sanitize upper part of V2SFmode reg with -fno-trapping-math [PR110832]
  2023-08-08 10:08         ` Richard Biener
@ 2023-08-08 11:03           ` Uros Bizjak
  0 siblings, 0 replies; 8+ messages in thread
From: Uros Bizjak @ 2023-08-08 11:03 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jan Hubicka, Hongtao Liu

On Tue, Aug 8, 2023 at 12:08 PM Richard Biener <rguenther@suse.de> wrote:

> > > > > > Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF
> > > > > > named patterns in order to avoid generation of partial vector V4SFmode
> > > > > > trapping instructions.
> > > > > >
> > > > > > The new option is enabled by default, because even with sanitization,
> > > > > > a small but consistent speed up of 2 to 3% with Polyhedron capacita
> > > > > > benchmark can be achieved vs. scalar code.
> > > > > >
> > > > > > Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9%
> > > > > > vs. scalar code.  This is what clang does by default, as it defaults
> > > > > > to -fno-trapping-math.
> > > > >
> > > > > I like the new option, note you lack invoke.texi documentation where
> > > > > I'd also elaborate a bit on the interaction with -fno-trapping-math
> > > > > and the possible performance impact then NaNs or denormals leak
> > > > > into the upper halves and cross-reference -mdaz-ftz.
> > > >
> > > > The attached doc patch is invoke.texi entry for -mmmxfp-with-sse
> > > > option. It is written in a way to also cover half-float vectors. WDYT?
> > >
> > > "generate trapping floating-point operations"
> > >
> > > I'd say "generate floating-point operations that might affect the
> > > set of floating point status flags", the word "trapping" is IMHO
> > > misleading.
> > > Not sure if "set of floating point status flags" is the correct term,
> > > but it's what the C standard seems to refer to when talking about
> > > things you get with fegetexceptflag.  feraieexcept refers to
> > > "floating-point exceptions".  Unfortunately the -fno-trapping-math
> > > documentation is similarly confusing (and maybe even wrong, I read
> > > it to conform to 'non-stop' IEEE arithmetic).
> >
> > Thanks for suggesting the right terminology. I think that:
> >
> > +@opindex mpartial-vector-math
> > +@item -mpartial-vector-math
> > +This option enables GCC to generate floating-point operations that might
> > +affect the set of floating point status flags on partial vectors, where
> > +vector elements reside in the low part of the 128-bit SSE register.  Unless
> > +@option{-fno-trapping-math} is specified, the compiler guarantees correct
> > +behavior by sanitizing all input operands to have zeroes in the unused
> > +upper part of the vector register.  Note that by using built-in functions
> > +or inline assembly with partial vector arguments, NaNs, denormal or invalid
> > +values can leak into the upper part of the vector, causing possible
> > +performance issues when @option{-fno-trapping-math} is in effect.  These
> > +issues can be mitigated by manually sanitizing the upper part of the partial
> > +vector argument register or by using @option{-mdaz-ftz} to set
> > +denormals-are-zero (DAZ) flag in the MXCSR register.
> >
> > Now explain in adequate detail what the option does. IMO, the
> > "floating-point operations that might affect the set of floating point
> > status flags" correctly identifies affected operations, so an example,
> > as suggested below, is not necessary.
> >
> > > I'd maybe give an example of a FP operation that's _not_ affected
> > > by the flag (copysign?).
> >
> > Please note that I have renamed the option to "-mpartial-vector-math"
> > with a short target-specific description:
>
> Ah yes, that's a less confusing name but then it might suggest
> that -mno-partial-vector-math would disable all of that, including
> integer ops, not only the patterns possibly affecting the exception
> flags?  Note I don't have a better suggestion and this is clearly
> better than the one mentioning mmx.

You are right, I think I'll rename the option to -mpartial-vector-fp-math.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-08-08 11:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-30 20:12 [RFC PATCH] i386: Do not sanitize upper part of V2SFmode reg with -fno-trapping-math [PR110832] Uros Bizjak
2023-07-31  9:40 ` Richard Biener
2023-07-31 10:13   ` Uros Bizjak
2023-08-07 15:59   ` Uros Bizjak
2023-08-08  8:07     ` Richard Biener
2023-08-08  9:06       ` Uros Bizjak
2023-08-08 10:08         ` Richard Biener
2023-08-08 11:03           ` Uros Bizjak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).