public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 0/3][x86] Enable pass_late_combine for x86.
@ 2024-06-28  5:27 liuhongt
  2024-06-28  5:27 ` [PATCH 1/3] [avx512 testsuite] Define mask as extern instead of uninitialized local variables liuhongt
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: liuhongt @ 2024-06-28  5:27 UTC (permalink / raw)
  To: gcc-patches; +Cc: ubizjak

Because of the issue described in PR115610, late_combine is disabled by
default.The series try to solve the regressions and enable late_combine.
There're 4 regressions observed.

1. The first one is related to pass_stv2, because late_combine will restore
transformation did in the pass. Move the pass after pass_late_combine can
solve the issue.

2. The second one is related to pass_rpad, both pre_reload and post_reload
late_combine would restore the transformation. So besides moving pass_rpad
after pre_reload late_combine, target_insn_cost is defined to prevent
post_reload pass_late_combine to revert the optimziation did in pass_rpad.

3. The third one is related to avx512 kmask, lshirt + zero_extend are combined
into *<insn>si3_zext which doesn't support k alternative, and an extra move
between GPR and KMASK and regressed
gcc.target/i386/zero_extendkmask.c scan-assembler-not (?n)shr[bwl],
the solution is extending the pattern to ?k alternative just like what we did
before for other patterns.

4. The fourth one is fake, it's because pass_late_combine generates better code but
break scan assembly.
.i.e
Under 32-bit target, gcc used to generate broadcast from stack and
then do the real operation.
After enabling flate_combine, they're combined into embeded broadcast
operations.

Tested with SPEC2017, flate_combine reduces codesize by ~0.6%, which means
there're lots of small improvements.
Bootstrapped and regtested on x86_64-pc-linu-gnu{-m32,}.
Ok for trunk?


liuhongt (3):
  [avx512 testsuite] Define mask as extern instead of uninitialized
    local variables.
  Extend lshifrtsi3_1_zext to ?k alternative.
  [x86] Enable flate-combine.

 gcc/config/i386/i386-features.cc              | 16 +++++++----
 gcc/config/i386/i386-options.cc               |  4 ---
 gcc/config/i386/i386-passes.def               |  4 +--
 gcc/config/i386/i386-protos.h                 |  1 +
 gcc/config/i386/i386.cc                       | 18 ++++++++++++
 gcc/config/i386/i386.md                       | 19 +++++++++----
 gcc/config/i386/sse.md                        | 28 +++++++++++++++++++
 .../gcc.target/i386/avx512bitalg-vpopcntb.c   |  3 +-
 .../gcc.target/i386/avx512bitalg-vpopcntbvl.c |  4 +--
 .../gcc.target/i386/avx512bitalg-vpopcntw.c   |  2 +-
 .../gcc.target/i386/avx512bitalg-vpopcntwvl.c |  4 +--
 .../i386/avx512f-broadcast-pr87767-1.c        |  4 +--
 .../i386/avx512f-broadcast-pr87767-5.c        |  1 -
 .../gcc.target/i386/avx512f-fmadd-sf-zmm-7.c  |  2 +-
 .../gcc.target/i386/avx512f-fmsub-sf-zmm-7.c  |  2 +-
 .../gcc.target/i386/avx512f-fnmadd-sf-zmm-7.c |  2 +-
 .../gcc.target/i386/avx512f-fnmsub-sf-zmm-7.c |  2 +-
 .../i386/avx512vl-broadcast-pr87767-1.c       |  4 +--
 .../i386/avx512vl-broadcast-pr87767-5.c       |  2 --
 .../i386/avx512vpopcntdq-vpopcntd.c           |  5 ++--
 .../i386/avx512vpopcntdq-vpopcntq.c           |  2 +-
 gcc/testsuite/gcc.target/i386/pr91333.c       |  2 +-
 .../gcc.target/i386/vect-strided-4.c          |  2 +-
 23 files changed, 93 insertions(+), 40 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/3] [avx512 testsuite] Define mask as extern instead of uninitialized local variables.
  2024-06-28  5:27 [PATCH 0/3][x86] Enable pass_late_combine for x86 liuhongt
@ 2024-06-28  5:27 ` liuhongt
  2024-06-28  5:27 ` [PATCH 2/3] Extend lshifrtsi3_1_zext to ?k alternative liuhongt
  2024-06-28  5:27 ` [PATCH 3/3] [x86] Enable flate-combine liuhongt
  2 siblings, 0 replies; 6+ messages in thread
From: liuhongt @ 2024-06-28  5:27 UTC (permalink / raw)
  To: gcc-patches; +Cc: ubizjak

The testcases are supposed to scan for vpopcnt{b,w,d,q} operations
with k mask, but mask is defined as uninitialized local variable which
will be set as 0 at rtl expand phase.
And it's further simplified off by late_combine which caused scan assembly failure.
Move the definition of mask outside to make the testcases more stable.

gcc/testsuite/ChangeLog:

	PR target/115610
	* gcc.target/i386/avx512bitalg-vpopcntb.c: Define mask as
	extern instead of uninitialized local variables.
	* gcc.target/i386/avx512bitalg-vpopcntbvl.c: Ditto.
	* gcc.target/i386/avx512bitalg-vpopcntw.c: Ditto.
	* gcc.target/i386/avx512bitalg-vpopcntwvl.c: Ditto.
	* gcc.target/i386/avx512vpopcntdq-vpopcntd.c: Ditto.
	* gcc.target/i386/avx512vpopcntdq-vpopcntq.c: Ditto.
---
 gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntb.c    | 3 +--
 gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntbvl.c  | 4 ++--
 gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntw.c    | 2 +-
 gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntwvl.c  | 4 ++--
 gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntd.c | 5 +++--
 gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntq.c | 2 +-
 6 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntb.c b/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntb.c
index 44b82c0519d..66d24107c26 100644
--- a/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntb.c
+++ b/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntb.c
@@ -7,10 +7,9 @@
 #include <x86intrin.h>
 
 extern __m512i z, z1;
-
+extern __mmask16 msk;
 int foo ()
 {
-  __mmask16 msk;
   __m512i c = _mm512_popcnt_epi8 (z);
   asm volatile ("" : "+v" (c));
   c = _mm512_mask_popcnt_epi8 (z1, msk, z);
diff --git a/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntbvl.c b/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntbvl.c
index 8c2dfaba9c6..8ab05653f7c 100644
--- a/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntbvl.c
+++ b/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntbvl.c
@@ -11,11 +11,11 @@
 
 extern __m256i y, y_1;
 extern __m128i x, x_1;
+extern __mmask32 msk32;
+extern __mmask16 msk16;
 
 int foo ()
 {
-  __mmask32 msk32;
-  __mmask16 msk16;
   __m256i c256 = _mm256_popcnt_epi8 (y);
   asm volatile ("" : "+v" (c256));
   c256 = _mm256_mask_popcnt_epi8 (y_1, msk32, y);
diff --git a/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntw.c b/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntw.c
index 2ef8589f6c1..c741bf48a51 100644
--- a/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntw.c
+++ b/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntw.c
@@ -7,10 +7,10 @@
 #include <x86intrin.h>
 
 extern __m512i z, z1;
+extern __mmask16 msk;
 
 int foo ()
 {
-  __mmask16 msk;
   __m512i c = _mm512_popcnt_epi16 (z);
   asm volatile ("" : "+v" (c));
   c = _mm512_mask_popcnt_epi16 (z1, msk, z);
diff --git a/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntwvl.c b/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntwvl.c
index c976461b12e..79bb3c31e85 100644
--- a/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntwvl.c
+++ b/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntwvl.c
@@ -11,11 +11,11 @@
 
 extern __m256i y, y_1;
 extern __m128i x, x_1;
+extern __mmask16 msk16;
+extern __mmask8 msk8;
 
 int foo ()
 {
-  __mmask16 msk16;
-  __mmask8 msk8;
   __m256i c256 = _mm256_popcnt_epi16 (y);
   asm volatile ("" : "+v" (c256));
   c256 = _mm256_mask_popcnt_epi16 (y_1, msk16, y);
diff --git a/gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntd.c b/gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntd.c
index b4d82f97032..776a4753d8e 100644
--- a/gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntd.c
+++ b/gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntd.c
@@ -15,11 +15,12 @@
 extern __m128i x, x_1;
 extern __m256i y, y_1;
 extern __m512i z, z_1;
+extern  __mmask16 msk;
+extern  __mmask8 msk8;
+
 
 int foo ()
 {
-  __mmask16 msk;
-  __mmask8 msk8;
   __m128i a = _mm_popcnt_epi32 (x);
   asm volatile ("" : "+v" (a));
   a = _mm_mask_popcnt_epi32 (x_1, msk8, x);
diff --git a/gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntq.c b/gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntq.c
index e87d6c999b6..c6314ac5deb 100644
--- a/gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntq.c
+++ b/gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntq.c
@@ -15,10 +15,10 @@
 extern __m128i x, x_1;
 extern __m256i y, y_1;
 extern __m512i z, z_1;
+extern __mmask8 msk; 
 
 int foo ()
 {
-  __mmask8 msk; 
   __m128i a = _mm_popcnt_epi64 (x);
   asm volatile ("" : "+v" (a));
   a = _mm_mask_popcnt_epi64 (x_1, msk, x);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 2/3] Extend lshifrtsi3_1_zext to ?k alternative.
  2024-06-28  5:27 [PATCH 0/3][x86] Enable pass_late_combine for x86 liuhongt
  2024-06-28  5:27 ` [PATCH 1/3] [avx512 testsuite] Define mask as extern instead of uninitialized local variables liuhongt
@ 2024-06-28  5:27 ` liuhongt
  2024-06-28  5:59   ` Uros Bizjak
  2024-06-28  5:27 ` [PATCH 3/3] [x86] Enable flate-combine liuhongt
  2 siblings, 1 reply; 6+ messages in thread
From: liuhongt @ 2024-06-28  5:27 UTC (permalink / raw)
  To: gcc-patches; +Cc: ubizjak

late_combine will combine lshift + zero into *lshifrtsi3_1_zext which
cause extra mov between gpr and kmask, add ?k to the pattern.

gcc/ChangeLog:

	PR target/115610
	* config/i386/i386.md (<*insnsi3_zext): Add alternative ?k,
	enable it only for lshiftrt and under avx512bw.
	* config/i386/sse.md (*klshrsi3_1_zext): New define_insn, and
	add corresponding define_split after it.
---
 gcc/config/i386/i386.md | 19 +++++++++++++------
 gcc/config/i386/sse.md  | 28 ++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index fd48e764469..57a10c1af48 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -16836,10 +16836,10 @@ (define_insn "*bmi2_<insn>si3_1_zext"
    (set_attr "mode" "SI")])
 
 (define_insn "*<insn>si3_1_zext"
-  [(set (match_operand:DI 0 "register_operand" "=r,r,r")
+  [(set (match_operand:DI 0 "register_operand" "=r,r,r,?k")
 	(zero_extend:DI
-	  (any_shiftrt:SI (match_operand:SI 1 "nonimmediate_operand" "0,rm,rm")
-			  (match_operand:QI 2 "nonmemory_operand" "cI,r,cI"))))
+	  (any_shiftrt:SI (match_operand:SI 1 "nonimmediate_operand" "0,rm,rm,k")
+			  (match_operand:QI 2 "nonmemory_operand" "cI,r,cI,I"))))
    (clobber (reg:CC FLAGS_REG))]
   "TARGET_64BIT
    && ix86_binary_operator_ok (<CODE>, SImode, operands, TARGET_APX_NDD)"
@@ -16850,6 +16850,8 @@ (define_insn "*<insn>si3_1_zext"
     case TYPE_ISHIFTX:
       return "#";
 
+    case TYPE_MSKLOG:
+      return "#";
     default:
       if (operands[2] == const1_rtx
 	  && (TARGET_SHIFT1 || optimize_function_for_size_p (cfun))
@@ -16860,8 +16862,8 @@ (define_insn "*<insn>si3_1_zext"
 		       : "<shift>{l}\t{%2, %k0|%k0, %2}";
     }
 }
-  [(set_attr "isa" "*,bmi2,apx_ndd")
-   (set_attr "type" "ishift,ishiftx,ishift")
+  [(set_attr "isa" "*,bmi2,apx_ndd,avx512bw")
+   (set_attr "type" "ishift,ishiftx,ishift,msklog")
    (set (attr "length_immediate")
      (if_then_else
        (and (match_operand 2 "const1_operand")
@@ -16869,7 +16871,12 @@ (define_insn "*<insn>si3_1_zext"
 		 (match_test "optimize_function_for_size_p (cfun)")))
        (const_string "0")
        (const_string "*")))
-   (set_attr "mode" "SI")])
+   (set_attr "mode" "SI")
+   (set (attr "enabled")
+   	(if_then_else
+	  (eq_attr "alternative" "3")
+	  (symbol_ref "<CODE> == LSHIFTRT && TARGET_AVX512BW")
+	  (const_string "*")))])
 
 ;; Convert shift to the shiftx pattern to avoid flags dependency.
 (define_split
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 0be2dcd8891..20665a6f097 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -2179,6 +2179,34 @@ (define_split
 	     (match_dup 2)))
       (unspec [(const_int 0)] UNSPEC_MASKOP)])])
 
+(define_insn "*klshrsi3_1_zext"
+  [(set (match_operand:DI 0 "register_operand" "=k")
+	(zero_extend:DI
+	  (lshiftrt:SI (match_operand:SI 1 "register_operand" "k")
+		       (match_operand 2 "const_0_to_31_operand" "I"))))
+      (unspec [(const_int 0)] UNSPEC_MASKOP)]
+  "TARGET_AVX512BW"
+  "kshiftrd\t{%2, %1, %0|%0, %1, %2}"
+    [(set_attr "type" "msklog")
+   (set_attr "prefix" "vex")
+   (set_attr "mode" "SI")])
+
+(define_split
+  [(set (match_operand:DI 0 "mask_reg_operand")
+	(zero_extend:DI
+	  (lshiftrt:SI
+	    (match_operand:SI 1 "mask_reg_operand")
+	    (match_operand 2 "const_0_to_31_operand"))))
+    (clobber (reg:CC FLAGS_REG))]
+  "TARGET_AVX512BW && reload_completed"
+  [(parallel
+     [(set (match_dup 0)
+	   (zero_extend:DI
+	     (lshiftrt:SI
+	       (match_dup 1)
+	       (match_dup 2))))
+      (unspec [(const_int 0)] UNSPEC_MASKOP)])])
+
 (define_insn "ktest<mode>"
   [(set (reg:CC FLAGS_REG)
 	(unspec:CC
-- 
2.31.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 3/3] [x86] Enable flate-combine.
  2024-06-28  5:27 [PATCH 0/3][x86] Enable pass_late_combine for x86 liuhongt
  2024-06-28  5:27 ` [PATCH 1/3] [avx512 testsuite] Define mask as extern instead of uninitialized local variables liuhongt
  2024-06-28  5:27 ` [PATCH 2/3] Extend lshifrtsi3_1_zext to ?k alternative liuhongt
@ 2024-06-28  5:27 ` liuhongt
  2024-06-28  6:03   ` Uros Bizjak
  2 siblings, 1 reply; 6+ messages in thread
From: liuhongt @ 2024-06-28  5:27 UTC (permalink / raw)
  To: gcc-patches; +Cc: ubizjak

Move pass_stv2 and pass_rpad after pre_reload pass_late_combine, also
define target_insn_cost to prevent post_reload pass_late_combine to
revert the optimziation did in pass_rpad.

Adjust testcases since pass_late_combine generates better code but
break scan assembly.

.i.e
Under 32-bit target, gcc used to generate broadcast from stack and
then do the real operation.
After flate_combine, they're combined into embeded broadcast
operations.

gcc/ChangeLog:

	* config/i386/i386-features.cc (ix86_rpad_gate): New function.
	* config/i386/i386-options.cc (ix86_override_options_after_change):
	Don't disable flate_combine.
	* config/i386/i386-passes.def: Move pass_stv2 and pass_rpad
	after pre_reload pas_late_combine.
	* config/i386/i386-protos.h (ix86_rpad_gate): New declare.
	* config/i386/i386.cc (ix86_insn_cost): New function.
	(TARGET_INSN_COST): Define.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/avx512f-broadcast-pr87767-1.c: Adjus
	testcase.
	* gcc.target/i386/avx512f-broadcast-pr87767-5.c: Ditto.
	* gcc.target/i386/avx512f-fmadd-sf-zmm-7.c: Ditto.
	* gcc.target/i386/avx512f-fmsub-sf-zmm-7.c: Ditto.
	* gcc.target/i386/avx512f-fnmadd-sf-zmm-7.c: Ditto.
	* gcc.target/i386/avx512f-fnmsub-sf-zmm-7.c: Ditto.
	* gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Ditto.
	* gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Ditto.
	* gcc.target/i386/pr91333.c: Ditto.
	* gcc.target/i386/vect-strided-4.c: Ditto.
---
 gcc/config/i386/i386-features.cc               | 16 +++++++++++-----
 gcc/config/i386/i386-options.cc                |  4 ----
 gcc/config/i386/i386-passes.def                |  4 ++--
 gcc/config/i386/i386-protos.h                  |  1 +
 gcc/config/i386/i386.cc                        | 18 ++++++++++++++++++
 .../i386/avx512f-broadcast-pr87767-1.c         |  4 ++--
 .../i386/avx512f-broadcast-pr87767-5.c         |  1 -
 .../gcc.target/i386/avx512f-fmadd-sf-zmm-7.c   |  2 +-
 .../gcc.target/i386/avx512f-fmsub-sf-zmm-7.c   |  2 +-
 .../gcc.target/i386/avx512f-fnmadd-sf-zmm-7.c  |  2 +-
 .../gcc.target/i386/avx512f-fnmsub-sf-zmm-7.c  |  2 +-
 .../i386/avx512vl-broadcast-pr87767-1.c        |  4 ++--
 .../i386/avx512vl-broadcast-pr87767-5.c        |  2 --
 gcc/testsuite/gcc.target/i386/pr91333.c        |  2 +-
 gcc/testsuite/gcc.target/i386/vect-strided-4.c |  2 +-
 15 files changed, 42 insertions(+), 24 deletions(-)

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index 607d1991460..fc224ed06b0 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -2995,6 +2995,16 @@ make_pass_insert_endbr_and_patchable_area (gcc::context *ctxt)
   return new pass_insert_endbr_and_patchable_area (ctxt);
 }
 
+bool
+ix86_rpad_gate ()
+{
+  return (TARGET_AVX
+	  && TARGET_SSE_PARTIAL_REG_DEPENDENCY
+	  && TARGET_SSE_MATH
+	  && optimize
+	  && optimize_function_for_speed_p (cfun));
+}
+
 /* At entry of the nearest common dominator for basic blocks with
    conversions/rcp/sqrt/rsqrt/round, generate a single
 	vxorps %xmmN, %xmmN, %xmmN
@@ -3232,11 +3242,7 @@ public:
   /* opt_pass methods: */
   bool gate (function *) final override
     {
-      return (TARGET_AVX
-	      && TARGET_SSE_PARTIAL_REG_DEPENDENCY
-	      && TARGET_SSE_MATH
-	      && optimize
-	      && optimize_function_for_speed_p (cfun));
+      return ix86_rpad_gate ();
     }
 
   unsigned int execute (function *) final override
diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
index 9c12d498928..1ef2c71a7a2 100644
--- a/gcc/config/i386/i386-options.cc
+++ b/gcc/config/i386/i386-options.cc
@@ -1944,10 +1944,6 @@ ix86_override_options_after_change (void)
 	flag_cunroll_grow_size = flag_peel_loops || optimize >= 3;
     }
 
-  /* Late combine tends to undo some of the effects of STV and RPAD,
-     by combining instructions back to their original form.  */
-  if (!OPTION_SET_P (flag_late_combine_instructions))
-    flag_late_combine_instructions = 0;
 }
 
 /* Clear stack slot assignments remembered from previous functions.
diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
index 7d96766f7b9..2d29f65da88 100644
--- a/gcc/config/i386/i386-passes.def
+++ b/gcc/config/i386/i386-passes.def
@@ -25,11 +25,11 @@ along with GCC; see the file COPYING3.  If not see
  */
 
   INSERT_PASS_AFTER (pass_postreload_cse, 1, pass_insert_vzeroupper);
-  INSERT_PASS_AFTER (pass_combine, 1, pass_stv, false /* timode_p */);
+  INSERT_PASS_AFTER (pass_late_combine, 1, pass_stv, false /* timode_p */);
   /* Run the 64-bit STV pass before the CSE pass so that CONST0_RTX and
      CONSTM1_RTX generated by the STV pass can be CSEed.  */
   INSERT_PASS_BEFORE (pass_cse2, 1, pass_stv, true /* timode_p */);
 
   INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area);
 
-  INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
+  INSERT_PASS_AFTER (pass_late_combine, 1, pass_remove_partial_avx_dependency);
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 4f48dc0bf75..3dbd18dc70b 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -422,6 +422,7 @@ extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
   (gcc::context *);
 
 extern bool ix86_has_no_direct_extern_access;
+extern bool ix86_rpad_gate ();
 
 /* In i386-expand.cc.  */
 bool ix86_check_builtin_isa_match (unsigned int, HOST_WIDE_INT*,
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 1f71ed04be6..9d2b7d1f174 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -21371,6 +21371,22 @@ ix86_shift_rotate_cost (const struct processor_costs *cost,
     }
 }
 
+static int
+ix86_insn_cost (rtx_insn *insn, bool speed)
+{
+  int insn_cost = 0;
+  /* Add extra cost to avoid post_reload late_combine revert
+     the optimization did in pass_rpad.  */
+  if (reload_completed
+      && ix86_rpad_gate ()
+      && recog_memoized (insn) >= 0
+      && get_attr_avx_partial_xmm_update (insn)
+      == AVX_PARTIAL_XMM_UPDATE_TRUE)
+    insn_cost += COSTS_N_INSNS (3);
+
+  return insn_cost + pattern_cost (PATTERN (insn), speed);
+}
+
 /* Compute a (partial) cost for rtx X.  Return true if the complete
    cost has been computed, and false if subexpressions should be
    scanned.  In either case, *TOTAL contains the cost result.  */
@@ -26514,6 +26530,8 @@ static const scoped_attribute_specs *const ix86_attribute_table[] =
 #define TARGET_MEMORY_MOVE_COST ix86_memory_move_cost
 #undef TARGET_RTX_COSTS
 #define TARGET_RTX_COSTS ix86_rtx_costs
+#undef TARGET_INSN_COST
+#define TARGET_INSN_COST ix86_insn_cost
 #undef TARGET_ADDRESS_COST
 #define TARGET_ADDRESS_COST ix86_address_cost
 
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
index 138dbb4c973..3a50749e610 100644
--- a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
@@ -3,8 +3,8 @@
 /* { dg-options "-O2 -mavx512f -mavx512dq" } */
 /* { dg-additional-options "-fno-PIE" { target ia32 } } */
 /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 2 } } */
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to16\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 2 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to16\\\}"  2 } }  */
 /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %zmm\[0-9\]+" 3 } } */
 /* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %zmm\[0-9\]+" 3 { target { ! ia32 } } } } */
 
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
index d22251bc2a3..ea2f64861d0 100644
--- a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
@@ -3,7 +3,6 @@
 /* { dg-options "-O2 -mavx512f" } */
 /* { dg-additional-options "-fno-PIE" { target ia32 } } */
 /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
-/* { dg-final { scan-assembler-not "\[^\n\]*\\\{1to8\\\}" { target ia32 } } } */
 /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %zmm\[0-9\]+" 4 } } */
 /* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %zmm\[0-9\]+" 4 { target { ! ia32 } } } } */
 
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-fmadd-sf-zmm-7.c b/gcc/testsuite/gcc.target/i386/avx512f-fmadd-sf-zmm-7.c
index 8c117207efa..bbcc5ed0bec 100644
--- a/gcc/testsuite/gcc.target/i386/avx512f-fmadd-sf-zmm-7.c
+++ b/gcc/testsuite/gcc.target/i386/avx512f-fmadd-sf-zmm-7.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-mavx512f -O2" } */
-/* { dg-final { scan-assembler-times "vbroadcastss\[^\n\]*%zmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vbroadcastss\[^\n\]*%zmm\[0-9\]+" 1 { target { ! ia32 } } } } */
 /* { dg-final { scan-assembler-times "vfmadd...ps\[^\n\]*%zmm\[0-9\]+" 1 } } */
 
 #define type __m512
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-fmsub-sf-zmm-7.c b/gcc/testsuite/gcc.target/i386/avx512f-fmsub-sf-zmm-7.c
index cc705af8ea5..fc72dd6e557 100644
--- a/gcc/testsuite/gcc.target/i386/avx512f-fmsub-sf-zmm-7.c
+++ b/gcc/testsuite/gcc.target/i386/avx512f-fmsub-sf-zmm-7.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-mavx512f -O2" } */
-/* { dg-final { scan-assembler-times "vbroadcastss\[^\n\]*%zmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vbroadcastss\[^\n\]*%zmm\[0-9\]+" 1 { target { ! ia32 } } } } */
 /* { dg-final { scan-assembler-times "vfmsub...ps\[^\n\]*%zmm\[0-9\]+" 1 } } */
 
 #define type __m512
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-fnmadd-sf-zmm-7.c b/gcc/testsuite/gcc.target/i386/avx512f-fnmadd-sf-zmm-7.c
index db5c34678c0..342de482da8 100644
--- a/gcc/testsuite/gcc.target/i386/avx512f-fnmadd-sf-zmm-7.c
+++ b/gcc/testsuite/gcc.target/i386/avx512f-fnmadd-sf-zmm-7.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-mavx512f -O2" } */
-/* { dg-final { scan-assembler-times "vbroadcastss\[^\n\]*%zmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vbroadcastss\[^\n\]*%zmm\[0-9\]+" 1 { target { ! ia32 } } } } */
 /* { dg-final { scan-assembler-times "vfnmadd...ps\[^\n\]*%zmm\[0-9\]+" 1 } } */
 
 #define type __m512
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-fnmsub-sf-zmm-7.c b/gcc/testsuite/gcc.target/i386/avx512f-fnmsub-sf-zmm-7.c
index 7815251b82d..f56a3f8acc4 100644
--- a/gcc/testsuite/gcc.target/i386/avx512f-fnmsub-sf-zmm-7.c
+++ b/gcc/testsuite/gcc.target/i386/avx512f-fnmsub-sf-zmm-7.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-mavx512f -O2" } */
-/* { dg-final { scan-assembler-times "vbroadcastss\[^\n\]*%zmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vbroadcastss\[^\n\]*%zmm\[0-9\]+" 1 { target { ! ia32 } } } } */
 /* { dg-final { scan-assembler-times "vfnmsub...ps\[^\n\]*%zmm\[0-9\]+" 1 } } */
 
 #define type __m512
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
index e6df4d25f36..08898445be5 100644
--- a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
@@ -3,8 +3,8 @@
 /* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
 /* { dg-additional-options "-fno-PIE" { target ia32 } } */
 /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 2 } }  */
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 4 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 2 { target { ! ia32 } } } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 4 { target { ! ia32 } } } }  */
 /* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 2 } }  */
 /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 3 } } */
 /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 3 } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
index ebdc3619d8e..c57a2e29767 100644
--- a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
@@ -3,8 +3,6 @@
 /* { dg-options "-O2 -mavx512f -mavx512vl" } */
 /* { dg-additional-options "-fno-PIE" { target ia32 } } */
 /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
-/* { dg-final { scan-assembler-not "\[^\n\]*\\\{1to2\\\}" { target ia32 } } } */
-/* { dg-final { scan-assembler-not "\[^\n\]*\\\{1to4\\\}" { target ia32 } } } */
 /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 4 } } */
 /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 4 } } */
 /* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %xmm\[0-9\]+" 4 { target { ! ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr91333.c b/gcc/testsuite/gcc.target/i386/pr91333.c
index 2bdff871024..b4940b5c9ec 100644
--- a/gcc/testsuite/gcc.target/i386/pr91333.c
+++ b/gcc/testsuite/gcc.target/i386/pr91333.c
@@ -1,6 +1,6 @@
 /* { dg-do compile { target { ! ia32 } } } */
 /* { dg-options "-O2 -mavx" } */
-/* { dg-final { scan-assembler-times "vmovapd|vmovsd" 3 } } */
+/* { dg-final { scan-assembler-times "vmovapd|vmovsd" 2 } } */
 
 static inline double g (double x){
   asm volatile ("" : "+x" (x));
diff --git a/gcc/testsuite/gcc.target/i386/vect-strided-4.c b/gcc/testsuite/gcc.target/i386/vect-strided-4.c
index dd922926a2a..3fb9f07886e 100644
--- a/gcc/testsuite/gcc.target/i386/vect-strided-4.c
+++ b/gcc/testsuite/gcc.target/i386/vect-strided-4.c
@@ -15,6 +15,6 @@ void foo (int * __restrict a, int * __restrict b, int *c, int s)
 
 /* Vectorization factor two, two two-element stores to a using movq
    and two two-element stores to b via pextrq/movhps of the high part.  */
-/* { dg-final { scan-assembler-times "movq" 2 } } */
+/* { dg-final { scan-assembler-times "movq\[\t ]+%xmm\[0-9]" 2 } } */
 /* { dg-final { scan-assembler-times "pextrq" 2 { target { ! ia32 } } } } */
 /* { dg-final { scan-assembler-times "movhps" 2 { target { ia32 } } } } */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/3] Extend lshifrtsi3_1_zext to ?k alternative.
  2024-06-28  5:27 ` [PATCH 2/3] Extend lshifrtsi3_1_zext to ?k alternative liuhongt
@ 2024-06-28  5:59   ` Uros Bizjak
  0 siblings, 0 replies; 6+ messages in thread
From: Uros Bizjak @ 2024-06-28  5:59 UTC (permalink / raw)
  To: liuhongt; +Cc: gcc-patches

On Fri, Jun 28, 2024 at 7:29 AM liuhongt <hongtao.liu@intel.com> wrote:
>
> late_combine will combine lshift + zero into *lshifrtsi3_1_zext which
> cause extra mov between gpr and kmask, add ?k to the pattern.
>
> gcc/ChangeLog:
>
>         PR target/115610
>         * config/i386/i386.md (<*insnsi3_zext): Add alternative ?k,
>         enable it only for lshiftrt and under avx512bw.
>         * config/i386/sse.md (*klshrsi3_1_zext): New define_insn, and
>         add corresponding define_split after it.

OK.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386.md | 19 +++++++++++++------
>  gcc/config/i386/sse.md  | 28 ++++++++++++++++++++++++++++
>  2 files changed, 41 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index fd48e764469..57a10c1af48 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -16836,10 +16836,10 @@ (define_insn "*bmi2_<insn>si3_1_zext"
>     (set_attr "mode" "SI")])
>
>  (define_insn "*<insn>si3_1_zext"
> -  [(set (match_operand:DI 0 "register_operand" "=r,r,r")
> +  [(set (match_operand:DI 0 "register_operand" "=r,r,r,?k")
>         (zero_extend:DI
> -         (any_shiftrt:SI (match_operand:SI 1 "nonimmediate_operand" "0,rm,rm")
> -                         (match_operand:QI 2 "nonmemory_operand" "cI,r,cI"))))
> +         (any_shiftrt:SI (match_operand:SI 1 "nonimmediate_operand" "0,rm,rm,k")
> +                         (match_operand:QI 2 "nonmemory_operand" "cI,r,cI,I"))))
>     (clobber (reg:CC FLAGS_REG))]
>    "TARGET_64BIT
>     && ix86_binary_operator_ok (<CODE>, SImode, operands, TARGET_APX_NDD)"
> @@ -16850,6 +16850,8 @@ (define_insn "*<insn>si3_1_zext"
>      case TYPE_ISHIFTX:
>        return "#";
>
> +    case TYPE_MSKLOG:
> +      return "#";
>      default:
>        if (operands[2] == const1_rtx
>           && (TARGET_SHIFT1 || optimize_function_for_size_p (cfun))
> @@ -16860,8 +16862,8 @@ (define_insn "*<insn>si3_1_zext"
>                        : "<shift>{l}\t{%2, %k0|%k0, %2}";
>      }
>  }
> -  [(set_attr "isa" "*,bmi2,apx_ndd")
> -   (set_attr "type" "ishift,ishiftx,ishift")
> +  [(set_attr "isa" "*,bmi2,apx_ndd,avx512bw")
> +   (set_attr "type" "ishift,ishiftx,ishift,msklog")
>     (set (attr "length_immediate")
>       (if_then_else
>         (and (match_operand 2 "const1_operand")
> @@ -16869,7 +16871,12 @@ (define_insn "*<insn>si3_1_zext"
>                  (match_test "optimize_function_for_size_p (cfun)")))
>         (const_string "0")
>         (const_string "*")))
> -   (set_attr "mode" "SI")])
> +   (set_attr "mode" "SI")
> +   (set (attr "enabled")
> +       (if_then_else
> +         (eq_attr "alternative" "3")
> +         (symbol_ref "<CODE> == LSHIFTRT && TARGET_AVX512BW")
> +         (const_string "*")))])
>
>  ;; Convert shift to the shiftx pattern to avoid flags dependency.
>  (define_split
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 0be2dcd8891..20665a6f097 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -2179,6 +2179,34 @@ (define_split
>              (match_dup 2)))
>        (unspec [(const_int 0)] UNSPEC_MASKOP)])])
>
> +(define_insn "*klshrsi3_1_zext"
> +  [(set (match_operand:DI 0 "register_operand" "=k")
> +       (zero_extend:DI
> +         (lshiftrt:SI (match_operand:SI 1 "register_operand" "k")
> +                      (match_operand 2 "const_0_to_31_operand" "I"))))
> +      (unspec [(const_int 0)] UNSPEC_MASKOP)]
> +  "TARGET_AVX512BW"
> +  "kshiftrd\t{%2, %1, %0|%0, %1, %2}"
> +    [(set_attr "type" "msklog")
> +   (set_attr "prefix" "vex")
> +   (set_attr "mode" "SI")])
> +
> +(define_split
> +  [(set (match_operand:DI 0 "mask_reg_operand")
> +       (zero_extend:DI
> +         (lshiftrt:SI
> +           (match_operand:SI 1 "mask_reg_operand")
> +           (match_operand 2 "const_0_to_31_operand"))))
> +    (clobber (reg:CC FLAGS_REG))]
> +  "TARGET_AVX512BW && reload_completed"
> +  [(parallel
> +     [(set (match_dup 0)
> +          (zero_extend:DI
> +            (lshiftrt:SI
> +              (match_dup 1)
> +              (match_dup 2))))
> +      (unspec [(const_int 0)] UNSPEC_MASKOP)])])
> +
>  (define_insn "ktest<mode>"
>    [(set (reg:CC FLAGS_REG)
>         (unspec:CC
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] [x86] Enable flate-combine.
  2024-06-28  5:27 ` [PATCH 3/3] [x86] Enable flate-combine liuhongt
@ 2024-06-28  6:03   ` Uros Bizjak
  0 siblings, 0 replies; 6+ messages in thread
From: Uros Bizjak @ 2024-06-28  6:03 UTC (permalink / raw)
  To: liuhongt; +Cc: gcc-patches

On Fri, Jun 28, 2024 at 7:29 AM liuhongt <hongtao.liu@intel.com> wrote:
>
> Move pass_stv2 and pass_rpad after pre_reload pass_late_combine, also
> define target_insn_cost to prevent post_reload pass_late_combine to
> revert the optimziation did in pass_rpad.
>
> Adjust testcases since pass_late_combine generates better code but
> break scan assembly.
>
> .i.e
> Under 32-bit target, gcc used to generate broadcast from stack and
> then do the real operation.
> After flate_combine, they're combined into embeded broadcast
> operations.
>
> gcc/ChangeLog:
>
>         * config/i386/i386-features.cc (ix86_rpad_gate): New function.
>         * config/i386/i386-options.cc (ix86_override_options_after_change):
>         Don't disable flate_combine.
>         * config/i386/i386-passes.def: Move pass_stv2 and pass_rpad
>         after pre_reload pas_late_combine.
>         * config/i386/i386-protos.h (ix86_rpad_gate): New declare.
>         * config/i386/i386.cc (ix86_insn_cost): New function.
>         (TARGET_INSN_COST): Define.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Adjus
>         testcase.
>         * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Ditto.
>         * gcc.target/i386/avx512f-fmadd-sf-zmm-7.c: Ditto.
>         * gcc.target/i386/avx512f-fmsub-sf-zmm-7.c: Ditto.
>         * gcc.target/i386/avx512f-fnmadd-sf-zmm-7.c: Ditto.
>         * gcc.target/i386/avx512f-fnmsub-sf-zmm-7.c: Ditto.
>         * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Ditto.
>         * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Ditto.
>         * gcc.target/i386/pr91333.c: Ditto.
>         * gcc.target/i386/vect-strided-4.c: Ditto.

LGTM.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386-features.cc               | 16 +++++++++++-----
>  gcc/config/i386/i386-options.cc                |  4 ----
>  gcc/config/i386/i386-passes.def                |  4 ++--
>  gcc/config/i386/i386-protos.h                  |  1 +
>  gcc/config/i386/i386.cc                        | 18 ++++++++++++++++++
>  .../i386/avx512f-broadcast-pr87767-1.c         |  4 ++--
>  .../i386/avx512f-broadcast-pr87767-5.c         |  1 -
>  .../gcc.target/i386/avx512f-fmadd-sf-zmm-7.c   |  2 +-
>  .../gcc.target/i386/avx512f-fmsub-sf-zmm-7.c   |  2 +-
>  .../gcc.target/i386/avx512f-fnmadd-sf-zmm-7.c  |  2 +-
>  .../gcc.target/i386/avx512f-fnmsub-sf-zmm-7.c  |  2 +-
>  .../i386/avx512vl-broadcast-pr87767-1.c        |  4 ++--
>  .../i386/avx512vl-broadcast-pr87767-5.c        |  2 --
>  gcc/testsuite/gcc.target/i386/pr91333.c        |  2 +-
>  gcc/testsuite/gcc.target/i386/vect-strided-4.c |  2 +-
>  15 files changed, 42 insertions(+), 24 deletions(-)
>
> diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
> index 607d1991460..fc224ed06b0 100644
> --- a/gcc/config/i386/i386-features.cc
> +++ b/gcc/config/i386/i386-features.cc
> @@ -2995,6 +2995,16 @@ make_pass_insert_endbr_and_patchable_area (gcc::context *ctxt)
>    return new pass_insert_endbr_and_patchable_area (ctxt);
>  }
>
> +bool
> +ix86_rpad_gate ()
> +{
> +  return (TARGET_AVX
> +         && TARGET_SSE_PARTIAL_REG_DEPENDENCY
> +         && TARGET_SSE_MATH
> +         && optimize
> +         && optimize_function_for_speed_p (cfun));
> +}
> +
>  /* At entry of the nearest common dominator for basic blocks with
>     conversions/rcp/sqrt/rsqrt/round, generate a single
>         vxorps %xmmN, %xmmN, %xmmN
> @@ -3232,11 +3242,7 @@ public:
>    /* opt_pass methods: */
>    bool gate (function *) final override
>      {
> -      return (TARGET_AVX
> -             && TARGET_SSE_PARTIAL_REG_DEPENDENCY
> -             && TARGET_SSE_MATH
> -             && optimize
> -             && optimize_function_for_speed_p (cfun));
> +      return ix86_rpad_gate ();
>      }
>
>    unsigned int execute (function *) final override
> diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
> index 9c12d498928..1ef2c71a7a2 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -1944,10 +1944,6 @@ ix86_override_options_after_change (void)
>         flag_cunroll_grow_size = flag_peel_loops || optimize >= 3;
>      }
>
> -  /* Late combine tends to undo some of the effects of STV and RPAD,
> -     by combining instructions back to their original form.  */
> -  if (!OPTION_SET_P (flag_late_combine_instructions))
> -    flag_late_combine_instructions = 0;
>  }
>
>  /* Clear stack slot assignments remembered from previous functions.
> diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
> index 7d96766f7b9..2d29f65da88 100644
> --- a/gcc/config/i386/i386-passes.def
> +++ b/gcc/config/i386/i386-passes.def
> @@ -25,11 +25,11 @@ along with GCC; see the file COPYING3.  If not see
>   */
>
>    INSERT_PASS_AFTER (pass_postreload_cse, 1, pass_insert_vzeroupper);
> -  INSERT_PASS_AFTER (pass_combine, 1, pass_stv, false /* timode_p */);
> +  INSERT_PASS_AFTER (pass_late_combine, 1, pass_stv, false /* timode_p */);
>    /* Run the 64-bit STV pass before the CSE pass so that CONST0_RTX and
>       CONSTM1_RTX generated by the STV pass can be CSEed.  */
>    INSERT_PASS_BEFORE (pass_cse2, 1, pass_stv, true /* timode_p */);
>
>    INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area);
>
> -  INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
> +  INSERT_PASS_AFTER (pass_late_combine, 1, pass_remove_partial_avx_dependency);
> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> index 4f48dc0bf75..3dbd18dc70b 100644
> --- a/gcc/config/i386/i386-protos.h
> +++ b/gcc/config/i386/i386-protos.h
> @@ -422,6 +422,7 @@ extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
>    (gcc::context *);
>
>  extern bool ix86_has_no_direct_extern_access;
> +extern bool ix86_rpad_gate ();
>
>  /* In i386-expand.cc.  */
>  bool ix86_check_builtin_isa_match (unsigned int, HOST_WIDE_INT*,
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 1f71ed04be6..9d2b7d1f174 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -21371,6 +21371,22 @@ ix86_shift_rotate_cost (const struct processor_costs *cost,
>      }
>  }
>
> +static int
> +ix86_insn_cost (rtx_insn *insn, bool speed)
> +{
> +  int insn_cost = 0;
> +  /* Add extra cost to avoid post_reload late_combine revert
> +     the optimization did in pass_rpad.  */
> +  if (reload_completed
> +      && ix86_rpad_gate ()
> +      && recog_memoized (insn) >= 0
> +      && get_attr_avx_partial_xmm_update (insn)
> +      == AVX_PARTIAL_XMM_UPDATE_TRUE)
> +    insn_cost += COSTS_N_INSNS (3);
> +
> +  return insn_cost + pattern_cost (PATTERN (insn), speed);
> +}
> +
>  /* Compute a (partial) cost for rtx X.  Return true if the complete
>     cost has been computed, and false if subexpressions should be
>     scanned.  In either case, *TOTAL contains the cost result.  */
> @@ -26514,6 +26530,8 @@ static const scoped_attribute_specs *const ix86_attribute_table[] =
>  #define TARGET_MEMORY_MOVE_COST ix86_memory_move_cost
>  #undef TARGET_RTX_COSTS
>  #define TARGET_RTX_COSTS ix86_rtx_costs
> +#undef TARGET_INSN_COST
> +#define TARGET_INSN_COST ix86_insn_cost
>  #undef TARGET_ADDRESS_COST
>  #define TARGET_ADDRESS_COST ix86_address_cost
>
> diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> index 138dbb4c973..3a50749e610 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> @@ -3,8 +3,8 @@
>  /* { dg-options "-O2 -mavx512f -mavx512dq" } */
>  /* { dg-additional-options "-fno-PIE" { target ia32 } } */
>  /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
> -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 2 } } */
> -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to16\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 2 { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to16\\\}"  2 } }  */
>  /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %zmm\[0-9\]+" 3 } } */
>  /* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %zmm\[0-9\]+" 3 { target { ! ia32 } } } } */
>
> diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
> index d22251bc2a3..ea2f64861d0 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
> @@ -3,7 +3,6 @@
>  /* { dg-options "-O2 -mavx512f" } */
>  /* { dg-additional-options "-fno-PIE" { target ia32 } } */
>  /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
> -/* { dg-final { scan-assembler-not "\[^\n\]*\\\{1to8\\\}" { target ia32 } } } */
>  /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %zmm\[0-9\]+" 4 } } */
>  /* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %zmm\[0-9\]+" 4 { target { ! ia32 } } } } */
>
> diff --git a/gcc/testsuite/gcc.target/i386/avx512f-fmadd-sf-zmm-7.c b/gcc/testsuite/gcc.target/i386/avx512f-fmadd-sf-zmm-7.c
> index 8c117207efa..bbcc5ed0bec 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512f-fmadd-sf-zmm-7.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512f-fmadd-sf-zmm-7.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-options "-mavx512f -O2" } */
> -/* { dg-final { scan-assembler-times "vbroadcastss\[^\n\]*%zmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vbroadcastss\[^\n\]*%zmm\[0-9\]+" 1 { target { ! ia32 } } } } */
>  /* { dg-final { scan-assembler-times "vfmadd...ps\[^\n\]*%zmm\[0-9\]+" 1 } } */
>
>  #define type __m512
> diff --git a/gcc/testsuite/gcc.target/i386/avx512f-fmsub-sf-zmm-7.c b/gcc/testsuite/gcc.target/i386/avx512f-fmsub-sf-zmm-7.c
> index cc705af8ea5..fc72dd6e557 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512f-fmsub-sf-zmm-7.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512f-fmsub-sf-zmm-7.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-options "-mavx512f -O2" } */
> -/* { dg-final { scan-assembler-times "vbroadcastss\[^\n\]*%zmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vbroadcastss\[^\n\]*%zmm\[0-9\]+" 1 { target { ! ia32 } } } } */
>  /* { dg-final { scan-assembler-times "vfmsub...ps\[^\n\]*%zmm\[0-9\]+" 1 } } */
>
>  #define type __m512
> diff --git a/gcc/testsuite/gcc.target/i386/avx512f-fnmadd-sf-zmm-7.c b/gcc/testsuite/gcc.target/i386/avx512f-fnmadd-sf-zmm-7.c
> index db5c34678c0..342de482da8 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512f-fnmadd-sf-zmm-7.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512f-fnmadd-sf-zmm-7.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-options "-mavx512f -O2" } */
> -/* { dg-final { scan-assembler-times "vbroadcastss\[^\n\]*%zmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vbroadcastss\[^\n\]*%zmm\[0-9\]+" 1 { target { ! ia32 } } } } */
>  /* { dg-final { scan-assembler-times "vfnmadd...ps\[^\n\]*%zmm\[0-9\]+" 1 } } */
>
>  #define type __m512
> diff --git a/gcc/testsuite/gcc.target/i386/avx512f-fnmsub-sf-zmm-7.c b/gcc/testsuite/gcc.target/i386/avx512f-fnmsub-sf-zmm-7.c
> index 7815251b82d..f56a3f8acc4 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512f-fnmsub-sf-zmm-7.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512f-fnmsub-sf-zmm-7.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-options "-mavx512f -O2" } */
> -/* { dg-final { scan-assembler-times "vbroadcastss\[^\n\]*%zmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vbroadcastss\[^\n\]*%zmm\[0-9\]+" 1 { target { ! ia32 } } } } */
>  /* { dg-final { scan-assembler-times "vfnmsub...ps\[^\n\]*%zmm\[0-9\]+" 1 } } */
>
>  #define type __m512
> diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> index e6df4d25f36..08898445be5 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> @@ -3,8 +3,8 @@
>  /* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
>  /* { dg-additional-options "-fno-PIE" { target ia32 } } */
>  /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
> -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 2 } }  */
> -/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 4 } }  */
> +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 2 { target { ! ia32 } } } }  */
> +/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 4 { target { ! ia32 } } } }  */
>  /* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 2 } }  */
>  /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 3 } } */
>  /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 3 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
> index ebdc3619d8e..c57a2e29767 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
> @@ -3,8 +3,6 @@
>  /* { dg-options "-O2 -mavx512f -mavx512vl" } */
>  /* { dg-additional-options "-fno-PIE" { target ia32 } } */
>  /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
> -/* { dg-final { scan-assembler-not "\[^\n\]*\\\{1to2\\\}" { target ia32 } } } */
> -/* { dg-final { scan-assembler-not "\[^\n\]*\\\{1to4\\\}" { target ia32 } } } */
>  /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 4 } } */
>  /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 4 } } */
>  /* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %xmm\[0-9\]+" 4 { target { ! ia32 } } } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr91333.c b/gcc/testsuite/gcc.target/i386/pr91333.c
> index 2bdff871024..b4940b5c9ec 100644
> --- a/gcc/testsuite/gcc.target/i386/pr91333.c
> +++ b/gcc/testsuite/gcc.target/i386/pr91333.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile { target { ! ia32 } } } */
>  /* { dg-options "-O2 -mavx" } */
> -/* { dg-final { scan-assembler-times "vmovapd|vmovsd" 3 } } */
> +/* { dg-final { scan-assembler-times "vmovapd|vmovsd" 2 } } */
>
>  static inline double g (double x){
>    asm volatile ("" : "+x" (x));
> diff --git a/gcc/testsuite/gcc.target/i386/vect-strided-4.c b/gcc/testsuite/gcc.target/i386/vect-strided-4.c
> index dd922926a2a..3fb9f07886e 100644
> --- a/gcc/testsuite/gcc.target/i386/vect-strided-4.c
> +++ b/gcc/testsuite/gcc.target/i386/vect-strided-4.c
> @@ -15,6 +15,6 @@ void foo (int * __restrict a, int * __restrict b, int *c, int s)
>
>  /* Vectorization factor two, two two-element stores to a using movq
>     and two two-element stores to b via pextrq/movhps of the high part.  */
> -/* { dg-final { scan-assembler-times "movq" 2 } } */
> +/* { dg-final { scan-assembler-times "movq\[\t ]+%xmm\[0-9]" 2 } } */
>  /* { dg-final { scan-assembler-times "pextrq" 2 { target { ! ia32 } } } } */
>  /* { dg-final { scan-assembler-times "movhps" 2 { target { ia32 } } } } */
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-06-28  6:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-28  5:27 [PATCH 0/3][x86] Enable pass_late_combine for x86 liuhongt
2024-06-28  5:27 ` [PATCH 1/3] [avx512 testsuite] Define mask as extern instead of uninitialized local variables liuhongt
2024-06-28  5:27 ` [PATCH 2/3] Extend lshifrtsi3_1_zext to ?k alternative liuhongt
2024-06-28  5:59   ` Uros Bizjak
2024-06-28  5:27 ` [PATCH 3/3] [x86] Enable flate-combine liuhongt
2024-06-28  6:03   ` Uros Bizjak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).