public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of constants.
@ 2023-12-22 10:25 Roger Sayle
  2024-01-02  5:39 ` Hongtao Liu
  0 siblings, 1 reply; 4+ messages in thread
From: Roger Sayle @ 2023-12-22 10:25 UTC (permalink / raw)
  To: gcc-patches; +Cc: 'Uros Bizjak', 'Hongtao Liu'

[-- Attachment #1: Type: text/plain, Size: 5281 bytes --]


This patch resolves the second part of PR target/112992, building upon
Hongtao Liu's solution to the first part.

The issue addressed by this patch is that when initializing vectors by
broadcasting integer constants, the compiler has the flexibility to
select the most appropriate vector mode to perform the broadcast, as
long as the resulting vector has an identical bit pattern.  For
example, the following constants are all equivalent:
V4SImode {0x01010101, 0x01010101, 0x01010101, 0x01010101 }
V8HImode {0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101 }
V16QImode {0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, ... 0x01 }
So instruction sequences that construct any of these can be used to
construct the others (with a suitable cast/SUBREG).

On x86_64, it turns out that broadcasts of SImode constants are preferred,
as DImode constants often require a longer movabs instruction, and
HImode and QImode broadcasts require multiple uops on some architectures.
Hence, SImode is always the equal shortest/fastest implementation.

Examples of this improvement, can be seen in the testsuite.

gcc.target/i386/pr102021.c
Before:
   0:   48 b8 0c 00 0c 00 0c    movabs $0xc000c000c000c,%rax
   7:   00 0c 00
   a:   62 f2 fd 28 7c c0       vpbroadcastq %rax,%ymm0
  10:   c3                      retq

After:
   0:   b8 0c 00 0c 00          mov    $0xc000c,%eax
   5:   62 f2 7d 28 7c c0       vpbroadcastd %eax,%ymm0
   b:   c3                      retq

and
gcc.target/i386/pr90773-17.c:
Before:
   0:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 7 <foo+0x7>
   7:   b8 0c 00 00 00          mov    $0xc,%eax
   c:   62 f2 7d 08 7a c0       vpbroadcastb %eax,%xmm0
  12:   62 f1 7f 08 7f 02       vmovdqu8 %xmm0,(%rdx)
  18:   c7 42 0f 0c 0c 0c 0c    movl   $0xc0c0c0c,0xf(%rdx)
  1f:   c3                      retq

After:
   0:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 7 <foo+0x7>
   7:   b8 0c 0c 0c 0c          mov    $0xc0c0c0c,%eax
   c:   62 f2 7d 08 7c c0       vpbroadcastd %eax,%xmm0
  12:   62 f1 7f 08 7f 02       vmovdqu8 %xmm0,(%rdx)
  18:   c7 42 0f 0c 0c 0c 0c    movl   $0xc0c0c0c,0xf(%rdx)
  1f:   c3                      retq

where according to Agner Fog's instruction tables broadcastd is slightly
faster on some microarchitectures, for example Knight's Landing.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-12-21  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        PR target/112992
        * config/i386/i386-expand.cc
        (ix86_convert_const_wide_int_to_broadcast): Allow call to
        ix86_expand_vector_init_duplicate to fail, and return NULL_RTX.
        (ix86_broadcast_from_constant): Revert recent change; Return a
        suitable MEMREF independently of mode/target combinations.
        (ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate
        to decide whether expansion is possible/preferrable.  Only try
        forcing DImode constants to memory (and trying again) if calling
        ix86_expand_vector_init_duplicate fails with an DImode immediate
        constant.
        (ix86_expand_vector_init_duplicate) <case E_V2DImode>: Try using
        V4SImode for suitable immediate constants.
        <case E_V4DImode>: Try using V8SImode for suitable constants.
        <case E_V4SImode>: Use constant pool for AVX without AVX2.
        <case E_V4HImode>: Fail for CONST_INT_P, i.e. use constant pool.
        <case E_V2HImode>: Likewise.
        <case E_V8HImode>: For CONST_INT_P try using V4SImode via widen.
        <case E_V16QImode>: For CONT_INT_P try using V8HImode via widen.
        <label widen>: Handle CONT_INTs via simplify_binary_operation.
        Allow recursive calls to ix86_expand_vector_init_duplicate to fail.
        <case E_V16HImode>: For CONST_INT_P try V8SImode via widen.
        <case E_V32QImode>: For CONST_INT_P try V16HImode via widen.
        (ix86_expand_vector_init): Move try using a broadcast for all_same
        with ix86_expand_vector_init_duplicate before using constant pool.

gcc/testsuite/ChangeLog
        * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Update test case.
        * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise.
        * gcc.target/i386/avx512fp16-13.c: Likewise.
        * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise.
        * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise.
        * gcc.target/i386/pr100865-10a.c: Likewise.
        * gcc.target/i386/pr100865-10b.c: Likewise.
        * gcc.target/i386/pr100865-11c.c: Likewise.
        * gcc.target/i386/pr100865-12c.c: Likewise.
        * gcc.target/i386/pr100865-2.c: Likewise.
        * gcc.target/i386/pr100865-3.c: Likewise.
        * gcc.target/i386/pr100865-4a.c: Likewise.
        * gcc.target/i386/pr100865-4b.c: Likewise.
        * gcc.target/i386/pr100865-5a.c: Likewise.
        * gcc.target/i386/pr100865-5b.c: Likewise.
        * gcc.target/i386/pr100865-9a.c: Likewise.
        * gcc.target/i386/pr100865-9b.c: Likewise.
        * gcc.target/i386/pr102021.c: Likewise.
        * gcc.target/i386/pr90773-17.c: Likewise.


Thanks in advance,
Roger
--


[-- Attachment #2: patchwb4a.txt --]
[-- Type: text/plain, Size: 8250 bytes --]

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 57a108a..b0113b5 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -352,7 +352,8 @@ ix86_convert_const_wide_int_to_broadcast (machine_mode mode, rtx op)
   bool ok = ix86_expand_vector_init_duplicate (false, vector_mode,
 					       target,
 					       GEN_INT (val_broadcast));
-  gcc_assert (ok);
+  if (!ok)
+    return nullptr;
   target = lowpart_subreg (mode, target, vector_mode);
   return target;
 }
@@ -599,19 +600,11 @@ ix86_broadcast_from_constant (machine_mode mode, rtx op)
       && INTEGRAL_MODE_P (mode))
     return nullptr;
 
-  unsigned int msize = GET_MODE_SIZE (mode);
-  unsigned int inner_size = GET_MODE_SIZE (GET_MODE_INNER ((mode)));
-
   /* Convert CONST_VECTOR to a non-standard SSE constant integer
      broadcast only if vector broadcast is available.  */
   if (standard_sse_constant_p (op, mode))
     return nullptr;
 
-  /* vpbroadcast[b,w] is available under TARGET_AVX2.
-     or TARGET_AVX512BW for zmm.  */
-  if (inner_size < 4 && !(msize == 64 ? TARGET_AVX512BW : TARGET_AVX2))
-    return nullptr;
-
   if (GET_MODE_INNER (mode) == TImode)
     return nullptr;
 
@@ -705,22 +698,22 @@ ix86_expand_vector_move (machine_mode mode, rtx operands[])
 	{
 	  /* Broadcast to XMM/YMM/ZMM register from an integer
 	     constant or scalar mem.  */
-	  op1 = gen_reg_rtx (mode);
-	  if (FLOAT_MODE_P (mode)
-	      || (!TARGET_64BIT && GET_MODE_INNER (mode) == DImode)
-	      /* vbroadcastss/vbroadcastsd only supports memory operand
-		 w/o AVX2, force them into memory to avoid spill to
-		 memory.  */
-	      || (GET_MODE_SIZE (mode) == 32
-		  && (GET_MODE_INNER (mode) == DImode
-		      || GET_MODE_INNER (mode) == SImode)
-		  && !TARGET_AVX2))
+	  rtx tmp = gen_reg_rtx (mode);
+	  if (FLOAT_MODE_P (mode))
 	    first = force_const_mem (GET_MODE_INNER (mode), first);
 	  bool ok = ix86_expand_vector_init_duplicate (false, mode,
-						       op1, first);
-	  gcc_assert (ok);
-	  emit_move_insn (op0, op1);
-	  return;
+						       tmp, first);
+	  if (!ok && !TARGET_64BIT && GET_MODE_INNER (mode) == DImode)
+	    {
+	      first = force_const_mem (GET_MODE_INNER (mode), first);
+	      ok = ix86_expand_vector_init_duplicate (false, mode,
+						      tmp, first);
+	    }
+	  if (ok)
+	    {
+	      emit_move_insn (op0, tmp);
+	      return;
+	    }
 	}
     }
 
@@ -15697,6 +15690,56 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
 
   switch (mode)
     {
+    case E_V2DImode:
+      if (CONST_INT_P (val))
+	{
+	  int tmp = (int)INTVAL (val);
+	  if (tmp == (int)(INTVAL (val) >> 32))
+	    {
+	      rtx reg = gen_reg_rtx (V4SImode);
+	      ok = ix86_vector_duplicate_value (V4SImode, reg,
+						GEN_INT (tmp));
+	      if (ok)
+		{
+		  emit_move_insn (target, gen_lowpart (V2DImode, reg));
+		  return true;
+		}
+	    }
+	  if (!TARGET_AVX)
+	    return false;
+	  if (!TARGET_AVX2)
+	    val = force_const_mem (DImode, val);
+	}
+      return ix86_vector_duplicate_value (mode, target, val);
+
+    case E_V4DImode:
+      if (CONST_INT_P (val))
+	{
+	  int tmp = (int)INTVAL (val);
+	  if (tmp == (int)(INTVAL (val) >> 32))
+	    {
+	      rtx reg = gen_reg_rtx (V8SImode);
+	      ok = ix86_vector_duplicate_value (V8SImode, reg,
+						GEN_INT (tmp));
+	      if (ok)
+		{
+		  emit_move_insn (target, gen_lowpart (V4DImode, reg));
+		  return true;
+		}
+	    }
+	}
+      return ix86_vector_duplicate_value (mode, target, val);
+
+    case E_V4SImode:
+      if (CONST_INT_P (val))
+	{
+	  if (!TARGET_AVX)
+	    return false;
+	  if (!TARGET_AVX2)
+	    val = force_const_mem (SImode, val);
+	}
+      return ix86_vector_duplicate_value (mode, target, val);
+
     case E_V2SImode:
     case E_V2SFmode:
       if (!mmx_ok)
@@ -15704,13 +15747,10 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
       /* FALLTHRU */
 
     case E_V4DFmode:
-    case E_V4DImode:
     case E_V8SFmode:
     case E_V8SImode:
     case E_V2DFmode:
-    case E_V2DImode:
     case E_V4SFmode:
-    case E_V4SImode:
     case E_V16SImode:
     case E_V8DImode:
     case E_V16SFmode:
@@ -15725,6 +15765,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
 	  rtx x;
 
 	  val = gen_lowpart (SImode, val);
+	  if (CONST_INT_P (val))
+	    return false;
 	  x = gen_rtx_TRUNCATE (HImode, val);
 	  x = gen_rtx_VEC_DUPLICATE (mode, x);
 	  emit_insn (gen_rtx_SET (target, x));
@@ -15749,6 +15791,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
 	  rtx x;
 
 	  val = gen_lowpart (SImode, val);
+	  if (CONST_INT_P (val))
+	    return false;
 	  x = gen_rtx_TRUNCATE (HImode, val);
 	  x = gen_rtx_VEC_DUPLICATE (mode, x);
 	  emit_insn (gen_rtx_SET (target, x));
@@ -15774,6 +15818,10 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
       goto widen;
 
     case E_V8HImode:
+      if (CONST_INT_P (val))
+	goto widen;
+      /* FALLTHRU */
+
     case E_V8HFmode:
     case E_V8BFmode:
       if (TARGET_AVX2)
@@ -15821,6 +15869,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
       goto widen;
 
     case E_V16QImode:
+      if (CONST_INT_P (val))
+	goto widen;
       if (TARGET_AVX2)
 	return ix86_vector_duplicate_value (mode, target, val);
 
@@ -15840,7 +15890,13 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
 
 	val = convert_modes (wsmode, smode, val, true);
 
-	if (smode == QImode && !TARGET_PARTIAL_REG_STALL)
+	if (CONST_INT_P (val))
+	  {
+	    x = simplify_binary_operation (ASHIFT, wsmode, val,
+					   GEN_INT (GET_MODE_BITSIZE (smode)));
+	    val = simplify_binary_operation (IOR, wsmode, val, x);
+	  }
+	else if (smode == QImode && !TARGET_PARTIAL_REG_STALL)
 	  emit_insn (gen_insv_1 (wsmode, val, val));
 	else
 	  {
@@ -15853,15 +15909,20 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
 
 	x = gen_reg_rtx (wvmode);
 	ok = ix86_expand_vector_init_duplicate (mmx_ok, wvmode, x, val);
-	gcc_assert (ok);
+	if (!ok)
+	  return false;
 	emit_move_insn (target, gen_lowpart (GET_MODE (target), x));
-	return ok;
+	return true;
       }
 
     case E_V16HImode:
+    case E_V32QImode:
+      if (CONST_INT_P (val))
+	goto widen;
+      /* FALLTHRU */
+
     case E_V16HFmode:
     case E_V16BFmode:
-    case E_V32QImode:
       if (TARGET_AVX2)
 	return ix86_vector_duplicate_value (mode, target, val);
       else
@@ -15887,7 +15948,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
 	  rtx x = gen_reg_rtx (hvmode);
 
 	  ok = ix86_expand_vector_init_duplicate (false, hvmode, x, val);
-	  gcc_assert (ok);
+	  if (!ok)
+	    return false;
 
 	  x = gen_rtx_VEC_CONCAT (mode, x, x);
 	  emit_insn (gen_rtx_SET (target, x));
@@ -15924,7 +15986,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
 	  rtx x = gen_reg_rtx (hvmode);
 
 	  ok = ix86_expand_vector_init_duplicate (false, hvmode, x, val);
-	  gcc_assert (ok);
+	  if (!ok)
+	    return false;
 
 	  x = gen_rtx_VEC_CONCAT (mode, x, x);
 	  emit_insn (gen_rtx_SET (target, x));
@@ -16896,6 +16959,12 @@ ix86_expand_vector_init (bool mmx_ok, rtx target, rtx vals)
 	all_same = false;
     }
 
+  /* If all values are identical, broadcast the value.  */
+  if (all_same
+      && ix86_expand_vector_init_duplicate (mmx_ok, mode, target,
+					    XVECEXP (vals, 0, 0)))
+    return;
+
   /* Constants are best loaded from the constant pool.  */
   if (n_var == 0)
     {
@@ -16903,12 +16972,6 @@ ix86_expand_vector_init (bool mmx_ok, rtx target, rtx vals)
       return;
     }
 
-  /* If all values are identical, broadcast the value.  */
-  if (all_same
-      && ix86_expand_vector_init_duplicate (mmx_ok, mode, target,
-					    XVECEXP (vals, 0, 0)))
-    return;
-
   /* Values where only one field is non-constant are best loaded from
      the pool and overwritten via move later.  */
   if (n_var == 1)

[-- Attachment #3: patchwb4b.txt --]
[-- Type: text/plain, Size: 14940 bytes --]

diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
index 0fa93e0..138dbb4 100644
--- a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
@@ -3,8 +3,7 @@
 /* { dg-options "-O2 -mavx512f -mavx512dq" } */
 /* { dg-additional-options "-fno-PIE" { target ia32 } } */
 /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 2 { target { ! ia32 } } } }  */
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 5 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 2 } } */
 /* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to16\\\}" 2 } }  */
 /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %zmm\[0-9\]+" 3 } } */
 /* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %zmm\[0-9\]+" 3 { target { ! ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
index f1b672a..d22251b 100644
--- a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
@@ -3,7 +3,7 @@
 /* { dg-options "-O2 -mavx512f" } */
 /* { dg-additional-options "-fno-PIE" { target ia32 } } */
 /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 4 { target ia32 } } } */
+/* { dg-final { scan-assembler-not "\[^\n\]*\\\{1to8\\\}" { target ia32 } } } */
 /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %zmm\[0-9\]+" 4 } } */
 /* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %zmm\[0-9\]+" 4 { target { ! ia32 } } } } */
 
diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16-13.c b/gcc/testsuite/gcc.target/i386/avx512fp16-13.c
index b73a8f4..f431b8a 100644
--- a/gcc/testsuite/gcc.target/i386/avx512fp16-13.c
+++ b/gcc/testsuite/gcc.target/i386/avx512fp16-13.c
@@ -126,7 +126,7 @@ abs256_ph (__m256h a)
   return _mm256_abs_ph (a);
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastq\[^\n\]*%ymm\[0-9\]+" 1 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[^\n\]*%ymm\[0-9\]+" 1 { target { ! ia32 } } } } */
 /* { dg-final { scan-assembler-times "vpand\[^\n\]*%ymm\[0-9\]+" 1 } } */
 
 __m128h
@@ -136,5 +136,5 @@ abs_ph (__m128h a)
   return _mm_abs_ph (a);
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastq\[^\n\]*%xmm\[0-9\]+" 1 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[^\n\]*%xmm\[0-9\]+" 1 { target { ! ia32 } } } } */
 /* { dg-final { scan-assembler-times "vpand\[^\n\]*%xmm\[0-9\]+" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
index 0304b9d..e6df4d2 100644
--- a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
@@ -3,10 +3,8 @@
 /* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
 /* { dg-additional-options "-fno-PIE" { target ia32 } } */
 /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 2 { target { ! ia32 } } } }  */
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 4 { target { ! ia32 } } } }  */
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 5 { target ia32 } } } */
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 7 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 4 } }  */
 /* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 2 } }  */
 /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 3 } } */
 /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 3 } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
index 0ba0cd9..ebdc361 100644
--- a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
@@ -3,8 +3,8 @@
 /* { dg-options "-O2 -mavx512f -mavx512vl" } */
 /* { dg-additional-options "-fno-PIE" { target ia32 } } */
 /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 4 { target ia32 } } } */
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 4 { target ia32 } } } */
+/* { dg-final { scan-assembler-not "\[^\n\]*\\\{1to2\\\}" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "\[^\n\]*\\\{1to4\\\}" { target ia32 } } } */
 /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 4 } } */
 /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 4 } } */
 /* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %xmm\[0-9\]+" 4 { target { ! ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-10a.c b/gcc/testsuite/gcc.target/i386/pr100865-10a.c
index 1d849a3..3bc0f1a 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-10a.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-10a.c
@@ -29,5 +29,5 @@ foo (void)
     array[i] = MK_CONST128_BROADCAST (0x1f);
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+\[^\n\]*, %ymm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+\[^\n\]*, %ymm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 8 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-10b.c b/gcc/testsuite/gcc.target/i386/pr100865-10b.c
index e5616d8..f60d1bf 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-10b.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-10b.c
@@ -3,5 +3,5 @@
 
 #include "pr100865-10a.c"
 
-/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu8\[\\t \]%ymm\[0-9\]+, " 8 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-11c.c b/gcc/testsuite/gcc.target/i386/pr100865-11c.c
index de56c84..94a2c43 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-11c.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-11c.c
@@ -3,6 +3,5 @@
 
 #include "pr100865-11a.c"
 
-/* { dg-final { scan-assembler-times "movabsq" 1 } } */
-/* { dg-final { scan-assembler-times "vpunpcklqdq\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovddup\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-12c.c b/gcc/testsuite/gcc.target/i386/pr100865-12c.c
index 77415f2..424c916 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-12c.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-12c.c
@@ -3,6 +3,5 @@
 
 #include "pr100865-12a.c"
 
-/* { dg-final { scan-assembler-times "movabsq" 1 } } */
-/* { dg-final { scan-assembler-times "vpunpcklqdq\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vmovddup\[\\t \]+\[^\n\]*, %xmm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-2.c b/gcc/testsuite/gcc.target/i386/pr100865-2.c
index 090a010..e0265d2 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-2.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-2.c
@@ -10,6 +10,6 @@ foo (void)
   __builtin_memset (dst, 3, 16);
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
 /* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-3.c b/gcc/testsuite/gcc.target/i386/pr100865-3.c
index cde4b1c..433fd81 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-3.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-3.c
@@ -10,7 +10,7 @@ foo (void)
   __builtin_memset (dst, 3, 16);
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu8\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
-/* { dg-final { scan-assembler-not "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" } } */
+/* { dg-final { scan-assembler-not "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" } } */
 /* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-4a.c b/gcc/testsuite/gcc.target/i386/pr100865-4a.c
index bd99945..8009e5c 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-4a.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-4a.c
@@ -12,6 +12,6 @@ foo (void)
     array[i] = -45;
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 2 } } */
 /* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-4b.c b/gcc/testsuite/gcc.target/i386/pr100865-4b.c
index 1814306..6fd703e 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-4b.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-4b.c
@@ -4,8 +4,8 @@
 
 #include "pr100865-4a.c"
 
-/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu8\[\\t \]%ymm\[0-9\]+, " 2 } } */
 /* { dg-final { scan-assembler-times "vzeroupper" 1 } } */
-/* { dg-final { scan-assembler-not "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" } } */
+/* { dg-final { scan-assembler-not "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" } } */
 /* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-5a.c b/gcc/testsuite/gcc.target/i386/pr100865-5a.c
index b023fca..d6fb79e 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-5a.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-5a.c
@@ -12,6 +12,6 @@ foo (void)
     array[i] = -45;
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 4 } } */
 /* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-5b.c b/gcc/testsuite/gcc.target/i386/pr100865-5b.c
index 5bccfd0..6c2b33d 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-5b.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-5b.c
@@ -4,7 +4,7 @@
 
 #include "pr100865-5a.c"
 
-/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu16\[\\t \]%ymm\[0-9\]+, " 4 } } */
-/* { dg-final { scan-assembler-not "vpbroadcastw\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" } } */
+/* { dg-final { scan-assembler-not "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" } } */
 /* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-9a.c b/gcc/testsuite/gcc.target/i386/pr100865-9a.c
index 45d0e0d..f2ac1bd 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-9a.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-9a.c
@@ -21,5 +21,5 @@ foo (void)
     array[i] = MK_CONST128_BROADCAST (0x1fff);
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-9b.c b/gcc/testsuite/gcc.target/i386/pr100865-9b.c
index 1469624..e2a3f92 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-9b.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-9b.c
@@ -3,5 +3,5 @@
 
 #include "pr100865-9a.c"
 
-/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr102021.c b/gcc/testsuite/gcc.target/i386/pr102021.c
index a5012a4..8ff898d 100644
--- a/gcc/testsuite/gcc.target/i386/pr102021.c
+++ b/gcc/testsuite/gcc.target/i386/pr102021.c
@@ -10,7 +10,7 @@ foo ()
   return _mm256_set1_epi16 (12);
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %ymm\[0-9\]+" 1 { target { ! ia32 } } } } */
-/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+\[^\n\]*, %ymm\[0-9\]+" 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%\[^\n\]*, %ymm\[0-9\]+" 1 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+\[^\n\]*, %ymm\[0-9\]+" 1 { target ia32 } } } */
 /* { dg-final { scan-assembler-not "vmovdqa" } } */
 /* { dg-final { scan-assembler-not "vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-17.c b/gcc/testsuite/gcc.target/i386/pr90773-17.c
index 3036085..61b2bfd 100644
--- a/gcc/testsuite/gcc.target/i386/pr90773-17.c
+++ b/gcc/testsuite/gcc.target/i386/pr90773-17.c
@@ -10,6 +10,6 @@ foo (void)
   __builtin_memset (dst, 12, 19);
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastb" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu8\[\\t \]+%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
 /* { dg-final { scan-assembler-times "vmovd\[\\t \]+%xmm\[0-9\]+, 16\\(%\[\^,\]+\\)" 1 { xfail *-*-* } } } */

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of constants.
  2023-12-22 10:25 [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of constants Roger Sayle
@ 2024-01-02  5:39 ` Hongtao Liu
  2024-01-06 22:53   ` Roger Sayle
  0 siblings, 1 reply; 4+ messages in thread
From: Hongtao Liu @ 2024-01-02  5:39 UTC (permalink / raw)
  To: Roger Sayle; +Cc: gcc-patches, Uros Bizjak

On Fri, Dec 22, 2023 at 6:25 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch resolves the second part of PR target/112992, building upon
> Hongtao Liu's solution to the first part.
>
> The issue addressed by this patch is that when initializing vectors by
> broadcasting integer constants, the compiler has the flexibility to
> select the most appropriate vector mode to perform the broadcast, as
> long as the resulting vector has an identical bit pattern.  For
> example, the following constants are all equivalent:
> V4SImode {0x01010101, 0x01010101, 0x01010101, 0x01010101 }
> V8HImode {0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101 }
> V16QImode {0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, ... 0x01 }
> So instruction sequences that construct any of these can be used to
> construct the others (with a suitable cast/SUBREG).
>
> On x86_64, it turns out that broadcasts of SImode constants are preferred,
> as DImode constants often require a longer movabs instruction, and
> HImode and QImode broadcasts require multiple uops on some architectures.
> Hence, SImode is always the equal shortest/fastest implementation.
>
> Examples of this improvement, can be seen in the testsuite.
>
> gcc.target/i386/pr102021.c
> Before:
>    0:   48 b8 0c 00 0c 00 0c    movabs $0xc000c000c000c,%rax
>    7:   00 0c 00
>    a:   62 f2 fd 28 7c c0       vpbroadcastq %rax,%ymm0
>   10:   c3                      retq
>
> After:
>    0:   b8 0c 00 0c 00          mov    $0xc000c,%eax
>    5:   62 f2 7d 28 7c c0       vpbroadcastd %eax,%ymm0
>    b:   c3                      retq
>
> and
> gcc.target/i386/pr90773-17.c:
> Before:
>    0:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 7 <foo+0x7>
>    7:   b8 0c 00 00 00          mov    $0xc,%eax
>    c:   62 f2 7d 08 7a c0       vpbroadcastb %eax,%xmm0
>   12:   62 f1 7f 08 7f 02       vmovdqu8 %xmm0,(%rdx)
>   18:   c7 42 0f 0c 0c 0c 0c    movl   $0xc0c0c0c,0xf(%rdx)
>   1f:   c3                      retq
>
> After:
>    0:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 7 <foo+0x7>
>    7:   b8 0c 0c 0c 0c          mov    $0xc0c0c0c,%eax
>    c:   62 f2 7d 08 7c c0       vpbroadcastd %eax,%xmm0
>   12:   62 f1 7f 08 7f 02       vmovdqu8 %xmm0,(%rdx)
>   18:   c7 42 0f 0c 0c 0c 0c    movl   $0xc0c0c0c,0xf(%rdx)
>   1f:   c3                      retq
>
> where according to Agner Fog's instruction tables broadcastd is slightly
> faster on some microarchitectures, for example Knight's Landing.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2023-12-21  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         PR target/112992
>         * config/i386/i386-expand.cc
>         (ix86_convert_const_wide_int_to_broadcast): Allow call to
>         ix86_expand_vector_init_duplicate to fail, and return NULL_RTX.
>         (ix86_broadcast_from_constant): Revert recent change; Return a
>         suitable MEMREF independently of mode/target combinations.
>         (ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate
>         to decide whether expansion is possible/preferrable.  Only try
>         forcing DImode constants to memory (and trying again) if calling
>         ix86_expand_vector_init_duplicate fails with an DImode immediate
>         constant.
>         (ix86_expand_vector_init_duplicate) <case E_V2DImode>: Try using
>         V4SImode for suitable immediate constants.
>         <case E_V4DImode>: Try using V8SImode for suitable constants.
>         <case E_V4SImode>: Use constant pool for AVX without AVX2.
>         <case E_V4HImode>: Fail for CONST_INT_P, i.e. use constant pool.
>         <case E_V2HImode>: Likewise.
>         <case E_V8HImode>: For CONST_INT_P try using V4SImode via widen.
>         <case E_V16QImode>: For CONT_INT_P try using V8HImode via widen.
>         <label widen>: Handle CONT_INTs via simplify_binary_operation.
>         Allow recursive calls to ix86_expand_vector_init_duplicate to fail.
>         <case E_V16HImode>: For CONST_INT_P try V8SImode via widen.
>         <case E_V32QImode>: For CONST_INT_P try V16HImode via widen.
>         (ix86_expand_vector_init): Move try using a broadcast for all_same
>         with ix86_expand_vector_init_duplicate before using constant pool.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Update test case.
>         * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise.
>         * gcc.target/i386/avx512fp16-13.c: Likewise.
>         * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise.
>         * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise.
>         * gcc.target/i386/pr100865-10a.c: Likewise.
>         * gcc.target/i386/pr100865-10b.c: Likewise.
>         * gcc.target/i386/pr100865-11c.c: Likewise.
>         * gcc.target/i386/pr100865-12c.c: Likewise.
>         * gcc.target/i386/pr100865-2.c: Likewise.
>         * gcc.target/i386/pr100865-3.c: Likewise.
>         * gcc.target/i386/pr100865-4a.c: Likewise.
>         * gcc.target/i386/pr100865-4b.c: Likewise.
>         * gcc.target/i386/pr100865-5a.c: Likewise.
>         * gcc.target/i386/pr100865-5b.c: Likewise.
>         * gcc.target/i386/pr100865-9a.c: Likewise.
>         * gcc.target/i386/pr100865-9b.c: Likewise.
>         * gcc.target/i386/pr102021.c: Likewise.
>         * gcc.target/i386/pr90773-17.c: Likewise.
>
>
> Thanks in advance,
> Roger
> --
>

>+    case E_V2DImode:
>+      if (CONST_INT_P (val))
>+        {
>+          int tmp = (int)INTVAL (val);
>+          if (tmp == (int)(INTVAL (val) >> 32))
>+            {
>+              rtx reg = gen_reg_rtx (V4SImode);
>+              ok = ix86_vector_duplicate_value (V4SImode, reg,
>+                                                GEN_INT (tmp));
>+              if (ok)
>+                {
>+                  emit_move_insn (target, gen_lowpart (V2DImode, reg));
>+                  return true;
>+                }
>+            }
>+          if (!TARGET_AVX)
>+            return false;
>+          if (!TARGET_AVX2)
>+            val = force_const_mem (DImode, val);

We can use pshufd/pshufss to broadcast for v4si and punpcklqdq for
v2di, so I think there is no need to return false or force_const_mem
here.

>+    case E_V4SImode:
>+      if (CONST_INT_P (val))
>+        {
>+          if (!TARGET_AVX)
>+            return false;
>+          if (!TARGET_AVX2)
>+            val = force_const_mem (SImode, val);

Ditto.

>+        }
>+      return ix86_vector_duplicate_value (mode, target, val);



>         x = gen_reg_rtx (wvmode);
>         ok = ix86_expand_vector_init_duplicate (mmx_ok, wvmode, x, val);
>-        gcc_assert (ok);
>+        if (!ok)
>+          return false;
Then we can still gcc_assert (ok) here.
>         emit_move_insn (target, gen_lowpart (GET_MODE (target), x));
>-        return ok;
>+        return true;


>+    case E_V32QImode:
>+      if (CONST_INT_P (val))
>+        goto widen;
>+      /* FALLTHRU */
>+
>     case E_V16HFmode:
>    case E_V16BFmode:
>-    case E_V32QImode:
>       if (TARGET_AVX2)
>         return ix86_vector_duplicate_value (mode, target, val);
>       else
>@@ -15904,7 +15965,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
>           rtx x = gen_reg_rtx (hvmode);
>
>           ok = ix86_expand_vector_init_duplicate (false, hvmode, x, val);
>-          gcc_assert (ok);
>+          if (!ok)
>+            return false;
>
>           x = gen_rtx_VEC_CONCAT (mode, x, x);
>           emit_insn (gen_rtx_SET (target, x));
>@@ -15941,7 +16003,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
>           rtx x = gen_reg_rtx (hvmode);
>
>           ok = ix86_expand_vector_init_duplicate (false, hvmode, x, val);
>-          gcc_assert (ok);
>+          if (!ok)
Assume it's always true for vec_dupv8si?(vec_dupv32qi widen to
vec_dupv16hi widen to vec_dupv8si.
>+            return false;

-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of constants.
  2024-01-02  5:39 ` Hongtao Liu
@ 2024-01-06 22:53   ` Roger Sayle
  2024-01-08  1:58     ` Hongtao Liu
  0 siblings, 1 reply; 4+ messages in thread
From: Roger Sayle @ 2024-01-06 22:53 UTC (permalink / raw)
  To: 'Hongtao Liu'; +Cc: gcc-patches, 'Uros Bizjak'

[-- Attachment #1: Type: text/plain, Size: 13544 bytes --]

Hi Hongtao,

Many thanks for the review.  This revised patch implements several
of your suggestions, specifically to use pshufd for V4SImode and
punpcklqdq for V2DImode.  These changes are demonstrated by the
examples below:

typedef unsigned int v4si __attribute((vector_size(16)));
typedef unsigned long long v2di __attribute((vector_size(16)));

v4si foo() { return (v4si){1,1,1,1}; }
v2di bar() { return (v2di){1,1}; }

The previous version of my patch generated:

foo:    movdqa  .LC0(%rip), %xmm0
        ret
bar:    movdqa  .LC1(%rip), %xmm0
        ret

with this revised version, -O2 generates:

foo:    movl    $1, %eax
        movd    %eax, %xmm0
        pshufd  $0, %xmm0, %xmm0
        ret
bar:    movl    $1, %eax
        movq    %rax, %xmm0
        punpcklqdq      %xmm0, %xmm0
        ret

However, if it's OK with you, I'd prefer to allow this function to
return false, safely falling back to emitting a vector load from
the constant bool rather than ICEing from a gcc_assert.  For one
thing this isn't a unrecoverable correctness issue, but at worst
a missed optimization.  The deeper reason is that this usefully
provides a handle for tuning on different microarchitectures.
On some (AMD?) machines, where !TARGET_INTER_UNIT_MOVES_TO_VEC,
the first form above may be preferable to the second.  Currently
the start of ix86_convert_const_wide_int_to_broadcast disables
broadcasts for !TARGET_INTER_UNIT_MOVES_TO_VEC even when an
implementation doesn't reuire an inter unit move, such as a
broadcast from memory.  I plan follow-up patches that benefit
from this flexibility.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

gcc/ChangeLog
        PR target/112992
        * config/i386/i386-expand.cc
        (ix86_convert_const_wide_int_to_broadcast): Allow call to
        ix86_expand_vector_init_duplicate to fail, and return NULL_RTX.
        (ix86_broadcast_from_constant): Revert recent change; Return a
        suitable MEMREF independently of mode/target combinations.
        (ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate
        to decide whether expansion is possible/preferrable.  Only try
        forcing DImode constants to memory (and trying again) if calling
        ix86_expand_vector_init_duplicate fails with an DImode immediate
        constant.
        (ix86_expand_vector_init_duplicate) <case E_V2DImode>: Try using
        V4SImode for suitable immediate constants.
        <case E_V4DImode>: Try using V8SImode for suitable constants.
        <case E_V4HImode>: Fail for CONST_INT_P, i.e. use constant pool.
        <case E_V2HImode>: Likewise.
        <case E_V8HImode>: For CONST_INT_P try using V4SImode via widen.
        <case E_V16QImode>: For CONT_INT_P try using V8HImode via widen.
        <label widen>: Handle CONT_INTs via simplify_binary_operation.
        Allow recursive calls to ix86_expand_vector_init_duplicate to fail.
        <case E_V16HImode>: For CONST_INT_P try V8SImode via widen.
        <case E_V32QImode>: For CONST_INT_P try V16HImode via widen.
        (ix86_expand_vector_init): Move try using a broadcast for all_same
        with ix86_expand_vector_init_duplicate before using constant pool.

gcc/testsuite/ChangeLog
        * gcc.target/i386/auto-init-8.c: Update test case.
        * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Likewise.
        * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise.
        * gcc.target/i386/avx512fp16-13.c: Likewise.
        * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise.
        * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise.
        * gcc.target/i386/pr100865-1.c: Likewise.
        * gcc.target/i386/pr100865-10a.c: Likewise.
        * gcc.target/i386/pr100865-10b.c: Likewise.
        * gcc.target/i386/pr100865-2.c: Likewise.
        * gcc.target/i386/pr100865-3.c: Likewise.
        * gcc.target/i386/pr100865-4a.c: Likewise.
        * gcc.target/i386/pr100865-4b.c: Likewise.
        * gcc.target/i386/pr100865-5a.c: Likewise.
        * gcc.target/i386/pr100865-5b.c: Likewise.
        * gcc.target/i386/pr100865-9a.c: Likewise.
        * gcc.target/i386/pr100865-9b.c: Likewise.
        * gcc.target/i386/pr102021.c: Likewise.
        * gcc.target/i386/pr90773-17.c: Likewise.

Thanks in advance.
Roger
--

> -----Original Message-----
> From: Hongtao Liu <crazylht@gmail.com>
> Sent: 02 January 2024 05:40
> To: Roger Sayle <roger@nextmovesoftware.com>
> Cc: gcc-patches@gcc.gnu.org; Uros Bizjak <ubizjak@gmail.com>
> Subject: Re: [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of
> constants.
> 
> On Fri, Dec 22, 2023 at 6:25 PM Roger Sayle <roger@nextmovesoftware.com>
> wrote:
> >
> >
> > This patch resolves the second part of PR target/112992, building upon
> > Hongtao Liu's solution to the first part.
> >
> > The issue addressed by this patch is that when initializing vectors by
> > broadcasting integer constants, the compiler has the flexibility to
> > select the most appropriate vector mode to perform the broadcast, as
> > long as the resulting vector has an identical bit pattern.  For
> > example, the following constants are all equivalent:
> > V4SImode {0x01010101, 0x01010101, 0x01010101, 0x01010101 } V8HImode
> > {0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101 }
> > V16QImode {0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, ...
> > 0x01 } So instruction sequences that construct any of these can be
> > used to construct the others (with a suitable cast/SUBREG).
> >
> > On x86_64, it turns out that broadcasts of SImode constants are
> > preferred, as DImode constants often require a longer movabs
> > instruction, and HImode and QImode broadcasts require multiple uops on some
> architectures.
> > Hence, SImode is always the equal shortest/fastest implementation.
> >
> > Examples of this improvement, can be seen in the testsuite.
> >
> > gcc.target/i386/pr102021.c
> > Before:
> >    0:   48 b8 0c 00 0c 00 0c    movabs $0xc000c000c000c,%rax
> >    7:   00 0c 00
> >    a:   62 f2 fd 28 7c c0       vpbroadcastq %rax,%ymm0
> >   10:   c3                      retq
> >
> > After:
> >    0:   b8 0c 00 0c 00          mov    $0xc000c,%eax
> >    5:   62 f2 7d 28 7c c0       vpbroadcastd %eax,%ymm0
> >    b:   c3                      retq
> >
> > and
> > gcc.target/i386/pr90773-17.c:
> > Before:
> >    0:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 7 <foo+0x7>
> >    7:   b8 0c 00 00 00          mov    $0xc,%eax
> >    c:   62 f2 7d 08 7a c0       vpbroadcastb %eax,%xmm0
> >   12:   62 f1 7f 08 7f 02       vmovdqu8 %xmm0,(%rdx)
> >   18:   c7 42 0f 0c 0c 0c 0c    movl   $0xc0c0c0c,0xf(%rdx)
> >   1f:   c3                      retq
> >
> > After:
> >    0:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 7 <foo+0x7>
> >    7:   b8 0c 0c 0c 0c          mov    $0xc0c0c0c,%eax
> >    c:   62 f2 7d 08 7c c0       vpbroadcastd %eax,%xmm0
> >   12:   62 f1 7f 08 7f 02       vmovdqu8 %xmm0,(%rdx)
> >   18:   c7 42 0f 0c 0c 0c 0c    movl   $0xc0c0c0c,0xf(%rdx)
> >   1f:   c3                      retq
> >
> > where according to Agner Fog's instruction tables broadcastd is
> > slightly faster on some microarchitectures, for example Knight's Landing.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> >
> >
> > 2023-12-21  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         PR target/112992
> >         * config/i386/i386-expand.cc
> >         (ix86_convert_const_wide_int_to_broadcast): Allow call to
> >         ix86_expand_vector_init_duplicate to fail, and return NULL_RTX.
> >         (ix86_broadcast_from_constant): Revert recent change; Return a
> >         suitable MEMREF independently of mode/target combinations.
> >         (ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate
> >         to decide whether expansion is possible/preferrable.  Only try
> >         forcing DImode constants to memory (and trying again) if calling
> >         ix86_expand_vector_init_duplicate fails with an DImode immediate
> >         constant.
> >         (ix86_expand_vector_init_duplicate) <case E_V2DImode>: Try using
> >         V4SImode for suitable immediate constants.
> >         <case E_V4DImode>: Try using V8SImode for suitable constants.
> >         <case E_V4SImode>: Use constant pool for AVX without AVX2.
> >         <case E_V4HImode>: Fail for CONST_INT_P, i.e. use constant pool.
> >         <case E_V2HImode>: Likewise.
> >         <case E_V8HImode>: For CONST_INT_P try using V4SImode via widen.
> >         <case E_V16QImode>: For CONT_INT_P try using V8HImode via widen.
> >         <label widen>: Handle CONT_INTs via simplify_binary_operation.
> >         Allow recursive calls to ix86_expand_vector_init_duplicate to fail.
> >         <case E_V16HImode>: For CONST_INT_P try V8SImode via widen.
> >         <case E_V32QImode>: For CONST_INT_P try V16HImode via widen.
> >         (ix86_expand_vector_init): Move try using a broadcast for all_same
> >         with ix86_expand_vector_init_duplicate before using constant pool.
> >
> > gcc/testsuite/ChangeLog
> >         * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Update test case.
> >         * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise.
> >         * gcc.target/i386/avx512fp16-13.c: Likewise.
> >         * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise.
> >         * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise.
> >         * gcc.target/i386/pr100865-10a.c: Likewise.
> >         * gcc.target/i386/pr100865-10b.c: Likewise.
> >         * gcc.target/i386/pr100865-11c.c: Likewise.
> >         * gcc.target/i386/pr100865-12c.c: Likewise.
> >         * gcc.target/i386/pr100865-2.c: Likewise.
> >         * gcc.target/i386/pr100865-3.c: Likewise.
> >         * gcc.target/i386/pr100865-4a.c: Likewise.
> >         * gcc.target/i386/pr100865-4b.c: Likewise.
> >         * gcc.target/i386/pr100865-5a.c: Likewise.
> >         * gcc.target/i386/pr100865-5b.c: Likewise.
> >         * gcc.target/i386/pr100865-9a.c: Likewise.
> >         * gcc.target/i386/pr100865-9b.c: Likewise.
> >         * gcc.target/i386/pr102021.c: Likewise.
> >         * gcc.target/i386/pr90773-17.c: Likewise.
> >
> >
> > Thanks in advance,
> > Roger
> > --
> >
> 
> >+    case E_V2DImode:
> >+      if (CONST_INT_P (val))
> >+        {
> >+          int tmp = (int)INTVAL (val);
> >+          if (tmp == (int)(INTVAL (val) >> 32))
> >+            {
> >+              rtx reg = gen_reg_rtx (V4SImode);
> >+              ok = ix86_vector_duplicate_value (V4SImode, reg,
> >+                                                GEN_INT (tmp));
> >+              if (ok)
> >+                {
> >+                  emit_move_insn (target, gen_lowpart (V2DImode, reg));
> >+                  return true;
> >+                }
> >+            }
> >+          if (!TARGET_AVX)
> >+            return false;
> >+          if (!TARGET_AVX2)
> >+            val = force_const_mem (DImode, val);
> 
> We can use pshufd/pshufss to broadcast for v4si and punpcklqdq for v2di, so I
> think there is no need to return false or force_const_mem here.
> 
> >+    case E_V4SImode:
> >+      if (CONST_INT_P (val))
> >+        {
> >+          if (!TARGET_AVX)
> >+            return false;
> >+          if (!TARGET_AVX2)
> >+            val = force_const_mem (SImode, val);
> 
> Ditto.
> 
> >+        }
> >+      return ix86_vector_duplicate_value (mode, target, val);
> 
> 
> 
> >         x = gen_reg_rtx (wvmode);
> >         ok = ix86_expand_vector_init_duplicate (mmx_ok, wvmode, x, val);
> >-        gcc_assert (ok);
> >+        if (!ok)
> >+          return false;
> Then we can still gcc_assert (ok) here.
> >         emit_move_insn (target, gen_lowpart (GET_MODE (target), x));
> >-        return ok;
> >+        return true;
> 
> 
> >+    case E_V32QImode:
> >+      if (CONST_INT_P (val))
> >+        goto widen;
> >+      /* FALLTHRU */
> >+
> >     case E_V16HFmode:
> >    case E_V16BFmode:
> >-    case E_V32QImode:
> >       if (TARGET_AVX2)
> >         return ix86_vector_duplicate_value (mode, target, val);
> >       else
> >@@ -15904,7 +15965,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok,
> machine_mode mode,
> >           rtx x = gen_reg_rtx (hvmode);
> >
> >           ok = ix86_expand_vector_init_duplicate (false, hvmode, x, val);
> >-          gcc_assert (ok);
> >+          if (!ok)
> >+            return false;
> >
> >           x = gen_rtx_VEC_CONCAT (mode, x, x);
> >           emit_insn (gen_rtx_SET (target, x)); @@ -15941,7 +16003,8 @@
> >ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
> >           rtx x = gen_reg_rtx (hvmode);
> >
> >           ok = ix86_expand_vector_init_duplicate (false, hvmode, x, val);
> >-          gcc_assert (ok);
> >+          if (!ok)
> Assume it's always true for vec_dupv8si?(vec_dupv32qi widen to vec_dupv16hi
> widen to vec_dupv8si.
> >+            return false;
> 
> --
> BR,
> Hongtao

[-- Attachment #2: patchwb5c.txt --]
[-- Type: text/plain, Size: 22806 bytes --]

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index fd1b2a9..ca66620 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -352,7 +352,8 @@ ix86_convert_const_wide_int_to_broadcast (machine_mode mode, rtx op)
   bool ok = ix86_expand_vector_init_duplicate (false, vector_mode,
 					       target,
 					       GEN_INT (val_broadcast));
-  gcc_assert (ok);
+  if (!ok)
+    return nullptr;
   target = lowpart_subreg (mode, target, vector_mode);
   return target;
 }
@@ -599,19 +600,11 @@ ix86_broadcast_from_constant (machine_mode mode, rtx op)
       && INTEGRAL_MODE_P (mode))
     return nullptr;
 
-  unsigned int msize = GET_MODE_SIZE (mode);
-  unsigned int inner_size = GET_MODE_SIZE (GET_MODE_INNER ((mode)));
-
   /* Convert CONST_VECTOR to a non-standard SSE constant integer
      broadcast only if vector broadcast is available.  */
   if (standard_sse_constant_p (op, mode))
     return nullptr;
 
-  /* vpbroadcast[b,w] is available under TARGET_AVX2.
-     or TARGET_AVX512BW for zmm.  */
-  if (inner_size < 4 && !(msize == 64 ? TARGET_AVX512BW : TARGET_AVX2))
-    return nullptr;
-
   if (GET_MODE_INNER (mode) == TImode)
     return nullptr;
 
@@ -705,22 +698,22 @@ ix86_expand_vector_move (machine_mode mode, rtx operands[])
 	{
 	  /* Broadcast to XMM/YMM/ZMM register from an integer
 	     constant or scalar mem.  */
-	  op1 = gen_reg_rtx (mode);
-	  if (FLOAT_MODE_P (mode)
-	      || (!TARGET_64BIT && GET_MODE_INNER (mode) == DImode)
-	      /* vbroadcastss/vbroadcastsd only supports memory operand
-		 w/o AVX2, force them into memory to avoid spill to
-		 memory.  */
-	      || (GET_MODE_SIZE (mode) == 32
-		  && (GET_MODE_INNER (mode) == DImode
-		      || GET_MODE_INNER (mode) == SImode)
-		  && !TARGET_AVX2))
+	  rtx tmp = gen_reg_rtx (mode);
+	  if (FLOAT_MODE_P (mode))
 	    first = force_const_mem (GET_MODE_INNER (mode), first);
 	  bool ok = ix86_expand_vector_init_duplicate (false, mode,
-						       op1, first);
-	  gcc_assert (ok);
-	  emit_move_insn (op0, op1);
-	  return;
+						       tmp, first);
+	  if (!ok && !TARGET_64BIT && GET_MODE_INNER (mode) == DImode)
+	    {
+	      first = force_const_mem (GET_MODE_INNER (mode), first);
+	      ok = ix86_expand_vector_init_duplicate (false, mode,
+						      tmp, first);
+	    }
+	  if (ok)
+	    {
+	      emit_move_insn (op0, tmp);
+	      return;
+	    }
 	}
     }
 
@@ -15714,6 +15707,42 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
 
   switch (mode)
     {
+    case E_V2DImode:
+      if (CONST_INT_P (val))
+	{
+	  int tmp = (int)INTVAL (val);
+	  if (tmp == (int)(INTVAL (val) >> 32))
+	    {
+	      rtx reg = gen_reg_rtx (V4SImode);
+	      ok = ix86_vector_duplicate_value (V4SImode, reg,
+						GEN_INT (tmp));
+	      if (ok)
+		{
+		  emit_move_insn (target, gen_lowpart (V2DImode, reg));
+		  return true;
+		}
+	    }
+	}
+      return ix86_vector_duplicate_value (mode, target, val);
+
+    case E_V4DImode:
+      if (CONST_INT_P (val))
+	{
+	  int tmp = (int)INTVAL (val);
+	  if (tmp == (int)(INTVAL (val) >> 32))
+	    {
+	      rtx reg = gen_reg_rtx (V8SImode);
+	      ok = ix86_vector_duplicate_value (V8SImode, reg,
+						GEN_INT (tmp));
+	      if (ok)
+		{
+		  emit_move_insn (target, gen_lowpart (V4DImode, reg));
+		  return true;
+		}
+	    }
+	}
+      return ix86_vector_duplicate_value (mode, target, val);
+
     case E_V2SImode:
     case E_V2SFmode:
       if (!mmx_ok)
@@ -15721,11 +15750,9 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
       /* FALLTHRU */
 
     case E_V4DFmode:
-    case E_V4DImode:
     case E_V8SFmode:
     case E_V8SImode:
     case E_V2DFmode:
-    case E_V2DImode:
     case E_V4SFmode:
     case E_V4SImode:
     case E_V16SImode:
@@ -15742,6 +15769,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
 	  rtx x;
 
 	  val = gen_lowpart (SImode, val);
+	  if (CONST_INT_P (val))
+	    return false;
 	  x = gen_rtx_TRUNCATE (HImode, val);
 	  x = gen_rtx_VEC_DUPLICATE (mode, x);
 	  emit_insn (gen_rtx_SET (target, x));
@@ -15766,6 +15795,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
 	  rtx x;
 
 	  val = gen_lowpart (SImode, val);
+	  if (CONST_INT_P (val))
+	    return false;
 	  x = gen_rtx_TRUNCATE (HImode, val);
 	  x = gen_rtx_VEC_DUPLICATE (mode, x);
 	  emit_insn (gen_rtx_SET (target, x));
@@ -15791,6 +15822,10 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
       goto widen;
 
     case E_V8HImode:
+      if (CONST_INT_P (val))
+	goto widen;
+      /* FALLTHRU */
+
     case E_V8HFmode:
     case E_V8BFmode:
       if (TARGET_AVX2)
@@ -15838,6 +15873,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
       goto widen;
 
     case E_V16QImode:
+      if (CONST_INT_P (val))
+	goto widen;
       if (TARGET_AVX2)
 	return ix86_vector_duplicate_value (mode, target, val);
 
@@ -15857,7 +15894,13 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
 
 	val = convert_modes (wsmode, smode, val, true);
 
-	if (smode == QImode && !TARGET_PARTIAL_REG_STALL)
+	if (CONST_INT_P (val))
+	  {
+	    x = simplify_binary_operation (ASHIFT, wsmode, val,
+					   GEN_INT (GET_MODE_BITSIZE (smode)));
+	    val = simplify_binary_operation (IOR, wsmode, val, x);
+	  }
+	else if (smode == QImode && !TARGET_PARTIAL_REG_STALL)
 	  emit_insn (gen_insv_1 (wsmode, val, val));
 	else
 	  {
@@ -15870,15 +15913,20 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
 
 	x = gen_reg_rtx (wvmode);
 	ok = ix86_expand_vector_init_duplicate (mmx_ok, wvmode, x, val);
-	gcc_assert (ok);
+	if (!ok)
+	  return false;
 	emit_move_insn (target, gen_lowpart (GET_MODE (target), x));
-	return ok;
+	return true;
       }
 
     case E_V16HImode:
+    case E_V32QImode:
+      if (CONST_INT_P (val))
+	goto widen;
+      /* FALLTHRU */
+
     case E_V16HFmode:
     case E_V16BFmode:
-    case E_V32QImode:
       if (TARGET_AVX2)
 	return ix86_vector_duplicate_value (mode, target, val);
       else
@@ -15904,7 +15952,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
 	  rtx x = gen_reg_rtx (hvmode);
 
 	  ok = ix86_expand_vector_init_duplicate (false, hvmode, x, val);
-	  gcc_assert (ok);
+	  if (!ok)
+	    return false;
 
 	  x = gen_rtx_VEC_CONCAT (mode, x, x);
 	  emit_insn (gen_rtx_SET (target, x));
@@ -15941,7 +15990,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
 	  rtx x = gen_reg_rtx (hvmode);
 
 	  ok = ix86_expand_vector_init_duplicate (false, hvmode, x, val);
-	  gcc_assert (ok);
+	  if (!ok)
+	    return false;
 
 	  x = gen_rtx_VEC_CONCAT (mode, x, x);
 	  emit_insn (gen_rtx_SET (target, x));
@@ -16913,6 +16963,12 @@ ix86_expand_vector_init (bool mmx_ok, rtx target, rtx vals)
 	all_same = false;
     }
 
+  /* If all values are identical, broadcast the value.  */
+  if (all_same
+      && ix86_expand_vector_init_duplicate (mmx_ok, mode, target,
+					    XVECEXP (vals, 0, 0)))
+    return;
+
   /* Constants are best loaded from the constant pool.  */
   if (n_var == 0)
     {
@@ -16920,12 +16976,6 @@ ix86_expand_vector_init (bool mmx_ok, rtx target, rtx vals)
       return;
     }
 
-  /* If all values are identical, broadcast the value.  */
-  if (all_same
-      && ix86_expand_vector_init_duplicate (mmx_ok, mode, target,
-					    XVECEXP (vals, 0, 0)))
-    return;
-
   /* Values where only one field is non-constant are best loaded from
      the pool and overwritten via move later.  */
   if (n_var == 1)
diff --git a/gcc/testsuite/gcc.target/i386/auto-init-8.c b/gcc/testsuite/gcc.target/i386/auto-init-8.c
index 666ee14..7023d72 100644
--- a/gcc/testsuite/gcc.target/i386/auto-init-8.c
+++ b/gcc/testsuite/gcc.target/i386/auto-init-8.c
@@ -29,7 +29,7 @@ double foo()
   return result;
 }
 
-/* { dg-final { scan-rtl-dump-times "0xfffffffffefefefe" 1 "expand" } } */
+/* { dg-final { scan-rtl-dump-times "0xfffffffffefefefe" 3 "expand" } } */
 /* { dg-final { scan-rtl-dump-times "\\\[0xfefefefefefefefe\\\]" 2 "expand" } } */
 /* { dg-final { scan-rtl-dump-times "0xfffffffffffffffe\\\]\\\) repeated x16" 2 "expand" } } */
 
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
index 0fa93e0..138dbb4 100644
--- a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
@@ -3,8 +3,7 @@
 /* { dg-options "-O2 -mavx512f -mavx512dq" } */
 /* { dg-additional-options "-fno-PIE" { target ia32 } } */
 /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 2 { target { ! ia32 } } } }  */
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 5 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 2 } } */
 /* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to16\\\}" 2 } }  */
 /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %zmm\[0-9\]+" 3 } } */
 /* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %zmm\[0-9\]+" 3 { target { ! ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
index f1b672a..d22251b 100644
--- a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-5.c
@@ -3,7 +3,7 @@
 /* { dg-options "-O2 -mavx512f" } */
 /* { dg-additional-options "-fno-PIE" { target ia32 } } */
 /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 4 { target ia32 } } } */
+/* { dg-final { scan-assembler-not "\[^\n\]*\\\{1to8\\\}" { target ia32 } } } */
 /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %zmm\[0-9\]+" 4 } } */
 /* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %zmm\[0-9\]+" 4 { target { ! ia32 } } } } */
 
diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16-13.c b/gcc/testsuite/gcc.target/i386/avx512fp16-13.c
index b73a8f4..f431b8a 100644
--- a/gcc/testsuite/gcc.target/i386/avx512fp16-13.c
+++ b/gcc/testsuite/gcc.target/i386/avx512fp16-13.c
@@ -126,7 +126,7 @@ abs256_ph (__m256h a)
   return _mm256_abs_ph (a);
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastq\[^\n\]*%ymm\[0-9\]+" 1 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[^\n\]*%ymm\[0-9\]+" 1 { target { ! ia32 } } } } */
 /* { dg-final { scan-assembler-times "vpand\[^\n\]*%ymm\[0-9\]+" 1 } } */
 
 __m128h
@@ -136,5 +136,5 @@ abs_ph (__m128h a)
   return _mm_abs_ph (a);
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastq\[^\n\]*%xmm\[0-9\]+" 1 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[^\n\]*%xmm\[0-9\]+" 1 { target { ! ia32 } } } } */
 /* { dg-final { scan-assembler-times "vpand\[^\n\]*%xmm\[0-9\]+" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
index 0304b9d..e6df4d2 100644
--- a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
@@ -3,10 +3,8 @@
 /* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
 /* { dg-additional-options "-fno-PIE" { target ia32 } } */
 /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 2 { target { ! ia32 } } } }  */
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 4 { target { ! ia32 } } } }  */
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 5 { target ia32 } } } */
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 7 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 4 } }  */
 /* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to8\\\}" 2 } }  */
 /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 3 } } */
 /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 3 } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
index 0ba0cd9..ebdc361 100644
--- a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-5.c
@@ -3,8 +3,8 @@
 /* { dg-options "-O2 -mavx512f -mavx512vl" } */
 /* { dg-additional-options "-fno-PIE" { target ia32 } } */
 /* { dg-additional-options "-mdynamic-no-pic" { target { *-*-darwin* && ia32 } } }
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to2\\\}" 4 { target ia32 } } } */
-/* { dg-final { scan-assembler-times "\[^\n\]*\\\{1to4\\\}" 4 { target ia32 } } } */
+/* { dg-final { scan-assembler-not "\[^\n\]*\\\{1to2\\\}" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "\[^\n\]*\\\{1to4\\\}" { target ia32 } } } */
 /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 4 } } */
 /* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 4 } } */
 /* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %xmm\[0-9\]+" 4 { target { ! ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-1.c b/gcc/testsuite/gcc.target/i386/pr100865-1.c
index 949dd5c..75cd463 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-1.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-1.c
@@ -9,5 +9,6 @@ foo (void)
   __builtin_memset (dst, 3, 16);
 }
 
-/* { dg-final { scan-assembler-times "movdqa\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "movd\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "pshufd" 1 } } */
 /* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-10a.c b/gcc/testsuite/gcc.target/i386/pr100865-10a.c
index 1d849a3..3bc0f1a 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-10a.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-10a.c
@@ -29,5 +29,5 @@ foo (void)
     array[i] = MK_CONST128_BROADCAST (0x1f);
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+\[^\n\]*, %ymm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+\[^\n\]*, %ymm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 8 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-10b.c b/gcc/testsuite/gcc.target/i386/pr100865-10b.c
index e5616d8..f60d1bf 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-10b.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-10b.c
@@ -3,5 +3,5 @@
 
 #include "pr100865-10a.c"
 
-/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu8\[\\t \]%ymm\[0-9\]+, " 8 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-2.c b/gcc/testsuite/gcc.target/i386/pr100865-2.c
index 090a010..e0265d2 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-2.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-2.c
@@ -10,6 +10,6 @@ foo (void)
   __builtin_memset (dst, 3, 16);
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
 /* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-3.c b/gcc/testsuite/gcc.target/i386/pr100865-3.c
index cde4b1c..433fd81 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-3.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-3.c
@@ -10,7 +10,7 @@ foo (void)
   __builtin_memset (dst, 3, 16);
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu8\[\\t \]%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
-/* { dg-final { scan-assembler-not "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" } } */
+/* { dg-final { scan-assembler-not "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" } } */
 /* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-4a.c b/gcc/testsuite/gcc.target/i386/pr100865-4a.c
index bd99945..8009e5c 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-4a.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-4a.c
@@ -12,6 +12,6 @@ foo (void)
     array[i] = -45;
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 2 } } */
 /* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-4b.c b/gcc/testsuite/gcc.target/i386/pr100865-4b.c
index 1814306..6fd703e 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-4b.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-4b.c
@@ -4,8 +4,8 @@
 
 #include "pr100865-4a.c"
 
-/* { dg-final { scan-assembler-times "vpbroadcastb\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu8\[\\t \]%ymm\[0-9\]+, " 2 } } */
 /* { dg-final { scan-assembler-times "vzeroupper" 1 } } */
-/* { dg-final { scan-assembler-not "vpbroadcastb\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" } } */
+/* { dg-final { scan-assembler-not "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" } } */
 /* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-5a.c b/gcc/testsuite/gcc.target/i386/pr100865-5a.c
index b023fca..d6fb79e 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-5a.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-5a.c
@@ -12,6 +12,6 @@ foo (void)
     array[i] = -45;
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, " 4 } } */
 /* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-5b.c b/gcc/testsuite/gcc.target/i386/pr100865-5b.c
index 5bccfd0..6c2b33d 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-5b.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-5b.c
@@ -4,7 +4,7 @@
 
 #include "pr100865-5a.c"
 
-/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %ymm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu16\[\\t \]%ymm\[0-9\]+, " 4 } } */
-/* { dg-final { scan-assembler-not "vpbroadcastw\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" } } */
+/* { dg-final { scan-assembler-not "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %ymm\[0-9\]+" } } */
 /* { dg-final { scan-assembler-not "vmovdqa" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-9a.c b/gcc/testsuite/gcc.target/i386/pr100865-9a.c
index 45d0e0d..f2ac1bd 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-9a.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-9a.c
@@ -21,5 +21,5 @@ foo (void)
     array[i] = MK_CONST128_BROADCAST (0x1fff);
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%xmm\[0-9\]+, %xmm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-9b.c b/gcc/testsuite/gcc.target/i386/pr100865-9b.c
index 1469624..e2a3f92 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-9b.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-9b.c
@@ -3,5 +3,5 @@
 
 #include "pr100865-9a.c"
 
-/* { dg-final { scan-assembler-times "vpbroadcastw\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%(?:r|e)\[^\n\]*, %xmm\[0-9\]+" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqa\[\\t \]%xmm\[0-9\]+, " 16 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr102021.c b/gcc/testsuite/gcc.target/i386/pr102021.c
index a5012a4..8ff898d 100644
--- a/gcc/testsuite/gcc.target/i386/pr102021.c
+++ b/gcc/testsuite/gcc.target/i386/pr102021.c
@@ -10,7 +10,7 @@ foo ()
   return _mm256_set1_epi16 (12);
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+%r\[^\n\]*, %ymm\[0-9\]+" 1 { target { ! ia32 } } } } */
-/* { dg-final { scan-assembler-times "vpbroadcastq\[\\t \]+\[^\n\]*, %ymm\[0-9\]+" 1 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+%\[^\n\]*, %ymm\[0-9\]+" 1 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd\[\\t \]+\[^\n\]*, %ymm\[0-9\]+" 1 { target ia32 } } } */
 /* { dg-final { scan-assembler-not "vmovdqa" } } */
 /* { dg-final { scan-assembler-not "vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr90773-17.c b/gcc/testsuite/gcc.target/i386/pr90773-17.c
index 3036085..61b2bfd 100644
--- a/gcc/testsuite/gcc.target/i386/pr90773-17.c
+++ b/gcc/testsuite/gcc.target/i386/pr90773-17.c
@@ -10,6 +10,6 @@ foo (void)
   __builtin_memset (dst, 12, 19);
 }
 
-/* { dg-final { scan-assembler-times "vpbroadcastb" 1 } } */
+/* { dg-final { scan-assembler-times "vpbroadcastd" 1 } } */
 /* { dg-final { scan-assembler-times "vmovdqu8\[\\t \]+%xmm\[0-9\]+, \\(%\[\^,\]+\\)" 1 } } */
 /* { dg-final { scan-assembler-times "vmovd\[\\t \]+%xmm\[0-9\]+, 16\\(%\[\^,\]+\\)" 1 { xfail *-*-* } } } */

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of constants.
  2024-01-06 22:53   ` Roger Sayle
@ 2024-01-08  1:58     ` Hongtao Liu
  0 siblings, 0 replies; 4+ messages in thread
From: Hongtao Liu @ 2024-01-08  1:58 UTC (permalink / raw)
  To: Roger Sayle; +Cc: gcc-patches, Uros Bizjak

On Sun, Jan 7, 2024 at 6:53 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
> Hi Hongtao,
>
> Many thanks for the review.  This revised patch implements several
> of your suggestions, specifically to use pshufd for V4SImode and
> punpcklqdq for V2DImode.  These changes are demonstrated by the
> examples below:
>
> typedef unsigned int v4si __attribute((vector_size(16)));
> typedef unsigned long long v2di __attribute((vector_size(16)));
>
> v4si foo() { return (v4si){1,1,1,1}; }
> v2di bar() { return (v2di){1,1}; }
>
> The previous version of my patch generated:
>
> foo:    movdqa  .LC0(%rip), %xmm0
>         ret
> bar:    movdqa  .LC1(%rip), %xmm0
>         ret
>
> with this revised version, -O2 generates:
>
> foo:    movl    $1, %eax
>         movd    %eax, %xmm0
>         pshufd  $0, %xmm0, %xmm0
>         ret
> bar:    movl    $1, %eax
>         movq    %rax, %xmm0
>         punpcklqdq      %xmm0, %xmm0
>         ret
>
> However, if it's OK with you, I'd prefer to allow this function to
> return false, safely falling back to emitting a vector load from
> the constant bool rather than ICEing from a gcc_assert.  For one
Sure, that makes sense.
> thing this isn't a unrecoverable correctness issue, but at worst
> a missed optimization.  The deeper reason is that this usefully
> provides a handle for tuning on different microarchitectures.
> On some (AMD?) machines, where !TARGET_INTER_UNIT_MOVES_TO_VEC,
> the first form above may be preferable to the second.  Currently
> the start of ix86_convert_const_wide_int_to_broadcast disables
> broadcasts for !TARGET_INTER_UNIT_MOVES_TO_VEC even when an
> implementation doesn't reuire an inter unit move, such as a
> broadcast from memory.  I plan follow-up patches that benefit
> from this flexibility.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
Ok.
>
> gcc/ChangeLog
>         PR target/112992
>         * config/i386/i386-expand.cc
>         (ix86_convert_const_wide_int_to_broadcast): Allow call to
>         ix86_expand_vector_init_duplicate to fail, and return NULL_RTX.
>         (ix86_broadcast_from_constant): Revert recent change; Return a
>         suitable MEMREF independently of mode/target combinations.
>         (ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate
>         to decide whether expansion is possible/preferrable.  Only try
>         forcing DImode constants to memory (and trying again) if calling
>         ix86_expand_vector_init_duplicate fails with an DImode immediate
>         constant.
>         (ix86_expand_vector_init_duplicate) <case E_V2DImode>: Try using
>         V4SImode for suitable immediate constants.
>         <case E_V4DImode>: Try using V8SImode for suitable constants.
>         <case E_V4HImode>: Fail for CONST_INT_P, i.e. use constant pool.
>         <case E_V2HImode>: Likewise.
>         <case E_V8HImode>: For CONST_INT_P try using V4SImode via widen.
>         <case E_V16QImode>: For CONT_INT_P try using V8HImode via widen.
>         <label widen>: Handle CONT_INTs via simplify_binary_operation.
>         Allow recursive calls to ix86_expand_vector_init_duplicate to fail.
>         <case E_V16HImode>: For CONST_INT_P try V8SImode via widen.
>         <case E_V32QImode>: For CONST_INT_P try V16HImode via widen.
>         (ix86_expand_vector_init): Move try using a broadcast for all_same
>         with ix86_expand_vector_init_duplicate before using constant pool.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/auto-init-8.c: Update test case.
>         * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Likewise.
>         * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise.
>         * gcc.target/i386/avx512fp16-13.c: Likewise.
>         * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise.
>         * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise.
>         * gcc.target/i386/pr100865-1.c: Likewise.
>         * gcc.target/i386/pr100865-10a.c: Likewise.
>         * gcc.target/i386/pr100865-10b.c: Likewise.
>         * gcc.target/i386/pr100865-2.c: Likewise.
>         * gcc.target/i386/pr100865-3.c: Likewise.
>         * gcc.target/i386/pr100865-4a.c: Likewise.
>         * gcc.target/i386/pr100865-4b.c: Likewise.
>         * gcc.target/i386/pr100865-5a.c: Likewise.
>         * gcc.target/i386/pr100865-5b.c: Likewise.
>         * gcc.target/i386/pr100865-9a.c: Likewise.
>         * gcc.target/i386/pr100865-9b.c: Likewise.
>         * gcc.target/i386/pr102021.c: Likewise.
>         * gcc.target/i386/pr90773-17.c: Likewise.
>
> Thanks in advance.
> Roger
> --
>
> > -----Original Message-----
> > From: Hongtao Liu <crazylht@gmail.com>
> > Sent: 02 January 2024 05:40
> > To: Roger Sayle <roger@nextmovesoftware.com>
> > Cc: gcc-patches@gcc.gnu.org; Uros Bizjak <ubizjak@gmail.com>
> > Subject: Re: [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of
> > constants.
> >
> > On Fri, Dec 22, 2023 at 6:25 PM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > >
> > >
> > > This patch resolves the second part of PR target/112992, building upon
> > > Hongtao Liu's solution to the first part.
> > >
> > > The issue addressed by this patch is that when initializing vectors by
> > > broadcasting integer constants, the compiler has the flexibility to
> > > select the most appropriate vector mode to perform the broadcast, as
> > > long as the resulting vector has an identical bit pattern.  For
> > > example, the following constants are all equivalent:
> > > V4SImode {0x01010101, 0x01010101, 0x01010101, 0x01010101 } V8HImode
> > > {0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101 }
> > > V16QImode {0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, ...
> > > 0x01 } So instruction sequences that construct any of these can be
> > > used to construct the others (with a suitable cast/SUBREG).
> > >
> > > On x86_64, it turns out that broadcasts of SImode constants are
> > > preferred, as DImode constants often require a longer movabs
> > > instruction, and HImode and QImode broadcasts require multiple uops on some
> > architectures.
> > > Hence, SImode is always the equal shortest/fastest implementation.
> > >
> > > Examples of this improvement, can be seen in the testsuite.
> > >
> > > gcc.target/i386/pr102021.c
> > > Before:
> > >    0:   48 b8 0c 00 0c 00 0c    movabs $0xc000c000c000c,%rax
> > >    7:   00 0c 00
> > >    a:   62 f2 fd 28 7c c0       vpbroadcastq %rax,%ymm0
> > >   10:   c3                      retq
> > >
> > > After:
> > >    0:   b8 0c 00 0c 00          mov    $0xc000c,%eax
> > >    5:   62 f2 7d 28 7c c0       vpbroadcastd %eax,%ymm0
> > >    b:   c3                      retq
> > >
> > > and
> > > gcc.target/i386/pr90773-17.c:
> > > Before:
> > >    0:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 7 <foo+0x7>
> > >    7:   b8 0c 00 00 00          mov    $0xc,%eax
> > >    c:   62 f2 7d 08 7a c0       vpbroadcastb %eax,%xmm0
> > >   12:   62 f1 7f 08 7f 02       vmovdqu8 %xmm0,(%rdx)
> > >   18:   c7 42 0f 0c 0c 0c 0c    movl   $0xc0c0c0c,0xf(%rdx)
> > >   1f:   c3                      retq
> > >
> > > After:
> > >    0:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 7 <foo+0x7>
> > >    7:   b8 0c 0c 0c 0c          mov    $0xc0c0c0c,%eax
> > >    c:   62 f2 7d 08 7c c0       vpbroadcastd %eax,%xmm0
> > >   12:   62 f1 7f 08 7f 02       vmovdqu8 %xmm0,(%rdx)
> > >   18:   c7 42 0f 0c 0c 0c 0c    movl   $0xc0c0c0c,0xf(%rdx)
> > >   1f:   c3                      retq
> > >
> > > where according to Agner Fog's instruction tables broadcastd is
> > > slightly faster on some microarchitectures, for example Knight's Landing.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32}
> > > with no new failures.  Ok for mainline?
> > >
> > >
> > > 2023-12-21  Roger Sayle  <roger@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > >         PR target/112992
> > >         * config/i386/i386-expand.cc
> > >         (ix86_convert_const_wide_int_to_broadcast): Allow call to
> > >         ix86_expand_vector_init_duplicate to fail, and return NULL_RTX.
> > >         (ix86_broadcast_from_constant): Revert recent change; Return a
> > >         suitable MEMREF independently of mode/target combinations.
> > >         (ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate
> > >         to decide whether expansion is possible/preferrable.  Only try
> > >         forcing DImode constants to memory (and trying again) if calling
> > >         ix86_expand_vector_init_duplicate fails with an DImode immediate
> > >         constant.
> > >         (ix86_expand_vector_init_duplicate) <case E_V2DImode>: Try using
> > >         V4SImode for suitable immediate constants.
> > >         <case E_V4DImode>: Try using V8SImode for suitable constants.
> > >         <case E_V4SImode>: Use constant pool for AVX without AVX2.
> > >         <case E_V4HImode>: Fail for CONST_INT_P, i.e. use constant pool.
> > >         <case E_V2HImode>: Likewise.
> > >         <case E_V8HImode>: For CONST_INT_P try using V4SImode via widen.
> > >         <case E_V16QImode>: For CONT_INT_P try using V8HImode via widen.
> > >         <label widen>: Handle CONT_INTs via simplify_binary_operation.
> > >         Allow recursive calls to ix86_expand_vector_init_duplicate to fail.
> > >         <case E_V16HImode>: For CONST_INT_P try V8SImode via widen.
> > >         <case E_V32QImode>: For CONST_INT_P try V16HImode via widen.
> > >         (ix86_expand_vector_init): Move try using a broadcast for all_same
> > >         with ix86_expand_vector_init_duplicate before using constant pool.
> > >
> > > gcc/testsuite/ChangeLog
> > >         * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Update test case.
> > >         * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise.
> > >         * gcc.target/i386/avx512fp16-13.c: Likewise.
> > >         * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise.
> > >         * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise.
> > >         * gcc.target/i386/pr100865-10a.c: Likewise.
> > >         * gcc.target/i386/pr100865-10b.c: Likewise.
> > >         * gcc.target/i386/pr100865-11c.c: Likewise.
> > >         * gcc.target/i386/pr100865-12c.c: Likewise.
> > >         * gcc.target/i386/pr100865-2.c: Likewise.
> > >         * gcc.target/i386/pr100865-3.c: Likewise.
> > >         * gcc.target/i386/pr100865-4a.c: Likewise.
> > >         * gcc.target/i386/pr100865-4b.c: Likewise.
> > >         * gcc.target/i386/pr100865-5a.c: Likewise.
> > >         * gcc.target/i386/pr100865-5b.c: Likewise.
> > >         * gcc.target/i386/pr100865-9a.c: Likewise.
> > >         * gcc.target/i386/pr100865-9b.c: Likewise.
> > >         * gcc.target/i386/pr102021.c: Likewise.
> > >         * gcc.target/i386/pr90773-17.c: Likewise.
> > >
> > >
> > > Thanks in advance,
> > > Roger
> > > --
> > >
> >
> > >+    case E_V2DImode:
> > >+      if (CONST_INT_P (val))
> > >+        {
> > >+          int tmp = (int)INTVAL (val);
> > >+          if (tmp == (int)(INTVAL (val) >> 32))
> > >+            {
> > >+              rtx reg = gen_reg_rtx (V4SImode);
> > >+              ok = ix86_vector_duplicate_value (V4SImode, reg,
> > >+                                                GEN_INT (tmp));
> > >+              if (ok)
> > >+                {
> > >+                  emit_move_insn (target, gen_lowpart (V2DImode, reg));
> > >+                  return true;
> > >+                }
> > >+            }
> > >+          if (!TARGET_AVX)
> > >+            return false;
> > >+          if (!TARGET_AVX2)
> > >+            val = force_const_mem (DImode, val);
> >
> > We can use pshufd/pshufss to broadcast for v4si and punpcklqdq for v2di, so I
> > think there is no need to return false or force_const_mem here.
> >
> > >+    case E_V4SImode:
> > >+      if (CONST_INT_P (val))
> > >+        {
> > >+          if (!TARGET_AVX)
> > >+            return false;
> > >+          if (!TARGET_AVX2)
> > >+            val = force_const_mem (SImode, val);
> >
> > Ditto.
> >
> > >+        }
> > >+      return ix86_vector_duplicate_value (mode, target, val);
> >
> >
> >
> > >         x = gen_reg_rtx (wvmode);
> > >         ok = ix86_expand_vector_init_duplicate (mmx_ok, wvmode, x, val);
> > >-        gcc_assert (ok);
> > >+        if (!ok)
> > >+          return false;
> > Then we can still gcc_assert (ok) here.
> > >         emit_move_insn (target, gen_lowpart (GET_MODE (target), x));
> > >-        return ok;
> > >+        return true;
> >
> >
> > >+    case E_V32QImode:
> > >+      if (CONST_INT_P (val))
> > >+        goto widen;
> > >+      /* FALLTHRU */
> > >+
> > >     case E_V16HFmode:
> > >    case E_V16BFmode:
> > >-    case E_V32QImode:
> > >       if (TARGET_AVX2)
> > >         return ix86_vector_duplicate_value (mode, target, val);
> > >       else
> > >@@ -15904,7 +15965,8 @@ ix86_expand_vector_init_duplicate (bool mmx_ok,
> > machine_mode mode,
> > >           rtx x = gen_reg_rtx (hvmode);
> > >
> > >           ok = ix86_expand_vector_init_duplicate (false, hvmode, x, val);
> > >-          gcc_assert (ok);
> > >+          if (!ok)
> > >+            return false;
> > >
> > >           x = gen_rtx_VEC_CONCAT (mode, x, x);
> > >           emit_insn (gen_rtx_SET (target, x)); @@ -15941,7 +16003,8 @@
> > >ix86_expand_vector_init_duplicate (bool mmx_ok, machine_mode mode,
> > >           rtx x = gen_reg_rtx (hvmode);
> > >
> > >           ok = ix86_expand_vector_init_duplicate (false, hvmode, x, val);
> > >-          gcc_assert (ok);
> > >+          if (!ok)
> > Assume it's always true for vec_dupv8si?(vec_dupv32qi widen to vec_dupv16hi
> > widen to vec_dupv8si.
> > >+            return false;
> >
> > --
> > BR,
> > Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-01-08  1:50 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-22 10:25 [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of constants Roger Sayle
2024-01-02  5:39 ` Hongtao Liu
2024-01-06 22:53   ` Roger Sayle
2024-01-08  1:58     ` Hongtao Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).