public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Update mtune=tremont
@ 2021-09-15  8:09 lili.cui
  2021-09-15  8:09 ` [PATCH 1/4] [PATCH 1/4] x86: Update -mtune=tremont lili.cui
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: lili.cui @ 2021-09-15  8:09 UTC (permalink / raw)
  To: ubizjak; +Cc: gcc-patches, hongtao.liu, hjl.tools, Cui,Lili

From: "Cui,Lili" <lili.cui@intel.com>

Hi,

I have four patches for tremont tuning, With all patches applied,
performance impacts on SPEC CPU 2017 are:

    500.perlbench_r         1.81%
    502.gcc_r               0.57%
    505.mcf_r               1.16%
    520.omnetpp_r           0.00%
    523.xalancbmk_r         0.00%
    525.x264_r              4.55%
    531.deepsjeng_r         0.00%
    541.leela_r             0.39%
    548.exchange2_r         1.13%
    557.xz_r                0.00%
    geomean for intrate     0.95%
    503.bwaves_r            0.00%
    507.cactuBSSN_r         6.94%
    508.namd_r              12.37%
    510.parest_r            1.01%
    511.povray_r            3.70%
    519.lbm_r               36.61%
    521.wrf_r               8.79%
    526.blender_r           2.91%
    527.cam4_r              6.23%
    538.imagick_r           0.28%
    544.nab_r               21.99%
    549.fotonik3d_r         3.63%
    554.roms_r              -1.20%
    geomean for fprate      7.50%
	
Bootstrapped and regtested on x86_64-linux-gnu{-m32,-m64}.
Ok for master?

  x86: Update -mtune=tremont
  x86: Update memcpy/memset inline strategies for -mtune=tremont
  x86: Properly handle USE_VECTOR_FP_CONVERTS/USE_VECTOR_CONVERTS
  x86: Add TARGET_SSE_PARTIAL_REG_[FP_]CONVERTS_DEPENDENCY

 gcc/common/config/i386/i386-common.c          |   2 +-
 gcc/config/i386/i386-features.c               |  23 +++-
 gcc/config/i386/i386-options.c                |   2 +-
 gcc/config/i386/i386.c                        |   1 +
 gcc/config/i386/i386.h                        |   4 +
 gcc/config/i386/i386.md                       |   9 +-
 gcc/config/i386/x86-tune-costs.h              | 124 ++++++++++++++++++
 gcc/config/i386/x86-tune-sched.c              |   2 +
 gcc/config/i386/x86-tune.def                  |  52 +++++---
 gcc/testsuite/gcc.target/i386/avx-covert-1.c  |  19 +++
 .../gcc.target/i386/avx-fp-covert-1.c         |  15 +++
 .../gcc.target/i386/avx-int-covert-1.c        |  14 ++
 gcc/testsuite/gcc.target/i386/pr101900-1.c    |  18 +++
 gcc/testsuite/gcc.target/i386/pr101900-2.c    |  18 +++
 gcc/testsuite/gcc.target/i386/pr101900-3.c    |  19 +++
 gcc/testsuite/gcc.target/i386/sse-covert-1.c  |  19 +++
 .../gcc.target/i386/sse-fp-covert-1.c         |  15 +++
 .../gcc.target/i386/sse-int-covert-1.c        |  14 ++
 18 files changed, 344 insertions(+), 26 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx-int-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/sse-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/sse-int-covert-1.c

-- 
2.17.1

Thanks,
Lili.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/4] [PATCH 1/4] x86: Update -mtune=tremont
  2021-09-15  8:09 [PATCH 0/4] Update mtune=tremont lili.cui
@ 2021-09-15  8:09 ` lili.cui
  2021-09-16  6:35   ` Uros Bizjak
  2021-09-15  8:09 ` [PATCH 2/4] [PATCH 2/4] x86: Update memcpy/memset inline strategies for -mtune=tremont lili.cui
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: lili.cui @ 2021-09-15  8:09 UTC (permalink / raw)
  To: ubizjak; +Cc: gcc-patches, hongtao.liu, hjl.tools

From: "H.J. Lu" <hjl.tools@gmail.com>

Initial -mtune=tremont update

1. Use Haswell scheduling model.
2. Assume that stack engine allows to execute push&pop instructions in
parall.
3. Prepare for scheduling pass as -mtune=generic.
4. Use the same issue rate as -mtune=generic.
5. Enable partial_reg_dependency.
6. Disable accumulate_outgoing_args
7. Enable use_leave
8. Enable push_memory
9. Disable four_jump_limit
10. Disable opt_agu
11. Disable avoid_lea_for_addr
12. Disable avoid_mem_opnd_for_cmove
13. Enable misaligned_move_string_pro_epilogues
14. Enable use_cltd
16. Enable avoid_false_dep_for_bmi
17. Enable avoid_mfence
18. Disable expand_abs
19. Enable sse_typeless_stores
20. Enable sse_load0_by_pxor
21. Disable split_mem_opnd_for_fp_converts
22. Disable slow_pshufb
23. Enable partial_reg_dependency

This is the first patch to tune for Tremont.  With all patches applied,
performance impacts on SPEC CPU 2017 are:

500.perlbench_r         1.81%
502.gcc_r               0.57%
505.mcf_r               1.16%
520.omnetpp_r           0.00%
523.xalancbmk_r         0.00%
525.x264_r              4.55%
531.deepsjeng_r         0.00%
541.leela_r             0.39%
548.exchange2_r         1.13%
557.xz_r                0.00%
geomean for intrate     0.95%
503.bwaves_r            0.00%
507.cactuBSSN_r         6.94%
508.namd_r              12.37%
510.parest_r            1.01%
511.povray_r            3.70%
519.lbm_r               36.61%
521.wrf_r               8.79%
526.blender_r           2.91%
527.cam4_r              6.23%
538.imagick_r           0.28%
544.nab_r               21.99%
549.fotonik3d_r         3.63%
554.roms_r              -1.20%
geomean for fprate      7.50%

gcc/ChangeLog

	* common/config/i386/i386-common.c: Use Haswell scheduling model
	for Tremont.
	* config/i386/i386.c (ix86_sched_init_global): Prepare for Tremont
	scheduling pass.
	* config/i386/x86-tune-sched.c (ix86_issue_rate): Change Tremont
	issue rate to 4.
	(ix86_adjust_cost): Handle Tremont.
	* config/i386/x86-tune.def (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY):
	Enable for Tremont.
	(X86_TUNE_USE_LEAVE): Likewise.
	(X86_TUNE_PUSH_MEMORY): Likewise.
	(X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES): Likewise.
	(X86_TUNE_USE_CLTD): Likewise.
	(X86_TUNE_AVOID_FALSE_DEP_FOR_BMI): Likewise.
	(X86_TUNE_AVOID_MFENCE): Likewise.
	(X86_TUNE_SSE_TYPELESS_STORES): Likewise.
	(X86_TUNE_SSE_LOAD0_BY_PXOR): Likewise.
	(X86_TUNE_ACCUMULATE_OUTGOING_ARGS): Disable for Tremont.
	(X86_TUNE_FOUR_JUMP_LIMIT): Likewise.
	(X86_TUNE_OPT_AGU): Likewise.
	(X86_TUNE_AVOID_LEA_FOR_ADDR): Likewise.
	(X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE): Likewise.
	(X86_TUNE_EXPAND_ABS): Likewise.
	(X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS): Likewise.
	(X86_TUNE_SLOW_PSHUFB): Likewise.
---
 gcc/common/config/i386/i386-common.c |  2 +-
 gcc/config/i386/i386.c               |  1 +
 gcc/config/i386/x86-tune-sched.c     |  2 ++
 gcc/config/i386/x86-tune.def         | 37 ++++++++++++++--------------
 4 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/gcc/common/config/i386/i386-common.c b/gcc/common/config/i386/i386-common.c
index 00c65ba15ab..2c9e1ccbc6e 100644
--- a/gcc/common/config/i386/i386-common.c
+++ b/gcc/common/config/i386/i386-common.c
@@ -1935,7 +1935,7 @@ const pta processor_alias_table[] =
     M_CPU_TYPE (INTEL_GOLDMONT), P_PROC_SSE4_2},
   {"goldmont-plus", PROCESSOR_GOLDMONT_PLUS, CPU_GLM, PTA_GOLDMONT_PLUS,
     M_CPU_TYPE (INTEL_GOLDMONT_PLUS), P_PROC_SSE4_2},
-  {"tremont", PROCESSOR_TREMONT, CPU_GLM, PTA_TREMONT,
+  {"tremont", PROCESSOR_TREMONT, CPU_HASWELL, PTA_TREMONT,
     M_CPU_TYPE (INTEL_TREMONT), P_PROC_SSE4_2},
   {"knl", PROCESSOR_KNL, CPU_SLM, PTA_KNL,
     M_CPU_TYPE (INTEL_KNL), P_PROC_AVX512F},
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 7b173bc0beb..2927e2884c9 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -16976,6 +16976,7 @@ ix86_sched_init_global (FILE *, int, int)
     case PROCESSOR_NEHALEM:
     case PROCESSOR_SANDYBRIDGE:
     case PROCESSOR_HASWELL:
+    case PROCESSOR_TREMONT:
     case PROCESSOR_GENERIC:
       /* Do not perform multipass scheduling for pre-reload schedule
          to save compile time.  */
diff --git a/gcc/config/i386/x86-tune-sched.c b/gcc/config/i386/x86-tune-sched.c
index 2e5ee4e4444..56ada99a450 100644
--- a/gcc/config/i386/x86-tune-sched.c
+++ b/gcc/config/i386/x86-tune-sched.c
@@ -71,6 +71,7 @@ ix86_issue_rate (void)
     case PROCESSOR_NEHALEM:
     case PROCESSOR_SANDYBRIDGE:
     case PROCESSOR_HASWELL:
+    case PROCESSOR_TREMONT:
     case PROCESSOR_GENERIC:
       return 4;
 
@@ -429,6 +430,7 @@ ix86_adjust_cost (rtx_insn *insn, int dep_type, rtx_insn *dep_insn, int cost,
     case PROCESSOR_NEHALEM:
     case PROCESSOR_SANDYBRIDGE:
     case PROCESSOR_HASWELL:
+    case PROCESSOR_TREMONT:
     case PROCESSOR_GENERIC:
       /* Stack engine allows to execute push&pop instructions in parall.  */
       if ((insn_type == TYPE_PUSH || insn_type == TYPE_POP)
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 2f221b1f8c9..385e275bbd9 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -62,7 +62,7 @@ DEF_TUNE (X86_TUNE_PARTIAL_REG_DEPENDENCY, "partial_reg_dependency",
    that can be partly masked by careful scheduling of moves.  */
 DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY, "sse_partial_reg_dependency",
           m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10
-	  | m_BDVER | m_ZNVER | m_GENERIC)
+	  | m_BDVER | m_ZNVER | m_TREMONT | m_GENERIC)
 
 /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
    are resolved on SSE register parts instead of whole registers, so we may
@@ -136,7 +136,7 @@ DEF_TUNE (X86_TUNE_FUSE_ALU_AND_BRANCH, "fuse_alu_and_branch",
 
 DEF_TUNE (X86_TUNE_ACCUMULATE_OUTGOING_ARGS, "accumulate_outgoing_args",
 	  m_PPRO | m_P4_NOCONA | m_BONNELL | m_SILVERMONT | m_KNL | m_KNM | m_INTEL
-	  | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT | m_ATHLON_K8)
+	  | m_GOLDMONT | m_GOLDMONT_PLUS | m_ATHLON_K8)
 
 /* X86_TUNE_PROLOGUE_USING_MOVE: Do not use push/pop in prologues that are
    considered on critical path.  */
@@ -150,14 +150,15 @@ DEF_TUNE (X86_TUNE_EPILOGUE_USING_MOVE, "epilogue_using_move",
 
 /* X86_TUNE_USE_LEAVE: Use "leave" instruction in epilogues where it fits.  */
 DEF_TUNE (X86_TUNE_USE_LEAVE, "use_leave",
-	  m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC)
+	  m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE | m_TREMONT
+	  | m_GENERIC)
 
 /* X86_TUNE_PUSH_MEMORY: Enable generation of "push mem" instructions.
    Some chips, like 486 and Pentium works faster with separate load
    and push instructions.  */
 DEF_TUNE (X86_TUNE_PUSH_MEMORY, "push_memory",
           m_386 | m_P4_NOCONA | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE
-          | m_GENERIC)
+          | m_TREMONT | m_GENERIC)
 
 /* X86_TUNE_SINGLE_PUSH: Enable if single push insn is preferred
    over esp subtraction.  */
@@ -198,8 +199,7 @@ DEF_TUNE (X86_TUNE_PAD_RETURNS, "pad_returns",
    than 4 branch instructions in the 16 byte window.  */
 DEF_TUNE (X86_TUNE_FOUR_JUMP_LIMIT, "four_jump_limit",
           m_PPRO | m_P4_NOCONA | m_BONNELL | m_SILVERMONT | m_KNL | m_KNM
-	  | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT | m_INTEL | m_ATHLON_K8
-	  | m_AMDFAM10)
+	  | m_GOLDMONT | m_GOLDMONT_PLUS | m_INTEL | m_ATHLON_K8 | m_AMDFAM10)
 
 /*****************************************************************************/
 /* Integer instruction selection tuning                                      */
@@ -240,11 +240,11 @@ DEF_TUNE (X86_TUNE_INTEGER_DFMODE_MOVES, "integer_dfmode_moves",
 /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
    will impact LEA instruction selection. */
 DEF_TUNE (X86_TUNE_OPT_AGU, "opt_agu", m_BONNELL | m_SILVERMONT | m_KNL
-	 | m_KNM | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT | m_INTEL)
+	 | m_KNM | m_GOLDMONT | m_GOLDMONT_PLUS | m_INTEL)
 
 /* X86_TUNE_AVOID_LEA_FOR_ADDR: Avoid lea for address computation.  */
 DEF_TUNE (X86_TUNE_AVOID_LEA_FOR_ADDR, "avoid_lea_for_addr",
-	  m_BONNELL | m_SILVERMONT | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT
+	  m_BONNELL | m_SILVERMONT | m_GOLDMONT | m_GOLDMONT_PLUS
 	  | m_KNL | m_KNM)
 
 /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is
@@ -263,7 +263,7 @@ DEF_TUNE (X86_TUNE_SLOW_IMUL_IMM8, "slow_imul_imm8",
    a conditional move.  */
 DEF_TUNE (X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE, "avoid_mem_opnd_for_cmove",
 	  m_BONNELL | m_SILVERMONT | m_GOLDMONT | m_GOLDMONT_PLUS | m_KNL
-	  | m_KNM | m_TREMONT | m_INTEL)
+	  | m_KNM | m_INTEL)
 
 /* X86_TUNE_SINGLE_STRINGOP: Enable use of single string operations, such
    as MOVS and STOS (without a REP prefix) to move/set sequences of bytes.  */
@@ -282,7 +282,8 @@ DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB,
    FIXME: This may actualy be a win on more targets than listed here.  */
 DEF_TUNE (X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES,
 	  "misaligned_move_string_pro_epilogues",
-	  m_386 | m_486 | m_CORE_ALL | m_AMD_MULTIPLE | m_GENERIC)
+	  m_386 | m_486 | m_CORE_ALL | m_AMD_MULTIPLE | m_TREMONT
+	  | m_GENERIC)
 
 /* X86_TUNE_USE_SAHF: Controls use of SAHF.  */
 DEF_TUNE (X86_TUNE_USE_SAHF, "use_sahf",
@@ -294,7 +295,7 @@ DEF_TUNE (X86_TUNE_USE_SAHF, "use_sahf",
 /* X86_TUNE_USE_CLTD: Controls use of CLTD and CTQO instructions.  */
 DEF_TUNE (X86_TUNE_USE_CLTD, "use_cltd",
 	  ~(m_PENT | m_LAKEMONT | m_BONNELL | m_SILVERMONT | m_KNL | m_KNM | m_INTEL
-	    | m_K6 | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT))
+	    | m_K6 | m_GOLDMONT | m_GOLDMONT_PLUS))
 
 /* X86_TUNE_USE_BT: Enable use of BT (bit test) instructions.  */
 DEF_TUNE (X86_TUNE_USE_BT, "use_bt",
@@ -305,7 +306,7 @@ DEF_TUNE (X86_TUNE_USE_BT, "use_bt",
 /* X86_TUNE_AVOID_FALSE_DEP_FOR_BMI: Avoid false dependency
    for bit-manipulation instructions.  */
 DEF_TUNE (X86_TUNE_AVOID_FALSE_DEP_FOR_BMI, "avoid_false_dep_for_bmi",
-	  m_SANDYBRIDGE | m_CORE_AVX2 | m_GENERIC)
+	  m_SANDYBRIDGE | m_CORE_AVX2 | m_TREMONT | m_GENERIC)
 
 /* X86_TUNE_ADJUST_UNROLL: This enables adjusting the unroll factor based
    on hardware capabilities. Bdver3 hardware has a loop buffer which makes
@@ -321,14 +322,14 @@ DEF_TUNE (X86_TUNE_ONE_IF_CONV_INSN, "one_if_conv_insn",
 
 /* X86_TUNE_AVOID_MFENCE: Use lock prefixed instructions instead of mfence.  */
 DEF_TUNE (X86_TUNE_AVOID_MFENCE, "avoid_mfence",
-	 m_CORE_ALL | m_BDVER | m_ZNVER | m_GENERIC)
+	 m_CORE_ALL | m_BDVER | m_ZNVER | m_TREMONT | m_GENERIC)
 
 /* X86_TUNE_EXPAND_ABS: This enables a new abs pattern by
    generating instructions for abs (x) = (((signed) x >> (W-1) ^ x) -
    (signed) x >> (W-1)) instead of cmove or SSE max/abs instructions.  */
 DEF_TUNE (X86_TUNE_EXPAND_ABS, "expand_abs",
 	  m_CORE_ALL | m_SILVERMONT | m_KNL | m_KNM | m_GOLDMONT
-	  | m_GOLDMONT_PLUS | m_TREMONT )
+	  | m_GOLDMONT_PLUS)
 
 /*****************************************************************************/
 /* 387 instruction selection tuning                                          */
@@ -386,13 +387,13 @@ DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL, "sse_packed_single_insn_optim
 
 /* X86_TUNE_SSE_TYPELESS_STORES: Always movaps/movups for 128bit stores.   */
 DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores",
-	  m_AMD_MULTIPLE | m_CORE_ALL | m_GENERIC)
+	  m_AMD_MULTIPLE | m_CORE_ALL | m_TREMONT | m_GENERIC)
 
 /* X86_TUNE_SSE_LOAD0_BY_PXOR: Always use pxor to load0 as opposed to
    xorps/xorpd and other variants.  */
 DEF_TUNE (X86_TUNE_SSE_LOAD0_BY_PXOR, "sse_load0_by_pxor",
 	  m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BDVER | m_BTVER | m_ZNVER
-	  | m_GENERIC)
+	  | m_TREMONT | m_GENERIC)
 
 /* X86_TUNE_INTER_UNIT_MOVES_TO_VEC: Enable moves in from integer
    to SSE registers.  If disabled, the moves will be done by storing
@@ -419,7 +420,7 @@ DEF_TUNE (X86_TUNE_INTER_UNIT_CONVERSIONS, "inter_unit_conversions",
    fp converts to destination register.  */
 DEF_TUNE (X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS, "split_mem_opnd_for_fp_converts",
 	  m_SILVERMONT | m_KNL | m_KNM | m_GOLDMONT | m_GOLDMONT_PLUS
-	  | m_TREMONT | m_INTEL)
+	  | m_INTEL)
 
 /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
    from FP to FP.  This form of instructions avoids partial write to the
@@ -434,7 +435,7 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10)
 /* X86_TUNE_SLOW_SHUFB: Indicates tunings with slow pshufb instruction.  */
 DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_pshufb",
 	  m_BONNELL | m_SILVERMONT | m_KNL | m_KNM | m_GOLDMONT
-	  | m_GOLDMONT_PLUS | m_TREMONT | m_INTEL)
+	  | m_GOLDMONT_PLUS | m_INTEL)
 
 /* X86_TUNE_AVOID_4BYTE_PREFIXES: Avoid instructions requiring 4+ bytes of prefixes.  */
 DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",
-- 
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 2/4] [PATCH 2/4] x86: Update memcpy/memset inline strategies for -mtune=tremont
  2021-09-15  8:09 [PATCH 0/4] Update mtune=tremont lili.cui
  2021-09-15  8:09 ` [PATCH 1/4] [PATCH 1/4] x86: Update -mtune=tremont lili.cui
@ 2021-09-15  8:09 ` lili.cui
  2021-09-16  6:36   ` Uros Bizjak
  2021-09-15  8:09 ` [PATCH 3/4] [PATCH 3/4] x86: Properly handle USE_VECTOR_FP_CONVERTS/USE_VECTOR_CONVERTS lili.cui
  2021-09-15  8:09 ` [PATCH 4/4] [PATCH 4/4] x86: Add TARGET_SSE_PARTIAL_REG_[FP_]CONVERTS_DEPENDENCY lili.cui
  3 siblings, 1 reply; 16+ messages in thread
From: lili.cui @ 2021-09-15  8:09 UTC (permalink / raw)
  To: ubizjak; +Cc: gcc-patches, hongtao.liu, hjl.tools

From: "H.J. Lu" <hjl.tools@gmail.com>

Simply memcpy and memset inline strategies to avoid branches for
-mtune=tremont:

1. Create Tremont cost model from generic cost model.
2. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
   load and store for up to 16 * 16 (256) bytes when the data size is
   fixed and known.
3. Inline only if data size is known to be <= 256.
   a. Use "rep movsb/stosb" with simple code sequence if the data size
      is a constant.
   b. Use loop if data size is not a constant.
4. Use memcpy/memset libray function if data size is unknown or > 256.

	* config/i386/i386-options.c (processor_cost_table): Use
	tremont_cost for Tremont.
	* config/i386/x86-tune-costs.h (tremont_memcpy): New.
	(tremont_memset): Likewise.
	(tremont_cost): Likewise.
	* config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
	Enable for Tremont.
---
 gcc/config/i386/i386-options.c   |   2 +-
 gcc/config/i386/x86-tune-costs.h | 124 +++++++++++++++++++++++++++++++
 gcc/config/i386/x86-tune.def     |   2 +-
 3 files changed, 126 insertions(+), 2 deletions(-)

diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
index c0006b3674b..e7a3bd4aaea 100644
--- a/gcc/config/i386/i386-options.c
+++ b/gcc/config/i386/i386-options.c
@@ -724,7 +724,7 @@ static const struct processor_costs *processor_cost_table[] =
   &slm_cost,
   &slm_cost,
   &slm_cost,
-  &slm_cost,
+  &tremont_cost,
   &slm_cost,
   &slm_cost,
   &skylake_cost,
diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index ffe810f2bcb..93644be9cb3 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -2734,6 +2734,130 @@ struct processor_costs slm_cost = {
   "16",					/* Func alignment.  */
 };
 
+static stringop_algs tremont_memcpy[2] = {
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}},
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}}};
+static stringop_algs tremont_memset[2] = {
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}},
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}}};
+static const
+struct processor_costs tremont_cost = {
+  {
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
+  6,				     /* cost for loading QImode using movzbl */
+  {6, 6, 6},				/* cost of loading integer registers
+					   in QImode, HImode and SImode.
+					   Relative to reg-reg move (2).  */
+  {6, 6, 6},				/* cost of storing integer registers */
+  4,					/* cost of reg,reg fld/fst */
+  {6, 6, 12},				/* cost of loading fp registers
+					   in SFmode, DFmode and XFmode */
+  {6, 6, 12},				/* cost of storing fp registers
+					   in SFmode, DFmode and XFmode */
+  2,					/* cost of moving MMX register */
+  {6, 6},				/* cost of loading MMX registers
+					   in SImode and DImode */
+  {6, 6},				/* cost of storing MMX registers
+					   in SImode and DImode */
+  2, 3, 4,				/* cost of moving XMM,YMM,ZMM register */
+  {6, 6, 6, 10, 15},			/* cost of loading SSE registers
+					   in 32,64,128,256 and 512-bit */
+  {6, 6, 6, 10, 15},			/* cost of storing SSE registers
+					   in 32,64,128,256 and 512-bit */
+  6, 6,				/* SSE->integer and integer->SSE moves */
+  6, 6,				/* mask->integer and integer->mask moves */
+  {6, 6, 6},				/* cost of loading mask register
+					   in QImode, HImode, SImode.  */
+  {6, 6, 6},			/* cost if storing mask register
+					   in QImode, HImode, SImode.  */
+  2,					/* cost of moving mask register.  */
+  /* End of register allocator costs.  */
+  },
+
+  COSTS_N_INSNS (1),			/* cost of an add instruction */
+  /* Setting cost to 2 makes our current implementation of synth_mult result in
+     use of unnecessary temporary registers causing regression on several
+     SPECfp benchmarks.  */
+  COSTS_N_INSNS (1) + 1,		/* cost of a lea instruction */
+  COSTS_N_INSNS (1),			/* variable shift costs */
+  COSTS_N_INSNS (1),			/* constant shift costs */
+  {COSTS_N_INSNS (3),			/* cost of starting multiply for QI */
+   COSTS_N_INSNS (4),			/*				 HI */
+   COSTS_N_INSNS (3),			/*				 SI */
+   COSTS_N_INSNS (4),			/*				 DI */
+   COSTS_N_INSNS (4)},			/*			      other */
+  0,					/* cost of multiply per each bit set */
+  {COSTS_N_INSNS (16),			/* cost of a divide/mod for QI */
+   COSTS_N_INSNS (22),			/*			    HI */
+   COSTS_N_INSNS (30),			/*			    SI */
+   COSTS_N_INSNS (74),			/*			    DI */
+   COSTS_N_INSNS (74)},			/*			    other */
+  COSTS_N_INSNS (1),			/* cost of movsx */
+  COSTS_N_INSNS (1),			/* cost of movzx */
+  8,					/* "large" insn */
+  17,					/* MOVE_RATIO */
+  17,					/* CLEAR_RATIO */
+  {6, 6, 6},				/* cost of loading integer registers
+					   in QImode, HImode and SImode.
+					   Relative to reg-reg move (2).  */
+  {6, 6, 6},				/* cost of storing integer registers */
+  {6, 6, 6, 10, 15},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 10, 15},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 10, 15},			/* cost of unaligned loads.  */
+  {6, 6, 6, 10, 15},			/* cost of unaligned storess.  */
+  2, 3, 4,				/* cost of moving XMM,YMM,ZMM register */
+  6,					/* cost of moving SSE register to integer.  */
+  18, 6,				/* Gather load static, per_elt.  */
+  18, 6,				/* Gather store static, per_elt.  */
+  32,					/* size of l1 cache.  */
+  512,					/* size of l2 cache.  */
+  64,					/* size of prefetch block */
+  6,					/* number of parallel prefetches */
+  /* Benchmarks shows large regressions on K8 sixtrack benchmark when this
+     value is increased to perhaps more appropriate value of 5.  */
+  3,					/* Branch cost */
+  COSTS_N_INSNS (3),			/* cost of FADD and FSUB insns.  */
+  COSTS_N_INSNS (5),			/* cost of FMUL instruction.  */
+  COSTS_N_INSNS (17),			/* cost of FDIV instruction.  */
+  COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
+  COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
+  COSTS_N_INSNS (14),			/* cost of FSQRT instruction.  */
+
+  COSTS_N_INSNS (1),			/* cost of cheap SSE instruction.  */
+  COSTS_N_INSNS (3),			/* cost of ADDSS/SD SUBSS/SD insns.  */
+  COSTS_N_INSNS (4),			/* cost of MULSS instruction.  */
+  COSTS_N_INSNS (5),			/* cost of MULSD instruction.  */
+  COSTS_N_INSNS (5),			/* cost of FMA SS instruction.  */
+  COSTS_N_INSNS (5),			/* cost of FMA SD instruction.  */
+  COSTS_N_INSNS (13),			/* cost of DIVSS instruction.  */
+  COSTS_N_INSNS (17),			/* cost of DIVSD instruction.  */
+  COSTS_N_INSNS (14),			/* cost of SQRTSS instruction.  */
+  COSTS_N_INSNS (18),			/* cost of SQRTSD instruction.  */
+  1, 4, 3, 3,				/* reassoc int, fp, vec_int, vec_fp.  */
+  tremont_memcpy,
+  tremont_memset,
+  COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
+  COSTS_N_INSNS (2),			/* cond_not_taken_branch_cost.  */
+  "16:11:8",				/* Loop alignment.  */
+  "16:11:8",				/* Jump alignment.  */
+  "0:0:8",				/* Label alignment.  */
+  "16",					/* Func alignment.  */
+};
+
 static stringop_algs intel_memcpy[2] = {
   {libcall, {{11, loop, false}, {-1, rep_prefix_4_byte, false}}},
   {libcall, {{32, loop, false}, {64, rep_prefix_4_byte, false},
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 385e275bbd9..088edb6c4ca 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -273,7 +273,7 @@ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA)
    move/set sequences of bytes with known size.  */
 DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB,
 	  "prefer_known_rep_movsb_stosb",
-	  m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512)
+	  m_SKYLAKE | m_ALDERLAKE | m_TREMONT | m_CORE_AVX512)
 
 /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of
    compact prologues and epilogues by issuing a misaligned moves.  This
-- 
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 3/4] [PATCH 3/4] x86: Properly handle USE_VECTOR_FP_CONVERTS/USE_VECTOR_CONVERTS
  2021-09-15  8:09 [PATCH 0/4] Update mtune=tremont lili.cui
  2021-09-15  8:09 ` [PATCH 1/4] [PATCH 1/4] x86: Update -mtune=tremont lili.cui
  2021-09-15  8:09 ` [PATCH 2/4] [PATCH 2/4] x86: Update memcpy/memset inline strategies for -mtune=tremont lili.cui
@ 2021-09-15  8:09 ` lili.cui
  2021-09-16  6:27   ` Uros Bizjak
  2021-09-15  8:09 ` [PATCH 4/4] [PATCH 4/4] x86: Add TARGET_SSE_PARTIAL_REG_[FP_]CONVERTS_DEPENDENCY lili.cui
  3 siblings, 1 reply; 16+ messages in thread
From: lili.cui @ 2021-09-15  8:09 UTC (permalink / raw)
  To: ubizjak; +Cc: gcc-patches, hongtao.liu, hjl.tools

From: "H.J. Lu" <hjl.tools@gmail.com>

Check TARGET_USE_VECTOR_FP_CONVERTS or TARGET_USE_VECTOR_CONVERTS when
handling avx_partial_xmm_update attribute.  Don't convert AVX partial
XMM register update if vector packed SSE conversion should be used.

gcc/

	PR target/101900
	* config/i386/i386-features.c (remove_partial_avx_dependency):
	Check TARGET_USE_VECTOR_FP_CONVERTS and TARGET_USE_VECTOR_CONVERTS
	before generating vxorps.

gcc/

	PR target/101900
	* testsuite/gcc.target/i386/pr101900-1.c: New test.
	* testsuite/gcc.target/i386/pr101900-2.c: Likewise.
	* testsuite/gcc.target/i386/pr101900-3.c: Likewise.
---
 gcc/config/i386/i386-features.c            | 21 ++++++++++++++++++---
 gcc/testsuite/gcc.target/i386/pr101900-1.c | 18 ++++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr101900-2.c | 18 ++++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr101900-3.c | 19 +++++++++++++++++++
 4 files changed, 73 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-3.c

diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index 5a99ea7c046..ae5ea02a002 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -2210,15 +2210,30 @@ remove_partial_avx_dependency (void)
 	      != AVX_PARTIAL_XMM_UPDATE_TRUE)
 	    continue;
 
-	  if (!v4sf_const0)
-	    v4sf_const0 = gen_reg_rtx (V4SFmode);
-
 	  /* Convert PARTIAL_XMM_UPDATE_TRUE insns, DF -> SF, SF -> DF,
 	     SI -> SF, SI -> DF, DI -> SF, DI -> DF, to vec_dup and
 	     vec_merge with subreg.  */
 	  rtx src = SET_SRC (set);
 	  rtx dest = SET_DEST (set);
 	  machine_mode dest_mode = GET_MODE (dest);
+	  machine_mode src_mode;
+
+	  if (TARGET_USE_VECTOR_FP_CONVERTS)
+	    {
+	      src_mode = GET_MODE (XEXP (src, 0));
+	      if (src_mode == E_SFmode || src_mode == E_DFmode)
+		continue;
+	    }
+
+	  if (TARGET_USE_VECTOR_CONVERTS)
+	    {
+	      src_mode = GET_MODE (XEXP (src, 0));
+	      if (src_mode == E_SImode || src_mode == E_DImode)
+		continue;
+	    }
+
+	  if (!v4sf_const0)
+	    v4sf_const0 = gen_reg_rtx (V4SFmode);
 
 	  rtx zero;
 	  machine_mode dest_vecmode;
diff --git a/gcc/testsuite/gcc.target/i386/pr101900-1.c b/gcc/testsuite/gcc.target/i386/pr101900-1.c
new file mode 100644
index 00000000000..0a45f8e340a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101900-1.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=use_vector_fp_converts" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler "vcvtps2pd" } } */
+/* { dg-final { scan-assembler "vcvtsi2ssl" } } */
+/* { dg-final { scan-assembler-not "vcvtss2sd" } } */
+/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr101900-2.c b/gcc/testsuite/gcc.target/i386/pr101900-2.c
new file mode 100644
index 00000000000..c8b2d1da5ae
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101900-2.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=use_vector_converts" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler "vcvtss2sd" } } */
+/* { dg-final { scan-assembler "vcvtdq2ps" } } */
+/* { dg-final { scan-assembler-not "vcvtsi2ssl" } } */
+/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr101900-3.c b/gcc/testsuite/gcc.target/i386/pr101900-3.c
new file mode 100644
index 00000000000..6ee565b5bd4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101900-3.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=use_vector_fp_converts,use_vector_converts" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler "vcvtps2pd" } } */
+/* { dg-final { scan-assembler "vcvtdq2ps" } } */
+/* { dg-final { scan-assembler-not "vcvtss2sd" } } */
+/* { dg-final { scan-assembler-not "vcvtsi2ssl" } } */
+/* { dg-final { scan-assembler-not "vxorps" } } */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 4/4] [PATCH 4/4] x86: Add TARGET_SSE_PARTIAL_REG_[FP_]CONVERTS_DEPENDENCY
  2021-09-15  8:09 [PATCH 0/4] Update mtune=tremont lili.cui
                   ` (2 preceding siblings ...)
  2021-09-15  8:09 ` [PATCH 3/4] [PATCH 3/4] x86: Properly handle USE_VECTOR_FP_CONVERTS/USE_VECTOR_CONVERTS lili.cui
@ 2021-09-15  8:09 ` lili.cui
  2021-09-15 14:13   ` H.J. Lu
  2021-09-16  6:34   ` Uros Bizjak
  3 siblings, 2 replies; 16+ messages in thread
From: lili.cui @ 2021-09-15  8:09 UTC (permalink / raw)
  To: ubizjak; +Cc: gcc-patches, hongtao.liu, hjl.tools

From: "H.J. Lu" <hjl.tools@gmail.com>

1. Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with
TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY in SSE FP to FP splitters.
2. Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with
TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY in SSE INT to FP splitters.
3.  Also check TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY and
TARGET_SSE_PARTIAL_REG_DEPENDENCY when handling avx_partial_xmm_update
attribute.  Don't convert AVX partial XMM register update if there is no
partial SSE register dependency for SSE conversion.

gcc/

	* config/i386/i386-features.c (remove_partial_avx_dependency):
	Also check TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY and
	and TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY before generating
	vxorps.
	* config/i386/i386.h (TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY):
	New.
	(TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise.
	* config/i386/i386.md (SSE FP to FP splitters): Replace
	TARGET_SSE_PARTIAL_REG_DEPENDENCY with
	TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY.
	(SSE INT to FP splitter): Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY
	with TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY.
	* config/i386/x86-tune.def
	(X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY): New.
	(X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise.

gcc/testsuite/

	* gcc.target/i386/avx-covert-1.c: New file.
	* gcc.target/i386/avx-fp-covert-1.c: Likewise.
	* gcc.target/i386/avx-int-covert-1.c: Likewise.
	* gcc.target/i386/sse-covert-1.c: Likewise.
	* gcc.target/i386/sse-fp-covert-1.c: Likewise.
	* gcc.target/i386/sse-int-covert-1.c: Likewise.
---
 gcc/config/i386/i386-features.c               |  6 ++++--
 gcc/config/i386/i386.h                        |  4 ++++
 gcc/config/i386/i386.md                       |  9 ++++++---
 gcc/config/i386/x86-tune.def                  | 15 +++++++++++++++
 gcc/testsuite/gcc.target/i386/avx-covert-1.c  | 19 +++++++++++++++++++
 .../gcc.target/i386/avx-fp-covert-1.c         | 15 +++++++++++++++
 .../gcc.target/i386/avx-int-covert-1.c        | 14 ++++++++++++++
 gcc/testsuite/gcc.target/i386/sse-covert-1.c  | 19 +++++++++++++++++++
 .../gcc.target/i386/sse-fp-covert-1.c         | 15 +++++++++++++++
 .../gcc.target/i386/sse-int-covert-1.c        | 14 ++++++++++++++
 10 files changed, 125 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx-int-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/sse-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/sse-int-covert-1.c

diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index ae5ea02a002..91bfa06d4bf 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -2218,14 +2218,16 @@ remove_partial_avx_dependency (void)
 	  machine_mode dest_mode = GET_MODE (dest);
 	  machine_mode src_mode;
 
-	  if (TARGET_USE_VECTOR_FP_CONVERTS)
+	  if (TARGET_USE_VECTOR_FP_CONVERTS
+	      || !TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY)
 	    {
 	      src_mode = GET_MODE (XEXP (src, 0));
 	      if (src_mode == E_SFmode || src_mode == E_DFmode)
 		continue;
 	    }
 
-	  if (TARGET_USE_VECTOR_CONVERTS)
+	  if (TARGET_USE_VECTOR_CONVERTS
+	      || !TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY)
 	    {
 	      src_mode = GET_MODE (XEXP (src, 0));
 	      if (src_mode == E_SImode || src_mode == E_DImode)
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index e76bb55c080..ec60b89753e 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -334,6 +334,10 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_PARTIAL_REG_DEPENDENCY]
 #define TARGET_SSE_PARTIAL_REG_DEPENDENCY \
 	ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY]
+#define TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY \
+	ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY]
+#define TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY \
+	ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY]
 #define TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
 	ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
 #define TARGET_SSE_UNALIGNED_STORE_OPTIMAL \
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 13f6f57cdcc..c82a9dc1f67 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -4535,7 +4535,8 @@
         (float_extend:DF
           (match_operand:SF 1 "nonimmediate_operand")))]
   "!TARGET_AVX
-   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+   && TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
+   && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!REG_P (operands[1])
        || (!TARGET_AVX && REGNO (operands[0]) != REGNO (operands[1])))
@@ -4708,7 +4709,8 @@
         (float_truncate:SF
 	  (match_operand:DF 1 "nonimmediate_operand")))]
   "!TARGET_AVX
-   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+   && TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
+   && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!REG_P (operands[1])
        || (!TARGET_AVX && REGNO (operands[0]) != REGNO (operands[1])))
@@ -5243,7 +5245,8 @@
   [(set (match_operand:MODEF 0 "sse_reg_operand")
 	(float:MODEF (match_operand:SWI48 1 "nonimmediate_operand")))]
   "!TARGET_AVX
-   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+   && TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY
+   && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!EXT_REX_SSE_REG_P (operands[0])
        || TARGET_AVX512VL)"
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 088edb6c4ca..58e8ead56b4 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -64,6 +64,21 @@ DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY, "sse_partial_reg_dependency",
           m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10
 	  | m_BDVER | m_ZNVER | m_TREMONT | m_GENERIC)
 
+/* X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY: This knob avoids
+   partial write to the destination in scalar SSE conversion from FP
+   to FP.  */
+DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY,
+	  "sse_partial_reg_fp_converts_dependency",
+	  m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10
+	  | m_BDVER | m_ZNVER | m_GENERIC)
+
+/* X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY: This knob avoids partial
+   write to the destination in scalar SSE conversion from integer to FP.  */
+DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY,
+	  "sse_partial_reg_converts_dependency",
+	  m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10
+	  | m_BDVER | m_ZNVER | m_GENERIC)
+
 /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
    are resolved on SSE register parts instead of whole registers, so we may
    maintain just lower part of scalar values in proper format leaving the
diff --git a/gcc/testsuite/gcc.target/i386/avx-covert-1.c b/gcc/testsuite/gcc.target/i386/avx-covert-1.c
new file mode 100644
index 00000000000..b6c794ecbb8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-covert-1.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency,^sse_partial_reg_converts_dependency" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler "vcvtss2sd" } } */
+/* { dg-final { scan-assembler "vcvtsi2ssl" } } */
+/* { dg-final { scan-assembler-not "vcvtps2pd" } } */
+/* { dg-final { scan-assembler-not "vcvtdq2ps" } } */
+/* { dg-final { scan-assembler-not "vxorps" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c b/gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c
new file mode 100644
index 00000000000..c40c48b1b2d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency" } */
+
+extern float f;
+extern double d;
+
+void
+foo (void)
+{
+  d = f;
+}
+
+/* { dg-final { scan-assembler "vcvtss2sd" } } */
+/* { dg-final { scan-assembler-not "vcvtps2pd" } } */
+/* { dg-final { scan-assembler-not "vxorps" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-int-covert-1.c b/gcc/testsuite/gcc.target/i386/avx-int-covert-1.c
new file mode 100644
index 00000000000..01bb64e66cc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-int-covert-1.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=^sse_partial_reg_converts_dependency" } */
+
+extern float f;
+extern int i;
+
+void
+foo (void)
+{
+  f = i;
+}
+
+/* { dg-final { scan-assembler "vcvtsi2ssl" } } */
+/* { dg-final { scan-assembler-not "vxorps" } } */
diff --git a/gcc/testsuite/gcc.target/i386/sse-covert-1.c b/gcc/testsuite/gcc.target/i386/sse-covert-1.c
new file mode 100644
index 00000000000..c30af694505
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/sse-covert-1.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=x86-64 -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency,^sse_partial_reg_converts_dependency" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler "cvtss2sd" } } */
+/* { dg-final { scan-assembler "cvtsi2ssl" } } */
+/* { dg-final { scan-assembler-not "cvtps2pd" } } */
+/* { dg-final { scan-assembler-not "cvtdq2ps" } } */
+/* { dg-final { scan-assembler-not "pxor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c b/gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c
new file mode 100644
index 00000000000..b6567e60e3e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=x86-64 -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency" } */
+
+extern float f;
+extern double d;
+
+void
+foo (void)
+{
+  d = f;
+}
+
+/* { dg-final { scan-assembler "cvtss2sd" } } */
+/* { dg-final { scan-assembler-not "cvtps2pd" } } */
+/* { dg-final { scan-assembler-not "pxor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/sse-int-covert-1.c b/gcc/testsuite/gcc.target/i386/sse-int-covert-1.c
new file mode 100644
index 00000000000..107f7241def
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/sse-int-covert-1.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=x86-64 -mfpmath=sse -mtune-ctrl=^sse_partial_reg_converts_dependency" } */
+
+extern float f;
+extern int i;
+
+void
+foo (void)
+{
+  f = i;
+}
+
+/* { dg-final { scan-assembler "cvtsi2ssl" } } */
+/* { dg-final { scan-assembler-not "pxor" } } */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] [PATCH 4/4] x86: Add TARGET_SSE_PARTIAL_REG_[FP_]CONVERTS_DEPENDENCY
  2021-09-15  8:09 ` [PATCH 4/4] [PATCH 4/4] x86: Add TARGET_SSE_PARTIAL_REG_[FP_]CONVERTS_DEPENDENCY lili.cui
@ 2021-09-15 14:13   ` H.J. Lu
  2021-09-15 23:54     ` Cui, Lili
  2021-09-16  6:34   ` Uros Bizjak
  1 sibling, 1 reply; 16+ messages in thread
From: H.J. Lu @ 2021-09-15 14:13 UTC (permalink / raw)
  To: Lili Cui; +Cc: Uros Bizjak, GCC Patches, Hongtao Liu

There is no need to add [PATCH N/4] in the first line of the git
commit message.  "git format-patch" or "git send-email" will
add them automatically.

On Wed, Sep 15, 2021 at 1:10 AM <lili.cui@intel.com> wrote:
>
> From: "H.J. Lu" <hjl.tools@gmail.com>
>
> 1. Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with
> TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY in SSE FP to FP splitters.
> 2. Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with
> TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY in SSE INT to FP splitters.
> 3.  Also check TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY and
> TARGET_SSE_PARTIAL_REG_DEPENDENCY when handling avx_partial_xmm_update
> attribute.  Don't convert AVX partial XMM register update if there is no
> partial SSE register dependency for SSE conversion.
>
> gcc/
>
>         * config/i386/i386-features.c (remove_partial_avx_dependency):
>         Also check TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY and
>         and TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY before generating
>         vxorps.
>         * config/i386/i386.h (TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY):
>         New.
>         (TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise.
>         * config/i386/i386.md (SSE FP to FP splitters): Replace
>         TARGET_SSE_PARTIAL_REG_DEPENDENCY with
>         TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY.
>         (SSE INT to FP splitter): Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY
>         with TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY.
>         * config/i386/x86-tune.def
>         (X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY): New.
>         (X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise.
>
> gcc/testsuite/
>
>         * gcc.target/i386/avx-covert-1.c: New file.
>         * gcc.target/i386/avx-fp-covert-1.c: Likewise.
>         * gcc.target/i386/avx-int-covert-1.c: Likewise.
>         * gcc.target/i386/sse-covert-1.c: Likewise.
>         * gcc.target/i386/sse-fp-covert-1.c: Likewise.
>         * gcc.target/i386/sse-int-covert-1.c: Likewise.
> ---
>  gcc/config/i386/i386-features.c               |  6 ++++--
>  gcc/config/i386/i386.h                        |  4 ++++
>  gcc/config/i386/i386.md                       |  9 ++++++---
>  gcc/config/i386/x86-tune.def                  | 15 +++++++++++++++
>  gcc/testsuite/gcc.target/i386/avx-covert-1.c  | 19 +++++++++++++++++++
>  .../gcc.target/i386/avx-fp-covert-1.c         | 15 +++++++++++++++
>  .../gcc.target/i386/avx-int-covert-1.c        | 14 ++++++++++++++
>  gcc/testsuite/gcc.target/i386/sse-covert-1.c  | 19 +++++++++++++++++++
>  .../gcc.target/i386/sse-fp-covert-1.c         | 15 +++++++++++++++
>  .../gcc.target/i386/sse-int-covert-1.c        | 14 ++++++++++++++
>  10 files changed, 125 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx-covert-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx-int-covert-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sse-covert-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sse-int-covert-1.c
>
> diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
> index ae5ea02a002..91bfa06d4bf 100644
> --- a/gcc/config/i386/i386-features.c
> +++ b/gcc/config/i386/i386-features.c
> @@ -2218,14 +2218,16 @@ remove_partial_avx_dependency (void)
>           machine_mode dest_mode = GET_MODE (dest);
>           machine_mode src_mode;
>
> -         if (TARGET_USE_VECTOR_FP_CONVERTS)
> +         if (TARGET_USE_VECTOR_FP_CONVERTS
> +             || !TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY)
>             {
>               src_mode = GET_MODE (XEXP (src, 0));
>               if (src_mode == E_SFmode || src_mode == E_DFmode)
>                 continue;
>             }
>
> -         if (TARGET_USE_VECTOR_CONVERTS)
> +         if (TARGET_USE_VECTOR_CONVERTS
> +             || !TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY)
>             {
>               src_mode = GET_MODE (XEXP (src, 0));
>               if (src_mode == E_SImode || src_mode == E_DImode)
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index e76bb55c080..ec60b89753e 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -334,6 +334,10 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
>         ix86_tune_features[X86_TUNE_PARTIAL_REG_DEPENDENCY]
>  #define TARGET_SSE_PARTIAL_REG_DEPENDENCY \
>         ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY]
> +#define TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY \
> +       ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY]
> +#define TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY \
> +       ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY]
>  #define TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
>         ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
>  #define TARGET_SSE_UNALIGNED_STORE_OPTIMAL \
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index 13f6f57cdcc..c82a9dc1f67 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -4535,7 +4535,8 @@
>          (float_extend:DF
>            (match_operand:SF 1 "nonimmediate_operand")))]
>    "!TARGET_AVX
> -   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
> +   && TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
> +   && epilogue_completed
>     && optimize_function_for_speed_p (cfun)
>     && (!REG_P (operands[1])
>         || (!TARGET_AVX && REGNO (operands[0]) != REGNO (operands[1])))
> @@ -4708,7 +4709,8 @@
>          (float_truncate:SF
>           (match_operand:DF 1 "nonimmediate_operand")))]
>    "!TARGET_AVX
> -   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
> +   && TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
> +   && epilogue_completed
>     && optimize_function_for_speed_p (cfun)
>     && (!REG_P (operands[1])
>         || (!TARGET_AVX && REGNO (operands[0]) != REGNO (operands[1])))
> @@ -5243,7 +5245,8 @@
>    [(set (match_operand:MODEF 0 "sse_reg_operand")
>         (float:MODEF (match_operand:SWI48 1 "nonimmediate_operand")))]
>    "!TARGET_AVX
> -   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
> +   && TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY
> +   && epilogue_completed
>     && optimize_function_for_speed_p (cfun)
>     && (!EXT_REX_SSE_REG_P (operands[0])
>         || TARGET_AVX512VL)"
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 088edb6c4ca..58e8ead56b4 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -64,6 +64,21 @@ DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY, "sse_partial_reg_dependency",
>            m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10
>           | m_BDVER | m_ZNVER | m_TREMONT | m_GENERIC)
>
> +/* X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY: This knob avoids
> +   partial write to the destination in scalar SSE conversion from FP
> +   to FP.  */
> +DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY,
> +         "sse_partial_reg_fp_converts_dependency",
> +         m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10
> +         | m_BDVER | m_ZNVER | m_GENERIC)

I thought we wanted to enable this for Tremont.

> +
> +/* X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY: This knob avoids partial
> +   write to the destination in scalar SSE conversion from integer to FP.  */
> +DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY,
> +         "sse_partial_reg_converts_dependency",
> +         m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10
> +         | m_BDVER | m_ZNVER | m_GENERIC)
> +
>  /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
>     are resolved on SSE register parts instead of whole registers, so we may
>     maintain just lower part of scalar values in proper format leaving the

-- 
H.J.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 4/4] [PATCH 4/4] x86: Add TARGET_SSE_PARTIAL_REG_[FP_]CONVERTS_DEPENDENCY
  2021-09-15 14:13   ` H.J. Lu
@ 2021-09-15 23:54     ` Cui, Lili
  2021-09-16  2:21       ` H.J. Lu
  0 siblings, 1 reply; 16+ messages in thread
From: Cui, Lili @ 2021-09-15 23:54 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Uros Bizjak, GCC Patches, Liu, Hongtao



> -----Original Message-----
> From: H.J. Lu <hjl.tools@gmail.com>
> Sent: Wednesday, September 15, 2021 10:14 PM
> To: Cui, Lili <lili.cui@intel.com>
> Cc: Uros Bizjak <ubizjak@gmail.com>; GCC Patches <gcc-
> patches@gcc.gnu.org>; Liu, Hongtao <hongtao.liu@intel.com>
> Subject: Re: [PATCH 4/4] [PATCH 4/4] x86: Add
> TARGET_SSE_PARTIAL_REG_[FP_]CONVERTS_DEPENDENCY
> 
> There is no need to add [PATCH N/4] in the first line of the git commit
> message.  "git format-patch" or "git send-email" will add them automatically.
> 
Thanks for the reminder, I didn't notice it before.

> On Wed, Sep 15, 2021 at 1:10 AM <lili.cui@intel.com> wrote:
> >
> > From: "H.J. Lu" <hjl.tools@gmail.com>
> >
> > 1. Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with
> > TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY in SSE FP to FP
> splitters.
> > 2. Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with
> > TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY in SSE INT to FP
> splitters.
> > 3.  Also check TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
> and
> > TARGET_SSE_PARTIAL_REG_DEPENDENCY when handling
> avx_partial_xmm_update
> > attribute.  Don't convert AVX partial XMM register update if there is
> > no partial SSE register dependency for SSE conversion.
> >
> > gcc/
> >
> >         * config/i386/i386-features.c (remove_partial_avx_dependency):
> >         Also check TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
> and
> >         and TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY before
> generating
> >         vxorps.
> >         * config/i386/i386.h
> (TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY):
> >         New.
> >         (TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise.
> >         * config/i386/i386.md (SSE FP to FP splitters): Replace
> >         TARGET_SSE_PARTIAL_REG_DEPENDENCY with
> >         TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY.
> >         (SSE INT to FP splitter): Replace
> TARGET_SSE_PARTIAL_REG_DEPENDENCY
> >         with TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY.
> >         * config/i386/x86-tune.def
> >         (X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY): New.
> >         (X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise.
> >
> > gcc/testsuite/
> >
> >         * gcc.target/i386/avx-covert-1.c: New file.
> >         * gcc.target/i386/avx-fp-covert-1.c: Likewise.
> >         * gcc.target/i386/avx-int-covert-1.c: Likewise.
> >         * gcc.target/i386/sse-covert-1.c: Likewise.
> >         * gcc.target/i386/sse-fp-covert-1.c: Likewise.
> >         * gcc.target/i386/sse-int-covert-1.c: Likewise.
> > ---
> >  gcc/config/i386/i386-features.c               |  6 ++++--
> >  gcc/config/i386/i386.h                        |  4 ++++
> >  gcc/config/i386/i386.md                       |  9 ++++++---
> >  gcc/config/i386/x86-tune.def                  | 15 +++++++++++++++
> >  gcc/testsuite/gcc.target/i386/avx-covert-1.c  | 19 +++++++++++++++++++
> >  .../gcc.target/i386/avx-fp-covert-1.c         | 15 +++++++++++++++
> >  .../gcc.target/i386/avx-int-covert-1.c        | 14 ++++++++++++++
> >  gcc/testsuite/gcc.target/i386/sse-covert-1.c  | 19 +++++++++++++++++++
> >  .../gcc.target/i386/sse-fp-covert-1.c         | 15 +++++++++++++++
> >  .../gcc.target/i386/sse-int-covert-1.c        | 14 ++++++++++++++
> >  10 files changed, 125 insertions(+), 5 deletions(-)  create mode
> > 100644 gcc/testsuite/gcc.target/i386/avx-covert-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/avx-int-covert-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/sse-covert-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/sse-int-covert-1.c
> >
> > diff --git a/gcc/config/i386/i386-features.c
> > b/gcc/config/i386/i386-features.c index ae5ea02a002..91bfa06d4bf
> > 100644
> > --- a/gcc/config/i386/i386-features.c
> > +++ b/gcc/config/i386/i386-features.c
> > @@ -2218,14 +2218,16 @@ remove_partial_avx_dependency (void)
> >           machine_mode dest_mode = GET_MODE (dest);
> >           machine_mode src_mode;
> >
> > -         if (TARGET_USE_VECTOR_FP_CONVERTS)
> > +         if (TARGET_USE_VECTOR_FP_CONVERTS
> > +             || !TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY)
> >             {
> >               src_mode = GET_MODE (XEXP (src, 0));
> >               if (src_mode == E_SFmode || src_mode == E_DFmode)
> >                 continue;
> >             }
> >
> > -         if (TARGET_USE_VECTOR_CONVERTS)
> > +         if (TARGET_USE_VECTOR_CONVERTS
> > +             || !TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY)
> >             {
> >               src_mode = GET_MODE (XEXP (src, 0));
> >               if (src_mode == E_SImode || src_mode == E_DImode) diff
> > --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index
> > e76bb55c080..ec60b89753e 100644
> > --- a/gcc/config/i386/i386.h
> > +++ b/gcc/config/i386/i386.h
> > @@ -334,6 +334,10 @@ extern unsigned char
> ix86_tune_features[X86_TUNE_LAST];
> >         ix86_tune_features[X86_TUNE_PARTIAL_REG_DEPENDENCY]
> >  #define TARGET_SSE_PARTIAL_REG_DEPENDENCY \
> >         ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY]
> > +#define TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY \
> > +
> >
> +ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDE
> NCY]
> > +#define TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY \
> > +
> >
> +ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENC
> Y]
> >  #define TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
> >         ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
> >  #define TARGET_SSE_UNALIGNED_STORE_OPTIMAL \ diff --git
> > a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index
> > 13f6f57cdcc..c82a9dc1f67 100644
> > --- a/gcc/config/i386/i386.md
> > +++ b/gcc/config/i386/i386.md
> > @@ -4535,7 +4535,8 @@
> >          (float_extend:DF
> >            (match_operand:SF 1 "nonimmediate_operand")))]
> >    "!TARGET_AVX
> > -   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
> > +   && TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
> > +   && epilogue_completed
> >     && optimize_function_for_speed_p (cfun)
> >     && (!REG_P (operands[1])
> >         || (!TARGET_AVX && REGNO (operands[0]) != REGNO
> > (operands[1]))) @@ -4708,7 +4709,8 @@
> >          (float_truncate:SF
> >           (match_operand:DF 1 "nonimmediate_operand")))]
> >    "!TARGET_AVX
> > -   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
> > +   && TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
> > +   && epilogue_completed
> >     && optimize_function_for_speed_p (cfun)
> >     && (!REG_P (operands[1])
> >         || (!TARGET_AVX && REGNO (operands[0]) != REGNO
> > (operands[1]))) @@ -5243,7 +5245,8 @@
> >    [(set (match_operand:MODEF 0 "sse_reg_operand")
> >         (float:MODEF (match_operand:SWI48 1 "nonimmediate_operand")))]
> >    "!TARGET_AVX
> > -   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
> > +   && TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY
> > +   && epilogue_completed
> >     && optimize_function_for_speed_p (cfun)
> >     && (!EXT_REX_SSE_REG_P (operands[0])
> >         || TARGET_AVX512VL)"
> > diff --git a/gcc/config/i386/x86-tune.def
> > b/gcc/config/i386/x86-tune.def index 088edb6c4ca..58e8ead56b4 100644
> > --- a/gcc/config/i386/x86-tune.def
> > +++ b/gcc/config/i386/x86-tune.def
> > @@ -64,6 +64,21 @@ DEF_TUNE
> (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY, "sse_partial_reg_dependency",
> >            m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL |
> m_AMDFAM10
> >           | m_BDVER | m_ZNVER | m_TREMONT | m_GENERIC)
> >
> > +/* X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY: This knob
> avoids
> > +   partial write to the destination in scalar SSE conversion from FP
> > +   to FP.  */
> > +DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY,
> > +         "sse_partial_reg_fp_converts_dependency",
> > +         m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL |
> m_AMDFAM10
> > +         | m_BDVER | m_ZNVER | m_GENERIC)
> 
> I thought we wanted to enable this for Tremont.
> 
From the latest test, enabling Tremont here will cause a 2.8% regression to 538.imagic_r.

Thanks,
Lili.

> > +
> > +/* X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY: This knob
> avoids partial
> > +   write to the destination in scalar SSE conversion from integer to
> > +FP.  */ DEF_TUNE
> (X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY,
> > +         "sse_partial_reg_converts_dependency",
> > +         m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL |
> m_AMDFAM10
> > +         | m_BDVER | m_ZNVER | m_GENERIC)
> > +
> >  /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and
> dependencies
> >     are resolved on SSE register parts instead of whole registers, so we may
> >     maintain just lower part of scalar values in proper format leaving
> > the
> 
> --
> H.J.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] [PATCH 4/4] x86: Add TARGET_SSE_PARTIAL_REG_[FP_]CONVERTS_DEPENDENCY
  2021-09-15 23:54     ` Cui, Lili
@ 2021-09-16  2:21       ` H.J. Lu
  0 siblings, 0 replies; 16+ messages in thread
From: H.J. Lu @ 2021-09-16  2:21 UTC (permalink / raw)
  To: Cui, Lili; +Cc: Uros Bizjak, GCC Patches, Liu, Hongtao

On Wed, Sep 15, 2021 at 4:54 PM Cui, Lili <lili.cui@intel.com> wrote:
>
>
>
> > -----Original Message-----
> > From: H.J. Lu <hjl.tools@gmail.com>
> > Sent: Wednesday, September 15, 2021 10:14 PM
> > To: Cui, Lili <lili.cui@intel.com>
> > Cc: Uros Bizjak <ubizjak@gmail.com>; GCC Patches <gcc-
> > patches@gcc.gnu.org>; Liu, Hongtao <hongtao.liu@intel.com>
> > Subject: Re: [PATCH 4/4] [PATCH 4/4] x86: Add
> > TARGET_SSE_PARTIAL_REG_[FP_]CONVERTS_DEPENDENCY
> >
> > There is no need to add [PATCH N/4] in the first line of the git commit
> > message.  "git format-patch" or "git send-email" will add them automatically.
> >
> Thanks for the reminder, I didn't notice it before.
>
> > On Wed, Sep 15, 2021 at 1:10 AM <lili.cui@intel.com> wrote:
> > >
> > > From: "H.J. Lu" <hjl.tools@gmail.com>
> > >
> > > 1. Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with
> > > TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY in SSE FP to FP
> > splitters.
> > > 2. Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with
> > > TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY in SSE INT to FP
> > splitters.
> > > 3.  Also check TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
> > and
> > > TARGET_SSE_PARTIAL_REG_DEPENDENCY when handling
> > avx_partial_xmm_update
> > > attribute.  Don't convert AVX partial XMM register update if there is
> > > no partial SSE register dependency for SSE conversion.
> > >
> > > gcc/
> > >
> > >         * config/i386/i386-features.c (remove_partial_avx_dependency):
> > >         Also check TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
> > and
> > >         and TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY before
> > generating
> > >         vxorps.
> > >         * config/i386/i386.h
> > (TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY):
> > >         New.
> > >         (TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise.
> > >         * config/i386/i386.md (SSE FP to FP splitters): Replace
> > >         TARGET_SSE_PARTIAL_REG_DEPENDENCY with
> > >         TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY.
> > >         (SSE INT to FP splitter): Replace
> > TARGET_SSE_PARTIAL_REG_DEPENDENCY
> > >         with TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY.
> > >         * config/i386/x86-tune.def
> > >         (X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY): New.
> > >         (X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise.
> > >
> > > gcc/testsuite/
> > >
> > >         * gcc.target/i386/avx-covert-1.c: New file.
> > >         * gcc.target/i386/avx-fp-covert-1.c: Likewise.
> > >         * gcc.target/i386/avx-int-covert-1.c: Likewise.
> > >         * gcc.target/i386/sse-covert-1.c: Likewise.
> > >         * gcc.target/i386/sse-fp-covert-1.c: Likewise.
> > >         * gcc.target/i386/sse-int-covert-1.c: Likewise.
> > > ---
> > >  gcc/config/i386/i386-features.c               |  6 ++++--
> > >  gcc/config/i386/i386.h                        |  4 ++++
> > >  gcc/config/i386/i386.md                       |  9 ++++++---
> > >  gcc/config/i386/x86-tune.def                  | 15 +++++++++++++++
> > >  gcc/testsuite/gcc.target/i386/avx-covert-1.c  | 19 +++++++++++++++++++
> > >  .../gcc.target/i386/avx-fp-covert-1.c         | 15 +++++++++++++++
> > >  .../gcc.target/i386/avx-int-covert-1.c        | 14 ++++++++++++++
> > >  gcc/testsuite/gcc.target/i386/sse-covert-1.c  | 19 +++++++++++++++++++
> > >  .../gcc.target/i386/sse-fp-covert-1.c         | 15 +++++++++++++++
> > >  .../gcc.target/i386/sse-int-covert-1.c        | 14 ++++++++++++++
> > >  10 files changed, 125 insertions(+), 5 deletions(-)  create mode
> > > 100644 gcc/testsuite/gcc.target/i386/avx-covert-1.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/avx-int-covert-1.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/sse-covert-1.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/sse-int-covert-1.c
> > >
> > > diff --git a/gcc/config/i386/i386-features.c
> > > b/gcc/config/i386/i386-features.c index ae5ea02a002..91bfa06d4bf
> > > 100644
> > > --- a/gcc/config/i386/i386-features.c
> > > +++ b/gcc/config/i386/i386-features.c
> > > @@ -2218,14 +2218,16 @@ remove_partial_avx_dependency (void)
> > >           machine_mode dest_mode = GET_MODE (dest);
> > >           machine_mode src_mode;
> > >
> > > -         if (TARGET_USE_VECTOR_FP_CONVERTS)
> > > +         if (TARGET_USE_VECTOR_FP_CONVERTS
> > > +             || !TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY)
> > >             {
> > >               src_mode = GET_MODE (XEXP (src, 0));
> > >               if (src_mode == E_SFmode || src_mode == E_DFmode)
> > >                 continue;
> > >             }
> > >
> > > -         if (TARGET_USE_VECTOR_CONVERTS)
> > > +         if (TARGET_USE_VECTOR_CONVERTS
> > > +             || !TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY)
> > >             {
> > >               src_mode = GET_MODE (XEXP (src, 0));
> > >               if (src_mode == E_SImode || src_mode == E_DImode) diff
> > > --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index
> > > e76bb55c080..ec60b89753e 100644
> > > --- a/gcc/config/i386/i386.h
> > > +++ b/gcc/config/i386/i386.h
> > > @@ -334,6 +334,10 @@ extern unsigned char
> > ix86_tune_features[X86_TUNE_LAST];
> > >         ix86_tune_features[X86_TUNE_PARTIAL_REG_DEPENDENCY]
> > >  #define TARGET_SSE_PARTIAL_REG_DEPENDENCY \
> > >         ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY]
> > > +#define TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY \
> > > +
> > >
> > +ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDE
> > NCY]
> > > +#define TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY \
> > > +
> > >
> > +ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENC
> > Y]
> > >  #define TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
> > >         ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
> > >  #define TARGET_SSE_UNALIGNED_STORE_OPTIMAL \ diff --git
> > > a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index
> > > 13f6f57cdcc..c82a9dc1f67 100644
> > > --- a/gcc/config/i386/i386.md
> > > +++ b/gcc/config/i386/i386.md
> > > @@ -4535,7 +4535,8 @@
> > >          (float_extend:DF
> > >            (match_operand:SF 1 "nonimmediate_operand")))]
> > >    "!TARGET_AVX
> > > -   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
> > > +   && TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
> > > +   && epilogue_completed
> > >     && optimize_function_for_speed_p (cfun)
> > >     && (!REG_P (operands[1])
> > >         || (!TARGET_AVX && REGNO (operands[0]) != REGNO
> > > (operands[1]))) @@ -4708,7 +4709,8 @@
> > >          (float_truncate:SF
> > >           (match_operand:DF 1 "nonimmediate_operand")))]
> > >    "!TARGET_AVX
> > > -   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
> > > +   && TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
> > > +   && epilogue_completed
> > >     && optimize_function_for_speed_p (cfun)
> > >     && (!REG_P (operands[1])
> > >         || (!TARGET_AVX && REGNO (operands[0]) != REGNO
> > > (operands[1]))) @@ -5243,7 +5245,8 @@
> > >    [(set (match_operand:MODEF 0 "sse_reg_operand")
> > >         (float:MODEF (match_operand:SWI48 1 "nonimmediate_operand")))]
> > >    "!TARGET_AVX
> > > -   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
> > > +   && TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY
> > > +   && epilogue_completed
> > >     && optimize_function_for_speed_p (cfun)
> > >     && (!EXT_REX_SSE_REG_P (operands[0])
> > >         || TARGET_AVX512VL)"
> > > diff --git a/gcc/config/i386/x86-tune.def
> > > b/gcc/config/i386/x86-tune.def index 088edb6c4ca..58e8ead56b4 100644
> > > --- a/gcc/config/i386/x86-tune.def
> > > +++ b/gcc/config/i386/x86-tune.def
> > > @@ -64,6 +64,21 @@ DEF_TUNE
> > (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY, "sse_partial_reg_dependency",
> > >            m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL |
> > m_AMDFAM10
> > >           | m_BDVER | m_ZNVER | m_TREMONT | m_GENERIC)
> > >
> > > +/* X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY: This knob
> > avoids
> > > +   partial write to the destination in scalar SSE conversion from FP
> > > +   to FP.  */
> > > +DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY,
> > > +         "sse_partial_reg_fp_converts_dependency",
> > > +         m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL |
> > m_AMDFAM10
> > > +         | m_BDVER | m_ZNVER | m_GENERIC)
> >
> > I thought we wanted to enable this for Tremont.
> >
> From the latest test, enabling Tremont here will cause a 2.8% regression to 538.imagic_r.

I see.

Thanks.

> Thanks,
> Lili.
>
> > > +
> > > +/* X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY: This knob
> > avoids partial
> > > +   write to the destination in scalar SSE conversion from integer to
> > > +FP.  */ DEF_TUNE
> > (X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY,
> > > +         "sse_partial_reg_converts_dependency",
> > > +         m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL |
> > m_AMDFAM10
> > > +         | m_BDVER | m_ZNVER | m_GENERIC)
> > > +
> > >  /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and
> > dependencies
> > >     are resolved on SSE register parts instead of whole registers, so we may
> > >     maintain just lower part of scalar values in proper format leaving
> > > the
> >
> > --
> > H.J.



-- 
H.J.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] [PATCH 3/4] x86: Properly handle USE_VECTOR_FP_CONVERTS/USE_VECTOR_CONVERTS
  2021-09-15  8:09 ` [PATCH 3/4] [PATCH 3/4] x86: Properly handle USE_VECTOR_FP_CONVERTS/USE_VECTOR_CONVERTS lili.cui
@ 2021-09-16  6:27   ` Uros Bizjak
  2021-09-17  3:15     ` Cui, Lili
  0 siblings, 1 reply; 16+ messages in thread
From: Uros Bizjak @ 2021-09-16  6:27 UTC (permalink / raw)
  To: Lili Cui; +Cc: gcc-patches, Hongtao Liu, H. J. Lu

On Wed, Sep 15, 2021 at 10:10 AM <lili.cui@intel.com> wrote:
>
> From: "H.J. Lu" <hjl.tools@gmail.com>
>
> Check TARGET_USE_VECTOR_FP_CONVERTS or TARGET_USE_VECTOR_CONVERTS when
> handling avx_partial_xmm_update attribute.  Don't convert AVX partial
> XMM register update if vector packed SSE conversion should be used.
>
> gcc/
>
>         PR target/101900
>         * config/i386/i386-features.c (remove_partial_avx_dependency):
>         Check TARGET_USE_VECTOR_FP_CONVERTS and TARGET_USE_VECTOR_CONVERTS
>         before generating vxorps.
>
> gcc/
>
>         PR target/101900
>         * testsuite/gcc.target/i386/pr101900-1.c: New test.
>         * testsuite/gcc.target/i386/pr101900-2.c: Likewise.
>         * testsuite/gcc.target/i386/pr101900-3.c: Likewise.
> ---
>  gcc/config/i386/i386-features.c            | 21 ++++++++++++++++++---
>  gcc/testsuite/gcc.target/i386/pr101900-1.c | 18 ++++++++++++++++++
>  gcc/testsuite/gcc.target/i386/pr101900-2.c | 18 ++++++++++++++++++
>  gcc/testsuite/gcc.target/i386/pr101900-3.c | 19 +++++++++++++++++++
>  4 files changed, 73 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-3.c
>
> diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
> index 5a99ea7c046..ae5ea02a002 100644
> --- a/gcc/config/i386/i386-features.c
> +++ b/gcc/config/i386/i386-features.c
> @@ -2210,15 +2210,30 @@ remove_partial_avx_dependency (void)
>               != AVX_PARTIAL_XMM_UPDATE_TRUE)
>             continue;
>
> -         if (!v4sf_const0)
> -           v4sf_const0 = gen_reg_rtx (V4SFmode);
> -
>           /* Convert PARTIAL_XMM_UPDATE_TRUE insns, DF -> SF, SF -> DF,
>              SI -> SF, SI -> DF, DI -> SF, DI -> DF, to vec_dup and
>              vec_merge with subreg.  */
>           rtx src = SET_SRC (set);
>           rtx dest = SET_DEST (set);
>           machine_mode dest_mode = GET_MODE (dest);
> +         machine_mode src_mode;
> +
> +         if (TARGET_USE_VECTOR_FP_CONVERTS)
> +           {
> +             src_mode = GET_MODE (XEXP (src, 0));
> +             if (src_mode == E_SFmode || src_mode == E_DFmode)
> +               continue;
> +           }
> +
> +         if (TARGET_USE_VECTOR_CONVERTS)
> +           {
> +             src_mode = GET_MODE (XEXP (src, 0));
> +             if (src_mode == E_SImode || src_mode == E_DImode)
> +               continue;
> +           }
> +
> +         if (!v4sf_const0)
> +           v4sf_const0 = gen_reg_rtx (V4SFmode);

Please better move initialization of src_mode to the top of the new hunk, like:

machine_mode src_mode = GET_MODE (XEXP (src, 0));
switch (src_mode)
{
  case E_SFmode:
  case E_DFmode:
    if (TARGET_USE_VECTOR_FP_CONVERTS)
      continue;
    break;
  case E_SImode:
  case E_DImode:
    if (TARGET_USE_VECTOR_CONVERTS)
      continue;
    break;
  default:
    break;
}

or something like the above.

Uros.

>
>           rtx zero;
>           machine_mode dest_vecmode;
> diff --git a/gcc/testsuite/gcc.target/i386/pr101900-1.c b/gcc/testsuite/gcc.target/i386/pr101900-1.c
> new file mode 100644
> index 00000000000..0a45f8e340a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101900-1.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=use_vector_fp_converts" } */
> +
> +extern float f;
> +extern double d;
> +extern int i;
> +
> +void
> +foo (void)
> +{
> +  d = f;
> +  f = i;
> +}
> +
> +/* { dg-final { scan-assembler "vcvtps2pd" } } */
> +/* { dg-final { scan-assembler "vcvtsi2ssl" } } */
> +/* { dg-final { scan-assembler-not "vcvtss2sd" } } */
> +/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr101900-2.c b/gcc/testsuite/gcc.target/i386/pr101900-2.c
> new file mode 100644
> index 00000000000..c8b2d1da5ae
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101900-2.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=use_vector_converts" } */
> +
> +extern float f;
> +extern double d;
> +extern int i;
> +
> +void
> +foo (void)
> +{
> +  d = f;
> +  f = i;
> +}
> +
> +/* { dg-final { scan-assembler "vcvtss2sd" } } */
> +/* { dg-final { scan-assembler "vcvtdq2ps" } } */
> +/* { dg-final { scan-assembler-not "vcvtsi2ssl" } } */
> +/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr101900-3.c b/gcc/testsuite/gcc.target/i386/pr101900-3.c
> new file mode 100644
> index 00000000000..6ee565b5bd4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101900-3.c
> @@ -0,0 +1,19 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=use_vector_fp_converts,use_vector_converts" } */
> +
> +extern float f;
> +extern double d;
> +extern int i;
> +
> +void
> +foo (void)
> +{
> +  d = f;
> +  f = i;
> +}
> +
> +/* { dg-final { scan-assembler "vcvtps2pd" } } */
> +/* { dg-final { scan-assembler "vcvtdq2ps" } } */
> +/* { dg-final { scan-assembler-not "vcvtss2sd" } } */
> +/* { dg-final { scan-assembler-not "vcvtsi2ssl" } } */
> +/* { dg-final { scan-assembler-not "vxorps" } } */
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] [PATCH 4/4] x86: Add TARGET_SSE_PARTIAL_REG_[FP_]CONVERTS_DEPENDENCY
  2021-09-15  8:09 ` [PATCH 4/4] [PATCH 4/4] x86: Add TARGET_SSE_PARTIAL_REG_[FP_]CONVERTS_DEPENDENCY lili.cui
  2021-09-15 14:13   ` H.J. Lu
@ 2021-09-16  6:34   ` Uros Bizjak
  1 sibling, 0 replies; 16+ messages in thread
From: Uros Bizjak @ 2021-09-16  6:34 UTC (permalink / raw)
  To: Lili Cui; +Cc: gcc-patches, Hongtao Liu, H. J. Lu

On Wed, Sep 15, 2021 at 10:10 AM <lili.cui@intel.com> wrote:
>
> From: "H.J. Lu" <hjl.tools@gmail.com>
>
> 1. Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with
> TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY in SSE FP to FP splitters.
> 2. Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with
> TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY in SSE INT to FP splitters.
> 3.  Also check TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY and
> TARGET_SSE_PARTIAL_REG_DEPENDENCY when handling avx_partial_xmm_update
> attribute.  Don't convert AVX partial XMM register update if there is no
> partial SSE register dependency for SSE conversion.
>
> gcc/
>
>         * config/i386/i386-features.c (remove_partial_avx_dependency):
>         Also check TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY and
>         and TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY before generating
>         vxorps.
>         * config/i386/i386.h (TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY):
>         New.
>         (TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise.
>         * config/i386/i386.md (SSE FP to FP splitters): Replace
>         TARGET_SSE_PARTIAL_REG_DEPENDENCY with
>         TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY.
>         (SSE INT to FP splitter): Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY
>         with TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY.
>         * config/i386/x86-tune.def
>         (X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY): New.
>         (X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise.
>
> gcc/testsuite/
>
>         * gcc.target/i386/avx-covert-1.c: New file.
>         * gcc.target/i386/avx-fp-covert-1.c: Likewise.
>         * gcc.target/i386/avx-int-covert-1.c: Likewise.
>         * gcc.target/i386/sse-covert-1.c: Likewise.
>         * gcc.target/i386/sse-fp-covert-1.c: Likewise.
>         * gcc.target/i386/sse-int-covert-1.c: Likewise.

OK.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386-features.c               |  6 ++++--
>  gcc/config/i386/i386.h                        |  4 ++++
>  gcc/config/i386/i386.md                       |  9 ++++++---
>  gcc/config/i386/x86-tune.def                  | 15 +++++++++++++++
>  gcc/testsuite/gcc.target/i386/avx-covert-1.c  | 19 +++++++++++++++++++
>  .../gcc.target/i386/avx-fp-covert-1.c         | 15 +++++++++++++++
>  .../gcc.target/i386/avx-int-covert-1.c        | 14 ++++++++++++++
>  gcc/testsuite/gcc.target/i386/sse-covert-1.c  | 19 +++++++++++++++++++
>  .../gcc.target/i386/sse-fp-covert-1.c         | 15 +++++++++++++++
>  .../gcc.target/i386/sse-int-covert-1.c        | 14 ++++++++++++++
>  10 files changed, 125 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx-covert-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx-int-covert-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sse-covert-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sse-int-covert-1.c
>
> diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
> index ae5ea02a002..91bfa06d4bf 100644
> --- a/gcc/config/i386/i386-features.c
> +++ b/gcc/config/i386/i386-features.c
> @@ -2218,14 +2218,16 @@ remove_partial_avx_dependency (void)
>           machine_mode dest_mode = GET_MODE (dest);
>           machine_mode src_mode;
>
> -         if (TARGET_USE_VECTOR_FP_CONVERTS)
> +         if (TARGET_USE_VECTOR_FP_CONVERTS
> +             || !TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY)
>             {
>               src_mode = GET_MODE (XEXP (src, 0));
>               if (src_mode == E_SFmode || src_mode == E_DFmode)
>                 continue;
>             }
>
> -         if (TARGET_USE_VECTOR_CONVERTS)
> +         if (TARGET_USE_VECTOR_CONVERTS
> +             || !TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY)
>             {
>               src_mode = GET_MODE (XEXP (src, 0));
>               if (src_mode == E_SImode || src_mode == E_DImode)
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index e76bb55c080..ec60b89753e 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -334,6 +334,10 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
>         ix86_tune_features[X86_TUNE_PARTIAL_REG_DEPENDENCY]
>  #define TARGET_SSE_PARTIAL_REG_DEPENDENCY \
>         ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY]
> +#define TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY \
> +       ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY]
> +#define TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY \
> +       ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY]
>  #define TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
>         ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
>  #define TARGET_SSE_UNALIGNED_STORE_OPTIMAL \
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index 13f6f57cdcc..c82a9dc1f67 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -4535,7 +4535,8 @@
>          (float_extend:DF
>            (match_operand:SF 1 "nonimmediate_operand")))]
>    "!TARGET_AVX
> -   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
> +   && TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
> +   && epilogue_completed
>     && optimize_function_for_speed_p (cfun)
>     && (!REG_P (operands[1])
>         || (!TARGET_AVX && REGNO (operands[0]) != REGNO (operands[1])))
> @@ -4708,7 +4709,8 @@
>          (float_truncate:SF
>           (match_operand:DF 1 "nonimmediate_operand")))]
>    "!TARGET_AVX
> -   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
> +   && TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
> +   && epilogue_completed
>     && optimize_function_for_speed_p (cfun)
>     && (!REG_P (operands[1])
>         || (!TARGET_AVX && REGNO (operands[0]) != REGNO (operands[1])))
> @@ -5243,7 +5245,8 @@
>    [(set (match_operand:MODEF 0 "sse_reg_operand")
>         (float:MODEF (match_operand:SWI48 1 "nonimmediate_operand")))]
>    "!TARGET_AVX
> -   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
> +   && TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY
> +   && epilogue_completed
>     && optimize_function_for_speed_p (cfun)
>     && (!EXT_REX_SSE_REG_P (operands[0])
>         || TARGET_AVX512VL)"
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 088edb6c4ca..58e8ead56b4 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -64,6 +64,21 @@ DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY, "sse_partial_reg_dependency",
>            m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10
>           | m_BDVER | m_ZNVER | m_TREMONT | m_GENERIC)
>
> +/* X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY: This knob avoids
> +   partial write to the destination in scalar SSE conversion from FP
> +   to FP.  */
> +DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY,
> +         "sse_partial_reg_fp_converts_dependency",
> +         m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10
> +         | m_BDVER | m_ZNVER | m_GENERIC)
> +
> +/* X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY: This knob avoids partial
> +   write to the destination in scalar SSE conversion from integer to FP.  */
> +DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY,
> +         "sse_partial_reg_converts_dependency",
> +         m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10
> +         | m_BDVER | m_ZNVER | m_GENERIC)
> +
>  /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
>     are resolved on SSE register parts instead of whole registers, so we may
>     maintain just lower part of scalar values in proper format leaving the
> diff --git a/gcc/testsuite/gcc.target/i386/avx-covert-1.c b/gcc/testsuite/gcc.target/i386/avx-covert-1.c
> new file mode 100644
> index 00000000000..b6c794ecbb8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx-covert-1.c
> @@ -0,0 +1,19 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency,^sse_partial_reg_converts_dependency" } */
> +
> +extern float f;
> +extern double d;
> +extern int i;
> +
> +void
> +foo (void)
> +{
> +  d = f;
> +  f = i;
> +}
> +
> +/* { dg-final { scan-assembler "vcvtss2sd" } } */
> +/* { dg-final { scan-assembler "vcvtsi2ssl" } } */
> +/* { dg-final { scan-assembler-not "vcvtps2pd" } } */
> +/* { dg-final { scan-assembler-not "vcvtdq2ps" } } */
> +/* { dg-final { scan-assembler-not "vxorps" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c b/gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c
> new file mode 100644
> index 00000000000..c40c48b1b2d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency" } */
> +
> +extern float f;
> +extern double d;
> +
> +void
> +foo (void)
> +{
> +  d = f;
> +}
> +
> +/* { dg-final { scan-assembler "vcvtss2sd" } } */
> +/* { dg-final { scan-assembler-not "vcvtps2pd" } } */
> +/* { dg-final { scan-assembler-not "vxorps" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/avx-int-covert-1.c b/gcc/testsuite/gcc.target/i386/avx-int-covert-1.c
> new file mode 100644
> index 00000000000..01bb64e66cc
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx-int-covert-1.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=^sse_partial_reg_converts_dependency" } */
> +
> +extern float f;
> +extern int i;
> +
> +void
> +foo (void)
> +{
> +  f = i;
> +}
> +
> +/* { dg-final { scan-assembler "vcvtsi2ssl" } } */
> +/* { dg-final { scan-assembler-not "vxorps" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/sse-covert-1.c b/gcc/testsuite/gcc.target/i386/sse-covert-1.c
> new file mode 100644
> index 00000000000..c30af694505
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/sse-covert-1.c
> @@ -0,0 +1,19 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=x86-64 -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency,^sse_partial_reg_converts_dependency" } */
> +
> +extern float f;
> +extern double d;
> +extern int i;
> +
> +void
> +foo (void)
> +{
> +  d = f;
> +  f = i;
> +}
> +
> +/* { dg-final { scan-assembler "cvtss2sd" } } */
> +/* { dg-final { scan-assembler "cvtsi2ssl" } } */
> +/* { dg-final { scan-assembler-not "cvtps2pd" } } */
> +/* { dg-final { scan-assembler-not "cvtdq2ps" } } */
> +/* { dg-final { scan-assembler-not "pxor" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c b/gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c
> new file mode 100644
> index 00000000000..b6567e60e3e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=x86-64 -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency" } */
> +
> +extern float f;
> +extern double d;
> +
> +void
> +foo (void)
> +{
> +  d = f;
> +}
> +
> +/* { dg-final { scan-assembler "cvtss2sd" } } */
> +/* { dg-final { scan-assembler-not "cvtps2pd" } } */
> +/* { dg-final { scan-assembler-not "pxor" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/sse-int-covert-1.c b/gcc/testsuite/gcc.target/i386/sse-int-covert-1.c
> new file mode 100644
> index 00000000000..107f7241def
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/sse-int-covert-1.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=x86-64 -mfpmath=sse -mtune-ctrl=^sse_partial_reg_converts_dependency" } */
> +
> +extern float f;
> +extern int i;
> +
> +void
> +foo (void)
> +{
> +  f = i;
> +}
> +
> +/* { dg-final { scan-assembler "cvtsi2ssl" } } */
> +/* { dg-final { scan-assembler-not "pxor" } } */
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/4] [PATCH 1/4] x86: Update -mtune=tremont
  2021-09-15  8:09 ` [PATCH 1/4] [PATCH 1/4] x86: Update -mtune=tremont lili.cui
@ 2021-09-16  6:35   ` Uros Bizjak
  0 siblings, 0 replies; 16+ messages in thread
From: Uros Bizjak @ 2021-09-16  6:35 UTC (permalink / raw)
  To: Lili Cui; +Cc: gcc-patches, Hongtao Liu, H. J. Lu

On Wed, Sep 15, 2021 at 10:09 AM <lili.cui@intel.com> wrote:
>
> From: "H.J. Lu" <hjl.tools@gmail.com>
>
> Initial -mtune=tremont update
>
> 1. Use Haswell scheduling model.
> 2. Assume that stack engine allows to execute push&pop instructions in
> parall.
> 3. Prepare for scheduling pass as -mtune=generic.
> 4. Use the same issue rate as -mtune=generic.
> 5. Enable partial_reg_dependency.
> 6. Disable accumulate_outgoing_args
> 7. Enable use_leave
> 8. Enable push_memory
> 9. Disable four_jump_limit
> 10. Disable opt_agu
> 11. Disable avoid_lea_for_addr
> 12. Disable avoid_mem_opnd_for_cmove
> 13. Enable misaligned_move_string_pro_epilogues
> 14. Enable use_cltd
> 16. Enable avoid_false_dep_for_bmi
> 17. Enable avoid_mfence
> 18. Disable expand_abs
> 19. Enable sse_typeless_stores
> 20. Enable sse_load0_by_pxor
> 21. Disable split_mem_opnd_for_fp_converts
> 22. Disable slow_pshufb
> 23. Enable partial_reg_dependency
>
> This is the first patch to tune for Tremont.  With all patches applied,
> performance impacts on SPEC CPU 2017 are:
>
> 500.perlbench_r         1.81%
> 502.gcc_r               0.57%
> 505.mcf_r               1.16%
> 520.omnetpp_r           0.00%
> 523.xalancbmk_r         0.00%
> 525.x264_r              4.55%
> 531.deepsjeng_r         0.00%
> 541.leela_r             0.39%
> 548.exchange2_r         1.13%
> 557.xz_r                0.00%
> geomean for intrate     0.95%
> 503.bwaves_r            0.00%
> 507.cactuBSSN_r         6.94%
> 508.namd_r              12.37%
> 510.parest_r            1.01%
> 511.povray_r            3.70%
> 519.lbm_r               36.61%
> 521.wrf_r               8.79%
> 526.blender_r           2.91%
> 527.cam4_r              6.23%
> 538.imagick_r           0.28%
> 544.nab_r               21.99%
> 549.fotonik3d_r         3.63%
> 554.roms_r              -1.20%
> geomean for fprate      7.50%
>
> gcc/ChangeLog
>
>         * common/config/i386/i386-common.c: Use Haswell scheduling model
>         for Tremont.
>         * config/i386/i386.c (ix86_sched_init_global): Prepare for Tremont
>         scheduling pass.
>         * config/i386/x86-tune-sched.c (ix86_issue_rate): Change Tremont
>         issue rate to 4.
>         (ix86_adjust_cost): Handle Tremont.
>         * config/i386/x86-tune.def (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY):
>         Enable for Tremont.
>         (X86_TUNE_USE_LEAVE): Likewise.
>         (X86_TUNE_PUSH_MEMORY): Likewise.
>         (X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES): Likewise.
>         (X86_TUNE_USE_CLTD): Likewise.
>         (X86_TUNE_AVOID_FALSE_DEP_FOR_BMI): Likewise.
>         (X86_TUNE_AVOID_MFENCE): Likewise.
>         (X86_TUNE_SSE_TYPELESS_STORES): Likewise.
>         (X86_TUNE_SSE_LOAD0_BY_PXOR): Likewise.
>         (X86_TUNE_ACCUMULATE_OUTGOING_ARGS): Disable for Tremont.
>         (X86_TUNE_FOUR_JUMP_LIMIT): Likewise.
>         (X86_TUNE_OPT_AGU): Likewise.
>         (X86_TUNE_AVOID_LEA_FOR_ADDR): Likewise.
>         (X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE): Likewise.
>         (X86_TUNE_EXPAND_ABS): Likewise.
>         (X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS): Likewise.
>         (X86_TUNE_SLOW_PSHUFB): Likewise.

OK. (Tuning patches are kind of obvious).

Thanks,
Uros.

> ---
>  gcc/common/config/i386/i386-common.c |  2 +-
>  gcc/config/i386/i386.c               |  1 +
>  gcc/config/i386/x86-tune-sched.c     |  2 ++
>  gcc/config/i386/x86-tune.def         | 37 ++++++++++++++--------------
>  4 files changed, 23 insertions(+), 19 deletions(-)
>
> diff --git a/gcc/common/config/i386/i386-common.c b/gcc/common/config/i386/i386-common.c
> index 00c65ba15ab..2c9e1ccbc6e 100644
> --- a/gcc/common/config/i386/i386-common.c
> +++ b/gcc/common/config/i386/i386-common.c
> @@ -1935,7 +1935,7 @@ const pta processor_alias_table[] =
>      M_CPU_TYPE (INTEL_GOLDMONT), P_PROC_SSE4_2},
>    {"goldmont-plus", PROCESSOR_GOLDMONT_PLUS, CPU_GLM, PTA_GOLDMONT_PLUS,
>      M_CPU_TYPE (INTEL_GOLDMONT_PLUS), P_PROC_SSE4_2},
> -  {"tremont", PROCESSOR_TREMONT, CPU_GLM, PTA_TREMONT,
> +  {"tremont", PROCESSOR_TREMONT, CPU_HASWELL, PTA_TREMONT,
>      M_CPU_TYPE (INTEL_TREMONT), P_PROC_SSE4_2},
>    {"knl", PROCESSOR_KNL, CPU_SLM, PTA_KNL,
>      M_CPU_TYPE (INTEL_KNL), P_PROC_AVX512F},
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 7b173bc0beb..2927e2884c9 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -16976,6 +16976,7 @@ ix86_sched_init_global (FILE *, int, int)
>      case PROCESSOR_NEHALEM:
>      case PROCESSOR_SANDYBRIDGE:
>      case PROCESSOR_HASWELL:
> +    case PROCESSOR_TREMONT:
>      case PROCESSOR_GENERIC:
>        /* Do not perform multipass scheduling for pre-reload schedule
>           to save compile time.  */
> diff --git a/gcc/config/i386/x86-tune-sched.c b/gcc/config/i386/x86-tune-sched.c
> index 2e5ee4e4444..56ada99a450 100644
> --- a/gcc/config/i386/x86-tune-sched.c
> +++ b/gcc/config/i386/x86-tune-sched.c
> @@ -71,6 +71,7 @@ ix86_issue_rate (void)
>      case PROCESSOR_NEHALEM:
>      case PROCESSOR_SANDYBRIDGE:
>      case PROCESSOR_HASWELL:
> +    case PROCESSOR_TREMONT:
>      case PROCESSOR_GENERIC:
>        return 4;
>
> @@ -429,6 +430,7 @@ ix86_adjust_cost (rtx_insn *insn, int dep_type, rtx_insn *dep_insn, int cost,
>      case PROCESSOR_NEHALEM:
>      case PROCESSOR_SANDYBRIDGE:
>      case PROCESSOR_HASWELL:
> +    case PROCESSOR_TREMONT:
>      case PROCESSOR_GENERIC:
>        /* Stack engine allows to execute push&pop instructions in parall.  */
>        if ((insn_type == TYPE_PUSH || insn_type == TYPE_POP)
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 2f221b1f8c9..385e275bbd9 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -62,7 +62,7 @@ DEF_TUNE (X86_TUNE_PARTIAL_REG_DEPENDENCY, "partial_reg_dependency",
>     that can be partly masked by careful scheduling of moves.  */
>  DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY, "sse_partial_reg_dependency",
>            m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10
> -         | m_BDVER | m_ZNVER | m_GENERIC)
> +         | m_BDVER | m_ZNVER | m_TREMONT | m_GENERIC)
>
>  /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
>     are resolved on SSE register parts instead of whole registers, so we may
> @@ -136,7 +136,7 @@ DEF_TUNE (X86_TUNE_FUSE_ALU_AND_BRANCH, "fuse_alu_and_branch",
>
>  DEF_TUNE (X86_TUNE_ACCUMULATE_OUTGOING_ARGS, "accumulate_outgoing_args",
>           m_PPRO | m_P4_NOCONA | m_BONNELL | m_SILVERMONT | m_KNL | m_KNM | m_INTEL
> -         | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT | m_ATHLON_K8)
> +         | m_GOLDMONT | m_GOLDMONT_PLUS | m_ATHLON_K8)
>
>  /* X86_TUNE_PROLOGUE_USING_MOVE: Do not use push/pop in prologues that are
>     considered on critical path.  */
> @@ -150,14 +150,15 @@ DEF_TUNE (X86_TUNE_EPILOGUE_USING_MOVE, "epilogue_using_move",
>
>  /* X86_TUNE_USE_LEAVE: Use "leave" instruction in epilogues where it fits.  */
>  DEF_TUNE (X86_TUNE_USE_LEAVE, "use_leave",
> -         m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC)
> +         m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE | m_TREMONT
> +         | m_GENERIC)
>
>  /* X86_TUNE_PUSH_MEMORY: Enable generation of "push mem" instructions.
>     Some chips, like 486 and Pentium works faster with separate load
>     and push instructions.  */
>  DEF_TUNE (X86_TUNE_PUSH_MEMORY, "push_memory",
>            m_386 | m_P4_NOCONA | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE
> -          | m_GENERIC)
> +          | m_TREMONT | m_GENERIC)
>
>  /* X86_TUNE_SINGLE_PUSH: Enable if single push insn is preferred
>     over esp subtraction.  */
> @@ -198,8 +199,7 @@ DEF_TUNE (X86_TUNE_PAD_RETURNS, "pad_returns",
>     than 4 branch instructions in the 16 byte window.  */
>  DEF_TUNE (X86_TUNE_FOUR_JUMP_LIMIT, "four_jump_limit",
>            m_PPRO | m_P4_NOCONA | m_BONNELL | m_SILVERMONT | m_KNL | m_KNM
> -         | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT | m_INTEL | m_ATHLON_K8
> -         | m_AMDFAM10)
> +         | m_GOLDMONT | m_GOLDMONT_PLUS | m_INTEL | m_ATHLON_K8 | m_AMDFAM10)
>
>  /*****************************************************************************/
>  /* Integer instruction selection tuning                                      */
> @@ -240,11 +240,11 @@ DEF_TUNE (X86_TUNE_INTEGER_DFMODE_MOVES, "integer_dfmode_moves",
>  /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
>     will impact LEA instruction selection. */
>  DEF_TUNE (X86_TUNE_OPT_AGU, "opt_agu", m_BONNELL | m_SILVERMONT | m_KNL
> -        | m_KNM | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT | m_INTEL)
> +        | m_KNM | m_GOLDMONT | m_GOLDMONT_PLUS | m_INTEL)
>
>  /* X86_TUNE_AVOID_LEA_FOR_ADDR: Avoid lea for address computation.  */
>  DEF_TUNE (X86_TUNE_AVOID_LEA_FOR_ADDR, "avoid_lea_for_addr",
> -         m_BONNELL | m_SILVERMONT | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT
> +         m_BONNELL | m_SILVERMONT | m_GOLDMONT | m_GOLDMONT_PLUS
>           | m_KNL | m_KNM)
>
>  /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is
> @@ -263,7 +263,7 @@ DEF_TUNE (X86_TUNE_SLOW_IMUL_IMM8, "slow_imul_imm8",
>     a conditional move.  */
>  DEF_TUNE (X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE, "avoid_mem_opnd_for_cmove",
>           m_BONNELL | m_SILVERMONT | m_GOLDMONT | m_GOLDMONT_PLUS | m_KNL
> -         | m_KNM | m_TREMONT | m_INTEL)
> +         | m_KNM | m_INTEL)
>
>  /* X86_TUNE_SINGLE_STRINGOP: Enable use of single string operations, such
>     as MOVS and STOS (without a REP prefix) to move/set sequences of bytes.  */
> @@ -282,7 +282,8 @@ DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB,
>     FIXME: This may actualy be a win on more targets than listed here.  */
>  DEF_TUNE (X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES,
>           "misaligned_move_string_pro_epilogues",
> -         m_386 | m_486 | m_CORE_ALL | m_AMD_MULTIPLE | m_GENERIC)
> +         m_386 | m_486 | m_CORE_ALL | m_AMD_MULTIPLE | m_TREMONT
> +         | m_GENERIC)
>
>  /* X86_TUNE_USE_SAHF: Controls use of SAHF.  */
>  DEF_TUNE (X86_TUNE_USE_SAHF, "use_sahf",
> @@ -294,7 +295,7 @@ DEF_TUNE (X86_TUNE_USE_SAHF, "use_sahf",
>  /* X86_TUNE_USE_CLTD: Controls use of CLTD and CTQO instructions.  */
>  DEF_TUNE (X86_TUNE_USE_CLTD, "use_cltd",
>           ~(m_PENT | m_LAKEMONT | m_BONNELL | m_SILVERMONT | m_KNL | m_KNM | m_INTEL
> -           | m_K6 | m_GOLDMONT | m_GOLDMONT_PLUS | m_TREMONT))
> +           | m_K6 | m_GOLDMONT | m_GOLDMONT_PLUS))
>
>  /* X86_TUNE_USE_BT: Enable use of BT (bit test) instructions.  */
>  DEF_TUNE (X86_TUNE_USE_BT, "use_bt",
> @@ -305,7 +306,7 @@ DEF_TUNE (X86_TUNE_USE_BT, "use_bt",
>  /* X86_TUNE_AVOID_FALSE_DEP_FOR_BMI: Avoid false dependency
>     for bit-manipulation instructions.  */
>  DEF_TUNE (X86_TUNE_AVOID_FALSE_DEP_FOR_BMI, "avoid_false_dep_for_bmi",
> -         m_SANDYBRIDGE | m_CORE_AVX2 | m_GENERIC)
> +         m_SANDYBRIDGE | m_CORE_AVX2 | m_TREMONT | m_GENERIC)
>
>  /* X86_TUNE_ADJUST_UNROLL: This enables adjusting the unroll factor based
>     on hardware capabilities. Bdver3 hardware has a loop buffer which makes
> @@ -321,14 +322,14 @@ DEF_TUNE (X86_TUNE_ONE_IF_CONV_INSN, "one_if_conv_insn",
>
>  /* X86_TUNE_AVOID_MFENCE: Use lock prefixed instructions instead of mfence.  */
>  DEF_TUNE (X86_TUNE_AVOID_MFENCE, "avoid_mfence",
> -        m_CORE_ALL | m_BDVER | m_ZNVER | m_GENERIC)
> +        m_CORE_ALL | m_BDVER | m_ZNVER | m_TREMONT | m_GENERIC)
>
>  /* X86_TUNE_EXPAND_ABS: This enables a new abs pattern by
>     generating instructions for abs (x) = (((signed) x >> (W-1) ^ x) -
>     (signed) x >> (W-1)) instead of cmove or SSE max/abs instructions.  */
>  DEF_TUNE (X86_TUNE_EXPAND_ABS, "expand_abs",
>           m_CORE_ALL | m_SILVERMONT | m_KNL | m_KNM | m_GOLDMONT
> -         | m_GOLDMONT_PLUS | m_TREMONT )
> +         | m_GOLDMONT_PLUS)
>
>  /*****************************************************************************/
>  /* 387 instruction selection tuning                                          */
> @@ -386,13 +387,13 @@ DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL, "sse_packed_single_insn_optim
>
>  /* X86_TUNE_SSE_TYPELESS_STORES: Always movaps/movups for 128bit stores.   */
>  DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores",
> -         m_AMD_MULTIPLE | m_CORE_ALL | m_GENERIC)
> +         m_AMD_MULTIPLE | m_CORE_ALL | m_TREMONT | m_GENERIC)
>
>  /* X86_TUNE_SSE_LOAD0_BY_PXOR: Always use pxor to load0 as opposed to
>     xorps/xorpd and other variants.  */
>  DEF_TUNE (X86_TUNE_SSE_LOAD0_BY_PXOR, "sse_load0_by_pxor",
>           m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BDVER | m_BTVER | m_ZNVER
> -         | m_GENERIC)
> +         | m_TREMONT | m_GENERIC)
>
>  /* X86_TUNE_INTER_UNIT_MOVES_TO_VEC: Enable moves in from integer
>     to SSE registers.  If disabled, the moves will be done by storing
> @@ -419,7 +420,7 @@ DEF_TUNE (X86_TUNE_INTER_UNIT_CONVERSIONS, "inter_unit_conversions",
>     fp converts to destination register.  */
>  DEF_TUNE (X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS, "split_mem_opnd_for_fp_converts",
>           m_SILVERMONT | m_KNL | m_KNM | m_GOLDMONT | m_GOLDMONT_PLUS
> -         | m_TREMONT | m_INTEL)
> +         | m_INTEL)
>
>  /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
>     from FP to FP.  This form of instructions avoids partial write to the
> @@ -434,7 +435,7 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10)
>  /* X86_TUNE_SLOW_SHUFB: Indicates tunings with slow pshufb instruction.  */
>  DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_pshufb",
>           m_BONNELL | m_SILVERMONT | m_KNL | m_KNM | m_GOLDMONT
> -         | m_GOLDMONT_PLUS | m_TREMONT | m_INTEL)
> +         | m_GOLDMONT_PLUS | m_INTEL)
>
>  /* X86_TUNE_AVOID_4BYTE_PREFIXES: Avoid instructions requiring 4+ bytes of prefixes.  */
>  DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] [PATCH 2/4] x86: Update memcpy/memset inline strategies for -mtune=tremont
  2021-09-15  8:09 ` [PATCH 2/4] [PATCH 2/4] x86: Update memcpy/memset inline strategies for -mtune=tremont lili.cui
@ 2021-09-16  6:36   ` Uros Bizjak
  0 siblings, 0 replies; 16+ messages in thread
From: Uros Bizjak @ 2021-09-16  6:36 UTC (permalink / raw)
  To: Lili Cui; +Cc: gcc-patches, Hongtao Liu, H. J. Lu

On Wed, Sep 15, 2021 at 10:10 AM <lili.cui@intel.com> wrote:
>
> From: "H.J. Lu" <hjl.tools@gmail.com>
>
> Simply memcpy and memset inline strategies to avoid branches for
> -mtune=tremont:
>
> 1. Create Tremont cost model from generic cost model.
> 2. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
>    load and store for up to 16 * 16 (256) bytes when the data size is
>    fixed and known.
> 3. Inline only if data size is known to be <= 256.
>    a. Use "rep movsb/stosb" with simple code sequence if the data size
>       is a constant.
>    b. Use loop if data size is not a constant.
> 4. Use memcpy/memset libray function if data size is unknown or > 256.
>
>         * config/i386/i386-options.c (processor_cost_table): Use
>         tremont_cost for Tremont.
>         * config/i386/x86-tune-costs.h (tremont_memcpy): New.
>         (tremont_memset): Likewise.
>         (tremont_cost): Likewise.
>         * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
>         Enable for Tremont.

OK, and also obvious as a tuning patch.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386-options.c   |   2 +-
>  gcc/config/i386/x86-tune-costs.h | 124 +++++++++++++++++++++++++++++++
>  gcc/config/i386/x86-tune.def     |   2 +-
>  3 files changed, 126 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
> index c0006b3674b..e7a3bd4aaea 100644
> --- a/gcc/config/i386/i386-options.c
> +++ b/gcc/config/i386/i386-options.c
> @@ -724,7 +724,7 @@ static const struct processor_costs *processor_cost_table[] =
>    &slm_cost,
>    &slm_cost,
>    &slm_cost,
> -  &slm_cost,
> +  &tremont_cost,
>    &slm_cost,
>    &slm_cost,
>    &skylake_cost,
> diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> index ffe810f2bcb..93644be9cb3 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -2734,6 +2734,130 @@ struct processor_costs slm_cost = {
>    "16",                                        /* Func alignment.  */
>  };
>
> +static stringop_algs tremont_memcpy[2] = {
> +  {libcall,
> +   {{256, rep_prefix_1_byte, true},
> +    {256, loop, false},
> +    {-1, libcall, false}}},
> +  {libcall,
> +   {{256, rep_prefix_1_byte, true},
> +    {256, loop, false},
> +    {-1, libcall, false}}}};
> +static stringop_algs tremont_memset[2] = {
> +  {libcall,
> +   {{256, rep_prefix_1_byte, true},
> +    {256, loop, false},
> +    {-1, libcall, false}}},
> +  {libcall,
> +   {{256, rep_prefix_1_byte, true},
> +    {256, loop, false},
> +    {-1, libcall, false}}}};
> +static const
> +struct processor_costs tremont_cost = {
> +  {
> +  /* Start of register allocator costs.  integer->integer move cost is 2. */
> +  6,                                /* cost for loading QImode using movzbl */
> +  {6, 6, 6},                           /* cost of loading integer registers
> +                                          in QImode, HImode and SImode.
> +                                          Relative to reg-reg move (2).  */
> +  {6, 6, 6},                           /* cost of storing integer registers */
> +  4,                                   /* cost of reg,reg fld/fst */
> +  {6, 6, 12},                          /* cost of loading fp registers
> +                                          in SFmode, DFmode and XFmode */
> +  {6, 6, 12},                          /* cost of storing fp registers
> +                                          in SFmode, DFmode and XFmode */
> +  2,                                   /* cost of moving MMX register */
> +  {6, 6},                              /* cost of loading MMX registers
> +                                          in SImode and DImode */
> +  {6, 6},                              /* cost of storing MMX registers
> +                                          in SImode and DImode */
> +  2, 3, 4,                             /* cost of moving XMM,YMM,ZMM register */
> +  {6, 6, 6, 10, 15},                   /* cost of loading SSE registers
> +                                          in 32,64,128,256 and 512-bit */
> +  {6, 6, 6, 10, 15},                   /* cost of storing SSE registers
> +                                          in 32,64,128,256 and 512-bit */
> +  6, 6,                                /* SSE->integer and integer->SSE moves */
> +  6, 6,                                /* mask->integer and integer->mask moves */
> +  {6, 6, 6},                           /* cost of loading mask register
> +                                          in QImode, HImode, SImode.  */
> +  {6, 6, 6},                   /* cost if storing mask register
> +                                          in QImode, HImode, SImode.  */
> +  2,                                   /* cost of moving mask register.  */
> +  /* End of register allocator costs.  */
> +  },
> +
> +  COSTS_N_INSNS (1),                   /* cost of an add instruction */
> +  /* Setting cost to 2 makes our current implementation of synth_mult result in
> +     use of unnecessary temporary registers causing regression on several
> +     SPECfp benchmarks.  */
> +  COSTS_N_INSNS (1) + 1,               /* cost of a lea instruction */
> +  COSTS_N_INSNS (1),                   /* variable shift costs */
> +  COSTS_N_INSNS (1),                   /* constant shift costs */
> +  {COSTS_N_INSNS (3),                  /* cost of starting multiply for QI */
> +   COSTS_N_INSNS (4),                  /*                               HI */
> +   COSTS_N_INSNS (3),                  /*                               SI */
> +   COSTS_N_INSNS (4),                  /*                               DI */
> +   COSTS_N_INSNS (4)},                 /*                            other */
> +  0,                                   /* cost of multiply per each bit set */
> +  {COSTS_N_INSNS (16),                 /* cost of a divide/mod for QI */
> +   COSTS_N_INSNS (22),                 /*                          HI */
> +   COSTS_N_INSNS (30),                 /*                          SI */
> +   COSTS_N_INSNS (74),                 /*                          DI */
> +   COSTS_N_INSNS (74)},                        /*                          other */
> +  COSTS_N_INSNS (1),                   /* cost of movsx */
> +  COSTS_N_INSNS (1),                   /* cost of movzx */
> +  8,                                   /* "large" insn */
> +  17,                                  /* MOVE_RATIO */
> +  17,                                  /* CLEAR_RATIO */
> +  {6, 6, 6},                           /* cost of loading integer registers
> +                                          in QImode, HImode and SImode.
> +                                          Relative to reg-reg move (2).  */
> +  {6, 6, 6},                           /* cost of storing integer registers */
> +  {6, 6, 6, 10, 15},                   /* cost of loading SSE register
> +                                          in 32bit, 64bit, 128bit, 256bit and 512bit */
> +  {6, 6, 6, 10, 15},                   /* cost of storing SSE register
> +                                          in 32bit, 64bit, 128bit, 256bit and 512bit */
> +  {6, 6, 6, 10, 15},                   /* cost of unaligned loads.  */
> +  {6, 6, 6, 10, 15},                   /* cost of unaligned storess.  */
> +  2, 3, 4,                             /* cost of moving XMM,YMM,ZMM register */
> +  6,                                   /* cost of moving SSE register to integer.  */
> +  18, 6,                               /* Gather load static, per_elt.  */
> +  18, 6,                               /* Gather store static, per_elt.  */
> +  32,                                  /* size of l1 cache.  */
> +  512,                                 /* size of l2 cache.  */
> +  64,                                  /* size of prefetch block */
> +  6,                                   /* number of parallel prefetches */
> +  /* Benchmarks shows large regressions on K8 sixtrack benchmark when this
> +     value is increased to perhaps more appropriate value of 5.  */
> +  3,                                   /* Branch cost */
> +  COSTS_N_INSNS (3),                   /* cost of FADD and FSUB insns.  */
> +  COSTS_N_INSNS (5),                   /* cost of FMUL instruction.  */
> +  COSTS_N_INSNS (17),                  /* cost of FDIV instruction.  */
> +  COSTS_N_INSNS (1),                   /* cost of FABS instruction.  */
> +  COSTS_N_INSNS (1),                   /* cost of FCHS instruction.  */
> +  COSTS_N_INSNS (14),                  /* cost of FSQRT instruction.  */
> +
> +  COSTS_N_INSNS (1),                   /* cost of cheap SSE instruction.  */
> +  COSTS_N_INSNS (3),                   /* cost of ADDSS/SD SUBSS/SD insns.  */
> +  COSTS_N_INSNS (4),                   /* cost of MULSS instruction.  */
> +  COSTS_N_INSNS (5),                   /* cost of MULSD instruction.  */
> +  COSTS_N_INSNS (5),                   /* cost of FMA SS instruction.  */
> +  COSTS_N_INSNS (5),                   /* cost of FMA SD instruction.  */
> +  COSTS_N_INSNS (13),                  /* cost of DIVSS instruction.  */
> +  COSTS_N_INSNS (17),                  /* cost of DIVSD instruction.  */
> +  COSTS_N_INSNS (14),                  /* cost of SQRTSS instruction.  */
> +  COSTS_N_INSNS (18),                  /* cost of SQRTSD instruction.  */
> +  1, 4, 3, 3,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  tremont_memcpy,
> +  tremont_memset,
> +  COSTS_N_INSNS (4),                   /* cond_taken_branch_cost.  */
> +  COSTS_N_INSNS (2),                   /* cond_not_taken_branch_cost.  */
> +  "16:11:8",                           /* Loop alignment.  */
> +  "16:11:8",                           /* Jump alignment.  */
> +  "0:0:8",                             /* Label alignment.  */
> +  "16",                                        /* Func alignment.  */
> +};
> +
>  static stringop_algs intel_memcpy[2] = {
>    {libcall, {{11, loop, false}, {-1, rep_prefix_4_byte, false}}},
>    {libcall, {{32, loop, false}, {64, rep_prefix_4_byte, false},
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 385e275bbd9..088edb6c4ca 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -273,7 +273,7 @@ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA)
>     move/set sequences of bytes with known size.  */
>  DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB,
>           "prefer_known_rep_movsb_stosb",
> -         m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512)
> +         m_SKYLAKE | m_ALDERLAKE | m_TREMONT | m_CORE_AVX512)
>
>  /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of
>     compact prologues and epilogues by issuing a misaligned moves.  This
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 3/4] [PATCH 3/4] x86: Properly handle USE_VECTOR_FP_CONVERTS/USE_VECTOR_CONVERTS
  2021-09-16  6:27   ` Uros Bizjak
@ 2021-09-17  3:15     ` Cui, Lili
  2021-09-17  6:35       ` Uros Bizjak
  0 siblings, 1 reply; 16+ messages in thread
From: Cui, Lili @ 2021-09-17  3:15 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches, Liu, Hongtao, H. J. Lu

[-- Attachment #1: Type: text/plain, Size: 6976 bytes --]


> -----Original Message-----
> From: Uros Bizjak <ubizjak@gmail.com>
> Sent: Thursday, September 16, 2021 2:28 PM
> To: Cui, Lili <lili.cui@intel.com>
> Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao <hongtao.liu@intel.com>; H. J. Lu
> <hjl.tools@gmail.com>
> Subject: Re: [PATCH 3/4] [PATCH 3/4] x86: Properly handle
> USE_VECTOR_FP_CONVERTS/USE_VECTOR_CONVERTS
> 
> On Wed, Sep 15, 2021 at 10:10 AM <lili.cui@intel.com> wrote:
> >
> > From: "H.J. Lu" <hjl.tools@gmail.com>
> >
> > Check TARGET_USE_VECTOR_FP_CONVERTS or
> TARGET_USE_VECTOR_CONVERTS when
> > handling avx_partial_xmm_update attribute.  Don't convert AVX partial
> > XMM register update if vector packed SSE conversion should be used.
> >
> > gcc/
> >
> >         PR target/101900
> >         * config/i386/i386-features.c (remove_partial_avx_dependency):
> >         Check TARGET_USE_VECTOR_FP_CONVERTS and
> TARGET_USE_VECTOR_CONVERTS
> >         before generating vxorps.
> >
> > gcc/
> >
> >         PR target/101900
> >         * testsuite/gcc.target/i386/pr101900-1.c: New test.
> >         * testsuite/gcc.target/i386/pr101900-2.c: Likewise.
> >         * testsuite/gcc.target/i386/pr101900-3.c: Likewise.
> > ---
> >  gcc/config/i386/i386-features.c            | 21 ++++++++++++++++++---
> >  gcc/testsuite/gcc.target/i386/pr101900-1.c | 18 ++++++++++++++++++
> > gcc/testsuite/gcc.target/i386/pr101900-2.c | 18 ++++++++++++++++++
> > gcc/testsuite/gcc.target/i386/pr101900-3.c | 19 +++++++++++++++++++
> >  4 files changed, 73 insertions(+), 3 deletions(-)  create mode 100644
> > gcc/testsuite/gcc.target/i386/pr101900-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-3.c
> >
> > diff --git a/gcc/config/i386/i386-features.c
> > b/gcc/config/i386/i386-features.c index 5a99ea7c046..ae5ea02a002
> > 100644
> > --- a/gcc/config/i386/i386-features.c
> > +++ b/gcc/config/i386/i386-features.c
> > @@ -2210,15 +2210,30 @@ remove_partial_avx_dependency (void)
> >               != AVX_PARTIAL_XMM_UPDATE_TRUE)
> >             continue;
> >
> > -         if (!v4sf_const0)
> > -           v4sf_const0 = gen_reg_rtx (V4SFmode);
> > -
> >           /* Convert PARTIAL_XMM_UPDATE_TRUE insns, DF -> SF, SF -> DF,
> >              SI -> SF, SI -> DF, DI -> SF, DI -> DF, to vec_dup and
> >              vec_merge with subreg.  */
> >           rtx src = SET_SRC (set);
> >           rtx dest = SET_DEST (set);
> >           machine_mode dest_mode = GET_MODE (dest);
> > +         machine_mode src_mode;
> > +
> > +         if (TARGET_USE_VECTOR_FP_CONVERTS)
> > +           {
> > +             src_mode = GET_MODE (XEXP (src, 0));
> > +             if (src_mode == E_SFmode || src_mode == E_DFmode)
> > +               continue;
> > +           }
> > +
> > +         if (TARGET_USE_VECTOR_CONVERTS)
> > +           {
> > +             src_mode = GET_MODE (XEXP (src, 0));
> > +             if (src_mode == E_SImode || src_mode == E_DImode)
> > +               continue;
> > +           }
> > +
> > +         if (!v4sf_const0)
> > +           v4sf_const0 = gen_reg_rtx (V4SFmode);
> 
> Please better move initialization of src_mode to the top of the new hunk, like:
> 
> machine_mode src_mode = GET_MODE (XEXP (src, 0)); switch (src_mode) {
>   case E_SFmode:
>   case E_DFmode:
>     if (TARGET_USE_VECTOR_FP_CONVERTS)
>       continue;
>     break;
>   case E_SImode:
>   case E_DImode:
>     if (TARGET_USE_VECTOR_CONVERTS)
>       continue;
>     break;
>   default:
>     break;
> }
> 
> or something like the above.

Done, thanks for your good advice, I also rebased patch 4/4, since it is based on patch 3/4.

Changed it to:

+         machine_mode src_mode = GET_MODE (XEXP (src, 0));
+
+         switch (src_mode)
+           {
+           case E_SFmode:
+           case E_DFmode:
+             if (TARGET_USE_VECTOR_FP_CONVERTS)
+               continue;
+             break;
+           case E_SImode:
+           case E_DImode:
+             if (TARGET_USE_VECTOR_CONVERTS)
+               continue;
+             break;
+           default:
+             break;
+           }
+         if (!v4sf_const0)
+           v4sf_const0 = gen_reg_rtx (V4SFmode);

Thanks,
Lili.

> 
> Uros.
> 
> >
> >           rtx zero;
> >           machine_mode dest_vecmode;
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101900-1.c
> > b/gcc/testsuite/gcc.target/i386/pr101900-1.c
> > new file mode 100644
> > index 00000000000..0a45f8e340a
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101900-1.c
> > @@ -0,0 +1,18 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=skylake -mfpmath=sse
> > +-mtune-ctrl=use_vector_fp_converts" } */
> > +
> > +extern float f;
> > +extern double d;
> > +extern int i;
> > +
> > +void
> > +foo (void)
> > +{
> > +  d = f;
> > +  f = i;
> > +}
> > +
> > +/* { dg-final { scan-assembler "vcvtps2pd" } } */
> > +/* { dg-final { scan-assembler "vcvtsi2ssl" } } */
> > +/* { dg-final { scan-assembler-not "vcvtss2sd" } } */
> > +/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 }
> > +} */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101900-2.c
> > b/gcc/testsuite/gcc.target/i386/pr101900-2.c
> > new file mode 100644
> > index 00000000000..c8b2d1da5ae
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101900-2.c
> > @@ -0,0 +1,18 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=skylake -mfpmath=sse
> > +-mtune-ctrl=use_vector_converts" } */
> > +
> > +extern float f;
> > +extern double d;
> > +extern int i;
> > +
> > +void
> > +foo (void)
> > +{
> > +  d = f;
> > +  f = i;
> > +}
> > +
> > +/* { dg-final { scan-assembler "vcvtss2sd" } } */
> > +/* { dg-final { scan-assembler "vcvtdq2ps" } } */
> > +/* { dg-final { scan-assembler-not "vcvtsi2ssl" } } */
> > +/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 }
> > +} */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101900-3.c
> > b/gcc/testsuite/gcc.target/i386/pr101900-3.c
> > new file mode 100644
> > index 00000000000..6ee565b5bd4
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101900-3.c
> > @@ -0,0 +1,19 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=skylake -mfpmath=sse
> > +-mtune-ctrl=use_vector_fp_converts,use_vector_converts" } */
> > +
> > +extern float f;
> > +extern double d;
> > +extern int i;
> > +
> > +void
> > +foo (void)
> > +{
> > +  d = f;
> > +  f = i;
> > +}
> > +
> > +/* { dg-final { scan-assembler "vcvtps2pd" } } */
> > +/* { dg-final { scan-assembler "vcvtdq2ps" } } */
> > +/* { dg-final { scan-assembler-not "vcvtss2sd" } } */
> > +/* { dg-final { scan-assembler-not "vcvtsi2ssl" } } */
> > +/* { dg-final { scan-assembler-not "vxorps" } } */
> > --
> > 2.17.1
> >

[-- Attachment #2: 0003-PATCH-3-4-x86-Properly-handle-USE_VECTOR_FP_CONVERTS.patch --]
[-- Type: application/octet-stream, Size: 4561 bytes --]

From 4503c7960a42078be20b175ab1e9bbbabe475bcb Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Wed, 15 Sep 2021 14:17:58 +0800
Subject: [PATCH 3/4] [PATCH 3/4] x86: Properly handle
 USE_VECTOR_FP_CONVERTS/USE_VECTOR_CONVERTS

Check TARGET_USE_VECTOR_FP_CONVERTS or TARGET_USE_VECTOR_CONVERTS when
handling avx_partial_xmm_update attribute.  Don't convert AVX partial
XMM register update if vector packed SSE conversion should be used.

gcc/

	PR target/101900
	* config/i386/i386-features.c (remove_partial_avx_dependency):
	Check TARGET_USE_VECTOR_FP_CONVERTS and TARGET_USE_VECTOR_CONVERTS
	before generating vxorps.

gcc/

	PR target/101900
	* testsuite/gcc.target/i386/pr101900-1.c: New test.
	* testsuite/gcc.target/i386/pr101900-2.c: Likewise.
	* testsuite/gcc.target/i386/pr101900-3.c: Likewise.
---
 gcc/config/i386/i386-features.c            | 23 +++++++++++++++++++---
 gcc/testsuite/gcc.target/i386/pr101900-1.c | 18 +++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr101900-2.c | 18 +++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr101900-3.c | 19 ++++++++++++++++++
 4 files changed, 75 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-3.c

diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index 5a99ea7c046..a65f60122a5 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -2210,15 +2210,32 @@ remove_partial_avx_dependency (void)
 	      != AVX_PARTIAL_XMM_UPDATE_TRUE)
 	    continue;
 
-	  if (!v4sf_const0)
-	    v4sf_const0 = gen_reg_rtx (V4SFmode);
-
 	  /* Convert PARTIAL_XMM_UPDATE_TRUE insns, DF -> SF, SF -> DF,
 	     SI -> SF, SI -> DF, DI -> SF, DI -> DF, to vec_dup and
 	     vec_merge with subreg.  */
 	  rtx src = SET_SRC (set);
 	  rtx dest = SET_DEST (set);
 	  machine_mode dest_mode = GET_MODE (dest);
+	  machine_mode src_mode = GET_MODE (XEXP (src, 0));
+
+	  switch (src_mode)
+	    {
+	    case E_SFmode:
+	    case E_DFmode:
+	      if (TARGET_USE_VECTOR_FP_CONVERTS)
+		continue;
+	      break;
+	    case E_SImode:
+	    case E_DImode:
+	      if (TARGET_USE_VECTOR_CONVERTS)
+		continue;
+	      break;
+	    default:
+	      break;
+	    }
+
+	  if (!v4sf_const0)
+	    v4sf_const0 = gen_reg_rtx (V4SFmode);
 
 	  rtx zero;
 	  machine_mode dest_vecmode;
diff --git a/gcc/testsuite/gcc.target/i386/pr101900-1.c b/gcc/testsuite/gcc.target/i386/pr101900-1.c
new file mode 100644
index 00000000000..0a45f8e340a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101900-1.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=use_vector_fp_converts" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler "vcvtps2pd" } } */
+/* { dg-final { scan-assembler "vcvtsi2ssl" } } */
+/* { dg-final { scan-assembler-not "vcvtss2sd" } } */
+/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr101900-2.c b/gcc/testsuite/gcc.target/i386/pr101900-2.c
new file mode 100644
index 00000000000..c8b2d1da5ae
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101900-2.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=use_vector_converts" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler "vcvtss2sd" } } */
+/* { dg-final { scan-assembler "vcvtdq2ps" } } */
+/* { dg-final { scan-assembler-not "vcvtsi2ssl" } } */
+/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr101900-3.c b/gcc/testsuite/gcc.target/i386/pr101900-3.c
new file mode 100644
index 00000000000..6ee565b5bd4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101900-3.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=use_vector_fp_converts,use_vector_converts" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler "vcvtps2pd" } } */
+/* { dg-final { scan-assembler "vcvtdq2ps" } } */
+/* { dg-final { scan-assembler-not "vcvtss2sd" } } */
+/* { dg-final { scan-assembler-not "vcvtsi2ssl" } } */
+/* { dg-final { scan-assembler-not "vxorps" } } */
-- 
2.17.1


[-- Attachment #3: 0004-PATCH-4-4-x86-Add-TARGET_SSE_PARTIAL_REG_-FP_-CONVER.patch --]
[-- Type: application/octet-stream, Size: 11008 bytes --]

From 78989688a2548fb5a6411ffbfc464ca7e6a4f291 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Wed, 15 Sep 2021 14:18:21 +0800
Subject: [PATCH 4/4] [PATCH 4/4] x86: Add
 TARGET_SSE_PARTIAL_REG_[FP_]CONVERTS_DEPENDENCY

1. Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with
TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY in SSE FP to FP splitters.
2. Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY with
TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY in SSE INT to FP splitters.
3.  Also check TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY and
TARGET_SSE_PARTIAL_REG_DEPENDENCY when handling avx_partial_xmm_update
attribute.  Don't convert AVX partial XMM register update if there is no
partial SSE register dependency for SSE conversion.

gcc/

	* config/i386/i386-features.c (remove_partial_avx_dependency):
	Also check TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY and
	and TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY before generating
	vxorps.
	* config/i386/i386.h (TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY):
	New.
	(TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise.
	* config/i386/i386.md (SSE FP to FP splitters): Replace
	TARGET_SSE_PARTIAL_REG_DEPENDENCY with
	TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY.
	(SSE INT to FP splitter): Replace TARGET_SSE_PARTIAL_REG_DEPENDENCY
	with TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY.
	* config/i386/x86-tune.def
	(X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY): New.
	(X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise.

gcc/testsuite/

	* gcc.target/i386/avx-covert-1.c: New file.
	* gcc.target/i386/avx-fp-covert-1.c: Likewise.
	* gcc.target/i386/avx-int-covert-1.c: Likewise.
	* gcc.target/i386/sse-covert-1.c: Likewise.
	* gcc.target/i386/sse-fp-covert-1.c: Likewise.
	* gcc.target/i386/sse-int-covert-1.c: Likewise.
---
 gcc/config/i386/i386-features.c               |  6 ++++--
 gcc/config/i386/i386.h                        |  4 ++++
 gcc/config/i386/i386.md                       |  9 ++++++---
 gcc/config/i386/x86-tune.def                  | 15 +++++++++++++++
 gcc/testsuite/gcc.target/i386/avx-covert-1.c  | 19 +++++++++++++++++++
 .../gcc.target/i386/avx-fp-covert-1.c         | 15 +++++++++++++++
 .../gcc.target/i386/avx-int-covert-1.c        | 14 ++++++++++++++
 gcc/testsuite/gcc.target/i386/sse-covert-1.c  | 19 +++++++++++++++++++
 .../gcc.target/i386/sse-fp-covert-1.c         | 15 +++++++++++++++
 .../gcc.target/i386/sse-int-covert-1.c        | 14 ++++++++++++++
 10 files changed, 125 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx-int-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/sse-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/sse-int-covert-1.c

diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index a65f60122a5..a525a83afd3 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -2222,12 +2222,14 @@ remove_partial_avx_dependency (void)
 	    {
 	    case E_SFmode:
 	    case E_DFmode:
-	      if (TARGET_USE_VECTOR_FP_CONVERTS)
+	      if (TARGET_USE_VECTOR_FP_CONVERTS
+		  || !TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY)
 		continue;
 	      break;
 	    case E_SImode:
 	    case E_DImode:
-	      if (TARGET_USE_VECTOR_CONVERTS)
+	      if (TARGET_USE_VECTOR_CONVERTS
+		  || !TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY)
 		continue;
 	      break;
 	    default:
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index e76bb55c080..ec60b89753e 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -334,6 +334,10 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_PARTIAL_REG_DEPENDENCY]
 #define TARGET_SSE_PARTIAL_REG_DEPENDENCY \
 	ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY]
+#define TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY \
+	ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY]
+#define TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY \
+	ix86_tune_features[X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY]
 #define TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
 	ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
 #define TARGET_SSE_UNALIGNED_STORE_OPTIMAL \
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 13f6f57cdcc..c82a9dc1f67 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -4535,7 +4535,8 @@
         (float_extend:DF
           (match_operand:SF 1 "nonimmediate_operand")))]
   "!TARGET_AVX
-   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+   && TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
+   && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!REG_P (operands[1])
        || (!TARGET_AVX && REGNO (operands[0]) != REGNO (operands[1])))
@@ -4708,7 +4709,8 @@
         (float_truncate:SF
 	  (match_operand:DF 1 "nonimmediate_operand")))]
   "!TARGET_AVX
-   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+   && TARGET_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY
+   && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!REG_P (operands[1])
        || (!TARGET_AVX && REGNO (operands[0]) != REGNO (operands[1])))
@@ -5243,7 +5245,8 @@
   [(set (match_operand:MODEF 0 "sse_reg_operand")
 	(float:MODEF (match_operand:SWI48 1 "nonimmediate_operand")))]
   "!TARGET_AVX
-   && TARGET_SSE_PARTIAL_REG_DEPENDENCY && epilogue_completed
+   && TARGET_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY
+   && epilogue_completed
    && optimize_function_for_speed_p (cfun)
    && (!EXT_REX_SSE_REG_P (operands[0])
        || TARGET_AVX512VL)"
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 088edb6c4ca..58e8ead56b4 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -64,6 +64,21 @@ DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY, "sse_partial_reg_dependency",
           m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10
 	  | m_BDVER | m_ZNVER | m_TREMONT | m_GENERIC)
 
+/* X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY: This knob avoids
+   partial write to the destination in scalar SSE conversion from FP
+   to FP.  */
+DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY,
+	  "sse_partial_reg_fp_converts_dependency",
+	  m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10
+	  | m_BDVER | m_ZNVER | m_GENERIC)
+
+/* X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY: This knob avoids partial
+   write to the destination in scalar SSE conversion from integer to FP.  */
+DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY,
+	  "sse_partial_reg_converts_dependency",
+	  m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_AMDFAM10
+	  | m_BDVER | m_ZNVER | m_GENERIC)
+
 /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
    are resolved on SSE register parts instead of whole registers, so we may
    maintain just lower part of scalar values in proper format leaving the
diff --git a/gcc/testsuite/gcc.target/i386/avx-covert-1.c b/gcc/testsuite/gcc.target/i386/avx-covert-1.c
new file mode 100644
index 00000000000..b6c794ecbb8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-covert-1.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency,^sse_partial_reg_converts_dependency" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler "vcvtss2sd" } } */
+/* { dg-final { scan-assembler "vcvtsi2ssl" } } */
+/* { dg-final { scan-assembler-not "vcvtps2pd" } } */
+/* { dg-final { scan-assembler-not "vcvtdq2ps" } } */
+/* { dg-final { scan-assembler-not "vxorps" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c b/gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c
new file mode 100644
index 00000000000..c40c48b1b2d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-fp-covert-1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency" } */
+
+extern float f;
+extern double d;
+
+void
+foo (void)
+{
+  d = f;
+}
+
+/* { dg-final { scan-assembler "vcvtss2sd" } } */
+/* { dg-final { scan-assembler-not "vcvtps2pd" } } */
+/* { dg-final { scan-assembler-not "vxorps" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-int-covert-1.c b/gcc/testsuite/gcc.target/i386/avx-int-covert-1.c
new file mode 100644
index 00000000000..01bb64e66cc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-int-covert-1.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mfpmath=sse -mtune-ctrl=^sse_partial_reg_converts_dependency" } */
+
+extern float f;
+extern int i;
+
+void
+foo (void)
+{
+  f = i;
+}
+
+/* { dg-final { scan-assembler "vcvtsi2ssl" } } */
+/* { dg-final { scan-assembler-not "vxorps" } } */
diff --git a/gcc/testsuite/gcc.target/i386/sse-covert-1.c b/gcc/testsuite/gcc.target/i386/sse-covert-1.c
new file mode 100644
index 00000000000..c30af694505
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/sse-covert-1.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=x86-64 -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency,^sse_partial_reg_converts_dependency" } */
+
+extern float f;
+extern double d;
+extern int i;
+
+void
+foo (void)
+{
+  d = f;
+  f = i;
+}
+
+/* { dg-final { scan-assembler "cvtss2sd" } } */
+/* { dg-final { scan-assembler "cvtsi2ssl" } } */
+/* { dg-final { scan-assembler-not "cvtps2pd" } } */
+/* { dg-final { scan-assembler-not "cvtdq2ps" } } */
+/* { dg-final { scan-assembler-not "pxor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c b/gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c
new file mode 100644
index 00000000000..b6567e60e3e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/sse-fp-covert-1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=x86-64 -mfpmath=sse -mtune-ctrl=^sse_partial_reg_fp_converts_dependency" } */
+
+extern float f;
+extern double d;
+
+void
+foo (void)
+{
+  d = f;
+}
+
+/* { dg-final { scan-assembler "cvtss2sd" } } */
+/* { dg-final { scan-assembler-not "cvtps2pd" } } */
+/* { dg-final { scan-assembler-not "pxor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/sse-int-covert-1.c b/gcc/testsuite/gcc.target/i386/sse-int-covert-1.c
new file mode 100644
index 00000000000..107f7241def
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/sse-int-covert-1.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=x86-64 -mfpmath=sse -mtune-ctrl=^sse_partial_reg_converts_dependency" } */
+
+extern float f;
+extern int i;
+
+void
+foo (void)
+{
+  f = i;
+}
+
+/* { dg-final { scan-assembler "cvtsi2ssl" } } */
+/* { dg-final { scan-assembler-not "pxor" } } */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] [PATCH 3/4] x86: Properly handle USE_VECTOR_FP_CONVERTS/USE_VECTOR_CONVERTS
  2021-09-17  3:15     ` Cui, Lili
@ 2021-09-17  6:35       ` Uros Bizjak
  2021-09-17 23:50         ` Jakub Jelinek
  0 siblings, 1 reply; 16+ messages in thread
From: Uros Bizjak @ 2021-09-17  6:35 UTC (permalink / raw)
  To: Cui, Lili; +Cc: gcc-patches, Liu, Hongtao, H. J. Lu

On Fri, Sep 17, 2021 at 5:15 AM Cui, Lili <lili.cui@intel.com> wrote:
>
>
> > -----Original Message-----
> > From: Uros Bizjak <ubizjak@gmail.com>
> > Sent: Thursday, September 16, 2021 2:28 PM
> > To: Cui, Lili <lili.cui@intel.com>
> > Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao <hongtao.liu@intel.com>; H. J. Lu
> > <hjl.tools@gmail.com>
> > Subject: Re: [PATCH 3/4] [PATCH 3/4] x86: Properly handle
> > USE_VECTOR_FP_CONVERTS/USE_VECTOR_CONVERTS
> >
> > On Wed, Sep 15, 2021 at 10:10 AM <lili.cui@intel.com> wrote:
> > >
> > > From: "H.J. Lu" <hjl.tools@gmail.com>
> > >
> > > Check TARGET_USE_VECTOR_FP_CONVERTS or
> > TARGET_USE_VECTOR_CONVERTS when
> > > handling avx_partial_xmm_update attribute.  Don't convert AVX partial
> > > XMM register update if vector packed SSE conversion should be used.
> > >
> > > gcc/
> > >
> > >         PR target/101900
> > >         * config/i386/i386-features.c (remove_partial_avx_dependency):
> > >         Check TARGET_USE_VECTOR_FP_CONVERTS and
> > TARGET_USE_VECTOR_CONVERTS
> > >         before generating vxorps.
> > >
> > > gcc/
> > >
> > >         PR target/101900
> > >         * testsuite/gcc.target/i386/pr101900-1.c: New test.
> > >         * testsuite/gcc.target/i386/pr101900-2.c: Likewise.
> > >         * testsuite/gcc.target/i386/pr101900-3.c: Likewise.
> > > ---
> > >  gcc/config/i386/i386-features.c            | 21 ++++++++++++++++++---
> > >  gcc/testsuite/gcc.target/i386/pr101900-1.c | 18 ++++++++++++++++++
> > > gcc/testsuite/gcc.target/i386/pr101900-2.c | 18 ++++++++++++++++++
> > > gcc/testsuite/gcc.target/i386/pr101900-3.c | 19 +++++++++++++++++++
> > >  4 files changed, 73 insertions(+), 3 deletions(-)  create mode 100644
> > > gcc/testsuite/gcc.target/i386/pr101900-1.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-2.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-3.c
> > >
> > > diff --git a/gcc/config/i386/i386-features.c
> > > b/gcc/config/i386/i386-features.c index 5a99ea7c046..ae5ea02a002
> > > 100644
> > > --- a/gcc/config/i386/i386-features.c
> > > +++ b/gcc/config/i386/i386-features.c
> > > @@ -2210,15 +2210,30 @@ remove_partial_avx_dependency (void)
> > >               != AVX_PARTIAL_XMM_UPDATE_TRUE)
> > >             continue;
> > >
> > > -         if (!v4sf_const0)
> > > -           v4sf_const0 = gen_reg_rtx (V4SFmode);
> > > -
> > >           /* Convert PARTIAL_XMM_UPDATE_TRUE insns, DF -> SF, SF -> DF,
> > >              SI -> SF, SI -> DF, DI -> SF, DI -> DF, to vec_dup and
> > >              vec_merge with subreg.  */
> > >           rtx src = SET_SRC (set);
> > >           rtx dest = SET_DEST (set);
> > >           machine_mode dest_mode = GET_MODE (dest);
> > > +         machine_mode src_mode;
> > > +
> > > +         if (TARGET_USE_VECTOR_FP_CONVERTS)
> > > +           {
> > > +             src_mode = GET_MODE (XEXP (src, 0));
> > > +             if (src_mode == E_SFmode || src_mode == E_DFmode)
> > > +               continue;
> > > +           }
> > > +
> > > +         if (TARGET_USE_VECTOR_CONVERTS)
> > > +           {
> > > +             src_mode = GET_MODE (XEXP (src, 0));
> > > +             if (src_mode == E_SImode || src_mode == E_DImode)
> > > +               continue;
> > > +           }
> > > +
> > > +         if (!v4sf_const0)
> > > +           v4sf_const0 = gen_reg_rtx (V4SFmode);
> >
> > Please better move initialization of src_mode to the top of the new hunk, like:
> >
> > machine_mode src_mode = GET_MODE (XEXP (src, 0)); switch (src_mode) {
> >   case E_SFmode:
> >   case E_DFmode:
> >     if (TARGET_USE_VECTOR_FP_CONVERTS)
> >       continue;
> >     break;
> >   case E_SImode:
> >   case E_DImode:
> >     if (TARGET_USE_VECTOR_CONVERTS)
> >       continue;
> >     break;
> >   default:
> >     break;
> > }
> >
> > or something like the above.
>
> Done, thanks for your good advice, I also rebased patch 4/4, since it is based on patch 3/4.

OK.

Thanks,
Uros.

>
> Changed it to:
>
> +         machine_mode src_mode = GET_MODE (XEXP (src, 0));
> +
> +         switch (src_mode)
> +           {
> +           case E_SFmode:
> +           case E_DFmode:
> +             if (TARGET_USE_VECTOR_FP_CONVERTS)
> +               continue;
> +             break;
> +           case E_SImode:
> +           case E_DImode:
> +             if (TARGET_USE_VECTOR_CONVERTS)
> +               continue;
> +             break;
> +           default:
> +             break;
> +           }
> +         if (!v4sf_const0)
> +           v4sf_const0 = gen_reg_rtx (V4SFmode);
>
> Thanks,
> Lili.
>
> >
> > Uros.
> >
> > >
> > >           rtx zero;
> > >           machine_mode dest_vecmode;
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr101900-1.c
> > > b/gcc/testsuite/gcc.target/i386/pr101900-1.c
> > > new file mode 100644
> > > index 00000000000..0a45f8e340a
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr101900-1.c
> > > @@ -0,0 +1,18 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=skylake -mfpmath=sse
> > > +-mtune-ctrl=use_vector_fp_converts" } */
> > > +
> > > +extern float f;
> > > +extern double d;
> > > +extern int i;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > +  d = f;
> > > +  f = i;
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler "vcvtps2pd" } } */
> > > +/* { dg-final { scan-assembler "vcvtsi2ssl" } } */
> > > +/* { dg-final { scan-assembler-not "vcvtss2sd" } } */
> > > +/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 }
> > > +} */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr101900-2.c
> > > b/gcc/testsuite/gcc.target/i386/pr101900-2.c
> > > new file mode 100644
> > > index 00000000000..c8b2d1da5ae
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr101900-2.c
> > > @@ -0,0 +1,18 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=skylake -mfpmath=sse
> > > +-mtune-ctrl=use_vector_converts" } */
> > > +
> > > +extern float f;
> > > +extern double d;
> > > +extern int i;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > +  d = f;
> > > +  f = i;
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler "vcvtss2sd" } } */
> > > +/* { dg-final { scan-assembler "vcvtdq2ps" } } */
> > > +/* { dg-final { scan-assembler-not "vcvtsi2ssl" } } */
> > > +/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 }
> > > +} */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr101900-3.c
> > > b/gcc/testsuite/gcc.target/i386/pr101900-3.c
> > > new file mode 100644
> > > index 00000000000..6ee565b5bd4
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr101900-3.c
> > > @@ -0,0 +1,19 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=skylake -mfpmath=sse
> > > +-mtune-ctrl=use_vector_fp_converts,use_vector_converts" } */
> > > +
> > > +extern float f;
> > > +extern double d;
> > > +extern int i;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > +  d = f;
> > > +  f = i;
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler "vcvtps2pd" } } */
> > > +/* { dg-final { scan-assembler "vcvtdq2ps" } } */
> > > +/* { dg-final { scan-assembler-not "vcvtss2sd" } } */
> > > +/* { dg-final { scan-assembler-not "vcvtsi2ssl" } } */
> > > +/* { dg-final { scan-assembler-not "vxorps" } } */
> > > --
> > > 2.17.1
> > >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] [PATCH 3/4] x86: Properly handle USE_VECTOR_FP_CONVERTS/USE_VECTOR_CONVERTS
  2021-09-17  6:35       ` Uros Bizjak
@ 2021-09-17 23:50         ` Jakub Jelinek
  2021-09-18  2:25           ` Hongtao Liu
  0 siblings, 1 reply; 16+ messages in thread
From: Jakub Jelinek @ 2021-09-17 23:50 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Cui, Lili, Liu, Hongtao, gcc-patches

On Fri, Sep 17, 2021 at 08:35:57AM +0200, Uros Bizjak via Gcc-patches wrote:
> > > On Wed, Sep 15, 2021 at 10:10 AM <lili.cui@intel.com> wrote:
> > > >
> > > > From: "H.J. Lu" <hjl.tools@gmail.com>
> > > >
> > > > Check TARGET_USE_VECTOR_FP_CONVERTS or
> > > TARGET_USE_VECTOR_CONVERTS when
> > > > handling avx_partial_xmm_update attribute.  Don't convert AVX partial
> > > > XMM register update if vector packed SSE conversion should be used.
> > > >
> > > > gcc/
> > > >
> > > >         PR target/101900
> > > >         * config/i386/i386-features.c (remove_partial_avx_dependency):
> > > >         Check TARGET_USE_VECTOR_FP_CONVERTS and
> > > TARGET_USE_VECTOR_CONVERTS
> > > >         before generating vxorps.
> > > >
> > > > gcc/
> > > >
> > > >         PR target/101900
> > > >         * testsuite/gcc.target/i386/pr101900-1.c: New test.
> > > >         * testsuite/gcc.target/i386/pr101900-2.c: Likewise.
> > > >         * testsuite/gcc.target/i386/pr101900-3.c: Likewise.
> > > > ---
> > > >  gcc/config/i386/i386-features.c            | 21 ++++++++++++++++++---
> > > >  gcc/testsuite/gcc.target/i386/pr101900-1.c | 18 ++++++++++++++++++
> > > > gcc/testsuite/gcc.target/i386/pr101900-2.c | 18 ++++++++++++++++++
> > > > gcc/testsuite/gcc.target/i386/pr101900-3.c | 19 +++++++++++++++++++
> > > >  4 files changed, 73 insertions(+), 3 deletions(-)  create mode 100644
> > > > gcc/testsuite/gcc.target/i386/pr101900-1.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-2.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-3.c
> > > >
> > > > diff --git a/gcc/config/i386/i386-features.c
> > > > b/gcc/config/i386/i386-features.c index 5a99ea7c046..ae5ea02a002
> > > > 100644
> > > > --- a/gcc/config/i386/i386-features.c
> > > > +++ b/gcc/config/i386/i386-features.c
> > > > @@ -2210,15 +2210,30 @@ remove_partial_avx_dependency (void)
> > > >               != AVX_PARTIAL_XMM_UPDATE_TRUE)
> > > >             continue;
> > > >
> > > > -         if (!v4sf_const0)
> > > > -           v4sf_const0 = gen_reg_rtx (V4SFmode);
> > > > -
> > > >           /* Convert PARTIAL_XMM_UPDATE_TRUE insns, DF -> SF, SF -> DF,
> > > >              SI -> SF, SI -> DF, DI -> SF, DI -> DF, to vec_dup and
> > > >              vec_merge with subreg.  */
> > > >           rtx src = SET_SRC (set);
> > > >           rtx dest = SET_DEST (set);
> > > >           machine_mode dest_mode = GET_MODE (dest);
> > > > +         machine_mode src_mode;
> > > > +
> > > > +         if (TARGET_USE_VECTOR_FP_CONVERTS)
> > > > +           {
> > > > +             src_mode = GET_MODE (XEXP (src, 0));
> > > > +             if (src_mode == E_SFmode || src_mode == E_DFmode)
> > > > +               continue;
> > > > +           }
> > > > +
> > > > +         if (TARGET_USE_VECTOR_CONVERTS)
> > > > +           {
> > > > +             src_mode = GET_MODE (XEXP (src, 0));
> > > > +             if (src_mode == E_SImode || src_mode == E_DImode)
> > > > +               continue;
> > > > +           }
> > > > +
> > > > +         if (!v4sf_const0)
> > > > +           v4sf_const0 = gen_reg_rtx (V4SFmode);
> > >
> > > Please better move initialization of src_mode to the top of the new hunk, like:
> > >
> > > machine_mode src_mode = GET_MODE (XEXP (src, 0)); switch (src_mode) {
> > >   case E_SFmode:
> > >   case E_DFmode:
> > >     if (TARGET_USE_VECTOR_FP_CONVERTS)
> > >       continue;
> > >     break;
> > >   case E_SImode:
> > >   case E_DImode:
> > >     if (TARGET_USE_VECTOR_CONVERTS)
> > >       continue;
> > >     break;
> > >   default:
> > >     break;
> > > }
> > >
> > > or something like the above.
> >
> > Done, thanks for your good advice, I also rebased patch 4/4, since it is based on patch 3/4.

The above change broke
+FAIL: gcc.target/i386/avx512f-vscalefpd-2.c (internal compiler error)
+FAIL: gcc.target/i386/avx512f-vscalefpd-2.c (test for excess errors)
+UNRESOLVED: gcc.target/i386/avx512f-vscalefpd-2.c compilation failed to produce executable
+FAIL: gcc.target/i386/avx512f-vscalefps-2.c (internal compiler error)
+FAIL: gcc.target/i386/avx512f-vscalefps-2.c (test for excess errors)
+UNRESOLVED: gcc.target/i386/avx512f-vscalefps-2.c compilation failed to produce executable
+FAIL: gcc.target/i386/avx512f-vscalefss-2.c (internal compiler error)
+FAIL: gcc.target/i386/avx512f-vscalefss-2.c (test for excess errors)
+UNRESOLVED: gcc.target/i386/avx512f-vscalefss-2.c compilation failed to produce executable
+FAIL: gcc.target/i386/avx512vl-vscalefpd-2.c (internal compiler error)
+FAIL: gcc.target/i386/avx512vl-vscalefpd-2.c (test for excess errors)
+UNRESOLVED: gcc.target/i386/avx512vl-vscalefpd-2.c compilation failed to produce executable
+FAIL: gcc.target/i386/avx512vl-vscalefps-2.c (internal compiler error)
+FAIL: gcc.target/i386/avx512vl-vscalefps-2.c (test for excess errors)
+UNRESOLVED: gcc.target/i386/avx512vl-vscalefps-2.c compilation failed to produce executable
when configured with --enable-checking=yes,rtl,extra, the error is:
during RTL pass: rpad
/home/jakub/src/gcc/gcc/testsuite/gcc.target/i386/avx512f-vscalefpd-2.c:57:1: internal compiler error: RTL check: expected elt 0 type 'e' or 'u', have 'E' (rtx unspec) in remove_partial_avx_dependency, at config/i386/i386-features.c:2219
0x77541d rtl_check_failed_type2(rtx_def const*, int, int, int, char const*, int, char const*)
        ../../gcc/rtl.c:898
0x84e731 remove_partial_avx_dependency
        ../../gcc/config/i386/i386-features.c:2219
0x84e731 execute
        ../../gcc/config/i386/i386-features.c:2389
This is on
2219		  machine_mode src_mode = GET_MODE (XEXP (src, 0));
and src is:
(unspec:DF [
        (mem:DF (plus:DI (reg/f:DI 149)
                (reg:DI 103 [ ivtmp.65 ])) [3 MEM[(double *)&src2 + ivtmp.65_65 * 1]+0 S8 A64])
        (const_int 9 [0x9])
    ] UNSPEC_ROUND)
- whole insn
(insn 55 54 56 5 (set (reg:DF 99 [ _50 ])
        (unspec:DF [
                (mem:DF (plus:DI (reg/f:DI 149)
                        (reg:DI 103 [ ivtmp.65 ])) [3 MEM[(double *)&src2 + ivtmp.65_65 * 1]+0 S8 A64])
                (const_int 9 [0x9])
            ] UNSPEC_ROUND)) "/home/jakub/src/gcc/gcc/testsuite/gcc.target/i386/avx512f-vscalefpd-2.c":19:28 1076 {sse4_1_rounddf2}
     (nil))
so XEXP (src, 0) can't be used in that case.
Looking at insns with avx_partial_xmm_update attribute, it seems
src is either FLOAT_EXTEND/FLOAT_TRUNCATE/FLOAT/UNSIGNED_FLOAT and
in that case it looks like a conversion and has different modes,
or it is UNSPEC (UNSPEC_{RCP,RSQRT,ROUND}) or SQRT and in that case it doesn't.

	Jakub


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] [PATCH 3/4] x86: Properly handle USE_VECTOR_FP_CONVERTS/USE_VECTOR_CONVERTS
  2021-09-17 23:50         ` Jakub Jelinek
@ 2021-09-18  2:25           ` Hongtao Liu
  0 siblings, 0 replies; 16+ messages in thread
From: Hongtao Liu @ 2021-09-18  2:25 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Uros Bizjak, gcc-patches, Liu, Hongtao

On Sat, Sep 18, 2021 at 7:50 AM Jakub Jelinek via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Fri, Sep 17, 2021 at 08:35:57AM +0200, Uros Bizjak via Gcc-patches wrote:
> > > > On Wed, Sep 15, 2021 at 10:10 AM <lili.cui@intel.com> wrote:
> > > > >
> > > > > From: "H.J. Lu" <hjl.tools@gmail.com>
> > > > >
> > > > > Check TARGET_USE_VECTOR_FP_CONVERTS or
> > > > TARGET_USE_VECTOR_CONVERTS when
> > > > > handling avx_partial_xmm_update attribute.  Don't convert AVX partial
> > > > > XMM register update if vector packed SSE conversion should be used.
> > > > >
> > > > > gcc/
> > > > >
> > > > >         PR target/101900
> > > > >         * config/i386/i386-features.c (remove_partial_avx_dependency):
> > > > >         Check TARGET_USE_VECTOR_FP_CONVERTS and
> > > > TARGET_USE_VECTOR_CONVERTS
> > > > >         before generating vxorps.
> > > > >
> > > > > gcc/
> > > > >
> > > > >         PR target/101900
> > > > >         * testsuite/gcc.target/i386/pr101900-1.c: New test.
> > > > >         * testsuite/gcc.target/i386/pr101900-2.c: Likewise.
> > > > >         * testsuite/gcc.target/i386/pr101900-3.c: Likewise.
> > > > > ---
> > > > >  gcc/config/i386/i386-features.c            | 21 ++++++++++++++++++---
> > > > >  gcc/testsuite/gcc.target/i386/pr101900-1.c | 18 ++++++++++++++++++
> > > > > gcc/testsuite/gcc.target/i386/pr101900-2.c | 18 ++++++++++++++++++
> > > > > gcc/testsuite/gcc.target/i386/pr101900-3.c | 19 +++++++++++++++++++
> > > > >  4 files changed, 73 insertions(+), 3 deletions(-)  create mode 100644
> > > > > gcc/testsuite/gcc.target/i386/pr101900-1.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-2.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101900-3.c
> > > > >
> > > > > diff --git a/gcc/config/i386/i386-features.c
> > > > > b/gcc/config/i386/i386-features.c index 5a99ea7c046..ae5ea02a002
> > > > > 100644
> > > > > --- a/gcc/config/i386/i386-features.c
> > > > > +++ b/gcc/config/i386/i386-features.c
> > > > > @@ -2210,15 +2210,30 @@ remove_partial_avx_dependency (void)
> > > > >               != AVX_PARTIAL_XMM_UPDATE_TRUE)
> > > > >             continue;
> > > > >
> > > > > -         if (!v4sf_const0)
> > > > > -           v4sf_const0 = gen_reg_rtx (V4SFmode);
> > > > > -
> > > > >           /* Convert PARTIAL_XMM_UPDATE_TRUE insns, DF -> SF, SF -> DF,
> > > > >              SI -> SF, SI -> DF, DI -> SF, DI -> DF, to vec_dup and
> > > > >              vec_merge with subreg.  */
> > > > >           rtx src = SET_SRC (set);
> > > > >           rtx dest = SET_DEST (set);
> > > > >           machine_mode dest_mode = GET_MODE (dest);
> > > > > +         machine_mode src_mode;
> > > > > +
> > > > > +         if (TARGET_USE_VECTOR_FP_CONVERTS)
> > > > > +           {
> > > > > +             src_mode = GET_MODE (XEXP (src, 0));
> > > > > +             if (src_mode == E_SFmode || src_mode == E_DFmode)
> > > > > +               continue;
> > > > > +           }
> > > > > +
> > > > > +         if (TARGET_USE_VECTOR_CONVERTS)
> > > > > +           {
> > > > > +             src_mode = GET_MODE (XEXP (src, 0));
> > > > > +             if (src_mode == E_SImode || src_mode == E_DImode)
> > > > > +               continue;
> > > > > +           }
> > > > > +
> > > > > +         if (!v4sf_const0)
> > > > > +           v4sf_const0 = gen_reg_rtx (V4SFmode);
> > > >
> > > > Please better move initialization of src_mode to the top of the new hunk, like:
> > > >
> > > > machine_mode src_mode = GET_MODE (XEXP (src, 0)); switch (src_mode) {
> > > >   case E_SFmode:
> > > >   case E_DFmode:
> > > >     if (TARGET_USE_VECTOR_FP_CONVERTS)
> > > >       continue;
> > > >     break;
> > > >   case E_SImode:
> > > >   case E_DImode:
> > > >     if (TARGET_USE_VECTOR_CONVERTS)
> > > >       continue;
> > > >     break;
> > > >   default:
> > > >     break;
> > > > }
> > > >
> > > > or something like the above.
> > >
> > > Done, thanks for your good advice, I also rebased patch 4/4, since it is based on patch 3/4.
>
> The above change broke
> +FAIL: gcc.target/i386/avx512f-vscalefpd-2.c (internal compiler error)
> +FAIL: gcc.target/i386/avx512f-vscalefpd-2.c (test for excess errors)
> +UNRESOLVED: gcc.target/i386/avx512f-vscalefpd-2.c compilation failed to produce executable
> +FAIL: gcc.target/i386/avx512f-vscalefps-2.c (internal compiler error)
> +FAIL: gcc.target/i386/avx512f-vscalefps-2.c (test for excess errors)
> +UNRESOLVED: gcc.target/i386/avx512f-vscalefps-2.c compilation failed to produce executable
> +FAIL: gcc.target/i386/avx512f-vscalefss-2.c (internal compiler error)
> +FAIL: gcc.target/i386/avx512f-vscalefss-2.c (test for excess errors)
> +UNRESOLVED: gcc.target/i386/avx512f-vscalefss-2.c compilation failed to produce executable
> +FAIL: gcc.target/i386/avx512vl-vscalefpd-2.c (internal compiler error)
> +FAIL: gcc.target/i386/avx512vl-vscalefpd-2.c (test for excess errors)
> +UNRESOLVED: gcc.target/i386/avx512vl-vscalefpd-2.c compilation failed to produce executable
> +FAIL: gcc.target/i386/avx512vl-vscalefps-2.c (internal compiler error)
> +FAIL: gcc.target/i386/avx512vl-vscalefps-2.c (test for excess errors)
> +UNRESOLVED: gcc.target/i386/avx512vl-vscalefps-2.c compilation failed to produce executable
> when configured with --enable-checking=yes,rtl,extra, the error is:
> during RTL pass: rpad
> /home/jakub/src/gcc/gcc/testsuite/gcc.target/i386/avx512f-vscalefpd-2.c:57:1: internal compiler error: RTL check: expected elt 0 type 'e' or 'u', have 'E' (rtx unspec) in remove_partial_avx_dependency, at config/i386/i386-features.c:2219
> 0x77541d rtl_check_failed_type2(rtx_def const*, int, int, int, char const*, int, char const*)
>         ../../gcc/rtl.c:898
> 0x84e731 remove_partial_avx_dependency
>         ../../gcc/config/i386/i386-features.c:2219
> 0x84e731 execute
>         ../../gcc/config/i386/i386-features.c:2389
> This is on
> 2219              machine_mode src_mode = GET_MODE (XEXP (src, 0));
> and src is:
> (unspec:DF [
>         (mem:DF (plus:DI (reg/f:DI 149)
>                 (reg:DI 103 [ ivtmp.65 ])) [3 MEM[(double *)&src2 + ivtmp.65_65 * 1]+0 S8 A64])
>         (const_int 9 [0x9])
>     ] UNSPEC_ROUND)
> - whole insn
> (insn 55 54 56 5 (set (reg:DF 99 [ _50 ])
>         (unspec:DF [
>                 (mem:DF (plus:DI (reg/f:DI 149)
>                         (reg:DI 103 [ ivtmp.65 ])) [3 MEM[(double *)&src2 + ivtmp.65_65 * 1]+0 S8 A64])
>                 (const_int 9 [0x9])
>             ] UNSPEC_ROUND)) "/home/jakub/src/gcc/gcc/testsuite/gcc.target/i386/avx512f-vscalefpd-2.c":19:28 1076 {sse4_1_rounddf2}
>      (nil))
> so XEXP (src, 0) can't be used in that case.
Let me fix this.
> Looking at insns with avx_partial_xmm_update attribute, it seems
> src is either FLOAT_EXTEND/FLOAT_TRUNCATE/FLOAT/UNSIGNED_FLOAT and
> in that case it looks like a conversion and has different modes,
> or it is UNSPEC (UNSPEC_{RCP,RSQRT,ROUND}) or SQRT and in that case it doesn't.
Yes, i'll add comments to mention pass_rpad also handle rcp/round/sqrt/rsqrt.
>
>         Jakub
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-09-18  2:19 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-15  8:09 [PATCH 0/4] Update mtune=tremont lili.cui
2021-09-15  8:09 ` [PATCH 1/4] [PATCH 1/4] x86: Update -mtune=tremont lili.cui
2021-09-16  6:35   ` Uros Bizjak
2021-09-15  8:09 ` [PATCH 2/4] [PATCH 2/4] x86: Update memcpy/memset inline strategies for -mtune=tremont lili.cui
2021-09-16  6:36   ` Uros Bizjak
2021-09-15  8:09 ` [PATCH 3/4] [PATCH 3/4] x86: Properly handle USE_VECTOR_FP_CONVERTS/USE_VECTOR_CONVERTS lili.cui
2021-09-16  6:27   ` Uros Bizjak
2021-09-17  3:15     ` Cui, Lili
2021-09-17  6:35       ` Uros Bizjak
2021-09-17 23:50         ` Jakub Jelinek
2021-09-18  2:25           ` Hongtao Liu
2021-09-15  8:09 ` [PATCH 4/4] [PATCH 4/4] x86: Add TARGET_SSE_PARTIAL_REG_[FP_]CONVERTS_DEPENDENCY lili.cui
2021-09-15 14:13   ` H.J. Lu
2021-09-15 23:54     ` Cui, Lili
2021-09-16  2:21       ` H.J. Lu
2021-09-16  6:34   ` Uros Bizjak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).