[PATCH 0/3] x86: Update memcpy/memset inline strategies

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH 0/3] x86: Update memcpy/memset inline strategies
@ 2021-03-22 13:16 H.J. Lu
  2021-03-22 13:16 ` [PATCH 1/3] x86: Update memcpy/memset inline strategies for Ice Lake H.J. Lu
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: H.J. Lu @ 2021-03-22 13:16 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jan Hubicka, Uros Bizjak, Hongtao Liu, Hongyu Wang

Simply memcpy and memset inline strategies to avoid branches:

1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
   load and store for up to 16 * 16 (256) bytes when the data size is
   fixed and known.
2. Inline only if data size is known to be <= 256.
   a. Use "rep movsb/stosb" with simple code sequence if the data size
      is a constant.
   b. Use loop if data size is not a constant.
3. Use memcpy/memset libray function if data size is unknown or > 256.

There are no significant performance impacts on SPEC CPU 2017.  There
are visible performance improvements on eembc benchmarks with one
regression.

H.J. Lu (3):
  x86: Update memcpy/memset inline strategies for Ice Lake
  x86: Update memcpy/memset inline strategies for Skylake family CPUs
  x86: Update memcpy/memset inline strategies for -mtune=generic

 gcc/config/i386/i386-expand.c                 |  11 +-
 gcc/config/i386/i386-options.c                |  12 +-
 gcc/config/i386/i386.h                        |   2 +
 gcc/config/i386/x86-tune-costs.h              | 185 ++++++++++++++++--
 gcc/config/i386/x86-tune.def                  |   6 +
 .../gcc.target/i386/memcpy-strategy-10.c      |  11 ++
 .../gcc.target/i386/memcpy-strategy-11.c      |  18 ++
 .../gcc.target/i386/memcpy-strategy-12.c      |   9 +
 .../gcc.target/i386/memcpy-strategy-13.c      |  11 ++
 .../gcc.target/i386/memcpy-strategy-5.c       |  11 ++
 .../gcc.target/i386/memcpy-strategy-6.c       |  18 ++
 .../gcc.target/i386/memcpy-strategy-7.c       |   9 +
 .../gcc.target/i386/memcpy-strategy-8.c       |  18 ++
 .../gcc.target/i386/memcpy-strategy-9.c       |   9 +
 .../gcc.target/i386/memset-strategy-10.c      |  11 ++
 .../gcc.target/i386/memset-strategy-11.c      |   9 +
 .../gcc.target/i386/memset-strategy-3.c       |  17 ++
 .../gcc.target/i386/memset-strategy-4.c       |  17 ++
 .../gcc.target/i386/memset-strategy-5.c       |  11 ++
 .../gcc.target/i386/memset-strategy-6.c       |   9 +
 .../gcc.target/i386/memset-strategy-7.c       |  11 ++
 .../gcc.target/i386/memset-strategy-8.c       |   9 +
 .../gcc.target/i386/memset-strategy-9.c       |  17 ++
 gcc/testsuite/gcc.target/i386/shrink_wrap_1.c |   2 +-
 gcc/testsuite/gcc.target/i386/sw-1.c          |   2 +-
 25 files changed, 413 insertions(+), 32 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-6.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-7.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-8.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-10.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-11.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-6.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-7.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-8.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-9.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/3] x86: Update memcpy/memset inline strategies for Ice Lake
  2021-03-22 13:16 [PATCH 0/3] x86: Update memcpy/memset inline strategies H.J. Lu
@ 2021-03-22 13:16 ` H.J. Lu
  2021-03-22 14:10   ` Jan Hubicka
  2021-03-22 13:16 ` [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs H.J. Lu
  2021-03-22 13:16 ` [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic H.J. Lu
  2 siblings, 1 reply; 31+ messages in thread
From: H.J. Lu @ 2021-03-22 13:16 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jan Hubicka, Uros Bizjak, Hongtao Liu, Hongyu Wang

Simply memcpy and memset inline strategies to avoid branches for
-mtune=icelake:

1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
   load and store for up to 16 * 16 (256) bytes when the data size is
   fixed and known.
2. Inline only if data size is known to be <= 256.
   a. Use "rep movsb/stosb" with simple code sequence if the data size
      is a constant.
   b. Use loop if data size is not a constant.
3. Use memcpy/memset libray function if data size is unknown or > 256.

On Ice Lake processor with -march=native -Ofast -flto,

1.  Performance impacts of SPEC CPU 2017 rate are:

500.perlbench_r -0.93%
502.gcc_r        0.36%
505.mcf_r        0.31%
520.omnetpp_r   -0.07%
523.xalancbmk_r -0.53%
525.x264_r      -0.09%
531.deepsjeng_r -0.19%
541.leela_r      0.16%
548.exchange2_r  0.22%
557.xz_r        -1.64%
Geomean         -0.24%

503.bwaves_r    -0.01%
507.cactuBSSN_r  0.00%
508.namd_r       0.12%
510.parest_r     0.07%
511.povray_r     0.29%
519.lbm_r        0.00%
521.wrf_r       -0.38%
526.blender_r    0.16%
527.cam4_r       0.18%
538.imagick_r    0.76%
544.nab_r       -0.84%
549.fotonik3d_r -0.07%
554.roms_r      -0.01%
Geomean          0.02%

2. Significant impacts on eembc benchmarks are:

eembc/nnet_test      9.90%
eembc/mp2decoddata2  16.42%
eembc/textv2data3   -4.86%
eembc/qos            12.90%

gcc/

	* config/i386/i386-expand.c (expand_set_or_cpymem_via_rep):
	For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, don't convert QImode
	to SImode.
	(decide_alg): For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, use
	"rep movsb/stosb" only for known sizes.
	* config/i386/i386-options.c (processor_cost_table): Use Ice
	Lake cost for Cannon Lake, Ice Lake, Tiger Lake, Sapphire
	Rapids and Alder Lake.
	* config/i386/i386.h (TARGET_PREFER_KNOWN_REP_MOVSB_STOSB): New.
	* config/i386/x86-tune-costs.h (icelake_memcpy): New.
	(icelake_memset): Likewise.
	(icelake_cost): Likewise.
	* config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
	New.

gcc/testsuite/

	* gcc.target/i386/memcpy-strategy-5.c: New test.
	* gcc.target/i386/memcpy-strategy-6.c: Likewise.
	* gcc.target/i386/memcpy-strategy-7.c: Likewise.
	* gcc.target/i386/memcpy-strategy-8.c: Likewise.
	* gcc.target/i386/memset-strategy-3.c: Likewise.
	* gcc.target/i386/memset-strategy-4.c: Likewise.
	* gcc.target/i386/memset-strategy-5.c: Likewise.
	* gcc.target/i386/memset-strategy-6.c: Likewise.
---
 gcc/config/i386/i386-expand.c                 |  11 +-
 gcc/config/i386/i386-options.c                |  12 +-
 gcc/config/i386/i386.h                        |   2 +
 gcc/config/i386/x86-tune-costs.h              | 127 ++++++++++++++++++
 gcc/config/i386/x86-tune.def                  |   7 +
 .../gcc.target/i386/memcpy-strategy-5.c       |  11 ++
 .../gcc.target/i386/memcpy-strategy-6.c       |  18 +++
 .../gcc.target/i386/memcpy-strategy-7.c       |   9 ++
 .../gcc.target/i386/memcpy-strategy-8.c       |  18 +++
 .../gcc.target/i386/memset-strategy-3.c       |  17 +++
 .../gcc.target/i386/memset-strategy-4.c       |  17 +++
 .../gcc.target/i386/memset-strategy-5.c       |  11 ++
 .../gcc.target/i386/memset-strategy-6.c       |   9 ++
 13 files changed, 260 insertions(+), 9 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-6.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-7.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-8.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-6.c

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index ac69eed4d32..00efe090d97 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -5976,6 +5976,7 @@ expand_set_or_cpymem_via_rep (rtx destmem, rtx srcmem,
   /* If possible, it is shorter to use rep movs.
      TODO: Maybe it is better to move this logic to decide_alg.  */
   if (mode == QImode && CONST_INT_P (count) && !(INTVAL (count) & 3)
+      && !TARGET_PREFER_KNOWN_REP_MOVSB_STOSB
       && (!issetmem || orig_value == const0_rtx))
     mode = SImode;
 
@@ -6890,6 +6891,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size,
   const struct processor_costs *cost;
   int i;
   bool any_alg_usable_p = false;
+  bool known_size_p = expected_size != -1;
 
   *noalign = false;
   *dynamic_check = -1;
@@ -6899,7 +6901,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size,
   if (optimize_function_for_size_p (cfun)
       || (optimize_insn_for_size_p ()
  	  && (max_size < 256
-              || (expected_size != -1 && expected_size < 256))))
+	      || (known_size_p && expected_size < 256))))
     optimize_for_speed = false;
   else
     optimize_for_speed = true;
@@ -6925,7 +6927,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size,
      so inline version is a win, set expected size into
      the range.  */
   if (((max > 1 && (unsigned HOST_WIDE_INT) max >= max_size) || max == -1)
-      && expected_size == -1)
+      && !known_size_p)
     expected_size = min_size / 2 + max_size / 2;
 
   /* If user specified the algorithm, honor it if possible.  */
@@ -6984,7 +6986,10 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size,
 		  else if (!any_alg_usable_p)
 		    break;
 		}
-	      else if (alg_usable_p (candidate, memset, have_as))
+	      else if (alg_usable_p (candidate, memset, have_as)
+		       && !(TARGET_PREFER_KNOWN_REP_MOVSB_STOSB
+			    && candidate == rep_prefix_1_byte
+			    && !known_size_p))
 		{
 		  *noalign = algs->size[i].noalign;
 		  return candidate;
diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
index b653527d266..bd52ce6ffec 100644
--- a/gcc/config/i386/i386-options.c
+++ b/gcc/config/i386/i386-options.c
@@ -721,14 +721,14 @@ static const struct processor_costs *processor_cost_table[] =
   &slm_cost,
   &skylake_cost,
   &skylake_cost,
+  &icelake_cost,
+  &icelake_cost,
+  &icelake_cost,
   &skylake_cost,
+  &icelake_cost,
   &skylake_cost,
-  &skylake_cost,
-  &skylake_cost,
-  &skylake_cost,
-  &skylake_cost,
-  &skylake_cost,
-  &skylake_cost,
+  &icelake_cost,
+  &icelake_cost,
   &intel_cost,
   &geode_cost,
   &k6_cost,
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 058c1cc25b2..b4001d21b70 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -523,6 +523,8 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 #define TARGET_PROMOTE_QImode	ix86_tune_features[X86_TUNE_PROMOTE_QIMODE]
 #define TARGET_FAST_PREFIX	ix86_tune_features[X86_TUNE_FAST_PREFIX]
 #define TARGET_SINGLE_STRINGOP	ix86_tune_features[X86_TUNE_SINGLE_STRINGOP]
+#define TARGET_PREFER_KNOWN_REP_MOVSB_STOSB \
+  ix86_tune_features[X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB]
 #define TARGET_MISALIGNED_MOVE_STRING_PRO_EPILOGUES \
 	ix86_tune_features[X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES]
 #define TARGET_QIMODE_MATH	ix86_tune_features[X86_TUNE_QIMODE_MATH]
diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index 58b3b81985b..0e00ff99df3 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1936,6 +1936,133 @@ struct processor_costs skylake_cost = {
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
 };
+
+/* icelake_cost should produce code tuned for Icelake family of CPUs.
+   NB: rep_prefix_1_byte is used only for known size. */
+
+static stringop_algs icelake_memcpy[2] =   {
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}},
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}}};
+
+static stringop_algs icelake_memset[2] = {
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}},
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}}};
+
+static const
+struct processor_costs icelake_cost = {
+  {
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
+  6,				     /* cost for loading QImode using movzbl */
+  {4, 4, 4},				/* cost of loading integer registers
+					   in QImode, HImode and SImode.
+					   Relative to reg-reg move (2).  */
+  {6, 6, 6},				/* cost of storing integer registers */
+  2,					/* cost of reg,reg fld/fst */
+  {6, 6, 8},				/* cost of loading fp registers
+					   in SFmode, DFmode and XFmode */
+  {6, 6, 10},				/* cost of storing fp registers
+					   in SFmode, DFmode and XFmode */
+  2,					/* cost of moving MMX register */
+  {6, 6},				/* cost of loading MMX registers
+					   in SImode and DImode */
+  {6, 6},				/* cost of storing MMX registers
+					   in SImode and DImode */
+  2, 2, 4,				/* cost of moving XMM,YMM,ZMM register */
+  {6, 6, 6, 10, 20},			/* cost of loading SSE registers
+					   in 32,64,128,256 and 512-bit */
+  {8, 8, 8, 12, 24},			/* cost of storing SSE registers
+					   in 32,64,128,256 and 512-bit */
+  6, 6,				/* SSE->integer and integer->SSE moves */
+  5, 5,				/* mask->integer and integer->mask moves */
+  {8, 8, 8},				/* cost of loading mask register
+					   in QImode, HImode, SImode.  */
+  {6, 6, 6},				/* cost if storing mask register
+					   in QImode, HImode, SImode.  */
+  3,					/* cost of moving mask register.  */
+  /* End of register allocator costs.  */
+  },
+
+  COSTS_N_INSNS (1),			/* cost of an add instruction */
+  COSTS_N_INSNS (1)+1,		/* cost of a lea instruction */
+  COSTS_N_INSNS (1),			/* variable shift costs */
+  COSTS_N_INSNS (1),			/* constant shift costs */
+  {COSTS_N_INSNS (3),			/* cost of starting multiply for QI */
+   COSTS_N_INSNS (4),			/*				 HI */
+   COSTS_N_INSNS (3),			/*				 SI */
+   COSTS_N_INSNS (3),			/*				 DI */
+   COSTS_N_INSNS (3)},			/*			      other */
+  0,					/* cost of multiply per each bit set */
+  /* Expanding div/mod currently doesn't consider parallelism. So the cost
+     model is not realistic. We compensate by increasing the latencies a bit.  */
+  {COSTS_N_INSNS (11),			/* cost of a divide/mod for QI */
+   COSTS_N_INSNS (11),			/*			    HI */
+   COSTS_N_INSNS (14),			/*			    SI */
+   COSTS_N_INSNS (76),			/*			    DI */
+   COSTS_N_INSNS (76)},			/*			    other */
+  COSTS_N_INSNS (1),			/* cost of movsx */
+  COSTS_N_INSNS (0),			/* cost of movzx */
+  8,					/* "large" insn */
+  17,					/* MOVE_RATIO */
+  17,					/* CLEAR_RATIO */
+  {4, 4, 4},				/* cost of loading integer registers
+					   in QImode, HImode and SImode.
+					   Relative to reg-reg move (2).  */
+  {6, 6, 6},				/* cost of storing integer registers */
+  {6, 6, 6, 10, 20},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {8, 8, 8, 12, 24},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 10, 20},			/* cost of unaligned loads.  */
+  {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
+  2, 2, 4,				/* cost of moving XMM,YMM,ZMM register */
+  6,					/* cost of moving SSE register to integer.  */
+  20, 8,				/* Gather load static, per_elt.  */
+  22, 10,				/* Gather store static, per_elt.  */
+  64,					/* size of l1 cache.  */
+  512,					/* size of l2 cache.  */
+  64,					/* size of prefetch block */
+  6,					/* number of parallel prefetches */
+  3,					/* Branch cost */
+  COSTS_N_INSNS (3),			/* cost of FADD and FSUB insns.  */
+  COSTS_N_INSNS (4),			/* cost of FMUL instruction.  */
+  COSTS_N_INSNS (20),			/* cost of FDIV instruction.  */
+  COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
+  COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
+  COSTS_N_INSNS (20),			/* cost of FSQRT instruction.  */
+
+  COSTS_N_INSNS (1),			/* cost of cheap SSE instruction.  */
+  COSTS_N_INSNS (4),			/* cost of ADDSS/SD SUBSS/SD insns.  */
+  COSTS_N_INSNS (4),			/* cost of MULSS instruction.  */
+  COSTS_N_INSNS (4),			/* cost of MULSD instruction.  */
+  COSTS_N_INSNS (4),			/* cost of FMA SS instruction.  */
+  COSTS_N_INSNS (4),			/* cost of FMA SD instruction.  */
+  COSTS_N_INSNS (11),			/* cost of DIVSS instruction.  */
+  COSTS_N_INSNS (14),			/* cost of DIVSD instruction.  */
+  COSTS_N_INSNS (12),			/* cost of SQRTSS instruction.  */
+  COSTS_N_INSNS (18),			/* cost of SQRTSD instruction.  */
+  1, 4, 2, 2,				/* reassoc int, fp, vec_int, vec_fp.  */
+  icelake_memcpy,
+  icelake_memset,
+  COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
+  COSTS_N_INSNS (1),			/* cond_not_taken_branch_cost.  */
+  "16:11:8",				/* Loop alignment.  */
+  "16:11:8",				/* Jump alignment.  */
+  "0:0:8",				/* Label alignment.  */
+  "16",					/* Func alignment.  */
+};
+
   /* BTVER1 has optimized REP instruction for medium sized blocks, but for
      very small blocks it is better to use loop. For large blocks, libcall can
      do nontemporary accesses and beat inline considerably.  */
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index caebf76736e..134916cc972 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -269,6 +269,13 @@ DEF_TUNE (X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE, "avoid_mem_opnd_for_cmove",
    as MOVS and STOS (without a REP prefix) to move/set sequences of bytes.  */
 DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA)
 
+/* X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB: Enable use of REP MOVSB/STOSB to
+   move/set sequences of bytes with known size.  */
+DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB,
+	  "prefer_known_rep_movsb_stosb",
+	  m_CANNONLAKE | m_ICELAKE_CLIENT | m_ICELAKE_SERVER | m_TIGERLAKE
+	  | m_ALDERLAKE | m_SAPPHIRERAPIDS)
+
 /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of
    compact prologues and epilogues by issuing a misaligned moves.  This
    requires target to handle misaligned moves and partial memory stalls
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-5.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-5.c
new file mode 100644
index 00000000000..83c333b551d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-5.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=tigerlake -mno-sse" } */
+/* { dg-final { scan-assembler "jmp\tmemcpy" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler "call\tmemcpy" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "rep movsb" } } */
+
+void
+foo (char *dest, char *src)
+{
+  __builtin_memcpy (dest, src, 257);
+}
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-6.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-6.c
new file mode 100644
index 00000000000..ed963dec853
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-6.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic" } */
+/* { dg-final { scan-assembler-not "jmp\tmemcpy" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-not "call\tmemcpy" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "rep movsb" } } */
+
+typedef unsigned char e_u8;
+
+#define MAXBC 8
+
+void MixColumn(e_u8 a[4][MAXBC], e_u8 BC)
+{
+  e_u8 b[4][MAXBC];
+  int i, j;
+
+  for(i = 0; i < 4; i++)
+    for(j = 0; j < BC; j++) a[i][j] = b[i][j];
+}
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-7.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-7.c
new file mode 100644
index 00000000000..be66d6b8426
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-7.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=tigerlake -mno-sse" } */
+/* { dg-final { scan-assembler "rep movsb" } } */
+
+void
+foo (char *dest, char *src)
+{
+  __builtin_memcpy (dest, src, 256);
+}
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-8.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-8.c
new file mode 100644
index 00000000000..e8fe0a66c98
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-8.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=tigerlake" } */
+/* { dg-final { scan-assembler-not "jmp\tmemcpy" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-not "call\tmemcpy" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "rep movsb" } } */
+
+typedef unsigned char e_u8;
+
+#define MAXBC 8
+
+void MixColumn(e_u8 a[4][MAXBC], e_u8 BC)
+{
+  e_u8 b[4][MAXBC];
+  int i, j;
+
+  for(i = 0; i < 4; i++)
+    for(j = 0; j < BC; j++) a[i][j] = b[i][j];
+}
diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-3.c b/gcc/testsuite/gcc.target/i386/memset-strategy-3.c
new file mode 100644
index 00000000000..9ea1e1ae7c2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-strategy-3.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic" } */
+/* { dg-final { scan-assembler-not "jmp\tmemset" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-not "call\tmemset" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "rep stosb" } } */
+
+typedef unsigned char e_u8;
+
+#define MAXBC 8
+
+void MixColumn(e_u8 a[4][MAXBC], e_u8 BC)
+{
+  int i, j;
+
+  for(i = 0; i < 4; i++)
+    for(j = 0; j < BC; j++) a[i][j] = 1;
+}
diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-4.c b/gcc/testsuite/gcc.target/i386/memset-strategy-4.c
new file mode 100644
index 00000000000..00d82f13ff8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-strategy-4.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=tigerlake" } */
+/* { dg-final { scan-assembler-not "jmp\tmemset" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-not "call\tmemset" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "rep stosb" } } */
+
+typedef unsigned char e_u8;
+
+#define MAXBC 8
+
+void MixColumn(e_u8 a[4][MAXBC], e_u8 BC)
+{
+  int i, j;
+
+  for(i = 0; i < 4; i++)
+    for(j = 0; j < BC; j++) a[i][j] = 1;
+}
diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-5.c b/gcc/testsuite/gcc.target/i386/memset-strategy-5.c
new file mode 100644
index 00000000000..dc1de8e79c2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-strategy-5.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=tigerlake -mno-sse" } */
+/* { dg-final { scan-assembler "jmp\tmemset" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler "call\tmemset" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "rep stosb" } } */
+
+void
+foo (char *dest)
+{
+  __builtin_memset (dest, 0, 257);
+}
diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-6.c b/gcc/testsuite/gcc.target/i386/memset-strategy-6.c
new file mode 100644
index 00000000000..e51af3b730f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-strategy-6.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=tigerlake -mno-sse" } */
+/* { dg-final { scan-assembler "rep stosb" } } */
+
+void
+foo (char *dest)
+{
+  __builtin_memset (dest, 0, 256);
+}
-- 
2.30.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs
  2021-03-22 13:16 [PATCH 0/3] x86: Update memcpy/memset inline strategies H.J. Lu
  2021-03-22 13:16 ` [PATCH 1/3] x86: Update memcpy/memset inline strategies for Ice Lake H.J. Lu
@ 2021-03-22 13:16 ` H.J. Lu
  2021-04-05 13:45   ` H.J. Lu
  2021-03-22 13:16 ` [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic H.J. Lu
  2 siblings, 1 reply; 31+ messages in thread
From: H.J. Lu @ 2021-03-22 13:16 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jan Hubicka, Uros Bizjak, Hongtao Liu, Hongyu Wang

Simply memcpy and memset inline strategies to avoid branches for
Skylake family CPUs:

1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
   load and store for up to 16 * 16 (256) bytes when the data size is
   fixed and known.
2. Inline only if data size is known to be <= 256.
   a. Use "rep movsb/stosb" with simple code sequence if the data size
      is a constant.
   b. Use loop if data size is not a constant.
3. Use memcpy/memset libray function if data size is unknown or > 256.

On Cascadelake processor with -march=native -Ofast -flto,

1. Performance impacts of SPEC CPU 2017 rate are:

500.perlbench_r  0.17%
502.gcc_r       -0.36%
505.mcf_r        0.00%
520.omnetpp_r    0.08%
523.xalancbmk_r -0.62%
525.x264_r       1.04%
531.deepsjeng_r  0.11%
541.leela_r     -1.09%
548.exchange2_r -0.25%
557.xz_r         0.17%
Geomean         -0.08%

503.bwaves_r     0.00%
507.cactuBSSN_r  0.69%
508.namd_r      -0.07%
510.parest_r     1.12%
511.povray_r     1.82%
519.lbm_r        0.00%
521.wrf_r       -1.32%
526.blender_r   -0.47%
527.cam4_r       0.23%
538.imagick_r   -1.72%
544.nab_r       -0.56%
549.fotonik3d_r  0.12%
554.roms_r       0.43%
Geomean          0.02%

2. Significant impacts on eembc benchmarks are:

eembc/idctrn01   9.23%
eembc/nnet_test  29.26%

gcc/

	* config/i386/x86-tune-costs.h (skylake_memcpy): Updated.
	(skylake_memset): Likewise.
	(skylake_cost): Change CLEAR_RATIO to 17.
	* config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
	Replace m_CANNONLAKE, m_ICELAKE_CLIENT, m_ICELAKE_SERVER,
	m_TIGERLAKE and m_SAPPHIRERAPIDS with m_SKYLAKE and m_CORE_AVX512.

gcc/testsuite/

	* gcc.target/i386/memcpy-strategy-9.c: New test.
	* gcc.target/i386/memcpy-strategy-10.c: Likewise.
	* gcc.target/i386/memcpy-strategy-11.c: Likewise.
	* gcc.target/i386/memset-strategy-7.c: Likewise.
	* gcc.target/i386/memset-strategy-8.c: Likewise.
	* gcc.target/i386/memset-strategy-9.c: Likewise.
---
 gcc/config/i386/x86-tune-costs.h              | 27 ++++++++++++-------
 gcc/config/i386/x86-tune.def                  |  3 +--
 .../gcc.target/i386/memcpy-strategy-10.c      | 11 ++++++++
 .../gcc.target/i386/memcpy-strategy-11.c      | 18 +++++++++++++
 .../gcc.target/i386/memcpy-strategy-9.c       |  9 +++++++
 .../gcc.target/i386/memset-strategy-7.c       | 11 ++++++++
 .../gcc.target/i386/memset-strategy-8.c       |  9 +++++++
 .../gcc.target/i386/memset-strategy-9.c       | 17 ++++++++++++
 8 files changed, 93 insertions(+), 12 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-7.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-8.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-9.c

diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index 0e00ff99df3..ffe810f2bcb 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1822,17 +1822,24 @@ struct processor_costs znver3_cost = {
 
 /* skylake_cost should produce code tuned for Skylake familly of CPUs.  */
 static stringop_algs skylake_memcpy[2] =   {
-  {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
-  {libcall, {{16, loop, false}, {512, unrolled_loop, false},
-             {-1, libcall, false}}}};
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}},
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}}};
 
 static stringop_algs skylake_memset[2] = {
-  {libcall, {{6, loop_1_byte, true},
-             {24, loop, true},
-             {8192, rep_prefix_4_byte, true},
-             {-1, libcall, false}}},
-  {libcall, {{24, loop, true}, {512, unrolled_loop, false},
-             {-1, libcall, false}}}};
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}},
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}}};
 
 static const
 struct processor_costs skylake_cost = {
@@ -1889,7 +1896,7 @@ struct processor_costs skylake_cost = {
   COSTS_N_INSNS (0),			/* cost of movzx */
   8,					/* "large" insn */
   17,					/* MOVE_RATIO */
-  6,					/* CLEAR_RATIO */
+  17,					/* CLEAR_RATIO */
   {4, 4, 4},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
 					   Relative to reg-reg move (2).  */
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 134916cc972..eb057a67750 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -273,8 +273,7 @@ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA)
    move/set sequences of bytes with known size.  */
 DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB,
 	  "prefer_known_rep_movsb_stosb",
-	  m_CANNONLAKE | m_ICELAKE_CLIENT | m_ICELAKE_SERVER | m_TIGERLAKE
-	  | m_ALDERLAKE | m_SAPPHIRERAPIDS)
+	  m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512)
 
 /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of
    compact prologues and epilogues by issuing a misaligned moves.  This
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c
new file mode 100644
index 00000000000..970aa741971
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mno-sse" } */
+/* { dg-final { scan-assembler "jmp\tmemcpy" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler "call\tmemcpy" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "rep movsb" } } */
+
+void
+foo (char *dest, char *src)
+{
+  __builtin_memcpy (dest, src, 257);
+}
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c
new file mode 100644
index 00000000000..b6041944630
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake" } */
+/* { dg-final { scan-assembler-not "jmp\tmemcpy" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-not "call\tmemcpy" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "rep movsb" } } */
+
+typedef unsigned char e_u8;
+
+#define MAXBC 8
+
+void MixColumn(e_u8 a[4][MAXBC], e_u8 BC)
+{
+  e_u8 b[4][MAXBC];
+  int i, j;
+
+  for(i = 0; i < 4; i++)
+    for(j = 0; j < BC; j++) a[i][j] = b[i][j];
+}
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c
new file mode 100644
index 00000000000..b0dc7484d09
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mno-sse" } */
+/* { dg-final { scan-assembler "rep movsb" } } */
+
+void
+foo (char *dest, char *src)
+{
+  __builtin_memcpy (dest, src, 256);
+}
diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-7.c b/gcc/testsuite/gcc.target/i386/memset-strategy-7.c
new file mode 100644
index 00000000000..07c2816910c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-strategy-7.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mno-sse" } */
+/* { dg-final { scan-assembler "jmp\tmemset" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler "call\tmemset" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "rep stosb" } } */
+
+void
+foo (char *dest)
+{
+  __builtin_memset (dest, 0, 257);
+}
diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-8.c b/gcc/testsuite/gcc.target/i386/memset-strategy-8.c
new file mode 100644
index 00000000000..52ea882c814
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-strategy-8.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake -mno-sse" } */
+/* { dg-final { scan-assembler "rep stosb" } } */
+
+void
+foo (char *dest)
+{
+  __builtin_memset (dest, 0, 256);
+}
diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-9.c b/gcc/testsuite/gcc.target/i386/memset-strategy-9.c
new file mode 100644
index 00000000000..d4db031958f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-strategy-9.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake" } */
+/* { dg-final { scan-assembler-not "jmp\tmemset" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-not "call\tmemset" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "rep stosb" } } */
+
+typedef unsigned char e_u8;
+
+#define MAXBC 8
+
+void MixColumn(e_u8 a[4][MAXBC], e_u8 BC)
+{
+  int i, j;
+
+  for(i = 0; i < 4; i++)
+    for(j = 0; j < BC; j++) a[i][j] = 1;
+}
-- 
2.30.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic
  2021-03-22 13:16 [PATCH 0/3] x86: Update memcpy/memset inline strategies H.J. Lu
  2021-03-22 13:16 ` [PATCH 1/3] x86: Update memcpy/memset inline strategies for Ice Lake H.J. Lu
  2021-03-22 13:16 ` [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs H.J. Lu
@ 2021-03-22 13:16 ` H.J. Lu
  2021-03-22 13:29   ` Richard Biener
  2 siblings, 1 reply; 31+ messages in thread
From: H.J. Lu @ 2021-03-22 13:16 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jan Hubicka, Uros Bizjak, Hongtao Liu, Hongyu Wang

Simply memcpy and memset inline strategies to avoid branches for
-mtune=generic:

1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
   load and store for up to 16 * 16 (256) bytes when the data size is
   fixed and known.
2. Inline only if data size is known to be <= 256.
   a. Use "rep movsb/stosb" with simple code sequence if the data size
      is a constant.
   b. Use loop if data size is not a constant.
3. Use memcpy/memset libray function if data size is unknown or > 256.

With -mtune=generic -O2,

1. On Ice Lake processor,

Performance impacts on SPEC CPU 2017:

500.perlbench_r  0.51%
502.gcc_r        0.55%
505.mcf_r        0.38%
520.omnetpp_r   -0.74%
523.xalancbmk_r -0.35%
525.x264_r       2.99%
531.deepsjeng_r -0.17%
541.leela_r     -0.98%
548.exchange2_r  0.89%
557.xz_r         0.70%
Geomean          0.37%

503.bwaves_r     0.04%
507.cactuBSSN_r -0.01%
508.namd_r      -0.45%
510.parest_r    -0.09%
511.povray_r    -1.37%
519.lbm_r        0.00%
521.wrf_r       -2.56%
526.blender_r   -0.01%
527.cam4_r      -0.05%
538.imagick_r    0.36%
544.nab_r        0.08%
549.fotonik3d_r -0.06%
554.roms_r       0.05%
Geomean         -0.34%

Significant impacts on eembc benchmarks:

eembc/nnet_test      14.85%
eembc/mp2decoddata2  13.57%

2. On Cascadelake processor,

Performance impacts on SPEC CPU 2017:

500.perlbench_r -0.02%
502.gcc_r        0.10%
505.mcf_r       -1.14%
520.omnetpp_r   -0.22%
523.xalancbmk_r  0.21%
525.x264_r       0.94%
531.deepsjeng_r -0.37%
541.leela_r     -0.46%
548.exchange2_r -0.40%
557.xz_r         0.60%
Geomean         -0.08%

503.bwaves_r    -0.50%
507.cactuBSSN_r  0.05%
508.namd_r      -0.02%
510.parest_r     0.09%
511.povray_r    -1.35%
519.lbm_r        0.00%
521.wrf_r       -0.03%
526.blender_r   -0.83%
527.cam4_r       1.23%
538.imagick_r    0.97%
544.nab_r       -0.02%
549.fotonik3d_r -0.12%
554.roms_r       0.55%
Geomean          0.00%

Significant impacts on eembc benchmarks:

eembc/nnet_test      9.90%
eembc/mp2decoddata2  16.42%
eembc/textv2data3   -4.86%
eembc/qos            12.90%

3. On Znver3 processor,

Performance impacts on SPEC CPU 2017:

500.perlbench_r -0.96%
502.gcc_r       -1.06%
505.mcf_r       -0.01%
520.omnetpp_r   -1.45%
523.xalancbmk_r  2.89%
525.x264_r       4.98%
531.deepsjeng_r  0.18%
541.leela_r     -1.54%
548.exchange2_r -1.25%
557.xz_r        -0.01%
Geomean          0.16%

503.bwaves_r     0.04%
507.cactuBSSN_r  0.85%
508.namd_r      -0.13%
510.parest_r     0.39%
511.povray_r     0.00%
519.lbm_r        0.00%
521.wrf_r        0.28%
526.blender_r   -0.10%
527.cam4_r      -0.58%
538.imagick_r    0.69%
544.nab_r       -0.04%
549.fotonik3d_r -0.04%
554.roms_r       0.40%
Geomean          0.15%

Significant impacts on eembc benchmarks:

eembc/aifftr01       13.95%
eembc/idctrn01       8.41%
eembc/nnet_test      30.25%
eembc/mp2decoddata2  5.05%
eembc/textv2data3    6.43%
eembc/qos           -5.79%

gcc/

	* config/i386/x86-tune-costs.h (generic_memcpy): Updated.
	(generic_memset): Likewise.
	(generic_cost): Change CLEAR_RATIO to 17.
	* config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
	Add m_GENERIC.

gcc/testsuite/

	* gcc.target/i386/memcpy-strategy-12.c: New test.
	* gcc.target/i386/memcpy-strategy-13.c: Likewise.
	* gcc.target/i386/memset-strategy-10.c: Likewise.
	* gcc.target/i386/memset-strategy-11.c: Likewise.
	* gcc.target/i386/shrink_wrap_1.c: Also pass
	-mmemset-strategy=rep_8byte:-1:align.
	* gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte.
---
 gcc/config/i386/x86-tune-costs.h              | 31 ++++++++++++-------
 gcc/config/i386/x86-tune.def                  |  2 +-
 .../gcc.target/i386/memcpy-strategy-12.c      |  9 ++++++
 .../gcc.target/i386/memcpy-strategy-13.c      | 11 +++++++
 .../gcc.target/i386/memset-strategy-10.c      | 11 +++++++
 .../gcc.target/i386/memset-strategy-11.c      |  9 ++++++
 gcc/testsuite/gcc.target/i386/shrink_wrap_1.c |  2 +-
 gcc/testsuite/gcc.target/i386/sw-1.c          |  2 +-
 8 files changed, 63 insertions(+), 14 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-10.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-11.c

diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index ffe810f2bcb..30e7c3e4261 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -2844,19 +2844,28 @@ struct processor_costs intel_cost = {
   "16",					/* Func alignment.  */
 };
 
-/* Generic should produce code tuned for Core-i7 (and newer chips)
-   and btver1 (and newer chips).  */
+/* Generic should produce code tuned for Haswell (and newer chips)
+   and znver1 (and newer chips).  NB: rep_prefix_1_byte is used only
+   for known size.  */
 
 static stringop_algs generic_memcpy[2] = {
-  {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false},
-             {-1, libcall, false}}},
-  {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false},
-             {-1, libcall, false}}}};
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}},
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}}};
 static stringop_algs generic_memset[2] = {
-  {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false},
-             {-1, libcall, false}}},
-  {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false},
-             {-1, libcall, false}}}};
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}},
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}}};
 static const
 struct processor_costs generic_cost = {
   {
@@ -2913,7 +2922,7 @@ struct processor_costs generic_cost = {
   COSTS_N_INSNS (1),			/* cost of movzx */
   8,					/* "large" insn */
   17,					/* MOVE_RATIO */
-  6,					/* CLEAR_RATIO */
+  17,					/* CLEAR_RATIO */
   {6, 6, 6},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
 					   Relative to reg-reg move (2).  */
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index eb057a67750..fd9c011a3f5 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -273,7 +273,7 @@ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA)
    move/set sequences of bytes with known size.  */
 DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB,
 	  "prefer_known_rep_movsb_stosb",
-	  m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512)
+	  m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512 | m_GENERIC)
 
 /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of
    compact prologues and epilogues by issuing a misaligned moves.  This
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c
new file mode 100644
index 00000000000..87f03352736
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic" } */
+/* { dg-final { scan-assembler "rep movsb" } } */
+
+void
+foo (char *dest, char *src)
+{
+  __builtin_memcpy (dest, src, 249);
+}
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c
new file mode 100644
index 00000000000..cfc3cfba623
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic" } */
+/* { dg-final { scan-assembler "jmp\tmemcpy" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler "call\tmemcpy" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "rep movsb" } } */
+
+void
+foo (char *dest, char *src)
+{
+  __builtin_memcpy (dest, src, 257);
+}
diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-10.c b/gcc/testsuite/gcc.target/i386/memset-strategy-10.c
new file mode 100644
index 00000000000..ade5e8da42c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-strategy-10.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic" } */
+/* { dg-final { scan-assembler "jmp\tmemset" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler "call\tmemset" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "rep stosb" } } */
+
+void
+foo (char *dest)
+{
+  __builtin_memset (dest, 0, 257);
+}
diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-11.c b/gcc/testsuite/gcc.target/i386/memset-strategy-11.c
new file mode 100644
index 00000000000..d1b86152474
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-strategy-11.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic" } */
+/* { dg-final { scan-assembler "rep stosb" } } */
+
+void
+foo (char *dest)
+{
+  __builtin_memset (dest, 0, 253);
+}
diff --git a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c
index 94dadd6cdbd..44fe7d2836e 100644
--- a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c
+++ b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile { target { ! ia32 } } } */
-/* { dg-options "-O2 -fdump-rtl-pro_and_epilogue" } */
+/* { dg-options "-O2 -mmemset-strategy=rep_8byte:-1:align -fdump-rtl-pro_and_epilogue" } */
 
 enum machine_mode
 {
diff --git a/gcc/testsuite/gcc.target/i386/sw-1.c b/gcc/testsuite/gcc.target/i386/sw-1.c
index aec095eda62..f61621e42bf 100644
--- a/gcc/testsuite/gcc.target/i386/sw-1.c
+++ b/gcc/testsuite/gcc.target/i386/sw-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -mtune=generic -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */
+/* { dg-options "-O2 -mtune=generic -mstringop-strategy=rep_byte -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */
 /* { dg-skip-if "No shrink-wrapping preformed" { x86_64-*-mingw* } } */
 
 #include <string.h>
-- 
2.30.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic
  2021-03-22 13:16 ` [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic H.J. Lu
@ 2021-03-22 13:29   ` Richard Biener
  2021-03-22 13:38     ` H.J. Lu
  0 siblings, 1 reply; 31+ messages in thread
From: Richard Biener @ 2021-03-22 13:29 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GCC Patches, Jan Hubicka, Hongtao Liu, Hongyu Wang

On Mon, Mar 22, 2021 at 2:19 PM H.J. Lu via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Simply memcpy and memset inline strategies to avoid branches for
> -mtune=generic:
>
> 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
>    load and store for up to 16 * 16 (256) bytes when the data size is
>    fixed and known.
> 2. Inline only if data size is known to be <= 256.
>    a. Use "rep movsb/stosb" with simple code sequence if the data size
>       is a constant.
>    b. Use loop if data size is not a constant.
> 3. Use memcpy/memset libray function if data size is unknown or > 256.
>
> With -mtune=generic -O2,

Is there any visible code-size effect of increasing CLEAR_RATIO on
SPEC/eembc?  Did you play with other values of MOVE/CLEAR_RATIO?
17 memory-to-memory/memory-clear insns looks quite a lot.

> 1. On Ice Lake processor,
>
> Performance impacts on SPEC CPU 2017:
>
> 500.perlbench_r  0.51%
> 502.gcc_r        0.55%
> 505.mcf_r        0.38%
> 520.omnetpp_r   -0.74%
> 523.xalancbmk_r -0.35%
> 525.x264_r       2.99%
> 531.deepsjeng_r -0.17%
> 541.leela_r     -0.98%
> 548.exchange2_r  0.89%
> 557.xz_r         0.70%
> Geomean          0.37%
>
> 503.bwaves_r     0.04%
> 507.cactuBSSN_r -0.01%
> 508.namd_r      -0.45%
> 510.parest_r    -0.09%
> 511.povray_r    -1.37%
> 519.lbm_r        0.00%
> 521.wrf_r       -2.56%
> 526.blender_r   -0.01%
> 527.cam4_r      -0.05%
> 538.imagick_r    0.36%
> 544.nab_r        0.08%
> 549.fotonik3d_r -0.06%
> 554.roms_r       0.05%
> Geomean         -0.34%
>
> Significant impacts on eembc benchmarks:
>
> eembc/nnet_test      14.85%
> eembc/mp2decoddata2  13.57%
>
> 2. On Cascadelake processor,
>
> Performance impacts on SPEC CPU 2017:
>
> 500.perlbench_r -0.02%
> 502.gcc_r        0.10%
> 505.mcf_r       -1.14%
> 520.omnetpp_r   -0.22%
> 523.xalancbmk_r  0.21%
> 525.x264_r       0.94%
> 531.deepsjeng_r -0.37%
> 541.leela_r     -0.46%
> 548.exchange2_r -0.40%
> 557.xz_r         0.60%
> Geomean         -0.08%
>
> 503.bwaves_r    -0.50%
> 507.cactuBSSN_r  0.05%
> 508.namd_r      -0.02%
> 510.parest_r     0.09%
> 511.povray_r    -1.35%
> 519.lbm_r        0.00%
> 521.wrf_r       -0.03%
> 526.blender_r   -0.83%
> 527.cam4_r       1.23%
> 538.imagick_r    0.97%
> 544.nab_r       -0.02%
> 549.fotonik3d_r -0.12%
> 554.roms_r       0.55%
> Geomean          0.00%
>
> Significant impacts on eembc benchmarks:
>
> eembc/nnet_test      9.90%
> eembc/mp2decoddata2  16.42%
> eembc/textv2data3   -4.86%
> eembc/qos            12.90%
>
> 3. On Znver3 processor,
>
> Performance impacts on SPEC CPU 2017:
>
> 500.perlbench_r -0.96%
> 502.gcc_r       -1.06%
> 505.mcf_r       -0.01%
> 520.omnetpp_r   -1.45%
> 523.xalancbmk_r  2.89%
> 525.x264_r       4.98%
> 531.deepsjeng_r  0.18%
> 541.leela_r     -1.54%
> 548.exchange2_r -1.25%
> 557.xz_r        -0.01%
> Geomean          0.16%
>
> 503.bwaves_r     0.04%
> 507.cactuBSSN_r  0.85%
> 508.namd_r      -0.13%
> 510.parest_r     0.39%
> 511.povray_r     0.00%
> 519.lbm_r        0.00%
> 521.wrf_r        0.28%
> 526.blender_r   -0.10%
> 527.cam4_r      -0.58%
> 538.imagick_r    0.69%
> 544.nab_r       -0.04%
> 549.fotonik3d_r -0.04%
> 554.roms_r       0.40%
> Geomean          0.15%
>
> Significant impacts on eembc benchmarks:
>
> eembc/aifftr01       13.95%
> eembc/idctrn01       8.41%
> eembc/nnet_test      30.25%
> eembc/mp2decoddata2  5.05%
> eembc/textv2data3    6.43%
> eembc/qos           -5.79%
>
> gcc/
>
>         * config/i386/x86-tune-costs.h (generic_memcpy): Updated.
>         (generic_memset): Likewise.
>         (generic_cost): Change CLEAR_RATIO to 17.
>         * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
>         Add m_GENERIC.
>
> gcc/testsuite/
>
>         * gcc.target/i386/memcpy-strategy-12.c: New test.
>         * gcc.target/i386/memcpy-strategy-13.c: Likewise.
>         * gcc.target/i386/memset-strategy-10.c: Likewise.
>         * gcc.target/i386/memset-strategy-11.c: Likewise.
>         * gcc.target/i386/shrink_wrap_1.c: Also pass
>         -mmemset-strategy=rep_8byte:-1:align.
>         * gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte.
> ---
>  gcc/config/i386/x86-tune-costs.h              | 31 ++++++++++++-------
>  gcc/config/i386/x86-tune.def                  |  2 +-
>  .../gcc.target/i386/memcpy-strategy-12.c      |  9 ++++++
>  .../gcc.target/i386/memcpy-strategy-13.c      | 11 +++++++
>  .../gcc.target/i386/memset-strategy-10.c      | 11 +++++++
>  .../gcc.target/i386/memset-strategy-11.c      |  9 ++++++
>  gcc/testsuite/gcc.target/i386/shrink_wrap_1.c |  2 +-
>  gcc/testsuite/gcc.target/i386/sw-1.c          |  2 +-
>  8 files changed, 63 insertions(+), 14 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-10.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-11.c
>
> diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> index ffe810f2bcb..30e7c3e4261 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -2844,19 +2844,28 @@ struct processor_costs intel_cost = {
>    "16",                                        /* Func alignment.  */
>  };
>
> -/* Generic should produce code tuned for Core-i7 (and newer chips)
> -   and btver1 (and newer chips).  */
> +/* Generic should produce code tuned for Haswell (and newer chips)
> +   and znver1 (and newer chips).  NB: rep_prefix_1_byte is used only
> +   for known size.  */
>
>  static stringop_algs generic_memcpy[2] = {
> -  {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false},
> -             {-1, libcall, false}}},
> -  {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false},
> -             {-1, libcall, false}}}};
> +  {libcall,
> +   {{256, rep_prefix_1_byte, true},
> +    {256, loop, false},
> +    {-1, libcall, false}}},
> +  {libcall,
> +   {{256, rep_prefix_1_byte, true},
> +    {256, loop, false},
> +    {-1, libcall, false}}}};
>  static stringop_algs generic_memset[2] = {
> -  {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false},
> -             {-1, libcall, false}}},
> -  {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false},
> -             {-1, libcall, false}}}};
> +  {libcall,
> +   {{256, rep_prefix_1_byte, true},
> +    {256, loop, false},
> +    {-1, libcall, false}}},
> +  {libcall,
> +   {{256, rep_prefix_1_byte, true},
> +    {256, loop, false},
> +    {-1, libcall, false}}}};
>  static const
>  struct processor_costs generic_cost = {
>    {
> @@ -2913,7 +2922,7 @@ struct processor_costs generic_cost = {
>    COSTS_N_INSNS (1),                   /* cost of movzx */
>    8,                                   /* "large" insn */
>    17,                                  /* MOVE_RATIO */
> -  6,                                   /* CLEAR_RATIO */
> +  17,                                  /* CLEAR_RATIO */
>    {6, 6, 6},                           /* cost of loading integer registers
>                                            in QImode, HImode and SImode.
>                                            Relative to reg-reg move (2).  */
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index eb057a67750..fd9c011a3f5 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -273,7 +273,7 @@ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA)
>     move/set sequences of bytes with known size.  */
>  DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB,
>           "prefer_known_rep_movsb_stosb",
> -         m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512)
> +         m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512 | m_GENERIC)
>
>  /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of
>     compact prologues and epilogues by issuing a misaligned moves.  This
> diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c
> new file mode 100644
> index 00000000000..87f03352736
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune=generic" } */
> +/* { dg-final { scan-assembler "rep movsb" } } */
> +
> +void
> +foo (char *dest, char *src)
> +{
> +  __builtin_memcpy (dest, src, 249);
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c
> new file mode 100644
> index 00000000000..cfc3cfba623
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune=generic" } */
> +/* { dg-final { scan-assembler "jmp\tmemcpy" { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler "call\tmemcpy" { target ia32 } } } */
> +/* { dg-final { scan-assembler-not "rep movsb" } } */
> +
> +void
> +foo (char *dest, char *src)
> +{
> +  __builtin_memcpy (dest, src, 257);
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-10.c b/gcc/testsuite/gcc.target/i386/memset-strategy-10.c
> new file mode 100644
> index 00000000000..ade5e8da42c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-10.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune=generic" } */
> +/* { dg-final { scan-assembler "jmp\tmemset" { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler "call\tmemset" { target ia32 } } } */
> +/* { dg-final { scan-assembler-not "rep stosb" } } */
> +
> +void
> +foo (char *dest)
> +{
> +  __builtin_memset (dest, 0, 257);
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-11.c b/gcc/testsuite/gcc.target/i386/memset-strategy-11.c
> new file mode 100644
> index 00000000000..d1b86152474
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-11.c
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune=generic" } */
> +/* { dg-final { scan-assembler "rep stosb" } } */
> +
> +void
> +foo (char *dest)
> +{
> +  __builtin_memset (dest, 0, 253);
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c
> index 94dadd6cdbd..44fe7d2836e 100644
> --- a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c
> +++ b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile { target { ! ia32 } } } */
> -/* { dg-options "-O2 -fdump-rtl-pro_and_epilogue" } */
> +/* { dg-options "-O2 -mmemset-strategy=rep_8byte:-1:align -fdump-rtl-pro_and_epilogue" } */
>
>  enum machine_mode
>  {
> diff --git a/gcc/testsuite/gcc.target/i386/sw-1.c b/gcc/testsuite/gcc.target/i386/sw-1.c
> index aec095eda62..f61621e42bf 100644
> --- a/gcc/testsuite/gcc.target/i386/sw-1.c
> +++ b/gcc/testsuite/gcc.target/i386/sw-1.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -mtune=generic -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */
> +/* { dg-options "-O2 -mtune=generic -mstringop-strategy=rep_byte -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */
>  /* { dg-skip-if "No shrink-wrapping preformed" { x86_64-*-mingw* } } */
>
>  #include <string.h>
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic
  2021-03-22 13:29   ` Richard Biener
@ 2021-03-22 13:38     ` H.J. Lu
  2021-03-23  2:41       ` Hongyu Wang
  0 siblings, 1 reply; 31+ messages in thread
From: H.J. Lu @ 2021-03-22 13:38 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Jan Hubicka, Hongtao Liu, Hongyu Wang

On Mon, Mar 22, 2021 at 6:29 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Mon, Mar 22, 2021 at 2:19 PM H.J. Lu via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > Simply memcpy and memset inline strategies to avoid branches for
> > -mtune=generic:
> >
> > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
> >    load and store for up to 16 * 16 (256) bytes when the data size is
> >    fixed and known.
> > 2. Inline only if data size is known to be <= 256.
> >    a. Use "rep movsb/stosb" with simple code sequence if the data size
> >       is a constant.
> >    b. Use loop if data size is not a constant.
> > 3. Use memcpy/memset libray function if data size is unknown or > 256.
> >
> > With -mtune=generic -O2,
>
> Is there any visible code-size effect of increasing CLEAR_RATIO on

Hongyue, please collect code size differences on SPEC CPU 2017 and
eembc.

> SPEC/eembc?  Did you play with other values of MOVE/CLEAR_RATIO?
> 17 memory-to-memory/memory-clear insns looks quite a lot.
>

Yes, we did.  256 bytes is the threshold above which memcpy/memset in libc
win. Below 256 bytes, 16 by_pieces move/store is faster.

-- 
H.J.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] x86: Update memcpy/memset inline strategies for Ice Lake
  2021-03-22 13:16 ` [PATCH 1/3] x86: Update memcpy/memset inline strategies for Ice Lake H.J. Lu
@ 2021-03-22 14:10   ` Jan Hubicka
  2021-03-22 23:57     ` [PATCH v2 " H.J. Lu
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Hubicka @ 2021-03-22 14:10 UTC (permalink / raw)
  To: H.J. Lu; +Cc: gcc-patches, Hongtao Liu, Hongyu Wang

> 
> gcc/
> 
> 	* config/i386/i386-expand.c (expand_set_or_cpymem_via_rep):
> 	For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, don't convert QImode
> 	to SImode.
> 	(decide_alg): For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, use
> 	"rep movsb/stosb" only for known sizes.
> 	* config/i386/i386-options.c (processor_cost_table): Use Ice
> 	Lake cost for Cannon Lake, Ice Lake, Tiger Lake, Sapphire
> 	Rapids and Alder Lake.
> 	* config/i386/i386.h (TARGET_PREFER_KNOWN_REP_MOVSB_STOSB): New.
> 	* config/i386/x86-tune-costs.h (icelake_memcpy): New.
> 	(icelake_memset): Likewise.
> 	(icelake_cost): Likewise.
> 	* config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
> 	New.

It looks like X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is quite obviously
benefical and independent of the rest of changes.  I think we will need
to discuss bit more the move ratio and the code size/uop cache polution
issues - one option would be to use increased limits for -O3 only.

Can you break this out to independent patch?  I also wonder if it owuld
not be more readable to special case this just on the beggining of
decide_alg.
> @@ -6890,6 +6891,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size,
>    const struct processor_costs *cost;
>    int i;
>    bool any_alg_usable_p = false;
> +  bool known_size_p = expected_size != -1;

expected_size is not -1 if we have profile feedback and we detected from
histogram average size of a block.  It seems to me that from description
that you want the const to be actual compile time constant that would be 
min_size == max_size I guess.

Honza

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake
  2021-03-22 14:10   ` Jan Hubicka
@ 2021-03-22 23:57     ` H.J. Lu
  2021-03-29 13:43       ` H.J. Lu
                         ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: H.J. Lu @ 2021-03-22 23:57 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Hongtao Liu, Hongyu Wang

[-- Attachment #1: Type: text/plain, Size: 2292 bytes --]

On Mon, Mar 22, 2021 at 7:10 AM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> >
> > gcc/
> >
> >       * config/i386/i386-expand.c (expand_set_or_cpymem_via_rep):
> >       For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, don't convert QImode
> >       to SImode.
> >       (decide_alg): For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, use
> >       "rep movsb/stosb" only for known sizes.
> >       * config/i386/i386-options.c (processor_cost_table): Use Ice
> >       Lake cost for Cannon Lake, Ice Lake, Tiger Lake, Sapphire
> >       Rapids and Alder Lake.
> >       * config/i386/i386.h (TARGET_PREFER_KNOWN_REP_MOVSB_STOSB): New.
> >       * config/i386/x86-tune-costs.h (icelake_memcpy): New.
> >       (icelake_memset): Likewise.
> >       (icelake_cost): Likewise.
> >       * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
> >       New.
>
> It looks like X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is quite obviously
> benefical and independent of the rest of changes.  I think we will need
> to discuss bit more the move ratio and the code size/uop cache polution
> issues - one option would be to use increased limits for -O3 only.

My change only increases CLEAR_RATIO, not MOVE_RATIO.   We are
checking code size impacts on SPEC CPU 2017 and eembc.

> Can you break this out to independent patch?  I also wonder if it owuld

X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB improves performance
only when memcpy/memset costs and MOVE_RATIO are updated the same time,
like:

https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567096.html

Make it a standalone means moving from Ice Lake patch to Skylake patch.

> not be more readable to special case this just on the beggining of
> decide_alg.
> > @@ -6890,6 +6891,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size,
> >    const struct processor_costs *cost;
> >    int i;
> >    bool any_alg_usable_p = false;
> > +  bool known_size_p = expected_size != -1;
>
> expected_size is not -1 if we have profile feedback and we detected from
> histogram average size of a block.  It seems to me that from description
> that you want the const to be actual compile time constant that would be
> min_size == max_size I guess.
>

You are right.  Here is the v2 patch with min_size != max_size check for
unknown size.

Thanks.

--
H.J.

[-- Attachment #2: v2-0001-x86-Update-memcpy-memset-inline-strategies-for-Ic.patch --]
[-- Type: application/x-patch, Size: 18453 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic
  2021-03-22 13:38     ` H.J. Lu
@ 2021-03-23  2:41       ` Hongyu Wang
  2021-03-23  8:19         ` Richard Biener
  0 siblings, 1 reply; 31+ messages in thread
From: Hongyu Wang @ 2021-03-23  2:41 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Richard Biener, Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang

> Hongyue, please collect code size differences on SPEC CPU 2017 and
> eembc.

Here is code size difference for this patch

SPEC CPU 2017
                                   difference             w patch      w/o patch
500.perlbench_r              0.051%             1622637          1621805
502.gcc_r                         0.039%             6930877          6928141
505.mcf_r                         0.098%             16413              16397
520.omnetpp_r               0.083%             1327757          1326653
523.xalancbmk_r            0.001%             3575709          3575677
525.x264_r                       -0.067%           769095            769607
531.deepsjeng_r             0.071%             67629              67581
541.leela_r                       -3.062%           127629            131661
548.exchange2_r            -0.338%            66141              66365
557.xz_r                            0.946%            128061            126861

503.bwaves_r                  0.534%             33117              32941
507.cactuBSSN_r            0.004%             2993645          2993517
508.namd_r                     0.006%             851677            851629
510.parest_r                    0.488%             6741277          6708557
511.povray_r                   -0.021%           849290            849466
521.wrf_r                         0.022%             29682154       29675530
526.blender_r                  0.054%             7544057          7540009
527.cam4_r                      0.043%             6102234          6099594
538.imagick_r                  -0.015%           1625770          1626010
544.nab_r                         0.155%             155453            155213
549.fotonik3d_r              0.000%             351757            351757
554.roms_r                      0.041%             735837            735533

eembc
                                    difference        w patch      w/o patch
aifftr01                              0.762%             14813            14701
aiifft01                              0.556%             14477            14397
idctrn01                            0.101%             15853            15837
cjpeg-rose7-preset         0.114%             56125              56061
nnet_test                         -0.848%           35549              35853
aes                                   0.125%             38493            38445
cjpegv2data                     0.108%             59213              59149
djpegv2data                     0.025%             63821              63805
huffde                               -0.104%           30621              30653
mp2decoddata                -0.047%           68285              68317
mp2enf32data1              0.018%             86925              86909
mp2enf32data2              0.018%             89357              89341
mp2enf32data3              0.018%             88253              88237
mp3playerfixeddata       0.103%             46877              46829
ip_pktcheckb1m              0.191%             25213              25165
nat                                   0.527%             45757             45517
ospfv2                               0.196%             24573             24525
routelookup                     0.189%             25389              25341
tcpbulk                            0.155%             30925              30877
textv2data                        0.055%             29101              29085

H.J. Lu via Gcc-patches <gcc-patches@gcc.gnu.org> 于2021年3月22日周一 下午9:39写道：
>
> On Mon, Mar 22, 2021 at 6:29 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Mon, Mar 22, 2021 at 2:19 PM H.J. Lu via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > Simply memcpy and memset inline strategies to avoid branches for
> > > -mtune=generic:
> > >
> > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
> > >    load and store for up to 16 * 16 (256) bytes when the data size is
> > >    fixed and known.
> > > 2. Inline only if data size is known to be <= 256.
> > >    a. Use "rep movsb/stosb" with simple code sequence if the data size
> > >       is a constant.
> > >    b. Use loop if data size is not a constant.
> > > 3. Use memcpy/memset libray function if data size is unknown or > 256.
> > >
> > > With -mtune=generic -O2,
> >
> > Is there any visible code-size effect of increasing CLEAR_RATIO on
>
> Hongyue, please collect code size differences on SPEC CPU 2017 and
> eembc.
>
> > SPEC/eembc?  Did you play with other values of MOVE/CLEAR_RATIO?
> > 17 memory-to-memory/memory-clear insns looks quite a lot.
> >
>
> Yes, we did.  256 bytes is the threshold above which memcpy/memset in libc
> win. Below 256 bytes, 16 by_pieces move/store is faster.
>
> --
> H.J.

-- 
Regards,

Hongyu, Wang

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic
  2021-03-23  2:41       ` Hongyu Wang
@ 2021-03-23  8:19         ` Richard Biener
  2021-08-22 15:28           ` PING [PATCH] " H.J. Lu
  0 siblings, 1 reply; 31+ messages in thread
From: Richard Biener @ 2021-03-23  8:19 UTC (permalink / raw)
  To: Hongyu Wang; +Cc: H.J. Lu, Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang

On Tue, Mar 23, 2021 at 3:41 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote:
>
> > Hongyue, please collect code size differences on SPEC CPU 2017 and
> > eembc.
>
> Here is code size difference for this patch

Thanks, nothing too bad although slightly larger impacts than envisioned.

> SPEC CPU 2017
>                                    difference             w patch      w/o patch
> 500.perlbench_r              0.051%             1622637          1621805
> 502.gcc_r                         0.039%             6930877          6928141
> 505.mcf_r                         0.098%             16413              16397
> 520.omnetpp_r               0.083%             1327757          1326653
> 523.xalancbmk_r            0.001%             3575709          3575677
> 525.x264_r                       -0.067%           769095            769607
> 531.deepsjeng_r             0.071%             67629              67581
> 541.leela_r                       -3.062%           127629            131661
> 548.exchange2_r            -0.338%            66141              66365
> 557.xz_r                            0.946%            128061            126861
>
> 503.bwaves_r                  0.534%             33117              32941
> 507.cactuBSSN_r            0.004%             2993645          2993517
> 508.namd_r                     0.006%             851677            851629
> 510.parest_r                    0.488%             6741277          6708557
> 511.povray_r                   -0.021%           849290            849466
> 521.wrf_r                         0.022%             29682154       29675530
> 526.blender_r                  0.054%             7544057          7540009
> 527.cam4_r                      0.043%             6102234          6099594
> 538.imagick_r                  -0.015%           1625770          1626010
> 544.nab_r                         0.155%             155453            155213
> 549.fotonik3d_r              0.000%             351757            351757
> 554.roms_r                      0.041%             735837            735533
>
> eembc
>                                     difference        w patch      w/o patch
> aifftr01                              0.762%             14813            14701
> aiifft01                              0.556%             14477            14397
> idctrn01                            0.101%             15853            15837
> cjpeg-rose7-preset         0.114%             56125              56061
> nnet_test                         -0.848%           35549              35853
> aes                                   0.125%             38493            38445
> cjpegv2data                     0.108%             59213              59149
> djpegv2data                     0.025%             63821              63805
> huffde                               -0.104%           30621              30653
> mp2decoddata                -0.047%           68285              68317
> mp2enf32data1              0.018%             86925              86909
> mp2enf32data2              0.018%             89357              89341
> mp2enf32data3              0.018%             88253              88237
> mp3playerfixeddata       0.103%             46877              46829
> ip_pktcheckb1m              0.191%             25213              25165
> nat                                   0.527%             45757             45517
> ospfv2                               0.196%             24573             24525
> routelookup                     0.189%             25389              25341
> tcpbulk                            0.155%             30925              30877
> textv2data                        0.055%             29101              29085
>
> H.J. Lu via Gcc-patches <gcc-patches@gcc.gnu.org> 于2021年3月22日周一 下午9:39写道：
> >
> > On Mon, Mar 22, 2021 at 6:29 AM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Mon, Mar 22, 2021 at 2:19 PM H.J. Lu via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > > Simply memcpy and memset inline strategies to avoid branches for
> > > > -mtune=generic:
> > > >
> > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
> > > >    load and store for up to 16 * 16 (256) bytes when the data size is
> > > >    fixed and known.
> > > > 2. Inline only if data size is known to be <= 256.
> > > >    a. Use "rep movsb/stosb" with simple code sequence if the data size
> > > >       is a constant.
> > > >    b. Use loop if data size is not a constant.
> > > > 3. Use memcpy/memset libray function if data size is unknown or > 256.
> > > >
> > > > With -mtune=generic -O2,
> > >
> > > Is there any visible code-size effect of increasing CLEAR_RATIO on
> >
> > Hongyue, please collect code size differences on SPEC CPU 2017 and
> > eembc.
> >
> > > SPEC/eembc?  Did you play with other values of MOVE/CLEAR_RATIO?
> > > 17 memory-to-memory/memory-clear insns looks quite a lot.
> > >
> >
> > Yes, we did.  256 bytes is the threshold above which memcpy/memset in libc
> > win. Below 256 bytes, 16 by_pieces move/store is faster.
> >
> > --
> > H.J.
>
> --
> Regards,
>
> Hongyu, Wang

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake
  2021-03-22 23:57     ` [PATCH v2 " H.J. Lu
@ 2021-03-29 13:43       ` H.J. Lu
  2021-03-31  6:59       ` Richard Biener
  2021-03-31  8:05       ` Jan Hubicka
  2 siblings, 0 replies; 31+ messages in thread
From: H.J. Lu @ 2021-03-29 13:43 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Hongtao Liu, Hongyu Wang

On Mon, Mar 22, 2021 at 4:57 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, Mar 22, 2021 at 7:10 AM Jan Hubicka <hubicka@ucw.cz> wrote:
> >
> > >
> > > gcc/
> > >
> > >       * config/i386/i386-expand.c (expand_set_or_cpymem_via_rep):
> > >       For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, don't convert QImode
> > >       to SImode.
> > >       (decide_alg): For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, use
> > >       "rep movsb/stosb" only for known sizes.
> > >       * config/i386/i386-options.c (processor_cost_table): Use Ice
> > >       Lake cost for Cannon Lake, Ice Lake, Tiger Lake, Sapphire
> > >       Rapids and Alder Lake.
> > >       * config/i386/i386.h (TARGET_PREFER_KNOWN_REP_MOVSB_STOSB): New.
> > >       * config/i386/x86-tune-costs.h (icelake_memcpy): New.
> > >       (icelake_memset): Likewise.
> > >       (icelake_cost): Likewise.
> > >       * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
> > >       New.
> >
> > It looks like X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is quite obviously
> > benefical and independent of the rest of changes.  I think we will need
> > to discuss bit more the move ratio and the code size/uop cache polution
> > issues - one option would be to use increased limits for -O3 only.
>
> My change only increases CLEAR_RATIO, not MOVE_RATIO.   We are
> checking code size impacts on SPEC CPU 2017 and eembc.
>
> > Can you break this out to independent patch?  I also wonder if it owuld
>
> X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB improves performance
> only when memcpy/memset costs and MOVE_RATIO are updated the same time,
> like:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567096.html
>
> Make it a standalone means moving from Ice Lake patch to Skylake patch.
>
> > not be more readable to special case this just on the beggining of
> > decide_alg.
> > > @@ -6890,6 +6891,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size,
> > >    const struct processor_costs *cost;
> > >    int i;
> > >    bool any_alg_usable_p = false;
> > > +  bool known_size_p = expected_size != -1;
> >
> > expected_size is not -1 if we have profile feedback and we detected from
> > histogram average size of a block.  It seems to me that from description
> > that you want the const to be actual compile time constant that would be
> > min_size == max_size I guess.
> >
>
> You are right.  Here is the v2 patch with min_size != max_size check for
> unknown size.
>

Hi Honza,

This patch only impacts Ice Lake.   Do you have any comments for the v2
patch?

Thanks.

--
H.J.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake
  2021-03-22 23:57     ` [PATCH v2 " H.J. Lu
  2021-03-29 13:43       ` H.J. Lu
@ 2021-03-31  6:59       ` Richard Biener
  2021-03-31  8:05       ` Jan Hubicka
  2 siblings, 0 replies; 31+ messages in thread
From: Richard Biener @ 2021-03-31  6:59 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang

On Tue, Mar 23, 2021 at 12:59 AM H.J. Lu via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Mon, Mar 22, 2021 at 7:10 AM Jan Hubicka <hubicka@ucw.cz> wrote:
> >
> > >
> > > gcc/
> > >
> > >       * config/i386/i386-expand.c (expand_set_or_cpymem_via_rep):
> > >       For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, don't convert QImode
> > >       to SImode.
> > >       (decide_alg): For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, use
> > >       "rep movsb/stosb" only for known sizes.
> > >       * config/i386/i386-options.c (processor_cost_table): Use Ice
> > >       Lake cost for Cannon Lake, Ice Lake, Tiger Lake, Sapphire
> > >       Rapids and Alder Lake.
> > >       * config/i386/i386.h (TARGET_PREFER_KNOWN_REP_MOVSB_STOSB): New.
> > >       * config/i386/x86-tune-costs.h (icelake_memcpy): New.
> > >       (icelake_memset): Likewise.
> > >       (icelake_cost): Likewise.
> > >       * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
> > >       New.
> >
> > It looks like X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is quite obviously
> > benefical and independent of the rest of changes.  I think we will need
> > to discuss bit more the move ratio and the code size/uop cache polution
> > issues - one option would be to use increased limits for -O3 only.
>
> My change only increases CLEAR_RATIO, not MOVE_RATIO.   We are
> checking code size impacts on SPEC CPU 2017 and eembc.
>
> > Can you break this out to independent patch?  I also wonder if it owuld
>
> X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB improves performance
> only when memcpy/memset costs and MOVE_RATIO are updated the same time,
> like:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567096.html
>
> Make it a standalone means moving from Ice Lake patch to Skylake patch.
>
> > not be more readable to special case this just on the beggining of
> > decide_alg.
> > > @@ -6890,6 +6891,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size,
> > >    const struct processor_costs *cost;
> > >    int i;
> > >    bool any_alg_usable_p = false;
> > > +  bool known_size_p = expected_size != -1;
> >
> > expected_size is not -1 if we have profile feedback and we detected from
> > histogram average size of a block.  It seems to me that from description
> > that you want the const to be actual compile time constant that would be
> > min_size == max_size I guess.
> >
>
> You are right.  Here is the v2 patch with min_size != max_size check for
> unknown size.

OK.

Thanks,
Richard.

> Thanks.
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake
  2021-03-22 23:57     ` [PATCH v2 " H.J. Lu
  2021-03-29 13:43       ` H.J. Lu
  2021-03-31  6:59       ` Richard Biener
@ 2021-03-31  8:05       ` Jan Hubicka
  2021-03-31 13:09         ` H.J. Lu
  2 siblings, 1 reply; 31+ messages in thread
From: Jan Hubicka @ 2021-03-31  8:05 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Hongtao Liu, GCC Patches, Hongyu Wang

> > It looks like X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is quite obviously
> > benefical and independent of the rest of changes.  I think we will need
> > to discuss bit more the move ratio and the code size/uop cache polution
> > issues - one option would be to use increased limits for -O3 only.
> 
> My change only increases CLEAR_RATIO, not MOVE_RATIO.   We are
> checking code size impacts on SPEC CPU 2017 and eembc.
> 
> > Can you break this out to independent patch?  I also wonder if it owuld
> 
> X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB improves performance
> only when memcpy/memset costs and MOVE_RATIO are updated the same time,
> like:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567096.html
> 
> Make it a standalone means moving from Ice Lake patch to Skylake patch.
> 
> > not be more readable to special case this just on the beggining of
> > decide_alg.
> > > @@ -6890,6 +6891,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size,
> > >    const struct processor_costs *cost;
> > >    int i;
> > >    bool any_alg_usable_p = false;
> > > +  bool known_size_p = expected_size != -1;
> >
> > expected_size is not -1 if we have profile feedback and we detected from
> > histogram average size of a block.  It seems to me that from description
> > that you want the const to be actual compile time constant that would be
> > min_size == max_size I guess.
> >
> 
> You are right.  Here is the v2 patch with min_size != max_size check for
> unknown size.

Patch is OK now.  I was wondering about using avx256 for moves of known
size (per comment on MOVE_MAX_PIECES there is issue with
MAX_FIXED_MODE_SIZE, but that seems not hard to fix). Did you look into
it?

Honza
> 
> Thanks.
> 
> --
> H.J.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake
  2021-03-31  8:05       ` Jan Hubicka
@ 2021-03-31 13:09         ` H.J. Lu
  2021-03-31 13:40           ` Jan Hubicka
  0 siblings, 1 reply; 31+ messages in thread
From: H.J. Lu @ 2021-03-31 13:09 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Hongtao Liu, GCC Patches, Hongyu Wang

On Wed, Mar 31, 2021 at 1:05 AM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > > It looks like X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is quite obviously
> > > benefical and independent of the rest of changes.  I think we will need
> > > to discuss bit more the move ratio and the code size/uop cache polution
> > > issues - one option would be to use increased limits for -O3 only.
> >
> > My change only increases CLEAR_RATIO, not MOVE_RATIO.   We are
> > checking code size impacts on SPEC CPU 2017 and eembc.
> >
> > > Can you break this out to independent patch?  I also wonder if it owuld
> >
> > X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB improves performance
> > only when memcpy/memset costs and MOVE_RATIO are updated the same time,
> > like:
> >
> > https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567096.html
> >
> > Make it a standalone means moving from Ice Lake patch to Skylake patch.
> >
> > > not be more readable to special case this just on the beggining of
> > > decide_alg.
> > > > @@ -6890,6 +6891,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size,
> > > >    const struct processor_costs *cost;
> > > >    int i;
> > > >    bool any_alg_usable_p = false;
> > > > +  bool known_size_p = expected_size != -1;
> > >
> > > expected_size is not -1 if we have profile feedback and we detected from
> > > histogram average size of a block.  It seems to me that from description
> > > that you want the const to be actual compile time constant that would be
> > > min_size == max_size I guess.
> > >
> >
> > You are right.  Here is the v2 patch with min_size != max_size check for
> > unknown size.
>
> Patch is OK now.  I was wondering about using avx256 for moves of known

Done.   X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is in now.   Can
you take a look at the patch for Skylake:

https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567096.html

> size (per comment on MOVE_MAX_PIECES there is issue with
> MAX_FIXED_MODE_SIZE, but that seems not hard to fix). Did you look into
> it?

It requires some changes in the middle-end.   See
users/hjl/pieces/master branch:

https://gitlab.com/x86-gcc/gcc/-/tree/users/hjl/pieces/master

I am rebasing it.

-- 
H.J.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake
  2021-03-31 13:09         ` H.J. Lu
@ 2021-03-31 13:40           ` Jan Hubicka
  2021-03-31 13:47             ` Jan Hubicka
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Hubicka @ 2021-03-31 13:40 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Hongtao Liu, GCC Patches, Hongyu Wang

> >
> > Patch is OK now.  I was wondering about using avx256 for moves of known
> 
> Done.   X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is in now.   Can
> you take a look at the patch for Skylake:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567096.html

I was wondering, if CPU preffers rep movsb when rcx is a compile time
constant, it probably does some logic at the decode time (i.e. expands
it into some sequence) and if so, then it may require the code setting
the register to be near rep (via fusing or simlar mechanism)

Perhaps we want to have fusing pattern for this, so we do not move them
far apart?
> 
> > size (per comment on MOVE_MAX_PIECES there is issue with
> > MAX_FIXED_MODE_SIZE, but that seems not hard to fix). Did you look into
> > it?
> 
> It requires some changes in the middle-end.   See

yep, I know - tried that too for zen3 tuning :)
> users/hjl/pieces/master branch:
> 
> https://gitlab.com/x86-gcc/gcc/-/tree/users/hjl/pieces/master
> 
> I am rebasing it.

Thanks, it would also help to reduce the code size bloat by bumping up
the move by pieces. Clang is using those.

Honza
> 
> -- 
> H.J.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake
  2021-03-31 13:40           ` Jan Hubicka
@ 2021-03-31 13:47             ` Jan Hubicka
  2021-03-31 15:41               ` H.J. Lu
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Hubicka @ 2021-03-31 13:47 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Hongtao Liu, GCC Patches, Hongyu Wang

> > >
> > > Patch is OK now.  I was wondering about using avx256 for moves of known
> > 
> > Done.   X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is in now.   Can
> > you take a look at the patch for Skylake:
> > 
> > https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567096.html
> 
> I was wondering, if CPU preffers rep movsb when rcx is a compile time
> constant, it probably does some logic at the decode time (i.e. expands
> it into some sequence) and if so, then it may require the code setting
> the register to be near rep (via fusing or simlar mechanism)
> 
> Perhaps we want to have fusing pattern for this, so we do not move them
> far apart?

Reading through the optimization manual it seems that mosvb is fast for
small block no matter if the size is hard wired. In that case you
probably want to check whetehr max_size or expected_size is known to be
small rather than max_size == min_size and both being small.

But it depends on what CPU really does.
Honza

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake
  2021-03-31 13:47             ` Jan Hubicka
@ 2021-03-31 15:41               ` H.J. Lu
  2021-03-31 17:43                 ` Jan Hubicka
  0 siblings, 1 reply; 31+ messages in thread
From: H.J. Lu @ 2021-03-31 15:41 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Hongtao Liu, GCC Patches, Hongyu Wang

On Wed, Mar 31, 2021 at 6:47 AM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > > >
> > > > Patch is OK now.  I was wondering about using avx256 for moves of known
> > >
> > > Done.   X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB is in now.   Can
> > > you take a look at the patch for Skylake:
> > >
> > > https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567096.html
> >
> > I was wondering, if CPU preffers rep movsb when rcx is a compile time
> > constant, it probably does some logic at the decode time (i.e. expands
> > it into some sequence) and if so, then it may require the code setting
> > the register to be near rep (via fusing or simlar mechanism)
> >
> > Perhaps we want to have fusing pattern for this, so we do not move them
> > far apart?
>
> Reading through the optimization manual it seems that mosvb is fast for
> small block no matter if the size is hard wired. In that case you
> probably want to check whetehr max_size or expected_size is known to be
> small rather than max_size == min_size and both being small.
>
> But it depends on what CPU really does.
> Honza

For small data size, rep movsb is faster only under certain conditions.   We
can continue fine tuning rep movsb.

-- 
H.J.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake
  2021-03-31 15:41               ` H.J. Lu
@ 2021-03-31 17:43                 ` Jan Hubicka
  2021-03-31 17:54                   ` H.J. Lu
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Hubicka @ 2021-03-31 17:43 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Hongtao Liu, GCC Patches, Hongyu Wang

> > Reading through the optimization manual it seems that mosvb is fast for
> > small block no matter if the size is hard wired. In that case you
> > probably want to check whetehr max_size or expected_size is known to be
> > small rather than max_size == min_size and both being small.
> >
> > But it depends on what CPU really does.
> > Honza
> 
> For small data size, rep movsb is faster only under certain conditions.   We
> can continue fine tuning rep movsb.

OK, I however wonder why you need condtion maxsize=minsize.
 - If CPU is looking for movl $cst, %rcx than we probably want to be
   sure that it is not moved away fro rep ;movsb by adding fused pattern
 - If rep movsb is slower than loop for very small blocks then you want
   to set lower bound on minsize & expected size, but you do not need 
   to require maxsize=minsize
 - If rep movsb is slower than sequence of moves for small blocks then
   one needs to tweak move by pieces
 - If rep movsb is slower for larger blocks than you want to test
   maxsize and expected size
So in neither of those scenarios testing maxsize=minsize alone makes too
much sense to me... What was the original motivation for differentiating
between precisely known size?

I am mostly curious because it is not that uncomon to have small maxsize
because we are able to track the object size and using short sequence
for those would be nice.

Having minsize non-trivial may not be that uncommon these days either
given that we track value ranges (and under assumption that
memcpy/memset expanders was updated to take these into account).

Honza
> 
> -- 
> H.J.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake
  2021-03-31 17:43                 ` Jan Hubicka
@ 2021-03-31 17:54                   ` H.J. Lu
  2021-04-01  5:57                     ` Hongyu Wang
  0 siblings, 1 reply; 31+ messages in thread
From: H.J. Lu @ 2021-03-31 17:54 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Hongtao Liu, GCC Patches, Hongyu Wang

On Wed, Mar 31, 2021 at 10:43 AM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > > Reading through the optimization manual it seems that mosvb is fast for
> > > small block no matter if the size is hard wired. In that case you
> > > probably want to check whetehr max_size or expected_size is known to be
> > > small rather than max_size == min_size and both being small.
> > >
> > > But it depends on what CPU really does.
> > > Honza
> >
> > For small data size, rep movsb is faster only under certain conditions.   We
> > can continue fine tuning rep movsb.
>
> OK, I however wonder why you need condtion maxsize=minsize.
>  - If CPU is looking for movl $cst, %rcx than we probably want to be
>    sure that it is not moved away fro rep ;movsb by adding fused pattern
>  - If rep movsb is slower than loop for very small blocks then you want
>    to set lower bound on minsize & expected size, but you do not need
>    to require maxsize=minsize
>  - If rep movsb is slower than sequence of moves for small blocks then
>    one needs to tweak move by pieces
>  - If rep movsb is slower for larger blocks than you want to test
>    maxsize and expected size
> So in neither of those scenarios testing maxsize=minsize alone makes too
> much sense to me... What was the original motivation for differentiating
> between precisely known size?
>
> I am mostly curious because it is not that uncomon to have small maxsize
> because we are able to track the object size and using short sequence
> for those would be nice.
>
> Having minsize non-trivial may not be that uncommon these days either
> given that we track value ranges (and under assumption that
> memcpy/memset expanders was updated to take these into account).
>

Hongyu has done some analysis on this.  Hongyu, can you share what
you got?

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] x86: Update memcpy/memset inline strategies for Ice Lake
  2021-03-31 17:54                   ` H.J. Lu
@ 2021-04-01  5:57                     ` Hongyu Wang
  0 siblings, 0 replies; 31+ messages in thread
From: Hongyu Wang @ 2021-04-01  5:57 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang

> > So in neither of those scenarios testing maxsize=minsize alone makes too
> > much sense to me... What was the original motivation for differentiating
> > between precisely known size?

There is a case that could meet small maxsize. https://godbolt.org/z/489Tf7ssj

typedef unsigned char e_u8;
#define MAXBC 8
void MixColumn(e_u8 a[4][MAXBC], e_u8 BC)
{
  e_u8 b[4][MAXBC];
  int i, j;

  for(i = 0; i < 4; i++)
    for(j = 0; j < BC; j++) a[i][j] = b[i][j];
}

Where BC is unsigned char so maxsize will be 256.

If we set stringop_alg to rep_1_byte the code could be like

 movzbl  %sil, %r8d
 movq    %rdi, %rdx
 leaq    -40(%rsp), %rax
 movq    %r8, %r9
 leaq    -8(%rsp), %r10

 testb   %r9b, %r9b
 je      .L5
 movq    %rdx, %rdi
 movq    %rax, %rsi
 movq    %r8, %rcx
 rep movsb

 addq    $8, %rax
 addq    $8, %rdx
 cmpq    %r10, %rax
 jne     .L2
 ret

In our test we found this is much slower than current trunk because
rep movsb triggers machine clear events, while in the current trunk
such small size is handled in the loop mov epilogue and rep movsq is
never executed.

So here we disabled inline for unknown size to avoid potential issues like this.

H.J. Lu via Gcc-patches <gcc-patches@gcc.gnu.org> 于2021年4月1日周四 上午1:55写道：
>
> On Wed, Mar 31, 2021 at 10:43 AM Jan Hubicka <hubicka@ucw.cz> wrote:
> >
> > > > Reading through the optimization manual it seems that mosvb is fast for
> > > > small block no matter if the size is hard wired. In that case you
> > > > probably want to check whetehr max_size or expected_size is known to be
> > > > small rather than max_size == min_size and both being small.
> > > >
> > > > But it depends on what CPU really does.
> > > > Honza
> > >
> > > For small data size, rep movsb is faster only under certain conditions.   We
> > > can continue fine tuning rep movsb.
> >
> > OK, I however wonder why you need condtion maxsize=minsize.
> >  - If CPU is looking for movl $cst, %rcx than we probably want to be
> >    sure that it is not moved away fro rep ;movsb by adding fused pattern
> >  - If rep movsb is slower than loop for very small blocks then you want
> >    to set lower bound on minsize & expected size, but you do not need
> >    to require maxsize=minsize
> >  - If rep movsb is slower than sequence of moves for small blocks then
> >    one needs to tweak move by pieces
> >  - If rep movsb is slower for larger blocks than you want to test
> >    maxsize and expected size
> > So in neither of those scenarios testing maxsize=minsize alone makes too
> > much sense to me... What was the original motivation for differentiating
> > between precisely known size?
> >
> > I am mostly curious because it is not that uncomon to have small maxsize
> > because we are able to track the object size and using short sequence
> > for those would be nice.
> >
> > Having minsize non-trivial may not be that uncommon these days either
> > given that we track value ranges (and under assumption that
> > memcpy/memset expanders was updated to take these into account).
> >
>
> Hongyu has done some analysis on this.  Hongyu, can you share what
> you got?
>
> Thanks.
>
> --
> H.J.

-- 
Regards,

Hongyu, Wang

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs
  2021-03-22 13:16 ` [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs H.J. Lu
@ 2021-04-05 13:45   ` H.J. Lu
  2021-04-05 21:14     ` Jan Hubicka
  0 siblings, 1 reply; 31+ messages in thread
From: H.J. Lu @ 2021-04-05 13:45 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jan Hubicka, Uros Bizjak, Hongtao Liu, Hongyu Wang

On Mon, Mar 22, 2021 at 6:16 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> Simply memcpy and memset inline strategies to avoid branches for
> Skylake family CPUs:
>
> 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
>    load and store for up to 16 * 16 (256) bytes when the data size is
>    fixed and known.
> 2. Inline only if data size is known to be <= 256.
>    a. Use "rep movsb/stosb" with simple code sequence if the data size
>       is a constant.
>    b. Use loop if data size is not a constant.
> 3. Use memcpy/memset libray function if data size is unknown or > 256.
>
> On Cascadelake processor with -march=native -Ofast -flto,
>
> 1. Performance impacts of SPEC CPU 2017 rate are:
>
> 500.perlbench_r  0.17%
> 502.gcc_r       -0.36%
> 505.mcf_r        0.00%
> 520.omnetpp_r    0.08%
> 523.xalancbmk_r -0.62%
> 525.x264_r       1.04%
> 531.deepsjeng_r  0.11%
> 541.leela_r     -1.09%
> 548.exchange2_r -0.25%
> 557.xz_r         0.17%
> Geomean         -0.08%
>
> 503.bwaves_r     0.00%
> 507.cactuBSSN_r  0.69%
> 508.namd_r      -0.07%
> 510.parest_r     1.12%
> 511.povray_r     1.82%
> 519.lbm_r        0.00%
> 521.wrf_r       -1.32%
> 526.blender_r   -0.47%
> 527.cam4_r       0.23%
> 538.imagick_r   -1.72%
> 544.nab_r       -0.56%
> 549.fotonik3d_r  0.12%
> 554.roms_r       0.43%
> Geomean          0.02%
>
> 2. Significant impacts on eembc benchmarks are:
>
> eembc/idctrn01   9.23%
> eembc/nnet_test  29.26%
>
> gcc/
>
>         * config/i386/x86-tune-costs.h (skylake_memcpy): Updated.
>         (skylake_memset): Likewise.
>         (skylake_cost): Change CLEAR_RATIO to 17.
>         * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
>         Replace m_CANNONLAKE, m_ICELAKE_CLIENT, m_ICELAKE_SERVER,
>         m_TIGERLAKE and m_SAPPHIRERAPIDS with m_SKYLAKE and m_CORE_AVX512.
>
> gcc/testsuite/
>
>         * gcc.target/i386/memcpy-strategy-9.c: New test.
>         * gcc.target/i386/memcpy-strategy-10.c: Likewise.
>         * gcc.target/i386/memcpy-strategy-11.c: Likewise.
>         * gcc.target/i386/memset-strategy-7.c: Likewise.
>         * gcc.target/i386/memset-strategy-8.c: Likewise.
>         * gcc.target/i386/memset-strategy-9.c: Likewise.
> ---
>  gcc/config/i386/x86-tune-costs.h              | 27 ++++++++++++-------
>  gcc/config/i386/x86-tune.def                  |  3 +--
>  .../gcc.target/i386/memcpy-strategy-10.c      | 11 ++++++++
>  .../gcc.target/i386/memcpy-strategy-11.c      | 18 +++++++++++++
>  .../gcc.target/i386/memcpy-strategy-9.c       |  9 +++++++
>  .../gcc.target/i386/memset-strategy-7.c       | 11 ++++++++
>  .../gcc.target/i386/memset-strategy-8.c       |  9 +++++++
>  .../gcc.target/i386/memset-strategy-9.c       | 17 ++++++++++++
>  8 files changed, 93 insertions(+), 12 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-7.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-8.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-9.c
>
> diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> index 0e00ff99df3..ffe810f2bcb 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -1822,17 +1822,24 @@ struct processor_costs znver3_cost = {
>
>  /* skylake_cost should produce code tuned for Skylake familly of CPUs.  */
>  static stringop_algs skylake_memcpy[2] =   {
> -  {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> -  {libcall, {{16, loop, false}, {512, unrolled_loop, false},
> -             {-1, libcall, false}}}};
> +  {libcall,
> +   {{256, rep_prefix_1_byte, true},
> +    {256, loop, false},
> +    {-1, libcall, false}}},
> +  {libcall,
> +   {{256, rep_prefix_1_byte, true},
> +    {256, loop, false},
> +    {-1, libcall, false}}}};
>
>  static stringop_algs skylake_memset[2] = {
> -  {libcall, {{6, loop_1_byte, true},
> -             {24, loop, true},
> -             {8192, rep_prefix_4_byte, true},
> -             {-1, libcall, false}}},
> -  {libcall, {{24, loop, true}, {512, unrolled_loop, false},
> -             {-1, libcall, false}}}};
> +  {libcall,
> +   {{256, rep_prefix_1_byte, true},
> +    {256, loop, false},
> +    {-1, libcall, false}}},
> +  {libcall,
> +   {{256, rep_prefix_1_byte, true},
> +    {256, loop, false},
> +    {-1, libcall, false}}}};
>
>  static const
>  struct processor_costs skylake_cost = {
> @@ -1889,7 +1896,7 @@ struct processor_costs skylake_cost = {
>    COSTS_N_INSNS (0),                   /* cost of movzx */
>    8,                                   /* "large" insn */
>    17,                                  /* MOVE_RATIO */
> -  6,                                   /* CLEAR_RATIO */
> +  17,                                  /* CLEAR_RATIO */
>    {4, 4, 4},                           /* cost of loading integer registers
>                                            in QImode, HImode and SImode.
>                                            Relative to reg-reg move (2).  */
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 134916cc972..eb057a67750 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -273,8 +273,7 @@ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA)
>     move/set sequences of bytes with known size.  */
>  DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB,
>           "prefer_known_rep_movsb_stosb",
> -         m_CANNONLAKE | m_ICELAKE_CLIENT | m_ICELAKE_SERVER | m_TIGERLAKE
> -         | m_ALDERLAKE | m_SAPPHIRERAPIDS)
> +         m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512)
>
>  /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of
>     compact prologues and epilogues by issuing a misaligned moves.  This
> diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c
> new file mode 100644
> index 00000000000..970aa741971
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=skylake -mno-sse" } */
> +/* { dg-final { scan-assembler "jmp\tmemcpy" { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler "call\tmemcpy" { target ia32 } } } */
> +/* { dg-final { scan-assembler-not "rep movsb" } } */
> +
> +void
> +foo (char *dest, char *src)
> +{
> +  __builtin_memcpy (dest, src, 257);
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c
> new file mode 100644
> index 00000000000..b6041944630
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=skylake" } */
> +/* { dg-final { scan-assembler-not "jmp\tmemcpy" { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-not "call\tmemcpy" { target ia32 } } } */
> +/* { dg-final { scan-assembler-not "rep movsb" } } */
> +
> +typedef unsigned char e_u8;
> +
> +#define MAXBC 8
> +
> +void MixColumn(e_u8 a[4][MAXBC], e_u8 BC)
> +{
> +  e_u8 b[4][MAXBC];
> +  int i, j;
> +
> +  for(i = 0; i < 4; i++)
> +    for(j = 0; j < BC; j++) a[i][j] = b[i][j];
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c
> new file mode 100644
> index 00000000000..b0dc7484d09
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=skylake -mno-sse" } */
> +/* { dg-final { scan-assembler "rep movsb" } } */
> +
> +void
> +foo (char *dest, char *src)
> +{
> +  __builtin_memcpy (dest, src, 256);
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-7.c b/gcc/testsuite/gcc.target/i386/memset-strategy-7.c
> new file mode 100644
> index 00000000000..07c2816910c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-7.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=skylake -mno-sse" } */
> +/* { dg-final { scan-assembler "jmp\tmemset" { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler "call\tmemset" { target ia32 } } } */
> +/* { dg-final { scan-assembler-not "rep stosb" } } */
> +
> +void
> +foo (char *dest)
> +{
> +  __builtin_memset (dest, 0, 257);
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-8.c b/gcc/testsuite/gcc.target/i386/memset-strategy-8.c
> new file mode 100644
> index 00000000000..52ea882c814
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-8.c
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=skylake -mno-sse" } */
> +/* { dg-final { scan-assembler "rep stosb" } } */
> +
> +void
> +foo (char *dest)
> +{
> +  __builtin_memset (dest, 0, 256);
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-9.c b/gcc/testsuite/gcc.target/i386/memset-strategy-9.c
> new file mode 100644
> index 00000000000..d4db031958f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-9.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=skylake" } */
> +/* { dg-final { scan-assembler-not "jmp\tmemset" { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-not "call\tmemset" { target ia32 } } } */
> +/* { dg-final { scan-assembler-not "rep stosb" } } */
> +
> +typedef unsigned char e_u8;
> +
> +#define MAXBC 8
> +
> +void MixColumn(e_u8 a[4][MAXBC], e_u8 BC)
> +{
> +  int i, j;
> +
> +  for(i = 0; i < 4; i++)
> +    for(j = 0; j < BC; j++) a[i][j] = 1;
> +}
> --
> 2.30.2
>

If there are no objections, I will check it in on Wednesday.

-- 
H.J.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs
  2021-04-05 13:45   ` H.J. Lu
@ 2021-04-05 21:14     ` Jan Hubicka
  2021-04-05 21:53       ` H.J. Lu
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Hubicka @ 2021-04-05 21:14 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GCC Patches, Hongtao Liu, Hongyu Wang

> >  /* skylake_cost should produce code tuned for Skylake familly of CPUs.  */
> >  static stringop_algs skylake_memcpy[2] =   {
> > -  {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> > -  {libcall, {{16, loop, false}, {512, unrolled_loop, false},
> > -             {-1, libcall, false}}}};
> > +  {libcall,
> > +   {{256, rep_prefix_1_byte, true},
> > +    {256, loop, false},
> > +    {-1, libcall, false}}},
> > +  {libcall,
> > +   {{256, rep_prefix_1_byte, true},
> > +    {256, loop, false},
> > +    {-1, libcall, false}}}};
> >
> >  static stringop_algs skylake_memset[2] = {
> > -  {libcall, {{6, loop_1_byte, true},
> > -             {24, loop, true},
> > -             {8192, rep_prefix_4_byte, true},
> > -             {-1, libcall, false}}},
> > -  {libcall, {{24, loop, true}, {512, unrolled_loop, false},
> > -             {-1, libcall, false}}}};
> > +  {libcall,
> > +   {{256, rep_prefix_1_byte, true},
> > +    {256, loop, false},
> > +    {-1, libcall, false}}},
> > +  {libcall,
> > +   {{256, rep_prefix_1_byte, true},
> > +    {256, loop, false},
> > +    {-1, libcall, false}}}};
> >
> 
> If there are no objections, I will check it in on Wednesday.

On my skylake notebook if I run the benchmarking script I get:

jan@skylake:~/trunk/contrib> ./bench-stringop 64 640000000 gcc -march=native
memcpy
  block size  libcall rep1    noalg   rep4    noalg   rep8    noalg   loop    noalg   unrl    noalg   sse     noalg   byte    PGO     dynamic    BEST
     8192000  0:00.23 0:00.21 0:00.21 0:00.21 0:00.21 0:00.22 0:00.24 0:00.28 0:00.22 0:00.20 0:00.21 0:00.19 0:00.19 0:00.77 0:00.18 0:00.18    0:00.19 sse
      819200  0:00.09 0:00.18 0:00.18 0:00.18 0:00.18 0:00.18 0:00.20 0:00.19 0:00.16 0:00.15 0:00.16 0:00.13 0:00.14 0:00.63 0:00.09 0:00.09    0:00.09 libcall
       81920  0:00.06 0:00.07 0:00.07 0:00.06 0:00.06 0:00.06 0:00.06 0:00.12 0:00.11 0:00.11 0:00.10 0:00.07 0:00.08 0:00.66 0:00.11 0:00.06    0:00.06 libcall
       20480  0:00.06 0:00.07 0:00.05 0:00.06 0:00.07 0:00.07 0:00.08 0:00.14 0:00.14 0:00.10 0:00.11 0:00.06 0:00.07 0:01.11 0:00.07 0:00.09    0:00.05 rep1noalign
        8192  0:00.06 0:00.05 0:00.04 0:00.05 0:00.06 0:00.07 0:00.07 0:00.12 0:00.15 0:00.11 0:00.10 0:00.06 0:00.06 0:00.64 0:00.06 0:00.05    0:00.04 rep1noalign
        4096  0:00.05 0:00.05 0:00.05 0:00.06 0:00.07 0:00.05 0:00.05 0:00.09 0:00.14 0:00.11 0:00.10 0:00.07 0:00.06 0:00.61 0:00.05 0:00.07    0:00.05 libcall
        2048  0:00.04 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.10 0:00.14 0:00.09 0:00.09 0:00.09 0:00.07 0:00.64 0:00.06 0:00.07    0:00.04 libcall
        1024  0:00.06 0:00.08 0:00.08 0:00.10 0:00.11 0:00.06 0:00.06 0:00.12 0:00.15 0:00.09 0:00.09 0:00.16 0:00.09 0:00.63 0:00.05 0:00.06    0:00.06 libcall
         512  0:00.06 0:00.07 0:00.08 0:00.12 0:00.08 0:00.10 0:00.09 0:00.13 0:00.16 0:00.10 0:00.10 0:00.28 0:00.18 0:00.66 0:00.13 0:00.08    0:00.06 libcall
         256  0:00.10 0:00.12 0:00.11 0:00.14 0:00.11 0:00.12 0:00.13 0:00.14 0:00.16 0:00.13 0:00.12 0:00.49 0:00.30 0:00.68 0:00.14 0:00.12    0:00.10 libcall
         128  0:00.15 0:00.19 0:00.18 0:00.20 0:00.19 0:00.20 0:00.18 0:00.19 0:00.21 0:00.17 0:00.15 0:00.49 0:00.43 0:00.72 0:00.17 0:00.17    0:00.15 libcall
          64  0:00.29 0:00.28 0:00.29 0:00.33 0:00.33 0:00.34 0:00.29 0:00.25 0:00.29 0:00.26 0:00.26 0:01.01 0:00.97 0:01.13 0:00.32 0:00.28    0:00.25 loop
          48  0:00.37 0:00.39 0:00.38 0:00.45 0:00.41 0:00.45 0:00.44 0:00.45 0:00.33 0:00.32 0:00.33 0:02.21 0:02.22 0:00.87 0:00.32 0:00.31    0:00.32 unrl
          32  0:00.54 0:00.52 0:00.50 0:00.60 0:00.62 0:00.61 0:00.52 0:00.42 0:00.43 0:00.40 0:00.42 0:01.18 0:01.16 0:01.14 0:00.39 0:00.40    0:00.40 unrl
          24  0:00.71 0:00.74 0:00.77 0:00.83 0:00.78 0:00.81 0:00.75 0:00.52 0:00.52 0:00.52 0:00.50 0:02.28 0:02.27 0:00.94 0:00.49 0:00.50    0:00.50 unrlnoalign
          16  0:00.97 0:01.03 0:01.20 0:01.52 0:01.37 0:01.84 0:01.10 0:00.90 0:00.86 0:00.79 0:00.77 0:01.27 0:01.32 0:01.25 0:00.91 0:00.91    0:00.77 unrlnoalign
          14  0:01.35 0:01.37 0:01.39 0:01.76 0:01.44 0:01.53 0:01.58 0:01.01 0:00.99 0:00.94 0:00.94 0:01.34 0:01.29 0:01.28 0:01.01 0:00.99    0:00.94 unrl
          12  0:01.48 0:01.55 0:01.55 0:01.70 0:01.55 0:02.01 0:01.52 0:01.11 0:01.07 0:01.02 0:01.04 0:02.21 0:02.25 0:01.19 0:01.11 0:01.10    0:01.02 unrl
          10  0:01.73 0:01.90 0:01.88 0:02.05 0:01.86 0:02.09 0:01.78 0:01.32 0:01.41 0:01.25 0:01.23 0:02.46 0:02.25 0:01.36 0:01.50 0:01.38    0:01.23 unrlnoalign
           8  0:02.22 0:02.17 0:02.18 0:02.43 0:02.09 0:02.55 0:01.92 0:01.54 0:01.46 0:01.38 0:01.38 0:01.51 0:01.62 0:01.54 0:01.55 0:01.55    0:01.38 unrl
So indeed rep byte seems consistently outperforming rep4/rep8 however
urolled variant seems to be better than rep byte for small block sizes.
Do you have some data for blocks in size 8...256 to be faster with rep1
compared to unrolled loop for perhaps more real world benchmarks?

The difference seems to get quite big for small locks in range 8...16
bytes.  I noticed that before and sort of conlcuded that it is probably
the branch prediction playing relatively well for those small block
sizes. On the other hand winding up the relatively long unrolled loop is
not very cool just to catch this case.

Do you know what of the three changes (preferring reps/stosb,
CLEAR_RATIO and algorithm choice changes) cause the two speedups
on eebmc?

Honza
> 
> -- 
> H.J.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs
  2021-04-05 21:14     ` Jan Hubicka
@ 2021-04-05 21:53       ` H.J. Lu
  2021-04-06  9:09         ` Hongyu Wang
  0 siblings, 1 reply; 31+ messages in thread
From: H.J. Lu @ 2021-04-05 21:53 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Hongtao Liu, Hongyu Wang

On Mon, Apr 5, 2021 at 2:14 PM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > >  /* skylake_cost should produce code tuned for Skylake familly of CPUs.  */
> > >  static stringop_algs skylake_memcpy[2] =   {
> > > -  {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> > > -  {libcall, {{16, loop, false}, {512, unrolled_loop, false},
> > > -             {-1, libcall, false}}}};
> > > +  {libcall,
> > > +   {{256, rep_prefix_1_byte, true},
> > > +    {256, loop, false},
> > > +    {-1, libcall, false}}},
> > > +  {libcall,
> > > +   {{256, rep_prefix_1_byte, true},
> > > +    {256, loop, false},
> > > +    {-1, libcall, false}}}};
> > >
> > >  static stringop_algs skylake_memset[2] = {
> > > -  {libcall, {{6, loop_1_byte, true},
> > > -             {24, loop, true},
> > > -             {8192, rep_prefix_4_byte, true},
> > > -             {-1, libcall, false}}},
> > > -  {libcall, {{24, loop, true}, {512, unrolled_loop, false},
> > > -             {-1, libcall, false}}}};
> > > +  {libcall,
> > > +   {{256, rep_prefix_1_byte, true},
> > > +    {256, loop, false},
> > > +    {-1, libcall, false}}},
> > > +  {libcall,
> > > +   {{256, rep_prefix_1_byte, true},
> > > +    {256, loop, false},
> > > +    {-1, libcall, false}}}};
> > >
> >
> > If there are no objections, I will check it in on Wednesday.
>
> On my skylake notebook if I run the benchmarking script I get:
>
> jan@skylake:~/trunk/contrib> ./bench-stringop 64 640000000 gcc -march=native
> memcpy
>   block size  libcall rep1    noalg   rep4    noalg   rep8    noalg   loop    noalg   unrl    noalg   sse     noalg   byte    PGO     dynamic    BEST
>      8192000  0:00.23 0:00.21 0:00.21 0:00.21 0:00.21 0:00.22 0:00.24 0:00.28 0:00.22 0:00.20 0:00.21 0:00.19 0:00.19 0:00.77 0:00.18 0:00.18    0:00.19 sse
>       819200  0:00.09 0:00.18 0:00.18 0:00.18 0:00.18 0:00.18 0:00.20 0:00.19 0:00.16 0:00.15 0:00.16 0:00.13 0:00.14 0:00.63 0:00.09 0:00.09    0:00.09 libcall
>        81920  0:00.06 0:00.07 0:00.07 0:00.06 0:00.06 0:00.06 0:00.06 0:00.12 0:00.11 0:00.11 0:00.10 0:00.07 0:00.08 0:00.66 0:00.11 0:00.06    0:00.06 libcall
>        20480  0:00.06 0:00.07 0:00.05 0:00.06 0:00.07 0:00.07 0:00.08 0:00.14 0:00.14 0:00.10 0:00.11 0:00.06 0:00.07 0:01.11 0:00.07 0:00.09    0:00.05 rep1noalign
>         8192  0:00.06 0:00.05 0:00.04 0:00.05 0:00.06 0:00.07 0:00.07 0:00.12 0:00.15 0:00.11 0:00.10 0:00.06 0:00.06 0:00.64 0:00.06 0:00.05    0:00.04 rep1noalign
>         4096  0:00.05 0:00.05 0:00.05 0:00.06 0:00.07 0:00.05 0:00.05 0:00.09 0:00.14 0:00.11 0:00.10 0:00.07 0:00.06 0:00.61 0:00.05 0:00.07    0:00.05 libcall
>         2048  0:00.04 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.10 0:00.14 0:00.09 0:00.09 0:00.09 0:00.07 0:00.64 0:00.06 0:00.07    0:00.04 libcall
>         1024  0:00.06 0:00.08 0:00.08 0:00.10 0:00.11 0:00.06 0:00.06 0:00.12 0:00.15 0:00.09 0:00.09 0:00.16 0:00.09 0:00.63 0:00.05 0:00.06    0:00.06 libcall
>          512  0:00.06 0:00.07 0:00.08 0:00.12 0:00.08 0:00.10 0:00.09 0:00.13 0:00.16 0:00.10 0:00.10 0:00.28 0:00.18 0:00.66 0:00.13 0:00.08    0:00.06 libcall
>          256  0:00.10 0:00.12 0:00.11 0:00.14 0:00.11 0:00.12 0:00.13 0:00.14 0:00.16 0:00.13 0:00.12 0:00.49 0:00.30 0:00.68 0:00.14 0:00.12    0:00.10 libcall
>          128  0:00.15 0:00.19 0:00.18 0:00.20 0:00.19 0:00.20 0:00.18 0:00.19 0:00.21 0:00.17 0:00.15 0:00.49 0:00.43 0:00.72 0:00.17 0:00.17    0:00.15 libcall
>           64  0:00.29 0:00.28 0:00.29 0:00.33 0:00.33 0:00.34 0:00.29 0:00.25 0:00.29 0:00.26 0:00.26 0:01.01 0:00.97 0:01.13 0:00.32 0:00.28    0:00.25 loop
>           48  0:00.37 0:00.39 0:00.38 0:00.45 0:00.41 0:00.45 0:00.44 0:00.45 0:00.33 0:00.32 0:00.33 0:02.21 0:02.22 0:00.87 0:00.32 0:00.31    0:00.32 unrl
>           32  0:00.54 0:00.52 0:00.50 0:00.60 0:00.62 0:00.61 0:00.52 0:00.42 0:00.43 0:00.40 0:00.42 0:01.18 0:01.16 0:01.14 0:00.39 0:00.40    0:00.40 unrl
>           24  0:00.71 0:00.74 0:00.77 0:00.83 0:00.78 0:00.81 0:00.75 0:00.52 0:00.52 0:00.52 0:00.50 0:02.28 0:02.27 0:00.94 0:00.49 0:00.50    0:00.50 unrlnoalign
>           16  0:00.97 0:01.03 0:01.20 0:01.52 0:01.37 0:01.84 0:01.10 0:00.90 0:00.86 0:00.79 0:00.77 0:01.27 0:01.32 0:01.25 0:00.91 0:00.91    0:00.77 unrlnoalign
>           14  0:01.35 0:01.37 0:01.39 0:01.76 0:01.44 0:01.53 0:01.58 0:01.01 0:00.99 0:00.94 0:00.94 0:01.34 0:01.29 0:01.28 0:01.01 0:00.99    0:00.94 unrl
>           12  0:01.48 0:01.55 0:01.55 0:01.70 0:01.55 0:02.01 0:01.52 0:01.11 0:01.07 0:01.02 0:01.04 0:02.21 0:02.25 0:01.19 0:01.11 0:01.10    0:01.02 unrl
>           10  0:01.73 0:01.90 0:01.88 0:02.05 0:01.86 0:02.09 0:01.78 0:01.32 0:01.41 0:01.25 0:01.23 0:02.46 0:02.25 0:01.36 0:01.50 0:01.38    0:01.23 unrlnoalign
>            8  0:02.22 0:02.17 0:02.18 0:02.43 0:02.09 0:02.55 0:01.92 0:01.54 0:01.46 0:01.38 0:01.38 0:01.51 0:01.62 0:01.54 0:01.55 0:01.55    0:01.38 unrl
> So indeed rep byte seems consistently outperforming rep4/rep8 however
> urolled variant seems to be better than rep byte for small block sizes.

My patch generates "rep movsb" only in a very limited cases:

1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
   load and store for up to 16 * 16 (256) bytes when the data size is
   fixed and known.
2. Inline only if data size is known to be <= 256.
   a. Use "rep movsb/stosb" with a simple code sequence if the data size
      is a constant.
   b. Use loop if data size is not a constant.

As a result,  "rep stosb" is generated only when 128 < data size < 256
with -mno-sse.

> Do you have some data for blocks in size 8...256 to be faster with rep1
> compared to unrolled loop for perhaps more real world benchmarks?

"rep movsb" isn't generated with my patch in this case since
MOVE_RATIO == 17 can copy up to 16 * 16 (256) bytes with
XMM registers.

> The difference seems to get quite big for small locks in range 8...16
> bytes.  I noticed that before and sort of conlcuded that it is probably
> the branch prediction playing relatively well for those small block
> sizes. On the other hand winding up the relatively long unrolled loop is
> not very cool just to catch this case.
>
> Do you know what of the three changes (preferring reps/stosb,
> CLEAR_RATIO and algorithm choice changes) cause the two speedups
> on eebmc?

Hongyu, can you find out where the speedup came from?

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs
  2021-04-05 21:53       ` H.J. Lu
@ 2021-04-06  9:09         ` Hongyu Wang
  2021-04-06  9:51           ` Jan Hubicka
  0 siblings, 1 reply; 31+ messages in thread
From: Hongyu Wang @ 2021-04-06  9:09 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang

> Do you know what of the three changes (preferring reps/stosb,
> CLEAR_RATIO and algorithm choice changes) cause the two speedups
> on eebmc?

A extracted testcase from nnet_test in https://godbolt.org/z/c8KdsohTP

This loop is transformed to builtin_memcpy and builtin_memset with size 280.

Current strategy for skylake is {512, unrolled_loop, false} for such
size, so it will generate unrolled loops with mov, while the patch
generates memcpy/memset libcall and uses vector move.

For idctrn01 it is memset with size 512. So the speedups come from
algorithm change.

H.J. Lu via Gcc-patches <gcc-patches@gcc.gnu.org> 于2021年4月6日周二 上午5:55写道：
>
> On Mon, Apr 5, 2021 at 2:14 PM Jan Hubicka <hubicka@ucw.cz> wrote:
> >
> > > >  /* skylake_cost should produce code tuned for Skylake familly of CPUs.  */
> > > >  static stringop_algs skylake_memcpy[2] =   {
> > > > -  {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> > > > -  {libcall, {{16, loop, false}, {512, unrolled_loop, false},
> > > > -             {-1, libcall, false}}}};
> > > > +  {libcall,
> > > > +   {{256, rep_prefix_1_byte, true},
> > > > +    {256, loop, false},
> > > > +    {-1, libcall, false}}},
> > > > +  {libcall,
> > > > +   {{256, rep_prefix_1_byte, true},
> > > > +    {256, loop, false},
> > > > +    {-1, libcall, false}}}};
> > > >
> > > >  static stringop_algs skylake_memset[2] = {
> > > > -  {libcall, {{6, loop_1_byte, true},
> > > > -             {24, loop, true},
> > > > -             {8192, rep_prefix_4_byte, true},
> > > > -             {-1, libcall, false}}},
> > > > -  {libcall, {{24, loop, true}, {512, unrolled_loop, false},
> > > > -             {-1, libcall, false}}}};
> > > > +  {libcall,
> > > > +   {{256, rep_prefix_1_byte, true},
> > > > +    {256, loop, false},
> > > > +    {-1, libcall, false}}},
> > > > +  {libcall,
> > > > +   {{256, rep_prefix_1_byte, true},
> > > > +    {256, loop, false},
> > > > +    {-1, libcall, false}}}};
> > > >
> > >
> > > If there are no objections, I will check it in on Wednesday.
> >
> > On my skylake notebook if I run the benchmarking script I get:
> >
> > jan@skylake:~/trunk/contrib> ./bench-stringop 64 640000000 gcc -march=native
> > memcpy
> >   block size  libcall rep1    noalg   rep4    noalg   rep8    noalg   loop    noalg   unrl    noalg   sse     noalg   byte    PGO     dynamic    BEST
> >      8192000  0:00.23 0:00.21 0:00.21 0:00.21 0:00.21 0:00.22 0:00.24 0:00.28 0:00.22 0:00.20 0:00.21 0:00.19 0:00.19 0:00.77 0:00.18 0:00.18    0:00.19 sse
> >       819200  0:00.09 0:00.18 0:00.18 0:00.18 0:00.18 0:00.18 0:00.20 0:00.19 0:00.16 0:00.15 0:00.16 0:00.13 0:00.14 0:00.63 0:00.09 0:00.09    0:00.09 libcall
> >        81920  0:00.06 0:00.07 0:00.07 0:00.06 0:00.06 0:00.06 0:00.06 0:00.12 0:00.11 0:00.11 0:00.10 0:00.07 0:00.08 0:00.66 0:00.11 0:00.06    0:00.06 libcall
> >        20480  0:00.06 0:00.07 0:00.05 0:00.06 0:00.07 0:00.07 0:00.08 0:00.14 0:00.14 0:00.10 0:00.11 0:00.06 0:00.07 0:01.11 0:00.07 0:00.09    0:00.05 rep1noalign
> >         8192  0:00.06 0:00.05 0:00.04 0:00.05 0:00.06 0:00.07 0:00.07 0:00.12 0:00.15 0:00.11 0:00.10 0:00.06 0:00.06 0:00.64 0:00.06 0:00.05    0:00.04 rep1noalign
> >         4096  0:00.05 0:00.05 0:00.05 0:00.06 0:00.07 0:00.05 0:00.05 0:00.09 0:00.14 0:00.11 0:00.10 0:00.07 0:00.06 0:00.61 0:00.05 0:00.07    0:00.05 libcall
> >         2048  0:00.04 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.10 0:00.14 0:00.09 0:00.09 0:00.09 0:00.07 0:00.64 0:00.06 0:00.07    0:00.04 libcall
> >         1024  0:00.06 0:00.08 0:00.08 0:00.10 0:00.11 0:00.06 0:00.06 0:00.12 0:00.15 0:00.09 0:00.09 0:00.16 0:00.09 0:00.63 0:00.05 0:00.06    0:00.06 libcall
> >          512  0:00.06 0:00.07 0:00.08 0:00.12 0:00.08 0:00.10 0:00.09 0:00.13 0:00.16 0:00.10 0:00.10 0:00.28 0:00.18 0:00.66 0:00.13 0:00.08    0:00.06 libcall
> >          256  0:00.10 0:00.12 0:00.11 0:00.14 0:00.11 0:00.12 0:00.13 0:00.14 0:00.16 0:00.13 0:00.12 0:00.49 0:00.30 0:00.68 0:00.14 0:00.12    0:00.10 libcall
> >          128  0:00.15 0:00.19 0:00.18 0:00.20 0:00.19 0:00.20 0:00.18 0:00.19 0:00.21 0:00.17 0:00.15 0:00.49 0:00.43 0:00.72 0:00.17 0:00.17    0:00.15 libcall
> >           64  0:00.29 0:00.28 0:00.29 0:00.33 0:00.33 0:00.34 0:00.29 0:00.25 0:00.29 0:00.26 0:00.26 0:01.01 0:00.97 0:01.13 0:00.32 0:00.28    0:00.25 loop
> >           48  0:00.37 0:00.39 0:00.38 0:00.45 0:00.41 0:00.45 0:00.44 0:00.45 0:00.33 0:00.32 0:00.33 0:02.21 0:02.22 0:00.87 0:00.32 0:00.31    0:00.32 unrl
> >           32  0:00.54 0:00.52 0:00.50 0:00.60 0:00.62 0:00.61 0:00.52 0:00.42 0:00.43 0:00.40 0:00.42 0:01.18 0:01.16 0:01.14 0:00.39 0:00.40    0:00.40 unrl
> >           24  0:00.71 0:00.74 0:00.77 0:00.83 0:00.78 0:00.81 0:00.75 0:00.52 0:00.52 0:00.52 0:00.50 0:02.28 0:02.27 0:00.94 0:00.49 0:00.50    0:00.50 unrlnoalign
> >           16  0:00.97 0:01.03 0:01.20 0:01.52 0:01.37 0:01.84 0:01.10 0:00.90 0:00.86 0:00.79 0:00.77 0:01.27 0:01.32 0:01.25 0:00.91 0:00.91    0:00.77 unrlnoalign
> >           14  0:01.35 0:01.37 0:01.39 0:01.76 0:01.44 0:01.53 0:01.58 0:01.01 0:00.99 0:00.94 0:00.94 0:01.34 0:01.29 0:01.28 0:01.01 0:00.99    0:00.94 unrl
> >           12  0:01.48 0:01.55 0:01.55 0:01.70 0:01.55 0:02.01 0:01.52 0:01.11 0:01.07 0:01.02 0:01.04 0:02.21 0:02.25 0:01.19 0:01.11 0:01.10    0:01.02 unrl
> >           10  0:01.73 0:01.90 0:01.88 0:02.05 0:01.86 0:02.09 0:01.78 0:01.32 0:01.41 0:01.25 0:01.23 0:02.46 0:02.25 0:01.36 0:01.50 0:01.38    0:01.23 unrlnoalign
> >            8  0:02.22 0:02.17 0:02.18 0:02.43 0:02.09 0:02.55 0:01.92 0:01.54 0:01.46 0:01.38 0:01.38 0:01.51 0:01.62 0:01.54 0:01.55 0:01.55    0:01.38 unrl
> > So indeed rep byte seems consistently outperforming rep4/rep8 however
> > urolled variant seems to be better than rep byte for small block sizes.
>
> My patch generates "rep movsb" only in a very limited cases:
>
> 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
>    load and store for up to 16 * 16 (256) bytes when the data size is
>    fixed and known.
> 2. Inline only if data size is known to be <= 256.
>    a. Use "rep movsb/stosb" with a simple code sequence if the data size
>       is a constant.
>    b. Use loop if data size is not a constant.
>
> As a result,  "rep stosb" is generated only when 128 < data size < 256
> with -mno-sse.
>
> > Do you have some data for blocks in size 8...256 to be faster with rep1
> > compared to unrolled loop for perhaps more real world benchmarks?
>
> "rep movsb" isn't generated with my patch in this case since
> MOVE_RATIO == 17 can copy up to 16 * 16 (256) bytes with
> XMM registers.
>
> > The difference seems to get quite big for small locks in range 8...16
> > bytes.  I noticed that before and sort of conlcuded that it is probably
> > the branch prediction playing relatively well for those small block
> > sizes. On the other hand winding up the relatively long unrolled loop is
> > not very cool just to catch this case.
> >
> > Do you know what of the three changes (preferring reps/stosb,
> > CLEAR_RATIO and algorithm choice changes) cause the two speedups
> > on eebmc?
>
> Hongyu, can you find out where the speedup came from?
>
> Thanks.
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs
  2021-04-06  9:09         ` Hongyu Wang
@ 2021-04-06  9:51           ` Jan Hubicka
  2021-04-06 12:34             ` H.J. Lu
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Hubicka @ 2021-04-06  9:51 UTC (permalink / raw)
  To: Hongyu Wang; +Cc: H.J. Lu, Hongtao Liu, GCC Patches, Hongyu Wang

> > Do you know what of the three changes (preferring reps/stosb,
> > CLEAR_RATIO and algorithm choice changes) cause the two speedups
> > on eebmc?
> 
> A extracted testcase from nnet_test in https://godbolt.org/z/c8KdsohTP
> 
> This loop is transformed to builtin_memcpy and builtin_memset with size 280.
> 
> Current strategy for skylake is {512, unrolled_loop, false} for such
> size, so it will generate unrolled loops with mov, while the patch
> generates memcpy/memset libcall and uses vector move.

This is good - I originally set the table based on this
micro-benchmarking script and apparently glibc used at that time had
more expensive memcpy for small blocks.

One thing to consider is, however, that calling external memcpy has also
additional cost of clobbering all caller saved registers.  Especially
for code that uses SSE this is painful since all needs to go to stack in
that case. So I am not completely sure how representative the
micro-benchmark is to this respect since it does not use any SSE and
register pressure is generally small.

So with current glibc it seems libcall is win for blocks of size greater
than 64 or 128 at least if the register pressure is not big.
With this respect your change looks good.
> >
> > My patch generates "rep movsb" only in a very limited cases:
> >
> > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
> >    load and store for up to 16 * 16 (256) bytes when the data size is
> >    fixed and known.
> > 2. Inline only if data size is known to be <= 256.
> >    a. Use "rep movsb/stosb" with a simple code sequence if the data size
> >       is a constant.
> >    b. Use loop if data size is not a constant.

Aha, this is very hard to read from the algorithm descriptor.  So we
still have the check that maxsize==minsize and use rep mosb only for
constant sized blocks when the corresponding TARGET macro is defined.

I think it would be more readable if we introduced rep_1_byte_constant.
The descriptor is supposed to read as a sequence of rules where fist
applies.  It is not obvious that we have another TARGET_* macro that
makes rep_1_byte to be ignored in some cases.
(TARGET macro will also interfere with the microbenchmarking script).

Still I do not understand why compile time constant makes rep mosb/stosb
better than loop. Is it CPU special casing it at decoder time and
requiring explicit mov instruction? Or is it only becuase rep mosb is
not good for blocks smaller than 128bit?

> >
> > As a result,  "rep stosb" is generated only when 128 < data size < 256
> > with -mno-sse.
> >
> > > Do you have some data for blocks in size 8...256 to be faster with rep1
> > > compared to unrolled loop for perhaps more real world benchmarks?
> >
> > "rep movsb" isn't generated with my patch in this case since
> > MOVE_RATIO == 17 can copy up to 16 * 16 (256) bytes with
> > XMM registers.

OK, so I guess:
  {libcall,
   {{256, rep_1_byte, true},
    {256, unrolled_loop, false},
    {-1, libcall, false}}},
  {libcall,
   {{256, rep_1_loop, true},
    {256, unrolled_loop, false},
    {-1, libcall, false}}}};

may still perform better but the differnece between loop and unrolled
loop is within 10% margin..

So i guess patch is OK and we should look into cleaning up the
descriptors.  I can make patch for that once I understand the logic above.

Honza
> >
> > > The difference seems to get quite big for small locks in range 8...16
> > > bytes.  I noticed that before and sort of conlcuded that it is probably
> > > the branch prediction playing relatively well for those small block
> > > sizes. On the other hand winding up the relatively long unrolled loop is
> > > not very cool just to catch this case.
> > >
> > > Do you know what of the three changes (preferring reps/stosb,
> > > CLEAR_RATIO and algorithm choice changes) cause the two speedups
> > > on eebmc?
> >
> > Hongyu, can you find out where the speedup came from?
> >
> > Thanks.
> >
> > --
> > H.J.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs
  2021-04-06  9:51           ` Jan Hubicka
@ 2021-04-06 12:34             ` H.J. Lu
  0 siblings, 0 replies; 31+ messages in thread
From: H.J. Lu @ 2021-04-06 12:34 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Hongyu Wang, Hongtao Liu, GCC Patches, Hongyu Wang

On Tue, Apr 6, 2021 at 2:51 AM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > > Do you know what of the three changes (preferring reps/stosb,
> > > CLEAR_RATIO and algorithm choice changes) cause the two speedups
> > > on eebmc?
> >
> > A extracted testcase from nnet_test in https://godbolt.org/z/c8KdsohTP
> >
> > This loop is transformed to builtin_memcpy and builtin_memset with size 280.
> >
> > Current strategy for skylake is {512, unrolled_loop, false} for such
> > size, so it will generate unrolled loops with mov, while the patch
> > generates memcpy/memset libcall and uses vector move.
>
> This is good - I originally set the table based on this
> micro-benchmarking script and apparently glibc used at that time had
> more expensive memcpy for small blocks.
>
> One thing to consider is, however, that calling external memcpy has also
> additional cost of clobbering all caller saved registers.  Especially
> for code that uses SSE this is painful since all needs to go to stack in
> that case. So I am not completely sure how representative the
> micro-benchmark is to this respect since it does not use any SSE and
> register pressure is generally small.
>
> So with current glibc it seems libcall is win for blocks of size greater
> than 64 or 128 at least if the register pressure is not big.
> With this respect your change looks good.
> > >
> > > My patch generates "rep movsb" only in a very limited cases:
> > >
> > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
> > >    load and store for up to 16 * 16 (256) bytes when the data size is
> > >    fixed and known.
> > > 2. Inline only if data size is known to be <= 256.
> > >    a. Use "rep movsb/stosb" with a simple code sequence if the data size
> > >       is a constant.
> > >    b. Use loop if data size is not a constant.
>
> Aha, this is very hard to read from the algorithm descriptor.  So we
> still have the check that maxsize==minsize and use rep mosb only for
> constant sized blocks when the corresponding TARGET macro is defined.
>
> I think it would be more readable if we introduced rep_1_byte_constant.
> The descriptor is supposed to read as a sequence of rules where fist
> applies.  It is not obvious that we have another TARGET_* macro that
> makes rep_1_byte to be ignored in some cases.
> (TARGET macro will also interfere with the microbenchmarking script).
>
> Still I do not understand why compile time constant makes rep mosb/stosb
> better than loop. Is it CPU special casing it at decoder time and
> requiring explicit mov instruction? Or is it only becuase rep mosb is
> not good for blocks smaller than 128bit?

Non constant "rep movsb" triggers more machine clear events:

https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/mo-machine-clear-overhead.html

in hot loops of some workloads.

> > >
> > > As a result,  "rep stosb" is generated only when 128 < data size < 256
> > > with -mno-sse.
> > >
> > > > Do you have some data for blocks in size 8...256 to be faster with rep1
> > > > compared to unrolled loop for perhaps more real world benchmarks?
> > >
> > > "rep movsb" isn't generated with my patch in this case since
> > > MOVE_RATIO == 17 can copy up to 16 * 16 (256) bytes with
> > > XMM registers.
>
> OK, so I guess:
>   {libcall,
>    {{256, rep_1_byte, true},
>     {256, unrolled_loop, false},
>     {-1, libcall, false}}},
>   {libcall,
>    {{256, rep_1_loop, true},
>     {256, unrolled_loop, false},
>     {-1, libcall, false}}}};
>
> may still perform better but the differnece between loop and unrolled
> loop is within 10% margin..
>
> So i guess patch is OK and we should look into cleaning up the
> descriptors.  I can make patch for that once I understand the logic above.

I am checking in my patch.  We improve it for GCC 12.  We will also revisit:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90773

for GCC 12.

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* PING [PATCH] x86: Update memcpy/memset inline strategies for -mtune=generic
  2021-03-23  8:19         ` Richard Biener
@ 2021-08-22 15:28           ` H.J. Lu
  2021-09-08  3:01             ` PING^2 " H.J. Lu
  0 siblings, 1 reply; 31+ messages in thread
From: H.J. Lu @ 2021-08-22 15:28 UTC (permalink / raw)
  To: Richard Biener
  Cc: Uros Bizjak, Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang

On Tue, Mar 23, 2021 at 09:19:38AM +0100, Richard Biener wrote:
> On Tue, Mar 23, 2021 at 3:41 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote:
> >
> > > Hongyue, please collect code size differences on SPEC CPU 2017 and
> > > eembc.
> >
> > Here is code size difference for this patch
> 
> Thanks, nothing too bad although slightly larger impacts than envisioned.
> 

PING.

OK for master branch?

Thanks.

H.J.
 ---
Simplify memcpy and memset inline strategies to avoid branches for
-mtune=generic:

1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
   load and store for up to 16 * 16 (256) bytes when the data size is
   fixed and known.
2. Inline only if data size is known to be <= 256.
   a. Use "rep movsb/stosb" with simple code sequence if the data size
      is a constant.
   b. Use loop if data size is not a constant.
3. Use memcpy/memset libray function if data size is unknown or > 256.

With -mtune=generic -O2,

1. On Ice Lake processor,

Performance impacts on SPEC CPU 2017:

500.perlbench_r  0.51%
502.gcc_r        0.55%
505.mcf_r        0.38%
520.omnetpp_r   -0.74%
523.xalancbmk_r -0.35%
525.x264_r       2.99%
531.deepsjeng_r -0.17%
541.leela_r     -0.98%
548.exchange2_r  0.89%
557.xz_r         0.70%
Geomean          0.37%

503.bwaves_r     0.04%
507.cactuBSSN_r -0.01%
508.namd_r      -0.45%
510.parest_r    -0.09%
511.povray_r    -1.37%
519.lbm_r        0.00%
521.wrf_r       -2.56%
526.blender_r   -0.01%
527.cam4_r      -0.05%
538.imagick_r    0.36%
544.nab_r        0.08%
549.fotonik3d_r -0.06%
554.roms_r       0.05%
Geomean         -0.34%

Significant impacts on eembc benchmarks:

eembc/nnet_test      14.85%
eembc/mp2decoddata2  13.57%

2. On Cascadelake processor,

Performance impacts on SPEC CPU 2017:

500.perlbench_r -0.02%
502.gcc_r        0.10%
505.mcf_r       -1.14%
520.omnetpp_r   -0.22%
523.xalancbmk_r  0.21%
525.x264_r       0.94%
531.deepsjeng_r -0.37%
541.leela_r     -0.46%
548.exchange2_r -0.40%
557.xz_r         0.60%
Geomean         -0.08%

503.bwaves_r    -0.50%
507.cactuBSSN_r  0.05%
508.namd_r      -0.02%
510.parest_r     0.09%
511.povray_r    -1.35%
519.lbm_r        0.00%
521.wrf_r       -0.03%
526.blender_r   -0.83%
527.cam4_r       1.23%
538.imagick_r    0.97%
544.nab_r       -0.02%
549.fotonik3d_r -0.12%
554.roms_r       0.55%
Geomean          0.00%

Significant impacts on eembc benchmarks:

eembc/nnet_test      9.90%
eembc/mp2decoddata2  16.42%
eembc/textv2data3   -4.86%
eembc/qos            12.90%

3. On Znver3 processor,

Performance impacts on SPEC CPU 2017:

500.perlbench_r -0.96%
502.gcc_r       -1.06%
505.mcf_r       -0.01%
520.omnetpp_r   -1.45%
523.xalancbmk_r  2.89%
525.x264_r       4.98%
531.deepsjeng_r  0.18%
541.leela_r     -1.54%
548.exchange2_r -1.25%
557.xz_r        -0.01%
Geomean          0.16%

503.bwaves_r     0.04%
507.cactuBSSN_r  0.85%
508.namd_r      -0.13%
510.parest_r     0.39%
511.povray_r     0.00%
519.lbm_r        0.00%
521.wrf_r        0.28%
526.blender_r   -0.10%
527.cam4_r      -0.58%
538.imagick_r    0.69%
544.nab_r       -0.04%
549.fotonik3d_r -0.04%
554.roms_r       0.40%
Geomean          0.15%

Significant impacts on eembc benchmarks:

eembc/aifftr01       13.95%
eembc/idctrn01       8.41%
eembc/nnet_test      30.25%
eembc/mp2decoddata2  5.05%
eembc/textv2data3    6.43%
eembc/qos           -5.79%

Code size differences are:

SPEC CPU 2017

                  difference      w patch      w/o patch
500.perlbench_r     0.051%        1622637      1621805
502.gcc_r           0.039%        6930877      6928141
505.mcf_r           0.098%        16413        16397
520.omnetpp_r       0.083%        1327757      1326653
523.xalancbmk_r     0.001%        3575709      3575677
525.x264_r         -0.067%        769095       769607
531.deepsjeng_r     0.071%        67629        67581
541.leela_r        -3.062%        127629       131661
548.exchange2_r    -0.338%        66141        66365
557.xz_r            0.946%        128061       126861
503.bwaves_r        0.534%        33117        32941
507.cactuBSSN_r     0.004%        2993645      2993517
508.namd_r          0.006%        851677       851629
510.parest_r        0.488%        6741277      6708557
511.povray_r       -0.021%        849290       849466
521.wrf_r           0.022%        29682154     29675530
526.blender_r       0.054%        7544057      7540009
527.cam4_r          0.043%        6102234      6099594
538.imagick_r      -0.015%        1625770      1626010
544.nab_r           0.155%        155453       155213
549.fotonik3d_r     0.000%        351757       351757
554.roms_r          0.041%        735837       735533

eembc

aifftr01            0.762%        14813        14701
aiifft01            0.556%        14477        14397
idctrn01            0.101%        15853        15837
cjpeg-rose7-preset  0.114%        56125        56061
nnet_test          -0.848%        35549        35853
aes                 0.125%        38493        38445
cjpegv2data         0.108%        59213        59149
djpegv2data         0.025%        63821        63805
huffde             -0.104%        30621        30653
mp2decoddata       -0.047%        68285        68317
mp2enf32data1       0.018%        86925        86909
mp2enf32data2       0.018%        89357        89341
mp2enf32data3       0.018%        88253        88237
mp3playerfixeddata  0.103%        46877        46829
ip_pktcheckb1m      0.191%        25213        25165
nat                 0.527%        45757        45517
ospfv2              0.196%        24573        24525
routelookup         0.189%        25389        25341
tcpbulk             0.155%        30925        30877
textv2data          0.055%        29101        29085

gcc/

	* config/i386/x86-tune-costs.h (generic_memcpy): Updated.
	(generic_memset): Likewise.
	(generic_cost): Change CLEAR_RATIO to 17.
	* config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
	Add m_GENERIC.

gcc/testsuite/

	* gcc.target/i386/memcpy-strategy-12.c: New test.
	* gcc.target/i386/memcpy-strategy-13.c: Likewise.
	* gcc.target/i386/memset-strategy-10.c: Likewise.
	* gcc.target/i386/memset-strategy-11.c: Likewise.
	* gcc.target/i386/shrink_wrap_1.c: Also pass
	-mmemset-strategy=rep_8byte:-1:align.
	* gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte.
---
 gcc/config/i386/x86-tune-costs.h              | 31 ++++++++++++-------
 gcc/config/i386/x86-tune.def                  |  2 +-
 .../gcc.target/i386/memcpy-strategy-12.c      |  9 ++++++
 .../gcc.target/i386/memcpy-strategy-13.c      | 11 +++++++
 .../gcc.target/i386/memset-strategy-10.c      | 11 +++++++
 .../gcc.target/i386/memset-strategy-11.c      |  9 ++++++
 gcc/testsuite/gcc.target/i386/shrink_wrap_1.c |  2 +-
 gcc/testsuite/gcc.target/i386/sw-1.c          |  2 +-
 8 files changed, 63 insertions(+), 14 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-10.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-11.c

diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index ffe810f2bcb..30e7c3e4261 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -2844,19 +2844,28 @@ struct processor_costs intel_cost = {
   "16",					/* Func alignment.  */
 };
 
-/* Generic should produce code tuned for Core-i7 (and newer chips)
-   and btver1 (and newer chips).  */
+/* Generic should produce code tuned for Haswell (and newer chips)
+   and znver1 (and newer chips).  NB: rep_prefix_1_byte is used only
+   for known size.  */
 
 static stringop_algs generic_memcpy[2] = {
-  {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false},
-             {-1, libcall, false}}},
-  {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false},
-             {-1, libcall, false}}}};
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}},
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}}};
 static stringop_algs generic_memset[2] = {
-  {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false},
-             {-1, libcall, false}}},
-  {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false},
-             {-1, libcall, false}}}};
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}},
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+    {256, loop, false},
+    {-1, libcall, false}}}};
 static const
 struct processor_costs generic_cost = {
   {
@@ -2913,7 +2922,7 @@ struct processor_costs generic_cost = {
   COSTS_N_INSNS (1),			/* cost of movzx */
   8,					/* "large" insn */
   17,					/* MOVE_RATIO */
-  6,					/* CLEAR_RATIO */
+  17,					/* CLEAR_RATIO */
   {6, 6, 6},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
 					   Relative to reg-reg move (2).  */
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 8f55da89c92..a9a023f33f5 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -273,7 +273,7 @@ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA)
    move/set sequences of bytes with known size.  */
 DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB,
 	  "prefer_known_rep_movsb_stosb",
-	  m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512)
+	  m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512 | m_GENERIC)
 
 /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of
    compact prologues and epilogues by issuing a misaligned moves.  This
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c
new file mode 100644
index 00000000000..e9998b70ab2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic -mno-sse" } */
+/* { dg-final { scan-assembler "rep movsb" } } */
+
+void
+foo (char *dest, char *src)
+{
+  __builtin_memcpy (dest, src, 249);
+}
diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c
new file mode 100644
index 00000000000..109bd675a51
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic -mno-avx" } */
+/* { dg-final { scan-assembler "jmp\tmemcpy" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler "call\tmemcpy" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "rep movsb" } } */
+
+void
+foo (char *dest, char *src)
+{
+  __builtin_memcpy (dest, src, 257);
+}
diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-10.c b/gcc/testsuite/gcc.target/i386/memset-strategy-10.c
new file mode 100644
index 00000000000..685d6e5a5c2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-strategy-10.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic -mno-avx" } */
+/* { dg-final { scan-assembler "jmp\tmemset" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler "call\tmemset" { target ia32 } } } */
+/* { dg-final { scan-assembler-not "rep stosb" } } */
+
+void
+foo (char *dest)
+{
+  __builtin_memset (dest, 0, 257);
+}
diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-11.c b/gcc/testsuite/gcc.target/i386/memset-strategy-11.c
new file mode 100644
index 00000000000..61ee463a8cf
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/memset-strategy-11.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=generic -mno-sse" } */
+/* { dg-final { scan-assembler "rep stosb" } } */
+
+void
+foo (char *dest)
+{
+  __builtin_memset (dest, 0, 253);
+}
diff --git a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c
index 94dadd6cdbd..44fe7d2836e 100644
--- a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c
+++ b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile { target { ! ia32 } } } */
-/* { dg-options "-O2 -fdump-rtl-pro_and_epilogue" } */
+/* { dg-options "-O2 -mmemset-strategy=rep_8byte:-1:align -fdump-rtl-pro_and_epilogue" } */
 
 enum machine_mode
 {
diff --git a/gcc/testsuite/gcc.target/i386/sw-1.c b/gcc/testsuite/gcc.target/i386/sw-1.c
index a9c89fca4ec..234db0e67c2 100644
--- a/gcc/testsuite/gcc.target/i386/sw-1.c
+++ b/gcc/testsuite/gcc.target/i386/sw-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -mtune=generic -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */
+/* { dg-options "-O2 -mtune=generic -mstringop-strategy=rep_byte -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */
 /* { dg-additional-options "-mno-avx" { target ia32 } } */
 /* { dg-skip-if "No shrink-wrapping preformed" { x86_64-*-mingw* } } */
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* PING^2 [PATCH] x86: Update memcpy/memset inline strategies for -mtune=generic
  2021-08-22 15:28           ` PING [PATCH] " H.J. Lu
@ 2021-09-08  3:01             ` H.J. Lu
  2021-09-13 13:38               ` H.J. Lu
  0 siblings, 1 reply; 31+ messages in thread
From: H.J. Lu @ 2021-09-08  3:01 UTC (permalink / raw)
  To: Richard Biener, Lili Cui
  Cc: Uros Bizjak, Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang

On Sun, Aug 22, 2021 at 8:28 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Mar 23, 2021 at 09:19:38AM +0100, Richard Biener wrote:
> > On Tue, Mar 23, 2021 at 3:41 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote:
> > >
> > > > Hongyue, please collect code size differences on SPEC CPU 2017 and
> > > > eembc.
> > >
> > > Here is code size difference for this patch
> >
> > Thanks, nothing too bad although slightly larger impacts than envisioned.
> >
>
> PING.
>
> OK for master branch?
>
> Thanks.
>
> H.J.
>  ---
> Simplify memcpy and memset inline strategies to avoid branches for
> -mtune=generic:
>
> 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
>    load and store for up to 16 * 16 (256) bytes when the data size is
>    fixed and known.
> 2. Inline only if data size is known to be <= 256.
>    a. Use "rep movsb/stosb" with simple code sequence if the data size
>       is a constant.
>    b. Use loop if data size is not a constant.
> 3. Use memcpy/memset libray function if data size is unknown or > 256.
>

PING:

https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577889.html

-- 
H.J.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: PING^2 [PATCH] x86: Update memcpy/memset inline strategies for -mtune=generic
  2021-09-08  3:01             ` PING^2 " H.J. Lu
@ 2021-09-13 13:38               ` H.J. Lu
  2021-09-20 17:06                 ` PING^3 " H.J. Lu
  0 siblings, 1 reply; 31+ messages in thread
From: H.J. Lu @ 2021-09-13 13:38 UTC (permalink / raw)
  To: Richard Biener, Lili Cui
  Cc: Uros Bizjak, Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang

On Tue, Sep 7, 2021 at 8:01 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Sun, Aug 22, 2021 at 8:28 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Tue, Mar 23, 2021 at 09:19:38AM +0100, Richard Biener wrote:
> > > On Tue, Mar 23, 2021 at 3:41 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote:
> > > >
> > > > > Hongyue, please collect code size differences on SPEC CPU 2017 and
> > > > > eembc.
> > > >
> > > > Here is code size difference for this patch
> > >
> > > Thanks, nothing too bad although slightly larger impacts than envisioned.
> > >
> >
> > PING.
> >
> > OK for master branch?
> >
> > Thanks.
> >
> > H.J.
> >  ---
> > Simplify memcpy and memset inline strategies to avoid branches for
> > -mtune=generic:
> >
> > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
> >    load and store for up to 16 * 16 (256) bytes when the data size is
> >    fixed and known.
> > 2. Inline only if data size is known to be <= 256.
> >    a. Use "rep movsb/stosb" with simple code sequence if the data size
> >       is a constant.
> >    b. Use loop if data size is not a constant.
> > 3. Use memcpy/memset libray function if data size is unknown or > 256.
> >
>
> PING:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577889.html
>

PING.  This should fix:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294

-- 
H.J.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* PING^3 [PATCH] x86: Update memcpy/memset inline strategies for -mtune=generic
  2021-09-13 13:38               ` H.J. Lu
@ 2021-09-20 17:06                 ` H.J. Lu
  2021-10-01 15:24                   ` PING^4 " H.J. Lu
  0 siblings, 1 reply; 31+ messages in thread
From: H.J. Lu @ 2021-09-20 17:06 UTC (permalink / raw)
  To: Richard Biener, Lili Cui
  Cc: Uros Bizjak, Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang

On Mon, Sep 13, 2021 at 6:38 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Sep 7, 2021 at 8:01 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Sun, Aug 22, 2021 at 8:28 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Tue, Mar 23, 2021 at 09:19:38AM +0100, Richard Biener wrote:
> > > > On Tue, Mar 23, 2021 at 3:41 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote:
> > > > >
> > > > > > Hongyue, please collect code size differences on SPEC CPU 2017 and
> > > > > > eembc.
> > > > >
> > > > > Here is code size difference for this patch
> > > >
> > > > Thanks, nothing too bad although slightly larger impacts than envisioned.
> > > >
> > >
> > > PING.
> > >
> > > OK for master branch?
> > >
> > > Thanks.
> > >
> > > H.J.
> > >  ---
> > > Simplify memcpy and memset inline strategies to avoid branches for
> > > -mtune=generic:
> > >
> > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
> > >    load and store for up to 16 * 16 (256) bytes when the data size is
> > >    fixed and known.
> > > 2. Inline only if data size is known to be <= 256.
> > >    a. Use "rep movsb/stosb" with simple code sequence if the data size
> > >       is a constant.
> > >    b. Use loop if data size is not a constant.
> > > 3. Use memcpy/memset libray function if data size is unknown or > 256.
> > >
> >
> > PING:
> >
> > https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577889.html
> >
>
> PING.  This should fix:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
>

PING.

-- 
H.J.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* PING^4 [PATCH] x86: Update memcpy/memset inline strategies for -mtune=generic
  2021-09-20 17:06                 ` PING^3 " H.J. Lu
@ 2021-10-01 15:24                   ` H.J. Lu
  0 siblings, 0 replies; 31+ messages in thread
From: H.J. Lu @ 2021-10-01 15:24 UTC (permalink / raw)
  To: Richard Biener, Lili Cui, Jeff Law, Jakub Jelinek
  Cc: Uros Bizjak, Jan Hubicka, Hongtao Liu, GCC Patches, Hongyu Wang

On Mon, Sep 20, 2021 at 10:06 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, Sep 13, 2021 at 6:38 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Tue, Sep 7, 2021 at 8:01 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Sun, Aug 22, 2021 at 8:28 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >
> > > > On Tue, Mar 23, 2021 at 09:19:38AM +0100, Richard Biener wrote:
> > > > > On Tue, Mar 23, 2021 at 3:41 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote:
> > > > > >
> > > > > > > Hongyue, please collect code size differences on SPEC CPU 2017 and
> > > > > > > eembc.
> > > > > >
> > > > > > Here is code size difference for this patch
> > > > >
> > > > > Thanks, nothing too bad although slightly larger impacts than envisioned.
> > > > >
> > > >
> > > > PING.
> > > >
> > > > OK for master branch?
> > > >
> > > > Thanks.
> > > >
> > > > H.J.
> > > >  ---
> > > > Simplify memcpy and memset inline strategies to avoid branches for
> > > > -mtune=generic:
> > > >
> > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
> > > >    load and store for up to 16 * 16 (256) bytes when the data size is
> > > >    fixed and known.
> > > > 2. Inline only if data size is known to be <= 256.
> > > >    a. Use "rep movsb/stosb" with simple code sequence if the data size
> > > >       is a constant.
> > > >    b. Use loop if data size is not a constant.
> > > > 3. Use memcpy/memset libray function if data size is unknown or > 256.
> > > >
> > >
> > > PING:
> > >
> > > https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577889.html
> > >
> >
> > PING.  This should fix:
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
> >
>
> PING.
>

Any comments or objections to this patch?


-- 
H.J.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2021-10-01 15:25 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-22 13:16 [PATCH 0/3] x86: Update memcpy/memset inline strategies H.J. Lu
2021-03-22 13:16 ` [PATCH 1/3] x86: Update memcpy/memset inline strategies for Ice Lake H.J. Lu
2021-03-22 14:10   ` Jan Hubicka
2021-03-22 23:57     ` [PATCH v2 " H.J. Lu
2021-03-29 13:43       ` H.J. Lu
2021-03-31  6:59       ` Richard Biener
2021-03-31  8:05       ` Jan Hubicka
2021-03-31 13:09         ` H.J. Lu
2021-03-31 13:40           ` Jan Hubicka
2021-03-31 13:47             ` Jan Hubicka
2021-03-31 15:41               ` H.J. Lu
2021-03-31 17:43                 ` Jan Hubicka
2021-03-31 17:54                   ` H.J. Lu
2021-04-01  5:57                     ` Hongyu Wang
2021-03-22 13:16 ` [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs H.J. Lu
2021-04-05 13:45   ` H.J. Lu
2021-04-05 21:14     ` Jan Hubicka
2021-04-05 21:53       ` H.J. Lu
2021-04-06  9:09         ` Hongyu Wang
2021-04-06  9:51           ` Jan Hubicka
2021-04-06 12:34             ` H.J. Lu
2021-03-22 13:16 ` [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic H.J. Lu
2021-03-22 13:29   ` Richard Biener
2021-03-22 13:38     ` H.J. Lu
2021-03-23  2:41       ` Hongyu Wang
2021-03-23  8:19         ` Richard Biener
2021-08-22 15:28           ` PING [PATCH] " H.J. Lu
2021-09-08  3:01             ` PING^2 " H.J. Lu
2021-09-13 13:38               ` H.J. Lu
2021-09-20 17:06                 ` PING^3 " H.J. Lu
2021-10-01 15:24                   ` PING^4 " H.J. Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).